clustering 101 for insurance applications...• partition-based clustering method • relatively...

63
April 25, 2019 Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Clustering 101 for Insurance Applications

Upload: others

Post on 19-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

April 25, 2019

Tom Kolde, FCAS, MAAA

Linda Brobeck, FCAS, MAAA

Clustering 101 for Insurance Applications

Page 2: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

1

About the Presenters

• Linda Brobeck, FCAS, MAAA • Director & Consulting Actuary• San Francisco, CA

• Tom Kolde, FCAS, MAAA• Consulting Actuary• Chicago, Illinois

Page 3: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

2

Agenda

• Supervised vs. Unsupervised Learning

• Clustering Algorithms Overview

– Hierarchical Clustering

– K-Means

• Clustering Application Examples

Page 4: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

3

Supervised Vs. Unsupervised Machine Learning

Machine Learning

Supervised

Predictive

Target Variable

Task Driven

Regression, Classification

Unsupervised

Descriptive

No Target Variable

Data Driven

Clustering, Pattern Discovery, Dimension Reduction

Reinforcement – Algorithm Learns to React

Page 5: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

4

Principal Component Analysis

Clustering

Neural Networks

Polling Question #1

What types of unsupervised learning have you used in the past?

Other

A

B

C

D

DE None….YET

SELECT ALL APPLICABLE

Page 6: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

5

Types of Clustering

Clustering Algorithms

Connectivity

Hierarchical

Agglomerative

Divisive

Centroid

K-Means

Fuzzy C-Means

K-Mediods

Distribution

Expectation Maximization

Density

OPTICS

DBSCAN

Page 7: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

6

• Additional types of cluster models

– Neural models

– Principal component analysis

• Hard vs. Soft (Fuzzy) clustering

• Finer distinctions

– Strict partitioning (with or without outliers)

– Overlapping

Other Clustering Options

Page 8: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

7

• Bottom Up - Agglomerative

Hierarchical Clustering (HCA)

7 Clusters

A

B

C

D

EF

G

Page 9: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

8

• Bottom Up - Agglomerative

Hierarchical Clustering (HCA)

6 Clusters

A

B

C

D

EF

G

Page 10: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

9

Euclidean Distance

A = (x1, y1)

B = (x2, y2)

y1 - y2

d = √ (y1 – y2)2 + (x2 - x1)

2

x2- x1

Page 11: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

10

Distance Matrix

a b c d e f

a

b 5.39

c 2.31 4.81

d 2.42 5.32 0.51

e 5.02 5.49 2.71 2.69

f 6.00 6.25 3.70 3.63 1.02

g 6.20 6.53 3.91 3.81 1.26 0.28

x y

a 4.0 5.0

b 6.0 10.0

c 6.3 5.2

d 6.4 4.7

e 9.0 5.4

f 10.0 5.2

g 10.2 5.0

Data Points Euclidean Distances

Page 12: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

11

• Bottom Up - Agglomerative

Hierarchical Clustering (HCA)

6 Clusters

A

B

C

D

EF

G

Page 13: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

12

• Bottom Up - Agglomerative

Hierarchical Clustering (HCA)

5 Clusters

A

B

C

D

EF

G

Page 14: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

13

• Bottom Up - Agglomerative

Hierarchical Clustering (HCA)

4 Clusters

A

B

C

D

EF

G

Page 15: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

14

• Bottom Up - Agglomerative

Hierarchical Clustering (HCA)

3 Clusters

A

B

C

D

EF

G

Page 16: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

15

• Bottom Up - Agglomerative

Hierarchical Clustering (HCA)

2 Clusters

A

B

C

D

EF

G

Page 17: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

16

• Bottom Up - Agglomerative

Hierarchical Clustering (HCA)

1 Cluster

A

B

C

D

EF

G

Page 18: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

17

• Bottom Up - Agglomerative

Hierarchical Clustering (HCA)

A

B

C

D

EF

G

B A C D E F G

Page 19: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

18

Hierarchical Algorithm

• Advantage

– Easy to understand

– Flexible

• Disadvantages

– Not easily computable for large data sets

– Sensitive to outliers

Page 20: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

19

• Partition-based clustering method

• Relatively simple to understand & program

• K-means Algorithm:

1. Start with a random set of k cluster seeds

2. For each data point, calculate the distance to the each cluster seed and assign to the closest seed

3. Once all data points have initially been assigned, calculate the centroid of each cluster

4. Repeat Step 2 using the cluster centroids instead of the initial cluster seed

5. For each new cluster, re-calculate the centroid

6. Repeat steps 3-5 until convergence

Introduction to the K-Means Algorithm

Page 21: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

20

• Our example begins with 95 data points

K-Means Cluster Analysis

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140 160

Y

X

Page 22: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

21

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140 160

Y

X

• Next we assume the data has 3 clusters and randomly generate initial seed centroids for each

K-Means Cluster Analysis

Seed 1

Seed 2

Seed 3

Page 23: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

22

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140 160

Y

X

Cluster 1 Cluster 2 Cluster 3

• Each data point is assigned to the closest seed for its initial cluster

K-Means Cluster Analysis

Seed 1

Seed 2

Seed 3

Page 24: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

23

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140 160

Y

X

Cluster 1 Cluster 2 Cluster 3

• New centroids are calculated for each cluster

K-Means Cluster Analysis

Seed 1

Seed 2

Seed 3

Centroid 1

Centroid 3 Centroid 2

Page 25: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

24

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140 160

Y

X

Cluster 1 Cluster 2 Cluster 3

• Data points are assigned to the nearest centroid

K-Means Cluster Analysis

Page 26: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

25

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140 160

Y

X

Cluster 1 Cluster 2 Cluster 3

• New centroids are calculated for the data points within each cluster

K-Means Cluster Analysis

Page 27: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

26

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140 160

Y

X

Cluster 1 Cluster 2 Cluster 3

• The process continues with data points being assigned to the nearest cluster centroid until convergence

K-Means Cluster Analysis

Page 28: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

27

Advantages/Disadvantages of K-Means Algorithm

• Advantage

– Computationally simple

• Disadvantages

– Number of clusters k must be pre-selected

– Results may not be repeatable when using randomly selected seed centroids

• K-Medians Algorithm

– Less sensitive to outliers

– More processing time (to sort dataset)

Page 29: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

28

• Variable reduction for modeling

• Territory analysis for ratemaking

Clustering Applications

Page 30: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

29

• 44 Macro Economic Variables

– Unemployment (current, long-term, local, state, countrywide, changes over time)

– Housing Prices (changes over time, local, state, countrywide)

– Treasury Rates (short-term, long-term, yield curve slope, etc.)

– GDP (change over time, duration negative or positive, ratios)

• Correlation Matrix

• PROC VARCLUS in SAS (Oblique Centroid Component Cluster Analysis)

• Variable selection for each cluster

Variable Reduction Example

Page 31: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

30

• Treas rate 30 yr• UE 1prior MSA• UE 1prior ST• UE 1prior CW• UE 3prior MSA• UE 3prior ST• UE 3prior CW• UE rel ST• UE rel MSA• UE rel CW• UE ST• UE MSA• UE CW• Yield Curve Slope• GDP current• GDP Prior• GDP dur neg• GDP dur pos• GDP recession• GDP ratio• GDP ratio 1YR• GDP ratio 2YR

Variable Reduction Example

• UE 10 yr MSA• UE 10 yr CW• UE 10 yr ST• UE Delta ST• UE Delta MSA• UE Delta CW• Fixed 30 YR rate• House Price Apprec 2YR ST• House Price Apprec 2YR MSA• House Price Apprec 2YR CW• Home Price Index ST• Home Price Index MSA• Home Price Index CW• Treas rate 3 mo• Treas rate 6 mo• Treas rate 1 yr• Treas rate 2 yr• Treas rate 3 yr• Treas rate 5 yr• Treas rate 7 yr• Treas rate 10 yr• Treas rate 20 yr

Page 32: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

31

Portion of the Correlation Matrix

Treas

rate 3 mo

Treas

rate 6 mo

Treas

rate 1 yr

Treas

rate 2 yr

Treas

rate 3 yr

Treas

rate 5 yr

Treas

rate 7 yr

Treas

rate 10 yr

Treas

rate 20 yr

Treas

rate 30 yr

Treas rate 3 mo 1 0.99828 0.99324 0.9728 0.94592 0.88626 0.83375 0.78125 0.21737 0.64186

Treas rate 6 mo 0.99828 1 0.9976 0.97972 0.95364 0.89353 0.83958 0.78627 0.21897 0.64229

Treas rate 1 yr 0.99324 0.9976 1 0.99018 0.96911 0.91428 0.86236 0.8095 0.22962 0.66596

Treas rate 2 yr 0.9728 0.97972 0.99018 1 0.99336 0.9569 0.91471 0.86526 0.26694 0.72931

Treas rate 3 yr 0.94592 0.95364 0.96911 0.99336 1 0.98265 0.95119 0.90702 0.29029 0.78043

Treas rate 5 yr 0.88626 0.89353 0.91428 0.9569 0.98265 1 0.99115 0.96453 0.32941 0.86757

Treas rate 7 yr 0.83375 0.83958 0.86236 0.91471 0.95119 0.99115 1 0.98911 0.34906 0.91986

Treas rate 10 yr 0.78125 0.78627 0.8095 0.86526 0.90702 0.96453 0.98911 1 0.36967 0.96373

Treas rate 20 yr 0.21737 0.21897 0.22962 0.26694 0.29029 0.32941 0.34906 0.36967 1 0.39463

Treas rate 30 yr 0.64186 0.64229 0.66596 0.72931 0.78043 0.86757 0.91986 0.96373 0.39463 1

Page 33: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

32

VARCLUS output

Total Proportion Minimum Minimum Maximum

Number Variation of Variation Proportion R-squared 1-R**2

of Explained by Explained Explained for a Ratio for

Clusters Clusters by Clusters by a Cluster Variable a Variable

1 2.0213 0.0459 0.0459 0

2 11.8449 0.2692 0.1105 0 2.3283

3 17.6347 0.4008 0.1212 0 2.0271

4 23.8405 0.5418 0.16 0.0126 1.8387

5 27.7395 0.6304 0.3053 0.0825 1.5727

6 30.1739 0.6858 0.4645 0.1161 1.3948

7 31.5827 0.7178 0.571 0.1292 1.3087

8 32.4495 0.7375 0.5919 0.1292 1.5582

9 33.4705 0.7607 0.6476 0.1292 1.5582

10 35.7483 0.8125 0.71 0.1655 1.5582

11 36.3604 0.8264 0.7369 0.1655 1.5582

12 37.0867 0.8429 0.7459 0.1655 1.5582

13 37.9171 0.8618 0.7898 0.1655 1.5582

Page 34: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

33

VARCLUS output

Total Proportion Minimum Minimum Maximum

Number Variation of Variation Proportion R-squared 1-R**2

of Explained by Explained Explained for a Ratio for

Clusters Clusters by Clusters by a Cluster Variable a Variable

1 2.0213 0.0459 0.0459 0

2 11.8449 0.2692 0.1105 0 2.3283

3 17.6347 0.4008 0.1212 0 2.0271

4 23.8405 0.5418 0.16 0.0126 1.8387

5 27.7395 0.6304 0.3053 0.0825 1.5727

6 30.1739 0.6858 0.4645 0.1161 1.3948

7 31.5827 0.7178 0.571 0.1292 1.3087

8 32.4495 0.7375 0.5919 0.1292 1.5582

9 33.4705 0.7607 0.6476 0.1292 1.5582

10 35.7483 0.8125 0.71 0.1655 1.5582

11 36.3604 0.8264 0.7369 0.1655 1.5582

12 37.0867 0.8429 0.7459 0.1655 1.5582

13 37.9171 0.8618 0.7898 0.1655 1.5582

Page 35: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

34

VARCLUS output

Total Proportion Minimum Minimum Maximum

Number Variation of Variation Proportion R-squared 1-R**2

of Explained by Explained Explained for a Ratio for

Clusters Clusters by Clusters by a Cluster Variable a Variable

1 2.0213 0.0459 0.0459 0

2 11.8449 0.2692 0.1105 0 2.3283

3 17.6347 0.4008 0.1212 0 2.0271

4 23.8405 0.5418 0.16 0.0126 1.8387

5 27.7395 0.6304 0.3053 0.0825 1.5727

6 30.1739 0.6858 0.4645 0.1161 1.3948

7 31.5827 0.7178 0.571 0.1292 1.3087

8 32.4495 0.7375 0.5919 0.1292 1.5582

9 33.4705 0.7607 0.6476 0.1292 1.5582

10 35.7483 0.8125 0.71 0.1655 1.5582

11 36.3604 0.8264 0.7369 0.1655 1.5582

12 37.0867 0.8429 0.7459 0.1655 1.5582

13 37.9171 0.8618 0.7898 0.1655 1.5582

Page 36: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

35

Page 37: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

36

VARCLUS OUTPUT

1-R**2

Own Next Ratio

Cluster Closest

Cluster 9 Fixed 30 YR rate 0.9003 0.7565 0.4093

Treas rate 3 mo 0.8434 0.3789 0.2522

Treas rate 6 mo 0.852 0.3861 0.2412

Treas rate 1 yr 0.8793 0.4136 0.2059

Treas rate 2 yr 0.9346 0.4757 0.1247

Treas rate 3 yr 0.9624 0.5218 0.0787

Treas rate 5 yr 0.9746 0.6031 0.0639

Treas rate 7 yr 0.9512 0.6584 0.1427

Treas rate 10 yr 0.9107 0.7231 0.3223

Treas rate 20 yr 0.1655 0.1063 0.9338

Treas rate 30 yr 0.7495 0.7659 1.0701

Cluster 10 UE Delta ST 0.9249 0.2696 0.1028

UE Delta CW 0.9249 0.3729 0.1197

10 Cluster Solution R-squared with

Cluster

1 − 𝑅𝑜𝑤𝑛2

1 − 𝑅𝑛𝑒𝑎𝑟𝑒𝑠𝑡2

Page 38: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

37

• Calculated the correlation matrix to be used in VARCLUS• Selected number of clusters based on the proportion of variation

explained by clusters and the minimum R-squared for a variable within the cluster

• Selected the variable with the smallest 1-R2 ratio to represent the cluster– 5 year treasury rate– Prior quarter countrywide unemployment rate– Prior quarter MSA unemployment rate– Ratio GDP current to 2 years prior– Current GDP– GDP recession indicator– State home price index– MSA home price index– Duration of positive GDP growth– Change in unemployment rate by state

Summary of Variable Reduction Clustering

Page 39: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

38

• Deriving territory definitions is a common application of cluster analysis in ratemaking

• Goals:

– Loss experience by territory should be actuarially credible

– Balance homogeneity of loss experience within territory while producing a manageable number of territories

– Contiguous territories

• Solution = Hierarchical clustering using Ward’s method with contiguity constraint

Introduction to Territorial Clustering

Page 40: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

39

• Each square below represents a zip code in our hypothetical State X

Introduction to Territorial Cluster Analysis

l

l

l l

l

l

West Town

North Center

Star City

Central City

South Shore City

Old Town

Page 41: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

40

l

l

l l

l

l

West Town

North Center

Star City

Central City

South Shore City

Old Town

• Step 1 – Determine raw pure premium by zip code

Introduction to Territorial Cluster Analysis

Lower PP Higher PP

Page 42: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

41

l

l

l l

l

l

West Town

North Center

Star City

Central City

South Shore City

Old Town

• Not every zip code is fully credible

• Spatial smoothing allows us to obtain credible results by zip code

Engineering Credible Loss Experience by Zip Code

Lower PP Higher PP

Page 43: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

42

l

l

l l

l

l

West Town

North Center

Star City

Central City

South Shore City

Old Town

• Determine the credibility for a single zip code

Spatial Smoothing

Credibility = Z0

Pure Premium = PP0

Lower PP Higher PP

Page 44: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

43

l

l

l l

l

l

West Town

North Center

Star City

Central City

South Shore City

Old Town

• Determine pure premium and credibility for area including surrounding zip codes

Spatial Smoothing

Credibility = Z1

Pure Premium = PP1

Lower PP Higher PP

Page 45: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

44

l

l

l l

l

l

West Town

North Center

Star City

Central City

South Shore City

Old Town

• Determine pure premium and credibility for area including surrounding zip codes

Spatial Smoothing

Credibility = Z2

Pure Premium = PP2

Lower PP Higher PP

Page 46: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

45

• Smoothed PP =

PP0 x Z0 + PP1 x (Z1-Z0) + PP2 x (Z2-Z1) + PPState x (1-Z2)

Spatial Smoothing

l

l

l l

l

l

West Town

North Center

Star City

Central City

South Shore City

Old Town

Lower PP Higher PP

Page 47: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

46

• Spatial Smoothing helps uncover patterns hidden within the loss experience

Spatial Smoothing

l

l

l l

l

l

West Town

North Center

Star City

Central City

South Shore City

Old Town

l

l

l l

l

l

West Town

North Center

Star City

Central City

South Shore City

Old Town

Raw Pure Premium Smoothed Pure Premium

Lower PP Higher PP

Page 48: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

47

• Ward’s method seeks to minimize the variance of data characteristics within each cluster

• In territorial cluster analysis this means minimizing the within cluster variance of loss experience metrics, such as frequency or pure premium

• In this case, frequency/pure premium is not viewed as a target variable but rather as a risk characteristic of a zip code

Ward’s Method

Page 49: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

48

• The variance measure for combining clusters is the within-cluster sum of squares between a data object and the mean of the cluster:

– Within-cluster sum of squares = ESS = ∑∑ 𝑋𝑖𝑗 − ത𝑋𝑖.2

– Between-cluster sum of squares = BSS = ∑∑ ത𝑋𝑖. − ത𝑋..2

– Total sum of squares = TSS = ∑∑ 𝑋𝑖𝑗 − ത𝑋..2

– TSS = ESS + BSS

Ward’s Method

Page 50: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

49

• Begin with each zip code as its own cluster (N=600)

• Evaluate each pair of contiguous zip codes to determine the within-cluster variance

• The pair of zip codes which are most similar (i.e., produce the smallest within-cluster variance) is formed into a cluster

• Next, the clusters from the 1st iteration (N-1=599) are evaluated to find the pair with the minimum within-cluster variance. This pair is combined to form the second cluster

• The process continues until all zip codes are grouped into a single cluster

Territorial Cluster Analysis Using Ward’s Method

Page 51: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

50

l

l

l l

l

l

West Town

North Center

Star City

Central City

South Shore City

Old Town

• The highlighted pair of zip codes produce the smallest within-cluster variance of any pair of contiguous zip codes

Territorial Cluster Analysis Using Ward’s Method

Page 52: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

51

l

l

l l

l

l

West Town

North Center

Star City

Central City

South Shore City

Old Town

• The process continues combining zip codes into clusters until all zip codes are combined into a single cluster

Territorial Cluster Analysis Using Ward’s Method

Page 53: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

52

l

l

l l

l

l

West Town

North Center

Star City

Central City

South Shore City

Old Town

• The process continues combining zip codes into clusters until all zip codes are combined into a single cluster

Territorial Cluster Analysis Using Ward’s Method

Page 54: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

53

0%

20%

40%

60%

80%

100%

120%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Number of Territories

Percentage of Total Variance Explained by Within-Cluster Variance

• Ward’s Method does not explicitly optimize the number of territories but it can provide insight into the percentage of total variance explained by the within-cluster variance

• A common metric used for this evaluation is ESS/TSS

Determining the Number of Territories

Page 55: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

54

0%

20%

40%

60%

80%

100%

120%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Number of Territories

Percentage of Total Variance Explained by Within-Cluster Variance

• Ward’s Method does not explicitly optimize the number of territories but it can provide insight into the percentage of total variance explained by the within-cluster variance

• A common metric used for this evaluation is ESS/TSS

Determining the Number of Territories

15.2% of the total variance is explained by the within-cluster variance at 12 territories

10.2% of the total variance is explained by the within-cluster variance at 22 territories

Page 56: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

55

Territorial Cluster Analysis Results

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 6 3 3 3 3 3 3 3 2 2 l 2 2

3 3 3 3 3 3 3 3 3 3 3 3 3 3 6 6 6 3 3 2 2

3 3 3 3 3 3 3 3 3 3 3 3 7 l 7 6 6 6 3 6 3 6 3 3 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 7 7 7 6 6 6 6 6 3 3 3 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 3 6 7 6 6 6 6 2 2 2 2 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 6 6 7 6 6 6 6 6 2 2 2 4 2 2 2 2 2 2

3 5 5 5 5 5 3 6 6 3 3 6 7 7 6 6 6 6 2 2 4 4 4 2 2 2 2 2

3 3 5 3 6 6 6 6 6 6 6 6 6 6 6 4 2 2 2 2 2 2

3 3 3 5 l 5 5 3 6 6 6 6 6 6 6 10 6 6 6 4 4 l 4 4 4 2 2 2 2 2

3 3 3 5 5 5 3 3 3 3 3 3 6 10 l 10 6 6 6 6 4 4 4 4 4 2 2 2 2 2

3 3 3 3 5 5 3 3 11 11 11 6 6 6 6 6 6 6 4 4 4 4 2 2 2

11 3 3 3 11 5 3 3 11 11 6 6 6 6 6 10 6 6 6 6 6 2 2 2 2 2 2 2 2 2

11 3 11 11 11 11 11 11 11 11 11 11 6 6 6 6 6 6 11 6 6 6 2 2 2 2 2 2 2 2

11 11 11 11 11 11 11 11 11 11 11 6 6 6 6 6 6 11 11 11 6 6 6 2 2 2 2 2 2 2

11 11 11 11 11 11 11 11 11 11 11 6 6 6 6 6 11 11 6 6 6 11 6 2 2 11 2 2 2 2

11 11 11 11 11 11 11 11 11 11 11 6 6 6 6 11 11 11 11 8 8 11 11 2 2 11 12 12 11 11

11 11 11 11 11 11 11 11 11 11 11 6 6 6 11 11 11 11 11 11 8 8 11 2 11 l 12 12 11

11 11 11 11 11 11 11 11 11 11 11 11 6 6 6 11 11 11 11 11 11 11 11 11

11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11

West Town

North Center

Star City

Central City

South Shore City

Old Town

West Town

North Center

Star City

Central City

South Shore City

Old Town

Page 57: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

56

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 6 3 3 3 3 3 3 3 2 2 l 2 2

3 3 3 3 3 3 3 3 3 3 3 3 3 3 6 6 6 3 3 2 2

3 3 3 3 3 3 3 3 3 3 3 3 7 l 7 6 6 6 3 6 3 6 3 3 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 7 7 7 6 6 6 6 6 3 3 3 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 3 6 7 6 6 6 6 2 2 2 2 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 6 6 7 6 6 6 6 6 2 2 2 4 2 2 2 2 2 2

3 5 5 5 5 5 3 6 6 3 3 6 7 7 6 6 6 6 2 2 4 4 4 2 2 2 2 2

3 3 5 3 6 6 6 6 6 6 6 6 6 6 6 4 2 2 2 2 2 2

3 3 3 5 l 5 5 3 6 6 6 6 6 6 6 10 6 6 6 4 4 l 4 4 4 2 2 2 2 2

3 3 3 5 5 5 3 3 3 3 3 3 6 10 l 10 6 6 6 6 4 4 4 4 4 2 2 2 2 2

3 3 3 3 5 5 3 3 11 11 11 6 6 6 6 6 6 6 4 4 4 4 2 2 2

11 3 3 3 11 5 3 3 11 11 6 6 6 6 6 10 6 6 6 6 6 2 2 2 2 2 2 2 2 2

11 3 11 11 11 11 11 11 11 11 11 11 6 6 6 6 6 6 11 6 6 6 2 2 2 2 2 2 2 2

11 11 11 11 11 11 11 11 11 11 11 6 6 6 6 6 6 11 11 11 6 6 6 2 2 2 2 2 2 2

11 11 11 11 11 11 11 11 11 11 11 6 6 6 6 6 11 11 6 6 6 11 6 2 2 11 2 2 2 2

11 11 11 11 11 11 11 11 11 11 11 6 6 6 6 11 11 11 11 8 8 11 11 2 2 11 12 12 11 11

11 11 11 11 11 11 11 11 11 11 11 6 6 6 11 11 11 11 11 11 8 8 11 2 11 l 12 12 11

11 11 11 11 11 11 11 11 11 11 11 11 6 6 6 11 11 11 11 11 11 11 11 11

11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11

West Town

North Center

Star City

Central City

South Shore City

Old Town

West Town

North Center

Star City

Central City

South Shore City

Old Town

Territory Boundries Overlaid Against Smoothed PP

Lower PP Higher PP

Page 58: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

57

• The results of the cluster analysis should be evaluated for:

– Reasonability

– Underwriting and competitive considerations

– Regulatory constraints

• The territories can be used in the context of GLMs or other supervised learning analyses to determine appropriate rating factors and/or further territory refinement.

Further Considerations

Page 59: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

58

• Cluster analysis is a broad field with many possibilities for further exploration

• K-means and Hierarchical clustering methods provide the practitioner a starting point for expanding his/her knowledge

• Software packages offer out of the box clustering procedures but custom programming may be required for sophisticated applications to introduce actuarial considerations and constraints

Summary

Page 60: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

59

Questions

Page 61: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

60

Join Us for the Next APEX Webinar

Page 62: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

61

• We’d like your feedback and suggestions

• Please complete our survey

• For copies of this APEX presentation

• Visit the Resource Knowledge Center at Pinnacleactuaries.com

Final notes

Page 63: Clustering 101 for Insurance Applications...• Partition-based clustering method • Relatively simple to understand & program • K-means Algorithm: 1. Start with a random set of

62Commitment Beyond Numbers

Thank You for Your Time and Attention

Tom Kolde

[email protected]

Linda Brobeck

[email protected]