k-means clustering to group cereals using r

1

Statistics 522 Clustering and Affinity Analysis Assignment 2: K-Means Clustering Kevin Bahr 6. Suppose that we have the following data: a b c d e f g h i j (2,0) (1,2) (2,2) (3,2) (2,3) (3,3) (2,4) (3,4) (4,4) (3,5) Identify the cluster by applying the k-means algorithm, with k = 2. Use data points a and j as your initial cluster centers. For this problem we use the data set above (as seen plotted in figure 1) and assign the cluster centers as m1 = (2,0) and m2 = (4,4), which are the values of a and j respectively. The distances of each data point from m1 and m2 are calculated using the Euclidean distance formula and shown in table 1 along with their cluster membership.

Table 1 -‐ First pass, K=2

Figure 1 -‐ Plot of data points

After the first pass, cluster 1 contains points {a,b,c,d} and cluster 2 contains points {e,f,g,h,i,j}. The centroid for cluster 1 changes from (2, 0) to [(2+1+2+3)/4, (0+2+2+2)/4] = (2, 1.5) and the centroid for cluster 2 changes from (3, 5) to [(2+3+2+3+4+3)/6, (3+3+4+4+4+5)/6] = (2.83, 3.83). Since the centroids have moved, we will apply the Euclidean distance formula for each point to the new centroids to see if their cluster membership changes. As seen in table 2, the cluster membership of each point doesn’t change after the second pass and the algorithm terminates. The final plot of the data points and the clusters is seen in figure 2.

Point Dist. m1 Dist. m2 Cluster a 0.00 5.10 C1 b 2.24 3.61 C1 c 2.00 3.16 C1 d 2.24 3.00 C1 e 3.00 2.24 C2 f 3.16 2.00 C2 g 4.00 1.41 C2 h 4.12 1.00 C2 i 4.47 1.41 C2 j 5.10 0.00 C2

2

Table 2 -‐ Second pass, K=2

Figure 2 -‐ Cluster plot, k=2

7. Refer to Exercise 6. Show that the ratio of the between-cluster variation to the within-cluster variation decreases with each pass of the algorithm. The means of the x and y values of the data points (2.5, 2.9) are used in order to calculate the between-cluster and within-cluster variation. Mean squared error (MSE) represents the within cluster variation while sum of squared between clusters (SSB) represents between cluster variation. Mean square between is MSB. Sum of squared error (SSE) is the sum of each point’s distance to its closest cluster. A pseudo-F statistic will be calculated and used to compare which k-means clustering method has a better quality.

𝑺𝑺𝑩 = 𝑛! ∗ 𝑑(𝑚! ,𝑀)!!

!!!

𝑃𝑎𝑠𝑠 1 = 4 ∗ 𝑑( 2,1.5 , (2.5,2.9))+ 6(𝑑 2.83,3.83 , 2.5,2.9 = 𝟏𝟒.𝟕𝟑 𝑃𝑎𝑠𝑠 2 = 4 ∗ 𝑑( 2,1.5 , (2.5,2.9))+ 6(𝑑 2.83,3.83 , 2.5,2.9 = 𝟏𝟒.𝟕𝟑

𝑴𝑺𝑩 = 𝑆𝑆𝐵𝑘 − 1

𝑃𝑎𝑠𝑠 1 = 14.732− 1 = 𝟏𝟒.𝟕𝟑

𝑃𝑎𝑠𝑠 2 =14.732− 1 = 𝟏𝟒.𝟕𝟑

𝑺𝑺𝑬 = 𝑑(𝑝,𝑚!)!!∈!!

!

!!!

𝑃𝑎𝑠𝑠1 = 0! + 2.24! + 2! + 2.24! + 2.24! + 2! + 1.14! + 1! + 1.14! + 0! = 𝟐𝟖 𝑃𝑎𝑠𝑠2 = 1.5! + 1.12! + 0.5! + 1.18! + 1.18! + 0.85! + 0.85! + 0.24! + 1.18!

+ 1.18! = 𝟏𝟎.𝟔𝟕

Point Dist. m1 Dist. m2 Cluster a 1.50 3.92 C1 b 1.12 2.59 C1 c 0.50 2.01 C1 d 1.12 1.84 C1 e 1.50 1.18 C2 f 1.80 0.85 C2 g 2.50 0.85 C2 h 2.69 0.24 C2 i 3.20 1.18 C2 j 3.64 1.18 C2

3

𝑴𝑺𝑬 = 𝑆𝑆𝐸𝑁 − 𝑘

𝑃𝑎𝑠𝑠 1 = 28

10− 2 = 𝟑.𝟓𝟎

𝑃𝑎𝑠𝑠 2 = 10.6710− 2 = 𝟏.𝟑𝟑

𝑭 = 𝑀𝑆𝐵𝑀𝑆𝐸

𝑃𝑎𝑠𝑠 1 = 14.733.50 = 𝟒.𝟐𝟏

𝑃𝑎𝑠𝑠 2 = 14.731.33 = 𝟏𝟏.𝟎𝟓

The key difference from pass 1 to pass 2 is the movement of the centroids to become closer to every data point within the cluster as a whole. The result is decreased SSE and MSE, allowing for a greater pseudo-F statistic, which results in a better cluster fit. 8. Once again identify the clusters in Exercise 6 data, this time by applying the k-means algorithm, with k = 3. Try using initial cluster centers as far apart as possible. The data set can be eye balled in order to find three centroids furthest apart, but a scalable solution would be appropriate for a data set with more points, and thus will be explored. We will group the data points into unique sets of 3 in order to find the greatest area between the three points, which will tell us the 3 points furthest from each other to use for clustering. There are 120 possible unique combinations of data sets for 3-combination (k=3) with 10 data points (n=10), further denoted by the factorial !"!

!! !"!! !.

R-code included in the supplementary documentation was used to organize the data points into the 120 unique sets and calculate their area using 𝑎𝑟𝑒𝑎 = | !" !"!!" !!" !"!!" !!" !"!!"

!|. Table 3 shows the top 5 sets of data points

with the greatest area. Table 3 -‐ Top 5 points w/ greatest area

This method gives us two options to use for the furthest three centroids. Points {a,b,i} or {a,g,i} both have an area of eight. Points {a,b,i} have been chosen for the three centroids.

Continuing with the analysis, data points a (2,0), b (1,2), and i (4,4) were chosen as centroids m1, m2, and m3.

Points Area abi 8 agi 8 abj 7 abh 6 aei 6

4

Table 4 -‐ First pass, K=3

Table 4 shows that after the first pass, cluster 1 contains point a, cluster 2 contains points {b,c,d,e,}, and cluster 3 contains points {f,g,h,i,j}. The centroid for cluster 1 remains at (2, 0), while centroid 2 changes from (1,2) to [(1+2+3+2)/4, (2+2+2+3)/4] = (2, 2.25) and cluster 3 changes from (4, 4) to [(3+2+3+4+3)/5, (3+4+4+4+5)/5] = (3, 4). Since the centroids have moved, we will apply the Euclidean distance formula for each point to the new

centroids to see if their cluster membership changes. After the second pass, the cluster membership of each point does not change and the algorithm terminates. The distances and cluster assignment are shown in table 5 and the cluster plot is shown in figure 3. Table 5 -‐ Second pass, k=3

Figure 3 -‐ Cluster plot, k=3

Point Dist. m1 Dist. m2 Dist. m3 Cluster a 0.00 2.24 4.47 C1 b 2.24 0.00 3.61 C2 c 2.00 1.00 2.83 C2 d 2.24 2.00 2.24 C2 e 3.00 1.41 2.24 C2 f 3.16 2.24 1.41 C3 g 4.00 2.24 2.00 C3 h 4.12 2.83 1.00 C3 i 4.47 3.61 0.00 C3 j 5.10 3.61 1.41 C3

Point Dist. m1 Dist. m2 Dist. m3 Cluster a 0.00 2.25 4.12 C1 b 2.24 1.03 2.83 C2 c 2.00 0.25 2.24 C2 d 2.24 1.03 2.00 C2 e 3.00 0.75 1.41 C2 f 3.16 1.25 1.00 C3 g 4.00 1.75 1.00 C3 h 4.12 2.02 0.00 C3 i 4.47 2.66 1.00 C3 j 5.10 2.93 1.00 C3

5

9. Refer to Exercise 8. Show that the ratio of the between-cluster variation to the within-cluster variation decreases with each pass of the algorithm. 𝑺𝑺𝑩 𝑃𝑎𝑠𝑠 1 = 1 ∗ 𝑑( 2,0 , (2.5,2.9))+ 4(𝑑 2,2.25 , 2.5,2.9 + 5(𝑑 3,4 , 2.5,2.9

= 𝟏𝟖.𝟔𝟓 𝑃𝑎𝑠𝑠 2 = 1 ∗ 𝑑( 2,0 , (2.5,2.9))+ 4(𝑑 2,2.25 , 2.5,2.9 + 5(𝑑 3,4 , 2.5,2.9

= 𝟏𝟖.𝟔𝟓 𝑴𝑺𝑩

𝑃𝑎𝑠𝑠 1 = 18.652− 1 = 𝟗.𝟑𝟑

𝑃𝑎𝑠𝑠 2 =18.652− 1 = 𝟗.𝟑𝟑

𝑺𝑺𝑬 𝑃𝑎𝑠𝑠1 = 0! + 0! + 1! + 2! + 1.41! + 1.41! + 2! + 1! + 0! + 1.41! = 𝟏𝟔 𝑃𝑎𝑠𝑠2 = 0! + 1.03! + 0.25! + 1.03! + 0.75! + 1! + 1! + 0! + 1! + 1! = 𝟔.𝟕𝟓 𝑴𝑺𝑬

𝑃𝑎𝑠𝑠 1 = 16

10− 3 = 𝟐.𝟐𝟗

𝑃𝑎𝑠𝑠 2 = 6.7510− 3 = 𝟎.𝟗𝟔

𝑭 = 𝑀𝑆𝐵𝑀𝑆𝐸

𝑃𝑎𝑠𝑠 1 = 9.332.29 = 𝟒.𝟎𝟕

𝑃𝑎𝑠𝑠 2 = 9.330.96 = 𝟗.𝟔𝟕

The within-cluster variation greatly decreased from the first to second pass while the between cluster variation did not change. The greater value of F in the second pass indicates a greater quality in the clusters. 10. Which clustering solution do you think is preferable? Why? Table 6 -‐ Comparing k=2 and k=3

The first clustering solution with k = 2 is preferable because of the larger value of the pseudo F-statistic. As seen in table 6, the clusters with k = 2 are further apart (higher MSB) and still have a relatively low within cluster variation (MSE), making the pseudo-F statistic greater and thus a better quality.

2nd pass, k = 2 2nd pass, k = 3 SSB 14.73 18.65 MSB 14.73 9.33 SSE 10.67 6.75 MSE 1.33 0.96 F 11.05 9.67

6

12. Using all of the variables except name and rating, run the k-means algorithm with k=5 to identify clusters within the data. Using R, the cereals data set was modified to remove the name and rating fields, converting the manufacturer and type field to integers, and standardizing all numeric fields using z-score standardization. In order to reproduce results, a seed of “522” was set in R. The “kmeans” function was executed with the modified cereals data set with k (centers) equal to 5. Table 7 shows the color-coded results of the variable means across each cluster. Table 7 -‐ Comparison of Variable Means across Clusters, k=5

1 2 3 4 5

MANUF_z -0.63 0.30 -0.19 -0.38 0.23 TYPE_z -0.12 -0.12 -0.12 0.36 -0.12 CALORIES_z 0.80 0.52 -2.20 0.15 -0.43 PROTEIN_z 0.19 0.60 1.38 -0.79 -0.10 FAT_z 0.00 0.89 -0.33 0.00 -0.59 SODIUM_z 0.45 -0.09 0.17 -0.20 0.06 FIBER_z -0.08 0.44 3.63 -0.66 -0.25 CARBO_z 0.84 -0.33 -2.09 -0.38 0.50 SUGARS_z 0.04 0.31 -0.79 0.71 -0.62 POTASS_z 0.09 0.70 2.98 -0.70 -0.38 VITAMINS_z 2.70 -0.24 -0.18 -0.18 -0.39 WEIGHT_z 1.58 -0.17 -0.17 -0.17 -0.17 CUPS_z 0.16 -0.05 -2.42 0.37 0.01 SHELF1_z -0.58 -0.58 -0.58 -0.58 1.02 SHELF2_z -0.60 -0.37 -0.60 1.63 -0.60

Although included in the clustering, the variables MANUF and TYPE won’t be discussed in the analysis. MANUF isn’t an ordinal variable that can be easily interpreted when converted to numeric and standardized. TYPE is also fairly homogeneous throughout the clusters. Future analysis can be conducted when recoding MANUF and TYPE variables to binary ones. 13. Develop clustering profiles that clearly describe the characteristics of the cereals within the cluster. Cluster 1 (7 cereals): Get Your Vitamins. Cluster 1 cereals seem to be designed specifically for people who are looking to get their vitamin intake from cereal. Cluster 1 cereals have the most amount of vitamins and a moderate amount of calories and carbohydrates. Cluster 2 (19 cereals): Fatten Up. Cluster 2 cereals have the most amount of fat out of all the clusters. Other characteristics in cluster 2 include a moderate amount of calories,

7

protein, fiber, sugars, and potassium, while having slightly lower than average carbohydrates and vitamins. Cluster 3 (3 cereals): Health Nuts. Cereals in cluster 3 have the highest average fiber, potassium, and protein, and the lowest average calories, carbohydrates, and sugars. These cereals are the kind that someone who is concerned about their general health or works out regularly would likely consume. The cups mean is also lower than the other clusters. Cluster 3 only contains 3 cereals, which is less than 5% of the total cereals. Cluster 4 (18 cereals): Sugar Lovers. These cereals have the highest amount of sugar in them and the lowest amount of protein, potassium, and fiber. These are likely to be the cereals that parents try to keep away from their children just because of their sweet taste and general lack of nutritional value. Sugar cereals are also more located on the second shelf than the first. Cluster 5 (27 cereals): Carbohydrate Loaders. Cereals in cluster 5 have a focus around carbohydrates while every other nutritional characteristic is present at a lower rate. These cereals might be good for runners to load up on the day or morning before a race as they have low sugars, low fat, low calories, yet a modest amount of carbohydrates. They are also more likely found on the first shelf than the second. Cluster 5 makes up the most in the data set, at about 36%. 14. Rerun the k-means algorithm with k = 3. The same process was followed in requirement 12 for requirement 14. The color-coded output of the variable centers is shown in table 8. Table 8 -‐ Comparison of Variable Means across Clusters, k=3

1 2 3

MANUF_z 0.22 -0.27 -0.63 TYPE_z -0.12 0.29 -0.12 CALORIES_z -0.18 0.13 0.80 PROTEIN_z 0.35 -0.83 0.19 FAT_z 0.06 -0.14 0.00 SODIUM_z 0.03 -0.21 0.45 FIBER_z 0.32 -0.67 -0.08 CARBO_z 0.05 -0.39 0.84 SUGARS_z -0.36 0.78 0.04 POTASS_z 0.32 -0.73 0.09 VITAMINS_z -0.33 -0.18 2.70 WEIGHT_z -0.17 -0.17 1.58 CUPS_z -0.18 0.34 0.16 SHELF1_z 0.21 -0.26 -0.58 SHELF2_z -0.51 1.31 -0.60

8

15. Provide values for MSB, MSE, and pseudo-F for each clustering solution. Which clustering solution do you prefer, and why? Table 9 -‐ Comparing k=5, k=3

Table 9 shows the values of SSB, MSB, SSE, MSE, and pseudo-F (F) for the k = 5 and k = 3 clustering methods. Calculations are shown in the attached R documentation through the “kmeans_stats” function. The k = 5 clustering solution is preferable to the k = 3 due to the greater value of the pseudo-F statistic. Although the between cluster variation stays nearly the same between methods, the within cluster

variation drops enough in the k = 5 solution to give a greater pseudo-F statistic.

Additionally, plotting the within groups sum of squares for each possible cluster results in a scree plot. From looking at figure 4, we might want to try up to 10 clusters before the decrease in within group sum of squares begins to taper off for every increase in cluster.

16. Develop clustering profiles that clearly describe the characteristics of the cereals within the cluster. Cluster 1 (46 cereals): Your Average Cereal. Cluster 1 has no nutritional characteristics greater than |0.36|. Cereals in this cluster have slightly above average protein, fiber, and potassium, and slightly below average sugars, calories, and vitamins. Cluster 2 (21 cereals): Sugar Lovers. Cluster 2 has similar characteristics to cluster 4 in from the k = 5 clustering method. The negative characteristics are more prominent in this cluster compared to the others as protein, fiber, and potassium all have moderately negative averages. These cereals are still more likely to be found on the second shelf. Cluster 3 (7 cereals): Get Your Vitamins. It is interesting to note that cluster 3 has the same characteristics of cluster 1 from the k = 5 clustering method. 17. Use cluster membership to predict rating. One way to do this would be to construct a histogram of rating based on cluster membership alone. Describe how the relationship you uncovered makes sense, based on your earlier profiles. In order to predict rating based on cluster membership, a box plot was made for each cluster in the k = 5 cluster method (because of the higher pseudo-F statistic). In figure 4,

k = 5 k = 3 SSB 452.03 232.95 MSB 113.01 116.47 SSE 642.97 862.05 MSE 9.32 12.14 F 12.13 9.59

Figure 4 -‐ Within groups sum of squares for cereal data

9

each cluster with it’s box is plotted on the x-axis with RATING_z on the y-axis. Box/Cluster 1 matches with the Vitamin profile, box 2 is the Fatty cereals, box 3 is the Healthy cluster, box 4 is the Sugar cereals, and box 5 are the Carb Loader cereals.

Figure 5 -‐ Box Plot of RATING_z vs. Cluster

It is interesting to note that the sugar cereals (cluster 4) have the most outliers, whereas other clusters barely have any. The widest range in rating is apparent in the Carb Loaders cereals (cluster 5) and the lowest range in rating appears to be the Get Your Vitamins cereals (cluster 1). The cluster profiles are ranked by median quality in the following order:

• 1st – Cluster 3 (Health Nuts) – With only three cereals, this cluster has the greatest nutritional value out of all the clusters, which results in the highest median rating as expected.

• 2nd – Cluster 5 (Carb Loaders) – The most cereals are in this cluster (27) and they have the most average nutritional qualities compared to the other clusters. It is very understandable that cluster 5 is ranked higher than the Fatten Up and Sugar Lovers Cereals. Carb Loaders has a median quality rating just above average while the next three clusters all have lower than average quality rankings.

• 3rd – Cluster 2 (Fatten Up) – The median quality rating for cluster 2 (19 cereals) is slightly below average. Its place in the quality pecking order is understandable compared to the other clusters.

• 4th – Cluster 1 (Get Your Vitamins) – Ranking between the Fatty cereals and the Sugar cereals are the Vitamin Cereals (cluster 1). At first glance, one could expect cereals rich in vitamins to have a higher quality than most others, but these 7 cereals are ranked below average in quality rating. This might be due to synthetic (man-made) vitamins that the body isn’t made to digest that easily. That would mean the vitamins go to waste.

10

• 5th – Cluster 4 (Sugar Lovers) – The median quality of the 18 cereals in cluster 4 ranks the lowest. This ranking is no surprise as not much nutritional value is found in these cereals. They taste good though!

k-means clustering to group cereals using r

Documents

cluster variation

m2 cluster

cluster plot

closest cluster

cluster membership changes

initial cluster centers

better cluster fit

plot of data points