lab 4arwhite/teaching/stu33011/lab4.pdf · clust1
TRANSCRIPT
Lab 4Make sure that you have completed the previous lab sessions (https://www.scss.tcd.ie/~arwhite/Teaching/STU33011.html) before moving on to this one. Remember to save your commands in an R script.
In this session, we will
• implement hierarchical clustering methods;• interpet the results using simple plotting and summary statistics;• explore how to choose the number of clusters.
Olive Oil data
Read the olive oil data into R. (Available at https://www.scss.tcd.ie/~arwhite/Teaching/STU33011/olive.csv,or see last week’s lab. Remember to set the right working directory.) Call the data olive.
Recall that the olive oil data consists of 572 observations of 10 variables, the first two of which are categoricaland correspond to the region of origin and the specific area of origin, respectively. The final 8 variables in thedata set consist of the percentage composition of 8 fatty acids in the oil.names(olive)
## [1] "Region" "Area" "palmitic" "palmitoleic" "stearic"## [6] "oleic" "linoleic" "linolenic" "arachidic" "eicosenoic"
dim(olive)
## [1] 572 10
head(olive)
## Region Area palmitic palmitoleic stearic oleic linoleic## 1 South North Apulia 1075 75 226 7823 672## 2 South North Apulia 1088 73 224 7709 781## 3 South North Apulia 911 54 246 8113 549## 4 South North Apulia 966 57 240 7952 619## 5 South North Apulia 1051 67 259 7771 672## 6 South North Apulia 911 49 268 7924 678## linolenic arachidic eicosenoic## 1 36 60 29## 2 31 61 29## 3 31 63 29## 4 50 78 35## 5 50 80 46## 6 51 70 44
Let’s focus on the 8 fatty acid values. Perhaps we can uncover a structure similar to the origin informationusing hierarchical clustering methods.
Exercise
• Create a new data matrix called acids that consists only of the final 8 columns of olive.
1
Dissimilarity measures
The dist function can be used to create a dissimilarity matrix. Remember to check its help file. Thisfunction returns an object of class dist. Although this class is useful when used with some other functions,it is convenient to convert this object to a matrix (i.e., object of class matrix) format if we want to check aparticular entries.acids_dis <- dist(acids, method="euclidean")acids_dis_mat <- as.matrix(acids_dis)
You can use acids_dis_mat to check the dissimilarity between observations in the usual way, by using thesubset commands for matrices covered in earlier labs.acids_dis_mat[1, 5]
## [1] 72.92462
acids_dis_mat[c(1:5, 331:334), c(1:5, 331:334)]
## 1 2 3 4 5 331 332## 1 0.00000 158.3667 356.3706 180.0194 72.92462 597.92809 632.41284## 2 158.36666 0.0000 499.2174 318.3944 139.16178 440.35667 475.04737## 3 356.37059 499.2174 0.0000 185.7767 391.11379 925.43827 947.22225## 4 180.01944 318.3944 185.7767 0.0000 208.28106 747.72923 772.68558## 5 72.92462 139.1618 391.1138 208.2811 0.00000 567.85562 601.30691## 331 597.92809 440.3567 925.4383 747.7292 567.85562 0.00000 71.09149## 332 632.41284 475.0474 947.2223 772.6856 601.30691 71.09149 0.00000## 333 692.17845 534.5157 1011.8562 835.9426 659.41868 102.56218 69.94998## 334 675.30734 517.5133 993.9799 818.1485 643.46717 89.72179 51.38093## 333 334## 1 692.17845 675.30734## 2 534.51567 517.51328## 3 1011.85622 993.97988## 4 835.94258 818.14852## 5 659.41868 643.46717## 331 102.56218 89.72179## 332 69.94998 51.38093## 333 0.00000 32.32646## 334 32.32646 0.00000
Exercise
• Compare the Euclidean dissimilarity between observations 1 and 10.
• Compare the Manhattan dissimilarity between observations 1 and 10.
• Compare the dissimilarity between the first five observations from the Sardinia region, (specificallyfrom Inland Sardinia) with the first five observations from the North region, (specifically, Umbria).Are oils from the same region more similar? Use any dissimilarity measure you like.
Hierarchical Clustering
Let’s cluster the data. The hclust function takes an object of class dist as its input, and performs ahierarchical clustering algorithm using a specified linkage method. Plotting the output of the functionproduces a dendogram.
2
clust1 <- hclust(acids_dis, method = "average")plot(clust1)
1821
121
679 42 53 3
133 65 3
026
9 273
49 262 50
366 28
655 29
175 51 74 36
230
258 37 47 292 39 76 54 45 52 34 275 67 26 35 38 80
46 58 283
41 8118
828
285 236 29 32 44
187
237
43 48 27
61 226
139
227
296 5
064 73 7
172 77 8
312
168
233
281
28 40 56 6952
327
029
854
539
441
239
338
638
736
736
838
840
235
436
134
236
632
532
834
336
035
136
3 391
408
399
405
396
403
349
398
326
347
331
346
344
355
369
401
392
333
356
327
329
404
395
407
348
350
341
370
359
332
358
406
334
389
400
357
364
397
409
330
362
365
353
345
352 3
9021
232
437
337
241
038
433
538
037
537
837
738
233
841
638
141
834
037
937
133
942
041
133
638
541
537
442
141
941
341
741
437
633
738
3 204
198
200
246
251
320
159
177
199
194
158
176 32
116
320
530
631
432
229
930
725
930
031
331
531
931
731
819
021
7 70
278
280
218
274
130
295
310 11
912
413
113
212
612
8 247
123
133
115
129
125
104
113
108
111
100
122
169 97 323
277
279
224
228
229
234 21
910
624
186
145
171
102
185
223
225 82
137
222
179
141
193
162
174
180 84 138
196
191
144
210
232
213
244 96
148
152
109
101
134
305
303
107
184
316
153
308
221
189
253
175
231
118
243 91 186
178
150
135
238
294
304
209
215 27
690 25
625
525
730
9 89 249
311
312
301
202
160
173 2
60 261
146
172
170
149
214
165
127
297
248
166
240
142
203
235 11
798 99
110
112
120
105
116 9
219
520
610
320
713
618
330
218
215
618
111
416
414
715
787 24
519
720
1 9523
925
4 9419
214
315
1 88
140
154
242
167
208
155
168 9
316
122
025
025
252
226
326
749
552
146
846
046
6 479
465
459
463
462
471
452
430
446
427
445
469
425
435
443
448
447
457
424
432
434
422
423
433
453
472
470
458
440
442
426
436
438
444
455
439
437
441
456
464
461
467
429
454
431
450
451
428
449 10 11 5
4256
356
5 22
489
505
12 24 5 25 496 1
492
494
518
510
511
516
506
517
478
512
520 50
850
751
349
351
950
951
451
53
19 572 6
271 14 4 7
527
537
524
531 525
533
526
540
547
552 27
228
728
828
9 17
20 15 266
534
535
543
548
551
553 53
655
453
853
954
453
256
657
0 556
549
550
557
559
560
562
564
561
569
567
571
530
555
558 52
952
854
6 264
541
57 6259 60 50
247
449
948
047
750
048
849
149
721 23 1
3 8 9 247
650
148
749
847
549
0 265
290
268
284
285 16
485
486
484
481
482
483
6350
456
847
378 29
3
020
040
060
080
0
Cluster Dendrogram
hclust (*, "average")acids_dis
Hei
ght
Exercise
• Produce a dendrogram of the hierarchical clustering of the olive oil data using single linkage and aManhattan measure of dissimilarity. Comment on the difference between this plot and that usingaverage linkage and Euclidean distance. Which factor has more influence on the dendogram, the linkagemethod or dissimilarity measure?
The object clust1 returns several pieces of information concerning the clustering results. For example:head(clust1$merge)
## [,1] [,2]## [1,] -450 -451## [2,] -432 -434## [3,] -437 -441## [4,] -438 -444## [5,] -427 -445## [6,] -436 4
head(clust1$height)
## [1] 1.000000 3.741657 3.741657 4.898979 5.567764 5.720575
The clust1$merge explains the ordering in which observations were joined into groups. clust1$heightdescribes the dissimilarity (with respect to linkage method) between groups as they were clustered. Further
3
information on both these terms, and others, is provided in the Value section of the hclust help file.
Exercise:
• Use clust1$height to find a “recommended” cut off height of h + 3sh, where h is the mean height atwhich groups are joined, and sh is the standard deviation of such heights.
This recommended cut off height is 306.5752243. To add a line to the dendrogram plot at this cut off point,enter:plot(clust1)abline(h = 306.5752, lty=2, col=2)
1821
121
679 42 53 3
133 65 3
026
9 273
49 262 50
366 28
655 29
175 51 74 36
230
258 37 47 292 39 76 54 45 52 34 275 67 26 35 38 80
46 58 283
41 8118
828
285 236 29 32 44
187
237
43 48 27
61 226
139
227
296 5
064 73 7
172 77 8
312
168
233
281
28 40 56 6952
327
029
854
539
441
239
338
638
736
736
838
840
235
436
134
236
632
532
834
336
035
136
3 391
408
399
405
396
403
349
398
326
347
331
346
344
355
369
401
392
333
356
327
329
404
395
407
348
350
341
370
359
332
358
406
334
389
400
357
364
397
409
330
362
365
353
345
352 3
9021
232
437
337
241
038
433
538
037
537
837
738
233
841
638
141
834
037
937
133
942
041
133
638
541
537
442
141
941
341
741
437
633
738
3 204
198
200
246
251
320
159
177
199
194
158
176 32
116
320
530
631
432
229
930
725
930
031
331
531
931
731
819
021
7 70
278
280
218
274
130
295
310 11
912
413
113
212
612
8 247
123
133
115
129
125
104
113
108
111
100
122
169 97 323
277
279
224
228
229
234 21
910
624
186
145
171
102
185
223
225 82
137
222
179
141
193
162
174
180 84 138
196
191
144
210
232
213
244 96
148
152
109
101
134
305
303
107
184
316
153
308
221
189
253
175
231
118
243 91 186
178
150
135
238
294
304
209
215 27
690 25
625
525
730
9 89 249
311
312
301
202
160
173 2
60 261
146
172
170
149
214
165
127
297
248
166
240
142
203
235 11
798 99
110
112
120
105
116 9
219
520
610
320
713
618
330
218
215
618
111
416
414
715
787 24
519
720
1 9523
925
4 9419
214
315
1 88
140
154
242
167
208
155
168 9
316
122
025
025
252
226
326
749
552
146
846
046
6 479
465
459
463
462
471
452
430
446
427
445
469
425
435
443
448
447
457
424
432
434
422
423
433
453
472
470
458
440
442
426
436
438
444
455
439
437
441
456
464
461
467
429
454
431
450
451
428
449 10 11 5
4256
356
5 22
489
505
12 24 5 25 496 1
492
494
518
510
511
516
506
517
478
512
520 50
850
751
349
351
950
951
451
53
19 572 6
271 14 4 7
527
537
524
531 525
533
526
540
547
552 27
228
728
828
9 17
20 15 266
534
535
543
548
551
553 53
655
453
853
954
453
256
657
0 556
549
550
557
559
560
562
564
561
569
567
571
530
555
558 52
952
854
6 264
541
57 6259 60 50
247
449
948
047
750
048
849
149
721 23 1
3 8 9 247
650
148
749
847
549
0 265
290
268
284
285 16
485
486
484
481
482
483
6350
456
847
378 29
3
020
040
060
080
0
Cluster Dendrogram
hclust (*, "average")acids_dis
Hei
ght
The abline command adds a line to an already existing plot. The arguments lty and col specify line typeand color of the line respectively. (Use ?par to learn more about these and other plotting options.)
What do you think of the recommended cut off height? Does it look like a good choice to split the data intoclusters?
Interpreting the clusters
Use the cutree function to split the data into a specific cluster structure. This function takes the hclustobject and either a given cut off height for the dendogram or a pre-specified number of clusters as itsarguments:acids_label1 <- cutree(clust1, k=10)acids_label2 <- cutree(clust1, h=306.5752)
4
To find the points assigned to a given cluster, we can use the following:which(acids_label1 == 1)
## [1] 1 3 4 5 6 7 10 11 12 14 19 22 24 25 263 267 271## [18] 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438## [35] 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455## [52] 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472## [69] 478 479 489 492 493 494 495 496 505 506 507 508 509 510 511 512 513## [86] 514 515 516 517 518 519 520 521 542 563 565 572
The argument acidlabel1 == 1 is a logical statement that checks each element of the vector and returnswhether or not its value equals 1 (i.e., TRUE or FALSE). The which function returns those elements within avector that satisfy the property of its argument.
Once we have chosen a specific clustering of the data, we should plot our results. Unfortunately, in R thedefault number of colours is 8, so we have to add some extra colours to the default palette:palette(rainbow(10))plot(acids[,1], acids[,2], col = acids_label1)
600 800 1000 1200 1400 1600
5010
015
020
025
0
acids[, 1]
acid
s[, 2
]
pairs(acids, col = acids_label1)
5
palmitic
5065
000
60
600 1400
040
50 200
palmitoleic
stearic
150 300
6500 8000
oleic
linoleic
600 1400
0 40
linolenic
arachidic
0 40 100
0 30 60
600
150
600
080
eicosenoic
The first line here creates a new colour palette that consists of 10 (hopefully) distinct elements which is thenused in all further plot commands.
Standardising data
Standardizing the data prior to calculating the dissimilarity matrix can sometimes dramatically affect results.To standardize acids we need to divide each variable by its standard deviation over all observations. To dothis we can use the functions apply and sweep.
The function apply performs an operation on a matrix successively across its rows or columns:acid_sd <- apply(acids, 2 ,sd)acid_sd
## palmitic palmitoleic stearic oleic linoleic linolenic## 168.59226 52.49436 36.74494 405.81022 242.79922 12.96870## arachidic eicosenoic## 22.03025 14.08330
The first argument to apply is the matrix to apply the operation over. The second argument specifies thatthe operation is performed over successive columns (if we wanted the operation performed over successiverows we would replace the 2 with a 1). The final argument specifies the operation to be performed.
Exercise
• Use the apply function to find the column means of acids.
6
In order to divide each column in acids by its standard deviation we can use the function sweep. Thisfunction returns an alteration to the matrix acids in which the relevant summary statistic will have been“swept” out.standard_acids <- sweep(acids, 2, acid_sd, "/")head(standard_acids)
## palmitic palmitoleic stearic oleic linoleic linolenic arachidic## 1 6.376331 1.4287248 6.150508 19.27748 2.767719 2.775915 2.723528## 2 6.453440 1.3906255 6.096078 18.99656 3.216650 2.390371 2.768920## 3 5.403569 1.0286818 6.694800 19.99210 2.261128 2.390371 2.859704## 4 5.729800 1.0858308 6.531512 19.59537 2.549432 3.855437 3.540586## 5 6.233975 1.2763275 7.048590 19.14935 2.767719 3.855437 3.631371## 6 5.403569 0.9334335 7.293522 19.52637 2.792431 3.932546 3.177449## eicosenoic## 1 2.059177## 2 2.059177## 3 2.059177## 4 2.485214## 5 3.266281## 6 3.124269
In the above function, the second argument again specifies that the operation is performed over columns(use a 1 for rows), while the acid_sd and "/" arguments specifies that the columns are to be divided by thestandard deviations calculated earlier.
Exercise
• Use the sweep function to create a centered version of acids whereby each column has mean 0.
In this case, and quite often in general, there is an already existing command that produces the same result:acid_scale <- scale(acids, center = TRUE, scale = TRUE)
Exercise
• Perform a hierarchical cluster analysis on the standardized version of acids.
Exercise
• The faithful dataset should be loaded in R by default. Perform a hierarchical cluster analysis on thisdata set. Is it appropriate to scale the data before doing so?
7