lab 4arwhite/teaching/stu33011/lab4.pdf · clust1

7
Lab 4 Make sure that you have completed the previous lab sessions (https://www.scss.tcd.ie/~arwhite/Teaching/ STU33011.html) before moving on to this one. Remember to save your commands in an R script. In this session, we will • implement hierarchical clustering methods; • interpet the results using simple plotting and summary statistics; • explore how to choose the number of clusters. Olive Oil data Read the olive oil data into R. (Available at https://www.scss.tcd.ie/~arwhite/Teaching/STU33011/olive.csv, or see last week’s lab. Remember to set the right working directory.) Call the data olive. Recall that the olive oil data consists of 572 observations of 10 variables, the first two of which are categorical and correspond to the region of origin and the specific area of origin, respectively. The final 8 variables in the data set consist of the percentage composition of 8 fatty acids in the oil. names(olive) ## [1] "Region" "Area" "palmitic" "palmitoleic" "stearic" ## [6] "oleic" "linoleic" "linolenic" "arachidic" "eicosenoic" dim(olive) ## [1] 572 10 head(olive) ## Region Area palmitic palmitoleic stearic oleic linoleic ## 1 South North Apulia 1075 75 226 7823 672 ## 2 South North Apulia 1088 73 224 7709 781 ## 3 South North Apulia 911 54 246 8113 549 ## 4 South North Apulia 966 57 240 7952 619 ## 5 South North Apulia 1051 67 259 7771 672 ## 6 South North Apulia 911 49 268 7924 678 ## linolenic arachidic eicosenoic ## 1 36 60 29 ## 2 31 61 29 ## 3 31 63 29 ## 4 50 78 35 ## 5 50 80 46 ## 6 51 70 44 Let’s focus on the 8 fatty acid values. Perhaps we can uncover a structure similar to the origin information using hierarchical clustering methods. Exercise • Create a new data matrix called acids that consists only of the final 8 columns of olive. 1

Upload: others

Post on 15-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lab 4arwhite/Teaching/STU33011/Lab4.pdf · clust1

Lab 4Make sure that you have completed the previous lab sessions (https://www.scss.tcd.ie/~arwhite/Teaching/STU33011.html) before moving on to this one. Remember to save your commands in an R script.

In this session, we will

• implement hierarchical clustering methods;• interpet the results using simple plotting and summary statistics;• explore how to choose the number of clusters.

Olive Oil data

Read the olive oil data into R. (Available at https://www.scss.tcd.ie/~arwhite/Teaching/STU33011/olive.csv,or see last week’s lab. Remember to set the right working directory.) Call the data olive.

Recall that the olive oil data consists of 572 observations of 10 variables, the first two of which are categoricaland correspond to the region of origin and the specific area of origin, respectively. The final 8 variables in thedata set consist of the percentage composition of 8 fatty acids in the oil.names(olive)

## [1] "Region" "Area" "palmitic" "palmitoleic" "stearic"## [6] "oleic" "linoleic" "linolenic" "arachidic" "eicosenoic"

dim(olive)

## [1] 572 10

head(olive)

## Region Area palmitic palmitoleic stearic oleic linoleic## 1 South North Apulia 1075 75 226 7823 672## 2 South North Apulia 1088 73 224 7709 781## 3 South North Apulia 911 54 246 8113 549## 4 South North Apulia 966 57 240 7952 619## 5 South North Apulia 1051 67 259 7771 672## 6 South North Apulia 911 49 268 7924 678## linolenic arachidic eicosenoic## 1 36 60 29## 2 31 61 29## 3 31 63 29## 4 50 78 35## 5 50 80 46## 6 51 70 44

Let’s focus on the 8 fatty acid values. Perhaps we can uncover a structure similar to the origin informationusing hierarchical clustering methods.

Exercise

• Create a new data matrix called acids that consists only of the final 8 columns of olive.

1

Page 2: Lab 4arwhite/Teaching/STU33011/Lab4.pdf · clust1

Dissimilarity measures

The dist function can be used to create a dissimilarity matrix. Remember to check its help file. Thisfunction returns an object of class dist. Although this class is useful when used with some other functions,it is convenient to convert this object to a matrix (i.e., object of class matrix) format if we want to check aparticular entries.acids_dis <- dist(acids, method="euclidean")acids_dis_mat <- as.matrix(acids_dis)

You can use acids_dis_mat to check the dissimilarity between observations in the usual way, by using thesubset commands for matrices covered in earlier labs.acids_dis_mat[1, 5]

## [1] 72.92462

acids_dis_mat[c(1:5, 331:334), c(1:5, 331:334)]

## 1 2 3 4 5 331 332## 1 0.00000 158.3667 356.3706 180.0194 72.92462 597.92809 632.41284## 2 158.36666 0.0000 499.2174 318.3944 139.16178 440.35667 475.04737## 3 356.37059 499.2174 0.0000 185.7767 391.11379 925.43827 947.22225## 4 180.01944 318.3944 185.7767 0.0000 208.28106 747.72923 772.68558## 5 72.92462 139.1618 391.1138 208.2811 0.00000 567.85562 601.30691## 331 597.92809 440.3567 925.4383 747.7292 567.85562 0.00000 71.09149## 332 632.41284 475.0474 947.2223 772.6856 601.30691 71.09149 0.00000## 333 692.17845 534.5157 1011.8562 835.9426 659.41868 102.56218 69.94998## 334 675.30734 517.5133 993.9799 818.1485 643.46717 89.72179 51.38093## 333 334## 1 692.17845 675.30734## 2 534.51567 517.51328## 3 1011.85622 993.97988## 4 835.94258 818.14852## 5 659.41868 643.46717## 331 102.56218 89.72179## 332 69.94998 51.38093## 333 0.00000 32.32646## 334 32.32646 0.00000

Exercise

• Compare the Euclidean dissimilarity between observations 1 and 10.

• Compare the Manhattan dissimilarity between observations 1 and 10.

• Compare the dissimilarity between the first five observations from the Sardinia region, (specificallyfrom Inland Sardinia) with the first five observations from the North region, (specifically, Umbria).Are oils from the same region more similar? Use any dissimilarity measure you like.

Hierarchical Clustering

Let’s cluster the data. The hclust function takes an object of class dist as its input, and performs ahierarchical clustering algorithm using a specified linkage method. Plotting the output of the functionproduces a dendogram.

2

Page 3: Lab 4arwhite/Teaching/STU33011/Lab4.pdf · clust1

clust1 <- hclust(acids_dis, method = "average")plot(clust1)

1821

121

679 42 53 3

133 65 3

026

9 273

49 262 50

366 28

655 29

175 51 74 36

230

258 37 47 292 39 76 54 45 52 34 275 67 26 35 38 80

46 58 283

41 8118

828

285 236 29 32 44

187

237

43 48 27

61 226

139

227

296 5

064 73 7

172 77 8

312

168

233

281

28 40 56 6952

327

029

854

539

441

239

338

638

736

736

838

840

235

436

134

236

632

532

834

336

035

136

3 391

408

399

405

396

403

349

398

326

347

331

346

344

355

369

401

392

333

356

327

329

404

395

407

348

350

341

370

359

332

358

406

334

389

400

357

364

397

409

330

362

365

353

345

352 3

9021

232

437

337

241

038

433

538

037

537

837

738

233

841

638

141

834

037

937

133

942

041

133

638

541

537

442

141

941

341

741

437

633

738

3 204

198

200

246

251

320

159

177

199

194

158

176 32

116

320

530

631

432

229

930

725

930

031

331

531

931

731

819

021

7 70

278

280

218

274

130

295

310 11

912

413

113

212

612

8 247

123

133

115

129

125

104

113

108

111

100

122

169 97 323

277

279

224

228

229

234 21

910

624

186

145

171

102

185

223

225 82

137

222

179

141

193

162

174

180 84 138

196

191

144

210

232

213

244 96

148

152

109

101

134

305

303

107

184

316

153

308

221

189

253

175

231

118

243 91 186

178

150

135

238

294

304

209

215 27

690 25

625

525

730

9 89 249

311

312

301

202

160

173 2

60 261

146

172

170

149

214

165

127

297

248

166

240

142

203

235 11

798 99

110

112

120

105

116 9

219

520

610

320

713

618

330

218

215

618

111

416

414

715

787 24

519

720

1 9523

925

4 9419

214

315

1 88

140

154

242

167

208

155

168 9

316

122

025

025

252

226

326

749

552

146

846

046

6 479

465

459

463

462

471

452

430

446

427

445

469

425

435

443

448

447

457

424

432

434

422

423

433

453

472

470

458

440

442

426

436

438

444

455

439

437

441

456

464

461

467

429

454

431

450

451

428

449 10 11 5

4256

356

5 22

489

505

12 24 5 25 496 1

492

494

518

510

511

516

506

517

478

512

520 50

850

751

349

351

950

951

451

53

19 572 6

271 14 4 7

527

537

524

531 525

533

526

540

547

552 27

228

728

828

9 17

20 15 266

534

535

543

548

551

553 53

655

453

853

954

453

256

657

0 556

549

550

557

559

560

562

564

561

569

567

571

530

555

558 52

952

854

6 264

541

57 6259 60 50

247

449

948

047

750

048

849

149

721 23 1

3 8 9 247

650

148

749

847

549

0 265

290

268

284

285 16

485

486

484

481

482

483

6350

456

847

378 29

3

020

040

060

080

0

Cluster Dendrogram

hclust (*, "average")acids_dis

Hei

ght

Exercise

• Produce a dendrogram of the hierarchical clustering of the olive oil data using single linkage and aManhattan measure of dissimilarity. Comment on the difference between this plot and that usingaverage linkage and Euclidean distance. Which factor has more influence on the dendogram, the linkagemethod or dissimilarity measure?

The object clust1 returns several pieces of information concerning the clustering results. For example:head(clust1$merge)

## [,1] [,2]## [1,] -450 -451## [2,] -432 -434## [3,] -437 -441## [4,] -438 -444## [5,] -427 -445## [6,] -436 4

head(clust1$height)

## [1] 1.000000 3.741657 3.741657 4.898979 5.567764 5.720575

The clust1$merge explains the ordering in which observations were joined into groups. clust1$heightdescribes the dissimilarity (with respect to linkage method) between groups as they were clustered. Further

3

Page 4: Lab 4arwhite/Teaching/STU33011/Lab4.pdf · clust1

information on both these terms, and others, is provided in the Value section of the hclust help file.

Exercise:

• Use clust1$height to find a “recommended” cut off height of h + 3sh, where h is the mean height atwhich groups are joined, and sh is the standard deviation of such heights.

This recommended cut off height is 306.5752243. To add a line to the dendrogram plot at this cut off point,enter:plot(clust1)abline(h = 306.5752, lty=2, col=2)

1821

121

679 42 53 3

133 65 3

026

9 273

49 262 50

366 28

655 29

175 51 74 36

230

258 37 47 292 39 76 54 45 52 34 275 67 26 35 38 80

46 58 283

41 8118

828

285 236 29 32 44

187

237

43 48 27

61 226

139

227

296 5

064 73 7

172 77 8

312

168

233

281

28 40 56 6952

327

029

854

539

441

239

338

638

736

736

838

840

235

436

134

236

632

532

834

336

035

136

3 391

408

399

405

396

403

349

398

326

347

331

346

344

355

369

401

392

333

356

327

329

404

395

407

348

350

341

370

359

332

358

406

334

389

400

357

364

397

409

330

362

365

353

345

352 3

9021

232

437

337

241

038

433

538

037

537

837

738

233

841

638

141

834

037

937

133

942

041

133

638

541

537

442

141

941

341

741

437

633

738

3 204

198

200

246

251

320

159

177

199

194

158

176 32

116

320

530

631

432

229

930

725

930

031

331

531

931

731

819

021

7 70

278

280

218

274

130

295

310 11

912

413

113

212

612

8 247

123

133

115

129

125

104

113

108

111

100

122

169 97 323

277

279

224

228

229

234 21

910

624

186

145

171

102

185

223

225 82

137

222

179

141

193

162

174

180 84 138

196

191

144

210

232

213

244 96

148

152

109

101

134

305

303

107

184

316

153

308

221

189

253

175

231

118

243 91 186

178

150

135

238

294

304

209

215 27

690 25

625

525

730

9 89 249

311

312

301

202

160

173 2

60 261

146

172

170

149

214

165

127

297

248

166

240

142

203

235 11

798 99

110

112

120

105

116 9

219

520

610

320

713

618

330

218

215

618

111

416

414

715

787 24

519

720

1 9523

925

4 9419

214

315

1 88

140

154

242

167

208

155

168 9

316

122

025

025

252

226

326

749

552

146

846

046

6 479

465

459

463

462

471

452

430

446

427

445

469

425

435

443

448

447

457

424

432

434

422

423

433

453

472

470

458

440

442

426

436

438

444

455

439

437

441

456

464

461

467

429

454

431

450

451

428

449 10 11 5

4256

356

5 22

489

505

12 24 5 25 496 1

492

494

518

510

511

516

506

517

478

512

520 50

850

751

349

351

950

951

451

53

19 572 6

271 14 4 7

527

537

524

531 525

533

526

540

547

552 27

228

728

828

9 17

20 15 266

534

535

543

548

551

553 53

655

453

853

954

453

256

657

0 556

549

550

557

559

560

562

564

561

569

567

571

530

555

558 52

952

854

6 264

541

57 6259 60 50

247

449

948

047

750

048

849

149

721 23 1

3 8 9 247

650

148

749

847

549

0 265

290

268

284

285 16

485

486

484

481

482

483

6350

456

847

378 29

3

020

040

060

080

0

Cluster Dendrogram

hclust (*, "average")acids_dis

Hei

ght

The abline command adds a line to an already existing plot. The arguments lty and col specify line typeand color of the line respectively. (Use ?par to learn more about these and other plotting options.)

What do you think of the recommended cut off height? Does it look like a good choice to split the data intoclusters?

Interpreting the clusters

Use the cutree function to split the data into a specific cluster structure. This function takes the hclustobject and either a given cut off height for the dendogram or a pre-specified number of clusters as itsarguments:acids_label1 <- cutree(clust1, k=10)acids_label2 <- cutree(clust1, h=306.5752)

4

Page 5: Lab 4arwhite/Teaching/STU33011/Lab4.pdf · clust1

To find the points assigned to a given cluster, we can use the following:which(acids_label1 == 1)

## [1] 1 3 4 5 6 7 10 11 12 14 19 22 24 25 263 267 271## [18] 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438## [35] 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455## [52] 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472## [69] 478 479 489 492 493 494 495 496 505 506 507 508 509 510 511 512 513## [86] 514 515 516 517 518 519 520 521 542 563 565 572

The argument acidlabel1 == 1 is a logical statement that checks each element of the vector and returnswhether or not its value equals 1 (i.e., TRUE or FALSE). The which function returns those elements within avector that satisfy the property of its argument.

Once we have chosen a specific clustering of the data, we should plot our results. Unfortunately, in R thedefault number of colours is 8, so we have to add some extra colours to the default palette:palette(rainbow(10))plot(acids[,1], acids[,2], col = acids_label1)

600 800 1000 1200 1400 1600

5010

015

020

025

0

acids[, 1]

acid

s[, 2

]

pairs(acids, col = acids_label1)

5

Page 6: Lab 4arwhite/Teaching/STU33011/Lab4.pdf · clust1

palmitic

5065

000

60

600 1400

040

50 200

palmitoleic

stearic

150 300

6500 8000

oleic

linoleic

600 1400

0 40

linolenic

arachidic

0 40 100

0 30 60

600

150

600

080

eicosenoic

The first line here creates a new colour palette that consists of 10 (hopefully) distinct elements which is thenused in all further plot commands.

Standardising data

Standardizing the data prior to calculating the dissimilarity matrix can sometimes dramatically affect results.To standardize acids we need to divide each variable by its standard deviation over all observations. To dothis we can use the functions apply and sweep.

The function apply performs an operation on a matrix successively across its rows or columns:acid_sd <- apply(acids, 2 ,sd)acid_sd

## palmitic palmitoleic stearic oleic linoleic linolenic## 168.59226 52.49436 36.74494 405.81022 242.79922 12.96870## arachidic eicosenoic## 22.03025 14.08330

The first argument to apply is the matrix to apply the operation over. The second argument specifies thatthe operation is performed over successive columns (if we wanted the operation performed over successiverows we would replace the 2 with a 1). The final argument specifies the operation to be performed.

Exercise

• Use the apply function to find the column means of acids.

6

Page 7: Lab 4arwhite/Teaching/STU33011/Lab4.pdf · clust1

In order to divide each column in acids by its standard deviation we can use the function sweep. Thisfunction returns an alteration to the matrix acids in which the relevant summary statistic will have been“swept” out.standard_acids <- sweep(acids, 2, acid_sd, "/")head(standard_acids)

## palmitic palmitoleic stearic oleic linoleic linolenic arachidic## 1 6.376331 1.4287248 6.150508 19.27748 2.767719 2.775915 2.723528## 2 6.453440 1.3906255 6.096078 18.99656 3.216650 2.390371 2.768920## 3 5.403569 1.0286818 6.694800 19.99210 2.261128 2.390371 2.859704## 4 5.729800 1.0858308 6.531512 19.59537 2.549432 3.855437 3.540586## 5 6.233975 1.2763275 7.048590 19.14935 2.767719 3.855437 3.631371## 6 5.403569 0.9334335 7.293522 19.52637 2.792431 3.932546 3.177449## eicosenoic## 1 2.059177## 2 2.059177## 3 2.059177## 4 2.485214## 5 3.266281## 6 3.124269

In the above function, the second argument again specifies that the operation is performed over columns(use a 1 for rows), while the acid_sd and "/" arguments specifies that the columns are to be divided by thestandard deviations calculated earlier.

Exercise

• Use the sweep function to create a centered version of acids whereby each column has mean 0.

In this case, and quite often in general, there is an already existing command that produces the same result:acid_scale <- scale(acids, center = TRUE, scale = TRUE)

Exercise

• Perform a hierarchical cluster analysis on the standardized version of acids.

Exercise

• The faithful dataset should be loaded in R by default. Perform a hierarchical cluster analysis on thisdata set. Is it appropriate to scale the data before doing so?

7