practical course using the software ||||| multivariate

Practical course using the software

—————Multivariate analysis of genetic data: exploring

group diversity

Thibaut Jombart

—————

Abstract

This practical course tackles the question of group diversity in geneticdata analysis using [8]. It consists of two main parts: first, how to infergroups when these are unknown, and second, how to use group informationwhen describing the genetic diversity. The practical uses mostly the packagesadegenet [5], ape [7] and ade4 [1, 3, 2], but others like genetics [9] andhierfstat [4] are also required.

1

Contents

1 Defining genetic clusters 31.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . 31.2 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Describing genetic clusters 62.1 Analysis of group data . . . . . . . . . . . . . . . . . . . . . . 72.2 Between-class analyses . . . . . . . . . . . . . . . . . . . . . . 102.3 Discriminant Analysis of Principal Components . . . . . . . . 13

2

1 Defining genetic clusters

Group information is not always known when analysing genetic data. Evenwhen some prior clustering can be defined, it is not always obvious thatthese are the best genetic clusters that can be defined. In this section, weillustrate two simple approaches for defining genetic clusters.

1.1 Hierarchical clustering

Hierarchical clustering can be used to represent genetic distances as trees,and indirectly to define genetic clusters. This is achieved by cutting the treeat a certain distance, and pooling the tips descending from the few retainedbranches into the same clusters (cutree). Load the data microbov, replacethe missing data, and compute the Euclidean distances between individuals(functions na.replace and dist). Then, use hclust to obtain a hierarchicalclustering of the individual which forms strong groups (choose the rightmethod(s)).

488

503 49

651

549

749

552

548

252

049

049

4 493

274

485

478

510

484

487

508

486

489

269

521

261

650 4

76 492

244

481

514 39

150

935

264

231

436

265

443

363

1 271

273

275

248

686 25

225

125

824

567

770

464

967

135

026

564

425

640

741

124

042

933

524

739

568

068

225

727

836

4 266

238

260 23

527

2 383

343

373

694

702 36

052

869

970

030

452

624

969

770

166

366

768

3 665

655

698

678

695

659

688 24

353

937

856

528

154

868

468

927

970

366

066

165

769

3 674

679

384

369

388

396

405

393

394

409

691

353

681

367

504 52

457

550

267

363

263

864

664

863

063

462

664

031

331

731

8 295

296

322

310

324

302

309

289

319

320

312

298

299

307

286

297 31

528

829

430

832

532

745

729

052

754

4 516

522

459

507

501

519

499

270

450

470

573

563

566 51

254

949

152

933

454

1 241

545

232

268

557

348

374

337

696

547

400

408

375

542

398

357

372 38

752

354

653

055

255

633

934

6 572

532

536

355

276

427

446

342

356 3

4935

936

547

950

048

051

769

066

266

466

666

967

667

568

765

869

267

065

667

268

525

926

229

262

733

034

1 328

282

323

300

301

303

305 4

83 511 3

77 471

333

332

498 347

381

647

651

412

283

284

242

403

385

344

389

533

653

236

368

543

306

316

293

311

285

326 32

128

729

137

055

853

555

956

1 379

386

554

540

560 60

735

135

857

153

456

8 537

538 53

157

058

459

558

861

558

160

059

260

8 596 62

459

759

9 590

619

587

594

618

617

586

579

613

603

611 60

461

457

657

761

258

960

562

358

043

661

060

162

2 602

578

593

472

462

438

475 38

242

542

643

443

947

444

946

446

740

245

442

040

441

741

041

6 505

567

431

463

458

451

461 59

860

658

262

1 585 60

958

361

639

244

542

159

123

464

1 574

637

629

633 40

664

5 643

652 3

71 555

432

444

465 4

4844

145

344

746

055

363

646

662

825

362

539

039

941

341

940

141

4 415

418

338

397 38

062

033

144

243

042

243

744

343

545

642

345

5 468

469

440

452

473

424

639

361

428

336

366

513

506

518 23

334

537

6 263 63

555

156

256

424

655

056

926

436

323

723

934

035

426

747

725

025

425

528

066

8 277 32

902

903

8 085

019

004

034

050

080

017

062

026

071

061

066

030

049

012

013 07

608

7 001

002

016

092

095

055

060

081

056

059

005

011

014

090

098

083

089

065

063

078

068

075

051

053 0

52 057

020

069

084 0

77 100

067

070

096

097

099

079

093

091

086

088

048

064

072

094

054

058

082

073

074

128

130

110

138

122

123 13

715

010

311

914

914

214

614

513

313

513

413

612

411

211

612

011

711

812

512

913

120

217

611

514

112

610

210

713

914

010

110

913

214

7 108

105

113

151

174

178

169

179

218

207

219 16

022

115

703

117

504

604

503

504

704

204

303

304

403

603

903

204

000

317

002

320

322

602

402

5 021 04

1 208

223

205

216

197

209

195

183

184

220

231

222

224 19

119

4 217

198

214

229

008

009 03

701

802

202

702

815

916

818

722

818

019

0 156

154

210 2

0100

723

016

316

116

4 166

155

162

200

213

182

227

206

185

212 1

99 204

173

177

181

152

172

196

186

211

148

189

158

171 15

316

516

719

219

301

000

601

5 121 21

512

714

414

310

611

110

411

4 188 22

5

23

45

6

Cluster Dendrogram

secret clustering methodD

Hei

ght

Cut the tree into two groups. Do these groups match any prior clusteringof the data (see content of microbov$other)? Remember that table can beused to build contingency tables. Accordingly, what is the main componentof the genetic variability in these cattle breeds?

Repeat this analysis by cutting the tree into as many clusters as there arebreeds in the dataset (this can be extracted by pop). Build a contingency ta-

3

ble to see the match between inferred groups and breeds. Use table.value

to represent the result, then interprete it:

Borgou

Zebu

Lagunaire

NDama

Somba

Aubrac

Bazadais

BlondeAquitaine

BretPieNoire

Charolais

Gascon

Limousin

MaineAnjou

Montbeliard

Salers

grp

1

grp

2

grp

3

grp

4

grp

5

grp

6

grp

7

grp

8

grp

9

grp

10

grp

11

grp

12

grp

13

grp

14

grp

15

5 15 25 35 45

Can some groups be identified to species? to breeds? Are some speciesmore admixed than others?

1.2 K-means

K-means is another, non-hierarchical approach for defining genetic clusters.It relies on the usual ANOVA model for multivariate data X ∈ Rn×p:

VAR(X) = B(X) + W(X)

where VAR, B and W correspond to, respectively, the total variance, thevariance between groups, and the variance within groups. K-means thenconsists in finding groups that will minimize W(X). Using the functionkmeans, find 15 genetic clusters in the microbov data; how do the inferredclusters compare to the breeds? Reproduce the same figure as before forthese results; the figure should ressemble:

4

Borgou

Zebu

Lagunaire

NDama

Somba

Aubrac

Bazadais

BlondeAquitaine

BretPieNoire

Charolais

Gascon

Limousin

MaineAnjou

Montbeliard

Salers

grp

1

grp

2

grp

3

grp

4

grp

5

grp

6

grp

7

grp

8

grp

9

grp

10

grp

11

grp

12

grp

13

grp

14

grp

15 5 15 25 35 45 55

What differences can you see? What are the species which are easily genet-ically identified using K-means?

Look at the output of K-means clustering, and try to identify the sumsof squares corresponding to the variance partition model. How do thesebehave when increasing the number of clusters? Using sapply, retrieve theresults of several K-means where K is increased gradually. Plot the results;one example of obtained result would be:

5

●

●

●

●

●

●

●

●

●

●

10 20 30 40 50

1400

1600

1800

2000

2200

2400

2600

2800

K

Bet

wee

n su

m o

f squ

ares

How can you interprete this observation? Can B(X) or W(X) be usedfor selecting the best K?

Deciding of the number of optimal clusters (K) to be retained is oftentricky. This can be achieved by computing K-means solution for a range ofK, and then selecting the K giving the best Bayesian Information Criterion(BIC). This is achieved by the function find.clusters. In addition, thisfunction orthogonalizes and reduces the data using PCA as a prior step toK-means. Use this function to search for the optimal number of clustersin microbov. How many clusters would you retain? Compare them to thebreed information: when 8 clusters are retained, what are the most admixedbreeds?

Repeat the same analyses for the nancycats data. What can you sayabout the likely profile of admixture between these cat colonies?

2 Describing genetic clusters

There are different ways of using genetic cluster information in the multi-variate analysis of genetic data. The methods vary according to the purposeof the study: describing the differences between groups, describing how in-dividuals are structured in the different clusters, etc.

6

2.1 Analysis of group data

Sometimes we are just interested in the diversity between groups, and notinterested by variations occuring within clusters. In such cases, we can anal-yse data directly at a population level. The first step doing so is computingallele counts by populations (using genind2genpop). These data can thenbe:

? directly analysed using Correspondance Analysis (CA, dudi.coa), whichis appropriate for contingency tables

? translated into scaled allele frequencies (makefreq or scaleGen), andanalysed by PCA (PCA, dudi.pca)

? used to compute genetic distances between populations (dist.genpop)which are in turn summarised by Principal Coordinates Analysis (PCoA,dudi.pco)

Try applying these approaches to the dataset microbov, and comparethe obtained results.

d = 0.5

Borgou

Zebu

Lagunaire

NDama Somba

Aubrac

Bazadais BlondeAquitaine

BretPieNoire Charolais Gascon Limousin MaineAnjou Montbeliard

Salers

Eigenvalues

microbov − Correspondance Analysis

7

d = 0.5

Borgou

Zebu

Lagunaire

NDama

Somba

Aubrac

Bazadais

BlondeAquitaine BretPieNoire Charolais

Gascon Limousin

MaineAnjou Montbeliard

Salers

Eigenvalues

microbov − Principal Component Analysis

d = 0.1

Borgou

Zebu

Lagunaire

NDama Somba

Aubrac

Bazadais

BlondeAquitaine

BretPieNoire Charolais

Gascon Limousin

MaineAnjou Montbeliard

Salers

Eigenvalues

microbov − Principal Coordinates AnalysisEdwards distance

For the PCoA, try different Euclidean distances (see dist.genpop), andcompare the results. Does the genetic distance employed seem to matter?

We will leave the cattle at a rest for now, and switch to Human sea-sonal influenza, H3N2. The dataset H3N2 contains SNPs derived from 1903

8

hemagglutinin RNA sequences. Obtain allele counts per year of sampling(see content of H3N2$other) using genind2genpop, and compute i) Ed-ward’s distance ii) Reynolds’ distance and iii) Roger’s distance betweenyears. Perform the PCoA of these distances and plot the results; here is aresult for one of the requested distances:

d = 0.1

2001

2002 2003

2004

2005

2006

Eigenvalues

H3N2 − PCoA − secret distance

What is the meaning of the first principal components? Are the resultscoherent between the three distances? To have yet another point of view,perform the PCA of the individual data, and represent group informationwhen plotting the principal components using s.class.

9

d = 2

●●

●●●●

●

●

●

●●

●●●●

●●

●

●●●

●

●●●●●●●

●●

●

●●●●

●●●●●

●●

●

●●●● ●●

●●●●

●●●

●●

●●●

●●

●●

●●●●●

●●

●●●●●

●●●

●

●●●

●

●

●

●●

●●

●

●

●

●●

●

●●●●●●

●

●

●

●●●●

●●

●●●●●

●●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●●

●

●

●

●

●

●

●●●●●●●

●●●

●

●●

●

●●

●●

●

●●●

●

●●●

●

●

●●

●

●

●

●

●

●●●

●●

●● ●

●

●● ●●

●●

●

●

●

●●

●

●

●●

●●●

●●

●●●

●●

●●●●

●

●●●

●

●

●●

●●

●

●

●

●●●

●

●

●●

●

●●

●

●●●

●

●●

●●

●

●

●

● ●

●●

●●●●

●

●●●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●●●●●●●●●

●●●●●●●●●

●●

●

●

●

●●●

●●●●

●

●●

●

●●

●

●●

●

●●

●

●

●●

●

●●

●

●●●

●

●

●●●

● ●

●

●●

●

● ●

●●●●

●

●●

●●●

●

●

●●●

●●●●

●

●

●

●●

●

●●●●●●

●●

●●●

●●

●

●●

●●

●

●

●

●

●

●●

●

●●●●●●●●

●

●●●

●

●

●

●●●

●●

●

●●

●

●

●●

●

●●

●●

●●

●●

●●●●

●

●

●● ●

●

●●

●

●

●●

●

●

●●●

●●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●●

●●

●●●●

●

●

●●●

●

●●●

●

●

●

●

●

●●

●●

●●●●

●●

●●

●

●

●

●●

●

●

●

●

●

●●● ●

●

●

●

●●

●

●●

●●●

●●

●

●

●●

●●

●

●●

●●

●●

●

●●●

●

●

●

●●●●

●

●

●

●

●

●●

●●●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●●●

●●●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●●

●

●

●●

●● ●●●●

●

●●●●●●

●●

●

●●●

●●● ●

●●●●●●●●

●

●

●

●●

●●●● ●

●

●●

●

●●

●

●

●

●

●

● ●●

●

●

●

●●●

●

●

●

●

●●●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●●●●

●

●

●

●●●●●●●●●●●●●●●●

●●●

●●●●

●

●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●

●

●

●

●

●

●●●●

●●● ●

●●

●●

●●

●

●●●●

●

●

●

●

●

●

●●●●●●●●●●●

●●

●

●●●●

●

●●●

●●

●●

●

●

●

●●●●

●

●

●

●●●●●

●●

●

●

●

●●●

●●●

●

●●●●●●

●●●

●●

●

●

● ●●●●

●

●●●●

●

●

●

●

●

●●

●

●●

●●●

●●

●

●●

●

●

●

●●

●●●

●

●

●●●●

●

●

●●

●

●●●●●●●●

●●●●

●●●●●●●●●●

●●● ●●●●●●●●●●●●●●●●●

●●●●●●

●

●

●●

●

●●

●

●●●

●●●

●●●

●●●

●

●●

●

●●●

●●●●

●

●●

●

●●●

●

●●●●●●●

●●●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●●

●

●●

●●

●

●

●

●

●

●●

●●●●●

●

●

●●●

●●●●

●●●

●

●

●●●

●

●●●

●

●●●●

●

●

●●●●

●

●●

●

●

●

●● ●●

●●

●

●

●●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●●

●●

●

●

●

●●

●●

●●

●● ●

●●●●

●●

●●●●

●

●●●●

●

●●●●

●●

●

●

●●

●

●● ●●

●

●●

●

●●

●

●

●●●●

●●●●●●●●

●●●

●●●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●●●●

●●●●●●

●

●●●●●●●

●●●

●

●●

●●●●●●●

●

●

●●

●●●

●

●

●

●●●●●

●●●

●●●

●

●

●●

●

●●●●

●

●●●●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●●

●

●●●●●

●

●●●●

●●

●

●●●

●

●

●●●●●

●● ●

●

●●●

●

●●●●

●●

●

●●●

●●●

●●●●●

●●●●●●●●

●

●

●●

●

●

●

●●

●●●

●

●

●●●

●

●

●●

●●●

●

●●

●●●●●

●●●●●

●

●

●●

●●

●

●

●

●●

●●

●

●●

●●

●

●

●●● ●●

●●

●

●●●●●

●

●

●●

●●

●●●●

●

●●●●

●●●

●

●

●●●

●

●

●

●●●

●●

●

●●●

●●●

●●●

●●●

●

●●●●●●

●

●

●●

●

●●

●

●

●

●●●●●

●●

●

●

●●

●

●●●●●

●

●●●●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●●●●

●●

●

●●

●●●●

●

●

●

●

●

●●

●

●●

●

●

●●●●●

●●

●●●●●

●

●

●

●●

●

●●●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

● ●

●●●●●

●

●●

●●●

●

●●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●●

●

●

● ●●

●

●

●

●●

●

●

●

●

●●●●●

●

●

●●●●●●●

●

●

●●

●

●

●●

●● ●

●

●●

●

●●●

●

●●

●

●●●●

●

●

● ●

●

●

●●

●●●

●

●●

●●

●●

●●●●

●●●

●

●

●

●●●

●●●●●●●●●●●

●●●●●●●●●●●

●

●●●

●●●●●●●●●●●●●●●●●

●●●●●

●

●●

●

●●

●●●●●●●● 2001

2002

2003

2004

2005

2006

Eigenvalues

What can we say about the genetic evolution of influenza between years2001-2006? Can we safely discard information about individual isolates,and just work with data pooled by year of epidemic?

2.2 Between-class analyses

In many situations, discarding information about the individual observa-tions leads to a considerable loss of information. However, basic analysesworking at an individual level –mainly PCA– are not optimal in terms ofgroup separation. This is due to the fact that PCA focuses on the totalvariability, while only variability between groups should be optimized.

Let us remember the basic multivariate ANOVA model (P being the pro-jector onto dummy vectors of group membership H: P = H(HTDH)−1HTD):

X = PX + (I−P)X

which leads to the variance partition seen before with the K-means:

VAR(X) = B(X) + W(X)

with:

? VAR(X) = trace(XTDX)

? B(X) = trace(XTPTDPX)

10

? W(X) = trace(XT (I−P)TD(I−P)X)

where D is a metric (in PCA, it can be replaced by 1/n) and I is the identitymatrix.

We can verify that ordinary PCA decomposes the total variance:

> X <- scaleGen(H3N2, miss = "mean", scale = FALSE)> pca1 <- dudi.pca(X, scale = FALSE, scannf = FALSE, nf = 2)> sum(pca1$eig)

[1] 15.05856

> D <- diag(1/nrow(X), nrow(X))> sum(diag(t(X) %*% D %*% X))

[1] 15.05856

Interestingly, it is possible to modify a multivariate analysis so as todecompose B(X) or W(X) instead of VAR(X). Corresponding methods arerespectively called between-class (or inter-class) and within-class (or intra-class) analyses. If the basic method is a PCA, then we will be performingbetween-class PCA and within-class PCA. These methods are implementedby the functions between and within. Here is an example of how to performthe between-class PCA:

> f <- H3N2$other$epid> bpca1 <- between(pca1, fac = factor(f), scannf = FALSE, nf = 3)

Verify that the sum of eigenvalues of this analysis equates the variancebetween groups (B(X)); for this, you will need to compute H, and then P.This requires a few matrix operations:

? A %*% B: multiplies matrix A by matrix B

? diag: either creates a diagonal matrix, or extract the diagonal of anexisting square matrix

? t: transposes a matrix

? ginv: inverses a symmetric matrix

To obtain H, the matrix of dummy vectors of group memberships, do:

> H <- as.matrix(acm.disjonctif(data.frame(f)))

You should find that B(X) is exactly equal to the sum of the eigenvaluesof the between-class analysis.

Repeat the same approach for the within-class analysis, and confirm thatthe variance within groups is fully decomposed by the within-class PCA.

Now that you have checked that we can associate multivariate methods toeach component of the variance partitioning into groups, investigate furtherthe results of the between-class PCA.

11

> par(bg = "grey")> s.class(bpca1$ls, factor(f), col = rainbow(6))> add.scatter.eig(bpca1$eig, 2, 1, 2)> title("H3N2 - Between-class PCA")

d = 2

●●

●

●●●●

●

●

●●

●●●●

●●

●

●●●

●

●●●●●●●

●●

●●●●●

●●●●●

●●●

●●●

●●●●●

●●●●

●●

●●

●●●

●●

●●●●●

●●

●●●●●

●●●●

●

●●●

●

●

●

●●

●●

●

●●

●●

●

●●●●●●

●

●

●

●●●●

●●

●●●●●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●●●

●

●

●

●

●

●

●

●●●●●●●

●

●●

●

●●

●

●●

●●

●

●●●

●

●●●

●

●

●●

●

●

●

●

●

●●●

●●

●● ●

●●●

●●

●

●

●

●

●

●●

●

●

●●

●●●

●●

●●

●

●●

●●●●

●

●●●

●

●

●

●●●

●

●

●

●●●

●

●

●●

●

●

●

●

●●●

●

●●

●●

●

●

●

●●●●

●●●●●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●●●●●●●●●

●●●●●●●●●

●

●

●

●●

●

●●

●●●●

●

●

●

●

●●

●

●●

●

●●

●

●

●●

●

●●

●

●●●

●

●

●●●

●●●

●●

●● ●

●●●●

●

●●

●●●

●

●

●●●●●●●

●

●

●

●

●

●

●●●

●

●●

●●

●●●

●●

●

●●

●●

●

●

●

●

●

●●

●

●●●●●●●●

●

●●●

●

●

●

●●●

● ●

●

●●

●

●

●●●

●●

●●

●●

●●

●●●● ●

●

●●●

●

●●

●

●

● ●

●

●

●●●

●●

●●●

●●

●●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●●

● ●

●●●●

●

●

●●●

●

●●●

●

●

●

●

●

●●

●●

●●●●

●●

●●

●

●

●

●●

●

●

●

●

●

●●● ●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●● ●●

●●

●

●●●

●

●

●

●●●●

●

●

●

●

●

●●

●●●●

●

●●●

●

●

●

●

●

●●

●

●

●

●●●●

●

●●●

●●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●●

●●●

●●●●

●●●●●

●

●●

●

●●●

●●

● ●●●●●●●●●

●

●

●

●●

●●●● ●

●

●

●

●

●●

●

●

●●●

● ●●

●

●

●

●●●

●

●

●

●● ●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●●●●●●●●●

●

●●●●●●●●

●●●●●●

●●

●●●●●●●

●

●●●●

●●●●●●●●●

●●●●

●●●●●●

●●●●●

●●●●●

●●●●

●●●●●●●●●●●●●●●

●

●

●

●

●

●

●●●●

●●● ●

●●

●●

●

●

●

●●●●

●

●

●

●

●

●

●●●

●●●

●●●●●

●●

●

●●●●

●

●●●

●

●

●●

●

●

●●●●●

●

●

●

●●●●●

●●

●

●

●

●●●

●●●

●

●●●●●●●

● ●●

●

●

●

● ●●●●

●

●●●

● ●

●

●

●●

●●

●

●●

●●●

●●

●

●●

●

●

●

●●

●●●

●

●

●●●●

●

●

●●

●●●●●●●●●

●●●●

●●●●●●●●●●

●●●●

●●●●●●●●●●●●

●●

●●●

●●

●●

●

●

●

●●

●

●●

●

●●●

●●●

●●●

●●

●

●

●●

●

●

●

●

●●●●●

●●

●

●●●

●

●●●

●●●●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●●

●

●●

●●

●

●

●

●

●

●●

●●●●●

●

●

●●●

●●●

●●●

●●

●

●●

●

●

●●

●

●

●●●●

●

●

●●

●●

●

●●

●

●

●

●●

●●

●●

●

●

●●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●●

●●

●●●

●●●

●●

●

●●●●

●

●●●

●●

●●●●

●●

●

●

●●

●● ●●●

●

●●

●

●●

●

●

●●●

●●●

●●●●

●●

●

●●●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●●●●●

●●●●●

●

●

●●●●●●●

●●●

●

●●

●●●●●●●

●

●

●●

●

●●

●

●

●

●●●●●

●●●

●●●

●

●

●●

●

●●●●

●

●●●●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●●

●

●●●●●

●

●●●●

●

●

●

●●●

●

●

●●●●●

●● ●

●

●●●●

●●●●

●●

●

●●●

●●●

●●●●●

●●

●●●●●●

●

●

●●

●

●

●

●●

●●●

●

●

●●●

●

●

●●

●●

●

●

●●

●●●

●

●

●●●●●

●

●

●●

●●

●

●

●

●●

●

●

●

●●

●●

●

●

●●● ●●

●●

●

●●●●●

●

●

●●

●●

●●●●

●

●●●●

●●●

●

●

●●●

●

●

●

●●●●●

●

●●●

●●●

●●●

●●●

●

●●●●●●●

●

●●

●

●●

●

●

●

●●●●●

●●

●

●

●

●

●

●●●●●

●

●●●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●●●●

●●

●

●

●

●●●●

●

●

●

●

●

●●

●

●●

●

●

●●●●●

●●

●●●●●

●

●

●

●●

●

●●●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

● ●

●●●●●

●

●●

●●●

●

●●

●

●

●●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●●

●

●

● ●●

●

●

●

●●

●

●

●

●

●●●●●

●

●

●●●●●●●

●

●

●

●

●

●

●●

●● ●

●

●●

●

●

●●

●

●●

●

●●

●●

●

●

● ●

●

●

●●

●●●

●

● ●

●●

●●

●

●●●

●●●●

●

●

●●

●

●●●●●●●●●●●

●●●●●●●●●●●

●

●●●

●●●●●●●●●●●●●●●●●

●●●●●

●

●●

●

●●

●●●●●●●●

2001

2002

2003

2004

2005

2006

Eigenvalues

H3N2 − Between−class PCA

On this scatterplot, the centres of the groups (labels of year) are as scatteredas possible. Why is it slightly disappointing? How does it compare to theoriginal PCA? The following figure represents the PCA axes onto the basisof the between-class analysis.

> s.corcircle(bpca1$as)

12

Axis1

Axis2

What can we say about the PCA and between-class PCA? One obviousproblem of the previous scatterplot is the dispersion of the isolates within2001 and 2002. To assess best the separation of the isolates by year, we needto maximize dispersion between groups, and to minimize dispersion withingroups. This is precisely what Discriminant Analysis does.

2.3 Discriminant Analysis of Principal Components

Discriminant Analysis aims at displaying the best discrimination of individ-uals into pre-defined groups. However, some technical requirements make itoften impractical for genetic data. Discriminant Analysis of Principal Com-ponents (DAPC, [6]) has been developped to circumvent this issue. ApplyDAPC (function dapc) to the H3N2 data, retaining 50 PCs in the prelim-inary PCA step, and plot the results using scatter. The result shouldressemble this:

13

d = 5

●●

●

●

●

●

●

●

●

●

● ●

●

●●

●●

●

●●

●●

●●●

●●●●

●●

●●

●

●

●

●

●

●●

●

● ● ●●●

●

●

●

●

●

●

●●

●

●● ●●

●

●● ●

●

●●

●

●●

●

●

●

●

●●

●●

●

● ●●●

●

●

● ●

●●

●●●

●

●● ●

●

●●

●●

●●●

●

●

●

●

●●●●

●●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●● ●●

●●

● ●

●

●

●●●

●

●●●

●

●●

●

●●

●

●●●

●

●

●●● ●●

●●

●●

●

●

●●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●●

●

●●

●●

●●

●

●

●

●●●●

●●●

●

●

●

●

●

●●

● ●

● ●●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●●●●●

●●●●

●

●

●

●

●

●

● ●●●

●●

●●

●

●

●

●

●

●● ●●

●

●

●

●

●

●● ●

●

●●

●

●

●●●●●

●●●

●

●

●

●

●

●

●

●

●

●●●●●

●

●

●

●

●●

●

●●

●

●●

●

●

●●●

●●●●●●

●

●

●●

●●

● ●●●●●

●

●●●●

●

●●

●●● ●●●●●●●●

●

●

●

●

●●

●

●

●

●●●

●●

●●●●●●●

●

●●●● ●

●

●

●

●●

●

●●●●●●

●●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

● ● ●●●

●

●

●●●●

●●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●●●

●

●

●

● ●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●● ●

●

●●●●

●

●

●●

●●●

●●

●

●

●●

●

●

●●

●

●

●●

● ●

●● ●●

●●●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●●

●

●

●●● ●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●●

●●●●●

●● ●

●

●

●

● ●●●

●

●●

●

●

●●

●●●●

●

●●●

●

●

●

●

●

●

●

●

●●

●●●●

●

● ●●●● ●●

●

●●●

●●

●

● ●●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●●

●

●

●●●

●

●●

●

●

●

●

●●

●●●

●

●●●● ●●

●● ●

●

●●

●

●●●●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●●

● ●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●● ●

●

●●●

●●

●●●●

●● ●

●

●

●

●●

●

●

●●●

●●●●●●

●●

●

●●

●● ●

●●

●●●

●●

●●●

●●●

● ●●

●

●●●

●●●●●●●

●●●●●●

●●

●

●

●

●

●

●

●●●●

●

●●

●●●

●●

●

●

● ●●●

●

●

●●

●

●●

●●●

●

●●

●

●●

●

●

●●

●

●

●

●●

●

●●●

● ●

●●

●

●

●

●

●●●

●

●

●

●

●●●

●

●●●

●

●

● ●

●●●●

●

●

●

●● ●

●

●

●

●● ●

●●

●

●●●●

●

●●

●

●

●

●●

●●

●●●●

●

●●

●

●●

●

●

●

●●

●

●●●●●

●●

● ●●

● ●●

●

●

●● ●●

●●●●

●

●●

●

●

●● ●

●●

●

●●●●●

●

●●

●

●

●●●●

●●●●

●

●● ●●●

●●

●

●

●●

●

●

●

●

●

●

●● ●●●

●●●

●●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●●●

●●

●●

●

●●●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●● ●●

● ●

●

●●●

●

● ● ●●●

●●● ●●●

●●●●●

●●●

●●●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●●●

●●

●

●●

●●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

● ●

●●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●●●

●

●

●●

●

● ●●

●

●●

●●

● ●

●●

●

●● ●

●

●●●

●

●●

●●●

●

●

●●●●

● ●●

●●

●●

●●

●●

●●●

●●

●

●

●

●

●●●

●●●●●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●●

●●●

●

●

●

●●

●●●

●●●●●●●

●

● ●●●●●●●

●●●

●

●●

●

●

●

● ●●●

●

● ●●

●

●

●

●

●

● ●●● ●

●

●●●

●●●

●

●●●

●●

●●

●

● ●●●●

●

●

●●●

●

●

●

●

● ●●●

●

●

●●●

●

●

●

●

●●●●●

●

●●●

●

●

●

●

●

●●

●●●●●●●●

●

●

●●

●

●

●●●●

●

●●●

●

●●●

●●●●

●

●●●

●

●●

●●

●

●

●

●●

● ●

●

●

●

●●●●

●●●

●

●

●

●

●

●

●●

●

●

●

●●●

●●

●●●●●

●

●

●

●

●●

●

●●

●●

●●●

●●

●●

●

●

●

● ●

●●

●●

●●●●

●●

●

●

●

●

●●

●●

●●●

●●

●●

●●●●

●●●●●

●

●●

●

●

●●● ●

●●●●●●

●

●●

●

●●

●●●●●

●

●

●

●

●

●

● ●

●●

●

●●●●

●

●

●

●

●●

●●

●●●●●

●

●●

●●

●●

●●●

●

●●

●●

●

●

●

●●●

●●

●●

●●

●

●●●●●●

●●

●

●

●

●

●

● ●●

●

●●●

●

● ●●●●●

●

●

●

●●●

●

●

●●

●

●●

●●

●

●●●●●●●

● ●

●●

●

●

●●●●●

●

●

●

●●●●

●

●

●

●

●●

●

●

●

●●●

●●

●

●●●

●● ●

●●●

●

●

●●

●

●

●

●●

●

●●●

●●

●

●●

● ●

●

●

●●●●

●● ●

●●●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●●

● ●● ●●

●●●

●●

●●

●

●

●

●

●

●

●●

●●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●●

●

●

●●

●●●

●

●

●

●●

●

●●●●●

●

●

●

●●

●

●●●●●●●

●●

●●

●

●

●

●

●●●

●●●●●●●●●

●

●●

●

●

●●●●●● ●●● ●●

●●●

●●

●

●●

2001

2002

2003

2004

2005

2006

Eigenvalues

How does this compare to the previous results? What new feature appeared,that was not visible before? To have a better assessment of the structure oneach axis, you can plot one dimension at a time, specifying the same axisfor xax and yax in scatter.dapc; for instance:

−5 0 5

0.0

0.5

1.0

1.5

Discriminant function 1

Den

sity

14

To interprete results further, you will need the loadings of alleles, whichare not provided by default in dapc. Re-run the analysis if needed, andspecify you want allele contributions to be returned. Use loadingplot todisplay allele contributions to the first and to the second structure. This isthe result you should obtain for the first axis:

0.00

0.05

0.10

0.15

Loading plot

Variables

Load

ings

X6.aX6.g X90.aX90.gX157.aX157.gX243.cX243.tX267.aX267.gX280.cX280.t

X334.aX334.gX345.tX376.aX376.gX384.gX384.t

X396.aX396.gX399.cX399.t

X412.gX418.aX418.gX424.aX424.gX430.aX430.g

X435.aX435.c

X468.a

X468.t

X476.aX476.t

X529.cX529.tX546.cX546.tX555.aX555.g

X557.gX557.tX564.cX564.tX566.aX566.gX577.aX577.tX578.tX582.aX582.gX594.aX594.t

X595.cX595.t

X602.aX602.gX627.cX627.t

X647.aX647.g

X664.a

X676.aX676.gX685.aX685.c

X717.aX717.gX763.aX763.cX806.aX806.gX882.cX882.t

X906.cX906.t

X933.aX933.gX936.aX936.cX939.cX939.t

How can you interprete this result? Note that loadingplot returns invisi-bly some information about the most contributing alleles (thoses indicatedon the plot). Save this information, and then examine it. Interprete thefollowing commands and their result:

> which(H3N2$loc.names == "435")

L05151

> snp435 <- truenames(H3N2[loc = "L051"])> head(snp435)

435.a 435.c 435.gAB434107 1 0 0AB434108 1 0 0AB438242 1 0 0AB438243 1 0 0AB438244 1 0 0AB438245 1 0 0

> temp <- apply(snp435, 2, function(e) tapply(e, f, mean, na.rm = TRUE))> temp

15

435.a 435.c 435.g2001 1.000000000 0.000000000 0.0000000002002 0.995535714 0.004464286 0.0000000002003 0.942396313 0.055299539 0.0023041472004 0.046210721 0.953789279 0.0000000002005 0.008316008 0.991683992 0.0000000002006 0.000000000 1.000000000 0.000000000

> matplot(temp, type = "b", pch = c("a", "c", "g"), xlab = "year",+ ylab = "allele frequency", xaxt = "n", cex = 1.5)> axis(side = 1, at = 1:6, lab = 2001:2006)

a aa

aa a0.

00.

20.

40.

60.

81.

0

year

alle

le fr

eque

ncy

c cc

cc c

g g g g g g

2001 2002 2003 2004 2005 2006

Do the same for the second axis. How would you interprete this result?Should we expect a vaccine from 2005 influenza to have worked against the2006 virus?

References

[1] D. Chessel, A-B. Dufour, and J. Thioulouse. The ade4 package-I- one-table methods. R News, 4:5–10, 2004.

[2] S. Dray and A.-B. Dufour. The ade4 package: implementing the dualitydiagram for ecologists. Journal of Statistical Software, 22(4):1–20, 2007.

16

[3] S. Dray, A.-B. Dufour, and D. Chessel. The ade4 package - II: Two-tableand K-table methods. R News, 7:47–54, 2007.

[4] J. Goudet. HIERFSTAT, a package for R to compute and test hierar-chical F-statistics. Molecular Ecology Notes, 5:184–186, 2005.

[5] T. Jombart. adegenet: a R package for the multivariate analysis ofgenetic markers. Bioinformatics, 24:1403–1405, 2008.

[6] Thibaut Jombart, Sebastien Devillard, and Francois Balloux. Discrim-inant analysis of principal components: a new method for the analysisof genetically structured populations. BMC Genetics, 11(1):94, 2010.

[7] E. Paradis, J. Claude, and K. Strimmer. APE: analyses of phylogeneticsand evolution in R language. Bioinformatics, 20:289–290, 2004.

[8] R Development Core Team. R: A Language and Environment for Sta-tistical Computing. R Foundation for Statistical Computing, Vienna,Austria, 2009. ISBN 3-900051-07-0.

[9] G. R. Warnes. The genetics package. R News, 3(1):9–13, 2003.

17

practical course using the software ||||| multivariate

Documents