practical course using the software ||||| multivariate
TRANSCRIPT
Practical course using the software
—————Multivariate analysis of genetic data: exploring
group diversity
Thibaut Jombart
—————
Abstract
This practical course tackles the question of group diversity in geneticdata analysis using [8]. It consists of two main parts: first, how to infergroups when these are unknown, and second, how to use group informationwhen describing the genetic diversity. The practical uses mostly the packagesadegenet [5], ape [7] and ade4 [1, 3, 2], but others like genetics [9] andhierfstat [4] are also required.
1
Contents
1 Defining genetic clusters 31.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . 31.2 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Describing genetic clusters 62.1 Analysis of group data . . . . . . . . . . . . . . . . . . . . . . 72.2 Between-class analyses . . . . . . . . . . . . . . . . . . . . . . 102.3 Discriminant Analysis of Principal Components . . . . . . . . 13
2
1 Defining genetic clusters
Group information is not always known when analysing genetic data. Evenwhen some prior clustering can be defined, it is not always obvious thatthese are the best genetic clusters that can be defined. In this section, weillustrate two simple approaches for defining genetic clusters.
1.1 Hierarchical clustering
Hierarchical clustering can be used to represent genetic distances as trees,and indirectly to define genetic clusters. This is achieved by cutting the treeat a certain distance, and pooling the tips descending from the few retainedbranches into the same clusters (cutree). Load the data microbov, replacethe missing data, and compute the Euclidean distances between individuals(functions na.replace and dist). Then, use hclust to obtain a hierarchicalclustering of the individual which forms strong groups (choose the rightmethod(s)).
488
503 49
651
549
749
552
548
252
049
049
4 493
274
485
478
510
484
487
508
486
489
269
521
261
650 4
76 492
244
481
514 39
150
935
264
231
436
265
443
363
1 271
273
275
248
686 25
225
125
824
567
770
464
967
135
026
564
425
640
741
124
042
933
524
739
568
068
225
727
836
4 266
238
260 23
527
2 383
343
373
694
702 36
052
869
970
030
452
624
969
770
166
366
768
3 665
655
698
678
695
659
688 24
353
937
856
528
154
868
468
927
970
366
066
165
769
3 674
679
384
369
388
396
405
393
394
409
691
353
681
367
504 52
457
550
267
363
263
864
664
863
063
462
664
031
331
731
8 295
296
322
310
324
302
309
289
319
320
312
298
299
307
286
297 31
528
829
430
832
532
745
729
052
754
4 516
522
459
507
501
519
499
270
450
470
573
563
566 51
254
949
152
933
454
1 241
545
232
268
557
348
374
337
696
547
400
408
375
542
398
357
372 38
752
354
653
055
255
633
934
6 572
532
536
355
276
427
446
342
356 3
4935
936
547
950
048
051
769
066
266
466
666
967
667
568
765
869
267
065
667
268
525
926
229
262
733
034
1 328
282
323
300
301
303
305 4
83 511 3
77 471
333
332
498 347
381
647
651
412
283
284
242
403
385
344
389
533
653
236
368
543
306
316
293
311
285
326 32
128
729
137
055
853
555
956
1 379
386
554
540
560 60
735
135
857
153
456
8 537
538 53
157
058
459
558
861
558
160
059
260
8 596 62
459
759
9 590
619
587
594
618
617
586
579
613
603
611 60
461
457
657
761
258
960
562
358
043
661
060
162
2 602
578
593
472
462
438
475 38
242
542
643
443
947
444
946
446
740
245
442
040
441
741
041
6 505
567
431
463
458
451
461 59
860
658
262
1 585 60
958
361
639
244
542
159
123
464
1 574
637
629
633 40
664
5 643
652 3
71 555
432
444
465 4
4844
145
344
746
055
363
646
662
825
362
539
039
941
341
940
141
4 415
418
338
397 38
062
033
144
243
042
243
744
343
545
642
345
5 468
469
440
452
473
424
639
361
428
336
366
513
506
518 23
334
537
6 263 63
555
156
256
424
655
056
926
436
323
723
934
035
426
747
725
025
425
528
066
8 277 32
902
903
8 085
019
004
034
050
080
017
062
026
071
061
066
030
049
012
013 07
608
7 001
002
016
092
095
055
060
081
056
059
005
011
014
090
098
083
089
065
063
078
068
075
051
053 0
52 057
020
069
084 0
77 100
067
070
096
097
099
079
093
091
086
088
048
064
072
094
054
058
082
073
074
128
130
110
138
122
123 13
715
010
311
914
914
214
614
513
313
513
413
612
411
211
612
011
711
812
512
913
120
217
611
514
112
610
210
713
914
010
110
913
214
7 108
105
113
151
174
178
169
179
218
207
219 16
022
115
703
117
504
604
503
504
704
204
303
304
403
603
903
204
000
317
002
320
322
602
402
5 021 04
1 208
223
205
216
197
209
195
183
184
220
231
222
224 19
119
4 217
198
214
229
008
009 03
701
802
202
702
815
916
818
722
818
019
0 156
154
210 2
0100
723
016
316
116
4 166
155
162
200
213
182
227
206
185
212 1
99 204
173
177
181
152
172
196
186
211
148
189
158
171 15
316
516
719
219
301
000
601
5 121 21
512
714
414
310
611
110
411
4 188 22
5
23
45
6
Cluster Dendrogram
secret clustering methodD
Hei
ght
Cut the tree into two groups. Do these groups match any prior clusteringof the data (see content of microbov$other)? Remember that table can beused to build contingency tables. Accordingly, what is the main componentof the genetic variability in these cattle breeds?
Repeat this analysis by cutting the tree into as many clusters as there arebreeds in the dataset (this can be extracted by pop). Build a contingency ta-
3
ble to see the match between inferred groups and breeds. Use table.value
to represent the result, then interprete it:
Borgou
Zebu
Lagunaire
NDama
Somba
Aubrac
Bazadais
BlondeAquitaine
BretPieNoire
Charolais
Gascon
Limousin
MaineAnjou
Montbeliard
Salers
grp
1
grp
2
grp
3
grp
4
grp
5
grp
6
grp
7
grp
8
grp
9
grp
10
grp
11
grp
12
grp
13
grp
14
grp
15
5 15 25 35 45
Can some groups be identified to species? to breeds? Are some speciesmore admixed than others?
1.2 K-means
K-means is another, non-hierarchical approach for defining genetic clusters.It relies on the usual ANOVA model for multivariate data X ∈ Rn×p:
VAR(X) = B(X) + W(X)
where VAR, B and W correspond to, respectively, the total variance, thevariance between groups, and the variance within groups. K-means thenconsists in finding groups that will minimize W(X). Using the functionkmeans, find 15 genetic clusters in the microbov data; how do the inferredclusters compare to the breeds? Reproduce the same figure as before forthese results; the figure should ressemble:
4
Borgou
Zebu
Lagunaire
NDama
Somba
Aubrac
Bazadais
BlondeAquitaine
BretPieNoire
Charolais
Gascon
Limousin
MaineAnjou
Montbeliard
Salers
grp
1
grp
2
grp
3
grp
4
grp
5
grp
6
grp
7
grp
8
grp
9
grp
10
grp
11
grp
12
grp
13
grp
14
grp
15 5 15 25 35 45 55
What differences can you see? What are the species which are easily genet-ically identified using K-means?
Look at the output of K-means clustering, and try to identify the sumsof squares corresponding to the variance partition model. How do thesebehave when increasing the number of clusters? Using sapply, retrieve theresults of several K-means where K is increased gradually. Plot the results;one example of obtained result would be:
5
●
●
●
●
●
●
●
●
●
●
10 20 30 40 50
1400
1600
1800
2000
2200
2400
2600
2800
K
Bet
wee
n su
m o
f squ
ares
How can you interprete this observation? Can B(X) or W(X) be usedfor selecting the best K?
Deciding of the number of optimal clusters (K) to be retained is oftentricky. This can be achieved by computing K-means solution for a range ofK, and then selecting the K giving the best Bayesian Information Criterion(BIC). This is achieved by the function find.clusters. In addition, thisfunction orthogonalizes and reduces the data using PCA as a prior step toK-means. Use this function to search for the optimal number of clustersin microbov. How many clusters would you retain? Compare them to thebreed information: when 8 clusters are retained, what are the most admixedbreeds?
Repeat the same analyses for the nancycats data. What can you sayabout the likely profile of admixture between these cat colonies?
2 Describing genetic clusters
There are different ways of using genetic cluster information in the multi-variate analysis of genetic data. The methods vary according to the purposeof the study: describing the differences between groups, describing how in-dividuals are structured in the different clusters, etc.
6
2.1 Analysis of group data
Sometimes we are just interested in the diversity between groups, and notinterested by variations occuring within clusters. In such cases, we can anal-yse data directly at a population level. The first step doing so is computingallele counts by populations (using genind2genpop). These data can thenbe:
? directly analysed using Correspondance Analysis (CA, dudi.coa), whichis appropriate for contingency tables
? translated into scaled allele frequencies (makefreq or scaleGen), andanalysed by PCA (PCA, dudi.pca)
? used to compute genetic distances between populations (dist.genpop)which are in turn summarised by Principal Coordinates Analysis (PCoA,dudi.pco)
Try applying these approaches to the dataset microbov, and comparethe obtained results.
d = 0.5
Borgou
Zebu
Lagunaire
NDama Somba
Aubrac
Bazadais BlondeAquitaine
BretPieNoire Charolais Gascon Limousin MaineAnjou Montbeliard
Salers
Eigenvalues
microbov − Correspondance Analysis
7
d = 0.5
Borgou
Zebu
Lagunaire
NDama
Somba
Aubrac
Bazadais
BlondeAquitaine BretPieNoire Charolais
Gascon Limousin
MaineAnjou Montbeliard
Salers
Eigenvalues
microbov − Principal Component Analysis
d = 0.1
Borgou
Zebu
Lagunaire
NDama Somba
Aubrac
Bazadais
BlondeAquitaine
BretPieNoire Charolais
Gascon Limousin
MaineAnjou Montbeliard
Salers
Eigenvalues
microbov − Principal Coordinates AnalysisEdwards distance
For the PCoA, try different Euclidean distances (see dist.genpop), andcompare the results. Does the genetic distance employed seem to matter?
We will leave the cattle at a rest for now, and switch to Human sea-sonal influenza, H3N2. The dataset H3N2 contains SNPs derived from 1903
8
hemagglutinin RNA sequences. Obtain allele counts per year of sampling(see content of H3N2$other) using genind2genpop, and compute i) Ed-ward’s distance ii) Reynolds’ distance and iii) Roger’s distance betweenyears. Perform the PCoA of these distances and plot the results; here is aresult for one of the requested distances:
d = 0.1
2001
2002 2003
2004
2005
2006
Eigenvalues
H3N2 − PCoA − secret distance
What is the meaning of the first principal components? Are the resultscoherent between the three distances? To have yet another point of view,perform the PCA of the individual data, and represent group informationwhen plotting the principal components using s.class.
9
d = 2
●●
●●●●
●
●
●
●●
●●●●
●●
●
●●●
●
●●●●●●●
●●
●
●●●●
●●●●●
●●
●
●●●● ●●
●●●●
●●●
●●
●●●
●●
●●
●●●●●
●●
●●●●●
●●●
●
●●●
●
●
●
●●
●●
●
●
●
●●
●
●●●●●●
●
●
●
●●●●
●●
●●●●●
●●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●
●●
●
●
●
●
●
●
●●●●●●●
●●●
●
●●
●
●●
●●
●
●●●
●
●●●
●
●
●●
●
●
●
●
●
●●●
●●
●● ●
●
●● ●●
●●
●
●
●
●●
●
●
●●
●●●
●●
●●●
●●
●●●●
●
●●●
●
●
●●
●●
●
●
●
●●●
●
●
●●
●
●●
●
●●●
●
●●
●●
●
●
●
● ●
●●
●●●●
●
●●●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●●●●●●●●
●●●●●●●●●
●●
●
●
●
●●●
●●●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●●
●
●
●●●
● ●
●
●●
●
● ●
●●●●
●
●●
●●●
●
●
●●●
●●●●
●
●
●
●●
●
●●●●●●
●●
●●●
●●
●
●●
●●
●
●
●
●
●
●●
●
●●●●●●●●
●
●●●
●
●
●
●●●
●●
●
●●
●
●
●●
●
●●
●●
●●
●●
●●●●
●
●
●● ●
●
●●
●
●
●●
●
●
●●●
●●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
●●
●●●●
●
●
●●●
●
●●●
●
●
●
●
●
●●
●●
●●●●
●●
●●
●
●
●
●●
●
●
●
●
●
●●● ●
●
●
●
●●
●
●●
●●●
●●
●
●
●●
●●
●
●●
●●
●●
●
●●●
●
●
●
●●●●
●
●
●
●
●
●●
●●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●●
●●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●●
●
●
●●
●● ●●●●
●
●●●●●●
●●
●
●●●
●●● ●
●●●●●●●●
●
●
●
●●
●●●● ●
●
●●
●
●●
●
●
●
●
●
● ●●
●
●
●
●●●
●
●
●
●
●●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●●●●
●
●
●
●●●●●●●●●●●●●●●●
●●●
●●●●
●
●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●
●
●
●
●
●
●●●●
●●● ●
●●
●●
●●
●
●●●●
●
●
●
●
●
●
●●●●●●●●●●●
●●
●
●●●●
●
●●●
●●
●●
●
●
●
●●●●
●
●
●
●●●●●
●●
●
●
●
●●●
●●●
●
●●●●●●
●●●
●●
●
●
● ●●●●
●
●●●●
●
●
●
●
●
●●
●
●●
●●●
●●
●
●●
●
●
●
●●
●●●
●
●
●●●●
●
●
●●
●
●●●●●●●●
●●●●
●●●●●●●●●●
●●● ●●●●●●●●●●●●●●●●●
●●●●●●
●
●
●●
●
●●
●
●●●
●●●
●●●
●●●
●
●●
●
●●●
●●●●
●
●●
●
●●●
●
●●●●●●●
●●●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●●
●●
●
●
●
●
●
●●
●●●●●
●
●
●●●
●●●●
●●●
●
●
●●●
●
●●●
●
●●●●
●
●
●●●●
●
●●
●
●
●
●● ●●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●●
●●
●●
●● ●
●●●●
●●
●●●●
●
●●●●
●
●●●●
●●
●
●
●●
●
●● ●●
●
●●
●
●●
●
●
●●●●
●●●●●●●●
●●●
●●●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●●●●
●●●●●●
●
●●●●●●●
●●●
●
●●
●●●●●●●
●
●
●●
●●●
●
●
●
●●●●●
●●●
●●●
●
●
●●
●
●●●●
●
●●●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●
●●●●●
●
●●●●
●●
●
●●●
●
●
●●●●●
●● ●
●
●●●
●
●●●●
●●
●
●●●
●●●
●●●●●
●●●●●●●●
●
●
●●
●
●
●
●●
●●●
●
●
●●●
●
●
●●
●●●
●
●●
●●●●●
●●●●●
●
●
●●
●●
●
●
●
●●
●●
●
●●
●●
●
●
●●● ●●
●●
●
●●●●●
●
●
●●
●●
●●●●
●
●●●●
●●●
●
●
●●●
●
●
●
●●●
●●
●
●●●
●●●
●●●
●●●
●
●●●●●●
●
●
●●
●
●●
●
●
●
●●●●●
●●
●
●
●●
●
●●●●●
●
●●●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●●●●
●●
●
●●
●●●●
●
●
●
●
●
●●
●
●●
●
●
●●●●●
●●
●●●●●
●
●
●
●●
●
●●●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
● ●
●●●●●
●
●●
●●●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●
●
● ●●
●
●
●
●●
●
●
●
●
●●●●●
●
●
●●●●●●●
●
●
●●
●
●
●●
●● ●
●
●●
●
●●●
●
●●
●
●●●●
●
●
● ●
●
●
●●
●●●
●
●●
●●
●●
●●●●
●●●
●
●
●
●●●
●●●●●●●●●●●
●●●●●●●●●●●
●
●●●
●●●●●●●●●●●●●●●●●
●●●●●
●
●●
●
●●
●●●●●●●● 2001
2002
2003
2004
2005
2006
Eigenvalues
What can we say about the genetic evolution of influenza between years2001-2006? Can we safely discard information about individual isolates,and just work with data pooled by year of epidemic?
2.2 Between-class analyses
In many situations, discarding information about the individual observa-tions leads to a considerable loss of information. However, basic analysesworking at an individual level –mainly PCA– are not optimal in terms ofgroup separation. This is due to the fact that PCA focuses on the totalvariability, while only variability between groups should be optimized.
Let us remember the basic multivariate ANOVA model (P being the pro-jector onto dummy vectors of group membership H: P = H(HTDH)−1HTD):
X = PX + (I−P)X
which leads to the variance partition seen before with the K-means:
VAR(X) = B(X) + W(X)
with:
? VAR(X) = trace(XTDX)
? B(X) = trace(XTPTDPX)
10
? W(X) = trace(XT (I−P)TD(I−P)X)
where D is a metric (in PCA, it can be replaced by 1/n) and I is the identitymatrix.
We can verify that ordinary PCA decomposes the total variance:
> X <- scaleGen(H3N2, miss = "mean", scale = FALSE)> pca1 <- dudi.pca(X, scale = FALSE, scannf = FALSE, nf = 2)> sum(pca1$eig)
[1] 15.05856
> D <- diag(1/nrow(X), nrow(X))> sum(diag(t(X) %*% D %*% X))
[1] 15.05856
Interestingly, it is possible to modify a multivariate analysis so as todecompose B(X) or W(X) instead of VAR(X). Corresponding methods arerespectively called between-class (or inter-class) and within-class (or intra-class) analyses. If the basic method is a PCA, then we will be performingbetween-class PCA and within-class PCA. These methods are implementedby the functions between and within. Here is an example of how to performthe between-class PCA:
> f <- H3N2$other$epid> bpca1 <- between(pca1, fac = factor(f), scannf = FALSE, nf = 3)
Verify that the sum of eigenvalues of this analysis equates the variancebetween groups (B(X)); for this, you will need to compute H, and then P.This requires a few matrix operations:
? A %*% B: multiplies matrix A by matrix B
? diag: either creates a diagonal matrix, or extract the diagonal of anexisting square matrix
? t: transposes a matrix
? ginv: inverses a symmetric matrix
To obtain H, the matrix of dummy vectors of group memberships, do:
> H <- as.matrix(acm.disjonctif(data.frame(f)))
You should find that B(X) is exactly equal to the sum of the eigenvaluesof the between-class analysis.
Repeat the same approach for the within-class analysis, and confirm thatthe variance within groups is fully decomposed by the within-class PCA.
Now that you have checked that we can associate multivariate methods toeach component of the variance partitioning into groups, investigate furtherthe results of the between-class PCA.
11
> par(bg = "grey")> s.class(bpca1$ls, factor(f), col = rainbow(6))> add.scatter.eig(bpca1$eig, 2, 1, 2)> title("H3N2 - Between-class PCA")
d = 2
●●
●
●●●●
●
●
●●
●●●●
●●
●
●●●
●
●●●●●●●
●●
●●●●●
●●●●●
●●●
●●●
●●●●●
●●●●
●●
●●
●●●
●●
●●●●●
●●
●●●●●
●●●●
●
●●●
●
●
●
●●
●●
●
●●
●●
●
●●●●●●
●
●
●
●●●●
●●
●●●●●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●●
●
●
●
●
●
●
●
●●●●●●●
●
●●
●
●●
●
●●
●●
●
●●●
●
●●●
●
●
●●
●
●
●
●
●
●●●
●●
●● ●
●●●
●●
●
●
●
●
●
●●
●
●
●●
●●●
●●
●●
●
●●
●●●●
●
●●●
●
●
●
●●●
●
●
●
●●●
●
●
●●
●
●
●
●
●●●
●
●●
●●
●
●
●
●●●●
●●●●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●●●●●●●●
●●●●●●●●●
●
●
●
●●
●
●●
●●●●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●●
●
●
●●●
●●●
●●
●● ●
●●●●
●
●●
●●●
●
●
●●●●●●●
●
●
●
●
●
●
●●●
●
●●
●●
●●●
●●
●
●●
●●
●
●
●
●
●
●●
●
●●●●●●●●
●
●●●
●
●
●
●●●
● ●
●
●●
●
●
●●●
●●
●●
●●
●●
●●●● ●
●
●●●
●
●●
●
●
● ●
●
●
●●●
●●
●●●
●●
●●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
● ●
●●●●
●
●
●●●
●
●●●
●
●
●
●
●
●●
●●
●●●●
●●
●●
●
●
●
●●
●
●
●
●
●
●●● ●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●● ●●
●●
●
●●●
●
●
●
●●●●
●
●
●
●
●
●●
●●●●
●
●●●
●
●
●
●
●
●●
●
●
●
●●●●
●
●●●
●●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●●●
●●●●
●●●●●
●
●●
●
●●●
●●
● ●●●●●●●●●
●
●
●
●●
●●●● ●
●
●
●
●
●●
●
●
●●●
● ●●
●
●
●
●●●
●
●
●
●● ●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●●●●●●●●●
●
●●●●●●●●
●●●●●●
●●
●●●●●●●
●
●●●●
●●●●●●●●●
●●●●
●●●●●●
●●●●●
●●●●●
●●●●
●●●●●●●●●●●●●●●
●
●
●
●
●
●
●●●●
●●● ●
●●
●●
●
●
●
●●●●
●
●
●
●
●
●
●●●
●●●
●●●●●
●●
●
●●●●
●
●●●
●
●
●●
●
●
●●●●●
●
●
●
●●●●●
●●
●
●
●
●●●
●●●
●
●●●●●●●
● ●●
●
●
●
● ●●●●
●
●●●
● ●
●
●
●●
●●
●
●●
●●●
●●
●
●●
●
●
●
●●
●●●
●
●
●●●●
●
●
●●
●●●●●●●●●
●●●●
●●●●●●●●●●
●●●●
●●●●●●●●●●●●
●●
●●●
●●
●●
●
●
●
●●
●
●●
●
●●●
●●●
●●●
●●
●
●
●●
●
●
●
●
●●●●●
●●
●
●●●
●
●●●
●●●●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●●
●●
●
●
●
●
●
●●
●●●●●
●
●
●●●
●●●
●●●
●●
●
●●
●
●
●●
●
●
●●●●
●
●
●●
●●
●
●●
●
●
●
●●
●●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●●
●●
●●●
●●●
●●
●
●●●●
●
●●●
●●
●●●●
●●
●
●
●●
●● ●●●
●
●●
●
●●
●
●
●●●
●●●
●●●●
●●
●
●●●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●●●●●
●●●●●
●
●
●●●●●●●
●●●
●
●●
●●●●●●●
●
●
●●
●
●●
●
●
●
●●●●●
●●●
●●●
●
●
●●
●
●●●●
●
●●●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●
●●●●●
●
●●●●
●
●
●
●●●
●
●
●●●●●
●● ●
●
●●●●
●●●●
●●
●
●●●
●●●
●●●●●
●●
●●●●●●
●
●
●●
●
●
●
●●
●●●
●
●
●●●
●
●
●●
●●
●
●
●●
●●●
●
●
●●●●●
●
●
●●
●●
●
●
●
●●
●
●
●
●●
●●
●
●
●●● ●●
●●
●
●●●●●
●
●
●●
●●
●●●●
●
●●●●
●●●
●
●
●●●
●
●
●
●●●●●
●
●●●
●●●
●●●
●●●
●
●●●●●●●
●
●●
●
●●
●
●
●
●●●●●
●●
●
●
●
●
●
●●●●●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●●●●
●●
●
●
●
●●●●
●
●
●
●
●
●●
●
●●
●
●
●●●●●
●●
●●●●●
●
●
●
●●
●
●●●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
● ●
●●●●●
●
●●
●●●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●
●
● ●●
●
●
●
●●
●
●
●
●
●●●●●
●
●
●●●●●●●
●
●
●
●
●
●
●●
●● ●
●
●●
●
●
●●
●
●●
●
●●
●●
●
●
● ●
●
●
●●
●●●
●
● ●
●●
●●
●
●●●
●●●●
●
●
●●
●
●●●●●●●●●●●
●●●●●●●●●●●
●
●●●
●●●●●●●●●●●●●●●●●
●●●●●
●
●●
●
●●
●●●●●●●●
2001
2002
2003
2004
2005
2006
Eigenvalues
H3N2 − Between−class PCA
On this scatterplot, the centres of the groups (labels of year) are as scatteredas possible. Why is it slightly disappointing? How does it compare to theoriginal PCA? The following figure represents the PCA axes onto the basisof the between-class analysis.
> s.corcircle(bpca1$as)
12
Axis1
Axis2
What can we say about the PCA and between-class PCA? One obviousproblem of the previous scatterplot is the dispersion of the isolates within2001 and 2002. To assess best the separation of the isolates by year, we needto maximize dispersion between groups, and to minimize dispersion withingroups. This is precisely what Discriminant Analysis does.
2.3 Discriminant Analysis of Principal Components
Discriminant Analysis aims at displaying the best discrimination of individ-uals into pre-defined groups. However, some technical requirements make itoften impractical for genetic data. Discriminant Analysis of Principal Com-ponents (DAPC, [6]) has been developped to circumvent this issue. ApplyDAPC (function dapc) to the H3N2 data, retaining 50 PCs in the prelim-inary PCA step, and plot the results using scatter. The result shouldressemble this:
13
d = 5
●●
●
●
●
●
●
●
●
●
● ●
●
●●
●●
●
●●
●●
●●●
●●●●
●●
●●
●
●
●
●
●
●●
●
● ● ●●●
●
●
●
●
●
●
●●
●
●● ●●
●
●● ●
●
●●
●
●●
●
●
●
●
●●
●●
●
● ●●●
●
●
● ●
●●
●●●
●
●● ●
●
●●
●●
●●●
●
●
●
●
●●●●
●●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●● ●●
●●
● ●
●
●
●●●
●
●●●
●
●●
●
●●
●
●●●
●
●
●●● ●●
●●
●●
●
●
●●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●●
●●
●●
●
●
●
●●●●
●●●
●
●
●
●
●
●●
● ●
● ●●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●●●●●
●●●●
●
●
●
●
●
●
● ●●●
●●
●●
●
●
●
●
●
●● ●●
●
●
●
●
●
●● ●
●
●●
●
●
●●●●●
●●●
●
●
●
●
●
●
●
●
●
●●●●●
●
●
●
●
●●
●
●●
●
●●
●
●
●●●
●●●●●●
●
●
●●
●●
● ●●●●●
●
●●●●
●
●●
●●● ●●●●●●●●
●
●
●
●
●●
●
●
●
●●●
●●
●●●●●●●
●
●●●● ●
●
●
●
●●
●
●●●●●●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
● ● ●●●
●
●
●●●●
●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●●●
●
●
●
● ●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●● ●
●
●●●●
●
●
●●
●●●
●●
●
●
●●
●
●
●●
●
●
●●
● ●
●● ●●
●●●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●
●
●
●●● ●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
●●●●●
●● ●
●
●
●
● ●●●
●
●●
●
●
●●
●●●●
●
●●●
●
●
●
●
●
●
●
●
●●
●●●●
●
● ●●●● ●●
●
●●●
●●
●
● ●●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●●
●
●
●●●
●
●●
●
●
●
●
●●
●●●
●
●●●● ●●
●● ●
●
●●
●
●●●●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●●
● ●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●● ●
●
●●●
●●
●●●●
●● ●
●
●
●
●●
●
●
●●●
●●●●●●
●●
●
●●
●● ●
●●
●●●
●●
●●●
●●●
● ●●
●
●●●
●●●●●●●
●●●●●●
●●
●
●
●
●
●
●
●●●●
●
●●
●●●
●●
●
●
● ●●●
●
●
●●
●
●●
●●●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●●●
● ●
●●
●
●
●
●
●●●
●
●
●
●
●●●
●
●●●
●
●
● ●
●●●●
●
●
●
●● ●
●
●
●
●● ●
●●
●
●●●●
●
●●
●
●
●
●●
●●
●●●●
●
●●
●
●●
●
●
●
●●
●
●●●●●
●●
● ●●
● ●●
●
●
●● ●●
●●●●
●
●●
●
●
●● ●
●●
●
●●●●●
●
●●
●
●
●●●●
●●●●
●
●● ●●●
●●
●
●
●●
●
●
●
●
●
●
●● ●●●
●●●
●●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●●●
●●
●●
●
●●●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●● ●●
● ●
●
●●●
●
● ● ●●●
●●● ●●●
●●●●●
●●●
●●●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●●●
●●
●
●●
●●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
● ●
●●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●●●
●
●
●●
●
● ●●
●
●●
●●
● ●
●●
●
●● ●
●
●●●
●
●●
●●●
●
●
●●●●
● ●●
●●
●●
●●
●●
●●●
●●
●
●
●
●
●●●
●●●●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●●
●●●
●
●
●
●●
●●●
●●●●●●●
●
● ●●●●●●●
●●●
●
●●
●
●
●
● ●●●
●
● ●●
●
●
●
●
●
● ●●● ●
●
●●●
●●●
●
●●●
●●
●●
●
● ●●●●
●
●
●●●
●
●
●
●
● ●●●
●
●
●●●
●
●
●
●
●●●●●
●
●●●
●
●
●
●
●
●●
●●●●●●●●
●
●
●●
●
●
●●●●
●
●●●
●
●●●
●●●●
●
●●●
●
●●
●●
●
●
●
●●
● ●
●
●
●
●●●●
●●●
●
●
●
●
●
●
●●
●
●
●
●●●
●●
●●●●●
●
●
●
●
●●
●
●●
●●
●●●
●●
●●
●
●
●
● ●
●●
●●
●●●●
●●
●
●
●
●
●●
●●
●●●
●●
●●
●●●●
●●●●●
●
●●
●
●
●●● ●
●●●●●●
●
●●
●
●●
●●●●●
●
●
●
●
●
●
● ●
●●
●
●●●●
●
●
●
●
●●
●●
●●●●●
●
●●
●●
●●
●●●
●
●●
●●
●
●
●
●●●
●●
●●
●●
●
●●●●●●
●●
●
●
●
●
●
● ●●
●
●●●
●
● ●●●●●
●
●
●
●●●
●
●
●●
●
●●
●●
●
●●●●●●●
● ●
●●
●
●
●●●●●
●
●
●
●●●●
●
●
●
●
●●
●
●
●
●●●
●●
●
●●●
●● ●
●●●
●
●
●●
●
●
●
●●
●
●●●
●●
●
●●
● ●
●
●
●●●●
●● ●
●●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
● ●● ●●
●●●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●●
●●●
●
●
●
●●
●
●●●●●
●
●
●
●●
●
●●●●●●●
●●
●●
●
●
●
●
●●●
●●●●●●●●●
●
●●
●
●
●●●●●● ●●● ●●
●●●
●●
●
●●
2001
2002
2003
2004
2005
2006
Eigenvalues
How does this compare to the previous results? What new feature appeared,that was not visible before? To have a better assessment of the structure oneach axis, you can plot one dimension at a time, specifying the same axisfor xax and yax in scatter.dapc; for instance:
−5 0 5
0.0
0.5
1.0
1.5
Discriminant function 1
Den
sity
14
To interprete results further, you will need the loadings of alleles, whichare not provided by default in dapc. Re-run the analysis if needed, andspecify you want allele contributions to be returned. Use loadingplot todisplay allele contributions to the first and to the second structure. This isthe result you should obtain for the first axis:
0.00
0.05
0.10
0.15
Loading plot
Variables
Load
ings
X6.aX6.g X90.aX90.gX157.aX157.gX243.cX243.tX267.aX267.gX280.cX280.t
X334.aX334.gX345.tX376.aX376.gX384.gX384.t
X396.aX396.gX399.cX399.t
X412.gX418.aX418.gX424.aX424.gX430.aX430.g
X435.aX435.c
X468.a
X468.t
X476.aX476.t
X529.cX529.tX546.cX546.tX555.aX555.g
X557.gX557.tX564.cX564.tX566.aX566.gX577.aX577.tX578.tX582.aX582.gX594.aX594.t
X595.cX595.t
X602.aX602.gX627.cX627.t
X647.aX647.g
X664.a
X676.aX676.gX685.aX685.c
X717.aX717.gX763.aX763.cX806.aX806.gX882.cX882.t
X906.cX906.t
X933.aX933.gX936.aX936.cX939.cX939.t
How can you interprete this result? Note that loadingplot returns invisi-bly some information about the most contributing alleles (thoses indicatedon the plot). Save this information, and then examine it. Interprete thefollowing commands and their result:
> which(H3N2$loc.names == "435")
L05151
> snp435 <- truenames(H3N2[loc = "L051"])> head(snp435)
435.a 435.c 435.gAB434107 1 0 0AB434108 1 0 0AB438242 1 0 0AB438243 1 0 0AB438244 1 0 0AB438245 1 0 0
> temp <- apply(snp435, 2, function(e) tapply(e, f, mean, na.rm = TRUE))> temp
15
435.a 435.c 435.g2001 1.000000000 0.000000000 0.0000000002002 0.995535714 0.004464286 0.0000000002003 0.942396313 0.055299539 0.0023041472004 0.046210721 0.953789279 0.0000000002005 0.008316008 0.991683992 0.0000000002006 0.000000000 1.000000000 0.000000000
> matplot(temp, type = "b", pch = c("a", "c", "g"), xlab = "year",+ ylab = "allele frequency", xaxt = "n", cex = 1.5)> axis(side = 1, at = 1:6, lab = 2001:2006)
a aa
aa a0.
00.
20.
40.
60.
81.
0
year
alle
le fr
eque
ncy
c cc
cc c
g g g g g g
2001 2002 2003 2004 2005 2006
Do the same for the second axis. How would you interprete this result?Should we expect a vaccine from 2005 influenza to have worked against the2006 virus?
References
[1] D. Chessel, A-B. Dufour, and J. Thioulouse. The ade4 package-I- one-table methods. R News, 4:5–10, 2004.
[2] S. Dray and A.-B. Dufour. The ade4 package: implementing the dualitydiagram for ecologists. Journal of Statistical Software, 22(4):1–20, 2007.
16
[3] S. Dray, A.-B. Dufour, and D. Chessel. The ade4 package - II: Two-tableand K-table methods. R News, 7:47–54, 2007.
[4] J. Goudet. HIERFSTAT, a package for R to compute and test hierar-chical F-statistics. Molecular Ecology Notes, 5:184–186, 2005.
[5] T. Jombart. adegenet: a R package for the multivariate analysis ofgenetic markers. Bioinformatics, 24:1403–1405, 2008.
[6] Thibaut Jombart, Sebastien Devillard, and Francois Balloux. Discrim-inant analysis of principal components: a new method for the analysisof genetically structured populations. BMC Genetics, 11(1):94, 2010.
[7] E. Paradis, J. Claude, and K. Strimmer. APE: analyses of phylogeneticsand evolution in R language. Bioinformatics, 20:289–290, 2004.
[8] R Development Core Team. R: A Language and Environment for Sta-tistical Computing. R Foundation for Statistical Computing, Vienna,Austria, 2009. ISBN 3-900051-07-0.
[9] G. R. Warnes. The genetics package. R News, 3(1):9–13, 2003.
17