2014 paper joc the canonical analysis of distance

Upload: wong-wai

Post on 06-Mar-2016

217 views

Category:

Documents


0 download

DESCRIPTION

Canonical Variate Analysis (CVA) is one of the most useful ofmultivariate methods. It is concerned with separating between and within groupvariation among N samples from K populations with respect to p measured variables.Mahalanobis distance between the K group means can be represented as points in a(K – 1) dimensional space and approximated in a smaller space, with the variablesshown as calibrated biplot axes. Within group variation may also be shown, togetherwith circular confidence regions and other convex prediction regions, which may beused to discriminate new samples.This type of representation extends to what we term Analysis of Distance(AoD), whenever a Euclidean inter-sample distance is defined. Although the N u Ndistance matrix of the samples, which may be large, is required, eigenvaluecalculations are needed only for the much smaller K u K matrix of distances betweengroup centroids. All the ancillary information that is attached to a CVA analysis isavailable in an AoD analysis.We outline the theory and the R programs we developed to implement AoD bypresenting two examples.

TRANSCRIPT

  • Journal of Classification 31:107-1028 (2014) DOI: 10.1007/s00357-014-9149-8

    __________ We would like to thank the anonymous reviewers and the editor for their helpful

    comments and suggestions. This work is based upon research supported by the National Research Foundation of South Africa. Any opinions, findings and conclusions, or recommendations expressed in this material are those of the authors and therefore the NRF does not accept any liability in regard thereof.

    Authors Addresses: J.C. Gower, Department of Mathematics and Statistics, The Open University, Milton Keynes, MK7 6AA, UK, e-mail: [email protected]; N.J. le Roux, Department of Statistics and Actuarial Science, Stellenbosch University, Stellenbosch, 7600, South Africa; e-mail: [email protected]; S. Lubbe, Department of Statistical Sciences, University of Cape Town, Cape Town, 7701, South Africa, e-mail: [email protected].

    The Canonical Analysis of Distance

    John C. Gower The Open University, UK Niel J. le Roux Stellenbosch University, South Africa Sugnet Gardner-Lubbe University of Cape Town, South Africa

    Abstract: Canonical Variate Analysis (CVA) is one of the most useful of multivariate methods. It is concerned with separating between and within group variation among N samples from K populations with respect to p measured variables. Mahalanobis distance between the K group means can be represented as points in a (K 1) dimensional space and approximated in a smaller space, with the variables shown as calibrated biplot axes. Within group variation may also be shown, together with circular confidence regions and other convex prediction regions, which may be used to discriminate new samples.

    This type of representation extends to what we term Analysis of Distance (AoD), whenever a Euclidean inter-sample distance is defined. Although the N u N distance matrix of the samples, which may be large, is required, eigenvalue calculations are needed only for the much smaller K u K matrix of distances between group centroids. All the ancillary information that is attached to a CVA analysis is available in an AoD analysis.

    We outline the theory and the R programs we developed to implement AoD by presenting two examples. Keywords: Analysis of distance; Biplot; Canonical variate analysis.

    Published online: 3 April 2014

  • J.C. Gower, N.J. le Roux, and S. Lubbe

    1. Introduction

    Canonical Variate Analysis (CVA) gives a useful method for des- cribing and assessing the differences between the means of K groups or classes (see e.g. Krzanowski 2000; Mardia, Kent and Bibby 1979; and McLachlan 1992). A key concept in CVA is the use of Mahalanobis dis-tance to define inter-group distance. The group means occupy K 1 dimensions and, after transformation into Mahalanobis space, will usually be approximated, essentially by Principal Component Analysis (PCA), in some smaller number of dimensions. This approximation may be exhibited graphically, together with points representing the individual samples. Confidence circles, or other regions, may be included to represent the degree of uncertainty and the whole may be endowed with calibrated linear biplot axes. Thus, CVA has two aspects: (a) making maps that exhibit within and between group variability and (b) using such maps to aid discrimination by assigning samples to their best group. By extending the technique referred to by Digby and Gower (1981) as an Analysis of Distance (AoD), we show how the ideas behind CVA can be generalized to cope with other definitions of distance which often occur in the applied literature. We emphasise aspect (a) while (b), which is related to linear discriminant analysis, gets less prominence.

    As in CVA we have measurements on each of p variables for n samples distributed among K groups of sizes n1, n2, , nK summing to n. These measurements are available in an np matrix X, with group-membership given in an nK indicator matrix G. Here, G is zero except that gik = 1 when the ith sample belongs to the kth group. Thus G1 = 1 and 1cG = 1cN, where N = diag(n1, n2, , nK) = GcG.

    In AoD it is assumed that the distances iid c between all pairs i and ic of samples (i, ic = 1, , n) are available in the form of an nn matrix D = ^ `221 iid c . To avoid tedious repetition we term such a matrix, of squared distances divided by 2, a ddistance matrix. Distances may be defined very generally, though it is desirable that they be Euclidean embeddable as we shall assume in the following. We shall also require that each variable contributes additively to squared distance, thus satisfying:

    cc p

    jjiijjii x,xfd

    1

    2 )( , (1)

    where fj(., .) is a function defining squared distance for variable j. Distances may be expressed in terms of quantitative variables or by qualitative (categorical) variables. Gower and Legendre (1986) give a list of some of the possibilities. Usually, fj(., .) represents the same function for each variable but this is not necessary. It is especially useful to allow

    108

  • The Canonical Analysis of Distance

    different functions when some variables are numerical and some categorical.

    Thus, AoD differs from CVA in allowing any Euclidean embeddable measure of inter-sample distance and, by extension, any measure of inter-group distance. As with CVA, these distances may be represented in maps with points representing the centres of the groups, supplemented by additional points representing the within-group variation. In this paper we give examples of Pythagorean distance (equivalent to PCA), square root of Manhattan distance and Clarks distance.

    We could proceed by (a) doing a Principal Coordinate Analysis (PCO) (see Gower and Hand 1996) of the n u n matrix D followed by (b) evaluating the group means to produce a map of the K group means. The latter may be approximated in fewer r, say, dimensions by using K dimensional PCA. PCA involves a rotation of the K group means and then the same rotation may be applied to all n samples. A problem with this approach is that n may be very large, entailing a massive eigen-decomposition. This can be avoided by using the AoD methodology described below, which requires the whole of D but the eigenstructure of only a K u K matrix. This simplification allows large data sets to be handled efficiently and at the same time, by focussing on the group-average space, helps interpretation. These provide the main motivations of the following.

    The basic methodology started with a somewhat hard-to-find publication by Digby and Gower (1981), followed by Gower (1989), Krzanowski (1994), Gower and Krzanowski (1999) and Gower, Lubbe and le Roux (2011). Ringrose (1996) and Krzanowski and Radley (1989) discussed nonparametric confidence and tolerance regions which may be used to aid discrimination in CVA, and which are readily adaptable to AoD. These papers present successive enhancements and generalizations, a process continued here.

    The general plan followed below is:

    x Representation of the K group means in K 1 dimensions. x Addition of points for all the n samples. x The approximation of the above in r dimensions and summary in

    the form of an analysis of distance. x Endowment of the approximation with predictive calibrated

    nonlinear biplot axes for quantitative variables. x A discussion of the methodology of using group sizes as weights. x Introduction to software written in R for performing an AoD as

    described above. x Presentation of examples.

    109

  • J.C. Gower, N.J. le Roux, and S. Lubbe

    2. Representation of the Group Means If gk is the kth column of G then kk

    kkkk nn ccc c Dgg

    1D gives the

    average of the ddistances between the members of the kth and kcth groups. When k = kc, the zero diagonals and repeated symmetric values are included in the averaging process. For all K groups we obtain the KK matrix:

    11 c DGNGND . (2)

    Using (2), Gower and Hand (1996, p249) showed that

    kkkkkkkk cccc D2DD 21 (3) is the ddistance between the centroids of groups k and kc forming a ddistance matrix ={ kk c }: K u K. In order to obtain a map of the group means, analogous to the map of group means in CVA, any method of multidimensional scaling can be used. In the following we shall use PCO because of its simplicity and openness to algebraic analysis. When we define the distances to be Pythagorean, a PCO of is equivalent to a PCA of the group means. If in D we defined Mahalanobis distance between all pairs of samples, a PCO of would recover the CVA of the canonical means. With other choices of embeddable distance and MDS, different analyses and representations will ensue.

    3. Adding Points for the Individual Samples

    Using PCO to represent , the coordinates of the group means mK u:Y are obtained from the spectral decomposition of

    YY11I11I c cc )/()/( KK , (4) where YY c is diagonal. Gower (1968) showed that any point P, say, whose ddistances from P to the group means are given in a vector Ghas coordinates KP /1 1Yy c , (5) and PPP KKy yy111 ccc /2/

    221m, , an extra (m + 1)th dimension

    required for each point added but rarely needed in applications. We assume that the distances from the new point P, to all the n

    original points are given in a column-vector d: n u 1 of elements Kdddd ccc c ...,,, 21 here dk is a vector of size nk giving the ddistances

    110

  • The Canonical Analysis of Distance

    of the new point from the samples in the kth group. To interpolate all the samples from the kth group, d must be taken successively as the nk

    columns of D, corresponding to the kth group, > @cccc Kkkk DDD !21 . Let cccc Khhhh )(2)(1)()( ...,,, dddd denote the hth column of the latter matrix, then Gower, Lubbe and le Roux (2011) show that substituting

    }{ 2211 iK G u with nk columns of the form

    ihniii iD )(22 d1c G , (6)

    for i = 1, ..., K and h = 1, ..., nk in (5), yields

    c

    c

    cc

    c

    cc

    c u

    k

    Kk

    k

    k

    k

    n

    Kkn

    kn

    kn

    nKK

    n

    n

    mnk K

    D

    DD

    11

    D1

    D1D1

    1

    11

    YY

    1

    21

    11

    22

    11

    1 2

    1

    21

    ##. .(7)

    The centroids of the nk inserted points are at the same position as the

    kth group-mean in Y , as is verified in section 5.8.1 of Gower, Lubbe and le Roux (2011).

    4. Approximation in r Dimensions

    Equation (5) gives the coordinates of an added point in K 1 or fewer dimensions; in r-dimensional approximations, only the first r columns of Y and the first r eigenvalues will be needed. This will be written as /1J Y where

    J =

    uu

    u)()(:)(:

    )(:rKrKrrK

    rKrr00

    0I.

    The cloud of points surrounding each centroid may be enclosed in any tolerance region that expresses spread, analogously to the confidence circles of CVA. Thus, we may use minimal covering circles or ellipses enclosing, say, all or 95% of the points or we may use bagplots or we may use convex hulls (see e.g. Section 2.9 of Gower, Lubbe and le Roux 2011). Furthermore a nonparametric testing procedure can be used for testing as illustrated in an example below.

    Next we show how the above may be summarized in the form of an analysis of distance. We may write in full matrix form as:

    = D 21 [diag( D )11c+ 11c( diag( D )],

    111

  • J.C. Gower, N.J. le Roux, and S. Lubbe

    whence

    ncn = ncDn n[nc(diag( D )1], (8)

    with n the n-vector of diagonal elements of N. From (8) we have:

    c c

    K

    kkkk Dnn

    1D11nn

    , which rearranges to:

    1

    Kk k

    k kn n ng Dg1 D1 n n

    cc c . (9)

    Recalling that 1cD1/n is the total sum of squares and k k kng Dg /c is the sum of squares within the kth group, we see that, apart from sign, the analysis of distance (9) is analogous to the CVA orthogonal analysis of variance:

    Total sum of squares = Between group sum of squares + Within group sum of squares. Thus, from (9) we may form an analysis of distance table in which

    the contributions between and within groups are exhibited. Further we may break this down into the contribution arising from different dimensions and sets of dimensions, especially the r fitted dimensions and the remaining residual dimensions. The latter may be further subdivided into the (K 1 r) dimensions holding the group means and the distances orthogonal to the group means. Note that with K groups, the means fit into K 1 or fewer dimensions so the remaining residual dimensions for the group means are null.

    5. Biplot Axes

    We have represented within group variation by choosing d as the

    successive columns of D, eventually leading to (7). However, d may also refer to a genuine new sample, in which case (5) interpolates that sample into the map. In particular, d may be chosen as a pseudo sample and used to plot predictive trajectories for numerical variables. To do this, we set the pseudo sample for the jth variable to have value ej so that as varies we trace out a nonlinear trajectory for the jth variable, which may be calibrated for suitably chosen values of . In this way the AoD of the individual samples may be enhanced with a biplot to include information on the variables. These trajectories may be approximated in r dimensions by the methods given by Gower and Hand (1996) and Gower, Lubbe and le Roux (2011) for nonlinear biplots. Here, we extend the nonlinear theory

    112

  • The Canonical Analysis of Distance

    to construct trajectories in the case of canonical analysis of distance. The trajectories act like coordinate axes and may be used by projecting sample points onto them and reading off the nearest calibrated value. An easier method for reading off predictions equivalent to normal projection is termed circular projection. In circular projection, the trajectories are constructed so that when a circle is drawn with diameter given by the origin and the sample point, the predictions are given where this circle intersects the trajectories. Alternatively, the regression method (see e.g. Chapter 4 in Gower, Lubbe and le Roux 2011) may be used to give approximate linear biplot axes. Krzanowski (2004) suggests a half-way house where a limited number of pseudo samples (say 10) are fitted and joined by linear axes.

    6. Weighting

    Finally, we note that the above uses unweighted centroids of the

    group means as its origin O, say. Again, analogously to CVA, we may use centroids weighted by sample sizes. The starting point is the weighted PCO of where now (4) is replaced by

    c

    c c

    nn1nIn1IYY 22 . (10)

    Because nc 2Y = 0c, the origin moves from O to G, the origin of the

    samples, which is the centroid of group centroids weighted by the group sizes. As with CVA, the use of a weighted centroid does not affect the distances between the individual centroids but in approximations, groups with smaller sample size will be less well represented than those with the larger sample sizes. A PCO type eigendecomposition of (10) is not entirely satisfactory because it gives an unweighted fit to a best-fitting plane through the weighted centroids. That is, residuals from projections of the group-centroids onto any r-dimensional approximation plane all have unit weight. To weight the residuals according to given weights W, say, is readily accomplished by a weighted PCA of 2Y . That is, we minimize

    )}(){( 2222 YYWYY ctrace which may be written min

    2

    222/1 )( YYW . This requires a simple application of the Eckart-

    Young theorem in which the singular value decomposition (SVD) VUYW c6 2

    2/1 gives the r-dimensional approximation

    VJUYW c6 22/1 and, finally:

    113

  • J.C. Gower, N.J. le Roux, and S. Lubbe

    VJUWY c6 212 . (11)

    Note that (i) we may write VVJVUWY cc6 212

    = 2Y VJVc,

    showing 2Y as an orthogonal projection of 2Y , and (ii) for r-dimensional

    plotting purposes we may use VJYY 22 because the final orthogonal

    matrix V merely rotates the r-dimensional solution into p dimensions. The two steps of determining 2Y from (10), followed by a weighted

    PCA of 2Y , may be subsumed into a single step, as follows. Combining the SVD of 2

    2/1 YW with (10) gives:

    UUW1nIn1IW c6

    c

    c 22/12/1

    nn, (12)

    which immediately yields U and . These may be substituted into (11), ignoring V as just discussed, to give 2Y referred to r-dimensional principal axes. Normally, the weights W would be the diagonal matrix N with the group sizes n in the diagonal.

    Thus (12) immediately gives a weighted AoD for the group means; the problem of how to add individual samples remains. This is not difficult but two points have to be borne in mind. Firstly, Y and 2Y are referred to different origins. Secondly, although Y given by (4) and the 2Y derived from (10) generate the same ddistances, their orientations will differ. Recall that any centring s, where s'1 = 1, does not affect the distances generated by Y . This follows from noting that writing ss YY c = (I 1sc)(I s1c) the squared distance between the ith and i'th rows of sY is (ei ei')'(I 1s')(I s1')(ei ei') = (ei ei')'(ei ei') = ii + i'i' 2ii' =

    2ii cG (where ei denotes a unit K-vector with its ith element equal to unity,

    else zero) as given in (3). In (4) we have chosen s = 1/K and in (10) s = n/n.

    Denote the recentred matrix Yn1I n/c by 1Y ; then we require the rotation Q of the coordinates represented by 1Y that match the

    coordinates represented by 2Y or 2Y . This match is given by the solution of

    the orthogonal Procrustes problem2

    2 1minQ Y YQ or 2

    2 1min

    QY YQ .

    114

  • The Canonical Analysis of Distance

    The solution to this problem is well-known (see for example, Gower and Dijksterhuis 2004) and it is obtained through the SVD

    TSYY c c 12 or TSYY c c 12 , by setting Q = TSc. Moreover the fit is exact when 2Y is used. The

    difference in origin of 2Y or 2Y , relative to Y is the translation ncY /n. These results imply that if a sample is added according to the methodology given in Section 3 to give a point y, then relative to the weighted analysis the point has coordinates (yc ncY/n)Q in -space; the coordinates orthogonal to -space are unchanged. This allows all samples and all biplot trajectories as well as CLPs to be placed in the space of the weighted analysis, whose first r dimensions then give the r-dimensional weighted approximation. Thus, the weighted analysis is easily derived from the unweighted analysis.

    7. Software for Constructing AoD Biplots

    A shortcoming, already referred to above, is that in the discussion of

    AoD given by Gower, Lubbe and le Roux (2011) they do not present AoD plots with trajectories fitted when the plot is based on general Euclidean embeddable distances that are additive. Nor do they provide their R function AODplot with facilities for constructing AoD biplots. To address these shortcomings the R function AODbiplot has been written to construct the AoD biplots discussed in this paper. AODbiplot extends AODplot by utilizing nonlinear biplots as discussed by Gower, Lubbe and le Roux (2011). All the above functions are available by following the instructions in the file ReadMe.txt at

    https://dl.dropbox.com/u/17860902/CanonAnalDist.zip. These authors show that the co-ordinates for tracing a prediction

    biplot axis W for variable t along a series of values P, is based on a series of lines L(P) with equation

    c

    c

    PP

    OO

    OO

    dd

    Kyy

    yy

    dd

    KK

    1z 1

    21

    11

    121

    111

    21

    21

    ## , (13)

    where z denotes the two dimensional coordinates in the biplot space and ijy is the ijth element of mK u:Y . The procedure is analogous to that of

    the nonlinear biplot (see Gower and Ngouenet 2005) except for the derivative of G. Assuming additive distances defined by (1) the ddistances between pseudo sample W(P) and the nk samples in the kth group are given by

    115

  • J.C. Gower, N.J. le Roux, and S. Lubbe

    p

    jtnttntjnj

    p

    jttttjj

    k

    kkkxfxfxf

    xfxfxf

    ),()0,()0,(

    ),()0,()0,(

    )(

    21

    21

    21

    121

    121

    121

    )(

    P

    P

    PW #d .

    Therefore, for the kth group and variable t it follows that

    ),(

    ),(

    )(

    21

    121

    )(

    PP

    PP

    PP W

    tnt

    tt

    k

    kxf

    dd

    xfdd

    dd #d . (14)

    Using (6) and (14), it follows that

    ^ `

    c i

    ii

    n

    jjttniniii xfd

    dDdd

    dd

    1

    1)(

    22 ),()( PP

    PP

    GP W

    d1

    and

    K

    K

    K

    n

    ijit

    K

    n

    ijit

    n

    ijit

    xfdd

    n

    xfdd

    n

    xfdd

    n

    dd

    1

    12

    11

    ),(2

    1

    ),(21

    ),(21

    2

    2

    2

    1

    1

    1

    PP

    PP

    PP

    P#

    .

    Writing (13) as a(P)z = c*(P) with reparameterization

    )()(/)()( 2221 PPPP aaal ii for i = 1, 2,

    and 2 21 2c c a a

    *( ) ( ) / ( ) ( )P P P P ,

    the normal projection prediction biplot trajectories are given by

    ^ ` ^ `2 10 0 c c d d

    2 1 1 2d l d l = l l d l l d( ) ( )( ) ( )( ) ( ) ( ) ( ) ( )W

    c ,

    116

  • The Canonical Analysis of Distance

    where P0 is the solution to c(P0) = 0 and the circle projection prediction

    biplot trajectories are given by > @1 2 l c l c ( ) ( ) ( ) ( ) ( )W c (Gower, Lubbe and le Roux 2011).

    This has been implemented in our R function AODbiplot used for constructing the examples in the next section. It is of interest to note that the nonlinear biplot as described by Gower and Hand (1996) or Gower, Lubbe and le Roux (2011) is obtained as a special case of an AoD biplot by specifying it as an n group AoD biplot with each group consisting of a single sample.

    8. Examples

    8.1 Ocotea Data Our first example concerns the properties of timber sampled from

    species of the hard wood genus Ocotea. Gower, Lubbe and le Roux (2011) give a detailed description of the data, consisting of three groups (the species) and six continuous variables. Anatomical characteristics of 37 wood samples were determined by microscopic methods. The following measurements were made: Vessel diameter in Pm (VesD), vessel element length in Pm (VesL), fibre length in Pm (FibL), ray height in Pm(RayH), ray width in Pm (RayW) and the number of vessels per mm2 (NumVes). The 37 samples consisted of three known species: Ocotea bullata, O. porosa and O. kenyensis.

    Initially, we used Pythagorean distance, after normalizing the variables in the usual way to unit variances. With our methodology this gives a PCA of the group means; the individual samples are interpolated as described above. A two-dimensional biplot approximation is shown in Figure 1. We have used open symbols for the samples to differentiate the three species and the corresponding filled symbols to mark the positions of the sample means. Notice that despite the normalization, for convenience we have calibrated the axes in terms of their actual measurements. We can immediately see Oken scores high on RayH and FibL but low on NumVes. Similarly Obul scores high on VesL but low on RayH, VesD and RayW. The main feature of Opor is its low score on VesL compared with the other species.

    We can also comment from Figure 1 that the sample variation within groups is neither homogeneous nor elliptical, the classical assumptions of CVA; similar heterogeneity is evident in other figures shown in this section. Nonparametric tolerance regions, as discussed by Ringrose (1996) and Krzanowski and Radley (1989), would have to be used if we had been concerned with discrimination.

    The biplot axes in Figure 1 are linear and the individual samples are interpolated onto the plot to show within group variation. Because there

    117

  • J.C. Gower, N.J. le Roux, and S. Lubbe

    Figure 1. An AoD using Pythagorean distance, showing the group means (filled symbols) and surrounding sample variation (corresponding unfilled symbols). This is similar to a PCA of the group means and, because of the Pythagorean distance, continues to have linear biplot axes.

    are only three species, the group means fit exactly into our two-dimensional display. Therefore, the group means given in Table 1 can be read exactly from the biplot axes in Figure 1.

    The partitioning (9) of the total AoD sum of squared distances of 216.0 in the analysis underlying Figure 1 is: Between = 66.9625 and Within = 149.0375.

    That there are real differences between the species is evident from Figure 1. A permutation test of the null hypothesis that any observed difference between the group means is due to chance may be made by assigning the 37 samples randomly to three groups of sizes 20, 10 and 7 respectively. The between and within contributions to the total sum of

    40

    60

    80

    100

    120

    140

    160

    180

    200VesD

    250

    300

    350

    400

    450

    500

    VesL

    400

    600

    800

    1000

    1200

    1400

    1600

    1800

    FibL

    300

    350

    400

    450

    RayH

    25

    30

    35

    40

    45

    50

    RayW

    810

    1214

    1618

    2022

    NumVes

    Obul

    Oken

    Opor

    118

  • The Canonical Analysis of Distance

    Table 1. The Group Mean Values of the Raw Ocotea Data.

    VesD VesL FibL RayH RayW NumVes Obul 98.10 412.00 1185.40 375.35 32.30 14.30 Oken 137.29 401.71 1568.86 446.14 37.29 9.14 Opor 129.30 342.40 1051.70 398.20 39.40 14.80

    squared distances of 216.0 were then determined for 10 000 repetitions and the achieved significance level (ASL) determined as the proportion of times the between contribution exceeds the value of 66.9625. It is clear from the permutation density displayed in Figure 2 that the null hypothesis is rejected with an ASL of approximately zero.

    Next we repeat the analysis of Figure 1, but using the square root of the Manhattan distance in the AoD biplot function (after the same normalization of the data underlying the biplot in Figure 1). Notice that taking the derivative of an absolute value function results in the sharp turns in the biplot axes of Figure 3.

    In contrast to the Pythagorean distance used in Figure 1, the square root of the Manhattan distance results in perfect separation between the samples of the three groups. However, projections onto the trajectories can be ambiguous and are more problematic.

    We note that from (9) the partitioning of the sum of squared distances of the AoD based on the square root of Manhattan distance follows as: Total = 119.2082; Between = 27.3292 and Within = 91.8790.

    Next, we give an example of an Euclidean embeddable distance that results in smooth nonlinear trajectories, namely Clarks distance (Gower and Ngouenet 2005) defined by

    p

    k jkik

    jkikij xx

    xxd

    1

    2

    for non-negative values xik, xjk. Clarks distance is invariant to scaling of the variables but not to the

    location of the origin. Thus, the variables in Clarks distance should always be positive (and preferably nonzero); it is ideal when there is a natural zero for every variable. Since the Ocotea data contain only non-negative values Clarks distance (without any normalization) can be used in the AoD. The resulting biplot with axes for circular prediction is shown in Figures 4 and 5.

    It is clear that the intersections of the circles with the respective axes in Figure 5 give relatively accurate predictions of the true values in Table 1, with the exception of the VesL value of Oken and the NumV value of Obul. The corresponding partitioning (9) of the sum of the squared dis-

    119

  • J.C. Gower, N.J. le Roux, and S. Lubbe

    Figure 2. Permutation distribution of the between sum of squares for 10 000 repetitions performing the AoD resulting in Figure 1 by randomly allocating the observed samples to three groups of sizes 20, 10 and 7 respectively. tances now becomes: Total = 2.7810; Between = 0.7926 and Within = 1.9884.

    Taking Figures 4 and 5 together, we can make the following remarks. Firstly, the axes are only slightly nonlinear. Indeed some axes turn out to be close to linearity. The corresponding AoD biplot with axes constructed to enable normal prediction will give the same predictions as those in Figures 4 and 5. Therefore it is not shown here but we can report that the program used for constructing Figures 4 and 5 takes several orders of magnitude longer when instructed to construct axes for normal prediction. As well as the nonlinear nature of the axes, we can remark on the regularity of the scale markers used for calibration. Of course in PCA everything is linear and regular and with Figure 3 everything is nonlinear and irregular. However, Clark distance seems to have produced only mild nonlinearity accompanied by mild irregularity. Most notable are probably variables NumVes and FibL, the latter being particularly noteworthy because its irregularity occurs in the centre of the range of the Obul samples. Apart from speed considerations circular projection has the property of giving all predictions for a sample simultaneously, which is

    0 10 20 30 40

    0.00

    0.02

    0.04

    0.06

    0.08

    Den

    sity

    120

  • The Canonical Analysis of Distance

    Figure 3. The same as Figure 1 but using the square root of Manhattan distance. In order to follow a trajectory more easily different colours are used for them. The trajectories are now angular (at data-values) but the groups overlap less. useful, especially when used in conjunction with interactive software. Although we have shown only predictions for the group means, both circular prediction and normal prediction are equally valid for predicting the values of all sample points. 8.2 Pine Data

    In our second example, we illustrate the effects of weighting on the

    AoD. We use a small set of data on 36 samples (sizes 11, 5, 6, 9, 5) collected from five species of pine (the groups). The species are described by seven variables: TotYield (total pulp yield expressed as a percentage

    125

    375395

    390

    370 430435

    80

    8590

    95100

    110

    115120

    130135

    140

    145

    VesD

    335340345

    350

    360

    365

    395400

    405

    410VesL

    1050

    11001150

    1300 140015001550

    FibL

    340

    345

    350355

    360365

    370

    375380 385

    390

    420

    425440

    445450

    RayH

    28

    29

    30

    313233

    34

    35

    37

    38

    39 40 RayW

    9

    10

    11

    1213

    14

    15

    NumVes

    105

    355

    36

    ObulOkenOpor

    121

  • J.C. Gower, N.J. le Roux, and S. Lubbe

    Figure 4. AoD biplot with Clarks distance on the original Ocotea data. The axes are designed for circular prediction and are well-behaved with no, or very slight, nonlinearity and near-regular calibrated intervals. of the original mass); Alkali (percentage alkali consumption); Density (wood density in kg m3); TEA (tensile energy absorbsion in mJ g1); Tensile (tensile index); Tear (tearing index in mN m2 g1) and Burst (burst index in kPa m2 g1), (see Gower, Lubbe and le Roux 2011 for a detailed description). The data come from an investigation at a South African wood mill into the underlying relationships between genetic (species) and physiological factors of wood, including pulp quality.The Pine data will also be used to show how the within sum of squares can be decomposed into components from the delta-space (here in four dimensions), the biplot display space (here in two dimensions), and the space orthogonal to the delta-space (here in three dimensions). In this example, we use Pythag-

    70

    80

    90

    100

    110

    130

    140

    150

    160

    170

    180

    VesD

    260

    280

    300

    320

    340

    360

    400

    420

    440

    460

    480

    500

    520VesL

    800

    900

    1000

    1100

    1200

    14001500

    16001700

    1800

    FibL

    280

    300

    320

    340

    360

    380

    400

    420

    440

    460

    480

    500

    RayH

    24

    26

    28

    30

    32

    34

    36

    38

    40

    42

    44

    46

    48

    50

    52

    RayW

    89

    1011

    121314

    15161718

    19202122

    NumVes

    Obul

    Oken

    Opor

    122

  • The Canonical Analysis of Distance

    Figure 5. Similar to Figure 4 but omitting the samples to illustrate how circular prediction can be made for the group means. gorean distance throughout but similar methodology would apply with other distances, though not necessarily with similar results.

    The partitioning of the total sum of squares associated with the above AoD is:

    Total ss = 245.00; Between ss = 75.15 and Within ss = 169.85. The contributions in the delta-space to the above partitioning are

    given in Table 2. Notice that the between sums of squares is confined to the four

    dimensions of the delta-space, while the within group sum of squares has components in the three higher dimensions. Summing over the first r dimensions gives the contributions in the r dimensional display space.

    70

    80

    90

    100

    110

    130

    140

    150

    160

    170

    180

    VesD

    260

    280

    300

    320

    340

    360

    400

    420

    440

    460

    480

    500

    520VesL

    800

    900

    1000

    1100

    1200

    14001500

    16001700

    1800

    FibL

    280

    300

    320

    340

    360

    380

    400

    420

    440

    460

    480

    500

    RayH

    24

    26

    28

    30

    32

    34

    36

    38

    40

    42

    44

    46

    48

    50

    52

    RayW

    89

    1011

    121314

    15161718

    19202122

    NumVes

    Obul

    Oken

    Opor

    123

  • J.C. Gower, N.J. le Roux, and S. Lubbe

    Table 2. Contributions to the overall partitioning of the sum of squares obtained in the AoD analysis of the Pine data for both the unweighted and weighted analyses. The first four columns refer to the delta-space and the fifth column to all dimensions orthogonal to the delta-space. Unweighted analysis

    Dim 1 Dim 2 Dim 3 Dim 4 Dim >4 Sum Between ss 34.36 27.41 11.60 1.79 0 75.16 Within ss 20.36 55.94 24.62 16.44 52.48 169.84 Total ss 54.72 83.35 36.22 18.23 52.48 245.00

    Weighted analysis

    Dim 1 Dim 2 Dim 3 Dim 4 Dim >4 Sum Between ss 35.97 27.18 10.25 1.75 0 75.16 Within ss 12.05 64.24 25.42 15.65 52.48 169.84 Total ss 48.02 91.42 35.67 17.40 52.48 245.00

    Thus, the overall quality of display in two dimensions is: 82.18% (unweighted) and 84.04% (weighted). Figures 6 and 7 contain the unweighted and weighted AoD biplot for the Pine data, respectively.

    The only difference between the unweighted and weighted analyses is in the values found in delta-space. The most striking observation from Figures 6 and 7 is that the weighted and unweighted analyses are very close. Probably the only difference apparent to the naked eye lies in the different distribution of within group variances in the first two dimensions, and even that is accounted for to some extent by differences in orientation.

    Next, we consider the individual contributions to the total within sum of squares (52.48) in the space orthogonal to the delta-space. In Table 3 we show the samples having the five smallest together with the samples having the five largest individual within sum of squares in the space orthogonal to the delta-space.

    Table 3 shows that samples 15, 22, 25, 27 and 34 are nearest to the delta-space while samples 30, 10, 29, 2 and 31 are the furthest away from the delta-space.

    Finally, (Table 4), the sum of squares orthogonal to delta-space can be computed separately for each group.

    8. Conclusions

    Within the context of grouped samples, the methodology described

    here generalizes canonical analysis to be applied to any additive Euclidean embedded distance, to give associated visualizations of samples and group means, together with the usual accessories of analysis of variance, representations of uncertainty regions and calibrated biplot axes. This gen-

    124

  • The Canonical Analysis of Distance

    Figure 6. Unweighted AoD of the Pine data for comparison with the weighted AoD in Figure 7. Pythagorean distance is used and predictive axes constructed. Axes pass through the unweighted centroid of the group centroids. With Pythagorean distance the axes are linear with regular calibrations. eralization has two advantages: (a) it presents a methodology for grouped data that handles commonly used distances of the kind often used in applications of ungrouped data in fields such as ecology, taxonomy and sociology and (b) our methods are computationally efficient, depending on the number of groups rather than the total number of samples. The generalization is not quite complete, as the additivity assumption does not directly allow for intra-group correlation but even this may be possible if there is an independently available metric correction-matrix, possibly obtained from previously determined or hypothesized measures of within group dispersion.

    42

    44

    46

    48

    50

    72.5

    73

    73.5

    74

    74.5

    75

    75.5

    350

    400

    450

    500

    550

    600

    650

    86

    88

    90

    92

    94

    96

    98

    1400

    1500

    1600

    1700

    1800

    1900

    8

    10

    12

    14

    16

    18

    5.5

    6

    6.5

    7

    TotYield

    Alkali

    Density

    TEA

    Tensile

    Tear

    Burst

    P.ell

    P.maxP.patP.tae

    P.kes

    125

  • J.C. Gower, N.J. le Roux, and S. Lubbe

    Figure 7. Weighted AoD for comparison with the unweighted AoD of Figure 6. The axes run through the weighted centroid of the group centroids which is the same as the centroid of all the individual samples. However, note that the intersection of the axes can be placed anywhere on the plot using orthogonal parallel shifts as explained in Gower, Lubbe and le Roux (2011).

    Although we have presented our results in terms of continuous

    variables, the methodology extends to cover categorical variables, or mixtures of continuous and categorical variables. The main difference is that because a categorical variable has a limited number of possible levels, it will be represented by a set of points, the category level points (CLPs), rather than by a continuous linear or nonlinear trajectory. Although the basic ideas are similar to those we have discussed above, their algebraic development is quite demanding and we shall develop the details elsewhere, showing that many properties of CLPs for ungrouped data extend to grouped data.

    42

    44

    46

    48

    50

    73

    73.5

    74

    74.5

    75

    75.5

    76

    350

    400

    450

    500

    550

    600

    650

    88

    90

    92

    94

    96

    98

    1400

    1500

    1600

    1700

    1800

    1900

    2000

    8

    10

    12

    14

    16

    18

    5.5

    6

    6.5

    7

    TotYield

    Alkali

    Density

    TEA

    Tensile

    Tear

    Burst

    P.ell

    P.maxP.patP.tae

    P.kes

    126

  • The Canonical Analysis of Distance

    Table 3. The five samples with the smallest and the five samples with the largest within sum of squares in the space orthogonal to the delta-space (unweighted and weighted analyses identical).

    Smallest within ss Largest within ss Sample Sum of squares Sample Sum of squares

    15 0.09 31 3.13 22 0.23 02 3.25 25 0.31 29 3.56 27 0.32 10 3.76 34 0.50 30 4.37

    Table 4. The within per group sum of squares orthogonal to the delta-space.

    Group SS Group size P.ell 19.18 11 P.kes 4.02 5 P.max 7.45 6 P.pat 15.51 9 P.tae 6.32 5 Sum 52.48 36

    References

    DIGBY, P.G.N., and GOWER, J.C. (1981), Ordination Between and Within Groups Applied to Soil Classification, in Down to Earth Statistics: Solutions Looking for Geological Problems, ed. D.F. Merriam, Syracuse University Geology Contribu-tions, pp. 5375.

    GOWER, J.C. (1968), Adding a Point to Vector Diagrams in Multivariate Analysis, Biometrika, 55, 582585.

    GOWER, J.C. (1989), Generalized Canonical Analysis, in: Multiway Data Analysis, eds. R. Coppi and S. Bolasco, Amsterdam: Elsevier (North Holland).

    GOWER, J.C., and DIJKSTERHUIS, G.B. (2004), Procrustes Problems, Oxford: Oxford University Press.

    GOWER, J.C., and NGOUENET, R.F. (2005), Nonlinearity Effects in Multidimensional Scaling, Journal of Multivariate Analysis, 94, 344365.

    GOWER, J.C., and HAND, D.J. (1996), Biplots, London: Chapman and Hall. GOWER, J.C., and KRZANOWSKI, W.J. (1999), Analysis of Distance for Structured

    Multivariate Data, Applied Statistics, 48, 505519. GOWER, J.C., LUBBE, S., and LE ROUX, N.J. (2011), Understanding Biplots,

    Chichester: John Wiley & Sons Ltd. GOWER, J.C., and LEGENDRE, P. (1986), Metric and Euclidean Properties of

    Dissimilarity Coefficients, Journal of Classification, 3, 548. KRZANOWSKI, W.J. (1994), Ordination in the Presence of Group Structure, for

    General Multivariate Data, Journal of Classification, 11, 195207.

    127

  • J.C. Gower, N.J. le Roux, and S. Lubbe

    KRZANOWSKI, W.J. (2004), Biplots for Multifactorial Analysis of Distance, Biometrics, 60, 517524.

    KRZANOWSKI, W.J. (2000), Principles of Multivariate Analysis: A Users Perspective (Revised Edition), Oxford: Oxford University Press.

    KRZANOWSKI, W.J., and RADLEY, D. (1989), Nonparametric Confidence and Tolerance Regions in Canonical Variate Analysis, Biometrics, 45, 11631173.

    MARDIA, K.V., KENT, J.T., and BIBBY, J.M. (1979), Multivariate Analysis, London: Academic Press.

    McLACHLAN, G.J. (1992), Discriminant Analysis and Statistical Pattern Recognition, Chichester: John Wiley & Sons Ltd.

    RINGROSE, T.J. (1996), Alternative Confidence Regions for Canonical Variate Analysis, Biometrika, 83, 575587.

    128