estimation of evolutionary distance for reconstructing molecular

Upload: pepi-electricladyland

Post on 07-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Estimation of Evolutionary Distance for Reconstructing Molecular

    1/9

    Estimation of Evolutionary Distance for Reconstructing MolecularPhylogenetic TreesFumio Tajima* and Naoko Takezakit*Department o f Population Genetics, National Institute of Genetics, and TInstitute of Molecular Evolutionary Genetics andDepartment of Biology, Pennsylvania State University

    The most commonly used measure of evolutionary distance in molecular pbylogenetics is the number of nucleotidesubstitutions per site. However, this number is not necessarily most efficient for reconstructing a phylogenetic tree.In order to evaluate the accuracy of evolutionary distance, D( t for obtaining the correct tree topology, an accuracyindex, A(t), was proposed. This index is defined as D(t)/ +V[ D( t)] , where D(t) is the first derivative of D(t)with respect to evolutionary time and V[D( t)] is the sampling variance of evolutionary distance. Using A(t),namely, finding the condition under which A(t) gives the maximu m value, we can obtain an evolutionary distancewhich is efficient for obtaining the correct topology. Under the assumption that the transversional changes do notoccur as frequently as the transitional changes, we obtained the evolutionary distances which are expected to givethe correct topology more often than are the other distances.

    IntroductionThere are many methods for reconstructing a mo-lecular phylogenetic tree from an evolutionary distancematrix, such as the unweighted pair-group method(Sneath and Sokal 1973) and the neighbor-joiningmethod (Saitou and Nei 1987 ). In these methods the

    num ber of nucleotide substitutio ns per site is usuallyused as an evolutionary distance w hen DN A sequencesare analyzed (see, e.g., Nei 1987).Several studies, however, suggest that, in order toobtain the correct tree topology, the num ber of nucleo-tide substitutions per site may not be the best evolu-tionary distance when the rate of nucleotide substitutionis the same among different lineages (Saitou and Nei1987; Saitou and Imanishi 1989; Schoniger and vonHaeseler 1993). For example, Saitou and Nei ( 1987)have shown by using computer simulations that the ef-ficiency of obtaining the correct tree topology from thematrix of the proportion of different nucleotides is nearlythe same as (or even higher than) that of obtaining thecorrect tree topology from the matrix of the num ber ofnucleotide substitutions per site estimated by using Jukesand Cantors ( 1969) method. This means that there

    Key words: DNA sequences, evolutionary distance, molecularphylogenetic tree, Jukes and Cantors method, Kimuras method, tran-sition/ transversion bias.Address for correspondence and reprints: Fumio T ajima, D e-partment of Population Genetics, Nationa l Institute of Genetics,Mishima, Shizuoka-ken 4 11, Japan.

    Mol. Bid. Evol. 1 1(2):278-286. 1994.0 1994 by The University of Chicago. All rights reserved.0737-4038/94/l 102-0011$02.00

    27 8

    might be a better evolutionary distance than the numberof nucleotide substitu tions per site.Whe n the rate of nucleotide substitution variesconsiderably among different lineages, however, thenumber of nucleotide substitutions per site gives thecorrect tree topology more often than does the propor-tion of different nucleotides between nucleotide se-quences ( Saitou and Iman ishi 1989; Schoniger and vonHaeseler 1993 ). This suggests that the evolutionary dis-tance shou ld be proportional to the num ber of nucleotidesubstitutions when the substitution rate varies amongdifferent lineages. Of course, this does not necessarilymean that the number of nucleotide substitutions persite is the best measure for obtaining the correct treetopology.In this paper w e shall present an evolutionary dis-tance which is expected to give the correct tree topologymore often than does the number of nucleotide substi-tutions per site, under the assumption that the rate oftransitional change is not the same as that of transver-sional change, i.e., under Kimuras ( 1980) model. Weconsider only tree topologies, so that branch lengths mustbe recomputed by using the number of nucleotide sub-stitutions per site.Theory

    Consider two nucleotide sequences which divergedt years ago. Let D(t) be the evolutionary distance be-tween them. The accuracy of D( t) for reconstructing thecorrect tree topology depends on whether w e can distin-guish D( t+At) from D (t), for a given At. If the samp ling

    atUniversidaddeAlcalaonNovember10,2011

    http://mbe.oxfordjournals.org/

    Downloadedfrom

    http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/
  • 8/3/2019 Estimation of Evolutionary Distance for Reconstructing Molecular

    2/9

    Table 1

    Evolutionary Distance for Molecular Phylogenetics 279

    Ratio of the Accuracy Index for p(t) to That for d(t), und er Jukes and Cantors Modeln d(t) = 0.5 d(t) = 1.0 d(t) = 1.5 d(t) = 2.0

    100 . . . . . . 1.004 (0.000) 1.025 (0.000) 1.13 1 (0.020) 1.333 (0.152)200 . . . . 1.002 (0.000) 1 o 10 (0.000) 1.061 (0.001) 1.223 (0.06 1)500 . . 1 oo 1 (0.000) 1.003 (0.000) 1 o 15 (0.000) 1.099 (0.006)1,000 . . . . . 1 ooo (0.000) 1.002 (0.000) 1.006 (0.000) 1.038 (0.000)

    NOTE.-Numbers in parentheses are the proportions of inapplicable cases.

    error of D( t) is substantially smaller than the differencebetween D( t+At ) and D( t ) , we can easily distinguishD( t+At) from D(t). Therefore, using a continuous ap-proxim ation, we define the accuracy index, A ( t), byA(t) = ~Y0/@EG , (1)

    where D (t) is the first derivative of D(t) with respectto divergent time and V[ D( t)] is the sam pling varianceof D( t) . It should be noted here that the smallest varianceof D(t) does not necessarily give the largest value ofA (t), unless D(t) is linear with t. Using equation ( 1 ),we can examine which evolutionary distance is moreaccurate.Jukes and Cantors M odel

    Before w e consider Kimuras model, we first con-sider Jukes and C antors ( 1969) model, under which therate of nucleotide substitution is assumed to be the sameamong different n ucleotides. We also assume that thesubstitution rate is the same among different nucleotidesites. Consider two nucleotide sequences w hich divergedt years ago . Let h be the rate of nucleotide substitutionper site per year. Then the number of nucleotide sub-stitutions per site, d(t), is

    d(t) = 2ht . (2)On the other hand, the proportion of different nucleo-tides, p(t), is

    p(t) = c[ l-e-d(f)lc] ,where c = 3/4, o that d(t) can be estimated by

    (3 )

    d(t) = -c ln[l-p(t)/c] (4)(Jukes and Cantor 1969; Kimura and Ohta 1972).

    Therefore, if the proportion of different nucleo-tides is used as an evolutionary distance, we haveD(t) = p(t), so that we obtain D(t) = 2he-d()ic= 2h[l-p(t)/c] and V[D(t)] = p(t)[l-p(t)]/n, where

    n is the number of nucleotides in each sequence. Fromthese, we obtain

    A(t) = 2h[l-p(t)/c]G~Pw[l-Pwl * (5 )

    On the other hand, if the number of nucleotidesubstitutions per site is used, we have D(t) = d(t)= 2ht, so that we obtain D(t) = 2h . The large samplingvariance expected when equation (4) is used is approx-imately given by

    ad(t) 2WWI = ape ~b(t)l =[ 1p(t)[l--p(t)] (6)P-p(t)lc12n

    (Kimura and Ohta 1972). Then, we have

    A(t) = 2h[l-p(t)/c]Gllp(~)[l-Pu)l * (7 )Com paring equation ( 5) with equation (7)) we can con-clude that the accuracy ofp( t) is nearly the same as thatof d( t) . In order to obtain the variance of d( t), however,an approximation was involved (see eq. [ 61) . If w ecompute the exact variance, the accuracy of d(t) be-comes slightly smaller than that of p( t). Table 1 showsthe ratio of the accuracy index for p(t) to that for d(t),which was numerically computed from the probabilitydistribution of p( t). In the case of d( t), A(t) was com-puted, excluding inapplicable cases. We can see fromthis table that the proportion of different nuc leotides ismore appropriate than is the number of nucleotide sub-stitutions per site, when nucleotide sequences are shortand when d(t) is large. These results are consistent withthe simulation results obtained by Saitou and Nei( 1987), S aitou and Imanishi ( 1989), and Schiiniger andvon Haeseler ( 1993). T hus, we recommend the pro-portion of different nuc leotides be used when the rateof nucleotide substitution is the same among differentlineages.

    atUniversidaddeAlcalaonNovember10,2011

    http://mbe.oxfordjournals.org/

    Downloadedfrom

    http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/
  • 8/3/2019 Estimation of Evolutionary Distance for Reconstructing Molecular

    3/9

    280 Tajima and TakezakiKimuras Transition / Transversion Model

    Consider two nucleotide sequences which divergedt years ag o. Let a and 2p be the rates of transitional andtransversional substitutions per site per year, respectively,so that the total ra te of substitution per site per year isa + 2p. Denote the proportions of transitional andtransversional differences by P(t) and Q( t ) , respectively.Then the number of nucleotide substitutions per site isgiven by

    d(t) = 2at + 4pt . (8)In Kimuras ( 1980) method d(t) is estimated by

    d(t) = d,(t) + d2W, (9 )where dl (t) and d2( t) are given by

    d,(t) = 2(a+P)t = - i ln[l-2P(t)-Q(t)] (9a)and

    d2(t) = 2Pt = - i ln[l-2Q(t)] .

    P(t) and Q(t) can be expressed as

    p(t) = + _ i e-4(a+P)t + i e-8Pt

    (W

    (lOa)

    and

    Q(t) = ; - ; e-*pI . (lob)Let us now obtain the accuracy index when the

    proportion of different nucleotides is used as an evolu-tionary distance, i.e., D(t) = P(t) + Q(t). In this case,we obtain

    D(t) = 2 ( a+P)e-4(a+P? + 2pe+= 2(a+P)[ l-2P(t)-Q(t)]

    + 2P[1-2Qwland

    VW)1 ==

    ~W[l-~Wln

    [P(t)+e(t)l[l-p(t)-e(t)ln

    (11)

    (12)

    Using equations ( 10) and ( 11)) we can obtain A(t)= o(t)/bqD(t)].

    On the other ha nd, when the number of nucleotidesubstitutions per site is used as an evolutionary distance,i.e., D(t) = d(t) = dl (t) + d2( t) = 2at + 4Pt, we have

    D(t) = 2a + 4p . (13)The sampling variance expected when equations (9))(9a), and (9b) are used is approxim ately given by

    uw)l = a2P(t) + b2Q(t) - [aP(t)+bQ(t)12 9n(14)

    where

    and

    1a = 1 - 2P(t) - Q(t) (144

    1 1l-2P(t)-Q(t) + l-2Q(t) (1

    Then A(t) = D(t)/ vV[ D( t)] can be obtained fromequations (13) and (14).Numerical examples are shown in table 2, where a= 3p and a = 8p were assumed. We can see from thistable that the distance using the proportion of differ-ent nucleotides (P(t) +Q( t)) is more efficient for ob-taining the correc t tree topology than is the distanceusing the number of nucleotide substitutions per site( dl (t) + d2( t)) estimated by Kimuras method, when therate of substitution is the same among different lineages.This conclusion is consistent with the simulation resultsobtained by Schoniger and von Haeseler ( 1993).Transversional Differences

    We often use only the transversional differenceswhen th e rate of transitional change is much higher thanthe rate of transversional change and when the numberof nucleotide substitutions per site is very large, say d(t)> 1. Let us now examine the accuracies of the proportionof transversional differences and the number of trans-versional substitutions per site, i.e., 2 d2( t).

    When D(t) = Q(t), we have

    D(t) = 4pe- *IQ= 4P[l-2&(t)] (15)

    atUniversidaddeAlcalaonNovember10,2011

    http://mbe.oxfordjournals.org/

    Downloadedfrom

    http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/
  • 8/3/2019 Estimation of Evolutionary Distance for Reconstructing Molecular

    4/9

    Evolutionary Distance for Molecular Phylogenetics 28 1

    Table 2A(t ) for d , ( t) + t&(f) , P ( t ) + Q(t ) , and Q(t ) , under Kimuras M odel ,W here n = 1 ,000 W as Assumed

    A(t) FOR40

    A. cx = 3 X 10m9 and p = 1 X 10e9:0.5 . . . . . . . . . . . . . . . . . . . . . . . .1.0 . . . . . . . . . . . . . . . . . . . . . . . .1.5 . . . . . . . . . . . . . . . . . . . . . . .2.0 . . . . . . . . . . . . . . . . . . . . . . . .

    B. CI = 4 X 10d9 and p = 5 X lo-:0.5 . . . . . . . . . . . . . . . . . . . . . . . .1.0 . . . . . . . . . . . . . . . . . . . . . . . .1.5 . . . . . . . . . . . . . . . . . . . . . . . .2.0 . . . . . . . . . . . . . . . . . . . . . . . .

    40 + 4(t) P(t) + Q(t) QW

    3.19 x lo- 3.26 X lo- 2.29 x lo-1.45 x lo- 1.59 x lo- 1.27 x lo-6.95 X 1O-8 8.70 X lo-* 7.99 x lo-*3.29 x 1O-8 4.94 x 1o-8 5.22 X 1O-8

    2.87 X lo- 2.98 X lo- 1.80 x lo-1.15 x lo- 1.36 X lo- 1.14 x lo-4.82 x lo-* 7.39 x 1o-8 8.30 x 1O-82.03 X 1O-8 4.54 x 1o-8 6.36 X lo-*

    NOTE.-A(f) for 2d&) is nearly the same as A(t) for Q(f).

    andv[D(t)l = e

  • 8/3/2019 Estimation of Evolutionary Distance for Reconstructing Molecular

    5/9

    282 Tajima and Takezaki

    Table 3A(t) for wP(t) + Q(t) under Kimuras M odel, Where n = 1,000 Was Assumed

    W

    A. a = 3 X lop9 and p = 1 X 10p9:0.0 .........................0.2 .........................0.4 .........................0.6 .........................0.8 .........................1.0 .........................Maximum ...................

    B. a = 4 X 10e9 and p = 5 X lo-:0.0 .........................0.2 .........................0.4 .........................0.6 .........................0.8 .........................1.0 .........................Maximum ...................

    d(t) = 0.5 d(t) = 1.0 d(t) = 1.5 d(t) = 2.0

    2.29 x lo- 1.27 X lo- 7.99 x lo-* 5.22 x 10-l2.73 X lo- 1.45 x lo- 8.78 x lo-* 5.56 X 10-l3.06 X lo- 1.59 x lo- 9.31 x lo-* 5.75 x 10-l3.23 x lo- 1.65 X lo- 9.47 x lo-* 5.70 x 10-l3.28 x lo- 1.64 x lo- 9.22 X lo-* 5.41 x 10-l3.26 x lo- 1.59 x lo- 8.70 X lo-* 4.94 x 10-j3.28 X lo- 1.65 X lo- 9.47 x lo-* 5.76 x lo--

    (0.809) (0.683) (0.572) (0.465)

    1.80 x lo- 1.14 x lo- 8.30 x lo-* 6.36 X lo-2.45 X lo- 1.35 x lo- 9.00 x lo-* 6.58 X lo-2.69 x lo- 1.47 x lo- 9.22 X lo-* 6.48 x lo-3.02 x lo- 1.49 x lo- 8.91 x lo-* 6.03 x lo-:3.03 x lo- 1.44 x lo- 8.22 X lo-* 5.32 X lo-2.98 x lo- 1.36 X lo- 7.39 x lo-* 4.54 x 10-l3.04 x lo- 1.49 x lo- 9.23 x lo-* 6.59 x 10-l

    (0.728) (0.543) (0.38 1) (0.248)

    NOT E.-Num bers in parentheses are the w values which maximize ,4(f).

    r(t) = [l-2Q(t)]ln[l-2Q(t)] (23a) where a and b are given byand

    s(t) = [l-2P(t)-Q(t)] (23b)X ln[l-2P(t)-Q(t)] - r(t)/2.

    There fore, using equation (23)) we can estimate w byreplacing P(t) and Q(t) with their observed values.When P(t) and Q(t) are small, ho wever, we cannot

    obtain a reliable estimate of w, and we often ha ve w> 1. In fact, we cannot use equation (23) when P(t)= 0 and/or Q(t) = 0. When 2P(t) 6 Q(t), we suggestthat w = 1 be used. In the case of Q(t) = 0, w can beestimated by w = s(t)/ {s(t)P(t)-2P (t)[l-P(t)]},which was obtained from equation ( 23) by lettingQ(t) + 0. It should be noted that wP( t) + Q(t) cannotbe used when the substitution rate varies among differentlineages.On the other hand, D(t) = wd, (t) + d*(t), wheredl ( t ) and &( t) are estimated by equations ( 9a) and (9b),respectively, can be used even when the substitution ratevaries among different lineages. In this case , we have

    D(t) = 2(a+P)w + 2p (24) v2 = QWWQWI[1-2QW12 (26b;and

    ~PWI = a2P(t) + b2Q(t) - [aP(t)+bQ(t)12 ,n(25)

    Wa = 1 - 2P (t) - Q(t)

    and

    (2 5a

    W 1l-2P(t)-Q(t) + I-2Q(t) 25 b

    Then A(t) = D(t)/ 1Jv[o ol can be obtained. Usinrequation (22)) we have

    4(t)V2 - d2(t)Covw 2(t)V1 - d,(t)Cov where V, , V2, and Cov are given by

    v1

    = 4P(t) + QW - WW +QW12WWt)-QW12

    and

    cov = Q(t)1 - 2Q(t)

    (26

    (26a

    (26~

    atUniversidaddeAlcalaonNovember10,2011

    http://mbe.oxfordjournals.org/

    Downloadedfrom

    http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/
  • 8/3/2019 Estimation of Evolutionary Distance for Reconstructing Molecular

    6/9

    Evolutionary Distance for Molecular Phylogenetics 283

    Table 4A(t) for w&(t) + d *(t) under Kimuras Mod el, W here n = 1,000 W as Assumed

    W d(t) = 0.5 d(t) = 1.0 d(t) = 1.5 d(t) = 2.0

    A. a = 3 X 10e9 and B = 1 X 10e9:0.0 . . . . . . . . . . . . . . . . . . . . . . . .0.05 . . . . . . . . . . . . . . . . . . . . . . .0.1 . . . . . . . . . . . . . . . . . . . . . . . .0.20.5 . . . . . . . . . . . . . . . . . . . . . . . .1.0 . . . . . . . . . . . . . . . . . . . . . . . .Maximum . .

    B. a = 4 X 10m9 and B = 5 X 1O-lo:0.0 .0.05 .0.1 . . . . .0.2 . . . . . . . . . . . . . . . . . . . . ..l.0.5 . . . .1.0 . . . .Maximum . . . . . . . . .

    2.29 x 1O-7 1.27 X 1O-7 7.99 x lo-* 5.22 x 1o-82.61 X 1O-7 1.45 x 1o-7 9.04 x lo-8 5.74 x lo-82.85 X 1O-7 1.57 x lo-7 9.44 x lo-* 5.63 X lo-*3.13 x lo-7 1.65 X lO-7 9.22 x lo-* 4.96 X lo-*3.28 X 1O-7 1.58 x lo-7 7.88 X lO-8 3.83 X lO-83.19 x 1o-7 1.45 x lo-7 6.95 X lO-8 3.29 X lo-*3.28 x lO-7 1.65 X 1O-7 9.47 x lo-* 5.76 X lO-8

    (0.455) (0.233) (0.121) (0.06 1)

    1.80 X lO-7 1.14 x, lo-7 8.30 X lo-* 6.36 X lo-*2.41 x 1O-7 1.44 x lo-7 8.96 x lo-* 4.92 x lO-82.76 X 1O-7 1.49 x 1o-7 7.81 X lO-8 3.66 x lO-83.00 x lo-7 1.41 x 1o-7 6.44 X lO-8 2.80 x lo-*2.98 x 1O-7 1.23 X 1O-7 5.27 X lo-* 2.23 x IO-2.87 X lO-7 1.15 x lo-7 4.82 x lO-8 2.03 x lo-*3.04 x lo-7 1.49 x lo- 9.23 x lO-8 6.59 X IO-

    (0.284) (0.092) (0.029) (0.009)

    NOTE.-Numbers in parentheses are the w values which maximize A(t).

    and dl (t) and d*(t) are given by equations (9a) and(9b), respectively. Therefore, from equation (26) wecan estimate w by replacing P(t) and Q(t) with theirobserved values.

    When P(t) and Q(t) are small, the estimate of wobtained from equation ( 26) may not be reliable. As inthe case of w P(t ) + Q(t ) , w e set w = 1 if 2P (t) < Q(t) .In the case of Q(t) = 0, w can be estimated by w= dl ( t) / [ V I / 2 - d l (t )], which was obtained by lettingQW + 0.Numerical examples are shown in tables 3 and 4,where a = 3p and a = Sp were assumed in parts A andB, and n = 1,000 was also assumed, as before. The wvalue which maximizes A(t) was computed by using

    Model Tree A Model Tree B

    6 If-=6

    I V 8FIG. 1 -Model trees used for computer simulations. Model tree

    A assumes constant rate of substitution, whereas model tree B assumesvarying rate of substitution.

    equation (23) or equation (26) and is shown in paren-theses. We can see from these tables that the weightingmethod improves the accuracy of evolutionary distance.

    One problem involved in the weighting meth od isthat there are N( N- 1) pair-wise distances when we re-construct a phylogenetic tree from N sequences and thata value o f w can be estimated from each pairwise distanceso that we have N( N-l ) values of w. We must use,however, the same value of w for all distances. One pos-sible w ay to choose the value of w is to compute valuesof w for all pairwise comp arisons by using eq uation ( 23 )or (26)) then to choose the smallest value of w amongthem. Another possible way is to use the average of wover all values of w. In the case of D( t ) = w P( t ) + Q(t) ,table 3 suggests that the arithmetic mean can be used.On the other hand, in the case of D( t) = wd, (t) + &(t),table 4 suggests that the harmonic mean can be used.Computer SimulationIn order to know the accuracy of new methods, wehave conduc ted com puter simulations. In these simu-lations we used model trees A and B shown in figure 1,which are the same as those used by Schiiniger and vonHaeseler ( 1993). Branch lengths (u and v) and thenumber of nucleotides (n ) assum ed in these simulationsare also the same as theirs. As evolutionary distances,we used (i) CP( t ) + Q(t) , where W is the arithmeticmean of w, (ii) wminP( t ) + Q(t) , wh ere W,in is the min-imum of w, (iii) GL & t) + c.&(t), where G is the harmonicmean of w, and (iv) wmidl (t) + d z ( t ). FollowingSchiiniger and von Haeseler ( 1993), we also used nine

    atUniversidaddeAlcalaonNovember10,2011

    http://mbe.oxfordjournals.org/

    Downloadedfrom

    http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/
  • 8/3/2019 Estimation of Evolutionary Distance for Reconstructing Molecular

    7/9

    284 Tajima and Takezaki

    Table 5Proportion (% ) of Trials Obtaining the Correct Topology by Using the Neighbor-Joining Method,Where Model Tree A and Jukes and Cantors M odel (a = fi) Were Used

    u/v = 0.01/0.07 u/v = 0.02/o. 19 u/v = 0.03/0.042DISTANCE n = 500 n = 1,000 n = 500 n = 1,000 n = 500 II = 1,000

    @P(t) + Q(r) . . 72.6 96.0 58.3 88.7 14.2 35.8wnd(~) + Q(t) 72.7 96.0 58.5 88.7 13.6 35.8@d,(t) + c.&(t) . . 72.3 95.6 56.2 88.4 13.1 35.3wmindl(t) + d2(t) 72.3 95.7 56.2 88.4 11.1 32.8uf/Uc 72.9 96.3 57.9 88.8 13.9 35.9uf/JC . . 72.2 95.7 56.7 88.3 13.4 35.7uf/Km . . . . 72.0 95.7 56.2 88.3 13.2 35.6ex/Uc . . 71.4 96.0 58.4 89.3 14.0 36.2ex/JC 71.5 95.8 55.7 88.4 13.2 35.4ex/Km 71.2 95.7 56.0 88.1 13.1 35.9CO/UC . . . 71.8 95.9 57.7 89.1 13.7 36.7co/JC . . 71.4 95.7 56.0 88.3 13.6 35.7co/Km . . 71.2 95.7 56.8 88.0 13.8 35.9

    NOTE.-All values are not significantly different from the maximum value in column, at the 10% level.

    additional evolutionary distances, i.e., uf/Uc, uf/ JC, uf/Km, ex/Uc, ex/JC, ex/Km, CO/UC , co/JC, and co/Km, where the weighting of nucleotide differences isuniform (uf), existential (ex), or combinatorial (co),and the evolutionary distances are computed withoutcorrection (UC), by Jukes and Cantors ( 1969) method(JC), or by Kimuras ( 1980) m ethod ( Km). In the caseof uf, the distances for uf/Uc, uf/ JC, and uf/Km aregiven byp(t) (=P (t)+Q (t)), equation (4), and equation(9), respectively. For the definitions of ex and co, see

    the work of Schoniger and von Ha eseler ( 1993 ) . Ph y-logenetic trees were reconstructed from these distancesby using the neighbor-joining metho d (S aitou and Nei1987 ), and the proportions of trials obtaining the corre cttree topology were recorded. The number of replicationswas 1,000 in each set of parameters.

    Tables 5-8 give the proportions of trials obtainingthe corre ct tree topology , i.e., the fraction of trials inwhich the given m ethod gave the correct tree topology.The results for model tree A are shown in tables 5 and

    Table 6Proportion (%) of Trials Obtaining the Correct Topology by Using the Neighbor-Joining Method,Where Model Tree A and Kimuras Mode l (a = 8s) W ere Used

    u/v = 0.01/0.07 u/v = 0.02/o. 19 u/v = 0.03/0.42

    DISTANCE n = 500 12 = 1,000 n = 500 n = 1,000 n = 500 n = 1,000

    @P(t) + Q(r)wminp(t) Q(t)@d,(t) + l&(t) . .%,4W + d2W .uf/Uc . .uf/JC . .uf/Km . . . . .ex/Uc . .ex/JC . . . . .ex/KmCO/UC . . . . . .co/JC . .co/Km . . .

    60.6 89.160.5 89.259.8 88.660.2 88.758.6 88.457.7 88.256.3 87.052.6 87.152.2* 86.852.5* 86.842.8*** 80.1***42.5*** 80.3**42.6*** 80.2**

    43.745.042.6*46.137.4***36.4***31.1***51.750.650.342.6*42.5*42.6*

    72.9**74.4*7 1.6***74.3*68.7***66.9***61.2***82.080.881.277.276.576.2

    10.7*13.410.2*14.85.4***5.1***1.8***

    11.611.28.8**

    16.516.116.1

    28.7***32.7*29.6**37.016.1***15.4***7.4***

    30.9*29.9**25.7***40.639.639.6

    * Significantly different from the maximum value in column, at the 1% level.** Significantly different from the maximum value in column, at the 0.1% level.*** Significantly different from the maximum value in column, at the 0.01% level.

    atUniversidaddeAlcalaonNovember10,2011

    http://mbe.oxfordjournals.org/

    Downloadedfrom

    http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/
  • 8/3/2019 Estimation of Evolutionary Distance for Reconstructing Molecular

    8/9

    Evolutionary Distance for Molecular Phylogenetics 285

    Table 7Proportion (%) of Trials Obtaining the Correct Topology by Using the Neighbor-Joining Method,Where Model Tree B and Jukes and Cantors Mode l (a = fi) Were Used

    u/v = 0.01/0.07 u/v = 0.02/0.19 u/v = 0.03/0.42DISTANCE n = 500 n = 1,000 n = 500 n = 1,000 n = 500 n = 1,000

    Gp(t) + Q(t) .wnd(t) + Q(t)@d,(t)+ d*(f) . . .Wn d l ( t ) + &f)uf/Uc . . . . .uf/JC .uf/Kmex/Ucex/JC . . .ex/Km . .CO/UC . .co/JC . . . . . .co/Km . . . . .

    81.2 95.4 49.1*** 53.7***81.2 95.4 49.3*** 53.8***80.5 97.1 67.0 91.780.5 97.2 67.1 91.281.0 95.5 49.3*** 53.2***80.4 97.3 67.2 91.580.7 97.3 67.5 91.481.0 95.1 49.3*** 54.0***79.0 96.9 67.9 91.678.9 96.9 67.5 91.580.5 95.1 48.3*** 53.8***78.9 96.9 67.9 91.679.0 96.9 68.1 91.5

    o.o***o.o***20 6A18 5Ao.o***

    19.518 8Ao.o***

    19.017.2 (0.5)o.o***

    20.218.7 (0.2)

    o.o***o.o***43 5A40 1A

    o.o***42.443 5A

    o.o***42.542.8

    o.o***42.741.5

    a Numbersn parentheses are percentages of inapplicable cases. Underlined values were obtained by using Tajimas (1993) method* Significantly different from the maximum value in column , at the 1% level.** Significantly different from th e maxim um value in colum n, at the 0. I% level.*** Significantly different from the maxim um value in colum n, at the 0.01 % level.

    6, those for model tree B in tables 7 and 8. In tables 5 appropriate, since the proportions obtained might beand 7, c1= p was assumed, whereas a = Sp was assumed mutually correlated.in tables 6 and 8. The x2 test was conducted in order to We can see from table 5 that the proportion of dif-know whe ther the proportion of trials obtaining the cor- ferent nucleotides is slightly more efficient for recon-rect topology is different from the max imum proportion strutting the correct topology than is the estima ted num -in each set of parameters, although the test may not be ber of nucleotide substitutions per site, when the

    Table 8Proportion (%) of Trials Obtaining the Correct Topology by Using the Neighbor-Joining Method,Where Model Tree B and Kimuras Mode l (a = 8s) W ere Used

    u/v = 0.01/0.07 u/v = 0.02/o. 19 u/v = 0.03/0.42

    DISTANCE n = 500 n = 1,000 n = 500 n = 1,000 n = 500 n = 1,000@P(t) + Q(t)WmiJ(t) Q(t)Gki,(f) + c&(f) .wrnindl(t) + 4(f)uf/Uc . . . .uf/JC . . .uf/Km . .ex/Uc . . .ex/JC . . . . . .ex/Km . .CO/UCco/JC .co/Km .

    76.6 93.077.3 93.474.4 94.575.8 95.376.2 92.575.0 93.872.6 92.870.3 91.368.7* 92.269.0* 92.26 1.4*** 87.7***60.6*** 88.2***60.8*** 88.2***

    38.6***42.6***54.4**61.933.1***53.6**45.2**53.7**65.665.655.4**59.659.7

    44.6***50.8***79.6*86.438.7***77.1**70.0***65.2***85.585.372.9***84.284.2

    o.o***3.6***

    26 2L29.4

    0 o***12:6***6.2***E***

    19:9**18.4***

    3.1***26.825.7

    o.o***3 o***

    442*L52 9A

    0 o***18:9***14 2***A0 o***

    30:9***27.4***

    2.2***46.246.0

    a Underlined values were obtained by using Tajimas (1993) method.* Significantly different from the maxim um value in column , at the 1% level.** Significantly different from the maximum value in column, at the 0.1% level.*** Significantly different from the maxim um value in colum n, at the 0.01 % level.

    atUniversidaddeAlcalaonNovember10,2011

    http://mbe.oxfordjournals.org/

    Downloadedfrom

    http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/
  • 8/3/2019 Estimation of Evolutionary Distance for Reconstructing Molecular

    9/9

    286 Tajima and Takezakisubstitution rate is the same among different lineages andwhen there is no transition;/transversion bias. The dif-ference in efficiency between the proportion of differentnucleotides and the estimated number of nucleotide sub-stitutions per site, how ever, is very small, s o that eitherdistance can be used. These resu lts are consistent withthose obtained by Schiiniger and von Hae seler ( 1993 ) .

    When there is transition / transversion bias, Schon-iger and von Haeseler ( 1993) ha ve shown that, amongnine distances they studied, which is the best depen ds onbranch lengths. Table 6 shows that the present methods,especially wmi,dl( t) + d z ( t ), are efficient regardless ofbranch lengths, although they are not the best in manycases.

    Table 7 shows the results obtained in the case wheremodel tree B and a = p were assumed. In the case of u= 0.03 and v = 0.42, the underlined values w ere obtainedby using almost unbiased estimates (Tajima 1993 ), sincethere we re inapplicable cases. We can see from this tablethat the proportion of different nucleotides cannot be usedunless the difference is very small when the substitutionrate varies among different lineages.

    The results for transition / transversion bias are givenin table 8, which indicates that W m i n d lt ) + d z ( t ) is themost efficient for reconstructing the corre ct tree topology.It is also recommended that Tajimas ( 1993) method beused to estimate d , ( t ) and d 2 t ) when the distance is large.For example, in the case of u = 0.03, v = 0.42, and n= 500, when we used Kimuras ( 1980) method forestimating d l ( t ) and d z ( t ), i.e., equations (9a) and (9b),170 of 1,000 cases were inapplicable, and the propor-tion of trials obtaining the corre ct topology by usingwmi, d , ( t) + d 2( t) was 23.9% rather than 2 9.4%. Consider-ing all the results shown in tables 5-8, we can concludethat W m i n d lt ) + d z ( t ) might be the best among the dis-tances examined.Discussion and Conclusion

    In this pape r, n ew meth ods for estimating evolu-tionary distance are presented . Com puter simulationssuggest that w m i d l t ) + d z ( t ) might be the best for ob-taining the corre ct tree topology , among the evolutionarydistances examined . When the substitution rate is thesame among different lineages, Wmi,P( t) + Q(t) alsocan be used. Since w e usually do not know w hether thesubstitution rate is the same among different lineages,we recommend that W m i n d lt ) + d 2( t ) be used for de-termining a tree topology .

    We have assumed in this study that the pattern ofsubstitution rates follows Kimuras ( 1980 ) transition/transversion mode l. Even when the pattern of substitutionrates follows the other model, we can still use the samemethod for determining the value of w. For example, in

    the case in which D(t) is expressed as D(t) = w l d l ( t+ w 2 d 2 ( ) + d 3 ( ) , w1 and w2 can be determined by solving

    aA - 0awl and

    aA - 0a w2

    where A (t) is defined by equation ( 1).Computer Program

    A computer program for estimating +P( t ) + Q( t )wminP(t) + Q(t), @dl(t) + h(t), Wmindl(t) + h(t)and their standard errors is available on reques t.Acknowledgments

    We thank Dr. M. Nei and anonymous reviewerfor their valuable suggestions and comm ents. This worlwas supported in part by grants from the National Institutes o f Hea lth and the National Science Foundatior(to M.N.) and by a grant from the Ministry of EducationScience and Culture (to F.T.) . This is contribution 1971from the National Institute of Genetics, Mishim a, Shizuoka-ken 4 11, Japan.LITERATURE CITEDJUKES , T. H., and C. R. CANTO R. 1969. Evolution of proteii

    molecules. Pp. 2 l- 132 in H. N . MU NRO , ed. Mammalianprotein metabolism. Academic Press, New York.

    KIMUR A, M. 1980. A simple m ethod for estimating evolutionary rate of base substitutions throug h c ompa rativestudies of nucleotide sequenc es. J. Mol. Evol. 16: 11 l-120

    KIMUR A, M., and T. OH TA. 1972. On the stochastic modefor estimation of mutational distance betw een homo logouproteins. J. Mol. Evol. 2:87-90.NEI, M. 1987. Molec ular evolutionary genetics. Columbi;

    University Press, New York.SAITOU, N., and T. IM ANISHI. 1989. Relative efficiencies o

    the Fitch-Margoliash, m aximum-parsimony, maximumlikelihood, minimum-evolution, and neighbor-joinin:meth ods o f phylogenetic tree construction in obtaining thcorrect tree. Mol. Biol. Evol. 6514-525.

    SAITOU,N., and M. N E I 1987 . The neighbor-joining meth oda new method for reconstructing phylogenetic trees. MolBiol. Evol. 4:406-425.SCHONIGE R, M., and A. VON HAESELE R. 1993. A simpllmeth od to improve the reliability of tree reconstructionsMol. Biol. Evol. 10:471-483.

    SNEA TH, P. H. A., and R. R. SOK AL. 1973. Numerical taxonomy. W. H. Freeman, San Francisco.

    TAJIMA,F. 1993. Unbiased estimation of evolutionary distantbetween nucleotide sequences. M ol. Biol. Evol. 103677-688

    DAN IEL HAR TL, reviewing editorReceived August 9, 1993Accepted November 5, 1993

    atUniversidaddeAlcalaonNovember10,2011

    http://mbe.oxfordjournals.org/

    Downloadedfrom

    http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/http://mbe.oxfordjournals.org/