a cladistic analysis of phenotypic associations with ...the cladogram, so these rules will not be...

11
Copyright 0 1993 by the Genetics Society of America A Cladistic Analysis of Phenotypic Associations with Haplotypes Inferred from Restriction Endonuclease Mapping. IV. Nested Analyses with Cladogram Uncertainty and Recombination Alan R. Templeton* and Charles F. Singt *Department of Biology, Washington University, St. Louis, Missouri 63130, and ?Department ofHuman Genetics, University of Michigan, Ann Arbor, Michigan 48109-0618 Manuscript received August 19, 1992 Accepted for publication February 25, 1993 ABSTRACT We previously developed an analytical strategy based on cladistic theory to identify subsets of haplotypes that are associated with significant phenotypic deviations. Our initial approach was limited to segments of DNA in which little recombination occurs. In such cases, a cladogram can be constructed from the restriction site data to estimate the evolutionary steps that interrelate the observed haplotypes to one another. The cladogram is then used to define a nested statisticaldesign for identifying mutational steps associated withsignificant phenotypic deviations. The central assumption behind this strategy is that a mutation responsible for a particular phenotypic effect is embedded within the evolutionary history that is represented by the cladogram. The power of this approach depends on the accuracyof the cladogram in portraying the evolutionary history of the DNA region. This accuracy can be diminished both by recombination and by uncertainty in the estimated cladogram topology. In a previous paper, we presented an algorithm for estimating the set of likely cladograms and recombination events. In this paper we present an algorithm for defining a nested statistical design under cladogram uncertainty and recombination. Given the nested design, phenotypic associ- ations can be examined using either a nested analysis ofvariance (for haploids or homozygous strains) or permutation testing (for outcrossed, diploid gene regions). In this paper we also extend this analytical strategy to include categorical phenotypes in addition to quantitative phenotypes. Some worked examples are presented using Drosophila data sets. These examples illustrate that having some recombination may actually enhance the biological inferences that may derived from a cladistic analysis. In particular, recombination can be usedto assign a physical localization to a given subregion for mutations responsible for significant phenotypic effects. G ENETIC studies of quantitative traits have tra- ditionally utilized phenotypic correlations be- tween related and unrelated individuals to estimate the fraction of the interindividual variance in the population that is attributable to unmeasured geno- typic differences. Recent advances in molecular ge- netics are making it possible to locate and characterize the loci thatdeterminethisgeneticcomponent of variance. If the trait of interest is determined by biochemical or physiological functions under the con- trol of identified genes (candidate genes), then the populationcan be screened for RFLP or sequence variability in and/or near a candidate gene to define haplotype variation. One then analyzes the associa- tionsbetweenhaplotypevariationatthecandidate gene with phenotypic variation in the quantitative trait. In this manner, haplotypes can be identified that are associated with statistically significant phenotypic deviations. Then individuals carrying these haplotypes can be subjected to more detailed molecular analyses to identify the responsible mutations. We have intro- duced a cladistics approach to identify haplotypes and Genetics 134 659-669 (June, 1993) individuals that most likely carry such mutations (TEMPLETON, BOERWINKLE and SING 1987; TEMPLE- TON et al. 1988). Our initial application of cladistics assumed that the haplotypes are defined from restriction endonuclease mapping or sequencing in small segments of DNA in which little recombination occurs (TEMPLETON, BOER- WINKLE and SING 1987; TEMPLETON et al. 1988). These haplotypes are organized into a cladogram that portrays the evolutionary steps that interrelate the observed haplotypes to one another. If the root of the cladogram can be determined or estimated, the cla- dogram represents a phylogenetic tree of the DNA region being characterized. However, the analysis does not require the cladogram to .be rooted. We employ the cladogram to define a nested statistical design that is used to systematically detect significant associations between mutational steps defined by the RFLPs and deviations of a quantitative trait from the sample mean. The central assumption behind this approach is that if an unknown mutation causing a phenotypic effect occurred at some point in the evo-

Upload: others

Post on 08-Jan-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Cladistic Analysis of Phenotypic Associations with ...the cladogram, so these rules will not be repeated here. However, the previously published rules fail to take into account one

Copyright 0 1993 by the Genetics Society of America

A Cladistic Analysis of Phenotypic Associations with Haplotypes Inferred from Restriction Endonuclease Mapping. IV. Nested Analyses with

Cladogram Uncertainty and Recombination

Alan R. Templeton* and Charles F. Singt

*Department of Biology, Washington University, S t . Louis, Missouri 63130, and ?Department ofHuman Genetics, University of Michigan, Ann Arbor, Michigan 48109-0618

Manuscript received August 19, 1992 Accepted for publication February 25, 1993

ABSTRACT We previously developed an analytical strategy based on cladistic theory to identify subsets of

haplotypes that are associated with significant phenotypic deviations. Our initial approach was limited to segments of DNA in which little recombination occurs. In such cases, a cladogram can be constructed from the restriction site data to estimate the evolutionary steps that interrelate the observed haplotypes to one another. The cladogram is then used to define a nested statistical design for identifying mutational steps associated with significant phenotypic deviations. The central assumption behind this strategy is that a mutation responsible for a particular phenotypic effect is embedded within the evolutionary history that is represented by the cladogram. The power of this approach depends on the accuracy of the cladogram in portraying the evolutionary history of the DNA region. This accuracy can be diminished both by recombination and by uncertainty in the estimated cladogram topology. In a previous paper, w e presented an algorithm for estimating the set of likely cladograms and recombination events. In this paper we present an algorithm for defining a nested statistical design under cladogram uncertainty and recombination. Given the nested design, phenotypic associ- ations can be examined using either a nested analysis of variance (for haploids or homozygous strains) or permutation testing (for outcrossed, diploid gene regions). In this paper we also extend this analytical strategy to include categorical phenotypes in addition to quantitative phenotypes. Some worked examples are presented using Drosophila data sets. These examples illustrate that having some recombination may actually enhance the biological inferences that may derived from a cladistic analysis. In particular, recombination can be used to assign a physical localization to a given subregion for mutations responsible for significant phenotypic effects.

G ENETIC studies of quantitative traits have tra- ditionally utilized phenotypic correlations be-

tween related and unrelated individuals to estimate the fraction of the interindividual variance in the population that is attributable to unmeasured geno- typic differences. Recent advances in molecular ge- netics are making it possible to locate and characterize the loci that determine this genetic component of variance. If the trait of interest is determined by biochemical or physiological functions under the con- trol of identified genes (candidate genes), then the population can be screened for RFLP or sequence variability in and/or near a candidate gene to define haplotype variation. One then analyzes the associa- tions between haplotype variation at the candidate gene with phenotypic variation in the quantitative trait. In this manner, haplotypes can be identified that a re associated with statistically significant phenotypic deviations. Then individuals carrying these haplotypes can be subjected to more detailed molecular analyses to identify the responsible mutations. We have intro- duced a cladistics approach to identify haplotypes and

Genetics 1 3 4 659-669 (June, 1993)

individuals that most likely carry such mutations (TEMPLETON, BOERWINKLE and SING 1987; TEMPLE- TON et al. 1988).

O u r initial application of cladistics assumed that the haplotypes are defined from restriction endonuclease mapping or sequencing in small segments of DNA in which little recombination occurs (TEMPLETON, BOER- WINKLE and SING 1987; TEMPLETON et al. 1988). These haplotypes are organized into a cladogram that portrays the evolutionary steps that interrelate the observed haplotypes to one another. If the root of the cladogram can be determined or estimated, the cla- dogram represents a phylogenetic tree of the DNA region being characterized. However, the analysis does not require the cladogram to .be rooted. We employ the cladogram to define a nested statistical design that is used to systematically detect significant associations between mutational steps defined by the RFLPs and deviations of a quantitative trait from the sample mean. T h e central assumption behind this approach is that if an unknown mutation causing a phenotypic effect occurred at some point in the evo-

Page 2: A Cladistic Analysis of Phenotypic Associations with ...the cladogram, so these rules will not be repeated here. However, the previously published rules fail to take into account one

660 A. R. Templeton and C. F. Sing

lutionary history of the population, it would be em- bedded within the same historical structure repre- sented by the cladogram.

The power of the cladistic approach depends on the accuracy with which the cladogram portrays the evo- lutionary history of the haplotype variation. Many factors can reduce this accuracy. The first is recom- bination. When recombination occurs, different subregions in the same haplotype can reflect different evolutionary histories. In the previous paper of this series (TEMPLETON, CRANDALL and SING 1992), we presented an algorithm that either excludes rare re- combinants or subdivides the candidate DNA region under investigation into two or more subregions, with each having little or no internal recombination. Con- sequently, we can recover the historical information by estimating separate cladograms for each DNA subregion. Each subregion can then be subjected to a separate cladistic analysis for phenotypic associations.

A second factor is inaccurate scoring of restriction sites or sequences. CLARK and WHITTAM (1 992) have shown that the error rates often found in DNA se- quence data banks were sufficiently high to sometimes influence the topology of an inferred phylogeny. Their focus was on interspecific evolutionary trees, but the problem is bound to be more serious for intraspecific haplotype trees. For this latter type of tree, most haplotypes are connected to other haplo- types by very few mutations-often only one. Hence, a single scoring error could have a large impact on the tree topology.

A third factor is incomplete scoring. Often, some restriction sites or some part of the sequence state is missing for some of the observations. This can lead to uncertainty in the resulting evolutionary tree. More- over, it is sometimes impossible to infer the multisite haplotype state with complete certainty because of multiple heterozygosity when sampling outcrossed, diploid populations. Probabilities can be placed on the possible haplotypes in such cases (TEMPLETON et d . 1988), but often more than one possible haplotype state has high probability, leading to cladogram uncer- tainty.

A fourth factor is homoplasy. The same restriction site or nucleotide may have experienced more than one mutation, leading to parallelisms or convergences of patterns (TEMPLETON 1983). Because of homo- plasy, there is often more than one cladogram that may be consistent with the data. TEMPLETON, CRAN- DALL and SING (1 992) present an algorithm for deter- mining the set of plausible cladograms that includes all the most probable evolutionary linkages of a hap- lotype to the other haplotypes until the cumulative probability of the linkages is 0.95 or greater. When this plausible set consists of more than one cladogram,

the evolutionary history of the region is not known with certainty.

The primary purpose of this paper is to extend our cladistic analyses of phenotypic associations to the case in which there is uncertainty about the exact topology of the cladogram. A second purpose of this paper is to show how the cladistic analysis can be performed on categorical phenotypes in addition to quantitative phenotypes. A third purpose is to show that knowl- edge about recombination can be used to provide information about the physical location of a mutation responsible for associations between a candidate gene and phenotypic variation. We will illustrate these three aspects of cladistic analyses with worked exam- ples from Drosophila.

AN ALGORITHM FOR A NESTED DESIGN UNDER CLADOGRAM UNCERTAINTY

The algorithm of TEMPLETON, CRANDALL and SING (1 992) is first used to identify segments of the DNA region that have had little to no internal recombina- tion (including the possibility that the entire region behaves as such a segment). This algorithm is then used to estimate the set of plausible cladograms within each subregion and assumes no scoring error. Hence, the only sources of uncertainty that are taken into account are those due to incomplete scoring and those due to homoplasy. With respect to homoplasy, two levels of cladogram uncertainty are identified by this estimation algorithm. First, haplotypes are united into one or more b-step networks in which b is the maxi- mum number of mutational steps separating two ad- jacent haplotypes within the cladogram under the assumption of maximum parsimony, and where par- simony is unlikely (P < 0.05) to be violated by more than a single additional mutational step. These sets can therefore include both parsimonious and nonpar- simonious alternative cladograms, but the deviations from parsimony are unlikely to be more than one additional step for a particular haplotype linkage. The second level of uncertainty occurs when linkages among haplotypes are likely ( P 2 0.05) to deviate from parsimony by two or more additional steps. In these cases, the haplotypes are subdivided among two or more b-step networks. The algorithm of TEMPLE- TON, CRANDALL, and SING (1 992) does not attempt to specify the exact mutational connections that are pos- sible among the different b-step networks because the number of possible cladograms that are consistent with such large deviations from parsimony is usually very high. Instead, the algorithm of TEMPLETON, CRANDALL and SING (1 992) indicates only which net- works are likely to be connected to one another even though the exact connections are not specified.

T o perform a cladistic analysis of phenotypic asso- ciation, we need to nest haplotypes (0-step clades in

Page 3: A Cladistic Analysis of Phenotypic Associations with ...the cladogram, so these rules will not be repeated here. However, the previously published rules fail to take into account one

Cladistic Analysis of Phenotypes 66 1

”---.-. 3””l

FIGURE 1.-A case of a symmetrically stranded clade. In this simple example, five clades are considered that are mutationally related to one another as indicated by double-headed arrows. To form the next higher nesting categories, the nesting algorithm described in TEMPLETON, BOERWINKLE and SING (1987) would nest together clades 1 and 2, and clades 4 and 5 , as indicated by the boxes in the figure. This would leave clade 3 unnested, as discussed in the text.

our terminology) that are in some sense evolutionarily adjacent into 1-step clades (branches of the evolution- ary tree), nest adjacent 1-step clades into 2-step clades, etc., until finally at level n+l all the data would become nested into a single clade. We now outline a three-step algorithm to accomplish this nesting on the set of plausible cladograms within each DNA (sub)region displaying little or no internal recombi- nation for a particular data set. We limit our initial nesting algorithm to within 6-step networks (ie., to within haplotype networks in which the mutational linkages are unlikely to deviate from parsimony by more than one additional mutation). Each b-step net- work is analyzed separately in steps 1 and 2. Nesting among b-step networks and above is delayed until step 3.

Step 1: We apply the standard nesting rules (TEM- PLETON, BOERWINKLE and SING 1987) to the unam- biguous steps within the b-step network. By an un- ambiguous step, we mean a mutational connection between clades that is shared by all cladograms in the plausible set. For the moment, we ignore all the am- biguous steps (that is, mutational connections that are not shared by all cladograms in the plausible set). This may result in disjointed sets of haplotypes, and each of these disjointed sets are nested by the standard rules as if they were separate cladograms.

TEMPLETON, BOERWINKLE and SING (1 987) give the standard nesting rules when there is no ambiguity in the cladogram, so these rules will not be repeated here. However, the previously published rules fail to take into account one symmetrical situation that can arise within cladograms, as illustrated in Figure 1. In this case, the standard rules nest together clades 1 and 2 and clades 4 and 5, leaving the clade 3 stranded. If the stranded clade is a missing intermediate (that is, there is no observed haplotype associated with this node), it is simply left unnested. However, when the stranded clade contains an observed haplotype, it needs to be nested either with the (1, 2) group or the ( 4 , 5 ) group, but the rules given by TEMPLETON, BOERWINKLE and SING (1 987) give no guidance in this symmetrical case. Hence, we will now offer some additional rules for dealing with this symmetry.

As discussed in TEMPLETON, BOERWINKLE and SING (1987), alternative nesting rules are possible to the ones given in that paper. However, the nesting rules must be able to be applied consistently, cover statisti-

cally all steps in the cladogram, create nesting cate- gories that reflect evolutionary closeness within the cladogram, and be completely independent of the phenotypic data (otherwise a biased statistical design will result). These criteria are satisfied with the follow- ing additional nesting rules. First, the stranded clade should be grouped with the nesting category that has the smallest sample size because such a grouping will tend to maximize statistical power. Second, if the smallest sample size is found in two or more of the alternatives, then the stranded clade should be nested with the alternative to which it is connected by a nonrestriction site mutation. Nonrestriction site mu- tations tend to be unique more often than restriction site mutations (LLOYD and CALDER 1991), so the connection involving the nonrestriction site change in general is a more certain connection. If the above rules fail to discriminate, a random number generator can be used to link the stranded clade with one group.

After the application of the standard rules, say at clade level x to create x+l-step clades, we then see if any x-step clades are left unnested because of clado- gram ambiguity. If all x+l-step clades have been nested, we repeat step 1 at the x+l clade level to create x+2-step clades. For any x-step clades that are left unnested, we proceed to step 2.

Step 2: In many cases of cladogram uncertainty, four or more unnested x-step clades form a closed loop connected by ambiguous linkages. Obviously, true loops cannot be created by the evolutionary process, so one or more of these plausible linkages must not have occurred. However, it is not known which linkages did occur and which did not because all are likely. T o deal with such loops of ambiguity, all the x-step clades in the loop are nested together and treated as a single x+l-step clade as long as x < n. (Recall that the standard nesting algorithm termi- nates at the n-step level.) If x = n, each of the x-step clades in the loop are regarded as separate n-step clades and the nesting algorithm is terminated.

For unnested x-step clades that are not part of a loop of ambiguous linkages, we examine their connec- tions to the parts of the cladogram that have already been nested to create x+ 1-step clades. If all ambiguous connections go to a single x+ 1 -step clade, the unnested x-step clade is joined to that x+ I-step clade. If un- nested clades exist at the nth level of nesting, they are regarded as separate n-step clades and the nesting algorithm is terminated.

If the ambiguous linkages go to two or more already defined x+ 1-step clades and if x < n, the unnested x- step clade is left unnested. Such unnested x-step clades are treated just like degenerate clades (clades that add on only inferred intermediates in the cladogram that do not actually exist in the sample) in TEMPLETON, BOERWINKLE and SING (1 987). Degenerate clades are

Page 4: A Cladistic Analysis of Phenotypic Associations with ...the cladogram, so these rules will not be repeated here. However, the previously published rules fail to take into account one

662 A. R. Templeton and C. F. Sing

simply excluded from the nested design at the degen- erate level. A clade that is left unnested because of cladogram ambiguity is likewise excluded from the nested design at the level of ambiguity, This does not mean, however, that the data in these unnested clades is excluded from the overall analysis. The data points within these clades may have been analyzed at lower clade levels (low x values), and will always reenter the nested analysis at the higher clade levels, so they are never excluded from the nested analysis as a whole.

Step 3: Once the above nesting procedure has been completed independently within each separate b-step haplotype network, we need to nest the b-step net- works to form the highest order nesting categories of the analysis. Recall that b-step networks are clusters of haplotypes that are so distant from one another that their mutational linkages are likely ( P 2 0.05) to deviate by two or more mutational steps from the maximum parsimony mutational path(s). If there are only two 6-step networks in the total data set, the two networks are simply regarded as two categories nested within the total cladogram. The standard nesting rules of TEMPLETON, BOERWINKLE and SING (1 987) stop at the level just below the total cladogram, so the b-step networks themselves are the terminal nesting level in this case and the final level of analysis performs con- trasts between the b-step networks.

If there are three or more b-step networks, it is possible that there is some evolutionary structure among the b-step networks (e.g., two of the b-step networks may be more closely related evolutionarily to one another than either is to any other b-step network). Evidence for such evolutionary structure is inferred from the topological features of the maxi- mum parsimony connections among the b-step net- works. The maximum parsimony connections are first inferred using a computer program such as PAUP (SWOFFORD 1991). The cladograms of each b-step network should be input as a constraint, but the connections among b-step networks should be left unspecified. This focuses the power and efficiency of the program on these internetwork connections. After determining the most parsimonious connection(s) among the b-step networks, one inspects for shared mutational changes in the connections among three or more networks. Figure 2 shows a hypothetical example for four b-step networks (labeled I , ZZ, ZZZ and ZV). Note that b-step networks Z and ZZ are both separated by six mutational steps under parsimony from b-step network ZZZ. However, three of the six mutational steps are shared in common. Similarly, b- step networks Z and ZZ are both separated by eight mutational steps under parsimony from b-step net- work ZV, but three of these eight are also shared, and indeed are the same three shared in the mutational connections to ZZI. Whenever shared mutational

1

II

\ K U E 1 KUE 2 \ K U E 1

“1 1, III

O\ 0

KUE 2

0-0-0-0

/ \

Iv

FIGURE 2.-A hypothetical example of evolutionary structure among four b-step networks. The four networks are indicated by I, 11, 111 and N . The maximum parsimony mutational connections among the networks are indicated by solid lines, with the Os indi- cating that all intermediate haplotypes are missing from the sample. As can be seen in this example, the paths connecting either I or II to either 111 or IV share three common mutational links. These shared mutations define two nodes of branching, as indicated in the figure. If these nodes were accepted as statistically significant, networks I and II would be nested together, as would III and N .

changes occur, they define nodes of branching in the connections among the b-step networks. Two such nodes are indicated in the hypothetical example shown in Figure 2. The inferred genetic state for these nodal haplotypes is determined (an option for doing this exists in PAUP). The significance of these nodes is then determined by calculating the probabil- ities of their parsimonious and nonparsimonious link- ages to one another and to the b-step networks using the procedures given in TEMPLETON, CRANDALL and SING (1 992). Any of the linkages involving these nodes that are likely ( P > 0.95) to deviate from parsimony by no more than one additional step can then be used to create nested sets of b-step networks using the rules given in steps 1 and 2. Any b-step networks that are left unnested after this procedure are regarded as separate highest-level categories nested within the to- tal cladogram and the nesting algorithm is terminated.

WORKED EXAMPLES

Amylase activity in Drosophila melanogaster: Our first example uses data on 49 lines of D. melanogaster that were scored with a battery of restriction enzymes in a 15-kb region surrounding the two transcriptional units of the duplicated Amy locus (LANGLEY et al. 1988). The duplicated genes in this region code for the enzyme a-amylase. Adult a-amylase was also meas- ured for each of these lines (LANGLEY et al. 1988). TEMPLETON, CRANDALL and SING (1992) subjected these restriction site data to their cladogram estima- tion algorithm. The analysis indicated that the 15-kb region should be subdivided into three regions, each with no evidence for internal recombination but with recombination between. The first region (henceforth called the left region) extends from the Hind111 -9.7 site on the extreme left of the Amy DNA region and ends just 5’ of the left duplicated Amy locus (see the restriction map in LANGLEY et al. 1988). The middle

Page 5: A Cladistic Analysis of Phenotypic Associations with ...the cladogram, so these rules will not be repeated here. However, the previously published rules fail to take into account one

Cladistic Analysis of Phenotypes

subregion is centered around the EcoRI -2.2 restric- tion site that lies between the two duplicated loci. The right subregion covers the remainder of this DNA region and includes the other duplicated Amy locus. We will subject each of these three regions to a clad- istic analysis of a-amylase activity.

Figure 3 shows the set of plausible cladograms as- sociated with the left DNA subregion. As discussed in TEMPLETON, CRANDALL and SING (1 992), there are only two cladograms in the plausible set: one with the 0-1 haplotype connected by a parsimonious connec- tion of length two to the [O-2,O-41 haplotype, and the other the 0-1 haplotype connected by a nonparsimon- ious connection of length three to the 0-3 haplotype. (Note, the haplotype numbers refer to haplotypes as scored by variation in the entire 15-kb region. Hence, when dealing with the variation found in only one subregion, some haplotypes will be pooled together as being identical for the subset of variable sites found in that subregion.) Figure 3 also shows the nesting categories defined by the algorithm given above. In this case, only the standard rules suggested by TEM- PLETON, BOERWINKLE and SING (1987) had to be invoked. Because the haplotype of ambiguous posi- tion, 0-1, is two or more mutational steps from its nearest neighbor, it forms a degenerate 1-step clade at the first round of standard nesting (the degenerate clade 1-4 in Figure 3). Note also that 0-step clades 0- 3 and [0-2,0-4] are nested together to form a single 1- step clade, 1-3. When going to the second level of nesting, both of the ambiguous connections of clade 1-4 (haplotype 0-1) go to clade 1-3. Hence, clades 1-4 and 1-3 are nested together to form clade 2-2, and the ambiguity of the evolutionary position of haplo- type 0-1 has no impact whatsoever on the nested design.

Table 1 gives the results of the nested analysis of variance of a-amylase activity that stems from the nested design given in Figure 3. Five of the 49 lines were not scored for the Hind111 -9.7 restriction site in this subregion. Consequently, the exact haplotype affinities of these five lines were not known with certainty. However, based on the other variable sites, these five lines were all contained in clade 1-1 in Figure 3. This state of partial ambiguity caused by not scoring a site can be treated with the same rules given above for nesting under cladogram ambiguity. Because their haplotype affinities are ambiguous, these five lines were simply ignored at the 0-step level of analysis. However, they were incorporated into the analysis as part of clade 1-1 and likewise appeared at all higher levels of nesting along with the remainder of clade 1-1. As shown in Table 1, no significant associations with a-amylase activity were detected in this subregion.

We next analyzed the middle subregion. Only one

I 2-2

663

0-13 0-1 5 1 - 1

+ AMY3 RI 16.1 [ Di(c)[

0-6 ~ ~ 0 - ~ , 0 - 7 , 0 - 1 0 , 0 - 1 1 , 0 - 1 2 , 0 - 1 4 , 0 - 1 6 )

Ins( b) In( 2R)NS - I I 2- I

+ I +I 0-9

(0-17,0-18,0-19,0-20,0-21 1 I I

I FIGURE 3.-The cladogram set estimated by the TEMPLETON,

CRANDALL and SING (1992) algorithm as derived from the 44 lines described in LANGLEY et al. (1988) for which complete haplotype data were available in the left subregion of the Amy DNA region of D. melanogaster. The 44 lines define 21 haplotype categories when variation over the entire region is taken into account, designated as “0-n”, n = 1-21. However, some of these haplotypes are identical for the variable sites in this left subregion, and accordingly are pooled together as shown by the rounded boxes in the figure. Each arrow indicates one mutational event. The description of the event is indicated by the arrow, using the notation given in LANGLEY et al. (1988). In their notation, mutations designated by the form “AMY#” represent alterations in the electrophoretic pattern of amylase, “Ins” and “Del” refer to insertions and deletions relative to a reference chromosome (each of these events could be either an insertion or a deletion mutational event depending on the root of the network), “IN(2R)NS” and ”Inv(a)” refer to inversions that cover the entire region, and hence are incorporated into the cla- dograms of all three subregions. The remainder of the mutational events are restriction site changes. Because the network is unrooted, each arrow is double-headed. Each mutational event has a “+” and a “-” by it. The “+”s indicate the presence of the notated restriction recognition sequence and the ‘“”s its absence. Since the network is unrooted, both possibilities could have occurred in the evolutionary history of the Amy region, and the type of change as a function of evolutionary direction is indicated by the symbol closest to the arrowhead that defines the evolutionary direction. Solid arrows indicate transitions that are unambiguous whereas dashed arrows indicate alternative connections that are all likely, using the criteria given in TEMPLETON, CRANDALL and SING (1992). Single-width lines enclose haplotypes that are nested together to form 1-step clades, as designated by the notation I-#. Double-width lines enclose the 1-step clades that are nested together to from 2-step clades, as notated by 2#.

cladogram (Figure 4) is plausible with the data from this subregion. Moreover, the cladogram is so simple that all the haplotypes in this subregion would be placed in the same 1-step clade, so no nesting is needed. Accordingly, a simple unnested analysis of

Page 6: A Cladistic Analysis of Phenotypic Associations with ...the cladogram, so these rules will not be repeated here. However, the previously published rules fail to take into account one

664 A. R. Templeton and C . F. Sing

TABLE 1

Nested analysis of variance of a-amylase activity in the left subregion of the D. melanogaster Amy region described in

LANGLEY et al. (1988)

Sum of Degrees of Mean Source squares freedom square F-statistic P

2-Step clades 1-Step clades

Within 2-1 Within 2-2

0-Step clades Within 1-1 Within 1-3

Error

0.035 1 1 0.00249 0.00246 0.00003 0.22804 0.20524 0.02280 0.74699

1 2 1 1 6 5 1

39

0.0351 1 0.00125 0.00246 0.00003 0.03801 0.04105 0.02280 0.01915

~ ~~~

1.833 0.184 0.065 0.937 0.128 0.722 0.002 0.965 1.984 0.091 2.143 0.081 1.190 0.282

The nested design is given in Figure 3.

* *

' I 0-9

FIGURE 4.-The cladogram set estimated by the TEMPLETON, CRANDALL and SING (1 992) algorithm as derived from the 49 lines described in LANGLEY et al. (1988) for the middle subregion of the Amy DNA region of D. melanogaster. Two of the haplotype cate- gories in this subregion include a large number of distinct haplo- types as characterized by variable sites in the whole region. Conse- quently, instead of boxing all of the haplotype numbers together, we simply indicate these two categories by EcoRI -2.2 + (to indicate haplotypes with this restriction site) and EcoRI -2.2 - (to indicate haplotypes without this site, with the exception of haplotypes 0-6 and 0-9). In this case there is no cladogram ambiguity. The inferred positions of mutations with significant phenotypic effects in the Amy region are marked by asterisks, where a double asterisk indicates significance at the 0.01 level.

TABLE 2

Analysis of variance of a-amylase activity in the middle subregion of the D. melanogaster Amy region described in

LANGLEY et al. (1988)

Sum of Degrees of Source squares freedom Mean square F-statistic

0-Step clades 0.38987 3 0.12996 8.09*** Error 0.72326 45 0.01607

*** Significant at the 0.001 level.

TABLE 3

Bonferroni multiple comparisons tests of the evolutionarily relevant contrasts among the haplotypes found in the middle

DNA subregion of the D. melanogaster Amy region

Contrast significance Bonferroni

EcoRI -2.2 - US. EcoRI -2.2 + EcoRI -2.2 - US. 0-9 EcoRI -2.2 - US. 0-6

<1% C1% >5 %

variance is used to analyze the data. As shown in Table 2, this analysis revealed highly significant asso-

TABLE 4

Nested analysis of variance of a-amylase activity in the right subregion of the D. melanogaster Amy region described in

LANGLEY et al. (1988)

Source squares freedom square F-statistic P Sum of Degrees of Mean

1-Step clades 0.02410 2 0.01205 0.573 0.569 0-Step clades 0.26574 7 0.03796 1.804 0.114

Within 1-1 0.07828 3 0.02609 1.240 0.308 Within 1-2 0.18745 4 0.04686 2.227 0.084

Error 0.82078 39 0.02105

The nested design is given in Figure 5.

L

FIGURE 5.-The cladogram set estimated by the TEMPLETON, CRANDALL and SING (1 992) algorithm as derived from the 49 lines described in LANGLEY et al. (1988) for the right subregion of the Amy DNA region of D. melanogaster. The notation is the same as given in Figure 2.

ciations. T o localize these associations, the cladogram shown in Figure 4 was used to define the relevant multiple-contrasts, as discussed in TEMPLETON et al. (1988). In this case, the evolutionarily meaningful contrasts are between the central clade lacking the EcoRI -2.2 sites us. the three peripheral clades. Table 3 gives the results of the Bonferroni multiple-compar- isons tests. As can be seen, two significant (at the 5 % level or lower) associations are revealed, as indicated by asterisks in the cladogram shown in Figure 4.

Figure 5 shows the single plausible cladogram for the right region. The nested design is also shown in Figure 5 , and the nested analysis of variance is given in Table 4. No significant phenotypic associations were discovered in this subregion.

Isozyme phenotypes in the Esterase6 DNA region of D. melanogaster: GAME and OAKESHOTT (1990) examined restriction site and isozyme variation in a 2 1.5-kb region surrounding the Esterase-6 locus. TEM- PLETON, CRANDALL and SING (1992) analyzed these data, and split the region into three subregions having little or no internal recombination but with extensive recombination between subregions. The first subre-

Page 7: A Cladistic Analysis of Phenotypic Associations with ...the cladogram, so these rules will not be repeated here. However, the previously published rules fail to take into account one

Cladistic Analysis of Phenotypes 665

TABLE 5

Nested exact contingency analysis of Esterase-6 isozyme phenotypes with clades in the 5’ subregion of the D.

melanogmter Est-6 DNA region described in GAME and OAKESHOTT (1 990)

Chi-square Source statistic P

2-Step clades 1 1.7086 0.27 1-Step clades

Within 2-1 0.8042 1 .oo Within 2-2 17.9167 0.55

Within 1-1 6.1722 0.77 Within 1-3 2.0000 1 .oo Within 1-4 4.0000 0.53 Within 1-6 2.0000 1 .oo The nested design is given in Figure 6. A standard contingency

chi-square statistic is calculated, and its exact significance is deter- mined by 1,000 random permutations that preserve the marginal values. The probability column refers to the frequency with which these randomly generated chi-square statistics were equal to or greater than the observed chi-square.

gion (henceforth known as the 5’ region) includes the DNA that is 5’ to the Est-6 coding region and a small portion of the 5’ coding region. The middle region includes the bulk of the coding region and some of the 3’ sequence. We will henceforth refer to this as the coding region. The last subregion is the remainder of the 3’ DNA sequence, and is called the 3’ region. TEMPLETON, CRANDALL and SING (1992) present the plausible cladogram sets for each of these subregions. In this case, there is ambiguity within each subregion. TEMPLETON, CRANDALL and SING (1 992) also showed how the isozyme data could be overlaid on the coding region cladogram to help resolve some of the ambi- guities found in that region. However, for the pur- poses of this analysis, we will regard the electropho- retic mobility of the Esterase-6 enzyme as the pheno- type to be analyzed. This will allow us to illustrate how the cladistic approach may be extended to an analysis of categorical phenotypes.

GAME and OAKESHOTT (1990) identified a total of six distinct isozyme phenotypes in the sample they analyzed, labeled by the numbers 1, 2, 4, 5 , 8 and 9. We want to see if these categorical isozyme pheno- types are significantly associated with the haplotypes found in any of the three subregions. This is accom- plished by a nested contingency analysis. The clado- grams will be used to define a nested statistical design. Within each nesting category, we will examine the data for associations among the clade categories with the isozyme categories by a contingency test. Such contingency tests are independent among nesting cat- egories for a given nesting level and across nesting levels (PRUM, GUILLOUD-BATAILLE and CLERGET-DAR- poux 1990). Often, one would use a contingency chi- square to test the null hypothesis of no association, but in our case we frequently encounter such small

0-Step clades

2-2

I 7-4 I 7 -6

i 7 - 7 ~4,7,10,11,14,16,18,19.20,21,22.25,28,29,3~)

23

24

7 -2

FIGURE 6.-The cladogram set estimated by the TEMPLETON, CRANDALL and SING (1 992) algorithm as derived from the 42 lines described in GAME and OAKESHOTT (1990) for the 5’ subregion of the Est-6 DNA region of D. melanogaster. The haplotypes are numbered as given in GAME and OAKESHOTT (1990). However, some of these haplotypes are identical for the variable sites in this 5’ subregion, and accordingly are pooled together as shown by the rounded boxes. Each arrow indicates one mutational event. The description of the event is indicated by the arrow, using the notation given in GAME and OAKESHOTT (1 990). Solid arrows indicate tran- sitions that are unambiguous whereas dashed arrows indicate alter- native connections that are all likely, using the criteria given in TEMPLETON, CRANDALL and SING (1992). Single-width lines enclose haplotypes that are nested together to form 1-step clades, as desig- nated by the notation I # . Double-width lines enclose the 1- step clades that are nested together to from 2-step clades, as notated by 2#.

sample sizes that the asymptotic property of a chi- square distribution cannot be assumed. Instead, we will perform an exact test that uses the random per- mutation procedure of ROFF and BENTZEN (1 989). In this procedure, a contingency chi-square statistic is calculated and the probability of observing the exact test statistic or larger is generated using a random permutation procedure that maintains the marginals but simulates the null hypothesis of no association. One thousand permutations are sufficient for obtain- ing an accurate test of statistical significance when the type I error rate is chosen to be 5% (EDGINGTON 1987).

Figure 6 gives the plausible set of eight cladograms and the nested design for the 5’ region. As discussed

Page 8: A Cladistic Analysis of Phenotypic Associations with ...the cladogram, so these rules will not be repeated here. However, the previously published rules fail to take into account one

666 A. R. Templeton and C. F. Sing

in TEMPLETON, CRANDALL and SING (1 9 9 9 two hap- lotypes (12 and 13, where the haplotype numbers are those given in GAME and OAKFSHOTT 1990) are of ambiguous position in the cladogram. Haplotype 12 can either be connected parsimoniously to haplotype 5 or haplotype [3,8,15], in both cases by two muta- tional steps. Haplotype 13 can be connected parsi- moniously by three mutational steps to haplotype 2, but in this case, the nonparsimonious connections of length four to haplotypes 5, [3,3,15] and 4 et al. are also included in the plausible cladogram set. Thus, there are a total of eight different cladograms in the plausible set (2 positions for 12, times 4 positions for 13). This ambiguity has no impact on nesting the haplotypes into l-step clades because both haplotypes 12 and 13 are at least two mutational steps from their nearest neighbor in the cladogram; hence, by the standard rules (TEMPLETON, BOERWINKLE and SING 1987) both haplotypes define degenerate 1-step clades (clades 1-7, which contains only haplotype 12, and 1- 8, which contains only haplotype 13, as illustrated in Figure 6). However, there is ambiguity for the con- nections of these haplotypes at the 2-step level. Hence, for the moment we ignore them in defining the 2-step clades. Using the standard rules, clades 1-1 and 1-2 should be pooled into a single 2-step clade (clade 2-1 in Figure 6), as should 1-step clades 1-4, 1-5 and 1-6. This pooling would leave clade 1-3 stranded symmet- rically, so we invoke the rules pertinent to symmetri- cally stranded clades. In this case, 1-3 is pooled with 1-4, 1-5 and 1-6 using the sample size criterion to form 2-step clade 2-2. We now turn attention to the ambiguous l-step clades, 1-7 and 1-8 (which, because of degeneracy, are the same as haplotypes 12 and 13). For l-step clade 1-7, all ambiguous connections at the 2-step level are to clade 2-2. Hence, 1-step clade 1-7 is unambiguously nested within 2-2. Because haplo- type 13 is at least three mutational steps removed from the remainder of the cladogram, the 1-step clade 1-7 that is degenerate with haplotype 13 remains degenerate at the 2-step level and defines the degen- erate 2-step clade 2-3 under the standard nesting rules. The next level of nesting would join all 2-step clades together regardless of how 2-3 is connected to the other 2-step clades, so the nesting algorithm is terminated at the 2-step level. Once again, the ambi- guities contained within the plausible cladogram set have had no impact on the nested design.

Table 5 presents the results of the nested exact permutation tests for associations with isozyme phe- notype using the nested design given in Figure 5. As can be seen, this analysis suggests that there is no significant association between an individual's esterase isozyme type and the DNA haplotypes in the region 5' to the gene coding for Esterase-6.

Figure 7 presents the plausible cladogram set for

Ecdll j +4.3 EcdU j +4.3 + ! + !

+ : Im(d) j+3.3

I 7 - 7

FIGURE 7.-The cladogram set estimated by the TEMPLETON, CRANDALL and SING ( 1 992) algorithm as derived from the 42 lines described in GAME and OAKESHOTT (1 990) for the coding subregion of the Est-6 DNA region of D. melunoguster. The haplotypes are numbered as given in GAME and OAKESHOTT (1990). However, some of these haplotypes are identical for the variable sites in this coding subregion, and accordingly are pooled together as shown by the boxes in the figure. Each arrow indicates one mutational event. The description of the event is indicated by the arrow, using the notation given in GAME and OAKESHOTT (1990). Solid arrows indicate transitions that are unambiguous whereas dashed arrows indicate alternative connections that are all likely, using the criteria given in TEMPLETON, CRANDALL and SING (1992). Single-width lines enclose haplotypes that are nested together to form I-step clades, as designated by the notation I # .

the coding region. There are two sources of ambiguity in this region. First there is a loop of ambiguity involving haplotypes 6 et al., 2 et al., 3 et al. and 11 et al. that is due to parallel mutations at either the TaqI + 0.8 site or the EcoRI + 4.3 site. Second, there is ambiguity in the position of haplotype 14 because this haplotype was not scored for its state at the EcoRI + 4.3 site (TEMPLETON, CRANDALL and SING 1992). As already illustrated by the left subregion Amy locus analysis, both types of ambiguity (evolutionary and incomplete scoring) can be dealt with by the nesting rules given in steps 1 and 2. In this case, the applica- tion of the nesting rules to the unambiguous portions of the cladogram produces a single l-step clade, la- beled as 1-1 in Figure 7. All connections among the remaining haplotypes are ambiguous, so they are left unnested. At the 2-step level the entire cladogram would be united, so these stranded haplotypes are elevated to the status of separate l-step clades nested in a single 2-step clade that encompasses the entire cladogram. Table 6 presents the results of this nested analysis, and in this case there is a highly significant association between isozyme mobility phenotypes and the DNA haplotypes found within the l-step clades.

Figure 8 presents the plausible cladogram set for the 3' region. As discussed by TEMPLETON, CRANDALL and SING (1992), there is considerable ambiguity in this subregion, as shown by all the dashed arrows in Figure 8 that define 5400 possible cladograms. In- deed, the ambiguity is so extreme in this case, it is better to go through the nesting algorithm in more detail than in the previous examples.

The first step is to focus just on the unambiguous

Page 9: A Cladistic Analysis of Phenotypic Associations with ...the cladogram, so these rules will not be repeated here. However, the previously published rules fail to take into account one

Cladistic Analysis of Phenotypes 667

TABLE 6

Nested exact contingency analysis of Esterase-6 isozyme phenotypes with clades in the coding subregion of the D. melanogaster Est-6 DNA region described in GAME and

OAKESHOTT (1990)

Source Chi-square

statistic P

1-Step clades 0-Step clades

Within 1-1

74.8381 0.00

2.761 1 0.68

The nested design is given in Figure 7. A standard contingency chi-square statistic is calculated, and its exact significance is deter- mined by 1,000 random permutations that preserve the marginal values. The probability column refers to the frequency with which these randomly generated chi-square statistics were equal to or greater than the observed chi-square.

.............. ' : 1 ............................... .,'EcCN ''I} 16.6 ',., . '~.. , ' j . . I .

:"..""""..""...~ ............... . . I ... ; , . .. : 3 ':.-:::.-:71 *""""""""' ....................................

FIGURE 8.-The cladogram set estimated by the TEMPLETON, CRANDALL and SING (1992) algorithm as derived from the 42 lines described in GAME and OAKESHOTT (1990) for the 3' subregion of the Est-6 DNA region of D. melanogaster. The haplotypes are numbered as given in GAME and OAKESHOTT (1990). However, some of these haplotypes are identical for the variable sites in this 3' subregion, and accordingly are pooled together as shown by the boxes in the figure. Each arrow indicates one mutational event. The description of the event is indicated by the arrow, using the notation given in GAME and OAKESHOTT (1990). Solid arrows indicate transitions that are unambiguous whereas dashed arrows indicate alternative connections that are all likely, using the criteria given in TEMPLETON, CRANDALL and SING (1992).

portions of the cladogram, which are shown in Figure 9. Here, the applications of the standard rules of TEMPLETON, BOERWINKLE and SING (1 987) result in the construction of four 1-step clades, as shown in Figure 9. We next examine the connections of the ambiguous haplotypes to these 1-step clades, as illus- trated in Figure 10. We note that each of the haplo- types of ambiguous position is separated by two or more mutational steps from its nearest neighbor, and by the standard nesting rules each of these haplotypes (10, 11, 12, 20, 26) defines its own degenerate 1-step clade. Hence, the considerable amount of cladogram ambiguity shown in Figure 8 has no impact on the 1- step nesting level.

We next go to the 2-step level, as also illustrated in Figure 10. First, note that haplotypes 10 and 12 (and the 1-step degenerate clades that they define) are unambiguously connected to 1-step clade 1-1 at the 2-

7 -2 30 29

E{SU/16.6

2 8 ) : * c m ) 1 Ecdl 7.0 t8. 1 [ Ec+ 1 1 3 . 1

I 7? I

FIGURE 9.-The unambiguous portion of the cladogram esti- mated by the TEMPLETON, CRANDALL and SING (1 992) algorithm as derived from the 42 lines described in GAME and OAKESHOTT (1990) for the 3' subregion of the Est-6 DNA region of D. melano- gaster. Single-width lines enclose haplotypes that are nested together to form 1-step clades, as designated by the notation I # .

! _ _ _ _ _ _ _ ......................... FIGURE 10.-The 1-step clades of unambiguous position and the

haplotypes of ambiguous position for the 3' subregion of the Est-6 DNA region of D. melanogaster. Solid arrows indicate transitions that are unambiguous whereas dashed arrows indicate alternative connections that are all likely, using the criteria given in TEMPLE- TON, CRANDALL and SING (1992). Single-width lines enclose haplo- types that are nested together to form 2-step clades, as designated by the notation 24. The haplotypes not enclosed by solid lines are regarded as separate 2-step clades.

step level. Applying the standard rules to the unam- biguous connections (shown by solid arrows in Figure 10) yields two 2-step clades. Haplotype 7 (and its degenerate 1-step clade) can be connected to either 1-step clade 1-2 or 1-4, but since both 1-2 and 1-4 are part of 2-step clade 2-2, haplotype 7 is unambiguously nested within 2-2. The three remaining haplotypes (I I, 20 and 26) could be connected to either 2-step clade 2-1 or 2-2, so they are left unnested. Since the next level of nesting would unite all haplotypes into a single clade, haplotypes I I, 20 and 26 are regarded as separate 2-step clades, yielding a total of five 2-step clades at the highest nesting level. Table 7 presents the results of the nested analysis. As can be seen, no significant associations are detected between isozymes and haplotypes in the DNA region 3' to the Esterase- 6 gene. In summary, all significant associations be- tween isozymes and haplotypes are limited to haplo- types defined solely by mutations in the coding region.

DISCUSSION

Both the esterase and the amylase example illustrate that the cladistic approach can by applied when the

Page 10: A Cladistic Analysis of Phenotypic Associations with ...the cladogram, so these rules will not be repeated here. However, the previously published rules fail to take into account one

668 A. R. Templeton and C . F. Sing

TABLE 7

Nested exact contingency analysis of Esterase6 isozyme phenotypes with clades in the 3' subregion of the D.

mlanogastet Est-6 DNA region described in GAME and OAKESHOTT (1990)

Source Chi-square

statistic P

2-Step clades 20.6250 0.40 I-Step clades

Within 2-1 6.0857 0.65 Within 2-2 20.7900 0.16

Within 1-1 10.6250 0.31 Within 1-2 2.5926 0.63 Within 1-3 2.0000 1 .oo Within 1-4 19.3375 0.32

The nested design is given in Figures 9 and 10. A standard contingency chi-square statistic is calculated, and its exact signifi- cance is determined by 1,000 random permutations that preserve the marginal values. The probability column refers to the frequency with which these randomly generated chi-square statistics were equal to or greater than the observed chi-square.

cladogram is uncertain. Four of the six DNA subre- gions analyzed in this paper had some degree of ambiguity, ranging from two possible cladograms for the left Amy subregion in Figure 3 to 5400 possible cladograms for the 3' Est-6 subregion in Figure 8. As illustrated by these examples, much of this cladogram ambiguity has no impact at all on the nested design. Hence, what seems at first glance to be a very large amount of uncertainty, in practice often has only a minimal impact on the nested statistical design. The ambiguities that cannot be dealt with using the stand- ard nesting rules can be dealt with using the additional nesting rules given in this paper. Moreover, ambigui- ties can also arise due to incomplete data, and the same nesting rules can deal with that source of uncer- tainty as well.

The Est-6 example shows how the cladistic analysis may be extended to include categorical phenotypes. The random permutation algorithm of ROFF and BEN- TZEN (1 989) used to perform exact contingency tests for categorical data should not be confused with the random permutation procedure that we had previ- ously developed for performing a nested analysis on quantitative phenotypes in natural populations (TEM- PLETON et al. 1988). The earlier algorithm was a hierarchical permutation procedure that used nested average excesses as test statistics for continuous phe- notypes. The current algorithm is a nested permuta- tion procedure that uses a contingency chi-square test statistic on categorical data.

Although our initial cladistic analyses assumed that the DNA region under analysis had experienced little or no recombination (TEMPLETON, BOERWINKLE and SING 1986), the examples given in this paper show that we can perform analyses on DNA regions that have experienced considerable recombination. In-

O-Step clades

deed, as pointed out in TEMPLETON, CRANDALL and SING (1 992), having some recombination actually strengthens the biological utility of the analysis since it allows not only a localization of phenotypically im- portant mutations in an evolutionary sense (i.e., in the cladogram), but also a physical localization to a given subregion. The cladistic analytical method can there- fore take advantage of long-term evolutionary con- trasts created by the accumulation of mutations in clades and of short-term genetic contrasts between clades created by recombination. Caution must be exercised when significant phenotypic associations are found in two or more subregions because substantial disequilibrium can still exist between subregions for some markers. However, in both of the examples provided above, significant phenotypic associations were found but were limited to only one of the subre- gions, thereby allowing an unambiguous physical lo- calization. In the case of the Est-6 locus we can be particularly confident that the physical localization is an accurate one. In that case, we know a priori that the phenotype of isozyme electrophoretic mobility should most likely map to the coding region of the gene. As shown in Tables 5 , 6 and 7, the significant associations with isozyme phenotype are indeed local- ized to the coding region and none are detected in the adjacent 5' and 3' DNA subregions.

For the Amy region, our analyses (Tables 1 through 4) localize the phenotypic associations to the DNA region between the duplicated genes. Because the duplicated genes are of reverse polarity, this region is 5' to both copies and should contain the transcription control regions for both loci. Moreover, our earlier analysis (TEMPLETON, CRANDALL and SING 1992) in- dicated that this middle region has high levels of recombination. Perhaps some of the phenotypic vari- ation is generated by recombination in this region, which would create new combinations of the duplicate genes. However, one of the significant associations is with an inversion that covers the entire Amy region, Zn(2R)NS. The fact that the phenotypic effects associ- ated with this inversion are localized to just the middle Amy region implies that the actual phenotypic effects are not due to the inversion per se, but rather to another mutation that is in disequilibrium with the inversion. It is known that other Amy markers are in disequilibrium with this inversion (YAMAZAKI et al. 1984; LANGLEY, ITO and VOELKER 1977).

Although the cladogram in the middle Amy region was so simple that the analysis of variance was un- nested, the cladistic analysis still differs from the un- nested analysis given in LANGLEY et al. (1988). Both analyses identified significant phenotypic associations with the EcoRI -2.2 site and the In(2R)NS inversion. However, LANGLEY et al. (1988) detected the signifi- cant effect of the inversion (which lacks the EcoRI

Page 11: A Cladistic Analysis of Phenotypic Associations with ...the cladogram, so these rules will not be repeated here. However, the previously published rules fail to take into account one

Cladistic Analysis of Phenotypes 669

-2.2 recognition sequence) by contrasting it with all other stocks. Therefore, their contrast was con- founded by the significant effect associated with the presence us. absence of the EcoRI -2.2 site. However, the cladistic analysis cleanly separates the phenotypic associations of these two different mutations and dem- onstrates that the inversion has an association that is independent of that of the EcoRI -2.2 site.

In addition, LANGLEY et al. (1988) contrasted the activities of the non-AMY' isozyme haplotypes (0-1 through 0-4 and 0-8) with the AMY' haplotypes and concluded that the non-AMY' haplotypes are associ- ated with higher activities. Such a contrast would not be part of the standard cladistic analysis. As shown in Figures 2 and 4, the non-AMY' isozyme haplotypes are found in noncontiguous locations in the clado- grams and involve different copies of the duplicated gene. Hence, the non-AMY' class represents a collec- tion of evolutionarily independent mutations. Accord- ingly, there is no evolutionary rationale for pooling the non-AMY' allozyme types together, and the bio- logical significance of this allozyme contrast is not clear. Problems such as these are avoided with a cladistic analysis. The strength of the cladistic ap- proach is that it automatically performs all the evolu- tionarily relevant contrasts. The standard ANOVA performed by LANGLEY et al. (1 988) merely says that there is a significant line effect, but it does not localize where this effect may be. Such a localization is stand- ardly achieved by inspection of the data after an AN- OVA to define sets of multiple contrasts. Such an inspection procedure can both overlook relevant con- trasts and create artificial categories, such as the non- AMY' haplotype category. Both problems are avoided by the cladistic procedure in which all relevant con- trasts are defined by the shape of the cladogram and are not based on an inspection of the phenotypic data. Besides localizing any detected effects, the cladistic approach has more power than the standard ANOVA to detect a line effect in the first place (TEMPLETON, BOERWINKLE and SING 1987).

By extending the cladistic analysis to incorporate cladogram uncertainty, categorical as well as contin- uous phenotypes, and recombination, we have greatly extended the breadth of applicability of this approach to the analysis of phenotypic associations with molec- ular markers. As mentioned above and in our earlier papers, this approach has several advantages over nonhistorical approaches: the ability to detect pheno- typically convergent but evolutionarily independent mutations, automatic definition of the relevant mu]- tiple contrasts, and the ability to narrow the possibili-

ties of which mutations are actually responsible for the phenotypic affects both to a small subset of hap- lotypes as well as to a subregion of the DNA (if recombination is present).

This work was supported by National Institutes of Health Grant 1 R01 HL39107. We wish to thank MARTHA HAVILAND, KIM ZERBA, BRUCE WEIR, JOSEPH FEUENSTEIN, MARY KUHNER and an anonymous reviewer for their useful comments and suggestions on earlier versions of this work.

LITERATURE CITED CLARK, A. G. and T. S. WHITTAM, 1992 Sequencing errors and

molecular evolutionary analysis. Mol. Biol. Evol. 9: 744-752. EDGINGTON, E. S., 1987 Randomization Tests. 2nd ed., Marcel

Dekker, New York and Basel. GAME, A. Y. and J. G. OAKESHOTT, 1990 Associations between

restriction site polymorphism and enzyme activity variation for Esterase-6 in Drosophila melanogaster. Genetics 126: 1021- 1031.

LANGLEY, C. H., K. ITO and R. A. VOELKER, 1977 Linkage disequilibrium in natural populations of Drosophila melano- gaster. Genetics 86: 447-454.

LANGLEY, C. H., A. E. SHRIMPTON, T. YAMAZAKI, N. MIYASHITA, Y . MATSUO and C. F. AQUADRO, 1988 Naturally occurring variation in the restriction map of the Amy region of Drosophila melanogaster. Genetics 119: 6 19-629.

LLOYD, D. G. and V. L. CALDER, 1991 Multi-residue gaps, a class of molecular characters with exceptional reliability for phylo- genetic analyses. J. Evol. Biol. 4 9-21.

PRUM, B., M. GUILLOUD-BATAILLE and F. CLERGET-DARPOUX, 1990 On the use of x' tests for nested categorized data. Ann. Hum. Genet. 5 4 31 5-320.

ROFF, D. A. and P. BENTZEN, 1989 The statistical analysis of mitochondrial DNA polymorphisms: x' and the problem of small samples. Mol. Biol. Evol. 6: 539-545.

SWOFFORD, D. PAUP ?.O User's Manual (Draj 2/9/91}. Illinois Natural History Survey, 199 1.

TEMPLETON, A. R., 1983 Convergent evolution and nonparamet- ric inferences from restriction data and DNA sequences, pp. 151-1 79 in Statistical Analysis of DNA Seguence Data, edited by B. S. WEIR. Marcel Dekker, New York.

TEMPLETON, A. R., E. BOERWINKLE and C. F. SING, 1987 A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping. I . Basic the- ory and an analysis of alcohol dehydrogenase activity in Dro- sophila. Genetics 117: 343-351.

TEMPLETON, A. R., K. A. CRANDALL and C. F. SING, 1992 A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping and DNA sequence data. 111. Cladogram estimation. Genetics 132 619- 633.

TEMPLETON, A. R., C. F. SING, A. KESLING and S. HUMPHRIES, 1988 A cladistic analysis of phenotypic associations with h a p lotypes inferred from restriction endonuclease mapping. 11. The analysis of natural populations. Genetics 120: 1145-1 154.

YAMAZAKI, T., Y. MATSUO, Y . INOUE and Y . MATSUO, 1984 Genetic analysis of natural populations of Drosophila melanogaster in Japan. I . Protein polymorphism, lethal gene, sterility gene, inversion polymorphism, and linkage disequili- brium. Jpn. J. Genet. 59: 33-49.

Communicating editor: B. S. WEIR