weighted quartets phylogeneticssciences.haifa.ac.il/snirlab/people/eliran/avni_et_al... ·...

[11:50 3/2/2015 Sysbio-syu087.tex] Page: 233 233–242

Syst. Biol. 64(2):233–242, 2015© The Author(s) 2014. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved.For Permissions, please email: [email protected]:10.1093/sysbio/syu087Advance Access publication November 19, 2014

Weighted Quartets Phylogenetics

ELIRAN AVNI1, REUVEN COHEN2, AND SAGI SNIR3,∗1Department of Evolutionary Biology, University of Haifa, Haifa 31905, Israel; 2School of Engineering, Kinneret College, 15132, Israel; and

3Department of Evolutionary Biology, University of Haifa, Haifa 31905, Israel∗Correspondence to be sent to: Department of Evolutionary Biology, University of Haifa, Haifa 31905, Israel;

E-mail: [email protected] Avni and Reuven Cohen contributed equally to this article.

Received 10 October 2013; reviews returned 10 January 2014; accepted 7 November 2014Associate Editor: Tiffani Williams

Abstract.—Despite impressive technical and theoretical developments, reconstruction of phylogenetic trees for enormousquantities of molecular data is still a challenging task. A key tool in analyses of large data sets has been the construction ofseparate trees for subsets (e.g., quartets) of sequences, and subsequent combination of these subtrees into a single tree forthe full set (i.e., supertree analysis). Unfortunately, even amalgamating quartets into a supertree remains a computationallydaunting task. Assigning weights to quartets to indicate importance or reliability was proposed more than a decade ago,but handling weighted quartets is even more challenging and has scarcely been attempted in the past. In this work, we focuson weighted quartet-based approaches. We propose a scheme to assign weights to quartets coming from weighted trees anddevise a tree similarity measure for weighted trees based on weighted quartets. We also extend the quartet MaxCut (QMCalgorithm) to handle weighted quartets. We evaluate these tools on simulated and real data. Our simulated data analysishighlights the additional information that is conveyed when using the new weighted tree similarity measure, and showsthat extending QMC to a weighted setting improves the quality of tree reconstruction. Our analyses of a cyanobacterial dataset with weighted QMC reinforce previous results achieved with other tools. [Phylogenetic reconstruction; quartet maxcut;supertree reconstruction; weighted quartet trees; weighted tree similarity.]

Reconstruction of phylogenetic (evolutionary)trees is a fundamental task in evolutionary biologythat has acquired growing importance with therecent developments in sequencing technologies. Theavailability of enormous quantities of molecular datahas turned the reconstruction of the phylogenies ofvery large groups of species into a common routinein comparative genomics. For this goal, it is desirableto utilize the maximum amount of classification dataavailable for the species group at hand. These data arenormally represented as separate trees (e.g., specificgene histories) over the species group or subsets of it.Combining this information into a single tree over thecomplete species set is called supertree construction.As the task of building theoretically accurate trees foreven a handful of tens of species is computationallydemanding (i.e., exponential in running time to the sizeof the input), constructing trees of larger size requiresemploying advanced heuristic techniques. One suchheuristic follows two stages: 1) constructing very smalltrees by accurate phylogenetic methods, resulting in alarge quantity of high quality trees over overlappingsets of taxa; 2) amalgamating these trees into a unifiedtree over the full taxa set by the supertree approach.

As evolution is time driven, evolutionary trees arenaturally rooted where the root of the tree representsthe last common ancestor of the taxa set. In such arooted setting, the basic informational unit is a rootedtriplet, a tree over three species, where one species isan outgroup with respect to the other two. However,because time scale is normally not part of the inputdata, most phylogenetic methods return an unrootedtree where the phylogenetic information representssplits rather than ancestry. In such an unrooted setting

the basic information unit is an unrooted quartet treeor simply a quartet. Therefore, the simplest case ofthe supertree problem is when all input subtrees arequartets and the goal is to combine these quartets intoa single tree. This task, named quartet-based supertree orquartet amalgamation, lies at the heart of many tasks inphylogenetics. The inference of such quartets is usuallydone accurately and rigorously from raw data (Strimmerand von Haeseler 1996; Chor et al. 2000, 2006; Chor andSnir 2004; Holland et al. 2013). Due to its fundamentalrole, quartet amalgamation has attracted a lot of interestmainly for practical reasons but also from its theoreticalperspective (Bandelt and Dress 1986; Bryant and Steel2001; Alon et al. 2014). A given set of quartets is notnecessarily compatible (or consistent) in the sense that itagrees with some single tree. The problem of findingthe largest compatible subset of a given quartet set wasfound to be computationally intractable more than 20years ago (Steel 1992). Despite intensive efforts (Jianget al. 2000; Snir and Yuster 2012), the best-known solutionto the general problem is still a random tree, satisfying(only) a third of the input.

The importance of associating quartets with weightshas already been demonstrated in (Berry and Gascuel2001; Strimmer et al. 1997) and it was specifically notedin (Ranwez and Gascuel 2001; Holland et al. 2013) that aproper weighting of quartets may increase substantiallythe accuracy of quartet-based methods. In these works,quartets were inferred from molecular sequences, andthe weight was used to represent the confidence of thequartet topology (out of the three possible topologies).

In this work, we focus on weighted extensions toquartet-based approaches. We first introduce weightedquartet MaxCut (wQMC), the weighted extention of the

233Downloaded from https://academic.oup.com/sysbio/article-abstract/64/2/233/1630954by Haifa University Library useron 09 November 2017


234 SYSTEMATIC BIOLOGY VOL. 64

QMC algorithm that considers weighted quartet inputsand builds a tree reflecting the augmented informationin the input. Next, we enhance the quartet fit (Qfit)measure to consider weighted trees and quartets. Thenew measure, wQfit, ranges between −1 and 1, wherezero signifies random correlation. We also devise ascheme for associating quartets with weights derivedfrom a tree. Such a scheme is essential for supertree taskswhere subtrees are decomposed into quartets that aresubsequently combined together.

In the Results section, we show that wQMC, underthe weighting scheme we devised, is successful athandling noisy or incomplete data sets, even withpoor model fit. wQMC consistently improves theperformance of QMC, sometimes dramatically. ThewQfit measure is found to be significantly moreinformative than the traditional Robinson–Foulds (RF)Symmetric Difference. Our final result concerns thetask of reconciling conflicting real data gene historiesin prokaryotes, mainly due to the phenomenon ofhorizontal gene transfer (HGT) (Doolittle 1999; Ochmanet al. 2000; Koonin et al. 2001; Jin et al. 2007). We appliedwQMC to over a thousand cyanobacterial gene treesextensively exposed to HGT and confirmed previousresults obtained with unweighted quartets as input forthe supertree method (Zhaxybayeva et al. 2006).

The wQMC software is publicly available at http://research.haifa.ac.il/∼ssagi/software/wQMC.tar.gz.Supplementary material (data and scripts) are availableon Dryad at http://dx.doi.org/10.5061/dryad.r9k57.

METHODS

PreliminaryPhylogenetic trees.—For a set of species (taxa) X , aphylogenetic X -tree t is a tree for which there is a one toone correspondence between X and the set of leaves oft−L(t). Removing an edge (or branch) from a tree createstwo subtrees that naturally split X . The split (U,X \U)that is identified by an edge e is denoted by eU , whereU is arbitrarily one of the parts of the split. A tree t isweighted if there is a function associating weights to theedges of t. We denote the weight of an edge e by wE(e). LetT be anX -tree andA⊆X be a subset ofL(t). We denote byt|A the following subtree of t that is induced by A: First, allthe leaves in X \A, as well as paths leading exclusively tothem, are removed. Next, all internal nodes with degreetwo are contracted. If t is weighted, edges created byvertex contraction receive the sum of the weights of thejoined edges.

Consistent trees.—For two trees t1 and t2, we say thatt1 satisfies t2 if L(t2)⊆L(t1) and t1|L(t2) = t2, otherwise,t2 is violated by t1. For a set of trees T ={t1,...,tk} withpossibly overlapping leaves, we say that T is consistent ifthere exists a tree t∗ over the union set of leaves of thetrees in T that satisfies every tree ti ∈T . Otherwise, Tis inconsistent. The problem of finding such a consistent

tree t∗, or a similar one if none exists, is known as thesupertree problem.

Quartets and the Maximum Quartet Consistency problem.—In this work we deal only with unrooted trees, wherethere are no ancestor–descendant relationships betweenthe nodes. The basic unit of information in unrootedtrees is a quartet tree q, an unrooted tree with four taxa{a,b,c,d}. A quartet q is denoted ab|cd if a split ({a,b},{c,d})is induced by one of its edges. More generally, a quartetq=ab|cd is satisfied by a tree t if t has a split separatinga,b from c,d. A common special case of the supertreeproblem is when the input consists solely of quartets andthe objective is to find a tree that satisfies the maximumnumber of them. This is known as the Maximum QuartetConsistency (MQC) problem. Algorithms for MQC arequartet dedicated supertree algorithms.

Weighted cuts and the MaxCut Problem.—A cut C= (S,S) ina graph G is a partition over the set of nodes V, i.e. S∪S=V and S∩S=∅. An edge e connecting two nodes u,v is inthe cut if u∈S∧v∈ S. We denote the set of all edges whichare in a given cut C by EC. The weight of a cut C, wc(C) isthe sum of weights of the edges in C, i.e., wc(C)= ∑

e∈EC

wE(e).

A maximum cut Cm = (Sm,Sm) is a cut which has themaximum weight in G. The task of finding Cm is calledthe MaxCut problem and is known to be computationallyhard Garey and Johnson (1979).

QMC and wQMC.—The Quartet MaxCut (QMC)algorithm of Snir and Rao (2010) is a heuristic intendedto solve the MQC problem indirectly, by solving theMaxCut problem. It has proven to be a very powerfultool for constructing phylogenetic trees (Snir et al. 2008;Snir and Rao 2010). The algorithm receives as input aset of quartets Q defined on a taxa set X . It then buildsthe following quartet (multi) graph G=G(Q)= (V,E) withvertices V =X and edges E as follows: For every q∈Q thealgorithm adds to G edges related to every pair of leavesin q. The edges that correspond to adjacent sister leaves(a cherry) in q are denoted as bad edges, and the (four)other pairs are denoted as good edges. Note that betweentwo nodes in G(Q) there can be good and bad edgessimultaneously, originating from different quartets. Anexample of such a graph and its constituting quartets isgiven in Figure 1.

The QMC algorithm seeks for a cut C in the quartetgraph that maximizes the ratio between the good and thebad edges in C. The cut thus found defines a split (U,X \U) over the taxa set X . This procedure of finding thesplits is applied recursively on the resulting subsets (Uand X \U), until the subset’s size is smaller than four. Thesplits that were found are used to construct the final tree,where every split defines an edge in the construction.Notice that an input quartet may be violated by theresulting tree only if one of its “bad” edges was cutduring the recursive process described above.

Downloaded from https://academic.oup.com/sysbio/article-abstract/64/2/233/1630954by Haifa University Library useron 09 November 2017

http://research.haifa.ac.il/~ssagi/software/wQMC.tar.gz

http://research.haifa.ac.il/~ssagi/software/wQMC.tar.gz

http://dx.doi.org/10.5061/dryad.r9k57


2015 AVNI ET AL.—WEIGHTED QUARTETS PHYLOGENETICS 235

FIGURE 1. Building a Graph from Quartets. The edges (1,3) and (2,4)from the first quartet and (1,2) and (3,4) from the second quartet areadded as bad edges to the graph as they connect cherry leaves, and theedges (1,2),(3,4),(1,4),(2,3) from the first quartet and (1,3),(2,4),(2,3),(1,4)from the second quartet are added as good edges as they connectnoncherry leaves. The cut separates between the leaves 1,2 and theleaves 3,4 and it includes two bad edges and six good edges.

To support weighted quartets, a very naturalmodification was incorporated into QMC. UnderwQMC, every edge in G(Q) is first assigned the weightof its “mother” quartet, and then the algorithm looksfor a cut that maximizes the ratio between the totalweight of the good edges and the bad edges. Thisextension of QMC may be natural and straightforward,yet it is important nonetheless, because there are caseswhere we wish to differentiate or prioritize between thequartets, according to certain criteria (e.g., importance,precision, etc.).

In Figure 2, we demonstrate the issue using a toyexample. The quartets on the left are over five taxa but areinconsistent—no tree satisfies them all simultaneously.Hence, some optimization criterion is necessary. Analgorithm that is indifferent to weights will obviouslychoose a tree maximizing the number of satisfiedquartets. It can be easily examined that the upper treeon the right satisfies the three lower quartets on the left.The sum of their weights is 1.2. In contrast, the lowertree on the right satisfies only two quartets (the upperones on the left), but achieves a significantly higher totalweight of 2.0.

Computing the Quartet’s WeightQuartets’ weights are computed based on data

related to the quartet. When homologous sequencesare analyzed, weights normally represent quartet’sconfidence or resolvability (Erdös et al. 1999; Gronauet al. 2008; Daskalakis et al. 2011). In the supertree realm,weight can be computed from the input subtrees andcan represent confidence in the split. Here we propose ascheme for assigning weight to a given quartet based onthe pairwise distances between its leaves. This scheme

FIGURE 2. The effect of the Weight on the Constructed Tree.The upper tree is constructed from quartets without referring to theirweights. The tree satisfies the last three quartets and violates the firstone. The construction of the lower tree considers the quartet weightsand it satisfies the upper two quartets that have higher weights and itviolates the lower quartets with the lower weights.

is relevant both to sequence-based reconstruction and tosupertree reconstruction.

For an edge-weighted tree t, let dtxy denote the weight

(or distance) of the path in t connecting leaves x andy (that is simply the sum of edge weights along thatpath). For a four-subset {a,b,c,d}⊆X let dt

abcd =dtab +

dtcd, dt

acbd =dtac +dt

bd, dtadbc =dt

ad +dtbc. We note that if the

distances come from a tree, then by the four pointsmethod (Buneman 1971), the maximum between theabove three distances is not unique. We define a variationon the four points method that can not only be applied todistances coming from trees, but also to distances comingfrom molecular sequences.

Assuming that dtabcd ≤dt

acbd ≤dtadbc, we set q′ =ab|cd and

define our weight function wq(q′) as follows:

wq(q′)= (dh −d�)exp(dh −dm)dh

, (1)

where d�,dm,dh denote dtabcd,d

tacbd,d

tadbc, respectively.

When dealing with quartets, dh −d� is twice the lengthof the internal edge. The quartet weight increases asthe internal edge is longer and the quartet split is moresignificant. Note that when the quartet is unresolved, i.e.,dh −d� =0 then the weight becomes 0. Adding dh, whichis the quartet’s diameter, to the denominator, givespriority for short quartets which are more reliable thanthe longer ones (Erdös et al. 1999; Gronau et al. 2008). Aswe expect dm and dh to be more similar the more reliablethe data are, we have exp(dh −dm) in the denominator.




Note that in a tree, when dh −dm =0, we have exp(dh −dm)=1 yielding wq(q′)=1− d�

dh.

The Weighted Qfit Tree Similarity MeasureMotivation.—The Qfit measure (Estabrook 1985)expresses the similarity of two trees by counting thenumber of quartets shared by the compared trees anddividing it by the number of all possible quartets.This measure treats all quartets equally. We presenthere the wQfit (weighted Qfit) tree similarity measurethat enhances Qfit becuase it also takes the quartetsweights under consideration. The wQfit measure ranksthe level of similarity between two trees by rewardingquartets that are shared by both trees and penalizingquartets that are not according to their weights. We nowdefine formally this new measure and its use in bothphylogenetic reconstruction and tree similarity.

Definition of the wQfit measure.—Let t be an edge-weighted X -tree. For a subset of species s⊆X such that|s|=4, let ts be the quartet that is induced by s and t, i.e.,ts = t|s. Let wq(ts) be the weight of the quartet ts (that canbe induced from the tree t or by any other manner). Fortwo X -trees denoted t1 and t2 and a given four-taxa s,let t1,s and t2,s be the two quartets that are induced bys and by t1 and t2, respectively. We write t1,s = t2,s whenthe two quartets have the same topology.

Definition 1 For two quartets q1 and q2 over the samefour-taxa, the (quartet) weighted quartet fit, wQfitq(q1,q2),is defined as:

wQfitq(q1,q2)=�wq(q1)wq(q1), (2)

where � is defined as:

�={

2 q1 =q2−1 q1 �=q2

Definition 2 For two X -trees t1 and t2, the (tree) wQfit,wQfitt(t1,t2), between t1 and t2 is defined as:

wQfitt(t1,t2)=2∑s

wQfitq(t1,s,t2,s)∑s

wQfitq(t1,s,t1,s)+∑s

wQfitq(t2,s,t2,s),

(3)where s runs over all four subsets of X .

In the rest of the article, we omit the subscript “t”from the notation and use it only where it may reduceconfusion.

We can observe that for two trees t1 and t2, t1 = t2 ifand only if wQfit(t1,t2)=1, (here t1 = t2 signifies identicaltopologies and identical quartet weights).

We also observe that:

(i) For any two X -trees t1 and t2, we have|wQfit(t1,t2)|≤1.

(ii) For any weighted binary X -tree t1, if the weightof a quartet is determined solely by the positions

of its leaves on the tree, and if t2 is obtainedby assigning a random permutation of X to theleaves of t1, then wQfit(t1,t2), when regarded asa random variable, has zero expectation. In otherwords, E[wQfit(t1,t2)]=0 holds.

Further discussion about these facts appears inthe Supplementary Material available on Dryad at(http://dx.doi.org/10.5061/dryad.r9k57).

Note that wQfit(t1,t2) can be negative if the total fitnessof the different (topology) quartets is higher than thetotal fitness of the equal (topology) quartets. In thiscase, we say that the trees are negatively correlated asthey exhibit a wQfit score that is worse than random(according to our second observation).

There are cases where quartets of one tree haveweights, whereas quartets of another tree do not (e.g.,when comparing a weighted tree with an unweightedone). We wish to obtain a score 1 if the trees have the sametopology and we note that the straightforward extension,in which unweighted edges are assigned unitary weight,does not work. In such a case, we modify the fitnessdefinition to consider weights only from one of the trees.

Suppose that the quartets of t1 have weights whereasthe quartets of t2 have no weights. Then, we define thefitness as:

wQfitt(t1,t2)=∑s

wQfitq(t1,s,t2,s)∑|s

wQfitq(t1,s,t2,s)| , (4)

where

wQfitq(t1,s,t2,s)=�wq(t1,s)wq(t2,s)=�wq(t1,s).

Let us denote the total weight of the quartets of t1 bywT , and the total weight of the quartets shared by bothtrees by wS. It can be shown that if wS/wT =x, then thewQfit score is

wQfitt(x)= 3x−1x+1

. (5)

Alternatively, if wQfitt =y is known then

wSwT

= y+13−y

. (6)

Similarly to wQfit, it can also be shown that|wQfit(t1,t2)|≤1 for any two trees and thatE[wQfit(t1,t2)]=0 whenever one of the trees is a resultof a random permutation on the leaves of the other.

The Simulation ProcedureThe quartet algorithm used throughout the simulation

study is the new wQMC. We used the r8s (Sanderson2003) software (version 1.71) to produce a random, edge-weighted, model tree over n taxa. Subsequently, wedrew four-taxa sets randomly and found their quartettopologies as induced by the model tree. Quartet weightswere assigned according to Equation (1) (see sectionComputing the quartet weight). These quartets were





used as input to wQMC (or other phylogenetic methodif comparison to another method was conducted). The(unweighted) tree constructed by wQMC from thatinput was compared with the originating (weighted)model tree.

There are several approaches to measure similaritybetween phylogenies. Apart from wQfit that was alreadydescribed above, we chose the following commonmeasures: 1) RF Symmetric Difference (Robinson andFoulds 1981), which counts the number of different splitsbetween two trees (calculated using Phylip Felsenstein1989). We used a variant measuring similarity instead ofdifference; 2) Maximum Agreement Subtree (MAST), thatfinds the largest subset of the taxa set, under which bothtrees are the same (used in supplementary text availableon Dryad at at (http://dx.doi.org/10.5061/dryad.r9k57)alone); and 3) Quartet Fit, which calculates the number ofidentically induced quartet trees (out of the total numberof induced quartets). Full details of the above measuresappear in the supplementary text available on Dryad at(http://dx.doi.org/10.5061/dryad.r9k57).

RESULTS

We provide here the main results of the variousstudies as described in the Methods section. Fulldetails are given in the Supplementary text availableon Dryad at (http://dx.doi.org/10.5061/dryad.r9k57),and the raw data used are given in the SupplementaryMaterial available on Dryad at (http://dx.doi.org/10.5061/dryad.r9k57).

Performance of the wQMC AlgorithmOne of our goals was to measure the dependence

of wQMC’s quality of reconstruction on the size of

the taxa set n. The number of input quartets, denotedhere #qrt, was nk where k was a parameter. The valuesof k ranged between 1.2 and 3.0 in increments of 0.2.For smaller k’s, it is very likely that some taxa willbe absent from all quartets and hence also from thereconstructed tree. The maximal value of k was chosensuch that it is empirically the minimal value providingaccurate (100%) reconstruction under all tree similaritymeasures and non-noisy input (see Supplementarytext at (http://dx.doi.org/10.5061/dryad.r9k57)). In oursimulations, we denoted the parameter k as qrt-num-factor.

We first set to find how wQMC copes with non-noisyand noisy data. We introduced noise to the quartetdata sets by rewiring a fraction of the input quartets,making some quartets inconsistent with the originaltree. When a quartet was chosen to be rewired, werandomly and uniformly chose one of its two incorrecttopologies, and replaced the original (correct) topologywith it. We gave higher rewiring probabilities tolightweight quartets, assuming that light weights reflectlow reliability. Denoting by wT and wR the total weightof the input quartets and the total weight of the rewiredquartets, respectively, the rewiring mechanism that wedevised assured that the ratio between wR and wT wasclose to rewire, which is a parameter defined by theuser. Therefore, when the weight of the satisfied inputquartets was at least 1−rewire of the weight of the entireinput quartets, we denoted it as optimal reconstruction.The details of the rewiring mechanism appear in theSupplementary text available on Dryad at (http://dx.doi.org/10.5061/dryad.r9k57).

We generated quartet sets with 30% and 60% quartetrewiring using the rewiring scheme that we mentioned.Based on our experience, showing wQfit scoresdepend solely on qrt-num-factor (see Supplementary

FIGURE 3. Quality of Tree Reconstruction by wQMC. We used wQfit and RF as tree similarity measures, and used a constant n=200. Therewire=0% curve corresponds to consistent quartet sets. The rewire=30% and rewire=60% curves correspond to noisy data sets. We see that bothmeasures, especially wQfit, increase and approach the model tree (an indication of accuracy of reconstruction) even when the data are noisy.












text available on Dryad at (http://dx.doi.org/10.5061/dryad.r9k57)), we set a constant number of species(n=200) and only varied qrt-num-factor (from 1.2 to 3.0,as before). Trees were constructed using wQMC andsimilarity to the original trees was measured using RFand wQfit.

The results show that wQMC can reconstruct a treethat is highly similar to the original, even when receivingnoisy input. We see (Fig. 3) that when rewire=30%,it reconstructs a tree where most of the quartets areconsistent with the original tree and wQfit is closeto 100%. Even when rewire=60% wQMC succeeds inbuilding a tree with wQfit of roughly 80% (when theinput is dense enough), which, according to Equation (6),is compatible with satisfying roughly 82% of thetotal weight of the quartets on the model tree. Thisresult emphasizes an important property of the quartetapproach and the wQMC algorithm in particular, thateven under very noisy input, the weak signal may sufficefor recovering a highly accurate tree.

Figure 4 shows the wQfit between the reconstructedtree and the input quartets, hence it expresses the weightof the input quartets satisfied by the reconstructed tree.We see that when the input is not noisy (rewire=0), wQfitconverges rapidly to 100%. However, when some of thequartets are rewired, wQfit decreases according to thefraction of the rewired quartets (thus, rewire=30% leadsto wQfit=75% and rewire=60% leads to wQfit=30%).According to Equation (6), wQfit=75% is compatiblewith satisfying ∼78% of the weight of the input quartets,and wQfit=30% is compatible with ∼48%. These values,78% and 48%, are higher than 1−rewire (70% and 40%,respectively), an indication of optimal reconstruction.

The high scores of wQfit for sparse noisy inputs areprobably due to the fact that when the input is not dense,wQMC succeeds in finding a tree that can satisfy morequartets than the original one.

The robustness of wQMC to noise becomes evidentwhen we integrate the data from Figures 3 and 4.Indeed, satisfying 78% of the weight of the input quartets(rewire=30%) allows for an almost perfect (100%) treereconstruction, whereas satisfying 48% of the weight ofthe input (rewire=60%) suffices for constructing a treesatisfying 82% of the total weight of the model treequartets.

Comparison Between Qfit and wQfitQfit and wQfit are tree similarity measures

quantifying quartet agreement between trees. Qfitexpresses the number of quartets that are equal inboth trees, whereas wQfit also considers the quartets’weights. The quartet’s weight is correlated to thequartet’s quality and has effect when input data arenoisy, and a prioritization policy needs to be employed.For illustration, suppose 30% of the quartets disagreewith the constructed tree. We expect this fraction to bemainly composed of unreliable quartets—low weightedones. Their weight then should be smaller, say only10%. Hence, although the Qfit score corresponds to

FIGURE 4. Satisfying the Input Quartets by wQMC. We used wQfitas a distance measure and n=200. We used different levels of noise;0%, 30%, and 60% quartet rewirings. We see that the wQfit scoresstabilize at 100%, 75%, and 30%, respectively. Our in-depth analysisshows that these values correspond to satisfying at least 1−rewire ofthe total weight of the input quartets. Hence, optimal reconstruction isachieved in all three cases.

70% (the fraction of correct quartets), the weight ofcorrect quartets is 90% (the exact wQfit score can becalculated using Equation (5)), reflecting the low levelof confidence in the wrong quartets.

Therefore, the goal of this experiment was toshow that heavier quartets have greater effect on thereconstruction, as reflected in the tree similarity score.The procedure we pursued was as follows: Quartets weregenerated from the model tree with weights assigned asdescribed above. We applied our rewiring scheme onthe generated quartets, and the resulting (postrewiring)quartets were used as input for wQMC. We expect thedifference between Qfit and wQfit to increase the biggerthe rewire ratio is, because wQMC prioritizes heavierquartets, and because heavy quartets contribute more tothe wQfit score (whereas in Qfit all quartets are treatedequally).

We ran wQMC on taxa sets with n=200 having bothnon-noisy and noisy quartets sets. We used the sameEquation (4) to compute Qfit and wQfit, where in Qfitwe ignored the quartets weights. As expected, we see(Fig. 5) a difference between Qfit and wQfit when wehave noisy data. We note that, when using wQMC,heavy quartets are also more likely to be satisfied bythe constructed tree, and this is proven by comparingQfit with wQfit: When rewire=60%, for instance, theQfit score reaches roughly 40% whereas the wQfit scorereaches roughly 80%. From Equation (6) we deduce thatalthough about 46% of the number of quartets in themodel tree disagree with the reconstructed tree, theircombined weight is only about 18% of the total weightof the tree’s quartets. This proves that the weight ofthe violated quartets is lighter than average. Therefore,the new measure augments information to the score






FIGURE 5. Comparing the Measures of Qfit and wQfit Between the wQMC Tree and the Model Tree. We used compatible and noisy quartetsinput sets, n=200. We see that both Qfit and wQfit have similar results for the non-noisy quartets, but wQfit scores are higher than Qfit scoresthe noisier the quartet input is.

by segregating quartets according to quality. We seethis as an important characteristic of the new proposedmeasure.

Comparison Between QMC and wQMCBoth the QMC and the wQMC algorithms build trees

from quartets where the difference between them isthat wQMC also supports weighted quartets. Naturally,weights reflect confidence in quartet data, implyingadvantage of wQMC over QMC when data are noisy, andlightweight quartets are more prone to exhibit a wrongtopology.

We ran QMC and wQMC on correct, on 30% and on60% noisy sets of quartets for tree size n=200, whereweights were assigned according to Equation (1). Theresults are shown in Figure 6. For consistent data, wesee that both algorithms exhibit similar results for boththe wQfit and the RF measures. This is expected asall quartets have the correct topology. However, whenthe data are noisy, wQMC converges to its highestscores faster than QMC, and wQMC’s highest scores areslightly better than QMC’s. Moreover, when applied tothe same input, wQMC consistently outperforms QMC(at times, by as much as 20%). Thus, although a treeconstructed using QMC may be worse than random(i.e., with a negative wQfit score), a tree constructedusing wQMC based on the same input may be quiteinformative (see other examples of this phenomenonin the supplementary text available on Dryad athttp://dx.doi.org/10.5061/dryad.r9k57.). QMC’s scoresare relatively high as well, as much of the phylogenetic

signal is encompassed in the quartet’s topology itself,allowing QMC to cope successfully with noisy data. Theweights provide more information, allowing wQMC toprioritize correct quartets. We note that the advantage ofthe weighted approach is dependent on the weightingscheme. Therefore, the encouraging results we presenthere give hope that future weighting schemes willimprove the reconstructions of phylogenies even further.

Reconstructing the Prokaryotic Tree of Life viaQuartet-Based Approaches

Quartets amalgamation has recently become anessential component in prokaryotic evolutionarystudies, where HGT is a widespread phenomenon.In (Zhaxybayeva et al. 2006) a new method calledembedded quartets was developed for the purpose ofexploring phylogenetic signals, consistent with theplurality of genes over a group of organisms. In thismethod, a separate tree is constructed for every singlegene in the set under study. Next, every such genetree Ti is decomposed into its induced quartets—theembedded quartets, that we denote by Q(Ti). Finally,these embedded quartets are used to reconstruct theevolutionary tree.

In this part, we set to extend the results of Zhaxybayevaet al. (2006) by using wQMC on weighted embeddedquartets. After proposing the idea of embedded quartets,Zhaxybayeva et al. (2006) suggested to combine theunion (with quartet multiplicity considered) of theembedded quartets from all gene trees into a singleunifying supertree. They demonstrated their method on





FIGURE 6. QMC versus wQMC Performance. We used n=200 for all runs. We see that both for wQfit and for RF, scores are similar for thenon-noisy quartets data, but wQMC has slightly better results for the similarity measure in question when the input is noisy.

a set of 11 cyanobacteria for which a clear phylogeneticsignal is not evident (Yap et al. 1999). The set of genesanalyzed consisted of 1128 genes that are shared byat least 9 out of the 11 cyanobacterial genomes in thestudy. The ground set of quartets was all the quartetsinduced by the 1128 gene trees. In Zhaxybayeva et al.(2006), quartet confidence was deduced by maximumlikelihood (ML) using Quartet Puzzling (Strimmer andvon Haeseler 1996) and quartets with low confidencewere excluded from the input. The remaining 214,729quartets were used as input for Matrix Representationwith Parsimony (MRP; Baum 1992; Ragan 1992) as thesupertree method.

Here we took the whole raw set of quartets andused it as a weighted input for wQMC. The treeproduced by wQMC (Fig. 7) is identical to the treeconstructed using MRP in Zhaxybayeva et al. (2006). Due

to the independent use of a different supertree methodcoupled with the genuine set of weighted quartets, weview this result as important and confirming the MRPtree of Zhaxybayeva et al. (2006).

SUMMARY AND CONCLUSION

In this work we focused on weighted quartets. We firstshowed that augmenting quartet inputs with additionalinformation such as reliability may boost the accuracyof the resulted trees. This improvement is a functionof the weighting scheme employed. Next, we extendedthe quartet fit measure to handle weighted quartetsand trees. The new wQfit measure (where randommatching receives score zero) is more informative than asimple counting of agreeing quartets. We also showedthe advantage in performance of this measure over




FIGURE 7. The 11 Species Cyanobacterial Tree Resulted from the Embedded Quartets

widely accepted tree similarity measures. Accordinglywe have extended the quartet MaxCut algorithm (Snirand Rao 2012) to handle weighted quartet inputs andimplemented it in software. Last but not least, wehave conducted a very extensive and comprehensivesimulation study to establish empirically the ideasexplored in this work. Our results show that augmentinga quartet set with weights can turn the set fromuninformative to informative, in the sense that usinga weight-oblivious algorithm such as QMC producesa tree with a score that is worse of random, versus asignificant score as obtained by wQMC.

The contribution of this work stems from severalreasons. Indeed traditional quartet-based phylogeneticreconstruction has not been shown to outperform theacceptedly used sequence-based methods such as ML,in particular in light of technological advancementsextending its viability to few hundreds of species.However, it was shown in several recent works that inthe presence of conflicting evolutionary signals, quartetamalgamation can reconcile these conflicts, resultingin a more refined tree than the simple conservativeconsensus approach (Chifman and Kubatko 2014). Thishas wide potential in topical tasks such as reconcilingcontradictory gene trees as was shown above and inZhaxybayeva et al. (2006), or large-scale sequence-basedreconstruction (Swenson et al. 2012).

There are also several theoretical implications to thiswork. The new measure we defined here, wQfit, isricher in the sense that it conveys more informationthan other measures usually used. However, it requires adeeper understanding than we provide here. Exploringits properties is left as further research.

SUPPLEMENTARY MATERIAL

Data available from the Dryad Digital Repository:http://dx.doi.org/10.5061/dryad.r9k57.

FUNDING

The work of R.C. and E.A. was supported by the ISF –the Israeli Science Foundation. E.A. was also supportedby The Graduate Studies Authority, University of Haifa,as part of his Ph.D. studies, and the Caesarea EdmondBenjamin de Rothschild Foundation.

ACKNOWLEDGMENTS

The authors wish to thank the EiC Frank Anderson,AE Tiffani Williams, and two anonymous reviewersfor their invaluable comments and suggestions thatgreatly improved the quality of the article. S.S. and R.C.conceived and designed the model and experiments;R.C. and E.A. performed the experiments; E.A., R.C., S.S.analyzed the data; S.S., E.A., R.C. wrote the article.

REFERENCES

Alon N., Snir S., Yuster R. 2014. On the compatibility of quartettrees. In: Proceedings of the Twenty-Fifth Annual ACM-SIAMSymposium on Discrete Algorithms, SODA; 2014, 535–545.

Bandelt H.J., Dress A. 1986. Reconstructing the shape of a tree fromobserved dissimilarity data. Adv. Appl. Math. 7:309–343.

Baum B.R. 1992. Combining trees as a way of combining data sets forphylogenetic inference. Taxon. 41:3–10.





Berry V., Gascuel O. 2001. Inferring evolutionary trees with strongcombinatorial evidence. Theor. Comput. Sci. 240:271–298.

Bryant D., Steel M.A. 2001. Constructing optimal trees from quartets.J. Algorithms. 38:237–259.

Buneman P. 1971. The recovery of trees from measures ofdissimilarity. In: F.R. Hodson, Kendall D.G., Tautu P., editors. Anglo-Romanian Conference on Mathematics in the Archaeological andHistorical Sciences. Mamaia, Romania: Edinburgh University Pressp. 387–395.

Chifman J., Kubatko L. 2014. Quartet inference fromSNP data under the coalescent model. Bioinformatics.doi:10.1093/bioinformatics/btu530.

Chor B., Snir S. 2004. Molecular clock fork phylogenies: Closed formanalytic maximum likelihood solutions. Syst. Biol. 53:963–967.

Chor B., Hendy M., Holland B., Penny D. 2000. Multiple maxima oflikelihood in phylogenetic trees: An analytic approach. Mol. Biol.Evol. 17:1529–1541. Earlier version appeared in RECOMB 2000.

Chor B., Khetan A., Snir S. 2006. Maximum likelihood molecularclock comb: Analytic solutions. Jour. Comput. Biol. Earlier versionappeared in RECOMB 2003.

Daskalakis C., Mossel E., Roch S. 2011. Phylogenies without branchbounds: Contracting the short, pruning the deep. SIAM J. DiscreteMath. 25:872–893.

Doolittle W.F. 1999. Phylogenetic classification and the universal tree.Science. 284:2124–9.

Erdös P., Steel M., Szekely L., Warnow T. 1999. A few logs suffice tobuild (almost) all trees (i). Random Struct. Algor. 14:153–184.

Estabrook G.F. 1985. Comparison of undirected phylogenetic treesbased on subtrees of four evolutionary units. Syst. Biol.34:193–200.

Felsenstein J. 1989. PHYLIP - phylogenetic inference package, (version3.2). Cladistics 5:164–166.

Garey M.R., Johnson D.S. 1979. Computers and Intractability: AGuide to the Theory of NP-Completeness. W.H. Freeman andCompany.

Gronau I., Moran S., Snir S. 2008. Fast and reliable reconstructionof phylogenetic trees with very short branches. In: ACM-SIAMSymposium on Discrete Algorithms (SODA), p. 379–388.

Holland B.R., Jarvis P.D., Sumner J.G. 2013. Low-parameterphylogenetic inference under the general markov model. Syst. Biol.62:78–92.

Jiang T., Kearney P.E., Li M. 2000. A polynomial time approximationscheme for inferring evolutionary trees from quartet topologies andits application. SIAM J. Comput. 30:1942–1961.

Jin G., Nakhleh L., Snir S., Tuller T. 2007. Inferring phylogeneticnetworks by the maximum parsimony criterion: a case study. Mol.Biol. Evol. 24:324–337.

Koonin E.V., Makarova K.S., Aravind L. 2001. Horizontal genetransfer in prokaryotes: quantification and classification. Annu. Rev.Microbiol. 55:709–742.

Ochman H., Lawrence J.G., Groisman E.A. 2000. Lateral gene transferand the nature of bacterial innovation. Nature. 405:299–304.

Ragan M.A. 1992. Matrix representation in reconstructingphylogenetic-relationships among the eukaryotes. Biosystems.28:47–55.

Ranwez V., Gascuel O. 2001. Quartet-based phylogenetic inference:Improvements and limits. Mole. Biol. Evol. 18:1103–1116.

Robinson D.R., Foulds L.R. 1981. Comparison of phylogenetic trees.Math. Biosci. 53:131–147.

Roch S., Snir S. 2012. Recovering the tree-like trend of evolutiondespite extensive lateral genetic transfer: A probabilistic analysis.In: RECOMB, p. 224–238.

Sanderson M.J. 2003. r8s: inferring absolute rates of molecularevolution and divergence times in the absence of a molecular clock.Bioinformatics. 19:301–302. doi:10.1093/bioinformatics/19.2.301.URL http://bioinformatics.oxfordjournals.org/content/19/2/301.abstract. Available at http://ginger.ucdavis.edu/r8s/

Snir S., Rao S. 2006. Using max cut to enhance rooted treesconsistency. IEEE/ACM Transactions on Computational Biologyand Bioinformatics (TCBB), 3:323–333. Preliminary versionappeared in WABI 2005.

Snir S., Rao S. 2010. Quartets maxcut: A divide and conquer quartetsalgorithm. Trans. Comput. Biol. Bioinform. 7:714–718.

Snir S., Rao S. 2012. Quartet maxcut: A fast algorithm for amalgamatingquartet trees. Mole. Phylogenet. Evol. 62:1–8.

Snir S., Yuster R. 2012. Reconstructing approximate phylogenetic treesfrom quartet samples. SIAM J. Comput. 41:1466–1480. Preliminaryversion appeared in SODA 2010.

Snir S., Warnow T., Rao S. 2008. Short quartet puzzling: A newquartet-based phylogeny reconstruction algorithm. J. Comput. Biol.1:91–103.

Steel M. 1992. The complexity of reconstructing trees from qualitativecharacters and subtress. J. Classif. 9:91–116.

Strimmer K., von Haeseler A. 1996. Quartet puzzling: Aquartet maximum-likelihood method for reconstructing treetopologies. Mole. Biol. Evol. 13:964–969. Available from URLftp://ftp.ebi.ac.uk/pub/software/unix/puzzle/.

Strimmer K., Goldman N., von Haeseler A. 1997. Bayesian probabilitiesand quartet puzzling. Mole. Biol. Evol. 14:210–211.

Swenson M.S., Suri R., Linder C.R., Warnow T. 2011. An experimentalstudy of quartets maxcut and other supertree methods. Algorithmsfor Mole. Biol. 6:7.

Swenson M.S., Suri R., Linder C.R., Warnow T. 2012. Superfine:fast and accurate supertree estimation. Syst. Biol. 61:214–227.doi:10.1093/sysbio/syr092.

Swofford D.L. 1998. PAUP*beta, Sinauer, Sunderland, Mass.Yap W.H., Zhang Z., Wang Y. 1999. Distinct types of RRNA

operons exist in the genome of the actinomycete thermomonosporachromogena and evidence for horizontal transfer of an entire RRNAoperon. J. Bacteriol. 181:5201–5209.

Zhaxybayeva O., Gogarten J.P., Charlebois R.L., Doolittle W.F.,Papke R.T. 2006. Phylogenetic analyses of cyanobacterial genomes:quantification of horizontal gene transfer events. Genome. Res.16:1099–1108.


weighted quartets phylogeneticssciences.haifa.ac.il/snirlab/people/eliran/avni_et_al... ·...

Documents