orthology inference in nonmodel organisms using ...nov 20, 2013 · of tree-based orthology...

Article

Orthology Inference in Nonmodel Organisms UsingTranscriptomes and Low-Coverage Genomes ImprovingAccuracy and Matrix Occupancy for PhylogenomicsYa Yang1 and Stephen A Smith1

1Department of Ecology amp Evolutionary Biology University of Michigan Ann Arbor

Corresponding author E-mail yangyaumichedu eebsmithumichedu

Associate editor Todd Oakley

Abstract

Orthology inference is central to phylogenomic analyses Phylogenomic data sets commonly include transcriptomes andlow-coverage genomes that are incomplete and contain errors and isoforms These properties can severely violate theunderlying assumptions of orthology inference with existing heuristics We present a procedure that uses phylogenies forboth homology and orthology assignment The procedure first uses similarity scores to infer putative homologs that arethen aligned constructed into phylogenies and pruned of spurious branches caused by deep paralogs misassemblyframeshifts or recombination These final homologs are then used to identify orthologs We explore four alternative tree-based orthology inference approaches of which two are new These accommodate gene and genome duplications as wellas gene tree discordance We demonstrate these methods in three published data sets including the grape familyHymenoptera and millipedes with divergence times ranging from approximately 100 to over 400 Ma The proceduresignificantly increased the completeness and accuracy of the inferred homologs and orthologs We also found that datasets that are more recently diverged andor include more high-coverage genomes had more complete sets of orthologsTo explicitly evaluate sources of conflicting phylogenetic signals we applied serial jackknife analyses of gene regionskeeping each locus intact The methods described here can scale to over 100 taxa They have been implemented in pythonwith independent scripts for each step making it easy to modify or incorporate them into existing pipelines All scriptsare available from httpsbitbucketorgyangyaphylogenomic_dataset_construction

Key words Diplopoda phylotranscriptomics RNA-seq Vitaceae

IntroductionOrthology is a phylogenetic concept as orthologous genes aredefined as those genes that have descended from an ancestralsequence of their common ancestor through speciation(Fitch 1970 2000) Accurate orthology inference is criticalfor phylogenomic reconstruction and functional studiesHowever this inference is especially challenging for datasets using transcriptomes or low-coverage genomes thatoften contain misassemblies and partial or missing sequencesThe complexities of these data types also make it difficult todistinguish recently duplicated copies from allelic variationssplice variants and misassemblies

A number of orthology inference methods have been ap-plied to phylogenomic analyses based on transcriptomes andlow-coverage genomes such as orthoMCL (Li et al 2003)Hcluster_sg (as part of TreeFam Li et al 2006) SCaFoS(Roure et al 2007) HaMStR (Ebersberger et al 2009) andOrthoSelect (Schreiber et al 2009) Emerging tools such asOMA-GETHOGs (Roth et al 2008 Altenhoff et al 2013) andAgalma (Dunn et al 2013) have also attracted interest in theirphylogenomic applications Among them HaMStR is by farthe most widely used HaMStR is based on a modified recip-rocal similarity criterion that starts with querying a setof precomputed high-quality orthologs (ldquocore-orthologsrdquo)

against candidate sequences (Ebersberger et al 2009)The resulting significant hits are then queried against allgenes in the reference taxon HaMStR only adds the candidateto the ortholog group if the best hit in the reference taxon isalso member of the same ortholog group (Ebersberger et al2009) Considering that incomplete sequences gene andgenome duplication and molecular rate heterogeneity arealmost certainly present in most data sets the reciprocal cri-terion is frequently violated A number of other alternativeorthology inference pipelines also suffer from using similaritymeasurements as approximations to directly infer orthology(Li et al 2003 Roure et al 2007 Schreiber et al 2009 Altenhoffet al 2011 2013)

Given the incomplete and noisy nature of transcriptomicand low-coverage genomic data orthology is best inferred byusing phylogenies to separate paralogs and orthologs afterhomology has been established (Gabaldon 2008) A varietyof tree-based orthology inference methods have been devel-oped However with a few exceptions most of these tree-based methods require a known species tree This is oftenundesirable as many of these data were generated for thepurpose of estimating an unknown species tree PHYLDOG(Boussau et al 2013) estimates gene trees and the species treesimultaneously taking duplications and gene loss into

The Author 2014 Published by Oxford University Press on behalf of the Society for Molecular Biology and EvolutionThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40) which permits unrestricted reuse distribution and reproduction in any medium provided the original work isproperly cited Open AccessMol Biol Evol 31(11)3081ndash3092 doi101093molbevmsu245 Advance Access publication August 25 2014 3081

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract311130812925722 by University of M

innesota Duluth user on 10 Septem

ber 2018

account However it was designed for genomic data Besidespotential scaling issues with such an approach as data setsgrow transcriptomes may lack a particular gene due tosilencing or low expression and coverage Taxa with lowgene coverages tend to be grouped together due to sharedldquogene lossrdquo (Boussau et al 2013)

An alternative strategy adopted by Agalma (Dunn et al2013) and recent implementations of OrthologID (Chiu et al2006) consists of two stages Obtaining homologs and sepa-rating orthologs from paralogs Both pipelines infer homologsusing an all-by-all BLAST search (Altschul et al 1990) followedby Markov clustering (MCL) that identifies sequence clustersbased on the relative connectivity (presenceabsence of hits)and relative strength of connections (E values from BLASThits) among sequences (van Dongen 2000) A phylogenetictree is then inferred for each homolog To obtain orthologsthe two pipelines use different approaches Agalma takes onlythe homolog tree topology into account It looks for thesubtree that has the highest number of nonrepeating taxacuts it off as an ortholog and repeats the search and cuttingon the remaining tree (Dunn et al 2013) This approach hasthe advantage of being relatively assumption free Howeverwhen there are genome duplications it breaks orthologs intofragments This is especially problematic when there are mul-tiple nested genome duplications as is frequently seen inplants (De Smet et al 2013) On the other hand orthologIDconsiders both homolog tree topology and the homologsequence alignment using a partial guide tree determinedfrom taxa with genome sequences available (Chiu et al2006) It is able to accommodate gene and genome duplica-tions yet it is limited by the availability of annotated genomesrequired to build the guide tree for each ortholog groupMost areas of the tree of life still lack reasonable whole-genome sequence coverage Both Agalma and orthologIDimprove enormously on previous methods by taking genetree into account (Chiu et al 2006 Dunn et al 2013)However there is the great potential for additional compo-nents and methods that would allow for higher flexibilityand broader applications

Here we outline a flexible orthology inference procedurebased on identifying homologs followed by cleaning aligningand cutting homolog trees We demonstrate this approachand compare different methods for cutting homolog treesin three recently published phylogenomic data sets acrossdiverse taxonomic groups and ages The grape family orVitaceae consists of approximately 900 species with a stemage of approximately 95 Ma (Wen et al 2013) The grape dataset (GRP) (Wen et al 2013) includes 15 transcriptomes and aproteome from the grape genome annotation The millipedes(class Diplopoda) are an ancient and diverse group withfossils dating back to 428 Ma (Brewer and Bond 2013)The millipedes data set (MIL) (Brewer and Bond 2013) in-cludes nine transcriptomes one expressed sequence tag (EST)data set and two non-millipedes proteomes from genomeannotation The aculeate Hymenoptera includes ants beesand wasps and has a crown age of approximately 150 Ma(Wilson et al 2013) The aculeate Hymenoptera data set(MIL) (Johnson et al 2013) includes 18 ingroup data sets

(11 transcriptomes 1 low-coverage genome 6 annotated ge-nomes) and one outgroup from annotated genome

New Approaches

Our orthology inference approach is tree-based does not relyon a known species tree and is capable of accommodatinggenome duplications and different outgroup scenarios It dif-fers from previous published tree-based and species tree-in-dependent orthology inference methods in a number of waysWe take phylogenetic trees into account in both homologinference as well as ortholog inference We explore four alter-native strategies for obtaining orthologs from cutting homo-log trees two of which are newly proposed to explicitlyaccommodate gene duplications To evaluate conflictingphylogenetic signals we use a jackknife strategy with multipleresampling ratios This strategy resamples by locus not by siteand therefore keeps each locus intact and explicitly evaluatesconflicting phylogenetic signals among loci Finally given theever-changing landscape of sequence processing alignmentand tree inference methods each step of our procedureis written in separate python scripts with a lightweightphylogenetic tree library which allows for steps to be easilymodified swapped and moved between computer clustersand desktop machines

Results and Discussion

Homology Inference Using Clusters and Trees

Our homology inference method starts with an all-by-allBLAST followed by clustering filtered BLAST hits using MCL(van Dongen 2000) For each cluster we align into a multiplesequence alignment infer a phylogenetic tree using maxi-mum likelihood cut deep paralogs and remove aberrantand redundant tips (fig 1)

A number of methods can be used for conducting theinitial homology search For the analyses presented hereinitial all-by-all homology searches were conducted usingpeptide sequences against a peptide database (BLASTP)In addition we also conducted homology search usingcoding sequences (CDS) against a CDS database (BLASTN)in the GRP data set which contains the most recently di-verged group among the three we analyzed here A thirdapproach employed by Agalma (Dunn et al 2013) uses tran-scripts translated by all six frames (TBLASTX) for homologysearch and then simultaneously translate and align CDS usingMACSE (Ranwez et al 2011) This approach is promisingfor improving translation accuracy yet very time consumingand is currently limited to relatively small data sets

Once a BLAST score has been calculated between eachsequence pair there are two general strategies for homologyclustering The simplest method is to filter out BLAST resultsthat do not meet a minimal coverage percentage for eithersequence (ldquohit fractionrdquo Chiu et al 2006 Sanderson andMcMahon 2007) then obtain clusters that are connectedby the remaining hits However this approach is sensitiveto both the value of the minimum hit fraction filter andlarge gene families that have sequences of intermediate com-pleteness that can attract even less complete sequences

3082

Yang and Smith doi101093molbevmsu245 MBED

ownloaded from

httpsacademicoupcom

mbearticle-abstract311130812925722 by U

niversity of Minnesota D

uluth user on 10 September 2018

009 0404

GRP HYM MIL

All-by-all blast

Filter blast results by hit fraction

Markov Clustering

Alignment and tree inference

Separate long branches refine alignments and re-estimate trees (optional)

Cut spurious tips and mask tips to obtain homolog trees

Maximum inclusion(MI)

Rooted ingroups(RT)

Monophyletic outgroups(MO)

One-to-one orthologs(1to1)

Outgroup presentbut non-monophyleticduplicated taxa present

Outgroup presentand monophyleticduplicated taxa present

C

Outgroup absentduplicated taxa present

Outgrouppresent or absentduplicated taxa absent

Ingroups A B COutgroup OInfered orthologs

A

A

B

CO

AB

O

A

B

O

A

A

B

B

C

A

A

B

B

C

A

A

B

B

C

A

A

B

B

C

AB

C

AB

C

AB

C

AB

C

A

B

CA

B

O

A

B

CA

B

O

A

B

CA

B

O

A

B

Representative trees from clustering using peptidesshowing branch lengths

A

A

B

CO

AB

O

A

A

B

CO

AB

O

A

A

B

CO

AB

O

FIG 1 Flow chart of homology and orthology inferences

3083

Orthology Inference in Nonmodel Organisms doi101093molbevmsu245 MBED

ownloaded from

httpsacademicoupcom




Our experience with using hit fraction alone is that it resultsin ldquosnowballsrdquo of gigantic clusters that are difficult to alignbecause there is only partial overlap between many of thesequences A second approach for homology clustering isto use a clustering algorithm the most popular of which isMCL (van Dongen 2000) MCL is a general clustering algo-rithm that breaks any network of connected nodes into clus-ters by using the presenceabsence of connections and therelative strength of those connections It has the advantage ofbeing extremely fast and efficient with computer memoryHowever the algorithm uses only a single source of data (egE values from BLAST hits) for clustering without taking thehierarchical structure of gene families into account Because Evalues are dependent on data set size and sequence lengths(Altschul et al 1990) the behavior and robustness of MCLhave yet to be evaluated for data sets that include manypartial sequences and sequence isoforms In addition becausethe E values produced by the BLAST algorithms frequentlyreach the lowest and most significant value (10180) amonglarge gene families using MCL alone frequently produces clus-ters that are very large and difficult to align accurately Theselarge clusters are often removed from further analysis in ex-isting pipelines reducing the amount of usable data in thesephylogenomic data sets

To effectively separate gene families of various sizes wedeveloped a multistep procedure (fig 1) Sequence similaritysearch results (here we use BLAST) are filtered using a min-imum coverage fraction (hit fraction here we use at least 04Chiu et al 2006 Sanderson and McMahon 2007) to removehits from conserved motifs and short sequence fragmentsand then clustered with MCL based on the filtered hitsThe purpose of this initial clustering step is simply to formclusters of sizes that can be accurately aligned Thereforevalues of the hit fraction cutoff and the inflation values inMCL are chosen to be as low as possible to ensure a coarseclustering that produces clusters containing less than a fewthousand sequences each

When a cluster contains deep duplications the alignmentwill be poorly aligned and the resulting phylogenetic tree willcontain long branches subtending orthologs especially in rel-atively recent data sets such as GRP and HYM (representativetrees in fig 1 showing trees from initial clusters using pep-tides) These branches often root orthologs at random inter-nodes and interfere with orthology inference One way toremove these deep duplications is to use all-by-all BLASTNusing CDS instead of BLASTP using peptide sequences Thisapproach is effective only in recently diverged groups such asthe grape family and we found that with increasing diver-gence BLASTN is susceptible to Type II error A second ap-proach is using a higher inflation value in MCL (van Dongen2000) By increasing the inflation value the clustering algorismis more sensitive to the contrasts in E values and connectivityamong sequences and tend to produce smaller clusters at therisk of breaking apart homologs at unexpected places A thirdapproach is to cut apart these deep duplications using a set ofbranch length cutoffs that are empirically determined by thedistribution of branch lengths among ingroup taxa In doingso the accuracy of the alignments and the homolog trees are

significantly improved Although one can potentially detectsubclusters that are significantly more distantly relatedamong than within each subcluster given the hierarchicalstructure of gene families cutoffs for subclusters are oftenarbitrary and dependent on the phylogenetic distanceamong the ingroup taxa Therefore here we simply set em-pirical branch length cutoffs to eliminate branches that aremuch older than diversification of orthologs Finally we trimspurious terminal branches that are much longer than sisterbranches that are usually a result of misassembly

De novo assembled transcriptomes often have multipleisoforms for each gene that form monophyletic or paraphy-letic tips on the gene tree For phylogenomic purposesonly the isoform with the highest number of nonambiguouscharacters in the alignment is kept as the representative withthe rest removed This procedure differs from Smith et al(2011) and the ldquomonophyly maskingrdquo step in Agalma(Dunn et al 2013) in that instead of only masking monophy-letic tip duplicates we also mask paraphyletic grades of thesame taxon and we retain the isoform with the highestnumber of aligned characters after trimming instead of keep-ing a random one Alternatively one can keep the isoformwith the shortest distance from its sister taxa or simply arandom isoform However short branches often result fromincomplete sequences and a random isoform can containpoorly aligned sections from misassembly By choosingthe one with the most aligned characters after trimming wemaximize the information retained Another option is to pickeither the longest isoform or the isoform with the highestread coverage from each isoform group (eg Trinity subcom-ponent) However in practice a subcomponent from anassembler (eg Trinity) does not always correspond to agene and its splice variants (Grabherr et al 2011) A previousbenchmark study in a model plant species (Yang and Smith2013) shows that chimeric transcripts exist in around 4 ofTrinity assemblies and picking the longest isoform alone willlikely further increase the percentage of chimeric sequencesPicking the highest covered isoform per subcomponent onthe other hand reduces the percentage of chimera from 4to around 1 at the cost of reducing the total base pairsassembled by around 10 (Yang and Smith 2013) A finalconsideration for picking the representative isoform is alter-native splicing Splice variants with different exon contentwill introduce bias to distance-based orthology inferencemethods whereas tree-based methods are less likely to beaffected

Orthology Inference

We present four alternative orthology inference methods thatmay be used once homolog phylogenies have been inferred(fig 1) The maximum inclusion (MI) method iteratively cutsout the subtree with the highest number of taxa withouttaxon duplication (Dunn et al 2008 2013 Smith et al2011) A second method iteratively searches for the subtreewith the highest number of ingroup taxa cuts it out as arooted tree (RT) and infers gene duplications from root totips When duplicated taxa are found between the two sides

3084


ownloaded from

httpsacademicoupcom




at a bifurcating node the side containing a smaller numberof taxa is cut off The third method looks for clusters withmonophyletic outgroups (MO) roots the tree and infersgene duplication in a similar way as RT Both RT and MOare similar to the tree-pruning method implemented inTreeKO (Marcet-Houben and Gabaldon 2011) in that bothtraverse a rooted tree from root to tips and prune at nodeswith taxon duplications The two differ in that TreeKO con-siders all possible decompositions calculates the pairwise dis-tance between all candidate orthologs from two differenthomologs and chooses one ortholog from each homologthat minimizes pair wise tree distance However given theincompleteness and noise in both transcriptome andlow-coverage genome data using a particular homolog as areference for reconciliation bears the risk of introducing ad-ditional noise Instead we choose the decomposition thatretains the highest number of taxa to maximize final matrixoccupancy Finally we compare these results to only usinghomologs that had no duplicated taxon and are one-to-one(1to1) orthologs

Orthology Inferences from Example Data Sets

To demonstrate the utility of the methods presented here weanalyzed three data sets GRP (Wen et al 2013) MIL(Brewer and Bond 2013) and HYM (Johnson et al 2013)The original authors provided peptides for HYM whereasboth MIL and GRP data were downloaded as raw readsfrom the NCBI Sequence Read Archive (SRA) For the MILdata set our read filtering procedure differed from Brewerand Bond (2013) For details on deviations see Materials andMethods Our quality filter removed 15ndash23 of read pairs

(supplementary table S1 Supplementary Material online) Ofthe remaining read pairs 003ndash082 contained adapters andwere removed The cleaned data sets contained 19ndash40 millionread pairs each 11ndash17 less than Brewer and Bond (2013)For the GRP data set our quality filter removed 16ndash41 ofread pairs (supplementary table S2 Supplementary Materialonline) Of the remaining read pairs 001ndash092 containedadaptors and were removed After filtering 27ndash37 millionread pairs for each taxon were used for de novo assembly

Homology and orthology inference were conducted usingthe methods as described above (for more details seeMaterials and Methods) The resulting ortholog occupancycurves were convex for HYM and GRP (fig 2) indicating ahigh number of orthologs containing high percentage of taxawhereas the almost straight curves for MIL indicate that rel-atively few orthologs have high percentage of taxa The shapeswere determined by the divergence time and the complete-ness of sequences in individual taxon (annotated genome vstranscriptomelow-coverage genomes) whereas the orthol-ogy inference methods shifted the height and the slope ofthe curves

The HYM Data SetWith seven out of 19 taxa from annotated genomes thetaxon occupancy curves of HYM were convex and all had aplateau that contained around 3000ndash4000 orthologs withcomplete or near complete taxon occupancy (fig 2) Onetaxon in HYM Apterogyna ZA01 was from a low-coveragegenome and resulted in the narrow peak above the plateau inall four curves Given that the outgroup Nasonia vitripennishad 12925 genes and the ingroup Apis mellifera had 10570

0 5000 10000 15000 0 5000 10000 15000

0 5000 10000 15000

10

12

14

16

18

0 5000 10000 15000

6

8

10

12

14

16

18

6

8

10

12

14

16

18

6

8

10

12

14

16

18

6

8

Num

ber o

f tax

a in

eac

h or

thol

og

Orthologs ranked from high to low taxon occupancy

Maximum inclusion (MI) Rooted ingroups (RT)

Monophyletic outgroups (MO) One-to-one orthologs (1to1)

HYM GRPMIL

HYM GRPMIL HYM GRPMIL

MIL HYM GRP

FIG 2 Ortholog taxon occupation ranked from high to low Orthologs with less than eight taxa (six for MILndashRT) were not shown

3085


ownloaded from

httpsacademicoupcom




genes numbers of orthologs with at least eight taxa were high(MI 9128 RT 8251 MO 7665 and 1to1 4937)

Our RT method recovered 7558 orthologs with at leastnine taxa and 4172 orthologs with at least 16 ingroup taxasignificantly more than the 5214 and 3018 respectivelyfrom the Johnson et al (2013) analyses using OrthologID(Chiu et al 2006) Our final supermatrices contained ortho-logs with full taxon sets and at least 100 amino acids (aa)in trimmed alignments (MI 1160 loci 620150 aa RT 1116loci 588895 aa MO 992 loci 525061 aa and 1to1 761 loci369371 aa) all with high amino acid occupancy (890 888892 and 910 respectively) Again these numbers aremuch higher than the 525 orthologs with full taxon occu-pancy recovered using OrthologID (Chiu et al 2006 Johnsonet al 2013)

The GRP Data SetWith one annotated genomes and 15 transcriptomes theGRP curves were also convex and all had a plateau that con-tained around 5000ndash7000 orthologs with full taxon occu-pancy (fig 2) The large number of orthologs with fulltaxon occupancy likely reflects the fact that the GRP dataset is relatively recent with the split between the ingroupsand the outgroup being approximately 95 Ma (Wen et al2013) Compared with the 29971 genes in the Vitis vinifera(GRP) genome approximately half of genes (a third for 1to1)had orthologs with at least eight taxa (MI 15488 RT 14929MO 14016 and 1to1 9149)

We constructed the supermatrices using orthologs thathad the full taxon set and containing at least 300 nt intrimmed alignments (MI 7462 loci 10925506 nt RT 7070loci 10315343 nt MO 6686 loci 9870949 nt and 1to15166 loci 7403388 nt) All four supermatrices had highnucleotide occupancy (896 891 892 and 915respectively) Our ortholog sets with full taxon occupancywere much larger compared with the 417 (before filtering)and 229 (after filtering) orthologs by Wen et al (2013) usingHcluster_sg (Li et al 2006)

The MIL Data SetWith two annotated genomes and ten transcriptomes andingroups dating back to more than 400 Ma (Brewer and Bond

2013) the MIL taxon occupancy curves were almost straight(fig 2) The numbers of orthologs containing at least eighttaxa were MI 3398 MO 2335 and 1to1 2075 whereas RTrecovered 3125 orthologs with at least six ldquoingrouprdquo taxa(millipedes + Lithobius see Materials and Methods) Amongthe four methods MI recovered the highest number of ortho-logs whereas the numbers of orhtologs recovered by both RTand MO were reduced by the high level of phylogenetic un-certainty among deep nodes For the final supermatrices weincluded orthologs that had no more than one taxon miss-ing and each had at least 100 aa in the trimmed alignments(MI 1085 RT 736 MO 739 and 1to1 712) Despite thevariation in numbers all four ortholog sets contained signif-icantly more orthologs compared with the 221 orthologs re-covered using HaMStR (Ebersberger et al 2009) using similaralignment filtering procedures as Brewer and Bond (2013)

Species Trees and Sources of Conflicts

We used concatenated supermatrices for species tree infer-ence partitioning by each locus These supermatrices maycontain conflicting phylogenetic signals due to hybridizationdeep coalescence contamination and horizontal gene trans-fer Noise and bias from assembly and orthology and treeinference may also complicate phylogenetic signal To evalu-ate the presence of conflicting phylogenetic signal weconducted serial jackknife analyses for each supermatrixkeeping each locus intact

The HYM Data SetSpecies trees reconstructed from the HYM data set wereoverall highly consistent among all four orthology inferencemethods in topology branch lengths and support values(fig 3) They had identical topologies to those in the analysisby Johnson et al (2013) All branches received a supportvalue of 100 from both the bootstrap and 30 jackknifeanalyses Branches received less-than-perfect support valuesusing STAR or PhyloNet in Johnson et al (2013) similarlyreceived less-than-perfect jackknife support values in our10 andor 20 gene jackknife analyses The node unitingFormicidae and Apoidea (marked with an arrow in fig 3)received 81ndash97 jackknife support with around 100 loci

FIG 3 Maximum-likelihood analysis of the HYM data set Taxon names were abbreviated to the first four letters of the genus names except the left-most tree Orthology inference methods MI maximum inclusion RT extracting rooted ingroup clades MO monophyletic outgroups 1to1 filteredone-to-one orthologs All nodes received bootstrap and 30 jackknife support values of 100 and are not shown Node labels are also not shown if allsupport values are 100 Arrows indicate nodes with relatively low support

3086


ownloaded from

httpsacademicoupcom




and around 60 with 20 loci Given that five of the nine taxain this clade were from annotated genomes and the entiretree was otherwise well supported this Formicidae + Apoideanode warrants further investigation of the source of theconflict

The GRP Data SetThe topology we recovered was well supported and in con-gruent with the topology recovered by Wen et al (2013)except one of the basal nodes (fig 4 indicated witharrows) We reanalyzed both the 417- and 229-gene CDSmatrixes from Wen et al (2013) with RAxML v735(Stamatakis 2006) partitioning by gene The two resultingtrees similarly showed low support values at two of thebasal nodes (fig 4) The original species tree inference byWen et al (2013) did not apply any partition to the conca-tenated supermatrices and received bootstrap support of 100for all nodes Their topology stability test only included nodesamong the ingroups without examining the uncertaintyin the outgroup placement Also although their topologicalstability test involved serial subsampling by locus as we didhere they discarded the subsampled replicates when themaximum-likelihood tree had a topology different from theldquostandard topologyrdquo They then calculated mean bootstrapvalues using only those replicates that agreed with the stan-dard topology with no partitioning of subsampled matrixesWhen partitioning was applied all six supermatrices (fig 4)four from our orthology inference and two from Wen et al(2013) showed strong conflicting signal among the deep

nodes in Vitaceae Therefore there is a need to take a closerlook at the conflicting signals the topology from plastidsequences and perhaps also sequences from additional out-group samples

The MIL Data SetWe recovered similar results to Brewer and Bond (2013)for the MIL data set Despite the generally well-supportedtopology the placement of Pseudopolydesmus was unstableClades including Pseudopolydesmus had low supportvalues regardless of orthology inference methods used(fig 5 arrows in upper four trees) and the support valuesin both RT and 1to1 decreased with increasing subsamplingratios This indicates strong conflicting signals in the place-ment of Pseudopolydesmus We subsequently removedPseudopolydesmus from the initial RAxML output for homo-log tree inference trimmed tips and carried out orthologyand species tree inferences By doing so the support valueswere significantly improved (fig 5 lower four trees) Althoughthe resulting species trees were well supported the nodeuniting Prostemmiulus Cambala and Archispirostreptus(marked with an arrow in fig 5 lower four trees) receivedsupport values of 88ndash94 when subsampling 10 of totalgenes and 59ndash72 when subsampling 20 genes These valueswere relatively low compared with the rest of the tree andmay deserve further investigations

A number of other multilocus species tree methods havebeen used for reconstructing species trees and evaluatingtopological support using multiple genes without

FIG 4 Maximum-likelihood analysis of the GRP CDS data set Taxon names were replaced by the collection numbers except the top left tree Orthologyinference methods MI maximum inclusion RT extracting rooted ingroup clades MO monophyletic outgroups 1to1 filtered one-to-one orthologsNode labels are not shown when all support values are 100 Arrows indicate nodes with relatively low support

3087


ownloaded from

httpsacademicoupcom




concatenation Methods such as STAR (Liu Yu et al 2009)and MP-EST (Liu et al 2010) assume that coalescenceis the only source of gene tree discordance making boot-strap numbers derived from these models difficult to inter-pret The program BUCKy does not assume a single source ofdiscordance (Ane et al 2007 Larget et al 2010) HoweverBUCKy assumes that each individual gene has enoughphylogenetic information and that the Markov chainMonte Carlo chains mixed well enough such that theposterior distribution of each gene tree reflects truephylogenetic uncertainty For phylogenomic data sets withassembly error partial and missing sequences and geneswith significant diversity in information content and molec-ular rate and therefore posterior distributions spread overmany alternative topologies BUCKy gives low concordancevalues across the tree that are difficult to interpret (Cui et al2013)

Comparison among Methods of Homology andOrthology Inference

Among the four alternative orthology inference methodswe examined MI has the advantage of not requiring anyoutgroup information It works well even in the absence ofhigh-quality outgroups However in the presence of genomeduplication MI breaks orthologs each time duplicated taxonnames are detected Both RT and MO explicitly accommo-date gene and genome duplications among the ingroups andare especially suitable for clades that have many genegenome duplications However both require high-quality

outgroup taxa that are phylogenetically distinct from theingroup In addition RT requires outgroups that will not beincluded in the final ortholog sets and work best when thereare multiple successive outgroups The 1to1 strategy worksfor relatively small data sets but otherwise is not likely to beuseful With transcriptome data sets that are both incom-plete redundant and contain errors and isoforms restrictingto 1to1 relationships ignores the evolutionary history of genefamilies and is susceptible to repeated gene loss (De Smetet al 2013) Finally if the data set lacks high-quality outgroupsand is complicated by genome duplications the quality oforthology inference using any method will be problematic(table 1)

A final consideration for tree-based orthology inferenceis the computational cost Our experience is that with increas-ing data set size the computational bottleneck is at the stageof all-by-all homology search which scales exponentially withthe data set size One possible modification is to infer a corehomolog set using taxa with genome sequences and thencarry out homology search using sequences from the remain-ing taxa against these core homologs This approach has therisk of missing novel genes that are not represented in thecore homolog set Once clusters are obtained the alignmentand tree inference steps can be easily distributed in manycomputer cores With the recent advance in large-scalealignment and tree inference tools (Liu et al 2012Stamatakis et al 2012 Katoh and Standley 2013) it is thetime to fully take advantage of the information in genetrees to obtain more complete and more accurate homologand ortholog sets

FIG 5 Maximum-likelihood analysis of the MIL Taxon names were abbreviated to the first four letters of the genus names except the top left treeOrthology inference methods MI maximum inclusion RT extracting rooted ingroup clades MO monophyletic outgroups 1to1 filtered one-to-oneorthologs Node labels are not shown when all support values are 100 Arrows indicate nodes with relatively low support

3088


ownloaded from

httpsacademicoupcom




ConclusionThis study demonstrates the power of tree-based homologyand orthology inference to recover significantly more usabledata from short-read transcriptomic and low-coverage geno-mic data sets than existing heuristics By reanalyzing threepublished data sets we illustrate the procedures for obtainingcleaned and optimized homologs and orthologs and showthe utility of different strategies for resolving data sets of dif-ferent age completeness rooting scenarios and presenceof genome duplications We also illustrate the importanceof including complete genomes even if as members ofthe outgroup The number of orthologs recovered can bedramatically improved with more complete data fromindividual taxa

The real power of our tree-based procedure is that itpreserves the full complement of evolutionary historypresent in each gene family With this approach futurestudies will be able to explore the rich information in thesephylogenomic data sets such as functional and phyloge-netic location of tree discordance gene and genome duplica-tions shifts in molecular rates and signatures of naturalselection in nonmodel systems at an unprecedentedlybroad scale

Materials and Methods

Data Sets and Sequence Processing

For the MIL data set nine transcriptomes from Brewer andBond (2013) were downloaded from the SRA (accessionsSRX326775ndashSRX326777 SRX326779ndashSRX326784) Paired-end 50 bp reads were filtered using the read cleaning proce-dure from Yang and Smith (2013) Reads with average qualityscores lower than 32 were removed bases at the 30-end withquality scores lower than 20 were trimmed and only readslonger than 30 bp after trimming were kept Both reads in aread pair were removed if one of the reads did not pass thequality filter Adapter contamination was screened againstthe UniVec database (httpwwwncbinlmnihgovtoolsvecscreenunivec last accessed November 20 2013) andthe Illumina TruSeq adapters and all vector containingread pairs were removed This differed from the original pub-lication in that we removed the entire read pair when anadapter was detected in either of the reads instead of cuttingoff the first nine bases from all reads Given the typical inser-tion size for Illumina RNA-seq libraries (~130ndash200 bp) the

presence of an adapter dimer (~120 bp) would often rendera read pair to be useless All nine transcriptomes were assem-bled using Trinity version 20131110 with default settings(Grabherr et al 2011) except that min_kmer_cov was setto 2 instead of the default value of 1 consistent withBrewer and Bond (2013) Archispirostreptus gigas EST se-quences were downloaded from GenBank (4008 in total ac-cessions FN194820ndashFN198827 Meusemann et al 2010)All transcripts were translated using TransDecoder version20131137 assisted by pfam domain information (Haas et al2013) Following Brewer and Bond (2013) additional prote-ome data of Ixodes scapularis were downloaded fromVectorBase (wwwvectorbaseorg last accessed November19 2013 Megy et al 2012) and peptide sequences ofDaphnia pulex were downloaded from the Joint GenomeInstitute httpgenomejgi-psforg (filtered models v11 lastaccessed November 19 2013 Colbourne et al 2011)

We suggest that future NCBI SRA submissions containinformation about what kit and modifications were usedfor library preparation the adapters used and the distributionof insertion sizes in either or both the SRA submission andthe methods narratives even when the library preparationwas outsourced Such information would greatly facilitateeffective reuse of these archived data sets

For the GRP data set all 15 transcriptomes generatedby Wen et al (2013) were downloaded from GenBank (SRAaccessions SRX286217ndashSRX286231) Paired-end 90 bp readswere filtered by quality scores and adaptor contaminationwas removed with the same procedure as for MIL The re-maining reads were assembled using Trinity version 20140413with default settings (Grabherr et al 2011) and translatedusing TransDecoder version rel16JAN2014 assisted by pfamdomain information (Haas et al 2013) CDS of V vinifera weredownloaded from the Phytozome database v91 (Jaillon et al2007 Goodstein et al 2012)

For the HYM data set all peptide sequences were kindlyprovided by the authors (Johnson et al 2013) including pep-tide sequences from additional studies (httpwwwncbinlmnihgovbioproject66515 httpswwwhgscbcmeduarthropodsbumble-bee-genome-project Weinstock et al 2006Bonasio et al 2010 Werren et al 2010 Smith Smith et al2011 Smith Zimin et al 2011 Kocher et al 2013) Allpeptides were reduced with cd-hit (-c 099 -n 5) andCDS were reduced with cd-hit-est (-c 0995 -n 10 -r 1Fu et al 2012)

Table 1 Comparison of the Four Alternative Orthology Inference Methods Used in This Study

High-QualityOutgroups

GenomeDuplications

Examples MI RT MO 1to1

Present Absent HYM MIL Good Good Good if only interestedin low-copy genes

OK with a small number of taxa

Absent Absent Deep metazoanphylogeny GRP

Good Bad Bad OK with a small number of taxa

Present Present Many plant groups Bad Good Good if only interestedin low-copy genes


Absent Present Many plant groups Bad Bad Bad May be the only choice

3089


ownloaded from

httpsacademicoupcom




Homology Inference

Homology searches were carried out using all-by-all BLASTPfrom peptides of all three data sets and an additional all-by-allBLASTN search using CDS from GRP All BLAST searches usedan E value cutoff of 1 and max_target_seqs set to 100 BLASToutput was filtered by a requirement that the hit fractionbeing at least 04 (Chiu et al 2006 Sanderson andMcMahon 2007) MCL (MCL v12-068 van Dongen 2000Enright et al 2002 van Dongen and Abreu-Goodger 2012)was performed on filtered all-by-all BLASTP hits with the Evalue cutoff set to 105 and an inflation value of 14 Endswith no BLAST hits presumably from misassembly andorframeshift were cut off Remaining sequences shorter than40 characters were removed and clusters smaller than eighttaxa were removed Each resulting cluster was aligned usingMAFFT v7043b (ndashgenafpairndashmaxiterate 1000 if less than1000 sequences ndashauto when 1000 or more sequencesKatoh and Standley 2013)

Clusters may include divergent sequences and the align-ments therefore require refinement (the optional step fig 1)Alignments that included 200 or more sequences were re-fined with SATe v227 (Liu Raghavan et al 2009 Liu et al2012) starting with alignments from MAFFT Alignmentswere trimmed with Phyutility v226 (-clean 001) and an initialphylogenetic tree estimated with FastTree v 217 (Price et al2010) The resulting trees often contain misassembly recom-bination or paralogs with deep splits that formed longbranches These long branches (15 for MIL 12 for HYMand 06 for GRP with peptides) were cut and sequencesfrom each subtree were realigned using MAFFT followed bySATe as the previous step As for the GRP CDS data set theinitial alignments using MAFFT were well aligned and weredirectly used in subsequent steps

Resulting alignments were trimmed with Phyutility (-clean01) Maximum-likelihood phylogenies were inferred usingRAxML v735 (Stamatakis 2006) with the modelPROTCATWAG for peptides and the model GTRCAT forCDS The resulting trees occasionally still had unusually longtips that likely arose from misassembly andor frameshiftA tip was removed if it was more than 10 times longerthan the average distance to tips seen in its sister cladeand was longer than 075 for MIL 06 for HYM or 01 forGRP When monophyletic or paraphyletic tips from the sametaxa were present in a tree only the one with the highestnumber of nonambiguous characters in the trimmed align-ment was kept as the representative with the rest removed(Dunn et al 2008 Smith Wilson et al 2011)

Finally branches longer than 15 for MIL 10 for HYMand 03 for GRP were cut The resulting trees were calledhomologous gene trees Three homologous gene tree setswere obtained MIL GRP (CDS only onwards) and HYM

Orthology Inference

All homologous gene trees were further pruned to producethe orthologs containing one sequence per species Fouralternative strategies were applied for each homolog set(fig 1) MI RT MO and 1to1

We used the MI method (Dunn et al 2008 2013 SmithWilson et al 2011) to search the homologous gene tree forthe subtree that had the highest number of taxa without anytaxon repeat and cut it off as an ortholog (fig 1) We thencontinued searching the remaining tree with the same cuttingcriteria until no subtree with at least eight nonrepeating taxacould be found As the remaining tree occasionally containedtips subtended by long branches as a result of leftover frompruning off orthologs a tip was removed if it was more thanten times longer than the average distance to tips seen in itssister clade and was longer than 04 for MIL 03 for HYM or01 for GRP

The RT (fig 1) strategy uses predefined outgroups to orientand extract ingroup clades and then infer locations of geneduplication events from the extracted ingroup clade We firstiteratively searched for the subtree that had the highestnumber of ingroup taxa regardless of taxon duplicationsand cut it off as a rooted tree We then traversed thisrooted ingroup tree from the root toward the tips whileinferring locations of gene duplication events from deepones to more recent ones by looking for taxon duplicationbetween the two sides at each bifurcation When a gene du-plication was found at a node the side with a smaller numberof taxa was pruned to maximize taxon occupancy in theremaining tree This paralog pruning procedure was carriedout iteratively on all subtrees until no taxon duplication wasleft in any subtree with at least eight taxa As for homologsthat lack outgroups only those with no taxon duplicationwere included as unrooted ortholog trees Homologs withduplicated taxa but no outgroup taxa were ignored due todifficulties in inferring locations of gene duplications withoutrooting The MIL data set included three non-millipede ar-thropods forming a successive grade to the millipedes Two ofthe more distantly related outgroups Ixodes and Daphniawere both from genome annotations whereas the thirdtaxa Lithobius was the closest to the millipedes and wassampled using RNA-seq Therefore we regarded both Ixodesand Daphnia as ldquooutgrouprdquo in our outgroup-based orthologyinferences and treated Lithobius as a member of the ingroupAs for HYM one single nonaculeate genome-derived Nasoniawas used as the outgroup following the original analysis ForGRP the transcriptome-derived outgroup Leea was used fol-lowing the original analysis

Although RT is effective for data sets with genome dupli-cations outgroups used for extracting rooted clades will beabsent from the final orthologs A modification for RT is toroot ingroups only when the outgroups are monophyletic andnonrepeating (MO) By doing so all taxa will be preserved inthe resulting ortholog sets while losing a fraction of homologs

Finally we also present results using only homologs with-out any taxon repeat (1to1) for completeness similar to pro-cedures that use reciprocal criteria (Ebersberger et al 2009Schreiber et al 2009)

Estimating Species Tree

Following ortholog inference an alignment was obtained foreach ortholog by extracting aligned sequences from the

3090


ownloaded from

httpsacademicoupcom




homologs The resulting alignments were trimmed withPhyutility (-clean 03) for HYM and GRP Because MIL hadmore divergent sequences alignments were trimmed withGblocks v091b (Talavera and Castresana 2007 Soria-Carrasco and Castresana 2008) using the same settings as ofBrewer and Bond (2013) For the final supermatrices we onlyincluded trimmed ortholog alignments that were at least 100aa for MIL and HYM or 300 nt for GRP in trimmed lengthand each ortholog had no more than one missing taxon forMIL and only orthologs with the full taxon set for HYM andGRP

Maximum-likelihood trees were estimated using RAxMLwith PROTCATWAG model for the MIL and HYM data setsand in ExaML v204 (Stamatakis et al 2012) with GTRCAT forthe GRP data set partitioning each ortholog Conflicts amongorthologs were estimated by 200 jackknife replicates eachresampling a fixed proportion (10 or 30) or a fixednumber of 20 orthologs keeping each ortholog intactFinally despite the recognition that the bootstrap methodmay provide a poor measurement of confidence in genome-wide data sets (Salichos and Rokas 2013) we carried out 200rapid bootstrap replicates for the HYM and MIL data sets tocompare to their respective original analyses

Supplementary MaterialSupplementary tables S1 and S2 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Christopher Laumer Michael J MooreCody E Hinchliff Joseph W Brown and Anne K Greenbergfor helpful discussions andor comments on the manuscriptThis work was supported by the Department of Ecology andEvolutionary Biology and the MCubed initiative at theUniversity of Michigan Ann Arbor and the NationalScience Foundation (grant number DEB1354048 to SAS)

ReferencesAltenhoff AM Gil M Gonnet GH Dessimoz C 2013 Inferring hierar-

chical orthologous groups from orthologous gene pairs PLoS One8(1)e53786

Altenhoff AM Schneider A Gonnet GH Dessimoz C 2011 OMA 2011orthology inference among 1000 complete genomes Nucleic AcidsRes 39D289ndashD294

Altschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic localalignment search tool J Mol Biol 215(3)403ndash410

Ane C Larget B Baum DA Smith SD Rokas A 2007 Bayesian estimationof concordance among gene trees Mol Biol Evol 24(2)412ndash426

Bonasio R Zhang G Ye C Mutti NS Fang X Qin N Donahue G Yang PLi Q Li C 2010 Genomic comparison of the ants Camponotusfloridanus and Harpegnathos saltator Science 329(5995)1068ndash1071

Boussau B Szollosi GJ Duret L Gouy M Tannier E Daubin V 2013Genome-scale coestimation of species and gene trees Genome Res23323ndash330

Brewer MS Bond JE 2013 Ordinal-level phylogenomics of the arthropodclass Diplopoda (millipedes) based on an analysis of 221 nuclearprotein-coding loci generated using next-generation sequence anal-yses PLoS One 8(11)e79935

Chiu JC Lee EK Egan MG Sarkar IN Coruzzi GM DeSalle R 2006OrthologID automation of genome-scale ortholog identificationwithin a parsimony framework Bioinformatics 22(6)699ndash707

Colbourne JK Pfrender ME Gilbert D Thomas WK Tucker A OakleyTH Tokishita S Aerts A Arnold GJ Basu MK et al 2011 Theecoresponsive genome of Daphnia pulex Science331(6017)555ndash561

Cui R Schumer M Kruesi K Walter R Andolfatto P Rosenthal GG 2013Phylogenomics reviews extensive reticulate evolution inXiphophorus fishes Evolution 67(8)2166ndash2179

De Smet R Adams KL Vandepoele K Van Montagu MCE Maere S Vande Peer Y 2013 Convergent gene loss following gene and genomeduplications creates single-copy families in flowering plants ProcNatl Acad Sci U S A 110(8)2898ndash2903

Dunn C Howison M Zapata F 2013 Agalma an automated phyloge-nomics workflow BMC Bioinformatics 14(1)330

Dunn CW Hejnol A Matus DQ Pang K Browne WE Smith SA SeaverE Rouse GW Obst M Edgecombe GD et al 2008 Broad phyloge-nomic sampling improves resolution of the animal tree of lifeNature 452(7188)745ndash749

Ebersberger I Strauss S von Haeseler A 2009 HaMStR profile hiddenmarkov model based search for orthologs in ESTs BMC Evol Biol9(1)157

Enright A van Dongen S Ouzounis C 2002 An efficient algorithm forlarge-scale detection of protein families Nucleic Acids Res 301575ndash1584

Fitch WM 1970 Distinguishing homologous from analogous proteinsSyst Biol 19(2)99ndash113

Fitch WM 2000 Homology a personal view on some of the problemsTrends Genet 16(5)227ndash231

Fu L Niu B Zhu Z Wu S Li W 2012 CD-HIT accelerated for clusteringthe next generation sequencing data Bioinformatics28(23)3150ndash3152

Gabaldon T 2008 Large-scale assignment of orthology back to phylo-genetics Genome Biol 9(10)235

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-length transcriptome assembly from RNA-seq data without a refer-ence genome Nat Biotechnol 29(7)644ndash654

Goodstein DM Shu SQ Howson R Neupane R Hayes RD Fazo J MitrosT Dirks W Hellsten U Putnam N Rokhsar DS 2012 Phytozome acomparative platform for green plant genomics Nucleic Acids Res40D1178ndashD1186

Haas BJ Papanicolaou A Yassour M Grabherr M Blood PD Bowden JCouger MB Eccles D Li B Lieber M et al 2013 De novo transcriptsequence reconstruction from RNA-seq using the Trinity platformfor reference generation and analysis Nat Protoc 8(8)1494ndash1512

Jaillon O Aury J Noel B Policriti A Clepet C Casagrande A Choisne NAubourg S Vitulo N Jubin C et al 2007 The grapevine genomesequence suggests ancestral hexaploidization in major angiospermphyla Nature 449463ndash467

Johnson BR Borowiec ML Chiu JC Lee EK Atallah J Ward PS 2013Phylogenomics resolves evolutionary relationships among ants beesand wasps Curr Biol 23(20)2058ndash2062

Katoh K Standley DM 2013 MAFFT multiple sequence alignment soft-ware version 7 improvements in performance and usability Mol BiolEvol 30(4)772ndash780

Kocher S Li C Yang W Tan H Yi S Yang X Hoekstra H Zhang G PierceN Yu D 2013 The draft genome of a socially polymorphic halictidbee Lasioglossum albipes Genome Biol 14(12)R142

Larget BR Kotha SK Dewey CN Ane C 2010 BUCKy gene treespeciestree reconciliation with Bayesian concordance analysisBioinformatics 26(22)2910ndash2911

Li H Coghlan A Ruan J Coin L Heriche J Osmotherly L Li R Liu TZhang Z Bolund L et al 2006 TreeFam a curated database ofphylogenetic trees of animal gene families Nucleic Acids Res 34572ndash580

Li L Stoeckert CJ J Roos D 2003 OrthoMCL identification of orthologgroups for eukaryotic genomes Genome Res 132178ndash2189

3091


ownloaded from

httpsacademicoupcom




Liu K Raghavan S Nelesen S Linder CR Warnow T 2009 Rapid andaccurate large-scale coestimation of sequence alignments and phy-logenetic trees Science 324(5934)1561ndash1564

Liu K Warnow TJ Holder MT Nelesen SM Yu J Stamatakis AP LinderCR 2012 SATe-II very fast and accurate simultaneous estimation ofmultiple sequence alignments and phylogenetic trees Syst Biol 6190ndash106

Liu L Yu L Edwards S 2010 A maximum pseudo-likelihood approachfor estimating species trees under the coalescent model BMC EvolBiol 10(1)302

Liu L Yu L Pearl DK Edwards SV 2009 Estimating species phyloge-nies using coalescence times among sequences Syst Biol58(5)468ndash477

Marcet-Houben M Gabaldon T 2011 TreeKO a duplication-awarealgorithm for the comparison of phylogenetic trees Nucleic AcidsRes 39(10)e66

Megy K Emrich SJ Lawson D Campbell D Dialynas E Hughes DSKoscielny G Louis C MacCallum RM Redmond SN 2012VectorBase improvements to a bioinformatics resource for inverte-brate vector genomics Nucleic Acids Res 40(D1)D729ndashD734

Meusemann K von Reumont BM Simon S Roeding F Strauss S Kuck PEbersberger I Walzl M Pass Gn Breuers S et al 2010 A phyloge-nomic approach to resolve the arthropod tree of life Mol Biol Evol27(11)2451ndash2464

Price MN Dehal PS Arkin AP 2010 FastTree 2 approximatelymaximum-likelihood trees for large alignments PLoS One5(3)e9490

Ranwez V Harispe S Delsuc F Douzery E 2011 MACSE multiple align-ment of coding sequences accounting for frameshifts and stopcodons PLoS One 6(9)e22594

Roth A Gonnet G Dessimoz C 2008 Algorithm of OMA for large-scaleorthology inference BMC Bioinformatics 9(1)518

Roure B Rodriguez-Ezpeleta N Philippe H 2007 SCaFoS a tool forSelection Concatenation and Fusion of Sequences for phyloge-nomics BMC Evol Biol 7(Suppl 1) S2

Salichos L Rokas A 2013 Inferring ancient divergences requires geneswith strong phylogenetic signals Nature 497(7449)327ndash331

Sanderson M McMahon M 2007 Inferring angiosperm phylogenyfrom EST data with widespread gene duplication BMC Evol Biol7(Suppl 1) S3

Schreiber F Pick K Erpenbeck D Worheide G Morgenstern B 2009OrthoSelect a protocol for selecting orthologous groups in phylo-genomics BMC Bioinformatics 10219

Smith CD Zimin A Holt C Abouheif E Benton R Cash E Croset VCurrie CR Elhaik E Elsik CG 2011 Draft genome of the globally

widespread and invasive Argentine ant (Linepithema humile) ProcNatl Acad Sci U S A 108(14)5673ndash5678

Smith CR Smith CD Robertson HM Helmkampf M Zimin A YandellM Holt C Hu H Abouheif E Benton R et al 2011 Draft genome ofthe red harvester ant Pogonomyrmex barbatus Proc Natl Acad Sci US A 108(14)5667ndash5672

Smith SA Dunn CW 2008 Phyutility a phyloinformatics tool for treesalignments and molecular data Bioinformatics 24(5)715ndash716

Smith SA Wilson NG Goetz FE Feehery C Andrade SCS Rouse GWGiribet G Dunn CW 2011 Resolving the evolutionary relationshipsof molluscs with phylogenomic tools Nature 480(7377)364ndash369

Soria-Carrasco V Castresana J 2008 Estimation of phylogenetic incon-sistencies in the three domains of life Mol Biol Evol25(11)2319ndash2329

Stamatakis A 2006 RAxML-VI-HPC maximum likelihood-based phylo-genetic analyses with thousands of taxa and mixed modelsBioinformatics 22(21)2688ndash2690

Stamatakis A Aberer AJ Goll C Smith SA Berger SA Izquierdo-CarrascoF 2012 RAxML-Light a tool for computing terabyte phylogeniesBioinformatics 282064ndash2066

Talavera G Castresana J 2007 Improvement of phylogenies after re-moving divergent and ambiguously aligned blocks from proteinsequence alignments Syst Biol 56564ndash577

van Dongen S 2000 Graph clustering by flow simulation [PhD thesis][Utrecht Netherlands] University of Utrecht

van Dongen S Abreu-Goodger C 2012 Using MCL to extract clustersfrom networks In Helden JV Toussaint A Thieffry D editorsBacterial molecular networks methods and protocols Methods inmolecular biology Vol 804 New York Springer p 281ndash295

Weinstock GM Robinson GE Gibbs RA Worley KC Evans JD MaleszkaR Robertson HM Weaver DB Beye M Bork P 2006 Insights intosocial insects from the genome of the honeybee Apis melliferaNature 443(7114)931ndash949

Wen J Xiong Z Nie Z-L Mao L Zhu Y Kan X-Z Ickert-Bond SMGerrath J Zimmer EA Fang X-D 2013 Transcriptome sequencesresolve deep relationships of the grape family PLoS One 8(9)e74394

Werren JH Richards S Desjardins CA Niehuis O Gadau J Colbourne JKGroup TNGW 2010 Functional and evolutionary insights from thegenomes of three parasitoid Nasonia species Science327(5963)343ndash348

Wilson JS von Dohlen CD Forister ML Pitts JP 2013 Family-leveldivergences in the stinging wasps (Hymenoptera Aculeata) withcorrelations to angiosperm diversification Evol Biol 40(1)101ndash107

Yang Y Smith SA 2013 Optimizing de novo assembly of short-readRNA-seq data for phylogenomics BMC Genomics 14(1)328

3092


ownloaded from

httpsacademicoupcom




account However it was designed for genomic data Besidespotential scaling issues with such an approach as data setsgrow transcriptomes may lack a particular gene due tosilencing or low expression and coverage Taxa with lowgene coverages tend to be grouped together due to sharedldquogene lossrdquo (Boussau et al 2013)

An alternative strategy adopted by Agalma (Dunn et al2013) and recent implementations of OrthologID (Chiu et al2006) consists of two stages Obtaining homologs and sepa-rating orthologs from paralogs Both pipelines infer homologsusing an all-by-all BLAST search (Altschul et al 1990) followedby Markov clustering (MCL) that identifies sequence clustersbased on the relative connectivity (presenceabsence of hits)and relative strength of connections (E values from BLASThits) among sequences (van Dongen 2000) A phylogenetictree is then inferred for each homolog To obtain orthologsthe two pipelines use different approaches Agalma takes onlythe homolog tree topology into account It looks for thesubtree that has the highest number of nonrepeating taxacuts it off as an ortholog and repeats the search and cuttingon the remaining tree (Dunn et al 2013) This approach hasthe advantage of being relatively assumption free Howeverwhen there are genome duplications it breaks orthologs intofragments This is especially problematic when there are mul-tiple nested genome duplications as is frequently seen inplants (De Smet et al 2013) On the other hand orthologIDconsiders both homolog tree topology and the homologsequence alignment using a partial guide tree determinedfrom taxa with genome sequences available (Chiu et al2006) It is able to accommodate gene and genome duplica-tions yet it is limited by the availability of annotated genomesrequired to build the guide tree for each ortholog groupMost areas of the tree of life still lack reasonable whole-genome sequence coverage Both Agalma and orthologIDimprove enormously on previous methods by taking genetree into account (Chiu et al 2006 Dunn et al 2013)However there is the great potential for additional compo-nents and methods that would allow for higher flexibilityand broader applications

Here we outline a flexible orthology inference procedurebased on identifying homologs followed by cleaning aligningand cutting homolog trees We demonstrate this approachand compare different methods for cutting homolog treesin three recently published phylogenomic data sets acrossdiverse taxonomic groups and ages The grape family orVitaceae consists of approximately 900 species with a stemage of approximately 95 Ma (Wen et al 2013) The grape dataset (GRP) (Wen et al 2013) includes 15 transcriptomes and aproteome from the grape genome annotation The millipedes(class Diplopoda) are an ancient and diverse group withfossils dating back to 428 Ma (Brewer and Bond 2013)The millipedes data set (MIL) (Brewer and Bond 2013) in-cludes nine transcriptomes one expressed sequence tag (EST)data set and two non-millipedes proteomes from genomeannotation The aculeate Hymenoptera includes ants beesand wasps and has a crown age of approximately 150 Ma(Wilson et al 2013) The aculeate Hymenoptera data set(MIL) (Johnson et al 2013) includes 18 ingroup data sets

(11 transcriptomes 1 low-coverage genome 6 annotated ge-nomes) and one outgroup from annotated genome

New Approaches

Our orthology inference approach is tree-based does not relyon a known species tree and is capable of accommodatinggenome duplications and different outgroup scenarios It dif-fers from previous published tree-based and species tree-in-dependent orthology inference methods in a number of waysWe take phylogenetic trees into account in both homologinference as well as ortholog inference We explore four alter-native strategies for obtaining orthologs from cutting homo-log trees two of which are newly proposed to explicitlyaccommodate gene duplications To evaluate conflictingphylogenetic signals we use a jackknife strategy with multipleresampling ratios This strategy resamples by locus not by siteand therefore keeps each locus intact and explicitly evaluatesconflicting phylogenetic signals among loci Finally given theever-changing landscape of sequence processing alignmentand tree inference methods each step of our procedureis written in separate python scripts with a lightweightphylogenetic tree library which allows for steps to be easilymodified swapped and moved between computer clustersand desktop machines

Results and Discussion

Homology Inference Using Clusters and Trees

Our homology inference method starts with an all-by-allBLAST followed by clustering filtered BLAST hits using MCL(van Dongen 2000) For each cluster we align into a multiplesequence alignment infer a phylogenetic tree using maxi-mum likelihood cut deep paralogs and remove aberrantand redundant tips (fig 1)

A number of methods can be used for conducting theinitial homology search For the analyses presented hereinitial all-by-all homology searches were conducted usingpeptide sequences against a peptide database (BLASTP)In addition we also conducted homology search usingcoding sequences (CDS) against a CDS database (BLASTN)in the GRP data set which contains the most recently di-verged group among the three we analyzed here A thirdapproach employed by Agalma (Dunn et al 2013) uses tran-scripts translated by all six frames (TBLASTX) for homologysearch and then simultaneously translate and align CDS usingMACSE (Ranwez et al 2011) This approach is promisingfor improving translation accuracy yet very time consumingand is currently limited to relatively small data sets

Once a BLAST score has been calculated between eachsequence pair there are two general strategies for homologyclustering The simplest method is to filter out BLAST resultsthat do not meet a minimal coverage percentage for eithersequence (ldquohit fractionrdquo Chiu et al 2006 Sanderson andMcMahon 2007) then obtain clusters that are connectedby the remaining hits However this approach is sensitiveto both the value of the minimum hit fraction filter andlarge gene families that have sequences of intermediate com-pleteness that can attract even less complete sequences

3082


ownloaded from

httpsacademicoupcom




009 0404

GRP HYM MIL

All-by-all blast


Markov Clustering





Rooted ingroups(RT)





C




A

A

B

CO

AB

O

A

B

O

A

A

B

B

C

A

A

B

B

C

A

A

B

B

C

A

A

B

B

C

AB

C

AB

C

AB

C

AB

C

A

B

CA

B

O

A

B

CA

B

O

A

B

CA

B

O

A

B


A

A

B

CO

AB

O

A

A

B

CO

AB

O

A

A

B

CO

AB

O


3083


ownloaded from

httpsacademicoupcom









Orthology Inference


3084


ownloaded from

httpsacademicoupcom










0 5000 10000 15000 0 5000 10000 15000

0 5000 10000 15000

10

12

14

16

18

0 5000 10000 15000

6

8

10

12

14

16

18

6

8

10

12

14

16

18

6

8

10

12

14

16

18

6

8

Num

ber o

f tax

a in

eac

h or

thol

og




HYM GRPMIL


MIL HYM GRP


3085


ownloaded from

httpsacademicoupcom














3086


ownloaded from

httpsacademicoupcom










3087


ownloaded from

httpsacademicoupcom










3088


ownloaded from

httpsacademicoupcom















GenomeDuplications









3089


ownloaded from

httpsacademicoupcom




Homology Inference





Orthology Inference








3090


ownloaded from

httpsacademicoupcom







Acknowledgments
































3091


ownloaded from

httpsacademicoupcom


































3092


ownloaded from

httpsacademicoupcom




009 0404

GRP HYM MIL

All-by-all blast


Markov Clustering





Rooted ingroups(RT)





C




A

A

B

CO

AB

O

A

B

O

A

A

B

B

C

A

A

B

B

C

A

A

B

B

C

A

A

B

B

C

AB

C

AB

C

AB

C

AB

C

A

B

CA

B

O

A

B

CA

B

O

A

B

CA

B

O

A

B


A

A

B

CO

AB

O

A

A

B

CO

AB

O

A

A

B

CO

AB

O


3083


ownloaded from

httpsacademicoupcom









Orthology Inference


3084


ownloaded from

httpsacademicoupcom










0 5000 10000 15000 0 5000 10000 15000

0 5000 10000 15000

10

12

14

16

18

0 5000 10000 15000

6

8

10

12

14

16

18

6

8

10

12

14

16

18

6

8

10

12

14

16

18

6

8

Num

ber o

f tax

a in

eac

h or

thol

og




HYM GRPMIL


MIL HYM GRP


3085


ownloaded from

httpsacademicoupcom














3086


ownloaded from

httpsacademicoupcom










3087


ownloaded from

httpsacademicoupcom










3088


ownloaded from

httpsacademicoupcom















GenomeDuplications









3089


ownloaded from

httpsacademicoupcom




Homology Inference





Orthology Inference








3090


ownloaded from

httpsacademicoupcom







Acknowledgments
































3091


ownloaded from

httpsacademicoupcom


































3092


ownloaded from

httpsacademicoupcom









Orthology Inference


3084


ownloaded from

httpsacademicoupcom










0 5000 10000 15000 0 5000 10000 15000

0 5000 10000 15000

10

12

14

16

18

0 5000 10000 15000

6

8

10

12

14

16

18

6

8

10

12

14

16

18

6

8

10

12

14

16

18

6

8

Num

ber o

f tax

a in

eac

h or

thol

og




HYM GRPMIL


MIL HYM GRP


3085


ownloaded from

httpsacademicoupcom














3086


ownloaded from

httpsacademicoupcom










3087


ownloaded from

httpsacademicoupcom










3088


ownloaded from

httpsacademicoupcom















GenomeDuplications









3089


ownloaded from

httpsacademicoupcom




Homology Inference





Orthology Inference








3090


ownloaded from

httpsacademicoupcom







Acknowledgments
































3091


ownloaded from

httpsacademicoupcom


































3092


ownloaded from

httpsacademicoupcom










0 5000 10000 15000 0 5000 10000 15000

0 5000 10000 15000

10

12

14

16

18

0 5000 10000 15000

6

8

10

12

14

16

18

6

8

10

12

14

16

18

6

8

10

12

14

16

18

6

8

Num

ber o

f tax

a in

eac

h or

thol

og




HYM GRPMIL


MIL HYM GRP


3085


ownloaded from

httpsacademicoupcom














3086


ownloaded from

httpsacademicoupcom










3087


ownloaded from

httpsacademicoupcom










3088


ownloaded from

httpsacademicoupcom















GenomeDuplications









3089


ownloaded from

httpsacademicoupcom




Homology Inference





Orthology Inference








3090


ownloaded from

httpsacademicoupcom







Acknowledgments
































3091


ownloaded from

httpsacademicoupcom


































3092


ownloaded from

httpsacademicoupcom














3086


ownloaded from

httpsacademicoupcom










3087


ownloaded from

httpsacademicoupcom










3088


ownloaded from

httpsacademicoupcom















GenomeDuplications









3089


ownloaded from

httpsacademicoupcom




Homology Inference





Orthology Inference








3090


ownloaded from

httpsacademicoupcom







Acknowledgments
































3091


ownloaded from

httpsacademicoupcom


































3092


ownloaded from

httpsacademicoupcom










3087


ownloaded from

httpsacademicoupcom










3088


ownloaded from

httpsacademicoupcom















GenomeDuplications









3089


ownloaded from

httpsacademicoupcom




Homology Inference





Orthology Inference








3090


ownloaded from

httpsacademicoupcom







Acknowledgments
































3091


ownloaded from

httpsacademicoupcom


































3092


ownloaded from

httpsacademicoupcom










3088


ownloaded from

httpsacademicoupcom















GenomeDuplications









3089


ownloaded from

httpsacademicoupcom




Homology Inference





Orthology Inference








3090


ownloaded from

httpsacademicoupcom







Acknowledgments
































3091


ownloaded from

httpsacademicoupcom


































3092


ownloaded from

httpsacademicoupcom















GenomeDuplications









3089


ownloaded from

httpsacademicoupcom




Homology Inference





Orthology Inference








3090


ownloaded from

httpsacademicoupcom







Acknowledgments
































3091


ownloaded from

httpsacademicoupcom


































3092


ownloaded from

httpsacademicoupcom




Homology Inference





Orthology Inference








3090


ownloaded from

httpsacademicoupcom







Acknowledgments
































3091


ownloaded from

httpsacademicoupcom


































3092


ownloaded from

httpsacademicoupcom







Acknowledgments
































3091


ownloaded from

httpsacademicoupcom


































3092


ownloaded from

httpsacademicoupcom


































3092


ownloaded from

httpsacademicoupcom




orthology inference in nonmodel organisms using ...nov 20, 2013 · of tree-based orthology...

Documents