analy&cal(and(computaonal(challenges(in( ((coalescentbased...

52
Analy&cal and computa&onal challenges in coalescentbased species tree es&ma&on Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at UrbanaChampaign hBp://tandy.cs.illinois.edu (joint work with Siavash Mirarab and M.S. Bayzid)

Upload: others

Post on 03-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Analy&cal  and  computa&onal  challenges  in      coalescent-­‐based  species  tree  es&ma&on

Tandy  Warnow  Departments  of  Computer  Science  and  Bioengineering  The  University  of  Illinois  at  Urbana-­‐Champaign  hBp://tandy.cs.illinois.edu  (joint  work  with  Siavash  Mirarab  and  M.S.  Bayzid)  

Page 2: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Species Tree

Page 3: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Phylogenomics  =  genome-­‐scale  phylogeny  es&ma&on  Note:  Jonathan  Eisen  coined  this  term,  but  used  it  to  mean  something  else.    

Page 4: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

   Main  contribu&ons  •  Mul&ple  Sequence  Alignment:  Methods  for  

large-­‐scale  MSA  (up  to  1,000,000  sequences,  including  fragments):  SATe,  PASTA,  and  UPP  

•  Phylogenomics:  Methods  for  mul&-­‐locus  species  tree  es&ma&on  that  are  robust  to  gene  tree  incongruence  due  to  incomplete  lineage  sor&ng  (ILS)  and  horizontal  gene  transfer  (HGT)  

•  Metagenomics:  Methods  for  taxon  iden&fica&on  and  abundance  profiling  of  metagenomic  datasets  

Page 5: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

   Main  contribu&ons  •  Mul&ple  Sequence  Alignment:  Methods  for  

large-­‐scale  MSA  (up  to  1,000,000  sequences,  including  fragments):  SATe,  PASTA,  and  UPP  

•  Phylogenomics:  Methods  for  mul&-­‐locus  species  tree  es&ma&on  that  are  robust  to  gene  tree  incongruence  due  to  incomplete  lineage  sor&ng  (ILS)  and  horizontal  gene  transfer  (HGT)  

•  Metagenomics:  Methods  for  taxon  iden&fica&on  and  abundance  profiling  of  metagenomic  datasets  

Page 6: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Concatena&on   gene 1"

S1  

S2  

S3  

S4  

S5  

S6  

S7  

S8  

gene 2! gene 3! TCTAATGGAA" GCTAAGGGAA" TCTAAGGGAA" TCTAACGGAA"

TCTAATGGAC"

TATAACGGAA"

GGTAACCCTC!GCTAAACCTC!

GGTGACCATC!

GCTAAACCTC!

TATTGATACA"

TCTTGATACC"

TAGTGATGCA"

CATTCATACC"

TAGTGATGCA" ? ? ? ? ? ? ? ? ? ? "

? ? ? ? ? ? ? ? ? ?"

? ? ? ? ? ? ? ? ? ? "

? ? ? ? ? ? ? ? ? ? "

? ? ? ? ? ? ? ? ? ? "

? ? ? ? ? ? ? ? ? ? "

? ? ? ? ? ? ? ? ? ?"

? ? ? ? ? ? ? ? ? ?"

? ? ? ? ? ? ? ? ? ?"

Page 7: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Red  gene  tree  ≠  species  tree  (green  gene  tree  okay)  

Page 8: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Gene  Tree  Incongruence  

Gene  trees  can  differ  from  the  species  tree  due  to:  •  Duplica&on  and  loss  •  Horizontal  gene  transfer  •  Incomplete  lineage  sor&ng  (ILS)    

Page 9: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Incomplete  Lineage  Sor&ng  (ILS)  

1000+  papers  in  2013  alone    Confounds  phylogene&c  analysis  for  many  groups:  

Hominids  Birds  Yeast  Animals  Toads  Fish  Fungi  

There  is  substan&al  debate  about  how  to  analyze  phylogenomic  datasets  in  the  presence  of  ILS.  

 

Page 10: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

The  Mul&-­‐species  Coalescent  Model  

Present  

Past

Courtesy  James  Degnan  

Page 11: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

. . ."

Analyze"separately"

Summary Method"

Two  compe&ng  approaches    

gene 1 gene 2 . . . gene k"

. . ." Concatenation"

Species  

Page 12: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

       Sta&s&cally  consistent  under  MSC?    

NO:  •  MDC  

•  Greedy    consensus  

•  Unpar&&oned  concatena&on  under  maximum  likelihood  or  maximum  parsimony  

•  MRP  (supertree  method)  

 Unknown:  •  Fully  par&&oned  concatena&on  under  maximum  likelihood    YES    •  MP-­‐EST  (Liu  et  al.  2010):  maximum  likelihood  es&ma&on  of  rooted  species  tree  –  YES,  but  

•  BUCKy-­‐pop  (Larget  et  al.  2010):  quartet-­‐based  Bayesian  species  tree  es&ma&on  –YES,  but…  

•  *BEAST  and  BEST  (co-­‐es&ma&on  of  gene  trees  and  species  trees)  –  YES,  but…  

•  SNAPP,  SVDquartets  (site-­‐based  analyses)  –  Yes,  but…  

•  STEM,  STELLS,  GLASS,  METAL,    etc.  –  Yes,  but…  

Page 13: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Avian  Phylogenomics  Project  

G  Zhang,    BGI  

•   Approx.  50  species,  whole  genomes  •   14,000  loci  

MTP  Gilbert,  Copenhagen  

S.  Mirarab            Md.  S.  Bayzid,    UT-­‐Aus&n                UT-­‐Aus&n  

T.  Warnow  UT-­‐Aus&n  

Plus  many  many  other  people…  

Erich  Jarvis,  HHMI    

Challenge:        •  Species  tree  es&ma&on  under  the  mul&-­‐species  coalescent            model    from  14,000  poorly  es&mated  gene  trees,  all  with              different  topologies  (we  used  “sta&s&cal  binning”)  

Science,  December  2014  (Jarvis,  Mirarab,  et  al.,  and  Mirarab  et  al.)  

Page 14: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

1kp:  Thousand  Transcriptome  Project  

Plant  Tree  of  Life  based  on  transcriptomes  of  !800  loci  and  ~100  species  

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen, UT-Austin UT-Austin UT-Austin

Plus many many other people…

Challenge:  •  Gene  tree  incongruence  sugges&ve  of  ILS,  but  we  were            unable  to  use  MP-­‐EST  due  to  dataset  size  and  many              incomplete  gene  trees  (we  used  ASTRAL,  Mirarab  et  al.  2014)  

WickeB,  Mirarab,  et  al.,  PNAS  2014  

Page 15: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Avian whole genomes phylogenies [Jarvis*, Mirarab*, et al., Science, 2014]

• International team of more than 100 researchers

• Whole genomes for 48 bird species (~100 million years of evolution)

• Goal: a phylogeny of major bird lineages

• Extremely challenging due to rampant gene tree incongruence

• Implications for traits such as vocal learning

• 14,000 “genes” (typically short and relatively conserved)

14

90. J. F. Storz, J. C. Opazo, F. G. Hoffmann, Mol. Phylogenet. Evol.66, 469–478 (2013).

91. F. G. Hoffmann, J. F. Storz, T. A. Gorr, J. C. Opazo, Mol. Biol.Evol. 27, 1126–1138 (2010).

ACKNOWLEDGMENTS

Genome assemblies and annotations of avian genomes in thisstudy are available on the avian phylogenomics website(http://phybirds.genomics.org.cn), GigaDB (http://dx.doi.org/10.5524/101000), National Center for Biotechnology Information(NCBI), and ENSEMBL (NCBI and Ensembl accession numbersare provided in table S2). The majority of this study wassupported by an internal funding from BGI. In addition, G.Z. wassupported by a Marie Curie International Incoming Fellowshipgrant (300837); M.T.P.G. was supported by a Danish NationalResearch Foundation grant (DNRF94) and a Lundbeck Foundationgrant (R52-A5062); C.L. and Q.L. were partially supported by aDanish Council for Independent Research Grant (10-081390);and E.D.J. was supported by the Howard Hughes Medical Instituteand NIH Directors Pioneer Award DP1OD000448.

The Avian Genome ConsortiumChen Ye,1 Shaoguang Liang,1 Zengli Yan,1 M. Lisandra Zepeda,2

Paula F. Campos,2 Amhed Missael Vargas Velazquez,2

José Alfredo Samaniego,2 María Avila-Arcos,2 Michael D. Martin,2

Ross Barnett,2 Angela M. Ribeiro,3 Claudio V. Mello,4 Peter V. Lovell,4

Daniela Almeida,3,5 Emanuel Maldonado,3 Joana Pereira,3

Kartik Sunagar,3,5 Siby Philip,3,5 Maria Gloria Dominguez-Bello,6

Michael Bunce,7 David Lambert,8 Robb T. Brumfield,9

Frederick H. Sheldon,9 Edward C. Holmes,10 Paul P. Gardner,11

Tammy E. Steeves,11 Peter F. Stadler,12 Sarah W. Burge,13

Eric Lyons,14 Jacqueline Smith,15 Fiona McCarthy,16

Frederique Pitel,17 Douglas Rhoads,18 David P. Froman19

1China National GeneBank, BGI-Shenzhen, Shenzhen 518083,China. 2Centre for GeoGenetics, Natural History Museum ofDenmark, University of Copenhagen, Øster Voldgade 5-7, 1350Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar deInvestigação Marinha e Ambiental, Universidade do Porto, Ruados Bragas, 177, 4050-123 Porto, Portugal. 4Department ofBehavioral Neuroscience Oregon Health & Science UniversityPortland, OR 97239, USA. 5Departamento de Biologia, Faculdadede Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal. 6Department of Biology, University of PuertoRico, Av Ponce de Leon, Rio Piedras Campus, JGD 224, San Juan,PR 009431-3360, USA. 7Trace and Environmental DNA laboratory,Department of Environment and Agriculture, Curtin University, Perth,Western Australia 6102, Australia. 8Environmental Futures ResearchInstitute, Griffith University, Nathan, Queensland 4121, Australia.9Museum of Natural Science, Louisiana State University, BatonRouge, LA 70803, USA. 10Marie Bashir Institute for InfectiousDiseases and Biosecurity, Charles Perkins Centre, School ofBiological Sciences and Sydney Medical School, The University ofSydney, Sydney NSW 2006, Australia. 11School of BiologicalSciences, University of Canterbury, Christchurch 8140, New Zealand.12Bioinformatics Group, Department of Computer Science, andInterdisciplinary Center for Bioinformatics, University of Leipzig,Hrtelstrasse 16-18, D-04107 Leipzig, Germany. 13European MolecularBiology Laboratory, European Bioinformatics Institute, Hinxton,Cambridge CB10 1SD, UK. 14School of Plant Sciences, BIO5 Institute,University of Arizona, Tucson, AZ 85721, USA. 15Division of Geneticsand Genomics, The Roslin Institute and Royal (Dick) School ofVeterinary Studies, The Roslin Institute Building, University ofEdinburgh, Easter Bush Campus, Midlothian EH25 9RG, UK.16Department of Veterinary Science and Microbiology, University ofArizona, 1117 E Lowell Street, Post Office Box 210090-0090, Tucson,AZ 85721, USA. 17Laboratoire de Génétique Cellulaire, INRA Cheminde Borde-Rouge, Auzeville, BP 52627 , 31326 CASTANET-TOLOSANCEDEX, France. 18Department of Biological Sciences, Science andEngineering 601, University of Arkansas, Fayetteville, AR 72701, USA.19Department of Animal Sciences, Oregon State University, Corvallis,OR 97331, USA.

SUPPLEMENTARY MATERIALS

www.sciencemag.org/content/346/6215/1311/suppl/DC1Supplementary TextFigs. S1 to S42Tables S1 to S51References (92–192)

27 January 2014; accepted 6 November 201410.1126/science.1251385

RESEARCH ARTICLE

Whole-genome analyses resolveearly branches in the tree of lifeof modern birdsErich D. Jarvis,1*† Siavash Mirarab,2* Andre J. Aberer,3 Bo Li,4,5,6 Peter Houde,7

Cai Li,4,6 Simon Y. W. Ho,8 Brant C. Faircloth,9,10 Benoit Nabholz,11

Jason T. Howard,1 Alexander Suh,12 Claudia C. Weber,12 Rute R. da Fonseca,6

Jianwen Li,4 Fang Zhang,4 Hui Li,4 Long Zhou,4 Nitish Narula,7,13 Liang Liu,14

Ganesh Ganapathy,1 Bastien Boussau,15 Md. Shamsuzzoha Bayzid,2

Volodymyr Zavidovych,1 Sankar Subramanian,16 Toni Gabaldón,17,18,19

Salvador Capella-Gutiérrez,17,18 Jaime Huerta-Cepas,17,18 Bhanu Rekepalli,20

Kasper Munch,21 Mikkel Schierup,21 Bent Lindow,6 Wesley C. Warren,22

David Ray,23,24,25 Richard E. Green,26 Michael W. Bruford,27 Xiangjiang Zhan,27,28

Andrew Dixon,29 Shengbin Li,30 Ning Li,31 Yinhua Huang,31

Elizabeth P. Derryberry,32,33 Mads Frost Bertelsen,34 Frederick H. Sheldon,33

Robb T. Brumfield,33 Claudio V. Mello,35,36 Peter V. Lovell,35 Morgan Wirthlin,35

Maria Paula Cruz Schneider,36,37 Francisco Prosdocimi,36,38 José Alfredo Samaniego,6

Amhed Missael Vargas Velazquez,6 Alonzo Alfaro-Núñez,6 Paula F. Campos,6

Bent Petersen,39 Thomas Sicheritz-Ponten,39 An Pas,40 Tom Bailey,41 Paul Scofield,42

Michael Bunce,43 David M. Lambert,16 Qi Zhou,44 Polina Perelman,45,46

Amy C. Driskell,47 Beth Shapiro,26 Zijun Xiong,4 Yongli Zeng,4 Shiping Liu,4

Zhenyu Li,4 Binghang Liu,4 Kui Wu,4 Jin Xiao,4 Xiong Yinqi,4 Qiuemei Zheng,4

Yong Zhang,4 Huanming Yang,48 Jian Wang,48 Linnea Smeds,12 Frank E. Rheindt,49

Michael Braun,50 Jon Fjeldsa,51 Ludovic Orlando,6 F. Keith Barker,52

Knud Andreas Jønsson,51,53,54 Warren Johnson,55 Klaus-Peter Koepfli,56

Stephen O’Brien,57,58 David Haussler,59 Oliver A. Ryder,60 Carsten Rahbek,51,54

Eske Willerslev,6 Gary R. Graves,51,61 Travis C. Glenn,62 John McCormack,63

Dave Burt,64 Hans Ellegren,12 Per Alström,65,66 Scott V. Edwards,67

Alexandros Stamatakis,3,68 David P. Mindell,69 Joel Cracraft,70 Edward L. Braun,71

Tandy Warnow,2,72† Wang Jun,48,73,74,75,76† M. Thomas P. Gilbert,6,43† Guojie Zhang4,77†

To better determine the history of modern birds, we performed a genome-scale phylogeneticanalysis of 48 species representing all orders of Neoaves using phylogenomic methodscreated to handle genome-scale data. We recovered a highly resolved tree that confirmspreviously controversial sister or close relationships. We identified the first divergence inNeoaves, two groups we named Passerea and Columbea, representing independent lineagesof diverse and convergently evolved land and water bird species. Among Passerea, we inferthe common ancestor of core landbirds to have been an apex predator and confirm independentgains of vocal learning. Among Columbea, we identify pigeons and flamingoes as belonging tosister clades. Even with whole genomes, some of the earliest branches in Neoaves provedchallenging to resolve, which was best explained by massive protein-coding sequenceconvergence and high levels of incomplete lineage sorting that occurred during a rapidradiation after the Cretaceous-Paleogene mass extinction event about 66 million years ago.

The diversification of species is not alwaysgradual but can occur in rapid radiations,especially aftermajor environmental changes(1, 2). Paleobiological (3–7) and molecular (8)evidence suggests that such “big bang” radia-

tions occurred for neoavian birds (e.g., songbirds,parrots, pigeons, and others) and placental mam-mals, representing 95% of extant avian and mam-malian species, after the Cretaceous to Paleogene(K-Pg)mass extinction event about 66million yearsago (Ma). However, other nuclear (9–12) and mito-chondrial (13, 14) DNA studies propose an earlier,more gradual diversification, beginning withinthe Cretaceous 80 to 125 Ma. This debate is con-founded by findings that different data sets (15–19)and analytical methods (20, 21) often yield con-

trasting species trees. Resolving such timing andphylogenetic relationships is important for com-parative genomics,which can informabout humantraits and diseases (22).Recent avian studies based on fragments of 5

[~5000 base pairs (bp) (8)] and 19 [31,000 bp (17)]genes recovered some relationships inferred frommorphological data (15, 23) and DNA-DNA hy-bridization (24), postulated new relationships,and contradicted many others. Consistent withmost previous molecular and contemporary mor-phological studies (15), they divided modernbirds (Neornithes) into Palaeognathae (tinamousand flightless ratites), Galloanseres [Galliformes(landfowl) and Anseriformes (waterfowl)], andNeoaves (all other extant birds). Within Neoaves,

1320 12 DECEMBER 2014 • VOL 346 ISSUE 6215 sciencemag.org SCIENCE

A FLOCK OF GENOMES

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Page 16: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

medianmean

0

5%

10%

15%

20%

0% 25% 50% 75% 100%branch bootstrap support

bran

ches

(per

cent

age)

Gene trees on the avian dataset

15

14,000 genes from avian genome-scale data [Jarvis*, Mirarab*, et al., Science, 2014]

A measure of confidence in estimated gene tree branches

Page 17: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

medianmean

0

5%

10%

15%

20%

0% 25% 50% 75% 100%branch bootstrap support

bran

ches

(per

cent

age)

Gene trees on the avian dataset

15

14,000 genes from avian genome-scale data [Jarvis*, Mirarab*, et al., Science, 2014]

��

��

��

��

��

��

��

�� ����

��

��

��

��

��

��

��

������� �

���

��������

���������

��

��������

���������

��

�������

����

���

��� �������

�������� ��

��

!���� "�#$%&

'������ "�#$%&

��

��

��

��

��

��

��

��

����� �

��

����

��

��

��

��

��

!���� "�#$%&

'������ "�#$%&

��(�)�*��

���������

�+�����,

� � ��)�

'��)�� ����

�����)�� ����

�-�������

-������������

-���,���

%-����������

!-����

&-� ������

.-,�����

/-,�,�����

%-�������

&-)��������

0-����1

0-������,�����

2-�����)����

!-(��� ��

3- �)������

�-�����+���

0-�����

4-�)��������

"-���)����

�-�������

"-��� �)�����

��

'��)�� ����

�����)�� ����

��

-���,���

&-)��������

.-,�����

"-��� �)�����

4-�)��������

/-,�,�����

0-�����

%-�������

�-�����+���

3- �)������

�-�������

-������������

%-����������

0-������,�����

�-�������

!-����

2-�����)����

&-� ������

"-���)����

!-(��� ��

0-����1

!��������

�� �)���5)��������

��

�������+���

��

������)�+���

��

6��)�5����,�����

������5)�������

���)���

�����

�))������+���

��

&���5����

������5)�������

���)���

�����

����)����5)������

$,����5�,��7����

3�������5������

����)��)���15)����

���)��������

�����

.� ��5��������

������5��������

$�����,�5�����

!������)�5��,������

��� ����5 �)�+����

������)����5��7��

������5����

������5����,�)�

�����������5)������

�����

&����)�5�����������

���� ����5��)8

������

�)�

���5)������

�������5� ��

�����)���5,���������

"���������5���)����

�����)�������5�����

"����,���5,������ �

.����

�5,����

�����5��������)

��

%������5)������

&������5,�������

��

��

��

��

�� �)���5)��������

�����)�������5�����

�)�

���5)������

�������+���

��

������)�+���

��

6��)�5����,�����

�))������+���

��

&���5����

����)����5)������

$,����5�,��7����

3�������5������

����)��)���15)����

���)��������

�����

.� ��5��������

������5��������

$�����,�5�����

!������)�5��,������

��� ����5 �)�+����

������)����5��7��

������5����

������5����,�)�

�����������5)������

�����

�������5� ��

�����)���5,���������

"���������5���)����

"����,���5,������ �

.����

�5,����

�����5��������)

��

%������5)������

&������5,�������

&����)�5�����������

���� ����5��)8

������

5%

10%

15%

20%

25%

Infinity(true g.t.)

1,500 1,000 500 250

Gene sequence lengthSp

ecie

s tre

e to

polo

gica

l erro

r (FN

)

MP−EST

Avian-like simulations (1000 genes) [Mirarab, et al., Science, 2014]

A statistically consistent summary method

more gene tree error

Page 18: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

medianmean

0

5%

10%

15%

20%

0% 25% 50% 75% 100%branch bootstrap support

bran

ches

(per

cent

age)

Gene trees on the avian dataset

15

14,000 genes from avian genome-scale data [Jarvis*, Mirarab*, et al., Science, 2014]

��

��

��

��

��

��

��

�� ����

��

��

��

��

��

��

��

������� �

���

��������

���������

��

��������

���������

��

�������

����

���

��� �������

�������� ��

��

!���� "�#$%&

'������ "�#$%&

��

��

��

��

��

��

��

��

����� �

��

����

��

��

��

��

��

!���� "�#$%&

'������ "�#$%&

��(�)�*��

���������

�+�����,

� � ��)�

'��)�� ����

�����)�� ����

�-�������

-������������

-���,���

%-����������

!-����

&-� ������

.-,�����

/-,�,�����

%-�������

&-)��������

0-����1

0-������,�����

2-�����)����

!-(��� ��

3- �)������

�-�����+���

0-�����

4-�)��������

"-���)����

�-�������

"-��� �)�����

��

'��)�� ����

�����)�� ����

��

-���,���

&-)��������

.-,�����

"-��� �)�����

4-�)��������

/-,�,�����

0-�����

%-�������

�-�����+���

3- �)������

�-�������

-������������

%-����������

0-������,�����

�-�������

!-����

2-�����)����

&-� ������

"-���)����

!-(��� ��

0-����1

!��������

�� �)���5)��������

��

�������+���

��

������)�+���

��

6��)�5����,�����

������5)�������

���)���

�����

�))������+���

��

&���5����

������5)�������

���)���

�����

����)����5)������

$,����5�,��7����

3�������5������

����)��)���15)����

���)��������

�����

.� ��5��������

������5��������

$�����,�5�����

!������)�5��,������

��� ����5 �)�+����

������)����5��7��

������5����

������5����,�)�

�����������5)������

�����

&����)�5�����������

���� ����5��)8

������

�)�

���5)������

�������5� ��

�����)���5,���������

"���������5���)����

�����)�������5�����

"����,���5,������ �

.����

�5,����

�����5��������)

��

%������5)������

&������5,�������

��

��

��

��

�� �)���5)��������

�����)�������5�����

�)�

���5)������

�������+���

��

������)�+���

��

6��)�5����,�����

�))������+���

��

&���5����

����)����5)������

$,����5�,��7����

3�������5������

����)��)���15)����

���)��������

�����

.� ��5��������

������5��������

$�����,�5�����

!������)�5��,������

��� ����5 �)�+����

������)����5��7��

������5����

������5����,�)�

�����������5)������

�����

�������5� ��

�����)���5,���������

"���������5���)����

"����,���5,������ �

.����

�5,����

�����5��������)

��

%������5)������

&������5,�������

&����)�5�����������

���� ����5��)8

������

5%

10%

15%

20%

25%

Infinity(true g.t.)

1,500 1,000 500 250

Gene sequence lengthSp

ecie

s tre

e to

polo

gica

l erro

r (FN

)

MP−EST

Avian-like simulations (1000 genes) [Mirarab, et al., Science, 2014]

A statistically consistent summary method

more gene tree error

Gene tree error matters

[Ané, et al, MBE, 2007] [Patel, et al, MBE, 2013] [Gatesy, Springer, MPE, 2014] [Mirarab, et al., Systematic Biology, 2014]

Page 19: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

The  individual  gene  sequence  alignments  in  the  avian  datasets  have  poor  phylogene&c  signal,  and  result  in  poorly  es&mated  gene  trees.  

Species  trees  obtained  by  combining  poorly  es&mated  gene  trees  have  poor  accuracy.  

There  are  no  theore&cal  guarantees  for  summary  methods  except  for  perfectly  correct  gene  trees.      

Page 20: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

The  individual  gene  sequence  alignments  in  the  avian  datasets  have  poor  phylogene&c  signal,  and  result  in  poorly  es&mated  gene  trees.  

Species  trees  obtained  by  combining  poorly  es&mated  gene  trees  have  poor  accuracy.  

There  are  no  theore&cal  guarantees  for  summary  methods  except  for  perfectly  correct  gene  trees.      

Page 21: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

The  individual  gene  sequence  alignments  in  the  avian  datasets  have  poor  phylogene&c  signal,  and  result  in  poorly  es&mated  gene  trees.  

Species  trees  obtained  by  combining  poorly  es&mated  gene  trees  have  poor  accuracy.  

There  are  no  theore&cal  guarantees  for  standard  summary  methods  except  for  perfectly  correct  gene  trees.      

Page 22: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

The  individual  gene  sequence  alignments  in  the  avian  datasets  have  poor  phylogene&c  signal,  and  result  in  poorly  es&mated  gene  trees.  

Species  trees  obtained  by  combining  poorly  es&mated  gene  trees  have  poor  accuracy.  

There  are  no  theore&cal  guarantees  for  standard  summary  methods  except  for  perfectly  correct  gene  trees.      

COMMON  PHYLOGENOMICS  PROBLEM:                      many  poor  gene  trees  

Page 23: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

The  individual  gene  sequence  alignments  in  the  avian  datasets  have  poor  phylogene&c  signal,  and  result  in  poorly  es&mated  gene  trees.  

Species  trees  obtained  by  combining  poorly  es&mated  gene  trees  have  poor  accuracy.  

There  are  no  theore&cal  guarantees  for  standard  summary  methods  except  for  perfectly  correct  gene  trees.      

COMMON  PHYLOGENOMICS  PROBLEM:                      many  poor  gene  trees  

See:  S.  Roch  and  T.  Warnow.  "On  the  robustness  to  gene  tree  es&ma&on  error  (or  lack  thereof)  of  coalescent-­‐based  species  tree  methods”,  Systema&c  Biology,  64(4):663-­‐676,  2015  

Page 24: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Species  tree  es&ma&on:  difficult,  even  for  small  datasets!  

Page 25: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Idea: combine best aspects of concatenation and summary methods

• Concatenation (fully partitioned) works fine when the concatenated data evolve under identical (or very similar) trees

• Some pairs of genes are not discordant (at least in topology)

• Concatenate “combinable” sets of genes into “supergenes” to increase the phylogenetic signal

• But how do we know which genes are combinable if we cannot estimate them correctly?

21

Page 26: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

IMPORTANT:  Supergene  trees  are  computed  using  fully  par==oned  maximum  likelihood.  Theorem:  Weighted  sta=s=cal  binning  is  sta=s=cally  consistent  under  the  MSC.  Theorem:  Unweighted  sta=s=cal  binning  is  not  sta=s=cally  consistent  under  the  MSC.  Proofs  in  Bayzid,  Mirarab,  and  Warnow,  PLOS  One,  2015.    See  also  discussion  in  Warnow,  PLOS  Currents:  Tree  of  Life  2015.  

Page 27: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Sta&s&cal  binning  vs.  unbinned  

Binning  produces  bins  with  approximate  5  to  7  genes  each  Datasets:  11-­‐taxon  strongILS  datasets  with  50  genes,  Chung  and  Ané,    Systema&c  Biology  

0

0.05

0.1

0.15

0.2

0.25

MP−EST MDC*(75) MRP MRL GC

Av

erag

e F

N r

ate

UnbinnedStatistical−75

Page 28: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

97/97

Cursores

Columbea

Otidimorphae

Australaves

80/79

73

67

92

79

94

99

68

88

87

9888

50/48 68

86

95

Binned MP-EST (unweighted/weighted) Unbinned MP-EST

Conflict with other lines of strong evidence

Podiceps cristatus9 7/94

PasseriformesPsittaciformesFalco peregrinusCariama cristataCoraciimorphaeAccipitriformesTyto alba

Cariama cristataCoraciimorphae

Pelecanus crispusEgrett agarzettaNipponia nipponPhalacrocorax carboProcellariimorphaeGavia stellataPhaethon lepturusEurypyga heliasBalearica regulorumCharadrius vociferusOpisthocomus hoazin

Calypte annaChaetura pelagicaAntrostomus carolinensis

Tauraco erythrolophusChlamydotis macqueeniiCuculus canorus

Columbal iviaPterocles gutturalisMesitornis unicolor

Phoenicopterus ruber

Meleagris gallopavoGallus gallusAnas platyrhynchos

Struthio camelusTinamus guttatus

91/87

58/56

59/57

99/99

Podiceps cristatusPhoenicopterus ruber

Cuculus canorus

PasseriformesPsittaciformes

Falco peregrinus

AccipitriformesTyto alba

Pelecanus crispusEgrett agarzettaNipponia nippon

Phalacrocorax carboProcellariimorphae

Gavia stellataPhaethon lepturus

Eurypyga heliasBalearica regulorumCharadrius vociferus

Opisthocomus hoazin

Calypte annaChaetura pelagica

Antrostomus carolinensis

Columbal iviaPterocles gutturalisMesitornis unicolor

Meleagris gallopavoGallus gallus

Anas platyrhynchos

Struthio camelusTinamus guttatus

Tauraco erythrolophusChlamydotis macqueenii

88/90100/99

100/99

100/99

Comparing  Binned  and  Un-­‐binned  MP-­‐EST  on  the  Avian  Dataset  

Unbinned  MP-­‐EST  strongly  rejects    Columbea,  a  major  finding  by    Jarvis,  Mirarab,  et  al.,    Science  2015.  

Page 29: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

   Summary  so  far  

Standard  coalescent-­‐based  methods  (such  as  MP-­‐EST)  have  poor  accuracy  in  the  presence  of  gene  tree  error.    Sta&s&cal  binning  improves  the  es&ma&on  of  gene  tree  distribu&ons,  and  so:    •  Improves  species  tree  es&ma&on  •  Improves  species  tree  branch  lengths  •  Reduces  incidence  of  strongly  supported  false  

posi&ve  branches  

Page 30: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

   Summary  so  far  

Standard  coalescent-­‐based  methods  (such  as  MP-­‐EST)  have  poor  accuracy  in  the  presence  of  gene  tree  error.    Sta&s&cal  binning  improves  the  es&ma&on  of  gene  tree  distribu&ons,  and  so:    •  Improves  species  tree  es&ma&on  •  Improves  species  tree  branch  lengths  Reduces  

incidence  of  strongly  supported  false  posi&ve  branches  

Page 31: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

   Summary  so  far  

Standard  coalescent-­‐based  methods  (such  as  MP-­‐EST)  have  poor  accuracy  in  the  presence  of  gene  tree  error.    Sta&s&cal  binning  improves  the  es&ma&on  of  gene  tree  distribu&ons,  and  so:    •  Improves  species  tree  es&ma&on  •  Improves  species  tree  branch  lengths    •  Reduces  incidence  of  strongly  supported  false  

posi&ve  branches  

Page 32: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

   Summary  so  far  

Standard  coalescent-­‐based  methods  (such  as  MP-­‐EST)  have  poor  accuracy  in  the  presence  of  gene  tree  error.    Sta&s&cal  binning  improves  the  es&ma&on  of  gene  tree  distribu&ons,  and  so:    •  Improves  species  tree  es&ma&on  •  Improves  species  tree  branch  lengths    •  Reduces  incidence  of  strongly  supported  false  

posi&ve  branches  

See  Mirarab  et  al.  Science  2015  and  Bayzid  et  al.  PLOS  One  2015  

Page 33: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

1KP: Plant whole transcriptomes[Wickett*, Mirarab*, et al., PNAS, 2014]

16

Phylotranscriptomic analysis of the origin and earlydiversification of land plantsNorman J. Wicketta,b,1,2, Siavash Mirarabc,1, Nam Nguyenc, Tandy Warnowc, Eric Carpenterd, Naim Matascie,f,Saravanaraj Ayyampalayamg, Michael S. Barkerf, J. Gordon Burleighh, Matthew A. Gitzendannerh,i, Brad R. Ruhfelh,j,k,Eric Wafulal, Joshua P. Derl, Sean W. Grahamm, Sarah Mathewsn, Michael Melkoniano, Douglas E. Soltish,i,k,Pamela S. Soltish,i,k, Nicholas W. Milesk, Carl J. Rothfelsp,q, Lisa Pokornyp,r, A. Jonathan Shawp, Lisa DeGironimos,Dennis W. Stevensons, Barbara Sureko, Juan Carlos Villarrealt, Béatrice Roureu, Hervé Philippeu,v, Claude W. dePamphilisl,Tao Chenw, Michael K. Deyholosd, Regina S. Baucomx, Toni M. Kutchany, Megan M. Augustiny, Jun Wangz, Yong Zhangv,Zhijian Tianz, Zhixiang Yanz, Xiaolei Wuz, Xiao Sunz, Gane Ka-Shu Wongd,z,aa,2, and James Leebens-Mackg,2

aChicago Botanic Garden, Glencoe, IL 60022; bProgram in Biological Sciences, Northwestern University, Evanston, IL 60208; cDepartment of Computer Science,University of Texas, Austin, TX 78712; dDepartment of Biological Sciences, University of Alberta, Edmonton, AB, Canada T6G 2E9; eiPlant Collaborative,Tucson, AZ 85721; fDepartment of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721; gDepartment of Plant Biology, University ofGeorgia, Athens, GA 30602; hDepartment of Biology and iGenetics Institute, University of Florida, Gainesville, FL 32611; jDepartment of Biological Sciences,Eastern Kentucky University, Richmond, KY 40475; kFlorida Museum of Natural History, Gainesville, FL 32611; lDepartment of Biology, Pennsylvania StateUniversity, University Park, PA 16803; mDepartment of Botany and qDepartment of Zoology, University of British Columbia, Vancouver, BC, Canada V6T 1Z4;nArnold Arboretum of Harvard University, Cambridge, MA 02138; oBotanical Institute, Universität zu Köln, Cologne D-50674, Germany; pDepartment ofBiology, Duke University, Durham, NC 27708; rDepartment of Biodiversity and Conservation, Real Jardín Botanico-Consejo Superior de InvestigacionesCientificas, 28014 Madrid, Spain; sNew York Botanical Garden, Bronx, NY 10458; tDepartment fur Biologie, Systematische Botanik und Mykologie,Ludwig-Maximilians-Universitat, 80638 Munich, Germany; uDépartement de Biochimie, Centre Robert-Cedergren, Université de Montréal, SuccursaleCentre-Ville, Montreal, QC, Canada H3C 3J7; vCNRS, Station d’ Ecologie Experimentale du CNRS, Moulis, 09200, France; wShenzhen Fairy Lake BotanicalGarden, The Chinese Academy of Sciences, Shenzhen, Guangdong 518004, China; xDepartment of Ecology and Evolutionary Biology, University ofMichigan, Ann Arbor, MI 48109; yDonald Danforth Plant Science Center, St. Louis, MO 63132; zBGI-Shenzhen, Bei shan Industrial Zone, Yantian District,Shenzhen 518083, China; and aaDepartment of Medicine, University of Alberta, Edmonton, AB, Canada T6G 2E1

Edited by Paul O. Lewis, University of Connecticut, Storrs, CT, and accepted by the Editorial Board September 29, 2014 (received for review December 23, 2013)

Reconstructing the origin and evolution of land plants and theiralgal relatives is a fundamental problem in plant phylogenetics, andis essential for understanding how critical adaptations arose, in-cluding the embryo, vascular tissue, seeds, and flowers. Despiteadvances in molecular systematics, some hypotheses of relationshipsremain weakly resolved. Inferring deep phylogenies with bouts ofrapid diversification can be problematic; however, genome-scaledata should significantly increase the number of informative charac-ters for analyses. Recent phylogenomic reconstructions focused onthe major divergences of plants have resulted in promising but in-consistent results. One limitation is sparse taxon sampling, likelyresulting from the difficulty and cost of data generation. To addressthis limitation, transcriptome data for 92 streptophyte taxa weregenerated and analyzed along with 11 published plant genomesequences. Phylogenetic reconstructions were conducted using upto 852 nuclear genes and 1,701,170 aligned sites. Sixty-nine analyseswere performed to test the robustness of phylogenetic inferences topermutations of the datamatrix or to phylogenetic method, includingsupermatrix, supertree, and coalescent-based approaches, maximum-likelihood and Bayesian methods, partitioned and unpartitioned ana-lyses, and amino acid versus DNA alignments. Among otherresults, we find robust support for a sister-group relationshipbetween land plants and one group of streptophyte green al-gae, the Zygnematophyceae. Strong and robust support for aclade comprising liverworts and mosses is inconsistent with awidely accepted view of early land plant evolution, and suggeststhat phylogenetic hypotheses used to understand the evolution offundamental plant traits should be reevaluated.

land plants | Streptophyta | phylogeny | phylogenomics | transcriptome

The origin of embryophytes (land plants) in the Ordovicianperiod roughly 480 Mya (1–4) marks one of the most im-

portant events in the evolution of life on Earth. The early evo-lution of embryophytes in terrestrial environments was facilitatedby numerous innovations, including parental protection for thedeveloping embryo, sperm and egg production in multicellularprotective structures, and an alternation of phases (often referred toas generations) in which a diploid sporophytic life history stagegives rise to a multicellular haploid gametophytic phase. With

Significance

Early branching events in the diversification of land plants andclosely related algal lineages remain fundamental and un-resolved questions in plant evolutionary biology. Accuratereconstructions of these relationships are critical for testing hy-potheses of character evolution: for example, the origins of theembryo, vascular tissue, seeds, and flowers. We investigatedrelationships among streptophyte algae and land plants usingthe largest set of nuclear genes that has been applied to thisproblem to date. Hypothesized relationships were rigorouslytested through a series of analyses to assess systematic errors inphylogenetic inference caused by sampling artifacts and modelmisspecification. Results support some generally accepted phy-logenetic hypotheses, while rejecting others. This work providesa new framework for studies of land plant evolution.

Author contributions: N.J.W., S. Mirarab, T.W., S.W.G., M.M., D.E.S., P.S.S., D.W.S., M.K.D.,J.W., G.K.-S.W., and J.L.-M. designed research; N.J.W., S. Mirarab, N.N., T.W., E.C., N.M., S.A.,M.S.B., J.G.B., M.A.G., B.R.R., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M.,C.J.R., L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., B.R., H.P., C.W.d., T.C., M.K.D., M.M.A., J.W., Y.Z.,Z.T., Z.Y., X.W., X.S., G.K.-S.W., and J.L.-M. performed research; S. Mirarab, N.N., T.W., N.M.,S.A., M.S.B., J.G.B., M.A.G., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M., C.J.R.,L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., H.P., C.W.d., T.C., M.K.D., R.S.B., T.M.K., M.M.A., J.W., Y.Z.,G.K.-S.W., and J.L.-M. contributed new reagents/analytic tools; N.J.W., S. Mirarab, N.N., E.C.,N.M., S.A., M.S.B., J.G.B., M.A.G., B.R.R., E.W., B.R., H.P., and J.L.-M. analyzed data; N.J.W.,S. Mirarab, T.W., S.W.G., M.M., D.E.S., D.W.S., H.P., G.K.-S.W., and J.L.-M. wrote the paper;and N.M. archived data.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. P.O.L. is a guest editor invited by theEditorial Board.

Freely available online through the PNAS open access option.

Data deposition: The sequences reported in this paper have been deposited in theiplant datastore database, mirrors.iplantcollaborative.org/onekp_pilot, and the Na-tional Center for Biotechnology Information Sequence Read Archive, www.ncbi.nlm.nih.gov/sra [accession no. PRJEB4921 (ERP004258)].1N.J.W. and S. Mirarab contributed equally to this work.2To whom correspondence may be addressed. Email: [email protected],[email protected], or [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1323926111/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1323926111 PNAS Early Edition | 1 of 10

EVOLU

TION

PNASPL

US

• Whole transcriptomes for 103 plant species

• 1,200 in the next phase

• 400-800 single copy “genes”

• Spans ~1 billion years of evolution

• Many unanswered questions about plant evolution

Page 34: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

medianmean

0%

2%

4%

6%

8%

0% 25% 50% 75% 100%branch bootstrap support

bran

ches

(per

cent

age)

Summary methods on the 1KP data (103 plants)

• Existing summary methods produced species trees with low support and unbelievable relationships

• .. despite having gene trees with relatively high bootstrap support

17

400 genes from 1KP data[Wickett*, Mirarab*, et al., PNAS, 2014]

Page 35: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

medianmean

0%

2%

4%

6%

8%

0% 25% 50% 75% 100%branch bootstrap support

bran

ches

(per

cent

age)

Summary methods on the 1KP data (103 plants)

• Existing summary methods produced species trees with low support and unbelievable relationships

• .. despite having gene trees with relatively high bootstrap support

• Our simulation studies showed that the reason had to do with the number of taxa

17

400 genes from 1KP data[Wickett*, Mirarab*, et al., PNAS, 2014]

1000 simulated genes, “medium” levels of ILS[Mirarab and Warnow, ISMB, 2015]

Page 36: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

medianmean

0%

2%

4%

6%

8%

0% 25% 50% 75% 100%branch bootstrap support

bran

ches

(per

cent

age)

Summary methods on the 1KP data (103 plants)

• Existing summary methods produced species trees with low support and unbelievable relationships

• .. despite having gene trees with relatively high bootstrap support

• Our simulation studies showed that the reason had to do with the number of taxa

17

400 genes from 1KP data[Wickett*, Mirarab*, et al., PNAS, 2014]

1000 simulated genes, “medium” levels of ILS[Mirarab and Warnow, ISMB, 2015]

Page 37: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

medianmean

0%

2%

4%

6%

8%

0% 25% 50% 75% 100%branch bootstrap support

bran

ches

(per

cent

age)

Summary methods on the 1KP data (103 plants)

• Existing summary methods produced species trees with low support and unbelievable relationships

• .. despite having gene trees with relatively high bootstrap support

• Our simulation studies showed that the reason had to do with the number of taxa

17

400 genes from 1KP data[Wickett*, Mirarab*, et al., PNAS, 2014]

The problem size (# species) matters too!

1000 simulated genes, “medium” levels of ILS[Mirarab and Warnow, ISMB, 2015]

Page 38: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

1KP:  Thousand  Transcriptome  Project  

l  1200  plant  transcriptomes    l  More  than  13,000  gene  families  (most  not  single  copy)  l  Gene  sequence  alignments  and  trees  computed  using  SATe  (Liu  et  al.,  

Science  2009  and  Systema&c  Biology  2012)  

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen, Md. S.Bayzid UT-Austin UT-Austin UT-Austin UT-Austin

MP-­‐EST  could  not  be  used  –  dataset  too  large,  and  requirement  that  all  gene  trees  be  rooted  correctly  was  also  a  problem.    We  used  ASTRAL  to  esHmate  a  coalescent-­‐based  species  tree    1KP  paper  by  WickeL,  Mirarab  et  al.,  PNAS  2014  

Plus  many  other  people…  

Page 39: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

1KP:  Thousand  Transcriptome  Project  

l  1200  plant  transcriptomes    l  More  than  13,000  gene  families  (most  not  single  copy)  l  Gene  sequence  alignments  and  trees  computed  using  SATe  (Liu  et  al.,  

Science  2009  and  Systema&c  Biology  2012)  

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen, Md. S.Bayzid UT-Austin UT-Austin UT-Austin UT-Austin

MP-­‐EST  could  not  be  used  –  dataset  too  large,  and  requirement  that  all  gene  trees  be  rooted  correctly  was  also  a  problem.    We  used  ASTRAL  to  esHmate  a  coalescent-­‐based  species  tree    1KP  paper  by  WickeL,  Mirarab  et  al.,  PNAS  2014  

Plus  many  other  people…  

Page 40: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

1KP:  Thousand  Transcriptome  Project  

l  1200  plant  transcriptomes    l  More  than  13,000  gene  families  (most  not  single  copy)  l  Gene  sequence  alignments  and  trees  computed  using  SATe  (Liu  et  al.,  

Science  2009  and  Systema&c  Biology  2012)  

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen, Md. S.Bayzid UT-Austin UT-Austin UT-Austin UT-Austin

MP-­‐EST  could  not  be  used  –  dataset  too  large,  and  requirement  that  all  gene  trees  be  rooted  correctly  was  also  a  problem.    We  used  ASTRAL  to  esHmate  a  coalescent-­‐based  species  tree    1KP  paper  by  WickeL,  Mirarab  et  al.,  PNAS  2014  

Plus  many  other  people…  

Page 41: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

ASTRAL’s  approach  

•  Input:  set  of  unrooted  gene  trees  T1,  T2,  …,  Tk  and  set  X  of  bipar&&ons  on  species  set  S  

•  Output:  Tree  T*  maximizing  the  total  quartet-­‐similarity  score  to  the  unrooted  gene  trees,  subject  to  Bipar&&ons(T*)  drawn  from  X  

 Theorem:  ASTRAL  is  sta&s&cally  consistent  under  the  mul&-­‐species  coalescent  model,  and  runs  in  polynomial  &me.    

Page 42: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

0.00

0.05

0.10

0.15

0.2X 0.5X 1X 2X 5X

Mis

sing

bra

nch

rate

MP−EST ASTRAL Concatenation − ML

ASTRAL  vs.  MP-­‐EST  and  Concatena&on  200  genes,  500bp  

Less  ILS  

Mammalian  Simula&on  Study,  Varying  ILS  level  

Page 43: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

�������������

��������

��������

�������������

����������

��������������

��������

����������

���

���

���

���

���

��

���

���

���

���

���

���

!���������

������������

��������

"#

�"�

$�%�&�& �!

""

Two  coalescent-­‐based  analyses  of  the  Song  et  al.  mammals  dataset  

Page 44: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

ASTRAL on plants dataset• The ASTRAL tree:

• High support

• Similar to concatenation with some interesting differences (e.g., recovered bryophytes)

• ASTRAL took only about 10 minutes (serial running time) on 103 taxa and 400 genes

43

estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.

In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.

Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).

Matrix type

Alignment

Codon positions

ASTRAL

AA AADNA to AA DNA to AA DNA

1 and 2 1 and 2all allNA NA NA

DNA

NA

Supermatrix

Zygnematophyceae-sisterCharales-sister

Coleochaetales-sister

Sister to land plants

Mosses + liverwortsBryophytes monophyletic

Hornworts-sister

Hornworts-basalLiverworts-basal

Bryophytes

GnepineConifers monophyletic

GnetiferGnetales-sister

Gymnosperms

Eudicots + magnoliidsEudicots + mag/Chlor

Magnoliids + ChloranthalesMag + Chlor, monocots

Monocots + eudicots

Angiosperms

Amborella + NupharAmborella-sister

ANA-grade angiosperms

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR60

4gen

es.tr

imE

xt.B

ayes

.CAT

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s25X

.unp

art

untr

im50

gene

s50

gene

s.25

X50

gene

s33t

axa

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

im50

gene

s50

gene

s.25

X

Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection

Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).

6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.

estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.

In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.

Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).

Matrix type

Alignment

Codon positions

ASTRAL

AA AADNA to AA DNA to AA DNA

1 and 2 1 and 2all allNA NA NA

DNA

NA

Supermatrix

Zygnematophyceae-sisterCharales-sister

Coleochaetales-sister

Sister to land plants

Mosses + liverwortsBryophytes monophyletic

Hornworts-sister

Hornworts-basalLiverworts-basal

Bryophytes

GnepineConifers monophyletic

GnetiferGnetales-sister

Gymnosperms

Eudicots + magnoliidsEudicots + mag/Chlor

Magnoliids + ChloranthalesMag + Chlor, monocots

Monocots + eudicots

Angiosperms

Amborella + NupharAmborella-sister

ANA-grade angiosperms

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR60

4gen

es.tr

imE

xt.B

ayes

.CAT

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s25X

.unp

art

untr

im50

gene

s50

gene

s.25

X50

gene

s33t

axa

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

im50

gene

s50

gene

s.25

X

Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection

Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).

6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.

estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.

In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.

Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).

Matrix type

Alignment

Codon positions

ASTRAL

AA AADNA to AA DNA to AA DNA

1 and 2 1 and 2all allNA NA NA

DNA

NA

Supermatrix

Zygnematophyceae-sisterCharales-sister

Coleochaetales-sister

Sister to land plants

Mosses + liverwortsBryophytes monophyletic

Hornworts-sister

Hornworts-basalLiverworts-basal

Bryophytes

GnepineConifers monophyletic

GnetiferGnetales-sister

Gymnosperms

Eudicots + magnoliidsEudicots + mag/Chlor

Magnoliids + ChloranthalesMag + Chlor, monocots

Monocots + eudicots

Angiosperms

Amborella + NupharAmborella-sister

ANA-grade angiosperms

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR

604g

enes

.trim

Ext

.Bay

es.C

AT

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s25X

.unp

art

untr

im50

gene

s50

gene

s.25

X50

gene

s33t

axa

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

im50

gene

s50

gene

s.25

X

Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection

Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).

6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.

estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.

In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.

Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).

Matrix type

Alignment

Codon positions

ASTRAL

AA AADNA to AA DNA to AA DNA

1 and 2 1 and 2all allNA NA NA

DNA

NA

Supermatrix

Zygnematophyceae-sisterCharales-sister

Coleochaetales-sister

Sister to land plants

Mosses + liverwortsBryophytes monophyletic

Hornworts-sister

Hornworts-basalLiverworts-basal

Bryophytes

GnepineConifers monophyletic

GnetiferGnetales-sister

Gymnosperms

Eudicots + magnoliidsEudicots + mag/Chlor

Magnoliids + ChloranthalesMag + Chlor, monocots

Monocots + eudicots

Angiosperms

Amborella + NupharAmborella-sister

ANA-grade angiosperms

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR60

4gen

es.tr

imE

xt.B

ayes

.CAT

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s25X

.unp

art

untr

im50

gene

s50

gene

s.25

X50

gene

s33t

axa

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

im50

gene

s50

gene

s.25

X

Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection

Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).

6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.

estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.

In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.

Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).

Matrix type

Alignment

Codon positions

ASTRAL

AA AADNA to AA DNA to AA DNA

1 and 2 1 and 2all allNA NA NA

DNA

NA

Supermatrix

Zygnematophyceae-sisterCharales-sister

Coleochaetales-sister

Sister to land plants

Mosses + liverwortsBryophytes monophyletic

Hornworts-sister

Hornworts-basalLiverworts-basal

Bryophytes

GnepineConifers monophyletic

GnetiferGnetales-sister

Gymnosperms

Eudicots + magnoliidsEudicots + mag/Chlor

Magnoliids + ChloranthalesMag + Chlor, monocots

Monocots + eudicots

Angiosperms

Amborella + NupharAmborella-sister

ANA-grade angiosperms

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR60

4gen

es.tr

imE

xt.B

ayes

.CAT

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s25X

.unp

art

untr

im50

gene

s50

gene

s.25

X50

gene

s33t

axa

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

im50

gene

s50

gene

s.25

X

Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection

Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).

6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.

estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.

In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.

Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).

Matrix type

Alignment

Codon positions

ASTRAL

AA AADNA to AA DNA to AA DNA

1 and 2 1 and 2all allNA NA NA

DNA

NA

Supermatrix

Zygnematophyceae-sisterCharales-sister

Coleochaetales-sister

Sister to land plants

Mosses + liverwortsBryophytes monophyletic

Hornworts-sister

Hornworts-basalLiverworts-basal

Bryophytes

GnepineConifers monophyletic

GnetiferGnetales-sister

Gymnosperms

Eudicots + magnoliidsEudicots + mag/Chlor

Magnoliids + ChloranthalesMag + Chlor, monocots

Monocots + eudicots

Angiosperms

Amborella + NupharAmborella-sister

ANA-grade angiosperms

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR

604g

enes

.trim

Ext

.Bay

es.C

AT

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s25X

.unp

art

untr

im50

gene

s50

gene

s.25

X50

gene

s33t

axa

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

im50

gene

s50

gene

s.25

X

Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection

Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).

6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.

estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.

In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.

Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).

Matrix type

Alignment

Codon positions

ASTRAL

AA AADNA to AA DNA to AA DNA

1 and 2 1 and 2all allNA NA NA

DNA

NA

Supermatrix

Zygnematophyceae-sisterCharales-sister

Coleochaetales-sister

Sister to land plants

Mosses + liverwortsBryophytes monophyletic

Hornworts-sister

Hornworts-basalLiverworts-basal

Bryophytes

GnepineConifers monophyletic

GnetiferGnetales-sister

Gymnosperms

Eudicots + magnoliidsEudicots + mag/Chlor

Magnoliids + ChloranthalesMag + Chlor, monocots

Monocots + eudicots

Angiosperms

Amborella + NupharAmborella-sister

ANA-grade angiosperms

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR60

4gen

es.tr

imE

xt.B

ayes

.CAT

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

604g

enes

.trim

Ext

.Bay

es.C

ATG

TR

untr

im.u

npar

t50

gene

s.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s.pa

rt50

gene

s50s

ites.

gam

ma.

part

50ge

nesC

hara

.unp

art

50ge

nes5

0site

s.25

X.u

npar

t50

gene

s33t

axa.

unpa

rt60

4gen

es.tr

imE

xt.u

npar

t60

4gen

es.tr

imE

xt.g

amm

a.un

part

50ge

nes5

0site

s.un

part

50ge

nes5

0site

s25X

.unp

art

untr

im50

gene

s50

gene

s.25

X50

gene

s33t

axa

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

imun

trim

.gam

ma

50ge

nes

50ge

nes.

25X

50ge

nes3

3tax

a

untr

im50

gene

s50

gene

s.25

X

Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection

Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).

6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.

ASTRALConcatenation-ML

[Wickett*, Mirarab*, et al., PNAS, 2014]

Page 45: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

17

0

10

20

10 50 100 200 500 1000number of species

Run

ning

tim

e (h

ours

)

ASTRAL−IINJstMP−EST

Running time when varying the number of species

1000 genes, “medium” levels of recent ILS

Page 46: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

1kp:  Thousand  Transcriptome  Project  

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin

Plus many many other people…

Upcoming  Challenges  (~1200  species,  ~400  loci):        •  Species  tree  es&ma&on  under  the  mul&-­‐species  coalescent              from  hundreds  of  conflic&ng  gene  trees  on  >1000  species;              we  will  use  ASTRAL-­‐II  (Mirarab  and  Warnow,  2015)    

•  Mul&ple  sequence  alignment  of  >100,000  sequences  (with  lots  of  fragments!)  –  we  will  use  UPP  (Nguyen  et  al.,  Genome  Biology,  2015)  

Page 47: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

ASTRAL-I on biological datasets

10

• 1KP: 103 plant species, 400-800 genes

• Yang, et al. 96 Caryophyllales species, 1122 genes

• Dentinger, et al. 39 mushroom species, 208 genes

• Giarla and Esselstyn. 19 Philippine shrew species, 1112 genes

• Laumer, et al. 40 flatworm species, 516 genes

• Grover, et al. 8 cotton species, 52 genes

• Hosner, Braun, and Kimball. 28 quail species, 11 genes

• Simmons and Gatesy. 47 angiosperm species, 310 genes

Phylotranscriptomic analysis of the origin and earlydiversification of land plantsNorman J. Wicketta,b,1,2, Siavash Mirarabc,1, Nam Nguyenc, Tandy Warnowc, Eric Carpenterd, Naim Matascie,f,Saravanaraj Ayyampalayamg, Michael S. Barkerf, J. Gordon Burleighh, Matthew A. Gitzendannerh,i, Brad R. Ruhfelh,j,k,Eric Wafulal, Joshua P. Derl, Sean W. Grahamm, Sarah Mathewsn, Michael Melkoniano, Douglas E. Soltish,i,k,Pamela S. Soltish,i,k, Nicholas W. Milesk, Carl J. Rothfelsp,q, Lisa Pokornyp,r, A. Jonathan Shawp, Lisa DeGironimos,Dennis W. Stevensons, Barbara Sureko, Juan Carlos Villarrealt, Béatrice Roureu, Hervé Philippeu,v, Claude W. dePamphilisl,Tao Chenw, Michael K. Deyholosd, Regina S. Baucomx, Toni M. Kutchany, Megan M. Augustiny, Jun Wangz, Yong Zhangv,Zhijian Tianz, Zhixiang Yanz, Xiaolei Wuz, Xiao Sunz, Gane Ka-Shu Wongd,z,aa,2, and James Leebens-Mackg,2

aChicago Botanic Garden, Glencoe, IL 60022; bProgram in Biological Sciences, Northwestern University, Evanston, IL 60208; cDepartment of Computer Science,University of Texas, Austin, TX 78712; dDepartment of Biological Sciences, University of Alberta, Edmonton, AB, Canada T6G 2E9; eiPlant Collaborative,Tucson, AZ 85721; fDepartment of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721; gDepartment of Plant Biology, University ofGeorgia, Athens, GA 30602; hDepartment of Biology and iGenetics Institute, University of Florida, Gainesville, FL 32611; jDepartment of Biological Sciences,Eastern Kentucky University, Richmond, KY 40475; kFlorida Museum of Natural History, Gainesville, FL 32611; lDepartment of Biology, Pennsylvania StateUniversity, University Park, PA 16803; mDepartment of Botany and qDepartment of Zoology, University of British Columbia, Vancouver, BC, Canada V6T 1Z4;nArnold Arboretum of Harvard University, Cambridge, MA 02138; oBotanical Institute, Universität zu Köln, Cologne D-50674, Germany; pDepartment ofBiology, Duke University, Durham, NC 27708; rDepartment of Biodiversity and Conservation, Real Jardín Botanico-Consejo Superior de InvestigacionesCientificas, 28014 Madrid, Spain; sNew York Botanical Garden, Bronx, NY 10458; tDepartment fur Biologie, Systematische Botanik und Mykologie,Ludwig-Maximilians-Universitat, 80638 Munich, Germany; uDépartement de Biochimie, Centre Robert-Cedergren, Université de Montréal, SuccursaleCentre-Ville, Montreal, QC, Canada H3C 3J7; vCNRS, Station d’ Ecologie Experimentale du CNRS, Moulis, 09200, France; wShenzhen Fairy Lake BotanicalGarden, The Chinese Academy of Sciences, Shenzhen, Guangdong 518004, China; xDepartment of Ecology and Evolutionary Biology, University ofMichigan, Ann Arbor, MI 48109; yDonald Danforth Plant Science Center, St. Louis, MO 63132; zBGI-Shenzhen, Bei shan Industrial Zone, Yantian District,Shenzhen 518083, China; and aaDepartment of Medicine, University of Alberta, Edmonton, AB, Canada T6G 2E1

Edited by Paul O. Lewis, University of Connecticut, Storrs, CT, and accepted by the Editorial Board September 29, 2014 (received for review December 23, 2013)

Reconstructing the origin and evolution of land plants and theiralgal relatives is a fundamental problem in plant phylogenetics, andis essential for understanding how critical adaptations arose, in-cluding the embryo, vascular tissue, seeds, and flowers. Despiteadvances in molecular systematics, some hypotheses of relationshipsremain weakly resolved. Inferring deep phylogenies with bouts ofrapid diversification can be problematic; however, genome-scaledata should significantly increase the number of informative charac-ters for analyses. Recent phylogenomic reconstructions focused onthe major divergences of plants have resulted in promising but in-consistent results. One limitation is sparse taxon sampling, likelyresulting from the difficulty and cost of data generation. To addressthis limitation, transcriptome data for 92 streptophyte taxa weregenerated and analyzed along with 11 published plant genomesequences. Phylogenetic reconstructions were conducted using upto 852 nuclear genes and 1,701,170 aligned sites. Sixty-nine analyseswere performed to test the robustness of phylogenetic inferences topermutations of the datamatrix or to phylogenetic method, includingsupermatrix, supertree, and coalescent-based approaches, maximum-likelihood and Bayesian methods, partitioned and unpartitioned ana-lyses, and amino acid versus DNA alignments. Among otherresults, we find robust support for a sister-group relationshipbetween land plants and one group of streptophyte green al-gae, the Zygnematophyceae. Strong and robust support for aclade comprising liverworts and mosses is inconsistent with awidely accepted view of early land plant evolution, and suggeststhat phylogenetic hypotheses used to understand the evolution offundamental plant traits should be reevaluated.

land plants | Streptophyta | phylogeny | phylogenomics | transcriptome

The origin of embryophytes (land plants) in the Ordovicianperiod roughly 480 Mya (1–4) marks one of the most im-

portant events in the evolution of life on Earth. The early evo-lution of embryophytes in terrestrial environments was facilitatedby numerous innovations, including parental protection for thedeveloping embryo, sperm and egg production in multicellularprotective structures, and an alternation of phases (often referred toas generations) in which a diploid sporophytic life history stagegives rise to a multicellular haploid gametophytic phase. With

Significance

Early branching events in the diversification of land plants andclosely related algal lineages remain fundamental and un-resolved questions in plant evolutionary biology. Accuratereconstructions of these relationships are critical for testing hy-potheses of character evolution: for example, the origins of theembryo, vascular tissue, seeds, and flowers. We investigatedrelationships among streptophyte algae and land plants usingthe largest set of nuclear genes that has been applied to thisproblem to date. Hypothesized relationships were rigorouslytested through a series of analyses to assess systematic errors inphylogenetic inference caused by sampling artifacts and modelmisspecification. Results support some generally accepted phy-logenetic hypotheses, while rejecting others. This work providesa new framework for studies of land plant evolution.

Author contributions: N.J.W., S. Mirarab, T.W., S.W.G., M.M., D.E.S., P.S.S., D.W.S., M.K.D.,J.W., G.K.-S.W., and J.L.-M. designed research; N.J.W., S. Mirarab, N.N., T.W., E.C., N.M., S.A.,M.S.B., J.G.B., M.A.G., B.R.R., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M.,C.J.R., L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., B.R., H.P., C.W.d., T.C., M.K.D., M.M.A., J.W., Y.Z.,Z.T., Z.Y., X.W., X.S., G.K.-S.W., and J.L.-M. performed research; S. Mirarab, N.N., T.W., N.M.,S.A., M.S.B., J.G.B., M.A.G., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M., C.J.R.,L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., H.P., C.W.d., T.C., M.K.D., R.S.B., T.M.K., M.M.A., J.W., Y.Z.,G.K.-S.W., and J.L.-M. contributed new reagents/analytic tools; N.J.W., S. Mirarab, N.N., E.C.,N.M., S.A., M.S.B., J.G.B., M.A.G., B.R.R., E.W., B.R., H.P., and J.L.-M. analyzed data; N.J.W.,S. Mirarab, T.W., S.W.G., M.M., D.E.S., D.W.S., H.P., G.K.-S.W., and J.L.-M. wrote the paper;and N.M. archived data.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. P.O.L. is a guest editor invited by theEditorial Board.

Freely available online through the PNAS open access option.

Data deposition: The sequences reported in this paper have been deposited in theiplant datastore database, mirrors.iplantcollaborative.org/onekp_pilot, and the Na-tional Center for Biotechnology Information Sequence Read Archive, www.ncbi.nlm.nih.gov/sra [accession no. PRJEB4921 (ERP004258)].1N.J.W. and S. Mirarab contributed equally to this work.2To whom correspondence may be addressed. Email: [email protected],[email protected], or [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1323926111/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1323926111 PNAS Early Edition | 1 of 10

EVOLU

TION

PNASPL

US

Page 48: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

ASTRAL-II on biological datasets (ongoing collaborations)

• 1200 plants with ~ 400 genes (1KP consortium)

• 250 avian species with 2000 genes (with LSU, UF, and Smithsonian)

• 200 avian species with whole genomes (with Genome 10K, international)

• 250 suboscine species (birds) with ~2000 genes (with LSU and Tulane)

• 140 Insects with 1400 genes (with U. Illinois at Urbana-Champaign)

• 50 Hummingbird species with 2000 genes (with U. Copenhagen and Smithsonian)

• 40 raptor species (birds) with 10,000 genes (with U. Copenhagen and Berkeley)

• 38 mammalian species with 10,000 genes (with U. of Bristol, Cambridge, and Nat. Univ. of Ireland)

44

Page 49: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

                   Summary  

•  Gene  tree  discord  due  to  ILS  is  a  common  challenge  in  species  tree  es&ma&on.  

•  Most  of  the  first  genera&on  of  coalescent-­‐based  are  sta&s&cally  consistent  in  the  presence  of  large  amounts  of  perfect  data),  but  are  insufficiently  accurate  under  some  biologically  realis&c  condi&ons  (especially  with  large  numbers  of  species).  

•  New  methods  have  been  developed  that  can  analyze  very  large  datasets  (thousands  of  loci  and  taxa)  with  improved  accuracy  compared  to  previous  methods.  

•  Yet,  all  methods  have  theore&cal  and/or  prac&cal  limita&ons.  New  methods  are  needed,  and  this  is  an  ac&ve  research  area.    

•  Concatena&on  is  o|en  a  reasonable  approach,  despite  not  being  sta&s&cally  consistent.  

Page 50: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

   Papers  and  So|ware  •  M.S.  Bayzid  and  T.  Warnow.  "Naive  binning  improves  phylogenomic  analyses".  Bioinforma&cs  2013  29  (18):  2277-­‐2284  •  S.  Mirarab,  R.  Reaz,  Md.  S.  Bayzid,  T.  Zimmermann,  M.S.  Swenson,  and  T.  Warnow.  "ASTRAL:  Genome-­‐Scale  Coalescent-­‐

Based  Species  Tree  Es&ma&on.”    Bioinforma&cs  2014  30  (17):i541-­‐i548  •  Md  S.  Bayzid,  T.  Hunt,  and  T.  Warnow.  "Disk  Covering  Methods  Improve  Phylogenomic  Analyses”.  BMC  Genomics  2014,  

15(Suppl  6):  S7.  •  T.  Zimmermann,  S.  Mirarab  and  T.  Warnow.  "BBCA:  Improving  the  scalability  of  *BEAST  using  random  binning".  BMC  

Genomics  2014,  15(Suppl  6):  S11  •  S.  Mirarab,  Md  S.  Bayzid,  and  T.  Warnow.  "Evalua&ng  summary  methods  for  mul&-­‐locus  species  tree  es&ma&on  in  the  

presence  of  incomplete  lineage  sor&ng".  Systema&c  Biology,  doi  =  {10.1093/sysbio/syu063  •  S.  Mirarab,  Md.  S.  Bayzid,  B.  Boussau,  and  T.  Warnow.  "Sta&s&cal  binning  enables  an  accurate  coalescent-­‐based  

es&ma&on  of  the  avian  tree".  Science,  12  December  2014:  1250463  •  M.  S.  Bayzid,  S.  Mirarab,  B.  Boussau,  and  T.  Warnow.  "Weighted  Sta&s&cal  Binning:  enabling  sta&s&cally  consistent  

genome-­‐scale  phylogene&c  analyses",  PLOS  One,  2015,  DOI:  10.1371/journal.pone.0129183  •  S.  Mirarab  and  T.  Warnow.  "ASTRAL-­‐II:  coalescent-­‐based  species  tree  es&ma&on  with  many  hundreds  of  taxa  and  

thousands  of  genes",  Proceedings  ISMB  2015,  and  Bioinforma&cs  2015  31  (12):  i44-­‐i52  •  S.  Roch  and  T.  Warnow.  "On  the  robustness  to  gene  tree  es&ma&on  error  (or  lack  thereof)  of  coalescent-­‐based  species  

tree  methods",  Systema&c  Biology,  64(4):663-­‐676,  2015  •  T.  Warnow.  "Concatena&on  analyses  in  the  presence  of  incomplete  lineage  sor&ng",  PLOS  Currents:  Tree  of  Life  2015    •  R.    Davidson,  P.  Vachaspa&,  S.  Mirarab,    and  T.  Warnow.  Phylogenomic  species  tree  es&ma&on  in  the  presence  of  

incomplete  lineage  sor&ng  and  horizontal  gene  transfer.  In  press,  BMC  Genomics,  2015.  •  J.  Chou,  A.  Gupta,  S.  Yaduvanshi,  R.  Davidson,  M.  Nute,  S.  Mirarab  and  T.  Warnow.  A  compara&ve  study  of  SVDquartets  

and  other  coalescent-­‐based  species  tree  es&ma&on  methods.  In  press,  BMC  Genomics,  2015  •  P.  Vachaspa&  and  T.  Warnow.  ASTRID:  Accurate  Species  TRees  from  Internode  Distances.  In  press,  BMC  Genomics,  2015  

 Open  source  so|ware  available  at  github  Papers  available  at  hBp://tandy.cs.illinois.edu/papers.html  

Page 51: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

   Papers  and  So|ware  •  M.S.  Bayzid  and  T.  Warnow.  "Naive  binning  improves  phylogenomic  analyses".  Bioinforma&cs  2013  29  (18):  2277-­‐2284  •  S.  Mirarab,  R.  Reaz,  Md.  S.  Bayzid,  T.  Zimmermann,  M.S.  Swenson,  and  T.  Warnow.  "ASTRAL:  Genome-­‐Scale  Coalescent-­‐

Based  Species  Tree  Es&ma&on.”    Bioinforma&cs  2014  30  (17):i541-­‐i548  •  Md  S.  Bayzid,  T.  Hunt,  and  T.  Warnow.  "Disk  Covering  Methods  Improve  Phylogenomic  Analyses”.  BMC  Genomics  2014,  

15(Suppl  6):  S7.  •  T.  Zimmermann,  S.  Mirarab  and  T.  Warnow.  "BBCA:  Improving  the  scalability  of  *BEAST  using  random  binning".  BMC  

Genomics  2014,  15(Suppl  6):  S11  •  S.  Mirarab,  Md  S.  Bayzid,  and  T.  Warnow.  "Evalua&ng  summary  methods  for  mul&-­‐locus  species  tree  es&ma&on  in  the  

presence  of  incomplete  lineage  sor&ng".  Systema&c  Biology,  doi  =  {10.1093/sysbio/syu063  •  S.  Mirarab,  Md.  S.  Bayzid,  B.  Boussau,  and  T.  Warnow.  "Sta&s&cal  binning  enables  an  accurate  coalescent-­‐based  

es&ma&on  of  the  avian  tree".  Science,  12  December  2014:  1250463  •  M.  S.  Bayzid,  S.  Mirarab,  B.  Boussau,  and  T.  Warnow.  "Weighted  Sta&s&cal  Binning:  enabling  sta&s&cally  consistent  

genome-­‐scale  phylogene&c  analyses",  PLOS  One,  2015,  DOI:  10.1371/journal.pone.0129183  •  S.  Mirarab  and  T.  Warnow.  "ASTRAL-­‐II:  coalescent-­‐based  species  tree  es&ma&on  with  many  hundreds  of  taxa  and  

thousands  of  genes",  Proceedings  ISMB  2015,  and  Bioinforma&cs  2015  31  (12):  i44-­‐i52  •  S.  Roch  and  T.  Warnow.  "On  the  robustness  to  gene  tree  es&ma&on  error  (or  lack  thereof)  of  coalescent-­‐based  species  

tree  methods",  Systema&c  Biology,  64(4):663-­‐676,  2015  •  T.  Warnow.  "Concatena&on  analyses  in  the  presence  of  incomplete  lineage  sor&ng",  PLOS  Currents:  Tree  of  Life  2015    •  R.    Davidson,  P.  Vachaspa&,  S.  Mirarab,    and  T.  Warnow.  Phylogenomic  species  tree  es&ma&on  in  the  presence  of  

incomplete  lineage  sor&ng  and  horizontal  gene  transfer.  In  press,  BMC  Genomics,  2015.  •  J.  Chou,  A.  Gupta,  S.  Yaduvanshi,  R.  Davidson,  M.  Nute,  S.  Mirarab  and  T.  Warnow.  A  compara&ve  study  of  SVDquartets  

and  other  coalescent-­‐based  species  tree  es&ma&on  methods.  In  press,  BMC  Genomics,  2015  •  P.  Vachaspa&  and  T.  Warnow.  ASTRID:  Accurate  Species  TRees  from  Internode  Distances.  In  press,  BMC  Genomics,  2015  

 Open  source  so|ware  available  at  github  Papers  available  at  hBp://tandy.cs.illinois.edu/papers.html  

Page 52: Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

     Acknowledgments  

PhD  students:  Siavash  Mirarab*  and  Md.  S.  Bayzid**      Funding:  Guggenheim  Founda&on,  NSF,  David  Bruton  Jr.  Centennial  Professorship,  TACC  (Texas  Advanced  Compu&ng  Center),  and  Grainger  Founda&on  (professorship).    TACC  and  UTCS  computa&onal  resources  *    Supported  by  HHMI  Predoctoral  Fellowship  **  Supported  by  Fulbright  Founda&on