iphy tools for collation and analysis of phylogenomic data. m blaxter
TRANSCRIPT
Phyloinformatics Workshop Edinburgh 2007
iPhy
tools for collation and analysis of phylogenomic data
Martin Jones and Mark Blaxter
cryptophytes
alveolates
cilia
tes
dino
flage
llate
s
mar
ine
grou
pI
apico
mplex
a
haptophytes
vahlkampfiid amoebas
vahlkampfiid amoebas
acrasid slime molds
acrasid slime moldseuglenids
trypanosomes
leishmania
discicristatesexcavates
core jakobids
core jakobidsdiplomonads
parabasalids
retortamonads
plants
*prasinophytealgae
*prasinophytealgaechlorophyte algae
chlorophyte algae
glaucophyte algaeglaucophyte algae
redalgae
redalgae
charaphyte algae
charaphyte algae
land plants
land plants marine
group
II
heterokonts
more chl ac algae
more chl ac algae
labyrinthulids
opalinids
diatomsoomycetes
brown algaebrown algae
bicosoecid
s
chlorarchniophytescercom
onads
foraminiferans
euglyphidam
oebas
euglyphidam
oebas
cercozoa
radiolariandsopisthokonts
oxymonads
root
fung
ianima
ls
microsporid
ia
choano
flagella
tes
ìcho
anoz
oaîlobose amoebaslobose amoebas
pelobion
ts
dictyostelid slime moldsdictyostelid slime molds
plasmodial slime moldsplasmodial slime molds
*protosteli
d slime molds
*protosteli
d slime moldsamoe
bozo
a
Phyloinformatics Workshop Edinburgh 2007
1: Forests of trees, and loads of kindling 2: Organising principles 3: iPhy design 4: iPhy deployment 5: Nameless taxa & endless forms
Phyloinformatics Workshop Edinburgh 2007
1: Forests of trees, and loads of kindling Phylogenetics is a growth area. The raw materials (sequences)
are being added at a startling rate. Tree databases are also growing
(both in number and size). so how does a lab worker bee keep up?
Chaeto
gnath
aChord
ataEch
inoderm
ataH
emich
ord
ataEntero
pneu
staXen
otu
rbellid
aArth
ropoda
Onych
ophora
Tard
igrad
aPriap
ulid
aAcan
thocep
hala
Kin
orh
ynch
aN
emato
da
Nem
atom
orp
ha
Platyhelm
inth
esAnnelid
aEch
iura
Pogonophora
Brach
iopoda
Bryo
zoa
Ento
pro
ctaM
ollu
scaN
emertea
Sip
uncu
laG
astrotrich
aRotifera
Seiso
nid
eaG
nath
osto
mulid
aAco
elom
orp
ha
Cyclio
phora
Micro
gnath
ozo
aCnid
ariaCten
ophora
Meso
zoa
Myxo
zoa
Budden
bro
ckiaPlaco
zoa
Porifera
Metazoan Phyla: Sequences per phylum
(10/05/2006)
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
Chaeto
gnath
aChord
ataEch
inoderm
ataH
emich
ord
ataEntero
pneu
staXen
otu
rbellid
aArth
ropoda
Onych
ophora
Tard
igrad
aPriap
ulid
aAcan
thocep
hala
Kin
orh
ynch
aN
emato
da
Nem
atom
orp
ha
Platyhelm
inth
esAnnelid
aEch
iura
Pogonophora
Brach
iopoda
Bryo
zoa
Ento
pro
ctaM
ollu
scaN
emertea
Sip
uncu
laG
astrotrich
aRotifera
Seiso
nid
eaG
nath
osto
mulid
aAco
elom
orp
ha
Cyclio
phora
Micro
gnath
ozo
aCnid
ariaCten
ophora
Meso
zoa
Myxo
zoa
Budden
bro
ckiaPlaco
zoa
Porifera
Metazoan Phyla: Species per phylum
(10/05/2006)
1
10
100
1000
10000
100000
1000000
10000000
0.1
1
10
100
1000
Chaeto
gnath
aChord
ataEch
inoderm
ataH
emich
ord
ataEntero
pneu
staXen
otu
rbellid
aArth
ropoda
Onych
ophora
Tard
igrad
aPriap
ulid
aAcan
thocep
hala
Kin
orh
ynch
aN
emato
da
Nem
atom
orp
ha
Platyhelm
inth
esAnnelid
aEch
iura
Pogonophora
Brach
iopoda
Bryo
zoa
Ento
pro
ctaM
ollu
scaN
emertea
Sip
uncu
laG
astrotrich
aRotifera
Seiso
nid
eaG
nath
osto
mulid
aAco
elom
orp
ha
Cyclio
phora
Micro
gnath
ozo
aCnid
ariaCten
ophora
Meso
zoa
Myxo
zoa
Budden
bro
ckiaPlaco
zoa
Porifera
Metazoan Phyla: Sequences per species
(10/05/2006)
Phyloinformatics Workshop Edinburgh 2007
1: Forests of trees, and loads of kindling Phylogenetics is a growth area. The raw materials (sequences)
are being added at a startling rate. Tree databases are also growing
(both in number and size). so how does a lab worker bee keep up?
1980 1985 1990 1995 20000
1000
2000
3000
4000
5000
6000
7000C
umul
ativ
e nu
mbe
r
Year
Molecular phylogenies
TreeBASE studies
from Rod Page “Towards a Taxonomically Intelligent Phylogenetic Database”
Phyloinformatics Workshop Edinburgh 2007
Two modes of data acquisition (a) wet lab - compute lab synergy explicitly source the sequences needed preformed ideas of the best taxa to sample the best genes to sample [this is the source of most phylogenetic data]
Phyloinformatics Workshop Edinburgh 2007
Two modes of data acquisition (a) wet lab - compute lab synergy (b) magpie surfing / tree surgery using phyloinformatic tools to discover the set of available genes AND taxa to address a particular problem
Phyloinformatics Workshop Edinburgh 2007
2: Organising principles On average … • more data are better more taxa more genes • multiple methods are better
Phyloinformatics Workshop Edinburgh 2007
2: Organising principles • assess all relevant taxa • assess all relevant sequence
while the NCBI taxonomyisn’t the best in the world,at least every sequenceis attached to a taxon,and TAX_IDs are unique
The Edinburgh EST analysis Pipeline
(CLOBB)Cluster into putative gene objectsPredict consensus sequence
(prot4EST)Predict translation reading frameGenerate protein translation
(annot8r)Annotate using BLASTGOtchaPSort Pfam SigPepKEGG
(trace2dbest)Process raw sequence tracesTrim off vector & low quality
(PartiGene)Collate information in relationaldatabase
NEMBASE3 http://www.nematodes.org/Mark Blaxter, James Wasmuth,
Ann Hedley & Ralf SchmidUniversity of Edinburgh,
Institute of Evolutionary Biology,Edinburgh UK EH9 3JT
The web portal to NEMBASE3
NEMBASE3 http://www.nematodes.org/
10000
20000
30000
40000
50000
2500000
A
B
50000 75000 100000 125000 150000C
Trichinella spiralis
Brugia malayi
Meloidogyne incognita
Strongyloidesstercoralis
Ancylostomacaninum
Caenorhabditiselegans
Total number of proteins
Num
ber
of f
amili
es
Collectors’ curve of nematode protein families
NEMBASE3 http://www.nematodes.org/
Rhabditina (Clade V)
Strongyloidea
Rhabditoidea
Diplogasteromorpha
Panagrolaimomorpha
Tylenchomorpha
Cephalobomorpha
Ascaridomorpha
Spiruromorpha
Spirurina (Clade III)
Tylenchina (Clade IV)
Rhabditida
NEM
ATO
DA
Trichinellida
Dorylaimida
Dorylaimia (Clade I)
2811
7501
4162
949(6120)
435(2678)
3893(11213)
0(1356)
128(2571)
0(1610)
824(5188)
30
293(3695)
132
152
1108
12302(3674)V
IV
III
I
Earliest origins of nematode protein families
Phyloinformatics Workshop Edinburgh 2007
2: Organising principles • assess all relevant taxa • assess all relevant sequence • store aligned sequences locally • output ‘slices’ of data in analysis-ready formats
many taxa, missing data
gene-> /taxon a b c d e f g h i
1 2 3 4 5 6 7 8 9
Generating a slice that • maximises taxonomic coverage
• maximises present data/minimises missing data
gene-> /taxon a b e f g i
1 3 7 9
Phyloinformatics Workshop Edinburgh 2007
2: Organising principles • assess all relevant taxa • assess all relevant sequence • store aligned sequences locally • output ‘slices’ of data in analysis-ready formats • store trees locally • store alternative taxonomic systems
Nematoda
Fungi
Choanoflagellata
Cnidaria
Ctenophora
Platyhelminthes
Mollusca
Echinodermata
Cephalochordata
Urochordata
Vertebrata
Arthropoda
Tardigrada
Annelida
C
P
L
E
D
Complete genome sequences
Includingneglected taxa ESTs
(Philippe et al.)
Phyloinformatics Workshop Edinburgh 2007
3: iPhy design
Processing to* identify relevant sequences and store locally* capture tree data* reconcile tree nodes with existing systems
Processing to* identify relevant sequences and store locally* associate sequences and taxa
Processing to* capture tree data* reconcile tree nodes with existing systems
systematicuser treeTreeFam
ACGGTCCGGAAGGCT
TreeBASE
ACGGTCCGGAAGGCT
ACGGTCCGGAAGGCT
alignmentsequence
AGGCTPheTyr
Processing to* identify relevant sequences and store locally* capture tree data* reconcile tree nodes with existing systems
Processing to* identify relevant sequences and store locally* associate sequences and taxa
Processing to* capture tree data* reconcile tree nodes with existing systems
systematicuser treeTreeFam
ACGGTCCGGAAGGCT
TreeBASE
ACGGTCCGGAAGGCT
ACGGTCCGGAAGGCT
alignmentsequence
AGGCTPheTyr
AGGCTPheTyr
ACGGTCCGGAAGGCT
iPhy databaseAlignment Cycle
ACGGTCCGGAAGGCT
ACGGTCCGGAAGGCT
POAtranAlign
Processing to* identify relevant sequences and store locally* capture tree data* reconcile tree nodes with existing systems
Processing to* identify relevant sequences and store locally* associate sequences and taxa
Processing to* capture tree data* reconcile tree nodes with existing systems
systematicuser treeTreeFam
ACGGTCCGGAAGGCT
TreeBASE
ACGGTCCGGAAGGCT
ACGGTCCGGAAGGCT
alignmentsequence
AGGCTPheTyr
AGGCTPheTyr
ACGGTCCGGAAGGCT
iPhy databaseAlignment Cycle
ACGGTCCGGAAGGCT
ACGGTCCGGAAGGCT
POAtranAlign
OrthologueInference
Engine
ACGGTCCGGAAGGCTTreeFam
Ortho-MCL
AGGCTPheTyr
ACGGTCCGGAAGGCT
iPhy databaseAlignment Cycle
ACGGTCCGGAAGGCT
ACGGTCCGGAAGGCT
POAtranAlign
OrthologueInference
Engine
ACGGTCCGGAAGGCTTreeFam
Ortho-MCL
Dataset Exploration Tools
TreeComparer
SliceSelecter
ACGGTCCGGAAGGCT
}Phylogenetics Cycle
ACGGTCCGGAAGGCT
PhyMLMrBayesPAUP...
maximalbicliques
AGGCTPheTyr
ACGGTCCGGAAGGCT
iPhy databaseAlignment Cycle
ACGGTCCGGAAGGCT
ACGGTCCGGAAGGCT
POAtranAlign
OrthologueInference
Engine
ACGGTCCGGAAGGCTTreeFam
Ortho-MCL
Dataset Exploration Tools
TreeComparer
SliceSelecter
ACGGTCCGGAAGGCT
}Phylogenetics Cycle
ACGGTCCGGAAGGCT
PhyMLMrBayesPAUP...
maximalbicliques
ACGGTCCGGAAGGCT
PublicationQuality
Analyses
trees &alignments
Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment version 0.1: ‘TaxMan’
Bio Med CentralBMC Bioinformatics
Open AccessSoftwareTaxMan: a taxonomic database managerMartin Jones* and Mark Blaxter
Address: Institute of Evolutionary Biology, King's Buildings, Ashworth Laboratories, West Ma ins Road, Edinburgh EH9 3JT, UK
Email: Martin Jones* - marti [email protected]; Mark Blax ter - [email protected]* Corresponding author
Published: 18 December 2006
BMC Bioinformatics 2006, 7:536 doi:10.1186/1471-2105-7-536
Received: 11 October 2006Accepted: 18 December 2006
This article is available from: http://www.biomedcentral.com/1471-2105/7/536
© 2006 Jones and Blaxter; licensee BioMed Central Ltd.
Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan automates assembly of large sequence datasets for chosen taxa TaxMan automates generation of aligned sequences sets for chosen genes
Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan simplifies selection of taxa for analysis e.g. given a gene set, choosing one species per family (choosing the species with the least missing data) e.g. given a taxon set, choosing the genes (choosing genes with less than a given % missing data) e.g. generating custom defined alignments
Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan simplifies analysis by exporting formatted alignments (NEXUS) of nucleotides (with codon positions and genes as defined partitions) of amino acids (with genes as defined partitions)
Phyloinformatics Workshop Edinburgh 2007
4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan simplifies post-phylogenetic analysis by saving trees (with links to the original data) saving analytical metadata (algorithm, parameters, settings) saving tree statistics (bootstraps, branch lengths)
Lophotrochozoa● 70,000 annotated sequences● 630,000 EST sequences● 21 genes (mt + 18S 28S actin H3 WG EF1A)● 53,000 sequences extracted● 17,000 aligned consensus sequences● 8,700 species represented● One day for data collection, one for alignment
Molecular Phylogenetics and Evolution 43 (2007) 583–595www.elsevier.com/locate/ympev
The e�ect of model choice on phylogenetic inference using mitochondrial sequence data: Lessons from the scorpions
Martin Jones a,¤, Benjamin Gantenbein b, Victor Fet c, Mark Blaxter a
a Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, UKb AO Research Institute, Clavadelerstrasse 8, Davos Platz CH-7270, Switzerland
c Department of Biological Sciences, Marshall University, Huntington, WV 25755-2510, USA
Received 25 April 2006; revised 14 November 2006; accepted 14 November 2006Available online 29 November 2006
Phyloinformatics Workshop Edinburgh 2007
5: Nameless taxa & endless forms
"... endless formsmost beautiful
and most wonderfulhave been,
and are being, evolved"
(Darwin 1859)
http
://w
ww
.nem
atod
es.o
rg/N
egle
cted
Gen
omes
/A
RTH
RO
POD
A/C
helic
erat
a.ht
ml
1 10
100
1000
10000
100000
1000000
10000000
ChoanoflagellidaPorifera
PlacozoaCnidaria
CtenophoraAcoela
MesozoaMyxozoa
NematodaNematomorpha
LoriciferaKinorhyncha
PriapulidaOnychophora
ArthropodaTardigrada
GastrotrichaNemertea
MyzostomidaGnathostomulida
CycliophoraPlatyhelminthesAcanthocephala
RotiferaChaetognatha
SipunculidaBryozoa
BrachiopodaEntoprocta
AnnelidaPogonophora
EchiuraMollusca
HemichordataEchinodermata
Chordata
Metazoan species per phylum 100000000
organism-size curve Eukaryotes
lots
fewminiscule tiny smalljust visible big
squillions(lo
gsc
ale)
(log scale)
FOODITEMS
POSSIBLEPREDATORS
size of organism
num
ber
of in
divi
dual
s
Sourhope farmNERC "Soil Biodiversity
and Ecosystem Function"Programme Study Site
120 m x 75 mof raw Scottish upland grass
13 000 000 000 nematodes
MAN IS BVT A WORM
Orkney
Loch Fyne Gullane
Orkney
Loch Fyne
Gullane
10
11
10
12
4
2
2
1034ED Fyne11022ED Fyne11010ED Fyne11020ED Fyne11005ED Fyne11007ED Fyne
1140ED Orkney1139ED Orkney
1031ED Fyne11043ED Gullane
1118ED Fyne21011ED Fyne11093ED Fyne2
1085ED Gullane1046ED Gullane
1041ED Gullane1060ED Gullane
1028ED Fyne11119ED Fyne2
1122ED Fyne21142ED Orkney
1145ED Orkney1170ED Orkney
1174ED Orkney1162ED Orkney
1169ED Orkney1173ED Orkney
1179ED Orkney1168ED Orkney1176ED Orkney1167ED Orkney1175ED Orkney1147ED Orkney
1008ED Fyne11009ED Fyne1
1144ED Orkney1146ED Orkney
1083ED Gullane1073ED Gullane1051ED Gullane
1019ED Fyne11124ED Fyne2
1097ED Fyne21150ED Orkney
1136ED Orkney1152ED Orkney1171ED Orkney1154ED Orkney1151ED Orkney
1029ED Fyne11012ED Fyne11138ED Orkney1013ED Fyne11032ED Fyne1
1092ED Fyne21036ED Fyne11037ED Fyne1
1075ED Gullane1109ED Fyne2
1128ED Fyne21094ED Fyne2
1044ED Gullane1071ED Gullane
1064ED Gullane1053ED Gullane
1070ED Gullane1038ED Gullane1052ED Gullane
1123ED Fyne21035ED Fyne11107ED Fyne2
1108ED Fyne21024ED Fyne11178ED Orkney
1165ED Orkney1156ED Orkney
1141ED Orkney1164ED Orkney1066ED Gullane
1047ED Gullane1099ED Fyne2
1058ED Gullane1042ED Gullane
1088ED Fyne21086ED Fyne2
1039ED Gullane1069ED Gullane1061ED Gullane1074ED Gullane
1096ED Fyne21105ED Fyne21133ED Fyne21077ED Gullane
1014ED Fyne11068ED Gullane1076ED Gullane1080ED Gullane
1072ED Gullane1054ED Gullane1062ED Gullane1048ED Gullane1057ED Gullane1040ED Gullane1059ED Gullane
1120ED Fyne21017ED Fyne1
1004ED Fyne11018ED Fyne1
1177ED Orkney1025ED Fyne1
1023ED Fyne11016ED Fyne1
1027ED Fyne11015ED Fyne11002ED Fyne11001ED Fyne11021ED Fyne1
1003ED Fyne11006ED Fyne1
1000ED Fyne11155ED Orkney
1121ED Fyne21103ED Fyne21110ED Fyne2
1114ED Fyne21125ED Fyne21131ED Fyne21101ED Fyne2
1102ED Fyne21112ED Fyne21116ED Fyne21106ED Fyne2
1104ED Fyne21132ED Fyne2
5 changes
1
MarineNematodeBarcodes
51
Phyloinformatics Workshop Edinburgh 2007
5: Nameless taxa & endless forms
MOTU Molecular Operational
Taxonomic Units
1. to cut; to snap offmotu-á te hau, the fishing line snapped off2. to engrave, to inscribe letters or pictures in stone or in wood, like the motu mo rogorogo, inscrip-tions for recitation in lines called kohau.3. isletsome names of islets: Motu Motiro Hiva, Motu Nui, Motu Iti, Motu Kaokao, Motu Tapu, Motu Marotiri, Motu Kau, Motu Tavake, Motu Tautara, Motu Ko Hepa Ko Maihori, Motu Hava.
motu
Phyloinformatics Workshop Edinburgh 2007
5: Nameless taxa & endless forms MOTU
specimen-based surveys CBoL Barcode of Life (CO1)
anonymous, specimen-free surveys environmental sampling bulk community DNA millions of sequences
Phyloinformatics Workshop Edinburgh 2007
5: Nameless taxa & endless forms
~1.2 million described species
~10-100 million species in reality Thus, most ‘species’ will never be formally named.
Phyloinformatics Workshop Edinburgh 2007
5: Nameless taxa & endless forms How do we incorporate these myriad ‘nameless taxa’ into our systems?
Phyloinformatics Workshop Edinburgh 2007
Martin Jones TaxMan, iPhy & chelicerate evolution
Robin Floyd & Jenna Mann
MOTU and barcoding
Ralf Schmid, James Wasmuth & Ann Hedley
PartiGene & EST analysis