iphy tools for collation and analysis of phylogenomic data. m blaxter

Post on 10-May-2015

913 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Phyloinformatics Workshop Edinburgh 2007

iPhy

tools for collation and analysis of phylogenomic data

Martin Jones and Mark Blaxter

cryptophytes

alveolates

cilia

tes

dino

flage

llate

s

mar

ine

grou

pI

apico

mplex

a

haptophytes

vahlkampfiid amoebas

vahlkampfiid amoebas

acrasid slime molds

acrasid slime moldseuglenids

trypanosomes

leishmania

discicristatesexcavates

core jakobids

core jakobidsdiplomonads

parabasalids

retortamonads

plants

*prasinophytealgae

*prasinophytealgaechlorophyte algae

chlorophyte algae

glaucophyte algaeglaucophyte algae

redalgae

redalgae

charaphyte algae

charaphyte algae

land plants

land plants marine

group

II

heterokonts

more chl ac algae

more chl ac algae

labyrinthulids

opalinids

diatomsoomycetes

brown algaebrown algae

bicosoecid

s

chlorarchniophytescercom

onads

foraminiferans

euglyphidam

oebas

euglyphidam

oebas

cercozoa

radiolariandsopisthokonts

oxymonads

root

fung

ianima

ls

microsporid

ia

choano

flagella

tes

ìcho

anoz

oaîlobose amoebaslobose amoebas

pelobion

ts

dictyostelid slime moldsdictyostelid slime molds

plasmodial slime moldsplasmodial slime molds

*protosteli

d slime molds

*protosteli

d slime moldsamoe

bozo

a

Phyloinformatics Workshop Edinburgh 2007

1: Forests of trees, and loads of kindling 2: Organising principles 3: iPhy design 4: iPhy deployment 5: Nameless taxa & endless forms

Phyloinformatics Workshop Edinburgh 2007

1: Forests of trees, and loads of kindling Phylogenetics is a growth area. The raw materials (sequences)

are being added at a startling rate. Tree databases are also growing

(both in number and size). so how does a lab worker bee keep up?

Chaeto

gnath

aChord

ataEch

inoderm

ataH

emich

ord

ataEntero

pneu

staXen

otu

rbellid

aArth

ropoda

Onych

ophora

Tard

igrad

aPriap

ulid

aAcan

thocep

hala

Kin

orh

ynch

aN

emato

da

Nem

atom

orp

ha

Platyhelm

inth

esAnnelid

aEch

iura

Pogonophora

Brach

iopoda

Bryo

zoa

Ento

pro

ctaM

ollu

scaN

emertea

Sip

uncu

laG

astrotrich

aRotifera

Seiso

nid

eaG

nath

osto

mulid

aAco

elom

orp

ha

Cyclio

phora

Micro

gnath

ozo

aCnid

ariaCten

ophora

Meso

zoa

Myxo

zoa

Budden

bro

ckiaPlaco

zoa

Porifera

Metazoan Phyla: Sequences per phylum

(10/05/2006)

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

Chaeto

gnath

aChord

ataEch

inoderm

ataH

emich

ord

ataEntero

pneu

staXen

otu

rbellid

aArth

ropoda

Onych

ophora

Tard

igrad

aPriap

ulid

aAcan

thocep

hala

Kin

orh

ynch

aN

emato

da

Nem

atom

orp

ha

Platyhelm

inth

esAnnelid

aEch

iura

Pogonophora

Brach

iopoda

Bryo

zoa

Ento

pro

ctaM

ollu

scaN

emertea

Sip

uncu

laG

astrotrich

aRotifera

Seiso

nid

eaG

nath

osto

mulid

aAco

elom

orp

ha

Cyclio

phora

Micro

gnath

ozo

aCnid

ariaCten

ophora

Meso

zoa

Myxo

zoa

Budden

bro

ckiaPlaco

zoa

Porifera

Metazoan Phyla: Species per phylum

(10/05/2006)

1

10

100

1000

10000

100000

1000000

10000000

0.1

1

10

100

1000

Chaeto

gnath

aChord

ataEch

inoderm

ataH

emich

ord

ataEntero

pneu

staXen

otu

rbellid

aArth

ropoda

Onych

ophora

Tard

igrad

aPriap

ulid

aAcan

thocep

hala

Kin

orh

ynch

aN

emato

da

Nem

atom

orp

ha

Platyhelm

inth

esAnnelid

aEch

iura

Pogonophora

Brach

iopoda

Bryo

zoa

Ento

pro

ctaM

ollu

scaN

emertea

Sip

uncu

laG

astrotrich

aRotifera

Seiso

nid

eaG

nath

osto

mulid

aAco

elom

orp

ha

Cyclio

phora

Micro

gnath

ozo

aCnid

ariaCten

ophora

Meso

zoa

Myxo

zoa

Budden

bro

ckiaPlaco

zoa

Porifera

Metazoan Phyla: Sequences per species

(10/05/2006)

Phyloinformatics Workshop Edinburgh 2007

1: Forests of trees, and loads of kindling Phylogenetics is a growth area. The raw materials (sequences)

are being added at a startling rate. Tree databases are also growing

(both in number and size). so how does a lab worker bee keep up?

1980 1985 1990 1995 20000

1000

2000

3000

4000

5000

6000

7000C

umul

ativ

e nu

mbe

r

Year

Molecular phylogenies

TreeBASE studies

from Rod Page “Towards a Taxonomically Intelligent Phylogenetic Database”

Phyloinformatics Workshop Edinburgh 2007

Two modes of data acquisition (a) wet lab - compute lab synergy explicitly source the sequences needed preformed ideas of the best taxa to sample the best genes to sample [this is the source of most phylogenetic data]

Phyloinformatics Workshop Edinburgh 2007

Two modes of data acquisition (a) wet lab - compute lab synergy (b) magpie surfing / tree surgery using phyloinformatic tools to discover the set of available genes AND taxa to address a particular problem

Phyloinformatics Workshop Edinburgh 2007

2: Organising principles On average … • more data are better more taxa more genes • multiple methods are better

Phyloinformatics Workshop Edinburgh 2007

2: Organising principles • assess all relevant taxa • assess all relevant sequence

while the NCBI taxonomyisn’t the best in the world,at least every sequenceis attached to a taxon,and TAX_IDs are unique

The Edinburgh EST analysis Pipeline

(CLOBB)Cluster into putative gene objectsPredict consensus sequence

(prot4EST)Predict translation reading frameGenerate protein translation

(annot8r)Annotate using BLASTGOtchaPSort Pfam SigPepKEGG

(trace2dbest)Process raw sequence tracesTrim off vector & low quality

(PartiGene)Collate information in relationaldatabase

NEMBASE3 http://www.nematodes.org/Mark Blaxter, James Wasmuth,

Ann Hedley & Ralf SchmidUniversity of Edinburgh,

Institute of Evolutionary Biology,Edinburgh UK EH9 3JT

mark.blaxter@ed.ac.uk

The web portal to NEMBASE3

NEMBASE3 http://www.nematodes.org/

10000

20000

30000

40000

50000

2500000

A

B

50000 75000 100000 125000 150000C

Trichinella spiralis

Brugia malayi

Meloidogyne incognita

Strongyloidesstercoralis

Ancylostomacaninum

Caenorhabditiselegans

Total number of proteins

Num

ber

of f

amili

es

Collectors’ curve of nematode protein families

NEMBASE3 http://www.nematodes.org/

Rhabditina (Clade V)

Strongyloidea

Rhabditoidea

Diplogasteromorpha

Panagrolaimomorpha

Tylenchomorpha

Cephalobomorpha

Ascaridomorpha

Spiruromorpha

Spirurina (Clade III)

Tylenchina (Clade IV)

Rhabditida

NEM

ATO

DA

Trichinellida

Dorylaimida

Dorylaimia (Clade I)

2811

7501

4162

949(6120)

435(2678)

3893(11213)

0(1356)

128(2571)

0(1610)

824(5188)

30

293(3695)

132

152

1108

12302(3674)V

IV

III

I

Earliest origins of nematode protein families

Phyloinformatics Workshop Edinburgh 2007

2: Organising principles • assess all relevant taxa • assess all relevant sequence • store aligned sequences locally • output ‘slices’ of data in analysis-ready formats

many taxa, missing data

gene-> /taxon a b c d e f g h i

1 2 3 4 5 6 7 8 9

Generating a slice that • maximises taxonomic coverage

• maximises present data/minimises missing data

gene-> /taxon a b e f g i

1 3 7 9

Phyloinformatics Workshop Edinburgh 2007

2: Organising principles • assess all relevant taxa • assess all relevant sequence • store aligned sequences locally • output ‘slices’ of data in analysis-ready formats • store trees locally • store alternative taxonomic systems

Nematoda

Fungi

Choanoflagellata

Cnidaria

Ctenophora

Platyhelminthes

Mollusca

Echinodermata

Cephalochordata

Urochordata

Vertebrata

Arthropoda

Tardigrada

Annelida

C

P

L

E

D

Complete genome sequences

Includingneglected taxa ESTs

(Philippe et al.)

Phyloinformatics Workshop Edinburgh 2007

3: iPhy design

Processing to* identify relevant sequences and store locally* capture tree data* reconcile tree nodes with existing systems

Processing to* identify relevant sequences and store locally* associate sequences and taxa

Processing to* capture tree data* reconcile tree nodes with existing systems

systematicuser treeTreeFam

ACGGTCCGGAAGGCT

TreeBASE

ACGGTCCGGAAGGCT

ACGGTCCGGAAGGCT

alignmentsequence

AGGCTPheTyr

Processing to* identify relevant sequences and store locally* capture tree data* reconcile tree nodes with existing systems

Processing to* identify relevant sequences and store locally* associate sequences and taxa

Processing to* capture tree data* reconcile tree nodes with existing systems

systematicuser treeTreeFam

ACGGTCCGGAAGGCT

TreeBASE

ACGGTCCGGAAGGCT

ACGGTCCGGAAGGCT

alignmentsequence

AGGCTPheTyr

AGGCTPheTyr

ACGGTCCGGAAGGCT

iPhy databaseAlignment Cycle

ACGGTCCGGAAGGCT

ACGGTCCGGAAGGCT

POAtranAlign

Processing to* identify relevant sequences and store locally* capture tree data* reconcile tree nodes with existing systems

Processing to* identify relevant sequences and store locally* associate sequences and taxa

Processing to* capture tree data* reconcile tree nodes with existing systems

systematicuser treeTreeFam

ACGGTCCGGAAGGCT

TreeBASE

ACGGTCCGGAAGGCT

ACGGTCCGGAAGGCT

alignmentsequence

AGGCTPheTyr

AGGCTPheTyr

ACGGTCCGGAAGGCT

iPhy databaseAlignment Cycle

ACGGTCCGGAAGGCT

ACGGTCCGGAAGGCT

POAtranAlign

OrthologueInference

Engine

ACGGTCCGGAAGGCTTreeFam

Ortho-MCL

AGGCTPheTyr

ACGGTCCGGAAGGCT

iPhy databaseAlignment Cycle

ACGGTCCGGAAGGCT

ACGGTCCGGAAGGCT

POAtranAlign

OrthologueInference

Engine

ACGGTCCGGAAGGCTTreeFam

Ortho-MCL

Dataset Exploration Tools

TreeComparer

SliceSelecter

ACGGTCCGGAAGGCT

}Phylogenetics Cycle

ACGGTCCGGAAGGCT

PhyMLMrBayesPAUP...

maximalbicliques

AGGCTPheTyr

ACGGTCCGGAAGGCT

iPhy databaseAlignment Cycle

ACGGTCCGGAAGGCT

ACGGTCCGGAAGGCT

POAtranAlign

OrthologueInference

Engine

ACGGTCCGGAAGGCTTreeFam

Ortho-MCL

Dataset Exploration Tools

TreeComparer

SliceSelecter

ACGGTCCGGAAGGCT

}Phylogenetics Cycle

ACGGTCCGGAAGGCT

PhyMLMrBayesPAUP...

maximalbicliques

ACGGTCCGGAAGGCT

PublicationQuality

Analyses

trees &alignments

Phyloinformatics Workshop Edinburgh 2007

4: iPhy deployment version 0.1: ‘TaxMan’

Bio Med CentralBMC Bioinformatics

Open AccessSoftwareTaxMan: a taxonomic database managerMartin Jones* and Mark Blaxter

Address: Institute of Evolutionary Biology, King's Buildings, Ashworth Laboratories, West Ma ins Road, Edinburgh EH9 3JT, UK

Email: Martin Jones* - marti n.jones@ed.ac.uk; Mark Blax ter - mark.blaxter@ed.ac.uk* Corresponding author

Published: 18 December 2006

BMC Bioinformatics 2006, 7:536 doi:10.1186/1471-2105-7-536

Received: 11 October 2006Accepted: 18 December 2006

This article is available from: http://www.biomedcentral.com/1471-2105/7/536

© 2006 Jones and Blaxter; licensee BioMed Central Ltd.

Phyloinformatics Workshop Edinburgh 2007

4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan automates assembly of large sequence datasets for chosen taxa TaxMan automates generation of aligned sequences sets for chosen genes

Phyloinformatics Workshop Edinburgh 2007

4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan simplifies selection of taxa for analysis e.g. given a gene set, choosing one species per family (choosing the species with the least missing data) e.g. given a taxon set, choosing the genes (choosing genes with less than a given % missing data) e.g. generating custom defined alignments

Phyloinformatics Workshop Edinburgh 2007

4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan simplifies analysis by exporting formatted alignments (NEXUS) of nucleotides (with codon positions and genes as defined partitions) of amino acids (with genes as defined partitions)

Phyloinformatics Workshop Edinburgh 2007

4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan simplifies post-phylogenetic analysis by saving trees (with links to the original data) saving analytical metadata (algorithm, parameters, settings) saving tree statistics (bootstraps, branch lengths)

Lophotrochozoa● 70,000 annotated sequences● 630,000 EST sequences● 21 genes (mt + 18S 28S actin H3 WG EF1A)● 53,000 sequences extracted● 17,000 aligned consensus sequences● 8,700 species represented● One day for data collection, one for alignment

Molecular Phylogenetics and Evolution 43 (2007) 583–595www.elsevier.com/locate/ympev

The e�ect of model choice on phylogenetic inference using mitochondrial sequence data: Lessons from the scorpions

Martin Jones a,¤, Benjamin Gantenbein b, Victor Fet c, Mark Blaxter a

a Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, UKb AO Research Institute, Clavadelerstrasse 8, Davos Platz CH-7270, Switzerland

c Department of Biological Sciences, Marshall University, Huntington, WV 25755-2510, USA

Received 25 April 2006; revised 14 November 2006; accepted 14 November 2006Available online 29 November 2006

Phyloinformatics Workshop Edinburgh 2007

5: Nameless taxa & endless forms

"... endless formsmost beautiful

and most wonderfulhave been,

and are being, evolved"

(Darwin 1859)

http

://w

ww

.nem

atod

es.o

rg/N

egle

cted

Gen

omes

/A

RTH

RO

POD

A/C

helic

erat

a.ht

ml

1 10

100

1000

10000

100000

1000000

10000000

ChoanoflagellidaPorifera

PlacozoaCnidaria

CtenophoraAcoela

MesozoaMyxozoa

NematodaNematomorpha

LoriciferaKinorhyncha

PriapulidaOnychophora

ArthropodaTardigrada

GastrotrichaNemertea

MyzostomidaGnathostomulida

CycliophoraPlatyhelminthesAcanthocephala

RotiferaChaetognatha

SipunculidaBryozoa

BrachiopodaEntoprocta

AnnelidaPogonophora

EchiuraMollusca

HemichordataEchinodermata

Chordata

Metazoan species per phylum 100000000

organism-size curve Eukaryotes

lots

fewminiscule tiny smalljust visible big

squillions(lo

gsc

ale)

(log scale)

FOODITEMS

POSSIBLEPREDATORS

size of organism

num

ber

of in

divi

dual

s

Sourhope farmNERC "Soil Biodiversity

and Ecosystem Function"Programme Study Site

120 m x 75 mof raw Scottish upland grass

13 000 000 000 nematodes

MAN IS BVT A WORM

Orkney

Loch Fyne Gullane

Orkney

Loch Fyne

Gullane

10

11

10

12

4

2

2

1034ED Fyne11022ED Fyne11010ED Fyne11020ED Fyne11005ED Fyne11007ED Fyne

1140ED Orkney1139ED Orkney

1031ED Fyne11043ED Gullane

1118ED Fyne21011ED Fyne11093ED Fyne2

1085ED Gullane1046ED Gullane

1041ED Gullane1060ED Gullane

1028ED Fyne11119ED Fyne2

1122ED Fyne21142ED Orkney

1145ED Orkney1170ED Orkney

1174ED Orkney1162ED Orkney

1169ED Orkney1173ED Orkney

1179ED Orkney1168ED Orkney1176ED Orkney1167ED Orkney1175ED Orkney1147ED Orkney

1008ED Fyne11009ED Fyne1

1144ED Orkney1146ED Orkney

1083ED Gullane1073ED Gullane1051ED Gullane

1019ED Fyne11124ED Fyne2

1097ED Fyne21150ED Orkney

1136ED Orkney1152ED Orkney1171ED Orkney1154ED Orkney1151ED Orkney

1029ED Fyne11012ED Fyne11138ED Orkney1013ED Fyne11032ED Fyne1

1092ED Fyne21036ED Fyne11037ED Fyne1

1075ED Gullane1109ED Fyne2

1128ED Fyne21094ED Fyne2

1044ED Gullane1071ED Gullane

1064ED Gullane1053ED Gullane

1070ED Gullane1038ED Gullane1052ED Gullane

1123ED Fyne21035ED Fyne11107ED Fyne2

1108ED Fyne21024ED Fyne11178ED Orkney

1165ED Orkney1156ED Orkney

1141ED Orkney1164ED Orkney1066ED Gullane

1047ED Gullane1099ED Fyne2

1058ED Gullane1042ED Gullane

1088ED Fyne21086ED Fyne2

1039ED Gullane1069ED Gullane1061ED Gullane1074ED Gullane

1096ED Fyne21105ED Fyne21133ED Fyne21077ED Gullane

1014ED Fyne11068ED Gullane1076ED Gullane1080ED Gullane

1072ED Gullane1054ED Gullane1062ED Gullane1048ED Gullane1057ED Gullane1040ED Gullane1059ED Gullane

1120ED Fyne21017ED Fyne1

1004ED Fyne11018ED Fyne1

1177ED Orkney1025ED Fyne1

1023ED Fyne11016ED Fyne1

1027ED Fyne11015ED Fyne11002ED Fyne11001ED Fyne11021ED Fyne1

1003ED Fyne11006ED Fyne1

1000ED Fyne11155ED Orkney

1121ED Fyne21103ED Fyne21110ED Fyne2

1114ED Fyne21125ED Fyne21131ED Fyne21101ED Fyne2

1102ED Fyne21112ED Fyne21116ED Fyne21106ED Fyne2

1104ED Fyne21132ED Fyne2

5 changes

1

MarineNematodeBarcodes

51

Phyloinformatics Workshop Edinburgh 2007

5: Nameless taxa & endless forms

MOTU Molecular Operational

Taxonomic Units

1. to cut; to snap offmotu-á te hau, the fishing line snapped off2. to engrave, to inscribe letters or pictures in stone or in wood, like the motu mo rogorogo, inscrip-tions for recitation in lines called kohau.3. isletsome names of islets: Motu Motiro Hiva, Motu Nui, Motu Iti, Motu Kaokao, Motu Tapu, Motu Marotiri, Motu Kau, Motu Tavake, Motu Tautara, Motu Ko Hepa Ko Maihori, Motu Hava.

motu

Phyloinformatics Workshop Edinburgh 2007

5: Nameless taxa & endless forms MOTU

specimen-based surveys CBoL Barcode of Life (CO1)

anonymous, specimen-free surveys environmental sampling bulk community DNA millions of sequences

Phyloinformatics Workshop Edinburgh 2007

5: Nameless taxa & endless forms

~1.2 million described species

~10-100 million species in reality Thus, most ‘species’ will never be formally named.

Phyloinformatics Workshop Edinburgh 2007

5: Nameless taxa & endless forms How do we incorporate these myriad ‘nameless taxa’ into our systems?

Phyloinformatics Workshop Edinburgh 2007

Martin Jones TaxMan, iPhy & chelicerate evolution

Robin Floyd & Jenna Mann

MOTU and barcoding

Ralf Schmid, James Wasmuth & Ann Hedley

PartiGene & EST analysis

top related