phytome a plant comparative genomics resource

27
www.PHYTOME.org a plant comparative genomics resource Todd Vision, Jason Phillips, Dihui Lu, Stefanie Hartmann

Upload: barbie

Post on 13-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

www. PHYTOME .org a plant comparative genomics resource. Todd Vision, Jason Phillips, Dihui Lu, Stefanie Hartmann. Outline of today’s presentation. What kind of data is stored in Phytome - and how did we generate this data? How can you search Phytome? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PHYTOME  a plant comparative genomics resource

www.PHYTOME.org

a plant comparative genomics resource

Todd Vision,

Jason Phillips, Dihui Lu, Stefanie Hartmann

Page 2: PHYTOME  a plant comparative genomics resource

Outline of today’s presentation

1. What kind of data is stored in Phytome - and how did we generate this data?

2. How can you search Phytome?

3. What kind of results will Phytome give you?

Page 3: PHYTOME  a plant comparative genomics resource

Phytome integrates

organismal phylogeny

gene family information: sequencesalignmentsphylogenies

genetic and physical maps

Page 4: PHYTOME  a plant comparative genomics resource

Phytome: applications

Starting with a gene family resolve orthology/paralogy relationships identify coevolving families

Starting with a species explore lineage-specific diversification guide comparative mapping bench-work

Starting with a chromosome segment identify homologous segments predict unobserved gene content (candidate

QTL)

Page 5: PHYTOME  a plant comparative genomics resource

overview of the pipeline

Page 6: PHYTOME  a plant comparative genomics resource

EST - expressed sequence tags

• are partial sequences of expressed genes• are error-prone, contain sequence or frame shift errors• are very useful for discovering new genes,

provide data on gene expression, make up much of the sequence data

EST contig assemblies• contigs: continuous sequences of multiple overlapping ESTs• singletons: don’t match other ESTs in the dataset

sources

• TIGR, Plant GDB, NCBI, TAIR, Sputnik, Plant Genome Network;

• for each species, we used the source with the largest number of EST

DNA pre-RNA mRNA cDNA cDNA clone

protein

data aquisition

Page 7: PHYTOME  a plant comparative genomics resource

data acquisition/organismal phylogenies

Glycine max

Phaseolus coccineus

Lotus corniculatus

Medicago truncatula

Cucumis sativus

Prunus persica

Populus tremula x tremuloides

Arabidopsis thaliana

Brassica napus

Gossypium hirsutum

Theobroma cacao

Citrus sinensis

Vitis vinifera

Lycopersicon esculentum

Solanum tuberosum

Capsicum annuum

Nicotiana benthamiana

Helianthus annuus

Zinnia elegans

Stevia rebaudiana

Lactuca sativa

Beta vulgaris

Mesembryanthemum crystallinum

Eschscholzia californica

Hordeum vulgare

Triticum aestivum

Secale cerealeAvena sativa

Zea mays

Sorghum bicolor

Oryza sativa

Allium cepa

Amborella trichopoda

Cryptomeria japonica

Pinus taeda

Cycas rumphii

Ceratopteris richardii

Marchantia polymorpha

Physcomitrella patens

core eudicotseudicotyledons

cycad

conifers

moss

fern

rosids

asterids

Liliopsida

Angiosperms

liverwort

Saccharum officinarum

Page 8: PHYTOME  a plant comparative genomics resource

protein sequence prediction

from EST contigs to peptide sequences: ESTwise

•translate cDNA sequence (ESTs) in all reading frames•compare the translated DNA to a database of known proteins

(Swiss-Prot, TrEMBL)•use this information for gene prediction/translation•correct frame shift errors based on the homology information

protein TVKKAHFEKWGNIVDVDYFQHFGNIVDINIVIDKETGKKRGFAFVEFDDYDPVDKVVLQKQHQLNGKMVDV TVK++HF +WG + D DYF+ +G I I I+ D+ +GKKRGF FV FD +D VDK+V+QK H +NG +V TVKRSHFxQWGTLTDCDYFEQYGKIEVIEIMTDRGSGKKRGF!FVTFDGHDSVDKIVIQKYHTVNGHNxEV EST agaaactNctgacagtgttgctgaaggagaaagcgagaaagt2tgatggcgtggaagacatcagagcatgg ctaggataaggctcagaataaagatattattcaggggaaggt ttctagaactaatttaaaactagaaNat tgagcttgagagcgcttttagtaatagtacgtcactcgagct tactcctccgtgtctgacttgtccctat

Page 9: PHYTOME  a plant comparative genomics resource

protein family clustering(Tribe-MCL)

input: • a set of proteins • BLAST-all vs. BLAST-all values

method: • construct weighted graph • convert into Markov matrix • expansion repeat until matrix • inflation doesn’t change

output: • clusters of related proteins: protein families

Page 10: PHYTOME  a plant comparative genomics resource

protein family clustering(Tribe-MCL)

input: • a set of proteins • BLAST-all vs. BLAST-all values

method: • construct weighted graph • convert into Markov matrix • expansion repeat until matrix • inflation doesn’t change

output: • clusters of related proteins: protein families

image taken from the MCL homepage: http://micans.org/mcl/

Page 11: PHYTOME  a plant comparative genomics resource

protein family clustering(Tribe-MCL)

Page 12: PHYTOME  a plant comparative genomics resource

multiple sequence alignment

testedprogram quality speed algorithmClustalW + ++ progressiveMafft i ++ + iterativeMafft p ++ +++ progressiveT-Coffee +++ memory! consistency-based/progressiveDialign +++ time! consistency based

progressive sequence alignment: 1. generate pairwise distances from a multiple alignment2. use distances to construct a guide tree3. start by aligning the most similar sequences4. progressively add more sequences to the existing

alignment

Page 13: PHYTOME  a plant comparative genomics resource

multiple sequence alignment

1. identification of homologous proteins, clustering these into a Phytome family, generation of a multiple sequence alignment

2. identification of homologous sequence positions within the homologous proteins = of columns of amino acids that share a common ancestral amino acid

Page 14: PHYTOME  a plant comparative genomics resource

multiple sequence alignment

1. find columns that will be retained• remove columns with low average pairwise scores• remove columns with high percentage of gaps

Page 15: PHYTOME  a plant comparative genomics resource

multiple sequence alignment

1. find columns that will be retained• remove columns with low average pairwise scores• remove columns with high percentage of gaps

2. find sequences that will be retained• remove sequences with a high proportion of gaps within the retained columns• remove misaligned sequences (i.e., with a low overall score)

3. final check• are enough sequences left for a phylogeny?

Page 16: PHYTOME  a plant comparative genomics resource

phylogenetic inference

generate distance matrix

generate unrooted neighbor-joining tree

midpoint-root the tree

do molecular clock testTreePuzzle

PHYLIP

?

Page 17: PHYTOME  a plant comparative genomics resource

defining subfamilies

ghir40678

taes49609

lsat28223

taes10592

lsat22003

taes12120

pper2228

soff68095

cjap1662

zmay5764

crum2659

soff59135

sbic29242

soff91873

lsat25221

taes42042

hvul18430

stub712

nben1351

taes10593

osat87929

zmay10735

lsat24951

sbic10907

lsat35999

gmax12743

taes100462

cann3062

ptre15750

lesc54493

stub32048

ghir40662

lsat25017

ecal221

ghir36382

bvul1173

ghir31978

ghir27968

stub12723

123456

123456

123456

1234

12345678910

12

123

12

Page 18: PHYTOME  a plant comparative genomics resource

webflow, overview

search pages

result pages

Page 19: PHYTOME  a plant comparative genomics resource
Page 20: PHYTOME  a plant comparative genomics resource

Lab meeting, Sept 13, 2004: Phytome demo

Dihui - BLAST search∑ a friend of mine is working with a plant called Lophopyrum elongatum (it's a weed, and it's salt-tolerant, and that's all I know about it). She just cloned a cDNA and want to find out more about it - what it does and which other genes in which other taxa it is related to.∑ Though Lophoprum is not among the species represented in Phytome, I offered her to see if I can find out more about her gene.∑ Best to use for this: the single BLAST search.∑ Navigate to the single BLAST search and explain the page. Mention batch BLAST.∑ paste the friend's sequence into the appropriate field∑ MEYQGQQQHDQATTNRVDEYGNPVAGHGVGTGMGAHGGVGTGAAAGGHFQPTREEHKAGGILQRSGSSSSSSSSEDDGMGGRRKKGIKDKIKEKLPGGHGDQQQTAGTYGQQGHTGMAGTGGNYGQPGHTGMAGTDGTGEKKGIMDKIKEKLPGQH∑ explain the results page∑ view the best result: taes7111 from wheat∑ go to the best scoring family: 1980

Stefanie - Unigene search∑ http://www.ebi.ac.uk/interpro/IEntry?ac=IPR000167∑ search Phytome for InterproEntry 000167∑ look at the hvul1175 entry:∑ The family and subfamily ID∑ Interpro and Gene Ontology results, but only if the Unipeptide is an exemplar of its subfamily∑ The species name∑ A link to the primary source for this unigene sequence∑ A list of related unigenes (from all sources) that contain common Genbank accession numbers in their assembly∑ Predicted peptide sequence (available for download in FASTA format)

Jason - "restrict by species" search∑ You can search for families that do or do not contain members from particular species. Navigate to the "restrict by species" search and explain the page.∑ The relationships among the species are displayed as a phylogenetic tree (NCBI taxonomy information)∑ and you can select families to include or exclude using radio buttons to the right of each species name.∑ If the default "either" is selected, Phytome will return a family regardless of whether there are members from that species.∑ I'm interested in monocot gene families (Hordeum-barley to Allium-onion): want to exclude all other taxa, only use gene families with monocot members. NOTE: explain the difference between "include" monocots or "either" monocots: because species with small numbers of Unipeptides will necessarily lack members in most families, selecting "include" will return NO families!∑ 119273 families were retrieved. Their family ID is shown∑ click on family number 1980

Stefanie - family results page∑ The "Family Information Page" includeso Related families if this family is part of a superfamily (?)o Hyperlinks to subfamilies (these will work if the "Subfamily" tab is selected).o A link to a list of family members excluded from the reduced alignment by REAPo A list of those species represented within the family (these will work if the with the default species tab)∑ The tabs below allow one to viewo A list of member Unipeptides, which can be sorted either by subfamily or by species, depending on which tab is selected. From these lists, you may select members to include in a multiple alignment and/or phylogeny.o InterPro and GO assignments for an examplar of each subfamily.o By selecting multiple Unipeptides and proceeding to the "Alignment Page", one can download a single filecontaining all the predicted peptide sequences (in FASTA format) as well as additional information such as the names used by the Unigene sources and the component Genbank accession numbers.

Page 21: PHYTOME  a plant comparative genomics resource
Page 22: PHYTOME  a plant comparative genomics resource
Page 23: PHYTOME  a plant comparative genomics resource
Page 24: PHYTOME  a plant comparative genomics resource

protein family clustering(Tribe-MCL)

I = 5 3.6 2.8 2.0 1.2

1 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 12 2 1 1 12 2 1 1 13 3 2 2 13 3 2 2 13 3 2 2 13 3 2 2 13 3 2 2 13 3 2 2 13 3 2 2 13 3 2 2 14 4 3 1 14 4 3 1 14 4 3 1 14 4 2 1 15 5 4 1 15 5 4 1 16 5 4 3 16 5 4 3 1

Page 25: PHYTOME  a plant comparative genomics resource

almost 1 million EST contigs/singletons

ESTwise translation

730,000 unigenes

BLAST all vs. BLAST all

640,000 unigenes 110,000 singletonsto be clusteredinto families

...some numbers

Page 26: PHYTOME  a plant comparative genomics resource

data aquisition

species tax_id common name NCBI PGDB PGN SPNK TIGR

Allium cepa 4679 onion XAmborella trichopoda 13333 amborella XArabidopsis thaliana 3702 thale cress XAvena sativa 4498 oat XBeta vulgaris 161934 sugarbeet XBrassica napus 3708 rape XCapsicum annuum 4072 (orgnamental) pepper XCeratopteris richardii 49495 water sprite or indian fern XCitrus sinensis 2711 orange XCryptomeria japonica 3369 Japanese cedar XCucumis sativus 3659 cucumber XCycas rumphii 58031 sago palm or seashore cycad XEschscholzia californica 3467 california poppy XGlycine maxX 3847 soybean XGossypium hirsutum 3635 cotton (tetraploid) XHelianthus annuus 4232 sunflower XHordeum vulgare 4513 barley XLactuca sativa 4236 lettuce XLotus corniculatus 47247 lotus XLycopersicon esculentum 4081 tomato XMarchantia polymorpha 3197 marchantia XMedicago truncatula 3880 barrel medic XMesembryanthemum crystallinum 3544 ice plant XNicotiana benthamiana 4100 wild tobacco XOryza sativa 4530 rice XPhyscomitrella patens 3218 Physcomitrella moss XPinus taeda 3352 loblolly pine XPhaseolus coccineus 3886 scarlet runner bean XPopulus tremula x Populus tremuloides 47664 aspen XPrunus persica 3760 peach XSaccharum officinarum 4547 plume grass or sugar cane XSecale cereale 4550 rye XSolanum tuberosum 4113 potato XSorghum bicolor 4558 sorghum XStevia rebaudiana 55670 candyleaf XTheobroma cacao 3641 cacao XTriticum aestivum 4565 wheat XVitis vinifera 29760 wine grape XZea mays 4577 corn XZinnia elegans 34245 zinnia X

Page 27: PHYTOME  a plant comparative genomics resource

multiple sequence alignment

testedprogram quality speed algorithmClustalW + ++ progressiveMafft i ++ + iterativeMafft p ++ +++ progressiveT-Coffee +++ memory! consistency-based/progressiveDialign +++ time! consistency based

family ClustalW Mafft i Mafft p2 Mafft p3 T-Coffee Dialign1 2061 12380 93 312 – –2 360 845 32 73 – 84293 5108 8414 182 467 – –4 950 2470 45 101 – 125335 307 404 22 59 – 35646 87 125 9 31 – 13767 104 128 9 24 – 10758 105 114 8 20 19207 8879 46 33 6 16 11820 394

10 145 296 17 36 7736 89811 4 5 1 3 177 27