emerging data standards for phylogenomics research

12
Christian M Zmasek, PhD Burnham Institute for Medical Research Bioinformatics and Systems Biology www.phylosoft.org www.phyloxml.org

Upload: letitia-dominick

Post on 31-Dec-2015

25 views

Category:

Documents


1 download

DESCRIPTION

Emerging Data Standards for Phylogenomics Research. Christian M Zmasek, PhD Burnham Institute for Medical Research Bioinformatics and Systems Biology www.phylosoft.org www.phyloxml.org. Phylogenomics. Original definition - PowerPoint PPT Presentation

TRANSCRIPT

Christian M Zmasek, PhDBurnham Institute for Medical Research

Bioinformatics and Systems Biologywww.phylosoft.orgwww.phyloxml.org

PhylogenomicsOriginal definition

the application of phylogenetic information for gene function analysis (Eisen, 1998)

Recent usagespecies evolution based on whole genome

analyses (for example, Dunn et al., 2008)various types of studies at the intersection of

genomics and phylogenetics

2www.phyloxml.org

RAT

MOUSE

HUMAN

CIONA

RAT

CIONA

MOUSE

HUMANCIONA

RAT

CIONA

Y

Z

: query sequence

: orthologous to query

: most similar to query

: gene duplication

RAT

X

Z

Y

3www.phyloxml.org

What information do we need for a phylogenomic analysis (sequence function analysis type)?In phylogenomic analyzes, tree nodes might

be annotated with:Sequence nameSpecies nameDuplication: true/false

Branches might be annotated with:Branch lengthsSupport values (bootstrap, probability, …)

4www.phyloxml.org

What information might we need for other types of phylogenomic analyses?Support values (possible multiple)Taxonomy information (possibly detailed)Geographic informationHost/parasite data (relation between tree

nodes)Gene expression valuesGenomic locationMutations, variation, disease…

5www.phyloxml.org

How is this information processed and stored?Tree topologies are described by hierarchical parenthesis:

((A,B),C)Unique tree node labels mapped to text files, spreadsheets,

databasesManual processing of text files with text editorsMacros, shell scripts, Perl scriptsNew Hamphshire eXtended (NHX) format

Adds tags for different fields: Species: S= Bootstrap support: B=

Example: ADH2:0.1[&&NHX:S=human:B=90]http://www.phylosoft.org/forester/NHX.html

6www.phyloxml.org

How is this information published?Mostly as images of phylogenetic trees in

journalsnot suitable as input for further studies!

Submission to (publicly accessible) databases rare

7www.phyloxml.org

Problems with this approachTediousError pronePublished images are difficult to use as input

for further studiesMeta-analyzes are hardDifferent, and incompatible, “dialects” of

NHX appearedLimited expressiveness

8www.phyloxml.org

phyloXML by example<phylogeny rooted="true"> <name> example from Prof. Joe Felsenstein's book "Inferring Phylogenies“ </name> <clade> <clade> <branch_length>0.06</branch_length> <clade> <name>A</name> <branch_length>0.102</branch_length> </clade> <clade> <name>B</name> <branch_length>0.23</branch_length> </clade> </clade> <clade> <name>C</name> <branch_length>0.4</branch_length> </clade> </clade></phylogeny>

9www.phyloxml.org

phyloXMLImportant elements:

TaxonomySequenceConfidenceEvents (duplication, speciation)Property (“custom data”)Typed relations (between clades, sequences)

XSD schema, examples, description, applications: http://www.phyloxml.org/

Current version: 1.o

10www.phyloxml.org

Important clade level elements <taxonomy>

<id source=“”> <scientific_name> <common_name> <rank> <uri>

<sequence> <symbol> <accession source=“”> <name> <uri>

<confidence type=“”> <distribution>

<desc> <point geodetic_datum=“”>

<lat> <long> <alt>

<property ref=“” unit=“” datatype=“”>

www.phyloxml.org 11

phyloXML applications/implementations (examples)BioPerl:

Parser, writerATV — A Tree Viewer

Java based tree display tool suitable for large (>10 000) and highly decorated phylogenetic/taxonomic trees

http://www.phylosoft.org/atvphyloxml_converter

Command line tool to convert Newick (NH), NHX, and Nexus formatted trees to phyloXML

www.phyloxml.org 12