eupathdb: an integrated resource and tool for eukaryotic pathogen bioinformatics aurrecoechea c.,...

1
EuPathDB: an integrated resource and tool for eukaryotic pathogen bioinformatics Aurrecoechea C., Heiges M., Warrenfeltz S. for the EuPathDB team CTEGD, University of Georgia, Athens, GA USA ABSTRACT: EuPathDB (http://eupathdb.org ) is an integrated bioinformatics database covering several eukaryotic pathogens. Genera represented are Cryptosporidium, Encephalitozoon, Entamoeba, Enterocytozoon, Giardia, Leishmania, Neospora, Plasmodium, Toxoplasma, Trichomonas and Trypanosoma, and the newly added Theileria and Babesia. Each of these groups is supported by a taxon-specific database and web interface which can be accessed independently of EuPathDB. EuPathDB provides a portal to all these databases, and the opportunity to leverage orthology for searches across genera. The databases are updated and expanded about every 2 months, providing online access to the latest genomic-scale datasets including complete genome sequences, annotations, and functional genomics such as proteomics, microarray, RNA-Seq, ChIp-chip, SAGE and EST data. The specific advantage of the EuPathDB databases lies in the graphical search interface that allows users to combine datasets while building a search strategy. Multistep searches strategies are built one step at a time choosing from more than 100 searches. The latest EuPathDB release debuts a search for DNA motifs and a method of combining searches based on relative genomic location. This new operation allows the results of successive steps to be combined based on each feature’s location relative to other features. Parameters defining upstream/downstream distances and gene overlap restrict the search results in a way that highlights biologically relevant relationships such as antisense transcription and promoter sharing. The merger of EuPathDB’s user-friendly search strategy system with full and up-to-date databases offers researchers a powerful tool for data mining during computational experiments. E. Dispar, E. histolytica, E. invadens C. hominis, C. muris, C. parvum G.lamblia, G.assemblage_B, G.assemblage_E E.cuniculi, E.intestinalis, E.bieneusi, N.parisii, O.bayeri B.bovis, T.annulata, T.parva P.berghei, P.chabaudi, P.falciparum, P.gallinaceum, P.knowlesi, P.vivax,P.yoelii N.caninum, T.gondii T.vaginalis C.fasciculata, L.braziliensis, T.cruzi L.infantum, L.Major, L.mexicana, T.vivax L.tarentolae, T.brucei, T.congolense Quick access to ID and text search options, login, contact, twitter, etc. Portal to EuPathDB databases by clicking on icons Main Header Tab Bar: mouse-over ‘New Search’ to initiate searches; click ‘My Strategies’ to enter your workspace Initiate searches from center panels. Over 100 search types available. ● Identify Genes by: look for Genes based on a variety of datasets, including whole genome sequence, coding vs non- coding genes, transcript evidence (microarray, EST), exon count, etc. ● Identify Other Data Types: Look for ESTs, SNPs or DNA motifs; ● Tools: Access tools like Blast and PubMed from any EuPathDB home page Building search strategies: New way of combining searches based on relative genomic location: ● Graphical representation of your search strategy. Each step can be revised by clicking on the step name. Searches return a list of IDs (genes, ESTs, SNPs, proteins) that satisfy the conditions of your query parameters. This gene search for protein coding genes in P. falciparum returned 5418 gene IDs. Taxon specific databases provide access to the latest available genome- scale datasets. Built with the same web- architecture, search types and functions are the same across all databases. ● Filter table showing the distribution of gene IDs across all species in the database. ● Results table with ID as the first column. Columns can be added, changed, deleted or sorted. Entire table can be downloaded as Excel or other formats. ● Click on the ID name to access details in that ID’s record page. The search generates a step and the results below show the list of genomic segment IDs corresponding to the locations of EcoR1 site: a segment ID for each occurrence of GAATTC in the genome. Search for DNA Motifs such as restriction enzyme sites or transcription factor binding sites. Choose Genomic segments, DNA Motif Pattern: 1 Initiate the search. It will find all occurrences of GAATTC in the genome. 2 3 New Search Type: DNA Motif Pattern 1. Run a query choosing from more than 100 searches. Build strategies for several data types: genes, ESTs, SNPs, ORFs, etc. 2. Add a step – run a second query combining results with previous searches. Query the results of Step 1 based on functional genomics. Nest strategies to build complexity 3. Add more steps… Run a Search. This search for all protein coding genes in P. faliciparum returned 5418 genes. 1 2 Add a step. The second search here, based on DNA motif, searches for the EcoR1 restriction enzyme site. 3 Combine search results using the co- location function. 4 Carefully consider the 5 user-defined parameters in the logic statement of the co- location function. 5 View results. The results table lists 214 IDs of genes whose upstream 500bp region contains the EcoR1 site. The column ‘Matched Regions’ defines the genomic location of the EcoR1 site within the gene. i. Return IDs from either step 1 or 2. ii. Define relative location (“Region”) of the returned data type. Search the exact region, upstream, or downstream of the returned data type. iii.Define relationship between step 1 and 2 results’ regions: contains, overlaps, or is contained in. iv. Define relative location (“Region”) of the other (non- returned) step result. v. Define strand to be considered in the operation: either, same or both. i ii iii iv v Graphical search interface motivates users to prioritize search results based a variety of data types. The search strategy system provides the opportunity to explore and identify biologically meaningful relationships

Upload: myron-conley

Post on 04-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EuPathDB: an integrated resource and tool for eukaryotic pathogen bioinformatics Aurrecoechea C., Heiges M., Warrenfeltz S. for the EuPathDB team CTEGD,

EuPathDB: an integrated resource and tool for eukaryotic pathogen bioinformaticsAurrecoechea C., Heiges M., Warrenfeltz S. for the EuPathDB teamCTEGD, University of Georgia, Athens, GA USA

ABSTRACT: EuPathDB (http://eupathdb.org) is an integrated bioinformatics database covering several eukaryotic pathogens. Genera represented are Cryptosporidium, Encephalitozoon, Entamoeba, Enterocytozoon, Giardia, Leishmania, Neospora, Plasmodium, Toxoplasma, Trichomonas and Trypanosoma, and the newly added Theileria and Babesia. Each of these groups is supported by a taxon-specific database and web interface which can be accessed independently of EuPathDB. EuPathDB provides a portal to all these databases, and the opportunity to leverage orthology for searches across genera. The databases are updated and expanded about every 2 months, providing online access to the latest genomic-scale datasets including complete genome sequences, annotations, and functional genomics such as proteomics, microarray, RNA-Seq, ChIp-chip, SAGE and EST data. The specific advantage of the EuPathDB databases lies in the graphical search interface that allows users to combine datasets while building a search strategy. Multistep searches strategies are built one step at a time choosing from more than 100 searches. The latest EuPathDB release debuts a search for DNA motifs and a method of combining searches based on relative genomic location. This new operation allows the results of successive steps to be combined based on each feature’s location relative to other features. Parameters defining upstream/downstream distances and gene overlap restrict the search results in a way that highlights biologically relevant relationships such as antisense transcription and promoter sharing. The merger of EuPathDB’s user-friendly search strategy system with full and up-to-date databases offers researchers a powerful tool for data mining during computational experiments.

E. Dispar, E. histolytica, E. invadens

C. hominis, C. muris, C. parvum

G.lamblia, G.assemblage_B, G.assemblage_E

E.cuniculi, E.intestinalis, E.bieneusi, N.parisii, O.bayeri

B.bovis, T.annulata, T.parva

P.berghei, P.chabaudi, P.falciparum, P.gallinaceum, P.knowlesi, P.vivax,P.yoelii

N.caninum, T.gondii

T.vaginalis

C.fasciculata, L.braziliensis, T.cruziL.infantum, L.Major, L.mexicana, T.vivaxL.tarentolae, T.brucei, T.congolense

● Quick access to ID and text search options, login, contact, twitter, etc.

● Portal to EuPathDB databases by clicking on icons

● Main Header Tab Bar: mouse-over ‘New Search’ to initiate searches; click ‘My Strategies’ to enter your workspace

● Initiate searches from center panels. Over 100 search types available. ● Identify Genes by: look for Genes based on a variety of datasets, including whole genome sequence, coding vs non-coding genes, transcript evidence (microarray, EST), exon count, etc.● Identify Other Data Types: Look for ESTs, SNPs or DNA motifs;● Tools: Access tools like Blast and PubMed from any EuPathDB home page

Building search strategies: New way of combining searches based on relative genomic location:

● Graphical representation of your search strategy. Each step can be revised by clicking on the step name.

Searches return a list of IDs (genes, ESTs, SNPs, proteins) that satisfy the conditions of your query parameters. This gene search for protein coding genes in P. falciparum returned 5418 gene IDs.

Taxon specific databases provide access to the latest available genome-scale datasets. Built with the same web-architecture, search types and functions are the same across all databases.

● Filter table showing the distribution of gene IDs across all species in the database.

● Results table with ID as the first column. Columns can be added, changed, deleted or sorted. Entire table can be downloaded as Excel or other formats. ● Click on the ID name to access details in that ID’s record page.

The search generates a step and the results below show the list of genomic segment IDs corresponding to the locations of EcoR1 site: a segment ID for each occurrence of GAATTC in the genome.

Search for DNA Motifs such as restriction enzyme sites or transcription factor binding sites.Choose Genomic segments, DNA Motif Pattern:1

Initiate the search. It will find all occurrences of GAATTC in the genome. 2

3

New Search Type: DNA Motif Pattern

1. Run a query choosing from more than 100 searches. Build strategies for several data types: genes, ESTs, SNPs, ORFs, etc.

2. Add a step – run a second query combining results with previous searches. Query the results of Step 1 based on functional genomics. Nest strategies to build complexity3. Add more steps…

Run a Search. This search for all protein coding genes in P. faliciparum returned 5418 genes.

1

2Add a step. The second search here, based on DNA motif, searches for the EcoR1 restriction enzyme site.

3 Combine search results using the co-location function.

4 Carefully consider the 5 user-defined parameters in the logic statement of the co-location function.

5 View results. The results table lists 214 IDs of genes whose upstream 500bp region contains the EcoR1 site. The column ‘Matched Regions’ defines the genomic location of the EcoR1 site within the gene.

i. Return IDs from either step 1 or 2.ii. Define relative location (“Region”) of the returned data type. Search the exact region,

upstream, or downstream of the returned data type.iii. Define relationship between step 1 and 2 results’ regions: contains, overlaps, or is

contained in.iv. Define relative location (“Region”) of the other (non-returned) step result.v. Define strand to be considered in the operation: either, same or both.

i ii iii iv v

Graphical search interface motivates users to prioritize search results based a variety of data types. The search strategy system provides the opportunity to explore and identify biologically meaningful relationships