gmodtools, argos & cetera a replicable genome information system of common components gmod...
TRANSCRIPT
GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components
GMOD Meeting, Oct. 2004
Don Gilbert, [email protected]
• GMOD Tools for public data releases• Argos framework for genome databases
• LuceGene fast document/object search
• Genome Directory System for genome
data mining
• Unified Gene Pages (XML, web page)
Genome DB building blocks
GMOD Tools: Bulkfilescvs.sourceforge.net:/cvsroot/gmod checkout schema/GMODTools
• Support common data update and public release
tasks.
• GmodTools to load and extract reagent sequences
(EST, cDNA, GSS) to/from Chado databases.
• GMOD Bulkfiles creates bulk genome sequence and
feature files for public distribution from a Chado
database.
• Citrina is a workflow tool to automate external
databank updates, such as GenBank and Gene
Ontologies.
Genome Data Tools
12 New genomes to go
• Need to publish numerous new genomes• Bulk files are standard public access:
• Sequence (fasta, …), features (gff,…), searches (Blast, ..);
• 11 new Drosophila genomes; Daphnia genome; many more• Chado database; XORT & other GMOD Tools to export data• http://flybase.net/species
Bulkfiles
• Build release files from Chado DB
• Standardized files, headers
• DNA - fasta, raw• Features - GFF3,
gnomap• Blast indices• Lucene file indices• Config files (blast,
gbrowse,…)
Bulkfiles - BLAST indices
Bulkfiles - Map features
Bulkfiles OUTPUTS
• DNA files (full chromosomes) in raw and fasta formats
• GFF (v3) and FFF (used in FlyBase) feature files• Fasta sequence for each feature set, with
standardized headers (ID,names,db_xref,…)from feature files
• NCBI BLAST indices & configs• Gbrowse config files with feature sets matching db• Others added as needed (more easily than before)
Bulkfiles Logic
• Organism/database logic (mostly) in configuration files• Dump all chado db features using simple sql to common
intermediate table files• Feature info is simple: type, location, name/id, and a few
attributes (db_xrefs,.. GFF-like)• Easier checking of SQL to get all features desired• Fast (30 - 60 min for full fly genome)
• Postprocess table files to create public use formats• Tested with FOUR different Chado dbs (Dmel, Dmel_hetero,
Dpse_Dmel, and SGDLite)
Bulkfiles stages
• postprocess table files in stages• Recode feature “oddities” to public view needs• Better debugging of steps in the process• Engineering time and configuration here• Stages are loosely coupled; go back, tweak
configurations, re-run partially as needed.• convert common feature table + dna to several
output formats in one step.• combine features from several dbs and other
sources like cytology here.
Bulkfiles config example <opt name="fbbulk-r3" relid="3" ROOT="${GMOD_ROOT}/" TMP="${GMOD_ROOT}/tmp" datadir="genomes/Drosophila_melanogaster"> <title>FlyBase Chado DB r3.2</title> <about> Configuration for feature and sequence bulk files from FlyBase chado data release 3.2.1 </about> <org>dmel</org> <species>Drosophila melanogaster</species> <doc name="README"> D. melanogaster euchromatin genome data from
FlyBase Release 3.2.1. See http://flybase.net/annot/dmel_r3.2.1.txt
</doc>
<include>fbreleases</include> <db driver="Pg" name="dmel_chado" host="localhost" port="7302" user="” password="" />
<idpattern>(FBgn|FBti)\d+</idpattern>
<include>filesets</include> <include>featuresets</include></opt>
ARGOS http://www.gmod.org/argos
ARGOS Genome DBs
• Automate genome database install & update• Eliminate { fetch, compile, install, configure,…} cycle• Developers test, compile, config once; others copy/run
• Start new project quickly - copy existing project & edit to suit
• Clone servers easily (local cluster; global mirrors; company/lab; laptop)
• Compatible with most GMOD projects• Secure collaborative genome db features• Goal: easy for biologists to use with minimal
informatics expertise
ARGOS Focus
ARGOS Components
Section ComponentsFlyBase (e.g.) Data, database indices, documents, web tools specific to genome service
Java Chado database tools, genome sequence reports, LuceGene search, Ant buildsystem, database interfaces, XML tools, Tomcat web server, Axis web services
Perl BioPerl, GBrowse, Chado database tools, Cmap comparative maps,database interfaces, Web tools, XML tools
Servers BLAST (NCBI), Apache web server, PostgreSQL, and BerkeleyDBdatabases
Systems Compiled portions for supported operating systemsInstall & Root Common configurations, web server, installation scripts and
instructions
ARGOS INSTALL
ARGOS INSTALL
Edit wFleaBase
Lucegene (‘Lucy Jean’)for Genome Information Search and Retrieval
Document/Object Search and Retrieval in Genome Databases • high-volume data search and retrieval system for genomics and
bioinformatics databases
• standard search features: booleans, phrase, near, relevance
• performance exceeds and extends relational databases
• suited to range of genome data: genes, literature, sequences,
XML annotations, Medline abstracts, HTML, PDF and text
documents.
LuceGene
Example LuceGene libraries
• FlyBase database• Annotation GAME XML, Medline XML (gamexml, medxml)• Genes, Annotation, References (fbgn, fban, fbrf)• Web, literature PDF Documents (docs) • Unified Gene Page XML (ugpxml)• Sequences, Genome Features (seqs)
• euGenes database• Gene summaries, Sequences, Genome Features • Unified Gene Page XML • Web Documents
• wFleaBase database• Sequences, Medline XML, Web documents
• Josh Goodman (gmod)• Paul Poole (gmod/iubio)• Hardik Sheth (flybase)• Nihar Sheth (flybase)• Vasanth Singan (gmod)• Victor Strelets (flybase)
And to many developers whose work we learn from and borrow from
Thanks to these folks