gmodtools, argos & cetera a replicable genome information system of common components gmod...

23
GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd @indiana.edu

Upload: gordon-hutchinson

Post on 17-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components

GMOD Meeting, Oct. 2004

Don Gilbert, [email protected]

Page 2: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

• GMOD Tools for public data releases• Argos framework for genome databases

• LuceGene fast document/object search

• Genome Directory System for genome

data mining

• Unified Gene Pages (XML, web page)

Genome DB building blocks

Page 3: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

GMOD Tools: Bulkfilescvs.sourceforge.net:/cvsroot/gmod checkout schema/GMODTools

Page 4: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

• Support common data update and public release

tasks.

• GmodTools to load and extract reagent sequences

(EST, cDNA, GSS) to/from Chado databases.

• GMOD Bulkfiles creates bulk genome sequence and

feature files for public distribution from a Chado

database.

• Citrina is a workflow tool to automate external

databank updates, such as GenBank and Gene

Ontologies.

Genome Data Tools

Page 5: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

12 New genomes to go

• Need to publish numerous new genomes• Bulk files are standard public access:

• Sequence (fasta, …), features (gff,…), searches (Blast, ..);

• 11 new Drosophila genomes; Daphnia genome; many more• Chado database; XORT & other GMOD Tools to export data• http://flybase.net/species

Page 6: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

Bulkfiles

• Build release files from Chado DB

• Standardized files, headers

• DNA - fasta, raw• Features - GFF3,

gnomap• Blast indices• Lucene file indices• Config files (blast,

gbrowse,…)

Page 7: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

Bulkfiles - BLAST indices

Page 8: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

Bulkfiles - Map features

Page 9: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

Bulkfiles OUTPUTS

• DNA files (full chromosomes) in raw and fasta formats

• GFF (v3) and FFF (used in FlyBase) feature files• Fasta sequence for each feature set, with

standardized headers (ID,names,db_xref,…)from feature files

• NCBI BLAST indices & configs• Gbrowse config files with feature sets matching db• Others added as needed (more easily than before)

Page 10: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

Bulkfiles Logic

• Organism/database logic (mostly) in configuration files• Dump all chado db features using simple sql to common

intermediate table files• Feature info is simple: type, location, name/id, and a few

attributes (db_xrefs,.. GFF-like)• Easier checking of SQL to get all features desired• Fast (30 - 60 min for full fly genome)

• Postprocess table files to create public use formats• Tested with FOUR different Chado dbs (Dmel, Dmel_hetero,

Dpse_Dmel, and SGDLite)

Page 11: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

Bulkfiles stages

• postprocess table files in stages• Recode feature “oddities” to public view needs• Better debugging of steps in the process• Engineering time and configuration here• Stages are loosely coupled; go back, tweak

configurations, re-run partially as needed.• convert common feature table + dna to several

output formats in one step.• combine features from several dbs and other

sources like cytology here.

Page 12: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

Bulkfiles config example <opt name="fbbulk-r3" relid="3" ROOT="${GMOD_ROOT}/" TMP="${GMOD_ROOT}/tmp" datadir="genomes/Drosophila_melanogaster"> <title>FlyBase Chado DB r3.2</title> <about> Configuration for feature and sequence bulk files from FlyBase chado data release 3.2.1 </about> <org>dmel</org> <species>Drosophila melanogaster</species> <doc name="README"> D. melanogaster euchromatin genome data from

FlyBase Release 3.2.1. See http://flybase.net/annot/dmel_r3.2.1.txt

</doc>

<include>fbreleases</include> <db driver="Pg" name="dmel_chado" host="localhost" port="7302" user="” password="" />

<idpattern>(FBgn|FBti)\d+</idpattern>

<include>filesets</include> <include>featuresets</include></opt>

Page 13: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

ARGOS http://www.gmod.org/argos

Page 14: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

ARGOS Genome DBs

Page 15: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

• Automate genome database install & update• Eliminate { fetch, compile, install, configure,…} cycle• Developers test, compile, config once; others copy/run

• Start new project quickly - copy existing project & edit to suit

• Clone servers easily (local cluster; global mirrors; company/lab; laptop)

• Compatible with most GMOD projects• Secure collaborative genome db features• Goal: easy for biologists to use with minimal

informatics expertise

ARGOS Focus

Page 16: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

ARGOS Components

Section ComponentsFlyBase (e.g.) Data, database indices, documents, web tools specific to genome service

Java Chado database tools, genome sequence reports, LuceGene search, Ant buildsystem, database interfaces, XML tools, Tomcat web server, Axis web services

Perl BioPerl, GBrowse, Chado database tools, Cmap comparative maps,database interfaces, Web tools, XML tools

Servers BLAST (NCBI), Apache web server, PostgreSQL, and BerkeleyDBdatabases

Systems Compiled portions for supported operating systemsInstall & Root Common configurations, web server, installation scripts and

instructions

Page 17: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

ARGOS INSTALL

Page 18: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

ARGOS INSTALL

Page 19: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

Edit wFleaBase

Page 20: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

Lucegene (‘Lucy Jean’)for Genome Information Search and Retrieval

Page 21: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

Document/Object Search and Retrieval in Genome Databases • high-volume data search and retrieval system for genomics and

bioinformatics databases

• standard search features: booleans, phrase, near, relevance

• performance exceeds and extends relational databases

• suited to range of genome data: genes, literature, sequences,

XML annotations, Medline abstracts, HTML, PDF and text

documents.

LuceGene

Page 22: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

Example LuceGene libraries

• FlyBase database• Annotation GAME XML, Medline XML (gamexml, medxml)• Genes, Annotation, References (fbgn, fban, fbrf)• Web, literature PDF Documents (docs) • Unified Gene Page XML (ugpxml)• Sequences, Genome Features (seqs)

• euGenes database• Gene summaries, Sequences, Genome Features • Unified Gene Page XML • Web Documents

• wFleaBase database• Sequences, Medline XML, Web documents

Page 23: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu

• Josh Goodman (gmod)• Paul Poole (gmod/iubio)• Hardik Sheth (flybase)• Nihar Sheth (flybase)• Vasanth Singan (gmod)• Victor Strelets (flybase)

And to many developers whose work we learn from and borrow from

Thanks to these folks