biosql reloaded: v1.0 release, phylodb module, and future features

39
BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features Hilmar Lapp (NESCent), Richard Holland, Aaron Mackey, William Piel, Mark Schreiber BOSC 2008

Upload: hilmar-lapp

Post on 11-Jun-2015

793 views

Category:

Education


0 download

TRANSCRIPT

Page 1: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

BioSQL Reloaded: v1.0 Release, PhyloDB Module,

and Future Features

Hilmar Lapp (NESCent),Richard Holland, Aaron Mackey, William Piel, Mark Schreiber

BOSC 2008

Page 2: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

What is BioSQL?

Interoperable persistence layer for Bio* supporting

• BioPerl

• Biojava

• Biopython

• BioRuby

Page 3: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

What is BioSQL?

Generic & extensible relational model

• sequences

• features

• sequence and feature annotation

• a reference taxonomy

• ontologies, controlled vocabularies

• phylogenetic trees or networks

Page 4: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

What is BioSQL?Generic & extensible relational model

• sequences

• features

• sequence and feature annotation

• a reference taxonomy

• ontologies, controlled vocabularies

• phylogenetic trees or networks

Page 5: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

A Brief History

• Ewan Birney started BioSQL and Bioperl-db in Nov 2001

• Major redesigns and refactorings at several BioHackathons in 2002-2003

• PhyloDB module added at 2006 Phyloinformatics Hackathon

• Reinvigorated at 2008 BioHackathon

• v1.0 released in March 2008

Page 6: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

Use Cases

1) Local ‘GenBank’ with random access

2) ‘GenBank’ in relational format

3) Interoperable Bio* persistence

4) My lab sequence database

5) Integrate sequence & annotation databases

Page 7: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

• Website: http://biosql.org

• Mailing list: [email protected]

• Subversion: svn://code.open-bio.org/biosql/biosql-schema

• Bugs: http://bugzilla.open-bio.org

Page 8: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features
Page 9: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features
Page 10: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

BioSQL 1.0 -- Relational Model

1, 1 / 1, 2 -- 7:45:35 PM , 6/4/2003

Annotation Bundle

Bioentry withTaxon and Namespace

Seqfeatureswith Locationand Annotation Ontology Terms

and Relationships

Biodatabase

Biodatabase Id

NameAuthorityDescription

Taxon

Taxon Id

Ncbi Taxon IdParent Taxon Id (FK)Node RankGenetic CodeMito Genetic CodeLeft ValueRight Value

Taxon Name

Taxon Id (FK)NameName Class

Ontology

Ontology Id

NameDefinition

Term

Term Id

NameDefinitionIdentifierIs ObsoleteOntology Id (FK)

Term Synonym

Term Id (FK)Synonym

Term Dbxref

Term Id (FK)Dbxref Id (FK)

Rank

Term Relationship

Term Relationship Id

Subject Term Id (FK)Predicate Term Id (FK)Object Term Id (FK)Ontology Id (FK)

Term PathTerm Path Id

Subject Term Id (FK)Predicate Term Id (FK)Object Term Id (FK)Ontology Id (FK)Distance

Bioentry

Bioentry Id

Biodatabase Id (FK)Taxon Id (FK)NameAccessionIdentifierDivisionDescriptionVersion

Bioentry RelationshipBioentry Relationship Id

Object Bioentry Id (FK)Subject Bioentry Id (FK)Term Id (FK)Rank

Bioentry Path

Object Bioentry Id (FK)Subject Bioentry Id (FK)Term Id (FK)Distance

Biosequence

Bioentry Id (FK)

AlphabetVersionLengthSeqDbxref

Dbxref Id

DbnameAccessionVersion

Dbxref Qualifier Value

Dbxref Id (FK)Term Id (FK)Rank

Value

Bioentry Dbxref

Bioentry Id (FK)Dbxref Id (FK)

Rank

Reference

Reference Id

Dbxref Id (FK)LocationTitleAuthorsCrc

Bioentry Reference

Bioentry Id (FK)Reference Id (FK)Rank

Start PosEnd Pos

CommentComment Id

Bioentry Id (FK)Comment TextRank

Bioentry Qualifier Value

Bioentry Id (FK)Term Id (FK)ValueRank

Seqfeature

Seqfeature Id

Bioentry Id (FK)Type Term Id (FK)Source Term Id (FK)Display NameRank

Seqfeature Relationship

Seqfeature Relationship Id

Object Seqfeature Id (FK)Subject Seqfeature Id (FK)Term Id (FK)Rank

Seqfeature Path

Object Seqfeature Id (FK)Subject Seqfeature Id (FK)Term Id (FK)Distance

Seqfeature Qualifier ValueSeqfeature Id (FK)Term Id (FK)Rank

Value

Seqfeature Dbxref

Seqfeature Id (FK)Dbxref Id (FK)

Rank

Location

Location Id

Seqfeature Id (FK)Dbxref Id (FK)Term Id (FK)Start PosEnd PosStrandRank

Location Qualifier ValueLocation Id (FK)Term Id (FK)

ValueInt Value

BioSQL Relational Model

Page 11: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

BioSQL 1.0 -- Relational Model

1, 1 / 1, 2 -- 7:45:35 PM , 6/4/2003

Annotation Bundle

Bioentry withTaxon and Namespace

Seqfeatureswith Locationand Annotation Ontology Terms

and Relationships

Biodatabase

Biodatabase Id

NameAuthorityDescription

Taxon

Taxon Id

Ncbi Taxon IdParent Taxon Id (FK)Node RankGenetic CodeMito Genetic CodeLeft ValueRight Value

Taxon Name

Taxon Id (FK)NameName Class

Ontology

Ontology Id

NameDefinition

Term

Term Id

NameDefinitionIdentifierIs ObsoleteOntology Id (FK)

Term Synonym

Term Id (FK)Synonym

Term Dbxref

Term Id (FK)Dbxref Id (FK)

Rank

Term Relationship

Term Relationship Id

Subject Term Id (FK)Predicate Term Id (FK)Object Term Id (FK)Ontology Id (FK)

Term PathTerm Path Id

Subject Term Id (FK)Predicate Term Id (FK)Object Term Id (FK)Ontology Id (FK)Distance

Bioentry

Bioentry Id

Biodatabase Id (FK)Taxon Id (FK)NameAccessionIdentifierDivisionDescriptionVersion

Bioentry RelationshipBioentry Relationship Id

Object Bioentry Id (FK)Subject Bioentry Id (FK)Term Id (FK)Rank

Bioentry Path

Object Bioentry Id (FK)Subject Bioentry Id (FK)Term Id (FK)Distance

Biosequence

Bioentry Id (FK)

AlphabetVersionLengthSeqDbxref

Dbxref Id

DbnameAccessionVersion

Dbxref Qualifier Value

Dbxref Id (FK)Term Id (FK)Rank

Value

Bioentry Dbxref

Bioentry Id (FK)Dbxref Id (FK)

Rank

Reference

Reference Id

Dbxref Id (FK)LocationTitleAuthorsCrc

Bioentry Reference

Bioentry Id (FK)Reference Id (FK)Rank

Start PosEnd Pos

CommentComment Id

Bioentry Id (FK)Comment TextRank

Bioentry Qualifier Value

Bioentry Id (FK)Term Id (FK)ValueRank

Seqfeature

Seqfeature Id

Bioentry Id (FK)Type Term Id (FK)Source Term Id (FK)Display NameRank

Seqfeature Relationship

Seqfeature Relationship Id

Object Seqfeature Id (FK)Subject Seqfeature Id (FK)Term Id (FK)Rank

Seqfeature Path

Object Seqfeature Id (FK)Subject Seqfeature Id (FK)Term Id (FK)Distance

Seqfeature Qualifier ValueSeqfeature Id (FK)Term Id (FK)Rank

Value

Seqfeature Dbxref

Seqfeature Id (FK)Dbxref Id (FK)

Rank

Location

Location Id

Seqfeature Id (FK)Dbxref Id (FK)Term Id (FK)Start PosEnd PosStrandRank

Location Qualifier ValueLocation Id (FK)Term Id (FK)

ValueInt Value

Page 12: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

BioSQL 1.0 -- Relational Model

1, 1 / 1, 2 -- 7:45:35 PM , 6/4/2003

Annotation Bundle

Bioentry withTaxon and Namespace

Seqfeatureswith Locationand Annotation Ontology Terms

and Relationships

Biodatabase

Biodatabase Id

NameAuthorityDescription

Taxon

Taxon Id

Ncbi Taxon IdParent Taxon Id (FK)Node RankGenetic CodeMito Genetic CodeLeft ValueRight Value

Taxon Name

Taxon Id (FK)NameName Class

Ontology

Ontology Id

NameDefinition

Term

Term Id

NameDefinitionIdentifierIs ObsoleteOntology Id (FK)

Term Synonym

Term Id (FK)Synonym

Term Dbxref

Term Id (FK)Dbxref Id (FK)

Rank

Term Relationship

Term Relationship Id

Subject Term Id (FK)Predicate Term Id (FK)Object Term Id (FK)Ontology Id (FK)

Term PathTerm Path Id

Subject Term Id (FK)Predicate Term Id (FK)Object Term Id (FK)Ontology Id (FK)Distance

Bioentry

Bioentry Id

Biodatabase Id (FK)Taxon Id (FK)NameAccessionIdentifierDivisionDescriptionVersion

Bioentry RelationshipBioentry Relationship Id

Object Bioentry Id (FK)Subject Bioentry Id (FK)Term Id (FK)Rank

Bioentry Path

Object Bioentry Id (FK)Subject Bioentry Id (FK)Term Id (FK)Distance

Biosequence

Bioentry Id (FK)

AlphabetVersionLengthSeqDbxref

Dbxref Id

DbnameAccessionVersion

Dbxref Qualifier Value

Dbxref Id (FK)Term Id (FK)Rank

Value

Bioentry Dbxref

Bioentry Id (FK)Dbxref Id (FK)

Rank

Reference

Reference Id

Dbxref Id (FK)LocationTitleAuthorsCrc

Bioentry Reference

Bioentry Id (FK)Reference Id (FK)Rank

Start PosEnd Pos

CommentComment Id

Bioentry Id (FK)Comment TextRank

Bioentry Qualifier Value

Bioentry Id (FK)Term Id (FK)ValueRank

Seqfeature

Seqfeature Id

Bioentry Id (FK)Type Term Id (FK)Source Term Id (FK)Display NameRank

Seqfeature Relationship

Seqfeature Relationship Id

Object Seqfeature Id (FK)Subject Seqfeature Id (FK)Term Id (FK)Rank

Seqfeature Path

Object Seqfeature Id (FK)Subject Seqfeature Id (FK)Term Id (FK)Distance

Seqfeature Qualifier ValueSeqfeature Id (FK)Term Id (FK)Rank

Value

Seqfeature Dbxref

Seqfeature Id (FK)Dbxref Id (FK)

Rank

Location

Location Id

Seqfeature Id (FK)Dbxref Id (FK)Term Id (FK)Start PosEnd PosStrandRank

Location Qualifier ValueLocation Id (FK)Term Id (FK)

ValueInt Value

Page 13: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

Loading & updating the NCBI Taxonomy

Page 14: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

Language Bindings

Page 15: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

Bioperl-db

• Step 1: connect, get adaptor factory

use Bio::DB::BioDB;# create the database-specific adaptor factory# (implements Bio::DB::DBAdaptorI)$db = Bio::DB::BioDB->new(-database =>”biosql”, # user, pwd, driver, host … -dbcontext => $dbc);

Page 16: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

Bioperl-db• Step 2: e.g., retrieve sequence, add

annotation, update in the dbuse Bio::Seq; use Bio::SeqFeature::Generic;# retrieve the sequence object somehow …$adp = $db->get_object_adaptor(“Bio::SeqI”);$dbseq = $adp->find_by_unique_key( Bio::Seq->new(-accession_number => “NM_000149”, -namespace => “RefSeq”));# create a feature as new annotation$feat = Bio::SeqFeature::Generic->new( -primary_tag => “TFBS”, -source_tag => “My Lab”, -start=>23,-end=>27,-strand=>-1);# add new annotation to the sequence$dbseq->add_SeqFeature($feat);# update in the database$dbseq->store();

Page 17: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features
Page 18: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

Tools for data loading:Sequences

• load_seqdatabase.pl (in Bioperl-db)

• All Bio::SeqIO and Bio::ClusterIO formats

• Flexible handling of updates

• --lookup, --noupdate, --remove, --mergeobjs

• Filtering and processing sequences

• --seqfilter, --pipeline

Page 19: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

Bindings for most Bio* projects

• BioPerl (Bioperl-db)

• Biojava (BiojavaX)

• Biopython

• BioRuby (Active Objects-based)

• All updated at the 2008 BioHackathon

Page 20: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

v1.0 Release (Tokyo)• Core BioSQL schema

(stable since Nov 2004)

• DDL for MySQL, PostgreSQL, Oracle, HSQLDB, Apache Derby

• Ancillary (but optional) files for PostgreSQL

• Documentation and ERD

• load_ncbi_taxonomy.pl

• Now LGPL v3.0 licensed

Download at http://biosql.org/DIST

Page 21: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

v1.0 Release (Tokyo)• Core BioSQL schema

(stable since Nov 2004)

• DDL for MySQL, PostgreSQL, Oracle, HSQLDB, Apache Derby

• Ancillary (but optional) files for PostgreSQL

• Documentation and ERD

• load_ncbi_taxonomy.pl

• Now LGPL v3.0 licensed

Download at http://biosql.org/DIST

Page 22: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

Tools for data loading:Ontologies

• load_ontology.pl (in Bioperl-db)

• All Bio::OntologyIO formats

• Additional options for term obsoletion

• --noobsolete, --updobsolete, --delobsolete, --mergeobjs

• (Re-)computing the transitive closure

• --computetc

Page 27: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

PhyloDB Module

• Phylogenetic trees (or networks)

• Metadata for trees, nodes, edges

• Attribute-value pairs

• Database cross-references

• Can attach taxa or genes to nodes

Page 28: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

PhyloDB History

• Started at NESCent Phyloinformatics Hackathon 2006 with Bill Piel

• Expanded metadata capabilities at BioHackathon 2008 (Tokyo)

• Separate, optional module

• Not released yet, still in development

Page 29: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

Tree-Name-Identifier-Is_Rooted

Node-Label-Left_Idx-Right_Idx

Edge

Node_Path- distance

Biodatabase

TermTaxon

Bioentry Ontology

-Value-Rank

Node_Qualifier_Value

Tree_Dbxref

-Value-Rank

Edge_Qualifier_Value

Node_Dbxref

-Value-Rank

Tree_Qualifier_Value

-Is_Alternate-Significance

Tree_Root

Dbxref

-Rank

Node_Taxon

-Rank

Node_Bioentry

Page 30: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

Tools for loading data

Page 31: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

• James Estill (U. Georgia):“A Perl-based Command Line Interface to a Topological Query Application for BioSQL in Support of High Throughput Classification and Analysis of LTR Retrotransposons in Plant Genomes”

Page 32: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

• James Estill (U. Georgia):“A Perl-based Command Line Interface to a Topological Query Application for BioSQL in Support of High Throughput Classification and Analysis of LTR Retrotransposons in Plant Genomes”

Page 33: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

What can you use BioSQL for?

Page 34: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

Data Integration

SymGene(Oracle 9i)

ContentSynthesis-Genome mappings-Relationship harvest

J2EE API

Bioperl/Bioperl-DB

GenomeBrowser

SymAtlasWeb-Application (JSPs)

Ensembl CeleraLocusLink

RefSeq UniGene OMIM GNF1B U133A

SQL API (Views, PL/SQL)

PublishedWeb-Services

Rich ClientApp

Page 35: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features
Page 36: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

6850(LocusLink)

GNF055813(GNF cDNA clones)

hCG29698 (Celera)

P43405 (UniProt)

NP_003168 (RefSeq)

NM_003177(RefSeq)

hCT1962558 (Celera)

hCT20865 (Celera)

207540_s_at (HG-U133A)

Hs.192182 (UniGene)

ENSG00000165025 (Ensembl)

ENST00000297685 (Ensembl)

36885_s_at (HG_U95Av2)

Platonic gene graphs

Page 37: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

•Mostly used as a module with custom extensions

•Squares away sequence annotation and ontologies

Page 38: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

Summary

• BioSQL has benefitted tremendously from hackathons.

• v1.0 Release allows to chart the way forward.

• PhyloDB module allows cross-project persistence of phylogenetic data.

• Use-cases range from simple to very complex.

Page 39: BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features

Acknowledgments• Bio* contributors:

Aaron Mackey, Ewan Birney, Thomas Down, Matthew Pocock, Mark Schreiber, Richard Holland,Elia Stupka, Chris Mungall, Brad Chapman, Jeff Chang, Toshiaki Katayama

• Hackathons sponsors:

• DBCLS/CBRC (Tokyo 2008)

• NESCent (Durham 2006)

• Apple (Singapore 2003)

• O’Reilly (Tucson 2002)

• Electric Genetics (Cape Town 2002)