the genomics unified schema and application frameworkgus: the genomics unified schema and...
TRANSCRIPT
GUS: The Genomics Unified Schema and Application Framework
Michael SaffitzComputational Biology and Informatics Laboratory (CBIL), University of Pennsylvania Center For Bioinformatics
Presentation Overview
SchemaApplication FrameworkGUS In UseGUS and OracleFuture WorkObtaining GUS
Motivation
Functional Genomics: The analysis of gene, RNA, and protein information and its biological function
Represent diversity of functional genomics dataIntegrate and establish relationships between these dataProvide facilities for the utilization of these data and their relationships
The creation of an extensible system for the storage, integration, and analysis of functional genomic data.
GUS Overview
Relational Schema Overview
7 major divisions representing approximately 50 concepts in over 400 tables and views:
Central Dogma (Genes, RNAs, Proteins)Sequences and FeaturesReagentsMicroarray ExperimentsTranscription RegulationControlled VocabulariesMisc: Bibliographic, External Database, Administration
Strongly typed, i.e. few key/value pairsView-based subclassing Extensive use of Controlled Vocabularies
Support for tracking, versioning, permissions
GUS Schemas
DoTS (Database of Transcribed Sequences)Genes, RNAs, Proteins, Sequences
RAD (RNA Abundance Database)Gene Expression and Microarray Experiments
TESS (Transcription Element Search System)Transcriptional regulation
SRes (Shared Resources)Controlled vocabularies, ontologies
CoreNon-Biological Tracking and Overhead
DoTS Schema OverviewCentral Dogma
GenesRNAsProteins
Sequences and FeaturesDNAAmino AcidAssembliesAlignment
ReagentsFingerprintMappingGene TrapsClones
DoTS Schema: Central DogmaGene RNA Protein
Central Dogma of Biology: Single gene gives rise to RNAs, which in turn give rise to proteinsFoundational organizing structure
Central Dogma: Sequences
Gene RNA Protein
NASequence AASequence
Genes, RNAs, and Proteins all have sequences, either Nucleic Acid or Amino Acid Sequences are stored independently of any other object
Central Dogma: Features
Gene RNA Protein
NAFeatureNALocation AAFeatureAALocation
NASequence AASequence
Features are used to represent interesting regions of a sequenceFeatures may be hierarchical: multiple exon features share a parent gene featureFeatures may have absolute or relative locations on the sequence, and be noncontiguous
Central Dogma: Instances
NASequence AASequence
NAFeature AAFeatureNALocation AALocation
GeneInstance
RNAInstance
ProteinInstance
Gene RNA Protein
Genes, RNAs, and Proteins are canonical objects with instancesInstances allow for a many to many relationship between objects and sequences.
Central Dogma
SequencesOrganism A Organism B
Gene Instances
Gene
Gene Features
RAD Schema Overview
Representation and management of high-throughput gene expression data
MicroarraySerial Analysis (SAGE)
Supports:Study DesignPlatform / ArrayAssay to Quantification (Hybridization, Scanning, Feature Extraction)BiomaterialsData Preprocessing (e.g. Normalization)Analysis Results (e.g. Clustering, Differential Expression)Misc: Ontologies, Protocol, Contact, Versioning, Privacy
MGED Standards Compliant: MIAME, MAGE
TESS Schema Overview
Represents the analysis and prediction of functional transcription factor binding sitesSupports:
Proteins, ComplexActivity -- BindingModel -- Weight MatricesAnalysis -- Training & Learning
Integrates TRANSFAC, a public database of transcription factorsPartially designed to provide a bridge between DoTS and RAD
e.g. RNA Sequences in DoTS and their expression levels in RAD are regulated by the transcription factors in TESS
Ontologies / Controlled Vocabularies
Explicit formal specification of terms and concepts Represented individually to accommodate differences in structure (flat, tree, graph) and attributes (fields)Provides an explicit relationship between a biological concept and a given controlled vocabulary
Supports about 15 in total, including:NCBI TaxonomyGene Ontology Function TermsSequence Ontology TermsAnatomy
GUS EvidenceEvidence may be provided for any item of any row in any tableImplemented as a relationship between that item any any other row (the evidence)
Evidence table provides linking and attributes:target_table, target_row, target_attributeevidence_table, evidence_row
Example:An assembly of ESTs and mRNAs containing a RefSeq uses the RefSeq as evidence to support that the assembly is full length coding.An RNA’s description use comments, similarities, or a sequence as supporting evidence
CBIL: 9 evidence tables provide support to 11 tables in 26 combinations
SubclassingOne-level subclassing providing conceptual clarity and query simplification for tables with core commonality and slight divergence in attributes.
e.g. NAFeature superclass has GeneFeature,ExonFeature, RNAFeature, etc. subclasses
Implemented as views on a “implementation” table containing:
Columns common to all subclassesGeneric columns available for subclass-specific attributesA column indicating the subclass a given row belongs to
There are 19 superclasses and 111 subclasses in GUS
Data Provenance
Permissions: Row-level Unix-Style read/writeVersioning: Simple preservation of modified rowsData Source: Tracking of external databases and their releasesProject Tracking: Data grouping by projectAlgorithm: Tracking of algorithms, their execution and parameters, row-level impact, and result status
Application Framework Overview
Provides consistent, reusable access, management, and display of data
Object Relational LayerPerlJava
Data Loading API & PluginsPipeline APIWeb Development Kit (WDK) GUS Database
PerlObjectLayer
WDKJava
ObjectLayer
Data LoadingAPI
PluginsPipeline
GUIApplications Websites
Object Layer
One-to-one relationship between objects and tables/viewsLight weight: centered primarily around data loading
Limited support for object-specific logicApplications generally define additional object models
Provides Simple constructors and full accessorsSmart update/insertParent/child relationship managementCascading insert and deleteCache management
Automation of Data Provenance and Evidence
Data Loading API & PluginsAPI provides:
Data ProvenanceObject layer and database connectivityStandardized documentationCommand line argument processingLoggingError Handling
Plugins are objects which utilize the Data Loading API Example GUS plugins:
Loading Data: Loading sequences from flat (FASTA) filesParsing Genbank records and storing results
Analysis: Predicting RNA/Protein function using Gene Function Ontology
General:Updating records from XML
Pipeline API
Allows for a chain of plugins and other Perl programs to be strung together for the automation of complex protocols
CBIL Example: DoTS BuildDownloads and inserts data, assembles transcripts, produces consensus sequences, and performs annotationAbout 150 total stepsAbout six weeks of processing (Human: 5-6M Sequences)
Web Development Kit (WDK)Facilitates development of data mining oriented websites:
Multiple parameterized canned queriesSophisticated recordsGraphical viewsBoolean query facilityQuery historySession management, process pooling, flow control
Model, View, Controller (MVC) DesignSeparates application logic (Model) from website layout (View) and application flow (Controller)Model: XML-based queries and recordsView: JSPController: Struts
New WDK under development, scheduled for release by the end of summer
GUS In Use
GUS In Use: Versatility
Large scale sites associated with sequencing centers GeneDB: Pathogen Sequencing Unit at the Sanger Institute
Lightly staffed genomics projects TcruziDB, CryptoDB: Kissinger Lab, University of Georgia
Data mining projects Multiple plant projects: Brett Tyler, Virginia Bioinformatics Institute and collaborators
Expression based projects dbDirt: Allen Okey, University of Toronto
Bioinformatics Core Facilities University of Pennsylvania Bioinformatics Core Facility
GUS In Use: Modularity
Several instances of GUS which exclusively useRAD or DoTSAllows for small initial investment of time and energy, while providing significant potential for future growth
GUS and Relational Database Systems
Oracle and PostgreSQL SupportOracle supplanted Sybase in 2001PostgreSQL added in 2004
GUS compatible RDBM systems require:SchemasViewsSequencesPrimary and Foreign Key Constraints
“Enhanced” functionality when using OracleUnder development: Workspace Manager IntegrationImplemented through database module, triggers, and GUS Projects
GUS: Adding PostgreSQL compatibility
Database module provides alternate SQL for use in the object layer
SQL-Function CallsDate functions
SequencesMetadata:
Constraint relationsTable attributesTable definition views
Third party utilities (SQL::Translator) and hand-editing to convert table definitions from Oracle to PostgreSQL
GUS & GUS Projects at CBIL
Multiple GUS-based projects sharing the same database instanceProject-specific extensions use their own schemas for application specific functionalityThese extensions may use Oracle specific functionality:
Query optimization, hintsMaterialized viewsAdvanced storage-- table compressionDatabase links
New concepts are introduced within projects and migrate to GUS
GUS Future Development
Extension of GUS to include proteomics and other domains (e.g. in situ hybrdization)Improved distribution: documentation, installation, API10g Migration / Integration:
Integrated analysis: BLAST, RegexData loading: UpsertImplementing obvious performance features
Workspace Manager Support
GUS & Workspace ManagerMultiple GUS-based projects, all sharing the same instanceEach project maintains its own release schedule and data build process
Project data releases range from weekly to once every 6 six weeks
Data is unreliable during the build process
Oracle Workspace ManagerEach project may manipulate data independentlyUpon completion of a build cycle, the data is committed back to the primary workspaceProjects may release at any time because just the primary workspace is made available.
Workspace Manager provides functionality to GUS by allowing morepowerful manipulation of data among many concurrent projects andresearchers
Obtaining GUS
www.gusdb.orgOpen SourceDocumentation -- Wiki, Installation GuidesSourceForge: gusdev
Mailing ListsTrackers
Coming Soon: Demonstration Instance
Acknowledgements
Steve FischerJonathan SchugChris Stoeckert
The Computational Biology and Informatics Laboratory Group
GUS is funded by grants from the National Institute of Health