the genomics unified schema and application frameworkgus: the genomics unified schema and...

33
GUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL), University of Pennsylvania Center For Bioinformatics

Upload: others

Post on 19-Sep-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

GUS: The Genomics Unified Schema and Application Framework

Michael SaffitzComputational Biology and Informatics Laboratory (CBIL), University of Pennsylvania Center For Bioinformatics

Page 2: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Presentation Overview

SchemaApplication FrameworkGUS In UseGUS and OracleFuture WorkObtaining GUS

Page 3: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Motivation

Functional Genomics: The analysis of gene, RNA, and protein information and its biological function

Represent diversity of functional genomics dataIntegrate and establish relationships between these dataProvide facilities for the utilization of these data and their relationships

The creation of an extensible system for the storage, integration, and analysis of functional genomic data.

Page 4: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

GUS Overview

Page 5: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Relational Schema Overview

7 major divisions representing approximately 50 concepts in over 400 tables and views:

Central Dogma (Genes, RNAs, Proteins)Sequences and FeaturesReagentsMicroarray ExperimentsTranscription RegulationControlled VocabulariesMisc: Bibliographic, External Database, Administration

Strongly typed, i.e. few key/value pairsView-based subclassing Extensive use of Controlled Vocabularies

Support for tracking, versioning, permissions

Page 6: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

GUS Schemas

DoTS (Database of Transcribed Sequences)Genes, RNAs, Proteins, Sequences

RAD (RNA Abundance Database)Gene Expression and Microarray Experiments

TESS (Transcription Element Search System)Transcriptional regulation

SRes (Shared Resources)Controlled vocabularies, ontologies

CoreNon-Biological Tracking and Overhead

Page 7: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

DoTS Schema OverviewCentral Dogma

GenesRNAsProteins

Sequences and FeaturesDNAAmino AcidAssembliesAlignment

ReagentsFingerprintMappingGene TrapsClones

Page 8: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

DoTS Schema: Central DogmaGene RNA Protein

Central Dogma of Biology: Single gene gives rise to RNAs, which in turn give rise to proteinsFoundational organizing structure

Page 9: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Central Dogma: Sequences

Gene RNA Protein

NASequence AASequence

Genes, RNAs, and Proteins all have sequences, either Nucleic Acid or Amino Acid Sequences are stored independently of any other object

Page 10: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Central Dogma: Features

Gene RNA Protein

NAFeatureNALocation AAFeatureAALocation

NASequence AASequence

Features are used to represent interesting regions of a sequenceFeatures may be hierarchical: multiple exon features share a parent gene featureFeatures may have absolute or relative locations on the sequence, and be noncontiguous

Page 11: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Central Dogma: Instances

NASequence AASequence

NAFeature AAFeatureNALocation AALocation

GeneInstance

RNAInstance

ProteinInstance

Gene RNA Protein

Genes, RNAs, and Proteins are canonical objects with instancesInstances allow for a many to many relationship between objects and sequences.

Page 12: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Central Dogma

SequencesOrganism A Organism B

Gene Instances

Gene

Gene Features

Page 13: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

RAD Schema Overview

Representation and management of high-throughput gene expression data

MicroarraySerial Analysis (SAGE)

Supports:Study DesignPlatform / ArrayAssay to Quantification (Hybridization, Scanning, Feature Extraction)BiomaterialsData Preprocessing (e.g. Normalization)Analysis Results (e.g. Clustering, Differential Expression)Misc: Ontologies, Protocol, Contact, Versioning, Privacy

MGED Standards Compliant: MIAME, MAGE

Page 14: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

TESS Schema Overview

Represents the analysis and prediction of functional transcription factor binding sitesSupports:

Proteins, ComplexActivity -- BindingModel -- Weight MatricesAnalysis -- Training & Learning

Integrates TRANSFAC, a public database of transcription factorsPartially designed to provide a bridge between DoTS and RAD

e.g. RNA Sequences in DoTS and their expression levels in RAD are regulated by the transcription factors in TESS

Page 15: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Ontologies / Controlled Vocabularies

Explicit formal specification of terms and concepts Represented individually to accommodate differences in structure (flat, tree, graph) and attributes (fields)Provides an explicit relationship between a biological concept and a given controlled vocabulary

Supports about 15 in total, including:NCBI TaxonomyGene Ontology Function TermsSequence Ontology TermsAnatomy

Page 16: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

GUS EvidenceEvidence may be provided for any item of any row in any tableImplemented as a relationship between that item any any other row (the evidence)

Evidence table provides linking and attributes:target_table, target_row, target_attributeevidence_table, evidence_row

Example:An assembly of ESTs and mRNAs containing a RefSeq uses the RefSeq as evidence to support that the assembly is full length coding.An RNA’s description use comments, similarities, or a sequence as supporting evidence

CBIL: 9 evidence tables provide support to 11 tables in 26 combinations

Page 17: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

SubclassingOne-level subclassing providing conceptual clarity and query simplification for tables with core commonality and slight divergence in attributes.

e.g. NAFeature superclass has GeneFeature,ExonFeature, RNAFeature, etc. subclasses

Implemented as views on a “implementation” table containing:

Columns common to all subclassesGeneric columns available for subclass-specific attributesA column indicating the subclass a given row belongs to

There are 19 superclasses and 111 subclasses in GUS

Page 18: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Data Provenance

Permissions: Row-level Unix-Style read/writeVersioning: Simple preservation of modified rowsData Source: Tracking of external databases and their releasesProject Tracking: Data grouping by projectAlgorithm: Tracking of algorithms, their execution and parameters, row-level impact, and result status

Page 19: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Application Framework Overview

Provides consistent, reusable access, management, and display of data

Object Relational LayerPerlJava

Data Loading API & PluginsPipeline APIWeb Development Kit (WDK) GUS Database

PerlObjectLayer

WDKJava

ObjectLayer

Data LoadingAPI

PluginsPipeline

GUIApplications Websites

Page 20: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Object Layer

One-to-one relationship between objects and tables/viewsLight weight: centered primarily around data loading

Limited support for object-specific logicApplications generally define additional object models

Provides Simple constructors and full accessorsSmart update/insertParent/child relationship managementCascading insert and deleteCache management

Automation of Data Provenance and Evidence

Page 21: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Data Loading API & PluginsAPI provides:

Data ProvenanceObject layer and database connectivityStandardized documentationCommand line argument processingLoggingError Handling

Plugins are objects which utilize the Data Loading API Example GUS plugins:

Loading Data: Loading sequences from flat (FASTA) filesParsing Genbank records and storing results

Analysis: Predicting RNA/Protein function using Gene Function Ontology

General:Updating records from XML

Page 22: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Pipeline API

Allows for a chain of plugins and other Perl programs to be strung together for the automation of complex protocols

CBIL Example: DoTS BuildDownloads and inserts data, assembles transcripts, produces consensus sequences, and performs annotationAbout 150 total stepsAbout six weeks of processing (Human: 5-6M Sequences)

Page 23: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Web Development Kit (WDK)Facilitates development of data mining oriented websites:

Multiple parameterized canned queriesSophisticated recordsGraphical viewsBoolean query facilityQuery historySession management, process pooling, flow control

Model, View, Controller (MVC) DesignSeparates application logic (Model) from website layout (View) and application flow (Controller)Model: XML-based queries and recordsView: JSPController: Struts

New WDK under development, scheduled for release by the end of summer

Page 24: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

GUS In Use

Page 25: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

GUS In Use: Versatility

Large scale sites associated with sequencing centers GeneDB: Pathogen Sequencing Unit at the Sanger Institute

Lightly staffed genomics projects TcruziDB, CryptoDB: Kissinger Lab, University of Georgia

Data mining projects Multiple plant projects: Brett Tyler, Virginia Bioinformatics Institute and collaborators

Expression based projects dbDirt: Allen Okey, University of Toronto

Bioinformatics Core Facilities University of Pennsylvania Bioinformatics Core Facility

Page 26: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

GUS In Use: Modularity

Several instances of GUS which exclusively useRAD or DoTSAllows for small initial investment of time and energy, while providing significant potential for future growth

Page 27: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

GUS and Relational Database Systems

Oracle and PostgreSQL SupportOracle supplanted Sybase in 2001PostgreSQL added in 2004

GUS compatible RDBM systems require:SchemasViewsSequencesPrimary and Foreign Key Constraints

“Enhanced” functionality when using OracleUnder development: Workspace Manager IntegrationImplemented through database module, triggers, and GUS Projects

Page 28: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

GUS: Adding PostgreSQL compatibility

Database module provides alternate SQL for use in the object layer

SQL-Function CallsDate functions

SequencesMetadata:

Constraint relationsTable attributesTable definition views

Third party utilities (SQL::Translator) and hand-editing to convert table definitions from Oracle to PostgreSQL

Page 29: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

GUS & GUS Projects at CBIL

Multiple GUS-based projects sharing the same database instanceProject-specific extensions use their own schemas for application specific functionalityThese extensions may use Oracle specific functionality:

Query optimization, hintsMaterialized viewsAdvanced storage-- table compressionDatabase links

New concepts are introduced within projects and migrate to GUS

Page 30: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

GUS Future Development

Extension of GUS to include proteomics and other domains (e.g. in situ hybrdization)Improved distribution: documentation, installation, API10g Migration / Integration:

Integrated analysis: BLAST, RegexData loading: UpsertImplementing obvious performance features

Workspace Manager Support

Page 31: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

GUS & Workspace ManagerMultiple GUS-based projects, all sharing the same instanceEach project maintains its own release schedule and data build process

Project data releases range from weekly to once every 6 six weeks

Data is unreliable during the build process

Oracle Workspace ManagerEach project may manipulate data independentlyUpon completion of a build cycle, the data is committed back to the primary workspaceProjects may release at any time because just the primary workspace is made available.

Workspace Manager provides functionality to GUS by allowing morepowerful manipulation of data among many concurrent projects andresearchers

Page 32: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Obtaining GUS

www.gusdb.orgOpen SourceDocumentation -- Wiki, Installation GuidesSourceForge: gusdev

Mailing ListsTrackers

Coming Soon: Demonstration Instance

Page 33: The Genomics Unified Schema and Application FrameworkGUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL),

Acknowledgements

Steve FischerJonathan SchugChris Stoeckert

The Computational Biology and Informatics Laboratory Group

GUS is funded by grants from the National Institute of Health