gus overview june 18, 2002. gus-3.0 supports application and data integration uses an extensible...

18
GUS Overview June 18, 2002

Upload: jesse-lee

Post on 25-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

GUS Overview

June 18, 2002

Page 2: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

GUS-3.0

• Supports application and data integration• Uses an extensible architecture.• Is object-oriented even though it uses an underlying

relational database management system (Oracle).• Warehouse instead of federation for local stable copy• Uses standards for bulk data exchange (e.g., MAGE)

Genomics Unified Schema

Page 3: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

GUS Usage• Annotation – of genomes - gene models, sequence features– of genes - gene function, gene expression, gene

regulation

• Data mining– Develop algorithms and queryable resource

• Publish– Map identifiers with other resources/ databases – URL for entry retrieval/ ad hoc queries in web interface

Page 4: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

GUS-3.0 Name SpacesGUS has 5 name spaces compartmentalizing different

types of information.

Namespace Domain Features

Core Data Provenance Workflows

Sres Shared resorurces Ontologies

DoTSsequence and

annotationCentral dogma

RAD Gene expresssion MIAME

TESS Gene regulation Grammars

Page 5: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

Application Integration: PlasmoDB

AutomatedAnalysis &Integration

WWW queries,

browsing, & download

Java Servlets &

Perl CGI

GenePlotSoftware

GenePlotCD

DoTS Oracle/SQL

GenomicSequence

microArray& SAGE

Experiments

MappingData

GenBank, InterPro,

GO, etc

GSSs &ESTs

Annotation QTL,POP,SNP, Clinical

Existing implementation

Future implementation

RAD Core SRes

Object Layer

TESS

TIGRSanger

Stanford

PlasmodiumInvestigators

PublicDatabases

Annotator’s Interface

Page 6: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

GUS Supports Multiple ProjectsAllGenesAllGenes PlasmoDBPlasmoDB

EPConDBEPConDB

CoreSRESTESSRADDoTS

Oracle RDBMS Object Layer for Data Loading

Java Servlets

Other sites,Other projectsOther sites,Other projects

Page 7: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

Main Aspects of GUS Development• Choice of development tools

– Schema: • CREATE TABLE statements• Documentation plug-in: input is tab- delimited text • UML - Rational Rose, PowerDesigner

– Code: CVS

• Areas to emphasize– Plug-ins – Work flow– TESS– Proteomics– Images

• Preferred type of user interface– JSP– PHP

Page 8: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

Data Integration

• GO• Species• Tissue• Dev. Stage

Ontologies

SRes

acute myeloid leukemia

Data Provenance

• Ownership• Protection• Algorithms• Similarity• Versioning• Workflow

Core

with sequence similarity to c-fos

GenomicSequence

• Genes, gene models• STSs, repeats, etc• Cross-species analysis

TranscribedSequence

• Characterize transcripts• RH mapping• Library analysis • Cross-species analysis• DOTS

ProteinSequence

• Domains• Function• Structure• Cross-species analysis

DoTS

Transcription factors

•Arrays•SAGE•Conditions

TranscriptExpression

RAD

up-regulated in

• Binding Sites• Patterns• Grammars

Gene Regulation

TESS

and common promoter motifs

Page 9: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

RAD

EST clustering and assembly

GUS

TESS

Genomic alignmentand comparativeSequence analysis

Identify sharedTF binding sites

Page 10: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

GUS Approach to Schema• Think objects

– Parents and children– Subclassing with views

• Views– Start with generic Imp table (e.g., NAFeatureImp) that contains

base attributes plus generic attributes of various datatypes– Superclass view (e.g., NAFeature) just has base attributes– Subclass views (e.g., RNAFeature) have additional attributes

using generic attributes

• Strongly-typed– Tend to avoid “name-value” pairs

Page 11: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

NAFeature

AAFeature

AASequence

NASequence

DoTS Central Dogma

Gene

RNA

Protein

GeneFeature

GenomicSequence

RNASequence

ProteinSequence

RNAFeature

ProteinFeature

GeneInstance

RNAInstance

ProteinInstance

Page 12: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

Functional predictions

GenomicSequence

DoTS consensusSequences

mRNA/ESTSequence

Clustering andAssembly

PredictedGenes

GeneIndex

Merge Genes

Gene/RNA clusterassignment

SIM4 or BLAT

ProteinsRNAs

Gene predictionsGenScan/ HMMer, PHAT

GO Functions

ProteinMotifs

BLAST Similarities

PFAM, Smart, ProDomBLASTPBLASTX

DoTS Schema Has Been Driven By Building Gene Indices

Other computed annotation(EPCR,

AssemblyAnatomyPercent,Index Key Words,

SNP analysis)

Annotate DoTSManual Annotation

Tasks

translationframefinder

Page 13: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

DoTS Gene Indices Are Based on Clustering and Assembling ESTs

Identify new sequencesIn GenBank and dbEST

“Quality” AssemblySequences

Clusters of sequences(40 bp length, 92% identity)

•Assemble clusters using CAP4• update database

•Remove vector, polyA tails, ribosomal and poor quality sequences•Mask repeats with RepeatMasker

•BLASTN vs self•BLASTN vs DoTS•Connected components analysis to form clusters

GUS relational databaseIterate to complete build -Extract consensus sequences -Block with RepeatMasker -BLASTN vs self -Cluster (95% identity, 75 bp overlap) -Assemble with CAP4

Annotation of DoTS consensus sequences -protein translations with framefinder -BLAST analyses vs nrdb, prodom and CDD -assign description and index keywords -GOFunction assignment -EPCR to generate radiation hybrid mapping -derive assembly -> anatomy mapping -alignment to genomic DNA -assignment to “Gene” clusters

Page 14: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

AnalysisInput

AnalysisOutputAnalysisImplementation

AnalysisParameter

Analysis

1

0..*

1

0..*1

0..*1

0..*

1

0..*

1

0..*

10..*

10..*

ARRAYANNOTATION

ASSAYLABELEDEXTRACT

BIOSAMPLE BIOSOURCE

COMPOSITEELEMENTANNOTATION

CONTROL

CONTROLTYPE

1

0..*

1

0..*

ELEMENTANNOTATION COMPOSITEELEMENTIMP

0..*0..1

0..*0..1

1

0..*

1

0..*

1

0..*

1

0..*

ARRAY10..* 10..*

1

0..*

1

0..*

ELEMENTIMP10..* 10..*

0..10..* 0..10..*

1

0..*

1

0..*

GROUPFACTOR

EXPERIMENTGROUP

1

0..*

1

0..*

LABEL

LABELEDEXTRACT

BIOSOURCECHARACTERISTIC

1

0..*

1

0..*

PROCESSIMPPARAMETER

PROCESSPARAMETER

ProcessInput

PROCESS

1

0..*

1

0..*

10..*

10..*

PROCESSIMPLEMENTATION

1

0..*

1

0..*

1 0..*1 0..*

PROCESSTYPE

0..*0..1

0..*0..1

1

0..*

1

0..*

ELEMENTRESULTIMP

1

0..*

1

0..*

COMPOSITEELEMENTRESULTIMP

1

0..*

1

0..*

0..10..* 0..10..*

RELATEDACQUISITIONACQUISITION

1 0..*1 0..*1 0..*1 0..*

RELATEDASSAYASSAY

10..*

10..*

1

0..*

1

0..*

1

0..*

1

0..*1

0..*

1

0..*

1 0..*1 0..*1

0..*1

0..*

RELATEDQUANTIFICATIONQUANTIFICATION

1

0..*

1

0..*

1

0..*

1

0..*

0..1

0..*

0..1

0..*

1 0..*1 0..*1 0..*1 0..*

ONTOLOGYENTRY

0..*0..1

0..*0..1

1

0..*

1

0..*

BIOMATERIALIMP1

0..*

1

0..*

BioMaterialImp

0..1

0..*

0..1

0..*

BioMaterialImp

PROTOCOLTREATMENT

1

0..*

1

0..*

1

0..*

1

0..*

0..10..* 0..10..*

ProcessOutput

1

0..*

1

0..*

ASSAYGROUPFACTOR

1

0..*

1

0..*

1

0..*

1

0..*

QUANTIFICATIONPARAMETER

0..1

0..*

0..1

0..*

BIOMATERIALMEASUREMENT

1

0..*

1

0..*

ACQUISITIONPARAMETER

1

0..*

1

0..*

RAD 3.0 Schema Incorporates MAGE and Experience With Microarrays

LIMS for Data Analysis. Also holds SAGE.

Page 15: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

Status of GUS Namespaces• Core

– Tables exist, Workflow documented

• Sres– Tables exist

• DoTS– Tables exist, some documentation

• RAD– Version 3.0 to include MAGE, experience

• Pretty much complete

– Tables exist, mostly documented

• TESS– Tables ready but not created

Page 16: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

Schema Development

• Releases on Sourceforge:– CREATE TABLE statements– Table dumps from Core::TableInfo,

Core::DatabaseDocumentation– Gifs of ER diagrams

• Adding tables between releases– In CVS tree?– Use message forum for discussion

Page 17: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

Documentation

• Schema Browser looks at TableInfo

• Plug-in– Populates DatabaseDocumentation– Input:

Table\t\tDescription of table

Table\tAttribute\tDescription of attribute

Page 18: GUS Overview June 18, 2002. GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses

GUS Schema Browser• http://www.cbil.upenn.edu/cgi-bin/GUS30/schema

Browser.pl?db=GUS30• Points at GUS30 on CBIL development database

server (erebus).– Need to move? Maintain release view?

• DoTS Tables:– Central dogma – Evidence/ Similarity – ProjectLink– SequenceGroupImp/ SequenceGroupExperimentImp– Plasmomap?

• Other tables of interest?