core 2: bioinformatics ncbo-berkeley. berkeley drosophila genome project finish the sequence of the...

48
Core 2: Bioinformatics NCBO-Berkeley

Upload: curtis-hopkins

Post on 11-Jan-2016

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Core 2: Bioinformatics

NCBO-Berkeley

Page 2: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Berkeley Drosophila Genome Project

Finish the sequence of the euchromatic genome of Drosophila melanogaster

Annotated biological important features of this sequence

Produced gene disruptions using P element-mediated mutagenesis

Full length sequencing and expression characterization of a cDNA for every gene

Developing informatics tools

Page 3: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

SimaChris

Who is here from NCBO-Berkeley

MarkShu

Page 4: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Chris GadFly database

schema GO database

schema Chado database

schema Perl libraries for all OBD data architect

Page 5: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Shu AmiGO,ImaGO &

database Compute Pipeline OBD dev & Data flow

Page 6: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Mark Apollo Genome

Annotation Editor

Phenote and other OBD interfaces

Page 7: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Sima Adh region

annotation Annotation of

entire Drosophila Genome

Project manager and coordinator nonpareil

Associate Director

Page 8: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

OBD Outline

Core 2 aims, refresher Data models for OBD

phenotypes clinical trials others

Modeling frameworks exchange formats database system

SQL based vs ‘SemWeb’ dbs

Progress Demo

Page 9: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Core 2 Specific Aims

1. Apply ontologies Software toolkit for describing and classifying

data

2. Capture, manage, and view data annotations Database (OBD) and interfaces to store and view

annotations

3. Investigate and compare implications Linking human diseases to model systems

4. Maintain Ongoing reconciliation of ontologies with

annotations

Page 10: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Core 3 Driving Biological Projects

DBPs phenotypes: Fly and Zebrafish to human clinical trials

Core 2 Aims1. Apply ontologies to describe data2. Capture, manage, and view data annotations3. Link disease genes to model systems4. Reconcile annotation and ontology changes

Page 11: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Apply ontologies to describe data

Requirements Data capture tools

phenote demo tomorrow

no tool requirements from UCSF

Data model Database (OBD)

--aim 2

Page 12: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

dataflow

Page 13: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

user’sview

Page 14: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Data models

Common/shared domain specific models

Aim 3 linking disease genes model must support this

granularity comparability

Page 15: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Domain specific data models

FB, ZFIN genotype to phenotype

‘EAV’ qualities inhere in entities

orthologs phenotype to disease core 2 will help define common model

UCSF clinical trials existing ontology-friendly schema - trialbank

Page 16: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Phenotype data model

Qualities inhere in entities Entity term; PATO term

brain FBbt:00005095; fused PATO:0000642

gut MA:0000917; dysplastic PATO:0000640

tail fin ZDB:020702-16; ventralized PATO:0000636

kidney ZDB:020702-16; hypertrophied PATO:0000636

midface ZDB:020702-16; hypoplastic PATO:0000636

Pre-composed phenotype terms Mammalian Phenotype Ontology

“increased activated B-cell number” MPO:0000319

“pink fur hue” MPO:0000374

Page 17: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Extensions to simple model

What about Relational attributes Quantative vs qualitative Post-composing entity and attribute terms Relative states/values Variation in place, space and time A better treatment of absence

See CSHL Pheno meeting talk also, more detailed formal presentation

(available) Not to mention genotypes,

environments, provenance, etc

Page 18: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Modeling clinical trials

Model already described using frame-based schema

Further modeling required? abstraction

to integrate more with other OBD datatypes

views to only show parts relevant to

OBD/BioPortal

Page 19: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Future DBPs and use cases

OBD will contain a variety of general types of data

Modeling is expensive use existing models where appropriate but whole must be cohesive and

integrated Most of this talk focuses on the

pheno DBPs for illustrative purposes

Page 20: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Modeling frameworks

language technology

Page 21: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Modeling data: underlying formalism

Model is expressed with modeling language Options

Relational/SQL Semi-structured, XML Object-centric (UML, frame-based?) Logic based

description logic: e.g. OWL first-order logic: e.g. CL

Natural language descriptions Model should be independent of language it

is expressed in

Page 22: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Data exchange language: XML

Simple XML is suited for data exchange XML can drive software spec

constrains programmatic data model XSD can generate UML closed world assumption is useful

cf Ruttenberg et al

Mature technology well understood by developers, MODs standards

Page 23: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

How OBD uses XML

obd-geno-pheno-xml (aka pheno-xml) actually multiple modular components

genotype schema phenotype schema: ‘EAV’ environment schema provenance schema

used as exchange format cf: gene ontology association files

no need for ClinicalTrials-XML

Page 24: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Example pheno-xml <genotype id="ZFIN:tm84"> <name>ZFIN:tm84</name> <genotype_phenotype_association> <phenotype> <entity type="ZDB-ANAT-010921-528"> <quality type=“PATO:……” > <state type="PATO:0000636"> <time_range type="ZDB-STAGE-010723-12"/> </state> </quality> </entity> </phenotype> </genotype_phenotype_association>

Page 25: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

SQL Databases

Data storage, management and querying all MODs use SQL dbs

Lots of advantages scalable, standard QL, mature, APIs, etc pure relational model is reasonably formal

XML/SQL more or less compatible low impedance mismatch

Page 26: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Schemas for geno-pheno data

We already have schema: Chado Used by many MODs (eg FB)

others are ‘chado compliant’ (eg ZFIN) Modular

ontologies genomic genotype phenotype phylogenies …etc

Phenotype module needs updating will be driven by pheno-xml

Page 27: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Problem solved?

We have two mature, complementary technologies, and can define schemas for our model in an appropriate formalism for each

Is this enough to work with?

Page 28: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Issues

OBD will be much more than geno-pheno clinical trials future DBPs, other NCBCs any data expressed in an ontology language

Software and schema development expensive fragility in face of schema evolution development gets bogged down in data

exchange issues

Page 29: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Major issue

SQL and XPath work great for ‘traditional’ data…

…but are too low level for ontology-centric data lack of inference no way to directly express ontology

constraints

Page 30: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Use cases from previous experience: AmiGO

GO “find all TF genes” (is_a closure) “find all gene products localised to endoplasmic

reticulum” (part_of closure, over is_a)

Our solution (AmiGO & go-sqldb) pre-compute transitive closure over all

relations in db (sort of) works for GO (for now)

refresh problem explosive for tangled DAGs

Page 31: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

OBD requires more ontological awareness

Other relations ontogenic (eg derives_from) transitive_over

Other types of data Pre- versus post- composed terms

E.g. MPO versus AO+PATO E.g. Entity+Spatial qualifier queries over either should be

interchangeable

Page 32: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Solution: more expressive formalisms

QLs and APIs should provide and abstract away common ontology operations ease of programming, optimisation

Choices ‘Semweb’ databases

RDF + RDFS + Owl [ lite + DL ] + extra lots to choose from, emerging standards compatible with Obo v1.2 spec

Deductive databases superset of relational databases from Prolog to full CL

Page 33: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Modeling phenotypes as RDF/OWL or Obo instances

classes/terms

instances

entity quality

Page 34: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Example query in SeRQL

SELECT DISTINCT EI, ET, OrgI, QI, QT, QNFROM {EI} rdf:type {ET} rdfs:label {EN}, {EI} OBO_REL_part_of {OrgI} rdf:type {Tax} rdfs:label {TaxN}, {EI} OBO_REL_has_quality {QI} rdf:type {QT} rdfs:label {QN}WHERE label(EN) = "wing vein" AND label(TaxN) = ”Arthropoda" AND label(QN) = "ShapeValue"

find mutations affecting the shape of the wing vein:

results of query on OBD-sesame:

one annotation to “wing vein L2”, “branched”

Page 35: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Advantages of ‘SemWeb’ dbs

Advantages over pure SQL The ontology is the model

constraints encoded in ontology e.g. certain quality types only applicable to certain

entity types agile development - fast database integration

Rich modeling constructs transitivity, subsumption, intersection, etc powerful QLs and APIs

More (technical) interoperation ‘for free’ URIs proven?

Open World Assumption (maybe a hindrance?)

Page 36: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Disadvantages of ‘SemWeb’ dbs

Disadvantages speed

may be slower than SQL ..but in-memory execution is fast

lack of maturity new technology.. but has a LOT of momentum

foundations are RDF triples appropriate? inherent difficulties modeling time SQL allows n-ary relations/predicates

Page 37: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Hybrid model

SemWeb dbs are commonly layered over SQL DBs

We can have the best of both worlds Data View layers

mapping between Obo/OWL model and domain-specific relational schema

(optionally) materialized for speed different applications use appropriate

layer

Page 38: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster
Page 39: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Current progress: OBD-Sesame

Sesame open source ‘triple store’ based on Jena

also used in Protégé-OWL

storage layer options mysql/postgresql generic schema in-memory disk-based

Page 40: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

OBD in Sesame: current datasets

Pheno ZFIN & FB : EAV trial 2003 data Test ortholog set

FB ‘simple phenotype’ alleles ZFIN legacy phenotype data, automatically parsed to EAV

Ontologies: AOs, PATO, Cell, GO Method

excel & flatfiles->pheno-xml->owl OWL from http://www.fruitfly.org/~cjm/obo-download

Trialbank Method: ocelot->obo-xml->owl

Soon human orthologs and omim

Page 41: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Technology Evaluation: Sesame

Use case query set Benchmarks

preliminary conclusions SQL layering is terrible in-memory is fast

optimisations? other triple stores? up to date results on wiki

http://smi.stanford.edu/projects/cbio/mwiki-internal/index.php/RDF_Sesame_Demo_Benchmark

Need to test OWL-DL entailment Bigger dataset required for full evaluations Community effort: pub-semweb-lifesci list

Page 42: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Parallel development: an OBD Prototype

Initiated prior to OBD-Sesame Simple deductive database

prolog-based chado-like schema

can be views on Obo/OWL predicates amigo-clone user interface

Rapid prototyping Current dataset

as obd-sesame, plus CT trivial to drop in more

Page 43: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Example logic query

inheres(QI,EI) &inst(QI,QT) &label(QT,shape) &inst(EI,ETP) &part_of*(ETP,ET) &label(ET,’head capsule’)

find mutations affecting the shape of some partof the head capsule

results of query on OBD-prolog:

one annotation to “arista lateral”, “irregular shape”

Page 44: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

OBD TODO

Pheno-xml finalise release version finalise Obo/OWL mapping logic specification

Data orthologies

OBD - BioPortal integration how will it work?

Versioning and reconciling changes decide on ontology versioning first

Page 45: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

OBD dependencies

PATO development UMLS into OBO-site Ontologies

FMA accessibility? species-centric AO alignments (XSPAN?) Sept meeting on AO development Nov meeting on disease ontologies

Data MOD pheno annotation OMIM annotation

Bioportal

Page 46: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Misc

NLP for phenote Obol trial on evolutionary phenotype

characters cambridge NLP project can be used to ‘prime’ phenote

Decomposing MPO pink fur def= fur, has_quality: pink

Page 47: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Discussion

Will SemWeb dbs work? experiment

Ontology-based modeling the ontology is the model importance of

relations ontology upper ontology

Page 48: Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project  Finish the sequence of the euchromatic genome of Drosophila melanogaster

Demos

http://yuri.lbl.gov/amigo/ct http://yuri.lbl.gov/amigo/obd http://spade.lbl.gov:8080/sesame/actio

nFrameset.jsp?repository=mem-rdfs-db