protein data management: integration of chemistry and biology · 2003-06-16 · protein data...

35
Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May 22 nd , 2003 Steve Potts, Ph.D., MBA Sunil Patel, Ph.D

Upload: others

Post on 25-Feb-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Protein Data Management: Integration of Chemistry and

BiologyOracle Discovery User Group Meeting

May 22nd, 2003

Steve Potts, Ph.D., MBA Sunil Patel, Ph.D

Page 2: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Biological Drug Targets

• Genomic data explosion has driven rise in Bioinformatics– Challenge is extracting functional information

• 3D Structural information in conjunction with traditional bioinformatics offer direct insights into function – 3D data often not experimentally available

– Predicted 3D structure a valuable proxy

– vHTS (ligands) used to focus lead discovery

• Structural Genomics initiatives will provide 3D information for new protein families

Page 3: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Bringing Chemistry to Biologists

Protein Sequence Data

DS AtlasStore

Sequence Similarity

Search

HMMs

High-Throughput

Modeling

SeqFold

3D Annotations

Motifs Analysis (TM, LC etc)

DS GeneAtlas Pipeline: A high throughput pipeline that provides biochemical functional annotation of

protein sequences

Trypsin (1trn) serine protease with active site residues (catalytic triad, Ser., His and Asp)

Page 4: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Pharmaceutical Industry Needs

• System for storing proprietary proteomics and protein/ligand complex data– Download Public PDB Structures

– Deposit X-ray crystal structures & complexes solved internally and by collaboration

– Deposit NMR & Homology Modeling Structures

– Deposit results from vHTS (ligands)

• Ability to share data across research organization

• Ability to annotate, search, visualize, analyze and link to external data– Ligand based searching

– Protein based searching

– Experimental data based searching

• Ability to view experimental data– E.g. X-ray Crystallographic Data: Electron Density

Page 5: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Bringing Target Data to Chemists:Issues with Current Systems

• Protein Centric Databases– Limited ability for end-user to write back into the database

• Don’t include security features/versioning

– Don’t really handle chemistry the same as ligand databases

– No way to search on Chemical Structure

– No link to experimental information

• Ligand Centric Databases– Handle proprietary compounds well

– But protein sequences & structures not stored

• Cross database querying is difficult because no common identifiers exist between databases!!

Page 6: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

DS AtlasStore

Oracle TechnologyDB Schema

Data Content

Public & Proprietary Genomes

Data Content from PDB

viaFTP

DS GeneAtlas

Mass Spec

In-house X-ray, NMR & vHTS groups SNPs

UserDeposits

Interface

Page 7: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

DS AtlasStore 2.0• Complete protein data management solution for experimental and

computational results– Integration

– Mining

– Analysis

– Visualization

• A proteomic database with connections to many types of experimental information

– X-ray including ligand structure and reflection data

– (NMR planned)

– Mass spectrometry

– Computational (in silico)

• GeneAtlas annotation (protein sequences, structures, 3D-motifs, etc.

• Results for vHTS

• Manage public, proprietary and third party data securely behind a firewall

• Flexible query interface allows queries using chemistry, biology and experimental information

Page 8: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

DS AtlasStore 2.0

• DS AtlasStore main features in DS Modeling

– Data Loading (automated and via GUI)

• Public

• Proprietary

– From DS GeneAtlas

– Other in-house data

– Querying

– Visualization

– Sharing data between chemists and biologists

• DS ProjectKM - Oracle-based file system

Page 9: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

DS AtlasStore is a module in Discovery Studio Modeling• The next release of DS

Modeling 1.1 release scheduled for July 2003

– Builds on release of DS Modeling 1.0 through new X-ray applications

• Science focused on best of breed from existing products

– CHARMm, CNX, MODELER, DelPhi, Profiles-3D, Ludi, etc…

• Designed to allow integration of protein sequence and structure data

– Facilitates target identification & characterization

– Work flow (wizard)-based applications for novices

– Jobs can be run interactively or in the background

– Automatic XML report generation

3D Structure Windows

Project Navigator Window

Menus,Tool bars and buttons

Hierarchy Window

SequenceWindow

Page 10: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

DS Modeling for Modeling and Simulations

• DS Modeling Visualizer

• DS ProjectKM for knowledge capture

• Simulations – DS CHARMm

– DS CHARMm Lite

– Forcefields (CHARMm, CFF, etc…)

– DS Analysis

– DS DelPhi

• Protein Homology Modeling and Characterization

– DS Protein Similarity Search

– DS MODELER

– DS Protein Health

– DS Protein Families

• Functional Annotation of Proteins– DS GeneAtlas

• Enterprise Proteomics Database– DS AtlasStore

• Structure-based Target Design *– DS Ludi

– DS LigandFit

– DS LigandScore

• X-ray – DS HT-XPIPE

– DS XBUILD

– DS XLIGAND

– DS CNX

* DS Modeling 1.2

Page 11: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Automated Data Loading from RCSB Public Database• Data automatically downloaded

from RCSB PDB mirror site– Includes HKL (reflection) X-ray data

• Files are parsed to identify ligands– XML file from RCSB used to automatically

assign as much “real chemistry” to the ligand as possible

• Protein, ligand, etc… structures and x-ray reflection data are then loaded into DS AtlasStore

Page 12: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Automated Data Loading from DS GeneAtlas Pipeline

• Public or Proprietary data

• High throughput functional annotation pipeline– GUI in DS

Modeling

– Data stored in DS AtlasStore

Protein Sequence Data

DS AtlasStore™

Sequence Similarity Search

Motifs Analysis (LC, TransMem, etc)

HMMs

High-Throughput Modeling

SeqFold

3D Annotations

Page 13: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

DS AtlasStore Overall Content• 32 genomes processed with DS GeneAtlas at Accelrys

– 87% of sequences annotated, 3.7M alignments 1.7M 3D structures

– 3D annotations based on PDB annotations, 3D functional motifs & SCOP

• Annotated Genomes• Pharma

– Caenorhabditis elegans

– Drosophila melanogaster

– Escherichia coli

– Homo sapiens

– Mus musculus

• Bacterial/infectives

– Aquifex aeolicus

– Bacillus subtilis– Campylobacter jejuni– Chlamydia pneumoniae

– Chlamydia trachomatis– Deinococcus radiodurans– Escherichia coli

– Haemophilus influenzae– Helicobacter pylori– Methanobacterium thermoautotrophicum

– Methanococcus jannaschii– Mycobacterium leprae

• Agrochemicals

– Agrobacterium tumefacium

– Arabidopsis thaliana

– Synechocystis sp.

– Drosophila melanogaster

– Escherichia coli

– Mycobacterium tuberculosis

– Mycoplasma genitalium

– Mycoplasma pneumoniae

– Nessereia meningitidis

– Pyrococcus horikoshii

– Rickettsia prowazekii

– Saccharomyces cerevisiae

– Salmonella typhimurium

– Staphylococcus aureus

– Streptococcus pneumoniae

– Treponema pallidum

– Ureaplasma urealytica

– Vibrio cholerae

– Yersinia pestis

Page 14: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

DS AtlasStore Data Loading and Searching

• DS AtlasStore data content comes from three specific sources.

• PDB– The Protein Data Bank is automatically

parsed and protein structures, ligands, and x-ray data is stored and queried in DS AtlasStore

• Your data– You can load and query your own proprietary

protein, ligand, and x-ray data into DS AtlasStore through a loading wizard

• DS GeneAtlas– 3D structural annotation and models from

Accelrys' DS GeneAtlas high throughput pipeline can be loaded and queried into DS AtlasStore

Page 15: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

DS AtlasStore Data Loading via GUI

Additional textual annotation can also be entered

Data loaded from DS Modeling Visualizer or from file

Automatically parse out the ‘HET’ groups into ligands

Page 16: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

DS AtlasStore Data Loading via GUI

The chemistry of the ligand can also be verified

X-ray parameters are captured from the PDB and structure factor files and compared

Page 17: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

DS AtlasStore Integrated Searching

Text Search

Bioinformatics (BLAST) Search

Cheminformatics(Accord) Search

Page 18: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

DS AtlasStore Integrated Searching

Search by SCOP, Gene Ontology and E.C. (Enzyme Classification number

Page 19: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

DS AtlasStore Query Results

• Query results are displayed in a hit list table

• Links allow the visualization of ligand chemistry or loading of the structure (and electron density) into the DS Modeling Visualizer

Page 20: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Data Visualization – X-ray Structures

Page 21: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Data Visualization – Protein Annotations

Page 22: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

SNP Visualization in DS AtlasStore

• Phenylalanine hydroxylase (PAH)– Phenylketonuria

– Hyperphenylalaninaemia

Mutation type # of mut.

Nucl. subst. (mis/nons) 235

Nucl. subst. (splicing) 45

Small deletions 36

Small insertions 3

Gross deletions 8

Complex rearrang. 3

TOTAL 330

Active Site Annotation

SNP

Page 23: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Present Relationships

• Confirmant: ACCELRYS TO PROVIDE FOUNDATION TECHNOLOGY FOR MANAGEMENT AND FUNCTIONAL ANNOTATION OF CONFIRMANT’S PROPRIETARY HUMAN PROTEOME– Feb. 11, 2003

– Confirmant is a joint venture between Oxford Glycosciences and Marconi

• They have a product called Protein Atlas

– experimentally derived information using proprietary technology to unambiguously define genes in the human genome

– Confirmant has licensed GeneAtlas & DS AtlasStore

• Link between Protein Atlas & DS AtlasStore will provide experimental validation of GeneAtlas annotations

– Confirmant is processing their data with GeneAtlas

• Will sell resulting content in conjunction with their product offering

Page 24: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Accelrys’ Biological Informatics Enterprise Solutions:

DS AtlasStoreProteomics database

Protein sequences & alignments, structures and models, ligands, x-ray data, expressed proteins

Text fields include:Functional annotation from GeneAtlasPDB headersOriginal sequence annotations Ligand descriptions

DS SeqStoreBioinformatics database

Protein and nucleic acid sequences & alignments

Text fields include:Sequence descriptionPublication titles and authors Keywords Comments and functional annotationsCross-references and alternative names

• Enterprise Data Management Systems:– Oracle 8i & 9i large databases (200 – 300 GB)

– Read & Write

– Versioning of data and security

Page 25: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

User Interface for Text Searching

All searching contained in a powerful QueryBuilder

Original annotation

Novel annotation fromGeneAtlas pipeline

Keywords in PDB

Text searches available in DS AtlasStore

Page 26: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Additional Text Search Functionality: Oracle Text

Y--fosforilateFuzzy

Scoring using the Salton Algorithm

Kinase P58 inhibitor

Phosphorylate, phosphorylation, phosphorylating

Cinace

Kinase

Example

Y

Y

Y

Y

Y

DS AtlasStore

2.0

-

Y

Y

-

Y

DS SeqStore 3.1

-Order of occurrence

-Sounds like

-Stem

-Results ranked by text score

Ykeyword

SeqStore 2.1

Feature

Page 27: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Handling XML Documents Inside the Database: Leveraging XML in Oracle 9i

DS AtlasStore:• PDB headers XML storage

• All PDB tags are indexed and can be searched:Combines the best of XML with SQL for searching

DS SeqStore:• Will be compatible with the extensible BSML centric approach

implemented in DS Gene

Page 28: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

XML Searching of PDB tags

All searching contained in a powerful QueryBuilder

XML tag based searching of PDB header categories

Page 29: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

XML Searching of PDB tagsSelect PDB_ID where contains(‘oxygen transport’ within CLASSIFICATION)

HEADER OXYGEN TRANSPORT 24-DEC-97 112M

TITLE SPERM WHALE MYOGLOBIN D122N N-PROPYL ISOCYANIDE AT PH 9.0

COMPND MOL_ID: 1;

COMPND 2 MOLECULE: MYOGLOBIN;

COMPND 3 CHAIN: NULL;

COMPND 4 ENGINEERED: SYNTHETIC GENE;

SOURCE MOL_ID: 1;

SOURCE 2 ORGANISM_SCIENTIFIC: PHYSETER CATODON;

SOURCE 3 ORGANISM_COMMON: SPERM WHALE;

REMARK 3 REMARK 3 OTHER REFINEMENT REMARKS: NULL REMARK 4 REMARK 4 112M COMPLIES WITH FORMAT V. 2.2, 16-DEC-1996 REMARK 200 REMARK 200 EXPERIMENTAL DETAILS REMARK 200 EXPERIMENT TYPE : X-RAY DIFFRACTION

<?xml version="1.0"?>

<PDB>

<PDBID>112M</PDBID>

<CLASSIFICATION>OXYGEN TRANSPORT</CLASSIFICATION>

<DATE>24-DEC-97</DATE>

<TITLE> <![CDATA[SPERM WHALE MYOGLOBIN D122N N-PROPYL ISOCYANIDE AT PH 9.0]]> </TITLE>

<KEYWDS> <![CDATA[LIGAND BINDING, OXYGEN STORAGE, OXYGEN BINDING, HEME, OXYGEN TRANSPORT]]> </KEYWDS>

<EXPDTA> <![CDATA[X-RAY DIFFRACTION]]> </EXPDTA>

<AUTHOR> <![CDATA[R.D.SMITH,J.S.OLSON,G.N.PHILLIPS JUNIOR]]> </AUTHOR>

<SOURCE> ORGANISM_COMMON: SPERM WHALE </SOURCE>

<REMARK 3> OTHER REFINEMENT REMARKS: NULL </REMARK 3><REMARK 4> 112M COMPLIES WITH FORMAT V. 2.2, 16-DEC-1996

</REMARK 4><REMARK 200> EXPERIMENT TYPE: X-RAY DIFFRACTION </REMARK

200>

SQL in RDBMS XPATH in XML

Page 30: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Combine Text Searching with Other Search Types

Ligand 2D chemistry

BLAST searching

X-ray metadata

Page 31: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Ontologies in DS AtlasStore3 Ontologies are used in DS AtlasStore to federate and search data:

• Structural Classification of Proteins (SCOP)– Human curated structural ontology

– Linked to PDB structure chains

– 17406 (~88%) PDB Entries in DS AtlasStore have a SCOP reference

• Gene Ontology (GO)– dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene

and protein roles in cells is accumulating and changing.

– Gene centric

– 13556 (~68%) PDB Entries in DS AtlasStore have a Gene Ontology reference through SwissProt.

• Enzyme Classification (EC)– 9565 (~50%) RCSB entries have an Enzyme Classification reference

Ontology: “A controlled, hierarchical vocabulary for describing a knowledge system.”

Page 32: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Searchable Ontology Browsers

Powerful Query Builder provides access to ontologies

Page 33: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Searchable Ontology Browsers

User is looking for a novel kinase sequence in DS AtlasStore:

Page 34: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Summary

• DS AtlasStore provides access to an integrated proteomics data source that will – Aid the identification of novel targets

• View functional annotations

– Capture experimental target information in a central location

• Functions as a data ‘Warehouse’

• Share with chemists & biologists

– Use models & structures to suggest further experiments

• DS SeqStore and DS AtlasStore leverage Oracle’s new technologies in text searching and XML:– More rapid searching

– Greater diversity of textual data mining search types

– Integration with other searches (structural, sequence, experimental data)

– Ontology searches

Page 35: Protein Data Management: Integration of Chemistry and Biology · 2003-06-16 · Protein Data Management: Integration of Chemistry and Biology Oracle Discovery User Group Meeting May

Acknowledgements

• Development team:– Steve Potts

– Michael Pu

– Yin Yu

– Dave Dawley

– Yi-shiou Chen

• Management & Marketing:– Sándor Szalma

– David Edwards

– Mary Donlan

– Dana Haley-Vicente