protein data management: integration of chemistry and biology · 2003-06-16 · protein data...
TRANSCRIPT
Protein Data Management: Integration of Chemistry and
BiologyOracle Discovery User Group Meeting
May 22nd, 2003
Steve Potts, Ph.D., MBA Sunil Patel, Ph.D
Biological Drug Targets
• Genomic data explosion has driven rise in Bioinformatics– Challenge is extracting functional information
• 3D Structural information in conjunction with traditional bioinformatics offer direct insights into function – 3D data often not experimentally available
– Predicted 3D structure a valuable proxy
– vHTS (ligands) used to focus lead discovery
• Structural Genomics initiatives will provide 3D information for new protein families
Bringing Chemistry to Biologists
Protein Sequence Data
DS AtlasStore
Sequence Similarity
Search
HMMs
High-Throughput
Modeling
SeqFold
3D Annotations
Motifs Analysis (TM, LC etc)
DS GeneAtlas Pipeline: A high throughput pipeline that provides biochemical functional annotation of
protein sequences
Trypsin (1trn) serine protease with active site residues (catalytic triad, Ser., His and Asp)
Pharmaceutical Industry Needs
• System for storing proprietary proteomics and protein/ligand complex data– Download Public PDB Structures
– Deposit X-ray crystal structures & complexes solved internally and by collaboration
– Deposit NMR & Homology Modeling Structures
– Deposit results from vHTS (ligands)
• Ability to share data across research organization
• Ability to annotate, search, visualize, analyze and link to external data– Ligand based searching
– Protein based searching
– Experimental data based searching
• Ability to view experimental data– E.g. X-ray Crystallographic Data: Electron Density
Bringing Target Data to Chemists:Issues with Current Systems
• Protein Centric Databases– Limited ability for end-user to write back into the database
• Don’t include security features/versioning
– Don’t really handle chemistry the same as ligand databases
– No way to search on Chemical Structure
– No link to experimental information
• Ligand Centric Databases– Handle proprietary compounds well
– But protein sequences & structures not stored
• Cross database querying is difficult because no common identifiers exist between databases!!
DS AtlasStore
Oracle TechnologyDB Schema
Data Content
Public & Proprietary Genomes
Data Content from PDB
viaFTP
DS GeneAtlas
Mass Spec
In-house X-ray, NMR & vHTS groups SNPs
UserDeposits
Interface
DS AtlasStore 2.0• Complete protein data management solution for experimental and
computational results– Integration
– Mining
– Analysis
– Visualization
• A proteomic database with connections to many types of experimental information
– X-ray including ligand structure and reflection data
– (NMR planned)
– Mass spectrometry
– Computational (in silico)
• GeneAtlas annotation (protein sequences, structures, 3D-motifs, etc.
• Results for vHTS
• Manage public, proprietary and third party data securely behind a firewall
• Flexible query interface allows queries using chemistry, biology and experimental information
DS AtlasStore 2.0
• DS AtlasStore main features in DS Modeling
– Data Loading (automated and via GUI)
• Public
• Proprietary
– From DS GeneAtlas
– Other in-house data
– Querying
– Visualization
– Sharing data between chemists and biologists
• DS ProjectKM - Oracle-based file system
DS AtlasStore is a module in Discovery Studio Modeling• The next release of DS
Modeling 1.1 release scheduled for July 2003
– Builds on release of DS Modeling 1.0 through new X-ray applications
• Science focused on best of breed from existing products
– CHARMm, CNX, MODELER, DelPhi, Profiles-3D, Ludi, etc…
• Designed to allow integration of protein sequence and structure data
– Facilitates target identification & characterization
– Work flow (wizard)-based applications for novices
– Jobs can be run interactively or in the background
– Automatic XML report generation
3D Structure Windows
Project Navigator Window
Menus,Tool bars and buttons
Hierarchy Window
SequenceWindow
DS Modeling for Modeling and Simulations
• DS Modeling Visualizer
• DS ProjectKM for knowledge capture
• Simulations – DS CHARMm
– DS CHARMm Lite
– Forcefields (CHARMm, CFF, etc…)
– DS Analysis
– DS DelPhi
• Protein Homology Modeling and Characterization
– DS Protein Similarity Search
– DS MODELER
– DS Protein Health
– DS Protein Families
• Functional Annotation of Proteins– DS GeneAtlas
• Enterprise Proteomics Database– DS AtlasStore
• Structure-based Target Design *– DS Ludi
– DS LigandFit
– DS LigandScore
• X-ray – DS HT-XPIPE
– DS XBUILD
– DS XLIGAND
– DS CNX
* DS Modeling 1.2
Automated Data Loading from RCSB Public Database• Data automatically downloaded
from RCSB PDB mirror site– Includes HKL (reflection) X-ray data
• Files are parsed to identify ligands– XML file from RCSB used to automatically
assign as much “real chemistry” to the ligand as possible
• Protein, ligand, etc… structures and x-ray reflection data are then loaded into DS AtlasStore
Automated Data Loading from DS GeneAtlas Pipeline
• Public or Proprietary data
• High throughput functional annotation pipeline– GUI in DS
Modeling
– Data stored in DS AtlasStore
Protein Sequence Data
DS AtlasStore™
Sequence Similarity Search
Motifs Analysis (LC, TransMem, etc)
HMMs
High-Throughput Modeling
SeqFold
3D Annotations
DS AtlasStore Overall Content• 32 genomes processed with DS GeneAtlas at Accelrys
– 87% of sequences annotated, 3.7M alignments 1.7M 3D structures
– 3D annotations based on PDB annotations, 3D functional motifs & SCOP
• Annotated Genomes• Pharma
– Caenorhabditis elegans
– Drosophila melanogaster
– Escherichia coli
– Homo sapiens
– Mus musculus
• Bacterial/infectives
– Aquifex aeolicus
– Bacillus subtilis– Campylobacter jejuni– Chlamydia pneumoniae
– Chlamydia trachomatis– Deinococcus radiodurans– Escherichia coli
– Haemophilus influenzae– Helicobacter pylori– Methanobacterium thermoautotrophicum
– Methanococcus jannaschii– Mycobacterium leprae
• Agrochemicals
– Agrobacterium tumefacium
– Arabidopsis thaliana
– Synechocystis sp.
– Drosophila melanogaster
– Escherichia coli
– Mycobacterium tuberculosis
– Mycoplasma genitalium
– Mycoplasma pneumoniae
– Nessereia meningitidis
– Pyrococcus horikoshii
– Rickettsia prowazekii
– Saccharomyces cerevisiae
– Salmonella typhimurium
– Staphylococcus aureus
– Streptococcus pneumoniae
– Treponema pallidum
– Ureaplasma urealytica
– Vibrio cholerae
– Yersinia pestis
DS AtlasStore Data Loading and Searching
• DS AtlasStore data content comes from three specific sources.
• PDB– The Protein Data Bank is automatically
parsed and protein structures, ligands, and x-ray data is stored and queried in DS AtlasStore
• Your data– You can load and query your own proprietary
protein, ligand, and x-ray data into DS AtlasStore through a loading wizard
• DS GeneAtlas– 3D structural annotation and models from
Accelrys' DS GeneAtlas high throughput pipeline can be loaded and queried into DS AtlasStore
DS AtlasStore Data Loading via GUI
Additional textual annotation can also be entered
Data loaded from DS Modeling Visualizer or from file
Automatically parse out the ‘HET’ groups into ligands
DS AtlasStore Data Loading via GUI
The chemistry of the ligand can also be verified
X-ray parameters are captured from the PDB and structure factor files and compared
DS AtlasStore Integrated Searching
Text Search
Bioinformatics (BLAST) Search
Cheminformatics(Accord) Search
DS AtlasStore Integrated Searching
Search by SCOP, Gene Ontology and E.C. (Enzyme Classification number
DS AtlasStore Query Results
• Query results are displayed in a hit list table
• Links allow the visualization of ligand chemistry or loading of the structure (and electron density) into the DS Modeling Visualizer
Data Visualization – X-ray Structures
Data Visualization – Protein Annotations
SNP Visualization in DS AtlasStore
• Phenylalanine hydroxylase (PAH)– Phenylketonuria
– Hyperphenylalaninaemia
Mutation type # of mut.
Nucl. subst. (mis/nons) 235
Nucl. subst. (splicing) 45
Small deletions 36
Small insertions 3
Gross deletions 8
Complex rearrang. 3
TOTAL 330
Active Site Annotation
SNP
Present Relationships
• Confirmant: ACCELRYS TO PROVIDE FOUNDATION TECHNOLOGY FOR MANAGEMENT AND FUNCTIONAL ANNOTATION OF CONFIRMANT’S PROPRIETARY HUMAN PROTEOME– Feb. 11, 2003
– Confirmant is a joint venture between Oxford Glycosciences and Marconi
• They have a product called Protein Atlas
– experimentally derived information using proprietary technology to unambiguously define genes in the human genome
– Confirmant has licensed GeneAtlas & DS AtlasStore
• Link between Protein Atlas & DS AtlasStore will provide experimental validation of GeneAtlas annotations
– Confirmant is processing their data with GeneAtlas
• Will sell resulting content in conjunction with their product offering
Accelrys’ Biological Informatics Enterprise Solutions:
DS AtlasStoreProteomics database
Protein sequences & alignments, structures and models, ligands, x-ray data, expressed proteins
Text fields include:Functional annotation from GeneAtlasPDB headersOriginal sequence annotations Ligand descriptions
DS SeqStoreBioinformatics database
Protein and nucleic acid sequences & alignments
Text fields include:Sequence descriptionPublication titles and authors Keywords Comments and functional annotationsCross-references and alternative names
• Enterprise Data Management Systems:– Oracle 8i & 9i large databases (200 – 300 GB)
– Read & Write
– Versioning of data and security
User Interface for Text Searching
All searching contained in a powerful QueryBuilder
Original annotation
Novel annotation fromGeneAtlas pipeline
Keywords in PDB
Text searches available in DS AtlasStore
Additional Text Search Functionality: Oracle Text
Y--fosforilateFuzzy
Scoring using the Salton Algorithm
Kinase P58 inhibitor
Phosphorylate, phosphorylation, phosphorylating
Cinace
Kinase
Example
Y
Y
Y
Y
Y
DS AtlasStore
2.0
-
Y
Y
-
Y
DS SeqStore 3.1
-Order of occurrence
-Sounds like
-Stem
-Results ranked by text score
Ykeyword
SeqStore 2.1
Feature
Handling XML Documents Inside the Database: Leveraging XML in Oracle 9i
DS AtlasStore:• PDB headers XML storage
• All PDB tags are indexed and can be searched:Combines the best of XML with SQL for searching
DS SeqStore:• Will be compatible with the extensible BSML centric approach
implemented in DS Gene
XML Searching of PDB tags
All searching contained in a powerful QueryBuilder
XML tag based searching of PDB header categories
XML Searching of PDB tagsSelect PDB_ID where contains(‘oxygen transport’ within CLASSIFICATION)
HEADER OXYGEN TRANSPORT 24-DEC-97 112M
TITLE SPERM WHALE MYOGLOBIN D122N N-PROPYL ISOCYANIDE AT PH 9.0
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: MYOGLOBIN;
COMPND 3 CHAIN: NULL;
COMPND 4 ENGINEERED: SYNTHETIC GENE;
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: PHYSETER CATODON;
SOURCE 3 ORGANISM_COMMON: SPERM WHALE;
REMARK 3 REMARK 3 OTHER REFINEMENT REMARKS: NULL REMARK 4 REMARK 4 112M COMPLIES WITH FORMAT V. 2.2, 16-DEC-1996 REMARK 200 REMARK 200 EXPERIMENTAL DETAILS REMARK 200 EXPERIMENT TYPE : X-RAY DIFFRACTION
<?xml version="1.0"?>
<PDB>
<PDBID>112M</PDBID>
<CLASSIFICATION>OXYGEN TRANSPORT</CLASSIFICATION>
<DATE>24-DEC-97</DATE>
<TITLE> <![CDATA[SPERM WHALE MYOGLOBIN D122N N-PROPYL ISOCYANIDE AT PH 9.0]]> </TITLE>
<KEYWDS> <![CDATA[LIGAND BINDING, OXYGEN STORAGE, OXYGEN BINDING, HEME, OXYGEN TRANSPORT]]> </KEYWDS>
<EXPDTA> <![CDATA[X-RAY DIFFRACTION]]> </EXPDTA>
<AUTHOR> <![CDATA[R.D.SMITH,J.S.OLSON,G.N.PHILLIPS JUNIOR]]> </AUTHOR>
<SOURCE> ORGANISM_COMMON: SPERM WHALE </SOURCE>
<REMARK 3> OTHER REFINEMENT REMARKS: NULL </REMARK 3><REMARK 4> 112M COMPLIES WITH FORMAT V. 2.2, 16-DEC-1996
</REMARK 4><REMARK 200> EXPERIMENT TYPE: X-RAY DIFFRACTION </REMARK
200>
SQL in RDBMS XPATH in XML
Combine Text Searching with Other Search Types
Ligand 2D chemistry
BLAST searching
X-ray metadata
Ontologies in DS AtlasStore3 Ontologies are used in DS AtlasStore to federate and search data:
• Structural Classification of Proteins (SCOP)– Human curated structural ontology
– Linked to PDB structure chains
– 17406 (~88%) PDB Entries in DS AtlasStore have a SCOP reference
• Gene Ontology (GO)– dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene
and protein roles in cells is accumulating and changing.
– Gene centric
– 13556 (~68%) PDB Entries in DS AtlasStore have a Gene Ontology reference through SwissProt.
• Enzyme Classification (EC)– 9565 (~50%) RCSB entries have an Enzyme Classification reference
Ontology: “A controlled, hierarchical vocabulary for describing a knowledge system.”
Searchable Ontology Browsers
Powerful Query Builder provides access to ontologies
Searchable Ontology Browsers
User is looking for a novel kinase sequence in DS AtlasStore:
Summary
• DS AtlasStore provides access to an integrated proteomics data source that will – Aid the identification of novel targets
• View functional annotations
– Capture experimental target information in a central location
• Functions as a data ‘Warehouse’
• Share with chemists & biologists
– Use models & structures to suggest further experiments
• DS SeqStore and DS AtlasStore leverage Oracle’s new technologies in text searching and XML:– More rapid searching
– Greater diversity of textual data mining search types
– Integration with other searches (structural, sequence, experimental data)
– Ontology searches
Acknowledgements
• Development team:– Steve Potts
– Michael Pu
– Yin Yu
– Dave Dawley
– Yi-shiou Chen
• Management & Marketing:– Sándor Szalma
– David Edwards
– Mary Donlan
– Dana Haley-Vicente