csrml – a new markup languagea new markup language ...bulletin.acscinf.org/pdfs/240nm77.pdf ·...
TRANSCRIPT
CSRML – A New Markup LanguageCSRML A New Markup Language Definition for Chemical Substructure R t tiRepresentationCh i t f H S h b
Molecular Networks GmbH
Christof H. Schwab
Henkestraße 9191052 Erlangen, Germany
l l t kwww.molecular-networks.com
OutlineOutline
Chemical subgraphsRepresentation and use cases
De facto standardsRequirements of new definition of subgraph representationXML-based substructure representation
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 2
Chemical Subgraphs and Substructures
Well established concept in chemistry and chemoinformatics
Ray and Kirsch, Finding Chemical Records by Digital Computers. Science, 1957, 126, 814-819Fisanick et al. Substructure Searching of Computer-ReadableFisanick et al. Substructure Searching of Computer Readable Chemical Abstracts Service Ninth Collective Index Chemical Nomenclature Files. J. Chem. Inf. Comput. Sci. 1975, 15 (2), 73-84
Employed by almost all software packages that deal with sets of chemical structures and reactionssets of chemical structures and reactions
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 3
Chemical Substructures Use CasesChemical Substructures – Use Cases
Chemical database queriesFind structure(s) enclosing the query substructureRetrieval of analogs or similar structures
MCSS searchesFingerprintingAnalysis of chemical structures
S l lStructural alertsTTC analysis
Highlighting of functional groupsHighlighting of functional groups
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 4
Example: Database LookupExample: Database Lookup
ChemIDplusQuery
ChlorobenzeneSearch mode
Substructure
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 5
Example: Database LookupExample: Database Lookup
25,512 hits
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 6
Example: Query with PropertiesExample: Query with Properties
Find chlorobenzene derivatives which are easily hydrolyzed at standard conditions
ClR
OH OHR
+
Substructure based query will return both
R R
Cl ClNONO2
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 7
NO2
Example: Query with PropertiesExample: Query with Properties
Nucleophilic aromatic substitution
Cl
OH
Cl
RCl
ROH
R+ OH
- Cl
Chlorobenzene doesnot react at standard
ClNO2
Cl
conditionsNO2
Reaction conditions 400 °C, 300 bar room temp.Resonance stabilization 0 kJ/mol 43.5 kJ/molResonance stabilization 0 kJ/mol 43.5 kJ/mol
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 8
Example: Query with PropertiesExample: Query with Properties
It is not sufficient to have queries solely based on substructures
ClNO2
Cl
NO2
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 9
Existing de facto StandardsExisting de facto Standards
SMARTSSubstructure specification by text line notationDefinition complex substructure patterns including logical operations, recursion, etc
MDL CTab QueryMDL CTab QueryCTab file based query definitionPotentially extendible using SD properties in non-standard wayy g p p y
SYBYL line notation (SLN)Substructure specification by text line notationSupport of property annotations, macros, R-groups, etc
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 10
Limitations of Existing StandardsLimitations of Existing Standards
No provision of built-in extension mechanismsNo support of standardized property annotation (except SLN)No support of "inline" test casesLimited set of properties for annotation
No built-in support for comments, documentation of queries etcqueries, etc
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 11
Limitations of Existing StandardsLimitations of Existing Standards
No mechanisms to validate queries prior to executionErrors – both syntax and semantic ones – first seen when executing ththe query
Diffic lt and error prone to inp tDifficult and error-prone to input
P i t f tProprietary formatsVery few free/open source libraries and GUI tools
⇒ Need for a new definition or standard?
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 12
Requirements for New Substructure Representation Definition
Well defined representation of (sub)structuresUnambiguous interpretationClear document structure
S f ( )Support of (any) property annotation, query logic, etcE.g., physicochemical properties, toxicity alerts, etc
Support of comments, documentation, etc
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 13
Requirements for New Substructure Representation Definition
Built-in validationMechanisms to validate the syntax of queriesTest cases to validate the semantics of queries
C fConversion of queries into existing standardsBuilt-in support for future extensionsNon-proprietary, open format
⇒ XML (?)
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 14
Advantages of XML-based (Sub)Structure Representation
Structured representation of structures, test cases, etcNative support of
Syntax validationComments and documentation (extensible)
Easy toTransfer/exchangeIntegrate into other XML based languagesIntegrate into other XML-based languagesExtend and modify
XML open standard as wellXML open standard as wellLarge number of software available to work with
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 15
XML-based (Sub)Structure Representation
Chemical Subgraph Representation Markup Language
CSRML
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 16
CSRML Object ModelCSRML Object Model
CSRML document
SubgraphSubgraphSubgraph (Sub)StructureAtoms
AnnotationsA t tiAtomsAtoms
Bonds
AnnotationsAnnotationsAnnotations
AnnotationsAnnotationsAnnotations
AnnotationsAnnotationsAnnotationsBondsBonds
MustMatch
E-Systems
Annotations
AnnotationsAnnotationsAnnotationsE-Systemse--Systems
MustNotMatch
StructureStructureStructure
StructureStructure
mandatory
optional
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 17
StructureStructureStructure
Model of Single SubgraphRepresentation
Target (sub)structure, molecule, or a disconnected graph which represents the query
Connectivity (atoms, bonds, e--systems)Annotated query features and other propertiesLogical constructsLogical constructs
Test structure(s) that MUST match the targetTest structure(s) that MUST match the targetTest structure(s) that MUST NOT match the target
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 18
XML Grammar Definition for CSRMLXML Grammar Definition for CSRML
Enables easy validation of query definitionsXML documents have to be well-formed XML documents can be validated against data model (DTD or XSD)
Additional checks to validate the query prior to processingReferential integrity checksUnique or distinct constraintsUnique or distinct constraints
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 19
Query with Properties ExampleQuery with Properties – Example
<mol id="M1"><atomArray>
<atom id "A1" element "N/A" x "0" y "0"><atom id="A1" element="N/A" x="0" y="0"><query feature="atomList">
<value>N</value><value>O</value>
</query><query feature "piCharge" logic "AND"><query feature="piCharge" logic="AND">
<range><min>–0.6</min><max>–0.1</max>
</range>
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010
…
20
Query with PropertiesQuery with Properties
Easy accessible annotated query features
Easy nesting and logical combination of query features
Automatic validation of query syntax by XML parserBased on XML schema (grammar)Partial validation of query semantics
No chemical validation at this step!
⇒ The more checking is done by XML parser, the lesschecking has to be done by implementing library!checking has to be done by implementing library!
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 21
Annotation ModelAnnotation Model
Example of definition for a CSRML annotation
<annotation domain="bond" featureKey="ringBond"dataType="xsd:boolean" implementation="M_RING_BOND_IMPL" priority="2" severity="skip">priority 2 severity skip >
<label>Ring bond
</label></label><description>
Bond in ring system; bond order disregarded.</d i ti ></description>…</annotation>
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 22
Storage of CSRML QueriesStorage of CSRML Queries
Storage as XML documentsSingle or multiple queries in a single document
Storage in (XML) databasesSubstructure searches in chemistry-aware XML databases
Integration into other XML-based formats, e.g., ToxML
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 23
Exchange of CSRML QueriesExchange of CSRML Queries
Transfer between different applications as XML documentsRegular filesInternet (SOAP, HTTP, …)
C fConversion into existing formatsOmission or separate export of not-supported featuresTransformation into query depicts (SVG)Transformation into query depicts (SVG)
Conversion from existing formatsConversion from existing formatsE.g., from SMARTS to CSRML
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 24
Current StatusCurrent Status
First draft of CSRML definitionData model & schema designDesign of annotation modelDefault set of query features
Beta version of reference implementationLGPL or similarly licensed library to support I/O and class structuresLGPL or similarly licensed library to support I/O and class structures for query documents, query objects and features
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 25
Next StepsNext Steps
Development of graphical input toolChemical Subgraph Editor, CSE
Publishing everything on the WebAnnounced via Newsletter / RSS feed
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 26
CSRML SummaryCSRML – Summary
Universal and extensible platform for specifying advanced substructure queries
Connectivity (atoms, bonds, e--systems)Annotated query features and other propertiesLogical constructsLogical constructs
Open standard for easy exchange of substructure queriesOpen standard for easy exchange of substructure queries between different applications and databases
Encourage developers to use and distribute CSRML
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 27
AcknowledgementsAcknowledgements
Molecular Networks (co-authors)Bruno Bienfait, Johann Gasteiger, Thomas Kleinöder,J M k Oli S h Al k T kh L th T fl thJoerg Marucszyk, Oliver Sacher, Aleksey Tarkhov, Lothar Terfloth
Chihae Yang (co-author)Chihae Yang (co-author)Discussions about the chemical subgraph definition
US FDA CFSANKirk Arvidson(Contract for development of the Chemical Subgraph Editor)
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 28
Thank You!Thank You!
www.molecular-networks.com
ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 29