csrml – a new markup languagea new markup language ...bulletin.acscinf.org/pdfs/240nm77.pdf ·...

29
CSRML A New Markup Language CSRML A New Markup Language Definition for Chemical Substructure R t ti Representation Ch i t f HSh b Molecular Networks GmbH Christof H. Schwab Henkestraße 91 91052 Erlangen, Germany l l t k www.molecular-networks.com

Upload: others

Post on 10-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

CSRML – A New Markup LanguageCSRML A New Markup Language Definition for Chemical Substructure R t tiRepresentationCh i t f H S h b

Molecular Networks GmbH

Christof H. Schwab

Henkestraße 9191052 Erlangen, Germany

l l t kwww.molecular-networks.com

Page 2: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

OutlineOutline

Chemical subgraphsRepresentation and use cases

De facto standardsRequirements of new definition of subgraph representationXML-based substructure representation

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 2

Page 3: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Chemical Subgraphs and Substructures

Well established concept in chemistry and chemoinformatics

Ray and Kirsch, Finding Chemical Records by Digital Computers. Science, 1957, 126, 814-819Fisanick et al. Substructure Searching of Computer-ReadableFisanick et al. Substructure Searching of Computer Readable Chemical Abstracts Service Ninth Collective Index Chemical Nomenclature Files. J. Chem. Inf. Comput. Sci. 1975, 15 (2), 73-84

Employed by almost all software packages that deal with sets of chemical structures and reactionssets of chemical structures and reactions

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 3

Page 4: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Chemical Substructures Use CasesChemical Substructures – Use Cases

Chemical database queriesFind structure(s) enclosing the query substructureRetrieval of analogs or similar structures

MCSS searchesFingerprintingAnalysis of chemical structures

S l lStructural alertsTTC analysis

Highlighting of functional groupsHighlighting of functional groups

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 4

Page 5: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Example: Database LookupExample: Database Lookup

ChemIDplusQuery

ChlorobenzeneSearch mode

Substructure

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 5

Page 6: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Example: Database LookupExample: Database Lookup

25,512 hits

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 6

Page 7: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Example: Query with PropertiesExample: Query with Properties

Find chlorobenzene derivatives which are easily hydrolyzed at standard conditions

ClR

OH OHR

+

Substructure based query will return both

R R

Cl ClNONO2

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 7

NO2

Page 8: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Example: Query with PropertiesExample: Query with Properties

Nucleophilic aromatic substitution

Cl

OH

Cl

RCl

ROH

R+ OH

- Cl

Chlorobenzene doesnot react at standard

ClNO2

Cl

conditionsNO2

Reaction conditions 400 °C, 300 bar room temp.Resonance stabilization 0 kJ/mol 43.5 kJ/molResonance stabilization 0 kJ/mol 43.5 kJ/mol

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 8

Page 9: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Example: Query with PropertiesExample: Query with Properties

It is not sufficient to have queries solely based on substructures

ClNO2

Cl

NO2

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 9

Page 10: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Existing de facto StandardsExisting de facto Standards

SMARTSSubstructure specification by text line notationDefinition complex substructure patterns including logical operations, recursion, etc

MDL CTab QueryMDL CTab QueryCTab file based query definitionPotentially extendible using SD properties in non-standard wayy g p p y

SYBYL line notation (SLN)Substructure specification by text line notationSupport of property annotations, macros, R-groups, etc

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 10

Page 11: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Limitations of Existing StandardsLimitations of Existing Standards

No provision of built-in extension mechanismsNo support of standardized property annotation (except SLN)No support of "inline" test casesLimited set of properties for annotation

No built-in support for comments, documentation of queries etcqueries, etc

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 11

Page 12: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Limitations of Existing StandardsLimitations of Existing Standards

No mechanisms to validate queries prior to executionErrors – both syntax and semantic ones – first seen when executing ththe query

Diffic lt and error prone to inp tDifficult and error-prone to input

P i t f tProprietary formatsVery few free/open source libraries and GUI tools

⇒ Need for a new definition or standard?

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 12

Page 13: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Requirements for New Substructure Representation Definition

Well defined representation of (sub)structuresUnambiguous interpretationClear document structure

S f ( )Support of (any) property annotation, query logic, etcE.g., physicochemical properties, toxicity alerts, etc

Support of comments, documentation, etc

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 13

Page 14: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Requirements for New Substructure Representation Definition

Built-in validationMechanisms to validate the syntax of queriesTest cases to validate the semantics of queries

C fConversion of queries into existing standardsBuilt-in support for future extensionsNon-proprietary, open format

⇒ XML (?)

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 14

Page 15: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Advantages of XML-based (Sub)Structure Representation

Structured representation of structures, test cases, etcNative support of

Syntax validationComments and documentation (extensible)

Easy toTransfer/exchangeIntegrate into other XML based languagesIntegrate into other XML-based languagesExtend and modify

XML open standard as wellXML open standard as wellLarge number of software available to work with

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 15

Page 16: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

XML-based (Sub)Structure Representation

Chemical Subgraph Representation Markup Language

CSRML

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 16

Page 17: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

CSRML Object ModelCSRML Object Model

CSRML document

SubgraphSubgraphSubgraph (Sub)StructureAtoms

AnnotationsA t tiAtomsAtoms

Bonds

AnnotationsAnnotationsAnnotations

AnnotationsAnnotationsAnnotations

AnnotationsAnnotationsAnnotationsBondsBonds

MustMatch

E-Systems

Annotations

AnnotationsAnnotationsAnnotationsE-Systemse--Systems

MustNotMatch

StructureStructureStructure

StructureStructure

mandatory

optional

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 17

StructureStructureStructure

Page 18: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Model of Single SubgraphRepresentation

Target (sub)structure, molecule, or a disconnected graph which represents the query

Connectivity (atoms, bonds, e--systems)Annotated query features and other propertiesLogical constructsLogical constructs

Test structure(s) that MUST match the targetTest structure(s) that MUST match the targetTest structure(s) that MUST NOT match the target

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 18

Page 19: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

XML Grammar Definition for CSRMLXML Grammar Definition for CSRML

Enables easy validation of query definitionsXML documents have to be well-formed XML documents can be validated against data model (DTD or XSD)

Additional checks to validate the query prior to processingReferential integrity checksUnique or distinct constraintsUnique or distinct constraints

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 19

Page 20: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Query with Properties ExampleQuery with Properties – Example

<mol id="M1"><atomArray>

<atom id "A1" element "N/A" x "0" y "0"><atom id="A1" element="N/A" x="0" y="0"><query feature="atomList">

<value>N</value><value>O</value>

</query><query feature "piCharge" logic "AND"><query feature="piCharge" logic="AND">

<range><min>–0.6</min><max>–0.1</max>

</range>

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010

20

Page 21: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Query with PropertiesQuery with Properties

Easy accessible annotated query features

Easy nesting and logical combination of query features

Automatic validation of query syntax by XML parserBased on XML schema (grammar)Partial validation of query semantics

No chemical validation at this step!

⇒ The more checking is done by XML parser, the lesschecking has to be done by implementing library!checking has to be done by implementing library!

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 21

Page 22: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Annotation ModelAnnotation Model

Example of definition for a CSRML annotation

<annotation domain="bond" featureKey="ringBond"dataType="xsd:boolean" implementation="M_RING_BOND_IMPL" priority="2" severity="skip">priority 2 severity skip >

<label>Ring bond

</label></label><description>

Bond in ring system; bond order disregarded.</d i ti ></description>…</annotation>

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 22

Page 23: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Storage of CSRML QueriesStorage of CSRML Queries

Storage as XML documentsSingle or multiple queries in a single document

Storage in (XML) databasesSubstructure searches in chemistry-aware XML databases

Integration into other XML-based formats, e.g., ToxML

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 23

Page 24: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Exchange of CSRML QueriesExchange of CSRML Queries

Transfer between different applications as XML documentsRegular filesInternet (SOAP, HTTP, …)

C fConversion into existing formatsOmission or separate export of not-supported featuresTransformation into query depicts (SVG)Transformation into query depicts (SVG)

Conversion from existing formatsConversion from existing formatsE.g., from SMARTS to CSRML

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 24

Page 25: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Current StatusCurrent Status

First draft of CSRML definitionData model & schema designDesign of annotation modelDefault set of query features

Beta version of reference implementationLGPL or similarly licensed library to support I/O and class structuresLGPL or similarly licensed library to support I/O and class structures for query documents, query objects and features

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 25

Page 26: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Next StepsNext Steps

Development of graphical input toolChemical Subgraph Editor, CSE

Publishing everything on the WebAnnounced via Newsletter / RSS feed

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 26

Page 27: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

CSRML SummaryCSRML – Summary

Universal and extensible platform for specifying advanced substructure queries

Connectivity (atoms, bonds, e--systems)Annotated query features and other propertiesLogical constructsLogical constructs

Open standard for easy exchange of substructure queriesOpen standard for easy exchange of substructure queries between different applications and databases

Encourage developers to use and distribute CSRML

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 27

Page 28: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

AcknowledgementsAcknowledgements

Molecular Networks (co-authors)Bruno Bienfait, Johann Gasteiger, Thomas Kleinöder,J M k Oli S h Al k T kh L th T fl thJoerg Marucszyk, Oliver Sacher, Aleksey Tarkhov, Lothar Terfloth

Chihae Yang (co-author)Chihae Yang (co-author)Discussions about the chemical subgraph definition

US FDA CFSANKirk Arvidson(Contract for development of the Chemical Subgraph Editor)

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 28

Page 29: CSRML – A New Markup LanguageA New Markup Language ...bulletin.acscinf.org/PDFs/240nm77.pdf · ¾Single or multiple queries in a single document Storage in (XML) databases ¾Substructure

Thank You!Thank You!

www.molecular-networks.com

ACS Fall 2010 Meeting, Boston, MA, August 22-26, 2010 29