building support for the semantic web for chemistry at the royal society of chemistry

30
Presented by Karen Karapetyan , Colin Batchelor, Jonathan Steele , David Sharpe Valery Tkachenko, Antony Williams Building support for the semantic web for chemistry at the Royal Society of Chemistry

Upload: orcid-0000-0002-2668-4821

Post on 10-May-2015

1.142 views

Category:

Technology


0 download

DESCRIPTION

The Royal Society of Chemistry provides a variety of databases and services covering multiple domains of Chemistry. That includes our electronic publishing platform, ChemSpider and its related databases, the National Chemistry Database and digital access to the RSC archive that spans over 170 years. In order to support the rising tide of semantic web technologies we are now working on exposing our data to conform with the linked data paradigm. This presentation will provide an overview of our work to introduce semantic structure to all RSC electronic resources as well as outlining ways to access this information using standard formats and various APIs.

TRANSCRIPT

Page 1: Building support for the semantic web for chemistry at the Royal Society of Chemistry

Presented by Karen Karapetyan, Colin Batchelor, Jonathan Steele , David SharpeValery Tkachenko, Antony Williams

ACS Indianapolis September 2013

Building support for the semantic web for chemistry at the Royal Society of Chemistry

Page 2: Building support for the semantic web for chemistry at the Royal Society of Chemistry
Page 3: Building support for the semantic web for chemistry at the Royal Society of Chemistry

http://www.openphacts.org

Open PHACTS is an Innovative Medicines Initiative (IMI) project, aiming to reduce the

barriers to drug discovery in industry, academia and for small businesses.

Semantic web is one of the corner stones

Page 4: Building support for the semantic web for chemistry at the Royal Society of Chemistry

RDF Export

Data:

ChEMBLHMDB

DrugBank

Chemistry Validation and Standardization Platform (CVSP)at cvsp.chemspider.com

•Validation•Standardization•Parent generation•Run on Hadoop-based farm

Page 5: Building support for the semantic web for chemistry at the Royal Society of Chemistry

CVSP : chemical validation

Free chemistry validation platform that performs:

•Structure validation• Atoms• Bonds• Valence• Stereo• If aromatic - check that uniquely dearomatized• Strongest acid not ionized first in partially-ionized system

•Cross-matching of SDF fields• synonyms• InChIs• Smiles

Page 6: Building support for the semantic web for chemistry at the Royal Society of Chemistry

Input formats supported:CDX, Mol, SdfZipGzTab-delimited text files

Page 7: Building support for the semantic web for chemistry at the Royal Society of Chemistry

CVSP: standardization modules

• Custom processing let’s user to put together workflow from pre-defined standardization modules list

Page 8: Building support for the semantic web for chemistry at the Royal Society of Chemistry
Page 9: Building support for the semantic web for chemistry at the Royal Society of Chemistry

• ChemSpider (passed 100K records)• All records are planned to pass through CVSP

• DrugBank (~6.5K records)

• ChEMBL (~1.2 mln records)

Data set examples

Page 10: Building support for the semantic web for chemistry at the Royal Society of Chemistry

ChemSpider issues

Page 11: Building support for the semantic web for chemistry at the Royal Society of Chemistry

DrugBank dataset (6516 records)

~60 records that can’t be dearomatized unambiguously

DB04283 DB04462

Page 12: Building support for the semantic web for chemistry at the Royal Society of Chemistry

~30 records with bonds that do not make sense

DB04283

DDB04009

Page 13: Building support for the semantic web for chemistry at the Royal Society of Chemistry

2 records where Smiles, InChI, and name did not match the structure

DB00611 DB01547

Page 14: Building support for the semantic web for chemistry at the Royal Society of Chemistry

~40 records where InChIs did not match the structure

DrugBank ID: DB00755InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+

DruGBank ID: DB00614

Page 15: Building support for the semantic web for chemistry at the Royal Society of Chemistry

DB08128

J. Brechner, IUPACGraphical Representation of stereochem. configurationsSection: ST-1.1.10

DB06287

7 records with 2 stereo bonds at chiral atoms

Page 16: Building support for the semantic web for chemistry at the Royal Society of Chemistry

CVSP validation of ChEMBL 16 (~1.3 mln. records)

• Overall 0.7% of records had validation issues

• Stereo problems (~82%)• Directions of bonds do not make sense (~63%)• Ambiguous stereo : 2 stereo bonds at chiral center (~19%)

Page 17: Building support for the semantic web for chemistry at the Royal Society of Chemistry

“Direction of bond makes no sense” – 63%

Page 18: Building support for the semantic web for chemistry at the Royal Society of Chemistry

“Stereo types of the opposite bonds mismatch” -15%

http://www.iupac.org/publications/pac/2006/pdf/7810x1897.pdf

Page 19: Building support for the semantic web for chemistry at the Royal Society of Chemistry

“Stereo types of non-opposite bonds match” – 2%

Page 20: Building support for the semantic web for chemistry at the Royal Society of Chemistry

“atom not recognized” – 3% isotopes

Should be atom from periodic table

No mass difference in atom line

No “M ISO” in connection table

In molfile:

Page 21: Building support for the semantic web for chemistry at the Royal Society of Chemistry

CVSP : standardization

• Standardization workflow was developed for Open PHACTS’s registration system

• Workflow includes modules like• SMIRKS rules derived from FDA SRS manual• Resetting symmetric stereo• Dearomatize• Layout• Fix “fixable” stereo issues• Disconnect all metals from N, O, F• Fold non-stereo hydrogens• Handle partial ionization of acid-base• etc

Page 22: Building support for the semantic web for chemistry at the Royal Society of Chemistry

Open PHACTS chemical registry system: what we use as chemical identity?

•Standard InChI/InChIKey (currently used ChemSpider)•Absolute smiles (isomeric canonical)

Drawbacks•SMILES –many flavors•Standard InChI

• does not include unknown/undefined stereo unless at least one defined stereo is present• does not distinguish between undefined and unknown stereo (always “?”)• standard InChI does some basic tautomer canonicalization which we wanted to prevent

to distinguish between all tautomers (sometimes useful for linking spectral data to specific tautomer)

• assumes absolute stereo or no stereo at all

Path we took:Non-standard InChI with options: SUU SLUUD FixedH SUCF•Always include unknown/undefined stereo (‘u’,’?’)•add Fixed H layer (to distinguish between tautomers)•Uses chiral flag in MOL/SD record (ON – absolute stereo, OFF-relative)

Page 23: Building support for the semantic web for chemistry at the Royal Society of Chemistry

For each Compound (CSID) parent generation is attempted

“Tautomerism in large databases”, Sitzmann and others, J.Comput Aided Mol Des (2010)

Parent DescriptionCharge-Unsensitive An attempt is made to neutralize ionized acids

and bases. Envisioned to be an ongoing improvement while new cases appear.

Isotope-Unsensitive Isotopes replaced by common weight

Stereo-Unsensitive Stereo is stripped

Tautomer-Unsensitive Tautomer canonicalization is attempting to generate a “reasonable” tautomer

Super-Unsensitive This parent is all of the above

No fragment unsensitive parent – we treat all fragments as equal entities

Page 24: Building support for the semantic web for chemistry at the Royal Society of Chemistry

CTABREGID1DataSourceSynonym1Synonym2XRef1etc

DepositedSDF record

Standardized entity

OPS_ID1 Super Parent (OPS_ID8)

Parents

Charge Parent (OPS_ID7)

Isotope Parent (OPS_ID5)

Stereo Parent (OPS_ID4)

Tautomer Parent (OPS_ID6)

Fragment (OPS_ID3)

Fragment (OPS_ID2)

Page 25: Building support for the semantic web for chemistry at the Royal Society of Chemistry

Chemistry Validation and Standardization Platform (CVSP)at cvsp.chemspider.com

•Validation•Standardization•Parent generation

RDF Export

Data

Page 26: Building support for the semantic web for chemistry at the Royal Society of Chemistry

Data is being imported from ChemSpider to Open PHACTS in

RDF/turtle

Page 27: Building support for the semantic web for chemistry at the Royal Society of Chemistry

RDF/VoID– VoID is an RDF Schema vocabulary for expressing metadata about

RDF datasets. It is intended as a bridge between the publishers and users of RDF data. http://www.w3.org/TR/void

• skos:exactMatch (Simple Knowledge Organisation System)E.g. To link compounds in OPS with compounds in ChEBI.• skos:closeMatch E.g. To link Stereo Insensitive Parents to their Children within OPS.• skos:relatedMatch E.g. To link Parent compounds that contain others as Fragments.

– Recommendations on how to create the VoID have been specified by Manchester here: http://www.cs.man.ac.uk/~graya/ops/2012/ED-datadesc/

Page 28: Building support for the semantic web for chemistry at the Royal Society of Chemistry

O H

O

O H

O

O–

O

Na+

Na+

O

O–

O

O–

OPS1

O–

ONa

+

DrugBank ID DB07241

OPS5OPS4

OPS3

OPS2

OPS6

ops:OPS1 skos:exactMatch <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB07241> .

ops:OPS2 skos:relatedMatch ops:OPS1 .

ops:OPS3 skos:relatedMatch ops:OPS1 .

ops:OPS3 skos:closeMatch ops:OPS4 .

ops:OPS3 skos:closeMatch ops:OPS5 .

ops:OPS4 skos:closeMatch ops:OPS6 .

ops:OPS5 skos:closeMatch ops:OPS6 .

Page 29: Building support for the semantic web for chemistry at the Royal Society of Chemistry
Page 30: Building support for the semantic web for chemistry at the Royal Society of Chemistry

Future work

Enabling full semantic web capabilities:

•Establishing RDF server with all relationships (including parent-child relationships)

•Develop SPARQL capability for querying RDF

Validate all records in ChemSpider by passing it through CVSP