representation of molecular structures and related computations on the semantic web. universal data...

34
Chemical Semantics, Inc. Chemical Semantics Mirek Sopek*, Neil Ostlund, Jacob W.G. Bloom, Stuart Chalk Chemical Semantics Inc., 1115 NW 4th Street, Gainesville, Florida *[email protected] Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology.

Upload: sopekmir

Post on 19-Feb-2017

330 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

Chemical Semantics, Inc.

Chemical Semantics

Mirek Sopek*, Neil Ostlund, Jacob W.G. Bloom, Stuart ChalkChemical Semantics Inc., 1115 NW 4th Street, Gainesville, Florida

*[email protected]

Representation of molecular structures

and related computations on the Semantic Web.

Universal Data Model and its Ontology.

Page 2: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

3

Chemical Semantics

Chemical Semantics goals

Interoperable PUBLISHING of Computational Chemistry calculationsSemantic REPRESENTATION OF DATA for both humans and machinesFEDERATION of published data with existing web-based chemical datasetsCloud-like ARCHIVING of Computational Chemistry calculation results, input/output files etc.

http://chemsem.com

Page 3: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

4

Chemical Semantics

CSI Portal – a short reviewchemsem.com – EXISTING PLATFORM FOR DATA PUBLISHING

Page 4: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

5

Chemical Semantics

CSI Portal – what’s new ? Enhanced stability and

security

SPARQL Query Generator based on chemical drawings

Extending the range of QC packages to:

ADF, DALTON, GAMESS, GAMESS-UK, Gaussian, Jaguar, Molpro, NWChem, ORCA, Psi4, and QChem. (thanks to the use of ccLib)

Page 5: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

Chemical Semantics, Inc.

Chemical Semantics

Data Models in chemistry

Page 6: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

7

Chemical Semantics

What is a data model and why is it important?

What is a data model:A data model organizes data elements and standardizes how the data elements relate to one another.

As such, a data model should be distinguished from its serializations (i.e. file formats)The most important place where we work directly with data models is in the software!

Page 7: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

8

Chemical Semantics

Data Models in Chemistry

TABULAR data models (most popular: MOL files, MOLDEN files, ZMT, GJF, HIN, R elational DBs etc)TREE based data models (CML, AniML, CSX etc)KEY VALUE/MIXED data models (CIF, new PDB/mmCIF, JCAMP-DX)

Page 8: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

9

Chemical Semantics

Why we need new data models and standards

Existing data models have various levels of extensibility, but all of them fall short when a new, unknown or unpredicted (at the moment of creation), kind of data appears in it.Such new kind of data added to a model usually breaks it, or, in the best case, is ignored.There is no provision for dynamic sharing of data where people can add new data in real time.

Page 9: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

10

Chemical Semantics

What is the solution?

We are convinced that the solution comes in the form of:

a GRAPH-based data model based on the smallest possible data pattern: A TRIPLE

The best implementation is offered by RDF – Resource Description Framework known from Semantic Technologies.

Page 10: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

11

Chemical Semantics

Why triples?Arbitrary N-tuples can be constructed out of 3-tuplesProved by W. Quin. Mathematical Logic. Harvard University Press, 1940.

Page 11: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

12

Chemical Semantics

RDF data model

Anatomy of the triple:

<molecule>

gnvc:hasInChIString „1S/H2O/h1H2”

For example:

Subject Predicate ObjectThing Property Value

gc:hasInChIKey “DUGIDELPOPULAW-

UHFFFAOYSA-N”

Page 12: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

13

Chemical Semantics

RDF data model

Typical data set contains large numbers of triples forming a DIRECTED GRAPHIdentification and addressing of nodes is done via a URI scheme – a generalization of URLs – standard web addresses.

Page 13: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

14

Chemical Semantics

RDF data model in software

The RDF data model in software is usually represented as:

Unordered SET of TRIPLES (3-TUPLES)

For example, in Python we have 3-tuple:

(subject, predicate,object)

Page 14: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

15

Chemical Semantics

How do we interact with the model?

Through SPARQL queriesThrough specific API calls in your language of preference

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX gc: <http://purl.org/gc/>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>SELECT ?graphWHERE {    GRAPH ?graph { {        ?something gc:hasAtom ?atom1 ;            rdf:type ?somethingType ;            rdfs:label ?somethingLabel .        ?atom1 gc:isElement "F" .    }    UNION     {        ?something gc:hasAtom ?atom2 ;            rdf:type ?somethingType ;            rdfs:label ?somethingLabel .        ?atom2 gc:isElement "Cl" .    }    UNION    {        ?something gc:hasAtom ?atom3 ;            rdf:type ?somethingType ;            rdfs:label ?somethingLabel .        ?atom3 gc:isElement "Br" .    }    UNION(…)

ua=URIRef(u'http://purl.org/gc/Atom')um=URIRef(u'http://purl.org/gc/Molecule')ur=URIRef(u'http://purl.org/gc/Residue')

g=rdflib.Graph()ba=g.parse(urn,format="turtle")

for m in g.subjects(RDF.type,um):nmc += 1napm=0 # number of atoms per moleculeres1=g.objects(m,uhr)lres=len(list(res1))if lres>0:

res=g.objects(m,uhr)(…)

v=graph.value(subject=vURI,predicate=RDF.type)h=graph.value(subject=vURI,predicate=gcn.hasName)a=graph.value(subject=vURI,predicate=gcn.hasValue)

Page 15: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

16

Chemical Semantics

Software interaction with the model?

Out of all data models, RDF GRAPH represents almost infinite extensibility.

Its serializations (JSON-LD and Turtle) are the best to work with.

Page 16: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

Dr Mirek Sopek

SOFTWARE

Page 17: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

Dr Mirek Sopek

ORIGINAL SOFTWARE

OTHER SOFTWARE

Page 18: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

19

Chemical Semantics

Data model and its serializations

There is a number of serializations for the RDF graphs:

RDF/XML, NTriples, Turtle, JSON-LD etc

The most important today are: JSON-LD & Turtle

We shall never forget they are just SERIALIZATIONS

of the underlying, more fundamentalData Model

Page 19: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

Chemical Semantics, Inc.

Chemical Semantics

Chemical SemanticsGraph Data models

Page 20: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

21

Chemical Semantics

CSI Molecular Data ModelsExisting model (currently used on our portal):

Follows closely CSX (XML) data model presented here last year

The New Data model features:Alternate methods to describe molecular geometry: Cartesian, Fractional and Internal coordinates Flexible representation of molecular hierarchies (molecules, residues , groups, chains, templates etc.)Cleaner serializations to both JSON-LD and Turtle – easier to work with also for humansCloser integration with Gainesville Core Ontology

Page 21: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

22

Chemical Semantics

CSI Molecular Data Model

Geometrical objects: Top level class hierarchy

gc:Locus

gc:Atom gc:Point gc:DummyAtom

gc:GhostAtom

Rdf:subClass

Rdf:subClass

rdf:subClass

rdf:subClass

Page 22: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

23

Chemical Semantics

CSI Molecular Data ModelmSys

cart

p1

gc:contains

gc:usesType

A1gc:isPositionFor

gc:Point

rdf:type

0.06968 1.299703 0.021584gc:hasXValue

gc:hasYValuegc:hasZValue

p2

A2gc:isPositionFor

rdf:type

1.000204 1.658998 0.011623gc:hasXValue

gc:hasYValuegc:hasZV alue

p9

A7gc:isPositionFor

1.000204 1.658998 0.01162361gc:hasVectorValue

� .

rdf:type

gc:MolecularSystemrdf:type

gc:CartesianCoordinatesrdf:tpye

R1

gc:hasMolecules

A1 A3 A5 A7

gc:hasAtom gc:hasAtomgc:hasAtom gc:hasAtom

M1

R2

hasResidue

hasResidue

A2 A4

A6g1

gc:Residue rdf:tpye

g2

gc:Group

chebi:CHEBI_32952

chebi:CHEBI_32952

Amine Group

Carboxylic acid group

rdf:tpye

rdf:tpye

label

gc:hasAtom

gc:hasAtom

gc:hasAtom

b1_2

A1

A2

gc:binds

gc:binds

gc:SingleBond rdf:type

b6_7

A6

A7

gc:binds

gc:binds

gc:DoubleBond rdf:type

Page 23: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

24

Chemical Semantics

CSI Molecular Data ModelmSys

cart

p1

gc:contains

gc:usesType

A1gc:isPositionFor

gc:Point

rdf:type

0.06968 1.299703 0.021584gc:hasXValue

gc:hasYValuegc:hasZV alue

p2

A2gc:isPositionFor

rdf:type

1.000204 1.658998 0.011623gc:hasXValue

gc:hasYValuegc:hasZV alue

p9

A7gc:isPositionFor

1.000204 1.658998 0.01162361gc:hasVectorValue

� .

rdf:type

gc:MolecularSystemrdf:type

gc:CartesianCoordinatesrdf:tpye

Cartesian coordinates representation

Page 24: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

25

Chemical SemanticsCSI Molecular Data ModelMolecular hierarchy

mSys

gc:MolecularSystemrdf:type

R1

gc:hasMolecules

A1 A3 A5 A7

gc:hasAtom gc:hasAtomgc:hasAtom gc:hasAtom

M1

R2

hasResidue

hasResidue

A2 A4

A6g1

gc:Residue rdf:tpye

g2

gc:Group

chebi:CHEBI_32952

chebi:CHEBI_32952

Amine Group

Carboxylic acid group

rdf:tpye

rdf:tpye

label

gc:hasAtom

gc:hasAtom

gc:hasAtom

b1_2

A1

A2

gc:binds

gc:binds

gc:SingleBond rdf:type

b6_7

A6

A7

gc:binds

gc:binds

gc:DoubleBond rdf:type

Page 25: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

26

Chemical Semantics

CSI Molecular Data Model

Internal coordinates

mSys

zmat

zL1

gc:contains

gc:hasZmatLines

zL2

(rdf List next)

zL3

(rdf List next)

zL4

(rdf List next)

A1

A2 A1

A2

A3

hasFirstAtom

hasFirstAtom

hasSecondAtom

A3 A1hasFirstAtom

hasSecondAtomhasThirdAtom

A2A4 A1hasFirstAtom

hasSecondAtomhasThirdAtom

hasFourthAtom

v1hasDistance

v2hasDistance v3

hasAngle

v4hasDistance v5

hasAngle

v6

hasDihedral

v1R2

1.399645

hasName

hasValue

Data_view_value

rdf:type

(rdf List next)

� .

v1 1.081060hasValue

Data_view_value

rdf:type

v6D3

118.774

hasName

hasValue

Data_view_value

rdf:type

v7 hasAdditiveInverseData

Data_as_pointer

rdf:type

v6

� .

gc:MolecularSystemrdf:type

gc:InternalCoordinatesrdf:type

Page 26: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

27

Chemical Semantics

POC - Representation of residues

Proof-of-Concept based on AMBER residues(http://ambermd.org/doc/prep.html) As simple as adding a few more triples to the existing structure.Another example of the data model’s flexibility and processing software immunity to changes of the data patterns.

Page 27: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

28

Chemical Semantics

Amber residues

Page 28: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

29

Chemical Semantics

The contentsmTemplate

residue

zL1

gc:contains

gc:hasZmatLines

zL2

(rdf List next)

zL3

(rdf List next)

zL4

(rdf List next)

A1

A2 A1

A2

A3

hasFirstAtom

hasFirstAtom

hasSecondAtom

A3 A1hasFirstAtom

hasSecondAtomhasThirdAtom

A2A4 A1hasFirstAtom

v1hasDistance

v2hasDistance v3

hasAngle

v4hasDistance v5

hasAngle

v6

hasDihedral

(rdf List next)

� .

gc:Templatesrdf:type

gc:PolymericTemplatesrdf:type

DUMMgc:residueAtomName

DUgc:residueAtomSymbol

Mgc:residueTopologicalType

0.0gc:AtomCharge

CD2

gc:residueAtomName

CD

gc:residueAtomSymbol

E

gc:residueTopologicalType

-0.0110

gc:AtomCharge

I , IGRAPH(I) , ISYMBL(I) , ITREE(I) , NA(I) , NB(I) , NC(I) , R(I) , THETA(I) , PHI(I) , CHG(I)

Page 29: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

31

Chemical Semantics

Amber residuesCreation of residue templates on the base of internal coordinate representations adds completely new data to the system. However, the existing information is still readable by the software that ”knew” how to interpret it.

The new data can now be extracted by the software that ”knows” about residues.

Page 30: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

32

Chemical Semantics

Use in software

Excel examplePython examplePHP example

http://chemicalsemantics.com/rda/

Page 31: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

33

Chemical Semantics

Ontological description of the data model

The structure of the RDF data model can be described in an Ontology.

http://purl.org/gc

Page 32: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

34

Chemical Semantics

Conclusions

RDF data model delivers maximum possible extensibility while preserving the compatibility with the software used to create and consume it.It is suitable not only for knowledge representation and metadata encoding, but is also the best data model for encoding of molecular structure information.

Page 33: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

HypercubeChemical Semantics, Inc. – March 2016, San Diego

35

Chemical Semantics

Acknowledgements

I would like to thank the following people for making this presentation possible:

Dr. Neil S. OstlundDr. Jacob W.G. BloomDr. Bing WangDr. Stuart Chalk

Page 34: Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology

Chemical Semantics, Inc.

Chemical Semantics

Thank you!Mirek Sopek, PhD

Chemical Semantics, Inc. 1115 NW 4th Street

32601 Gainesville, Florida

cell: +1 917 3467500web: www.chemicalsemantics.com

email: [email protected]