representation of molecular structures and related computations on the semantic web. universal data...
TRANSCRIPT
Chemical Semantics, Inc.
Chemical Semantics
Mirek Sopek*, Neil Ostlund, Jacob W.G. Bloom, Stuart ChalkChemical Semantics Inc., 1115 NW 4th Street, Gainesville, Florida
Representation of molecular structures
and related computations on the Semantic Web.
Universal Data Model and its Ontology.
HypercubeChemical Semantics, Inc. – March 2016, San Diego
3
Chemical Semantics
Chemical Semantics goals
Interoperable PUBLISHING of Computational Chemistry calculationsSemantic REPRESENTATION OF DATA for both humans and machinesFEDERATION of published data with existing web-based chemical datasetsCloud-like ARCHIVING of Computational Chemistry calculation results, input/output files etc.
http://chemsem.com
HypercubeChemical Semantics, Inc. – March 2016, San Diego
4
Chemical Semantics
CSI Portal – a short reviewchemsem.com – EXISTING PLATFORM FOR DATA PUBLISHING
HypercubeChemical Semantics, Inc. – March 2016, San Diego
5
Chemical Semantics
CSI Portal – what’s new ? Enhanced stability and
security
SPARQL Query Generator based on chemical drawings
Extending the range of QC packages to:
ADF, DALTON, GAMESS, GAMESS-UK, Gaussian, Jaguar, Molpro, NWChem, ORCA, Psi4, and QChem. (thanks to the use of ccLib)
Chemical Semantics, Inc.
Chemical Semantics
Data Models in chemistry
HypercubeChemical Semantics, Inc. – March 2016, San Diego
7
Chemical Semantics
What is a data model and why is it important?
What is a data model:A data model organizes data elements and standardizes how the data elements relate to one another.
As such, a data model should be distinguished from its serializations (i.e. file formats)The most important place where we work directly with data models is in the software!
HypercubeChemical Semantics, Inc. – March 2016, San Diego
8
Chemical Semantics
Data Models in Chemistry
TABULAR data models (most popular: MOL files, MOLDEN files, ZMT, GJF, HIN, R elational DBs etc)TREE based data models (CML, AniML, CSX etc)KEY VALUE/MIXED data models (CIF, new PDB/mmCIF, JCAMP-DX)
HypercubeChemical Semantics, Inc. – March 2016, San Diego
9
Chemical Semantics
Why we need new data models and standards
Existing data models have various levels of extensibility, but all of them fall short when a new, unknown or unpredicted (at the moment of creation), kind of data appears in it.Such new kind of data added to a model usually breaks it, or, in the best case, is ignored.There is no provision for dynamic sharing of data where people can add new data in real time.
HypercubeChemical Semantics, Inc. – March 2016, San Diego
10
Chemical Semantics
What is the solution?
We are convinced that the solution comes in the form of:
a GRAPH-based data model based on the smallest possible data pattern: A TRIPLE
The best implementation is offered by RDF – Resource Description Framework known from Semantic Technologies.
HypercubeChemical Semantics, Inc. – March 2016, San Diego
11
Chemical Semantics
Why triples?Arbitrary N-tuples can be constructed out of 3-tuplesProved by W. Quin. Mathematical Logic. Harvard University Press, 1940.
HypercubeChemical Semantics, Inc. – March 2016, San Diego
12
Chemical Semantics
RDF data model
Anatomy of the triple:
<molecule>
gnvc:hasInChIString „1S/H2O/h1H2”
For example:
Subject Predicate ObjectThing Property Value
gc:hasInChIKey “DUGIDELPOPULAW-
UHFFFAOYSA-N”
HypercubeChemical Semantics, Inc. – March 2016, San Diego
13
Chemical Semantics
RDF data model
Typical data set contains large numbers of triples forming a DIRECTED GRAPHIdentification and addressing of nodes is done via a URI scheme – a generalization of URLs – standard web addresses.
HypercubeChemical Semantics, Inc. – March 2016, San Diego
14
Chemical Semantics
RDF data model in software
The RDF data model in software is usually represented as:
Unordered SET of TRIPLES (3-TUPLES)
For example, in Python we have 3-tuple:
(subject, predicate,object)
HypercubeChemical Semantics, Inc. – March 2016, San Diego
15
Chemical Semantics
How do we interact with the model?
Through SPARQL queriesThrough specific API calls in your language of preference
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX gc: <http://purl.org/gc/>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>SELECT ?graphWHERE { GRAPH ?graph { { ?something gc:hasAtom ?atom1 ; rdf:type ?somethingType ; rdfs:label ?somethingLabel . ?atom1 gc:isElement "F" . } UNION { ?something gc:hasAtom ?atom2 ; rdf:type ?somethingType ; rdfs:label ?somethingLabel . ?atom2 gc:isElement "Cl" . } UNION { ?something gc:hasAtom ?atom3 ; rdf:type ?somethingType ; rdfs:label ?somethingLabel . ?atom3 gc:isElement "Br" . } UNION(…)
ua=URIRef(u'http://purl.org/gc/Atom')um=URIRef(u'http://purl.org/gc/Molecule')ur=URIRef(u'http://purl.org/gc/Residue')
g=rdflib.Graph()ba=g.parse(urn,format="turtle")
for m in g.subjects(RDF.type,um):nmc += 1napm=0 # number of atoms per moleculeres1=g.objects(m,uhr)lres=len(list(res1))if lres>0:
res=g.objects(m,uhr)(…)
v=graph.value(subject=vURI,predicate=RDF.type)h=graph.value(subject=vURI,predicate=gcn.hasName)a=graph.value(subject=vURI,predicate=gcn.hasValue)
HypercubeChemical Semantics, Inc. – March 2016, San Diego
16
Chemical Semantics
Software interaction with the model?
Out of all data models, RDF GRAPH represents almost infinite extensibility.
Its serializations (JSON-LD and Turtle) are the best to work with.
Dr Mirek Sopek
SOFTWARE
Dr Mirek Sopek
ORIGINAL SOFTWARE
OTHER SOFTWARE
HypercubeChemical Semantics, Inc. – March 2016, San Diego
19
Chemical Semantics
Data model and its serializations
There is a number of serializations for the RDF graphs:
RDF/XML, NTriples, Turtle, JSON-LD etc
The most important today are: JSON-LD & Turtle
We shall never forget they are just SERIALIZATIONS
of the underlying, more fundamentalData Model
Chemical Semantics, Inc.
Chemical Semantics
Chemical SemanticsGraph Data models
HypercubeChemical Semantics, Inc. – March 2016, San Diego
21
Chemical Semantics
CSI Molecular Data ModelsExisting model (currently used on our portal):
Follows closely CSX (XML) data model presented here last year
The New Data model features:Alternate methods to describe molecular geometry: Cartesian, Fractional and Internal coordinates Flexible representation of molecular hierarchies (molecules, residues , groups, chains, templates etc.)Cleaner serializations to both JSON-LD and Turtle – easier to work with also for humansCloser integration with Gainesville Core Ontology
HypercubeChemical Semantics, Inc. – March 2016, San Diego
22
Chemical Semantics
CSI Molecular Data Model
Geometrical objects: Top level class hierarchy
gc:Locus
gc:Atom gc:Point gc:DummyAtom
gc:GhostAtom
Rdf:subClass
Rdf:subClass
rdf:subClass
rdf:subClass
HypercubeChemical Semantics, Inc. – March 2016, San Diego
23
Chemical Semantics
CSI Molecular Data ModelmSys
cart
p1
gc:contains
gc:usesType
A1gc:isPositionFor
gc:Point
rdf:type
0.06968 1.299703 0.021584gc:hasXValue
gc:hasYValuegc:hasZValue
p2
A2gc:isPositionFor
rdf:type
1.000204 1.658998 0.011623gc:hasXValue
gc:hasYValuegc:hasZV alue
p9
A7gc:isPositionFor
1.000204 1.658998 0.01162361gc:hasVectorValue
� .
rdf:type
gc:MolecularSystemrdf:type
gc:CartesianCoordinatesrdf:tpye
R1
gc:hasMolecules
A1 A3 A5 A7
gc:hasAtom gc:hasAtomgc:hasAtom gc:hasAtom
M1
R2
hasResidue
hasResidue
A2 A4
A6g1
gc:Residue rdf:tpye
g2
gc:Group
chebi:CHEBI_32952
chebi:CHEBI_32952
Amine Group
Carboxylic acid group
rdf:tpye
rdf:tpye
label
gc:hasAtom
gc:hasAtom
gc:hasAtom
b1_2
A1
A2
gc:binds
gc:binds
gc:SingleBond rdf:type
b6_7
A6
A7
gc:binds
gc:binds
gc:DoubleBond rdf:type
HypercubeChemical Semantics, Inc. – March 2016, San Diego
24
Chemical Semantics
CSI Molecular Data ModelmSys
cart
p1
gc:contains
gc:usesType
A1gc:isPositionFor
gc:Point
rdf:type
0.06968 1.299703 0.021584gc:hasXValue
gc:hasYValuegc:hasZV alue
p2
A2gc:isPositionFor
rdf:type
1.000204 1.658998 0.011623gc:hasXValue
gc:hasYValuegc:hasZV alue
p9
A7gc:isPositionFor
1.000204 1.658998 0.01162361gc:hasVectorValue
� .
rdf:type
gc:MolecularSystemrdf:type
gc:CartesianCoordinatesrdf:tpye
Cartesian coordinates representation
HypercubeChemical Semantics, Inc. – March 2016, San Diego
25
Chemical SemanticsCSI Molecular Data ModelMolecular hierarchy
mSys
gc:MolecularSystemrdf:type
R1
gc:hasMolecules
A1 A3 A5 A7
gc:hasAtom gc:hasAtomgc:hasAtom gc:hasAtom
M1
R2
hasResidue
hasResidue
A2 A4
A6g1
gc:Residue rdf:tpye
g2
gc:Group
chebi:CHEBI_32952
chebi:CHEBI_32952
Amine Group
Carboxylic acid group
rdf:tpye
rdf:tpye
label
gc:hasAtom
gc:hasAtom
gc:hasAtom
b1_2
A1
A2
gc:binds
gc:binds
gc:SingleBond rdf:type
b6_7
A6
A7
gc:binds
gc:binds
gc:DoubleBond rdf:type
HypercubeChemical Semantics, Inc. – March 2016, San Diego
26
Chemical Semantics
CSI Molecular Data Model
Internal coordinates
mSys
zmat
zL1
gc:contains
gc:hasZmatLines
zL2
(rdf List next)
zL3
(rdf List next)
zL4
(rdf List next)
A1
A2 A1
A2
A3
hasFirstAtom
hasFirstAtom
hasSecondAtom
A3 A1hasFirstAtom
hasSecondAtomhasThirdAtom
A2A4 A1hasFirstAtom
hasSecondAtomhasThirdAtom
hasFourthAtom
v1hasDistance
v2hasDistance v3
hasAngle
v4hasDistance v5
hasAngle
v6
hasDihedral
v1R2
1.399645
hasName
hasValue
Data_view_value
rdf:type
(rdf List next)
� .
v1 1.081060hasValue
Data_view_value
rdf:type
v6D3
118.774
hasName
hasValue
Data_view_value
rdf:type
v7 hasAdditiveInverseData
Data_as_pointer
rdf:type
v6
� .
gc:MolecularSystemrdf:type
gc:InternalCoordinatesrdf:type
HypercubeChemical Semantics, Inc. – March 2016, San Diego
27
Chemical Semantics
POC - Representation of residues
Proof-of-Concept based on AMBER residues(http://ambermd.org/doc/prep.html) As simple as adding a few more triples to the existing structure.Another example of the data model’s flexibility and processing software immunity to changes of the data patterns.
HypercubeChemical Semantics, Inc. – March 2016, San Diego
28
Chemical Semantics
Amber residues
HypercubeChemical Semantics, Inc. – March 2016, San Diego
29
Chemical Semantics
The contentsmTemplate
residue
zL1
gc:contains
gc:hasZmatLines
zL2
(rdf List next)
zL3
(rdf List next)
zL4
(rdf List next)
A1
A2 A1
A2
A3
hasFirstAtom
hasFirstAtom
hasSecondAtom
A3 A1hasFirstAtom
hasSecondAtomhasThirdAtom
A2A4 A1hasFirstAtom
v1hasDistance
v2hasDistance v3
hasAngle
v4hasDistance v5
hasAngle
v6
hasDihedral
(rdf List next)
� .
gc:Templatesrdf:type
gc:PolymericTemplatesrdf:type
DUMMgc:residueAtomName
DUgc:residueAtomSymbol
Mgc:residueTopologicalType
0.0gc:AtomCharge
CD2
gc:residueAtomName
CD
gc:residueAtomSymbol
E
gc:residueTopologicalType
-0.0110
gc:AtomCharge
I , IGRAPH(I) , ISYMBL(I) , ITREE(I) , NA(I) , NB(I) , NC(I) , R(I) , THETA(I) , PHI(I) , CHG(I)
HypercubeChemical Semantics, Inc. – March 2016, San Diego
31
Chemical Semantics
Amber residuesCreation of residue templates on the base of internal coordinate representations adds completely new data to the system. However, the existing information is still readable by the software that ”knew” how to interpret it.
The new data can now be extracted by the software that ”knows” about residues.
HypercubeChemical Semantics, Inc. – March 2016, San Diego
32
Chemical Semantics
Use in software
Excel examplePython examplePHP example
http://chemicalsemantics.com/rda/
HypercubeChemical Semantics, Inc. – March 2016, San Diego
33
Chemical Semantics
Ontological description of the data model
The structure of the RDF data model can be described in an Ontology.
http://purl.org/gc
HypercubeChemical Semantics, Inc. – March 2016, San Diego
34
Chemical Semantics
Conclusions
RDF data model delivers maximum possible extensibility while preserving the compatibility with the software used to create and consume it.It is suitable not only for knowledge representation and metadata encoding, but is also the best data model for encoding of molecular structure information.
HypercubeChemical Semantics, Inc. – March 2016, San Diego
35
Chemical Semantics
Acknowledgements
I would like to thank the following people for making this presentation possible:
Dr. Neil S. OstlundDr. Jacob W.G. BloomDr. Bing WangDr. Stuart Chalk
Chemical Semantics, Inc.
Chemical Semantics
Thank you!Mirek Sopek, PhD
Chemical Semantics, Inc. 1115 NW 4th Street
32601 Gainesville, Florida
cell: +1 917 3467500web: www.chemicalsemantics.com
email: [email protected]