an architecture for an open science molecular compound database

Department of Bioinformatics - BiGCaT 1

An architecture for anOpen Sciencemolecular compound database

Egon Willighagen, @egonwillighagenDept. of Bioinformatics - BiGCaT - Maastricht University

orcid.org/0000-0001-7542-0286

ACS New Orleans, 9 April 2013, #ACSNola


This session: Public Databases ...

• Public: what's that?– free access?– redistribute?–Modify?

• BTW, what is “Open Access” ???


This session: Serving the community...

• Service–What do people want?–Do they know what is possible?

• Community–Who are they? Personas!→–Usability must include learnability


Personas

• Not every scientist is alike• You cannot and must not target one

user• Instead, target at least 2 different

users, particularly:–The hacker doing all the actual

bioinformatics in the lab–The professor who has too little time to

understand things outside his narrow field


Reason #1: Bioclipse decision support

Spjuth, O. et al. JCIM 2011 51(8):1840-1847.


Data #1: Linked Open Drug Data

M. Samwald, et al, Linked open drug data for pharmaceutical research and development, 2011, JChemInf.


Data^2: Linked Open Data

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Sept 2011, CC-BY-SA.

http://lod-cloud.net/


Linked Open Data in the Life Sciences


WikiPathways

Pico, AR et al. PLoS biology 6.7 (2008): e184.


PathVisio: Pathway Analysis

Van Iersel, M. et al. BMC Bioinfo. 2008 9(1):399.


Reason #2: Publishing

• Journals will increasingly require data deposition–e.g. BioMed Central:


Needs

• We must propagate rights–whether open or not!

• We must make things explicit–e.g. by using semantics–e.g. by using the InChI


Tool #1: licensing


Open Data #1: crystallography


Open Data #2: Open Notebook Science


Open Data #3: CrystalEye


Licensing Open not Required→

• But not providing info is a killer–no, not really because

no scientist seems to care

–yes, because how will a machine do? Think scalability and massive data integration efforts


Why does explicit licensing matter?

Because when there is a fire, you want immediate access to the fire hose. You do not want to wait for permission from the mayor.

Because when you like to validate your scientific results, you want immediate access to related data. You do not want to wait for permission from that professor who is on a conference tour for the next 4 weeks. You must have an immediate answer, whatever it is.


Tool #2: Semantic Web to the rescue

• Allows provenance–provide where data came from– tells us our rights


App #1: Spidering the semantic web

Spjuth, O et al. JChemInf 2013 5:14.


App #2: Making a web

http://rdf.openmolecules.net/


App #3: Open PHACTS Explorer

http://www.openphacts.org/ → room 349, 2:20pm


How #1: RDF Graphs


How #1: RDF Graphs

PREFIX cheminf: <http://semanticscience.org/resource/>

SELECT ?graph ?p ?o WHERE { GRAPH ?graph { ?mol cheminf:CHEMINF_000200 [ a cheminf:CHEMINF_000059 ; cheminf:SIO_000300 "$inchikey" ] ; ?p ?o . }}


NanoPub.org


Graph output

orcid.org/0000-0001-7542-0286


Is that it?!? Just an architecture??

Yes, but a simple and flexible one. Keep an eye out on my blog. This will happen in the next few months:

1. Aggregate all CCZero/PDDL data around chemical properties

1.Open Notebook Science (solubility, melting point)

2.ChemPedia

3.Crystallography (COD, CrystalEye)

4....

2. Calculate molecular properties with the CDK (and release as CCZero)

3. Host on http://linkedchemistry.info/chembox


CHEMINF ontology

orcid.org/0000-0001-7542-0286

Hastings, J. et al. PLoS ONE 2011 6(10):e25513.


Architecture

Triple Store(e.g. Virtuoso)

Web server(HTML / RDF)

• Graphs• Explicit license

info• InChI/FixedH


/FixedH ?!?!


Conclusions & Outlook

• We must propagate rights–whether open or not!

• We must make things explicit–e.g. by using semantics–e.g. by using the InChI with FixedH


More information

• @egonwillighagen• http://chem-bla-ics.blogspot.com/• http://egonw.github.com/

• http://orcid.org/0000-0001-7542-0286

an architecture for an open science molecular compound database

Education

bioinformatics bigcat

department of bioinformatics

linked open data

licensing open

open access

cczeropddl data

related data

open drug datam