an architecture for an open science molecular compound database

32
Department of Bioinformatics - BiGCaT 1 An architecture for an Open Science molecular compound database Egon Willighagen, @egonwillighagen Dept. of Bioinformatics - BiGCaT - Maastricht University orcid.org/0000-0001-7542-0286 ACS New Orleans, 9 April 2013, #ACSNola

Upload: egon-willighagen

Post on 27-Jan-2015

108 views

Category:

Education


1 download

DESCRIPTION

The past few years has seen a tremendous leap forward in public compound databases. Both PubChem and ChemSpider have made a clear message: chemical sciences can only move forward if we can search existing chemistry. However, the exact Open nature of “public” database is not always crystal clear. PubChem is mostly public domain but contains proprietary content too, while ChemSpider is mostly proprietary but has Open Data content. Neither are clear in how the Open Data parts of these databases can be used, modified, and redistributed, the three corner stones of Open Science. We will demo, based on previous work on http://rdf.openmolecules.net/, an architecture where semantic web technologies, the InChI, and Open Source cheminformatics tools are used to create a Panton Principles-compliant compound database to aid the next-generation public databases. Standards proposed in the Open PHACTS community will be use to specify links between this new resource and other databases, and to provide compound properties. All this input will be available with provenance on the origin of that data, as separate downloadable files, and using ontologies to provide explicit meaning. Using ontologies like ChEBI and CHEMINF, applications in the areas of metabolomics and toxicology will be presented.

TRANSCRIPT

Page 1: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 1

An architecture for anOpen Sciencemolecular compound database

Egon Willighagen, @egonwillighagenDept. of Bioinformatics - BiGCaT - Maastricht University

orcid.org/0000-0001-7542-0286

ACS New Orleans, 9 April 2013, #ACSNola

Page 2: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 2

This session: Public Databases ...

• Public: what's that?– free access?– redistribute?–Modify?

• BTW, what is “Open Access” ???

Page 3: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 3

This session: Serving the community...

• Service–What do people want?–Do they know what is possible?

• Community–Who are they? Personas!→–Usability must include learnability

Page 4: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 4

Personas

• Not every scientist is alike• You cannot and must not target one

user• Instead, target at least 2 different

users, particularly:–The hacker doing all the actual

bioinformatics in the lab–The professor who has too little time to

understand things outside his narrow field

Page 5: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 5

Reason #1: Bioclipse decision support

Spjuth, O. et al. JCIM 2011 51(8):1840-1847.

Page 6: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 6

Data #1: Linked Open Drug Data

M. Samwald, et al, Linked open drug data for pharmaceutical research and development, 2011, JChemInf.

Page 7: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 7

Data^2: Linked Open Data

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Sept 2011, CC-BY-SA.

Page 8: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 8

Linked Open Data in the Life Sciences

Page 9: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 9

WikiPathways

Pico, AR et al. PLoS biology 6.7 (2008): e184.

Page 10: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 10

PathVisio: Pathway Analysis

Van Iersel, M. et al. BMC Bioinfo. 2008 9(1):399.

Page 11: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 11

Reason #2: Publishing

• Journals will increasingly require data deposition–e.g. BioMed Central:

Page 12: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 12

Needs

• We must propagate rights–whether open or not!

• We must make things explicit–e.g. by using semantics–e.g. by using the InChI

Page 13: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 13

Tool #1: licensing

Page 14: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 14

Open Data #1: crystallography

Page 15: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 15

Open Data #2: Open Notebook Science

Page 16: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 16

Open Data #3: CrystalEye

Page 17: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 17

Licensing Open not Required→

• But not providing info is a killer–no, not really because

no scientist seems to care

–yes, because how will a machine do? Think scalability and massive data integration efforts

Page 18: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 18

Why does explicit licensing matter?

Because when there is a fire, you want immediate access to the fire hose. You do not want to wait for permission from the mayor.

Because when you like to validate your scientific results, you want immediate access to related data. You do not want to wait for permission from that professor who is on a conference tour for the next 4 weeks. You must have an immediate answer, whatever it is.

Page 19: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 19

Tool #2: Semantic Web to the rescue

• Allows provenance–provide where data came from– tells us our rights

Page 20: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 20

App #1: Spidering the semantic web

Spjuth, O et al. JChemInf 2013 5:14.

Page 21: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 21

App #2: Making a web

http://rdf.openmolecules.net/

Page 22: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 22

App #3: Open PHACTS Explorer

http://www.openphacts.org/ → room 349, 2:20pm

Page 23: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 23

How #1: RDF Graphs

Page 24: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 24

How #1: RDF Graphs

PREFIX cheminf: <http://semanticscience.org/resource/>

SELECT ?graph ?p ?o WHERE { GRAPH ?graph { ?mol cheminf:CHEMINF_000200 [ a cheminf:CHEMINF_000059 ; cheminf:SIO_000300 "$inchikey" ] ; ?p ?o . }}

Page 25: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 25

NanoPub.org

Page 26: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 26

Graph output

orcid.org/0000-0001-7542-0286

Page 27: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 27

Is that it?!? Just an architecture??

Yes, but a simple and flexible one. Keep an eye out on my blog. This will happen in the next few months:

1. Aggregate all CCZero/PDDL data around chemical properties

1.Open Notebook Science (solubility, melting point)

2.ChemPedia

3.Crystallography (COD, CrystalEye)

4....

2. Calculate molecular properties with the CDK (and release as CCZero)

3. Host on http://linkedchemistry.info/chembox

Page 28: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 28

CHEMINF ontology

orcid.org/0000-0001-7542-0286

Hastings, J. et al. PLoS ONE 2011 6(10):e25513.

Page 29: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 29

Architecture

Triple Store(e.g. Virtuoso)

Web server(HTML / RDF)

• Graphs• Explicit license

info• InChI/FixedH

Page 30: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 30

/FixedH ?!?!

Page 31: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 31

Conclusions & Outlook

• We must propagate rights–whether open or not!

• We must make things explicit–e.g. by using semantics–e.g. by using the InChI with FixedH

Page 32: An architecture for an Open Science molecular compound database

Department of Bioinformatics - BiGCaT 32

More information

• @egonwillighagen• http://chem-bla-ics.blogspot.com/• http://egonw.github.com/

• http://orcid.org/0000-0001-7542-0286