dealing with the complex challenge of managing diverse chemistry data online

79
Dealing with the complex challenge of managing diverse chemistry data online Antony Williams, Valery Tkachenko, Alexey Pshenichnov and Ken Karapetyan ACS San Francisco August 2014

Upload: orcid-0000-0002-2668-4821

Post on 10-May-2015

861 views

Category:

Science


0 download

DESCRIPTION

The Royal Society of Chemistry has provided access to data associated with millions of chemical compounds via our ChemSpider database for over 5 years. During this period the richness and complexity of the data has continued to expand dramatically and the original vision for providing an integrated hub for structure-centric data has been delivered across the world to hundreds of thousands of users. With an intention of expanding the reach to cover more diverse aspects of chemistry-related data including compounds, reactions and analytical data, to name just a few data-types, we are in the process of implementing a new architecture to build a Chemistry Data Repository. The data repository will manage the challenges of associated metadata, the various levels of required security (private, shared and public) and exposing the data as appropriate using semantic web technologies. Ultimately this platform will become the host for all chemicals, reactions and analytical data contained within RSC publications and specifically supplementary information. This presentation will report on how our efforts to manage chemistry related data has impacted chemists and projects across the world and will review specifically our contributions to projects involving natural products for collaborators in Brazil and China, for the Open Source Drug Discovery project in India, and our collaborations with scientists in Russia.

TRANSCRIPT

Page 1: Dealing with the complex challenge of managing diverse chemistry data online

Dealing with the complex challenge of managing diverse chemistry

data online

Antony Williams, Valery Tkachenko, Alexey Pshenichnov and Ken Karapetyan

ACS San Francisco

August 2014

Page 2: Dealing with the complex challenge of managing diverse chemistry data online
Page 3: Dealing with the complex challenge of managing diverse chemistry data online

CAS Counter http://www.cas.org/content/counter

Page 4: Dealing with the complex challenge of managing diverse chemistry data online

About Me…as a Chemist• I’ve performed a few dozen chemical

syntheses• I’ve run thousands of analytical spectra• I’ve generated thousands of NMR assignments• I’ve probably published <5% of all work • Most of it has been lost• But things can be different today….• But it still needs to be associated with me…

Page 5: Dealing with the complex challenge of managing diverse chemistry data online

• If we imagine that permission exists… (i.e. forget IP, chemical and pharma companies etc…think students…)– How many syntheses are performed– How many spectra are run– How many properties are measured– How many compounds are made– How many, how much, how big??.....– Let’s go manage it all!!

Think about chemistry a mo’

Page 6: Dealing with the complex challenge of managing diverse chemistry data online
Page 7: Dealing with the complex challenge of managing diverse chemistry data online

Consider a shift to Openness

Page 8: Dealing with the complex challenge of managing diverse chemistry data online

Times have changed…

Open Access funder mandates…

Page 9: Dealing with the complex challenge of managing diverse chemistry data online

Publishers are responding

Page 10: Dealing with the complex challenge of managing diverse chemistry data online

The world of Open Data is here

Page 11: Dealing with the complex challenge of managing diverse chemistry data online

Open Data are everywhere

• Is Openness and Social Sharing changing the world?

• The cultural experiments in Open Data and exchange are almost daily

• Mobile platforms enhance participation

• And then what of Chemistry Data???

Page 12: Dealing with the complex challenge of managing diverse chemistry data online

An Experiment - ChemSpider

• ChemSpider allowed the community to participate in linking the internet of chemistry & crowdsourcing of data

• Successful experiment in terms of building a central hub for integrated web search

• More people are “users” than “contributors”

• Yet basic feedback and game-play helps

Page 13: Dealing with the complex challenge of managing diverse chemistry data online

An Experiment - CSSP

Page 14: Dealing with the complex challenge of managing diverse chemistry data online

An EPSRC Call

“…the identification of the need for a UK national service for the provision of a searchable, electronic chemical database for the UK academic research community.”

Page 15: Dealing with the complex challenge of managing diverse chemistry data online

National Chemical Database Service

Page 16: Dealing with the complex challenge of managing diverse chemistry data online

• Manage “all” of the chemistry data associated with chemical substances – PUBLISHED and UNPUBLISHED

• Based on user selected licensing the data to be downloadable, reusable, interactive

• Build a platform that enables the scientist • Data storage, validation, standardization and

curation• Collaborative data sharing

• Provide data platform that can enable and enhance publishing of scientific papers

We set a vision…

Page 17: Dealing with the complex challenge of managing diverse chemistry data online

Data Repository

• Registration of chemical compounds

• Deposition of chemical syntheses

• Addition of analytical data

• Integration to electronic notebooks

• Rewards and recognition for data sharing

• Document processing

• Hosting of data as private, embargoed or public

Page 18: Dealing with the complex challenge of managing diverse chemistry data online

Development of Data Repository

• Data repository should not just be a data dump – should not be a “big disk”

• Searchable, integrated, segregated repository of data types

• Data access including private, shared embargoed and public

• Delivery of derived models from data

Page 19: Dealing with the complex challenge of managing diverse chemistry data online

New Repository Architecturedoi: 10.1007/s10822-014-9784-5

Page 20: Dealing with the complex challenge of managing diverse chemistry data online

New Repository Architecture

Compounds Reactions Spectra Materials Documents

CompoundsAPI

ReactionsAPI

SpectraAPI

MaterialsAPI

DocumentsAPI

CompoundsWidgets

ReactionsWidgets

SpectraWidgets

MaterialsWidgets

DocumentsWidgets

Data tier

Data access tier

User interface

components tier

Analytical Laboratory application

User interface tier

(examples) Electronic Laboratory Notebook

Paid 3rd party integrations (various platforms – SharePoint, Google, etc)

Chemical Inventory application

Page 21: Dealing with the complex challenge of managing diverse chemistry data online

Input data pipeline

Deposition Gateway

Staging databases

Compounds

Reactions

Spectra

Materials

Articles / CSSP

Compounds Module

Spectra Module

Reactions Module

Materials Module

TextminingModule

1Module

Web UI for unified depositions

DropBox, Google Drive, SkyDrive, etc

LabTrove and other templated data

Documents

API, FTP, etc

Raw data Validated dataStaging

databases

All databases are sliced by data sources/data

collections and have simple

security model where each data

slice/source is private, public or

embargoed

Page 22: Dealing with the complex challenge of managing diverse chemistry data online

Compounds

Page 23: Dealing with the complex challenge of managing diverse chemistry data online

Reactions

Page 24: Dealing with the complex challenge of managing diverse chemistry data online

Analytical data

Page 25: Dealing with the complex challenge of managing diverse chemistry data online

Crystallography data

Page 26: Dealing with the complex challenge of managing diverse chemistry data online

For Deposition of Data• Quality of data at source

• ensuring chemicals are correct - VALIDATION• reactions map and balance as appropriate –

VALIDATION and STANDARDIZATION• file format handling for analytical data types –

binary file formats are proprietary - STANDARDIZATION

• valid interpretation of data – VALIDATION and ANNOTATION

Page 27: Dealing with the complex challenge of managing diverse chemistry data online

Input data pipeline

Deposition Gateway

Staging databases

Compounds

Reactions

Spectra

Materials

Articles / CSSP

Compounds Module

Spectra Module

Reactions Module

Materials Module

TextminingModule

1Module

Web UI for unified depositions

DropBox, Google Drive, SkyDrive, etc

LabTrove and other templated data

Documents

API, FTP, etc

Raw data Validated dataStaging

databases

All databases are sliced by data sources/data

collections and have simple

security model where each data

slice/source is private, public or

embargoed

Page 28: Dealing with the complex challenge of managing diverse chemistry data online

Depositions Gateway User Interface

Page 29: Dealing with the complex challenge of managing diverse chemistry data online

Deposition of Data

Page 30: Dealing with the complex challenge of managing diverse chemistry data online

Validate and Standardize

Page 31: Dealing with the complex challenge of managing diverse chemistry data online

CVSP Filtering

Page 32: Dealing with the complex challenge of managing diverse chemistry data online

CVSP Filtering of DrugBank

Page 33: Dealing with the complex challenge of managing diverse chemistry data online

ChEMBL (1.3 million records)

• 11,020 records with 4 bonds and zero charge, e.g. CHEMBL501101 or CHEMBL501973

• 271 records with hypervalent oxygen (e.g. , CHEMBL2219679), carbon (e.g. 1005895), boron, chlorine, iodine or phosphine

• 6,177 records where direction of bond makes no sense, e.g. CHEMBL12760 and CHEMBL34704

Page 34: Dealing with the complex challenge of managing diverse chemistry data online
Page 35: Dealing with the complex challenge of managing diverse chemistry data online

Depositions User Interface

Page 36: Dealing with the complex challenge of managing diverse chemistry data online

The challenges of analytical data

• Vendors produce complex proprietary data formats and standard formats are required (JCAMP, NetCDF, AniML)• ChemSpider already hosts thousands of JCAMP spectra

• Support of “assigned spectra” in place

• Data validation approaches understood

• There are a myriad of analytical data types…

Page 37: Dealing with the complex challenge of managing diverse chemistry data online

ChemSpider ID 24528095 H1 NMR

Page 38: Dealing with the complex challenge of managing diverse chemistry data online

ChemSpider ID 24528095 C13 NMR

Page 39: Dealing with the complex challenge of managing diverse chemistry data online

ChemSpider ID 24528095 HHCOSY

Page 40: Dealing with the complex challenge of managing diverse chemistry data online

ChemSpider ID 24528095 HSQC

Page 41: Dealing with the complex challenge of managing diverse chemistry data online

ChemSpider ID 24528095 HMBC

Page 42: Dealing with the complex challenge of managing diverse chemistry data online

Managing Assignments?

Page 43: Dealing with the complex challenge of managing diverse chemistry data online

Depositions User Interface

Page 44: Dealing with the complex challenge of managing diverse chemistry data online

Depositions from ELNs

• Development work integrating chemistry into the Southampton Labtrove notebook• Stoichiometry table development• Analytical data integration

• “ChemTrove” rolled out to a small test group in January

Page 45: Dealing with the complex challenge of managing diverse chemistry data online
Page 46: Dealing with the complex challenge of managing diverse chemistry data online
Page 47: Dealing with the complex challenge of managing diverse chemistry data online
Page 48: Dealing with the complex challenge of managing diverse chemistry data online

Document deposition/processing

Page 49: Dealing with the complex challenge of managing diverse chemistry data online

Experimental data checker

Page 50: Dealing with the complex challenge of managing diverse chemistry data online

User Interface Approach

Compounds Reactions Spectra Materials Documents

CompoundsAPI

ReactionsAPI

SpectraAPI

MaterialsAPI

DocumentsAPI

CompoundsWidgets

ReactionsWidgets

SpectraWidgets

MaterialsWidgets

DocumentsWidgets

Data tier

Data access tier

User interface

components tier

Analytical Laboratory application

User interface tier

(examples) Electronic Laboratory Notebook

Paid 3rd party integrations (various platforms – SharePoint, Google, etc)

Chemical Inventory application

Page 51: Dealing with the complex challenge of managing diverse chemistry data online
Page 52: Dealing with the complex challenge of managing diverse chemistry data online
Page 53: Dealing with the complex challenge of managing diverse chemistry data online

User Interface Approach

Compounds Reactions Spectra Materials Documents

CompoundsAPI

ReactionsAPI

SpectraAPI

MaterialsAPI

DocumentsAPI

CompoundsWidgets

ReactionsWidgets

SpectraWidgets

MaterialsWidgets

DocumentsWidgets

Data tier

Data access tier

User interface

components tier

Analytical Laboratory application

User interface tier

(examples) Electronic Laboratory Notebook

Paid 3rd party integrations (various platforms – SharePoint, Google, etc)

Chemical Inventory application

Page 54: Dealing with the complex challenge of managing diverse chemistry data online
Page 55: Dealing with the complex challenge of managing diverse chemistry data online

Display Widgets

Page 56: Dealing with the complex challenge of managing diverse chemistry data online

Work in Progress

Page 57: Dealing with the complex challenge of managing diverse chemistry data online

Work in Progress

Page 58: Dealing with the complex challenge of managing diverse chemistry data online

User Interface Approach

Compounds Reactions Spectra Materials Documents

CompoundsAPI

ReactionsAPI

SpectraAPI

MaterialsAPI

DocumentsAPI

CompoundsWidgets

ReactionsWidgets

SpectraWidgets

MaterialsWidgets

DocumentsWidgets

Data tier

Data access tier

User interface

components tier

Analytical Laboratory application

User interface tier

(examples) Electronic Laboratory Notebook

Paid 3rd party integrations (various platforms – SharePoint, Google, etc)

Chemical Inventory application

Page 59: Dealing with the complex challenge of managing diverse chemistry data online

Analytical Chemist

Characterize

Measure

Search

Store

<<include>>

<<include>>

<<include>>

Synthetic Chemist

Search(synthetic procedure)

Document(publish synthetic procedure)

Retrosynthetic analysis

Page 60: Dealing with the complex challenge of managing diverse chemistry data online

A Compounds Repository Interface

Page 61: Dealing with the complex challenge of managing diverse chemistry data online

A Reactions/Document Interface

Page 62: Dealing with the complex challenge of managing diverse chemistry data online
Page 63: Dealing with the complex challenge of managing diverse chemistry data online

The PharmaSea Website

Page 64: Dealing with the complex challenge of managing diverse chemistry data online

The Open PHACTS community ecosystem

Page 65: Dealing with the complex challenge of managing diverse chemistry data online

Open Source Drug Discovery India

Page 66: Dealing with the complex challenge of managing diverse chemistry data online

What can drive participation?

• What can drive scientists to participate and contribute?• Ensuring provenance of their data for reuse• Mandates from funding agencies• Improved systems to ease contribution• Additional contributions to science• Improved publishing processes• Recognition for contributions

Page 67: Dealing with the complex challenge of managing diverse chemistry data online

Scientists are Increasingly Quantified…

Page 68: Dealing with the complex challenge of managing diverse chemistry data online

AltMetrics as Scientist Impact

Page 69: Dealing with the complex challenge of managing diverse chemistry data online

AltMetrics

Page 70: Dealing with the complex challenge of managing diverse chemistry data online
Page 71: Dealing with the complex challenge of managing diverse chemistry data online

Detailed Usage Statistics

Page 72: Dealing with the complex challenge of managing diverse chemistry data online

Rewards and Recognition

Congratulations! Your 1st CSSP article has been published. Philosopher Lao Tzu said “A journey of a thousand miles begins with a single step”. In the same way we hope that this will be the first of many submissions that you make to CSSP.

The First Step badge is awarded when a user submits (& has published) their 1st CSSP article.

Page 73: Dealing with the complex challenge of managing diverse chemistry data online

http://orcid.org/0000-0002-2668-4821

Page 74: Dealing with the complex challenge of managing diverse chemistry data online

AltMetrics Feeds

• For our data repository ensure contribution of data will feed out to the AltMetrics platforms

• Every data point, every data download, use and reuse will be associated with the scientist

• Data will be DOI’ed (presently under review)

• Services provided will allow for AltMetrics use

Page 75: Dealing with the complex challenge of managing diverse chemistry data online

What do we have in place?• We are testing an early form of the data

repository on our data – ChemSpider and our archive of publications

• Working with collaborators to define needs

• Testing and enhancing deposition systems

• Chemical validation & standardization platform

• Analytical data handling formats

• And lots in development…

Page 76: Dealing with the complex challenge of managing diverse chemistry data online

The Challenges Ahead

• Chemistry is NOT just nicely defined structures!• Materials, minerals, attached to beads,

polymers, ambiguous materials

• Domain-specific measurements• File format standards are limited in application

• Encouraging scientists to free up their data• AltMetrics, open data mandates, systems

• The data explosion continues

Page 77: Dealing with the complex challenge of managing diverse chemistry data online

But it’s not easy of course

• Not everything we would like around data handling is there for sure

• Many systems, tools, platforms are already available but we don’t know about them or even if we did contributing us “more work”

• “What’s in it for me?”, “It’s my data”, “It’s too much work”, “What credit do I get?”

Page 78: Dealing with the complex challenge of managing diverse chemistry data online

And yes…we know…

Page 79: Dealing with the complex challenge of managing diverse chemistry data online

Thank you

Email: [email protected]: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams