crowdsourced curation of chemistry data. how bad is online chemistry data?

Post on 26-Jun-2015

2.613 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

This presentation was given at the Wolfram Data Summit in Washington DC on Sept 9th 2010 as part of a panel series of presentations and discussions on crowdsourcing approaches for data. It was a rant by me on the quality of what's online and questioning "who cares".

TRANSCRIPT

Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Antony WilliamsWolfram Summit, September 2010

A Pragmatic Vision

“Build a Structure Centric Community”

Integrate chemistry across the internet based on “chemical structure”

A “structure-based hub” to information and data Let chemists contribute their own data Allow the community to curate/correct data

www.chemspider.com

We Answer Questions for Chemists Questions a chemist might ask…

What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Aspirin? What is the NMR spectrum of Benzoic Acid? What are the safety handling issues for toluene?

Search for a Chemical…by name

Available Information…

Linked to vendors, safety data, toxicity, metabolism

Available Information….

Search for chemicals

ChemSpider Today

24.8 million structures 400 data sources Grows daily Community annotation and curation

We curate, edit, change, enhance data daily

Linked Data on the Web

Three Years of Experience Internet-based chemistry is a mess!

Most public compound databases on the web are contaminated. Including ours!

The annotation/curation of data online is difficult

Most database hosts are non-responsive to feedback – “We are a host/repository of data”

Who cares?

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

What is the Structure of Vitamin K?

MeSH – Medical Subject Headings

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

What is the Structure of Vitamin K1?

What is the Structure of Vitamin K1?

Chemical Abstracts“Common Chemistry” Database

Wikipedia

Incorrect Structures

Wow!

Lack of Stereochemistry

Does stereochemistry matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide

PubChem

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”

Variants of systematic names on PubChem

2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl

ChEBI – Manual Curation

What’s Methane?

What’s Methane?

What ELSE is Methane???

The EXPERTS must get it right?!

Wikipedia, C&E News, PubChem C&E News (from ACS)

Internet-Based Chemistry is a Mess

Algorithms can get you so far

Human curation is necessary

Only the crowds can help with big data… ChemSpider is approaching 25 million compounds

Search “Vitamin H”

Search “Vitamin H”

“Curate” Identifiers

“Curate” Identifiers

“Curate” Identifiers

“Curate” Identifiers

General curation activities Remove incorrect names Correct spellings Add multilingual names Add alternative names

In 3 years over 1 million structure-identifier relationships have been validated – robotically and manually

130 people have participated in validation or annotation. “Crowds” can be quite small!

Crowdsourced “Annotations”

Registered Users can add Descriptions/Syntheses/Commentaries Links to articles, blogs, wikis etc Add spectral data Add photos Add MP3 files Add Videos

Data Validation – Not Vitamin K1

Data Validation – Not Beclamethasone Dipropionate

DailyMed Article

Data Validation …NOT Cholesterol

Data Validation – ONE CymarinQuestion Quality in Big Databases

First request to Database Hosts!

Every public compound database host should add ONE feature – “Leave Comments”

Second request to Database Hosts! Show Comments

Always Question Online Chemistry

Thank you

Email: williamsa@rsc.org Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams

top related