online public compound databases

83
Online Public Compound Databases Antony Williams

Upload: orcid-0000-0002-2668-4821

Post on 10-May-2015

2.534 views

Category:

Technology


0 download

DESCRIPTION

This is a workshop I gave on "Online Public Compound Databases" at the BCCE in Dallas, Texas on August 3rd 2010. It is an overview of online resources, InChI, linking data, online data quality, searching and ChemSpider.

TRANSCRIPT

Page 1: Online Public Compound Databases

Online Public Compound DatabasesAntony Williams

Page 2: Online Public Compound Databases

Introductions….

Hi….I’m Antony Williams, ChemSpiderman

NMR Spectroscopist by training Worked in gov’t lab, academia, Fortune 500,

start-up, founded ChemSpider, now work for the Royal Society of Chemistry

I am the host of ChemSpider… 25 million compounds Linked to 400 data sources You’ll hear more….

Page 3: Online Public Compound Databases

What’s your interest in Public Compound DBs? What public compound databases do you use? What are you looking to find? What proprietary databases do you presently use? What do you trust? Why are for-fee databases not enough? What issues do you have with free chemistry

databases/resources online?

What would the ideal solution provide????

Page 4: Online Public Compound Databases

Content is King and Quality Costs Chemistry “content” is big money

Patent searching Structures and properties Drug databases Literature databases

Chemical Abstracts Service (CAS), the “Gold Standard” in Chemistry related information 103 years of content $260 million revenue (2006) >50 million substances >60 million sequences

Page 5: Online Public Compound Databases

What’s the Status of Chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases (eMolecules) Metabolic pathway databases (WikiPathways) Virtual Screening databases (ZINC DB) Property databases (Beilstein) Screening assay results (PubChem) Patents with chemical structures (SureChem) ADME/Tox data (OEChem) Scientific publications (Many publishers) Compound aggregators (ChemSpider) Blogs/Wikis and Open Notebook Science (Many)

Page 6: Online Public Compound Databases

Synthesis Blogs…TotallySynthetic.com

Page 7: Online Public Compound Databases

Org Prep Daily

Page 8: Online Public Compound Databases

Molbank (Open Access Journal)

Page 9: Online Public Compound Databases

Synthetic Pages (Website)

Page 10: Online Public Compound Databases

Lots of “Public Compound” Databases PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC ChemSpider Lots of chemical vendors What’s missing??? What do you use online?

Page 11: Online Public Compound Databases

Where Would You look? What Do You Trust?

Page 12: Online Public Compound Databases

Linked Data on the Web

Taken from: Rafael Sidis’ Blog

Page 13: Online Public Compound Databases

What is a compound?

Page 14: Online Public Compound Databases

Connections Can Lead Anywhere

Page 15: Online Public Compound Databases

Where Would You look? What Do You Trust?

Page 16: Online Public Compound Databases

PubChem

Page 17: Online Public Compound Databases

PubChem

PubChem is “a repository of screening data”

BUT, publishers, vendors, lots of chemical data holders..almost 30 million compounds

Properties, 3D optimized structures, links to various databases, chemical names, registry numbers, synonyms, trade names

PubChem is a repository – non-curated, no way to annotate or clean data. Data are free, not “open”

Page 18: Online Public Compound Databases

LIVE DEMO of PubChem

Name a chemical compound – search and review

Next slides: Methane… Diamond Vancomycin Taxol Cholesterol

Page 19: Online Public Compound Databases

Chemistry on The Internet Is Messy

Page 20: Online Public Compound Databases

It’s Methane…

Page 21: Online Public Compound Databases

What’s Methane?

Page 22: Online Public Compound Databases

What’s Methane?

Page 23: Online Public Compound Databases

What ELSE is Methane???

Page 24: Online Public Compound Databases

The Challenges of Internet Data

Text-based searches commonly will get you to “representative data”

Accurate chemical structures are hard to find!

Wikipedia IS a good source of accurate chemistry data..not perfect but good. See Tacrolimus Tell the story of Domoic Acid – Next Slide

Page 25: Online Public Compound Databases

The EXPERTS must get it right?!

Page 26: Online Public Compound Databases

Wikipedia, C&E News, PubChem C&E News (from ACS)

Page 27: Online Public Compound Databases

Feedback from Steve Ritter

“As for where we source our structures, our primary source is the researcher and peer-reviewed papers, because many compounds are novel.

..we always double check them against one or more primary sources, typically Merck Index and SciFinder.

Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.”

Page 28: Online Public Compound Databases

Feedback from Steve Ritter

“As a rule, we at C&EN don’t use Wikipedia as a primary source for structures or chemical information, and I recommend that policy to anyone.”

“It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day.”

Page 29: Online Public Compound Databases

The Challenges of Internet Data

Text-based searches commonly will get you to “representative data”

Accurate chemical structures are hard to find! Wikipedia IS a good source of accurate chemistry

data..not perfect but good. See Tacrolimus Tell the story of Domoic Acid

Unfortunately, question everything

Page 30: Online Public Compound Databases

Question Everything online: www.dhmo.org

Page 31: Online Public Compound Databases

The FDA’s DailyMed

Page 32: Online Public Compound Databases

Structures on DailyMed

Page 33: Online Public Compound Databases

Lack of Stereochemistry

Page 34: Online Public Compound Databases

Does Stereochemistry Matter?

Page 35: Online Public Compound Databases

Does one stereocenter matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon

Page 36: Online Public Compound Databases

Does one stereocenter matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide

Page 37: Online Public Compound Databases

Incorrect Structures

Page 38: Online Public Compound Databases

Wow!

Page 39: Online Public Compound Databases

Collaborative Knowledge Management

Page 40: Online Public Compound Databases

Taxol on PubChem

Page 41: Online Public Compound Databases

Drugbank

Page 42: Online Public Compound Databases

Digitonin? More Crowdsourcing…

Page 43: Online Public Compound Databases

Comments on the Blog September 15th, 2009 at 1:57 pm It looks like

both ChEBI and Wikipedia structures are wrong as far as aglycon is concerned. According to http://www3.interscience.wiley.com/journal/20330/abstract

“…for the first time to confirm beyond all doubt the structure suggested by Tschesche and Wulff for digitonin by means of modern NMR techniques, and to assign all proton and carbon resonances.”Structure 1 shows methyl group at C-20 going UP, i.e. 20β (while by default spirostan is 20α).

Page 44: Online Public Compound Databases

CAS as an authority

Page 45: Online Public Compound Databases

The Blogging Community Participate

Page 46: Online Public Compound Databases

Will it ever end? The community says the structure of digitonin has

“up” 20-Methyl.

If so, then multiple substances related to digitonin have OPPOSITE stereo at 20-Methyl

The spirostane skeleton is considered to have a “down” Methyl group so all spirostane-related structures would be wrong

The ACD/Dictionary has 24 structures with close skeleton and all have the “down” Methyl group.

Page 47: Online Public Compound Databases

Chemistry is REALLY Messy

Page 48: Online Public Compound Databases

Vancomycin

Who will curate?

How would you clean such a large dataset?

Assertions!!!

Page 49: Online Public Compound Databases

An Introduction to the InChI Identifier

Page 50: Online Public Compound Databases

Multiple Layers

Page 51: Online Public Compound Databases

InChIStrings Hash to InChIKeys

Page 52: Online Public Compound Databases

InChIs for Taxol

Page 53: Online Public Compound Databases

Back to Taxol

DrugBank: RCINICONZNJXQF-CLDWUXIMDD ChEBI: RCINICONZNJXQF-GXKQXQCDDN Wikipedia: RCINICONZNJXQF-MZXODVADBJ

Which one is correct???

Page 54: Online Public Compound Databases

Vancomycin – Search the Internet

Page 55: Online Public Compound Databases

Full Molecule Search: 4 Hits

Page 56: Online Public Compound Databases

Full Skeleton Search: 104 Hits

Page 57: Online Public Compound Databases

Vancomycin on ChemSpider 1 compound – 3 days

Page 58: Online Public Compound Databases

Assertion and Chemical Entities

Who says what Taxol is?

What is the “timeline” for a molecule?

How do we clean up the Public data?

Page 59: Online Public Compound Databases

InChIKeys for Taxol

DrugBank: RCINICONZNJXQF-CLDWUXIMDD ChEBI: RCINICONZNJXQF-GXKQXQCDDN Wikipedia: RCINICONZNJXQF-MZXODVADBJ

Structure validation is tough work!Who is validating chemistry data online???

Page 60: Online Public Compound Databases

Bio-Break

Next Up – QUALITY CHOICES for online data

An introduction to ChemSpider

Crowdsourced Participation and Curation

Page 61: Online Public Compound Databases

Tony’s Quality Choices For Data

Chemical Abstracts Service and Reaxys – not free but definitely high quality!

Wikipedia Chemistry is good ChEBI (look for “starred” compounds to indicate

manual curation) DSSTox – manually curated EPA database. Very

high quality ChemIDPlus – ongoing curation and good quality The databases of David Wishart – manually

curated. Good but not perfect – DrugBank, HMDB, FooDB and others

Page 62: Online Public Compound Databases

ChEBI: http://www.ebi.ac.uk/chebi/

Chemical Entities of Biological Interest from the European Bioinformatics Institute

Page 63: Online Public Compound Databases

DSSTox: http://www.epa.gov/comptox/dsstox/

Distributed Structure Searching for Toxicology from Ann Richards at the Environmental Protection Agency

Page 64: Online Public Compound Databases

ChemIDPlus – 350,000 Compoundshttp://chem.sis.nlm.nih.gov/chemidplus/

Page 65: Online Public Compound Databases

DrugBank: http://www.drugbank.ca/

Page 66: Online Public Compound Databases

And Our Own Work...ChemSpider ChemSpider is:

Building a Structure Centric Community for Chemists >25 million compounds, >400 data sources

A deposition and curation platform

A publishing platform for the community

Grows daily – more depositions, more links, more data sources

Page 67: Online Public Compound Databases

How Was ChemSpider Built? ChemSpider was a “hobby project”

Housed in a basement and running off three servers – one bought, two built

Sensitive to weather and power stability

Went live at ACS Spring 2007 in Chicago

Page 68: Online Public Compound Databases

Search Cholesterol

Page 69: Online Public Compound Databases

Live DEMO

ChemSpider demo…

Page 70: Online Public Compound Databases

Link off a structure in ChemSpider

Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

Page 71: Online Public Compound Databases

Answering Questions for Chemists Questions a chemist might ask…

What is the melting point of n-butanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?

Page 72: Online Public Compound Databases

Complex Data and Information

Page 73: Online Public Compound Databases

Crowd-sourcing Chemistry Curation

Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Page 74: Online Public Compound Databases

Multi-level Curation and Approval

Page 75: Online Public Compound Databases

ChemSpider SyntheticPages

ChemSpider Synthesis will be a home for all things “synthetic”

An online resource for synthetic procedures from blogs, other online resources, RSC supplementary info, other publishers etc.

Public peer-review and feedback for synthetic procedures

Page 76: Online Public Compound Databases

ChemSpider Everywhere : Embed

Page 77: Online Public Compound Databases

ChemSpider Everywhere: Spectral Game

Page 78: Online Public Compound Databases

ChemSpider EverywhereCrowdsourced Curation of Spectra

Page 79: Online Public Compound Databases

ChemSpider EverywhereChemMobi

Page 80: Online Public Compound Databases

Where is ChemSpider Lacking?

More databases coming online monthly

Quality of data remains the primary issue

ChemSpider is limited to “defined chemicals”. No support for: Polymers Minerals Markush structures

Page 81: Online Public Compound Databases

It’s a long road ahead…

Page 82: Online Public Compound Databases

Conclusions

The internet enables chemistry, at a reduced cost

Web 2.0 is here and improving quality

Question Quality!

Crowdsourcing to expand, curate and integrate

InChIs are enabling chemistry on the internet

Page 83: Online Public Compound Databases

Thank you

[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams