hosting public domain chemicals data online for the community – the challenges of handling...

104
Hosting public domain chemicals data online for the community – the challenges of handling materials Antony Williams Opportunities in Materials Informatics, University of Wisconsin-Madison February 9 th , 2015 0000-0002-2668-4821

Upload: antony-williams-chemconnector-orcid-0000-0002-2668-4821

Post on 15-Jul-2015

1.129 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Hosting public domain chemicals data online for the community – the challenges of handling materials

Hosting public domain chemicals data online for the community – the

challenges of handling materials

Antony WilliamsOpportunities in Materials Informatics, University of Wisconsin-Madison

February 9th, 2015

0000-0002-2668-4821

Page 2: Hosting public domain chemicals data online for the community – the challenges of handling materials

About Me…• I am NOT a materials chemist• I am an NMR spectroscopist by training• Worked on a LIMS while at Kodak• 10 years in commercial cheminformatics• Built the ChemSpider database as a hobby• Worked on validating compounds on Wikipedia• Manage cheminformatics team for RSC• Believer in the value of social networking and

Open Data for science• Dane Morgan asked me to tell jokes…

Page 3: Hosting public domain chemicals data online for the community – the challenges of handling materials

I would tell a chemistry joke…

But all of the good ones…

Page 4: Hosting public domain chemicals data online for the community – the challenges of handling materials

An ambitious idea….

• Let’s map together all online chemistry data and build systems to integrate it

• Heck, let’s integrate chemistry and biology data and add in disease data too if we can

• Let’s extract property data and model it and see if we can extract new relationships – quantitative and qualitative

• Let’s make it all available on the web…for free

Page 5: Hosting public domain chemicals data online for the community – the challenges of handling materials
Page 6: Hosting public domain chemicals data online for the community – the challenges of handling materials

What about this….

• We’re going to map the world

• We’re going to take photos of as many places as we can and link them together

• We’ll let people annotate and curate the map• Then let’s make it available free on the web• We’ll make it available for decision making • Put it on Mobile Devices, give it away…

Page 7: Hosting public domain chemicals data online for the community – the challenges of handling materials

Where is chemistry online?

• Encyclopedic articles (Wikipedia)• Chemical vendor databases• Metabolic pathway databases• Property databases• Patents with chemical structures• Drug Discovery data• Scientific publications

• Compound aggregators• Blogs/Wikis and Open Notebook Science

Page 8: Hosting public domain chemicals data online for the community – the challenges of handling materials

Chemistry on the Internet…

• Most searching for chemistry on the internet…• Name searching Google/Bing/Yahoo• Name searching Wikipedia• Name searching Wolfram Alpha• Name, name, name, name…searching• Structure searching DOZENS of websites, each

with different information or…

Page 9: Hosting public domain chemicals data online for the community – the challenges of handling materials

Chemistry on the Internet…

• Most searching for chemistry on the internet…• Name searching Google/Bing/Yahoo• Name searching Wikipedia• Name searching Wolfram Alpha• Name, name, name, name…searching• Structure searching DOZENS of websites, each

with different information or…

• Search ONE website integrating the others!

Page 10: Hosting public domain chemicals data online for the community – the challenges of handling materials

• ~30 million chemicals and growing

• Data sourced from >500 different sources• Crowd sourced curation and annotation• Ongoing deposition of data from our

journals and our collaborators• Structure centric hub for web-searching

• …and a really big dictionary!!!

• Note…NOT all websites connected

Page 11: Hosting public domain chemicals data online for the community – the challenges of handling materials

ChemSpider

Page 12: Hosting public domain chemicals data online for the community – the challenges of handling materials

ChemSpider

Page 13: Hosting public domain chemicals data online for the community – the challenges of handling materials

ChemSpider

Page 14: Hosting public domain chemicals data online for the community – the challenges of handling materials

Experimental/Predicted Properties

Page 15: Hosting public domain chemicals data online for the community – the challenges of handling materials

Literature references

Page 16: Hosting public domain chemicals data online for the community – the challenges of handling materials

Patents references

Page 17: Hosting public domain chemicals data online for the community – the challenges of handling materials

RSC Books

Page 18: Hosting public domain chemicals data online for the community – the challenges of handling materials

Google Books

Page 19: Hosting public domain chemicals data online for the community – the challenges of handling materials

Vendors and data sources

Page 20: Hosting public domain chemicals data online for the community – the challenges of handling materials

APIs

Page 21: Hosting public domain chemicals data online for the community – the challenges of handling materials

APIs

Page 22: Hosting public domain chemicals data online for the community – the challenges of handling materials

Organic Chemistry is hard…

Page 23: Hosting public domain chemicals data online for the community – the challenges of handling materials

…it has alkynes of trouble

Page 24: Hosting public domain chemicals data online for the community – the challenges of handling materials

Flavors of Chemistry

Page 25: Hosting public domain chemicals data online for the community – the challenges of handling materials

Molfiles 10 9 0 0 1 0 0 0 0 0 1 V2000 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3 1 2 0 0 0 0 4 1 1 0 0 0 0 9 1 1 0 0 0 0 7 2 1 0 0 0 0 5 2 2 0 0 0 0 8 2 1 0 0 0 0 6 4 1 0 0 0 0 4 10 1 6 0 0 0 7 6 1 0 0 0 0M END

Page 26: Hosting public domain chemicals data online for the community – the challenges of handling materials

Molfiles

• Molfiles are the primary exchange format between structure drawing packages

• Can be different between different drawing packages

• Most commonly carry X,Y coordinates for layout• Can support polymers, organometallics, etc.• Can carry 3D coordinates

Page 27: Hosting public domain chemicals data online for the community – the challenges of handling materials

SMILES

• SMILES is a common format • Can support polymers,

organometallics, etc.• Does NOT carry X,Y or Z

coordinates for layout so requires layout algorithms – can be problematic!

• Generally different between drawing packages

Page 28: Hosting public domain chemicals data online for the community – the challenges of handling materials

Stereo

Page 29: Hosting public domain chemicals data online for the community – the challenges of handling materials

Tautomeric forms

Page 30: Hosting public domain chemicals data online for the community – the challenges of handling materials

Vendor-dependent SMILESACD/LabsCC(C)CCC[C@@H](C)CCC[C@@H](C)CCCC(\C)=C\CC2=C(C)C(=O)c1ccccc1C2=O

OpenEyeCC1=C(C(=O)c2ccccc2C1=O)C/C=C(\C)/CCC[C@H](C)CCC[C@H](C)CCCC(C)C

ChEMBLCC(C)CCC[C@@H](C)CCC[C@@H](C)CCC\C(=C\CC1=C(C)C(=O)c2ccccc2C1=O)\C

Page 31: Hosting public domain chemicals data online for the community – the challenges of handling materials

Chemists are good…

Page 32: Hosting public domain chemicals data online for the community – the challenges of handling materials

The InChI Identifier

Page 33: Hosting public domain chemicals data online for the community – the challenges of handling materials

InChI

• SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES

• InChI Strings can be reversed to structures – same problem as with SMILES – no layout

• Adopted by the community (databases, blogs, Wikipedia) – good for searching the internet

Page 34: Hosting public domain chemicals data online for the community – the challenges of handling materials

Multiple Layers

Page 35: Hosting public domain chemicals data online for the community – the challenges of handling materials

Tautomers

Page 36: Hosting public domain chemicals data online for the community – the challenges of handling materials

Stereo

Page 37: Hosting public domain chemicals data online for the community – the challenges of handling materials

InChIStrings Hash to InChIKeys

Page 38: Hosting public domain chemicals data online for the community – the challenges of handling materials

Structure search the web

Page 39: Hosting public domain chemicals data online for the community – the challenges of handling materials

Exact Search

Page 40: Hosting public domain chemicals data online for the community – the challenges of handling materials

Skeleton Search

Page 41: Hosting public domain chemicals data online for the community – the challenges of handling materials

Data Quality/Standardization

• MANY structures meant to be something online are MISREPRESENTED.

• Commonly you will have better success finding information by name searches than structure – with many caveats of course…

• Validating chemical structure representations is laborious work – and it’s shocking to review data…

Page 42: Hosting public domain chemicals data online for the community – the challenges of handling materials

Data Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011)

Science Translational Medicine 2011

Page 43: Hosting public domain chemicals data online for the community – the challenges of handling materials

Data quality is a known issue

Page 44: Hosting public domain chemicals data online for the community – the challenges of handling materials

Data quality is a known issue

Page 45: Hosting public domain chemicals data online for the community – the challenges of handling materials

Substructure # of

Hits

# of

Correct

Hits

No

stereochemistry

Incomplete

Stereochemistry

Complete but

incorrect

stereochemistry

Gonane 34 5 8 21 0

Gon-4-ene 55 12 3 33 7

Gon-1,4-diene 60 17 10 23 10

Only 34 out of 149 structures were correct!

Page 46: Hosting public domain chemicals data online for the community – the challenges of handling materials

Patent data in public databases

Page 47: Hosting public domain chemicals data online for the community – the challenges of handling materials

Patent data in public databases

Page 48: Hosting public domain chemicals data online for the community – the challenges of handling materials

You just can’t trust atoms!

Page 49: Hosting public domain chemicals data online for the community – the challenges of handling materials

You just can’t trust atoms!They make up everything…

Page 50: Hosting public domain chemicals data online for the community – the challenges of handling materials

ALL variants of Yohimbine!!!

Page 51: Hosting public domain chemicals data online for the community – the challenges of handling materials

What’s Methane? OLD PUBCHEM

Page 52: Hosting public domain chemicals data online for the community – the challenges of handling materials

What ELSE is Methane???

Page 53: Hosting public domain chemicals data online for the community – the challenges of handling materials

NEW PUBCHEM

Page 54: Hosting public domain chemicals data online for the community – the challenges of handling materials

Depiction vs Accurate Representation

Page 55: Hosting public domain chemicals data online for the community – the challenges of handling materials

Depiction vs Accurate Representation

Page 56: Hosting public domain chemicals data online for the community – the challenges of handling materials

What is the Structure of Vitamin K1?

Page 57: Hosting public domain chemicals data online for the community – the challenges of handling materials

Standardize

• Use the SRS as a guidance document for standardization

• Adjust as necessary to our needs

Page 58: Hosting public domain chemicals data online for the community – the challenges of handling materials

Nitro groups

Page 59: Hosting public domain chemicals data online for the community – the challenges of handling materials

Salt and Ionic Bonds

Page 60: Hosting public domain chemicals data online for the community – the challenges of handling materials

Ammonium salts

Page 61: Hosting public domain chemicals data online for the community – the challenges of handling materials

Can we MAKE Quality Data?

• We are building systems for everyone to validate and standardize their data

Page 62: Hosting public domain chemicals data online for the community – the challenges of handling materials

DICTIONARIES are powerful

• Search all forms of structure IDs• Systematic name(s)• Trivial Name(s)• SMILES• InChI Strings• InChIKeys• Database IDs

• Registry Number

Page 63: Hosting public domain chemicals data online for the community – the challenges of handling materials

Many Names, One Structure

Page 64: Hosting public domain chemicals data online for the community – the challenges of handling materials

But big and often noisy

Page 65: Hosting public domain chemicals data online for the community – the challenges of handling materials

Text-Mining and Markup…

Page 66: Hosting public domain chemicals data online for the community – the challenges of handling materials

Text-Mining and Markup…

Page 67: Hosting public domain chemicals data online for the community – the challenges of handling materials

With links out to platforms

Page 68: Hosting public domain chemicals data online for the community – the challenges of handling materials

Dictionaries are invaluable

Page 69: Hosting public domain chemicals data online for the community – the challenges of handling materials

Text Mining on IUPAC Names

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

Page 70: Hosting public domain chemicals data online for the community – the challenges of handling materials

Text Mining on IUPAC Names

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

Page 71: Hosting public domain chemicals data online for the community – the challenges of handling materials

Name to Structure Conversion

Page 72: Hosting public domain chemicals data online for the community – the challenges of handling materials

Name to Structure Conversion

Page 73: Hosting public domain chemicals data online for the community – the challenges of handling materials

ChemSpider “Annotations”

• Users can add • Descriptions, Syntheses and Commentaries• Links to PubMed articles• Links to articles via DOIs • Add spectral data• Add Crystallographic Information Files• Add photos• Add MP3 files• Add Videos

Page 74: Hosting public domain chemicals data online for the community – the challenges of handling materials

Spectral Data

• Spectral data to be deposited in standard formats – JCAMP or images

• All spectra available at: http://www.chemspider.com/spectra.aspx

• Data are deposited on a regular basis• Students

• Chemical vendors• Growing collection now

Page 75: Hosting public domain chemicals data online for the community – the challenges of handling materials

Student Submissions

Page 76: Hosting public domain chemicals data online for the community – the challenges of handling materials

Data on ChemSpider

Page 77: Hosting public domain chemicals data online for the community – the challenges of handling materials

Spectral Data EXTRACTION

Page 78: Hosting public domain chemicals data online for the community – the challenges of handling materials

ORIGINAL

EXTRACTED

Page 79: Hosting public domain chemicals data online for the community – the challenges of handling materials

It’s exactly the WRONG WAY!

• We should NOT be mining data out of future publications

• Structures should be submitted “correctly” • Spectra should be digital spectral formats,

not images• ESI should be RICH and interactive,

preferably with OPEN DATA

Page 80: Hosting public domain chemicals data online for the community – the challenges of handling materials

An Adventure into the World of Small but significant contribution..

Page 81: Hosting public domain chemicals data online for the community – the challenges of handling materials

ChemSpider SyntheticPages

Page 82: Hosting public domain chemicals data online for the community – the challenges of handling materials

Micropublishing with Peer Review(a chemical synthesis blog?)

Page 83: Hosting public domain chemicals data online for the community – the challenges of handling materials

Multi-Step Synthesis

Page 84: Hosting public domain chemicals data online for the community – the challenges of handling materials

Interactive Data

Page 85: Hosting public domain chemicals data online for the community – the challenges of handling materials

Chemistry data is of value?

• Reference databases generate hundreds of millions of dollars/euros per year

• So much data generated that could go public• Maybe 5% of all data generated is published• There is no “Journal of Failed Experiments”• Funding agencies start to demand Open Data• Scientists want funding but also recognition

Page 86: Hosting public domain chemicals data online for the community – the challenges of handling materials

A shift to Openness

Page 87: Hosting public domain chemicals data online for the community – the challenges of handling materials

How will I get recognized?

• Who in the room has an ORCID?

Page 88: Hosting public domain chemicals data online for the community – the challenges of handling materials

Deposition of Research Data

• If we manage compounds, syntheses and analytical data…

• If we have security and provenance of data…

• If we deliver user interfaces to satisfy the various use cases…

• Then we have delivered electronic lab notebooks for chemistry laboratories. ELNs are research data repositories

Page 89: Hosting public domain chemicals data online for the community – the challenges of handling materials

Recognition: need to have Impact

Page 90: Hosting public domain chemicals data online for the community – the challenges of handling materials

Quantitating scientists?

Page 91: Hosting public domain chemicals data online for the community – the challenges of handling materials

National Information Standards Organization and “Altmetrics”

http://www.niso.org/apps/group_public/download.php/13295/niso_altmetrics_white_paper_draft_v4.pdf

Page 92: Hosting public domain chemicals data online for the community – the challenges of handling materials

What are we building?

• We are building the “RSC Data Repository”

• Containers for compounds, reactions, analytical data, tabular data

• Algorithms for data validation and standardization • Flexible indexing and search technologies• A platform for modeling data and hosting existing

models and predictive algorithms

Page 93: Hosting public domain chemicals data online for the community – the challenges of handling materials

Compounds

Page 94: Hosting public domain chemicals data online for the community – the challenges of handling materials

Reactions

Page 95: Hosting public domain chemicals data online for the community – the challenges of handling materials

Analytical data

Page 96: Hosting public domain chemicals data online for the community – the challenges of handling materials

Crystallography data

Page 97: Hosting public domain chemicals data online for the community – the challenges of handling materials

Deposition of Data

• Developing systems that provides feedback to users regarding data quality• Validate/standardize chemical compounds

• Check for balanced reactions• Checks spectral data

• EXAMPLE Future work• Properties – compare experimental to pred.• Automated structure verification - NMR

Page 98: Hosting public domain chemicals data online for the community – the challenges of handling materials

So we know about ORGANICS

• Comment – you don’t know all of the challenges until you start to work in the area!

• We, and cheminformatics companies, have solved MANY, but not all of the issues regarding organic chemistry management

• The majority of our approaches do not map to materials • No standard ways to represent compounds• No InChI for materials

Page 99: Hosting public domain chemicals data online for the community – the challenges of handling materials

Questions to consider…

• Organics are hard enough!

• What are your best dictionaries of materials?

• We have chemical ontologies. Status for materials?

• Is open annotation of your databases possible?

• What standards do you have for materials data exchange?

Page 100: Hosting public domain chemicals data online for the community – the challenges of handling materials

Polymorphism is common

Page 101: Hosting public domain chemicals data online for the community – the challenges of handling materials

Known Challenges

• Many materials are non-stoichiometric

• How to represent composite materials (e.g. supported catalysts)?

• Methods to distinguish novelty in materials (equivalent to diversity in organic structures)?

• Many more I will learn at this workshop..?

Page 102: Hosting public domain chemicals data online for the community – the challenges of handling materials

Collaboration is key

Page 103: Hosting public domain chemicals data online for the community – the challenges of handling materials

Internet Data

The Future

Commercial SoftwarePre-competitive Data

Open ScienceOpen DataPublishersEducators

Open DatabasesChemical Vendors

Small organic moleculesUndefined materialsOrganometallicsNanomaterialsPolymersMineralsParticle boundLinks to Biologicals

Page 104: Hosting public domain chemicals data online for the community – the challenges of handling materials

Thank you

Email: [email protected]: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams