building a community resource for the life sciences
DESCRIPTION
This is a presentation given in Track 4, Open Access and Cheminformatics, at the Bio-IT Meeting in Boston on April 21st 2010. It is a general overview of ChemSpider activities to link together the internet for chemists and validate and curate data. We won the Bio-IT Best Practices Community Service Award that evening also.TRANSCRIPT
Building A Community Platform to Support Chemistry and the Life Sciences
Where Would You look? What Do You Trust?
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Data are dirty: sorting the wheat from the chaff. Who can you trust?
Too many searches required to resource data
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Data are dirty: sorting the wheat from the chaff. Who can you trust?
Too many searches required to resource data
The Final Search Strategy
All Those Names, One StructureA problem to solve…
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Data are dirty: sorting the wheat from the chaff. Who can you trust?
Too many searches required to resource data
Trustworthy Chemistry? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science
Where Would You look? What Do You Trust?
Structural Data for LifeSciencesDailyMed
Lack of Stereochemisty
Incorrect Structures
Ugh…
Drugs are REALLY Messy
Vancomycin
Who will curate?
How would you clean such a large dataset?
Assertions!!!
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem C&E News (from ACS)
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Data are dirty: sorting the wheat from the chaff. Who can you trust?
Too many searches required to resource data
Just “Public Compound” Databases
PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors ChemSpider
media.obsessable.com
As few interfaces as possible
What do humans want?
A Pragmatic Vision“Build a Structure Centric Community to
Serve Chemists”
Integrate chemical structure data on the web Create a “structure-based hub” to information and
data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data
Answer Questions
Questions a chemist might ask… What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?
ChemSpider Searches
Search “OEA”
Search OEA
Link Farm Connections
Link Farm Connections
Search OEA
Search OEA
Google Books
Google Scholar
Linked Patents for OEA
Google Patents
Microsoft Academic Search
RSC Journals
RSC Databases
Statistics for Today
Almost 25 million compounds from >350 data sources
About 7000 unique users per day and up to ½ million transactions per day
A crowdsourced deposition and curation platform
Grows daily – more depositions, more links, more data
Searching Chemistry on the Internet
How complete a result set will we get if we search for “chemicals” by name?
Is there a better way to link chemistry databases? Linking by “names” is dangerous
Chemists want structure and SUBstructure searching
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
Link the Internet with InChIKeys!
Taken from: Rafael Sidis’ Blog
Vancomycin – Search the Internet
Vancomycin
Search Molecular SKELETON
Search Full Molecule
Full Molecule Search: 4 Hits
Full Skeleton Search: 104 Hits
Vancomycin
Vancomycin on ChemSpider 1 compound – 3 days
InChIKeys
RCINICONZNJXQF-MZXODVADSA-N
Make the internet searchable by adding InChIKeys
Publishers add InChIKeys to papers now…
InChIKeys
RCINICONZNJXQF-MZXODVADSA-N
Make the internet searchable by adding InChIKeys
Publishers add InChIKeys to papers now…
is what???
The InChI “Resolver”
InChI Resolver to DOIsStructure Search the Web
Most Chemistry is NOT Published
Only a fraction of chemistry is published
Only a tiny fraction of chemistry is patented
What of the “Lost Chemistry”- never published and cannot be abstracted Reactions performed Structures made and studied Spectra acquired and then disposed of Available chemicals never found
Crowd-sourcing Curation and Deposition
Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
Building a Structure Centric Community for Chemists
Multi-level Curation and Approval
Semantic Markup: Project Prospect
Name-Structure Pairs
Semantic Linking of Structures
What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”
Org Prep Daily (Blog)
ChemSpider SyntheticPages
Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,
syntheses, data, publications and patents A world of Open Access and Open Data
ChemSpider Web Services
Thank you
[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams