Download - ChemSpider and Traveling the Internet via Chemical Structures Cheminformatics Presentation
ChemSpider and Traveling the Internet via Chemical Structures
Antony WilliamsDrexel University, November 2012
Compounds and Identifiers
Chemistry on the Internet
Where do you source chemistry information? What can you trust online? How can you recognize potential issues? Cross-referencing and curating data
Molfiles (http://en.wikipedia.org/wiki/Chemical_table_file)
Molfiles 10 9 0 0 1 0 0 0 0 0 1 V2000 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3 1 2 0 0 0 0 4 1 1 0 0 0 0 9 1 1 0 0 0 0 7 2 1 0 0 0 0 5 2 2 0 0 0 0 8 2 1 0 0 0 0 6 4 1 0 0 0 0 4 10 1 6 0 0 0 7 6 1 0 0 0 0 M END
Molfiles Molfiles are the primary exchange format between
structure drawing packages Can be different between different drawing packages Most commonly carry X,Y coordinates for layout Can support polymers, organometallics, etc. Can carry 3D coordinates
SMILES (http://en.wikipedia.org/wiki/SMILES)
SMILES is a common format Can support polymers,
organometallics, etc. Does NOT carry X,Y or Z
coordinates for layout so requires layout algorithms – can be problematic!
Generally different between drawing packages
Stereo
Tautomers
SMILES ACD/Labs CC(C)CCC[C@@H](C)CCC[C@@H](C)CCCC(\
C)=C\CC2=C(C)C(=O)c1ccccc1C2=O
OpenEye CC1=C(C(=O)c2ccccc2C1=O)C/C=C(\C)/
CCC[C@H](C)CCC[C@H](C)CCCC(C)C
ChEMBL CC(C)CCC[C@@H](C)CCC[C@@H](C)CCC\
C(=C\CC1=C(C)C(=O)c2ccccc2C1=O)\C
The InChI Identifier
InChI
SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES
InChI Strings can be reversed to structures – same problem as with SMILES – no layout
Well adopted by the community (databases, publishers, blogs, Wikipedia) – good for searching the internet
The InChI Standard
Tautomers – “Mobile H Perception”
Double Bond Orientation
Stereo
Checking for Stereochemistry
Checking for StereochemistryUse your drawing package!
Checking for Stereochemistry
Checking for Stereochemistry
Checking for Stereochemistry
InChIKeysSearch the Web by Structure
InChIs
Databases and Standardization
Databases and Standardization
InChI
No support for polymers, organometallics
Many option settings can lead to variability and make integration across databases difficult – FixedH option especially problematic
“Slight” chance of collisions of InChIKeys
VERY USEFUL FOR INTEGRATING THE WEB
Vancomycin
Vancomycin
Search Molecular SKELETON
Search Full Molecule
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science
www.chemspider.com
How do we build it?
We deal in Molfiles or SDF files – with coordinates
Valence checking, charge imbalance
We have our own “business logic” to standardize
InChI to “aggregate tautomers” to one record
We link out to external sites using their IDs
Searches: The INTERNET
All ChemSpider and Internet searches are “simply algorithms” but synonym searching is based on an assertion
Validated Names for Searching…
Validating structures
Check for “full stereo” and use stereo descriptors especially for checking!
Check for quality of associated data sources
Check against reference literature when available – but it can be wrong
Question EVERYTHING!
Contributing to The Quality of DataWhat is the Structure of Vitamin K?
Contributing to The Quality of DataWhat is the Structure of Vitamin K?
A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K1 (phytomenadione) derived from plants, VITAMIN K2 (menaquinone) from bacteria & synthetic naphthoquinone provitamins, VITAMIN K3 (menadione).
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
Wikipedia
Wolfram Alpha
DailyMed
ALL Different, ALL “Domoic Acids”
Thank you
Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams