5th meeting on u.s. government chemical databases and open chemistry talk
TRANSCRIPT
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Markus Sitzmann1, Wolf-Dietrich Ihlenfeldt2, andMarc C. Nicklaus1
[1] Computer-Aided Drug Design Group, Chemical Biology Laboratory,NCI-Frederick, NIH, DHHS[2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany
NCI/CADD Chemical Identifier Resolver:Indexing and Analysis of Available Chemistry Space
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Chemistry Space Analysis
• how many small-molecules are there currently?• since the early 2000s: number of databases “publishing” small
molecules grew enormously, e.g. PubChem, ChemSpider, ChEMBL, DrugBank – what is the overlap?
• many ambiguities in the representation of small molecules (e.g. tautomerism, salts, ionic resonance forms)
• growing number of chemical structure identifiers (InChI/InChIKey, PubChem SID/CID, ChemSpider ID, ChEBI ID, …)
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
chemical structure
Chemical Identifier Resolver
NCI/CADD Identifiers
InChI/InChIKey
ChemSpider ID
PubChem SID/CID
chemical names
CAS Registry Number
NSC number
FDA UNII
ChemNavigator SID
SMILES
SD File
Chemical FormulaChEBI ID
PDB Ligand ID
MRV
CML
SYBYL Line Notation
GIF image
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
http://cactus.nci.nih.gov/chemical/structure
Works as a resolver for different chemical structure identifiers. Allows one to convert a givenstructure identifier into anotherrepresentation or structureidentifier.
Chemical Identifier ResolverNCI/CADD Web Resources
first beta release: July 2009current release (beta 4): April 2011
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
• it is usable by a simple URL API:
example: http://cactus.nci.nih.gov/chemical/structure/Tamiflu/cas
204255-11-8
http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”
MIME type: text/plain
Chemical Identifier ResolverNCI/CADD Web Resources
XML format: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”/xml
• if a request is not resolvable: HTTP404 status message
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
resolver
chemical namesIUPAC names (by OPSIN)
CAS numbersSMILES strings
IUPAC InChI/InChIKeysNCI/CADD Identifiers
CACTVS HASHISYNSC number
PubChem SIDChemSpider ID
ChemNavigator SIDFDA UNII
/smiles/names, /iupac_name/cas/inchi, /stdinchi/inchikey, /stdinchikey/ficts, /ficus, /uuuuu /image/file, /sdf/mw, /monoisotopic_mass /formula/twirl, /3d/urls/chemspider_id/pubchem_sid/chemnavigator_sid
“identifier” “representation”
http://cactus.nci.nih.gov/chemcial/structure
Chemical Identifier ResolverNCI/CADD Public Web Resources
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
identifier representation
http request
http response
detection ofthe identifier
type
identifier is afull structure
representation(e.g. SMILES, InChI)
calculation of therequested structure
representation
identifier is ahashed structure
representation(e.g. InChIKey),
trivial nameetc.
database lookup
MIME type
Chemical Identifier ResolverNCI/CADD Web Resources
structure
e.g. InChI, GIF image
e.g. CAS number,chemical nameCACTVS
NCI/CADD Chemical Structure Database (CSDB)
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
identifier representation
http request
http responseidentifier is afull structure
representation(e.g. SMILES, InChI)
calculation of therequested structure
representation
identifier is ahashed structure
representation(e.g. InChIKey),
trivial nameetc.
database lookup
MIME type
Chemical Identifier ResolverNCI/CADD Web Resources
structure
e.g. InChI, GIF image
e.g. CAS number,chemical nameCACTVS
NCI/CADD Chemical Structure Database (CSDB)
detection ofthe identifier
type
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
<request string="L-alanin" representation="smiles"><data id="1" resolver="name_by_chemspider" string_class="Chemical Name (ChemSpider)">
<item id="1">C[C@H](N)C(O)=O</item></data><data id="2" resolver="name_by_opsin" string_class="IUPAC Name (OPSIN)">
<item id="1">C[C@H](N)C(O)=O</item></data><data id="3" resolver="name_by_cir" string_class="Chemical Name (CIR)">
<item id="1“>C[C@H](N)C(O)=O</item></data>
</request>
http://cactus.nci.nih.gov/chemical/structure/L-alanin/smiles/xmls?resolver=name_by_chemspider,name_by_opsin,name_by_cir
Chemical Identifier ResolverNCI/CADD Web Resources
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
• ChemNavigator iResearch Librarycompilation of commercially availablescreening compounds from ~330 inter-national chemistry suppliers
• PubChem databaseincluding Open NCI database, EPA DSSTox databases, NIAID HIVdatabases, NIST Webbook, NLM ChemIDplus, ChemSpider …
• Commercial Sources / othersAsinex, Comgenex, eMolecules,ChEMBL, …
currently:~150 chemical structure databases
~120 million structure records ~81.6 million unique structures by
NCI/CADD FICuS Identifier~84 million unique structures by Std. InChIKey
ChemNav.iResearch Lib.~56%
PubChem~38%
others
~6%
Chemical Structure Database (CSDB)Chemical Identifier Resolver
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Structure Identifiers
FICTS, FICuS, uuuuu
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
• based on hashcodes calculated by the chemoinformatics toolkit CACTVS
• CACTVS hashcodes: represent a chemical structure uniquely as
16-digit hexadecimal number (64-bit unsigned) high sensitivity to structural features of a compound change if connectivity changes
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
HNN NH2
OH
O
9850FD9F9E2B4E25
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
original structure
record
MolfileSDFSMILESChemDraw cdxPDB
structurenormalization
parentstructure
SDFSMILESdatabase
NCI/CADDIdentifier
hashcodecalculation
E_HASHISY
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
structurenormalization
parentstructure
NCI/CADDIdentifier
hashcodecalculation
E_HASHISY
• calculation of a set of parent structures with differentsensitivity to chemical features
• representation of chemical structures on different levels
FICTS
original structure
record
MolfileSDFSMILESChemDraw cdxPDB
FICuS
uuuuu
SDFSMILESdatabase
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Fragments Isotopes Charges StereoTautomers
FICTS
FICuS
uuuuu
sensitive / not sensitive
<CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum>
HNN NH2
O-
ONa+ 4A122D094098B50D-FICTS-01-1D
0E26B623DF7FAD30-FICuS-01-709850FD9F9E2B4E25-uuuuu-01-27
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+O-
O
O
HNN NH2
ONa
HNN NH
OH
ONH
N 15NH2
OH
O
charged form
tautomer
isotope
salt
stereoisomers
“errors”
histidine
HNN NH2
OH
O
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
A3DAE0788050DDE4-FICTS E5F83F10C5DB080A-FICTS
B2FDA68AEDA06DB9-FICTS
9850FD9F9E2B4E25-FICTS
E5F83F10C5DB080A-FICTS
E92E4BA2869F3611-FICTS8A7AD1EB498CC76A-FICTS6C16DE2351F9FF50-FICTS
HNN NH2
OH
O
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+O-
O
O
HNN NH2
ONa
HNN NH
OH
ONH
N 15NH2
OH
O
9850FD9F9E2B4E25-FICTS
charged form
tautomer
isotope
salt
stereoisomers
FICTS
“errors”
histidine
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
A3DAE0788050DDE4-FICuS E5F83F10C5DB080A-FICuS
B2FDA68AEDA06DB9-FICuS
9850FD9F9E2B4E25-FICuS
E5F83F10C5DB080A-FICuS
E92E4BA2869F3611-FICuS8A7AD1EB498CC76A-FICuS9850FD9F9E2B4E25-FICuS
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+O-
O
O
HNN NH2
ONa
HNN NH
OH
ONH
N 15NH2
OH
O
charged form
tautomer
isotope
salt
stereoisomers
FICuS
“errors”
HNN NH2
OH
O
9850FD9F9E2B4E25-FICuShistidine
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-FICuS
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+O-
O
O
HNN NH2
ONa
HNN NH
OH
ONH
N 15NH2
OH
O
charged form
tautomer
isotope
stereoisomers
salt
uuuuu
“errors”
HNN NH2
OH
O
9850FD9F9E2B4E25-uuuuuhistidine
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
HNDVDQJCIGZPNO-UHFFFAOYSA-N
HNDVDQJCIGZPNO-CDYZYAPPSA-N
HNDVDQJCIGZPNO-RXMQYKEDSA-N HNDVDQJCIGZPNO-YFKPBYRVSA-NHNDVDQJCIGZPNO-UHFFFAOYSA-N
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+O-
O
O
HNN NH2
ONa
HNN NH
OH
ONH
N 15NH2
OH
O
charged form
tautomer
isotope
stereoisomers
salt
Std. InChIKey
“errors”
HNDVDQJCIGZPNO-UHFFFAOYSA-N
UHPNKBYGGMJTIM-UHFFFAOYSA-M
UHPNKBYGGMJTIM-UHFFFAOYSA-M
HNN NH2
OH
O
histidineHNDVDQJCIGZPNO-UHFFFAOYSA-N
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
original record
original record
original record
original record
original record
original record
original record
original record
original record
original record
original record
NCI/CADD Chemical Structure Database
Structure Normalization
119.8 million originalstructure records
in CSDB
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
FICTS
original record
original record
original record
original record
FICTS
original record
original record
original record
original record
original record
original record
original record
FICTS
FICTS
FICTS
FICTS
FICTS
FICTS
83.1 millionFICTS
parent structures
119.8 million originalstructure records
in CSDB
NCI/CADD Chemical Structure Database
Structure Normalization
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
FICTS
original record
original record
original record
original record
FICTS
original record
original record
original record
original record
original record
original record
original record
FICTS
FICTS
FICTS
FICTS
FICTS
FICTS
FICuS
FICuS
FICuS
FICuS
FICuS
FICuS
83.1 millionFICTS
parent structures
81.6 millionFICuS
parent structures
119.8 million originalstructure records
in CSDB
NCI/CADD Chemical Structure Database
Structure Normalization
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
FICTS
original record
original record
original record
original record
FICTS
original record
original record
original record
original record
original record
original record
original record
FICTS
FICTS
FICTS
FICTS
FICTS
FICTS
FICuS
FICuS
FICuS
FICuS
FICuS
FICuS
uuuuu
uuuuu
uuuuu
uuuuu
83.1 millionFICTS
parent structures
81.6 millionFICuS
parent structures
76.2 millionuuuuu
parent structures
119.8 million originalstructure records
in CSDB
NCI/CADD Chemical Structure Database
Structure Normalization
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
FICTS
original record
original record
original record
original record
FICTS
original record
original record
original record
original record
original record
original record
original record
FICTS
FICTS
FICTS
FICTS
FICTS
FICTS
FICuS
FICuS
FICuS
FICuS
FICuS
FICuS
uuuuu
uuuuu
uuuuu
uuuuu
tautomer-invariant
83.1 millionFICTS
parent structures
81.6 millionFICuS
parent structures
76.2 millionuuuuu
parent structures
119.8 million originalstructure records
in CSDB
NCI/CADD Chemical Structure Database
Structure Normalization
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Tautomer Analysis
How much “chemical space” is “just generated” by drawing tautomers?
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
• CACTVS: generation of all formal tautomers for a given organic compound (prototropic tautomerism)
• rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS• rule set is systematically applied to the original structure
(and all tautomers that have been generated in previous steps)• tautomer generation is limited to 1000 SMIRKS transform
operations/structure• all tautomers are ranked by a scoring function• the highest ranked tautomer is defined as the
canonical tautomer
NCI/CADD Chemical Structure Database
Tautomer Analysis
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
rule 12: furanones
rule 11: 1.11 (aromatic) heteroatom H shiftrule 10: 1.9 (aromatic) heteroatom H shiftrule 9: 1.7 (aromatic) heteroatom H shiftrule 8: 1.5 aromatic heteroatom H shift (2)rule 7: 1.5 (aromatic) heteroatom H shift (1)rule 6: 1.3 heteroatom H shiftrule 5: 1.3 aromatic heteroatom H shiftrule 4: special iminerule 3: simple (aliphatic) iminerule 2: 1.5 (thio)keto/(thio)enolrule 1: 1.3 (thio)keto/(thio)enol
• 21 SMIRKS transform rules:
rule 21: phosphonic acidsrule 20: isocyanidesrule 19: formamidinesulfinic acidsrule 18: cyanic/iso-cyanic acidsrule 17: oxim/nitroso via phenolrule 16: oxim/nitrosorule 15: pentavalent nitro/aci-nitrorule 14: ionic nitro/aci-nitro
rule 13: keten/ynol exchange
NCI/CADD Chemical Structure Database
Tautomer Analysis
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
FICuS
FICuS
FICuS
FICuS
FICuS
FICuS
70.6 millionFICuS
parent structures
NCI/CADD Chemical Structure Database
Tautomer Analysis
starting from the set of FICuS parent structures we systematically generatedall tautomers based on the 21 SMIRKS rule set available in CACTVS
generated680 million tautomers
for 1.7% of the FICuS parent structuresthe enumeration was not exhaustive
(2009 DB version)
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Tautomer Analysis
numberdatabasereleases
0
10
20
30
40
50
60
70
80
90
0.0 0.5 1.0 1.5 2.0
frequency
tautomeric overlap within each individual database release (%)
average: ~0.3% of original structure records
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Tautomer Analysis
numberdatabasereleases
0
10
20
30
40
50
60
70
80
90
0.0 0.5 1.0 1.5 2.0
frequency
tautomeric overlap within each individual database release (%)
average: ~0.3% of original structure records
AsinexChemBridgeComGenexChemNavigatorColumbia University Molecular Screening CenterEPA DSSToxSpecs
AmbinterBINDBindingDBChemNavigatorKEGGNCI Open DatabaseNIST WebBookNLM ChemIDplusNMRShiftDBThomson PharmaWombat
NCI/DTPPASS Training SetSGC-Ox
ChemDBZINC
ChEBIChemSpider
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
Tautomer Analysis
0
5
10
15
20
25
30
0.5 2.5 4.5 6.5 8.5 10.5 12.5 14.5 16.5 18.5 20.5 22.5 24.5
frequencynumber
databasereleases
percentage of FICuS parent structure in each database releaseoccurring somewhere in CSDB with a conflict
occurrence of “tautomerism-critical” molecules within each individual database release (%)
average: ~9.5% of FICuS parent structures
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
HNN O
O
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
• HPMBP is used in liquid membranes(selective removal of metal ions)
• selectivity and efficiency depends on the tautomeric form of HPMBP
• the tautomeric form depends on solvent and concentration of HPMBP
He, D.; Li Z.; Ma M.; Huang J.; Yang Y. Study of extraction characteristics of HPMBP.1. Tautomer and extraction characteristics. J. Chem. Eng. Data 2009, 54(10), 2944-2947
Example for a Tautomer “Conflict”
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
NN OH
O
HNN O
O
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
CACTVS generates 7 tautomers
Example for a Tautomer “Conflict”
canonical tautomer
by CACTVS 5 tautomers have potential stereo center on atoms or bonds
HNN O
OR/S
HNN OH
OHR/S
HNN O
OHE/Z
NN O
OHE/Z
NN O
OR/S
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
HNN O
O
HNN O
O
H
4551-69-133064-14-1
127117-31-1
859 references49 references
3 references
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
3 tautomers have CAS Registry Numbersassigned
Example for a Tautomer “Conflict”
(no stereo)
(Z)
HNN O
OR/S
HNN OH
OHR/S
NN O
OHE/Z
NN O
OHE/Z
NN O
OR/S
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
NN OH
O
NN O
O
HNN O
O
NN O
OH
HNN O
OH
HNN OH
OH
HNN O
O
6 databases16 databases (no stereo)3 databases (R)2 databases (S)
12 databases
1 database(no stereo)
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
Example for a Tautomer “Conflict”
occurrences in databasesindexed in CSDB
R/S
R/S
E/ZE/Z
R/S
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
6 databases16 databases (no stereo)3 databases (R)2 databases (S)
12 databases
occurrences in databasesN
N OH
O
NN O
OR/S
HNN O
O
NN O
OHE/Z
HNN O
OHE/Z
HNN OH
OHR/S
HNN O
OR/S
1 database(no stereo)
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
Example for a Tautomer “Conflict”
ACD 3DAmbinterBindingDBChemBankChemDBChemSpiderChemNavigatorMLSMRNIAID Scripps Screening CenterThomson PharmaZINC
ChemDB
ACD 3DACXAmbinterBioByte QSARChemBankChemBridgeChemDBChemSpiderDiscoveryGateEPA GCESMLSMRNCI Open DatabaseNIST MS-LibNLM ChemIDplusSigma-AldrichThomson Pharma
AmbinterChemDBChemSpiderDiscoveryGateChemNavigatorThomson Pharma
ChemSpiderZINC
ChemSpiderECOTOXZINC
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Scaffold Analysis
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Scaffold AnalysisNCI/CADD Chemical Structure Database
molecular scaffold tree
archetype scaffold
simple scaffold
Schuffenhauer et al.J. Chem. Inf. Model. 2007, 47, 47-58
Bemis et al.J. Med. Chem. 1996, 39, 2887-2893
Bemis et al.J. Med. Chem. 1996, 39, 2887-2893
SO O
NNO
NNHO
NNH
O NNH
level 2 level 1
example
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
76.2 million
CSDB
Scaffold Analysis
uuuuu compound
set
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
molecular scaffold tree
archetype scaffold
simple scaffold
76.2 million
8.1 million scaffolds
6.8 million scaffolds
0.8 million scaffolds
CSDB
Scaffold Analysis
uuuuu compound
set
NNHO
O NNH
NNH
level 2level 1
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
NCI/CADD Chemical Structure Database
76.2 million
number of unique scaffolds per hierarchy level
CSDB
Scaffold Analysis
uuuuu compound
set
NNHO
O NNH
8.1 million scaffolds
0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 3 4 5 6 7 8 9 10
Hierarchy Level
Nu
mb
er
of
Un
iqu
e S
caf
fold
s (
in m
illi
on
s)
0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
Nu
mb
er o
f un
iqu
e s
truc
ture
s (in
millio
n)
level 2level 1
molecular scaffold tree
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Atom Neighborhoods
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Multilevel Neighborhoods of Atoms (MNA)
HC C(C(CC-H)C(CC-C)-H(C))HO C(C(CC-H)C(CN-H)-H(C))CHCC C(C(CC-H)C(CN-H)-C(C-O-O))CHCN C(C(CC-H)N(CC)-H(C))CCCC C(C(CC-C)N(CC)-H(C))CCOO N(C(CN-H)C(CN-H))NCC -H(C(CC-H))OHC -H(C(CN-H))OC -H(-O(-H-C))
-C(C(CC-C)-O(-H-C)-O(-C))-O(-H(-O)-C(C-O-O))-O(-C(C-O-O))
NCI/CADD Chemical Structure Database
Filimonov D., Poroikov V., Borodina Yu., Gloriozova T. J.Chem. Inf. Comput. Sci., 1999, 39 (4), 666-670.
N
OH
O
HH
MNA level 1 MNA level 2
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Multilevel Neighborhoods of Atoms (MNA)NCI/CADD Chemical Structure Database
76.2 million
CSDB
uuuuu compound
set
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Multilevel Neighborhoods of Atoms (MNA)NCI/CADD Chemical Structure Database
76.2 million
CSDB
uuuuu compound
set
Unique MNAs
level 1
level 2
13,426
918,516
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Multilevel Neighborhoods of Atoms (MNA)NCI/CADD Chemical Structure Database
76.2 million
CSDB
uuuuu compound
set
Unique MNAs
level 1
level 2
13,426
918,5162.3 billion relationships
1.3 billion relationships~ 17 MNAs per uuuuu parent structure
~ 30 MNAs per uuuuu parent structure
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Multilevel Neighborhoods of Atoms (MNA)NCI/CADD Chemical Structure Database
surprising:424,784 MNAs (level 2) are exclusive to a set of 1,3 million structures in ChemSpider
76.2 million
CSDB
uuuuu compound
set
Unique MNAs
level 1
level 2
13,426
918,5162.3 billion relationships
1.3 billion relationships~ 17 MNAs per uuuuu parent structure
~ 30 MNAs per uuuuu parent structure
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Chemical Structure Web Services
NCI/CADDweb service
NCI/CADDweb service
NCI/CADD Chemical StructureDatabase (CSDB)
CACTVS
external(web) services
http
ChemicalIdentifierResolver
othersoftwarepackages
e.g. OPSIN
Chemical Structure Web ServicesNCI/CADD Web Resources
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
IUPHAR DATABASEhttp://www.iuphar-db.org
http://www.akosgmbh.eu/globalsearch/index.htm
CACTVS
http://www.xemistry.com
Symyx Draw Resolver
http://www.symyx.com/
webel.py - A Cinfony modulehttp://baoilleach.blogspot.com/2009/11/introducing-webel-cheminformatics.html
avogadro.openmolecules.net/
gChem
Virtual Molecular Model Kithttp://chemagic.com/web_molecules/script_page_large.aspx
Chemical Identifier ResolverNCI/CADD Web Resources
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Chemical Structure Lookup Service IIWork in progress …
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Chemical Structure Lookup Service IIWork in progress …
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Acknowledgments
ChemNavigatorScott HuttonTad Hurst
Thanks to all database providers!
http://cactus.nci.nih.gov
Our web site:
University of CambridgeDaniel LowePeter Murray-Rust
Noel’ O Boyle (University College Cork, Ireland) Richard Apodaca (Metamolecular)
Hans-Juergen Himmler
CADD Group, CBL, NCIIgor Filippov
ChemSpiderAntony WilliamsValery Tkachenko
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
http://cactus.nci.nih.gov/chemical/structure
Chemical Identifier ResolverNCI/CADD Web Resources
http://cactus.nci.nih.gov/blog
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry
Acknowledgments - Software
Python Web FrameworkChemWriter
Python SQL library
Javascript library
Peter Ertl
CACTVS
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry