5th meeting on u.s. government chemical databases and open chemistry talk

56
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry Markus Sitzmann 1 , Wolf-Dietrich Ihlenfeldt 2 , and Marc C. Nicklaus 1 [1] Computer-Aided Drug Design Group, Chemical Biology Laboratory, NCI-Frederick, NIH, DHHS [2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany ADD Chemical Identifier Resolver: ing and Analysis of Available Chemistry S

Upload: markus-sitzmann

Post on 17-May-2015

468 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Markus Sitzmann1, Wolf-Dietrich Ihlenfeldt2, andMarc C. Nicklaus1

[1] Computer-Aided Drug Design Group, Chemical Biology Laboratory,NCI-Frederick, NIH, DHHS[2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany

NCI/CADD Chemical Identifier Resolver:Indexing and Analysis of Available Chemistry Space

Page 2: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Chemistry Space Analysis

• how many small-molecules are there currently?• since the early 2000s: number of databases “publishing” small

molecules grew enormously, e.g. PubChem, ChemSpider, ChEMBL, DrugBank – what is the overlap?

• many ambiguities in the representation of small molecules (e.g. tautomerism, salts, ionic resonance forms)

• growing number of chemical structure identifiers (InChI/InChIKey, PubChem SID/CID, ChemSpider ID, ChEBI ID, …)

Page 3: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

chemical structure

Chemical Identifier Resolver

NCI/CADD Identifiers

InChI/InChIKey

ChemSpider ID

PubChem SID/CID

chemical names

CAS Registry Number

NSC number

FDA UNII

ChemNavigator SID

SMILES

SD File

Chemical FormulaChEBI ID

PDB Ligand ID

MRV

CML

SYBYL Line Notation

GIF image

Page 4: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

http://cactus.nci.nih.gov/chemical/structure

Works as a resolver for different chemical structure identifiers. Allows one to convert a givenstructure identifier into anotherrepresentation or structureidentifier.

Chemical Identifier ResolverNCI/CADD Web Resources

first beta release: July 2009current release (beta 4): April 2011

Page 5: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

• it is usable by a simple URL API:

example: http://cactus.nci.nih.gov/chemical/structure/Tamiflu/cas

204255-11-8

http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”

MIME type: text/plain

Chemical Identifier ResolverNCI/CADD Web Resources

XML format: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”/xml

• if a request is not resolvable: HTTP404 status message

Page 6: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

resolver

chemical namesIUPAC names (by OPSIN)

CAS numbersSMILES strings

IUPAC InChI/InChIKeysNCI/CADD Identifiers

CACTVS HASHISYNSC number

PubChem SIDChemSpider ID

ChemNavigator SIDFDA UNII

/smiles/names, /iupac_name/cas/inchi, /stdinchi/inchikey, /stdinchikey/ficts, /ficus, /uuuuu /image/file, /sdf/mw, /monoisotopic_mass /formula/twirl, /3d/urls/chemspider_id/pubchem_sid/chemnavigator_sid

“identifier” “representation”

http://cactus.nci.nih.gov/chemcial/structure

Chemical Identifier ResolverNCI/CADD Public Web Resources

Page 7: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

identifier representation

http request

http response

detection ofthe identifier

type

identifier is afull structure

representation(e.g. SMILES, InChI)

calculation of therequested structure

representation

identifier is ahashed structure

representation(e.g. InChIKey),

trivial nameetc.

database lookup

MIME type

Chemical Identifier ResolverNCI/CADD Web Resources

structure

e.g. InChI, GIF image

e.g. CAS number,chemical nameCACTVS

NCI/CADD Chemical Structure Database (CSDB)

Page 8: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

identifier representation

http request

http responseidentifier is afull structure

representation(e.g. SMILES, InChI)

calculation of therequested structure

representation

identifier is ahashed structure

representation(e.g. InChIKey),

trivial nameetc.

database lookup

MIME type

Chemical Identifier ResolverNCI/CADD Web Resources

structure

e.g. InChI, GIF image

e.g. CAS number,chemical nameCACTVS

NCI/CADD Chemical Structure Database (CSDB)

detection ofthe identifier

type

Page 9: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

<request string="L-alanin" representation="smiles"><data id="1" resolver="name_by_chemspider" string_class="Chemical Name (ChemSpider)">

<item id="1">C[C@H](N)C(O)=O</item></data><data id="2" resolver="name_by_opsin" string_class="IUPAC Name (OPSIN)">

<item id="1">C[C@H](N)C(O)=O</item></data><data id="3" resolver="name_by_cir" string_class="Chemical Name (CIR)">

<item id="1“>C[C@H](N)C(O)=O</item></data>

</request>

http://cactus.nci.nih.gov/chemical/structure/L-alanin/smiles/xmls?resolver=name_by_chemspider,name_by_opsin,name_by_cir

Chemical Identifier ResolverNCI/CADD Web Resources

Page 10: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

• ChemNavigator iResearch Librarycompilation of commercially availablescreening compounds from ~330 inter-national chemistry suppliers

• PubChem databaseincluding Open NCI database, EPA DSSTox databases, NIAID HIVdatabases, NIST Webbook, NLM ChemIDplus, ChemSpider …

• Commercial Sources / othersAsinex, Comgenex, eMolecules,ChEMBL, …

currently:~150 chemical structure databases

~120 million structure records ~81.6 million unique structures by

NCI/CADD FICuS Identifier~84 million unique structures by Std. InChIKey

ChemNav.iResearch Lib.~56%

PubChem~38%

others

~6%

Chemical Structure Database (CSDB)Chemical Identifier Resolver

Page 11: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

NCI/CADD Structure Identifiers

FICTS, FICuS, uuuuu

Page 12: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

• based on hashcodes calculated by the chemoinformatics toolkit CACTVS

• CACTVS hashcodes: represent a chemical structure uniquely as

16-digit hexadecimal number (64-bit unsigned) high sensitivity to structural features of a compound change if connectivity changes

NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures

HNN NH2

OH

O

9850FD9F9E2B4E25

Page 13: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

original structure

record

MolfileSDFSMILESChemDraw cdxPDB

structurenormalization

parentstructure

SDFSMILESdatabase

NCI/CADDIdentifier

hashcodecalculation

E_HASHISY

NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures

Page 14: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

structurenormalization

parentstructure

NCI/CADDIdentifier

hashcodecalculation

E_HASHISY

• calculation of a set of parent structures with differentsensitivity to chemical features

• representation of chemical structures on different levels

FICTS

original structure

record

MolfileSDFSMILESChemDraw cdxPDB

FICuS

uuuuu

SDFSMILESdatabase

NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures

Page 15: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Fragments Isotopes Charges StereoTautomers

FICTS

FICuS

uuuuu

sensitive / not sensitive

<CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum>

HNN NH2

O-

ONa+ 4A122D094098B50D-FICTS-01-1D

0E26B623DF7FAD30-FICuS-01-709850FD9F9E2B4E25-uuuuu-01-27

NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures

Page 16: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

NNH NH2

OH

O

HNN

OH

O

NH2

HNN

OH

O

NH2

HNN NH2

O-

ONa+

HNN NH3

+O-

O

O

HNN NH2

ONa

HNN NH

OH

ONH

N 15NH2

OH

O

charged form

tautomer

isotope

salt

stereoisomers

“errors”

histidine

HNN NH2

OH

O

Page 17: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

A3DAE0788050DDE4-FICTS E5F83F10C5DB080A-FICTS

B2FDA68AEDA06DB9-FICTS

9850FD9F9E2B4E25-FICTS

E5F83F10C5DB080A-FICTS

E92E4BA2869F3611-FICTS8A7AD1EB498CC76A-FICTS6C16DE2351F9FF50-FICTS

HNN NH2

OH

O

NNH NH2

OH

O

HNN

OH

O

NH2

HNN

OH

O

NH2

HNN NH2

O-

ONa+

HNN NH3

+O-

O

O

HNN NH2

ONa

HNN NH

OH

ONH

N 15NH2

OH

O

9850FD9F9E2B4E25-FICTS

charged form

tautomer

isotope

salt

stereoisomers

FICTS

“errors”

histidine

Page 18: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

A3DAE0788050DDE4-FICuS E5F83F10C5DB080A-FICuS

B2FDA68AEDA06DB9-FICuS

9850FD9F9E2B4E25-FICuS

E5F83F10C5DB080A-FICuS

E92E4BA2869F3611-FICuS8A7AD1EB498CC76A-FICuS9850FD9F9E2B4E25-FICuS

NNH NH2

OH

O

HNN

OH

O

NH2

HNN

OH

O

NH2

HNN NH2

O-

ONa+

HNN NH3

+O-

O

O

HNN NH2

ONa

HNN NH

OH

ONH

N 15NH2

OH

O

charged form

tautomer

isotope

salt

stereoisomers

FICuS

“errors”

HNN NH2

OH

O

9850FD9F9E2B4E25-FICuShistidine

Page 19: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu

9850FD9F9E2B4E25-uuuuu

9850FD9F9E2B4E25-FICuS

9850FD9F9E2B4E25-uuuuu

9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu

NNH NH2

OH

O

HNN

OH

O

NH2

HNN

OH

O

NH2

HNN NH2

O-

ONa+

HNN NH3

+O-

O

O

HNN NH2

ONa

HNN NH

OH

ONH

N 15NH2

OH

O

charged form

tautomer

isotope

stereoisomers

salt

uuuuu

“errors”

HNN NH2

OH

O

9850FD9F9E2B4E25-uuuuuhistidine

Page 20: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

HNDVDQJCIGZPNO-UHFFFAOYSA-N

HNDVDQJCIGZPNO-CDYZYAPPSA-N

HNDVDQJCIGZPNO-RXMQYKEDSA-N HNDVDQJCIGZPNO-YFKPBYRVSA-NHNDVDQJCIGZPNO-UHFFFAOYSA-N

NNH NH2

OH

O

HNN

OH

O

NH2

HNN

OH

O

NH2

HNN NH2

O-

ONa+

HNN NH3

+O-

O

O

HNN NH2

ONa

HNN NH

OH

ONH

N 15NH2

OH

O

charged form

tautomer

isotope

stereoisomers

salt

Std. InChIKey

“errors”

HNDVDQJCIGZPNO-UHFFFAOYSA-N

UHPNKBYGGMJTIM-UHFFFAOYSA-M

UHPNKBYGGMJTIM-UHFFFAOYSA-M

HNN NH2

OH

O

histidineHNDVDQJCIGZPNO-UHFFFAOYSA-N

Page 21: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

original record

original record

original record

original record

original record

original record

original record

original record

original record

original record

original record

NCI/CADD Chemical Structure Database

Structure Normalization

119.8 million originalstructure records

in CSDB

Page 22: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

FICTS

original record

original record

original record

original record

FICTS

original record

original record

original record

original record

original record

original record

original record

FICTS

FICTS

FICTS

FICTS

FICTS

FICTS

83.1 millionFICTS

parent structures

119.8 million originalstructure records

in CSDB

NCI/CADD Chemical Structure Database

Structure Normalization

Page 23: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

FICTS

original record

original record

original record

original record

FICTS

original record

original record

original record

original record

original record

original record

original record

FICTS

FICTS

FICTS

FICTS

FICTS

FICTS

FICuS

FICuS

FICuS

FICuS

FICuS

FICuS

83.1 millionFICTS

parent structures

81.6 millionFICuS

parent structures

119.8 million originalstructure records

in CSDB

NCI/CADD Chemical Structure Database

Structure Normalization

Page 24: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

FICTS

original record

original record

original record

original record

FICTS

original record

original record

original record

original record

original record

original record

original record

FICTS

FICTS

FICTS

FICTS

FICTS

FICTS

FICuS

FICuS

FICuS

FICuS

FICuS

FICuS

uuuuu

uuuuu

uuuuu

uuuuu

83.1 millionFICTS

parent structures

81.6 millionFICuS

parent structures

76.2 millionuuuuu

parent structures

119.8 million originalstructure records

in CSDB

NCI/CADD Chemical Structure Database

Structure Normalization

Page 25: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

FICTS

original record

original record

original record

original record

FICTS

original record

original record

original record

original record

original record

original record

original record

FICTS

FICTS

FICTS

FICTS

FICTS

FICTS

FICuS

FICuS

FICuS

FICuS

FICuS

FICuS

uuuuu

uuuuu

uuuuu

uuuuu

tautomer-invariant

83.1 millionFICTS

parent structures

81.6 millionFICuS

parent structures

76.2 millionuuuuu

parent structures

119.8 million originalstructure records

in CSDB

NCI/CADD Chemical Structure Database

Structure Normalization

Page 26: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Tautomer Analysis

How much “chemical space” is “just generated” by drawing tautomers?

Page 27: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

• CACTVS: generation of all formal tautomers for a given organic compound (prototropic tautomerism)

• rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS• rule set is systematically applied to the original structure

(and all tautomers that have been generated in previous steps)• tautomer generation is limited to 1000 SMIRKS transform

operations/structure• all tautomers are ranked by a scoring function• the highest ranked tautomer is defined as the

canonical tautomer

NCI/CADD Chemical Structure Database

Tautomer Analysis

Page 28: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

rule 12: furanones

rule 11: 1.11 (aromatic) heteroatom H shiftrule 10: 1.9 (aromatic) heteroatom H shiftrule 9: 1.7 (aromatic) heteroatom H shiftrule 8: 1.5 aromatic heteroatom H shift (2)rule 7: 1.5 (aromatic) heteroatom H shift (1)rule 6: 1.3 heteroatom H shiftrule 5: 1.3 aromatic heteroatom H shiftrule 4: special iminerule 3: simple (aliphatic) iminerule 2: 1.5 (thio)keto/(thio)enolrule 1: 1.3 (thio)keto/(thio)enol

• 21 SMIRKS transform rules:

rule 21: phosphonic acidsrule 20: isocyanidesrule 19: formamidinesulfinic acidsrule 18: cyanic/iso-cyanic acidsrule 17: oxim/nitroso via phenolrule 16: oxim/nitrosorule 15: pentavalent nitro/aci-nitrorule 14: ionic nitro/aci-nitro

rule 13: keten/ynol exchange

NCI/CADD Chemical Structure Database

Tautomer Analysis

Page 29: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

FICuS

FICuS

FICuS

FICuS

FICuS

FICuS

70.6 millionFICuS

parent structures

NCI/CADD Chemical Structure Database

Tautomer Analysis

starting from the set of FICuS parent structures we systematically generatedall tautomers based on the 21 SMIRKS rule set available in CACTVS

generated680 million tautomers

for 1.7% of the FICuS parent structuresthe enumeration was not exhaustive

(2009 DB version)

Page 30: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

NCI/CADD Chemical Structure Database

Tautomer Analysis

numberdatabasereleases

0

10

20

30

40

50

60

70

80

90

0.0 0.5 1.0 1.5 2.0

frequency

tautomeric overlap within each individual database release (%)

average: ~0.3% of original structure records

Page 31: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

NCI/CADD Chemical Structure Database

Tautomer Analysis

numberdatabasereleases

0

10

20

30

40

50

60

70

80

90

0.0 0.5 1.0 1.5 2.0

frequency

tautomeric overlap within each individual database release (%)

average: ~0.3% of original structure records

AsinexChemBridgeComGenexChemNavigatorColumbia University Molecular Screening CenterEPA DSSToxSpecs

AmbinterBINDBindingDBChemNavigatorKEGGNCI Open DatabaseNIST WebBookNLM ChemIDplusNMRShiftDBThomson PharmaWombat

NCI/DTPPASS Training SetSGC-Ox

ChemDBZINC

ChEBIChemSpider

Page 32: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

NCI/CADD Chemical Structure Database

Tautomer Analysis

0

5

10

15

20

25

30

0.5 2.5 4.5 6.5 8.5 10.5 12.5 14.5 16.5 18.5 20.5 22.5 24.5

frequencynumber

databasereleases

percentage of FICuS parent structure in each database releaseoccurring somewhere in CSDB with a conflict

occurrence of “tautomerism-critical” molecules within each individual database release (%)

average: ~9.5% of FICuS parent structures

Page 33: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

HNN O

O

HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)

• HPMBP is used in liquid membranes(selective removal of metal ions)

• selectivity and efficiency depends on the tautomeric form of HPMBP

• the tautomeric form depends on solvent and concentration of HPMBP

He, D.; Li Z.; Ma M.; Huang J.; Yang Y. Study of extraction characteristics of HPMBP.1. Tautomer and extraction characteristics. J. Chem. Eng. Data 2009, 54(10), 2944-2947

Example for a Tautomer “Conflict”

Page 34: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

NN OH

O

HNN O

O

HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)

CACTVS generates 7 tautomers

Example for a Tautomer “Conflict”

canonical tautomer

by CACTVS 5 tautomers have potential stereo center on atoms or bonds

HNN O

OR/S

HNN OH

OHR/S

HNN O

OHE/Z

NN O

OHE/Z

NN O

OR/S

Page 35: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

HNN O

O

HNN O

O

H

4551-69-133064-14-1

127117-31-1

859 references49 references

3 references

HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)

3 tautomers have CAS Registry Numbersassigned

Example for a Tautomer “Conflict”

(no stereo)

(Z)

HNN O

OR/S

HNN OH

OHR/S

NN O

OHE/Z

NN O

OHE/Z

NN O

OR/S

Page 36: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

NN OH

O

NN O

O

HNN O

O

NN O

OH

HNN O

OH

HNN OH

OH

HNN O

O

6 databases16 databases (no stereo)3 databases (R)2 databases (S)

12 databases

1 database(no stereo)

HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)

Example for a Tautomer “Conflict”

occurrences in databasesindexed in CSDB

R/S

R/S

E/ZE/Z

R/S

Page 37: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

6 databases16 databases (no stereo)3 databases (R)2 databases (S)

12 databases

occurrences in databasesN

N OH

O

NN O

OR/S

HNN O

O

NN O

OHE/Z

HNN O

OHE/Z

HNN OH

OHR/S

HNN O

OR/S

1 database(no stereo)

HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)

Example for a Tautomer “Conflict”

ACD 3DAmbinterBindingDBChemBankChemDBChemSpiderChemNavigatorMLSMRNIAID Scripps Screening CenterThomson PharmaZINC

ChemDB

ACD 3DACXAmbinterBioByte QSARChemBankChemBridgeChemDBChemSpiderDiscoveryGateEPA GCESMLSMRNCI Open DatabaseNIST MS-LibNLM ChemIDplusSigma-AldrichThomson Pharma

AmbinterChemDBChemSpiderDiscoveryGateChemNavigatorThomson Pharma

ChemSpiderZINC

ChemSpiderECOTOXZINC

Page 38: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Scaffold Analysis

Page 39: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Scaffold AnalysisNCI/CADD Chemical Structure Database

molecular scaffold tree

archetype scaffold

simple scaffold

Schuffenhauer et al.J. Chem. Inf. Model. 2007, 47, 47-58

Bemis et al.J. Med. Chem. 1996, 39, 2887-2893

Bemis et al.J. Med. Chem. 1996, 39, 2887-2893

SO O

NNO

NNHO

NNH

O NNH

level 2 level 1

example

Page 40: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

NCI/CADD Chemical Structure Database

76.2 million

CSDB

Scaffold Analysis

uuuuu compound

set

Page 41: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

NCI/CADD Chemical Structure Database

molecular scaffold tree

archetype scaffold

simple scaffold

76.2 million

8.1 million scaffolds

6.8 million scaffolds

0.8 million scaffolds

CSDB

Scaffold Analysis

uuuuu compound

set

NNHO

O NNH

NNH

level 2level 1

Page 42: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

NCI/CADD Chemical Structure Database

76.2 million

number of unique scaffolds per hierarchy level

CSDB

Scaffold Analysis

uuuuu compound

set

NNHO

O NNH

8.1 million scaffolds

0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 3 4 5 6 7 8 9 10

Hierarchy Level

Nu

mb

er

of

Un

iqu

e S

caf

fold

s (

in m

illi

on

s)

0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

Nu

mb

er o

f un

iqu

e s

truc

ture

s (in

millio

n)

level 2level 1

molecular scaffold tree

Page 43: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Atom Neighborhoods

Page 44: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Multilevel Neighborhoods of Atoms (MNA)

HC C(C(CC-H)C(CC-C)-H(C))HO C(C(CC-H)C(CN-H)-H(C))CHCC C(C(CC-H)C(CN-H)-C(C-O-O))CHCN C(C(CC-H)N(CC)-H(C))CCCC C(C(CC-C)N(CC)-H(C))CCOO N(C(CN-H)C(CN-H))NCC -H(C(CC-H))OHC -H(C(CN-H))OC -H(-O(-H-C))

-C(C(CC-C)-O(-H-C)-O(-C))-O(-H(-O)-C(C-O-O))-O(-C(C-O-O))

NCI/CADD Chemical Structure Database

Filimonov D., Poroikov V., Borodina Yu., Gloriozova T. J.Chem. Inf. Comput. Sci., 1999, 39 (4), 666-670.

N

OH

O

HH

MNA level 1 MNA level 2

Page 45: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Multilevel Neighborhoods of Atoms (MNA)NCI/CADD Chemical Structure Database

76.2 million

CSDB

uuuuu compound

set

Page 46: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Multilevel Neighborhoods of Atoms (MNA)NCI/CADD Chemical Structure Database

76.2 million

CSDB

uuuuu compound

set

Unique MNAs

level 1

level 2

13,426

918,516

Page 47: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Multilevel Neighborhoods of Atoms (MNA)NCI/CADD Chemical Structure Database

76.2 million

CSDB

uuuuu compound

set

Unique MNAs

level 1

level 2

13,426

918,5162.3 billion relationships

1.3 billion relationships~ 17 MNAs per uuuuu parent structure

~ 30 MNAs per uuuuu parent structure

Page 48: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Multilevel Neighborhoods of Atoms (MNA)NCI/CADD Chemical Structure Database

surprising:424,784 MNAs (level 2) are exclusive to a set of 1,3 million structures in ChemSpider

76.2 million

CSDB

uuuuu compound

set

Unique MNAs

level 1

level 2

13,426

918,5162.3 billion relationships

1.3 billion relationships~ 17 MNAs per uuuuu parent structure

~ 30 MNAs per uuuuu parent structure

Page 49: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Chemical Structure Web Services

NCI/CADDweb service

NCI/CADDweb service

NCI/CADD Chemical StructureDatabase (CSDB)

CACTVS

external(web) services

http

ChemicalIdentifierResolver

othersoftwarepackages

e.g. OPSIN

Chemical Structure Web ServicesNCI/CADD Web Resources

Page 50: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

IUPHAR DATABASEhttp://www.iuphar-db.org

http://www.akosgmbh.eu/globalsearch/index.htm

CACTVS

http://www.xemistry.com

Symyx Draw Resolver

http://www.symyx.com/

webel.py - A Cinfony modulehttp://baoilleach.blogspot.com/2009/11/introducing-webel-cheminformatics.html

avogadro.openmolecules.net/

gChem

Virtual Molecular Model Kithttp://chemagic.com/web_molecules/script_page_large.aspx

Chemical Identifier ResolverNCI/CADD Web Resources

Page 51: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Chemical Structure Lookup Service IIWork in progress …

Page 52: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Chemical Structure Lookup Service IIWork in progress …

Page 53: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Acknowledgments

ChemNavigatorScott HuttonTad Hurst

Thanks to all database providers!

http://cactus.nci.nih.gov

Our web site:

University of CambridgeDaniel LowePeter Murray-Rust

Noel’ O Boyle (University College Cork, Ireland) Richard Apodaca (Metamolecular)

Hans-Juergen Himmler

CADD Group, CBL, NCIIgor Filippov

ChemSpiderAntony WilliamsValery Tkachenko

Page 54: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

http://cactus.nci.nih.gov/chemical/structure

Chemical Identifier ResolverNCI/CADD Web Resources

http://cactus.nci.nih.gov/blog

Page 55: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry

Acknowledgments - Software

Python Web FrameworkChemWriter

Python SQL library

Javascript library

Peter Ertl

CACTVS

Page 56: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Spa ce U.S. Government Chemical Databases and Open Chemistry