smashing molecules
TRANSCRIPT
Smashing Molecules How Molecular Fragments Allow us to Explore Large
Chemical Spaces
Rajarshi Guha & Trung Nguyen NIH Center for
Transla9onal Therapeu9cs
Chemaxon UGM September 2011
Outline
• Fragments as the building blocks of chemistry • Fragments and SAR • Fragments and ac9vity profiles
Big Data for Some Problems
• Halevy et al discuss the effec9veness of extremely large datasets
• Their applica9on focuses on machine transla9on – see the Google n-‐gram corpus
• They suggest that such extremely large datasets are useful because they effec9vely encompass all n-‐grams (phrases) commonly used
• Domain is rela9vely constrained
Halevy et al, IEEE Intelligent Systems, 2009, 24, 8-‐12
Google Scale in Chemistry?
• What would be the equivalent of an n-‐gram corpus in chemistry? – Fragments – A more direct analogy can be made by using LINGO’s
• It is possible to generate arbitrarily large (virtual) compound and fragment collec9ons
• But would such a collec9on span all of “commonly used” chemistry? – Depending on the ini9al compound set, yes – But we’re also interested in going beyond such a “commonly used” set
Fink T, Reymond JL, J Chem Inf Model, 2007, 47, 342
Fragment Diversity
• Consider a set of bioac9ves such as the LOPAC collec9on, 1280 compounds
• Using exhaus9ve fragmenta9on we get 2,460 unique fragments
• On the MLSMR (~ 372K compounds), we get 164,583 fragments
log Fragment Frequency
Pe
rce
nt
of
To
tal
0
10
20
30
40
0 1 2 3 4
PC 1
PC
2
-4
-2
0
2
4
-4 -2 0 2 4
Fragment Diversity
• Distribu9on of MLSMR fragments in BCUT space
PC 1
PC
2
-4
-2
0
2
4
6
-4 -2 0 2
All fragments Fragments occurring in 5 to 50 molecules
What Do We Do with Fragments?
• Assuming we obtain fragments from a large enough collec9on what do we do? – Learning from fragments – QSARs, genera9ve models
– Use fragments as filters, alterna9ve to clustering
– Explore chemotypes and ac9vity
– Scaffold level promiscuity
White, D and Wilson, RC, J Chem Inf Model, 2010, 50, 1257-‐1274
Scaffold AcKvity Diagrams
• Network oriented view of fragment (scaffold) collec9ons – Similar in idea to Scaffold Hunter etc
– Not purely hierarchical • Color by arbitrary proper9es
• Quickly assess u9lity of a scaffold
• Try it online
What Makes a Good Scaffold?
• What makes a good scaffold? – Size, complexity, … – Do the members represent an SAR or not?
– Intui9on and experience also play a role
Scaffold QSAR
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!8 !6 !4 !2 0
!8
!6
!4
!2
0
ObservedPredicted
Evaluate topological and physicochemical descriptors for the R-‐groups
Fit PLS or ridge regression model
Characterize the SAR landscape
Scaffold QSAR -‐ Drawbacks
• Many scaffolds have few (5 to 10) members • Invariably, more features than observa9ons • If the number of R-‐groups is large, the feature matrix can be very sparse – Less of a problem for combinatorial libraries
• A linear fit may not be the best approach to correla9ng R-‐groups to the ac9vi9es – Difficult to choose a model type a priori
Fragment AcKvity Profiles
• Using scaffolds in HTS triage usually leads to two ques9ons – What is known about the chemical series with respect to the intended target?
– What compound classes are known to modulate the intended target & how similar are they to series in ques9on
• We’re interested in exploring summaries of ac=vity, grouped by scaffolds and targets
Fragment AcKvity Profiles
• We use ChEMBL (08) as the source of bioac9vity across mul9ple targets
• Preprocess the database – Generate scaffolds (exhaus9ve enumera9on of combina9ons of SSSR’s)
– Normalize ac9vity data so that we compare the ac9vity of a molecule across different assays
Database Setup
• Preprocessing steps available as a Java servlet – hkp://tripod.nih.gov/files/chembl-‐servlets.zip
• Need ChEMBL installed in Oracle; we add some extra tables – Fragment structures and computed proper9es – Aggregated assay ac9vity summary
• Only consider assays with IC50’s in nM and uncensored data, more than 5 observa9ons and a MAD > 0
– (Robust) z-‐scored ac9vi9es
Some Fragment StaKsKcs
• Considered Z-‐score range of -‐40 to 15 • There were 12,887 molecules lying outside this range
log(Number of molecules)
Per
cent
age
of a
ssay
s
0
5
10
15
1.0 1.5 2.0 2.5
Z-score
Num
ber o
f com
poun
ds
0
10
20
30
40
50
-40 -30 -20 -10 0 10
Some Fragment StaKsKcs
• Next, iden9fy fragments with 8 to 20 atoms and occurring in 100 to 900 molecules
• Gives us 1,746 fragments
Num Molecules
Per
cent
age
of F
ragm
ents
0
10
20
30
40
200 400 600 800
Some Fragment StaKsKcs
• We can query the fragment tables to get ac9vity summaries for individual fragments
• For these examples we consider the full range of Z-‐ scores
Z-Score
Per
cent
of T
otal
0
10
20
30
40
50
60
-30 -20 -10 0 10
N = 1280
778
-600 -400 -200 0
N = 1918
2723
-50 0 50
N = 2641
4058
-5 0 5 10 15
N = 1489
5390
0 10 20
N = 1578
5486
-60 -40 -20 0 20
0
10
20
30
40
50
60N = 1455
13485
0
10
20
30
40
50
60
-20 0 20
N = 1457
40169
-40 -20 0 20
N = 1595
64473
-20 -10 0 10
N = 1515
115654
Exploring AcKvity Profiles
Fragments from ChEMBL
Ac9vity distribu9ons of parent molecules across all targets Z-‐scores for individual
molecules against a specific target
Exploring AcKvity Profiles
• User can draw a molecule and fragment on the fly
• Use generated fragments to create ac9vity histograms
Target SelecKon
• Employs the ChEMBL target hierarchy
• Can select target families or individual targets
Similar Fragments with Similar Profiles?
• Consider 658 fragments with > 10 atoms and occurring in 500 to 1200 molecules
• Overall, the fragments tend to be dissimilar – 95th percen9le is just 0.50
• 1,873 pairs do exhibit Tc > 0.8
Tanimoto Similarity
Per
cent
age
of p
airs
0
5
10
15
20
25
0.0 0.2 0.4 0.6 0.8 1.0
Comparing AcKvity Profiles
• Compare ac9vity profiles with the K-‐S sta9s9c • Color corresponds to p-‐value of the K-‐S test
• No obvious correla9on between fragment similarity & ac9vity profile similarity
• Probably not rigorous when a scaffold has few parent molecules Tanimoto Similarity
K-S
sta
tistic
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.80 0.85 0.90 0.95 1.00
0.0
0.2
0.4
0.6
0.8
1.0
Exploring Profiles for Fragment Pairs
• Compare ac9vity distribu9ons across all targets in a pairwise fashion
• Can also generate comparison for a single target, but requires data for all the fragments
Looking for SelecKve Fragments
• Interes9ng to visually explore fragment pairs • Can become tedious, especially in a database as big as ChEMBL
• Can we automate this type of analysis? – Iden9fy fragment pairs with very different ac9vity distribu9ons?
– Iden9fy fragments with a preference for a certain target (class)?
Targetwise AcKvity Profiles M
ean
Z−Sc
ore
−10
−50
Acetylc
holine
recep
tor
Adrene
rgic r
ecepto
rAgc
Angiote
nsin r
ecepto
r
ANIONIC C1A
Calciton
in gen
e−rel
ated p
eptide
recep
torCam
k
CATIONIC
CC chem
okine
recep
torCmgc
CXC chem
okine
recep
tor
CYP_11B
1
CYP_11B
2
CYP_19A
1
CYP_1A2
CYP_2C19
CYP_2C9
CYP_2D6
CYP_3A4
CYP_4A1
CYP_4A11
CYP_4A3
CYP_4F2
CYP_5A1
Dopam
ine re
ceptor dru
g
Endoth
elin re
ceptor
GnRH re
ceptor
Histamine
recep
torM10
A
MCH recep
tor
Metabo
tropic
glutam
ate re
ceptor
Neurok
inin re
ceptor
Neurop
eptide
Y recep
tor
Norepin
ephri
ne
NR1H3
NR3A1
NR3A2
NR3C3
Opioid r
ecepto
rPA
F
potas
sium
S1A
Seroton
in rece
ptor
Sodium
_hydr
ogen Tk
3 50 6 14 107 6 2 5 19 1 19 1 3 6 8 14 7 17 13 20 2 24 2 24 9 18 4 2 2 1 2 1 1 2 10 1 59 4 4 2 4 86 3 12 42 7 153
4056459
• Evaluate mean ac9vity of parent molecules within a target class
• Count number of parent molecules tested against the target
• Selec9vity of 1-‐phenylimidazole for CYP450 has been noted
Wilkinson et al, Biochem Pharmacol, 1983, 32, 997-‐1003
Targetwise AcKvity Profiles
• Iden9fied benzylpyrrolidine as a fragment with preference for a specific target class
• But reported as dopamine agonists
Mea
n Z−
Scor
e
−8−6
−4−2
02
A2A
Adrene
rgic r
ecep
tor Agc
Angiot
ensin
rece
ptor
Bradyki
nin re
cepto
rC1A
Calcium
sens
ing re
cepto
rCam
k
CATIO
NIC
CC chem
okine
rece
ptor
Cholec
ystok
inin re
cepto
rCmgc
CYP_2D6
CYP_3A4
Dopam
ine
Dopam
ine re
cepto
r
EDG rece
ptor
Endoth
elin re
cepto
r
Glucag
on re
cepto
r
GnRH re
cepto
r
Histamine
rece
ptor
Leuko
triene
rece
ptor
M10A
M12B
MCH rece
ptor
Metabo
tropic
gluta
mate re
cepto
r
Neurok
inin re
cepto
r
Neurop
eptid
e Y re
cepto
r
Norepin
ephri
neNR1I1
NR3C4
Opioid
recep
torOthe
rPA
F
Prostan
oid re
cepto
rReg S1A S21 S9A
Seroton
in
Seroton
in rec
eptor Tk Tkl
5 2 23 7 6 7 24 2 67 102 6 18 3 8 11 19 16 2 1 16 49 1 3 2 33 18 118 1 1 4 2 11 8 3 28 5 38 7 45 4 9 29 2
4055899
Fragment or Scaffold?
• I’ve been using fragment & scaffold interchangeably – not always true
• Chemists have an intui9ve idea of what a scaffold is
• Can we encode the idea of scaffold-‐like or fragment-‐like
• We use the concept of Signal-‐to-‐Noise Ra9o SNR = µ
!
Size of fragment
SD of number of atoms not in the fragment, considered over the parent molecules
Fragment or Scaffold
• Par9al distribu9on of SNR values for fragments with atom count > 8 & < 20
SNR
Per
cent
age
of F
ragm
ents
0
10
20
30
40
50
60
0 1 2 3 4 5 6
• Large SNR’s associated with Murcko-‐like fragments • A useful SNR cutoff is an open ques9on
SNR = 8.50
Fragment or Scaffold
SNR = 12.09 SNR = 9.10
SNR = 0.36 SNR = 0.43 SNR = 0.83
AcKvity Profiles & SNR
• Given a fragment, evaluate SD of the number of atoms in the parent molecules that are not part of the fragment
• Label the parent molecules based on – If number of atoms not in the fragment > SD, non core-‐like
– Otherwise core-‐like • Visualize the ac9vity distribu9ons of the parent molecules, grouped by the label
Z-Score
Per
cent
age
of T
otal
20
40
60
80
-50 0 50
Core-like20967
-50 0 50
Not core-like20967
-50 0 50
Core-like44591
-50 0 50
Not core-like44591
Z-Score
Per
cent
age
of T
otal
20
40
60
80
-30 -20 -10 0 10
Core-like801
-30 -20 -10 0 10
Not core-like801
-30 -20 -10 0 10
Core-like68604
-30 -20 -10 0 10
Not core-like68604
Z-Score
Per
cent
age
of T
otal
20
40
60
80
-50 0 50
Core-like20967
-50 0 50
Not core-like20967
-50 0 50
Core-like44591
-50 0 50
Not core-like44591
Z-Score
Per
cent
age
of T
otal
20
40
60
80
-30 -20 -10 0 10
Core-like801
-30 -20 -10 0 10
Not core-like801
-30 -20 -10 0 10
Core-like68604
-30 -20 -10 0 10
Not core-like68604
Z-Score
Per
cent
age
of T
otal
20
40
60
80
-50 0 50
Core-like20967
-50 0 50
Not core-like20967
-50 0 50
Core-like44591
-50 0 50
Not core-like44591
Z-Score
Per
cent
age
of T
otal
20
40
60
80
-30 -20 -10 0 10
Core-like801
-30 -20 -10 0 10
Not core-like801
-30 -20 -10 0 10
Core-like68604
-30 -20 -10 0 10
Not core-like68604
Z-Score
Per
cent
age
of T
otal
20
40
60
80
-50 0 50
Core-like20967
-50 0 50
Not core-like20967
-50 0 50
Core-like44591
-50 0 50
Not core-like44591
Z-Score
Per
cent
age
of T
otal
20
40
60
80
-30 -20 -10 0 10
Core-like801
-30 -20 -10 0 10
Not core-like801
-30 -20 -10 0 10
Core-like68604
-30 -20 -10 0 10
Not core-like68604
AcKvity Profiles & SNR
High SNR
Low SNR
Downloads
• Scaffold ac9vity networks • Fragment Ac9vity Profiler – SQL & servlet sources – Client sources – Online version