characterization of chemical libraries using scaffolds and network models
TRANSCRIPT
Characteriza*on of Chemical Libraries Using Scaffolds and
Network Models
Dac-‐Trung Nguyen, Rajarshi Guha NIH NCATS
ACS Na:onal Mee:ng, Boston 2015
Outline
OR
Mo*va*ons
• Library comparison usually driven by a need to construct or expand a library – OLen with constraints on resources
• Two classes of features to consider – Compound-‐centric (physchem proper:es, bioac:vity, target preferences)
– Library-‐centric (diversity, chemical space coverage) • Library comparisons generally reduce to – Distribu:ons of compound features (univariate) – Overlap in some chemical space (mul:variate)
Comparing Libraries
• Most comparisons employ a reduced (numerical) representa:on of the structure – Fingerprints, BCUTs, physicochemical descriptors
• Perform comparisons in the new space – PCA, SOM, MDS, GTM, …
Schamberger et al, DDT, 2011, 16, 636-‐641; Kireeva et al, Mol. Inf., 2012, 31, 301-‐312
Scaffolds & Networks
• Scaffolds represent a chemically meaningful reduced representa:on of the structures
• Can be challenging to define what a (good) scaffold is
• A network representa:on of the collec:on of structures allows for novel ways to perform library comparisons – How fine grained can such comparisons be?
Scaffold Network Representa*ons
• Scaffolds are generated by exhaus:ve enumera:on of SSSR
• Scaffolds are nodes, connected by directed edges • Nodes are labeled by a hash key of the scaffold
4 compounds 1912 compounds
Scaffold Network Construc*on
• A scaffold network is a directed graph • Edges denote sub/super-‐structure rela:onships between scaffolds
• Each node in the network represents a unique scaffold
• Singletons are acyclic molecules
Datasets CL1420, 31320 compounds
CL886, 3552 compounds
MIPE, 1920 compounds
Natural Products, 5000 compounds Mathews and Guha et al, PNAS, 2014, 111, 11365; Singh et al, JCIM, 2009, 49, 1010
LOPAC, 1280 compounds
1079 nodes, 115287 edges 69 trees
2131 nodes, 1843 edges 129 trees
Approved, inves:ga:onal drugs, constructed for func:onal diversity Diverse library, designed for enrichment of bioac:vity
15283 nodes, 13622 edges 729 trees
5563 nodes, 4832 edges 239 trees
23716 nodes, 21468 edges 750 trees
• The overall structure of the complete network can characterize the library
• But distribu:ons of vertex-‐level network metrics may be informa:ve
• We can also consider approaches to iden:fy “important” scaffolds
Scaffold Network Representa*ons
Metrics for the Complete Network
• Examined vertex-‐level measures of centrality – Closeness, betweenness, … – High similarity of MIPE & NP and low similarity of LOPAC & NP is surprising (Ertl et al, JCIM, 2008)
0.00
0.25
0.50
0.75
−10 −9 −8 −7 −6 −5log10(Betweenness)
density
CL1420CL886LOPACMIPENP
0
5000
10000
15000
20000
−8 −7 −6log Closeness (in−degree)
Num
. Sca
ffold
CL1420CL886LOPACMIPENP
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.000
0.025
0.050
0.075
Centralization
CPL
Transitivity
CL1420 CL886 LOPAC MIPE NPLibrary
Value
Metrics for the Complete Network
• Useful to summarize distribu:ons by scalar metrics
• Path length metrics are not discriminatory due to many short paths
• Extent of clustering differs but is quite low overall
Comparing Complete Networks
• Library overlap is characterized by the set of common scaffolds
• Scaffolds can be ranked (e.g., PageRank) – Small fragments have low PR – Large frameworks have high PR – Interes:ng scaffolds lie in between?
• Similar libraries will have common scaffolds with similar PageRank values
PageRank vector
PageRank vector
Subset Common
Fragments
Subset Common
Fragments
Normalized Dot Product
Comparing Complete Networks
1 0 0 0 0
0 1 0 0 0
0 0 1 0.2 0.3
0 0 0.2 1 0.3
0 0 0.3 0.3 1
CL1420
CL886
LOPAC
MIPE
NP
CL1420 CL886 LOPAC MIPE NP
Scaffold Recogni*on • What is a scaffold? • Can be addressed through the scaffold network – A scaffold is a hub within the scaffold network
• Provide a prac:cal answer to “What are the missing scaffolds in my library”
• Examples of unique scaffolds in MIPE but not in NP
Scaffold Comparison
Reduced Network Representa*on
• The complete network can be reduced to a forest of trees
• Order nodes by out-‐degree • From each node, traverse network un:l a terminal node is reached
• Result is a set of spanning trees
Reduced Network Representa*on
MIPE, 1912 compounds
Network Structure
• A scaffold forest is characterized by – Disconnected components
• structurally related scaffolds, scaffolds diversity – Singletons
• scaffolds with no superstructure – Branching within connected components
• scaffold complexity
Forest Size vs Library Size
• A large libraries doesn’t imply a large forest • Forest size is a func:on of scaffold diversity
CL1420, 31K combinatorial library MIPE, 1912 (target) diverse library
Summarizing Forests
• A key feature is the nature of branching in individual trees
• Characterized by ID -‐ informa:on theore:c descriptor of branching derived from the distance matrix
Bonchev & Trinajis:c, IJQC, 1978, 14, 293-‐303
ID = 978 ID = 90794 ID = 3456 ID = 979252
Summarizing Forests
• Distribu:on of ID dis:nguishes datasets primarily in the tails
• Aggrega:ng by mean ID s:ll discriminates well – Driven by the tails
0.00
0.25
0.50
0.75
1.00
2 4 6log10(ID)
Density
CL1420CL886LOPACMIPENP
0
1
2
3
4
CL1420 CL886 LOPAC MIPE NP
Mea
n lo
g10(
I D)
Exploring the Forest
• The metric also allows us to drill down – Select scaffolds of given branching complexity – Iden:fy scaffolds of given complexity range across different libraries (equivalent to finding holes in scaffold coverage)
≈
LOPAC, ID = 10214 MIPE, ID = 10197
Library Comparison via Merging
• … reduces to comparing networks • We compute a graph union and construct new edges between nodes with the same hash
• How does the network structure of the union differ from the original networks?
• Can be extended to merge more than two networks
Source Forests
• Structurally similar networks
• 2659 iden:cal nodes
• Construct union by connec:ng nodes with iden:cal hash
LOPAC MIPE
Merged Network
• Green edges “bridge” the two networks
• Trees can now have two types of nodes
• How can we characterize the – Contrac:on? – Degree of mixing?
Contrac*on to Measure Overlap
• Merging very similar libraries should generate a smaller forest compared to the original forests
• But this doesn’t really describe how the individual trees become (more) connected
Cnorm =F12
F1 + F2
where Fi = G1i,G2i,!,Gni{ }
0.00
0.25
0.50
0.75
1.00
Cl886/CL1420 MIPE/CL886 MIPE/LOPAC MIPE/NP
Cnorm
0
25
50
75
100
Cl886/CL1420 MIPE/CL886 MIPE/LOPAC MIPE/NP
% o
f tre
es
Assortive Not Assortive
Assorta*vity to Measure Overlap
• Quan:fies the no:on that “like connects to like”
• Undefined for trees that only have one type of vertex (i.e., only from a single library)
• The number of trees that are assorta:ve is a global indicator of library similarity
Newman, Phys. Rev. E., 2003, 026126
0
10
20
30
0.4 0.6 0.8 1.0Assortativity
density
Cl886/CL1420
MIPE/CL886
MIPE/LOPAC
MIPE/NP
Assorta*vity to Measure Overlap
• We then examine the distribu:on of assorta:vity across assorta:ve trees
• Dissimilar libraries have few assorta:ve trees – But they have high values of assorta:vity
• However, high assorta:vity doesn’t imply high overlap
Assorta*vity to Measure Overlap
Assorta:vity = 0.85 (MIPE & NP)
Assorta:vity = 0.95 (CL886 & CL1420)
Overlap via Tree Complexity
• Similar libraries lead to fewer trees in the merged network, but also denser trees
• Change in density (branching) across the forest can also measure the extent of overlap
MIPE LOPAC Merged
Summarizing via Tree Complexity
• Distribu:ons of ID before and aLer merging don’t differ very much, visually
• However a KS test does discriminate them
0.0
0.2
0.4
0.6
0.8
1 2 3 4log10(ID)
density
IndividualMerged
CL886 / CL1420 MIPE / NP
0.0
0.1
0.2
0.3
0.4
2.5 5.0 7.5log10(ID)
density
IndividualMerged
D = 0.0173, p = 1 D = 0.0582, p = .0008
Summary
• Scaffold networks are a rela:vely objec:ve way to characterize & compare libraries – Supports fast comparisons between libraries
• The approach supports mul:plexing informa:on in to a single data structure – Physchem proper:es, bioac:vi:es, …
• “What is a good comparison?” quickly becomes a philosophical ques:on