Download - Molecular Descriptors
Molecular Descriptors
INTRODUCTION
• Molecular descriptors are numerical values that characterize properties of molecules
• Examples:
– Physicochemical properties (empirical)
– Values from algorithms, such as 2D fingerprints
• Vary in complexity of encoded information and in compute time
Descriptors for Large Data Sets
• Descriptors representing properties of complete molecules
– Examples: LogP, Molar Refractivity
• Descriptors calculated from 2D graphs
– Examples: Topological Indexes, 2D fingerprints
• Descriptors requiring 3D representations
• Example: Pharmacophore descriptors
DESCRIPTORS CALCULATED FROM 2D STRUCTURES
• Simple counts of features
– Lipinski Rule of Five (H bonds, MW, etc.)
– Number of ring systems
– Number of rotatable bonds
• Not likely to discriminate sufficiently when used alone
• Combined with other descriptors for best effect
Physicochemical Properties
• Hydrophobicity
– LogP – the logarithm of the partition coefficient between n-octanol and water
• ClogP (Leo and Hansch) – based on small set of values from a small set of simple molecules
– BioByte: http://www.biobyte.com/
– Daylight’s MedChem Help page
– http://www.daylight.com/dayhtml/databases/medchem/medchem- help.html
– Isolating carbon: one not doubly or triply bonded to a heteroatom
ACD Labs Calculated Properties
• http://www.acdlabs.com
• ACD Labs values now incorporated into the CAS Registry File for millions of compounds
• I-Lab: http://ilab.acdlabs.com/
– Name generation
– NMR prediction
– Physical property prediction
Molar Refractivity
• MR = n2 – 1 MW
-------- -----
n2 + 2 d
where n is the refractive index, d is density, and MW is molecular weight.
• Measures the steric bulk of a molecule.
Topological Indexes
• Single-valued descriptors calculated from the 2D graph of the molecule
• Characterize structures according to size, degree of branching, and overall shape
• Example: Wiener Index – counts the number of bonds between pairs of atoms and sums the distances between all pairs
Wiener Index
• Add up all the off-diagonal elements and divide by 2 (because matrix is symmetrical)
• The Wiener index correlates well with the boiling points of alkanes
Zagreb Index
• For each non-hydrogen atom, add up the squares of the number of connections to other non-hydrogen atoms (regardless of bond order)
Topological Indexes: Others
• Molecular Connectivity Indexes
– Randić (et al.) branching index
• Defines a “degree” of an atom as the number of adjacent non-hydrogen atoms
• Bond connectivity value is the reciprocal of the square root of the product of the degree of the two atoms in the bond.
• Branching index is the sum of the bond connectivities over all bonds in the molecule.
– Chi indexes – introduces valence values to encode sigma, pi, and lone pair electrons
Kappa Shape Indexes
• Characterize aspects of molecular shape
– Compare the molecule with the “extreme shapes” possible for that number of atoms
• Range from linear molecules to completely connected graph
2D Fingerprints
• Two types:
– One based on a fragment dictionary
• Each bit position corresponds to a specific substructure fragment
• Fragments that occur infrequently may be more useful
– Another based on hashed methods
• Not dependent on a pre-defined dictionary
• Any fragment can be encoded
• Originally designed for substructure searching, not for molecular descriptors
Topological indexes
another type of numerical descriptor that can be calculated from a 2D structure diagram
there are many different topological indexes
• some are designed to represent structural features such as branching or shape
they can be calculated from connection tables, or closely-related formats
• e.g. the distance matrix
o an N x N table showing the distance (in bonds) between each pair of atoms
“Redundant” Connection Table
1. O 1 2 1
2. C 0 1 1 3 2 4 1
3. O 0 2 2
4. C 1 2 1 5 1 6 1
5. N 2 4 1
6. C 2 4 1 7 1
7. C 0 6 1 8 2 12 1
8. C 1 7 2 9 1
9. C 1 8 1 10 2
10. C 0 9 2 11 1 13 1
11. C 1 10 1 12 2
12. C 1 11 2 7 1
13. O 1 10 1
Distance Matrix
1 2 3 4 5 6 7 8 9 10 11 12 13
1. O 1 2 2 3 3 4 5 6 7 6 5 8
2. 1 C 1 1 2 2 3 4 5 6 5 4 7
3. 2 1 O 2 3 3 4 5 6 7 6 5 8
4. 2 1 2 C 1 1 2 3 4 5 4 3 6
5. 3 2 3 1 N 2 3 4 5 6 5 4 7
6. 3 2 3 1 2 C 1 2 3 4 3 2 5
7. 4 3 4 2 3 1 C 1 2 3 2 1 4
8. 5 4 5 3 4 2 1 C 1 2 3 2 3
9. 6 5 6 4 5 3 2 1 C 1 2 3 2
10. 7 6 7 5 6 4 3 2 1 C 1 2 1
11. 6 5 6 4 5 3 2 3 2 1 C 1 2
12. 5 4 5 3 4 2 1 2 3 2 1 C 3
13. 8 7 8 6 7 5 4 3 2 1 2 3 O
Kier Shape Indexes
Several indexes based on the number of atoms (N) and the number of bonds (P) in the graph
k1 = N (N-1)2 / P2
k 2 = (N-1) (N-2)2 / P2
k 3 = (N-1) (N-3)2 / P2 (if N is odd)
k 3 = (N-3) (N-2)2 / P2 (if N is even)
“alpha-modified” kappa indexes can be generated where N is adjusted take into account the sizes of atoms, relative to sp2-hybridised carbons
a “molecular flexibility index” is derived from these
j = k1a k2
a / N
Molecular Connectivity Indexes
a whole series of indexes, developed by Kier and Hall in the late 1970s, following earlier work by Randic
involves identifying all possible subgraphs of different sizes in the molecule
size of subgraph determines the order of the index
• 0 bond subgraphs give 0c index
• 1-bond subgraphs give 1c index
• 2-bond subgraphs give 2c index
• 3-bond subgraphs give 3c indexes etc.
Molecular Connectivity Indexes
At higher orders the subgraphs are divided into
“path” subgraphs (only 1 and 2-connected nodes)
“cluster” subgraphs (no 2-connected nodes)
“path-cluster” subgraphs (any sort of node)
“chain” subgraphs (involving rings)
Molecular Connectivity Indexes
For each subgraph order and type the index is calculated as
where di is number of connections of node i in the subgraph
molecular connectivity indexes also exist in a “valence-modified” form that takes into account the heteroatoms present
Molecular Connectivity Indexes
many experiments have been done to find correlations between them (and other indexes) and measured physico-chemical or biological properties
this uses a statistical technique called multiple regression analysis to build an equation of the form
Property = c0 + c1x1 + c2x2 + c3x3 + c4x4 + c5x5 + …
where x1, x2 etc. are topological indexes and c1, c2 etc. are constants
good correlations have often been obtained
What do topological indexes mean?
Good question!
it is often difficult to assign some chemical meaning to, e.g. the order-6 path-cluster, valence-modified Kier index
topological indexes effectively encode the same information as fingerprint fragments
• in a less obvious way
• but one which can be processed numerically
Atom-Pair Descriptors
• Encode all pairs of atoms in a molecule
• Include the length of the shortest bond-by-bond path between them
• Elemental type plus the number of non-hydrogen atoms and the number of π-bonding electrons
BCUT Descriptors
• Designed to encode atomic properties that govern intermolecular interactions
• Used in diversity analysis
• Encode atomic charge, atomic polarizability, and atomic hydrogen bonding ability
BCUT descriptors
• A type of topological index with a complex history
• B = Frank Burden
• C = Chemical Abstracts Service
• UT = University of Texas (Bob Pearlman)
• based on 3D structure of molecule
• 6 different indexes generated for each molecule
• often used as descriptors for cell-based partitioning of chemical space
• 6 descriptors = 6 dimensions
DESCRIPTORS BASED ON 3D REPRESENTATIONS
• Require the generation of 3D conformations
– Can be computationally time consuming with large data sets
– Usually must take into account conformational flexibility
– 3D fragment screens encode spatial relationships between atoms, ring centroids, and planes
Pharmacophore Keys & Other 3D Descriptors
• Based on atoms or substructures thought to be relevant for receptor binding
• Typically include hydrogen bond donors and acceptors, charged centers, aromatic ring centers and hydrophobic centers
• Others: 3D topographical indexes, geometric atom pairs, quantum mechanical calculations for HUMO and LUMO
DATA VERIFICATION AND MANIPULATION
• Data spread and distribution
– Coefficient of variation (standard deviation divided by the mean)
• Scaling (standardization): making sure that each descriptor has an equal chance of contributing to the overall analysis
• Correlations
• Reducing the dimensionality of a data set: Principal Components Analysis
Chemical Structure Representation and Search Systems
Topics to be Covered
Clustering
• identifying classes of molecules similar to each other, but different to those in other classes
Topological indexes
• numbers that can be calculated from connection tables
Property prediction
• predicting physicochemical or biological properties directly from connection tables
The Drug Discovery Process
• virtual screening
Cluster Analysis
process of putting molecules (or other objects) into classes, based on similarity
molecules in the same cluster are similar to each other
molecules in different clusters are different from each other
many different methods and algorithms
• different clustering methods will result in different clusters, with different relationships between them
• different algorithms can be used to implement the same method (some may be more efficient than others)
Downs, G. M., Barnard, J. M., Rev. Comput. Chem., 18 (2002)
Hierarchical and non-hierarchical
A basic distinction is between clustering methods that organise clusters hierarchically, and those that do not
Hierarchical Agglomerative
the hierarchy is built from the bottom upwards
several different methods and algorithms
basic Lance-Williams algorithm (common to all methods) starts with table of similarities between all pairs of items
• at each step the most similar pair of molecules (or previously-formed clusters) are merged together
• until everything is in one big cluster
• methods differ in how they determine the similarity between clusters
o “single link” chooses clusters whose closest members are most similar
o “complete link” chooses clusters whose furthest members are most similar
o other methods (e.g. Group-average method and Ward’s method) use some sort of “average” member
Hierarchical Agglomerative
Lance-Williams algorithm is slow
• O(N2) to generate pairwise similarity table initially
• this table must be updated N times, once for each merge (agglomeration) of clusters
• overall time requirements are O(N3)
more efficient algorithms can be used for some methods
• single link can be O(N logN) with k-D trees algorithm
• Ward’s method and Group-Average method can be O(N2) using Murtagh’s Reciprocal Nearest-Neighbour algorithm
Hierarchical Divisive
the hierarchy is built from the top downwards
at each step a cluster is chosen to divide, until each cluster has only one member
various ways of choosing next cluster to divide
• one with most members
• one with least similar pair of members
• etc.
various ways of dividing it
• using a single descriptor (e.g. fingerprints bit) [“monothetic”]
• using all descriptors (based on similarities between pairs of members) [“polythetic”]
most polythetic methods are slow
Non-hierarchical methods
usually faster than hierarchical
several different methods
e.g. Leader algorithm
• make a single pass through the dataset (O(N))
o if molecule is similar enough (need to define threshold) to an existing cluster, it joins that cluster
o otherwise it starts (leads) a new cluster
• results depend on order of processing
Nearest neighbour methods
non-hierarchical
best known is example is Jarvis-Patrick method
• identify top k (e.g. 20) nearest neighbours for each molecule
• two molecules join same cluster if they have at least kmin of their top k nearest neighbours in common
very popular for chemical applications from mid 1980s
rather less popular now
tends to produce a few large heterogeneous clusters and a lot of singletons (single-member clusters)
some variations have been tried
• variable-length nearest-neighbour lists (threshold similarity)
• reclustering of singletons
Relocation methods
non-hierarchical
• clusters are initialised (sometimes randomly)
• iterative refinement then relocates molecules between clusters to improve some objective function
simplest and most common example is K-means
• select k random molecules to act as cluster seeds
o k is required number of clusters
• assign each remaining molecule to closest seed
• calculate “centroid” (mean) of each cluster
• relocate molecules to nearest cluster centroid if necessary
• recalculate centroids and repeat until no further changes
K-means clustering
K-means has the advantage of being fast (O(Nk)) and is popular with statisticians
however it has several disadvantages
• sensitive to the initial choice of seeds
o can try non-random sets of seeds
• can converge to a local (rather than global) optimum
• tends to produce only “spherical” clusters of similar size
• difficult to decide what value of k to choose
Overlapping and fuzzy clusters
some clustering methods produce overlapping clusters, in which some molecules are members of more than one cluster
in fuzzy clustering, each molecule has partial membership of all clusters
• degree of membership in each cluster is in range 0.0 to 1.0
• sum of membership over all clusters is 1.0
fuzzy clustering is arguably a better representation of the “real world” but makes it difficult to make decisions
Which method is best?
as with similarity measures and structure descriptors, there is no definite agreement
• this is probably why there are so many methods
empirical property-prediction experiments have been done to evaluate different methods
• predicted property value is average of other members of same cluster (Sheffield University work)
o calculate correlation coefficient between observed and predicted properties
• active and inactive molecules should be in separate clusters (Abbott Laboratories work)
Which method is best?
• Sheffield University work (mid-1980s) showed Ward’s (hierarchical agglomerative) and Jarvis-Patrick method gave best predictions
o at that time Jarvis-Patrick was significantly faster
• Joint CAS/Sheffield/BCI study in early 1990s showed Ward’s and “minimum diameter” (hierarchical divisive) significantly better than Jarvis-Patrick
• similar conclusions in Abbott study (mid 1990s)
• more recent work at Eli Lilly recommended K-means
o certainly better for very large datasets, because of speed
• still a very active area of research
How many clusters to choose?
Hierarchical methods allow user to choose any slice across the hierarchy
but what level is thebest one to choose?
there are methodsthat give a “score”to each level
• get the fewest and“tightest” clusters
How many clusters to choose?
Non-hierarchical methods
• Jarvis-Patrick method decides for itself on basis of user-selected k and kmin
• with other methods (e.g. k-means) it is more difficult
o what is the “natural” number of clusters?
The “natural number” of clusters
What is clustering used for?
• compound acquisition
o purchase compounds from clusters that contain no compounds from existing collections
• high-throughput screening
o choose one compound per cluster in first round
o test other compounds from clusters where hits are found
• homogeneous subsets for QSAR
• diverse subset selection from combinatorial libraries
o maximise different clusters represented; penalise over-representation of individual clusters
• classification of new compounds
o which existing cluster is a new compound closest to?
A clustering of clustering methods
Descriptor calculation
various numerical descriptors can be calculated for chemical structures
• molecular weight
• counts of features
o hydrogen bond donors/acceptors
o aromatic rings
o rotatable bonds
o etc
these can be used in similarity searching and clustering
Property Prediction
it is often useful to be able to calculate a physico-chemical property for a compound from its structure
• regression equations have been used to do this from topological indexes, but usually only for limited sets of molecules
• it would be better to have a more general method
some important properties have had a lot of attention in this respect
logP
octanol-water partition coefficient
• has been found very useful in predicting the bioavailability of a drug
o it needs to be soluble enough in lipid to be able to cross cell membranes
o but soluble enough in water not to get stuck there
• many methods have been proposed for calculating a good estimate from the structure
Leo, A. J. Chemical Reviews, 1993, 93, 1281-1306
logP calculation
fragment-based methods (ClogP)
• pioneered by Corwin Hansch and Al Leo (Pomona College)
• identify large fragments, whose contribution to logP value is known from their occurrence in other compounds with measured logP
• large “training set” of compounds with accurately-measured logP (the “Starlist”)
• works very well if test compound has the right fragments
o problems arise if test compound contains fragments that are “missing” from the training set
logP calculation
atom-based methods (AlogP, XlogP, SlogP)
• pioneered by Gordon Crippen (Univ. Michigan)
• based on identifying a series of “atom types” in the molecule
o essentially, small atom-centred fragments
o usually 60-200 such fragments are involved
• each atom-type is assigned a numerical value
• logP is obtained by adding values for the atom types present in the test molecule
• atom-type values are obtained by regression analysis, based on a set of compounds with measured logP
• sometimes some extra correction factors are used too
Atom-based property calculations
atom-based principle has also been used for other properties
• molar refractivity
• charged partial surface area
• intestinal absorption
• etc.
The Drug Discovery Process
pharmaceutical companies are in the business of identifying compounds that may be useful new drugs
• tens or hundreds of thousands of compounds are made and tested every year (“screening”)
o tests are usually simple binding assays (does the molecule bind to a target protein?)
• testing is done in two stages
o Lead Generation (find a compound that binds)
o Lead Optimisation (find a compound that binds better)
• chemical informatics techniques are important at both these stages
Drug development
Patents will be applied for as soon as a good compound (or class of compounds) is identified
• need to get in before the competition
• patent life (20 years) starts counting down from here
Much development work has still to be done
• animal tests
• clinical trials (several phases)
• regulatory requirements
• many drugs may “fail” during the process
Patent may have only 10 years left to run by the time a new drug is marketed
The need for early attrition
Only a tiny proportion of compounds make it all the way through this process
If a potential new drug is going to “fail” it is better that it fail early
• before too much money has been spent on it
If you can identify the failures before you even synthesise them, so much the better
• “virtual screening”
Three stages of screening
in silico (“in silicon”)
• virtual screening
• entirely in the computer
in vitro (“in glass”)
• uses test tube models of biological systems
• enzyme assays etc
• requires real compounds
in vivo (“in life”)
• compounds tested in living organisms
Virtual Screening
Often based on concept of “drug-likeness”
• do these compounds actually look like drugs?
• need to calculate appropriate properties
o Is compound likely to have suitable properties for
• Absorption
• Distribution
• Metabolism
• Excretion
• Toxicity
o ADMET or ADME/Tox
• suitable property ranges identified by analysing databases of existing drugs
Lipinski Rule of Five
Widely used set of properties used for virtual screening
Developed at Pfizer, 1997
• molecular weight < 500
• logP < 5.0
• < 5 hydrogen bond donors
o number of –OH and –NH groups
• < 10 hydrogen bond acceptors
o number of O and N atoms
Lead generation
when testing a large number of compounds to identify a new “lead”, it is obviously desirable to have them as different from each other as possible
• pharmaceutical companies purchase large numbers of compounds from 3rd party suppliers (often Eastern European) to test
• they also synthesise combinatorial libraries of compounds
chemical “diversity” is important feature of such compound collections and libraries
• the idea is to cover as much of “chemical space” as possible
Lead optimisation
when a “lead” compound has been identified, the next stage is to find compounds that are similar to it, which might bind even better
• this can involve similarity searching to find compounds previously made, or available commercially for purchase
in later stages, as activity of compound becomes better understood, medicinal chemists will make specific changes to the molecule which they hope will improve its binding affinity
Conclusions
Clustering is a useful technique for identifying classes of molecules in a dataset
• there are many different methods and algorithms
• some are faster or more effective than others
Topological indices are numbers that can be calculated from structures represented as connection tables
• there are many different indices available, some of which are designed to represent gross features like shape and branching
Topological indices can be used in regression equations to predict properties of a structure
• other methods are available for property prediction, based on summing scores for different fragments or atom types
Calculated properties can be used in “virtual screening”
Conclusions
Many computer techniques are available to manipulate chemical structure representations
• some have inherent limitations but are none-the-less useful
Structure and substructure search algorithms are among the most important and useful
There are useful techniques for calculating estimates of physico-chemical and other properties
Identifying structurally similar molecules can lead to identifying molecules with similar biological activities
Chemoinformatics is now a vital part of the drug discovery process in the pharmaceutical industry