kernel-based machine learning for virtual screening · virtual screening: ligand-based approach...

23
Kernel-based Machine Learning for Virtual Screening Dipl.-Inf. Matthias Rupp Beilstein Endowed Chair for Chemoinformatics Johann Wolfgang Goethe-University Frankfurt am Main, Germany 2008-04-11, Helmholtz Center, Munich

Upload: others

Post on 26-Jul-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Kernel-based Machine Learningfor Virtual Screening

Dipl.-Inf. Matthias Rupp

Beilstein Endowed Chair for ChemoinformaticsJohann Wolfgang Goethe-University

Frankfurt am Main, Germany

2008-04-11, Helmholtz Center, Munich

Page 2: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Outline

Virtual screening Setting, definition, aspects

Representation Descriptors, graphs, shape, densities

Methods Gaussian process regression, novelty detection

Application Virtual screening for PPARγ agonists

2

Page 3: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Virtual screening: Drug development

Disease↓

Target↓

Screening

↓Optimization

↓Preclinical

↓Clinical Phases I, II, III

↓Market authorization

↓Clinical Phase IV

3

Page 4: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Virtual screening: Drug development

Disease Systematic testing of compounds for activity↓

Target Biochemical assay↓ High-throughput screening

Screening Virtual screening

↓Optimization Receptor-based versus ligand-based

↓Preclinical

↓Clinical Phases I, II, III

↓Market authorization

↓Clinical Phase IV COX-2 Celecoxib

4

Page 5: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Virtual screening: Ligand-based approach

Input: Known ligands (training samples)Compound library (test samples)

Output: Molecules with best predicted activity

Particularities

I Small training sets (101 to 103)

I Large test sets (105 to 106)

I False positives worse than false negatives

I Only top predictions are of interest

I Available binding activity information varies

Key questions

I How to represent (and compare) molecules?

I How to learn from the training data?

5

Page 6: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Representation: Descriptors

I Computable properties in vector form

I Most frequently used representation

I Comparison by metric, inner product or similarity coefficient

1-pentyl acetate

¥ Bonds in longest chain: 7¥ Rotatable bonds: 4¥ Negative partial charge¥ surface fraction: 0.13¥ Hydrogen bond acceptors: 1. . .

Figure courtesy Dr. Michael Schmuker

M.Rupp, G. Schneider, P. Schneider: Distance phenomena in high-dimensionalchemical descriptor spaces: consequences for similarity-based approaches, in prepa-ration, 2008.

6

Page 7: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Representation: Descriptors

I Computable properties in vector form

I Most frequently used representation

I Comparison by metric, inner product or similarity coefficient

Alternatives: Structured data representations

I Graph models (structure graph)

I Surface models (molecular shape)

I Density models (spatial distribution)

I . . .

M.Rupp, G. Schneider, P. Schneider: Distance phenomena in high-dimensionalchemical descriptor spaces: consequences for similarity-based approaches, in prepa-ration, 2008.

7

Page 8: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Representation: ISOAK

Iterative similarity optimal assignment graph kernel

Iterative graph similarity

I |V | × |V ′| matrix X of pairwise vertex similarities

I”Two vertices are similar if their neighbours are similar“

I Recursive definition; iterative computation

Xi ,j = (1−α)kv (vi , v′j )+αmax

π

1

|v ′j |∑

v∈n(vi )

Xv ,π(v)ke

({vi , v}, {v ′j , π(v)}

)

Optimal assignment

I Find assignment ρ : V → V ′ such that∑|V |

i=1 Xi ,ρ(i) is maximal

M. Rupp, E. Proschak, G. Schneider: Kernel Approach to Molecular SimilarityBased on Iterative Graph Similarity, Journal of Chemical Information and Modeling47(6): 2280–2286, 2007.

8

Page 9: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Representation: ISOAK example

ISOAK with α = 12 , Dirac vertex kernel using element types and Dirac

edge kernel using bond types. Overall similarity is 4.64/√

5 · 7 = 0.78.

102Xij 1 2 3 4 5 6 7

1 98 50 00 00 00 00 50

2 50 98 11 34 16 17 89

3 00 11 96 14 68 78 13

4 00 34 14 91 13 20 38

5 00 24 67 17 81 77 20

Pairwise atom similarities Glycine Serine

M. Rupp, E. Proschak, G. Schneider: Kernel Approach to Molecular SimilarityBased on Iterative Graph Similarity, Journal of Chemical Information and Modeling47(6): 2280–2286, 2007.

9

Page 10: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Methods: Kernel-based machine learning

Linear algorithms and the kernel trick

1. Transformation into higher-dimensional space

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æà à à à à à à à à à à à à à

-6 -4 -2 0 2 4 6x

not linearly separable

2. Implicit computation of inner products

3. Rewrite linear algorithms using only inner products

10

Page 11: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Methods: Kernel-based machine learning

Linear algorithms and the kernel trick

1. Transformation into higher-dimensional space

æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æ æà à à à à à à à à à à à à à

-6 -4 -2 0 2 4 6x

not linearly separable

æ

æ

æ

ææ

æ

æ

æ

æ

æ

æ

ææ

æ

æ

æ

æ

à

à

à

àà

à

à

à

à

à

àà

à

à

-6 -4 -2 2 4 6x

-1.0

-0.5

0.5

1.0

y

linearly separable

x 7→(x , sin(x)

)

2. Implicit computation of inner products

3. Rewrite linear algorithms using only inner products

11

Page 12: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Methods: Kernel-based machine learning

Linear algorithms and the kernel trick

1. Transformation into higher-dimensional space

2. Implicit computation of inner products

kernel k : X × X → R, k(x , x ′) =⟨Φ(x),Φ(x ′)

⟩Example: Quadratic kernel

Φ : Rn → Rn2, x 7→ (xixj)

ni ,j=1

k(x , x ′) =⟨Φ(x),Φ(x ′)

⟩=

n∑i ,j=1

xixjx′i x′j =

n∑i=1

xix′i

n∑j=1

xjx′j =

⟨x , x ′

⟩2

3. Rewrite linear algorithms using only inner products

12

Page 13: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Methods: Kernel-based machine learningLinear algorithms and the kernel trick

1. Transformation into higher-dimensional space

2. Implicit computation of inner products

3. Rewrite linear algorithms using only inner products

Example: Centering in feature space H

k∗(x , x ′) =⟨Φ(x)− 1

n

n∑i=1

Φ(xi ),Φ(x ′)− 1n

n∑i=1

Φ(xi )⟩

= 〈Φ(x),Φ(x ′)〉 − 1n

n∑i=1

〈Φ(xi ),Φ(x ′)〉

− 1n

n∑i=1

〈Φ(x),Φ(xi )〉+ 1n2

n∑i ,j=1

〈Φ(xi ),Φ(xj)〉

= k(x , x ′)− 1n

n∑i=1

k(xi , x′)− 1

n

n∑i=1

k(x , xi ) + 1n2

n∑i ,j=1

k(xi , xj)

13

Page 14: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Methods: Gaussian process regression

I Gaussian process as data model

I Generalization of multivariate normal distribution to functions

I Determined by mean and covariance

I Kernel matrix as covariance matrix

I Conditioning of prior on training data yields posterior distribution

I Variance as confidence estimates for predictions

- 4 - 2 0 2 4input- 3

- 2

- 1

0

1

2

3target

+

+++

+

- 4 - 2 0 2 4input- 3

- 2

- 1

0

1

2

3target

14

Page 15: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Methods: Principle component analysis novelty detection

I Orthogonal directions ofmaximum variance

I Dimensionality reduction

I Descriptive statistic

I Non-linear variants recoverunderlying Riemannian manifolds

I Novelty detection viaprojection error

15

Page 16: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Methods: Principle component analysis novelty detection

I Orthogonal directions ofmaximum variance

I Dimensionality reduction

I Descriptive statistic

I Non-linear variants recoverunderlying Riemannian manifolds

I Novelty detection viaprojection error

16

Page 17: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Methods: Principle component analysis novelty detection

I Orthogonal directions ofmaximum variance

I Dimensionality reduction

I Descriptive statistic

I Non-linear variants recoverunderlying Riemannian manifolds

I Novelty detection viaprojection error

17

Page 18: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Methods: Principle component analysis novelty detection

I Orthogonal directions ofmaximum variance

I Dimensionality reduction

I Descriptive statistic

I Non-linear variants recoverunderlying Riemannian manifolds

I Novelty detection viaprojection error

18

Page 19: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Methods: Principle component analysis novelty detection

I Orthogonal directions ofmaximum variance

I Dimensionality reduction

I Descriptive statistic

I Non-linear variants recoverunderlying Riemannian manifolds

I Novelty detection viaprojection error

19

Page 20: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Application: Material and methods

I Target: PPARγ (peroxisome proliferator-activated receptor γ)

I Dataset: 144 published ligands with pKi values

I Screening library: Asinex Gold and Platinum (360 000 cpds.)I Representation:

I Vectorial (CATS2D, MOE 2D, Ghose-Crippen fragments)I ISOAK molecular graph kernel

I Method:I Gaussian process regressionI Multiple kernel learningI Leave-one-cluster-out cross-validationI Fraction of actives (FA20) as success measure

T. Schroeter, M. Rupp, K.Hansen, E. Proschak, K.-R. Muller, G. Schneider: Virtualscreening for PPARγ ligands using ISOAK molecular graph kernel and Gaussianprocesses, 4th German Conference on Chemoinformatics, 2008.

20

Page 21: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Application: Results

I Top 30 of three best performing models

I 16 cherry-picked compounds with novel scaffolds

I PPARγ selective activator (EC50 9.3± 0.3µM),natural product related

I 3 dual PPARα/γ activators (µM range, two ≤ 10µM)

I 4 selective PPARα activators (µM range, one ≤ 10µM)

I 8 out of 16 compounds are active

I 4 out of 16 compounds with EC50 ≤ 10µM

I Results preliminary since testing is still on-going

M.Rupp, T. Schroeter, R. Steri, E. Proschak, K.Hansen, O. Rau, M. Schubert-Zsilavecz, K.-R. Muller, G. Schneider, in preparation, 2008. 21

Page 22: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Summary

I Virtual screening as a machine learning problem

I Importance of molecular representation

I Virtual screening using only positive samples

22

Page 23: Kernel-based Machine Learning for Virtual Screening · Virtual screening: Ligand-based approach Input: Known ligands (training samples) Compound library (test samples) Output: Molecules

Acknowledgements

I Prof. Dr. Gisbert Schneider and modlab team(molecular design laboratory, www.modlab.de)

I Prof. Dr. Klaus Robert-Muller, Timon Schroeter, Katja Hansen(TU Berlin and Fraunhofer FIRST)

I Prof. Dr. Manfred Schubert-Zsilavecz, Ramona Steri(University of Frankfurt)

I Beilstein-Institute for the advancement of chemical sciences

I FIRST (Frankfurt international research graduate school ontranslational biomedicine)

Thank you for your attention

23