bioactivity predictive modelingmay2016

Download Bioactivity Predictive ModelingMay2016

Post on 16-Apr-2017

124 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Elsevier Predictive Modeling |

    Presented By

    Date

    Bioactivity Predictive Modeling

    Using Reaxys Medicinal Chemistry Data to Build Models for Any Target

    Matthew CLARK

    5 May 2016

  • Elsevier Predictive Modeling |

    R&D Solutions FOR PHARMA & LIFE SCIENCES Essential Tools and Solutions for Pharma & Life Sciences

    https://www.elsevier.com/rd-solutions/pharma-and-life-sciences-solutions

    Matthew Clark, Ph. D. m.clark@elsevier.com

    https://www.elsevier.com/rd-solutions/pharma-and-life-sciences-solutionshttps://www.elsevier.com/rd-solutions/pharma-and-life-sciences-solutionshttps://www.elsevier.com/rd-solutions/pharma-and-life-sciences-solutionshttps://www.elsevier.com/rd-solutions/pharma-and-life-sciences-solutionshttps://www.elsevier.com/rd-solutions/pharma-and-life-sciences-solutionshttps://www.elsevier.com/rd-solutions/pharma-and-life-sciences-solutionshttps://www.elsevier.com/rd-solutions/pharma-and-life-sciences-solutionshttps://www.elsevier.com/rd-solutions/pharma-and-life-sciences-solutionshttps://www.elsevier.com/rd-solutions/pharma-and-life-sciences-solutionshttps://www.elsevier.com/rd-solutions/pharma-and-life-sciences-solutionshttps://www.elsevier.com/rd-solutions/pharma-and-life-sciences-solutionshttps://www.elsevier.com/rd-solutions/pharma-and-life-sciences-solutionsmailto:m.clark@elsevier.com

  • Elsevier Predictive Modeling |

    Reaxys Medicinal Chemistry has millions of bioactivity measurements for thousands of targets.

    Data from Aureus, Elsevier, and GVK Bio

    Covering patents and journal articles

    The measurements are associated with compounds with chemical structures.

    We can combine the structures/activities to make Quantitative Structure Activity Relationship (QSAR) predictive models.

    Here we use a KNIME automation workflow to demonstrate a system that can create and evaluate a predictive model for any selected target.

    Of course not all are predictive, but the workflow evaluates this

    Each model takes from 2-15 minutes to create

    It provides an extremely flexible workbench for exploring QSAR

    Elsevier services can integrate the system with your in-house structure/activity databases.

    Summary

  • Elsevier Predictive Modeling | 4

    THE WORLDS LARGEST & BEST-ORGANIZED MEDICINAL CHEMISTRY DATABASE

    The Reaxys Medicinal Chemistry database contains:

    This unique collection of data is sourced from:

  • Elsevier Predictive Modeling | 5

    Various Activity Metrics are Consolidated to a Single Value

    pIC50, pEC50, pED50, pID50, pLC50, pLD50, pKi, pKd,

    pKb, pD2, pD2, pA2 , IC50, EC50, ED50, ID50, LC50, LD50,

    Ki, Kd, Kb, Ka, Ke , % Inhibition

    pX is computed

    Original values are preserved; this is an additional computed descriptor

  • Elsevier Predictive Modeling | 6

    Conversion of % inhibition

    If Concentration of the compound is not Available p[X] is not calculated

    IF CONCENTRATION Available as a range

    p[X] is not calculated

    If concentration is available as a single value

    p[X] is calculated

    If % inhibition > 25 p[X] = -log10(IC50)

    =[]

    % [C]

    If % inhibition < 25 p[X] = 1

  • Elsevier Predictive Modeling | 7

    Qualitative results

    Reported as Inactive

    p[x] = 1

    Reported as Active

    Concentration reported p[X] = - log10(Concentration)

    Concentration is not reported p[X] not calculated

  • Elsevier Predictive Modeling | 8

    Process for Computing pX

    Conc

    Unit?

    Data

    concentration

    Not concentration

    Conversion to concentration

    Quality metric?

    Yes

    No

    Quality Rules

    pX not calculated

    is Range?

    Range processing

    Log unit?

    No

    Yes

    pX

    Yes

    -log10(affinity)

    No

    Low quality

    Acceptable quality

    pX not calculated

    Not convertible

  • Elsevier Predictive Modeling |

    Issue: some targets have thousands of measurements

    Solution: randomly select N examples from each decade of pX activity

    Selecting from each decade provides a more balanced data set across the activity range

    In this study only IC50 measurements are used, not % inhibition or Kd

    N is from 50 to 200 in these examples.

    Compounds are selected so that training and test sets have no overlap

    Use random selection of both

    For each compound the median pX value is chosen when multiple measurements are reported

    pX = -log(molar activity)

    Partial-least-squares correlation is used, selecting the optimal number of descriptor components

    Model quality is assessed

    I like RMS error of prediction because it most directly answers the question how accurate do I expect the prediction to be?

    r2, F and other statistics are also computed.

    9

    Elements of the Model Making Workflow

  • Elsevier Predictive Modeling | 10

    Random Selection to Flatten Activity Distribution

    1 2 4 3 6 5 7 8 9 10

    -log(M)

    # o

    f co

    mp

    ou

    nd

    s

    1 2 4 3 6 5 7 8 9 10

    -log(M)

    # o

    f co

    mp

    ou

    nd

    s

    Raw reports may be thousands for a popular target. Activities and structure distribution may skew statistics

    Randomly selected N compounds per decade More diverse set of structures and more uniform distribution of activities

  • Elsevier Predictive Modeling |

    KNIME Workflow

    Input: target name, number of compounds to use from

    each decade of log(Activity) activity type, e.g. IC50

    Output: model from training set predictions of test set

  • Elsevier Predictive Modeling |

    Powerful statistical methods can select variables to create a model that looks great on the training set

    However the real test is how well the model can predict activity for a compound not in the training set.

    In these examples the Training and Test sets are not overlapping; no test compound is in the training set.

    In this study the best metric for the model is the RMS Error of Prediction the errors seen when predicting activities for compounds not in the training set.

    This is useful to provide the error bar for the predictions.

    The required accuracy depends on what you are using the prediction for

    The Pearson r2 provides a metric of fit but is dependent on the range of the data and does not directly tell you how much error to expect.

    Other metrics for prediction might include how similar the compound is any compounds in the training set. One might expect the predictions to be less reliable if the compound is very different.

    12

    The Importance of the Test Set

  • Elsevier Predictive Modeling |

    The descriptors are properties computed for the compounds that are related to the activities

    They encode elements of the structure and other properties of the compounds.

    In this study we combined 2 sets of computed descriptors

    RD Kit

    CDK Toolkit

    13

    Descriptors

  • Elsevier Predictive Modeling |

    library(pls)

    model

  • Elsevier Predictive Modeling |

    This model is moderately useful to separate best from worst molecules.

    The test set statistics are influenced by 2 outliers with absurdly high predicted binding Without those the rms error is about 0.9.

    15

    ORL-1

    Model RMS Error of

    Prediction

    Training Set 0.71

    Test Set 1.3

    y = 0.9912x R = 0.6787

    0

    2

    4

    6

    8

    10

    12

    14

    16

    0 2 4 6 8 10 12 14 16

    pX

    Pre

    dic

    ted

    pX Measured

    Training Set y = 0.9956x R = 0.2808

    0

    2

    4

    6

    8

    10

    12

    14

    16

    0 2 4 6 8 10 12 14 16

    pX

    Pre

    dic

    ted

    pX Measured

    Test Set

  • Elsevier Predictive Modeling |

    p38a has 2 binding sites (normal and allosteric) which make a simple QSAR more difficult if the compounds are not separated

    Test set RMSEP suggests predictive value is limited

    16

    p38a (Q16539)

    Model RMS Error of

    Prediction

    Training Set 1.15

    Test Set 1.45

    y = 0.9734x R = 0.3237

    0

    2

    4

    6

    8

    10

    12

    14

    16

    0 2 4 6 8 10 12 14 16

    pX

    Pre

    dic

    ted

    pX Measured

    Training Set y = 0.951x

    R = 0.0775

    0

    2

    4

    6

    8

    10

    12

    14

    16

    0 2 4 6 8 10 12 14 16

    pX

    Pre

    dic

    ted

    pX Measured

    Test Set

  • Elsevier Predictive Modeling |

    This target has fewer reported compounds and activities than others

    Test suggests reasonable predictive power to distinguish active from inactive compounds in-silico

    17

    ABL Kinase (P00519)

    Model RMS Error of

    Prediction

    Training Set 0.63

    Test Set 0.96

    y = 0.9919x R = 0.8741

    0

    2

    4

    6

    8

    10

    12

    14

    16

    0 2 4 6 8 10 12 14 16

    pX

    Pre

    dic

    ted

    pX Measured

    Training Set y = 0.9868x R = 0.7179

    0

    2

    4

    6

    8

    10

    12

    14

    16

    0 2 4 6 8 10 12 14 16

    pX

    Pre

    dic

    ted

    pX Measured

    Test Set

  • Elsevier Predictive Modeling |

Recommended

View more >