automatic optimization of predictive bioactivity models · guido bolick: automatic generation of...

April 25th 2018

Automatic

optimization of

predictive Bioactivity

models

Chi Chung Lam, Fabian Steinmetz, Paul Czodrowski

2

Multiple models trained for biological targets

Random Forests

Neural Networks

Gradient Boosted Trees

NNs and GBTs are very sensitive to hyperparameter changes

Automated ways needed to build models with the right hyperparameters

Predictive Models in Production

millions of unique

combinations possible3

NN Architectures & Hyperparameters

NN-Architecture

• Layer-Type

• Number of Layers

• Neurons per Layer

• Activation-Functions

Training-Parameters

• Optimizer

• Learning-Rate

• Weight-Decay

• Batch-Size

• Loss-Function

• …

Hyperparameters

Guido Bolick: Automatic Generation of Neural Network Architectures Using a Genetic Algorithm | 27.09.2016

4

Genetic Algorithm for hyperparameter optimization

5.1

5.2

4

12 3


5

Genetic Algorithm Workflow


6

Comparing Global Models

Model Description

RF Random Forest with fixed hyperparams

Leiden DNN DNN with fixed hyperparams

GA DNN DNN with GA optimized hyperparams

Random DNN DNN with grid search optimized hyperparams

Feature-Wise Baseline Model that takes the fingerprint bit as prediction

XGBoost Gradient Boosted Trees with fixed hyperparams

7

Assume that each fingerprint bit is a prediction, and select the best bit

Feature-Wise Baseline

Bit 0 Bit 1 Bit 2 Bit 3 Activity

Sample 1 1 0 0 1 0

Sample 2 1 0 0 0 0

Sample 3 1 1 1 1 1

Sample 4 1 1 1 0 1

Sample 5 0 0 1 1 0

Kappa score 0.41 1.00 0.67 -0.17

8

Global Model Performance

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

CACO CLINT_H CLINT_M CLINT_R HERG SOL

Kappa S

core

Target

Global Model Performance

RF Leiden DNN GA DNN Random DNN Feature-Wise XGBoost XGBoost Random

9

GA vs Random Search Comparison

Mean kappa score increases as GA evolution occurs

However, good solution is found too easily (already found in initial 100 architectures)

A random search of the same search space finds a similar or better solution

10

Fingerprints hash a molecule’s substructures into a fixed bit

A small fingerprint size will cause “collisions”

A large fingerprint size will cause many redundant bits

Fingerprint Filtering: CLINT_R

FP Size 1024 4096

Avg substructures per bit 79.84 20.64

0.01 variance filter 3 2388

Substr/bit after 0.01 var filter 80.00 21.86

True size after 0.01 var filter 1021 1708

11

Feature-selection of fingerprints by variance

Control: unfiltered FP of same length as filtered FP

Problem: Arbitrary choice of threshold variance

Fingerprint Filtering: CLINT_R

0,000

0,050

0,100

0,150

0,200

0,250

0.01 var Control 0.0 var Control Unfiltered

Mean K

appa S

core

CLINT_R 1024 Bits Filtering

DNN RF XGB

0,000

0,050

0,100

0,150

0,200

0,250

0,300

0.01 var Control 0.0 var Control Unfiltered

Mean K

appa S

core

CLINT_R 4096 Bits Filtering

DNN RF XGB

Finding the optimal variance: CLINT_R

0,000

0,050

0,100

0,150

0,200

0,250

0.01 var 0.0 var Optimal Var Unfiltered

Mean K

appa S

core

CLINT_R Optimal Var Filtering

Finding the optimal variance: HERG

0,000

0,100

0,200

0,300

0,400

0,500

0,600

0.01 var 0.0 var Optimal Var Unfiltered

Mean K

appa S

core

HERG Optimal Var Filtering

Fingerprint Filtering: Problems

Variance of bits highly depends on sample size

Use threshold that is relative to sample size, instead of absolute value

Can we combine this filtering with the “feature-wise baseline” analysis?

Drop fingerprints that correlate poorly with dependent variable?

15

Nested Cluster Validation

16

The final models are used in production and served to chemists, etc.

Retraining occurs every 3 months

During these three months, models are “outdated”

Retraining more frequently is time-wise impractical

XGB and DNNs allow “On-line” updating

Fit new data during an additional training step of existing models

Can happen nearly real-time

Retraining only necessary when performance starts declining

On-line Updating of Models

Our in house environments: CREAM and MOCCA

CREAM (Classification REgression At Merck)

- Python environment and modelling tool

- Used for the majority of predictive models

- Holds versatile features, such as

- Multiple machine learning algorithms

- Different validation methods

- Interface to MOCCA

MOCCA is the Merck Online Computational Chemistry Analyzer, our

web-based in-house prediction tool

Global models

• Large Dataset

• Large Applicability Domain (AD)

• Endpoints, such as

• Physico-chemical Properties

• Pharmacokinetics

• Toxicity

• General Selectivity

Global vs. local models

Local models

• Smaller Dataset

• Smaller Applicability Domain

• Endpoints, such as

• Activity

• Selectivity

• Toxicity, Pharmacokinetics

Generally global models are preferrable dueto greater in-house modelling experience andlarger AD, but we are happy to supportprojects with local models if needed.

22

• Chi Chung Lam

• Wolf-Guido Bolick (Andreas Dominik)

• Fabian Steinmetz

• Kristina Preuer, Günter Klambauer (Sepp Hochreiter)

• Friedrich Rippmann

• Marcel Baltruschat

• Cornelius Kohl

• Samo Turk

• Jan Fiedler

• Christian Röder

Acknowledgement

23

back-up

24

SET Train Test Classes

CACO 9637 523 3

CLINT_H 16264 797 3

CLINT_M 18313 981 3

CLINT_R 15910 760 3

HERG 6894 288 2

SOL 19615 667 3

Datasets

millions of unique

combinations possible25

NN Architectures & Hyperparameters

NN-Architecture

• Layer-Type

• Number of Layers

• Neurons per Layer

• Activation-Functions

Training-Parameters

• Optimizer

• Learning-Rate

• Weight-Decay

• Batch-Size

• Loss-Function

• …

Hyperparameters

26

Optimization of Hyperparameters

Expert Lucky People Everyone

Hyperparameters derived

from literature & experience

Hyperparameter search

within promising parameter

areas

Random-Search (Bergstra et al. 2012)

Grid-Search (Larochelle et al. 2007)

Probability based algorithms (Brochu et al. 2010, Bergstra et al. 2011)

Directed Random-Search

(e.g. genetic algorithms)

27

What is a Genetic Algorithm?

5.1

5.2

4

12 3

28

Validation Strategies

• Use as much data as possible for training

• Being able to get a realistic glimpse of the

performance

• 5-fold cross-validation

• Every compound represented in 4/5 models

• Hyperparameter optimization to increase

performance of validation sets

• Resulting performance trustworthy ?!

• 5-fold nested cross-validation 25 models

• Every compound represented in 16/25 models

• Increased computational requirements

• 5x Hyperparameter optimizations to increase

performances of validation sets

• Final performances evaluated using

corresponding outer loop test sets

29

Getting a job (hyperparameters) from the jobserver

Repeat for all training/test sets:

Building of a NN based on hyperparameters

Training of the NN using a training set

Balanced-Batch-Generator maintains the same active/inactive-ratio within a batch

Early-Stopping, when mean validation-loss of sliding window (15 epochs) does not

improve for 100 epochs

Evaluation of best state (center of best window)

using validation set, metric Cohen’s Kappa

Training of a NN

1

2

2.1

2.2

2.3

Agreement of labels vs. prediction

Agreement of 2 random observers

30

So many parameters..

Genetic Algorithm

• Population-Size: 100

• Workers: 10

• Fingerprint-Size:

1024

• Smarts-Patterns:

826

• Evolution-Strat.:

Drop-Worst-50%

Mutation Settings

• Default:

• Mutation-Rate: 5%

• Mutation-Strength: 1

• Crossing-Over-Rate: 30%

• Increased:

• Mutation-Rate: 10%

• Mutation-Strength: 2

• Crossing-Over-Rate: 30%

Training

• Optimizer: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam

• Loss-Functions: mae, mse, msle

• Learning-Rate: 0.05, 0.1, 0.5, 1.0

• Weight-Decay: 0.0, 1E-7, 5E-7

• Momentum: 0.0, 0.1, …, 0.9

• Nesterov: 0, 1

• Batch-Size: 5%, 6%, …, 20%

Architecture

• Layers: 1-4

• Layer-Types: Dense, Dropout

• Neurons: 32, 64, …, 512

• Dropout-Ratio: 5%, 10%, …, 90%

• Activation-Functions: linear, sigmoid, hard-sigmoid, softmax, relu, tanh

31

Datasets

Dataset hERG Micronucleus-Test

Compounds 6999 798

Actives 3205 (46%) 263 (33%)

Inactives 3794 (54%) 535 (67%)

Binary Classification: Inactive 0

Active 1

32

Found NN-Hyperparameters

33

Found NN-Hyperparameters

34

Improvement of NNs while running the GA

Initial population starts with inner-

kappa values of ~0.6 in all splits

GA is able to improve performance of

best entities even more (red line)

Mutations can lead to bad performing

entities (blue line) until the last

generation

35

Novelty of Architectures

Proportion of new entities in population

decreases during the runtime of the GA

Higher mutation-rate (red line) increases

the searchable space for the GA

36

Influence of Hyperparameters

1_activation (344)

First hidden

layer

Activation-function

of this layer

Number of

contributing pairs

Contributing pairs only differ by

the shown parameter

Boxplots are based on the

absolute difference of both inner-

kappa values of all contributing

pairs

37

User-Interface

38

Implemented an algorithm to create a consensus-model using 5-fold nested cross-validation

Each compound is represented in 16 of 25 NNs

Calculation needs 8-14 hours (e.g. during a night) using a GTX-Cluster

GA improves already high kappa values of NNs even more

Kappa values of final NN-models are mostly larger than 0.5 (“moderate” according to Landis et al. 1977)

Further steps:

Possibility to use chemical descriptors and multiple fingerprints

Option to create multi-class models (more classes than just 0 and 1) and regression models

(Polishing up and writing a paper)

Conclusion

39

Implementation of the GA

automatic optimization of predictive bioactivity models · guido bolick: automatic generation of...

Documents