smiles based encoderssmiles based autoencoders example: gómez-bombarelli, rafael et al. 2018....

30
SMILES based encoders base for QSAR feature modelling and generative modelling Esben Jannik Bjerrum, Principal Scientist Eighth Joint Sheffield Conference on Chemoinfomatics 17-19 June 2019

Upload: others

Post on 30-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

SMILES based encodersbase for QSAR feature modelling and generative modelling

Esben Jannik Bjerrum, Principal ScientistEighth Joint Sheffield Conference on Chemoinfomatics 17-19 June 2019

Page 2: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

2

Introduction - Outline

The Players:

SMILES

Recurrent Neural Networks (RNN)

Long Short-Term Memory cells (LSTM)

SMILES enumeration

RNN as SMILES readers and generators

Combining it:

Autoencoders

Heteroencoder training trick

Latent space properties

Sampling in a 1:many setting

Use in QSAR

Use in Generative Modelling

Conclusion

Toolkits and Code

Page 3: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

SMILES, a Chemical Language and Information System

3

Page 4: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

6

Recurrent Neural Networks (RNN)

●Sequences of features as inputs

●The same task for every element

of a sequence, with the output

being affected by the previous

computations

●Modeling of sequences such as

text, tweets, time series etc.

Page 5: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

• Neural Networks learns on vectors, matrices or tensors.

• One-hot encoding with a defined vocabulary converts SMILES

strings into 2D matrices

7

One hot encoding

Page 6: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

Long Short-Term Memory cells (LSTM)

8

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation.

Page 7: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

9

RNNs as an encoder

Internal LSTM states gets changed from step to step. The full sequence

influences the final vector used for prediction task.

Page 8: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

• Canonical SMILES ensures a 1:1

relationship between molecule and

SMILES

• I go the other way and generate

multiple SMILES for the same

molecule

• Works as data augmentation

10

Enumeration of non-canonical SMILES

Live DEMO of generation

Bjerrum, Esben Jannik. 2017. “SMILES Enumeration as Data Augmentation for Neural Network

Modeling of Molecules.” http://arxiv.org/abs/1703.07076

Page 9: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

11

Random SMILES in practice

kfxl284
Highlight
kfxl284
Sticky Note
over 11000 SMILES generated
Page 10: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

12

RNNs as generators

REINVENT: Olivecrona et. al. Molecular de-novo design through deep reinforcement

learning, J. Cheminf. 2017

Page 11: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

13

SMILES Based Autoencoders

Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven

Continuous Representation of Molecules.” ACS Central Science 4(2): 268–76.

Page 12: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

14

Projecting non-canonical SMILES in latent space

Conv2RNN RNN2RNN

Page 13: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

15

A SMILES string is not a Molecule

Moleculec1ccccc1

Page 14: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

16

HeteroEncoders

Also possible with InChi’s and CAS-names : Winter et. al 2018

From chemical images to SMILES: Bjerrum & Sattarov 2018

Page 15: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

17

Projecting non-canonical SMILES in the latent space

Page 16: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

18

Latent space molecular similarities

SMILES sequence similarity

Morgan FP Tanimoto similarity

Model R2 FP metric R2 Seq metric

Can2Can 0.24 0.58

Enum2Can 0.37 0.53

Can2Enum 0.58 0.55

Emun2Enum 0.49 0.40

Page 17: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

20

Non deterministic sampling of molecules

Can2Can Enum2Enum

(2l)

Unique SMILES 1 111

%Correct MOL 100 57

Unique SMILES for

Correct MOL

1 42

Unique Molecules 1 17

Page 18: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

21

1:Many sampling from latent space point

Page 19: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

22

Latent vectors as a base for QSAR models

Figure adapted from : Bjerrum, Esben Jannik, and Boris Sattarov. 2018. “Improving Chemical Autoencoder Latent Space and Molecular

De Novo Generation Diversity with Heteroencoders.” Biomolecules.

Page 20: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

23

Latent Space gets more relevant for QSAR modelling tasks

RMSEP of 5 datasets modelled using deep neural networks

IGC50 LD50 BCF Solubility MP Norm

Mean

Enum2Enum 0.43 0.54 0.71 0.65 37 0.75

Can2Enum 0.46 0.54 0.69 0.69 37 0.77

Enum2Can 0.46 0.57 0.71 0.66 38 0.78

Can2Can 0.53 0.62 0.79 0.87 43 0.89

ECFP4 0.62 0.59 0.94 1.21 43 1.00

impro

vem

ent

ECFP4 performance low when compared to literature, Enum2Enum close

Bjerrum, Esben Jannik, and Boris Sattarov. 2018. “Improving Chemical Autoencoder Latent Space

and Molecular De Novo Generation Diversity with Heteroencoders.” Biomolecules.

Page 21: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

25

Trends in searching for Hyperparameters

Larger is better

for the decoder

(Except for

Linear QSAR

model)

Page 22: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

26

Final Architecture

Validity: 99%

Reconstruction: 76 (80) %

Page 23: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

28

QSAR performance using Linear regression model

Datasets from ExCAPE-DB

Page 24: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

29

QSAR performance using SVM models

Latent space is non-linear, works with non-linear models (SVM, NN),

but not so well with linear (MLR)

Page 25: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

30

Optimization of molecular properties

We

currently

use

REINVENT

Decoder

based:

Page 26: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

31

Conclusions

• SMILES based autoencoders can be improved by training on non-

canonical to different non-canonical SMILES (Heteroencoders)

• Binomial sampling makes it possible for the RNN to solve the 1:Many

task on the output level

• But reconstruction of molecules is non-deterministic

• Latent space is relevant with respect to QSAR tasks compared to

autoencoders

• Nearly on par with ECFP4

• Latent space is non-liniear and works best with non-liniear ML models

• Can be used to optimize molecules

• Benefits over current approaches at AZ (REINVENT) still not

resolved

Page 27: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

32

Toolkits – Source code - Links

Molvecgen: github.com/Ebjerrum/molvegen

Blogposts: www.wildcardconsulting.com

Deep Drug Coder (to be released)

https://github.com/EBjerrum/Deep-Drug-Coder.git

Page 28: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

33

Acknowledgements

De Novo Design groupOla Engkvist, Associate Director, Molecular AI

Christian Tyrchan, Team Leader - Computational Chemistry

Atanas Patronov, Associate Principal Scientist, Molecular AI

Michael Withnall, Ph.D student, Molecular AI

Rocio Mercado, Post.doc. Molecular AI

Jiazhen He, post.doc. Molecular AI

Josep Arus Pous, Ph.D student, BIGCHEM

Dhanushka Weerakoon, Graduate Scientist, IMED Graduate Programme

Simon Johansson, Master student

Oleksii Prydkhodko, Master student

Panagiotis-Christos Kotsias, Graduate Scientist, IMED Graduate Programme

Boris Sattarov, Informatics Programmer, Science Data Software LLC (Ext.)

Hongming Chen, Principal Scientist, Molecular AI

Page 29: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

Thank you for listening

34

Page 30: SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.”

Confidentiality Notice

This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove

it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the

contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 1 Francis Crick Avenue, Cambridge Biomedical Campus,

Cambridge, CB2 0AA, UK, T: +44(0)203 749 5000, www.astrazeneca.com

35

Figure on Slides 15 - 21 from Open Access: Bjerrum, Esben Jannik, and Boris Sattarov. 2018. “Improving Chemical Autoencoder Latent

Space and Molecular De Novo Generation Diversity with Heteroencoders.” Biomolecules.

kfxl284
Sticky Note
Approved by publication sign off 2019-06-19