interactive tools for improved prediction of the effect of non- synonymous single nucleotide...

63
Interactive tools for improved prediction of the effect of non-synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry and Molecular Biology Under the supervision of Prof. Nir Ben-Tal

Upload: pauline-lawrence

Post on 13-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Interactive tools for improved prediction of the effect of non-synonymous single nucleotide mutations on the protein

Gilad Wainreb Department of Biochemistry and Molecular Biology

Under the supervision of Prof. Nir Ben-Tal

Page 2: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Mutations diseases

• The human population contains approximately 10 million single nucleotide polymorphism (SNP) sites1.

• A large portion of the known genetic diseases is associated with non-synonymous SNPs (nsSNP)2.

1Sachidanandam et al. (2001) Nature, 409, 928–933.2Stenson,P.D et al. (2008) J. Med. Genet., 45,124–126.

Page 3: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Cell culture: constitutive phosphorylation

nsSNPs analysis

Bercovich, D., Ganmore, I., Scott, L.M., Wainreb, G., Birger, Y., Elimelech, A., Shochat, C., Cazzaniga, G., Biondi, A., Basso, G. et al. (2008) Mutations of JAK2 in acute lymphoblastic leukaemias associated with Down's syndrome. Lancet, 372, 1484-1492.

Does it alter the protein’s function?

?

Phenotypic expression

How does the mutation affect the protein’s function?

?

?

R683G, R683S, R683K in the Janus kinase (JAK2)

?

?

?

Myeloproliferative disorders

?

?

?

Page 4: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

1Ashkenazy et al. (2010) NAR, web-server issue.

JAK2 R683G, R683S, R683K SNPs analysis

A homology model of the JAK2 pseudo kinase domain colored according to the ConSurf1 color scheme

Page 5: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Mutagenesis studies: Deleterious

nsSNPs analysis

Bercovich, D., Ganmore, I., Scott, L.M., Wainreb, G., Birger, Y., Elimelech, A., Shochat, C., Cazzaniga, G., Biondi, A., Basso, G. et al. (2008) Mutations of JAK2 in acute lymphoblastic leukaemias associated with Down's syndrome. Lancet, 372, 1484-1492.

Does it alter the protein’s function?

?

Phenotypic expression

Why does the mutation affect the protein’s function?

?

?

R683G, R683S, R683K in the Janus kinase (JAK2)

?

?

?

Myeloproliferative disorders

Hindering the binding site

Page 6: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Distribution of the Effects of Missense SNPs on Protein Molecular Function

Wang and Moult (2001) , Human Mutation 17:263-270.

Page 7: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Non-synonymous SNPs analysis

Deleterious/neutral

?

?

Disease

?

Protein Mutant stAbilitY Analyzer

Changes in protein stability

Page 8: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Protein Mutant stAbilitY Analyzer

Wainreb et al. (2011) Bioinformatics 27, 3286

Page 9: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

What is protein stability?• Protein stability is defined as the difference in Gibbs free energy (ΔG) between the

unfolded and folded states of the protein.

• ΔΔG= ΔGmutant- ΔGWT

Why is protein stability important?• Stability of the native conformation is important for proper function. • 83% of the deleterious SNP involve changes in protein stability1.

Change in protein stability (ΔΔG)

1Wang and Moult (2001). Human Mutation 17:263-270.

Page 10: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Atom-basedAmino acid-based

Computational prediction methods

1 Prevost et al. (1991) PNAS , 88, 10880-10884. 5 Guerois et al. (2002) JMB, 320, 369-387.2 Seeliger et al. (2010) Biophysical journal, 98, 2309-2316. 6 Tian et al. (2010) BMC bioinformatics, 11, 370.3 Bahar and Jernigan (1997) JMB, 266, 195-214. 7 Dehouck et al. (2009) Bioinformatics, 25, 2537-2543.4 Samudrala and Moult (1998) JMB, 275, 895-916. 8 Capriotti et al. (2005) NAR, 33, W306-310.

Physical effective potentials1,2 Molecular dynamics

Statistical effective potentials3,4 Observed frequencies potentials

Empirical effective potentials Machine learning (mostly)

FoldX5

Prethermut6

PopMuSic-2.07

Evolutionary- physicochemical- or sequence-based features

I-Mutant2.08

Van der Waals interactionsTorsion angleElectrostatic terms

Known ΔΔG

Entropic cost(EC)

. . Electrostatic(E)

Torsion angle(TA)

Van der Waals(VDW)

1.02 0.2 . . 0.4 30 0.8 1

-0.3 0.4 . . 0.9 -0.7 1 2

0.8 -0.9 . . 0.2 0.5 0.6 3

ΔΔG = w1*VDW+w2*TA+w3*E+ ..+wi(EC)

Protein Mutant stAbilitY Analyzer

ΔΔG = VDW+ TA+ E+ ..+ EC

Page 11: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Preliminary study (goal)

• The ΔΔG of mutations occurring at the same position tend to cluster.

• Prior knowledge of ΔΔG values of other mutations at the query position might help in the prediction of the query’s ΔΔG.

Protein Mutant stAbilitY Analyzer

Page 12: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Machine learning

• Learn from experience• Correlate between Description <--> outcome

Page 13: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

The prediction scheme

Dataset

FeaturesDescribe the substitutions

Find relations between the features and the observed ΔΔG

PoPMuSiC-DB1

2646 mutations in 137 proteins 2155 mutations in 79 proteins

Potapov-DB2

Structural-based

• Solvent accessibility.• Prethermut.• PoPMuSiC-2.0.

Sequence-based

• The wild and mutant AAs.• Physicochemical deviation.

Random Forests3 and collaborative filtering based prediction

Machine learning

1Dehouck et al. (2009) Bioinformatics, 25, 2537-2543.2Potapov et al. (2009) Protein Eng Des Sel, 22, 553-560.3Leo Breiman, Machine Learning. (2001) Kluwer Academic Publishers. p. 5-32.

Mutations with a measured ΔΔG

Page 14: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Calculate the query’s features

Prediction results

Predict the ΔΔG using the features and a pre-calculated Random Forests model

A

B

No known ΔΔG records at the query position (traditional scheme)

C

Page 15: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Machine learning algorithmRandom Forests1

Descriptors

Buried?

WT =P WT =A

Sub

stitu

tions

1Leo, B., Random Forests. 2001, Kluwer Academic Publishers. p. 5-32.

× 700

Decision tree

ΔΔG =0.01

Page 16: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Calculate the query’s descriptors

Prediction results

Predict the ΔΔG using the descriptors and a pre-calculated Random Forests model

A

B

No known ΔΔG records at the query position (traditional scheme)

C

Page 17: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

ΔΔG values of the “known” mutations

Calculate descriptors for the mutation with the known ΔΔG and the query mutations

Add the “known” mutations to the training set

Rebuild Random Forests model

Predict the query’s ΔΔG using the new model

ΔΔG Prediction

Query mutation with other known mutations at the query position

Collaborative filtering and content-based algorithm

Page 18: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Collaborative-filtering

Page 19: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Position MU

1WQ549

1LZ12

4LYZ91

1EY041

5DFR91

1EY071

1SAK342

1HMS64

A -0.41 1.51 NA NA NA 0.55 0.38 NA. . . . . . . . .. . . . . . . . .

W -2.43 NA NA NA 1.34 NA NA NA

Y NA NA 3.07 NA NA NA NA NA

movie User

Star wars E.T Jaws

Top gun

ConAir

Pulpfiction Troy

Wall-E

Jane 4 1 NA NA NA 5 3 NA

. . . . . . . . .

. . . . . . . . .

Tom 4 NA 1 NA NA NA NA NA

John 2 NA NA NA 3 NA NA NA

Bellkor1 collaborative-filtering algorithmFirst we represent the mutation data as a sparse matrix:

The purpose of the algorithm is to predict the missing elements in the matrix by creating a model according to the available data

1Koren, Y. (2008) Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’08). pp. 426-434.

Protein position

Possible mutation outcomes

Page 20: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Bellkor collaborative-filtering model

The neighborhood model

The latent factor model

The Bellkor algorithm takes into account only the ΔΔG table:

Different positions and amino acids have different ΔΔG tendencies

Predict the baseline estimators for each amino acid and position

Baseline estimators model

?Rui = bui+

bui=bi+bu

Position MU

1WQ549

1LZ12

4LYZ91

5DFR91

1EY071

1SAK342

1HMS64

A -0.41 1.51 NA NA 0.55 0.38 NA

. . . . . . . .

. . . . . . . .

W -2.43 NA NA 1.34 NA NA NA

Y NA NA 3.07 NA NA 2.3 NA

bu...

.

.

bi . . . . . . .

Page 21: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Bellkor collaborative-filtering model

Baseline estimators model

Rui = bui+

Up to now no “Biology” was introduced into our model only ΔΔG data

Content-based modelUse a linear regression with a subset of the features to describe the mutation:

XuiF

1) Pro-Maya (RF)2) Prethermut3) PoPMuSiC-2.0

4) Solvent accessibility

Pro-Maya algorithm

For example: Rui = bui+ F1*solvent accesibility+F2*Prethermut …

Page 22: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Collaborative-Filtering

Generate a model to relate the ΔΔG of mutations

A matrix representation of the known ΔΔGs

Training:

Calibrated model

Stochastic gradient descent

Query mutation (at a position that is present in the training set) Predicted ΔΔG

Prediction:

Page 23: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Stochastic gradient descent

We want to find the best model i.e. the best set of: bi, bu,and F for which Rui-rui<Ɛ

Rui = bui+ XuiF

1. Create random values for bi, bu ,F.2. Iterate on all the known values of the table.3. For each value measure the error between

the models prediction .4. Move down the slop and go to 2. until Rui-rui<Ɛ

Page 24: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Collaborative-Filtering

Generate a model to relate the ΔΔG of mutations

A matrix representation of the known ΔΔGs

Training:

Calibrated model

Stochastic gradient descent

Query mutation (at a position that is present in the training set) Predicted ΔΔG

Prediction:

Page 25: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Results

Page 26: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

How do we test our prediction performance?Cross-validation

All substitutions

Test set Learning set

Test set

Test set

Test set Test setTest setResults on

all substitutions

Page 27: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Cross-validation results

Rosetta

Hunter

I-Mutant2.0

FoldX

CC/PBSA

EGAD

PoPMuSiC-2.0

Combining method

Prethermut

Pro-Maya

0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.26

0.45

0.54

0.55

0.56

0.59

0.62

0.64

0.72

0.77

Performance of current methods on the Potapov-DB

Pearson Correlation Coefficient

Page 28: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

PCC – Pearson correlation coefficient

Cross-validation results

number of Mutations

Performance

measure

Pro-MayaPrethermut

Random Forests

Collaborative filtering

No known ΔΔGs at the

query position

910PCC 0.65±0.02

0.61±0.02

1.14RMSE (kcal/mol) 1.09

One or more known ΔΔGs at the query

position

1735PCC 0.79±0.01 0.82±0.01 0.75±0.01

RMSE (kcal/mol) 0.92 0.86 0.99

Page 29: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

validation set results

I-mutant-2.0

Eris

CUPSAT

FoldX

Automute

Dmutant

PoPMuSiC-2.0

Prethermut

Pro-Maya

0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.29

0.35

0.37

0.4

0.46

0.48

0.69

0.72

0.79

Performance of current methods on the Validation set

Pearson Correlation Coefficient

Page 30: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Pro-Maya performs well also in its sequence-based prediction scheme

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

Pro-Maya Pro-Maya sequence only Prethermut

Pearson correla-tion coefficient

Page 31: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

How does the number of mutations with known ΔΔG values in the query position affect the prediction accuracy

One or two known ΔΔGs are sufficient to improve the prediction accuracy significantly (alanine scanning)

0 (910) 1 (524) 2 (327) >2 (690)0.55

0.6

0.65

0.7

0.75

0.8

0.85

The prediction accuracy improves as the number of known records at the query position increases

Number of known ΔΔG values per query position

Pearsoncorrelation coefficient

Page 32: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Conclusions• One or two known records at the query position improve the

prediction.

• The improvement is independent of the amino acid identity of the known records and of the sequence identity of the query protein to the training set.

• Availability: bental.tau.ac.il/ProMaya

Page 33: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Non-synonymous SNPs analysis

Deleterious/neutral

?

?

Disease

?

Protein Mutant stAbilitY Analyzer

Changes in protein stability

Page 34: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Prediction of deleterious SNPs

Wainreb G, Ashkenazy H, Bromberg Y, Starovolsky-Shitrit A, Haliloglu T, Ruppin E, Avraham KB, Rost B, Ben-Tal N. MuD: an interactive web server for the prediction of non-neutral substitutions using protein structural data. Nucleic Acids Res. 2010 Jul 1;38 Suppl:W523-8.

Page 35: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Previous workSequence-based:

1) Agreement with the profiles of AA residues in the alignment (SIFT)1.

2) Physicochemical-based features (MAPP)2.

3) Sequence-based prediction of structural features (SNAP etc.)(NN)3,4.

Sequence- and structure-based:

1) Observed solvent accessibility (Tree classifier)5.

2) Distance to the ligand6.

3) Micro-environment description (SVM)7.

4) SWISS-PROT annotations (PolyPhen)8.

1Ng et al. (2001). Gen. Res, 11. 2Stone et al. (2005) Gen. Res, 15. 3Bromberg et al. (2007), NAR, 35.4Ferrer-Costa et al. (2004) Proteins, 57. 5Saunders et al. (2002). JMB, 322. 6Chasman et al. (2001). JMB, 307.7Capriotti et al. (2005). Bioinfo, 21. 8Ramensky et al. (2002) NAR, 30.

Page 36: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Prediction of deleterious SNPs

? Ligand binding site? Catalytic site? Protein-protein interface site? Stop codon? Changes in protein stability

?

?

?

Disease

Missing data

Dirty datasets

A harder problem Deleterious/neutral

Page 37: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Prediction of deleterious mutations

DatasetSubstitution number 1 2 . . 3000 3001Substitution A54V H89Y N30A K90FObserved ΔΔG 1 -1 1 1

Features Describe the substitutions

Find a pattern that distinguishes deleterious

versus neutral substitutionsMachine learning

Deleterious/Neutral

Page 38: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Dealing with the problem of Noisy data

1Bairoch, et al. NAR, 2005, 332Kawabata et al. NAR,1999, 273Bromberg et al NAR, 2007, 35.

Known mutations in proteins with a solved

crystal structure

Experimentally validated

Evolutionary model substitutions3

Protein Mutant Database2

Swiss-Prot1 variants

Page 39: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Prediction of deleterious mutations

DatasetSubstitution number 1 2 . . 3000 3001Substitution A54V H89Y N30A K90FObserved ΔΔG 1 -1 1 1

Features Describe the substitutions

Find a pattern that distinguishes deleterious

versus neutral substitutionsMachine learning

Deleterious/Neutral

Page 40: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Dealing with the problem of missing data

Structural and sequence based descriptors

Sequence identity with nearest homolog bearing the substitution

Swiss-Prot annotations

Apo- Holo

Substitution matrix distance

SIFT analysis Secondary structure assignment

Neighborhood composition in homologs

Evolutionary conservation (ConSurf1)

Physicochemical deviation

Cα B-factor

Solvent accessibility

Number of sequences in the alignment

Distance from the ligand and binding site conservation

Describes the importance of the query position

SNAP’s prediction

1Ashkenazy et al. (2010) NAR, web-server issue.

Page 41: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Prediction of deleterious mutations

DatasetSubstitution number 1 2 . . 3000 3001Substitution A54V H89Y N30A K90FObserved ΔΔG 1 -1 1 1

Features Describe the substitutions

Random ForestsMachine learning

Deleterious/Neutral

Page 42: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Results

Page 43: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

All alpha proteins (18%)

All beta pro-teins (16%)

Alpha and beta pro-teins (a/b)

(27%)

Alpha and beta pro-

teins (a+b) (22%)

Multi-do-main pro-

teins (alpha and beta)

(7%)

Membrane and cell

surface pro-teins and peptides

(4%)

Small pro-teins (3%)

0.05

0.15

0.25

0.35

0.45

0.55

0.65

Cross validation results of MuD and current methods analyzed accord-ing

to the SCOP class of the query protein

MuD SNAP SIFT Polyphen

SCOP class

Mat

thew

s co

rrel

atio

n c

oef

fici

ent

PolyPhen SIFT SNAP MuD0.35

0.37

0.39

0.41

0.43

0.45

0.47

0.49

0.51

Cross-validation results of MuD and current methodsM

att

he

ws

co

rre

lati

on

co

eff

icie

nt

Cross-validation results

Page 44: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Lac repressor

T4 lysozyme

HIV protease

PolyPhen 54.2 36.4

SIFT 40.2 34.8 51.7SNAP 41.2 37.7 30.7MuD 60.93 27.4 45.6

MuD SA 0.64 0.45 0.54

Known oligomerization stateKnown naturally occurring Ligands

Lac repressor

T4 lysozyme

HIV protease

PolyPhen 0.54 0.36 NA

SIFT 0.40 0.34 0.51SNAP 0.41 0.37 0.30MuD 0.61 0.27 0.45

Importance of interactivity

Test set

Predicted oligomerization statePredicted naturally occurring Ligands

Page 45: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

PDB structure template selection

PDB ID or user uploaded coordinate file

Protein Sequence

Accept query mutations Accept query mutations

Oligomerization state selection and removal of irrelevant chains

Filtering of biologically irrelevant ligands

Selection of interesting residues

Results page

Page 46: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Conclusions• Development of interactive tool that can incorporate user reported data

into the prediction and improve the prediction. • Biological data is increasing rapidly thus allowing us to improve further the

accuracy (annotations, protein structures).

Page 47: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Thanks• Prof. Nir Ben-Tal• Dr. Guy Nimrod• Haim Ashkenazy

Protein Mutant stAbilitY Analyzer

• Prof. Lior Wolf• Dr. Yves Dehouck

• Dr. Yana Bromberg• Alina Starovolsky-Shitrit• Prof. Turkan Haliloglu• Prof. Eytan Ruppin• Prof. Karen B. Avraham• Prof. Burkhard Rost

To all my friends in the lab:• Dr. Yanay Ofran• Maya Schushan• Yana Gofman• Daphna Meroz• Inbar Fish• Noam Chen• Ofir Goldenberg• Dr. Gal Almogy• Ori Kalid• Dr. Meytal Landau• Dr. Sarel Fleishman• Uri Ron• Adva Suez• Yariv Barkan• Matan Kalman

To my wife Adi for putting up with all the saturdays and nights .....

Funding:• Eurohear project.• DIP program.

Just a reminder that not all mutations are bad

Page 48: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Additional experimental data

Page 49: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry
Page 50: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Random Forests cross-validation results

Page 51: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

How does the amino acid type of mutations with known ΔΔG values in the query position affect the prediction accuracy

Position MU

1WQ549

A NA

C NAD ?E 0.02F NAG 1.2H NAI -0.4. .. .Y NA

Amino acid pair

Miyata physicochemical distance

A G 0.9

A E 3.98

Minimal MiyataPhysicochemicaldistance

?Accuracy X

RMSEMinimal Miyata

physicochemical distance

0.3 0.9

0.5 3.98. .. .

0.02 2.37

Miyata et al (1979) Journal of molecular evolution, 12, 219-236.

0.14

Page 52: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry
Page 53: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

  PM* ΔΔGRF CFCB ΔΔGRF CFCB

LOO type  LOO

unseenLOO

allLOO

unseenLOO

allLOO

unseenLOO

all

The whole

dataset

PCC 0.71±0.01 0.74±0.01

 

0.75±0.01 0.77±0.01

RMSE (kcal/mol) 1.04 0.98 0.96 0.94

SRPMPCC 0.60±0.02 0.64±0.02

 

RMSE (kcal/mol) 1.15 1.10

MRPMPCC 0.76±0.01 0.79±0.01 0.83±0.01 0.82±0.01

RMSE (kcal/mol) 0.98 0.91 0.84 0.84

how well Pro-Maya performs on query mutations at proteins that are not homologous to any of the proteins in the training set

Page 54: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Random Forests cross-validation results on the SRPM and MRPM subsets of PoPMuSiC-DB and Potapov-DB

SRPM – Single-Replacement Position MutationMRPM - Multi-Replacement Position Mutation

Page 55: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

validation set results

Page 56: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

  DatasetPerformance

measure

Pro-MayaSequence -

based

Pro-MayaStructure -

basedPrethermut

All the dataset

Potapov-DB

PCC 0.68±0.02 0.77±0.01 0.72±0.01RMSE

(kcal/mol)1.27 1.09 1.20

PoPMuSiC-DB

PCC 0.69±0.01 0.77±0.01 0.71±0.01

RMSE (kcal/mol)

1.06 0.94 1.05

SRPM

Potapov-DB

PCC 0.44±0.03 0.59±0.04 0.57±0.03RMSE

(kcal/mol)1.45 1.28 1.30

PoPMuSiC-DB

PCC 0.55±0.02 0.64±0.02 0.61±0.02RMSE

(kcal/mol)1.20 1.11 1.14

MRPM

Potapov-DB

PCC 0.77±0.01 0.83±0.01 0.77±0.01RMSE

(kcal/mol)1.15 0.98 1.14

PoPMuSiC-DB

PCC 0.76±0.01 0.82±0.01 0.75±0.01RMSE

(kcal/mol)0.97 0.85 0.99

Sequence-based Pro-Maya

Page 57: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

0 0.2 0.4 0.6 0.8 1

Mutant and WT AAIsoelectric point deviation

Number of sequencesProtein flexibility

Torsion angle potentialsMolecular weight deviation

SIDCHHydrophobicity deviation

Solvent accessibilityPoPMuSiC-2.0

Prethermut

Mean decrease in mean square error

Relative importance of Pro-Maya's features

Page 58: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Additional explanatory slides

Page 59: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

More models

Page 60: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Bellkor collaborative-filtering model

The latent factor model

The Bellkor algorithm takes into account only the ΔΔG table:

Baseline estimators model

Rui = bui+

Position MU

1WQ549

1LZ12

4LYZ91

5DFR91

1EY071

1SAK342

1HMS64

A -0.41 0.4 NA NA 0.55 0.38 NA

. . . . . . . .

. . . . . . . .

W -2.43 NA NA -1.34 NA NA NA

Y 4.5 NA 3.07 2.1 NA 2.3 NA

The neighborhood model

Find k positions with similar ΔΔG values

Learn the pairwise weights for the

neighboring positionsWij, Cij

0.5

( ; )

( ; )k

kuj uj ij ij

j R u i

R u i r b w c

Relate the ΔΔG values of neighboring positions i and j

Page 61: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Bellkor collaborative-filtering model

The Bellkor algorithm takes into account only the ΔΔG table:

Baseline estimators model

Rui = bui+

The neighborhood model

Mat

rix

Decom

positi

on

MU

Positionf

f

~~x

The latent factor model

Position MU

1WQ549

1LZ12

4LYZ91

1EY071

1SAK342

A -0.41 0.4 NA 0.55 0.38

. . . . . .

. . . . . .W -2.43 NA NA NA NAY NA NA 3.07 NA 2.3

-2.43

Break down the ΔΔG matrix into two matrices and

𝒑𝒖𝒕

qi

+ ptuqi 0.5

( ; )

( ; )k

kuj uj ij ij

j R u i

R u i r b w c

Page 62: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Latent factors

MU

f = number of Latent factors

m . . . . . . . . . . . . . . . . . . . . 2 1A

.

.W

Y

Clustering of similar positions into a representative position while taking into account only the ΔΔG values

Positions

f . . . 1A

.

.

W

Y

Latent factor model example

MU

Page 63: Interactive tools for improved prediction of the effect of non- synonymous single nucleotide mutations on the protein Gilad Wainreb Department of Biochemistry

Bellkor collaborative-filtering model

The Bellkor algorithm takes into account only the ΔΔG table:

Baseline estimators model

Rui = bui+

The neighborhood model

Mat

rix

Decom

positi

on

MU

Positionf

f

~~x

The latent factor model

Position MU

1WQ549

1LZ12

4LYZ91

1EY071

1SAK342

A -0.41 0.4 NA 0.55 0.38

. . . . . .

. . . . . .W -2.43 NA NA NA NAY NA NA 3.07 NA 2.3

Break down the ΔΔG matrix into two matrices and

𝒑𝒖𝒕

qi

+ ptuqi 0.5

( ; )

( ; )k

kuj uj ij ij

j R u i

R u i r b w c