interactive tools for improved prediction of the effect of non- synonymous single nucleotide...

Interactive tools for improved prediction of the effect of non-synonymous single nucleotide mutations on the protein

Gilad Wainreb Department of Biochemistry and Molecular Biology

Under the supervision of Prof. Nir Ben-Tal

Mutations diseases

• The human population contains approximately 10 million single nucleotide polymorphism (SNP) sites1.

• A large portion of the known genetic diseases is associated with non-synonymous SNPs (nsSNP)2.

1Sachidanandam et al. (2001) Nature, 409, 928–933.2Stenson,P.D et al. (2008) J. Med. Genet., 45,124–126.

Cell culture: constitutive phosphorylation

nsSNPs analysis

Bercovich, D., Ganmore, I., Scott, L.M., Wainreb, G., Birger, Y., Elimelech, A., Shochat, C., Cazzaniga, G., Biondi, A., Basso, G. et al. (2008) Mutations of JAK2 in acute lymphoblastic leukaemias associated with Down's syndrome. Lancet, 372, 1484-1492.

Does it alter the protein’s function?

Phenotypic expression

How does the mutation affect the protein’s function?

R683G, R683S, R683K in the Janus kinase (JAK2)

Myeloproliferative disorders

1Ashkenazy et al. (2010) NAR, web-server issue.

JAK2 R683G, R683S, R683K SNPs analysis

A homology model of the JAK2 pseudo kinase domain colored according to the ConSurf1 color scheme

Mutagenesis studies: Deleterious

nsSNPs analysis

Bercovich, D., Ganmore, I., Scott, L.M., Wainreb, G., Birger, Y., Elimelech, A., Shochat, C., Cazzaniga, G., Biondi, A., Basso, G. et al. (2008) Mutations of JAK2 in acute lymphoblastic leukaemias associated with Down's syndrome. Lancet, 372, 1484-1492.

Does it alter the protein’s function?

Phenotypic expression

Why does the mutation affect the protein’s function?

R683G, R683S, R683K in the Janus kinase (JAK2)

Myeloproliferative disorders

Hindering the binding site

Distribution of the Effects of Missense SNPs on Protein Molecular Function

Wang and Moult (2001) , Human Mutation 17:263-270.

Non-synonymous SNPs analysis

Deleterious/neutral

Disease

Protein Mutant stAbilitY Analyzer

Changes in protein stability

Wainreb et al. (2011) Bioinformatics 27, 3286

What is protein stability?• Protein stability is defined as the difference in Gibbs free energy (ΔG) between the

unfolded and folded states of the protein.

• ΔΔG= ΔGmutant- ΔGWT

Why is protein stability important?• Stability of the native conformation is important for proper function. • 83% of the deleterious SNP involve changes in protein stability1.

Change in protein stability (ΔΔG)

1Wang and Moult (2001). Human Mutation 17:263-270.

Atom-basedAmino acid-based

Computational prediction methods

1 Prevost et al. (1991) PNAS , 88, 10880-10884. 5 Guerois et al. (2002) JMB, 320, 369-387.2 Seeliger et al. (2010) Biophysical journal, 98, 2309-2316. 6 Tian et al. (2010) BMC bioinformatics, 11, 370.3 Bahar and Jernigan (1997) JMB, 266, 195-214. 7 Dehouck et al. (2009) Bioinformatics, 25, 2537-2543.4 Samudrala and Moult (1998) JMB, 275, 895-916. 8 Capriotti et al. (2005) NAR, 33, W306-310.

Physical effective potentials1,2 Molecular dynamics

Statistical effective potentials3,4 Observed frequencies potentials

Empirical effective potentials Machine learning (mostly)

FoldX5

Prethermut6

PopMuSic-2.07

Evolutionary- physicochemical- or sequence-based features

I-Mutant2.08

Van der Waals interactionsTorsion angleElectrostatic terms

Known ΔΔG

Entropic cost(EC)

. . Electrostatic(E)

Torsion angle(TA)

Van der Waals(VDW)

1.02 0.2 . . 0.4 30 0.8 1

-0.3 0.4 . . 0.9 -0.7 1 2

0.8 -0.9 . . 0.2 0.5 0.6 3

ΔΔG = w1*VDW+w2*TA+w3*E+ ..+wi(EC)

ΔΔG = VDW+ TA+ E+ ..+ EC

Preliminary study (goal)

• The ΔΔG of mutations occurring at the same position tend to cluster.

• Prior knowledge of ΔΔG values of other mutations at the query position might help in the prediction of the query’s ΔΔG.

Machine learning

• Learn from experience• Correlate between Description <--> outcome

The prediction scheme

Dataset

FeaturesDescribe the substitutions

Find relations between the features and the observed ΔΔG

PoPMuSiC-DB1

2646 mutations in 137 proteins 2155 mutations in 79 proteins

Potapov-DB2

Structural-based

• Solvent accessibility.• Prethermut.• PoPMuSiC-2.0.

Sequence-based

• The wild and mutant AAs.• Physicochemical deviation.

Random Forests3 and collaborative filtering based prediction

Machine learning

1Dehouck et al. (2009) Bioinformatics, 25, 2537-2543.2Potapov et al. (2009) Protein Eng Des Sel, 22, 553-560.3Leo Breiman, Machine Learning. (2001) Kluwer Academic Publishers. p. 5-32.

Mutations with a measured ΔΔG

Calculate the query’s features

Prediction results

Predict the ΔΔG using the features and a pre-calculated Random Forests model

No known ΔΔG records at the query position (traditional scheme)

Machine learning algorithmRandom Forests1

Descriptors

Buried?

WT =P WT =A

1Leo, B., Random Forests. 2001, Kluwer Academic Publishers. p. 5-32.

× 700

Decision tree

ΔΔG =0.01

Calculate the query’s descriptors

Prediction results

Predict the ΔΔG using the descriptors and a pre-calculated Random Forests model

No known ΔΔG records at the query position (traditional scheme)

ΔΔG values of the “known” mutations

Calculate descriptors for the mutation with the known ΔΔG and the query mutations

Add the “known” mutations to the training set

Rebuild Random Forests model

Predict the query’s ΔΔG using the new model

ΔΔG Prediction

Query mutation with other known mutations at the query position

Collaborative filtering and content-based algorithm

Collaborative-filtering

Position MU

1WQ549

4LYZ91

1EY041

5DFR91

1EY071

1SAK342

1HMS64

A -0.41 1.51 NA NA NA 0.55 0.38 NA. . . . . . . . .. . . . . . . . .

W -2.43 NA NA NA 1.34 NA NA NA

Y NA NA 3.07 NA NA NA NA NA

movie User

Star wars E.T Jaws

Top gun

ConAir

Pulpfiction Troy

Wall-E

Jane 4 1 NA NA NA 5 3 NA

. . . . . . . . .

Tom 4 NA 1 NA NA NA NA NA

John 2 NA NA NA 3 NA NA NA

Bellkor1 collaborative-filtering algorithmFirst we represent the mutation data as a sparse matrix:

The purpose of the algorithm is to predict the missing elements in the matrix by creating a model according to the available data

1Koren, Y. (2008) Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’08). pp. 426-434.

Protein position

Possible mutation outcomes

Bellkor collaborative-filtering model

The neighborhood model

The latent factor model

The Bellkor algorithm takes into account only the ΔΔG table:

Different positions and amino acids have different ΔΔG tendencies

Predict the baseline estimators for each amino acid and position

Baseline estimators model

?Rui = bui+

bui=bi+bu

Position MU

1WQ549

4LYZ91

5DFR91

1EY071

1SAK342

1HMS64

A -0.41 1.51 NA NA 0.55 0.38 NA

. . . . . . . .

W -2.43 NA NA 1.34 NA NA NA

Y NA NA 3.07 NA NA 2.3 NA

bi . . . . . . .

Rui = bui+

Up to now no “Biology” was introduced into our model only ΔΔG data

Content-based modelUse a linear regression with a subset of the features to describe the mutation:

1) Pro-Maya (RF)2) Prethermut3) PoPMuSiC-2.0

4) Solvent accessibility

Pro-Maya algorithm

For example: Rui = bui+ F1*solvent accesibility+F2*Prethermut …

Collaborative-Filtering

Generate a model to relate the ΔΔG of mutations

A matrix representation of the known ΔΔGs

Training:

Calibrated model

Stochastic gradient descent

Query mutation (at a position that is present in the training set) Predicted ΔΔG

Prediction:

We want to find the best model i.e. the best set of: bi, bu,and F for which Rui-rui<Ɛ

Rui = bui+ XuiF

1. Create random values for bi, bu ,F.2. Iterate on all the known values of the table.3. For each value measure the error between

the models prediction .4. Move down the slop and go to 2. until Rui-rui<Ɛ

Collaborative-Filtering

Generate a model to relate the ΔΔG of mutations

A matrix representation of the known ΔΔGs

Training:

Calibrated model

Query mutation (at a position that is present in the training set) Predicted ΔΔG

Prediction:

Results

How do we test our prediction performance?Cross-validation

All substitutions

Test set Learning set

Test set

Test set Test setTest setResults on

all substitutions

Cross-validation results

Rosetta

Hunter

I-Mutant2.0

CC/PBSA

PoPMuSiC-2.0

Combining method

Prethermut

Pro-Maya

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Performance of current methods on the Potapov-DB

Pearson Correlation Coefficient

PCC – Pearson correlation coefficient

number of Mutations

Performance

measure

Pro-MayaPrethermut

Random Forests

Collaborative filtering

No known ΔΔGs at the

query position

910PCC 0.65±0.02

0.61±0.02

1.14RMSE (kcal/mol) 1.09

One or more known ΔΔGs at the query

position

1735PCC 0.79±0.01 0.82±0.01 0.75±0.01

RMSE (kcal/mol) 0.92 0.86 0.99

validation set results

I-mutant-2.0

CUPSAT

Automute

Dmutant

PoPMuSiC-2.0

Prethermut

Pro-Maya

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Performance of current methods on the Validation set

Pearson Correlation Coefficient

Pro-Maya performs well also in its sequence-based prediction scheme

Pro-Maya Pro-Maya sequence only Prethermut

Pearson correla-tion coefficient

How does the number of mutations with known ΔΔG values in the query position affect the prediction accuracy

One or two known ΔΔGs are sufficient to improve the prediction accuracy significantly (alanine scanning)

0 (910) 1 (524) 2 (327) >2 (690)0.55

The prediction accuracy improves as the number of known records at the query position increases

Number of known ΔΔG values per query position

Pearsoncorrelation coefficient

Conclusions• One or two known records at the query position improve the

prediction.

• The improvement is independent of the amino acid identity of the known records and of the sequence identity of the query protein to the training set.

• Availability: bental.tau.ac.il/ProMaya

Non-synonymous SNPs analysis

Deleterious/neutral

Disease

Changes in protein stability

Prediction of deleterious SNPs

Wainreb G, Ashkenazy H, Bromberg Y, Starovolsky-Shitrit A, Haliloglu T, Ruppin E, Avraham KB, Rost B, Ben-Tal N. MuD: an interactive web server for the prediction of non-neutral substitutions using protein structural data. Nucleic Acids Res. 2010 Jul 1;38 Suppl:W523-8.

Previous workSequence-based:

1) Agreement with the profiles of AA residues in the alignment (SIFT)1.

2) Physicochemical-based features (MAPP)2.

3) Sequence-based prediction of structural features (SNAP etc.)(NN)3,4.

Sequence- and structure-based:

1) Observed solvent accessibility (Tree classifier)5.

2) Distance to the ligand6.

3) Micro-environment description (SVM)7.

4) SWISS-PROT annotations (PolyPhen)8.

1Ng et al. (2001). Gen. Res, 11. 2Stone et al. (2005) Gen. Res, 15. 3Bromberg et al. (2007), NAR, 35.4Ferrer-Costa et al. (2004) Proteins, 57. 5Saunders et al. (2002). JMB, 322. 6Chasman et al. (2001). JMB, 307.7Capriotti et al. (2005). Bioinfo, 21. 8Ramensky et al. (2002) NAR, 30.

Prediction of deleterious SNPs

? Ligand binding site? Catalytic site? Protein-protein interface site? Stop codon? Changes in protein stability

Disease

Missing data

Dirty datasets

A harder problem Deleterious/neutral

Prediction of deleterious mutations

DatasetSubstitution number 1 2 . . 3000 3001Substitution A54V H89Y N30A K90FObserved ΔΔG 1 -1 1 1

Features Describe the substitutions

Find a pattern that distinguishes deleterious

versus neutral substitutionsMachine learning

Deleterious/Neutral

Dealing with the problem of Noisy data

1Bairoch, et al. NAR, 2005, 332Kawabata et al. NAR,1999, 273Bromberg et al NAR, 2007, 35.

Known mutations in proteins with a solved

crystal structure

Experimentally validated

Evolutionary model substitutions3

Protein Mutant Database2

Swiss-Prot1 variants

Find a pattern that distinguishes deleterious

versus neutral substitutionsMachine learning

Deleterious/Neutral

Dealing with the problem of missing data

Structural and sequence based descriptors

Sequence identity with nearest homolog bearing the substitution

Swiss-Prot annotations

Apo- Holo

Substitution matrix distance

SIFT analysis Secondary structure assignment

Neighborhood composition in homologs

Evolutionary conservation (ConSurf1)

Physicochemical deviation

Cα B-factor

Solvent accessibility

Number of sequences in the alignment

Distance from the ligand and binding site conservation

Describes the importance of the query position

SNAP’s prediction

1Ashkenazy et al. (2010) NAR, web-server issue.

Random ForestsMachine learning

Deleterious/Neutral

Results

All alpha proteins (18%)

All beta pro-teins (16%)

Alpha and beta pro-teins (a/b)

Alpha and beta pro-

teins (a+b) (22%)

Multi-do-main pro-

teins (alpha and beta)

Membrane and cell

surface pro-teins and peptides

Small pro-teins (3%)

Cross validation results of MuD and current methods analyzed accord-ing

to the SCOP class of the query protein

MuD SNAP SIFT Polyphen

SCOP class

PolyPhen SIFT SNAP MuD0.35

Cross-validation results of MuD and current methodsM

Lac repressor

T4 lysozyme

HIV protease

PolyPhen 54.2 36.4

SIFT 40.2 34.8 51.7SNAP 41.2 37.7 30.7MuD 60.93 27.4 45.6

MuD SA 0.64 0.45 0.54

Known oligomerization stateKnown naturally occurring Ligands

Lac repressor

T4 lysozyme

HIV protease

PolyPhen 0.54 0.36 NA

SIFT 0.40 0.34 0.51SNAP 0.41 0.37 0.30MuD 0.61 0.27 0.45

Importance of interactivity

Test set

Predicted oligomerization statePredicted naturally occurring Ligands

PDB structure template selection

PDB ID or user uploaded coordinate file

Protein Sequence

Accept query mutations Accept query mutations

Oligomerization state selection and removal of irrelevant chains

Filtering of biologically irrelevant ligands

Selection of interesting residues

Results page

Conclusions• Development of interactive tool that can incorporate user reported data

into the prediction and improve the prediction. • Biological data is increasing rapidly thus allowing us to improve further the

accuracy (annotations, protein structures).

Thanks• Prof. Nir Ben-Tal• Dr. Guy Nimrod• Haim Ashkenazy

• Prof. Lior Wolf• Dr. Yves Dehouck

• Dr. Yana Bromberg• Alina Starovolsky-Shitrit• Prof. Turkan Haliloglu• Prof. Eytan Ruppin• Prof. Karen B. Avraham• Prof. Burkhard Rost

To all my friends in the lab:• Dr. Yanay Ofran• Maya Schushan• Yana Gofman• Daphna Meroz• Inbar Fish• Noam Chen• Ofir Goldenberg• Dr. Gal Almogy• Ori Kalid• Dr. Meytal Landau• Dr. Sarel Fleishman• Uri Ron• Adva Suez• Yariv Barkan• Matan Kalman

To my wife Adi for putting up with all the saturdays and nights .....

Funding:• Eurohear project.• DIP program.

Just a reminder that not all mutations are bad

Additional experimental data

Random Forests cross-validation results

How does the amino acid type of mutations with known ΔΔG values in the query position affect the prediction accuracy

Position MU

1WQ549

C NAD ?E 0.02F NAG 1.2H NAI -0.4. .. .Y NA

Amino acid pair

Miyata physicochemical distance

A G 0.9

A E 3.98

Minimal MiyataPhysicochemicaldistance

?Accuracy X

RMSEMinimal Miyata

physicochemical distance

0.3 0.9

0.5 3.98. .. .

0.02 2.37

Miyata et al (1979) Journal of molecular evolution, 12, 219-236.

PM* ΔΔGRF CFCB ΔΔGRF CFCB

LOO type LOO

unseenLOO

allLOO

unseenLOO

allLOO

unseenLOO

The whole

dataset

PCC 0.71±0.01 0.74±0.01

0.75±0.01 0.77±0.01

RMSE (kcal/mol) 1.04 0.98 0.96 0.94

SRPMPCC 0.60±0.02 0.64±0.02

RMSE (kcal/mol) 1.15 1.10

MRPMPCC 0.76±0.01 0.79±0.01 0.83±0.01 0.82±0.01

RMSE (kcal/mol) 0.98 0.91 0.84 0.84

how well Pro-Maya performs on query mutations at proteins that are not homologous to any of the proteins in the training set

Random Forests cross-validation results on the SRPM and MRPM subsets of PoPMuSiC-DB and Potapov-DB

SRPM – Single-Replacement Position MutationMRPM - Multi-Replacement Position Mutation

validation set results

DatasetPerformance

measure

Pro-MayaSequence -

Pro-MayaStructure -

basedPrethermut

All the dataset

Potapov-DB

PCC 0.68±0.02 0.77±0.01 0.72±0.01RMSE

(kcal/mol)1.27 1.09 1.20

PoPMuSiC-DB

PCC 0.69±0.01 0.77±0.01 0.71±0.01

RMSE (kcal/mol)

1.06 0.94 1.05

Potapov-DB

PCC 0.44±0.03 0.59±0.04 0.57±0.03RMSE

(kcal/mol)1.45 1.28 1.30

PoPMuSiC-DB

PCC 0.55±0.02 0.64±0.02 0.61±0.02RMSE

(kcal/mol)1.20 1.11 1.14

Potapov-DB

PCC 0.77±0.01 0.83±0.01 0.77±0.01RMSE

(kcal/mol)1.15 0.98 1.14

PoPMuSiC-DB

PCC 0.76±0.01 0.82±0.01 0.75±0.01RMSE

(kcal/mol)0.97 0.85 0.99

Sequence-based Pro-Maya

0 0.2 0.4 0.6 0.8 1

Mutant and WT AAIsoelectric point deviation

Number of sequencesProtein flexibility

Torsion angle potentialsMolecular weight deviation

SIDCHHydrophobicity deviation

Solvent accessibilityPoPMuSiC-2.0

Prethermut

Mean decrease in mean square error

Relative importance of Pro-Maya's features

Additional explanatory slides

More models

Rui = bui+

Position MU

1WQ549

4LYZ91

5DFR91

1EY071

1SAK342

1HMS64

A -0.41 0.4 NA NA 0.55 0.38 NA

. . . . . . . .

W -2.43 NA NA -1.34 NA NA NA

Y 4.5 NA 3.07 2.1 NA 2.3 NA

Find k positions with similar ΔΔG values

Learn the pairwise weights for the

neighboring positionsWij, Cij

( ; )k

kuj uj ij ij

j R u i

R u i r b w c

Relate the ΔΔG values of neighboring positions i and j

Rui = bui+

positi

Positionf

Position MU

1WQ549

4LYZ91

1EY071

1SAK342

A -0.41 0.4 NA 0.55 0.38

. . . . . .

. . . . . .W -2.43 NA NA NA NAY NA NA 3.07 NA 2.3

Break down the ΔΔG matrix into two matrices and

𝒑𝒖𝒕

+ ptuqi 0.5

( ; )k

kuj uj ij ij

j R u i

R u i r b w c

Latent factors

f = number of Latent factors

m . . . . . . . . . . . . . . . . . . . . 2 1A

Clustering of similar positions into a representative position while taking into account only the ΔΔG values

Positions

f . . . 1A

Latent factor model example

Rui = bui+

positi

Positionf

Position MU

1WQ549

4LYZ91

1EY071

1SAK342

A -0.41 0.4 NA 0.55 0.38

. . . . . .

. . . . . .W -2.43 NA NA NA NAY NA NA 3.07 NA 2.3

Break down the ΔΔG matrix into two matrices and

𝒑𝒖𝒕

+ ptuqi 0.5

( ; )k

kuj uj ij ij

j R u i

R u i r b w c

interactive tools for improved prediction of the effect of non- synonymous single nucleotide...

protein stability important

protein engineering

protein stability g1wang

protein stability1

protein molecular functionwang

mutations of jak2

jak2 r683g

proteins function

Documents

jazz gilad atzmon - wordpress.com

gilad hekselman release date october 16 (europe – asia)...

network analysis of knowledge ... - gilad ravid - cv

contemporary financialization: a marxian analysis by gilad

gilad development inc

gilad hekselman nobody else but me

financial freedom is sexy by sue gilad

gilad lotan, news xchange 2014, algorithmic power

mtorc1 couples nucleotide synthesis to nucleotide …cancer...

gilad perez - struttura

new evolutionary relationships and sequence variation of a...

wkb method and some applications- gilad amar.pdf

gilad booklet hr2

createspace gilad

gilad m. guttmann and yaniv gelbstein

pvredator gilad lotan physical computing without computers

belle gilad

gilad perez moriond

twitter data - university of california,...

nucleotide metabolism. nucleotide nucleoside