interactive tools for improved prediction of the effect of non- synonymous single nucleotide...
Post on 13-Jan-2016
225 Views
Preview:
TRANSCRIPT
Interactive tools for improved prediction of the effect of non-synonymous single nucleotide mutations on the protein
Gilad Wainreb Department of Biochemistry and Molecular Biology
Under the supervision of Prof. Nir Ben-Tal
Mutations diseases
• The human population contains approximately 10 million single nucleotide polymorphism (SNP) sites1.
• A large portion of the known genetic diseases is associated with non-synonymous SNPs (nsSNP)2.
1Sachidanandam et al. (2001) Nature, 409, 928–933.2Stenson,P.D et al. (2008) J. Med. Genet., 45,124–126.
Cell culture: constitutive phosphorylation
nsSNPs analysis
Bercovich, D., Ganmore, I., Scott, L.M., Wainreb, G., Birger, Y., Elimelech, A., Shochat, C., Cazzaniga, G., Biondi, A., Basso, G. et al. (2008) Mutations of JAK2 in acute lymphoblastic leukaemias associated with Down's syndrome. Lancet, 372, 1484-1492.
Does it alter the protein’s function?
?
Phenotypic expression
How does the mutation affect the protein’s function?
?
?
R683G, R683S, R683K in the Janus kinase (JAK2)
?
?
?
Myeloproliferative disorders
?
?
?
1Ashkenazy et al. (2010) NAR, web-server issue.
JAK2 R683G, R683S, R683K SNPs analysis
A homology model of the JAK2 pseudo kinase domain colored according to the ConSurf1 color scheme
Mutagenesis studies: Deleterious
nsSNPs analysis
Bercovich, D., Ganmore, I., Scott, L.M., Wainreb, G., Birger, Y., Elimelech, A., Shochat, C., Cazzaniga, G., Biondi, A., Basso, G. et al. (2008) Mutations of JAK2 in acute lymphoblastic leukaemias associated with Down's syndrome. Lancet, 372, 1484-1492.
Does it alter the protein’s function?
?
Phenotypic expression
Why does the mutation affect the protein’s function?
?
?
R683G, R683S, R683K in the Janus kinase (JAK2)
?
?
?
Myeloproliferative disorders
Hindering the binding site
Distribution of the Effects of Missense SNPs on Protein Molecular Function
Wang and Moult (2001) , Human Mutation 17:263-270.
Non-synonymous SNPs analysis
Deleterious/neutral
?
?
Disease
?
Protein Mutant stAbilitY Analyzer
Changes in protein stability
Protein Mutant stAbilitY Analyzer
Wainreb et al. (2011) Bioinformatics 27, 3286
What is protein stability?• Protein stability is defined as the difference in Gibbs free energy (ΔG) between the
unfolded and folded states of the protein.
• ΔΔG= ΔGmutant- ΔGWT
Why is protein stability important?• Stability of the native conformation is important for proper function. • 83% of the deleterious SNP involve changes in protein stability1.
Change in protein stability (ΔΔG)
1Wang and Moult (2001). Human Mutation 17:263-270.
Atom-basedAmino acid-based
Computational prediction methods
1 Prevost et al. (1991) PNAS , 88, 10880-10884. 5 Guerois et al. (2002) JMB, 320, 369-387.2 Seeliger et al. (2010) Biophysical journal, 98, 2309-2316. 6 Tian et al. (2010) BMC bioinformatics, 11, 370.3 Bahar and Jernigan (1997) JMB, 266, 195-214. 7 Dehouck et al. (2009) Bioinformatics, 25, 2537-2543.4 Samudrala and Moult (1998) JMB, 275, 895-916. 8 Capriotti et al. (2005) NAR, 33, W306-310.
Physical effective potentials1,2 Molecular dynamics
Statistical effective potentials3,4 Observed frequencies potentials
Empirical effective potentials Machine learning (mostly)
FoldX5
Prethermut6
PopMuSic-2.07
Evolutionary- physicochemical- or sequence-based features
I-Mutant2.08
Van der Waals interactionsTorsion angleElectrostatic terms
Known ΔΔG
Entropic cost(EC)
. . Electrostatic(E)
Torsion angle(TA)
Van der Waals(VDW)
1.02 0.2 . . 0.4 30 0.8 1
-0.3 0.4 . . 0.9 -0.7 1 2
0.8 -0.9 . . 0.2 0.5 0.6 3
ΔΔG = w1*VDW+w2*TA+w3*E+ ..+wi(EC)
Protein Mutant stAbilitY Analyzer
ΔΔG = VDW+ TA+ E+ ..+ EC
Preliminary study (goal)
• The ΔΔG of mutations occurring at the same position tend to cluster.
• Prior knowledge of ΔΔG values of other mutations at the query position might help in the prediction of the query’s ΔΔG.
Protein Mutant stAbilitY Analyzer
Machine learning
• Learn from experience• Correlate between Description <--> outcome
The prediction scheme
Dataset
FeaturesDescribe the substitutions
Find relations between the features and the observed ΔΔG
PoPMuSiC-DB1
2646 mutations in 137 proteins 2155 mutations in 79 proteins
Potapov-DB2
Structural-based
• Solvent accessibility.• Prethermut.• PoPMuSiC-2.0.
Sequence-based
• The wild and mutant AAs.• Physicochemical deviation.
Random Forests3 and collaborative filtering based prediction
Machine learning
1Dehouck et al. (2009) Bioinformatics, 25, 2537-2543.2Potapov et al. (2009) Protein Eng Des Sel, 22, 553-560.3Leo Breiman, Machine Learning. (2001) Kluwer Academic Publishers. p. 5-32.
Mutations with a measured ΔΔG
Calculate the query’s features
Prediction results
Predict the ΔΔG using the features and a pre-calculated Random Forests model
A
B
No known ΔΔG records at the query position (traditional scheme)
C
Machine learning algorithmRandom Forests1
Descriptors
Buried?
WT =P WT =A
Sub
stitu
tions
1Leo, B., Random Forests. 2001, Kluwer Academic Publishers. p. 5-32.
× 700
Decision tree
ΔΔG =0.01
Calculate the query’s descriptors
Prediction results
Predict the ΔΔG using the descriptors and a pre-calculated Random Forests model
A
B
No known ΔΔG records at the query position (traditional scheme)
C
ΔΔG values of the “known” mutations
Calculate descriptors for the mutation with the known ΔΔG and the query mutations
Add the “known” mutations to the training set
Rebuild Random Forests model
Predict the query’s ΔΔG using the new model
ΔΔG Prediction
Query mutation with other known mutations at the query position
Collaborative filtering and content-based algorithm
Collaborative-filtering
Position MU
1WQ549
1LZ12
4LYZ91
1EY041
5DFR91
1EY071
1SAK342
1HMS64
A -0.41 1.51 NA NA NA 0.55 0.38 NA. . . . . . . . .. . . . . . . . .
W -2.43 NA NA NA 1.34 NA NA NA
Y NA NA 3.07 NA NA NA NA NA
movie User
Star wars E.T Jaws
Top gun
ConAir
Pulpfiction Troy
Wall-E
Jane 4 1 NA NA NA 5 3 NA
. . . . . . . . .
. . . . . . . . .
Tom 4 NA 1 NA NA NA NA NA
John 2 NA NA NA 3 NA NA NA
Bellkor1 collaborative-filtering algorithmFirst we represent the mutation data as a sparse matrix:
The purpose of the algorithm is to predict the missing elements in the matrix by creating a model according to the available data
1Koren, Y. (2008) Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’08). pp. 426-434.
Protein position
Possible mutation outcomes
Bellkor collaborative-filtering model
The neighborhood model
The latent factor model
The Bellkor algorithm takes into account only the ΔΔG table:
Different positions and amino acids have different ΔΔG tendencies
Predict the baseline estimators for each amino acid and position
Baseline estimators model
?Rui = bui+
bui=bi+bu
Position MU
1WQ549
1LZ12
4LYZ91
5DFR91
1EY071
1SAK342
1HMS64
A -0.41 1.51 NA NA 0.55 0.38 NA
. . . . . . . .
. . . . . . . .
W -2.43 NA NA 1.34 NA NA NA
Y NA NA 3.07 NA NA 2.3 NA
bu...
.
.
bi . . . . . . .
Bellkor collaborative-filtering model
Baseline estimators model
Rui = bui+
Up to now no “Biology” was introduced into our model only ΔΔG data
Content-based modelUse a linear regression with a subset of the features to describe the mutation:
XuiF
1) Pro-Maya (RF)2) Prethermut3) PoPMuSiC-2.0
4) Solvent accessibility
Pro-Maya algorithm
For example: Rui = bui+ F1*solvent accesibility+F2*Prethermut …
Collaborative-Filtering
Generate a model to relate the ΔΔG of mutations
A matrix representation of the known ΔΔGs
Training:
Calibrated model
Stochastic gradient descent
Query mutation (at a position that is present in the training set) Predicted ΔΔG
Prediction:
Stochastic gradient descent
We want to find the best model i.e. the best set of: bi, bu,and F for which Rui-rui<Ɛ
Rui = bui+ XuiF
1. Create random values for bi, bu ,F.2. Iterate on all the known values of the table.3. For each value measure the error between
the models prediction .4. Move down the slop and go to 2. until Rui-rui<Ɛ
Collaborative-Filtering
Generate a model to relate the ΔΔG of mutations
A matrix representation of the known ΔΔGs
Training:
Calibrated model
Stochastic gradient descent
Query mutation (at a position that is present in the training set) Predicted ΔΔG
Prediction:
Results
How do we test our prediction performance?Cross-validation
All substitutions
Test set Learning set
Test set
Test set
Test set Test setTest setResults on
all substitutions
Cross-validation results
Rosetta
Hunter
I-Mutant2.0
FoldX
CC/PBSA
EGAD
PoPMuSiC-2.0
Combining method
Prethermut
Pro-Maya
0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.26
0.45
0.54
0.55
0.56
0.59
0.62
0.64
0.72
0.77
Performance of current methods on the Potapov-DB
Pearson Correlation Coefficient
PCC – Pearson correlation coefficient
Cross-validation results
number of Mutations
Performance
measure
Pro-MayaPrethermut
Random Forests
Collaborative filtering
No known ΔΔGs at the
query position
910PCC 0.65±0.02
0.61±0.02
1.14RMSE (kcal/mol) 1.09
One or more known ΔΔGs at the query
position
1735PCC 0.79±0.01 0.82±0.01 0.75±0.01
RMSE (kcal/mol) 0.92 0.86 0.99
validation set results
I-mutant-2.0
Eris
CUPSAT
FoldX
Automute
Dmutant
PoPMuSiC-2.0
Prethermut
Pro-Maya
0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.29
0.35
0.37
0.4
0.46
0.48
0.69
0.72
0.79
Performance of current methods on the Validation set
Pearson Correlation Coefficient
Pro-Maya performs well also in its sequence-based prediction scheme
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
Pro-Maya Pro-Maya sequence only Prethermut
Pearson correla-tion coefficient
How does the number of mutations with known ΔΔG values in the query position affect the prediction accuracy
One or two known ΔΔGs are sufficient to improve the prediction accuracy significantly (alanine scanning)
0 (910) 1 (524) 2 (327) >2 (690)0.55
0.6
0.65
0.7
0.75
0.8
0.85
The prediction accuracy improves as the number of known records at the query position increases
Number of known ΔΔG values per query position
Pearsoncorrelation coefficient
Conclusions• One or two known records at the query position improve the
prediction.
• The improvement is independent of the amino acid identity of the known records and of the sequence identity of the query protein to the training set.
• Availability: bental.tau.ac.il/ProMaya
Non-synonymous SNPs analysis
Deleterious/neutral
?
?
Disease
?
Protein Mutant stAbilitY Analyzer
Changes in protein stability
Prediction of deleterious SNPs
Wainreb G, Ashkenazy H, Bromberg Y, Starovolsky-Shitrit A, Haliloglu T, Ruppin E, Avraham KB, Rost B, Ben-Tal N. MuD: an interactive web server for the prediction of non-neutral substitutions using protein structural data. Nucleic Acids Res. 2010 Jul 1;38 Suppl:W523-8.
Previous workSequence-based:
1) Agreement with the profiles of AA residues in the alignment (SIFT)1.
2) Physicochemical-based features (MAPP)2.
3) Sequence-based prediction of structural features (SNAP etc.)(NN)3,4.
Sequence- and structure-based:
1) Observed solvent accessibility (Tree classifier)5.
2) Distance to the ligand6.
3) Micro-environment description (SVM)7.
4) SWISS-PROT annotations (PolyPhen)8.
1Ng et al. (2001). Gen. Res, 11. 2Stone et al. (2005) Gen. Res, 15. 3Bromberg et al. (2007), NAR, 35.4Ferrer-Costa et al. (2004) Proteins, 57. 5Saunders et al. (2002). JMB, 322. 6Chasman et al. (2001). JMB, 307.7Capriotti et al. (2005). Bioinfo, 21. 8Ramensky et al. (2002) NAR, 30.
Prediction of deleterious SNPs
? Ligand binding site? Catalytic site? Protein-protein interface site? Stop codon? Changes in protein stability
?
?
?
Disease
Missing data
Dirty datasets
A harder problem Deleterious/neutral
Prediction of deleterious mutations
DatasetSubstitution number 1 2 . . 3000 3001Substitution A54V H89Y N30A K90FObserved ΔΔG 1 -1 1 1
Features Describe the substitutions
Find a pattern that distinguishes deleterious
versus neutral substitutionsMachine learning
Deleterious/Neutral
Dealing with the problem of Noisy data
1Bairoch, et al. NAR, 2005, 332Kawabata et al. NAR,1999, 273Bromberg et al NAR, 2007, 35.
Known mutations in proteins with a solved
crystal structure
Experimentally validated
Evolutionary model substitutions3
Protein Mutant Database2
Swiss-Prot1 variants
Prediction of deleterious mutations
DatasetSubstitution number 1 2 . . 3000 3001Substitution A54V H89Y N30A K90FObserved ΔΔG 1 -1 1 1
Features Describe the substitutions
Find a pattern that distinguishes deleterious
versus neutral substitutionsMachine learning
Deleterious/Neutral
Dealing with the problem of missing data
Structural and sequence based descriptors
Sequence identity with nearest homolog bearing the substitution
Swiss-Prot annotations
Apo- Holo
Substitution matrix distance
SIFT analysis Secondary structure assignment
Neighborhood composition in homologs
Evolutionary conservation (ConSurf1)
Physicochemical deviation
Cα B-factor
Solvent accessibility
Number of sequences in the alignment
Distance from the ligand and binding site conservation
Describes the importance of the query position
SNAP’s prediction
1Ashkenazy et al. (2010) NAR, web-server issue.
Prediction of deleterious mutations
DatasetSubstitution number 1 2 . . 3000 3001Substitution A54V H89Y N30A K90FObserved ΔΔG 1 -1 1 1
Features Describe the substitutions
Random ForestsMachine learning
Deleterious/Neutral
Results
All alpha proteins (18%)
All beta pro-teins (16%)
Alpha and beta pro-teins (a/b)
(27%)
Alpha and beta pro-
teins (a+b) (22%)
Multi-do-main pro-
teins (alpha and beta)
(7%)
Membrane and cell
surface pro-teins and peptides
(4%)
Small pro-teins (3%)
0.05
0.15
0.25
0.35
0.45
0.55
0.65
Cross validation results of MuD and current methods analyzed accord-ing
to the SCOP class of the query protein
MuD SNAP SIFT Polyphen
SCOP class
Mat
thew
s co
rrel
atio
n c
oef
fici
ent
PolyPhen SIFT SNAP MuD0.35
0.37
0.39
0.41
0.43
0.45
0.47
0.49
0.51
Cross-validation results of MuD and current methodsM
att
he
ws
co
rre
lati
on
co
eff
icie
nt
Cross-validation results
Lac repressor
T4 lysozyme
HIV protease
PolyPhen 54.2 36.4
SIFT 40.2 34.8 51.7SNAP 41.2 37.7 30.7MuD 60.93 27.4 45.6
MuD SA 0.64 0.45 0.54
Known oligomerization stateKnown naturally occurring Ligands
Lac repressor
T4 lysozyme
HIV protease
PolyPhen 0.54 0.36 NA
SIFT 0.40 0.34 0.51SNAP 0.41 0.37 0.30MuD 0.61 0.27 0.45
Importance of interactivity
Test set
Predicted oligomerization statePredicted naturally occurring Ligands
PDB structure template selection
PDB ID or user uploaded coordinate file
Protein Sequence
Accept query mutations Accept query mutations
Oligomerization state selection and removal of irrelevant chains
Filtering of biologically irrelevant ligands
Selection of interesting residues
Results page
Conclusions• Development of interactive tool that can incorporate user reported data
into the prediction and improve the prediction. • Biological data is increasing rapidly thus allowing us to improve further the
accuracy (annotations, protein structures).
Thanks• Prof. Nir Ben-Tal• Dr. Guy Nimrod• Haim Ashkenazy
Protein Mutant stAbilitY Analyzer
• Prof. Lior Wolf• Dr. Yves Dehouck
• Dr. Yana Bromberg• Alina Starovolsky-Shitrit• Prof. Turkan Haliloglu• Prof. Eytan Ruppin• Prof. Karen B. Avraham• Prof. Burkhard Rost
To all my friends in the lab:• Dr. Yanay Ofran• Maya Schushan• Yana Gofman• Daphna Meroz• Inbar Fish• Noam Chen• Ofir Goldenberg• Dr. Gal Almogy• Ori Kalid• Dr. Meytal Landau• Dr. Sarel Fleishman• Uri Ron• Adva Suez• Yariv Barkan• Matan Kalman
To my wife Adi for putting up with all the saturdays and nights .....
Funding:• Eurohear project.• DIP program.
Just a reminder that not all mutations are bad
Additional experimental data
Random Forests cross-validation results
How does the amino acid type of mutations with known ΔΔG values in the query position affect the prediction accuracy
Position MU
1WQ549
A NA
C NAD ?E 0.02F NAG 1.2H NAI -0.4. .. .Y NA
Amino acid pair
Miyata physicochemical distance
A G 0.9
A E 3.98
Minimal MiyataPhysicochemicaldistance
?Accuracy X
RMSEMinimal Miyata
physicochemical distance
0.3 0.9
0.5 3.98. .. .
0.02 2.37
Miyata et al (1979) Journal of molecular evolution, 12, 219-236.
0.14
PM* ΔΔGRF CFCB ΔΔGRF CFCB
LOO type LOO
unseenLOO
allLOO
unseenLOO
allLOO
unseenLOO
all
The whole
dataset
PCC 0.71±0.01 0.74±0.01
0.75±0.01 0.77±0.01
RMSE (kcal/mol) 1.04 0.98 0.96 0.94
SRPMPCC 0.60±0.02 0.64±0.02
RMSE (kcal/mol) 1.15 1.10
MRPMPCC 0.76±0.01 0.79±0.01 0.83±0.01 0.82±0.01
RMSE (kcal/mol) 0.98 0.91 0.84 0.84
how well Pro-Maya performs on query mutations at proteins that are not homologous to any of the proteins in the training set
Random Forests cross-validation results on the SRPM and MRPM subsets of PoPMuSiC-DB and Potapov-DB
SRPM – Single-Replacement Position MutationMRPM - Multi-Replacement Position Mutation
validation set results
DatasetPerformance
measure
Pro-MayaSequence -
based
Pro-MayaStructure -
basedPrethermut
All the dataset
Potapov-DB
PCC 0.68±0.02 0.77±0.01 0.72±0.01RMSE
(kcal/mol)1.27 1.09 1.20
PoPMuSiC-DB
PCC 0.69±0.01 0.77±0.01 0.71±0.01
RMSE (kcal/mol)
1.06 0.94 1.05
SRPM
Potapov-DB
PCC 0.44±0.03 0.59±0.04 0.57±0.03RMSE
(kcal/mol)1.45 1.28 1.30
PoPMuSiC-DB
PCC 0.55±0.02 0.64±0.02 0.61±0.02RMSE
(kcal/mol)1.20 1.11 1.14
MRPM
Potapov-DB
PCC 0.77±0.01 0.83±0.01 0.77±0.01RMSE
(kcal/mol)1.15 0.98 1.14
PoPMuSiC-DB
PCC 0.76±0.01 0.82±0.01 0.75±0.01RMSE
(kcal/mol)0.97 0.85 0.99
Sequence-based Pro-Maya
0 0.2 0.4 0.6 0.8 1
Mutant and WT AAIsoelectric point deviation
Number of sequencesProtein flexibility
Torsion angle potentialsMolecular weight deviation
SIDCHHydrophobicity deviation
Solvent accessibilityPoPMuSiC-2.0
Prethermut
Mean decrease in mean square error
Relative importance of Pro-Maya's features
Additional explanatory slides
More models
Bellkor collaborative-filtering model
The latent factor model
The Bellkor algorithm takes into account only the ΔΔG table:
Baseline estimators model
Rui = bui+
Position MU
1WQ549
1LZ12
4LYZ91
5DFR91
1EY071
1SAK342
1HMS64
A -0.41 0.4 NA NA 0.55 0.38 NA
. . . . . . . .
. . . . . . . .
W -2.43 NA NA -1.34 NA NA NA
Y 4.5 NA 3.07 2.1 NA 2.3 NA
The neighborhood model
Find k positions with similar ΔΔG values
Learn the pairwise weights for the
neighboring positionsWij, Cij
0.5
( ; )
( ; )k
kuj uj ij ij
j R u i
R u i r b w c
Relate the ΔΔG values of neighboring positions i and j
Bellkor collaborative-filtering model
The Bellkor algorithm takes into account only the ΔΔG table:
Baseline estimators model
Rui = bui+
The neighborhood model
Mat
rix
Decom
positi
on
MU
Positionf
f
~~x
The latent factor model
Position MU
1WQ549
1LZ12
4LYZ91
1EY071
1SAK342
A -0.41 0.4 NA 0.55 0.38
. . . . . .
. . . . . .W -2.43 NA NA NA NAY NA NA 3.07 NA 2.3
-2.43
Break down the ΔΔG matrix into two matrices and
𝒑𝒖𝒕
qi
+ ptuqi 0.5
( ; )
( ; )k
kuj uj ij ij
j R u i
R u i r b w c
Latent factors
MU
f = number of Latent factors
m . . . . . . . . . . . . . . . . . . . . 2 1A
.
.W
Y
Clustering of similar positions into a representative position while taking into account only the ΔΔG values
Positions
f . . . 1A
.
.
W
Y
Latent factor model example
MU
Bellkor collaborative-filtering model
The Bellkor algorithm takes into account only the ΔΔG table:
Baseline estimators model
Rui = bui+
The neighborhood model
Mat
rix
Decom
positi
on
MU
Positionf
f
~~x
The latent factor model
Position MU
1WQ549
1LZ12
4LYZ91
1EY071
1SAK342
A -0.41 0.4 NA 0.55 0.38
. . . . . .
. . . . . .W -2.43 NA NA NA NAY NA NA 3.07 NA 2.3
Break down the ΔΔG matrix into two matrices and
𝒑𝒖𝒕
qi
+ ptuqi 0.5
( ; )
( ; )k
kuj uj ij ij
j R u i
R u i r b w c
top related