tmseg michael bernhofer short title: pp1 tmseg · 2017. 6. 20. · •tmh assignment from...
Post on 24-Aug-2020
5 Views
Preview:
TRANSCRIPT
© Burkhard Rost 1
20.06.2017
title: TMSEG
Michael Bernhofer
short title: pp1_tmseg
lecture: Protein Prediction 1 (for Informatics) – Protein structure
TUM summer semester
© Burkhard Rost 2
Last time…
© Burkhard Rost 3
© Burkhard Rost
More data available
• Re-training old methods is viable but no one does it
Less extensive machine learning
Runtime
4
Yet another transmembrane predictor?
© Burkhard Rost
166 membrane protein sequences (TMP166)
• TMH assignment from 3D-structure by OPM & PDBTM
Assignments differ, both used for training
• Map to UniProt sequence using SIFTS
• Redundancy reduction with Uniqueprot at HVAL>0
5
Dataset – Transmembrane helices I
Lomize et al., 2006, Bioinformatics Velankar et al., 2013, NAR
Kozma et al., 2013, NAR Mika et al., 2003, NAR
© Burkhard Rost
Inside/Outside topology assignment OPM
6
Dataset – Transmembrane helices II
Lomize et al., 2006, Bioinformatics
© Burkhard Rost
Derived from the SignalP 4.0 training set
Redundancy reduced against set of 166 TMPs
at HVAL>0
Redundancy reduced within at HVAL>0
7
Dataset – Proteins w/ and w/o signal peptides
Petersen et al., 2011, Nat. Methods
Soluble: 1142 (452 w/ SP)
Membrane: 299 (25 w/ SP)
SP1441
© Burkhard Rost
Split into 4 subsets, maintaining distribution of
TMPs, SPs and sequence lengths
Use 3 sets for cross-validation, keep one for
final independent evaluation (Blind set)
8
Dataset – Split
TMP166
Blind
Train
41
SP1441
Blind
285
Train
© Burkhard Rost 9
Classification trees
© Burkhard Rost 10
Classification trees example
Loh, 2011, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
© Burkhard Rost 11
Random forests
© Burkhard Rost
Fast
No black box
Intuitive to
interpret
Good
performance
12
Random forests - Popularity
Jensen et al., 2011, Bioinformatics
© Burkhard Rost 13
TMSEG step 1 – Initial prediction
© Burkhard Rost 14
TMSEG overview – Step 1
© Burkhard Rost
Global features:
• Global amino acid composition
• Protein length
Local features:
• PSSM score
• Distance to N- and C-terminus
• Average hydrophobicity (Kyte-Doolittle)
• % hydrophobic
• % charged (positive & negative)
• % polar
15
TMSEG step 1 - Feature set I
© Burkhard Rost
Use homology information to improve prediction
• Position-specific scoring matrix (PSSM)
16
TMSEG step 1 - Feature set II
© Burkhard Rost 17
TMSEG step 1 - Feature set IIIAdjusting for conservation
…
Substitutions with score > 0 = 16
Substitutions with score < 0 = 79
© Burkhard Rost 18
TMSEG step 1 - Feature set IVAdjusting for conservation
…
Amino acid composition M (PSSM>0) = 1/16
© Burkhard Rost 19
TMSEG step 1 - Feature set VAdjusting for conservation
…
Amino acid composition M (PSSM>0) = 1/16
Amino acid composition M (PSSM<0) = 3/79
© Burkhard Rost 20
TMSEG step 1 - Feature set VIAdjusting for conservation
…
% positive charge (PSSM>0) = 2/16
% positive charge (PSSM<0) = 8/79
© Burkhard Rost
Global features:
• Global amino acid composition 2*20
• Protein length (binned) 1
Local features:
• PSSM score 21*19
• Distance to N- and C-terminus 2
• Average hydrophobicity (Kyte-Doolittle) 2*1
• % hydrophobic 2*1
• % charged (positive & negative) 2*2
• % polar 2*1
21
TMSEG step 1 - Feature set VII
PSSM≶0
PSSM≶0
PSSM≶0
PSSM≶0
PSSM≶0
© Burkhard Rost 22
TMSEG step 2 – Empirical filter
© Burkhard Rost
TMSEG step 2 – ExampleSEQ: M G P R A R P A L L L L ...
SIG: 400 400 100 100 800 600 700 900 100 600 100 800 ...
SOL: 500 400 600 500 100 100 100 000 500 100 100 200 ...
TMH: 100 200 300 400 100 300 200 100 400 300 800 000 ...
Median filter
SIG: 400 400 400 400 600 700 700 600 600 600 ...
SOL: 500 500 500 400 100 100 100 100 100 100 ...
TMH: 100 200 200 300 300 200 200 300 300 300 ...
Adjust for overprediction
SIG: 400 400 400 400 600 700 700 600 600 600 ...
SOL: 315 315 315 215 -85 -85 -85 -85 -85 -85 ...
TMH: 040 140 140 240 240 140 140 240 240 240 ...
OUT: S S S S S S S S S S ...
23
© Burkhard Rost 24
TMSEG overview – Step 1 & 2
© Burkhard Rost 25
TMSEG step 3 – Refine TMH prediction I
Neural Network (25 hidden nodes)
Input: TMH segments of variable length
Features:
• Amino acid composition 2*20
• Average hydrophobicity (Kyte-Doolittle) 2*1
• % hydrophobic 2*1
• % charged 2*1
• Segment length (exact) 1
PSSM≶0
PSSM≶0
PSSM≶0
PSSM≶0
© Burkhard Rost
Split long TMHs (≥35 residues) into two
shorter TMHs (≥17 residues)
• Keep two TMHs if higher average score after split
Adjust TMH endpoints by up to 3 residues in
either direction
26
TMSEG step 3 – Refine TMH prediction II
© Burkhard Rost 27
TMSEG overview – Step 1-3
© Burkhard Rost 28
TMSEG step 4 – Topology prediction I
PSSM≶0
PSSM≶0
PSSM≶0
© Burkhard Rost
Consider only residues close to TMHs
• 15 residues next to TMHs and 8 residues into TMHs
Predict topology of N-terminus and extrapolate
If SP predicted Residues after SP “outside”
29
TMSEG step 4 – Topology prediction II
© Burkhard Rost 30
TMSEG overview – Step 1-4
© Burkhard Rost
Per-residue measures often misleading
Score by TMH segments instead
Whole-protein scores: Qok and Qtop
Performance measures I
31
© Burkhard Rost 32
Performance measures II
© Burkhard Rost
What is a correctly predicted TMH?
Strict criteria
• Endpoint deviation ≤5 residues
• Overlap at least 50%
Performance measures III
33
© Burkhard Rost
ti: 100% if toplogy is correct, otherwise 0%
Qtop:
34
Performance measures IV
© Burkhard Rost
Performance of TMH predictions
35
© Burkhard Rost
Performance measures – TMP classification
36
© Burkhard Rost
Very low misclassification rates
TMP classification
Method TMP sensitivity
TMP FPR
Topology correct
Misclassified in human
More mistakes than TMSEG in human
TMSEG 98 ± 2 3 ± 1 93 ± 4 558 -
PolyPhobius 100 ± 0 5 ± 1 78 ± 7 770 212
MEMSAT3 100 ± 0 28 ± 2 93 ± 4 4,313 3,755
MEMSAT-SVM 98 ± 2 14 ± 2 88 ± 5 2,253 1,695
Baseline 95 ± 3 31 ± 2 75 ± 7 5,015 4,457
37
© Burkhard Rost
How to get more data?
• Use what was published since starting work
Data unknown by any method
From 07/2013 to 02/2016:
• Only 12 new TMPs published
• Very small dataset
TMSEG predicts every TMH of the 10 recognized TMPs
38
Dataset of 12 new proteins
© Burkhard Rost
High modularity (steps 1-4)
Apply steps 3 and 4 to other methods
• Step3: NN-based TMH prediction improvement
• Step4: RF-based topology prediction
Can this improve other methods?
Applying TMSEG to other methods I
39
© Burkhard Rost
Applying TMSEG to other methods II
40
© Burkhard Rost
Re-entrant regions not
modelled (little data)
Idea: Check “abnormal”
TMH segments for re-
entrant
• Does not switching topology
increase scores?
41
Potential extensions
© Burkhard Rost
Feature space too big
Wanted: ~10 samples per free parameter
RF1: 452 features
• 2800, 10400, and 60900 samples
NN: 47 features (x25 hidden units)
• 2100 samples
RF2: 86 features
• only 40 samples!
More data desirable
© Burkhard Rost
Debian package: http://rostlab.org/debian/pool/main/t/tmseg/
Github: github.com/Rostlab/TMSEG
PredictProtein: predictprotein.org
43
Availability
Yachdav et al., 2014, NAR
© Burkhard Rost 44
Thank you
Unknown source
© Burkhard Rost
Yachdav, G., Kloppmann, E., Kajan, L., Hecht, M., Goldberg, T., Hamp, T., … Rost, B. (2014).
PredictProtein-an open resource for online prediction of protein structural and
functional features. Nucleic Acids Research, 42(Web Server issue), W337–43.
http://doi.org/10.1093/nar/gku366
Jensen, L. J., & Bateman, A. (2011). The rise and fall of supervised machine learning
techniques. Bioinformatics, 27(24), 3331–3332. http://doi.org/10.1093/bioinformatics/btr585
Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery, 1(1), 14–23. http://doi.org/10.1002/widm.8
Lomize, M. a, Lomize, A. L., Pogozheva, I. D., & Mosberg, H. I. (2006). OPM: orientations of
proteins in membranes database. Bioinformatics (Oxford, England), 22(5), 623–5.
http://doi.org/10.1093/bioinformatics/btk023
Velankar, S., McNeil, P., Mittard-Runte, V., Suarez, a, Barrell, D., Apweiler, R., & Henrick, K.
(2005). E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Research,
33(Database issue), D262–5. http://doi.org/10.1093/nar/gki058
Kozma, D., Simon, I., & Tusnády, G. E. (2013). PDBTM: Protein Data Bank of
transmembrane proteins after 8 years. Nucleic Acids Research, 41(Database issue),
D524–9. http://doi.org/10.1093/nar/gks1169
Mika, S., & Rost, B. (2003). UniqueProt: creating representative protein sequence sets.
Nucleic Acids Research, 31(13), 3789–3791. http://doi.org/10.1093/nar/gkg620
Petersen TN, Brunak S, von Heijne G, Nielsen H. (2011). SignalP 4.0: discriminating signal
peptides from transmembrane regions. Nat Methods;8:785–786.
45
References
top related