tmseg michael bernhofer short title: pp1 tmseg · 2017. 6. 20. · •tmh assignment from...

20.06.2017

title: TMSEG

Michael Bernhofer

short title: pp1_tmseg

lecture: Protein Prediction 1 (for Informatics) – Protein structure

TUM summer semester

Last time…

More data available

• Re-training old methods is viable but no one does it

Less extensive machine learning

Runtime

Yet another transmembrane predictor?

166 membrane protein sequences (TMP166)

• TMH assignment from 3D-structure by OPM & PDBTM

Assignments differ, both used for training

• Map to UniProt sequence using SIFTS

• Redundancy reduction with Uniqueprot at HVAL>0

Dataset – Transmembrane helices I

Lomize et al., 2006, Bioinformatics Velankar et al., 2013, NAR

Kozma et al., 2013, NAR Mika et al., 2003, NAR

Inside/Outside topology assignment OPM

Dataset – Transmembrane helices II

Lomize et al., 2006, Bioinformatics

Derived from the SignalP 4.0 training set

Redundancy reduced against set of 166 TMPs

at HVAL>0

Redundancy reduced within at HVAL>0

Dataset – Proteins w/ and w/o signal peptides

Petersen et al., 2011, Nat. Methods

Soluble: 1142 (452 w/ SP)

Membrane: 299 (25 w/ SP)

SP1441

Split into 4 subsets, maintaining distribution of

TMPs, SPs and sequence lengths

Use 3 sets for cross-validation, keep one for

final independent evaluation (Blind set)

Dataset – Split

TMP166

SP1441

Classification trees

Classification trees example

Loh, 2011, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

Random forests

No black box

Intuitive to

interpret

performance

Random forests - Popularity

Jensen et al., 2011, Bioinformatics

TMSEG step 1 – Initial prediction

TMSEG overview – Step 1

Global features:

• Global amino acid composition

• Protein length

Local features:

• PSSM score

• Distance to N- and C-terminus

• Average hydrophobicity (Kyte-Doolittle)

• % hydrophobic

• % charged (positive & negative)

• % polar

TMSEG step 1 - Feature set I

Use homology information to improve prediction

• Position-specific scoring matrix (PSSM)

TMSEG step 1 - Feature set II

TMSEG step 1 - Feature set IIIAdjusting for conservation

Substitutions with score > 0 = 16

Substitutions with score < 0 = 79

TMSEG step 1 - Feature set IVAdjusting for conservation

Amino acid composition M (PSSM>0) = 1/16

TMSEG step 1 - Feature set VAdjusting for conservation

Amino acid composition M (PSSM>0) = 1/16

Amino acid composition M (PSSM<0) = 3/79

TMSEG step 1 - Feature set VIAdjusting for conservation

% positive charge (PSSM>0) = 2/16

% positive charge (PSSM<0) = 8/79

Global features:

• Global amino acid composition 2*20

• Protein length (binned) 1

Local features:

• PSSM score 21*19

• Distance to N- and C-terminus 2

• Average hydrophobicity (Kyte-Doolittle) 2*1

• % hydrophobic 2*1

• % charged (positive & negative) 2*2

• % polar 2*1

TMSEG step 1 - Feature set VII

PSSM≶0

TMSEG step 2 – Empirical filter

TMSEG step 2 – ExampleSEQ: M G P R A R P A L L L L ...

SIG: 400 400 100 100 800 600 700 900 100 600 100 800 ...

SOL: 500 400 600 500 100 100 100 000 500 100 100 200 ...

TMH: 100 200 300 400 100 300 200 100 400 300 800 000 ...

Median filter

SIG: 400 400 400 400 600 700 700 600 600 600 ...

SOL: 500 500 500 400 100 100 100 100 100 100 ...

TMH: 100 200 200 300 300 200 200 300 300 300 ...

Adjust for overprediction

SIG: 400 400 400 400 600 700 700 600 600 600 ...

SOL: 315 315 315 215 -85 -85 -85 -85 -85 -85 ...

TMH: 040 140 140 240 240 140 140 240 240 240 ...

OUT: S S S S S S S S S S ...

TMSEG overview – Step 1 & 2

TMSEG step 3 – Refine TMH prediction I

Neural Network (25 hidden nodes)

Input: TMH segments of variable length

Features:

• Amino acid composition 2*20

• Average hydrophobicity (Kyte-Doolittle) 2*1

• % hydrophobic 2*1

• % charged 2*1

• Segment length (exact) 1

PSSM≶0

Split long TMHs (≥35 residues) into two

shorter TMHs (≥17 residues)

• Keep two TMHs if higher average score after split

Adjust TMH endpoints by up to 3 residues in

either direction

TMSEG step 3 – Refine TMH prediction II

TMSEG overview – Step 1-3

TMSEG step 4 – Topology prediction I

PSSM≶0

Consider only residues close to TMHs

• 15 residues next to TMHs and 8 residues into TMHs

Predict topology of N-terminus and extrapolate

If SP predicted Residues after SP “outside”

TMSEG step 4 – Topology prediction II

TMSEG overview – Step 1-4

Per-residue measures often misleading

Score by TMH segments instead

Whole-protein scores: Qok and Qtop

Performance measures I

Performance measures II

What is a correctly predicted TMH?

Strict criteria

• Endpoint deviation ≤5 residues

• Overlap at least 50%

Performance measures III

ti: 100% if toplogy is correct, otherwise 0%

Performance measures IV

Performance of TMH predictions

Performance measures – TMP classification

Very low misclassification rates

TMP classification

Method TMP sensitivity

TMP FPR

Topology correct

Misclassified in human

More mistakes than TMSEG in human

TMSEG 98 ± 2 3 ± 1 93 ± 4 558 -

PolyPhobius 100 ± 0 5 ± 1 78 ± 7 770 212

MEMSAT3 100 ± 0 28 ± 2 93 ± 4 4,313 3,755

MEMSAT-SVM 98 ± 2 14 ± 2 88 ± 5 2,253 1,695

Baseline 95 ± 3 31 ± 2 75 ± 7 5,015 4,457

How to get more data?

• Use what was published since starting work

Data unknown by any method

From 07/2013 to 02/2016:

• Only 12 new TMPs published

• Very small dataset

TMSEG predicts every TMH of the 10 recognized TMPs

Dataset of 12 new proteins

High modularity (steps 1-4)

Apply steps 3 and 4 to other methods

• Step3: NN-based TMH prediction improvement

• Step4: RF-based topology prediction

Can this improve other methods?

Applying TMSEG to other methods I

Applying TMSEG to other methods II

Re-entrant regions not

modelled (little data)

Idea: Check “abnormal”

TMH segments for re-

entrant

• Does not switching topology

increase scores?

Potential extensions

Feature space too big

Wanted: ~10 samples per free parameter

RF1: 452 features

• 2800, 10400, and 60900 samples

NN: 47 features (x25 hidden units)

• 2100 samples

RF2: 86 features

• only 40 samples!

More data desirable

Debian package: http://rostlab.org/debian/pool/main/t/tmseg/

Github: github.com/Rostlab/TMSEG

PredictProtein: predictprotein.org

Availability

Yachdav et al., 2014, NAR

Thank you

Unknown source

Yachdav, G., Kloppmann, E., Kajan, L., Hecht, M., Goldberg, T., Hamp, T., … Rost, B. (2014).

PredictProtein-an open resource for online prediction of protein structural and

functional features. Nucleic Acids Research, 42(Web Server issue), W337–43.

http://doi.org/10.1093/nar/gku366

Jensen, L. J., & Bateman, A. (2011). The rise and fall of supervised machine learning

techniques. Bioinformatics, 27(24), 3331–3332. http://doi.org/10.1093/bioinformatics/btr585

Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews:

Data Mining and Knowledge Discovery, 1(1), 14–23. http://doi.org/10.1002/widm.8

Lomize, M. a, Lomize, A. L., Pogozheva, I. D., & Mosberg, H. I. (2006). OPM: orientations of

proteins in membranes database. Bioinformatics (Oxford, England), 22(5), 623–5.

http://doi.org/10.1093/bioinformatics/btk023

Velankar, S., McNeil, P., Mittard-Runte, V., Suarez, a, Barrell, D., Apweiler, R., & Henrick, K.

(2005). E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Research,

33(Database issue), D262–5. http://doi.org/10.1093/nar/gki058

Kozma, D., Simon, I., & Tusnády, G. E. (2013). PDBTM: Protein Data Bank of

transmembrane proteins after 8 years. Nucleic Acids Research, 41(Database issue),

D524–9. http://doi.org/10.1093/nar/gks1169

Mika, S., & Rost, B. (2003). UniqueProt: creating representative protein sequence sets.

Nucleic Acids Research, 31(13), 3789–3791. http://doi.org/10.1093/nar/gkg620

Petersen TN, Brunak S, von Heijne G, Nielsen H. (2011). SignalP 4.0: discriminating signal

peptides from transmembrane regions. Nat Methods;8:785–786.

References

tmseg michael bernhofer short title: pp1 tmseg · 2017. 6. 20. · •tmh assignment from...

Documents

name cv and statement of the nominated person eas board...

ghana aristides polío aaron boone bernhofer martina bowman...

dimensions in mm new: tmh-25 & tmh-30 - rami yokota bv...

tmh 3000 3000... · 2015. 5. 26. · tmh-ams tmh 3000...

tmh-6 -st7.pdf

tmh 520 / tmu 520 tmh 520 p / tmu 520 p

buenos aires by tmh

tmh may 2014

uniprot - the universal protein resource

trabalho de tmh

apresentacao clientes topsis tmh

uniprot and complete proteomes

wci tmh agenda

tmh 6 introduction

tmh-6 -st1

benhvienlaokhoa.vnbenhvienlaokhoa.vn/sites/bvlk_d7/files/cv_342_tb._0001.pdf ·...

bioinformatics web resources ncbi / ebi / uniprot / … ·...

iom in scoliosis tmh

tmh-6 -st9

tmh-6 -st10