tmseg michael bernhofer short title: pp1 tmseg · 2017. 6. 20. · •tmh assignment from...

45
© Burkhard Rost 1 20.06.2017 title: TMSEG Michael Bernhofer short title: pp1_tmseg lecture: Protein Prediction 1 (for Informatics) Protein structure TUM summer semester

Upload: others

Post on 24-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 1

20.06.2017

title: TMSEG

Michael Bernhofer

short title: pp1_tmseg

lecture: Protein Prediction 1 (for Informatics) – Protein structure

TUM summer semester

Page 2: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 2

Last time…

Page 3: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 3

Page 4: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

More data available

• Re-training old methods is viable but no one does it

Less extensive machine learning

Runtime

4

Yet another transmembrane predictor?

Page 5: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

166 membrane protein sequences (TMP166)

• TMH assignment from 3D-structure by OPM & PDBTM

Assignments differ, both used for training

• Map to UniProt sequence using SIFTS

• Redundancy reduction with Uniqueprot at HVAL>0

5

Dataset – Transmembrane helices I

Lomize et al., 2006, Bioinformatics Velankar et al., 2013, NAR

Kozma et al., 2013, NAR Mika et al., 2003, NAR

Page 6: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Inside/Outside topology assignment OPM

6

Dataset – Transmembrane helices II

Lomize et al., 2006, Bioinformatics

Page 7: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Derived from the SignalP 4.0 training set

Redundancy reduced against set of 166 TMPs

at HVAL>0

Redundancy reduced within at HVAL>0

7

Dataset – Proteins w/ and w/o signal peptides

Petersen et al., 2011, Nat. Methods

Soluble: 1142 (452 w/ SP)

Membrane: 299 (25 w/ SP)

SP1441

Page 8: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Split into 4 subsets, maintaining distribution of

TMPs, SPs and sequence lengths

Use 3 sets for cross-validation, keep one for

final independent evaluation (Blind set)

8

Dataset – Split

TMP166

Blind

Train

41

SP1441

Blind

285

Train

Page 9: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 9

Classification trees

Page 10: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 10

Classification trees example

Loh, 2011, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

Page 11: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 11

Random forests

Page 12: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Fast

No black box

Intuitive to

interpret

Good

performance

12

Random forests - Popularity

Jensen et al., 2011, Bioinformatics

Page 13: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 13

TMSEG step 1 – Initial prediction

Page 14: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 14

TMSEG overview – Step 1

Page 15: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Global features:

• Global amino acid composition

• Protein length

Local features:

• PSSM score

• Distance to N- and C-terminus

• Average hydrophobicity (Kyte-Doolittle)

• % hydrophobic

• % charged (positive & negative)

• % polar

15

TMSEG step 1 - Feature set I

Page 16: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Use homology information to improve prediction

• Position-specific scoring matrix (PSSM)

16

TMSEG step 1 - Feature set II

Page 17: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 17

TMSEG step 1 - Feature set IIIAdjusting for conservation

Substitutions with score > 0 = 16

Substitutions with score < 0 = 79

Page 18: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 18

TMSEG step 1 - Feature set IVAdjusting for conservation

Amino acid composition M (PSSM>0) = 1/16

Page 19: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 19

TMSEG step 1 - Feature set VAdjusting for conservation

Amino acid composition M (PSSM>0) = 1/16

Amino acid composition M (PSSM<0) = 3/79

Page 20: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 20

TMSEG step 1 - Feature set VIAdjusting for conservation

% positive charge (PSSM>0) = 2/16

% positive charge (PSSM<0) = 8/79

Page 21: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Global features:

• Global amino acid composition 2*20

• Protein length (binned) 1

Local features:

• PSSM score 21*19

• Distance to N- and C-terminus 2

• Average hydrophobicity (Kyte-Doolittle) 2*1

• % hydrophobic 2*1

• % charged (positive & negative) 2*2

• % polar 2*1

21

TMSEG step 1 - Feature set VII

PSSM≶0

PSSM≶0

PSSM≶0

PSSM≶0

PSSM≶0

Page 22: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 22

TMSEG step 2 – Empirical filter

Page 23: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

TMSEG step 2 – ExampleSEQ: M G P R A R P A L L L L ...

SIG: 400 400 100 100 800 600 700 900 100 600 100 800 ...

SOL: 500 400 600 500 100 100 100 000 500 100 100 200 ...

TMH: 100 200 300 400 100 300 200 100 400 300 800 000 ...

Median filter

SIG: 400 400 400 400 600 700 700 600 600 600 ...

SOL: 500 500 500 400 100 100 100 100 100 100 ...

TMH: 100 200 200 300 300 200 200 300 300 300 ...

Adjust for overprediction

SIG: 400 400 400 400 600 700 700 600 600 600 ...

SOL: 315 315 315 215 -85 -85 -85 -85 -85 -85 ...

TMH: 040 140 140 240 240 140 140 240 240 240 ...

OUT: S S S S S S S S S S ...

23

Page 24: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 24

TMSEG overview – Step 1 & 2

Page 25: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 25

TMSEG step 3 – Refine TMH prediction I

Neural Network (25 hidden nodes)

Input: TMH segments of variable length

Features:

• Amino acid composition 2*20

• Average hydrophobicity (Kyte-Doolittle) 2*1

• % hydrophobic 2*1

• % charged 2*1

• Segment length (exact) 1

PSSM≶0

PSSM≶0

PSSM≶0

PSSM≶0

Page 26: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Split long TMHs (≥35 residues) into two

shorter TMHs (≥17 residues)

• Keep two TMHs if higher average score after split

Adjust TMH endpoints by up to 3 residues in

either direction

26

TMSEG step 3 – Refine TMH prediction II

Page 27: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 27

TMSEG overview – Step 1-3

Page 28: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 28

TMSEG step 4 – Topology prediction I

PSSM≶0

PSSM≶0

PSSM≶0

Page 29: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Consider only residues close to TMHs

• 15 residues next to TMHs and 8 residues into TMHs

Predict topology of N-terminus and extrapolate

If SP predicted Residues after SP “outside”

29

TMSEG step 4 – Topology prediction II

Page 30: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 30

TMSEG overview – Step 1-4

Page 31: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Per-residue measures often misleading

Score by TMH segments instead

Whole-protein scores: Qok and Qtop

Performance measures I

31

Page 32: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 32

Performance measures II

Page 33: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

What is a correctly predicted TMH?

Strict criteria

• Endpoint deviation ≤5 residues

• Overlap at least 50%

Performance measures III

33

Page 34: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

ti: 100% if toplogy is correct, otherwise 0%

Qtop:

34

Performance measures IV

Page 35: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Performance of TMH predictions

35

Page 36: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Performance measures – TMP classification

36

Page 37: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Very low misclassification rates

TMP classification

Method TMP sensitivity

TMP FPR

Topology correct

Misclassified in human

More mistakes than TMSEG in human

TMSEG 98 ± 2 3 ± 1 93 ± 4 558 -

PolyPhobius 100 ± 0 5 ± 1 78 ± 7 770 212

MEMSAT3 100 ± 0 28 ± 2 93 ± 4 4,313 3,755

MEMSAT-SVM 98 ± 2 14 ± 2 88 ± 5 2,253 1,695

Baseline 95 ± 3 31 ± 2 75 ± 7 5,015 4,457

37

Page 38: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

How to get more data?

• Use what was published since starting work

Data unknown by any method

From 07/2013 to 02/2016:

• Only 12 new TMPs published

• Very small dataset

TMSEG predicts every TMH of the 10 recognized TMPs

38

Dataset of 12 new proteins

Page 39: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

High modularity (steps 1-4)

Apply steps 3 and 4 to other methods

• Step3: NN-based TMH prediction improvement

• Step4: RF-based topology prediction

Can this improve other methods?

Applying TMSEG to other methods I

39

Page 40: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Applying TMSEG to other methods II

40

Page 41: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Re-entrant regions not

modelled (little data)

Idea: Check “abnormal”

TMH segments for re-

entrant

• Does not switching topology

increase scores?

41

Potential extensions

Page 42: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Feature space too big

Wanted: ~10 samples per free parameter

RF1: 452 features

• 2800, 10400, and 60900 samples

NN: 47 features (x25 hidden units)

• 2100 samples

RF2: 86 features

• only 40 samples!

More data desirable

Page 43: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Debian package: http://rostlab.org/debian/pool/main/t/tmseg/

Github: github.com/Rostlab/TMSEG

PredictProtein: predictprotein.org

43

Availability

Yachdav et al., 2014, NAR

Page 44: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost 44

Thank you

Unknown source

Page 45: TMSEG Michael Bernhofer short title: pp1 tmseg · 2017. 6. 20. · •TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training •Map to UniProt

© Burkhard Rost

Yachdav, G., Kloppmann, E., Kajan, L., Hecht, M., Goldberg, T., Hamp, T., … Rost, B. (2014).

PredictProtein-an open resource for online prediction of protein structural and

functional features. Nucleic Acids Research, 42(Web Server issue), W337–43.

http://doi.org/10.1093/nar/gku366

Jensen, L. J., & Bateman, A. (2011). The rise and fall of supervised machine learning

techniques. Bioinformatics, 27(24), 3331–3332. http://doi.org/10.1093/bioinformatics/btr585

Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews:

Data Mining and Knowledge Discovery, 1(1), 14–23. http://doi.org/10.1002/widm.8

Lomize, M. a, Lomize, A. L., Pogozheva, I. D., & Mosberg, H. I. (2006). OPM: orientations of

proteins in membranes database. Bioinformatics (Oxford, England), 22(5), 623–5.

http://doi.org/10.1093/bioinformatics/btk023

Velankar, S., McNeil, P., Mittard-Runte, V., Suarez, a, Barrell, D., Apweiler, R., & Henrick, K.

(2005). E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Research,

33(Database issue), D262–5. http://doi.org/10.1093/nar/gki058

Kozma, D., Simon, I., & Tusnády, G. E. (2013). PDBTM: Protein Data Bank of

transmembrane proteins after 8 years. Nucleic Acids Research, 41(Database issue),

D524–9. http://doi.org/10.1093/nar/gks1169

Mika, S., & Rost, B. (2003). UniqueProt: creating representative protein sequence sets.

Nucleic Acids Research, 31(13), 3789–3791. http://doi.org/10.1093/nar/gkg620

Petersen TN, Brunak S, von Heijne G, Nielsen H. (2011). SignalP 4.0: discriminating signal

peptides from transmembrane regions. Nat Methods;8:785–786.

45

References