transmembrane protein prediction

Transmembrane Protein Prediction

Project Presentation

CMPUT 606

Overview

Transmembrane (TM) protein: Associated with the plasma membrane “A protein that has domains exposed on both

sides of the membrane” [Genes VII] Some of the TM proteins that span the lipid layer

several times form a hydrophilic channel that permits various ions and molecules to circulate through the plasma membrane.

Transmembrane Proteins

Transmembrane Segments

Ion Channels

Transmembrane Domains

Data SetsData Sets Brief Description

DB-TMR Database of TM segments (not fasta). After translation into fasta: DB-TMR40672.fasta (TM segments flanked by 5 amino acids at each end of the segment) and DB-TMR40672onlytm.fasta (only TM segments). They each contain 40672 protein sequences.

PDB Database of 3D structures in PDB format. After translation into fasta and removing all nucleotide sequences: pdb61042.fasta. The file PDBseqsnontm.fasta that contains 645 globular proteins constitutes a negative test set.As a result of testing the TMHMM predictor with the protein chains extracted from PDB, a file containing the prediction for each of the 61042 sequences was obtained, outputTMHMMonPDB61042.txt. Out of the 61042 sequences, 1294 were predicted to be TM. The sequence predicted as TM are stored in the file seqOutputTMHMM1294.fasta in fasta format (all entries are preceded by “sp|” at the beginning of each entry to mark them as TM for prediction and testing purposes).

PDB_TM Database of TM proteins in XML format. The site provides lists of PDB IDs representing TM proteins (test_mem.txt) and globular proteins (globpdb.txt). From these initial list, the following fasta files are generated: tm.fasta with 916 TM protein chains, nontm.fasta with 900 protein chains, and both.fasta with 1816 protein chains. We have also two files obtained from the PDB_TM site, pdbtm_all.seq with 1363 chains.

TMHMMset160 This dataset is used to train TMHMM and comprises 160 protein sequences in fasta format. They are all TM proteins and are preceded by “sp|”.

TMPDB Database of 302 verified transmembrane protein sequences, together with their TM domain location and number, in SwissProt format. After translation into fasta for all the TM categories: alpha helix non-redundant (231), alpha buried non-redundant (7), and beta non-redundant (15), the file obtained is the sum of these three files and it contains 253 protein sequences, tmpdb253.fasta in fasta format (“sp|”).

TMbase Database of transmembrane proteins and their helical membrane-spanning domains. It is mainly based on Swiss-Prot.

Predictors

ePST bPST TMHMM TMpred HMMTOP HMMer TMDET

PredictorsPredictors Major/Minor Contribution Impact

TMHMM Predictor for TM helices. Based on TMHMM predictions, the authors estimated that 20-30 % of all genes in most genomes encode membrane proteins.

In July 2001: rated best for prediction of TM helices. The accuracy reported is 97-98%.

TMpred Predictor for membrane spanning regions and their orientation. The underlying algorithm is based on the statistical analysis of TMbase. The prediction is made using a combination of several weight-matrices for scoring.

Still a reference comparison for TM protein prediction.

HMMer Searches for homologues of a sequence family. Builds an HMM from the training data and matches the query sequence into a sequence database to find homologues. The model accepts as input a file on which MSA is performed.

Improves upon the methods for sensitive database searches using multiple sequence alignments as queries.

HMMTOP Builds on an HMM architecture. The training model is a regularizer that is estimated from a set of known TM proteins. The prediction model is estimated from the query sequence and then it is used to predict the structure of that sequence. The server only accepts one test sequence at a time.

The accuracy reported is 78%.

TMDET Predictor for transmembrane domains. Based only on the structural information (3D) of the protein. Determines the membrane planes relative to the position of atomic coordinates. A discrimination function separates TM and globular proteins even in cases of low resolution or incomplete structures such as fragments or parts of large multi chain complexes. First algorithm that uses the 3D structure as input, identifies TM proteins, and determines membrane location. This method can be used to annotate protein structures having TM segments.

Generates PDB_TM: automatically updated database for TM proteins from PDB. The algorithm can also construct a globular protein database.

bPST Histories are represented in the tree. Alternative approach for detecting significant patterns in protein sequences based on probabilistic suffix trees (PSTs) without any prior information about the input sequences and without the prior alignment of the input sequences.

The PST model detects much more related sequences than pair-wise methods and it is much faster and almost as sensitive as an HMM.

ePST Training sequences are represented in the tree. Prediction of the probability of a protein sequence function using an efficient PST is possible in linear time.

Good results for protein function prediction.

Predictors Performance: Theoretical Time

TMHMM

Short form prediction sp_1xqe_A len=418 ExpAA=243.54 First60=39.67 PredHel=11 Topology=o10-32i45-67o98-120i127-149o159-181i193-215o225-

247i259-281o285-302i315-337o352-374i

TMpred

HMMTOP

HMMer Flow

HMMerScores for complete sequences (score includes all domains):Sequence Description Score E-value N-------- ----------- ----- ------- ---nontm|1ALO._ OXIDOREDUCTASE -20.6 4.7 1nontm|1CDE._ TRANSFERASE(FORMYL) -26.1 9.9 1nontm|1AKO._ NUCLEASE -27.4 10 1nontm|1ARU._ PEROXIDASE -37.1 10 1sp|1pv7_A -41.7 10 1sp|1pw4_A -46.0 10 1sp|1pxs_A -48.9 10 1sp|1xqe_A -49.0 10 1sp|1r2c_L -53.2 10 1nontm|1HSB.B HISTOCOMPATIBILITY -61.4 10 1

Parsed for domains:Sequence Domain seq-f seq-t hmm-f hmm-t score E-value-------- ------- ----- ----- ----- ----- ----- -------nontm|1ALO._ 1/1 125 323 .. 1 199 [] -20.6 4.7nontm|1CDE._ 1/1 4 202 .. 1 199 [] -26.1 9.9nontm|1AKO._ 1/1 5 202 .. 1 199 [] -27.4 10nontm|1ARU._ 1/1 112 295 .. 1 199 [] -37.1 10sp|1pv7_A 1/1 116 314 .. 1 199 [] -41.7 10sp|1pw4_A 1/1 162 329 .. 1 199 [] -46.0 10sp|1pxs_A 1/1 51 249 .] 1 199 [] -48.9 10sp|1xqe_A 1/1 39 226 .. 1 199 [] -49.0 10sp|1r2c_L 1/1 62 260 .. 1 199 [] -53.2 10nontm|1HSB.B 1/1 2 99 .] 1 199 [] -61.4 10

HMMer

Total sequences searched: 10

Whole sequence top hits:tophits_s report: Total hits: 10 Satisfying E cutoff: 9 Total memory: 16K

Domain top hits:tophits_s report: Total hits: 10 Satisfying E cutoff: 10 Total memory: 22K

ePST Output

TM# Start End

1 12 24 2 50 61 3 101 112 4 130 142 5 163 166 6 168 175 7 199 201 8 203 211 9 228 24010 260 27111 287 29712 315 33313 353 365

Total # ePST segments = 13

ePST Outputs# i char pos neg odds tot win maxwin regions 0 A -1.87 -708.40 706.52 706.52 706.52 0.00 -s 1 P -2.96 -708.40 705.44 1411.96 1411.96 0.00 -s 2 A -1.87 -708.40 706.52 2118.48 2118.48 0.00 -s 3 V -0.75 -708.40 707.64 2826.13 2826.13 0.00 -s 4 A -1.80 -708.40 706.60 3532.72 3532.72 0.00 -s 5 D -6.47 -708.40 701.92 4234.65 4234.65 0.00 -s 6 K -3.53 -708.40 704.87 4939.52 4939.52 0.00 -s 7 A -3.40 -708.40 705.00 5644.51 5644.51 0.00 -s 8 D -6.47 -708.40 701.92 6346.43 6346.43 0.00 -s 9 N -5.22 -708.40 703.18 7049.61 7049.61 0.00 -s 10 A -1.87 -708.40 706.52 7756.14 7756.14 0.00 -s 11 F -3.91 -708.40 704.49 8460.63 8460.63 0.00 -s 12 M -3.76 -708.40 704.63 9165.26 9165.26 0.00 -s 13 M -3.76 -708.40 704.63 9869.89 9869.89 0.00 -s 14 I -2.06 -708.40 706.34 10576.23 10576.23 0.00 -s 15 C -4.54 -708.40 703.86 11280.08 10573.56 10573.56 -s 16 T -2.71 -708.40 705.69 11985.77 10573.81 10573.81 -s 17 A -2.48 -708.40 705.91 12691.68 10573.20 10573.81 -s 18 L -4.01 -708.40 704.38 13396.07 10569.94 10573.81 -s 19 V -1.29 -708.40 707.11 14103.18 10570.45 10573.81 -s 20 L -0.59 -708.40 707.81 14810.99 10576.34 10576.34 -s 21 F -1.12 -708.40 707.28 15518.26 10578.75 10578.75 +s 22 M -3.76 -708.40 704.63 16222.90 10578.39 10578.75 +s 23 T -3.12 -708.40 705.27 16928.17 10581.74 10581.74 +s 24 I -0.87 -708.40 707.52 17635.69 10586.08 10586.08 +s 25 P -0.51 -708.40 707.89 18343.58 10587.44 10587.44 +s 26 G -2.25 -708.40 706.15 19049.73 10589.11 10589.11 +s 27 I -1.49 -708.40 706.91 19756.64 10591.38 10591.38 +s 28 A -1.54 -708.40 706.85 20463.50 10593.61 10593.61 +s 29 L -4.01 -708.40 704.38 21167.88 10591.65 10593.61 +s 30 F -1.92 -708.40 706.48 21874.36 10594.27 10594.27 +s 31 Y -6.07 -708.40 702.33 22576.69 10590.91 10594.27 +s 32 G -2.25 -708.40 706.15 23282.84 10591.15 10594.27 +s 33 G -4.38 -708.40 704.02 23986.86 10590.79 10594.27 +s 34 L -1.54 -708.40 706.85 24693.71 10590.53 10594.27 +s 35 I -2.06 -708.40 706.34 25400.05 10589.06 10594.27 +s 36 R -2.75 -708.40 705.65 26105.70 10587.43 10594.27 +s 37 G -2.25 -708.40 706.15 26811.85 10588.95 10594.27 +

ePST Execution Flow

Training Set

Testing Set

ePST

ePSTPrediction

Post-processingScripts

TM# Start End1 12 242 50 61

3 101 112 4 130 142 5 163 166 6 168 175

7 199 201 8 203 211 9 228 24010 260 27111 287 29712 315 33313 353 365

Total # segments predicted by ePST = 13

HMMer Results for both.fasta

Step Time

CLUSTALW 41.37s (41.05s)

hmmbuild 0.56s (0.25s –f)

hmmcalibrate 5.47s (2.71s -f)

hmmsearch 1.56s (0.73s -f)

HMMer vs. ePST

Predictor Train Global

Train Local

Test Global

Test Local

Accuracy Global

Accuracy Local

HMMer 6.03 (2.96) 1.56 (0.73) 240/916 = 26%

ePST 0.33 0.23 5.59 fp=fn=288

0.82 fp=fn=333

69% 64%

ePST

Training Testing Local Accuracy Global Accuracy

DBTMR40672 q.fasta 100% (W 15) 100%

DBTMR1000 q.fasta 60% (W 15)

fp=2; fn=0

100%

DBTMR40672 both.fasta fp=fn=269, 71% fp=fn=247, 73%

DBTMR1000 both.fasta fp=fn=333, 64% fp=fn=288, 69%

Set 160 both.fasta fp=fn=321, 65% 72.55% (-), 73%(+)

Cross-validation (5 folds) - ePST

Data Set Train Global

Train Local

Test Global

Test Local

Accuracy Global

Accuracy Local

DBTMR40672

7.15 7.20 0.36 0.42 100% 100%

DBTMR1000

0.13 0.13 0.01 0.01 100% 100%

tm.fasta 0.60 0.60 0.01 0.01 100% 100%

both.fasta 0.61 0.61 1.67 1.67 99% 99%

Set 160 0.84 0.84 0.00 0.00 100% 100%

TMHMM and ePST

Predictor Testing Local Accuracy Global Accuracy

TMHMM both.fasta 99.11% (-), 60% (+)

ePST trained on Set 160

both.fasta 65% 72.55% (-), 73%(+)

ePST trained on mix.fasta

both.fasta 74% (W 15, 20), 78% (W 10, 35),

80% (W 25, 27)

78%

ePST trained on Set 160

q.fasta 100% 100%

TMHMM q.fasta 100%

Scanning PDB

Training: DMTMR40672 Testing: PDB Threshold 705.37->Nrtm=1665 chains PDB_TM retrieves 1673 chains Validation necessary – lack of ground truth

TMH Benchmark

tmeval.fasta: 2247 non-annotated sequences Script for converting ePST output to TMH

submit format Comparison with other predictors 4 tables 8 evaluation parameters

Window 25, 35, T 10584 - High Resolution

Window 25, 35, T 10584 - Low Resolution

Window 15, T 10588 – High Resolution

Window 15, T 10588 – Low Resolution

Window 15, T 10588 – False Positives

Window 15, T 10588 – Confusion with Signal Peptides

Conclusions

ePST competitive predictor Fast training Scales well in contrast with HMMs ePST does not suffer from a poor local

minimum as HMMs ePST does not require MSA of the

sequences ePST allows more than one test sequence at

a time

Future Work

More tuning, use pruning Applications to other tasks (phosphorylation)

involved in signal transduction pathways Search for a verified data set for training and

testing (no consensus in the literature) Extract features from the sequence Analyze the false negatives with particular

helix topologies (such as 1orq)

transmembrane protein prediction

Documents

fasta tm segments

tm protein chains

tm protein prediction

fasta format

tm site

tm categories

tmdatabase of tm proteins

prediction of tm helices