localization prediction of transmembrane proteins
DESCRIPTION
Localization prediction of transmembrane proteins Stefan Maetschke, Mikael Bod én and Marcus Gallagher The University of Queensland. Protein. Membrane. Soluble. Integral. Peripheral. Anchored. Transmembrane. -barrel. -helical. Multi-spanning. Single-spanning. Protein classes. - PowerPoint PPT PresentationTRANSCRIPT
Localization prediction of transmembrane proteinsStefan Maetschke, Mikael Bodén and Marcus GallagherThe University of Queensland
Maetschke et al, The University of Queensland2
Protein classes
-helical-barrel
TransmembraneAnchored
IntegralPeripheral
Protein
Soluble Membrane
Single-spanningMulti-spanning
Maetschke et al, The University of Queensland3
Transmembrane protein types
N
N
C
C
Type-I Type-II Type-IV(multi-spanning)
Cytosol (inside)
signal peptide
Type-III
NC
Maetschke et al, The University of Queensland4
NucleusMitochondrion
Peroxisome
Lysosome
Endoplasmic Reticulum
Golgi Complex
ERGIC
Endosome
RNARibosome
Eukaryotic cell
Maetschke et al, The University of Queensland5
Secretory and endocytic pathway
Maetschke et al, The University of Queensland6
Problem and hypothesis• Sorting signals for transmembrane proteins serve multiple
purposes (targeting, retention, retrieval, avoidance) and are largely unknown (the problem is challenging/multi-faceted)
• Current localization prediction of eukaryotic transmembrane proteins is poor (models based on soluble proteins are ill-suited) (previous work is inadequate/incomplete)
• Localization prediction for transmembrane proteins is virtually unexplored (paucity/variance of data) (it is an open problem)
• Explicit modelling of protein topology should enhance localization prediction accuracy(parameter tuning receives explicit guidance to biologically sensible solutions) (the way to do it!)
Maetschke et al, The University of Queensland7
Hidden Markov model
ii Sq 1
Inital state probabilities:
)|( 1 itjtij SqSqPaA
State transition probabilities:
a12S1 S2 S3
b1
a23
a11a33
b3b2
a22
)|()( itkti SqVoPkbB
Observation probabilities:
A
R
1
V...
2
20
A
R
1
V
...
2
20
A
R
1
V
...
2
20
s1 s1 s1 s2 s2 s2 s2 s2 s2 s3 State sequence:
Observation sequence:
Maetschke et al, The University of Queensland8
2-order Hidden Markov model
ii Sq 1
Inital state probabilities:
)|( 1 itjtij SqSqPaA
State transition probabilities:
a12S1 S2 S3
b1
a23
a11a33
b3b2
a22
)|()( itkti SqVoPkbB
Observation probabilities:
AA
AR
1
VV
...
2
400
s1 s1 s1 s2 s2 s2 s2 s2 s2 s3 State sequence:
Observation sequence:
AN
AD
3
4
AA
AR
1
VV
...
2
400
AN
AD
3
4
AA
AR
1
VV
...
2
400
AN
AD
3
4
Maetschke et al, The University of Queensland9
3-order Hidden Markov model
ii Sq 1
Inital state probabilities:
)|( 1 itjtij SqSqPaA
State transition probabilities:
a12S1 S2 S3
b1
a23
a11a33
b3b2
a22
)|()( itkti SqVoPkbB
Observation probabilities:
AAA
AAR
1
VVV
...
2
8000
s1 s1 s1 s2 s2 s2 s2 s2 s2 s3 State sequence:
Observation sequence:
AAN
AAD
3
4
AAC
AAQ
5
6
AAA
AAR
1
VVV
...
2
8000
AAN
AAD
3
4
AAC
AAQ
5
6
AAA
AAR
1
VVV
...
2
8000
AAN
AAD
3
4
AAC
AAQ
5
6
Maetschke et al, The University of Queensland10
Signal peptide
cleavage region
hydrophobic coreN-terminal
regionmature protein
Maetschke et al, The University of Queensland11
Transmembrane domain
icap TMD ocap
Maetschke et al, The University of Queensland12
Protein topology model
ocap TMD icap C-termN-termSP outside inside
Maetschke et al, The University of Queensland13
Localization model (5 x topology models)
NucleusMitochondrion
Peroxisome
Lysosome
Endoplasmic Reticulum
Golgi Complex
ERGIC
Endosome
Maetschke et al, The University of Queensland14
LOCATE dataset
Subset LOCATE database FANTOM3, Mouse proteome Filter for transmembrane proteins No multi-targeted proteins Redundancy reduced (<25%) TMDs and SPs are labeled (predicted) High quality localization annotation
873 Plasma Membrane
261 Endoplasmic Reticulum
141 Golgi Complex
45 Lysosome
31 Endosome
1351
Maetschke et al, The University of Queensland15
Prediction performance
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
SVM-1
SVM-2
HMM-1
HMM-2
HMM-3
Prediction Performance (MCC)
LOCATE dataset Mean correlation coefficient 10 fold, 10 times Five locations (ER, PM, GO, EN, LY) SVM: linear kernel 1-, 2- and 3-order HMMs
Confusion Matrix HMM-2
=> Di-peptide composition superior to single amino acid composition
=> Topological model superior to non-topological model
Maetschke et al, The University of Queensland16
Predictor comparison
18
33
48
75
0
10
20
30
40
50
60
70
80
CELLO WolfPSort PAnalyst HMM-2
Prediction accuracy in %
CELLO 2.5: http://cello.life.nctu.edu.tw/WolfPSort: http://wolfpsort.seq.cbrc.jp/ProteomeAnalyst 2.5: http://www.cs.ualberta.ca/~bioinfo/PA/Sub/HMM-2: http://pprowler.itee.uq.edu.au/TMPHMMLoc
Test set (20 PM, 20 ER, 20 Golgi) HMM: only three classes but test set train set Other predictors: more classes but
test set train set
→ difficult to compare!
Maetschke et al, The University of Queensland17
Conclusion
• Novel predictor for subcellular localization of transmembrane proteins along the secretory pathway: http://pprowler.itee.uq.edu.au/TMPHMMLoc
• Protein model has less states than topology predictors (TMHMM, HMMTOP, etc) but is of second order
• Localization model is trained and tested using LOCATE, a recent, high-quality localization dataset
• Overall better performance than current localization predictors (transmembrane proteins, eukaryotic, secretory pathway)– Di-peptide composition superior to single amino acid composition– "Topological" model superior to "non-topological" baseline model