prediction of protein function from sequence derived protein features
DESCRIPTION
Technical University of Denmark, Lyngby, October 23, 2002TRANSCRIPT
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis Prediction of Protein
Function from Sequence Derived Protein Features
Lars Juhl Jensen
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Function unknown for 40% of human proteins
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Pairwise alignment
>carp Cyprinus carpio growth hormone 210 aa vs.
>chicken Gallus gallus growth hormone 216 aa
scoring matrix: BLOSUM50, gap penalties: -12/-2
40.6% identity; Global alignment score: 487
10 20 30 40 50 60 70
carp MA--RVLVLLSVVLVSLLVNQGRASDN-----QRLFNNAVIRVQHLHQLAAKMINDFEDSLLPEERRQLSKIFPLSFCNSD
:: . : ...:.: . : :. . :: :::.:.:::: :::. ..:: . .::..: .: .:: :.
chicken MAPGSWFSPLLIAVVTLGLPQEAAATFPAMPLSNLFANAVLRAQHLHLLAAETYKEFERTYIPEDQRYTNKNSQAAFCYSE
10 20 30 40 50 60 70 80
80 90 100 110 120 130 140 150
carp YIEAPAGKDETQKSSMLKLLRISFHLIESWEFPSQSLSGTVSNSLTVGNPNQLTEKLADLKMGISVLIQACLDGQPNMDDN
: ::.:::..:..: ..:::.:. ::.:: : : ::. .:.:. :. ... ::: ::. ::..:.. : .: .
chicken TIPAPTGKDDAQQKSDMELLRFSLVLIQSWLTPVQYLSKVFTNNLVFGTSDRVFEKLKDLEEGIQALMRELEDRSPR---G
90 100 110 120 130 140 150 160
170 180 190 200 210
carp DSLPLP-FEDFYLTM-GENNLRESFRLLACFKKDMHKVETYLRVANCRRSLDSNCTL
.: : .. : . . .:. : ... ::.:::::.:::::::.: .::: .::::.
chicken PQLLRPTYDKFDIHLRNEDALLKNYGLLSCFKKDLHKVETYLKVMKCRRFGESNCTI
170 180 190 200 210
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Functional assignment: alignment versus prediction
Alignment is good for transferring knowledge about the function of homologous proteins
For orphan proteins there is no knowledge to transfer
Orphan sequences must thus be handled by true prediction tools rather than alignment
Develop a prediction method that works for orphans but only requires sequence input
Assign a possible function to as many of the orphans as possible
Screen the human genome for novel pharmaceutical targets such as transcription factors and receptors
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
The paradigm: sequence to structure to function
Structure does play a very important role for the function proteins
Structure is not very useful for prediction of protein function• For proteins of unknown function, the structure is
rarely known• Prediction of 3D structure from sequence is a very
difficult unsolved problem• Prediction of protein function from structure is by
many considered an even harder problem
Predicted secondary structure/fold can be used
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
1AOZ (129 aa) vs. 1PLC (99 aa)scoring matrix: BLOSUM50, gap penalties: -12/-215.5% identity; Global alignment score: -23
10 20 30 40 50 601AOZ SQIRHYKWEVEYMFWAPNCNENIVMGINGQFPGPTIRANAGDSVVVELTNKLHTEGVVIH .. .. : ... . . ..: . :...: . .: ...:. 1PLC ---------IDVLLGA---DDGSLAFVPSEFS-----ISPGEKIVFK-NNAGFPHNIVFD 10 20 30 40
70 80 90 100 110 1201AOZ WHGILQRGTPWADGTASISQCAINPGETFFYNFTVDNPGTFFYHGHLGMQRSAGLYGSLI .: :. . . : . :::: .. . .:. : : ::. :.. 1PLC EDSI-PSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQG----AGMVGKVT 50 60 70 80 90
1AOZ VDPPQGKKE :. 1PLC VN-------
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
An enzyme and a non-enzyme from the Cupredoxin superfamily
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Function prediction from post translational modifications
Proteins with similar function may not be related in sequence
Still they must perform their function in the context of the same cellular machinery
Similarities in features such like PTMs and physical/chemical properties could be expected for proteinswith similar function
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Functional classes predicted
Functional role (Monica Riley categories)• The original scheme had 14 categories• We reduce it to 12 categories by skipping the
category ”other” and combining replication and transcription
Enzyme prediction• Enzyme vs. non-enzyme• Major enzyme class in the EC system
Gene Ontology • A subset of classes can be predicted
Systems related categories• For example “cell cycle regulated’’
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
The concept of ProtFun
Predict as many biologically relevant features as we can from the sequence
Train artificial neural networks for each category, also optimizing the feature combinations
Assign a probability for each category from the NN outputs
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Training of neural networks
Human protein protein sequences from SWISS-PROT were assigned to functional classes based on their keywords by using the EUCLID dictionary
The set of sequences was divided into a test and a training set with no significant sequence similarity between the two sets
Neural networks were first trained for single features and subsequently for combinations of the best performing features
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Prediction performance on cellular role categories
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
An enzyme and a non-enzyme from the Cupredoxin superfamily
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
# Functional category 1AOZ 1PLC Amino_acid_biosynthesis 0.126 0.070 Biosynthesis_of_cofactors 0.100 0.075 Cell_envelope 0.429 0.032 Cellular_processes 0.057 0.059 Central_intermediary_metabolism 0.063 0.041 Energy_metabolism 0.126 0.268 Fatty_acid_metabolism 0.027 0.072 Purines_and_pyrimidines 0.439 0.088 Regulatory_functions 0.102 0.019 Replication_and_transcription 0.052 0.089 Translation 0.079 0.150 Transport_and_binding 0.032 0.052
# Enzyme/nonenzyme Enzyme 0.773 0.310 Nonenzyme 0.227 0.690
# Enzyme class Oxidoreductase (EC 1.-.-.-) 0.077 0.077 Transferase (EC 2.-.-.-) 0.260 0.099 Hydrolase (EC 3.-.-.-) 0.114 0.071 Lyase (EC 4.-.-.-) 0.025 0.020 Isomerase (EC 5.-.-.-) 0.010 0.068 Ligase (EC 6.-.-.-) 0.017 0.017
Similar structure different functions
Many examples exist of structurally similar proteins which have different functions
Two PDB structures from the Cupredoxin superfamily were shown• 1AOZ is an enzyme• 1PLC is not an enzyme
Despite their structural similarity, our method predicts both correctly
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Evolution conserves protein features and function
Protein features are more conserved between orthologs than paralogs
This leads to ProtFun predicting orthologs to be more likely to share function than paralogs
That prediction is fully consistent with the notion that it is best to infer function from orthologous proteins
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
ProtFun performance for other organisms
Our predictors work in general for eukaryotes
Some categories work quite well for prokaryotes• Most metabolism
categories• Transport and binding
While other categories fail• Energy metabolism• Regulatory functions
hsapdm elceleathascerspomssolafulm thephor
m tub
rpxxnm enecolihinfcje j
tm arbsub
ctra
aquasynec
Am
ino
acid
bio
syn
the
sis
Bio
synt
hes
is o
f co
fact
ors
Ce
ll en
velo
peC
ellu
lar
pro
cess
es
Ce
ntra
l in
term
edia
ry m
eta
b.
Ene
rgy
met
abo
lism
Fat
ty a
cid
met
abo
lism
Pur
ine
s a
nd p
yrim
idin
es
Re
gula
tory
fun
ctio
ns
Re
plic
atio
n a
nd
tra
nscr
iptio
nT
ran
slat
ion
Tra
nsp
ort
an
d b
ind
ing
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Mapping category performances onto input features
hsapdm elceleathascerspomssolafulm thephor
m tub
rpxxnm enecolihinfcje j
tm arbsub
ctra
aquasynec
Am
ino
acid
bio
syn
the
sis
Bio
synt
hes
is o
f co
fact
ors
Ce
ll en
velo
peC
ellu
lar
pro
cess
es
Ce
ntra
l in
term
edia
ry m
eta
b.
Ene
rgy
met
abo
lism
Fat
ty a
cid
met
abo
lism
Pur
ine
s a
nd p
yrim
idin
es
Re
gula
tory
fun
ctio
ns
Re
plic
atio
n a
nd
tra
nscr
iptio
nT
ran
slat
ion
Tra
nsp
ort
an
d b
ind
ing
hsapdm elceleathascerspomssolafulm thephor
m tub
rpxxnm enecolihinfcje j
tm arbsub
ctra
aquasynec
Ext
inct
ion
co
effic
ien
tH
ydro
pho
bici
tyN
egat
ive
resi
due
sP
ositi
ve r
esid
ues
O-g
lyco
syla
tion
S/T
-pho
sph
oryl
atio
nY
-pho
sph
oryl
atio
nN
-gly
cosy
latio
nP
ES
T r
egio
ns
Sec
ond
ary
stru
ctur
eS
ubce
llula
r lo
caliz
atio
nLo
w c
om
ple
xity
re
gio
nsS
ign
al p
eptid
es
Tra
nsm
em
bra
ne h
elic
es
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Performance contribution of sequence derived features
The correlations between features and function is conserved for eukaryotes
Some correlations extend to archaea and bacteria• Physical/chemical
properties• Secondary structure and
transmembrane helices
Other correlations only hold for eukaryotes• PTMs and Subcellular
localization features
hsapdm elceleathascerspomssolafulm thephor
m tub
rpxxnm enecolihinfcje j
tm arbsub
ctra
aquasynec
Ext
inct
ion
co
effic
ien
tH
ydro
pho
bici
tyN
egat
ive
resi
due
sP
ositi
ve r
esid
ues
O-g
lyco
syla
tion
S/T
-pho
sph
oryl
atio
nY
-pho
sph
oryl
atio
nN
-gly
cosy
latio
nP
ES
T r
egio
ns
Sec
ond
ary
stru
ctur
eS
ubce
llula
r lo
caliz
atio
nLo
w c
om
ple
xity
re
gio
nsS
ign
al p
eptid
es
Tra
nsm
em
bra
ne h
elic
es
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Are our classes meaningful?
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
A better classification system: the Gene Ontology
Standardized by the Gene Ontology Consortium
Proteins can belong to multiple classes
Different kinds of function can be annotated:• Molecular function• Biological process• Cellular component
GO assigns the “function” at several levels of detail
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Training of the Gene Ontology predictor
GO numbers were assigned to all human SWISS-PROT and TREMBL entries based on matches to InterPro
Classes annotated to fewer than 20 different InterPro families were discard
The sequences were split into five sets of equal size where significant similarity only exist within sets – not between sets
Using this data set neural networks were trained in sets of five constituting a five fold cross validation
Single feature neural nets were first trained on each remaining category
Neural networks using combinations of features were trained on promising categories resulting in 14 good predictors
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Prediction performance on Gene Ontology categories
Predicts many pharmaceutically interesting classes
70% of hormones and receptors can be predicted at a false positive rate of only 5%
All categories can be predicted with a sensitivity of 50% and 10% rate of false positives
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Features usage
Transmembrane helices important for prediction of• Receptors• Transporters• Ion channels
Subcellular localization good for predicting• Receptors• Transcription
(regulation)
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
############## ProtFun 2.0 predictions ##############
>ENSP00000257015 # Functional category Prob Odds Amino_acid_biosynthesis 0.021 0.955 Biosynthesis_of_cofactors 0.032 0.444 Cell_envelope => 0.661 10.836 Cellular_processes 0.039 0.534 Central_intermediary_metabolism 0.042 0.667 Energy_metabolism 0.043 0.478 Fatty_acid_metabolism 0.043 3.308 Purines_and_pyrimidines 0.164 0.675 Regulatory_functions 0.014 0.087 Replication_and_transcription 0.020 0.075 Translation 0.033 0.750 Transport_and_binding 0.834 2.034 # Enzyme/nonenzyme Prob Odds Enzyme 0.202 0.705 Nonenzyme => 0.798 1.118 # Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.055 0.264 Transferase (EC 2.-.-.-) 0.032 0.093 Hydrolase (EC 3.-.-.-) 0.077 0.243 Isomerase (EC 4.-.-.-) 0.020 0.426 Ligase (EC 5.-.-.-) 0.010 0.313 Lyase (EC 6.-.-.-) 0.017 0.334 # Gene Ontology category Prob Odds Signal_transducer 0.493 2.304 Receptor => 0.734 4.318 Hormone 0.001 0.154 Structural_protein 0.001 0.036 Transporter 0.050 0.459 Ion_channel 0.035 0.614 Voltage-gated_ion_channel 0.002 0.091 Cation_channel 0.010 0.217 Transcription 0.050 0.391 Transcription_regulation 0.021 0.168 Stress_response 0.364 4.136 Immune_response 0.477 5.612 Growth_factor 0.117 8.357 Metabolism 0.142 0.307 Metal_ion_transport 0.013 0.394
Possible novel receptor
No BLAST matches against SWISS-PROT with an E-value below 1
A Pfam search yielded a questionable match to TGF-beta type III receptors (E-value 0.28)
While this match is not significant on its own, it supports the predictions
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Summary
A method for prediction of “protein function” has been developed for human proteins
This method has been successfully applied to a number different categorization systems
The feature usage of the neural networks is in agreement with current biological knowledge
Cross-species tests show that the prediction methods developed on human proteins work for most eukaryotes
The evolutionary aspects of “feature space” have been discussed
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Acknowledgements
Other people at CBS• David Ussery• Marie Skovgaard• Ulrik de Lichtenberg• Thomas Skøt Jensen• Anne Mølgaard
The EUCLID team at CNB/CSIC, Madrid• Alfonso Valencia• Damien Devos• Javier Tamames
The ProtFun team at CBS• Søren Brunak• Ramneek Gupta• Can Kesmir• Kristoffer Rapacki• Hans-Henrik Stærfeldt• Henrik Nielsen• Nikolaj Blom• Claus A.F. Andersen• Anders Krogh• Steen Knudsen• Chris Workman
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Thank you!