system approaches to the prediction of protein function
DESCRIPTION
System approaches to the prediction of protein function. Søren Brunak Center for Biological Sequence Analysis Technical University of Denmark [email protected] www.cbs.dtu.dk. 40- 60% proteins of unknown function in the human genome. - PowerPoint PPT PresentationTRANSCRIPT
System approaches to the prediction of protein function
Søren BrunakCenter for Biological Sequence AnalysisTechnical University of [email protected]
40-60% proteins of unknown function in the human genome
134
109
54
19
17
138 7 17
Molecular_function unknown (134)
Catalytic activity (109)
Binding (54)
Enzyme regulator activity (19)
Transcription regulator activity (17)
Structural molecule activity (13)
Transporter activity (8)
Motor activity (7)
Signal transducer activity (7)
Chaperone activity (1)
Diverse functional categories of cell cycle regulated yeast proteins
Level 1 GO categories for 349 cell cycle regulated yeast genes. Only 95 of these belong to the ”Cell Cycle” category (biological process).
77
51
35
54 4 3 1
Binding (77)
Structural molecule activity (51)
Catalytic activity (35)
Chaperone activity (5)
Enzyme regulator activity (4)
Transporter activity (4)
Transcription regulator activity (3)
Translation regulator activity (1)
Diverse functional categories for human nucleolus proteins
Level 1 GO categories for 148 human genes located in the nucleolus. Only 5 of these belong to the ”Nucleolus” category (cellular component).
Pairwise alignment>carp Cyprinus carpio growth hormone 210 aa vs.
>chicken Gallus gallus growth hormone 216 aa
scoring matrix: BLOSUM50, gap penalties: -12/-2
40.6% identity; Global alignment score: 487
10 20 30 40 50 60 70
carp MA--RVLVLLSVVLVSLLVNQGRASDN-----QRLFNNAVIRVQHLHQLAAKMINDFEDSLLPEERRQLSKIFPLSFCNSD
:: . : ...:.: . : :. . :: :::.:.:::: :::. ..:: . .::..: .: .:: :.
chicken MAPGSWFSPLLIAVVTLGLPQEAAATFPAMPLSNLFANAVLRAQHLHLLAAETYKEFERTYIPEDQRYTNKNSQAAFCYSE
10 20 30 40 50 60 70 80
80 90 100 110 120 130 140 150
carp YIEAPAGKDETQKSSMLKLLRISFHLIESWEFPSQSLSGTVSNSLTVGNPNQLTEKLADLKMGISVLIQACLDGQPNMDDN
: ::.:::..:..: ..:::.:. ::.:: : : ::. .:.:. :. ... ::: ::. ::..:.. : .: .
chicken TIPAPTGKDDAQQKSDMELLRFSLVLIQSWLTPVQYLSKVFTNNLVFGTSDRVFEKLKDLEEGIQALMRELEDRSPR---G
90 100 110 120 130 140 150 160
170 180 190 200 210
carp DSLPLP-FEDFYLTM-GENNLRESFRLLACFKKDMHKVETYLRVANCRRSLDSNCTL
.: : .. : . . .:. : ... ::.:::::.:::::::.: .::: .::::.
chicken PQLLRPTYDKFDIHLRNEDALLKNYGLLSCFKKDLHKVETYLKVMKCRRFGESNCTI
170 180 190 200 210
An enzyme (1AOZ) and a non-enzyme (1PLC) from the Cupredoxin superfamily
1AOZ (129 aa) vs. 1PLC (99 aa)scoring matrix: BLOSUM50, gap penalties: -12/-215.5% identity; Global alignment score: -23
10 20 30 40 50 601AOZ SQIRHYKWEVEYMFWAPNCNENIVMGINGQFPGPTIRANAGDSVVVELTNKLHTEGVVIH .. .. : ... . . ..: . :...: . .: ...:. 1PLC ---------IDVLLGA---DDGSLAFVPSEFS-----ISPGEKIVFK-NNAGFPHNIVFD 10 20 30 40
70 80 90 100 110 1201AOZ WHGILQRGTPWADGTASISQCAINPGETFFYNFTVDNPGTFFYHGHLGMQRSAGLYGSLI .: :. . . : . :::: .. . .:. : : ::. :.. 1PLC EDSI-PSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQG----AGMVGKVT 50 60 70 80 90
1AOZ VDPPQGKKE :. 1PLC VN-------
Transfer of functional information – in what space ?
Recognize function in:
Sequence space – sequence alignment
Structure space – structural comparison
Gene expression spaces – array data
Interaction spaces – network/pathway
extraction
Paper space – text mining
…
Protein feature space
Predict orphan protein function in feature space
Orphan sequences have to use the standard cellular machinery for sorting, post-translational modification, etc.Similar pattern of modification may imply similar functionPredict sequence attributes independently, e.g. local and global properties such as
- post-translational modifications - localization signals - degradation signals - structure - composition, length, isoelectric point, ….
Then integrate and correlate using neural networks
Acceptor site Pos. Target AKKG S EQES S-10 PKA (1CMK)GFGD S IEAQ S-87 Ovalbumin (1OVA)EVVG S AEAG S-350 Ovalbumin (1OVA)GDLG S CEFH S-80 Cystatin (1CEW)
Serine phosphorylation sites
Length distributions
and functional role categories
Propeptide cleavage sites
Post-translational processing by limited proteolysis of inactive secretory precursors produces active proteins and peptides
Furin specific (a) and otherproprotein convertasecleavage sites (b)
PCs activate a large variety of proteins
Peptide hormones, neuropeptides, growth and differentiation factors, adhesion factors, receptors, blood coagulation factors, plasma proteins, extracellular matrix proteins, proteases, exogenous proteins such as coat glycoproteins from infectious viruses (e.g. HIV-1 and Influenza) and bacterial toxins (e.g. diphtheria and anthrax toxin).
PCs play an essential role in many vital biological processes like embryonic development and neural function, and in viral and bacterial pathogenesis. PCs are implicated in pathologies such as cancer and neurodegenerative diseases.
Mucin-type O-glycosylation
N-acetylgalactosamine (GalNAc) -1 linked to the hydroxyl group of a serine or threonine
Responsible for the high carbohydrate content of mucin proteins (>50% of the dry weight)
Mucins, principal component of mucus, protects epithelial surfaces from dehydration, mechanical injury, proteases and pathogens
Mucin-type glycosylation contributes to this by changing the structure to a stiff extended one and charging the protein to make it bind more water
Mucin-type O-glycosylation site conservation
Positional preference of N-Glyc sites across cellular role categories
Functional classes predicted
Functional role (Monica Riley categories)• The original scheme had 14 categories• Reduced to 12 categories by skipping the category
”other” and combining replication and transcription
Enzyme prediction• Enzyme vs non-enzyme• Major enzyme class in the EC system
Gene Ontology • A subset of classes can be predicted
Systems biology related categories• For example ’cell cycle regulated’, secreted, nucleolar
Predicting Gene Ontology categories
The GO system is designed for proteins to belong to multiple classes rather than oneDifferent kinds of function can be annotated:• Molecular function• Biological process• Cellular component
GO assigns the ”function” at several levels of detail rather than only one
The concept of ProtFun
Predict as many biologically relevant features as we can from the sequence
Train artificial neural networks for each category
Assign a probability for each category from the NN outputs
An enzyme (1AOZ) and a non-enzyme (1PLC) from the Cupredoxin superfamily
1AOZ and 1PLC predictions# Functional category 1AOZ 1PLC Amino_acid_biosynthesis 0.126 0.070 Biosynthesis_of_cofactors 0.100 0.075 Cell_envelope 0.429 0.032 Cellular_processes 0.057 0.059 Central_intermediary_metabolism 0.063 0.041 Energy_metabolism 0.126 0.268 Fatty_acid_metabolism 0.027 0.072 Purines_and_pyrimidines 0.439 0.088 Regulatory_functions 0.102 0.019 Replication_and_transcription 0.052 0.089 Translation 0.079 0.150 Transport_and_binding 0.032 0.052
# Enzyme/nonenzyme Enzyme 0.773 0.310 Nonenzyme 0.227 0.690
# Enzyme class Oxidoreductase (EC 1.-.-.-) 0.077 0.077 Transferase (EC 2.-.-.-) 0.260 0.099 Hydrolase (EC 3.-.-.-) 0.114 0.071 Lyase (EC 4.-.-.-) 0.025 0.020 Isomerase (EC 5.-.-.-) 0.010 0.068 Ligase (EC 6.-.-.-) 0.017 0.017
Similar structure different functions
Many examples exist of structurally similar proteins which have different functions
Two PDB structures from the Cupredoxin superfamily • 1AOZ is an ascorbate oxidase (enzyme)• 1PLC is performing electron transport (non-enzyme)
Despite their structural similarity, our method predicts both correctly
Performance on Gene Ontology categories (worst case)
Example: Eukaryotic Cell CycleEukaryotic Cell Cycle
Systems Biology – Whole system description
• Focus on whole systems, rather
than individual units
• Requires identification of all units
in the system
• High diversity in biological
systems
• Inference of system
features/functions from
experimental data
• Ultimate goal is in-silico modeling
of the temporal aspects of the
cell cycle in different organisms
Microarray identification of periodic genes
Synchronous
Yeast cells DNA chips Gene expression Temporal expression
Look for those with a periodic expression
Periodic
? ? ? ? Non-Periodic
70% 91% 47% 104 known genes
1) Visual inspection of expression profiles (Cho et al., 1998) 2) Fourier analysis and correlation with profiles of known genes (Spellman et al., 1998)3) Statistical modeling (single pulse model) (Zhao et al., 2001)
Problems• Cho uses non-objective criteria• Spellman identifies too many genes• Zhao identifies less than half of previous identified cell cycle regulated genes
Identification of periodicly expressed genes
Sequence based ’’machine learning approach’’
LearnLearn {consistensy
filterPeriodic genesPeriodic genes
Non-periodic genesNon-periodic genes
? ? Grey zone areaGrey zone area
(~5600 gener)
Positive setPositive set
(97 sequences)(97 sequences)
Negative setNegative set
(556 sequences)(556 sequences)
6200 genes
Our novel strategy
Prediction of cell cycle regulated genes from protein sequence
Features of cell cycle regulated genes used by neural net ensemble
Non-linear function prediction! Responds to single AA change
ORF ANN F-score Intensity Protein functionYIL169C 0,98 2,8 176 Protein of unknown functionYNL322C 0,98 1,7 870 Cell wall protein needed for cell wall beta-1,6-glucan assemblyYJL078C 0,98 5,5 86 Protein that may have a role in mating efficiencyYDL038C 0,98 5,3 165 Protein of unknown functionYOL155C 0,97 3,0 391 Protein with similarity to glucan 1,4-alpha-glucosidaseYJR151C 0,97 1,3 251 Member of the seripauperin (PAU) familyYLR286C 0,97 9,3 520 EndochitinaseYOL030W 0,97 4,1 817 Protein with similarity to Gas1pYOR220W 0,97 2,5 340 Protein of unknown functionYNR044W 0,97 6,5 172 Anchor subunit of a-agglutininYGR023W 0,97 1,8 129 Signal transduction of cell wall stress during morphorgenesisYDL016C 0,97 0,8 338 Protein of unknown functionYDL152W 0,97 1,0 156 Protein of unknown functionYPR136C 0,97 1,1 76 Protein of unknown functionYGR115C 0,97 1,0 71 Protein of unknown function, questionable ORFYMR317W 0,97 2,1 260 Protein of unknown functionYCR089W 0,97 3,4 104 Protein involved in mating inductionYLR194C 0,96 5,4 1870 Protein of unknown functionYIL011W 0,96 2,6 565 Member of the seripauperin (PAU) familyYGR161C 0,96 2,4 190 Protein of unknown functionYBR067C 0,96 5,9 825 Cold- and heat-shock induced mannoprotein of the cell wallYNL228W 0,96 1,9 250 Protein of unknown function; questionable ORFYNL327W 0,96 8,7 1320 Cell-cycle regulation protein involved in cell separationYLR332W 0,96 1,5 642 Putative sensor for cell wall integrity signaling during growthYNR067C 0,96 6,3 222 Protein with similarity to endo-1,3-beta-glucanase
unknownkinase & phosphatase
transcription
RNA binding
Serine rich
hydrolase
other
unknown
wall
nuclear
membrane
cytoplasmic
cytoskeleton other
Subcellular localizationFunctional grouping
Among the ”top 250 predicted” genes not used for training are• 75 previous identified as cell cycle regulated genes• 175 new potentially cell cycle regulated genes
Top 250 genes predicted from the entire genome
Experimental validation results
More than 100 new periodic genes identified/validated
For many of them, a role in the cell cycle is supported by other sources of evidence
About 30% of them have no known functional role
Gene p-valueNeural
Network score
GO Biological Process & Gene Description
Gene A 0.0009 0.76 Regulates the cell size requirement for passage through Start and commitment to cell division
Gene B 0.0026 0.70 cyclin involved in G1/S transition of mitotic cell cycle
Gene C 0.0081 0.59 Involved in cell cycle dependent gene expression
Gene D 0.0111 0.76 cell wall organization and biogenesis*
Gene E 0.0142 0.90 Required for spindle pole body duplication and a mitotic checkpoint function.
Gene F 0.0169 0.85 DNA repair*
Gene G 0.0192 0.74 G1/S transition of mitotic cell cycle*
Gene H 0.0222 0.76 DNA repair*
Gene I 0.0247 0.75 cellular morphogenesis*
Gene J 0.0255 0.81 regulation of exit from mitosis
Gene K 0.0353 0.46 Protein with similarity to putative glycosidase of the cell wall
Gene L 0.0482 0.74 G2/M transition of mitotic cell cycle*
Gene M 0.0520 0.81 chromatin assembly/disassembly*
Gene N 0.0630 0.92 actin cytoskeleton organization and biogenesis*
High confidence set
The eukaryotic cell cycle
The cell division process is divided into four phases:
• G1 growth/synthesis
• S replication of DNA
• G2 growth/synthesis
• M mitosis/cell division
Temporal variation in feature space
S phase ?
40% into the cell cycle the plots shows:
• High isoelectric point
• Many nuclear proteins
• Short proteins
• Low potential for N-glycosylation
• Low potential for Ser/Thr-phosphorylation
• Few PEST regions
• Low aliphatic index
S phase feature snapshot
Name
Fsc
ore
Avg
. In
t.
pI
Leng
th
Protein function or role
IRS4 0,98 122 9,8 615 Protein involved in silencing of ribosomal DNA
SHE1 2,09 60 10,4 338 Protein that causes lethality when overexpressed
HHT1 8,89 2920 11,4 136 Histone H3, identical to Hht2p
YGR079W 1,06 194 5,4 370 Protein of unknown function
HTB1 9,68 1171 10,1 131 Histone H2B
MKC7 2,00 533 4,6 596 Aspartyl protease found in the periplasmic space
YNL228W 1,92 250 4,9 258 Protein of unknown function; questionable ORF
HTB2 9,70 1071 10,1 131 Histone H2B, nearly identical to Htb1p
HHF2 9,18 1955 11,4 103 Histone H4, identical to Hhf1p
TOF2 4,15 270 8,0 771 Protein that interacts with DNA topoisomerase I
ENT4 1,47 73 9,4 247 Protein of unknown function
HTA1 9,82 1340 10,7 132 Histone H2A, nearly identical to Hta2p
HHT2 7,86 2084 11,4 136 Histone H3, core component of the nucleosome
YPL150W 0,66 95 9,4 901 Serine/threonine protein kinase with unknown role
YKR045C 1,01 242 11,0 191 Protein of unknown function
YNR014W 1,80 312 8,7 212 Protein of unknown function
HHO1 9,17 625 10,2 258 Histone H1
S phase peaking genes
Identify areas where prediction approaches can clean up noisyexperimental data
• High-throughput proteomics data• DNA array data
Strength of prediction approaches can indeed be complementary to the experimental data due toexperimental constraints
Generate hypotheses on the dynamics of protein feature space, e.g. the periodicity of the phospho-proteome.
Acknowledgements
People at CBS
• Lars Juhl Jensen• Ramneek Gupta• + 20 others
• Karin Julenius (O-glyc conservation)
• Thomas Skøt Jensen (cell cycle)• Ulrik de Lichtenberg (cell cycle) • Rasmus Wernersson (Febit experiments)
• Jannick Bendtsen (SecretomeP)• Lars Kiemer (NucleolusP)• Anders Fausbøll (NucleolusP)
• Thomas Schiritz-Ponten (new ProFun method)
Febit AG• Peer Smith
CNB/CSIC, Madrid • Alfonso Valencia• Javier Tamames• Damien Devos
Gunnar von Heijne, Stockholm (SecretomeP)
Referenceswww.cbs.dtu.dk/services/Protfunwww.cbs.dtu.dk/cellcycle
L.J. Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames, C. Kesmir, H. Nielsen, H.H. Stærfeldt, K. Rapacki, C. Workman, C.A.F. Andersen, S. Knudsen, A. Krogh, A. Valencia, and S. Brunak, "Prediction of human protein function from post-translational modifications and localization features", J. Mol. Biol., 319, 1257-1265, 2002.
L.J. Jensen, M. Skovgaard, and S. Brunak, "Prediction of novel archaeal enzymes from sequence derived features", Protein Sci., 11, 2894-2898, 2002.
L.J. Jensen, R. Gupta, H.-H. Stærfeldt, and S. Brunak, "Prediction of human protein function according to Gene Ontology categories", Bioinformatics, 19, 635-642, 2003.
L.J. Jensen, D.W. Ussery, and S. Brunak, "Functionality of system components: Conservation of protein function in protein feature space", Genome Res., Oct 14, 2003.
U. de Lichtenberg, T.S. Jensen, L.J. Jensen, and S. Brunak, Protein feature based identification of cell cycle regulated proteins in yeast, J. Mol. Biol., 13, 663-674, 2003.