prediction of protein function lars juhl jensen embl heidelberg
TRANSCRIPT
![Page 1: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/1.jpg)
Prediction of protein function
Lars Juhl JensenEMBL Heidelberg
![Page 2: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/2.jpg)
Overview
• Part 1– Homology-based transfer of annotation– Function prediction from protein domains
• Part 2– Prediction of functional motifs from sequence– Feature-based prediction of protein function
• Part 3– Prediction of functional interaction networks
![Page 3: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/3.jpg)
Why do we need to predict function?
![Page 4: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/4.jpg)
What do we mean by function?
• The concept “function” is not clearly defined– A structural biologist, a cell biologist, and a medical
doctor will have very different views
• Many levels of granularity– For the overall definition of “function”, the knowledge
and description can be more or less specific
• Functional categories are somewhat artificial– People like to put things in boxes …
![Page 5: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/5.jpg)
![Page 6: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/6.jpg)
Descriptions of protein function
• Controlled vocabularies– Gene Ontology– SwissProt keywords– KEGG pathways– EcoCyc pathways
• Interaction networks
• More accurate data models– Reactome– Systems Biology Markup Language (SBML)
![Page 7: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/7.jpg)
Molecular function
• Molecular function describes activities, such as catalytic or binding activities, at the molecular level
• GO molecular function terms represent activities rather than the entities that perform the actions, and do not specify where or when, or in what context, the action takes place
• Examples of broad functional terms are catalytic activity or transporter activity; an example of a narrower term is adenylate cyclase activity
![Page 8: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/8.jpg)
Biological process
• A biological process is series of events accomplished by one or more ordered assemblies of molecular functions
• An example of a broad GO biological process terms is signal transduction; examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport
• It can be difficult to distinguish between a biological process and a molecular function
![Page 9: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/9.jpg)
Cellular component
• A cellular component is just that, a component of a cell that is part of some larger object
• It may be an anatomical structure (for example, the rough endoplasmic reticulum or the nucleus) or a gene product group (for example, the ribosome, the proteasome or a protein dimer)
• The cellular component categories are probably the best defined categories since they correspond to actual entities
![Page 10: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/10.jpg)
Homology-basedtransfer of annotation
Lars Juhl JensenEMBL Heidelberg
![Page 11: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/11.jpg)
Detection of homologs
• Pairwise sequence similarity searches– BLAST (fastest)– FASTA– Full Smith-Waterman (most sensitive)
• Profile-based similarity searches– PSI-BLAST– Hidden Markov Models (HMMs)
• Sequence similarity should always be evaluated at the protein level
![Page 12: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/12.jpg)
![Page 13: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/13.jpg)
Sequence similarity, sequence homology, and functional homology
• Sequence similarity means that the sequences are similar – no more, no less
• Sequence homology implies that the proteins are encoded by genes that share a common ancestry
• Functional homology means that two proteins from two organisms have the same function
• Sequence similarity or sequence homology does not guarantee functional homology
![Page 14: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/14.jpg)
Orthologs vs. paralogs
![Page 15: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/15.jpg)
Functional consequencesof gene duplication
• Neofunctionalization– One copy has retained the ancestral function and can
be treated as a 1–to–1 ortholog (functional homolog)– The other copy have changed their function and behave
much like paralogs
• Subfunctionalization– Each copy has taken on a part of the ancestral function– A functional homolog cannot be defined– Each ortholog typically has the same molecular function
in a different sub-process or location
![Page 16: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/16.jpg)
1–to–1 orthology
• A single gene in one organism corresponds to a single gene in another organism
• These can generally be assumed to encode functionally equivalent proteins– Same molecular function– Same biological process– Same localization
• 1–to–1 orthology is fairly common in prokaryotes and among very closely related organisms
![Page 17: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/17.jpg)
1–to–many orthology
• A single gene in one organism corresponds to multiple genes in another organism
• Any mixture of neo- and sub-functionalizations can have occurred– Typically same molecular function– Often different biological process or sub-process– Often different sub-cellular localization or tissue
• 1–to–many orthology is very common between simple model organisms and higher eukaryotes
![Page 18: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/18.jpg)
Many–to–many orthology
• Many genes in each organism have arisen from a single gene in their last common ancestor
• Different neo- and sub-functionalizations have likely taken place in each lineage– Typically same molecular function– Often different biological process or sub-process– Often different sub-cellular localization or tissue
• Many–to–many orthology is common between higher eukaryotes that are distantly related
![Page 19: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/19.jpg)
Detection of orthologs
• Reconstruction of phylogenetic trees– The theoretically most correct way– Works for analyzing particular genes of interest
• Methods based on reciprocal matches– What currently works at the genomic scale
• Manual curation– Detection of very remote orthologs may require that
knowledge on gene synteny and/or protein function is taken into account
![Page 20: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/20.jpg)
Construction of gene trees
• Identify the relevant proteins– Sequence similarity and possibly additional information
• Construct a blocked multiple sequence alignment– Use, for example, Muscle and Gblocks
• Reconstruct the most likely phylogenetic tree– Use, for example, PhyML
• Orthologs and paralogs can be trivially extracted based on a gene tree
![Page 21: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/21.jpg)
Reciprocal matches
• Simple “best reciprocal match” is a bad choice– Can only deal with one-to-one orthology
• Detection of in-paralogs– Similarity higher with species than between species
• Orthologs can now be detected based on best reciprocal matches between in-paralogous groups
• One or more out-group organisms can optionally be used to improve the definition of orthologs
![Page 22: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/22.jpg)
![Page 23: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/23.jpg)
Orthologous groups
• Orthologs and paralogs are in principle always defined with respect to two organisms
• Orthologous groups instead try to encompass an entire set of organisms
• The “inclusiveness” of the orthologous groups depends on how broad a set of organisms the groups cover
![Page 24: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/24.jpg)
Definition of orthologous groups
![Page 25: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/25.jpg)
![Page 26: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/26.jpg)
![Page 27: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/27.jpg)
COGs, KOGs, and NOGs
• The COGs and KOGs were manually curated– These were automatically expanded to more species
• Tri-clustering– Detection of in-paralogs– Identification of triangles of best reciprocal matches– Merging of triangles that share an edge
• Broad phylogenetics coverage– COGs and NOGs cover all three domains of life– KOGs cover all eukaryotes
![Page 28: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/28.jpg)
![Page 29: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/29.jpg)
![Page 30: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/30.jpg)
Clustering based on similarity
• All-against-all sequence similarity is calculated
• A standard clustering method is applied to define groups of homologous genes– TribeMCL– Hierarchical clustering
• These methods generally detect groups of homologous genes, but are not good for distinguishing between orthologs and paralogs
![Page 31: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/31.jpg)
![Page 32: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/32.jpg)
Meta-servers
• Since numerous methods exist for identifying groups of orthologous proteins, meta-servers have begun to emerge
• These can be very useful for “fishing expeditions” where one is looking for a remote ortholog of a particular protein of interest
• However, such meta-servers do not attempt to unify the different orthologous groups and are thus not useful for genome-wide studies
![Page 33: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/33.jpg)
Function predictionfrom protein domains
Lars Juhl JensenEMBL Heidelberg
![Page 34: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/34.jpg)
When homology searches fail
• Sometimes no orthologs or even paralogs can be identified by sequence similarity searches, or they are all of unknown function
• No functional information can thus be transferred based on simple sequence homology
• By instead analyzing the various parts that make up the complete protein, it is nonetheless often possible to predict the protein function
![Page 35: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/35.jpg)
Protein domains
• Many eukaryotic proteins consist of multiple globular domains that can fold independently
• These domains have been mixed and matched through evolution
• Each type of domain contributes towards the molecular function of the complete protein
• Numerous resources are able to identify such domains from sequence alone using HMMs
![Page 36: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/36.jpg)
![Page 37: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/37.jpg)
![Page 38: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/38.jpg)
![Page 39: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/39.jpg)
![Page 40: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/40.jpg)
![Page 41: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/41.jpg)
![Page 42: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/42.jpg)
![Page 43: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/43.jpg)
![Page 44: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/44.jpg)
Which domain resource should I use?
• SMART is focused on signal transduction domains
• Pfam is very actively developed and thus tends to have the most up-to-date domain collection
• InterPro is useful for genome annotation since the domains are annotated with GO terms
• CDD is conveniently integrated with the NCBI BLAST web interface
![Page 45: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/45.jpg)
Predicting globular domains and intrinsically disordered regions
• Not all globular domains have been discovered and the databases are thus not comprehensive
• Methods exist for predicting from sequence which regions are globular and which are disordered– GlobPlot uses a simple propensity scale– DisEMBL, DISOPRED, and PONDR all use ensembles
of artificial neural networks
• Many disordered regions are important for protein function and they should thus not be ignored
![Page 46: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/46.jpg)
![Page 47: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/47.jpg)
![Page 48: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/48.jpg)
![Page 49: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/49.jpg)
![Page 50: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/50.jpg)
![Page 51: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/51.jpg)
![Page 52: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/52.jpg)
Summary
• Functional annotation– Molecular function vs. biological process– Inference of molecular function by sequence similarity– Biological process only transferable between orthologs
• Detection of orthologs– In-depth studies: phylogenetic trees– Automated analysis: InParanoid and COG/KOG/NOG
• Profile searches for protein domains– Each domains contributes a different molecular function
![Page 53: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/53.jpg)
Acknowledgments
Christian von Mering
Christopher Creevey
Ivica Letunic
Rune Linding
Tobias Doerks
Francesca Ciccarelli
Berend Snel
Martijn Huynen
Toby Gibson
Rob Russell
Peer Bork
![Page 54: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/54.jpg)
Prediction of functionalmotifs from sequence
Lars Juhl JensenEMBL Heidelberg
![Page 55: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/55.jpg)
Proteins – more than justglobular domains
• Transmembrane helices
• Disordered regions
• Eukaryotic linear motifs (ELMs)– Modification sites, e.g. phosphorylation sites– Ligand peptides, e.g. SH3 binding sites– Targeting signals, e.g nuclear localization sequences
• The short functional motifs are as important as the globular domains
![Page 56: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/56.jpg)
Insulin Receptor Substrate 1
![Page 57: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/57.jpg)
Databases of functional motifs
• Fewer and smaller databases– General databases of motifs: ProSite and ELM– Phosphorylation sites: Phospho.ELM and PhosphoSite– These databases contain much fewer instances that
protein domain databases
• Curation is more difficult– Protein domain databases can be constructed based on
analysis of protein sequences alone– Short functional motifs must be curated based on
experimental evidence
![Page 58: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/58.jpg)
![Page 59: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/59.jpg)
![Page 60: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/60.jpg)
![Page 61: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/61.jpg)
![Page 62: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/62.jpg)
Prediction of ELMs
• Most functional motifs are “information poor”– Weak/short consensus sequences for ELMs– The typical ELM only has three conserved residues– Some variance is often allowed even for these
• ELMs are very hard to predict from sequence– Simply consensus sequences match everywhere– Even more advanced methods like PSSMs, ANNs, or
SVMs give poor specificity– The full information is not in the site itself
![Page 63: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/63.jpg)
![Page 64: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/64.jpg)
![Page 65: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/65.jpg)
![Page 66: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/66.jpg)
![Page 67: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/67.jpg)
Construction of data sets
• Compiling an initial data set– Positive examples can be obtained from existing
databases or curated from the literature– Good negative examples are often harder to get
• Separate training and test sets– A method may be able to learn the training examples
but to generalize to new examples
• Homology reduction!– It is crucial that there is no significant sequence
similarity between examples in the training and test sets
![Page 68: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/68.jpg)
Machine learning
• Numerous algorithms exist– Artificial neural networks– Support vector machines– Decision trees
• The choice of algorithm is not so important
• Providing the relevant input is important
• Having high-quality training data is crucial
![Page 69: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/69.jpg)
![Page 70: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/70.jpg)
![Page 71: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/71.jpg)
Kinase-specific prediction of phosphorylation sites (NetPhosK)
• Artificial neural networks (ANNs) were trained several different kinases
• The sequence logos show only the positive examples
• Negative examples also provide information
• Also, ANNs and SVMs can capture correlations between positions
![Page 72: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/72.jpg)
Prediction of signal peptidesfrom sequence (SignalP)
• Function– Eukaryotic proteins are
targeted to the ER– Prokaryotic proteins are
targeted for secretion
• Architecture– Positively charged N-
terminus– Hydrophobic core– Short, more polar region– Cleavage site
• Signal peptides can be accurately predicted
![Page 73: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/73.jpg)
Machine learning can help identifyerrors in curated databases
• Some of the manually curated databases contain obvious errors that can be eliminated
• General “SIGNAL” errors– Wrong signal peptide cleavage site– The secreted protein is processed by proteases– Signal peptide include propeptide– Wrong start codon used
![Page 74: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/74.jpg)
Signal peptide or propeptide
N–
Signal peptide
Propeptide
Mature protein
![Page 75: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/75.jpg)
Signal peptide or propeptide
Propeptide cleavage
Signal peptide cleavage
![Page 76: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/76.jpg)
Wrong start codon
![Page 77: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/77.jpg)
Use of short linear motifsfor function prediction
• Only a few motifs (mostly localization signals) can be predicted with high accuracy– Even in these cases advanced machine learning
methods are typically needed– These can be treated in the same way as domains
• Most motifs are weak, and predictions should be approached with care– To tell if these sites are likely to be true, one needs to
consider the context– An experiment is needed to prove that it is functional
![Page 78: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/78.jpg)
Feature-based predictionof protein function
Lars Juhl JensenEMBL Heidelberg
![Page 79: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/79.jpg)
Function prediction from post translational modifications
• Proteins with similar function may not be related in sequence
• Still they must perform their function in the context of the same cellular machinery
• Similarities in features such like PTMs and physical/chemical properties could be expected for proteinswith similar function
Henrik Nielsen, CBS, DTU Lyngby
![Page 80: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/80.jpg)
The concept of ProtFun
![Page 81: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/81.jpg)
![Page 82: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/82.jpg)
Function prediction on thehuman prion sequence
############## ProtFun 1.1 predictions ##############
>PRIO_HUMAN# Functional category Prob Odds Amino_acid_biosynthesis 0.020 0.909 Biosynthesis_of_cofactors 0.032 0.444 Cell_envelope 0.146 2.393 Cellular_processes 0.053 0.726 Central_intermediary_metabolism 0.130 2.063 Energy_metabolism 0.029 0.322 Fatty_acid_metabolism 0.017 1.308 Purines_and_pyrimidines 0.528 2.173 Regulatory_functions 0.013 0.081 Replication_and_transcription 0.020 0.075 Translation 0.035 0.795 Transport_and_binding => 0.831 2.027
# Enzyme/nonenzyme Prob Odds Enzyme 0.250 0.873 Nonenzyme => 0.750 1.051
# Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.070 0.336 Transferase (EC 2.-.-.-) 0.031 0.090 Hydrolase (EC 3.-.-.-) 0.057 0.180 Isomerase (EC 4.-.-.-) 0.020 0.426 Ligase (EC 5.-.-.-) 0.010 0.313 Lyase (EC 6.-.-.-) 0.017 0.334
![Page 83: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/83.jpg)
ProtFun data sets
• Labeling of training and test data– Cellular role categories: human SwissProt sequences
were categorizes using EUCLID– Enzyme categories: top-level enzyme classifications
were extract from human SwissProt description lines– Gene Ontology terms were transferred from InterPro
• The sequences were divided into training and test sets without significant sequence similarity
• Binary predictors were for each category
![Page 84: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/84.jpg)
Prediction performance oncellular role categories
![Page 85: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/85.jpg)
Prediction performance onenzyme categories
![Page 86: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/86.jpg)
Predictive performance onGene Ontology categories
![Page 87: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/87.jpg)
Non-classical secretion
• Some proteins without N-terminal signal peptides are secreted via alternative secretion pathways– Several growth factors, i.e. FGF1 and FGF2– Interleukine 1 beta– HIV-1 tat
• No consensus sequence motif is known
• Maybe they have some features in common with other secreted proteins …
![Page 88: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/88.jpg)
SecretomeP data sets
• Training and test set– Positive examples: 3321 extracellular mammalian
proteins with their signal peptides removed– Negative examples: 3654 mammalian proteins from
cytoplasm or nucleus
• Validation set– 14 known non-classically secreted proteins
![Page 89: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/89.jpg)
Secreted proteins are typically small
![Page 90: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/90.jpg)
ROC plot for SecretomeP
![Page 91: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/91.jpg)
Similar properties of classically and non-classically secreted proteins
![Page 92: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/92.jpg)
![Page 93: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/93.jpg)
A look into the black box
• Neural networks are often criticized for being a “black box” method
• However, there are several ways to investigate what a neural network ensemble has learned– Which fraction of the ensemble use a certain feature?– How good performance can be attained using each of
the features individually?– How much does performance decrease if the neural
networks are retrained without a certain feature (or combination of features)?
![Page 94: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/94.jpg)
![Page 95: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/95.jpg)
![Page 96: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/96.jpg)
![Page 97: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/97.jpg)
SecretomeP feature usage
![Page 98: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/98.jpg)
ProtFun performance forother organisms
• Our predictors work in general for eukaryotes– Best performance on
metazoan proteins
• Some categories work quite well for prokaryotes– Most metabolism categories– Transport and binding
• While other categories fail– Energy metabolism– Regulatory functions
![Page 99: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/99.jpg)
Mapping category performancesonto input features
![Page 100: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/100.jpg)
Performance contribution of sequence derived features
• The correlations between features and function is conserved for eukaryotes
• Some correlations extend to archaea and bacteria– Physical/chemical properties– Secondary structure and
transmembrane helices
• Other correlations only hold for eukaryotes– PTMs and subcellular
localization features
![Page 101: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/101.jpg)
Evolution conserves proteinfeatures and function
• Protein features are more conserved between orthologs than paralogs
• This leads to ProtFun predicting orthologs to be more likely to share function than paralogs
• That prediction is fully consistent with the notion that it is best to infer function from orthologous proteins
![Page 102: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/102.jpg)
Conclusions
• Short linear motifs are likely equally important for protein function as the large well-studied domains
• These are much harder to predict from sequence– Reasonable accuracy can be obtained by applying
machine learning methods on high-quality datasets
• Many classes of proteins can be predicted based on such sequence derived-protein features– These methods a not nearly as reliable as homology– However, often they are the only option
![Page 103: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/103.jpg)
Acknowledgments
Ramneek Gupta
Can Kesmir
Jannick Dyrløv Bendtsen
Henrik Nielsen
Nikolaj Blom
Francesca Diella
Rune Linding
Damien Devos
Alfonso Valencia
Søren Brunak
Toby Gibson
![Page 104: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/104.jpg)
Prediction of functionalinteraction networks
Lars Juhl JensenEMBL Heidelberg
![Page 105: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/105.jpg)
What is an interaction?
• Physical protein interactions– Proteins that physically touch each other– Members of the same stable complex– Transient interactions, e.g. a kinase and its substrate
• The pragmatic definition – whatever the assay in question can measure
• Functional interactions– Neighbors in metabolic networks– Members of the same pathway
![Page 106: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/106.jpg)
The use of interaction networksfor function prediction
• A functional interaction implies that two proteins are involved in the same biological process
• However, the networks do not divide proteins into a predefined set of functional classes such as the Gene Ontology terms
• Functional associations do not require homology to proteins of know function, and can complement the predictions even when homology is present
![Page 107: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/107.jpg)
![Page 108: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/108.jpg)
![Page 109: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/109.jpg)
Functional interaction networks
![Page 110: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/110.jpg)
Evidence types
• Genomic context methods– Phylogenetic profiles, gene neighborhood, and fusion
• Primary experimental data– Physical protein interactions and gene expression data
• Manually curated databases– Pathways and protein complexes
• Automatic literature mining– Co-ocurrence and Natural Language Processing
![Page 111: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/111.jpg)
Phylogenetic profiles
![Page 112: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/112.jpg)
![Page 113: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/113.jpg)
![Page 114: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/114.jpg)
![Page 115: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/115.jpg)
Cell
Cellulosomes
Cellulose
![Page 116: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/116.jpg)
Formalizing the phylogeneticprofile method
Align all proteins against allAlign all proteins against all
Calculate best-hit profileCalculate best-hit profile
Join similar species by PCAJoin similar species by PCA
Calculate PC profile distancesCalculate PC profile distances
Calibrate against KEGG mapsCalibrate against KEGG maps
![Page 117: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/117.jpg)
Gene neighbourhood
![Page 118: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/118.jpg)
Gene neighborhood
Identify runs of adjacent geneswith the same direction
Identify runs of adjacent geneswith the same direction
Score each gene pair based onintergenic distances
Score each gene pair based onintergenic distances
Calibrate against KEGG mapsCalibrate against KEGG maps
Infer associationsin other species
Infer associationsin other species
![Page 119: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/119.jpg)
Gene fusion
![Page 120: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/120.jpg)
Gene fusion
Find in A genes that matcha the same gene in B
Find in A genes that matcha the same gene in B
Exclude overlappingalignments
Exclude overlappingalignments
Calibrate againstKEGG maps
Calibrate againstKEGG maps
Calculate all-against-allpairwise alignments
Calculate all-against-allpairwise alignments
![Page 121: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/121.jpg)
Calibration of quality scores
• Different pieces of evidence are not directly comparable
– A different raw quality score is used for each evidence type
– Quality differences exist among data sets of the same type
• Solved by calibrating all scores against a common reference
– The accuracy relative to a “gold standard” is calculated within score intervals
– The resulting points are approximated by a sigmoid
![Page 122: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/122.jpg)
Data integration
![Page 123: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/123.jpg)
Protein-protein interaction databases
• Imported databases– BIND, Biomolecular Interaction Network Database– DIP, Database of Interacting Proteins– GRID, General Repository for Interaction Datasets– HPRD, Human Protein Reference Database– MINT, Molecular Interactions Database
• Databases to be added– IntAct– PDB
![Page 124: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/124.jpg)
Physical protein interactions
Make binaryrepresentationof complexes
Make binaryrepresentationof complexes
Yeast two-hybriddata sets are
inherently binary
Yeast two-hybriddata sets are
inherently binary
Calculate scorefrom number of
(co-)occurrences
Calculate scorefrom number of
(co-)occurrences
Calculate scorefrom non-shared
partners
Calculate scorefrom non-shared
partners
Calibrate against KEGG mapsCalibrate against KEGG maps
Infer associations in other speciesInfer associations in other species
Combine evidence from experimentsCombine evidence from experiments
![Page 125: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/125.jpg)
Binary representationsof purification data
![Page 126: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/126.jpg)
Topology based quality scores
• Scoring scheme for yeast two-hybrid data:– S1 = -log((N1+1)·(N2+1))
– N1 and N2 are the numbers of non-shared interaction partners
– Similar scoring schemes have been published by Saito et al.
• Scoring scheme for complex pull-down data:– S2 = log[(N12·N)/((N1+1)·(N2+1))]
– N12 is the number of purifications containing both proteins
– N1 is the number containing protein 1, N2 is defined similarly
– N is the total number of purifications
• Both schemes aim at identifying ubiquitous interactors
![Page 127: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/127.jpg)
Mining microarrayexpression databases
Re-normalize arraysby modern methodto remove biases
Re-normalize arraysby modern methodto remove biases
Buildexpression
matrix
Buildexpression
matrix
Combinesimilar arrays
by PCA
Combinesimilar arrays
by PCA
Construct predictorby Gaussian kerneldensity estimation
Construct predictorby Gaussian kerneldensity estimation
Calibrateagainst
KEGG maps
Calibrateagainst
KEGG maps
Inferassociations inother species
Inferassociations inother species
![Page 128: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/128.jpg)
Databases of curated knowledge
• Pathway databases– BioCarta– KEGG, Kyoto Encyclopedia of Genes and Genomes– Reactome– STKE, Signal Transduction Knowledge Environment
• Curated protein complexes– MIPS, Munich Information center for Protein Sequences
• Databases to be added– Gene Ontology annotation
![Page 129: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/129.jpg)
Co-occurrence in the scientific texts
Associate abstracts with speciesAssociate abstracts with species
Identify gene names in title/abstractIdentify gene names in title/abstract
Count (co-)occurrences of genesCount (co-)occurrences of genes
Test significance of associationsTest significance of associations
Calibrate against KEGG mapsCalibrate against KEGG maps
Infer associations in other speciesInfer associations in other species
![Page 130: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/130.jpg)
Databases used for text mining
• Corpora– Medline– OMIM, Online Mendelian
Inheritance in Man– SGD, Saccharomyces
Genome Database– The Interactive Fly
• These text sources are all parsed and converted into a unified format
• Gene synonyms– Ensembl– SwissProt– HUGO– LocusLink– SGD– TAIR
• Cross references and sequence comparison is used for merging
![Page 131: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/131.jpg)
Gene and protein namesCue words for entity recognitionVerbs for relation extraction
[nxgene The GAL4 gene]
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
Natural Language Processing
![Page 132: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/132.jpg)
Multiple types of interactions
![Page 133: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/133.jpg)
Transfer of evidence
• STRING “red” – COG mode– Each node in the network represents a COG– For each pair of COGs, the highest confidence score for
each evidence type counts from each clade– The scores are combined using naïve Bayes
• STRING “blue” – protein mode– Each node in the network represents a single locus– Evidence from other organisms are transferred based
on fuzzy orthology– The scores are combined using naïve Bayes
![Page 134: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/134.jpg)
![Page 135: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/135.jpg)
![Page 136: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/136.jpg)
?
Source species
Target species
Evidence transfer basedon “fuzzy orthology”
• Orthology transfer is tricky– Correct assignment of
orthology is difficult for distant species
– Functional equivalence is not guaranteed for paralogs
• These problems are addressed by our “fuzzy orthology” scheme– Functional equivalence
scores are calculated from all-against-all alignment
– Evidence is distributed across possible pairs
![Page 137: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/137.jpg)
The power of cross-species transferand evidence integration
![Page 138: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/138.jpg)
The power of cross-species transferand evidence integration
![Page 139: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/139.jpg)
The power of cross-species transferand evidence integration
![Page 140: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/140.jpg)
The power of cross-species transferand evidence integration
![Page 141: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/141.jpg)
The power of cross-species transferand evidence integration
![Page 142: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/142.jpg)
The power of cross-species transferand evidence integration
![Page 143: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/143.jpg)
The big challenge
![Page 144: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/144.jpg)
Prediction of “mode of action”
![Page 145: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/145.jpg)
Summary
• Functional interaction networks are useful for predicting the biological role of a protein
• Many algorithms and types of data can be used for predicting functional interactions– Each method must be benchmarked– The different types of evidence should be integrated in
a probabilistic scoring scheme
• To make the most of the available data, evidence should also be transferred between organisms
![Page 146: Prediction of protein function Lars Juhl Jensen EMBL Heidelberg](https://reader038.vdocuments.site/reader038/viewer/2022110304/551c15045503469e4f8b55b8/html5/thumbnails/146.jpg)
Acknowledgments
Christian von Mering
Jasmin Saric
Berend Snel
Sean Hooper
Rossitza Ouzounova
Samuel Chaffron
Julien Lagarde
Mathilde Foglierini
Isabel Rojas
Martijn Huynen
Peer Bork