2003/12/5pplab1 prediction of human protein function according to gene ontology categories gene...

21
2003/12/5 PPLAB 1 Prediction of Human Prediction of Human Protein Function Protein Function According to According to Gene Ontology Gene Ontology Categories Categories L. J. Jensen, R. Gupta, H. –H. Stærf L. J. Jensen, R. Gupta, H. –H. Stærf eldt and S. Brunak eldt and S. Brunak Bioinformatics, Vol. 19, No. 5, p.p. Bioinformatics, Vol. 19, No. 5, p.p. 635-642, 2003 635-642, 2003

Upload: susanna-weaver

Post on 30-Dec-2015

222 views

Category:

Documents


4 download

TRANSCRIPT

2003/12/5 PPLAB 1

Prediction of Human Protein Prediction of Human Protein Function According to Function According to

Gene Ontology Gene Ontology CategoriesCategories

L. J. Jensen, R. Gupta, H. –H. Stærfeldt and S. BrunakL. J. Jensen, R. Gupta, H. –H. Stærfeldt and S. Brunak

Bioinformatics, Vol. 19, No. 5, p.p. 635-642, 2003Bioinformatics, Vol. 19, No. 5, p.p. 635-642, 2003

2003/12/5 PPLAB 2

OutlineOutline

IntroductionIntroduction System and MethodsSystem and Methods DiscussionDiscussion ConclusionConclusion

2003/12/5 PPLAB 3

IntroductionIntroduction

For most of the whole genome sequencing pFor most of the whole genome sequencing projects, the function of a large fraction of prrojects, the function of a large fraction of proteins remain unknownoteins remain unknown

This paper expands the ProtFun prediction This paper expands the ProtFun prediction method predicted the cellular role categoriemethod predicted the cellular role categories as well as enzymatic function to also coves as well as enzymatic function to also cover a number of GO categoriesr a number of GO categories– More specific description of the functionMore specific description of the function

2003/12/5 PPLAB 4

System and MethodsSystem and Methods

The Neural Networks ApproachThe Neural Networks Approach– Data setData set– Data set partitioningData set partitioning– Choosing the classes to predictChoosing the classes to predict– Sequence derived protein featuresSequence derived protein features– Neural network training and feature selectionNeural network training and feature selection– Making predictions with the neural networksMaking predictions with the neural networks

2003/12/5 PPLAB 5

The Neural NetworksThe Neural Networks

2003/12/5 PPLAB 6

Data SetData Set

Generation of a Labeled Data SetGeneration of a Labeled Data Set– Making use of the InterPro database in which pMaking use of the InterPro database in which p

rotein families have been assigned with GO nurotein families have been assigned with GO numbersmbers

– Linking this with a list InterPro domain matcheLinking this with a list InterPro domain matches to SWISS-PROT and TrEMBLs to SWISS-PROT and TrEMBL

– A set of 21,401 human sequences with annotateA set of 21,401 human sequences with annotated GO numbersd GO numbers

2003/12/5 PPLAB 7

Data Set PartitioningData Set Partitioning

Cross Validation - HeuristicCross Validation - Heuristic– Divide the data set into five sets of equal size Divide the data set into five sets of equal size

with minimal sequence similarity overlap with minimal sequence similarity overlap between the setsbetween the sets

Unfortunately, it turned out to either be Unfortunately, it turned out to either be impossible to split the set into five unrelated impossible to split the set into five unrelated subsets or the heuristic at least failed to find subsets or the heuristic at least failed to find a sufficiently good solutiona sufficiently good solution

2003/12/5 PPLAB 8

Data Set Partitioning Data Set Partitioning (cont.)(cont.)

Reducing each of the five subsets to 2500 Reducing each of the five subsets to 2500 sequencessequences– By removing the sequences with the highest By removing the sequences with the highest

connectivityconnectivity A five fold cross validation set of 12,500 A five fold cross validation set of 12,500

sequences with no significant similarity sequences with no significant similarity between sequences in the different subsetsbetween sequences in the different subsets

2003/12/5 PPLAB 9

Choosing the Classes to PredictChoosing the Classes to Predict

GO Categories as of June 10GO Categories as of June 10 thth 2001 2001– 1532 of 7949 different classes were represented 1532 of 7949 different classes were represented

in the data set described abovein the data set described above– Leaving 347 categories which were annotated tLeaving 347 categories which were annotated t

o at least 20 different InterPro familieso at least 20 different InterPro families

2003/12/5 PPLAB 10

Sequence Derived Protein Sequence Derived Protein FeaturesFeatures

Features UsedFeatures Used– Aliphatic Aliphatic (( 脂肪族化合物脂肪族化合物 )) index index

– Extinction Extinction (( 吸光度吸光度 )) coefficient coefficient– Hydrophobicity Hydrophobicity (( 厭水厭水 ))

– Instability Instability (( 不穩定度不穩定度 )) index index

– Number of atomsNumber of atoms

– Number of negative residuesNumber of negative residues

– Number of positive residuesNumber of positive residues

– Isoelectric Isoelectric (( 等電點等電點 )) point point

– Secondary structureSecondary structure

– Transmembrane Transmembrane (( 橫跨膜的橫跨膜的 )) helices helices

– Low complexity regionsLow complexity regions

– Propeptides Propeptides

– Signal peptidesSignal peptides

– Protein targetingProtein targeting

– Protein sortingProtein sorting– N-glycosylation N-glycosylation (( 糖基化的糖基化的 ))

– S/T-phosphorylation S/T-phosphorylation (( 磷酸化磷酸化 ))

2003/12/5 PPLAB 11

TrainingTraining

For each GO class, standard feed-forward For each GO class, standard feed-forward neural networks with a single layer of neural networks with a single layer of hidden neurons were used for predicting hidden neurons were used for predicting which example belong to a given classwhich example belong to a given class

For each feature combination the input For each feature combination the input vector for the neural networks consists of a vector for the neural networks consists of a concatenation of the respective feature concatenation of the respective feature vectors while the target output is a single vectors while the target output is a single value (1 or 0)value (1 or 0)

2003/12/5 PPLAB 12

Training Training (cont.)(cont.)

Only 26 GO classes remained after Only 26 GO classes remained after reducing those not strongly correlated to reducing those not strongly correlated to any of the predicted featuresany of the predicted features

For each, the optimal feature combination For each, the optimal feature combination was searched for using a greedy search was searched for using a greedy search heuristicheuristic

The final set of predictors consists of cross The final set of predictors consists of cross validation ensembles of five neural validation ensembles of five neural networks for each of 14 GO classes networks for each of 14 GO classes

2003/12/5 PPLAB 13

Making PredictionsMaking Predictions

Use the neural networks to predict the Use the neural networks to predict the function of novel sequencesfunction of novel sequences– Sequence derived features and encodedSequence derived features and encoded– For each GO class, the average output is For each GO class, the average output is

calculated and converted to a probability using calculated and converted to a probability using a calibration curvea calibration curve

2003/12/5 PPLAB 14

Discussion Discussion

Why so relatively few GO classes can be prWhy so relatively few GO classes can be predicted?edicted?– Lack of data: for 90% of the GO classes we canLack of data: for 90% of the GO classes we can

not assign a single positive example among hunot assign a single positive example among human SWISS-PROT and TrEMBL entriesman SWISS-PROT and TrEMBL entries

– Many of the categories are reduced for some reMany of the categories are reduced for some reasonsasons

2003/12/5 PPLAB 15

2003/12/5 PPLAB 16

2003/12/5 PPLAB 17

Discussion Discussion (cont.)(cont.)

The method appears to be better at The method appears to be better at predicting biological process than predicting biological process than molecular functionmolecular function

Novel putative receptorsNovel putative receptors Chromosomal clustering of protein with Chromosomal clustering of protein with

similar functionsimilar function

2003/12/5 PPLAB 18

Conclusions Conclusions

We have succeeded in making a sequence We have succeeded in making a sequence based function prediction method for a based function prediction method for a subset of the GOsubset of the GO

The method is well suited for computational The method is well suited for computational screening of the human genome for novel screening of the human genome for novel drug targetsdrug targets

2003/12/5 PPLAB 19

Thank you!Thank you!

Yi-Yao Huang(Yi-Yao Huang( 黃奕堯黃奕堯 ))

[email protected]@par.cse.nsysu.edu.tw

2003/12/5 PPLAB 20

What Is GO (Gene Ontology) ?What Is GO (Gene Ontology) ?

The Gene Ontology (GO) project is a collaborativThe Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptie effort to address the need for consistent descriptions of gene products in different databasesons of gene products in different databases

Three structured, controlled vocabularies (ontologiThree structured, controlled vocabularies (ontologies) that describe gene products in terms of their ases) that describe gene products in terms of their associated biological processes, cellular components sociated biological processes, cellular components and molecular functions in a species-independent and molecular functions in a species-independent mannermanner

DAG (directed acyclic gragh)DAG (directed acyclic gragh)

2003/12/5 PPLAB 21

名詞解釋名詞解釋

Protein sorting and protein targetingProtein sorting and protein targeting– Proteins are sorted (delivered to their destination within Proteins are sorted (delivered to their destination within

the cell) in accordance with sorting signalsthe cell) in accordance with sorting signals

Propeptides, signal peptides and glycosylationPropeptides, signal peptides and glycosylation– 欲使轉殖的基因產物在特定的胞器中表現或停留,欲使轉殖的基因產物在特定的胞器中表現或停留,

通常需要在啟動子的後面加上一段可將基因帶到特通常需要在啟動子的後面加上一段可將基因帶到特定位置或胞器的 定位置或胞器的 signal peptide signal peptide 或 或 propeptidepropeptide 。。 SigSignal peptide nal peptide 通常將基因帶到 通常將基因帶到 ER (ER ( 內質網內質網 )) ,受到 ,受到 gglycosylationlycosylation ,最後走向 ,最後走向 secretory pathwaysecretory pathway ,而分泌,而分泌到細胞外。到細胞外。