application of stacked generalization to a protein localization prediction task
DESCRIPTION
Application of Stacked Generalization to a Protein Localization Prediction Task. Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School of Computer Science and Information Systems September 27, 2003. Overview. Introduction Purpose Methods Algorithms Results - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/1.jpg)
Application of Stacked Generalization to a Protein
Localization Prediction TaskMelissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D.
Pace University, School of Computer Science and Information Systems
September 27, 2003
![Page 2: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/2.jpg)
Overview
• Introduction• Purpose• Methods• Algorithms• Results• Conclusions and Future Work
![Page 3: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/3.jpg)
Introduction
![Page 4: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/4.jpg)
Introduction: Data Mining
• Application of machine learning algorithms to large databases
• Often used to classify future data based on training set
• “Target” variable is variable to be predicted• Theoretically, algorithms are context-independent
![Page 5: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/5.jpg)
Introduction: Stacked Generalization
• Method for combining models• Part of training set used to train level-0, or base,
models as usual• Level-1 data built from predictions of level-0
models on remainder of set• Level-1 Generalizers are models trained on level-1
data
![Page 6: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/6.jpg)
Introduction: Bioinformatics and Protein Localization
• Bioinformatics: application of computing to molecular biology
• Currently much interest in information about proteins• Expression of proteins localized in a particular type
or part of cell (localization)• Knowledge of protein localization can shed light on
protein’s function• Data mining employed to predict localization from
database of information about encoding genes
![Page 7: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/7.jpg)
Introduction: KDD Cup 2001 Task
• KDD Cup: Annual data mining competition sponsored by ACM SIGKDD
• Participants use training set to predict target variable values in test dataset of different instances
• Winner is most accurate model (correct predictions/total instances in test set)
• 2001 task: predict protein localization of genes• Anonymized genes were instances, information about genes
were attributes• Datasets (incl. revealed target) used in this project
![Page 8: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/8.jpg)
Purpose
• Use Stacked Generalization approach on this task
• Compare inter-algorithm performance using level-0 models and level-1 generalizers
• Evaluate strategy of equally distributing target variable
![Page 9: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/9.jpg)
Methods
![Page 10: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/10.jpg)
Methods: Dataset Manipulations
• Reduce number of input variables• Reduce number of potential target values to 3• Separate original training dataset into training and
validation sets for stacking• Eliminate effectively unary variables in final
training dataset
![Page 11: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/11.jpg)
Table: Target Variable Distribution
Localization Training N (Percentage)
Validation N (Percentage)
Test N (Percentage)
Nucleus 366 (58.4%) 189 (60.4%) 174 (63.3%) Cytoplasm 192 (30.6%) 90 (28.75%) 66 (24.0%) Mitochondria 69 (11.0%) 34 (10.9%) 35 (12.7%)
![Page 12: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/12.jpg)
• Second training set created by stratifying to ensure equally distributed localizations
• Level-0 models trained on both raw (unequally distributed) and equally distributed training sets
• Separate level-1 data and level-1 generalizers from this dataset
Methods: Equally Distributed Approach
![Page 13: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/13.jpg)
Algorithms
![Page 14: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/14.jpg)
Algorithms: Level-0 Artificial Neural Network (ANN)
• Fully connected feedforward network• Input variables dummy variables 186 input
nodes• Target variable dummy variables 2 output
nodes• 1 hidden node• Training based on change in misclassification rate
![Page 15: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/15.jpg)
• Used CHAID-like algorithm• Chi-squared p value splitting criterion: p < 0.2 • Model selection based on proportion of instances
correctly classified
Algorithms: Level-0 Decision Tree
![Page 16: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/16.jpg)
Algorithms: Level-0 Nearest Neighbor (NN)
• Compare each instance between two datasets• Count number of matching attributes• Predict target value of instance matching on
greatest number of attributes• Use relative frequency in unequally distributed
dataset to break ties
![Page 17: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/17.jpg)
Algorithms: Level-0 Hybrid Decision Tree/ANN
• Difficult for ANN to learn with too many variables
• Decision Tree can be used as a “feature selector”• Important variables are those used as branching
criteria• New ANN trained using only important variables
as inputs
![Page 18: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/18.jpg)
Algorithms: Level-1 Generalizers
• ANN and Decision Tree– Designed and trained essentially the same as
level-0 counterparts– ANN had 8 input nodes
• Naïve Bayesian Model– Calculated likelihood of each target value based
on Bayes rule– Predicted value with highest likelihood
![Page 19: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/19.jpg)
Results
![Page 20: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/20.jpg)
Results: Accuracy RatesLevel-0 Models Approach Dataset ANN Tree NN Hybrid
Validation 65.8% 72.2% 73.8% 71.3% Unequally Distributed Test 64.7% 65.1% 70.6% 71.3%
Validation 62.9% 61.7% 64.9% 62.3% Equally Distributed Test 66.9% 59.6% 65.1% 61.8% Level-1 Generalizers Approach Level-1
ANN Level-1 Tree
Level-1 Bayesian
Unequally Distributed 71.27% 71.64% 72.00% Equally Distributed 65.82% 67.64% 70.18%
![Page 21: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/21.jpg)
Results: Evaluation of Accuracy Rates
• Similar to highest-performing KDD Cup models• However, predictions drawn from much smaller
pool of potential localizations• Also not much better than just predicting nucleus• Still, had fewer input variables with which to work
![Page 22: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/22.jpg)
Level-1 Decision Tree Diagram
![Page 23: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/23.jpg)
Results: Statistical Comparisons
• No significant inter-algorithm differences for level-0 models
• Hybrid offered some improvement over ANN alone
• Equal distribution usually resulted in slightly worse performance
• Stacked Generalization resulted in better performance, sometimes significantly so
![Page 24: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/24.jpg)
Conclusions and Future Work
![Page 25: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/25.jpg)
Conclusions and Future Work: Stratifying for Equal Distribution
• Not worth it and perhaps harmful• Resulting small sample size may be to blame• Could sample from full training set• Other sampling approaches could be used• Weight variable not necessarily meaningful
![Page 26: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/26.jpg)
Conclusions and Future Work: Specific Models
• Algorithms performed comparably to each other• ANN may need more hidden nodes• Hybrid model improved ANN’s performance
slightly, but not much• NN may owe some of performance to tie-breaker
implementation• Naïve Bayesian not standout, as might be expected
– Could run A Priori search first
![Page 27: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/27.jpg)
Conclusions and Future Work: Stacked Generalization in General
• Somewhat, not drastically, better performance• Possible ways to improve performance
– Cross-validation could improve both performance and evaluation
– Use posterior probabilities instead of actual predictions
– Try different algorithms– Continue stacking on more levels (level-2, level-3,
etc.)• Apply Stacked Generalization to actual KDD Cup task
![Page 28: Application of Stacked Generalization to a Protein Localization Prediction Task](https://reader031.vdocuments.site/reader031/viewer/2022020417/56815bef550346895dc9df34/html5/thumbnails/28.jpg)
References• Page, D. (2001). KDD Cup 2001. Website located at
http://www.cs.wisc.edu/~dpage/kddcup2001/.
• Ting, K.M., Witten, I.H. (1997). Stacked generalization: when does it work?. Proc International Joint Conference on Artificial Intelligence, Japan, 866-871.
• Witten, I.H., Frank, E. (2000). Data Mining. Morgan Kaufmann (San Francisco).
• Wolpert, D.H. (1992). Stacked Generalization. Neural Networks, 5:241-259.