application of stacked generalization to a protein localization prediction task

Application of Stacked Generalization to a Protein

Localization Prediction TaskMelissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D.

Pace University, School of Computer Science and Information Systems

September 27, 2003

Overview

• Introduction• Purpose• Methods• Algorithms• Results• Conclusions and Future Work

Introduction

Introduction: Data Mining

• Application of machine learning algorithms to large databases

• Often used to classify future data based on training set

• “Target” variable is variable to be predicted• Theoretically, algorithms are context-independent

Introduction: Stacked Generalization

• Method for combining models• Part of training set used to train level-0, or base,

models as usual• Level-1 data built from predictions of level-0

models on remainder of set• Level-1 Generalizers are models trained on level-1

data

Introduction: Bioinformatics and Protein Localization

• Bioinformatics: application of computing to molecular biology

• Currently much interest in information about proteins• Expression of proteins localized in a particular type

or part of cell (localization)• Knowledge of protein localization can shed light on

protein’s function• Data mining employed to predict localization from

database of information about encoding genes

Introduction: KDD Cup 2001 Task

• KDD Cup: Annual data mining competition sponsored by ACM SIGKDD

• Participants use training set to predict target variable values in test dataset of different instances

• Winner is most accurate model (correct predictions/total instances in test set)

• 2001 task: predict protein localization of genes• Anonymized genes were instances, information about genes

were attributes• Datasets (incl. revealed target) used in this project

Purpose

• Use Stacked Generalization approach on this task

• Compare inter-algorithm performance using level-0 models and level-1 generalizers

• Evaluate strategy of equally distributing target variable

Methods

Methods: Dataset Manipulations

• Reduce number of input variables• Reduce number of potential target values to 3• Separate original training dataset into training and

validation sets for stacking• Eliminate effectively unary variables in final

training dataset

Table: Target Variable Distribution

Localization Training N (Percentage)

Validation N (Percentage)

Test N (Percentage)

Nucleus 366 (58.4%) 189 (60.4%) 174 (63.3%) Cytoplasm 192 (30.6%) 90 (28.75%) 66 (24.0%) Mitochondria 69 (11.0%) 34 (10.9%) 35 (12.7%)

• Second training set created by stratifying to ensure equally distributed localizations

• Level-0 models trained on both raw (unequally distributed) and equally distributed training sets

• Separate level-1 data and level-1 generalizers from this dataset

Methods: Equally Distributed Approach

Algorithms

Algorithms: Level-0 Artificial Neural Network (ANN)

• Fully connected feedforward network• Input variables dummy variables 186 input

nodes• Target variable dummy variables 2 output

nodes• 1 hidden node• Training based on change in misclassification rate

• Used CHAID-like algorithm• Chi-squared p value splitting criterion: p < 0.2 • Model selection based on proportion of instances

correctly classified

Algorithms: Level-0 Decision Tree

Algorithms: Level-0 Nearest Neighbor (NN)

• Compare each instance between two datasets• Count number of matching attributes• Predict target value of instance matching on

greatest number of attributes• Use relative frequency in unequally distributed

dataset to break ties

Algorithms: Level-0 Hybrid Decision Tree/ANN

• Difficult for ANN to learn with too many variables

• Decision Tree can be used as a “feature selector”• Important variables are those used as branching

criteria• New ANN trained using only important variables

as inputs

Algorithms: Level-1 Generalizers

• ANN and Decision Tree– Designed and trained essentially the same as

level-0 counterparts– ANN had 8 input nodes

• Naïve Bayesian Model– Calculated likelihood of each target value based

on Bayes rule– Predicted value with highest likelihood

Results

Results: Accuracy RatesLevel-0 Models Approach Dataset ANN Tree NN Hybrid

Validation 65.8% 72.2% 73.8% 71.3% Unequally Distributed Test 64.7% 65.1% 70.6% 71.3%

Validation 62.9% 61.7% 64.9% 62.3% Equally Distributed Test 66.9% 59.6% 65.1% 61.8% Level-1 Generalizers Approach Level-1

ANN Level-1 Tree

Level-1 Bayesian

Unequally Distributed 71.27% 71.64% 72.00% Equally Distributed 65.82% 67.64% 70.18%

Results: Evaluation of Accuracy Rates

• Similar to highest-performing KDD Cup models• However, predictions drawn from much smaller

pool of potential localizations• Also not much better than just predicting nucleus• Still, had fewer input variables with which to work

Level-1 Decision Tree Diagram

Results: Statistical Comparisons

• No significant inter-algorithm differences for level-0 models

• Hybrid offered some improvement over ANN alone

• Equal distribution usually resulted in slightly worse performance

• Stacked Generalization resulted in better performance, sometimes significantly so

Conclusions and Future Work

Conclusions and Future Work: Stratifying for Equal Distribution

• Not worth it and perhaps harmful• Resulting small sample size may be to blame• Could sample from full training set• Other sampling approaches could be used• Weight variable not necessarily meaningful

Conclusions and Future Work: Specific Models

• Algorithms performed comparably to each other• ANN may need more hidden nodes• Hybrid model improved ANN’s performance

slightly, but not much• NN may owe some of performance to tie-breaker

implementation• Naïve Bayesian not standout, as might be expected

– Could run A Priori search first

Conclusions and Future Work: Stacked Generalization in General

• Somewhat, not drastically, better performance• Possible ways to improve performance

– Cross-validation could improve both performance and evaluation

– Use posterior probabilities instead of actual predictions

– Try different algorithms– Continue stacking on more levels (level-2, level-3,

etc.)• Apply Stacked Generalization to actual KDD Cup task

References• Page, D. (2001). KDD Cup 2001. Website located at

http://www.cs.wisc.edu/~dpage/kddcup2001/.

• Ting, K.M., Witten, I.H. (1997). Stacked generalization: when does it work?. Proc International Joint Conference on Artificial Intelligence, Japan, 866-871.

• Witten, I.H., Frank, E. (2000). Data Mining. Morgan Kaufmann (San Francisco).

• Wolpert, D.H. (1992). Stacked Generalization. Neural Networks, 5:241-259.

http://www.cs.wisc.edu/~dpage/kddcup2001/

application of stacked generalization to a protein localization prediction task

Documents

predictions of level

training settarget variable

target variable values

modelspart of training

target variablemethodsmethods

future data

final training datasettable

unary variables