guest lecture: feature selection alan qi [email protected] dec 2, 2004
TRANSCRIPT
![Page 2: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/2.jpg)
Outline
• Problems
• Overview of feature selection (FS)– Filtering: correlation & information criteria– Wrapper approach: greedy FS &
regularization
• Classical Bayesian feature selection
• New Bayesian approach: predictive-ARD
![Page 3: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/3.jpg)
Feature selection
• Gene expression: thousands of gene measurements
• Documents: “bag of words” model with more than 10,000 words
• Images: histograms, colors, wavelet coefficients, etc.
• Task: find a small subset of features for prediction
![Page 4: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/4.jpg)
Gene Expression Classification
Task: Classify gene expression datasets into different categories, e.g., normal v.s. cancer
Challenge: Thousands of genes measured in the micro-array data. Only a small subset of genes are probably correlated with the classification task.
![Page 5: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/5.jpg)
Filtering approach
• Feature ranking based on sensible criteria:– Correlation between features and labels– Mutual information between features and
labels
![Page 6: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/6.jpg)
Wrapper ApproachAssess subsets of variables according to their
usefulness to a given predictor. A combinatorial problem: 2K combinations given K features.- Sequentially adding/removing features: Sequential
Forward Selection (SFS), Backward Sequential Selection (SBS).
- Recursively adding/removing features: Sequential Forward Floating Selection (SFFS) (When to stop? Overfitting?)
-Regularization: use sparse prior to enhance the sparsity of a trained predictor (classifier).
![Page 7: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/7.jpg)
Regularization
Regularization: combining the fit to the data and a penalty for complexity. Minimizing the following
Labels: t = [t1, t1, …, tN]
Inputs: X = [x1, x1, …, xN]
Parameters: w
Likelihood for the data set (For classification):
![Page 8: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/8.jpg)
Bayesian feature selection
• Background– Bayesian classification model– Automatic relevance determination (ARD)
• Risk of Overfitting by optimizing hyperparameters• Predictive ARD by expectation propagation (EP):
– Approximate prediction error– EP approximation
• Experiments• Conclusions
![Page 9: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/9.jpg)
Motivation
Task 1: Classify high dimensional datasets with many irrelevant features, e.g., normal v.s. cancer microarray data.
Task 2: Sparse Bayesian kernel classifiers for fast test performance.
![Page 10: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/10.jpg)
Bayesian Classification Model
Prior of the classifier w:
Labels: t inputs: X parameters: w
Likelihood for the data set:
Where is a cumulative distribution function for a standard Gaussian.
![Page 11: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/11.jpg)
Evidence and Predictive Distribution
The evidence, i.e., the marginal likelihood of the hyperparameters :
The predictive posterior distribution of the label for a new input :
![Page 12: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/12.jpg)
Automatic Relevance Determination (ARD)
• Give the classifier weight independent Gaussian priors whose variance, , controls how far away from zero each weight is allowed to go:
• Maximize , the marginal likelihood of the model, with respect to .
• Outcome: many elements of go to infinity, which naturally prunes irrelevant features in the data.
1
![Page 13: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/13.jpg)
Two Types of Overfitting
• Classical Maximum likelihood:– Optimizing the classifier weights w can
directly fit noise in the data, resulting in a complicated model.
• Type II Maximum likelihood (ARD):– Optimizing the hyperparameters
corresponds to choosing which variables are irrelevant. Choosing one out of exponentially many models can also overfit if we maximize the model marginal likelihood.
![Page 14: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/14.jpg)
Risk of Optimizing
X: Class 1 vs O: Class 2
Evd-ARD-1
Evd-ARD-2
Bayes Point
![Page 15: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/15.jpg)
Outline• Background
– Bayesian classification model– Automatic relevance determination (ARD)
• Risk of Overfitting by optimizing hyperparameters• Predictive ARD by expectation propagation (EP):
– Approximate prediction error– EP approximation
• Experiments• Conclusions
![Page 16: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/16.jpg)
Predictive-ARD
• Choosing the model with the best estimated predictive performance instead of the most probable model.
• Expectation propagation (EP) estimates the leave-one-out predictive performance without performing any expensive cross-validation.
![Page 17: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/17.jpg)
Estimate Predictive Performance• Predictive posterior given a test data point
• EP can estimate predictive leave-one-out error probability
• where q( w| t\i) is the approximate posterior of leaving out the ith label.
• EP can also estimate predictive leave-one-out error count
1Nx
wtwwxtx d)|(),|(),|( 1111 ptptp NNNN
N
iiii
N
iiii qtp
Ntp
N 1\
1\ d)|(),|(1
1),|(1
1wtwwxtx
N
1i21
\ )),|(I(1
iiitpN
LOO tx
![Page 18: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/18.jpg)
Expectation Propagation in a Nutshell
• Approximate a probability distribution by simpler parametric terms:
• Each approximation term lives in an exponential family (e.g. Gaussian)
ii
iii
ii
fq
tfp
)(~
)(
))(()()( T
ww
xwwt|w
)(~
wif
![Page 19: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/19.jpg)
EP in a NutshellThree key steps:• Deletion Step: approximate the “leave-one-out”
predictive posterior for the ith point:
• Minimizing the following KL divergence by moment matching:
• Inclusion:
ij
jii ffqq )(
~)(
~/)()(\ wwww
))()(~
||)()((minarg \\
)(~
wwwww
ii
ii
f
qfqfKL
i
)()(~
)( \ www ii qfq
The key observation: we can use the approximate predictive posterior, obtained in the deletion step, for model selection. No extra computation!
![Page 20: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/20.jpg)
Comparison of different model selection criteria for ARD training
• 1st row: Test error• 2nd row: Estimated leave-one-out error probability• 3rd row: Estimated leave-one-out error counts• 4th row: Evidence (Model marginal likelihood)• 5th row: Fraction of selected features
The estimated leave-one-out error probabilities and counts are better correlated with the test error than evidence and sparsity level.
![Page 21: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/21.jpg)
Gene Expression Classification
Task: Classify gene expression datasets into different categories, e.g., normal v.s. cancer
Challenge: Thousands of genes measured in the micro-array data. Only a small subset of genes are probably correlated with the classification task.
![Page 22: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/22.jpg)
Classifying Leukemia Data
• The task: distinguish acute myeloid leukemia (AML) from acute lymphoblastic leukemia (ALL).
• The dataset: 47 and 25 samples of type ALL and AML respectively with 7129 features per sample.
• The dataset was randomly split 100 times into 36 training and 36 testing samples.
![Page 23: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/23.jpg)
Classifying Colon Cancer Data
• The task: distinguish normal and cancer samples
• The dataset: 22 normal and 40 cancer samples with 2000 features per sample.
• The dataset was randomly split 100 times into 50 training and 12 testing samples.
• SVM results from Li et al. 2002
![Page 24: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/24.jpg)
Bayesian Sparse Kernel Classifiers
• Using feature/kernel expansions defined on training data points:
• Predictive-ARD-EP trains a classifier that depends on a small subset of the training set.
• Fast test performance.
![Page 25: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/25.jpg)
Test error rates and numbers of relevance or support vectors on breast cancer dataset.
50 partitionings of the data were used. All these methods use the same Gaussian kernel with kernel width = 5. The trade-off parameter C in SVM is chosen via 10-fold cross-validation for each partition.
![Page 26: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/26.jpg)
Test error rates on diabetes data
100 partitionings of the data were used. Evidence and Predictive ARD-EPs use the Gaussian kernel with kernel width = 5.
![Page 27: Guest lecture: Feature Selection Alan Qi yuanqi@mit.edu Dec 2, 2004](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697bfca1a28abf838ca937f/html5/thumbnails/27.jpg)
Summary• Two kinds of feature selection methods:
– Filtering and wrapper methods
• Classical Bayesian feature selection:– Excellent classical approach: Tuning prior to prune
features. – However, maximizing marginal likelihood can lead to
overfitting in the model space if there are a lot of features.
• New Bayesian approach: Predictive-ARD, which focus on the prediction performance.– feature selection – sparse kernel learning