feature selection i
TRANSCRIPT
Dr. Athanasios Tsanas (‘Thanasis’), Wellcome Trust post-doctoral fellow, Institute of Biomedical Engineering (IBME), Affiliate researcher, Centre for Mathematical Biology, Mathematical Institute, Junior Research Fellow, Kellogg College University of Oxford
Information Driven Healthcare: Machine Learning course
Lecture: Feature selection I --- Concepts
Centre for Doctoral Training in Healthcare Innovation
The big picture signal processing and statistical machine learning
Feature
generation
Feature
selection/transformation Statistical
mapping
Signal processing course: tools for extracting information
(patterns) from the data (feature generation)
This course: using the extracted information from
the data
Statistical mapping: associating features with another
measured quantity (response) – supervised learning
Goal: maximize the information available in the data to
predict the response
Supervised learning setting
Subjects feature 1 feature 2 ... feature M
S1 3.1 1.3 0.9
S2 3.7 1.0 1.3
S3 2.9 2.6 0.6
…
SN 1.7 2.0 0.7
Outcome
type 1
type 2
type 1
…
type 5 N (
sa
mp
les)
M (features or characteristics)
y
y = f (X), f : mechanism X: feature set y: outcome
X
Feature selection: which features 1…M in the design matrix X should we keep?
Feature transformation: project the features to a new lower dimensional feature space
Introduction to the problem
Many features Μ Curse of dimensionality
Obstruct interpretability and detrimental to learning
process
Solution to the problem
Reduce the initial feature space M
into m (or m<<M)
Feature selection
Feature transformation
Main concepts
Principle of parsimony
Information content
Statistical associations
Computational constraints
We want to determine the most parsimonious feature subset
with maximum joint information content
Feature transformation
Construct lower dimensional space where the new
data points retain the distance of the data points in the
original feature space
Different algorithms depending on how we define the
distance
Feature transformation problems
Results are not easily interpretable
Does not save up on resources on
data collection or data processing
Reliable transformation in high
dimensional spaces is problematic
Feature selection
Discard non-contributing features towards predicting
the response
Feature selection advantages
Interpretable
Retain domain expertise
Often is the only useful approach in practice (e.g. in
micro-array data)
Saves on resources on data collection or data
processing
Feature selection approaches
Two approaches:
Wrappers (involve a learner)
Filters (rely on information content of feature subset,
e.g. using statistical tests)
Wrappers
Computationally intensive
Rely on incorporating a learner
Feature exportability problems
Wrappers may produce models with better predictive
performance compared to filters
Filters
Rely on basic concepts
statistical
Information theory
Computationally fast
Learner performance comes later and hence filters may
generalize better than wrappers
Filter concept: relevance
Maximum relevance: features (F) and response (y)
Which features would you choose? In which order?
F1
y
F2
F3
Filter concept: redundancy
Minimum redundancy amongst features in the subset
Which features would you choose? In which order?
F1
F4
F2
F3
Filter concept: complementarity
Conditional relevance (feature interaction or
complementarity)
Formalizing these concepts
How to express relevance and redundancy (i.e. which
are the appropriate metrics?)
Metrics include: correlation coefficients, mutual
information, statistical tests, p-values, information
gain…
How to compromise between relevance and
redundancy?
Process? (forward selection VS backward
elimination)
To be continued…
In the following lecture we will look at specific algorithms!
Usual steps in forward selection
LASSO
Start with classical ordinary least squares regression
L1 penalty: sparsity promoting, some coefficients become 0
RELIEF
Feature weighting algorithm
Concept: work with nearest neighbours
Nearest hit (NH) and nearest miss (NM)
Great for datasets with interactions but does not account for
information redundancy
mRMR
minimum Redundancy Maximum Relevance (mRMR)
Trade-off relevance + redundancy
Does not account for interactions and non-pairwise redundancy
Generally works very well
Comparing feature selection algorithms
Selecting the ‘true’ feature subset (i.e. discarding features
which are known to be noise)
o Possible only for artificial datasets
Maximize the out of sample prediction performance
o proxy for assessing feature selection algorithms
o adds an additional ‘layer’: the learner
o beware of feature exportability (different learners may give
different results)
o BUT… in practice this is really what is of most interest!
Matlab code
LASSO: http://www.mathworks.co.uk/help/stats/lasso.html
RELIEF: http://www.mathworks.co.uk/help/stats/relieff.html
mRMR:
http://www.mathworks.com/matlabcentral/fileexchange/14888
Be careful: the latter implementation relies on discrete features
and computes densities using histograms. For continuous
features you would need another density estimator (e.g. kernel
density estimation)
UCI ML repository http://archive.ics.uci.edu/ml/
Conclusions
Multi-faceted problem, fertile field for research
No free lunch theorem (no universally best algorithm)
Trade-offs
o algorithmic: relevance, redundancy, complementarity
o computational: wrappers are costly but often give better results
o comprehensive search of the feature space, e.g. genetic
algorithms (very costly)
Reducing the number of features may improve prediction
performance and always improves interpretability