feature selection i

Dr. Athanasios Tsanas (‘Thanasis’), Wellcome Trust post-doctoral fellow, Institute of Biomedical Engineering (IBME), Affiliate researcher, Centre for Mathematical Biology, Mathematical Institute, Junior Research Fellow, Kellogg College University of Oxford

Information Driven Healthcare: Machine Learning course

Lecture: Feature selection I --- Concepts

Centre for Doctoral Training in Healthcare Innovation

The big picture signal processing and statistical machine learning

Feature

generation

Feature

selection/transformation Statistical

mapping

Signal processing course: tools for extracting information

(patterns) from the data (feature generation)

This course: using the extracted information from

the data

Statistical mapping: associating features with another

measured quantity (response) – supervised learning

Goal: maximize the information available in the data to

predict the response

Supervised learning setting

Subjects feature 1 feature 2 ... feature M

S1 3.1 1.3 0.9

S2 3.7 1.0 1.3

S3 2.9 2.6 0.6

…

SN 1.7 2.0 0.7

Outcome

type 1

type 2

type 1

…

type 5 N (

sa

mp

les)

M (features or characteristics)

y

y = f (X), f : mechanism X: feature set y: outcome

X

Feature selection: which features 1…M in the design matrix X should we keep?

Feature transformation: project the features to a new lower dimensional feature space

Introduction to the problem

Many features Μ Curse of dimensionality

Obstruct interpretability and detrimental to learning

process

Solution to the problem

Reduce the initial feature space M

into m (or m<<M)

Feature selection

Feature transformation

Main concepts

Principle of parsimony

Information content

Statistical associations

Computational constraints

We want to determine the most parsimonious feature subset

with maximum joint information content

Feature transformation

Construct lower dimensional space where the new

data points retain the distance of the data points in the

original feature space

Different algorithms depending on how we define the

distance

Feature transformation problems

Results are not easily interpretable

Does not save up on resources on

data collection or data processing

Reliable transformation in high

dimensional spaces is problematic

Feature selection

Discard non-contributing features towards predicting

the response

Feature selection advantages

Interpretable

Retain domain expertise

Often is the only useful approach in practice (e.g. in

micro-array data)

Saves on resources on data collection or data

processing

Feature selection approaches

Two approaches:

Wrappers (involve a learner)

Filters (rely on information content of feature subset,

e.g. using statistical tests)

Wrappers

Computationally intensive

Rely on incorporating a learner

Feature exportability problems

Wrappers may produce models with better predictive

performance compared to filters

Filters

Rely on basic concepts

statistical

Information theory

Computationally fast

Learner performance comes later and hence filters may

generalize better than wrappers

Filter concept: relevance

Maximum relevance: features (F) and response (y)

Which features would you choose? In which order?

F1

y

F2

F3

Filter concept: redundancy

Minimum redundancy amongst features in the subset

Which features would you choose? In which order?

F1

F4

F2

F3

Filter concept: complementarity

Conditional relevance (feature interaction or

complementarity)

Formalizing these concepts

How to express relevance and redundancy (i.e. which

are the appropriate metrics?)

Metrics include: correlation coefficients, mutual

information, statistical tests, p-values, information

gain…

How to compromise between relevance and

redundancy?

Process? (forward selection VS backward

elimination)

To be continued…

In the following lecture we will look at specific algorithms!

Usual steps in forward selection

LASSO

Start with classical ordinary least squares regression

L1 penalty: sparsity promoting, some coefficients become 0

RELIEF

Feature weighting algorithm

Concept: work with nearest neighbours

Nearest hit (NH) and nearest miss (NM)

Great for datasets with interactions but does not account for

information redundancy

mRMR

minimum Redundancy Maximum Relevance (mRMR)

Trade-off relevance + redundancy

Does not account for interactions and non-pairwise redundancy

Generally works very well

Comparing feature selection algorithms

Selecting the ‘true’ feature subset (i.e. discarding features

which are known to be noise)

o Possible only for artificial datasets

Maximize the out of sample prediction performance

o proxy for assessing feature selection algorithms

o adds an additional ‘layer’: the learner

o beware of feature exportability (different learners may give

different results)

o BUT… in practice this is really what is of most interest!

Matlab code

LASSO: http://www.mathworks.co.uk/help/stats/lasso.html

RELIEF: http://www.mathworks.co.uk/help/stats/relieff.html

mRMR:

http://www.mathworks.com/matlabcentral/fileexchange/14888

Be careful: the latter implementation relies on discrete features

and computes densities using histograms. For continuous

features you would need another density estimator (e.g. kernel

density estimation)

UCI ML repository http://archive.ics.uci.edu/ml/

Conclusions

Multi-faceted problem, fertile field for research

No free lunch theorem (no universally best algorithm)

Trade-offs

o algorithmic: relevance, redundancy, complementarity

o computational: wrappers are costly but often give better results

o comprehensive search of the feature space, e.g. genetic

algorithms (very costly)

Reducing the number of features may improve prediction

performance and always improves interpretability

feature selection i

Documents