lecture08 - data - feature selection & extraction€¦ · dr. patrick chan @ scut feature...

Machine Learning

Lecture 8

Data Processing

Feature & Sample Selection

Dr. Patrick [email protected]

South China University of Technology, China

1

Dr. Patrick Chan @ SCUT

Agenda

Feature Selection

Search

Criterion

Wrapper / Filter / Embedded Method

Feature Extraction

PCA

LDA

Active Learning

Lecture 8: Data - Feature & Sample Selection2


Feature Space

Design suitable feature space is more important than classifier

Some collected features may not be useful

Low storage complexity

Low model complexity

Accuracy may increase

Improving the understanding of the data and the model



Curse of Dimensionality

Exponential growth with dimensionality in the number of examples required to accurately estimate a function

For a given sample size, there is a max number of feature yields the best performance

Classifier will degrade rather than improve if more features are used

the information lost by discarding some features is compensated by a more accurate mapping in lower dimensional space



Feature Selection & Extraction

Feature Selection

Selecting a subset of features without a transformation

Feature Extraction

Transforming existing features into a lower dimensional space



Feature Selection

Given a feature set F = {x1, x2, …, xd}

Aim to maximize a selection criteria by selecting S, where S F

Major components:

Search

Criterion



Feature Selection

Search

Exhaustive Search

Explore all possible feature subsets

Impractical in applications with large feature number (d is usually large, otherwise no FS is

needed)

Optimal Feature Subset

d features, 2d candidates

220 = 1048576

Optimal Feature Subset with given subset size

Fixed m, the number of selected features, dCm

candidates

20C10 = 184756Lecture 8: Data - Feature & Sample Selection7


Feature Selection

Search

Heuristic Search

Prevents brute force

Only sub-optimal solution

Find a "closer" optimal subset of features

Naïve Search

Sequential Search

Randomized Search



Feature Selection

Search: Heuristic Search

Naïve Search

Evaluate each feature individually

Select the m features with highest scores

Example:

S(x2) = S(x4) > S(x3) > S(x1)

But the best combination is x2 and x3 or x4 and x3


x1

x2

x3

x4


Feature Selection


Sequential Search

Search the answer by adding/removinga feature each time

Greedy method

Each time selectthe best move

Forward Selection

Backward Elimination


0000

0001001001001000

1100 1010 0110 1001 00110101

1110 1101 1011 0111

1111


Feature Selection


Sequential Forward Selection (SFS)

Start from null set

Add the one which improves the result each time


F : Full feature set

S : Selected feature set

C : Evaluation Criteria

S = {}

Repeat

For each f in F

score(f) = Evaluation

(S U {f}, C)

Until

f* = max score(f)

S = S U {f*}

F = F – {f*}

Until F = {}

0000

0001001001001000

1100 1010 0110 1001 00110101

1110 1101 1011 0111

1111


Feature Selection


Sequential Backward Elimination (SBE)

Start from full set

Remove the one has the least affects on the result each time


F : Full feature set

S : Selected feature set

C : Evaluation Criteria

S = {}

Repeat

For each f in F

score(f) = Evaluation

(S - {f}, C)

Until

f* = max score(f)

S = S - {f*}

Until S = {}

0000

0001001001001000

1100 1010 0110 1001 00110101

1110 1101 1011 0111

1111


Feature Selection


Comparison: SFS and SBE

First Step:

SFS:

SBE:

Last Step:

SFS:

SBE:

Time complexity of SBE is higher usually

k candidates with k features are considered

SBE is usually commonly used in practical

E.g. Remove 10% of features


|S| = 0 d candidates

|S| = d d candidates

|S| = d-2

|S| = 2

2 candidates

2 candidates


Feature Selection

Search

Randomized Heuristic Selection

Generate better subsets iteratively based on the existing candidate pool

Keep improving the quality of selected features

Next subset is generated randomly

Do not know when the optimal set is obtained

it is unnecessary to wait until the search ends usually



Feature Selection: Search

Randomized Heuristic Selection

Example: Genetic Algorithm

Random initial population

Repeat

Evaluate the fitness of each candidate in population

Remove some bad candidates

Create new pollution by Mutation(c): change a candidate slightly

Crossover(c1, c2): generate a candidate containing the elements of both c1 and c2

Until a good candidate is found



Feature Selection

Criterion

Dependence Measures

Quantify whether a feature and class ID are correlated or dependent

Pearson correlation coefficient:



Feature Selection

Criterion

Information Gain

Measure the uncertain

Entropy:

Xi: a set containing samples in class i

Information gain:


� �

Current

entropyEntropy after

using feature A


Feature Selection

Criterion

Accuracy Measures

Classifier dependence

Evaluate a feature subset by using the performance of a classifier trained by that subset

Any accuracy or error measure



Feature Selection

Criterion

Consistency Measures

Classifier dependence

Aim to achieve P( C | Fullset ) = P( C | Subset )

Rather than accuracy, only the consistence on outputs is measured

Find a minimum number of features that separate classes as the full set of features can



Feature Selection

Selection Type

Filter

Only depend on the data structure but not a classifier

No bias on a model

No training is involved, low time complexity

Can handle larger sized data

Wrapper

Evaluate a feature set according to model performance

Selected features yield better results

Time consumed

2020 Lecture 8: Data - Feature & Sample Selection


Feature Selection

Selection Type

Embedded method

Features are selected for a model(Similar to wrapper)

Features are selected during training

Low time complexity

Avoiding re-training for each feature subset

No data splitting

Split into a training and test set



Feature Extraction

Given a feature space find a

mapping , where , such that z preserves (most of) the information in x

d: original feature number

m: selected feature number

Optimal case, no information loss

May not be linear



Feature Extraction

Principal Components Analysis (PCA)

Unsupervised

Increase variance

Linear Discriminant Analysis (LDA)

Supervised

Increase accuracy



Feature Extraction

PCA

What is the characteristic of important features?

Variance

If the value of all samples are very similar, they cannot be separated by this feature

E.g. x1 is better than x2


x1

x2


Feature Extraction

PCA

Principal Components Analysis (PCA) linearly projects the data along the directions where the data varies most

The first axis with the greatest variance, etc

Dimensionality can be reduced by eliminating the later principal components



Feature Extraction

PCA

Lecture 8: Data - Feature & Sample Selection

x1

x2

x1

x2

x1

x2

x1

x2

x1

x2

x2

x1

Project 1

Project 3

Project 2

26

Larger Var


Feature Extraction

PCA

The projection directions are determined by the eigenvectorsof the covariance matrix corresponding to the largest eigenvalues

The magnitude of the eigenvalues corresponds to the variance of the data along the eigenvector directions



Feature Extraction

PCA

Vectors v having same direction as Cv are called eigenvectors of C

Cv = v

C : an d by d covariance matrix

v : eigenvectors of C

: an eigenvalue of C

E.g.



Feature Extraction

PCA

Covariance Matrix Calculation

1. Scaler Operation:

2. Vector Operation


��

(�)� �

(�)�

�

��

=

� � � �� ⋯ � �� ⋱ ⋮

⋮ ⋱ ⋮

� �� ⋯ ⋯ � ��

(�) (�) �

�

��

(�)

�

��


Feature Extraction

PCA

Calculation of eigenvalue and eigenvector v

Solve the eigenvalues of

Expand and generate a a polynomial with the degree d

i.e. Only d different root

For each eigenvalue , solve to obtain eigenvectors v


Cv = v

(Cv - I) = 0

� � �

� � �

� ℎ �

= ��

ℎ �+b � �

� �+c � �

� ℎ

� �

� �= �� − ��

� � � �

� � � ℎ

� � � �

! "

= �

� � ℎ

� � �

! "

+b

� � ℎ

� � �

"

+c

� � ℎ

� � �

! "

+d

� � �

� � �

!


Feature Extraction

PCA

The eigenvector with the largest absolute eigenvalue is called First Principal Component (PC1)

Indicate that the data have the largest variance along its eigenvector

PC2 : the direction with

maximum variation left in data, orthogonal to the PC1

PCi: the direction with

maximum variation left in data, orthogonal to all previous PCj, j = 1, …, i-1


x1

x2 orthogonal


Feature Extraction

PCA

How to choose m?

Preserve a percentage of the information (variance) in the data

If m=d, all information is

preserved


Pers

evere

d V

ari

ance

Selected Dimensionality


Feature Extraction

PCA: Example


1.91

x1 x22.5 2.4

0.5 0.7

2.2 2.9

1.9 2.2

3.1 3

2.3 2.7

2 1.6

1 1.1

1.5 1.6

1.1 0.9

1.81Mean

(�) (�) �

�

��

(�)

�

��


Feature Extraction

PCA: Example


=0

=0

=0

=0

0.0491 1.2840

Cv = v

(Cv - I) = 0

��# − $ = 0

Eigenvalues


Feature Extraction

PCA: Example


0.04

0.04

�

0.0491

�

�

�

�

�

1.28

1.28

�

1.2840

�

�

�

�

Eigenvectors


Feature Extraction

PCA: Example


orthogonal

0.0491

�

� 1.2840

PC1

�> )PC2

Dr. Patrick Chan @ SCUT37

Feature Extraction

PCA: Limitation

PCA is not suitable for classification

More spare in x1 than x2

Eigenvalue of x1 > Eigenvalue of x2

But x2 is more useful in classification

x1

x2

Lecture 8: Data - Feature & Sample Selection


Feature Extraction

LDA

Linear Discriminant Analysis (LDA) projectthe original data to a new space linearlyaiming to preserve as much discriminatory information as possible

Seeks to find directions along which the classes are best separated



Feature Extraction

LDA: Two-Class Problem

Define a measure for class separation

Between-Class Scatter ( )

A class far away from others is preferable

Distance between means of classes

Within-Class Scatter ( )

A condense class is preferable

Variance of a class


More CondenseFurther Away


Feature Extraction


Between-Class Scatter ( )

Distance between means of classes


�(�,�)

'(

��

�(�,�)

'(

��

�

� (�,�)

'(

��

� (�,�)

'(

��

) � ��

��

�

��

�

��

� � �

�

�)

) � ��


Feature Extraction


Within-Class Scatter ( )

Variance of a class


* �

�(�,�)

�(�,�)

�

�'(

��

�(�,�)

�(�,�)

�

�'(

��

�

� (�,�) ��

� (�,�) ��

�'(

��

��

� (�,�)�

(�,�)�

�'(

��

� (�,�)�

(�,�)�

�'(

��

��

� �

�*

* �


Feature Extraction



�)

�*

)

*

*�*)

�)

)

*

�)

�*

�*

�)

�)

�*

�* )

�) *

�*

�*

)

�)

�*

*

) *


Feature Extraction



) *

*+

)

*+

) is scaler, let

*+

)

) *

When = 0


Feature Extraction


x1

x2

y

4 2 1

2 4 1

2 3 1

3 6 1

4 4 1


x1

x2

y

9 10 2

6 8 2

9 5 2

8 7 2

10 8 2

�

�

* �) � ��

�


Feature Extraction



*

)

+

�

0 �

*+

) Therefore

*+

) Since


Feature Extraction



0 �

�

�

�

�

�

�

�

�

�

�


Feature Extraction




Feature Extraction

LDA: Multi-Class Problem

How about multi-class?

Within-class Scatter

2-class

2-class can be generalized to multi-class

Multi-class


x1

x2

�

,


Feature Extraction


How about multi-class?

Between-Class Scatter

2-class

Define as the mean

of means of all classes

Multi-class


x1

x2

� ,


Feature Extraction


Original x Mapped z

* �

'

��

* �

'

��

�(�,�)

'(

��

�(�,�)

�(�,�)

�

�'(

��

) � � ��

'

��

) � � ��

'

��

�

'

��

�

'

��

�(�,�)

'(

��

�(�,�)

�(�,�)

�

�'(

��


50


Feature Extraction


The detailed proof is ignored (not difficult)

The loss function of multi-class problem is:

Eigenvalue and Eigenvector can be obtained by solving


)�)

*�*


Feature Extraction

LDA: Limitation

LDA assumes unimodal Gaussian likelihoods

Perform badly if the assumption is wrong



Feature Extraction

LDA: Limitation

Discriminatory information is not in the mean but in the variance



Feature Extraction

Artificial Neuron Network

ANN aims to extract a meaningful feature space in some settings

E.g. Deep Learning

Will be discussed later



Feature Space

Summary

Feature Selection

Selecting a subset

No transformation

Selected features are understandable

Feature Extraction

Transformingfeatures into another space

Only for numeric features

Meaning of original features is lost

Suitable for visualization

Extracted features contain more information



Active Learning

Models discussed in this course are Passive Learning

Samples are pre-collected

Some samples may not be useful in learning

Active LearningTraining samples are selected according to the need of the current model

The learner in different learning states needs different samples

Label information is queried for the selected samples



Active Learning

Incremental Learning framework

Algorithm

Given a set of unlabeled samples

Initialization: Query some samples selected randomly

Repeat:

Train a model using labelled samples queried so far

Query the label information of the most useful sample for the current model

How to quantify usefulness?

1. Uncertainty Sampling

2. Query-By-Committee

3. Expected Model Change



Active Learning: Sample Evaluation

Uncertainty Sampling

An active learner queries the instances which it is least certain on the decision

Three strategies

Least confident

Margin sampling

Entropy





Least Confident

�-'�

Only focus on probable class

Margin Sampling

./0 and ./0� are the largest and 2nd largest g

Ignores the output distribution for the remaining classes

Entropy

Consider all outputs





Example: Heat maps illustrating the query behavior for three-label classification problem


0.33

0.33

0.33

1 2 = 1 − �./0 2 1 2 = �./0 2 − �./0� 2 1 2 = −∑ �� 2 log�� 2�-'

0.5

0.5

0



Query-By-Committee (QBC)

A committee contains m diverse models

trained on the current labeled set

Each committee member votes on the labeling of query candidates

Pick the instances generating the most disagreement among hypotheses

Vote entropy

V(yi) is the number of votes for class I

m is the committee size


( (



Expected Model Change

The instance yielding the greatest change to the current model will be queried

Expected Gradient Length (EGL)

For gradient-based training model

Query the instance x which, if labeled and added to D,

would result in the new training gradient of the largest magnitude

8 : the gradient of the objective function J with

respect to the parameters θ

8 � : the new gradient of adding the training

tuple (x,y) to D

Computationally expensive


8 � 8 �


References

http://www.sci.utah.edu/~shireen


lecture08 - data - feature selection & extraction€¦ · dr. patrick chan @ scut feature...

Documents