lecture08 - data - feature selection & extraction€¦ · dr. patrick chan @ scut feature...
TRANSCRIPT
Machine Learning
Lecture 8
Data Processing
Feature & Sample Selection
Dr. Patrick [email protected]
South China University of Technology, China
1
Dr. Patrick Chan @ SCUT
Agenda
Feature Selection
Search
Criterion
Wrapper / Filter / Embedded Method
Feature Extraction
PCA
LDA
Active Learning
Lecture 8: Data - Feature & Sample Selection2
Dr. Patrick Chan @ SCUT
Feature Space
Design suitable feature space is more important than classifier
Some collected features may not be useful
Low storage complexity
Low model complexity
Accuracy may increase
Improving the understanding of the data and the model
Lecture 8: Data - Feature & Sample Selection3
Dr. Patrick Chan @ SCUT
Curse of Dimensionality
Exponential growth with dimensionality in the number of examples required to accurately estimate a function
For a given sample size, there is a max number of feature yields the best performance
Classifier will degrade rather than improve if more features are used
the information lost by discarding some features is compensated by a more accurate mapping in lower dimensional space
Lecture 8: Data - Feature & Sample Selection4
Dr. Patrick Chan @ SCUT
Feature Selection & Extraction
Feature Selection
Selecting a subset of features without a transformation
Feature Extraction
Transforming existing features into a lower dimensional space
Lecture 8: Data - Feature & Sample Selection5
Dr. Patrick Chan @ SCUT
Feature Selection
Given a feature set F = {x1, x2, …, xd}
Aim to maximize a selection criteria by selecting S, where S F
Major components:
Search
Criterion
Lecture 8: Data - Feature & Sample Selection6
Dr. Patrick Chan @ SCUT
Feature Selection
Search
Exhaustive Search
Explore all possible feature subsets
Impractical in applications with large feature number (d is usually large, otherwise no FS is
needed)
Optimal Feature Subset
d features, 2d candidates
220 = 1048576
Optimal Feature Subset with given subset size
Fixed m, the number of selected features, dCm
candidates
20C10 = 184756Lecture 8: Data - Feature & Sample Selection7
Dr. Patrick Chan @ SCUT
Feature Selection
Search
Heuristic Search
Prevents brute force
Only sub-optimal solution
Find a "closer" optimal subset of features
Naïve Search
Sequential Search
Randomized Search
Lecture 8: Data - Feature & Sample Selection8
Dr. Patrick Chan @ SCUT
Feature Selection
Search: Heuristic Search
Naïve Search
Evaluate each feature individually
Select the m features with highest scores
Example:
S(x2) = S(x4) > S(x3) > S(x1)
But the best combination is x2 and x3 or x4 and x3
Lecture 8: Data - Feature & Sample Selection9
x1
x2
x3
x4
Dr. Patrick Chan @ SCUT
Feature Selection
Search: Heuristic Search
Sequential Search
Search the answer by adding/removinga feature each time
Greedy method
Each time selectthe best move
Forward Selection
Backward Elimination
Lecture 8: Data - Feature & Sample Selection10
0000
0001001001001000
1100 1010 0110 1001 00110101
1110 1101 1011 0111
1111
Dr. Patrick Chan @ SCUT
Feature Selection
Search: Heuristic Search
Sequential Forward Selection (SFS)
Start from null set
Add the one which improves the result each time
Lecture 8: Data - Feature & Sample Selection11
F : Full feature set
S : Selected feature set
C : Evaluation Criteria
S = {}
Repeat
For each f in F
score(f) = Evaluation
(S U {f}, C)
Until
f* = max score(f)
S = S U {f*}
F = F – {f*}
Until F = {}
0000
0001001001001000
1100 1010 0110 1001 00110101
1110 1101 1011 0111
1111
Dr. Patrick Chan @ SCUT
Feature Selection
Search: Heuristic Search
Sequential Backward Elimination (SBE)
Start from full set
Remove the one has the least affects on the result each time
Lecture 8: Data - Feature & Sample Selection12
F : Full feature set
S : Selected feature set
C : Evaluation Criteria
S = {}
Repeat
For each f in F
score(f) = Evaluation
(S - {f}, C)
Until
f* = max score(f)
S = S - {f*}
Until S = {}
0000
0001001001001000
1100 1010 0110 1001 00110101
1110 1101 1011 0111
1111
Dr. Patrick Chan @ SCUT
Feature Selection
Search: Heuristic Search
Comparison: SFS and SBE
First Step:
SFS:
SBE:
Last Step:
SFS:
SBE:
Time complexity of SBE is higher usually
k candidates with k features are considered
SBE is usually commonly used in practical
E.g. Remove 10% of features
Lecture 8: Data - Feature & Sample Selection13
|S| = 0 d candidates
|S| = d d candidates
|S| = d-2
|S| = 2
2 candidates
2 candidates
Dr. Patrick Chan @ SCUT
Feature Selection
Search
Randomized Heuristic Selection
Generate better subsets iteratively based on the existing candidate pool
Keep improving the quality of selected features
Next subset is generated randomly
Do not know when the optimal set is obtained
it is unnecessary to wait until the search ends usually
Lecture 8: Data - Feature & Sample Selection14
Dr. Patrick Chan @ SCUT
Feature Selection: Search
Randomized Heuristic Selection
Example: Genetic Algorithm
Random initial population
Repeat
Evaluate the fitness of each candidate in population
Remove some bad candidates
Create new pollution by Mutation(c): change a candidate slightly
Crossover(c1, c2): generate a candidate containing the elements of both c1 and c2
Until a good candidate is found
Lecture 8: Data - Feature & Sample Selection15
Dr. Patrick Chan @ SCUT
Feature Selection
Criterion
Dependence Measures
Quantify whether a feature and class ID are correlated or dependent
Pearson correlation coefficient:
Lecture 8: Data - Feature & Sample Selection16
Dr. Patrick Chan @ SCUT
Feature Selection
Criterion
Information Gain
Measure the uncertain
Entropy:
Xi: a set containing samples in class i
Information gain:
Lecture 8: Data - Feature & Sample Selection17
� �
Current
entropyEntropy after
using feature A
Dr. Patrick Chan @ SCUT
Feature Selection
Criterion
Accuracy Measures
Classifier dependence
Evaluate a feature subset by using the performance of a classifier trained by that subset
Any accuracy or error measure
Lecture 8: Data - Feature & Sample Selection18
Dr. Patrick Chan @ SCUT
Feature Selection
Criterion
Consistency Measures
Classifier dependence
Aim to achieve P( C | Fullset ) = P( C | Subset )
Rather than accuracy, only the consistence on outputs is measured
Find a minimum number of features that separate classes as the full set of features can
Lecture 8: Data - Feature & Sample Selection19
Dr. Patrick Chan @ SCUT
Feature Selection
Selection Type
Filter
Only depend on the data structure but not a classifier
No bias on a model
No training is involved, low time complexity
Can handle larger sized data
Wrapper
Evaluate a feature set according to model performance
Selected features yield better results
Time consumed
2020 Lecture 8: Data - Feature & Sample Selection
Dr. Patrick Chan @ SCUT
Feature Selection
Selection Type
Embedded method
Features are selected for a model(Similar to wrapper)
Features are selected during training
Low time complexity
Avoiding re-training for each feature subset
No data splitting
Split into a training and test set
Lecture 8: Data - Feature & Sample Selection21
Dr. Patrick Chan @ SCUT
Feature Extraction
Given a feature space find a
mapping , where , such that z preserves (most of) the information in x
d: original feature number
m: selected feature number
Optimal case, no information loss
May not be linear
Lecture 8: Data - Feature & Sample Selection22
Dr. Patrick Chan @ SCUT
Feature Extraction
Principal Components Analysis (PCA)
Unsupervised
Increase variance
Linear Discriminant Analysis (LDA)
Supervised
Increase accuracy
Lecture 8: Data - Feature & Sample Selection23
Dr. Patrick Chan @ SCUT
Feature Extraction
PCA
What is the characteristic of important features?
Variance
If the value of all samples are very similar, they cannot be separated by this feature
E.g. x1 is better than x2
Lecture 8: Data - Feature & Sample Selection24
x1
x2
Dr. Patrick Chan @ SCUT
Feature Extraction
PCA
Principal Components Analysis (PCA) linearly projects the data along the directions where the data varies most
The first axis with the greatest variance, etc
Dimensionality can be reduced by eliminating the later principal components
Lecture 8: Data - Feature & Sample Selection25
Dr. Patrick Chan @ SCUT
Feature Extraction
PCA
Lecture 8: Data - Feature & Sample Selection
x1
x2
x1
x2
x1
x2
x1
x2
x1
x2
x2
x1
Project 1
Project 3
Project 2
26
Larger Var
Dr. Patrick Chan @ SCUT
Feature Extraction
PCA
The projection directions are determined by the eigenvectorsof the covariance matrix corresponding to the largest eigenvalues
The magnitude of the eigenvalues corresponds to the variance of the data along the eigenvector directions
Lecture 8: Data - Feature & Sample Selection27
Dr. Patrick Chan @ SCUT
Feature Extraction
PCA
Vectors v having same direction as Cv are called eigenvectors of C
Cv = v
C : an d by d covariance matrix
v : eigenvectors of C
: an eigenvalue of C
E.g.
Lecture 8: Data - Feature & Sample Selection28
Dr. Patrick Chan @ SCUT
Feature Extraction
PCA
Covariance Matrix Calculation
1. Scaler Operation:
2. Vector Operation
Lecture 8: Data - Feature & Sample Selection29
�� �
(�)� �
(�)�
�
��
=
� � � �� ⋯ � ��� �� ⋱ ⋮
⋮ ⋱ ⋮
� �� ⋯ ⋯ � ���
(�) (�) �
�
��
(�)
�
��
Dr. Patrick Chan @ SCUT
Feature Extraction
PCA
Calculation of eigenvalue and eigenvector v
Solve the eigenvalues of
Expand and generate a a polynomial with the degree d
i.e. Only d different root
For each eigenvalue , solve to obtain eigenvectors v
Lecture 8: Data - Feature & Sample Selection30
Cv = v
(Cv - I) = 0
� � �
� � �
� ℎ �
= �� �
ℎ �+b � �
� �+c � �
� ℎ
� �
� �= �� − ��
� � � �
� � � ℎ
� � � �
! "
= �
� � ℎ
� � �
! "
+b
� � ℎ
� � �
"
+c
� � ℎ
� � �
! "
+d
� � �
� � �
!
Dr. Patrick Chan @ SCUT
Feature Extraction
PCA
The eigenvector with the largest absolute eigenvalue is called First Principal Component (PC1)
Indicate that the data have the largest variance along its eigenvector
PC2 : the direction with
maximum variation left in data, orthogonal to the PC1
PCi: the direction with
maximum variation left in data, orthogonal to all previous PCj, j = 1, …, i-1
Lecture 8: Data - Feature & Sample Selection31
x1
x2 orthogonal
Dr. Patrick Chan @ SCUT
Feature Extraction
PCA
How to choose m?
Preserve a percentage of the information (variance) in the data
If m=d, all information is
preserved
Lecture 8: Data - Feature & Sample Selection32
Pers
evere
d V
ari
ance
Selected Dimensionality
Dr. Patrick Chan @ SCUT
Feature Extraction
PCA: Example
Lecture 8: Data - Feature & Sample Selection33
1.91
x1 x22.5 2.4
0.5 0.7
2.2 2.9
1.9 2.2
3.1 3
2.3 2.7
2 1.6
1 1.1
1.5 1.6
1.1 0.9
1.81Mean
(�) (�) �
�
��
(�)
�
��
Dr. Patrick Chan @ SCUT
Feature Extraction
PCA: Example
Lecture 8: Data - Feature & Sample Selection34
=0
=0
=0
=0
0.0491 1.2840
Cv = v
(Cv - I) = 0
��# − $ = 0
Eigenvalues
Dr. Patrick Chan @ SCUT
Feature Extraction
PCA: Example
Lecture 8: Data - Feature & Sample Selection35
0.04
0.04
�
0.0491
�
�
�
�
�
1.28
1.28
�
1.2840
�
�
�
�
Eigenvectors
Dr. Patrick Chan @ SCUT
Feature Extraction
PCA: Example
Lecture 8: Data - Feature & Sample Selection36
orthogonal
0.0491
�
� 1.2840
PC1
�> )PC2
Dr. Patrick Chan @ SCUT37
Feature Extraction
PCA: Limitation
PCA is not suitable for classification
More spare in x1 than x2
Eigenvalue of x1 > Eigenvalue of x2
But x2 is more useful in classification
x1
x2
Lecture 8: Data - Feature & Sample Selection
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA
Linear Discriminant Analysis (LDA) projectthe original data to a new space linearlyaiming to preserve as much discriminatory information as possible
Seeks to find directions along which the classes are best separated
38 Lecture 8: Data - Feature & Sample Selection
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Two-Class Problem
Define a measure for class separation
Between-Class Scatter ( )
A class far away from others is preferable
Distance between means of classes
Within-Class Scatter ( )
A condense class is preferable
Variance of a class
Lecture 8: Data - Feature & Sample Selection39
More CondenseFurther Away
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Two-Class Problem
Between-Class Scatter ( )
Distance between means of classes
Lecture 8: Data - Feature & Sample Selection40
�(�,�)
'(
��
�(�,�)
'(
��
�
� (�,�)
'(
��
� (�,�)
'(
��
) � ��
��
�
��
�
���
� � �
�
�)
) � ��
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Two-Class Problem
Within-Class Scatter ( )
Variance of a class
Lecture 8: Data - Feature & Sample Selection41
* �
�(�,�)
�(�,�)
�
�'(
��
�(�,�)
�(�,�)
�
�'(
��
�
� (�,�) ��
� (�,�) ��
�'(
��
���
� (�,�)�
(�,�)�
�'(
��
� (�,�)�
(�,�)�
�'(
��
��
� �
�*
* �
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Two-Class Problem
Lecture 8: Data - Feature & Sample Selection42
�)
�*
)
*
*�*)
�)
)
*
�)
�*
�*
�)
�)
�*
�* )
�) *
�*
�*
)
�)
�*
*
) *
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Two-Class Problem
Lecture 8: Data - Feature & Sample Selection43
) *
*+
)
*+
) is scaler, let
*+
)
) *
When = 0
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Two-Class Problem
x1
x2
y
4 2 1
2 4 1
2 3 1
3 6 1
4 4 1
Lecture 8: Data - Feature & Sample Selection44
x1
x2
y
9 10 2
6 8 2
9 5 2
8 7 2
10 8 2
�
�
* �) � ��
�
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Two-Class Problem
Lecture 8: Data - Feature & Sample Selection45
*
)
+
�
0 �
*+
) Therefore
*+
) Since
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Two-Class Problem
Lecture 8: Data - Feature & Sample Selection46
0 �
�
�
�
�
�
�
�
�
�
�
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Two-Class Problem
Lecture 8: Data - Feature & Sample Selection47
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Multi-Class Problem
How about multi-class?
Within-class Scatter
2-class
2-class can be generalized to multi-class
Multi-class
Lecture 8: Data - Feature & Sample Selection48
x1
x2
�
,
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Multi-Class Problem
How about multi-class?
Between-Class Scatter
2-class
Define as the mean
of means of all classes
Multi-class
Lecture 8: Data - Feature & Sample Selection49
x1
x2
� ,
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Multi-Class Problem
Original x Mapped z
* �
'
��
* �
'
��
�(�,�)
'(
��
�(�,�)
�(�,�)
�
�'(
��
) � � ��
'
���
) � � ��
'
���
�
'
���
�
'
���
�(�,�)
'(
��
�(�,�)
�(�,�)
�
�'(
��
Lecture 8: Data - Feature & Sample Selection50
50
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Multi-Class Problem
The detailed proof is ignored (not difficult)
The loss function of multi-class problem is:
Eigenvalue and Eigenvector can be obtained by solving
Lecture 8: Data - Feature & Sample Selection51
)�)
*�*
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Limitation
LDA assumes unimodal Gaussian likelihoods
Perform badly if the assumption is wrong
Lecture 8: Data - Feature & Sample Selection52
Dr. Patrick Chan @ SCUT
Feature Extraction
LDA: Limitation
Discriminatory information is not in the mean but in the variance
Lecture 8: Data - Feature & Sample Selection53
Dr. Patrick Chan @ SCUT
Feature Extraction
Artificial Neuron Network
ANN aims to extract a meaningful feature space in some settings
E.g. Deep Learning
Will be discussed later
Lecture 8: Data - Feature & Sample Selection54
Dr. Patrick Chan @ SCUT
Feature Space
Summary
Feature Selection
Selecting a subset
No transformation
Selected features are understandable
Feature Extraction
Transformingfeatures into another space
Only for numeric features
Meaning of original features is lost
Suitable for visualization
Extracted features contain more information
Lecture 8: Data - Feature & Sample Selection55
Dr. Patrick Chan @ SCUT
Active Learning
Models discussed in this course are Passive Learning
Samples are pre-collected
Some samples may not be useful in learning
Active LearningTraining samples are selected according to the need of the current model
The learner in different learning states needs different samples
Label information is queried for the selected samples
Lecture 8: Data - Feature & Sample Selection56
Dr. Patrick Chan @ SCUT
Active Learning
Incremental Learning framework
Algorithm
Given a set of unlabeled samples
Initialization: Query some samples selected randomly
Repeat:
Train a model using labelled samples queried so far
Query the label information of the most useful sample for the current model
How to quantify usefulness?
1. Uncertainty Sampling
2. Query-By-Committee
3. Expected Model Change
57 Lecture 8: Data - Feature & Sample Selection
Dr. Patrick Chan @ SCUT
Active Learning: Sample Evaluation
Uncertainty Sampling
An active learner queries the instances which it is least certain on the decision
Three strategies
Least confident
Margin sampling
Entropy
Lecture 8: Data - Feature & Sample Selection58
Dr. Patrick Chan @ SCUT
Active Learning: Sample Evaluation
Uncertainty Sampling
Least Confident
�-'�
Only focus on probable class
Margin Sampling
./0 and ./0� are the largest and 2nd largest g
Ignores the output distribution for the remaining classes
Entropy
Consider all outputs
Lecture 8: Data - Feature & Sample Selection59
Dr. Patrick Chan @ SCUT
Active Learning: Sample Evaluation
Uncertainty Sampling
Example: Heat maps illustrating the query behavior for three-label classification problem
Lecture 8: Data - Feature & Sample Selection60
0.33
0.33
0.33
1 2 = 1 − �./0 2 1 2 = �./0 2 − �./0� 2 1 2 = −∑ �� 2 log�� 2�-'
0.5
0.5
0
Dr. Patrick Chan @ SCUT
Active Learning: Sample Evaluation
Query-By-Committee (QBC)
A committee contains m diverse models
trained on the current labeled set
Each committee member votes on the labeling of query candidates
Pick the instances generating the most disagreement among hypotheses
Vote entropy
V(yi) is the number of votes for class I
m is the committee size
61 Lecture 8: Data - Feature & Sample Selection
( (
Dr. Patrick Chan @ SCUT
Active Learning: Sample Evaluation
Expected Model Change
The instance yielding the greatest change to the current model will be queried
Expected Gradient Length (EGL)
For gradient-based training model
Query the instance x which, if labeled and added to D,
would result in the new training gradient of the largest magnitude
8 : the gradient of the objective function J with
respect to the parameters θ
8 � : the new gradient of adding the training
tuple (x,y) to D
Computationally expensive
Lecture 8: Data - Feature & Sample Selection62
8 � 8 �
Dr. Patrick Chan @ SCUT
References
http://www.sci.utah.edu/~shireen
Lecture 8: Data - Feature & Sample Selection63