Regularized Double Nearest Neighbor Feature Extraction for Hyperspectral Image Classification
Hsiao-Yun Huang
Department of Statistics and Information Science,Fu-Jen University
Hyperspectral Image Introduction 1
(image credit: AFRL)
Hyperspectral Image Introduction 2
(image credit: AFRL)
Applications of Hyperspectral Image
Military: military equipment detection. Commercial: mineral exploration, agriculture
and forest production. Ecology: chlorophyll, leaf water, cellulose,
lignin. Agriculture: illness or type of the plants.
Classification of Hypectral Image Pixels How to distinguish different land cover types
precisely and automatically in the hyperspectral images is an interesting and important research problem.
Generally, each pixel in a hyperspectral image is consisted of about hounds or even thousands of bands. This makes the discrimination among pixels a high-dimensional classification problem.
High-Dimensional Data Analysis
“We can say with complete confidence that in the coming century, high-dimensional data analysis will be a very significant activity, and completely new methods of high-dimensional data analysis will be developed;…” (Lecture on August 8, to the American Mathematical S
ociety ‘Math Challenges of the 21st Centure’ by David L. Donoho (2000))
Blessing: The Power of Increasing Dimensionality
xx11xx22
xx33
xx22
xx11
xx33
xx11
xx33
xx22
-5 0 5 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-5 0 5 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-5 0 5 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
xx11 xx22 xx33
Curse: Hughes Phenomenon
m=2510
20
50100
200
1000
500
m =
1 100050020010050201052
MEASUREMENT COMPLEXITY n (Total Discrete Values)
0.50
0.55
0.60
0.65
0.70
0.75
ME
AN
RE
CO
GN
ITIO
N A
CC
UR
AC
Y
The Curse of Dimensionality
In statistics, it is about the situation that the convergence of any estimator to the true value of a smooth function defined on a space of high dimension is very slow. That is, we need an extremely large amount of observations. (Bellman, 1961 ) http://www.stat.ucla.edu/~sabatti/statarray/textr/
node5.html
The Challenge
Unfortunately, in hyperspectral image classification, the p > N case is the usual situation due to the access of training samples (ground truth data) can be very difficult and expensive.
The large dimension but few samples problems might cause the accuracy rate of the hyperspectral image classification to be unsatisfied.
Dimensionality Reduction
One common way to deal with the curse of dimensionality is to reduce the number of dimensions.
Two major reduction ideas: Feature Selection Feature Extraction
xx11
xxpp
xx11
xxpp
ff11
ff22
ff22
ff11
Feature selection:select l out of p measurements
Feature extraction:map p measurements to l measurements
Feature Extraction v.s. Feature Selection
-5 0 50
50
100
150
-5 0 50
50
100
150
-5 0 5
-6
-4
-2
0
2
4
6Selection
Extraction
Basic Ideas of Feature Extraction
Feature extraction consists of choosing those features which are most effective for preserving class separability.
Class Separability depends not only on the class distributions but also on the classifier to be used.
We seek the minimum feature set with reference to the Bayes classifier; this will result in the minimum error for the given distributions. Therefore, the Bayes error is the optimum measure of feature effectiveness.
One Consideration
A major disadvantage of the Bayes error as a criterion is that an explicit mathematical expression is not available except for a very few special cases, therefore, we cannot expect a great deal of theoretical development.
Practical Alternatives
Two types of criteria which have explicit mathematical expressions and frequently be used in practice: Functions of scatter matrices (do not relate to t
he Bayes error) Conceptually simple and give systematic algo
rithms. Bhattacharyya distance type of criteria (give u
pper bounds of the Bayes error) Only for two-class problem, and based on nor
mality assumption.
Discriminant Analysis Feature Extraction (DAFE or Fisher’s LDA)
1
1 10
10 ))(()()( where
L
i
L
ij
Tjiji
Ti
L
ii
DAb mmmmmmmmS
DAb
DAw SS 1)( of seignvector the
of composed is DAFEofmatrix ation transformfeature The
DAwS
L Is the number of classSb in pairwise st
ructure
Note: The number of extracted features are min{p,L-1} where p is the dimension of the mean vector
DAFE v.s. PCA
PCA
DAFE
Drawbacks of the Fisher’s LDA (1) In some situations, is not a good m
easure of class separability Share the same mean: No scatter of M1 and M2 aroun
d M0 Multimodal: more than L-1 features are needed
DAb
DAwLDA SStrJ
1
Unimodal share the same
mean
Multimodal and share the same mean
Multimodal
Drawbacks of the Fisher’s LDA (2) The unbiased estimate S (pooled covariance estim
ate) of the within-class scatter matrix is adopted in LDA. If it is singular, the performance will be poor.
When dim>>n, S will loose its full rank as a growing number of eigenvalues become zero. So, it is not positive definite and can not be inverted.
100
Eigenvalue(dim/n=10)
0
_: true eigenvalues- - : Sw eigenvalues
dim
Feature Extraction Methods with Other Measure of Separability Nonparametric Discriminant Analysis (NDA; Fuku
naga and Mantock, 1983). Nonparametric Weighted Feature Extraction (NW
FE; Bor-Chen Kuo and Landgrebe, 2004) Regularized Double Nearest Proportion Feature
Extraction (RDNP; Hsiao-Yun Huang and Bor-Chen Kuo, submitted)
The idea of Nonpaprametric Discriminant Analysis (NDA; Fukunaga and Mantock,1983)
i Class
j Class
jM
iM
Instead of separating the means like LDA
Try to separate the
boundary
Nearest Neighbor Structure
i Class
j Class
Xik
k NN for class j
k NN for class i
Pairwise Between-Class Scatter Matrix
)( )(ikj xM
i Class
j ClassXik
)( )(ihj xM
Xih
)( )(ikj xM
Large weight
Small weight
),(),(
)},(),,(min{)()()()(
)()()()(),(
jkNN
il
ikNN
il
jkNN
il
ikNN
ilji
l xxdxxd
xxdxxdw
NDA
Tilj
il
ilj
il
L
i
L
ijj
n
l i
jil
iNDAb xMxxMx
n
wPS
i
))())((( where )()()()(
1 1 1
),(
j. classin point kNN its to from distance theis ),(
and infinity, and zerobetween parameter control a is
),(),(
)},(),,(min{
)()()(
)()()()(
)()()()(),(
il
jkNN
il
jkNN
il
ikNN
il
jkNN
il
ikNN
ilji
l
xxxd
xxdxxd
xxdxxdw
NDAb
DAw SS 1)( of seignvector the
of composed is NDA ofmatrix ation transformfeature The
The Properties of NDA
The between-class scatter matrix Sb is usually full rank. So, the restriction about only min(#class-1, dim) features can be extracted can be liberated.
Since the parametric nature of the Sb is replaced by the nonparametric Sb which leads to preserve important boundary information for classification, NDA is more robust.
Some Considerations about NDA When Overlap Occurs (1) Based on the definition of the boundary of
NDA (the focus portion of the distribution), the points with similar distance among the considered two groups are regarded as the boundary points.
This definition of boundary will fail when overlap occurs, because the points around and within overlap region will tend to have the same weight.
Then Boundary of NDA When Overlap Occurs
Projection direction ?
)(ilx
++ +
++
+
Projection direction ?
++ +
++
+
Projection directionProjection direction
Some Considerations about NDA When Overlap Occurs (2) In NDA, the kNN is adopted for measuring the ‘local’
between-class scatter, so the selected k is a very small integer as kNN people usually do (All the experiments in the paper and book shown by Fukunaga use either k=1 or 3).
This setting of k might cause the data point j and its local mean are very similar (close). The consequence is that the entries of Sbj will be very close to zero and thus cancels out the effect of the weight or makes the Sbj even with less influence among the overall Sb.
Some Considerations about NDA When Overlap Occurs (3) Also, in the Sb of NDA
only one data point is used to represent one group and used the kNN mean to represent the other local group.
This makes the Sb may not measure the scatter between “groups” very well and be easily influenced by the outliers.
Tilj
il
ilj
il
L
i
L
ijj
n
l i
jil
iNDAb xMxxMx
n
wPS
i
))())((( )()()()(
1 1 1
),(
One Another Consideration
In the NDA, the boundary is estimated based on the sample. Even when the sample distributions are not overlapped, based on the setting of NDA, the estimated boundary might be too close to the edge (since small k and only one xj in one group is used in Sb).
Like what happened in the hard SVM, extremely clear cut support vectors (boundary) estimated from the sample might have unsatisfied performance due to the over fitting.
The Singularity Problem
In NDA, the unbiased covariance estimate S is still adopted ,thus, the singularity problem still exist in NDA.
Nonparamentric Weighted Feature Extraction (NWFE)
i Class
j Class
)(ilx
)( )(ilj xM)( )(i
li xM
)( )(itj xM
)( )(iti xM
)(itx
)( )()( itj
it xMx
)( )()( ilj
il xMx
Large Weight
Light Weight
1
1
1
in
k
(i)kj
(i)k
(i)lj
(i)l(i,j)
l
))(x,Mdist(x
))(x,Mdist(xλ
Nonparametric Weighted Feature Extraction (NWFE; Kuo & Landgrebe, 2002, 2004)
Tikj
ik
ikj
ik
L
i
L
ijj
n
k i
jik
iNWb xMxxMx
nPS
i
))())((( )()()()(
1 1 1
),(
L
i
n
k
Tiki
ik
iki
ik
i
iik
iNWw
i
xMxxMxn
PS1 1
)()()()(),(
))())(((
( ) ( , ) ( )
1
( ) , is the number of training samples of class jjn
i i j jj k kl l j
l
M x w x n
,
1
1
1
in
l
(i)lj
(i)l
(i)kj
(i)k(i,j)
k
))(x,Mdist(x
))(x,Mdist(xλ
in
l
jl
(i)k
jl
(i)k(i,j)
kl
),xdist(x
),xdist(xw
1
1)(
1)(
NWb
NWw
NWw SSS 1)](diag5.05.0[ of seignvector the
of composed are NWFE ofmatrix ation transformfeature The
Double Nearest Proportion Structure
*
*
Class j
Class i
self-class nearest proportion
other-class nearest proportion
)(iiM
)(ijM
Weightreference
)(ilx
Robust Against the Overlap
Class j
Class i
)(ilx
+
)(itx
)( )(ili xM
)( )(iti xM
)( )(ilj xM
)( )(itj xM
*
+
*
)( )(ilxONP
)( )(itxONP
larger weight
smaller weight
The Improvement of the Estimation of Sw (1) In Regularized Discriminant Analysis (RDA)
(Friedman, 1989), an extension of LDA, also proposed a improvement version of the Sw in LDA. The generalized version of that estimate is
∑ˆ= λ ∑ˆ +(1- λ) (σˆ)^2 I
λ is between 0 and 1.
The question is how to choose λ? (Friedman suggested using cross-validation.)
The Improvement of the Estimation of Sw (2)
In NWFE, different way to get the local mean and weight in NDA were proposed. But, the most influential effect on the performance improvement is its proposed estimation of Sw
Why 0.5?
)(diag5.05.0 NWw
NWw SS
The Shrinkage Estimation of Sw
Let Ψ denote the parameters of the unrestricted high-dimensional model, and Θ the matching parameters of a lower dimension restricted submodel. Also, let U be the estimate of Ψ and T be estimate of Θ. Then the shrinkage (regularized) estimate
U* = λ T +(1-λ )U where λ is between 0 and 1. λ can be determined analytically by Ledoit and Wolf
lemma (2003). Once the T (target) is specified, the λ can be calculated.
Some Targets
J. Schafer and K. Strimmer (2005) proposed six targets for the shrinkage estimate of the Sw.
RDNP Feature Extraction
The feature transformation matrix of RDPN is composed of the eignvectors of
where
RDNPb
RDNPw SS 1)(
L
i
ilj
ili
L
ijj
N
l i
jil
iRDNPb xMxM
NPS
i
1
)()(
1 1
),(
)()((
Tilj
ili xMxM )()(( )()(
L
i
N
l
il
ili
RDNPw
i
SPPS1 1
)()(
)()()()()( )1( il
il
il
il
il STS
iN
t
itj
iti
ilj
iliji
l
xMxMd
xMxMd
1
1)()(
1)()(),(
))(),((
))(),((
hg
ighl
hg
ighl
il ssVar 2)(
,)(
,)( )())(,
The Properties of RDNP (1)
RDNP is more likely to figure out the boundary when overlap occurs.
Use proportion mean in each group, so the between groups scatter could be measured more properly, the entries of the Sb will not be so close zero, the influence of the outliers will be reduced, and the estimated boundary will not too close to the edge.
The Properties of RDNP (2)
When NPi=Ni and NPj=Nj, then it can be easily shown that the features extracted by the RDNP is exactly the same as the features extracted by the Fisher’s LDA. Thas is, LDA is a special case of RDNP.
Washington DC Mall Image
Indian Pine Site Image
Experiment Result 1 (Washington DC Mall , Classifier: 1nn, Features 6)
# of trainingSamples
LDA NDA RDA NWFE RDNP
20
0.5771 0.5825 0.8564 0.8851 0.9217
40
0.8122 0.8160 0.8840 0.9231 0.9420
100
0.8897 0.8979 0.9206 0.9347 0.9688
Experiment Result 2 (Washington DC Mall , Classifier: SVM, Features 6)
# of trainingSamples
LDA NDA RDA NWFE RDNP
20
0.5809 0.5990 0.8441 0.8933 0.9266
40
0.8244 0.8067 0.8799 0.9243 0.9385
100
0.8902 0.8922 0.9302 0.9330 0.9701
A color IR image of a portion of the DC data set
NWFE with 1nn
RDA with 1nn
1NN-NS (191 bands)
RDNP with 1nn
Experiment Result 3 (Indian Pine Site, Classifier: 1nn, Features 8)
# of trainingSamples
LDA NDA RDA NWFE RDNP
20
0.5512 0.5825 0.7662 0.8012 0.8377
40
0.5729 0.6060 0.7911 0.8331 0.8503
100
0.6345 0.6495 0.8180 0.8452 0.8910
Experiment Result 4 (Indian Pine Site, Classifier: SVM, Features 8)
# of trainingSamples
LDA NDA RDA NWFE RDNP
20
0.5512 0.5825 0.7662 0.8012 0.8377
40
0.5729 0.6060 0.7911 0.8331 0.8503
100
0.6345 0.6495 0.8180 0.8452 0.8910
Other Applications
Microarray Data Discrimination Quality Control EEG Signal Classification
The End
Thank you for your listening.