nonparametric weighted feature extraction for …landgreb/nwfe20031212.pdf · nonparametric...

- 1 - 6/18/04

Nonparametric Weighted Feature Extraction for Classification

Bor-Chen Kuo

Department of Mathematics Education

National Taichung Teachers College, Taichung, Taiwan 403

Tel: 886-4-22263181 ext 223

Email: [email protected]

David A. Landgrebe

School of Electrical and Computer Engineering

Purdue University, West Lafayette, Indiana 47907-1285

Tel: 765-494-3486

Email: [email protected]

Copyright © 2004 IEEE. Reprinted from IEEE Transactions on Geoscienceand Remote Sensing, Volume 42 No. 5, pp 1096-1105, May, 2004.

This material is posted here with permission of the IEEE. Internal orpersonal use of this material is permitted. However, permission to

reprint/republish this material for advertising or promotional purposes orfor creating new collective works for resale or redistribution must be

obtained from the IEEE by sending a blank email message [email protected].

By choosing to view this document, you agree to all provisions of thecopyright laws protecting it.

- 2 - 6/18/04

Nonparametric Weighted Feature Extraction for Classification1

Bor-Chen Kuo, Member, IEEE and David A. Landgrebe, Life Fellow, IEEE

Abstract

In this paper, a new nonparametric feature extraction method is proposed for high

dimensional multiclass pattern recognition problems. It is based on a nonparametric extension of

scatter matrices. There are at least two advantages to using the proposed nonparametric scatter

matrices. First, they are generally of full rank. This provides the ability to specify the number of

extracted features desired and to reduce the effect of the singularity problem. This is in contrast

to parametric discriminant analysis, which usually only can extract L–1 (number of classes

minus one) features. In a real situation, this may not be enough. Second, the nonparametric

nature of scatter matrices reduces the effects of outliers and works well even for non-normal data

sets. The new method provides greater weight to samples near the expected decision boundary.

This tends to provide for increased classification accuracy.

Index Terms—Dimensionality reduction, discriminant analysis, nonparametric feature

extraction.

1. Introduction

Among the ways to approach high dimensional data classification, a useful processing

model that has evolved in the last several years [1,2] is shown schematically in Figure 1. Given

the availability of data (box 1), the process begins by the analyst specifying what classes are

desired, usually by labeling training samples for each class (box 2). New elements that have

- 3 - 6/18/04

proven important in the case of high dimensional data are those indicated by boxes in the

diagram marked 3 and 4. These are the focus of this work.

Figure 1. A schematic diagram for a hyperspectral data analysis procedure

The reason for the importance of elements 3 and 4 in this context is as follows.

Classification techniques in pattern recognition typically assume that there are enough training

samples available to obtain reasonably accurate class descriptions in quantitative form.

Unfortunately, the number of training samples required to train a classifier for high-dimensional

data is much greater than that required for conventional data, and gathering these training

samples can be difficult and expensive. Therefore, the assumption that enough training samples

are available to accurately estimate the class quantitative description is frequently not satisfied

for high-dimensional data. Small training sets usually result in the Hughes phenomenon [3] and

singularity problems. There are several ways to overcome these problems. In [4], these

techniques are categorized into three groups:

1 The work described in this paper was sponsored in part by the National Imagery and Mapping Agency undergrant NMA 201-01-C-0023

5 FeatureSelection

1 HyperspectralData

6 Classifier4 Class ConditionalFeature Extraction

2 Label TrainingSamples

3 Determine QuantitativeClass Descriptions

- 4 - 6/18/04

A. Dimensionality reduction by feature extraction or feature selection.

B. Regularization of the class sample covariance matrices (e.g. [5], [6], [7], [8], [9]).

C. Structurization of a true covariance matrix described by a small number of parameters[4].

Group C is useful when the property and structure of the true covariance are known;

otherwise, methods in Group A and B are suggested. Generally, methods in Group B, or Group B

followed by Group A methods, are useful when class training sample sizes are small and

especially, when the total number of training samples is less than the dimensionality of data.

When the total number of training samples is greater than the dimensionality, feature extraction

methods may be a better choice. This paper will focus on the situation in which general feature

extraction methods can be used and develop a new nonparametric feature extraction algorithm

that is suitable for both simple and complex distributed data.

2. Background: Relevant Existing Feature Extraction Methods

2.1 Parametric Feature Extraction

Discriminant Analysis Feature Extraction (DAFE) is often used for dimension reduction in

classification problems. It is also called the parametric feature extraction method in [10], since

DAFE uses the mean vector and covariance matrix of each class. Usually within-class, between-

class, and mixture scatter matrices are used to formulate the criteria of class separability. A

within-class scatter matrix for L classes is expressed by [10]:

∑∑==

=Σ=L

i

DAwii

L

iii

DAw SPPS

11

(1)

where Pi denotes the prior probability of class i, mi is the class mean and Σi is the class

covariance matrix. A between-class scatter matrix is expressed as

€

SbDA = Pi mi −mo( )∑ mi −mo( )T = ∑ ∑

−

= +=

−−1

1 1

))((L

i

Tjiji

L

ijji mmmmPP (2)

where m0 represents the expected vector of the mixture distribution and is given by

- 5 - 6/18/04

i

L

iimPm ∑

=

=1

0 (3)

The optimal features are determined by optimizing the Fisher criteria given by

€

JDAFE = tr SwDA( )

−1SbDA( )[ ] (4)

In [11], DAFE is shown to be equivalent to finding the ML estimators of a Gaussian model,

assuming that all class discrimination information resides in the transformed subspace and the

within-class covariances are equal for all classes. The advantage of DAFE is that it is

distribution-free but there are three major disadvantages in DAFE. One is that it works well only

if the distributions of classes are normal-like distributions [10]. When the distributions of classes

are nonnormal-like or multi-modal mixture distributions, the performance of DAFE is not

satisfactory. The second disadvantage of DAFE is the rank of the between-scatter matrix is

number of classes (L) –1, so assuming sufficient observations and the rank of within-class scatter

matrix is v, then only min(L-1, v) features can be extracted. From [10] Chapter 10, we know that

unless a posterior probability function is specified, L–1 features are suboptimal in a Bayes sense,

although they are optimal based on the chosen criterion. In real situations, the data distributions

are often complicated and not normal-like, therefore only using L-1 features is not sufficient for

much real data. The third limitation is that if the within-class covariance is singular, which often

occurs in high dimensional problems, DAFE will have a poor performance on classification.

Foley-Sammon feature extraction and its extension [13][14][15][19] can help to extract

more than L-1 orthogonal features from n-dimensional space based on the following:

€

ri = maxr

rT SbDAr

rT SwDAr

, i =1,2,...,n −1

jirSr jDAw

Ti ≠= ,0 subject to

This third limitation can be relieved by using regularized covariance estimators in the

estimating procedure of the within-class scatter matrix [16] or by adding Singular Value

Perturbation to the within-class scatter matrix to solve the generalized eigenvalue problem [17].

- 6 - 6/18/04

Approximated pairwise accuracy criterion Linear Dimension Reduction (aPAC-LDR) [18]

can be seen as DAFE weighted contributions of individual class pairs according to the Euclidian

distance of respective class means. The major difference between DAFE and aPAC-LDR is that

the Fisher criteria is redefined as

€

JLDR =i=1

L−1

∑ PiPjω Δ ij( )tr SwDA( )

−1SijLDR( )[ ]

j= i+1

L

∑ , (5)

where

€

SijLDR = mi −m j( ) mi −m j( )

T , ω Δ ij( ) =

12Δ ij

2 erfΔ ij

2 2

,

and

€

Δ ij = mi −m j( )TSwDA( )

−1mi −m j( ) (6)

The above weighted Fisher criteria is the same as (4) by redefining the between-class

scatter matrix as

€

SbLDR =

i=1

L−1

∑ PiPjω Δ ij( ) mi −m j( ) mi −m j( )T

j= i+1

L

∑ (7)

Hence the optimization problem is the same as DAFE.

There are one simulated and one real data experiments in [18]. They show that the

advantages of this method are

1 . It can be designed to confine the influence of outlier classes on the final LDR

transformation.

2. aPAC-LDR needs fewer features to reach the optimal accuracy of DAFE, but the best

accuracy of aPAC-LDR is almost the same as that of DAFE

aPAC-LDR is the same as DAFE in that it is still using the mean vector and covariance to

formulate the scatter matrices, hence it still suffers from those three major disadvantages of

DAFE.

- 7 - 6/18/04

2.2 Nonparametric Discriminant Analysis

Nonparametric Discriminant Analysis (NDA) [10][20] was proposed to solve the problems

of DAFE. In NDA, the between-class scatter matrix is redefined as a new nonparametric

between-class scatter matrix (for the 2 classes problem), denoted

€

SbNDA , as

( ) ( )( )( ) ( ) ( )( )( ){ }( ) ( )( )( ) ( ) ( )( )( ){ }22

122

12

2

11

211

21

1

ω

ω

T

TNDAb

XMXXMXEP

XMXXMXEPS

−−+

−−=

where )(iX denotes the random variable used to describe the distribution of class i, and )(ilx

denotes the l-th outcome of this random variable. ( )( ) ( )∑=

=k

j

ijNN

ilj x

kxM

1

1 is called the local kNN

mean, )(ijNNx is the jth nearest neighborhood (NN) from class i ( iω ) to the sample )(i

lx . If k = Ni,

the training sample size of class i, [10] shows that the features extracted by

maximizing ])[( 1 NDAb

NDAw SStr − must be the same as the ones from ])[( 1 DA

bDAw SStr − . Thus, the

parametric feature extraction obtained by maximizing ])[( 1 DAb

DAw SStr − is a special case of feature

extraction with the more general nonparametric criterion ])[( 1 NDAb

NDAw SStr − , where the definition

of NDAwS is in (12).

- 8 - 6/18/04

Figure 2. The relationship between sample points and their local means (*s are neighbors of )(ilx ,

+s are neighbors of )(itx , and ⊗s represent local means.)

Further understanding of

€

SbNDA

is obtained by examining the vector ( )( )()( ilj

il xMx − ). Figure

2 shows the importance of using boundary points and local means. Pointing to the local mean

from the other class, each vector indicates the direction to the other class locally. If we select

these vectors only from the samples located near the classification boundary (e.g.

)( )()( ilj

il xMx − ), the scatter matrix of these vectors should specify the subspace in which the

boundary region is embedded. Vectors of samples that are far away from the boundary

(e.g. )( )()( itj

it xMx − ) tend to have large magnitudes. These large magnitudes can exert a

considerable influence on the scatter matrix and distort the information of the boundary structure.

)(ilx

∗ ∗∗∗

∗

€

⊗

)( )()( ilj

il xMx −

)( )(ilj xM

i Classj Class

∗∗∗ ∗

∗

€

⊗

)( )()( ili

il xMx −

)( )(ili xM

)( )()( itj

it xMx −

€

⊗++ ++

+)( )(i

tj xM

)(itx

++

+++

€

⊗

)( )()( iti

it xMx − )( )(i

ti xM

- 9 - 6/18/04

Therefore, some method of de-emphasizing samples far from the boundary seems appropriate.

To accomplish this, [10] uses a weighting function for each ( )( )()( ilj

il xMx − ). The value of the

weighting function, denoted as

€

wl , for )(ilx is defined as

),(),(

)},(),,(min{)()()()(

)()()()(),(

jkNN

il

ikNN

il

jkNN

il

ikNN

ilji

l xxdxxd

xxdxxdw

αα

αα

+= , (10)

where α is a control parameter between zero and infinity, and d( )(ilx , )( j

kNNx ) is the Euclidean

distance from )(ilx to its kNN point in class j.

Based on [18] and [26], the final discrete form of within and between-class scatter matrices

for multi-class problem are expressed by

( )( ) ( )( )( ) ( ) ( )( )( ) ∑∑∑∑

==≠==

=−−=L

i

NDAbi

Tilj

il

N

l

ilj

il

i

jil

L

ijj

L

ii

NDAb SPxMxxMx

Nw

PSi

11

,

11

(11)

( )( ) ( )( )( ) ( ) ( )( )( ) ∑∑∑

===

=−−=L

i

NDAwii

Tili

il

N

l

ili

il

i

jil

L

ii

NDAw SPxMxxMx

N

wPS

i

11

,

1

(12)

Although the nonparametric version of the within-class matrix was proposed in [10] and [20], the

parametric

€

SwDA was still suggested to optimize ( )[ ]NDA

bDAwNDA SStrJ

1−= be used in NDA by the

authors.

The disadvantages of NDA are

1. Parameters k and α are usually decided by rules of thumb. So the better result usually comes

after several trials.

2. The within-class scatter matrix in NDA is still with a parametric form. When the training set

size is small, NDA will have the singularity problem.

- 10 - 6/18/04

2.3 Discriminant Analysis Using Malina’s Criterion

In [21] and [22], the criterion function of DAFE and NDA was modified based on Malina’s

criterion [23],[24],[25], and DAM is used for representing it. For a 2-class classification

problem, the following criterion was proposed for

( ) ( )

rSr

rSrrSrJ

wT

jiw

Tb

T

DAM

−+−=

ββ1(13)

where r is the vector feature that will be extracted, the Sb, Sw, Sbi, and Swi , could be parametric

(DAFE) or nonparametric (NDA) versions, β denotes a user-supplied parameter and

wiwjwjwiji

w SSSSS −−=− or )( .

In [26], this criterion has extended to a multiclass version. For the data normalized by the

common covariance, the criteria are

Parametric: ( ) ( ) ( )∑∑+=

−−

=− +−=

L

ij

jiw

Tji

L

ib

TPDAM rSrPPrSrrJ

1

1

1

1, βββ (14)

Nonparametric:

€

JDAM −N r,β,α,k( ) = 1−β( )rT SbNDAr + βi=1

L−1

∑ PiPj rT Sw

NDA i− j( )rj= i+1

L

∑ (15)

where

€

SwNDA i− j( ) = Swi

NDA − SwjNDA or

€

SwjNDA − Swi

NDA ,

€

0 ≤ β ≤1

For convenience, DAM-P (parametric) and DAM-N (nonparametric) are used for representing

them respectively in this paper. Like NDA, DAM-NP is used for representing the situation that

the between-class scatter matrix is nonparametric form and the within-class scatter matrix is

parametric form. [21], [22], and [26] suggested that, for extracting the first feature, the Euclidean

distance should be used in the weighting function, and for extracting the second feature, the

projected distance

- 11 - 6/18/04

||||),( 11 jT

iT

ji xrxrxxd −= (16)

should be applied.

In [21], the results of simulated and real data experiments showed that for 2-class

classification problems, and using just one or two features, the performance of DAM-N is better

that those of NDA, DAFE and DAM-P.

There are a few advantages of DAM. First, it is a generalized version of NDA, so it has the

advantages of NDA. Second, it has better performance when the difference of class variances is

large. The disadvantage is that if the number of classes is L then there are 2/)1(2 −LL different

)( jiwS− and for one ),,,( kr αβ , it is necessary to do 2/)1(2 −LL times eigenvalue decompositions

[26]. For example, if there is a 7-class problem then for finding the optimal eigenvector of one

case ),,,( kr αβ , it is necessary to compute eigenvectors 2,097,152 times. For solving the

problem, a binary tree multiclass mapping technique was proposed in [26]. The method is user-

friendly for 2D mappings and interactive classifier design.

The main idea of the method [21], [22], and [26] is to compute discriminant vectors by

successive optimization of the discriminant criterion for specific values of the control parameters

searched for by a trial and error procedure. In [21], [22], and [26], the distance (16) was used for

extracting the second discriminant vector. In [27], another extraction method, called “removal of

classification structure” was proposed and experimental results show that this method is better

than those successive extracting methods proposed in [26]. In this study, this successive

extracting method (REM) and traditional simultaneous orthogonal extracting method (ORTH)

are used for finding successive vector features.

3 Nonparametric Weighted Feature Extraction

In this section, a new feature extraction method called nonparametric weighted feature

extraction (NWFE) is proposed. From NDA, we know that the “local information” is important

and useful for improving DAFE. The main ideas of NWFE are putting different weights on every

- 12 - 6/18/04

sample to compute the “weighted means” and defining new nonparametric between-class and

within-class scatter matrices to obtain more than L–1 features. In NWFE, the nonparametric

between-class scatter matrix for L classes is defined as

( )( ) ( )( )( ) ( ) ( )( )( )∑∑∑

=≠==

−−=iN

l

Tilj

il

ilj

il

i

jil

L

ijj

L

ii

NWb xMxxMx

nPS

1

,

11

λ (17)

where )(ilx refers to the l-th sample from class i, iN is training sample size of class i, Pi denotes

the prior probability of class i,

Basically, Equation (17) is similar to Equation (11). The differences are in the definitions of

weights and weighted means. The scatter matrix weight ),( jilλ is a function of )(i

lx and )( )(ilj xM ,

and defined as:

∑=

−

−

=jN

t

itj

it

ilj

ilji

l

xMxdist

xMxdist

1

1)()(

1)()(),(

))(,(

))(,(λ , (18)

where ),( badist denotes the Euclidean distance from a to b.

If the distance between )(ilx and )( )(i

lj xM is small then its weight ),( jilλ will be close to 1;

otherwise, ),( jilλ will be close to 0. The sum of the ),( ji

lλ for class i is 1.

)( )(ilj xM denotes the weighted mean of )(i

lx in class j and defined as:

∑=

=iN

k

jk

jilk

ilj xwxM

1

)(),()( )( , (19)

where

∑=

−

−

=jn

t

jk

it

jk

ilji

lk

xxdist

xxdistw

1

1)()(

1)()(),(

),(

),( . (20)

- 13 - 6/18/04

The weight ),( jilkw for computing weighted means is a function of xl

(i) and xk(j). If the distance

between xl(i) and xk

(j) is small then its weight wlk(i,j) will be close to 1; otherwise, wlk

(i,j) will be

close to 0. The sum of the wlk(i,j) for Mj(xl

(i)) is 1.

The nonparametric within-class scatter matrix is defined as

( )( ) ( )( )( ) ( ) ( )( )( )Ti

liil

N

l

ili

il

i

jil

L

ii

NWw xMxxMx

nPS

i

−−= ∑∑== 1

,

1

λ (21)

In NDA, nearest neighbors are used to estimate the local mean and weighting method is used to

emphasis the importance of boundary points and related between-class vectors. Just using kNN

points to estimate local mean may lose some information and not all kNN points are with the

same information about finding class boundary. Based on this thought, NWFE proposes the

“weighed mean” (eq. (19) ) and using weighted between- and within-class vector to improve

NDA.

The extracted f features are the f eigenvectors with largest f eigenvalues of the following matrix:

NWb

NWw SS 1)( −

To reduce the effect of the cross products of within-class distances and prevent the singularity,

some regularized techniques [5, 29], can be applied to within-class scatter matrix. In this study,

within-class scatter matrix is regularized by

€

SwNW = 0.5Sw

NW + 0.5diag SwNW( ) ,

where diag(A) means the diagonal parts of matrix A.

Finally, the NWFE algorithm is

1. Compute the distances between each pair of sample points and form the distance matrix.

2. Compute ),( jilkw using the distance matrix

3. Use ),( jilkw to compute the weighted means )( )(i

lj xM

4. Compute the scatter matrix weight ),( jilλ

- 14 - 6/18/04

5. Compute

€

SbNW and regularized

€

SwNW

6. Extract features by using ORTH or REM methods

4 Experiment Design

In this paper, only real data experiment results are displayed; related simulated data

experiment results can be found in [16].

The design of Experiment 1 is to compare the multiclass classification performances of

using DAFE, aPAC-LDR, NDA and NWFE (with ORTH method) features applied to Gaussian,

2NN, and Parzen classifiers. The design of Experiment 2 is to compare the 2-class classification

performances of using DAFE, NDA, DAM-P, DAM-NP, DAM-N and NWFE features applied to

Gaussian, 2NN, and Parzen classifiers. In Experiment 1, only the simultaneous orthogonal

feature extraction method (ORTH, [27]) is used. In Experiment 2, ORTH and the successive

extracting method (REM, [27]) are used. Euclidean distance and 2NN are used in NDA, DAM-P,

DAM-NP, DAM-N, and kNN classifier. The grid method is used for successively finding

optimal β1 (first feature) and β2 (second feature) in DAM cases. Besides, all classifiers are from

[28].

There are four different real data sets, Cuprite: a site of geologic interest in western

Nevada, Jasper Ridge: a site of ecological interest in California, Indian Pine: a mixed

forest/agricultural site in Indiana, and the Washington, DC Mall as an urban site, in the

experiments. The first three of these data sets were gathered by a sensor known as AVIRIS,

mounted in an aircraft flown at 65,000 ft. altitude and operated by the NASA/Jet Propulsion Lab.

It produces pixels in 220 spectral bands measuring approximately 20 m across on the ground.

The fourth data set was flown with a sensor system in a lower altitude aircraft producing again

data in 220 bands but at a spatial resolution of approximately 5 m. Some water absorption

channels are discarded, so only 191 bands are used in the experiments. There are 8, 6, 6, and 7

classes in Cuprite, Jasper Ridge, Indian Pine, and DC Mall data sets respectively. There are 40

training samples, which are different from testing samples, in each class of Cuprite, Jasper

Ridge, Indian Pine, and DC Mall experiments. The data sets in Experiment 2 only contain the

- 15 - 6/18/04

first two classes of those four real data sets. At each experiment, 10 training and testing sample

data sets are selected randomly for establishing the classification process and its performance.

5 Experiment Results

5.1 Results of Experiment 1

The results of experiment 1 are displayed in Figure 3(a) to 3(c) (Cuprite), 4(a) to 4(c)

(Jasper Ridge), 5(a) to 5(c) (Indian Pine) and 6(a) to 6(c) (DC Mall), respectively. In each case,

L indicates the number of classes, Ni the number of training samples in each class, and p the

number of features (spectral bands) used. These figures show that

1. For these three classifiers, NWFE performs better than the other methods.

2. Gaussian and 2NN classifier perform better than Parzen classifier.

3 . Figure 5(a) shows that if only 5 (i.e. L-1) features are used, then the accuracy

(expressed as a percentage of the test samples correctly classified) of DAFE is 57% and

that of NWFE is 86%. But if 7 features of NWFE are used then the accuracy increases

to 91%. This shows that only using L-1 features is not enough in this real situation.

DAFE cannot do this due to the restriction of the rank of the between-class scatter

matrix. NWFE does not have this restriction.

4. Comparing Figure 7(b) and 7(c), one sees that the performance of NWFE is better than

that of DAFE in almost in all classes.

- 16 - 6/18/04

Cuprite (L=8, Ni=40, p=191, Gaussian Classifier)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Features

Acc

ura

cy

NWFE

DAFE

aPAC

NDA_2NN

Figure 3(a) Mean of accuracies using 1~15 features (Cuprite, Gaussian Classifier).

Cuprite (L=8, Ni=40, p=191, 2NN Classifier)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Features

Acc

ura

cy

NWFE

DAFE

aPAC

NDA_2NN

Figure 3(b) Mean of accuracies using 1~15 features (Cuprite, 2NN Classifier).

- 17 - 6/18/04

Cuprite (L=8, Ni=40, p=191, Parzen Classifier)

0.35

0.4

0.45

0.5

0.55

0.6

0.65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Features

Acc

ura

cy

NWFE

DAFE

aPAC

NDA_2NN

Figure 3(c) Mean of accuracies using 1~15 features (Cuprite, Parzen Classifier).

Jasper Ridge (L=6, Ni=40, p=191, Gaussian Classifier )

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Features

Acc

ura

cy

NWFE

DAFE

aPAC

NDA_2NN

Figure 4(a) Mean of accuracies using 1~15 features (Jasper Ridge, Gaussian Classifier).

- 18 - 6/18/04

Jasper Ridge (L=6, Ni=40, p=191, 2NN Classifier)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Features

Acc

ura

cy

NWFE

DAFE

aPAC

NDA_2NN

Figure 4(b) Mean of accuracies using 1~15 features (Jasper Ridge, 2NN Classifier).

Jasper Ridge (L=6, Ni=40, p=191, Parzen Classifier)

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Features

Acc

ura

cy

NWFE

DAFE

aPAC

NDA_2NN

Figure 4(c) Mean of accuracies using 1~15 features (Jasper Ridge, Parzen Classifier).

- 19 - 6/18/04

Indian Pine (L=6, Ni=40, p=191, Gaussian Classifier)

0.25

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Features

Acc

ura

cy

NWFE

DAFE

aPAC

NDA_2NN

Figure 5(a) Mean of accuracies using 1~15 features (Indian Pine, Gaussian Classifier).

Indian Pine (NC=6, Ni=40, p=191, 2NN Classifier)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Features

Acc

ura

cy

NWFE

DAFE

aPAC

NDA_2NN

Figure 5(b) Mean of accuracies using 1~15 features (Indian Pine, 2NN Classifier).

- 20 - 6/18/04

Indian Pine (NC=6, Ni=40, p=191, Parzen Classifier)

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Features

Acc

ura

cy

NWFE

DAFE

aPAC

NDA_2NN

Figure 5(c) Mean of accuracies using 1~15 features (Indian Pine, Parzen Classifier).

DC Mall (L=7, Ni=40, p=191, Gaussian Classifier)

0.25

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Features

Acc

ura

cy

NWFE

DAFE

aPAC

NDA_2NN

Figure 6(a) Mean of accuracies using 1~15 features (DC Mall, Gaussian Classifier).

- 21 - 6/18/04

DC Mall (NC=7, Ni=40, p=191, 2NN Classifier)

0.25

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Features

Acc

ura

cy

NWFE

DAFE

aPAC

NDA_2NN

Figure 6(b) Mean of accuracies using 1~15 features (DC Mall, 2NN Classifier).

DC Mall (L=7, Ni=40, p=191, Parzen Classifier)

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Features

Acc

ura

cy

NWFE

DAFE

aPAC

NDA_2NN

Figure 6(c) Mean of accuracies using 1~15 features (DC Mall, Parzen Classifier).

- 22 - 6/18/04

Figure 7(a) A simulated color IR image of a portion of the DC data set.

Figure 7(b). The thematic map resulting from the classification of the area of Figure 7(a) using

DAFE features and Gaussian Classifier.

- 23 - 6/18/04

Figure 7(c). The thematic map resulting from the classification of the area of Figure 7(a) using

NWFE features and Gaussian Classifier.

5.2 Results of Experiment 2

The results of Experiment 2 are displayed in Table 1, 2, and 3. The shadow parts mean that

the performance of REM is better than ORTH. They show that

1. REM is useful when DAM criteria are applied and data distribution is non-normal like

(e.g. Indian Pine). For NWFE and NDA, ORTH is still a better choice.

2. For NWFE/ORTH, three type classifiers have similar best results. For NDA, 2NN

classifier gets the better performances. For DAM criteria, the Gaussian classifier

obtains the better performances.

3. Overall, using the NWFE criterion and the ORTH extraction method is a good and

robust choice.

- 24 - 6/18/04

Table 1 Performances of Gaussian classifier using different criteria and extraction methods

Cuprite Jasper Ridge Indian Pine DC MallData Set

No. of features No. of features No. of features No. of features

Criterion Extraction Method 1 2 1 2 1 2 1 2

DAFE ORTH 0.9307 N/A 0.9994 N/A 0.7992 N/A 0.8044 N/A

ORTH 0.9985 0.9982 0.9992 0.9993 0.9422 0.9391 0.9645 0.9634NWFE

REM 0.9985 0.9981 0.9992 0.9993 0.9422 0.9401 0.9645 0.9634

ORTH 0.6476 0.9306 0.8299 0.9991 0.5878 0.7774 0.6200 0.7465NDA

REM 0.6476 0.6562 0.8299 0.9192 0.5878 0.636 0.6200 0.6281

ORTH 0.9583 0.9583 0.8415 0.8415 0.5393 0.5393 0.8559 0.8559DAM-P

REM 0.9583 0.8473 0.8415 0.9999 0.5393 0.9292 0.8559 0.8234

ORTH 0.8192 0.7010 0.9992 0.9994 0.8707 0.8685 0.6043 0.6924DAM-NP

REM 0.8192 0.7964 0.9992 1 0.8707 0.9242 0.6043 0.6859

ORTH 0.8192 0.7070 0.9992 0.9994 0.8707 0.9042 0.6048 0.7078DAM-N

REM 0.8192 0.8111 0.9992 1 0.8707 0.9264 0.6048 0.6509

Table 2 Performances of 2NN classifier using different criteria and extraction methods





ORTH 0.9992 0.9707 0.9992 0.9993 0.9265 0.9407 0.9598 0.9594NWFE

REM 0.9992 0.9466 0.9992 0.9990 0.9265 0.9352 0.9598 0.9571

ORTH 0.6515 0.9453 0.8173 0.9993 0.5899 0.7885 0.6807 0.7743NDA

REM 0.6515 0.6306 0.8173 0.9441 0.5899 0.9027 0.6807 0.6838

ORTH 0.6278 0.9583 0.7033 0.9553 0.8671 0.5393 0.8700 0.8561DAM-P

REM 0.6278 0.7775 0.7033 0.9997 0.8671 0.9202 0.8700 0.8126

ORTH 0.6732 0.7010 0.9992 0.9971 0.8696 0.8764 0.5535 0.6924DAM-NP

REM 0.6732 0.7707 0.9992 1 0.8696 0.9027 0.5535 0.5792

ORTH 0.6732 0.7070 0.9992 0.9992 0.8696 0.8801 0.5562 0.7078DAM-N

REM 0.6732 0.7707 0.9992 1 0.8696 0.9053 0.5562 0.5792

- 25 - 6/18/04

Table 3 Performances of Parzen classifier using different criteria and extraction methods





ORTH 0.9964 0.9718 0.9949 0.9993 0.9374 0.9455 0.9625 0.9612NWFE

REM 0.9964 0.9181 0.9949 0.9990 0.9374 0.9383 0.9625 0.9612

ORTH 0.7127 0.9474 0.7995 0.9992 0.5685 0.7358 0.7526 0.8542NDA

REM 0.7127 0.5943 0.7995 0.9492 0.5685 0.7436 0.7526 0.5979

ORTH 0.8234 0.9583 0.9956 0.8415 0.8347 0.5393 0.7823 0.8557DAM-P

REM 0.8234 0.8267 0.9956 1 0.8347 0.9179 0.7823 0.8541

ORTH 0.8053 0.7010 0.9894 0.9971 0.8836 0.8764 0.5851 0.6924DAM-NP

REM 0.8053 0.8413 0.9894 1 0.8836 0.9246 0.5851 0.7592

ORTH 0.8053 0.7070 0.9992 0.9992 0.8836 0.9067 0.5994 0.7078DAM-N

REM 0.8053 0.8296 0.9992 1 0.8836 0.9348 0.5994 0.6983

7 Concluding Comments

The volume available in high dimensional feature spaces is very large, making possible the

discrimination between classes with only very subtle differences. On the other hand, this large

volume makes increasingly challenging the problem of defining adequate precisely the desired

classes in terms of the feature space variables. The problems of class statistics estimation error

resulting from training sets of finite size grows rapidly with dimensionality, thus making it

desirable to use no larger feature space dimensionality than necessary for the problem at hand,

and therefore the importance of an effective, case-specific feature extraction procedure.

The NWFE algorithm presented here is intended to take advantage of the desirable

characteristics of DAFE and NDA, while avoiding their shortcomings. DAFE is fast and easy to

apply, but its limitation of L-1 features, its reduced performance particularly when the difference

in the mean values of classes is small, and the fact that it is based on the statistical description of

the entire training set, making it sensitive to outliers, limit its performance in many cases. NDA

- 26 - 6/18/04

does not have these limitations. It focuses the attention on training samples near the required

decision boundary. NDA does not perform well on unequal covariance or complexly distributed

data.

NWFE does not have any of these limitations. It appears to have improved performance in

a broad set of circumstances, making possible substantially better classification accuracy in the

data sets tested, which included sets of agricultural, geological, ecological and urban

significance. This improved performance is perhaps due to the fact that, like NDA, attention is

focused upon training samples that are near to the eventual decision boundary, rather than

equally weighted on all training pixels as with DAFE. It also appears to provide feature sets

which are relatively insensitive to the precise choice of feature set size, since the accuracy versus

dimensionality curves are relatively flat beyond the initial knee of the curve. This characteristic

would appear to be significant for the circumstance when this technology begins to be used by

general remote sensing practitioners who are not otherwise highly versed in signal processing

principles and thus might not realize how to choose the right dimensionality to use.

Weighted between- and within-scatter matrices and regularization are the most important

parts in NWFE. Only applying one of them can not get a satisfactory result.

An implementation of NWFE is available for testing in MultiSpec. MultiSpec is a personal

computer multispectral data analysis software package that may be downloaded free from

http://dynamo.ecn.purdue.edu/~biehl/MultiSpec/.

- 27 - 6/18/04

References

[1] D. A. Landgrebe, "Information Extraction Principles and Methods for Multispectral and

Hyperspectral Image Data," Chapter 1 of Information Processing for Remote Sensing,

edited by C. H. Chen, published by the World Scientific Publishing Co., Inc., 1060 Main

Street, River Edge, NJ 07661, USA 1999.

[2] David Landgrebe, Signal Theory Methods In Multispectral Remote Sensing, 508 pages

plus a CD containing exercises and data. John Wiley & Sons, January 2003, ISBN 0-471-

42028-X.

[3] G. F. Hughes, “ On the mean accuracy of statistical pattern recognition”, IEEE Trans.

Information Theory, 1968, vol. IT-14, no. 1, pp. 55-63.

[4] S. Raudys and A. Saudargiene, “Structures of the Covariance Matrices in Classifier Design”,

Advances in Pattern Recognition, A. Amin, D. Dori, P. Pudil, and H. Freeman, ed., Berlin

Heidelberg: Springer-Verlag, 1998, pp. 583-592.

[5] J.H. Friedman, “Regularized Discriminant Analysis,” Journal of the American Statistical

Association, vol. 84, 1989, pp. 165-175.

[6] W. Rayens and T. Greene, “ Covariance pooling and stabilization for classification.”

Computational Statistics and Data Analysis, vol. 11, 1991, pp. 17-42.

[7] J. P. Hoffbeck and D.A. Landgrebe, “ Covariance matrix estimation and classification with

limited training data” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.

18, No. 7, 1996, pp. 763-767.

[8] S. Tadjudin and D.A. Landgrebe, Classification of High Dimensional Data with Limited

Training Samples, PhD thesis Purdue University, West Lafayette, IN., ECE Technical Report

TR-EE 98-8, April, 1998, pp. 35-82.

- 28 - 6/18/04

[9] W.J. Krzanowski, P. Jonathan, W. V. McCarthy, and M. R. Thomas, “Discriminant analysis

with singular covariance matrices: methods and applications to spectroscopic data.”

Applied Statistics, vol. 44, 1995, pp.101-115,.

[10] K. Fukunaga, Introduction to Statistical Pattern Recognition, San Diego: Academic Press

Inc., 1990.

[11] C. B. Moler and G.W. Stewart, "An Algorithm for Generalized Matrix Eigenvalue

Problems", SIAM J. Numer. Anal., vol. 10, no. 2, April 1973.

[12] A. Campbell, “Canonical Variate Analysis—A General Model Formulation,” Australian J.

Statistics, vol. 26, pp.86-96, 1984.

[13] D.H. Foley and J.W. Sammon, "An optimal set of discriminant vectors", IEEE Trans.

Comput., vol.C-24, pp.281-289, 1975.

[14] T. Okada and S. Tomita, "An optimal orthonormal system for discriminant analysis",

Pattern Recognition, 18, pp.139-144, 1985.

[15] J. Duchene and S. Leclercq, “An optimal transformation for discriminant analysis and

principal component analysis,” IEEE Trans. Pattern Analysis and Machine Intelligence,

vol. 10, no. 6, 1988, pp. 978-983,.

[16] B-C. Kuo and D. A. Landgrebe, Improved Statistics Estimation And Feature Extraction For

Hyperspectral Data Classification, PhD Thesis and School of Electrical & Computer

Engineering Technical Report. TR-ECE 01-6, December 2001. (Available for download

from http://dynamo.ecn.purdue.edu/~landgreb/publications.html)

[17] Z-Q Hong and J-Y Yang, "Optimal discriminant plane for a small number of samples",

Pattern Recognition, vol. 24, no.4, 1991, pp. 317-324,.

- 29 - 6/18/04

[18] M. Loog, R, P. W. Duin and R. Haeb-Umbach, “Multiclass Linear Dimension Reduction by

Weighted Pairwise Fisher Criteria,” IEEE Trans. Pattern Analysis and Machine

Intelligence, vol. 23, 2001, pp. 762-766.

[19] Y. Hamamoto, Y. Matsuura, T. Kanaoka, and S. Tomita, ”A Note on the Orthonormal

Discriminant Vector Method for Feature Extraction,”Pattern Recognition, vol. 24, pp. 681-

684, 1991.

[20] K. Fukunaga and M. Mantock, Nonparametric Discriminant Analysis, IEEE Trans. Pattern

Analysis and Machine Intelligence, vol. 5, 1983, pp. 671-678.

[21] M. Aladjem, "Parametric and nonparametric linear mappings of multidimensional data",

Pattern Recognition, vol.24, no 6, 1991, pp. 543-553.

[22] M. Aladjem, "PNM: A program for parametric and nonparametric mapping of

multidimensional data", Computers in Biology and Medicine, vol.21, 1991, pp. 321-343.

[23] W. Malina, "Some extended Fisher criterion for feature selection," IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 3, 1981, pp. 611-614

[24] W. Malina, "Some multiclass Fisher feature selection algorithms and their comparison with

Karhunen-Loeve algorithms," Pattern Recognition Letters, vol. 6, 1987, pp. 279-285.

[25] W. Malina, "Two-parameter Fisher criterion," IEEE Transactions on Systems, Man, and

Cybernetics – Part B: Cybernetics, vol. 31, no. 4, 2001, pp. 629-636.

[26] M. Aladjem, " Multiclass discriminant mappings", Signal Processing, vol. 35, 1994, pp.1-

18.

[27] M. Aladjem, "Linear Discriminant analysis for two classes via removal of classification

structure", IEEE Transaction on Pattern Analysis and Machine Intelligence, vol.19, no 2,

1997, pp.187-191.

- 30 - 6/18/04

[28] R, P. W. Duin, "PRTools, a Matlab Toolbox for Pattern Recognition", August 2002.

(Available for download from http://www.ph.tn.tudelft.nl/prtools/)

[29] B-C. Kuo, D. A. Landgrebe, L-W. Ko, and C-H. Pai. “Regularized Feature Extractions for

Hyperspectral Data Classification”. International Geoscience and Remote Sensing

Symposium. Toulouse, France, July 21-25, 2003

- 31 - 6/18/04

Bor-Chen Kuo

Bor-Chen Kuo (S’01-M’02) received the B.S. and M.S. degrees from National TaichungTeachers College, Taiwan, R.O.C., in 1993 and 1996, the Ph.D. degree from school of electricaland computer engineering, Purdue University, West Lafayette, IN, in 2001. He is currently anassociate professor of mathematic education department and graduate institute of educationalmeasurement and statistics at National Taichung Teachers College, Taiwan, R.O.C. His researchinterests are pattern recognition, remote sensing, image processing, and nonparametric functionalestimation.

- 32 - 6/18/04

David A. Landgrebe

Dr. Landgrebe holds the BSEE, MSEE, and PhD degrees from Purdue University. He isProfessor (Emeritus) of Electrical and Computer Engineering at Purdue University. His area ofspecialty in research is communication science and signal processing, especially as applied toEarth observational remote sensing.

He was President of the IEEE Geoscience and Remote Sensing Society for 1986 and 1987and a member of its Administrative Committee from 1979 to 1990. He received that Society’sOutstanding Service Award in 1988. He is a co-author of the textbook, Remote Sensing: TheQuantitative Approach (1978), and a contributor to the ASP Manual of Remote Sensing (1stedition, 1974), and the books, Remote Sensing of Environment, (1976), and InformationProcessing for Remote Sensing (1999). He is the author of the textbook Signal Theory Methodsin Multispectral Remote Sensing (2003). He has been a member of the editorial board of thejournal, Remote Sensing of Environment, since its inception in 1970.

Dr. Landgrebe is a Life Fellow of the Institute of Electrical and Electronic Engineers, aFellow of the American Society of Photogrammetry and Remote Sensing, a Fellow of theAmerican Association for the Advancement of Science, a member of the Society of Photo-Optical Instrumentation Engineers and the American Society for Engineering Education, as wellas Eta Kappa Nu, Tau Beta Pi, and Sigma Xi honor societies. He received the NASAExceptional Scientific Achievement Medal in 1973 for his work in the field of machine analysismethods for remotely sensed Earth observational data. In 1976, on behalf of the Purdue'sLaboratory for Applications of Remote Sensing, which he directed, he accepted the William T.Pecora Award, presented by NASA and the U.S. Department of Interior. He was the 1990individual recipient of the William T. Pecora Award for contributions to the field of remotesensing. He was the 1992 recipient of the IEEE Geoscience and Remote Sensing Society’sDistinguished Achievement Award and the 2003 recipient of the IEEE Geoscience and RemoteSensing Society’s Education Award.

nonparametric weighted feature extraction for …landgreb/nwfe20031212.pdf · nonparametric...

Documents