y8103044 soumyadip thesis

SU

DEPA

INDIAN

UPERV

HYPE

SOUM

ARTME

N INSTITU

VISED L

ERSPE

B

MYADI

NT OF C

UTE OF

July

i

LEARN

CTRAL

By

IP CHA

CIVIL E

TECHN

y 2010

ING W

L DATA

ANDRA

ENGINEE

NOLOGY

WITH

A

ERING

KANPUUR

A D

SU

Dissertat

DEPA

INDIAN

UPERV

HYPE

tion SubRequire

Ma

SOUM

ARTME

N INSTITU

VISED L

ERSPE

bmitted ements

aster of

B

MYADI

(Y81

NT OF C

UTE OF

July

i

LEARN

CTRAL

In Partifor the D

Techno

By

IP CHA

103044)

CIVIL E

TECHN

y 2010

ING W

L DATA

ial FulfilDegree o

ology

ANDRA

ENGINEE

NOLOGY

WITH

A

llment oof

ERING

KANPU

of the

UR

iii

ABSTRACT Hyperspectral data (HD) has ability to provide large amount of spectral

information than multispectral data. However, it suffers from problems like curse

of dimensionality and data redundancy. The size of data set is also very large.

Consequently, it is difficult to process these datasets and obtain satisfactory

classification results.

The objectives of this thesis are to find the best feature extraction (FE)

techniques and improvement in accuracy and time for classification of HD by

using parametric (Gaussian maximum likely hood (GML)), non-parametric (k-

nearest neighborhood (KNN)) and support vector machine (SVM) algorithm. In

order to achieve these objectives, experiments were performed with different FE

techniques like segmented principal component analysis (SPCA), kernel principal

component analysis (KPCA), orthogonal subspace projection (OSP) and projection

pursuit (PP). DAIS-7915 hyperspectral sensor data set was used for investigations

in this thesis work.

From the experiments performed with the parametric and non-parametric

classifier, the GML classifier was found gave the best results with an overall

kappa value (k-value) 95.89%. This was achieved by using 300 training pixels (TP)

per class and 45 bands on SPCA feature extracted data set.

SVM algorithm with quadratic programming (QP) optimizer gave the best results

amongst all optimizers and approaches. The overall k-value of 96.91% was

achieved by using 300 TP per class and 20 bands of SPCA feature extracted data

set. However, the supervised FE techniques like KPCA and OSP failed to improve

results obtained by SVM significantly.

The best results obtained for GML, KNN and SVM were compared by the

one-tailed hypothesis testing. It was found that SVM classifier performed

significantly better than the GML classifiers for statistically large set of TP (300).

For statistically exact (100) and sufficient (200) set of TP, the performance of SVM

on SPCA extracted data set is statistically not better than the performance of

GML classifier.

iv

ACKNOWLEDGEMENTS I express my deep gratitude to my thesis supervisor, Dr. Onkar Dikshit for

his involvement, motivation and encouragement throughout and beyond the thesis

work. His expert directions have inculcated in my qualities which I will treasure

throughout my life. His patient hearing, critical comments approach to the research

problem made me do better every time. His valuable suggestions to all stages of the

thesis work helped me to improvise various sorts of my shortcomings of my thesis

work. I also express my sincere thanks for his effort in going through the

manuscript carefully and making it more readable. It has been a great learning

and life changing experience working with him.

I would like to express my sincere tribute to Dr. Bharat Lohani for his

friendly nature, excellent guidance and teaching during my stay at IITK.

I would like to thank specially to Sumanta Pasari for his valuable

comments and corrections of the manuscript of my thesis.

I would like to thank all of my friends, especially Shalabh, Pankaj, Amar,

Saurabh, Chotu, Manash, Kunal, Avinash, Anand, Sharat, Geeta and all other GI

peoples especially Shitlaji, Mauryaji, Mishraji who made my stay a very joyous,

pleasant and memorable one.

In closure, I express my cordial homage to my parents and my best friend

for their unwavering support and encouragement to complete my study at IITK

SOUMYADIP CHANDRA

July 2010

v

CONTENTS CERTIFICATE………………………………………………………………………….. ii

ABSTRACTS........................................................................................................... iii

ACKNOWLEDGEMENTS……………………………………………………………. iv

CONTENTS………………………………………………………………………………...v

LIST OF TABLES………………………………………………………………………..ix

LIST OF FIGURES..................................................................................................x

LIST OF ABBREVIATIONS…………………………………………………………xiii

CHAPTER 1 - Introduction ......................................................................... 1

1.1 High dimensional space ....................................................................................... 2

1.1.1 What is hyperspectral data? ......................................................................... 2

1.1.2 Characteristics of high dimensional space .................................................. 3

1.1.3 Hyperspectral imaging ................................................................................. 4

1.2 What is classification? ......................................................................................... 5

1.2.1 Difficulties in hyperspectral data classification .......................................... 5

1.3 Background of work ............................................................................................. 6

1.4 Objectives ............................................................................................................. 7

1.5 Study area and data set used .............................................................................. 7

1.6 Software details ................................................................................................... 9

vi

1.7 Structure of thesis ............................................................................................... 9

CHAPTER 2 – Literature Review ........................................................ 10

2.1 Dimensionality reduction by feature extraction .................................................. 10

2.1.1 Segmented principal component analysis (SPCA) ........................................ 11

2.1.2 Projection pursuit (PP) ............................................................................... 11

2.1.3 Orthogonal subspace projection (OSP) ..................................................... 12

2.1.4 Kernel principal component analysis (KPCA) ......................................... 12

2.2 Parametric classifiers ........................................................................................ 13

2.2.1 Gaussian maximum likelihood (GML) ....................................................... 13

2.3 Non–parametric classifiers .............................................................................. 14

2.3.1 KNN ............................................................................................................. 14

2.3.2 SVM .............................................................................................................. 15

2.4 Conclusions from literature review .................................................................. 19

CHAPTER 3 – Mathematical Background ................................... 21

3.1 What is kernel? .................................................................................................. 21

3.2 Feature extraction techniques .......................................................................... 24

3.2.1 Segmented principal component analysis (SPCA) .................................... 25

3.2.2 Projection pursuit (PP) ............................................................................... 27

3.2.3 Kernel principal component analysis (KPCA) .......................................... 34

3.2.4 Orthogonal subspace projection (OSP) ...................................................... 38

vii

3.3 Supervised classifier .......................................................................................... 43

3.3.1 Bayesian decision rule ................................................................................ 43

3.3.2 Gaussian maximum likelihood classification (GML): ............................... 44

3.3.3 k – nearest neighbor classification ............................................................. 44

3.3.4 Support vector machine (SVM): ................................................................. 46

3.4 Analysis of classification results ....................................................................... 58

3.4.1 One tailed hypothesis testing ..................................................................... 59

CHAPTER 4 - Experimental Design .................................................. 61

4.1 Feature extraction technique ............................................................................ 62

4.1.1 SPCA ............................................................................................................ 62

4.1.2 PP ................................................................................................................. 62

4.1.3 KPCA ............................................................................................................ 63

4.1.4 OSP............................................................................................................... 64

4.2 Experimental design .......................................................................................... 64

4.3 First set of experiment (SET-I) using parametric and non-parametric

classifier ........................................................................................................................ 66

4.4 Second set of experiment (SET-II) using advance classifier ............................... 67

4.5 Parameters ...................................................................................................... 68

CHAPTER 5 - Results .................................................................................... 69

5.1 Visual inspection of feature extraction techniques ......................................... 69

viii

5.2 Results for parametric and non-parametric classifiers ................................... 75

5.2.1 Results of classification using GML classifier (GMLC) ........................... 75

5.2.2 Class-wise comparison of result for GMLC ............................................... 81

5.2.3 Classification results using KNN classifier (KNNC) ................................ 82

5.2.4 Class wise comparison of results for KNNC ............................................. 91

5.3 Experiment results for SVM based classifiers ................................................. 92

5.3.1 Experiment results for SVM_QP algorithm .............................................. 93

5.3.2 Experiment results for SVM_SMO algorithm ........................................... 97

5.3.3 Experiment results for KPCA_SVM algorithm ....................................... 100

5.3.4 Class wise comparison of the best result of SVM ................................... 103

5.3.5 Comparison of results for different SVM algorithms ............................. 104

5.4 Comparison of best results of different classifiers......................................... 105

5.5 Ramifications of results ................................................................................... 107

CHAPTER 6 - Summary of Results and Conclusions ....... 109

6.1 Summary of results .......................................................................................... 109

6.2 Conclusions ....................................................................................................... 112

6.3 Recommendations for future work ................................................................. 112

REFERENCES………………………………………………….……………….115

APPENDIX A……………………………………………………………………..120

ix

LIST OF TABLES

Table Title Page

2.1 Summary of literature review 18

3.1 Examples of common kernel functions 23

4.1 List of parameters 68

5.1 The time taken for each FE techniques 71

5.2 The best kappa values and z-statistic (at 5% significance values)

for GML

80

5.3 Ranking of FE techniques and time required to obtain the best k-

value

80

5.4 Classification with KNNC on OD and feature extracted data set 84

5.5 The best k-values and z-statistic for KNNC 89

5.6 Rank of FE techniques and time required to obtain best k-value 90

5.7 The best kappa accuracy and z-statistic for SVM_QP on different

feature modified data set

95

5.8 The best k-value and z-statistic for SVM_SMO on OD and different

feature modified data set

100

5.9 The best k-value and z-statistic for KPCA_SVM on original and

different feature modified data sets

104

5.10 Comparison of the best k-values with different FE techniques,

classification time, and z-statistic for different SVM algorithms

106

5.11 Statistical comparison of different classifier’s results obtained for

different data sets

107

5.12 Ranking of different classification algorithms depending on

classification accuracy and time. (Rank: 1 indicate the best)

109

x

LIST OF FIGURES

Figure Title Page

1.1 Hyperspectral image cube 2

1.2 Fractional volume of a hypersphere inscribed in hypercube decreaseas dimension increases

4

1.3 Study area in La Mancha region, Madrid, Spain (Pal, 2002 8

1.4 FCC obtained by first 3 principal components and superimposed reference image showing training data available for classes identified for study area

8

1.5 Google earth image of study area 9

3.1 Overview of FE methods 24

3.2 Formation of blocks for SPCA 26

3.2a Chart of multilayered segmented PCA 27

3.3 Layout of the regions for the chi-square projection index 30

3.4 (a) Input points before kernel PCA (b) Output after kernel PCA. The three groups are distinguishable using the first component only

37

3.5 Outline of KPCA algorithm 38

3.6 KNN classification scheme 45

3.7 Outline of KNN algorithm 46

3.8 Linear separating hyperplane for linearly separable data 49

3.9 Non-linear mapping scheme 52

3.10 Brief description of SVM_QP algorithm 54

3.11 Overview of KPCA_SVM algorithm 58

3.12 Definitions and values used in applying one-tail hypothesis testing 60

4.1 SPCA feature extraction method 62

xi

4.2 Projection pursuit feature extraction method 63

4.3 KPCA feature extraction method 63

4.4 OSP feature extraction method 64

4.5 Overview of classification procedure 66

4.6 Experimental scheme for Set-I experiments 67

4.7 The experimental scheme for advanced classifier (Set-II) 68

5.1 Correlation image of the original data set consisting of three blocks having bands 32, 6 and 27 respectively

70

5.2 Projection of the data points. (a) Most interesting projection direction (b) Second most interesting projection direction

71

5.3 First six Segmented Principal Components (SPCs) (b) shows water

body and salt lake

72

5.4 First six Kernel Principal Components (KPCs) obtained by using 400 TP

72

5.5 First six features obtained by using eight end-members 73

5.6 Two components of most interesting projections 73

5.7 Correlation images after applying various feature extraction techniques

74

5.8 Overall kappa value observed for GML classification on different feature extracted data sets using selected different bands

78

5.9 Comparison of kappa values and classification times for GML classification method

81

5.10 Best producer accuracy of individual classes observed for GMLC on different feature extracted data set with respect to different set of TP

82

5.11 Overall accuracy observed for KNN classification of OD and feature extracted data sets for 25 TP

85


86


87


88

5.15 Time comparison for KNN classification. Time for different bands 91

xii

at different neighbors for (a) 300 TP (b) 200 TP training data per class

5.16 Comparison of best k-value and classification time for original and feature extracted data set

91

5.17 Class wise accuracy comparison of OD and different feature extracted data for KNNC

92

5.18 Overall kappa values observed for classification of FE modified data sets using SVM and QP optimizer

94

5.19 Classification time comparison using 200 and 300 TP per class 97

5.20 Overall kappa values observed for classification of original and FE modified data sets using SVM with SMO optimizer

100

5.21 Comparison of classification time different set of TPs with respect to number of bands for SVM_SMO classification algorithm

101

5.22 Overall kappa values observed for classification original and featuremodified data sets using KPCA_SVM algorithm.

103

5.23 Comparison of classification accuracy of individual classes for different SVM algorithms

105

xiii

LIST OF ABBREVIATIONS

AC

DAFE

DAIS

DBFE

FE

GML

HD

ICA

KNN k-value

KPCA

KPCA_SVM

MS

NWFE

Ncri

OD

OSP

PCA

PCT

PP

rbf

SPCA

SV

SVM

SVM_QP

Advance classifier

Discriminant analysis feature extraction

Digital airborne imaging spectrometer

Decision boundary feature extraction

Feature extraction

Gaussian maximum likelihood

Hyperspectral data

Independent component analysis

k-nearest neighbors

Kappa value

Kernel principal component analysis

Support vector machine with Kernel principal component

analysis

Multispectral data

Nonparametric weighted feature extraction

Critical value

Original data

Orthogonal subspace projection

Principal component analysis

Principal component transform

Projection pursuit

Radial basic function

Segmented principal component analysis

Support vectors

Support vector machine

Support vector machine with quadratic programming optimizer

xiv

SVM_SMO

TP

Support vector machine with sequential minimal optimizer

Training pixels

Dedicated to

my family & guide

ii

CHAPTER 1 INTRODUCTION

Remote sensing technology has brought a new dimension in the field of earth

observation, mapping and in many other different fields. At the beginning of this

technology, multispectral sensors were used for capturing data. The multispectral

sensors capture data in a small number of bands with broad wavelength intervals.

Due to few spectral bands, their spectral resolution is insufficient to discriminate

amongst many earth objects. But if the spectral measurement is performed by using

hundreds of narrow wavelength bands, then several earth objects could be

characterized precisely. This is the key concept of hyperspectral imagery.

As compared to multispectral (MS) data set, hyperspectral data (HD) has large

information content, voluminous and also different in characteristics. So, the

extraction of that huge information from HD remains a challenge. Therefore, some

cost effective and computationally efficient procedures are required to classify the

HD. Data classification is the categorization of data for its most effective and efficient

use. As a result of classification, we need a high accuracy thematic map. HD has that

potentiality.

This chapter will provide the concept of high dimensional space, HD and

difficulties in classification of HD. Next part focuses on the objectives of the thesis

followed by an overview of data set used in this thesis. Details of the software used

are mentioned in the next part of this chapter followed by the structure of thesis.

1.1 High dimensional space In Mathematics, an n-dimensional space is a topological space whose

dimension is n (where n is a fixed natural number). One of the typical example is n-

dimensional Euclidean space, which describes Euclidean geometry in n-dimensions.

2

n-dimensional spaces with large values of n are sometimes called high-dimensional

spaces (Werke, 1876). Many familiar geometric objects can be expressed by some

number of dimensions. For example, the two-dimensional triangle and the three-

dimensional tetrahedron can be seen as specific instances of the n-dimensional space.

In addition, the circle and the sphere are particular form of the n-dimensional

hypersphere for n = 2 and n = 3 respectively (Wikipedia, 2010).

1.1.1 What is hyperspectral data? When spectral measurement is done by using hundreds of narrow contiguous

wavelength intervals then the captured image is called Hyperspectral image. Mostly,

the hyperspectral image is representated by hyperspectral image cube (Figure 1.1). In

this cube, x and y axes specify the size of image and λ axis specifies the dimension or

the number bands. Hyperspectral sensors corresponding to each band collect

information as a set of images. Each image represents a range of the electromagnetic

spectrum for each band.

Figure 1.1: Hyperspectral image cube (Richards and Jia, 2006)

These images are then combined and form a three dimensional hyperspectral

cube. As the dimension of the HD is very high, it is comparable with the high

dimensional space. HD follows same characteristics like high dimensional space

which are described in the following section.

3

1.1.2 Characteristics of high dimensional space High dimensional spaces, spaces with a dimensionality greater than three,

have properties that are substantially different from normal sense of distance,

volume, and shape. In particular, in a high-dimensional Euclidean space, volume

expands far more rapidly with increasing diameter in compared to lower-dimensional

spaces, so that, for example:

(i). Almost all of the volume within a high-dimensional hypersphere lies in a thin

shell near its outer "surface"

(ii). The volume within a high-dimensional hypersphere relative to a hypercube of

the same width tends to zero as dimensionality tends to infinity, and almost all

of the volume of the hypercube is concentrated in its "corners".

The above mentioned characteristics have two important consequences for high

dimensional data that appear immediately. The first one is, high dimensional space is

mostly empty. As a consequence, high dimensional data can be projected to a lower

dimensional subspace without losing significant information in terms of separability

among the different statistical classes (Jimenez and Landgrebe, 1995). The second

consequence of the foregoing is, normally distributed data will have a tendency to

concentrate in the tails; similarly, uniformly distributed data will be more likely to be

collected in the corners, making density estimation more difficult. Local

neighborhoods are almost empty, requiring the bandwidth of estimation to be large

and producing the effect of losing detailed density estimation (Abhinav, 2009).

4

Volume fraction: The fraction of the volume of a hypersphere inscribed in a hypercube

Figure 1.2: Fractional volume of a hypersphere inscribed in hypercube

decreases as dimension increases (Modified after Jimenez, Landgrebe, 1995)

1.1.3 Hyperspectral imaging Hyperspectral imaging collects and processes information using the

electromagnetic spectrum. Hyperspectral imagery makes difference between many

types of earth’s objects, which may appear as the same color to the human eye.

Hyperspectral sensors look at objects using a vast portion of the electromagnetic

spectrum. The whole process of hyperspectral imaging can be divided into three steps:

preprocessing, radiance to reflectance transformation and data analysis (Varshney

and Arora, 2004).

In particular, preprocessing is required to convert the raw radiance to sensor

radiance. The processing steps contain the operations like spectral calibration,

geometric correction, geo-coding, signal to noise adjustment etc. Radiometric and

geometric accuracy of hyperspectral data is significantly different from one band to

another band (Varshney and Arora, 2004).

5

1.2 What is classification? Classification means to put data into groups according to their characteristics.

In the case of spectral classification, the areas of the image that have similar spectral

reflectance are put into same group or class (Abhinav, 2009). Classification is also

seen as a means of compressing image data by reducing the large range of digital

number (DN) in several spectral bands to a few classes in a single image.

Classification reduces this large spectral space into relatively few regions and

obviously results in loss of numerical information from the original image. Depending

on the availability of information of the region which is imaged, supervised or

unsupervised classification methods are performed.

1.2.1 Difficulties in hyperspectral data classification Though it is possible that HD can provide a high accuracy thematic map than

MS data, there are some difficulties in classification in case of high dimensional data

as listed below:

1. Curse of dimensionality and Hughes phenomenon: It says that when

the dimensionality of data set increases with the number of bands, the

number of training pixels (TP) required for training a specific classifier

should be increased as well to achieve the desired accuracy for

classification. It becomes very difficult and expensive to obtain large

number of TP for each sub class. This has been termed as “curse of

dimensionality” by Bellman (1960), which leads to the concept of “Hughes

phenomenon” (Hughes, 1968).

2. Characteristics of high dimensional space: The characteristics of high

dimensional space have been discussed in above section (Sec. 1.1.2). For

those reasons, the algorithms that are used to classify the multispectral

data often fail for hyperspectral data.

3. Large number of highly correlated bands: Hyperspectral sensor uses

the large number of contiguous spectral bands. Therefore, among these

bands, some bands are highly correlated. These correlated bands do not

provide good result in classification. Therefore, the important task is to

6

select the uncorrelated bands or make the bands uncorrelated, applying

feature reduction algorithms (Varshney and Arora, 2004).

4. Optimum number of feature: It is very critical to select the optimum

number of bands out of large number of bands (e.g. 224 bands for AVIRIS

image) to use in classification. Till today there are no suitable algorithms or

any rule for selection of optimal number of features.

5. Large data size and high processing time due to complexity of

classifier: Hyperspectral imaging system provides large amount of data. So

large memory and powerful system is necessary to store and handle the

data, generally which is very expensive.

1.3 Background of work This thesis work is the extension of work done by Abhinav Garg (2009) in his

M.Tech thesis. In his thesis, he showed that among the conventional classifiers

(gaussian maximum likelihood (GML), spectral angle mapper (SAM) and FISHER),

GML provides the best result. The performance of GML is improved significantly

after applying feature extraction (FE) techniques. Principal component analysis

(PCA) was found to be working best, among all FE techniques (discriminant analysis

FE (DAFE), decision boundary FE (DBFE), non-parametric weighted FE (NWFE) and

independent component analysis(ICA)), in improving classification accuracy of GML.

For the advance classifier, SVM’s result does not depend on the choice of

parameters but ANN’s does. He also showed SVM’s result was improved by using

PCA and ICA techniques while the supervised FE techniques like NWFE and DBFE

failed to improve it significantly.

He showed some drawbacks for advanced classifier like SVM and suggested

some FE techniques which may improve the result for conventional classifier (CC) as

well as advanced classifier (AC). However, for large TP (e.g. 300 per class) SVM takes

more processing time than small size of TP. The objectives of this thesis work are to

sort out these problems and to find the best FE technique, which will improve the

classification result for HD. In next article, the objective of this thesis work has been

described.

.

7

1.4 Objectives This thesis has investigated the following two objectives pertaining to

classification with hyperspectral data:

Objective-1:

To evaluate various FE techniques for classification of hyperspectral data.

Objective-2

To study the extent to which advance classifier can reduce problems related to

classification of hyperspectral data.

1.5 Study area and data set used The study area for this research is located within an area known as 'La

Mancha Alta' covering approximately 8000 sq. km to the south of Madrid, Spain (Fig.

1.4). The area is mainly used for cultivation of wheat, barley and other crops such as

vines and olives. HD is acquired by DAIS 7915 airborne imaging spectrometer on

29th June, 2000, at 5 m resolution.

Data was collected over 79 wavebands ranging from 0.4 μm to 12.5 μm with an

exception of 1.1 μm to 1.4 μm. The first 72 bands in the wavelength range 0.4 μm to

2.5 μm were selected for further analysis (Pal, 2002). Striping problems were

observed between bands 41 and 72. All the 72 bands were visually examined and 7

bands (41, 42 and 68 to 72) were found useless due to very severe stripping and were

removed. Finally 65 bands were retained and an area of 512 pixels by 512 pixels

covering the area of interest was extracted (Abhinav, 2009).

The data set available for this research work includes the 65 (retained after

pre-processing) bands data and the reference image, generated with the help of field

data collected by local farmers as briefed in Pal (2002). The area included in imagery

was found to be divided into eight different land cover types, namely wheat, water

body, salt lake, hydrophytic vegetation, vineyards, bare soil, pasture lands and built

up area.

8

Figure 1.3: Study area in La Mancha region, Madrid, Spain (Pal, 2002)

Figure 1.4: FCC obtained by first 3 principal components and superimposed

reference image showing training data available for classes identified for study area (Pal, 2002).

9

Figure 1.5: Google earth image of study area (Google earth, 2007)

1.6 Software details For the processing of HD very power full system is required due to the size of

data set and complexity of algorithms. The machine used for this thesis work

contains 2.16 GHz Intel processor with 2 GB RAM and operating system Windows 7.

Matlab 7.8.0 (R2009a) was used for the coding of different algorithms. All the results

are obtained here from same machine for the comparison of different algorithm.

1.7 Structure of thesis The present thesis is organized into six chapters. Chapter1 focuses on the

characteristics of high dimensional space, challenges of HD classification and outline

of the experiments of this thesis work. Also it discusses the study region, data set and

the software used in this thesis work. Chapter 2 presents the detailed description of

the HD classification and the previous research work related to this domain. Chapter

3 describes the detailed mathematical background of the different processes used in

this work. Chapter 4 outlines the detailed methodology carried out for this thesis

work. Chapter 5 presents the experiments which are conducted for this thesis

followed by interpretation. Chapter 6 provides the conclusions for present work and

the scopes for future works.

10

CHAPTER 2 LITERATURE REVIEW

This chapter outlines the important research works and major achievements in

the field of high dimensional data analysis and data classification. The chapter begins

with some of the FE techniques and classification approaches, for solving problems

related to HD classification as suggested by various researchers. The results of useful

experiments with the HD will also be included to highlight the usefulness and

reliability of these approaches. These results are presented in tabulated form. Some

other issues related to classification of HD are also discussed at the end of this

chapter.

2.1 Dimensionality reduction by

Swain and Davis (1978) mentioned details of various separability measures for

multivariate normal class models. Various statistical classes are found to be

overlapping which causes error of misclassification as most of the classifiers use

decision boundary approach for classification. The idea was to obtain such a

separability measure which could give an overall estimate of range of classification

accuracies that can be achieved by using a sub-set of selected features so that the

sub-set of features corresponding to highest classification accuracy can be selected for

classification (Abhinav, 2009).

FE is the process of transforming the given data from a higher dimensional

space to a lower dimensional space while conserving the underlying information

(Fukunaga, 1990). The philosophy behind such transformation is to re-distribute the

underlying information spread in high dimensional space by containing it into

comparatively smaller number of dimensions without loss of significant amount of

useful information. FE techniques, in case of classification, try to enhance class

separability while reducing data dimensionality (Abhinav, 2009).

11

2.1.1 Segmented principal component analysis (SPCA) The principal component transform (PCT) has been successfully applied in

multispectral data for feature reduction. Also it can be used as the tool of image

enhancement and digital change detection (Lodwick, 1979). For the case of dimension

reduction of HD, PCA outperforms those FE techniques which are based on class

statistics (Muasher and Landgrebe, 1983). Further, as the number of TP is limited

and ratio to the number of dimension is low for HD, class covariance matrix cannot be

estimated properly. To overcome these problems Jia (1996) proposed the scheme for

segmented principal component analysis (SPCA) which applies PCT on each of the

highly correlated blocks of bands. This approach also reduces the processing time by

converting the complete set of bands into several highly correlated bands. Jensen and

James (1999) proposed that the SPCA-based compression generally outperforms

PCA-based compression in terms of high detection and classification accuracy on

decompressed HD. PCA works efficiently for the highly correlated data set but SPCA

works efficiently for both high correlated as well as low correlated data sets (Jia,

1996).

Jia (1996) compared SPCA and PCA extracted features for target detection and

concluded SPCA as a better FE technique than PCA. She also showed that both

feature extracted data sets are identical and there is no loss of variance in the middle

stages, as long as no components are removed.

2.1.2 Projection pursuit (PP) Projection pursuit (PP) methods were originally posed and experimented by

Kruskal (1969, 1972). PP approach was implemented successfully first by Friedman

and Tukey (1974). They described PP as a way of searching for and exploring

nonlinear structure in multi-dimensional data by examining many 2-D projections.

Their goal was to find interesting views of high dimensional data set. The next stages

in the development of the technique were presented by Jones (1983) who, amongst

other things, developed a projection index based on polynomial moments of the data.

Huber (1985) presented several aspects of PP, including the design of projection

indices. Friedman (1987) derived a transformed projection index. Hall (1989)

developed an index using methods similar to Friedman, and also developed

12

theoretical notions of the convergence of PP solutions. Posse (1995a, 1995b)

introduced a projection index called the chi-square projection pursuit index. Posse

(1995a, 1995b) used a random search method to locate a plane with an optimal value

of the projection index and combined it with the structure removal of Friedman

(1987) to get a sequence of interesting 2-D projections. Each projection found in this

manner shows a structure that is less important (in terms of the projection index)

than the previous one. Most recently, the PP technique can also be used to obtain 1-D

projections (Martinez, 2005). In this research work, Posse’s method is followed that

reduces n-dimensional data set to 2-dimensional data.

2.1.3 Orthogonal subspace projection (OSP) Harsanyi and Chang (1994) proposed orthogonal subspace projection (OSP)

method which simultaneously reduces the data dimensionality, suppresses undesired

or interfering spectral signatures, and detects the presence of a spectral signature of

interest. The concept is to project each pixel vector onto a subspace which is

orthogonal to the undesired pixel. In order to make the OSP to be effective, number of

bands must not be taken less than the number of signatures. It is a big limitation

associated with multispectral image. To overcome this, Ren and Chang (2000)

presented the Generalized OSP (GOSP) method that relaxes this constraint in such a

manner that the OSP can be extended to multispectral image processing in an

unsupervised fashion. OSP can be used to classify hyperspectral image (Lentilucci,

2001) and also for magnetic resonance image classification (Wang et.al, 2001).

2.1.4 Kernel principal component analysis (KPCA) Linear PCA always detect all structure in a given data set. By the use of

suitable nonlinear feature extractor, more information can be extracted from the data

set. The kernel principal component analysis (KPCA) can be used as a strong

nonlinear FE method (Scholkopf and Smola, 2002) which maps the input vectors to

feature space and then PCA is applied on the mapped vectors. KPCA is also a

powerful method for preprocessing steps for classification algorithm (Mika et. al.

1998). Rosipal et.al (2001) proposed the application of the KPCA technique for feature

selection in a high-dimensional feature space where input variables were mapped by

13

a Gaussian kernel. In contrast to linear PCA, KPCA is capable of capturing part of

the higher-order statistics. To obtain this higher-order statistics, a large number of

TP is required. This causes problems for KPCA, since KPCA requires storing and

manipulating the kernel matrix whose size is the square of the number of TP. To

overcome this problem, a new iterative algorithm for KPCA, the Kernel Hebbian

Algorithm (KHA) was introduced by (Scholkopf et. al., 2005).

2.2 Parametric classifiers Parametric classifiers (Fukunaga, 1990) require some parameters to develop

the assumed density function model for the given data. These parameters are

computed with the help of a set of already classified or labeled data points called

training data. It is a subset of given data for which the class labels are known and is

chosen by sampling techniques (Abhinav, 2009). It is used to compute some class

statistics to obtain the assumed density function for each class. Such classes are

referred to as statistical classes (Richards and Jia, 2006) as these are dependent upon

the training data and may differ from the actual classes.

2.2.1 Gaussian maximum likelihood (GML) Maximum likelihood method is based on the assumption that the frequency

distribution of the class membership can be approximated by the multivariate normal

probability distribution (Mather, 1987). Gaussian Maximum Likelihood (GML) is one

of the most popular parametric classifiers that has been used conventionally for

purpose of classification of remotely sensed data (Landgrebe, 2003). The advantages

of GML classification method are that, it can obtain minimum classification error

under the assumption that the spectral data of each class is normally distributed and

it not only considers the class centre but also its shape, size and orientation by

calculating a statistical distance based on the mean values and covariance matrix of

the clusters (Lillesand et al., 2002).

Lee and Landgrebe (1993) compared the result of GML classifier on PCA and

DBFE feature extracted data set and concluded that DBFE feature extracted data set

provides better accuracy than PCA feature extracted data set. NWFE and DAFE FE

techniques were compared for classification accuracy achieved by nearest neighbor

14

and GML classifiers by Kuo and Landgrebe (2004). They concluded that NWFE is

better FE technique than DAFE. Abhinav (2009) investigated the effect of PCA, ICA,

DAFE, DBFE and NWFE feature extracted data set on GML classifier. He showed

that PCA is the best FE technique for HD among the other mentioned feature

extractor for GML classifier. He also suggested that some FE techniques like KPCA,

OSP, SPCA, PP may improve the classification result using GML classifier.

2.3 Non–parametric classifiers

The non–parametric classifiers (Fukunaga, 1990) uses some control

parameters, carefully chosen by the user, to estimate the best fitting function by

using an iterative or learning algorithm. They may or may not require any training

data for estimating the PDF. Parzen window (Parzen, 1962) and k–nearest neighbor

(KNN) (Cover and Hart, 1967) are two popular working classifiers under this

category. Edward (1972) gave brief descriptions of many non-parametric approaches

for estimation of data density functions.

2.3.1 KNN KNN algorithm (Fix and Hodges, 1951) has proven to be effective in pattern

recognition. The technique can achieve high classification accuracy in problems which

have unknown and non-normal distributions. However, it has a major drawback that

a large amount of TP is required in the classifiers resulting in high computational

complexity for classification (Hwang and Wen, 1998).

Pechenizkiy (2005) compared the performance of KNN classifier on the PCA

and random projection (RP) feature extracted data set. He concluded that KNN

performs well on PCA feature extracted data set. Zhu et. al. (2007) showed that the

KNN works better on the ICA feature extracted data set than the original data set

(OD) (OD was captured by Hyperspectral imaging system developed by the ISL). ICA-

KNN method with a few wavelengths had the same performance as the KNN

classifier alone using information from all wavelengths.

Some more non–parametric classifiers based on geometrical approaches of data

classification were found during literature survey. These approaches consider the

data points to be located in the Euclidean space and exploit the geometrical patterns

of the data points for classification. Such approaches are grouped into a new class of

15

classifiers known as machine learning techniques. Support Vector Machines (SVM)

(Boser et al., 1992), k-nearest neighborhood (KNN) (Fix and Hudges, 1956) are among

the popular classifiers of this kind. These do not make any assumptions regarding

data density function or the discriminating functions and hence are purely non–

parametric classifiers. However, these classifiers also need to be trained using the

training data.

2.3.2 SVM

SVM has been considered as advance classifier. SVM is a new generation of

classification techniques based on Statistical Learning Theory having its origins in

Machine Learning and introduced by Boser, Vapnik and Guyon (1992). Vapnik (1995,

1998) discussed SVM based classification in detail. SVM tends to improve learning by

empirical risk minimization (ERM) to minimize learning error and to minimize the

upper bound on the overall expected classification error by structural risk

minimization (SRM). SVM makes use of principle of optimal separation of classes to

find a separating hyperplane that separates classes of interest to maximum extent by

maximizing the margin between the classes (Vapnik, 1992). This technique is

different from that of estimation of effective decision boundaries used by Bayesian

classifiers as only data vectors near to the decision boundary (also known as support

vectors) are required to find the optimal hyperplane. A linear hyperplane may not be

enough to classify the given data set without error. In such cases, data is transformed

to a higher dimensional space using a non–linear transformation that spreads the

data apart such that a linear separating hyperplane may be found. Kernel functions

are used to reduce the computational complexity that arises due to increased

dimensionality (Varshney and Arora, 2004).

Advantages of SVM (Varshney and Arora, 2004) lie in their high generalization

capability and ability to adapt their learning characteristics by using kernel functions

due to which they can adequately classify data on a high–dimensional feature space

with a limited number of training data sets and are not affected by the Hughes

phenomenon and other affects of dimensionality. The ability to classify using even

limited number of training samples make SVM as a very powerful classification tool

for remotely sensed data. Thus, SVM has the potential to produce accurate

classifications from HD with limited number of training samples. SVMs are believed

16

to be better learning machines than neural networks, which tends to overfit classes

causing misclassification (Abhinav, 2009), as they rely on margin maximization

rather than finding a decision boundary directly from the training samples.

For conventional SVM an optimizer is used based on quadratic programming

(QP) or linear programming (LP) methods to solve the optimization problem. The

major disadvantage of QP algorithm is the storage requirement of kernel matrix in

the memory. When the size of the kernel matrix is large enough, it requires huge

memory that may not be always available. To overcome this Benett and Campbell

(2000) suggested an optimization method which sequentially updates the Lagrange

multipliers called the kernel adatron (KA) algorithm. Another approach was

decomposition method which updates the Lagrange multipliers in parallel since they

update many parameters in each iteration unlike other methods that update

parameter at a time (Varshney and Arora, 2004). QP optimizer is used here which

updates lagrange multipliers on the fixed size working data set. Decomposition

method uses QP or LP optimizer to solve the problem of huge data set by considering

many small data sets rather than a single huge data set (Varshney, 2001). The

sequential minimal optimization (SMO) algorithm (Platt, 1999) is a special case of

decomposition method when the size of working data set is fixed such that an

analytical solution can be derived in very few numerical operations. This does not use

the QP or LP optimization methods. This method needs more number of iterations

but requires a small number of operations thus results in an increase in optimization

speed for very large data set.

The speed of SVM classification decreases as the number of support vectors

(SV) decreases. By using kernel mapping, different SVM algorithms have successfully

incorporated effective and flexible nonlinear models. There are some major difficulties

for large data set due to calculation of nonlinear kernel matrix. To overcome the

computational difficulties, some authors have proposed low rank approximation to

the full kernel matrix (Wiens, 92). As an alternative, Lee and Mangasarian (2002)

have proposed the method of reduced support vector machine (RSVM) which reduces

the size of the kernel matrix. But there was a problem of selecting the number of

support vectors (SV). In 2009, Sundaram proposed a method which will reduce the

number of SV through the application of KPCA. This method is different from other

17

proposed method as the exact choice of support vector is not important as long as the

vector spanned a fixed subspace.

Benediktsson et al (2000) applied KPCA on the ROSIS-03 data set. Then he

used linear SVM on the feature extracted data set and showed that KPCA features

are more linearly separable than the features extracted by conventional PCA. Shah et

al (2003) compared SVM, GML and ANN classifiers for accuracies at full

dimensionality and using DAFE and DBFE FE techniques on AVIRIS data set and

concluded that SVM gives higher accuracies than GML and ANN for full

dimensionality but poor accuracies for features extracted by DAFE and DBFE.

Abhinav (2009) compared SVM, GML and ANN with OD and PCA, ICA, NWFE,

DBFE, DAFE feature extracted data set. He concluded that SVM provides better

result for OD than GML. SVM works best with PCA and ICA feature extracted data

set where ANN works better with DBFE and NWFE feature extracted data set.

The works done by various researchers with different hyperspectral data sets

using different classifiers and FE methods and the results obtained by them is

summarized in Table 2.1.

18

Table 2.1: Summary of literature review Author Dataset used Method used Results obtained Lee and Landgrebe (1993)

Field Spectrometer System (airborne hyperspectral sensor)

GML classifier is used to compare classification accuracies obtained by DBFE and PCA FE

Features extracted by DBFE produces better classification accuracies than those obtained from PCA and Bhattacharya feature selection methods.

Jimenez and Landgrebe (1998)

Stimulated and real AVIRIS data

Hyperspectral data characteristics were studied with respect to effects of dimensionality, order of data statistics used on supervised classification techniques.

Hughes phenomenon was observed as an effect of dimensionality and classification accuracy was observed to be increasing with use of higher statistics order. But lower order statistics were observed to be less affected by Hughes phenomenon.

Benediktsson et al (2001)

ROSIS-03

KPCA and PCA feature extracted data set was used for classification using linear SVM.

KPCA features are more linearly separable than features extracted by conventional PCA.

Shah et al. (2003) AVIRIS Compared SVM, GML and ANN classifiers for accuracies at full dimensionality and using DAFE and DBFE feature extraction techniques

SVM was found to be giving higher accuracies than GML and ANN for full dimensionality but poor accuracies were obtained for features extracted by DAFE and DBFE.

Kuo and Landgrebe (2004)

Stimulated and real data (HYDICE image of DC mall, Washington, US)

NWFE and DAFE FE techniques were compared for classification accuracy achieved by nearest neighbor and GML classifiers.

NWFE was found to be producing better classification accuracies than DAFE.

Pechenizkiy (2005) 20 data sets with different characteristics were taken from the UCI machine learning repository.

KNN classifier was used to compare classification accuracies obtained by PCA and Random Projection FE

PCA gave the better result than Random Projection

Zhu et al (2007) Hyperspectral imaging system developed by ISL.

ICA ranking methods were used to select the optimal wave length the KNN was used. Then KNN alone was used.

ICA-KNN method with a few band had the same performance as the KNN classifier alone using all bands.

Sundaram (2009)

The adult dataset ,part of UCI Machine Learning Repository

KPCA was applied in the support vector, then usual SVM algorithm is used

Significantly reduce the processing time without effecting the classification accuracy

19

Abhinav (2009) DAIS 7915 GML, SAM, MDM classification techniques were used on the PCA, ICA, NWFE, DBFE and DAFE feature extracted data set

GML was the best among the other techniques and performs best on PCA extracted data set.

Abhinav (2009) DAIS 7915 SVM and GML classification techniques were used on the OD and PCA, ICA, NWFE, DBFE and DAFE feature extracted data set to compare the accuracy

GML performed very low in OD than SVM. SVM provide better accuracy than GML. SVM performs better on PCA and ICA extracted data set.

2.4 Conclusions from literature review

1. From Table 2.1, it can be easily concluded that the FE techniques like PCA,

ICA, DAFE, DBFE and NWFE perform well in improving the classification

accuracies when used with GML. But the features extracted by DBFE and

DAFE failed to improve results obtained by SVM implying a limitation of these

techniques for the advance classifiers. KNN works best with PCA and ICA

feature extracted data set. However, in the surveyed literature the effects of

PP, SPCA, KPCA and OSP extracted features on classification accuracy

obtained from the advance classifiers like SVM, parametric classifier like GML

and nonparametric classifier KNN have not been observed.

2. Another important aspect found missing in the literature is the comparison of

classification time for SVM classifiers because SVM takes long time for

training using large TP. It was seen that many approach of SVM were

proposed to reduce the classification time but there is no conclusion for the best

SVM algorithm depending on classification accuracy and processing time.

3. Although KNN is effective classification technique for HD, there is no guideline

for classification time or suggestion of best FE techniques for KNN classifier.

Also the effect of different parameters like number of nearest neighbor,

number of TP, number of bands is not suggested for KNN.

20

4. During the literature survey, it is further found that there is no suggestion for

the best FE techniques for different SVM algorithms, GML and KNN.

Such missing aspects will be investigated in this thesis work and the

guidelines to choose an efficient and less time consuming classification technique

shall be presented as the result of this research.

This chapter presented the FE and classification techniques for mitigating the

effects of dimensionality. These techniques were result of different approaches used

to deal with the problem of high dimensionality and improving performance of

advance, parametric and nonparametric classifier. The approaches were applied on

real life HD and comparative results as reported in literature were compiled and

presented here. In addition, the important aspects found missing in the literature

survey were highlighted which this thesis work shall try to investigate. The

mathematical rationale and algorithms used to apply these techniques will be

discussed in detail in the next chapter.

21

CHAPTER 3 MATHEMATICAL BACKGROUND

This chapter will provide the detailed mathematical background of each of the

techniques used in this thesis. Starting with the some basic concepts of kernels and

kernel space this chapter will describe the unsupervised and supervised FE

techniques followed by classification and optimization rules for supervised classifier.

Finally, the scheme for statistical analysis which has been used for comparing the

results of different classification techniques are discussed.

Notations which are followed in this chapter for matrix and vector are given

below:

X A two dimensional matrix, whose columns represent the data points (m) and

rows represent number of bands (n), where ,X X n m= ⎡ ⎤⎣ ⎦.

ix n -dimensional single pixel column vector where 1 2, ......., mX x x x= ⎡ ⎤⎣ ⎦and

1 2, ,....., Ti i i nix x x x= ⎡ ⎤⎣ ⎦

jc Represents jth class.

( )zΦ Mapping of the input vector z in kernel space, using some kernel function.

,a b Defines inner product of the vectors a and b.

∈ Belongs to nR Set of n-dimensional real number.

N Set of natural number. T

⎡ ⎤⎣ ⎦ Denotes the transpose of a matrix.

∀ For all.

3.1 What is kernel? Before defining kernel, let’s look at the following two definitions:

• Input space: The space where originally data points lie.

22

• Feature space: The space spanned by the transformed data points (from

original space) which were mapped by some functions.

Kernel is the dot product in feature space H via a map Φ from input space,

such that :X HΦ → . Kernel can be defined as ( , ') ( ), ( ')k x x x x= Φ Φ , where

, ' and ( ), ( ')x x x xΦ Φ are the elements of input space and feature space respectively

and k is called the kernel and Φ is called feature map associated with k. Φ also can

be called as the kernel function. The space containing these dot products is called

kernel space. This is a nonlinear mapping from input space to feature space which

increases the internal distance between two points in a data set. This means that the

data set which is nonlinearly separable in input space becomes linearly separable in

kernel space. A few definitions related to kernel are given below:

Gram matrix: Given a kernel k and inputs 1 2, ........., nx x x X∈ , the xn n matrix,: ( ( , ))i j ijK k x x= is called the gram matrix of k with respect to 1 2, ........., nx x x X∈ .

Positive definite matrix: A real xn n symmetric matrix K satisfying 1 1 0Tx Kx >

for

all ( )1 11 21 1, ,......., T nnx x x x R= ∈ is called positive definite. 1x is a column vector. If the

equality in previous equation occurs only for 11 21 1........ 0nx x x= = = = , then the matrix

is called strictly positive definite.

Positive definite kernel: Let X be a nonempty set. A function :k X X R× → , ∀

, ,in N x X i N∈ ∈ ∈ if it gives rise to a positive definite gram matrix, is called a

positive definite kernel. A function :k X X R× → ∀ n N∈ and distinct ix X∈ if it

gives rise to a strictly positive definite gram matrix, called strictly positive definite

kernel.

Definitions of some commonly used kernel functions are shown in Table 3.1.

23

Table 3.1: Examples of common kernel functions (Modified after Varshney and Arora, 2004)

Kernel function type Definition

( , )iK x x Parameters Performance depends on

Linear ix x× Decision boundary either linear or non linear

Polynomial with degree n ( 1)n

ix x× + n is a positive integer User defined parameters

Radial basis function 2

2

( - )exp

2ix x

σ

⎛ ⎞⎜ ⎟−⎜ ⎟⎝ ⎠

σ is a user defined value

User defined parameters

Sigmoid tanh( ( . ) )ik x x + Θ K and Θ are user defined parameter

User defined parameters

All the above definitions have been explained with the following simple

example.

Let, 1 2 3

1 2 12 1 31 1 3

X x x x⎡ ⎤⎢ ⎥= =⎡ ⎤⎣ ⎦ ⎢ ⎥⎢ ⎥⎣ ⎦

is a matrix in input space whose columns ( , 1,2,3ix i = )

denote the number of data points and rows denote the dimension of data points.

Let, by using Gaussian kernel function, this matrix be mapped in to the feature space.

Let ,i jx x denotes the inner product of the columns of the matrix X using Gaussian

kernel function.

Then the gram matrix (kernel matrix) K takes precisely the form,

1 1 1 2 1 3

2 1 2 2 2 3

3 1 3 2 3 3

, , ,, , ,, , ,

x x x x x xK x x x x x x

x x x x x x

⎡ ⎤⎢ ⎥

= ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

The numerical value of the matrix K is, 1.0000 0.0498 0.0821 0.0498 1.0000 0.6065 0.0821 0.6065 1.0000

K⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦

K is symmetric matrix. If the matrix K turns out to be positive definite, then it is

called positive definite kernel and if it is strictly positive definite, then it is called

strictly positive definite kernel.

24

3.2 Feature extraction techniques

FE techniques are based on a simple assumption that given data sample

( : )nx X R∈ belonging to an unknown probability distribution in n-dimensional space

can be represented by some coordinate system in m dimensional space (Carreira-

Perpinan, 1997). Thus, the FE techniques aim at finding an optimal coordinate

system such that when the data points from higher dimensional space are projected

onto it, a dimensionally compact representation of these data points will be obtained.

There are two following main conditions to obtain an optimal dimension reduction

(Carreira-Perpinan, 1997):

(i) Elimination of dimensions with very low information content. Features with

low information content can be discarded as noise.

(ii) Remove redundancy among the dimensions of data space i.e. the reduced

feature set should be spanned by orthogonal vectors.

The unsupervised and supervised FE techniques have been investigated in this research work (Figure 3.1). For the unsupervised approach, segmented principal

component analysis (SPCA), projection pursuit (PP) and for supervised FE technique,

kernel principal component analysis (KPCA) and orthogonal subspace projection

(OSP) are used. The next sub-sections will discuss the assumptions used by these FE

techniques in detail.

Figure 3.1: Overview of FE methods

25

3.2.1 Segmented principal component analysis (SPCA) The principal component transform (PCT) has been successfully applied in

multispectral data analysis. It is used as a powerful tool for FE . For hyperspectral

image data, PCT outperforms those FE techniques which are based on the class statistics. The main advantage of using a PCT is that global statistics are used to

determine the transform functions. Implementation of PCT on high dimensional data

set requires high computational load. SPCA can overcome the problem of long processing time by partitioning the complete data set into several highly correlated

subgroups (Jia, 1996).

The complete data set is first partitioned into K subgroups with respect to the

correlation of bands. From the correlation image of HD, it can be seen that blocks are

formed from highly correlated bands (Figure 3.2). These blocks are selected as the

subgroups. Let 1n , 2n and kn are the number of bands in subgroups 1, 2 and k

respectively (Figure 3.2a). Then PCT is applied in each subgroup of data. After

applying PCT on each subgroup, significant features are selected by variance

information of each component. The PCs which contain about 99% variance were

chosen for each block then the selected features can be regrouped and transformed

again to compress the data further.

26

Figure 3.2: Formation of blocks for SPCA. Here, 3 blocks, containing 32, 6 and 27

bands respectively, corresponding to highly correlated bands have been formed from the correlation image of HYDICE hyperspectral sensor data.

Segmented PCT retains all the variance as with the conventional PCT. There

is no information lost either in the case that the transformation is conducted on the

complete vector at once or a few sub vectors are transformed separately (Jia, 1996).

When the new components obtained from each segmented PCT are gathered and

transformed again, then the resulting data variance and covariance are identical to

those of the conventional PCT. The main effect is that, the data compression rate is

lower in the middle stages compared to the no segmentation case. However, it makes

a relatively small difference in compression rate, if segmented transformation is developed on those subgroups which have poor correlation with each other.

27

Figure 3.2a: Chart of multilayered segmented PCA

3.2.2 Projection pursuit (PP) Projection pursuit (PP) refers to a technique first described by Friedman and

Tukey (1974) for exploring the nonlinear structure of high dimensional data sets by

means of selected low dimensional linear projections. To reach this goal, an objective

function is assigned, called projection index, to every projection characterizing the

structure present in the projection. Interesting projections are then automatically

picked up by optimizing the projection index numerically. The notion of interesting

projections has usually been defined as the ones exhibiting departure from normality

(normal distribution function) (Diaconis and Freedman, 1984; Huber, 1985).

Posse (1990) proposed an algorithm based on a random search and a chi-

squared projection index for finding the most interesting plane (two-dimensional

view). The optimization method was able to locate in general the global maximum of

the projection index over all two-dimensional projections (Posse, 1995). The chi-

squared index was efficient, being fast to compute and sensitive to departure from

normality in the core rather than in the tail of the distribution. In this investigation

only chi-squared (Posse, 1995a, 1995b) projection index has been used.

28

Projection pursuit exploratory data analysis (PPEDA) consists of following two parts:

(i) A projection pursuit index measures the degree of departure from normality.

(ii) A method for finding the projection that yields the highest value for the index.

Posse (1995a, 1995b) used a random search to locate a plane with an optimal

value of the projection index and combined it with the structure removal of Friedman

(1987) to get a sequence of interesting 2-D projections. The interesting projections are found in decreasing order of the value of the PP index. This implies that each

projection found in this manner shows a structure that is less important (in terms of

the projection index) than the previous one. In the following discussion, first the chi-

squared PP index has been described followed by the structure finding procedure.

Finally, the structure removal procedure is illustrated.

3.2.2.1 Posse chi-square index

Posse proposed an index based on the chi-square index. The plane is first

divided into 48 regions or boxes kB , 1,2,..,48k = that are distributed in the form of

rings (Figure 3.3). Inner boxes have the same radial width R/5 and all boxes have the

same angular width of 045 . R is chosen so that the boxes have approximately the

same weight under normally distributed data and which is equal to ( )122log6

5. The

outer boxes were having weight 1/48 under normally distributed data. This choice for

the radial width provides regions with approximately same probability for the

standard bivariate normal distribution (Martinez, 2001). The projection index is

given as:

( ) ( )2

28 48( ) ( )

0 1 1

1 1 1, ,9

j j

k

n

B i i kj k ik

PI I z z cc n

α λ β λ

χα β

= = =

⎡ ⎤= −⎢ ⎥

⎣ ⎦∑∑ ∑ (3.1)

Where,

φ The standard bivariate normal density. kc Probability evaluated over kth region using the normal density function,

given by 1 2k

kB

c dz dzφ= ∫∫ .

29

kB Box in the projection plane.

jλ , 0,.....,836

j jπ= is the angle by which the data are rotated in the plane

before being assigned to regions.

,α β Orthonormal p-dimensional vectors which span the projection plane (It

can be first two PCs or randomly chosen two pixels of the OD set).

( , )P α β A plane consists of two orthonormal vectors ,α β ,i jZ Zα β Sphered observations projected onto the vectors andα β . ( T

i iZ Zα α= and

Ti iZ Zβ β= )

( )jα λ cos sinj jα λ β λ−

( )jβ λ sin cosj jα λ β λ+

kBI The indicator functions for region.

( )2 ,PIχ

α β The chi-squareprojection index evaluated using the data projected onto the plane spanned byα and β .

The chi-square projection index is not affected by the presence of outliers.

However, it is sensitive to distributions that have a hole in the core, and it will also

yield projections that contain clusters. The chi-square projection pursuit index is fast

and easy to compute, making it appropriate for large sample sizes. Posse (1995a)

provides a formula to approximate the percentiles of the chi-square index.

30

R

R/5

45o

1/48 1/48

1/48

1/48

1/48

1/48

1/48

1/48

Figure- 3.3: Layout of the regions for the chi-squareprojection index. (Modified after Posse, 1995a)

3.2.2.2 Finding the structure (PPEDA algorithm)

For PPEDA projection pursuit index, ( )2 ,PIχ

α β must be optimized over all

possible direction onto 2-D planes. Posse (1990) proposed a random search for locating the global maximum of the projection index. Combined with the structure-

removal procedure, this gives a sequence of interesting bi-dimensional views of

decreasing importance. Starting with random planes, the algorithm tries to improve

the current best solution ( )* *,α β by considering two candidate planes ( )1 1,a b and

( )2 2,a b within a neighborhood of ( )* *,α β . These candidate planes are given by,

( )( )( )( )

* **1 11

1 1* * *1 1 1

* **1 21

2 2* * *1 1 2

T

T

T

T

a acva bcv a a

a acva bcv a a

β βαα β β

β βαα β β

⎫−+ ⎪= =+ ⎪− ⎪

⎬− ⎪−

= = ⎪− − ⎪⎭

(3.2)

Where c is a scalar that determines the size of the neighborhood visited, and v is a

unit p-vector uniformly distributed on the unit p-dimensional sphere. The idea is to

31

start a global search and then to concentrate on the region of the global maximum by

decreasing the value of c. After a specified number of steps, called half, without an

increase of the projection index, the value of c is halved. When this value is small

enough, the optimization is stopped. Part of the search still remains global to avoid

being kept in dummy local optimum. The complete search of the best plane contains

m such random searches with different random starting planes. The goal of PP

algorithm is to find best projection plane.

The steps for PPEDA are given below:

1. Sphere the OD set, let’s say, Z is the matrix of sphered data set.

2. Generate a random starting plane ( )0 0,α β , where 0 0andα β are orthonormal.

Consider this as the current best plane ( )* *,α β .

3. Evaluate the projection index ( )2* *,PI

χα β for the starting plane.

4. Generate two candidate plane ( )1 1,a b and ( )2 2,a b according to the Eq. (3.2)

5. Now calculate the projection index for these candidate planes.

6. Choose the candidate plane with a higher value of the projection pursuit index

as the current best plane ( )* *,α β .

7. Repeat steps 4 through 6 while there are improvements in the projection

pursuit index. 8. If the index does not improve for certain time, then decrease the value of c by

half

9. Repeat step 4 to step 8 until c becomes some small number (say .01).

3.2.2.3 Structure removal

There may be more than one interesting projection, and there may be other

views that reveal insights about the hyperspectral data. To locate other views,

Friedman (1987) proposed a method called structure removal. In this approach, first

we perform the PP algorithm on the data set to obtain the structure which means the

optimal projection plane. The approach then removes the structure found at that

projection, and repeats the projection pursuit process to find a projection that yields

another maximum value of the projection pursuit index. By proceeding in this

32

manner, it will give a sequence of projections providing informative views of the data.

The procedure repeatedly transforms the projected data to standard normal until

they stop becoming more normal as measured by the projection pursuit index. One

starts with a p p× matrix, where the first two rows of the matrix are the vectors of

the projection obtained from PPEDA. The rest of the rows have ‘1’ on the diagonal and ‘0’ elsewhere. For example, if p = 4, then

* * * *1 2 3 4* * * *

* 1 2 3 4

0 0 1 00 0 0 1

U

α α α α

β β β β

⎡ ⎤⎢ ⎥⎢ ⎥= ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

(3.3)

Gram-Schmidt orthonormalization process (Strang, 1988) makes the rows of *U

orthonormal. Let U is the orthonormal matrix of *U . The next step in the structure

removal process is to transform the Z matrix using the following equation,

TT UZ= (3.4)

Where T is a p n× matrix. With this transformation, the first two rows of T of every

transformed observations are the projection onto the plane given by ( )* *,α β . Now

applying a transformation (Θ ), which transforms the first two rows of T to a

standard normal and the rest remain unchanged, structure removal is performed

(Martinez, 2004). This is where the structure is removed, making the data normal in that projection (the first two rows). The transformation is defined as follows,

( ) ( )( ) ( )( )

11 1

12 2

3,4,.........,i i

T F T

T F T

T T i p

φ

φ

−

−

⎫⎡ ⎤Θ = ⎣ ⎦ ⎪⎪⎡ ⎤Θ = ⎬⎣ ⎦⎪

Θ = = ⎪⎭

(3.5)

Where 1φ− the inverse of the standard normal cumulative distribution function, 1T

and 2T are the first two rows of the matrix T and F is a function defined in Eq. (3.7).

From Eq. (3.3), it is seen that only the first two row of T are changing. 1T and 2T can

be written as,

( )( )

* * * *

* * * *

1 1 2

2 1 2

, ......., ,.......,

, ......., ,.......,

j n

j n

T z z z z

T z z z z

α α α α

β β β β

=

= (3.6)

33

Where *

jzα and *

jzβ are coordinates of the jth observation projected onto the plane

spanned by ( )* *,α β . Next, a rotation is defined about the origin through the angle as

follows

( ) ( ) ( )

( ) ( ) ( )

1 1 2

2 2 1

cos sin

cos sin

t t tj j j

t t tj j j

z z z

z z z

γ γ

γ γ

= +

= − (3.7)

Where 0, / 4, / 8,3 / 8γ π π π= and ( )1 tjz represents the jth element of 1T at the tth

iteration of the process. Now, applying the following transformation on Eq. (3.7) to the

rotated points it replaces each rotated observation by its normal score in the

projection.

( )( )( )

( )( )( )

11 1 1

22 1 1

0.5

.5

tjt

j

tjt

j

r zz

n

r zz

n

φ

φ

+ −

+ −

⎧ ⎫−⎪ ⎪= ⎨ ⎬⎪ ⎪⎩ ⎭⎧ ⎫−⎪ ⎪= ⎨ ⎬⎪ ⎪⎩ ⎭

(3.8)

Where ( )( )1 tjr z represents the rank of ( )1 t

jz

With this procedure, the projection index is reduced by making the data more

normal. During the first few iteration, the projection index should decrease rapidly

(Friedman, 1987). After approximate normality is obtained, the index might oscillate

with small changes. Usually, the process takes between 5 to 15 complete iterations to

remove the structure. Once the structure is removed using this process, data is

transformed back using the following equation,

( )T TZ U UZ′ = Θ (3.9)

From Matrix Theory (Strang, 1988), it is known that all directions that are

orthogonal to the structure (i.e., all rows of T other than the first two) have not been

changed, whereas the structure has been Gaussianized and then transformed back.

Next section will describe the summary of the steps of PP,

34

3.2.2.4 Steps of PP

1. Load the data and set the value of the parameters like number of best

projection plane (N), number of neighborhood for random starts (m), value of c

and half

2. Sphere the data and obtain the Z matrix.

3. Find each of the desired number of projection plane (structures) (3.3.4.2) using

Posse chi-squareindex.

4. Remove the structure (to reduce the effect of local optimum) and find another

structure (3.3.4.3) until the projection pursuit index stop changing.

5. Continue the process until the best projection plane (orthogonal to each other)

is obtained.

3.2.3 Kernel principal component analysis (KPCA) Kernel principal component analysis (KPCA) means conducting PCT in feature

space (kernel space). KPCA is applied on the variables which are nonlinearly related

to the input variables. In this section KPCA algorithm has been described through

PCA algorithm.

First m number of TP ( , 1,........,nix R i m∈ = ) are chosen. PCA finds the principal

axes by diagonalizing the following covariance matrix,

1

1 mT

j jj

C x xm =

= ∑ (3.10)

The covariance matrix C is positive definite; hence, non-negative eigen values

can be obtained.

v Cvλ = (3.11) For PCA, first sort the eigen values in decreasing order and find the corresponding

eigen vectors. Then project test point on to eigen vectors. PCs are obtained in this

manner. Now next step is rewriting of PCA in terms of dot product. Now substituting

Eq. (3.10) in Eq. (3.11)

1

1 mT

j jj

Cv x x v vm

λ=

= =∑

Thus

35

( )

1

1

1

1 .

mT

j jj

m

j jj

v x x vm

x v xm

λ

λ

=

=

=

=

∑

∑ (3.12)

since ( ) ( ). .Tx x v x v x=

In Eq. (3.12), the term ( ).jx v is a scalar. This means that all the solutions v with λ ≠

0 lie in the span of 1,......, mx x , i.e.

1

m

i ii

v xα=

= ∑ (3.13)

Steps for KPCA

1. For KPCA, first transform the TPs using a kernel function (Φ ) to feature space

( H ). Data set ( ( ), 1,.....,ix i mΦ = ) in feature space are assumed as centered to

reduce the complexity of calculation. The covariance matrix in H of the data

set takes the form as following

1

1 ( ) ( )m

Tj j

jC x x

m =

= Φ Φ∑ (3.14)

2. Find the eigen values 0λ ≥ and corresponding non zero eigen vectors

\ {0}v H∈ of the covariance matrix C from the equation,

v Cvλ = (3.15)

3. As shown in previously (for PCA), all solution of v ( 0λ ≠ ) lie in the span of

1( ),........, ( )mx xΦ Φ , i.e.,

1

( )m

i ii

v xα=

= Φ∑ (3.16)

Therefore,

1

( )m

i ii

Cv v xλ λ α=

= = Φ∑ (3.17)

Substituting Eq. (3.14) and eq. 3.16 in Eq. (3.17)

1 1 1

( ) ( ) ( ) ( )m m m

Tj j j i i j

j i jm x x x xλ α α

= = =

Φ = Φ Φ Φ∑ ∑∑ (3.18)

4. Define kernel inner product by ( , ) ( ) ( )Ti j i jK x x x x= Φ Φ . Substituting this in Eq.

(3.18) following equation is obtained.

36

1 1 1

( ) ( ) ( , )m m m

j j j i i jj i j

m x x K x xλ α α= = =

Φ = Φ∑ ∑∑ (3.19)

5. To express the relationship in Eq. (3.19) entirely in terms of the inner-product

kernel, premultiply both sides by ( )TkxΦ for all k = 1,……,m. Define the m ×m

matrix K, called the kernel matrix, whose ijth element is the inner-product

kernel , ( , )i jK x x . The vector α of length m, whose jth element is the coefficient

jα .

6. Finally, Eq. (3.19) can be written as,

1 1 1

1( ) ( ) ( ) ( ) ( ) ( )

1,2,....,

m m mT T T

j k j j k i i ji i j

x x x x x xm

k m

λ α α= = =

Φ Φ = Φ Φ Φ Φ

∀ =

∑ ∑∑ (3.20)

Now Eq. (3.20) can be transformed as (using ( , ) ( ) ( )Ti j i jK x x x x= Φ Φ ),

2m K Kλ α α= (3.21)

To find the solution of Eq. (3.21), an eigen value problem Eq. (3.22) needs to be

solved,

m Kλα α= (3.22)

7. Solution of Eq. (3.22) provides the eigen values and eigen vectors of the kernel

matrix K. Let 1 2 ........ mλ λ λ≥ ≥ ≥ be the eigen values of K and 1 2, ,......., mβ β β be

the corresponding set of eigen vectors with pλ being the last non zero eigen

value.

Figu

8. To

eige

H. T

9. In t

it is

H (

feat

equ

Figure-3.5

(

ure 3.4: (aThon

extract pr

en vectors β

Then

the above a

s certainly Schölkopf,

ture space

ation for k

,i jK

5 provides t

(a)

a) Input pohe three gnly (Wikipe

incipal com

nβ in H (n

β

algorithm,

difficult to 2004) . Th

. However

kernel PCA

( 1mK K= −

the outline

oints beforgroups areedia, 2010)

mponent, i

1,...., p= ).

, ( )n xβ Φ = ∑

it has bee

o obtain th

herefore, it

r, there is

A. It is need

1 1m mK K− +

e of KPCA a

37

e kernel Pe distingui).

it is neede

Let x be a

1( )

m

n ii

xβ=

Φ∑

n assumed

he mean of

is problem

a way to

ded to diago

) ,1 Wm m i jK

algorithm.

PCA (b) Oushable usi

ed to comp

a test point

), ( )xΦ

d that the d

f the mappe

matic to cen

o do it by

onalize the

Where (1 )m ij

(b)

utput aftering the fir

pute projec

t, with an i

data set is

ed data in

nter the m

slightly m

e kernel ma

1: ,i jm

= ∀

r kernel Prst compon

ction onto

image (xΦ

(3.2

centered,

feature sp

mapped data

modifying

atrix K,

(3.2

CA. nent

the

) in

23)

but

pace

a in

the

24)

38

Figure 3.5: Outline of KPCA algorithm

3.2.4 Orthogonal subspace projection (OSP) subspace projection is to eliminate all unwanted or undesired spectral

signatures (background) within a pixel, then use a matched filter to extract the

desired spectral signature (endmember) present in that pixel.

39

3.2.4.1 Automated target generation process algorithm (ATGP)

In hyperspectral image analysis a pixel may encompass many different

materials; such pixels are called mixed pixels. It contains multiple spectral

signatures. Let a column vector ir represent the mixed pixel by linear model,

i i ir M nα= + (3.25)

where the vector ir is a 1l× column vector, represents the ith mixed pixel. l is the

number of spectral bands. Each distinct material in the mixed pixel is called an

endmember (p). Assume that there are p spectrally distinct endmembers in the ith

mixed pixel. M is a matrix of dimension l p× , is made up of linearly independent

columns. These columns are denoted by ( )1 2, ,......, ,.......,j pm m m m . Here this system is

considered as over determined ( l p> ) system and jm denotes the spectral signature of

the jth distinct material or endmember. Let α be a p column vector given by

( )1 2, ,......, ,......,T

j pα α α α where the jth element represents the fraction of the jth

signature as present in the ith mixed pixel. ni is a 1l× column vector presenting the

white Gaussian noise with zero mean and covariance matrix 2Iσ where I is an l l×

identity matrix.

In the Eq. (3.25), assume ir ’s are a linear combination of p endmembers with

the weight coefficients designated by the fraction vector iα . The term iMα has been

rewritten to separate the desired spectral signatures from the undesired signatures.

In other way, targets are being separated from background. In searching for a single

spectral signature this can be written as:

pM d Uα α γ= + (3.26)

Where d is l l× matrix, the desired signature of interest containing column vector mp

while pα is 1 1× , the fraction of the desired signature. The matrix U is composed of

the remaining column vectors from M. These are the undesired spectral signatures or

background information. This is given by ( )1 2 , 1, ,....., ........,j pU m m m m −= with

dimension ( 1)l p× − where γ is a column vector containing rest of ( )1p − components

(fractions) of α

40

Suppose P is an operator, which eliminates the effects of U, the undesired

signatures. To do this, an operator (orthogonal subspace operator) has been developed

that projects r onto a subspace that is orthogonal to the columns of U. This results in

a vector that only contains energy associated with the target d and noise n. The

operator used is the l l× matrix

( )11 ( )T TP U U U U−= − (3.27)

The operator P maps d into a space orthogonal to the space spanned by the

uninteresting signatures in U. Now apply the operator P on the mixed pixel r from

Eq. (3.25)

Pr pPd PU Pnα γ= + + (3.28)

It should be noticed that P operating on Uγ reduces the contribution of U to zero

(close to zero in real data applications). Therefore, from above rearrangement we

have

Pr pPd Pnα= + (3.29)

3.2.4.1 Signal-to-Noise Ratio (SNR) Maximization

The second step in deriving the pixel classification operator is to find the 1 l×

operator TX that maximizes the SNR. Operating on Eq. (3.28) get

PrT T T TpX X Pd X PU X Pnα γ= + + (3.30)

The operator TX acting on Pr will produce a scalar (Ientilucci, 2001), The SNR is

given by,

2T T Tp

T T T

X Pd d P XX PE nn P X

αλ =

⎡ ⎤⎣ ⎦ (3.31)

2

2

T T Tp

T TX Pdd P X

X PP Xα

λσ⎛ ⎞

= ⎜ ⎟⎜ ⎟⎝ ⎠

(3.32)

where [ ]E denotes the expected value. Maximization of this quotient is the

generalized eigenvector problem

T T TPdd P X PP Xλ= (3.33)

41

where 2

2p

σλ λα⎛ ⎞

= ⎜ ⎟⎜ ⎟⎝ ⎠

, The value of TX which maximizes λ can be determined in general

using techniques outlined by (Miller, Farison, Shin,1992) and the idempotent and symmetric properties of the interference rejection operator. As it turns out the value

of TX which maximizes the SNR is

T TX kd= (3.34)

where k is an arbitrary scalar. Substituting the result in Eq. (3.34) into Eq. (3.30) it is

seen that the overall classification operator for a desired hyperspectral signature in the presence of multiple undesired signatures and white noise is given by the 1 l×

vector as

T Tq d p= (3.35)

This result first nulls the interfering signatures, and then uses a matched filter for

the desired signature to maximize the SNR. When the operator is applied to all of the

pixels in a hyperspectral scene, each 1l× pixel is reduced to a scalar which is a

measure of the presence of the signature of interest. The ultimate aim is to reduce the

l images that make-up the hyperspectral image cube into a single image where pixels

with high intensity indicate the presence of the desired signature.

This operator can be easily extended to seek out k signatures of interest. The

vector operator simply becomes a k l× matrix operator which is given by,

( )1 2, ,...., ,....,j kQ q q q q= (3.36)

When the operator in Eq. (3.36) is applied to all of the pixels in a hyperspectral

scene, each 1l× pixel is reduced to 1 1× vector. Ultimately, l dimensional

hyperspectral image reduces to single dimensional feature extracted image where

pixels with high intensity indicate the presence of the desired signature. Thus for k

desired signature hyperspectral image can be reduce to k dimensional feature

extracted image. Here each band corresponds to the each desired signature.

The above algorithm is discussed with the following example:

Let us start with three vectors or classes, each six elements or bands long. The

vectors are in reflectance units and can be seen below.

42

0.26 0.07 0.070.30 0.07 0.130.31 0.11 0.190.31 0.54 0.250.31 0.55 0.300.31 0.54 0.34

Concrete Tree Water

⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥

= = =⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦

Suppose the image consists of 100 pixels starting from left to right. Let 40th pixels

looks like,

( ) ( ) ( )40 .08 .75 .07pixel concrete tree dirt noise= + + + (3.37).

Let us assume that the noise is zero. If all the pixel mixture fractions have been

defined, particular class spectrum can be chosen to extract from the image. Suppose

the concrete material has to be extracted throughout the image. Same procedure can

be followed to extract grass and tree material.

Assume that 40pixel is made up some weighted linear combination of

endmembers.

40pixel M noiseα= + (3.38)

Now Mα can be break up into desired, dα and undesired, Uγ signatures. Now

assign the desired as d and undesired as U signatures to spectrum. Let concrete be

the vector d and tree and water be the column vectors of the matrix U. However, the

fractions of mixing are unknown to us. But it is known that 40pixel is made up of

some combination of d and U.

,d concrete and U tree water= =⎡ ⎤ ⎡ ⎤⎣ ⎦ ⎣ ⎦

Now it is required to reduce the effect of U. To do this it is needed to find a

projection operator P, that when operated on U, will reduce its contribution to zero.

To find concrete, d, 40pixel is projected onto a subspace that is orthogonal to the

columns of U using the operator P. In other words, P maps d into a space orthogonal

to the space spanned by the undesired signatures while simultaneously minimizing

the effects of U. If P is operated on U, which contains tree and water, then it is seen

that the effect of U is minimized.

43

000 00 00 00 0

PU

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

(3.39)

Now let 1r = 40pixel and n = noise, then from eq. (3.29),

1Pr pPd Pnα= + (3.40)

Now operator Tx needs to find out which will maximizes the signal-to noise

ratio (SNR). The operator Tx acting on 1Pr will produce a scalar. As stated before, the

value of Tx which maximizes the SNR is T TX kd= . This leads to an overall OSP

operator (Eq. (3.35)). Such a way the matrix Q in Eq. (3.36) can be formed. Now the entire data vector can be project along the columns of Q and OSP feature extracted

image is formed.

3.3 Supervised classifier This section describes the mathematical background of supervised classifiers.

First, it will describe the Bayesian decision rule followed by the decision rule for Gaussian maximum likelihood classifier (GML). Afterwards it will describe the k-

nearest neighbor (KNN) and Support vector machine (SVM) classification rules.

3.3.1 Bayesian decision rule In pattern recognition, patterns need to be classified. There are plenty of

decision rules available in literatures but only Bayes Decision Theory is optimal (Riggi and Harmouche, 2004). It is based on the popular Bayes theorem. Suppose

there are K classes and let ( )f xk be the distribution function of the kth class, where

0 k K< < , and ( )kP c is the prior probability of the kth classes such that 1

( ) 1K

kk

P c=

=∑ .

For any class k , the posteriori probability for a pixel vector x is denoted by ( )|k kp c x

and defined by (assuming all classes are mutually exclusive):

1

( | ) ( )( | )( ) ( )

k kk k K

k kk

kP x c P cp c

f P c=

=

=

∑x

x (3.41)

44

Therefore, the Bayes decision rule is:

( | ) max ( | )i i i k kkc if p c p c∈ =x x x (3.41a)

3.3.2 Gaussian maximum likelihood classification (GML): Gaussian maximum likelihood classifier assumes that the distribution of the data points is

Gaussian (normally distributed) and classifies an unknown pixel based on the variance and

covariance of the spectral response patterns. This classification is based on probability density

function associated with training data. Pixels are assigned to the most likely class based on a

comparison of the posterior probability that it belongs to each of the signatures being considered.

Under this assumption, the distribution of a category response pattern can be completely described

by the mean vector and the covariance matrix. With these parameters, the statistical probability of

a given pixel value being a member of a particular land cover class can be computed (Lillesand et

al., 2002). GML classification can obtain minimum classification error under the assumption that

the spectral data of each class is normally distributed. It considers not only the cluster centre but

also its shape, size and orientation by calculating a statistical distance based on the mean values

and covariance matrix of the clusters. The decision boundary for the GML classification is:

( ) 1ˆ ˆˆ ˆ(1 2) ln ( ) ( )Tk k k k

−⎡ ⎤− + − −⎢ ⎥⎣ ⎦x xΣ Σμ μ

(3.42) And the final bayesian decision rule is:

( ) max ( )j j kkc if g g∈ =x x x

where ( )kg x is the decision boundary function for kth class.

3.3.3 k – nearest neighbor classification KNN algorithm (Fix and Hodges, 1951) is a nonparametric classification

technique which has been proven to be effective in pattern recognition. However, its

inherent limitations and disadvantages restrict its practical applications. One of the

shortages is lazy learning which makes the traditional KNN time-consuming. In this

thesis work traditional KNN process has been applied (Fix and Hodges, 1951).

The k-nearest neighbor classifier is commonly based on the Euclidean distance

between a test pixel and the specified TP. The TP are vectors in a multidimensional

feature space, each with a class label. In the classification phase, k is a user-defined

45

constant. An unlabelled vector i.e. test pixel, is classified by assigning the label which

is most frequent among the k training samples nearest to that test pixel.

Figure 3.6: KNN classification scheme. The test pixel (circle) should be classified either to the first class of squares or to the second class of triangles. If k = 3, it is classified to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5, it is classified to first class (3 squares vs. 2 triangles inside the outer circle).If k = 11, it is classified to first class (6 squares vs. 5 triangles) (Modified after Wikipedia, 2009).

Let x be a n -dimensional test pixel and iy ( (1,2.... ))i p= is n -dimensional TP,

Euclidian distance between them is defined by:

2 2 211 1 12 2 1( , ) ( ) ( ) ...... ( )i i i i n ind x y x y x y x y= − + − + + − (3.43)

Where 11 12 1( , ...... ),nx x x x= 1 2( , ...... )i i i iny y y y= and 1 2{ , ...... }pD d d d= , p is number of TP

The final KNN decision rule is:

46

j

1 , even 2 if minimum element of D corresponding to c is

, odd 2

j

k kx c

k k

⎧ ⎫⎛ ⎞⎡ ⎤ +⎪ ⎪⎜ ⎟⎢ ⎥⎪ ⎪⎣ ⎦⎝ ⎠∈ ⎨ ⎬⎡ ⎤⎪ ⎪⎢ ⎥⎪ ⎪⎢ ⎥⎩ ⎭

(3.44)

In case of tie, the test pixel is assigned to the class jc if its distance from the mean

vector of the class jc is minimum.

Where ,( 1,2,....., )ik i p= is a user defined parameter which implies the number of

nearest neighbor is chosen for classification. The outline of algorithm of KNN

classification is given in Figure: 3.7

Figure 3.7: Outline of KNN algorithm

3.3.4 Support vector machine (SVM): The foundations of Support Vector Machines (SVM) have been developed by

Vapnik (1995). The formulation represents the Structural Risk Minimization (SRM)

47

principle, which has been shown to be superior, (Gunnet al., 1997), to traditional

Empirical Risk Minimization (ERM) principle, employed by conventional neural

networks. SRM minimizes an upper bound on the expected risk, as opposed to ERM

that minimizes the error on the training data. SVMs were developed to solve the

classification problem, but recently they have been extended to the domain of

regression problems (Vapnik et al., 1997).

SVM is basically a linear learning machine based on the principle of optimal

separation of classes. The aim is to find a hyperplane which linearly separates the

class of interest. The linear separating hyperplane is placed between the classes in

such a way that it satisfies two conditions.

(i) All the data vector that belongs to the same class are placed to the same side of separating hyperplane.

(ii) Distance between two closest data in both classes is maximized (Vapnik, 1982).

The main aim of SVM is to define an optimum hyperplane between two classes

which will maximize the boundary of two classes. For each class, the data vectors

forming the boundary of classes are called the support vectors (SV) and the

hyperplane is called decision surface (Pal, 2002).

3.3.4.2 Statistical learning theory The goal of statistical learning theory (Vapnik, 1998) is to create a mathematical

framework for learning from input training with known class and predict the outcome of data point

with unknown identity. The first is called ERM whose aim is to reduce the training error and the

second is called SRM, whose goal is to minimize the upper bound on the expected error on the

whole data set. The empirical risk is different from the expected risk in two ways (Haykin, 1999).

First, it does not depend on the unknown cumulative distribution function. Secondly, it can be

minimized with respect to the parameter, which is used in decision rule.

3.3.4.2 Vapnik and Charvonenkis dimension (VC-dimension):

VC dimension is a measure of the capacity of a set of classification functions. The

VC-dimension, generally denoted by h, is an integer that represents the largest number of

data points that can be separated by a set of functions fα in all possible ways. For

example, for a arbitrary classification problem, VC-dimension is the maximum

48

number of points, which can be separated into two classes without error in all

possible 2k ways (Varshney and Arora, 2004).

3.3.4.3 Support vector machine algorithm with quadratic optimization method (SVM_QP):

The procedure of obtaining a separating hyperplane by SVM is explained for a

simple linearly separable case for two classes which can be separated by a hyperplane

and it can be extended for the multiclass classification problem. This procedure then

can be extended to the case where a hyperplane cannot separate the two classes that

is kernel method for SVM.

Let there are n number of training samples obtained from two classes,

represented as 1 1 1 1( , ),( , ),..........,( , )n nx y x y x y where mix R∈ , m is the dimension of the

data vector with each sample belonging to either of the two classes labeled by{ 1, 1}y∈ − + . These samples are said to be linearly separable if there exists a

hyperplane in m-dimensional space whose orientation is given by a vector w and

whose location is determined by a scalar b as offset of this hyperplane from the origin

(Figure 3.8). In case such a hyperplane exists then the given set of training data

points must satisfy the following inequalities:

1, : 1i iw x b i y⋅ + ≥ + ∀ = + (3.45)

1, : 1i iw x b i y⋅ + ≤ − ∀ = − (3.46)

Thus, the equation of hyperplane is given by 0iw x b⋅ + = .

49

Figure 3.8: Linear separating hyperplane for linearly separable data (Modified after Gunn, 1998).

The inequalities in Eq. (3.45) and Eq. (3.46) can be combined into a single inequality as:

( . ) 1i iy w x b+ ≥ (3.47)

Thus, the decision rule for the linearly separable case can be defined in the following

form:

( . )i ix sign w x b∈ + (3.48)

Where, (.)sign is the signum function whose value is +1 for any element greater than

or equal to zero, and –1 if it is less than zero. The signum function, thus, can easily

represent the two classes given by labels +1 and –1.

The separating hyperplane (Figure 3.8) will be able to separate the two classes

optimally when its margin from both the classes is equal and maximum (Varshney,

2004) i.e. the hyperplane should be located exactly in the middle of the two classes.

50

The distance ( ; , )D x w b is used to express the margin of separation or margin for a

point x from the hyperplane defined by w and b. It is given by

2

.( ; , )

w x bD x w b

w+

= (3.49)

Where, 2 denotes the second norm which is equivalent to the Euclidean length of

the element vector for which it is being computed and is the absolute function. Let

d be the value of the margin between two separating planes. To maximize the

margin, express the value of d as

2 2

. 1 . 1w x b w x bdw w+ + + −

= −

2

2w

=

2Tw w

= (3.49a)

To obtain an optimal hyperplane the margin value (d ) should be maximized i.e. 2

2w

should be maximized, it is equivalent to minimization of the 2-norm of the vector w.

Thus, the objective function Φ(w) of finding the best separating hyperplane reduces to

1( )2

Tw w wΦ = (3.50)

A constrained optimization problem can be constructed for minimizing the objective

function in Eq. (3.50) under the constraints given in Eq. (3.47). This kind of

constrained optimization problem with a convex objective function of w and linear

constraints is called a primal problem and can be solved using standard Quadratic

Programming (QP) optimization techniques. The QP optimization technique can be

implemented by replacing the inequalities in a simpler form by transforming the

problem into a dual space representation using Lagrange multipliers ( iλ )

(Leunberger, 1984). The vector w can be defined in terms of Lagrange multipliers ( iλ )

as shown:

51

1

1

,

0t

n

i i ii

n

i ii

w y x

y

λ

λ

=

=

=

=

∑

∑ (3.51)

The dual optimization problem reduced by Lagrange’s multipliers ( λi ) thus

becomes

1 1 1

1max ( , , ) ( )2

n n n

i i j j i i ji i j

L w b y y x xλ

λ λ λ λ= = =

= − ⋅∑ ∑∑ (3.52)

Subjected to the constraints:

1

0n

i ii

yλ=

=∑ (3.53)

0, 1,2,...,i i nλ ≥ = (3.54)

Solution of the optimization problem is obtained in terms of Lagrange’s

multiplier. According to Krush-Kuhn-Tucker (KKT) optimality condition (Taylor,

2000) some of the Lagrange’s multiplier will be zero. The multipliers which have

nonzero values are called SVs. The result from an optimizer, also called as an optimal

solution, will be a set of unique and independent multipliers: 1 2( , ,..., )s

o o o onλ λ λ λ=

where, sn is the number of support vectors found. Substituted these in Eq. (3.51) to

obtain the orientation of optimal separating hyperplane ( ow ) as

0 0

1

n

i i ii

w y xλ=

= ∑ (3.55)

The offset from origin ( 0b ) is determined from the equation given below,

0 0 0 0 01 1

12

b w x w x+ −⎡ ⎤= +⎣ ⎦ (3.56)

Where 01x+ and 0

1x− are support vector of class labels +1 and -1 respectively. The

following decision rule (obtained from Eq. (3.48)) is then applied to classify the data

vectors into two classes +1 and -1:

0 0

support vectors( ) ( ( . ) )i i if x sign y x x bλ= +∑ (3.57)

Eq. (3.57) implies that

0 0

support vectors( ( . ) )i i ix sign y x x bλ∈ +∑ (3.58)

52

Generally, it may not be possible to separate the classes optimally by a linear

hyperplane and thus a non-linear manifold in hyperspace would be required for

optimal separation among the classes. The data present in m-dimensional space can

be mapped into a higher dimensional space where it spread out and can be separated

by a linear hyperplane in that dimensional space, shown in Figure 3.9.

Suppose the non-linear transformation function φ map the data into a higher

dimensional space where a data point x in original m-dimensional space is

represented as ( )xφ in higher dimensional space. Thus, the dual optimization

problem in Eq. (3.52) is modified as:

1 1 1

1max ( , ) = ( , )2

n n n

i i j j i i ji i j

L w b, y y K x xλ

λ λ λ λ= = =

−∑ ∑∑ (3.59)

The computation of the dot product ( ) ( )i jx xφ φ⋅ will be computationally very

expensive as computations will be done in a higher dimensional space. So, kernel

functions are used to substitute the value of dot product of the transformed vectors

according to Mercer’s Theorem (Mercer, 1909). Suppose there exists a kernel function

K such that

( , ) = ( ) ( )i j i jK x x x xφ ⋅φ (3.60)

(a) Input space (b) Feature space

Figure 3.9: Non-linear mapping scheme. φ is a nonlinear mapping, transforms the

pixels from input space to feature space. ( )ixφ s are pixels in feature space. Linearly non separable pixels in input space become linearly separable in feature space (Cristianini, 2000).

53

Putting Eq. (3.60) into eq. (3.59), the modified form of dual optimization problem

becomes:

1 1 1

1max ( , ) = ( , )2

t t tn n n

i i j j i i ji i j

L w b, y y K x xλ

λ λ λ λ= = =

−∑ ∑∑ (3.61)

Subject to the constraints:

1

0tn

i ii

yλ=

=∑ (3.62)

Similarly, the final decision rule can be modified as:

1

( ( , ) )sn

o oi i i

ix sign y K x x bλ

=

∈ +∑ (3.63)

Some of the commonly used kernel functions for classification are presented in Table

3.2. Selection of suitable kernel function is essential for better classification of a

particular data set. The details on effects of different kernel functions on

classification accuracy are available in Varshney and Arora (2004).

Originally SVMs were developed to perform binary classification. Now it has

been extended for multiclass classification where the number of classes is more than

two. Pal (2004) proposed two multiclass classification methods: one is one against the

rest and another is pairwise classification method. In the first one, K binary

classifiers may be created where each classifier is trained to distinguish one class

from another 1K − class for a K class classification problem. The second approach

considers one pair of classes at a time and performs SVM based binary classification

for classifying all the pixels to one of the two classes under consideration. A total of

( 1)2

K K − pairs of classes are possible for a K class problem and thus that many SVM

binary classifiers are to be created. A pixel is finally classified to a class to which it is

classified by most number of SVM classifiers out of total ( 1)2

K K − (Varshney and

Arora, 2004).

Figure 3.10 shows summary of the SVM classification algorithm.

54

Figure 3.10: Brief description of SVM_QP algorithm

55

3.3.4.4 SMO optimization for SVM

Sequential Minimal Optimization (SMO) is a simple algorithm that can quickly

solve the SVM QP problem without any extra matrix storage and without using

numerical QP optimization steps at all. SMO decomposes the overall QP problem into

QP sub-problems, using Osuna’s theorem (Osuna, 1997) to ensure convergence.

Unlike the previous methods, SMO chooses to solve the smallest possible

optimization problem at every step. For the standard SVM QP problem, the smallest

possible optimization problem involves two Lagrange multipliers, because the

Lagrange multipliers must obey a linear equality constraint. At every step, SMO

chooses two Lagrange multipliers to jointly optimize, finds the optimal values for

these multipliers, and updates the SVM to reflect the new optimal values. The

advantage of SMO lies in the fact that solving for two Lagrange multipliers can be

done analytically. Thus, numerical QP optimization is avoided entirely. Even though

more optimization sub-problems are solved in the course of the algorithm, each sub-

problem is so fast that the overall QP problem is solved quickly. In addition, SMO

requires no extra matrix storage at all. Thus, very large SVM training problems can

fit inside the memory of an ordinary personal computer or workstation. Because no

matrix algorithms are used in SMO, it is less susceptible to numerical precision

problems. There are two components to SMO: an analytic method for solving for the

two Lagrange multipliers, and a heuristic for choosing which multipliers to optimize.

In this thesis, all the computations regarding SMO optimization method have

been done with the Matlab in-built function “SVMSMOSET”

3.3.4.4 KPCA-SVM

Nonlinear SVM is quite accurate then linear SVM. However, they are slow and

time taking for classification increases linearly with the number of SV. Reduced set

methods (reducing no. of SVs) try to speed up the SVM classification by reducing the

number of SV (Burges and Scholkopf, 1996). This section will present the technique of

reducing the number of SVs using KPCA algorithm (Sundaram, 2009). It should be

kept in mind that the space spanned by original set of SVs will be always equivalent

to the space spanned by reduced set of SVs. This is the criteria for choosing minimum

number of SVs to improve the classification time

56

The solution of the optimization problem Eq. (3.52) is obtained in terms of

Lagrange’s multiplier. SVs are extracted solving by the Eq. (3.52). The algorithm for

this method is stated below.

1. First choose appropriate kernel function. Then calculate the kernel matrix xxK

from the set of SV ix , 1,2,........,i N=

( , ) ( , )xx i jK i j K x x= (3.64)

where , 1,2,........,j N=

2. Center the kernel matrix xxK ,

cxx xxK HK H= (3.65)

where, 1H I IN

= − , I is N N× identity matrix. H is centering matrix

Sundaram (2009) used the Eq. (3.65) to center the kernel matrix. But, according to

different literatures, kernel matrix should be center by using Eq. (3.24). This is the

standard procedure for centering kernel matrix.

3. Perform Kernel PCA by implementing an eigen value decomposition on

centered kernel matrix ( cxxK ).

c TxxK A A= Λ (3.66)

Where A is the matrix of eigen vectors and Λ is a diagonal matrix of eigen

values whose diagonal elements are 1 2, ,..........., Nλ λ λ .

4. Sort the eigen values and corresponding eigen vectors. Discard eigen values

smaller than a threshold. A value of 510− has been used in this thesis work.

This was done to prevent numerical problems in the later stages of the

algorithm.

5. Calculate the normalized principal directions.

1

1 ( )N

k jk ijk

V a xλ =

= Φ∑ (3.67)

where 1 1

1( ) ( ) ( )N

j j ix x xN =

Φ = Φ − Φ∑

In matrix form this becomes:

12V KA

−= Λ (3.68)

Select the first M number of principal directions which retains a total 99% variance.

57

6. Calculate new SV by choosing the projections on the principal directions from

a uniform distribution [ , ]k kU σ σ− + where kk N

λσ = . In matrix form it

becomes,

V VR= (3.69)

Where 121R U

N= Λ

Where U is a matrix of points chosen from the uniform distribution [ 1, 1]U − + .

7. Each column of V corresponds to a new SV. Now project image of the old SVs

( ( )ixΦ ) along the direction of new set of SVs (i.e. along the direction of PCs).

1

( ) ( )N

k ik ii

z V x=

Φ = Φ∑ (3.70)

8. Calculate the approximate pre-images of the points obtained in the previous

step (( ( )kzΦ )) according to the formula given below (Scholkopf, 1996).

1

1

1( (1 2 ))21( (1 2 ))2

i

i

nT T

ik k xx k k x ii

k NT T

ik k xx k k xi

V V K V V k xz

V V K V V k=

=

− +=

− +

∑

∑ (3.71)

where 1 2[ ( , ) ( , )............ ( , ) ]i

Tx i i i nk K x x K x x K x x=

9. Calculate the new coefficients β by solving zz zxK Kβ α= (3.72)

This ensures that both SVMs produce same results for all the kz ’s, 1,2,.......k M=

(Scholkopf and Mika, 1999)

Therefore new set of SV are obtained, kz , 1,2,....,k M= and the new coefficients

, 1,2,.....,i i Mβ = of the SV’s. Then general SVM classification algorithm is applied on

the new set of SV’s. Figure 3.11 describes the outline of above algorithm.

58

Figure 3.11: Overview of KPCA_SVM algorithm

3.4 Analysis of classification results The classification results obtained using various classification techniques are

expressed in standard confusion matrix (Landgrebe, 2003) showing the class-wise

user ( uak ), producer ( pak ) and overall (k) kappa measures (Congalton, 1991). The

59

overall kappa (k) values obtained from different classification techniques were used

for the one-tail hypothesis testing (Congalton, 1991) for comparing any two

classification results. While the class-wise producer’s kappa ( pak ) values were used to

check the performance of different classification techniques in separating different

classes (Abhinav, 2009).

3.4.1 One tailed hypothesis testing z-statistic (Congalton, 1991) is computed using the kappa values obtained for

comparing any two classification techniques:

( )

1 212 2 2

1 2

ˆ ˆ

ˆ ˆk kZσ σ

−

+= (3.73)

Where, 1̂k and 2k̂ are the kappa estimates obtained for the two classification

techniques under consideration and 21σ̂ , 2

2σ̂ are the respective estimates of variances

for the kappa values observed. The z-statistic obtained is used for the one-tailed

hypothesis testing with the following null ( 0H ) and alternate ( 1H ) hypotheses:

0 12 1 2

1 12 1 2

: = 0: = 0

H Z k kH Z k k

− ≤

− > (3.74)

The null hypothesis chosen here is that the out of the two classification results

obtained 1̂k and 2k̂ , 1̂k is not significantly better than 2k̂ which means that the first

classification technique is not significantly better than the second technique. While

the alternate hypothesis selected, it says that the two classification results are

statistically different and also the result corresponding to 1̂k is statistically better

than that corresponding to 2k̂ and thus, it can be said that the first classification

technique is significantly better than the second (Abhinav, 2009).

The z-statistic obtained in Eq. (3.73) follows the standard normal distribution

(Congalton, 1991) and thus, according to one-tailed hypothesis testing (Fig. 3.12) if the

value of 12Z -statistic is greater than a critical value (say, 1.65) for a confidence level

60

of 95%, the null hypothesis can be rejected and it can be said with 95% confidence

that the two classification results are statistically different with the first one

performing better than the second one (Abhinav, 2009).

Figure3.12: Definitions and values used in applying one-tailed hypothesis testing

(Abhinav, 2009).

Zc = 1.65 0

Non-rejection region for 0H

Rejection region for 0H

61

CHAPTER 4 EXPERIMENTAL DESIGN

This chapter will address the methodology followed for this thesis work.

Experiments were designed to investigate the best FE technique, classification

algorithm and best time saving strategy for HD. On the basis of conclusions from the

literature survey and recommendations for future work by Abhinav (2009), several

FE and classification algorithms have been tested which have potential for improving

classification accuracy and time for HD. The theoretical background of these

algorithms was presented in Chapter 3.

The following FE methods and classification algorithms have been tested:

(1) Feature extraction algorithms

• Unsupervised feature extraction algorithm

a) Segmented principal component analysis (SPCA) (Jia, 1996).

a) Projection pursuit (PP) (Friedman and Tukey, 1974).

• Supervised feature extraction algorithm

b) Kernel principal component analysis (KPCA) (Scholkopf, 1995).

b) Orthogonal subspace projection (OSP) (Lentilucci, 2001).

(2) Classification algorithms

• Parametric classification approach

a) Gaussian maximum likelihood (GML) (Savage, (1976)).

• Non-parametric classification approach

a) k nearest neighborhood (KNN) (Fix and Hodges, 1951).

• Advance classification approach

a) Support vector machine (Quadratic programming optimization method)

(SVM_QP) (Vapnik, 1995).

b) Support vector machine (sequential minimal optimization method)

(SVM_SMO) (Platt, 1999).

62

c) Kernel principal component analysis support vector machine

(KPCA_SVM) (Sundaram, 2009).

This chapter starts with experimental details for different FE and selection

techniques. Then it explains the classification techniques for parametric and non-

parametric classifier followed by advanced classifier.

4.1 Feature extraction technique Two types of FE techniques, unsupervised and supervised, were used in this

experiment. SPCA, PP are unsupervised FE techniques and KPCA, OSP are

supervised FE techniques. The details of FE methods are given below.

4.1.1 SPCA For SPCA, complete data set is subgrouped on the basis of correlation of bands.

Then PCA is applied separately on each subgroup of data. Feature selection from the

new data set is obtained after the first subgroup transformation by variance

information (first few PCs retaining 99% variance were selected). Then selected

features are regrouped and transformed again to compress the data further. The

flowchart of SPCA method is shown in Figure 4.1.

Figure 4.1: SPCA feature extraction method

4.1.2 PP For PP, Posse’s (1995a) algorithm was used in this research work where OD (n-

dimension) is projected on two dimensional space. Thus the dimension of the PP

63

feature extracted data set is two. Chi-square projection pursuit index was chosen

here. The methodology adopted for PP method is shown in Figure 4.2.

Figure 4.2: PP feature extraction method

4.1.3 KPCA The number of PCs is equal to the number of TP used for FE . In this

experiment, a total up to 400 TP have been used for FE using KPCA method. Hence,

the dimension of the KPCA feature extracted data set is up to 400. Firstly, TP are

mapped into feature space using different kernel function (linear, polynomial and

Gaussian) in the form of gram matrix. Then eigen values and eigen vectors of gram

matrix are calculated. Afterwards, OD is mapped in kernel space using the same

kernel function (used for TP) and projected along the direction of eigen vectors.

Finally, KPCA feature extracted data set is obtained. The outline of KPCA method is

shown in Figure 4.3.

Figure 4.3: KPCA feature extraction method

64

4.1.4 OSP The dimensionality of feature extracted data set depends upon the number of

classes present in the OD. OSP starts with finding the endmembers by automated

target generation process (ATGP). Then OD is projected along the endmembers and

feature extracted data set is obtained. The data set used for this thesis has eight

classes, so the number of endmembers is also eight. The dimension of feature

extracted data set is equal to the number of endmembers. The brief description of

OSP method is shown in Figure 4.4.

Figure 4.4: OSP feature extraction method

4.2 Experimental design

This section will provide the detailed methodology of the classification which

was followed in this research work. Feature extracted data or OD, TP and selected

bands are given as the input to classifier. In this thesis work, same set of TP have

been used for any data set to train the classifier. For example, to perform

classification using 200 TP per class on SPCA modified data set, the same 200 TP

were used for OD. To vet the results obtained by Abhinav (2009), the same sets of TP

are also used here. Those TP were obtained by multinomial TP selection algorithm.

Statistically sufficient sample size for training and test was calculated at a confidence

level of 99% and a desired precision of 4% using formula as suggested by Toratora

(1976). Following this approach, a minimum of 99 TP per class have to be chosen to

train a classifier.

Experiments were performed with GML, KNN and advance classifier (SVM).

For each classifier, two types of experiments were performed. The first type of

classification experiment was implemented on OD and the second type was carried

out on the feature extracted data set. For each set of experiment, classifier was

trained with 25, 100, 200 and 300 TP per class. The same set of TP will ensure no

discrepancy due to different training data sets while comparing different

65

classification results. These numbers were chosen in order to consider the following

cases of training sample size.

a) Statistically insufficient training sample size (25 TP)

b) Statistically exact training sample size (100 TP)

c) Statistically sufficient training sample size (200 TP)

d) Very large training sample size (300 TP)

Classifier provides thematic map as output of classification. These maps were

used to obtain test accuracy of classifiers in terms of confusion matrix. Accuracy

analysis of the resulted maps was performed using the kappa value for different

algorithms comparing z-statistics, on the basis of one tailed hypothesis, performed on

95% confidence interval (Congalton, 1991).

For each classification technique, initially five bands of OD or feature extracted

data set (except OSP and PP feature extracted data set) were chosen. Later on, it was

incremented by five in a stepwise manner up to the available bands (number of

available bands may be different for different feature extracted data set). The

classification was performed to evaluate if there was any improvement in accuracy.

This was performed for each set of TP.

Dimension of OSP feature extracted data set is equal to the number of classeds

present in OD. Each band of OSP feature extracted data set contains information

corresponding to each class. Therefore, for the classification, all bands of the OSP

feature extracted data set should be taken together. Otherwise, it may produce

classification error. For all the experiment in this thesis work, eight bands of OSP

feature extracted data set was taken together .

The dimension of the PP feature extracted data set is two. Therefore, the

maximum number of bands available for PP feature extracted data set is two. For all

the experiment on PP feature extracted data set both the bands were taken together.

The methodology of the classification procedure for this thesis work is shown in

Figure 4.5.

66

Figure 4.5: Overview of classification procedure

4.3 First set of experiment (SET-I) using parametric and non-parametric classifier

Set-I experimental set up was designed to investigate the results of parametric

(GML) and non-parametric (KNN) classifier. The classification was performed by

selecting different parameters of KNN and GML.

For KNN, initially three neighboring pixels were chosen which was further

increased by one, up to a neighborhood size of 11. Then, it was performed only for

neighborhood size of 15. However, there were negligible improvements in accuracy for

more than five neighboring pixels. The experiment was conducted to study the effect

of neighboring pixels in accuracy.

The best classification result for KNN and GML for feature extracted data sets

as well as OD were independently observed along with the parameters responsible for

the best result. The experimental scheme is given in Figure 4.6.

67

Figure 4.6: Experimental scheme for Set-I experiments

4.4 Second set of experiment (SET-II) using advance classifier

The second sets of experiments were designed with advance classifier, SVM

algorithms. Different optimization techniques and algorithms for SVM were chosen

for comparing the accuracy and time taken to train the classifier. In this thesis work,

SVM_QP, SVM_SMO and another approach KPCA_SVM were used to compare the

classification accuracy and time. As mentioned before, all these algorithms were

performed on OD as well as on feature extracted data set.

The purpose for this experiment is summarized below:

(i) Investigation of the best classification algorithm among these SVM

algorithms, depending upon the accuracy and processing time

(ii) Inquiry of the best FE techniques for SVM classifier

For KPCA_SVM, initially SV were extracted by solving dual optimization

problem using quadratic programming (QP) optimization method. Then KPCA

algorithm with Gaussian kernel was applied on the SV and PCs were arranged in

descending order with respect to the eigen values of kernel matrix. These PCs are the

new set of SV. In this research work, for all the experiment related to KPCA_SVM,

about 70% of the original SV were chosen from the new set of SV (for details, section

3.2.3.4), because about 99% variance was stored in first 70% of the PCs. Finally, the

SVM decision rule was applied on the new set of SV to obtain classified map.

68

For SVM_QP and SVM_SMO, quadratic programming optimization and

sequential minimal optimization methods were used respectively to solve the dual

optimization problem. The classification scheme for Set-II experiment is given in

Figure 4.7.

Figure 4.7: The experimental scheme for advanced classifier (Set-II)

4.5 Parameters Parameters play also an important role in HD classification. So, choosing of

parameters are also an important task. All the parameters chosen for different FE techniques

and classification algorithms are listed in Table 4.1.

FE techniques Parameters

SPCA Correlation matrix of the bands

PP No. of random searches – 5

half – 15

Stopping value – .01

KPCA Kernel function – rbf

OSP No. of endmembers – 8

Classifiers Parameters

GML Confidence interval – 99%

KNN Neighbors – 3,4,5……,11 and 15

SVM Kernel function – rbf

Table 4.1: List of parameters

69

CHAPTER-5 RESULTS

This chapter provides observations for various experiments and interpretation of the

same. Starting with the visual interpretation of feature extracted data sets, the

chapter will discuss the result of GML classifier on feature-extracted data set. These

results are compared with the best result for GML as observed by Abhinav (2009).

Then it will discuss the effect of KNN classification algorithm on OD and feature

extracted data set followed by the discussion of the results of different SVM

algorithms.

5.1 Visual inspection of feature extraction

techniques

Apart from comparison of k-values, features extracted by various FE

techniques can be visually inspected using grayscale views of the first few features.

The image form of correlation matrix are also used for this purpose.

From the correlation image of OD (Figure 5.1), it is clear that there are three

highly correlated blocks of bands. The first block contains 32 bands, the second 6

bands and the last contains 27 bands (Figure 5.1). The average correlation values for

each block are 0.931, 0.997 and 0.941 respectively. Thus, the OD is segmented based

on correlation of these three blocks of bands. Then PCT was applied on the basis of

correlation matrix of each block of bands for which SPCA feature extracted data set

was obtained. Total time taken to complete the aforementioned process was about 8

seconds.

70

Figure 5.1: Correlation image of the OD set consisting of three blocks having bands 32, 6 and 27 respectively.

In PP process, one can find from the most important to less important two-

dimensional structures in a sequential manner. Two structures (first one is the most

interesting) with decreasing order is given in Figure 5.2. The PP index after five

random searches was 0.3825 and the size of neighborhood (c) around the best

projection plane was 0.011. Total time taken to complete the whole process was about

11.30 hours. Table 5.1 presents the required time for each FE techniques with

different constraints.

71

Table 5.1: The time taken for each FE techniques

FE methods Time

SPCA 6-8 seconds

KPCA with rbf

kernel 1) 4 minutes for 25 TP

2) 5.5 minutes for 100 TP



5) 10 minutes for 400 TP

OSP 90 seconds for 8 endmembers

PP 11.30 hours

(a)

(b)

Figure 5.2: Projection of the data points. (a) Most interesting projection direction

(b) Second most interesting projection direction.

The grayscale images of features extracted data using various FE techniques are

provided in Figures 5.3 to 5.6, followed by the corresponding correlation images

shown in Figure 5.7.

α* α*

*β*β

72

(a) SPCA-1

(b) SPCA-2

(c) SPCA-3

(d) SPCA-4

(e) SPCA-5

(f) SPCA-6

Figure 5.3: First six Segmented Principal Components (SPCs) (b) shows water body and salt lake

(a) KPCA-1

(b) KPCA-2

(c) KPCA-3

(d) KPCA-4

(e) KPCA-5

(f) KPCA-6

Figure 5.4: First six Kernel Principal Components (KPCs) obtained by using 400 TP

73

(a) OSP-1

(b) OSP-2

(c) OSP-3

(d) OSP-4

(e) OSP-5

(f) OSP-6

Figure 5.5: First six features obtained by using eight end-members (b) shows

vineyards and wheat, (c) shows bare soil, (d) shows salt lake.

(a) PP -1

(b) PP -2

Figure 5.6: Two components of most interesting projections (a) shows salt lake.

74

Figure 5.7: Correlation images after applying various FE techniques

The following were observed based on visual inspection of features extracted

data sets (Figure 5.3 to 5.6) and their correlation images (Figure 5.7):

(i) Since extracted SPCs were ranked according to their eigen values, a higher

amount of information can be easily noticed in the first four SPCs. No

interesting structures could be visually identified beyond 4th SPC. As SPC uses

the local correlation of the bands rather than global (like PCA), it has ability to

make involved bands highly uncorrelated than PCA. So better classification

result is expected from SPCs. It has also been visually observed that SPCA-2 is

associated with the water body and salt lake classes.

(ii) The first few features extracted by KPCA were visually inferior than those

obtained by SPCA (not revealing any class). Some of the features like KPCA-1

and KPCA-2 show water body and salt lake prominently but other classes are

also present there.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(a) SPCA

(b) KPCA

(d) OSP

(e) PP

75

(iii) OSP is generally used to extract same number of features as the number of

classes present in the data set (in this case eight classes; hence eight features).

Although number of extracted features by OSP is low, it can identify some

structures prominently. For example, OSP-4 identifies salt lake, OSP-2

identifies vineyards and wheat and OSP-3 shows bare soil. From the algorithm

of OSP, it can be suggested that each band of OSP extracted data set is

associated with one of the predefined classes. Therefore, it can be said that

OSP is expected to perform well for classification.

(iv) The dimension of PP extracted feature is two. However, from the first

extracted feature, salt lake can be identified very clearly but the second feature

contains no identifiable structures and gives hazy appearance.

(v) The quality improvement of features extracted by different FE techniques can

be observed by comparing the correlation images of OD (Figure 5.1) and

feature extracted data (Figure 5.7). The correlation matrices obtained by SPCA

and PP extracted data sets are found to be perfectly diagonal with values equal

to unity and all the off-diagonal elements as zeros. On the other hand, feature

extracted data using supervised FE techniques (OSP, KPCA) are correlated.

This is because the SPCA and PP algorithms extract only orthogonal features

while the FE criterion is different for OSP. So highly correlated features are

observed for OSP. For the correlation image of KPCA feature extracted data

set, t can be observed that along diagonal correlation is unity which decreases

inversely with the increase in distance from diagonal in correlation matrix,

except for bands 80 to 100. These bands are observed to be fully uncorrelated.

5.2 Results for parametric and non-parametric classifiers

This section will represent the results of GML and KNN classifier using different data sets. First, it will describe the results for GML classifier followed by KNN.

5.2.1Results of classification using GML classifier (GMLC)

The performance of GMLC with feature modified data sets (SPCA, KPCA,

OSP, PP FE methods) was compared to the best result obtained by Abhinav (2009)

76

for GML classifier to evaluate the improvement in classification due to these FE

technique. It may be noted that he obtained the best results with PCA modified data

set. Figure 5.8 shows k-values obtained for different feature modified data sets.

Following observations can be listed from Figure 5.8:

(i) Considering the case with sufficient TP (100, 200, 300), the k-values

obtained for PCA, SPCA, and OSP extracted data sets were observed to be

higher than the PP and KPCA modified data sets.

(ii) For statistically insufficient TP (25), GML performs poorly for SPCA, PCA

and OSP modified data sets. When the number of bands increase, after a

certain number of bands, k-value for PCA and SPCA modified data set

becomes negative for 25 TP per class. Because to invert a p p× matrix, at

least p+1 sample points are required for obtaining numerically well

conditioned inverse of the matrix. Due to this effect, GML fails when more

than 25 bands were used with 25 TP per class. These were insufficient for

computing the inverse of the class covariance matrix.

(iii) An interesting phenomenon can be observed for k-values of KPCA modified

data set. The k-value increases for the first 35 bands. Then suddenly it falls

for 40 bands. From 45 bands onwards, it again starts to increase. The result

for KPCA modified data set is observed up to 65 bands (dimension of OD is

65).

(iv) The k-values obtained for SPCA and OSP seems to be outperforming those

obtained by PCA, KPCA.

(v) Performance of PP is found to be very poor due to very low number of

features (two features). Hence, PP was not considered any further for

classification.

(vi) For all FE techniques (except KPCA, OSP), the k-values increase

significantly with increase in number of bands up to a critical number of

bands (say, Ncri) after which no improvement could be observed in k-values.

This is due to the fact that the features extracted by these techniques were

arranged in decreasing order of eigen values. So useful information are

stored in the first few features only while the lower order features contain

77

less useful information and are very noisy. Therefore, when noisy bands

were added then probability of misclassification increases. As a result, the

classification accuracy becomes stagnant.

(vii) Ncri is different for different set of TP. When number of TP increases, Ncri

increases. Because of Hughes phenomenon, classification of large number of

bands provide poor result unless the number of TP is large.

Figure 5.8

PCA

8: Overall extracte

OSP

kappa vald data sets

lue observe using select

KP

78

ed for GMLted differen

PCA

L classificant bands

SPCA

ation on di

PP

fferent feat

ture

79

To confirm these observations, statistical analysis was performed. The k-

values obtained for each FE technique are given in Table 5.1. The best results

obtained by GML classification on different feature extracted data set for three

training data sets (100, 200 and 300 TP) were selected for comparison with the best

GML result obtained with PCA extracted data set. The condition for selecting the

best classification result (best k-value) is the least number of bands used after which

no statistically significant improvement in k-value could be achieved. A comparison of

the best results between the PCA and other FE modified data sets and among the

various FE techniques is presented in Table 5.2 in terms of z-statistic values obtained

for one-tailed hypothesis testing at 5% significance level.

Following observations can be viewed from the Table 5.2.

(i) PCA and SPCA were found to be giving statistically similar result for 100 and

300 TPs per class while SPCA provides statistically significantly better result

than PCA for 200 TP per class. SPCA is more improved method than PCA.

(ii) In case of OSP, statistically better result could not be achieved for statistically

exact TP set (100 TP per class) but when number of TP increases, it provides

the statistically better result than PCA. In case of large TP (300), statistically

similar result to PCA is obtained.

(iii) For 200, 300 TP set, SPCA and OSP provides statistically similar result but in

the case of small set of TP (100), SPCA provides the better result than OSP.

Since, SPCA extracted data set is more orthonormal than OSP extracted data

set, it can be concluded that SPCA is the best FE techniques than OSP for

GML classification.

(iv) PP extracted data set always provides statistically very poor result than OSP.

It is because of the low dimensionality (dimension-2) of PP extracted data set.

(v) KPCA always fails (for large or small TP) to provide statistically better result

than PCA or OSP and OSP is statistically better than PP for all sets of TP.

Again, SPCA provides statistically better result than PCA or OSP. Therefore, it

can be concluded that SPCA is the best FE techniques than PCA and other FE

techniques like OSP, PP, KPCA.

80

(vi) The best kappa accuracy for GML classifier is obtained by using SPCA

extracted data set with 300 TP. The kappa value is 0.9589 and the number of

bands used for classification is 45.

Table 5.2: Best kappa values and z-statistic (at 5% significance values) for GML

NB* numbers of bands used, ( )

1 212 2 2

1 2

ˆ ˆ

ˆ ˆσ σ

−

+=

k kZ

From Table 5.3 it is observed that the best results for PCA, SPCA, KPCA

extracted data sets were obtained for 30-45 features at 300 TP and for OSP extracted

data set 8 features at 300 TP. During the experiments, it was seen that GMLC took

around 55-70 seconds for processing of 30-45 bands for 300 TP per class for SPCA and

PCA extracted data set and about 32 seconds for OSP extracted data. However, OSP

provides statistically similar result to PCA and SPCA for 300 TP, but the processing

time is very less than other FE techniques. Therefore, OSP can be considered as an

effective FE technique. However, considering both accuracy and processing time, OSP

can be rated as the most effective FE technique for GMLC. For statistically

insufficient TP (25) and statistically sufficient TP (200) SPCA is rated as the best FE

technique. For 100 TP per class, performance of PCA and SPCA for GMLC is same.

From Figure 5.9, it can be observed that GMLC on OSP is the fastest than any other

FE technique. PCA and SPCA take about same time to provide the best k-values.

Table 5.3: Ranking of FE techniques and time required to obtain the best k-value

TP SPCA PCA KPCA OSP PP

k1* Time (s)*

Rank k2 Time (s)

Rank k3 Time (s)

Rank k4 Time (sec)

Rank k5 Time (sec)

Rank

25 0.8409 53.6 1 0.8296 53.6 2 0.8215 59.7 2 0.2700 35.4 3 0.1960 - 4 100 0.9384 60.6 1 0.9362 60.6 1 0.8489 75.6 3 0.9205 39.2 2 0.2220 - 5 200 0.9579 65.2 1 0.9460 59.4 3 0.8332 74.4 4 0.9505 36.7 2 0.2146 - 5 300 0.9589 83.5 1 0.9568 72.3 1 0.8569 62.8 2 0.9572 39.8 1 0.2228 - 3

Time* = Time (second) for obtaining best k-value, ki* = k-value for ith FE technique , Rank:1 indicates the best

TP PCA SPCA KPCA OSP PP z-statistic Best

k1 NB* Best

k2 NB Best

k3 NB Best

k4 NB Best

k5 NB Z12

-1.35 -3.97 -1.45

Z13

41.95 53.47 50.65

Z14

8.87 -4.07 -0.28

Z24

10.51 0.99 1.20

Z34

-41.95 -53.47 -50.65

Z45

222.81 304.51 290.75

100 200 300

0.9362 0.9460 0.9568

20 20 40

0.9384 0.9579 0.9589

20 30 45

0.8489 0.8332 0.8569

35 35 35

0.9205 0.9505 0.9572

8 8 8

0.2220 0.2146 0.2228

2 2 2

Figure 5.9

5.2.2 Cla

The

extracted

(i) For

extr

high

data

Bec

clea

(ii) Bes

clas

TP.

accu

(iii) GM

vine

pixe

up

mod

9: Comparclassific

ass-wise e class-wis

data set. F

all sizes o

racted data

h k-value f

a set, only

ause first

arly (Figure

ide salt lak

sses with v

For 300

uracy from

MLC classi

eyards pixe

el. Accurac

area class

dified data

rison of cation meth

compare accuracy

From Figur

of TP, GM

a with very

for all feat

y Salt lak

feature of

e 5.6)

ke and wat

very high k

TP, GML

m SPCA mo

ifies viney

els have be

cy of classif

ses are ab

set. It is lo

kappa vahod.

rison of ry for GML

re 5.10, foll

LC can ex

y high k-va

ture modif

ke class ca

f PP modif

ter body, G

k-value for

LC separat

dified data

yards and

een classifi

fication for

out same

ow for KPC

81

alues and

result forLC has bee

owing can

xtract salt

alue. Wate

fied data s

an be sep

fied data s

GMLC sepa

all featur

tes hydrop

a set.

wheat w

fied to whe

r vineyards

for 200 an

CA modifie

d classifica

r GMLC en observ

be observe

lake class

er class is a

et (except

arated wi

set can dis

arates hydr

re extracted

phyticc veg

with about

at pixels d

s, bare soil,

nd 300 TP

d data set.

ation tim

ved for diff

ed:

from all s

also extrac

PP). From

th satisfac

stinguish s

rophytic cla

d data set

g class wi

t same k

due to pres

, pasture la

P for SPCA

es for G

ferent feat

sets of feat

cted with v

m PP modi

ctory k-va

salt lake v

ass from ot

and all se

ith very h

k-value. So

sence of mi

and and bu

A, PCA, O

GML

ture

ture

very

fied

alue.

very

ther

et of

high

ome

ixed

uilt-

OSP

2

2

Figure 5.10

5.2.3 Cla

To und

performed

classificati

figure no.

(i) In c

set

25 Training

00 Training

0: Best profeature e

assificat

derstand th

d with OD a

ion, was c

5.11 to 5.1

case of KN

(i.e. 25TP)

Pixels

g Pixels

oducer accuextracted da

ion resu

he effect o

as well as f

chosen to

4 are as fo

NN, poor pe

). However

WT

SLT

HV

WHT

VY

BS

PL

BUA

: W

:

:

: W

: V

:

: P

:

uracy of indata set with

ults using

of FE tech

feature ext

compare

llowing:

erformance

r, KNN on

82

Water

Salt lake

HydrophyticVeg

Wheat

Vineyards

Bare Soil

Pasture Land

Built-up Area

dividual clash respect to

g KNN cl

hniques on

tracted dat

classificat

e is observ

OD perfor

1

3

sses observdifferent se

lassifier

n KNN cla

ta. Same se

tion accura

ved for stat

rms better

00 Training

00 Training

ed for GMLet of TP.

r (KNNC)

assifier, ex

et of TP, as

acy. Obser

tistically in

r than PCA

g Pixels

g Pixels

LC on differ

)

periment w

s used in G

rvations fr

nsufficient

A, OSP, SP

rent

was

GML

rom

TP

PCA

83

extracted data set. The maximum k-value was obtained for 65 bands and three

neighbors. For the KPCA extracted data set, k-value was comparatively better

than OD when 50 bands were taken for all neighbors. PP was not taken into

accuracy analysis as due to very low dimensionality it would not be able to

provide good k-values.

(ii) For statistically exact TP (100 TP), the performance of KNN on OD is better

than any other feature extracted data set. More number of bands, increases the

k-values for all feature extracted data sets except SPCA. Increasing number of

bands did not show any significant change in case of SPCA. However, if

number of neighbors is increased, changes were easily observed. It is observed

that, when number of neighbors is increased, after a critical number of

neighbors (say, Nnbd), k-value starts decreasing. Therefore, it is independent on

number of bands. It may be due to the effect of noisy points present in training

data set. However, large number of neighbors accelerates the chance of using

noisy TP. Consequently, misclassification error is added up.

(iii) For 200 TP per class, no improvement in result is observed for PCA, KPCA,

OSP extracted data set than OD. But, improvement was observed for SPCA

extracted data set. However, it did not show a prior change in PCA and KPCA

extracted data set for KNNC with 100 and 200 TP set respectively. Effect of

neighborhood on accuracy can be viewed from Table 5.4. Always for the first

few neighbors for all sets of TP, highest k-value is achieved (Table 5.2).

(iv) For large training data set (300 TP), it was observed that the k-values are

better than OD. This is due to PCA and SPCA extracted data sets. After a

certain threshold neighborhood, k-value decreases monotonically for PCA,

OSP, and SPCA extracted data set.

(v) KPCA extracted data set provides better result for high dimension since it is

more refined than PCA or SPCA extracted data set.

(vi) For all training data sets, except statistically insufficient, k-value for OSP

extracted data set varies a little (0.02 - 0.05) because of very low

dimensionality. If the number of extracted end members is large enough, result

could be further improved.

84

(vii) Another important aspect was observed for feature-extracted data set. The

difference of the k-values (for all set of TP), obtained using minimum and

maximum number of bands, is about 0.15 to 0.20. This could be because most

of the information was gathered in first some bands of feature extracted data

set. Additional bands cannot provide more useful information to change k-

value significantly.

Table 5.4: Classification with KNNC on OD and feature extracted data set

Data sets

100 TP 200 TP 300 TP Bnd* NN Bnd NN Bnd NN

Original 55 3 35 3 30 3 PCA 35 5 45 5 20 3

SPCA 10 3 15 3 40 3 KPCA 35 3 45 3 30 6 OSP 8 3 8 3 8 3 PP 2 15 2 11 2 15

bnd* = best k-values obtained for the number of bands NN* = no. of neighbors, for which best k-value obtained

Figure 5.1

Origin

SPC

OSP

11: Overaldata se

nal

CA

P

N

ll accuracy oets for 25 TP

25 Trai

NNb*: number

observed forP

85

ining Pixel

of nearest nei

r KNN clas

s

ighbors

sification of

PCA

KPCA

PP

f OD and feature extra

cted

Figure 5.1

Origin

SPC

OSP

12: Overaldata se

nal

CA

P

N

ll accuracy oets for 100 T

100 Tra

NNb*: number

observed forTP

86

ining Pixel

of nearest nei

r KNN clas

ls

ighbors

sification of

PCA

KPCA

PP

f OD and feature extra

cted

Figure 5.1

Origin

SPC

OSP

13: Overaldata se

nal

CA

P

N


200 Tra

NNb*: number

observed forTP

87

ining Pixel

of nearest nei

r KNN clas

ls

P

ighbors

sification of

CA

KPCA

PP

f OD and fe

NNb

NN

ature extra

b

Nb

cted

Figure 5.1

Origin

SPC

OSP

14: Overaldata se

nal

CA

P

N


300 Tra

NNb*: number

observed forTP

88

ining Pixel

of nearest nei

r KNN clas

ls

ighbors

sification of

PCA

KPCA

PP

f OD and fe

NN

ature extra

Nb

cted

89

The k-values for the classification of these data sets were analyzed to select the

best results for each data set. Similar approach as in the case of GML is also followed

here. The z-statistic values obtained for selected best k-values are shown in Table 5.5.

The following can be inferred from these results:

(i) Results obtained using PCA and SPCA modified data sets, were found to be

significantly better than those obtained using the OD for large training data

size (300). However, SPCs and PCs still found to be performing inferior than

OD for 100 TP. Statistically similar results were obtained for OD and SPCA

modified data sets using a training data set of 200 TP. For other feature

extracted data set and for all set of training data, OD provides statistically

significant result for KNN classification.

(ii) The best results were obtained with OD using 30 to 55 bands and three

neighbors. For 300 TP, statistically better results than OD were obtained using

SPCA (40 bands) and PCA (20 bands) modified data sets with three neighbors.

For 200 TP, SPCA modified data set (15 features and 3 neighbors) provides

statistically similar results to OD.

(iii) SPCA extracted data sets were observed to be performing statistically

significant to PCA extracted data sets with smaller training data sets, whereas

the best results, obtained with 300 TP training data set using SPCs, were

statistically similar to those as obtained by PCs.

(iv) SPCs were also observed to be performing significantly better than KPCA and

OSP modified data sets for all training data sets. In addition, the best results

for PCA and OSP were found to be statistically poor for all training data size.

Table 5.5: The best k-values and z-statistic for KNNC

* Number of bands used to obtain best k-value

TP OD KPCA SPCA PCA OSP z-statistic k1 NB* k2 NB k3 NB k5 NB Z12

42.5148.98 47.91

Z13 9.42 0.15

-4.58

Z14

44.72 41.31 -4.29

Z23 -34.98 -49.10 -52.68

Z34 37.24 11.43

0.29

Z45 -20.55 -17.72 30.95

100 200 300

0.8889 0.9037 0.9244

55 35 30

0.7773 0.7881 0.8141

35 45 30

0.8669 0.9040 0.9325

10 15 40

0.7715 0.8062 0.9320

35 45 20

0.8268 0.8514 0.8701

8 8 8

90

Time taken to train the KNN classifier is highly affected by the number of TP.

This is due to the fact that a distance matrix needs to be computed between a test

pixels and each of TP. Increasing number of TP indeed extends the calculation time

i.e. for n TP and m test pixels, number of distances calculated is mn . However,

increasing number of neighbors has significantly less effect in run time. It has been

observed that time taken for classification, for three and for 15 neighbors are almost

similar (maximum difference is 60-120 seconds) (Figure 5.15). Another aspect is also

noticed, increasing number of bands proportionally affect the calculation time (Figure

5.15). From the Figure 5.16, it could be observed that PCA takes least time in

compared to OD and SPCA extracted data to provide best result. Considering the

time constraint and k-value, PCA could be chosen as the best FE technique, followed

by SPCA, among the available techniques for KNN classification. Figure 5.15 shows

the comparison of time between 200 TP and 300 TP for same number of bands and

neighbors. Rank of FE techniques with respect to accuracy for KNNC for each set of

TP could be inferred from table 5.6.

From Table 5.6, it is further observed that for statistically exact size of (i.e.

100), KNNC produced best result with OD. For statistically sufficient TP (i.e.200),

SPCA secured first rank. However, for statistically large TP (i.e. 300), SPCA and PCA

both perform better. Therefore, it is concluded that among all the data sets feature

modified and original, SPCA and PCA provide the best result for KNNC which in

turn tells that PCA is the best FE technique among all of these techniques for KNNC.

Table 5.6 Rank of FE techniques and time required to obtain best k-value (Rank 1

indicates the best)

TP Original KPCA SPCA PCA OSP

k1 Time (s)*

Rank k2 Time (s)

Rank k3 Time (s)

Rank k4 Time (s)

Rank k5 Time (s)

Rank

100 0.8889 875.1 1 0.7773 722.9 4 0.8669 661.2 2 0.7715 789.6 5 0.8268 655.2 3 200 0.9037 1200.6 1 0.7881 1271.1 4 0.9040 1122.1 1 0.8062 1272.0 3 0.8514 1022.7 2 300 0.9244 1574.6 2 0.8141 1556.0 4 0.9325 1712.5 1 0.9320 1434.0 1 0.8701 1291.9 3

Time(s)*: presents the required time in second

Figure 5.1

Figure 5.1

5.2.4 Cla Fro

of KNNC

KN

feature m

due to pre

classified i

presence o

(a) 300 TP

5: Time cdifferen

6: Compafeature

ass wise m Figure 5

NNC extrac

odified dat

esence of l

into hydrop

of large nu

NNP

N

comparisonnt neighb

arison of be extracted

compar5.17, follow

cts water a

ta and OD

arge numb

phytic veg,

umber of m

Nb

NNb*: number

n for KNNbors for (a)

best k-valud data set

ison of rwing observ

and salt la

D. However

ber of mixe

, wheat, pa

mixed pixel

91

of nearest nei

N classifica 300 TP (b)

ue and cla

results fovations can

ake classes

r, the built

ed pixels.

asture land

s in built-u

ighbors

ation. Time) 200 TP tr

assification

or KNNCn be viewed

s with very

t up area

For built u

d classes fo

up area cla

(b) 200 T

e for differraining dat

n time for

C d for class

y high accu

is classifie

up area so

r all data s

ass. Perfor

NTP

rent bandsta per class

original

wise accur

uracy for b

ed very poo

ome pixels

sets due to

rmance of O

NNb

s at s.

and

racy

both

orly

are

the

OD,

KPCA and

to provide

all sets of

10

30

Figure 5.1

5.3 E

In t

it will de

KPCA_SV

SVM algor

d OSP mod

e good clas

TP. For vin

00 Training

00 Training

7: Class wdata fo

Experim

this section

escribe the

VM. The sec

rithms.

dified data

sification a

neyards, a

Pixels

Pixels

wise accuror KNNC

ment re

n, results o

e results

ction also p

sets are lo

accuracy fo

built-up ar

racy compa

esults fo

of different

of SVM_Q

provides a

92

ower than

or classific

rea classes

WTSLTHVWHVY BSPL BU

arison of O

for SVM

t SVM algo

QP algorit

compariso

SPCA and

cation of h

s for all dat

200

T T

V HT Y S

UA

: Water : Salt lake: Hydroph: Wheat : Vineyard: Bare Soi: Pasture : Built-up

OD and dif

M based

orithms ha

thm follow

on of classif

d PCA mod

hydrophytic

ta sets and

0 Training P

e hobicVeg

ds il Land

p Area

fferent feat

d classi

ave been de

wed by SV

fication tim

dified data

c veg class

d TP.

Pixels

ture extrac

ifiers

escribed. F

VM_SMO

me of differ

a set

for

cted

First

and

rent

93

5.3.1 Experiment results for SVM_QP algorithm Using the optimal set of parameter values (Table 4.5, recommended by

Abhinav, 2009) for SVM classifiers, classification were performed on feature modified

data sets. Results from these experiments are compared with the best result obtained

by Abhinav (2009) for SVM classifier. He noted that performance of SVM_QP was the

best for PCA extracted data set. The same training and input data sets were used as

for GML and KNN classifiers. The classification results obtained by SVM are

presented in Figure 5.18 from which the following observations can be made:

(i) The k-values are seen as improving with increase in training data size for all

input data sets types (PCA, SPCA, KPCA, OSP and PP modified data sets).

(ii) The best classification results were obtained by PCA and SPCA modified data

sets. For KPCA modified data set, when number of bands increases the k-

values also increase. It is possible that for very high dimension, KPCA

extracted data set can provide high k-value like SPCA or PCA extracted data

sets.

(iii) Increasing in k-values were observed for PCs and SPCs which stagnates after a

critical number of features used. After that it starts to decrease gradually. This

could be due to same reason discussed for GML classification algorithm in

section 5.1.

(iv) A similarity can be observed for KPCA, PCA and SPCA modified data set. For

statistically insufficient TP (25) suddenly k-values reach to about zero for

classification using 50 bands. The reason is not clear. Probably due to using

these number of bands and TP, SVM_QP was unable to find proper decision

boundary.

(v) Best result for KPCA and OSP extracted data set are about to similar for each

set of TP except for 25 TP.

Figure 5.1

The

to select th

(a) PC

(c) KPC

(d) PP

8: Overalsets us

e k-values f

he best res

CA

CA

P

ll kappa vsing SVM a

for the clas

sults for ea

SV

alues obseand QP opt

ssification

ach data se

94

VM_QP

erved for ctimizer

of these da

et. The app

classificatio

ata sets we

proach was

(b) SPCA

(e) OSP

on of FE m

ere statistic

similar to

modified d

cally analy

o that follow

data

yzed

wed

95

in case of GML. The z-statistic values obtained for best k-values are shown in Table

5.7. The following can be inferred from these results:

(i) PCA and SPCA were found to be giving statistically similar result for all set of

TP. On the other hand, PCA always provides statistically significantly better

result than KPCA and OSP modified data set for all set of TP for SVM_QP

classifier.

(ii) Classification with SPCA modified data set always performs statistically better

than KPCA modified data set for all sets of TP. However, OSP performs

statistically better than KPCA modified data set for 100 and 200 TP per class.

For large set of TP (300), OSP performs statistically similar with KPCA

modified data set.

(iii) Another observation is made from the Table 5.7 that the SPCA modified data

set always performs statistically better than OSP modified data set.

(iv) It can be concluded that PCs and SPCs have the better ability to improve k-

value than any other FE techniques. KPCA performs the worst among all the

FE techniques.

Table 5.7: The best kappa accuracy and z-statistic for SVM_QP on different feature

modified data set

NB* = no. of bands used to achieve the best k-value; ki* = k-value for ith FE technique ,

During above experiments, it was observed that time taken to train the SVM

based classifier is affected very much by the number of training samples used. This is

because a kernel matrix has to be computed for every pair of TP. There were very

little changes in training times with increase in number of bands.

Generally the total time taken to perform SVM based classification was

observed to be ranging from 23 to 102 seconds when bands were increased from 5 to

TP

PCA KPCA SPCA OSP z-statistic k1* NB* k2 NB k3 NB k4 NB Z12

36.30

7.89 6.07

Z13 0.00 0.53

-0.59

Z14

28.70 6.26 6.30

Z23

-36.30 -33.39

-7.40

Z24

-7.79 -7.26 1.06

Z34 28.70 30.40 7.65

100 200 300

0.9408 0.9621 0.9643

15 15 15

0.8703 0.8901 0.9090

55 65 60

0.9408 0.9573 0.9691

15 15 20

0.8874 0.9050 0.9069

8 8 8

65 for 25 T

to 615 seco

An i

with SPCA

critical nu

Same tren

using larg

sufficient

modified d

of noise. D

properly fo

data sets.

number of

decrease. T

SPCA and

Exc

the trainin

to the QP

optimizers

times. It i

optimizer

out by Var

TP. The sa

onds for 20

important

A modified

umber of ba

nd was obs

ge number

number of

data sets, e

Due to the

for large nu

That mea

f SV are l

This could

d PCA modi

ceptionally

ng data siz

P optimize

s which wo

s known th

in case of

rshney and

ame range

00 TP.

aspect has

d data set

ands (30 b

erved for 3

of TP and

f support

except first

e presence

umber of b

ans that s

less then c

be suppor

ified data s

higher tim

ze was incr

er used. V

ould give

hat same p

SVM as it

d Arora (20

for 100 TP

s been obse

(Figure 5.

bands), the

300 TP per

d large num

vectors re

t few band

e of noise,

bands with

sufficient n

classificatio

rted from th

set k-value

mes of the

reased to 3

Varshney

the same

performanc

makes use

004)

96

P was obser

erved for th

.19). When

classificat

r class. Thi

mber of ba

quired for

s, all rema

optimizat

h large set

number of

on time al

he Figure

es start to d

order of 2

00 TP. Suc

and Arora

classificati

ce would be

e of the sta

rved as 82

he classific

n the band

tion time d

is could be

ands, SVM

classificat

aining band

tion proble

of TP for

SV could

lso be less

5.18 (a), (b

decrease af

2600 secon

ch higher t

a (2004) s

ion accura

e achieved

atistical lea

to 273 sec

cation time

ds are incr

decreases m

e due to the

M_QP was u

tion. For S

ds contain

em might

SPCA or

not be fin

and k-val

b). It is obs

fter 25 ban

nds were o

times were

suggested

acies in sh

d irrespecti

arning theo

conds, and

e using 200

eased, afte

monotonica

e, fact that

unable to f

SPCA or P

large amo

not be sol

PCA modi

nd. When

lues may a

served that

nds.

bserved w

e observed

a few be

orter train

ve of choic

ory as poin

522

0 TP

er a

ally.

t by

find

PCA

ount

lved

fied

the

also

t for

hen

due

tter

ning

ce of

nted

97

Figure 5.19: Classification time comparison using 200 and 300 TP per class.

5.3.2 Experiment results for SVM_SMO algorithm The classification results obtained using SVM with SMO optimization

techniques are presented in Figure 5.20. The rbf kernel function is used for

classification of different data sets using SVM_SMO algorithm. The following

observations can be made on the basis of k-value presented in Figure 5.20:

(i) The k-values could be seen as improving with increase in training data size

(except 200 TP) for all input data set.

(ii) Like SVM_QP, a sudden decrease in k-value is observed with 25 TP for the

OD, SPCA, KPCA and OSP extracted data sets. For all data sets, this

happens for 50 features.

(iii) For all data sets (except KCPA extracted data), statistically sufficient

training data set (200 TP) is unable to provide positive k-value. This could

be due to failure of solving optimization problem for these data sets using

200 TP. For KPCA extracted data set, first few bands provide very low k-

value for 200 TP. From 20 bands onwards, k-value provided by KPCA

extracted data set for 200 TP is acceptable.

(iv) Increasing k-values were observed for original and KPCA modified data sets

which stops after a critical number of features used. After that, it starts to

decrease. It is because of same reason as reported for GML classifier. For

the OD and KPCA modified data sets k-values increase monotonically for

100 and 300 TP per class.

(v) For PP modified data set, however, very low k-values are observed. So, all

the results for PP extracted data set are ignored for comparison of results of

SVM_SMO classifier.

The k-values for the classification of these data sets were statistically analyzed

to select the best results for each data set. The approach was similar to the one

followed in previous cases. The z-statistic values are obtained to compare each data

98

set. The best k-values are shown in Table 5.8. The following can be inferred from

these results:

(i) The best results obtained using feature modified data sets were found to be

significantly better than those obtained using the OD set for large training

data size (300 TP). For OSP modified result is marginal, but can be said

that significantly better than OD set. Performance of OD, SPCA and OSP

modified data is very bad, but performance of KPCA modified data is very

high for 200 TP training data. SPCs found to be performing statistically

better than OD set for 100 TP per class and statistically similar to OD for

200 TP.

(ii) The best results were obtained with the OD using 50-60 bands, while

significantly better results than OD were obtained using SPCA modified

data sets with 15-30 features. For 300 TP, statistically similar result to OD

is obtained using OSP modified data set with eight bands.

(iii) KPCs were observed to be performing significantly better than SPCA and

OSP modified data set for 200 TP. For 100 and 300 TP, the best results

obtained by SPCA modified data set are significantly better than OSP and

KPCA modified data sets.

(iv) Classification with OSP is found to be significantly better than KPCA for

100 TP while KPCA is observed to be statistically better than OSP modified

data for 200 and 300 TP. Thus it can be said that SPCA performs better

than OD and any other feature extracted data and performance of OSP is

worst for SVM_SMO based classification.

Figure 5.2

Table 5.8:

TP

100

200

300

Origin

KPC

20: Overamodifi

The best modified

OD k1 NB*

0.8955 50

0.1694 5

0.8934 60

NB* = No

nal

A

all kappa vfied data se

k-value an data set

KPCA * k2 NB

0.8626 40

0.8826 50

0.9013 50

. of band used

SVM

values obsets using SV

nd z-statisti

SPCA k3 NB

0.9304 15 0

0.1694 5 0

0.9436 30 0

d to obtain best

99

M_SMO

served for VM with S

ic for SVM

OSP k4 NB Z1

0.8739 8 15

0.0001 8 -33

0.8999 8 -3

t k-value; ki* =

classificatSMO optim

M_SMO on O

z-s12 Z13 Z1

5.00 -15.91 9.8

36.2 0.00 12.9

3.80 -26.98 1.6

= k-value for it

SPCA

OSP and

tion of orimizer

OD and dif

statistic 4 Z23 Z2

85 -33.90 -4

90 475.46 630

65 -23.75 5

th FE techniqu

PP

ginal and

fferent feat

24 Z34 .99 28.25

.00 -

.56 28.86

ue

FE

ture

Figure 5.2

The

to be rang

the same r

TP and 11

when num

requireme

than the

optimizati

numerical

small num

very large

.

5.3.3 ExpThe

is present

different d

made on th

21: Companumbe

e total time

ging from 5

range for 1

184 to 1814

mber of ba

ent for larg

SVM class

ion metho

l operation

mber of ope

data sets

perimene classificat

ted in Figu

data set u

he basis of

arison of cler of bands

e taken to

55-90 secon

100 TP was

4 for 300 T

ands incre

ge number

sification m

od. The so

ns. This me

erations th

nt resultstion result

ure 5.22.

using KPCA

f k-values p

lassification for SVM_S

perform S

nds when b

s observed

TP (Figure

ases the c

of TP for S

method ba

olution de

ethod need

hus resulti

s for KPs obtained

The rbf k

A_SVM al

presented i

100

n time for SMO classi

SVM_SMO

bands were

as 145-194

5.21). Unl

classificatio

SVM_SMO

ased on QP

erived for

ds more nu

ing in an i

CA_SVM using KPC

kernel fun

gorithm. T

in Figure 5

different sification al

based clas

e increased

4 seconds,

ike to SVM

on time al

is observe

P optimize

SMO me

umber of it

increase in

M algoritCA_SVM a

nction is u

The followi

5.22:

set of TP wlgorithm.

ssification

d from 5 to

350-409 se

M_QP it is

lso increas

ed to be sig

er. This is

ethods nee

terations b

n optimizat

thm algorithm (

used for cla

ing observ

with respec

was obser

o 65 for 25

econds for

observed t

ses. The t

gnificantly

due to S

eds very

but require

tion speed

(QP optimi

assification

vations can

ct to

rved

TP,

200

that

time

less

MO

few

es a

d for

zer)

n of

n be

101

(i) For OD and KPCA extracted data, unpredictable behavior of KPCA_SVM

classifier is observed for all data set, TP and for different bands. Maximum k-

value for OD is obtained for 200 TP with 35 bands and for KPCA 200 TP with

25 bands.

(ii) For SPCA extracted data set, k-values reach to about zero after 20 bands for

each set of TP. Maximum k-value obtained by SPCA is better than obtained by

OD and KPCA extracted data set. Maximum k-value for each set of TP is

obtained with five bands.

(iii) For OSP extracted data set, highest k-value is obtained for 200 TP. This value

is higher than the k-values of other feature modified data sets, those are

obtained for 200 TP. Reverse of this scenario is seen for OSP modified data set

with 300 TP.

(iv) One important phenomenon is observed for KPCA_SVM algorithm. For large

set of TP (300), KPCA_SVM provides very low k-value. The best k-value is

obtained for all data set using 200 TP per class.

Figure 5.

The

best result

cases. The

(i) The

foun

TP.

mod

How

obse

Origin

KPC

.22: Overamodifi

e k-values

ts for each

e following

e best resul

nd to be si

For 300

dified data

wever, perf

erved to be

nal

A

all kappa vfied data se

for classifi

h data set.

can be infe

lts obtaine

ignificantly

TP, OD p

. Performa

formance o

e performin

KPC

values obseets using K

fication of

The appro

erred from

d using fea

y better th

provides st

ance of OD

of SPCA mo

ng statistic

102

CA_SVM

erved for c

KPCA_SVM

these data

oach was si

these resu

ature modi

han those o

tatistically

D, KPCA an

odified dat

cally better

classificatiM algorithm

a sets were

imilar to th

ults (Table

ified data s

obtained u

better res

nd OSP mo

ta is very h

r than OD s

SPCA

OSP

ion originam.

e analyzed

hat followe

5.9):

sets (excep

sing the O

sult than

odified dat

high for 100

set for 100

al and feat

d to select

ed in previ

t KPCA) w

OD set for

other feat

ta is not go

0 TP. SPCA

TP per cla

ture

the

ious

were

200

ture

ood.

A is

ass.

103

(ii) The best results were obtained with the OD with 50-60 bands while

significantly better results than OD were obtained using SPCA modified data

sets with five to ten features for 100 and 200 TP per class. For OSP modified

data set, statistically better result than OD is obtained using 200 TP with eight

bands

(iii) SPCs were observed to be performing significantly better than OSP for 100 and

200 TP. While OSP performs statistically better than SPCs for 200 TP. KPCs

perform statistically better than OSP for 100 TP. However, performance of

KPCs for 200 and 300 training data is statistically significantly low than OSP.

(iv) SPCs always perform statistically better than KPCs and OSP performs better

than SPCA only for 200 TP. It could be concluded that for 100, 200 and 300 TP,

KPCA_SVM performs better with SPCA, OSP modified data set and OD

respectively. KPCA_SVM provides low k-value compared to SVM_QP or

SVM_SMO algorithms.

Table 5.9: The best k-value and z-statistic for KPCA_SVM on original and different feature modified data sets.

NB* = No. of band used to obtain best k-value

5.3.4 Class wise comparison of the best result of SVM Ability of SVM classifiers to separate different classes is observed from Figure

5.23.

(i) Ability to distinguish salt lake class of all SVM classifier is about same.

(ii) Accuracy of separation of wheat class by SVM_QP and SVM_SMO

classifiers is about same. However, performance of KPCA_SVM is very low

(except salt) to separate any other classes than other two classifiers.

(iii) SVM_SMO separates all other classes with little low accuracy than

SVM_QP.

TP OD KPCA SPCA OSP z-statistic k1 NB* k2 NB k3 NB k4 NB Z12

62.93 6.98

61.15

Z13

-15.32 -6.64 5.10

Z14

93.69 -39.72 203.10

Z34

96.77 -21.83 104.90

100 200 300

0.7110 0.6736 0.7142

50 45 55

0.5150 0.6514 0.5109

25 30 45

0.7565 0.6976 0.5340

10 5 5

0.4192 0.7917 0.3488

8 8 8

(iv) S

h

Figure 5.2

5.3.5 Com The

statisticall

accuracy o

to compare

(i) Fro

than

obta

For

clas

TP)

(ii) Fro

SVM

and

valu

SVM_QP i

high k-valu

23: CompdifferHydrPastu

mparisoe overall be

ly to find o

obtained. T

e the pract

m Table 5

n all other

ained for S

300 TP

ssification t

. This time

m Table 5

M decision

d 300 TP u

ues obtain

is the best

ue.

parison ofrent SVM rophobic veure land, B

on of resuest results

out the bes

The same w

tical applic

5.10, it is

r SVM algo

SPCA and

best resu

time range

e range is v

5.10, it is o

rule. The b

using SPCA

ned for 100

t classifier

f classific algorithmeg, WHT – BUA – Buil

ults for d obtained b

st SVM cla

was done fo

cability of t

observed t

orithms for

PCA modi

ult is obt

es from 14

very high.

observed th

best k-valu

A, KPCA a

0 and 300

104

. It has ab

ation accums. WT – wheat, VYlt-up area

differentby differen

assification

or the time

these meth

that SVM_

r all sets o

ified data s

tained for

48 seconds

hat SVM_S

ues for SVM

and SPCA

0 TP are l

bility to se

uracy of water, SL

Y – Vineyar

t SVM alnt SVM alg

n method i

e scales obs

ods.

_QP metho

of TP. Best

sets for 10

SPCA m

(for 100 T

SMO algor

M_SMO are

modified d

little less

eparate all

individualLT – Salt rds, BS – B

lgorithmgorithms w

n terms of

served for

od is statis

t results of

00 and 200

modified d

TP) to 2596

rithm is th

e obtained

data sets re

than SVM

l classes w

l classes Lake, HV

Bare soil, P

ms were compa

f classificat

these in or

stically be

f SVM_QP

TP per cl

data set. T

6 seconds (

he second b

with 100,

espectively

M_QP, tho

with

for V –

PL –

ared

tion

rder

tter

are

ass.

The

(300

best

200

y. k-

ugh

105

required classification time using 300 TP is about two third of SVM_QP.

Though SVM_SMO needs more bands than SVM_QP to obtain best k-values

for different sets of TP but its processing time is very less than SVM_QP.

(iii) KPCA_SVM is poorest method amongst SVM_QP and SVM_SMO. Highest k-

value is obtained for KPCA_SVM by using OSP modified data set with 200 TP.

When number of pixel is large performance of KPCA_SVM is less.

From the above discussion, it can be concluded that SVM_QP is the best

classifier with respect to accuracy. Considering both the classification time and

accuracy, SVM_SMO can be considered as the effective SVM classifier. The best

accuracy is obtained by SVM_QP by using 300 TP with the first 20 bands of SPCA

modified data set. For SVM_SMO the best accuracy is obtained by using 300 TP with

the first 30 bands of SPCA modified data set.

Table 5.10: Comparison of the best k-values with different FE techniques, classification time, and z-statistic for different SVM algorithms.

TP SVM_QP SVM_SMO KPCA_SVM z-statistic

k1 FEA* Time (s)*

NB* k2 FEA Time (s)

NB k3 FEA Time (s)

NB Z12 Z13 Z23

100 0.9408 PCA, SPCA

122.6 15 0.9304 SPCA 148.1 15 0.7565 SPCA 94.3 10 6.14 77.61 71.94

200 0.9621 PCA, SPCA

585.7 15 0.8836 KPCA 363.9 50 0.7927 OSP 262.3 8 45.16 77.51 36.4

300 0.9691 SPCA 2596.2 20 0.9446 SPCA 1694.8 30 0.7142 OD 1190.2 55 18.38 113.47 97.01 ki = best k-value for ith classifier; FEA* = Feature extraction algorithms; NB* = No. of band used to obtain best k-value; Time (s)* = Required time to obtain best k-value, presented in second

5.4 Comparison of best results of different

classifiers The best results obtained by the parametric (GML), non-parametric (KNN) and

advanced (SVM) classifiers with different feature modified data set are already

presented in Tables 5.2, 5.5 and 5.9. The best advanced classifier (SVM_QP) is chosen

by statistically comparing all the advanced classifiers. The statistical comparison of

parametric, nonparametric and best advanced classifiers are carried out in order to

evaluate the best classifier among these classifiers with respect to classification

accuracy and time. The corresponding z-statistic is presented in Table 5.11:

106

The followings are observed from the Table 5.11:

(i) GML performs statistically better than KNN classifier for all set of TP. Also

the classification time of GMLC is negligible with respect to KNNC.

(ii) GMLC performs statistically similar with SVM_QP for 100 and 200 TP. For

large set of TP (300), the performance of SVM_QP classifier is statistically

significantly better than GMLC. However, required classification time is very

high for SVM classifier.

(iii) SVM_QP provides statistically better result than KNNC for all set of TP. From

here it can be concluded that SVM_QP is the best classifier on the basis of

classification accuracy. GML is ranked as the second best classifier.

(iv) It is also observed that the best results are obtained by all the classifiers by

using SPCA modified data set. It is also concluded that SPCA is the best

feature reduction technique among all other techniques for all classifiers.

(v) Processing time of GMLC is very less than any other classifiers. GMLC

provides little poor k-value than SVM_QP for 300 TP. Considering both

classification time and accuracy, it can be concluded that GMLC is the best

classifier than any other classifier.

Table 5.11: Statistical comparison of different classifier’s results obtained for

different data sets

TP GML KNN SVM_QP z-statistic k1 FEA* Time (s)* NB* k2 FEA Time (s) NB k3 FEA Time (s) NB Z12 Z13 Z23

100 0.9384 SPCA 60.6 20 0.8669 SPCA 661.2 10 0.9408 SPCA, PCA

122.6 15 36.82 -1.54 -38.06

200 0.9579 SPCA 64.7 30 0.9040 SPCA 1122.1 15 0.9573 SPCA, PCA

585.7 15 31.33 0.42 -30.98

300 0.9589 SPCA 82.6 45 0.9325 SPCA 1712.5 40 0.9691 SPCA 2596.2 20 16.00 -7.97 -25.37 ki = best k-value for ith classifier; FEA* = Feature extraction algorithm; NB* = No. of band used to obtain best k-value; Time (s)* = Required time to obtain best k-value, presented in second

The difference in performance of GML, KNN and SVM classifiers can be

attributed to difference in their classification mechanisms. GML and KNN are

capable of forming only simple decision boundaries where SVM can forms highly

complex non-linear decision boundaries. In the given data, different kinds of class

separabilities were observed for different classes. The water and salt classes were

found easily separable from the rest of the classes. About 100% classification

accuracies were observed for these classes with very small number of features for all

107

the classifiers. After these, the classes: wheat, vineyards and bare-soil were showing

a little lower accuracy values which means these are a little difficult to separate. The

lowest accuracies were observed for pasture land, built-up area and hydrophytic

vegetation classes. These classes are very poorly separated and thus complex decision

boundaries would be required to separate them. For large set of TP, SVM_QP is able

to achieve higher classification accuracies than the parametric and non-parametric

classifier because they were not able to separate the poor classes in a better way.

Classified maps corresponding to the best results of different classifiers are

shown in Appendix A (Figure A.1).

5.5 Ramifications of results HD classification is very crucial task due to its characteristics and large

volume of data. It is clear from the analysis that depending on availability of TP the

selection of FE techniques and classification algorithms are very important for

classification of HD. Another important aspect should also be kept in mind that is

time-consuming classification and FE procedures. This thesis work has pointed on

some important guidelines for classification of HD (Table 5.12).

(i) When only statistically insufficient TP is available, it is suggested to apply

either SVM_QP algorithm with OSP FE technique. This will provide high

classification accuracy in minimum time.

(ii) GML is strongly recommended to apply on SPCA modified data set to achieve

very high accuracy in very less time for statistically exact and statistically

sufficient training data sets.

(iii) For statistically large training data set, high accuracy could be achieved by

implementing SVM_QP on SPCA modified data set. Nevertheless, this method

will take very large processing time. So, it is strongly recommended to apply

GML on SPCA modified data set, though achieved classification accuracy is

little less than SVM_QP but processing time is negligible than SVM_QP.

SVM_SMO could also be used for large set of TP on SPCA modified data set.

(iv) Among all the popular FE techniques for HD, SPCA is the most effective FE

technique, which could be used to achieve high classification accuracy for HD

for all classification techniques.

108

Table 5.12: Ranking of different classification algorithms depending on classification

accuracy and time. (Rank: 1 indicate the best)

Ranking depending on accuracy TP Parametric Non-

parametric Advanced

GML FEA KNN FEA SVM_QP FEA SVM_SMO FEA KPCA_SVM FEA 25 2 SPCA 3 KPCA 1 SPCA,

OSP 1 SPCA 4

100 1 SPCA 3 SPCA 1 PCA, SPCA

2 SPCA 4 SPCA


3 KPCA 4 OSP

300 2 SPCA 4 SPCA 1 SPCA 3 SPCA 5 OD Ranking depending on accuracy & time

TP Parametric Non-parametric

Advanced

GML FEA KNN FEA SVM_QP FEA SVM_SMO FEA KPCA_SVM FEA 25 2 SPCA 3 SPCA 1 OSP 1 SPCA 4 SPCA


3 SPCA 5 SPCA


3 KPCA 5 OSP

300 1 SPCA 4 SPCA 3 SPCA 2 SPCA 5 OD

109

CHAPTER 6 SUMMARY OF RESULTS AND

CONCLUSIONS

Starting with the summary of observations as noticed in the previous, this

chapter mainly aims to summarize the conclusions corresponding to the main

objectives as defined in the first chapter. It also suggests the some area and methods

for further research in future.

6.1 Summary of results This research work is the extension of the work done by Abhinav (2009). For

this research work, DAIS 7915 hyperspectral sensor data was used for testing

different FE techniques and classification algorithms. The best results obtained by

these experiments were compared with those obtained by Abhinav (2009). Based on

the conclusions from the literature survey and recommendations for future work by

Abhinav (2009), several FE (SPCA, KPCA, OSP, PP) and classification algorithms

(KNN, GML, SVM based classifiers) have been tested to achieve the objectives as

mentioned in section 1.4.

For parametric classifier (GML), experiments were performed on different

feature extracted data sets which are mentioned above. The best result obtained by

the experiments were compared with the best result obtained by Abhinav (2009) to

observe the improvement. For non-parametric classifier (KNN), first experiment was

performed with OD. Then algorithm was applied on the different feature modified

data. The best results for OD and feature extracted data were compared to obtain the

best result for non-parametric classifier. For the advance classifier (SVM_QP,

SVM_SMO and KPCA_SVM) experiments were performed on OD as well as feature

modified data sets. For SVM_QP, like GML, also the best result was compared with

the best result obtained by Abhinav (2009). The best results of different SVM

classifiers were examined to obtain best SVM algorithm.

110

Lastly, the best results for parametric, non-parametric and advance classifiers

were compared to find out the best classifier for HD. All the comparisons were

performed by the one-tailed hypothesis testing at 5% significance level.

Classification experiments were performed using the four FE techniques,

namely, SPCA, KPCA, OSP and PP. From the statistical analysis of classification

results obtained using these feature modified data sets, it could be concluded that

among the four above mentioned FE techniques, SPCA modified data set provides the

best results. These results were also compared with the best classification results

obtained by Abhinav (2009) using different FE techniques. SPCA performs better

because it uses the local statistics rather than global.

Analyzing the different classifiers results, it is observed that sometimes the

results obtained from PCA modified data set competes with those obtained by SPCA

modified data set. Generally, different classifiers provide the best results using 15 to

30 bands of SPCA or PCA modified data sets, which effectively reduces the

classification time. For OSP and PP, due to very low dimensionality, these always fail

to produce satisfactory results. However, the results obtained by using eight bands of

OSP modified data set are reasonably good, though they are not always statistically

significantly better than SPCA or PCA modified data sets. There is a possibility of

improving result by increasing the dimension of OSP modified data set by extracting

more number of endmembers. For KPCA modified data set, it was observed that its

performance is always poor in quality. However, it is observed that KPCA can

produce satisfactory result by increasing the dimension which will also increase the

classification time proportionally. Therefore, KPCA is not considered as an effective

FE technique.

From the experiments performed with parametric classifier (GML), it was

observed that the performance of GML was significantly improved after applying FE

techniques. Comparing the obtained results with the best result obtained by Abhinav

(2009), SPCA was found to be working best among all available FE techniques, in

improving classification accuracy by GML.

Moving on to the non-parametric classifier, it is observed that result of KNN

classifier depends on the choice of number of bands and neighbors. Best results were

selected for KNN with and without applying FE techniques and it was found that

111

result of KNN was enhanced by PCA and SPCA techniques while the supervised FE

techniques like KPCA and OSP failed to do so.

SVM algorithm was selected as the advance classifier. It uses statistical learning

theory, which is expected to produce consistent and optimal results as compared to

the parametric and non-parametric classifiers. Different SVM algorithms (SVM_QP,

SVM_SMO and KPCA_SVM) were tested to reach this goal. For SVM based

classifiers, it was observed that, the dimension of the data sets and choosing of

optimizer significantly affect the results. The best result of SVM_QP was achieved by

SPCA feature extracted data set with 20 bands. It was also observed that, the

classification result using advanced classifier was further improved than the best

result obtained by Abhinav (2009). He obtained the best result using PCA modified

data sets. This result was further improved by using SPCA modified data set. This

proves that by using selected FE techniques, classification results of advance

classifier can further be improved. It was observed that supervised FE technique like

KPCA, OSP could not improve the result of SVM while unsupervised FE technique

(SPCA) made improvement in result. On the other hand, the best results of

SVM_SMO and KPCA_SVM were obtained by using SPCA and OSP modified data

sets respectively. Comparing the best results of different SVM algorithms SVM_QP is

concluded as the best SVM classifier.

On comparing the best results obtained by SVM classifiers with the best

results of parametric and non-parametric classifications, it was found that the

advance classifier performs significantly better for both the data sets, original or

feature extracted. The reason for better performance of this classifier is the

improvement in separating a few classes which shows poor k-values when parametric

or non-parametric classifiers were used. This observation is expected because of the

variation in formation of decision boundary. The decision boundary form by

parametric or non-parametric classifiers are simpler. For this reason they are unable

to perform to separate the poor classes efficiently. Advance classifier has ability to

form complex, nonlinear decision boundaries which help them to improve decision

boundary for separating poor classes.

Compared to parametric classifier, SVM required higher computation time and

memory requirement. In spite of these difficulties, significant improvement was

112

observed over parametric and non-parametric classifiers by advance classifier. This

strongly suggest that SVM has an ability to reduce the troubles regarding HD

classification.

6.2 Conclusions Based on these results, the following conclusions are drawn:

1. Out of various FE techniques for classification of HD, SPCA is the best FE

technique followed by PCA. In addition, orthogonal subspace projection can be

taken as the effective FE technique if its dimension could be increased.

2. Although advance classifiers needs large processing time but these are able to

reduce the problems concerned with the classification of HD in a much better

manner than the parametric or non-parametric classifiers. For statistically

exact and sufficient sets of TP, performance of SVM_QP is not statistically

better than those of parametric classifier. For large set of TP, SVM_QP

produces statistically better result than all classifiers. In addition, the SPCA

FE techniques were found to be helpful to increase the accuracy significantly

for all of advance, parametric and non-parametric classifiers.

6.3 Recommendations for future work During the literature survey, some additional methods were found that are not

included in this thesis work. These seem to be showing scope of improving accuracy

and computation time for the advance classifiers presented in this thesis. The

following methods are recommended for the future work:

(i). In this thesis work the high memory and computational time required by SVM

methods were little reduced by using different optimizers and algorithms.

There is still chance to reduce the computation time for SVM algorithm by

using Lagrangian SVM algorithm (Mangasarian and Musicant, 2000). This

required testing further. In addition, some optimization techniques like Kernel

Adatron (Bennett and Campbell, 200), Succesive Overrelaxation (SOR)

(Mangasarian and Musicant, 1998) should also be tested which may reduce the

computation time significantly.

113

(ii). Moreover, it can be commented that for large set of TP, KPCA method takes

much time. Lima and Zen (2005) suggested a method called Sparse KPCA

which may reduce the computation time. This needs to be tested.

(iii). The high computation time required by KNN found in this thesis work. It is

because of the large number of computation is required to classify a single

pixel. For large data set it will increase exponentially. In order to reduce these

Hash-table approach could be applied. By using Hash-table number of

computation will be less.

114

REFERENCES Barros, A. S and Rutledge, D, N (2005) ‘Segmented principal component transform–principal component analysis’, Chemometrics and Intelligent Laboratory Systems 78 (2005) 125– 137

Bhattacharyya, A. (1943) ‘On a measure of divergence between two statistical populations defined by probability distributions,’ Bulletin of Calcutta Mathematical Society, Vol. 35, pp. 99-109.

Ben-Dor, E., Patkin K., Banin A. and Karnieli, A. (2002) ‘Mapping of several soil properties using DAIS-7915 hyperspectral scanner data – a case study over clayey soils in Israel,’ International Journal of Remote Sensing, Vol. 23, No. 6, pp. 1043-1062.

Bierwirth, P., Huston, D., and Blewett, R. (2002) ‘Hyperspectral mapping of mineral assemblages associated with gold mineralization in the Central Pilbara, Western Australia,’ Economic Geology and the Bulletin of the Society of Economic Geologists, Vol. 97, No. 4, pp. 819-826.

Boser, H., Guyon, I. M., Vapnik, V. N. (1992) ‘A training algorithm for optimal margin classifiers’ Proceedings of the 5th Annual Workshop on Computational Learning Theory, ACM New York, NY, USA, pp. 144-152.

Carreira-Perpinan, M. A. (1997) ‘A review of dimension reduction techniques,’ Technical Report, Vol. 9, No. CS-96, Department of Computer Science, University of Sheffield.

Cha, G. H. (2005) ‘Kernel principal component analysis for content based image retrieval’, PAKDD 2005, LNAI 3518, pp. 844 – 849, Springer-Verlag Berlin Heidelberg.

Chang, C. I., Sun, T. L. E., and Althouse, M. L. G. (1998) ‘An unsupervised interference rejection approach to target detection and classification for hyperspectral imagery,’ Opt. Eng., VOL. 37, PP. 735–743.

Chang, C. I. (2005) ‘Orthogonal subspace projection (OSP) revisited: A comprehensive study and analysis’, IEEE Transactions on Geoscience and Remotesensing, VOL. 43, No. 3.

Cristianini, N., Shawe-Taylor, J. (2000) An introduction to support vector machines and other kernel-based learning methods, Cambridge University Press, Cambridge, UK.

115

Congalton, R. G. (1991) ‘A reviews of assessing the accuracy of classifications of remotely sensed data,’ Remote Sensing of Environment, Elsevier Science (pub.), Vol.37, No. 1, pp. 35-46.

Cover, T. M. and Hart, P. E. (1967) ‘Nearest neighbor pattern classification,’ IEEE Transactions Information Theory, Vol. IT-13, No. 1, pp. 21–27.

Curran, P. J. and Dungan J. L. (1989) ‘Estimation of signal-to-noise – a new procedure applied to AVIRIS data,’ IEEE Transactions on Geoscience and Remote Sensing, Vol. 27, No. 5, pp. 620-628.

Dasarathy, B. V. (1991) ‘Nearest neighbour (NN) norms: NN pattern classification techniques’, IEEE Computer Society Press, Los Alamitos, CA

Devijver, P. and Kittler, J. (1982) Pattern recognition: A statistical approach, Englewood Cliffs, New Jersey.

Dundar, M. M. and Landgrebe, D. A. (2004) ‘Toward an optimal supervised classifier for the analysis of hyperspectral data,’ IEEE Transactions on Geoscience and Remote Sensing, Vol. 42, No. 1, pp. 271-277.

Friedman, J. H. (1987) "Exploratory projection pursuit," Journal of the American statistical association, 82, 249-266.

Fukunaga, K. (1990) Introduction to statistical pattern recognition, Rheinboldt, W. (edt.), II edn., Academic Press, Inc., San Diego, USA.

Garg, A (2009) Investigations on classification techniques for hyperspectral imagery, M. Tech Thesis, Indian Institute of Technology, Kanpur.

Harsanyi, J. C. and Chang, C. I. (1994) ‘Hyperspectral image classification and dimensionality reduction: An orthogonal subspace projection,’ IEEE Transactions on Geoscience and Remote sensing, VOL. 32, PP. 779–785.

Harsanyi, J. C.(1993) Detection and classification of subpixel spectral signatures in hyperspectral image sequences, Ph.D. dissertation, Dept. Elect. Eng., Univ. Maryland Baltimore County, Baltimore, MD.

Huber, P. J. (1985) ‘Projection pursuit’, The Annals of Statistics, 13, 435-475.

Hughes, G. (1968) ‘On the mean accuracy of statistical pattern recognizers,’ IEEE Transactions on Information Theory, Vol. IT-14, No. 1, pp. 55-63.

Hwang, W. J. and Wen, K.W. (1998) ‘Fast KNN classification algorithm based on partial distance search’, IEEE Transaction, Electronics Filter, Vol. 34, No. 21.

116

Hwang, J., Lay, S., and Lippman, A. (1994), ‘Nonparametric multivariate density estimation: A comparative study,’ IEEE Transactions Signal Processing, Vol.42, No. 10, pp. 2795-2810.

Ifarraguerri, A. and Chang, C. I. (2000) ‘Unsupervised hyperspectral image analysis with projection pursuit’ IEEE Transactions on Geoscience and Remotesensing, VOL. 38, NO. 6.

Jia, X. (1996) Classification techniques for hyperspectral remote sensing data, Ph. D. Thesis, University of Canberra.

Jones, M. C., and Sibson, R. (1987) ‘What is projection pursuit?’, Journal of the Royal Statistical Society, Ser. A, 150, 1-38.

Jimenez, L. O. and Landgrebe, D. A. (1998) ‘Supervised classification in high dimensional space: Geometrical, statistical and asymptotic properties of multivariate data,’ IEEE Transactions Systems, Man and Cybernetics - Part C: Applications and Reviews, Vol. 28, No. 1, pp. 39-54.

Kim, K. I., Franz, F. O., and Scholkopf, B. (2005) ‘Iterative Kernel principal component analysis for image modeling’, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 9.

Kohram. M. and Sap, M. N. M. (2008) ‘Composite kernel for support vector classification of hyperspectral data’, MICAI 2008, LNAI 5317, pp. 360 – 370, Springer-Verlag Berlin Heidelberg.

Kolahdouzan, M. and Shahabi, C. (2004) ‘Voronoi-based K Nearest Neighbor search for spatial network databases’. Proceedings of the 30th VLDB Conference,Toronto, Canada, 2004.

Lee, Y. J. and Huang, S. Y. (2005) ‘Reduced support vector machines: A statistical theory’, Taiwan.

Landgrebe, A. (1971) ‘Description and results of the LARS/GE data compression study,’ LARS Information Note, Vol. 21171.

Leunberger, D. (1984) Linear and nonlinear programming, II edn., Addison-Wesley, Menlo Park, California

Luttrell, R. D. and Vogt, F. (2008) ‘Accelerating kernel principal component analysis (KPCA) by utilizing two dimensional wavelet compression: applications to spectroscopic imaging’, Wiley Inter Science.

Martinez, W. L. and Martinez, A. R. (2004) Exploratory data analysis with Matlab, Chapman and Hall /CRC

117

Mercer, J. (1909) ‘Functions of positive and negative type, and their connection with the theory of integral equations,’ Transactions of the London Philosophical Society, Vol.-209, No. A, pp. 415-446.

Nilsson, N. J. (1990) The mathematical foundations of learning machines, Morgan Kaufmann Publishers Inc., San Mateo, CA.

Pal, M. (2002) Factors influencing the accuracy of remote sensing classifications: A comparative study, Ph. D. Thesis, University of Nottingham.

Pechenizkiy, M. (2005) ‘The Impact of Feature Extraction on the Performance of a Classifier: kNN, Naïve Bayes and C4.5’. B. Kégl and G. Lapalme (Eds.): AI 2005, LNAI 3501, pp. 268 – 279, 2005., Springer-Verlag Berlin Heidelberg

Ping, X., Guo, G., and Chen, G. (2006) A fast document classification algorithm based on improved KNN, IEEE Transaction.

Posse, C. (1995) ‘Tools for two-dimensional exploratory projection pursuit’, Journal of Computational and Graphical Statistics, Vol. 4, No. 2 (June, 1995), pp. 83- 100.

Richards, J. A. and Jia, X. (2006) Remote sensing digital image analysis: An introduction, IV edn., Springer, Berlin.

Robila, S. A. and Varshney, P. K. (2002) ‘Target detection in hyperspectral images based on independent component analysis,’ Proceedings of SPIE: Automatic Target Recognition XII, SPIE-International Society for Optical Engineering, Vol. 4726, pp. 173-182.

Schraudolph, N. N., Gunter, S. S., and Vishwanathan, V. N. Fast iterative kernel PCA, Statistical Machine Learning, National ICT Australia.

Smola, A. J. and Scholkopf, B. (1997) ‘On a kernel-based method for pattern recognition, regression, approximation, and operator inversion’, GMD Technical Report: 1064.

Sundaram, N. (2009) ‘Support vector machine approximation using kernel PCA’, Technical Report No. UCB/EECS-2009-94.

Vapnik, V. N. (1995) The nature of statistical learning theory, Springer, NY.

Vapnik V. N. (1998) Statistical learning theory. John Wiley and Sons, NY.

Varshney, P. K. and Arora, M. K. (2004) Advanced image processing techniques for remotely sensed hyperspectral data, Springer, NY.

Wegman, E. J. (1990) ‘Hyperdimensional data analysis using parallel coordinates’, Journal of the American Statistical Association, Vol. 85, No. 411, PP. 664- 675.

118

Welling, M. ‘Kernel principal component analysis’, Department of Computer Science, University of Torento.

Zhu, B., Jiang, L., Jin, F., Qin, L.,Vogel, A., and Tao, Y. (2007) ‘Walnut shell and meat differentiation using fluorescence hyperspectral imagery with ICA-KNN optimal wavelength selection’, Sens. & Instrumen. Food Qual. (2007) 1:123–131 DOI 10.1007/s11694-007-9015-z, Springer Science+Business Media, LLC 2007

119

APPENDIX A

GML Legend KNN

SVM_QP SVM_SMO

KPCA_SVM

Figure A.1: Classified maps corresponding to the best results of different classifiers

y8103044 soumyadip thesis

Documents