y8103044 soumyadip thesis
TRANSCRIPT
SU
DEPA
INDIAN
UPERV
HYPE
SOUM
ARTME
N INSTITU
VISED L
ERSPE
B
MYADI
NT OF C
UTE OF
July
i
LEARN
CTRAL
By
IP CHA
CIVIL E
TECHN
y 2010
ING W
L DATA
ANDRA
ENGINEE
NOLOGY
WITH
A
ERING
KANPUUR
A D
SU
Dissertat
DEPA
INDIAN
UPERV
HYPE
tion SubRequire
Ma
SOUM
ARTME
N INSTITU
VISED L
ERSPE
bmitted ements
aster of
B
MYADI
(Y81
NT OF C
UTE OF
July
i
LEARN
CTRAL
In Partifor the D
Techno
By
IP CHA
103044)
CIVIL E
TECHN
y 2010
ING W
L DATA
ial FulfilDegree o
ology
ANDRA
ENGINEE
NOLOGY
WITH
A
llment oof
ERING
KANPU
of the
UR
ii
iii
ABSTRACT Hyperspectral data (HD) has ability to provide large amount of spectral
information than multispectral data. However, it suffers from problems like curse
of dimensionality and data redundancy. The size of data set is also very large.
Consequently, it is difficult to process these datasets and obtain satisfactory
classification results.
The objectives of this thesis are to find the best feature extraction (FE)
techniques and improvement in accuracy and time for classification of HD by
using parametric (Gaussian maximum likely hood (GML)), non-parametric (k-
nearest neighborhood (KNN)) and support vector machine (SVM) algorithm. In
order to achieve these objectives, experiments were performed with different FE
techniques like segmented principal component analysis (SPCA), kernel principal
component analysis (KPCA), orthogonal subspace projection (OSP) and projection
pursuit (PP). DAIS-7915 hyperspectral sensor data set was used for investigations
in this thesis work.
From the experiments performed with the parametric and non-parametric
classifier, the GML classifier was found gave the best results with an overall
kappa value (k-value) 95.89%. This was achieved by using 300 training pixels (TP)
per class and 45 bands on SPCA feature extracted data set.
SVM algorithm with quadratic programming (QP) optimizer gave the best results
amongst all optimizers and approaches. The overall k-value of 96.91% was
achieved by using 300 TP per class and 20 bands of SPCA feature extracted data
set. However, the supervised FE techniques like KPCA and OSP failed to improve
results obtained by SVM significantly.
The best results obtained for GML, KNN and SVM were compared by the
one-tailed hypothesis testing. It was found that SVM classifier performed
significantly better than the GML classifiers for statistically large set of TP (300).
For statistically exact (100) and sufficient (200) set of TP, the performance of SVM
on SPCA extracted data set is statistically not better than the performance of
GML classifier.
iv
ACKNOWLEDGEMENTS I express my deep gratitude to my thesis supervisor, Dr. Onkar Dikshit for
his involvement, motivation and encouragement throughout and beyond the thesis
work. His expert directions have inculcated in my qualities which I will treasure
throughout my life. His patient hearing, critical comments approach to the research
problem made me do better every time. His valuable suggestions to all stages of the
thesis work helped me to improvise various sorts of my shortcomings of my thesis
work. I also express my sincere thanks for his effort in going through the
manuscript carefully and making it more readable. It has been a great learning
and life changing experience working with him.
I would like to express my sincere tribute to Dr. Bharat Lohani for his
friendly nature, excellent guidance and teaching during my stay at IITK.
I would like to thank specially to Sumanta Pasari for his valuable
comments and corrections of the manuscript of my thesis.
I would like to thank all of my friends, especially Shalabh, Pankaj, Amar,
Saurabh, Chotu, Manash, Kunal, Avinash, Anand, Sharat, Geeta and all other GI
peoples especially Shitlaji, Mauryaji, Mishraji who made my stay a very joyous,
pleasant and memorable one.
In closure, I express my cordial homage to my parents and my best friend
for their unwavering support and encouragement to complete my study at IITK
SOUMYADIP CHANDRA
July 2010
v
CONTENTS CERTIFICATE………………………………………………………………………….. ii
ABSTRACTS........................................................................................................... iii
ACKNOWLEDGEMENTS……………………………………………………………. iv
CONTENTS………………………………………………………………………………...v
LIST OF TABLES………………………………………………………………………..ix
LIST OF FIGURES..................................................................................................x
LIST OF ABBREVIATIONS…………………………………………………………xiii
CHAPTER 1 - Introduction ......................................................................... 1
1.1 High dimensional space ....................................................................................... 2
1.1.1 What is hyperspectral data? ......................................................................... 2
1.1.2 Characteristics of high dimensional space .................................................. 3
1.1.3 Hyperspectral imaging ................................................................................. 4
1.2 What is classification? ......................................................................................... 5
1.2.1 Difficulties in hyperspectral data classification .......................................... 5
1.3 Background of work ............................................................................................. 6
1.4 Objectives ............................................................................................................. 7
1.5 Study area and data set used .............................................................................. 7
1.6 Software details ................................................................................................... 9
vi
1.7 Structure of thesis ............................................................................................... 9
CHAPTER 2 – Literature Review ........................................................ 10
2.1 Dimensionality reduction by feature extraction .................................................. 10
2.1.1 Segmented principal component analysis (SPCA) ........................................ 11
2.1.2 Projection pursuit (PP) ............................................................................... 11
2.1.3 Orthogonal subspace projection (OSP) ..................................................... 12
2.1.4 Kernel principal component analysis (KPCA) ......................................... 12
2.2 Parametric classifiers ........................................................................................ 13
2.2.1 Gaussian maximum likelihood (GML) ....................................................... 13
2.3 Non–parametric classifiers .............................................................................. 14
2.3.1 KNN ............................................................................................................. 14
2.3.2 SVM .............................................................................................................. 15
2.4 Conclusions from literature review .................................................................. 19
CHAPTER 3 – Mathematical Background ................................... 21
3.1 What is kernel? .................................................................................................. 21
3.2 Feature extraction techniques .......................................................................... 24
3.2.1 Segmented principal component analysis (SPCA) .................................... 25
3.2.2 Projection pursuit (PP) ............................................................................... 27
3.2.3 Kernel principal component analysis (KPCA) .......................................... 34
3.2.4 Orthogonal subspace projection (OSP) ...................................................... 38
vii
3.3 Supervised classifier .......................................................................................... 43
3.3.1 Bayesian decision rule ................................................................................ 43
3.3.2 Gaussian maximum likelihood classification (GML): ............................... 44
3.3.3 k – nearest neighbor classification ............................................................. 44
3.3.4 Support vector machine (SVM): ................................................................. 46
3.4 Analysis of classification results ....................................................................... 58
3.4.1 One tailed hypothesis testing ..................................................................... 59
CHAPTER 4 - Experimental Design .................................................. 61
4.1 Feature extraction technique ............................................................................ 62
4.1.1 SPCA ............................................................................................................ 62
4.1.2 PP ................................................................................................................. 62
4.1.3 KPCA ............................................................................................................ 63
4.1.4 OSP............................................................................................................... 64
4.2 Experimental design .......................................................................................... 64
4.3 First set of experiment (SET-I) using parametric and non-parametric
classifier ........................................................................................................................ 66
4.4 Second set of experiment (SET-II) using advance classifier ............................... 67
4.5 Parameters ...................................................................................................... 68
CHAPTER 5 - Results .................................................................................... 69
5.1 Visual inspection of feature extraction techniques ......................................... 69
viii
5.2 Results for parametric and non-parametric classifiers ................................... 75
5.2.1 Results of classification using GML classifier (GMLC) ........................... 75
5.2.2 Class-wise comparison of result for GMLC ............................................... 81
5.2.3 Classification results using KNN classifier (KNNC) ................................ 82
5.2.4 Class wise comparison of results for KNNC ............................................. 91
5.3 Experiment results for SVM based classifiers ................................................. 92
5.3.1 Experiment results for SVM_QP algorithm .............................................. 93
5.3.2 Experiment results for SVM_SMO algorithm ........................................... 97
5.3.3 Experiment results for KPCA_SVM algorithm ....................................... 100
5.3.4 Class wise comparison of the best result of SVM ................................... 103
5.3.5 Comparison of results for different SVM algorithms ............................. 104
5.4 Comparison of best results of different classifiers......................................... 105
5.5 Ramifications of results ................................................................................... 107
CHAPTER 6 - Summary of Results and Conclusions ....... 109
6.1 Summary of results .......................................................................................... 109
6.2 Conclusions ....................................................................................................... 112
6.3 Recommendations for future work ................................................................. 112
REFERENCES………………………………………………….……………….115
APPENDIX A……………………………………………………………………..120
ix
LIST OF TABLES
Table Title Page
2.1 Summary of literature review 18
3.1 Examples of common kernel functions 23
4.1 List of parameters 68
5.1 The time taken for each FE techniques 71
5.2 The best kappa values and z-statistic (at 5% significance values)
for GML
80
5.3 Ranking of FE techniques and time required to obtain the best k-
value
80
5.4 Classification with KNNC on OD and feature extracted data set 84
5.5 The best k-values and z-statistic for KNNC 89
5.6 Rank of FE techniques and time required to obtain best k-value 90
5.7 The best kappa accuracy and z-statistic for SVM_QP on different
feature modified data set
95
5.8 The best k-value and z-statistic for SVM_SMO on OD and different
feature modified data set
100
5.9 The best k-value and z-statistic for KPCA_SVM on original and
different feature modified data sets
104
5.10 Comparison of the best k-values with different FE techniques,
classification time, and z-statistic for different SVM algorithms
106
5.11 Statistical comparison of different classifier’s results obtained for
different data sets
107
5.12 Ranking of different classification algorithms depending on
classification accuracy and time. (Rank: 1 indicate the best)
109
x
LIST OF FIGURES
Figure Title Page
1.1 Hyperspectral image cube 2
1.2 Fractional volume of a hypersphere inscribed in hypercube decreaseas dimension increases
4
1.3 Study area in La Mancha region, Madrid, Spain (Pal, 2002 8
1.4 FCC obtained by first 3 principal components and superimposed reference image showing training data available for classes identified for study area
8
1.5 Google earth image of study area 9
3.1 Overview of FE methods 24
3.2 Formation of blocks for SPCA 26
3.2a Chart of multilayered segmented PCA 27
3.3 Layout of the regions for the chi-square projection index 30
3.4 (a) Input points before kernel PCA (b) Output after kernel PCA. The three groups are distinguishable using the first component only
37
3.5 Outline of KPCA algorithm 38
3.6 KNN classification scheme 45
3.7 Outline of KNN algorithm 46
3.8 Linear separating hyperplane for linearly separable data 49
3.9 Non-linear mapping scheme 52
3.10 Brief description of SVM_QP algorithm 54
3.11 Overview of KPCA_SVM algorithm 58
3.12 Definitions and values used in applying one-tail hypothesis testing 60
4.1 SPCA feature extraction method 62
xi
4.2 Projection pursuit feature extraction method 63
4.3 KPCA feature extraction method 63
4.4 OSP feature extraction method 64
4.5 Overview of classification procedure 66
4.6 Experimental scheme for Set-I experiments 67
4.7 The experimental scheme for advanced classifier (Set-II) 68
5.1 Correlation image of the original data set consisting of three blocks having bands 32, 6 and 27 respectively
70
5.2 Projection of the data points. (a) Most interesting projection direction (b) Second most interesting projection direction
71
5.3 First six Segmented Principal Components (SPCs) (b) shows water
body and salt lake
72
5.4 First six Kernel Principal Components (KPCs) obtained by using 400 TP
72
5.5 First six features obtained by using eight end-members 73
5.6 Two components of most interesting projections 73
5.7 Correlation images after applying various feature extraction techniques
74
5.8 Overall kappa value observed for GML classification on different feature extracted data sets using selected different bands
78
5.9 Comparison of kappa values and classification times for GML classification method
81
5.10 Best producer accuracy of individual classes observed for GMLC on different feature extracted data set with respect to different set of TP
82
5.11 Overall accuracy observed for KNN classification of OD and feature extracted data sets for 25 TP
85
5.12 Overall accuracy observed for KNN classification of OD and feature extracted data sets for 100 TP
86
5.13 Overall accuracy observed for KNN classification of OD and feature extracted data sets for 200 TP
87
5.14 Overall accuracy observed for KNN classification of OD and feature extracted data sets for 300 TP
88
5.15 Time comparison for KNN classification. Time for different bands 91
xii
at different neighbors for (a) 300 TP (b) 200 TP training data per class
5.16 Comparison of best k-value and classification time for original and feature extracted data set
91
5.17 Class wise accuracy comparison of OD and different feature extracted data for KNNC
92
5.18 Overall kappa values observed for classification of FE modified data sets using SVM and QP optimizer
94
5.19 Classification time comparison using 200 and 300 TP per class 97
5.20 Overall kappa values observed for classification of original and FE modified data sets using SVM with SMO optimizer
100
5.21 Comparison of classification time different set of TPs with respect to number of bands for SVM_SMO classification algorithm
101
5.22 Overall kappa values observed for classification original and featuremodified data sets using KPCA_SVM algorithm.
103
5.23 Comparison of classification accuracy of individual classes for different SVM algorithms
105
xiii
LIST OF ABBREVIATIONS
AC
DAFE
DAIS
DBFE
FE
GML
HD
ICA
KNN k-value
KPCA
KPCA_SVM
MS
NWFE
Ncri
OD
OSP
PCA
PCT
PP
rbf
SPCA
SV
SVM
SVM_QP
Advance classifier
Discriminant analysis feature extraction
Digital airborne imaging spectrometer
Decision boundary feature extraction
Feature extraction
Gaussian maximum likelihood
Hyperspectral data
Independent component analysis
k-nearest neighbors
Kappa value
Kernel principal component analysis
Support vector machine with Kernel principal component
analysis
Multispectral data
Nonparametric weighted feature extraction
Critical value
Original data
Orthogonal subspace projection
Principal component analysis
Principal component transform
Projection pursuit
Radial basic function
Segmented principal component analysis
Support vectors
Support vector machine
Support vector machine with quadratic programming optimizer
xiv
SVM_SMO
TP
Support vector machine with sequential minimal optimizer
Training pixels
Dedicated to
my family & guide
ii
CHAPTER 1 INTRODUCTION
Remote sensing technology has brought a new dimension in the field of earth
observation, mapping and in many other different fields. At the beginning of this
technology, multispectral sensors were used for capturing data. The multispectral
sensors capture data in a small number of bands with broad wavelength intervals.
Due to few spectral bands, their spectral resolution is insufficient to discriminate
amongst many earth objects. But if the spectral measurement is performed by using
hundreds of narrow wavelength bands, then several earth objects could be
characterized precisely. This is the key concept of hyperspectral imagery.
As compared to multispectral (MS) data set, hyperspectral data (HD) has large
information content, voluminous and also different in characteristics. So, the
extraction of that huge information from HD remains a challenge. Therefore, some
cost effective and computationally efficient procedures are required to classify the
HD. Data classification is the categorization of data for its most effective and efficient
use. As a result of classification, we need a high accuracy thematic map. HD has that
potentiality.
This chapter will provide the concept of high dimensional space, HD and
difficulties in classification of HD. Next part focuses on the objectives of the thesis
followed by an overview of data set used in this thesis. Details of the software used
are mentioned in the next part of this chapter followed by the structure of thesis.
1.1 High dimensional space In Mathematics, an n-dimensional space is a topological space whose
dimension is n (where n is a fixed natural number). One of the typical example is n-
dimensional Euclidean space, which describes Euclidean geometry in n-dimensions.
2
n-dimensional spaces with large values of n are sometimes called high-dimensional
spaces (Werke, 1876). Many familiar geometric objects can be expressed by some
number of dimensions. For example, the two-dimensional triangle and the three-
dimensional tetrahedron can be seen as specific instances of the n-dimensional space.
In addition, the circle and the sphere are particular form of the n-dimensional
hypersphere for n = 2 and n = 3 respectively (Wikipedia, 2010).
1.1.1 What is hyperspectral data? When spectral measurement is done by using hundreds of narrow contiguous
wavelength intervals then the captured image is called Hyperspectral image. Mostly,
the hyperspectral image is representated by hyperspectral image cube (Figure 1.1). In
this cube, x and y axes specify the size of image and λ axis specifies the dimension or
the number bands. Hyperspectral sensors corresponding to each band collect
information as a set of images. Each image represents a range of the electromagnetic
spectrum for each band.
Figure 1.1: Hyperspectral image cube (Richards and Jia, 2006)
These images are then combined and form a three dimensional hyperspectral
cube. As the dimension of the HD is very high, it is comparable with the high
dimensional space. HD follows same characteristics like high dimensional space
which are described in the following section.
3
1.1.2 Characteristics of high dimensional space High dimensional spaces, spaces with a dimensionality greater than three,
have properties that are substantially different from normal sense of distance,
volume, and shape. In particular, in a high-dimensional Euclidean space, volume
expands far more rapidly with increasing diameter in compared to lower-dimensional
spaces, so that, for example:
(i). Almost all of the volume within a high-dimensional hypersphere lies in a thin
shell near its outer "surface"
(ii). The volume within a high-dimensional hypersphere relative to a hypercube of
the same width tends to zero as dimensionality tends to infinity, and almost all
of the volume of the hypercube is concentrated in its "corners".
The above mentioned characteristics have two important consequences for high
dimensional data that appear immediately. The first one is, high dimensional space is
mostly empty. As a consequence, high dimensional data can be projected to a lower
dimensional subspace without losing significant information in terms of separability
among the different statistical classes (Jimenez and Landgrebe, 1995). The second
consequence of the foregoing is, normally distributed data will have a tendency to
concentrate in the tails; similarly, uniformly distributed data will be more likely to be
collected in the corners, making density estimation more difficult. Local
neighborhoods are almost empty, requiring the bandwidth of estimation to be large
and producing the effect of losing detailed density estimation (Abhinav, 2009).
4
Volume fraction: The fraction of the volume of a hypersphere inscribed in a hypercube
Figure 1.2: Fractional volume of a hypersphere inscribed in hypercube
decreases as dimension increases (Modified after Jimenez, Landgrebe, 1995)
1.1.3 Hyperspectral imaging Hyperspectral imaging collects and processes information using the
electromagnetic spectrum. Hyperspectral imagery makes difference between many
types of earth’s objects, which may appear as the same color to the human eye.
Hyperspectral sensors look at objects using a vast portion of the electromagnetic
spectrum. The whole process of hyperspectral imaging can be divided into three steps:
preprocessing, radiance to reflectance transformation and data analysis (Varshney
and Arora, 2004).
In particular, preprocessing is required to convert the raw radiance to sensor
radiance. The processing steps contain the operations like spectral calibration,
geometric correction, geo-coding, signal to noise adjustment etc. Radiometric and
geometric accuracy of hyperspectral data is significantly different from one band to
another band (Varshney and Arora, 2004).
5
1.2 What is classification? Classification means to put data into groups according to their characteristics.
In the case of spectral classification, the areas of the image that have similar spectral
reflectance are put into same group or class (Abhinav, 2009). Classification is also
seen as a means of compressing image data by reducing the large range of digital
number (DN) in several spectral bands to a few classes in a single image.
Classification reduces this large spectral space into relatively few regions and
obviously results in loss of numerical information from the original image. Depending
on the availability of information of the region which is imaged, supervised or
unsupervised classification methods are performed.
1.2.1 Difficulties in hyperspectral data classification Though it is possible that HD can provide a high accuracy thematic map than
MS data, there are some difficulties in classification in case of high dimensional data
as listed below:
1. Curse of dimensionality and Hughes phenomenon: It says that when
the dimensionality of data set increases with the number of bands, the
number of training pixels (TP) required for training a specific classifier
should be increased as well to achieve the desired accuracy for
classification. It becomes very difficult and expensive to obtain large
number of TP for each sub class. This has been termed as “curse of
dimensionality” by Bellman (1960), which leads to the concept of “Hughes
phenomenon” (Hughes, 1968).
2. Characteristics of high dimensional space: The characteristics of high
dimensional space have been discussed in above section (Sec. 1.1.2). For
those reasons, the algorithms that are used to classify the multispectral
data often fail for hyperspectral data.
3. Large number of highly correlated bands: Hyperspectral sensor uses
the large number of contiguous spectral bands. Therefore, among these
bands, some bands are highly correlated. These correlated bands do not
provide good result in classification. Therefore, the important task is to
6
select the uncorrelated bands or make the bands uncorrelated, applying
feature reduction algorithms (Varshney and Arora, 2004).
4. Optimum number of feature: It is very critical to select the optimum
number of bands out of large number of bands (e.g. 224 bands for AVIRIS
image) to use in classification. Till today there are no suitable algorithms or
any rule for selection of optimal number of features.
5. Large data size and high processing time due to complexity of
classifier: Hyperspectral imaging system provides large amount of data. So
large memory and powerful system is necessary to store and handle the
data, generally which is very expensive.
1.3 Background of work This thesis work is the extension of work done by Abhinav Garg (2009) in his
M.Tech thesis. In his thesis, he showed that among the conventional classifiers
(gaussian maximum likelihood (GML), spectral angle mapper (SAM) and FISHER),
GML provides the best result. The performance of GML is improved significantly
after applying feature extraction (FE) techniques. Principal component analysis
(PCA) was found to be working best, among all FE techniques (discriminant analysis
FE (DAFE), decision boundary FE (DBFE), non-parametric weighted FE (NWFE) and
independent component analysis(ICA)), in improving classification accuracy of GML.
For the advance classifier, SVM’s result does not depend on the choice of
parameters but ANN’s does. He also showed SVM’s result was improved by using
PCA and ICA techniques while the supervised FE techniques like NWFE and DBFE
failed to improve it significantly.
He showed some drawbacks for advanced classifier like SVM and suggested
some FE techniques which may improve the result for conventional classifier (CC) as
well as advanced classifier (AC). However, for large TP (e.g. 300 per class) SVM takes
more processing time than small size of TP. The objectives of this thesis work are to
sort out these problems and to find the best FE technique, which will improve the
classification result for HD. In next article, the objective of this thesis work has been
described.
.
7
1.4 Objectives This thesis has investigated the following two objectives pertaining to
classification with hyperspectral data:
Objective-1:
To evaluate various FE techniques for classification of hyperspectral data.
Objective-2
To study the extent to which advance classifier can reduce problems related to
classification of hyperspectral data.
1.5 Study area and data set used The study area for this research is located within an area known as 'La
Mancha Alta' covering approximately 8000 sq. km to the south of Madrid, Spain (Fig.
1.4). The area is mainly used for cultivation of wheat, barley and other crops such as
vines and olives. HD is acquired by DAIS 7915 airborne imaging spectrometer on
29th June, 2000, at 5 m resolution.
Data was collected over 79 wavebands ranging from 0.4 μm to 12.5 μm with an
exception of 1.1 μm to 1.4 μm. The first 72 bands in the wavelength range 0.4 μm to
2.5 μm were selected for further analysis (Pal, 2002). Striping problems were
observed between bands 41 and 72. All the 72 bands were visually examined and 7
bands (41, 42 and 68 to 72) were found useless due to very severe stripping and were
removed. Finally 65 bands were retained and an area of 512 pixels by 512 pixels
covering the area of interest was extracted (Abhinav, 2009).
The data set available for this research work includes the 65 (retained after
pre-processing) bands data and the reference image, generated with the help of field
data collected by local farmers as briefed in Pal (2002). The area included in imagery
was found to be divided into eight different land cover types, namely wheat, water
body, salt lake, hydrophytic vegetation, vineyards, bare soil, pasture lands and built
up area.
8
Figure 1.3: Study area in La Mancha region, Madrid, Spain (Pal, 2002)
Figure 1.4: FCC obtained by first 3 principal components and superimposed
reference image showing training data available for classes identified for study area (Pal, 2002).
9
Figure 1.5: Google earth image of study area (Google earth, 2007)
1.6 Software details For the processing of HD very power full system is required due to the size of
data set and complexity of algorithms. The machine used for this thesis work
contains 2.16 GHz Intel processor with 2 GB RAM and operating system Windows 7.
Matlab 7.8.0 (R2009a) was used for the coding of different algorithms. All the results
are obtained here from same machine for the comparison of different algorithm.
1.7 Structure of thesis The present thesis is organized into six chapters. Chapter1 focuses on the
characteristics of high dimensional space, challenges of HD classification and outline
of the experiments of this thesis work. Also it discusses the study region, data set and
the software used in this thesis work. Chapter 2 presents the detailed description of
the HD classification and the previous research work related to this domain. Chapter
3 describes the detailed mathematical background of the different processes used in
this work. Chapter 4 outlines the detailed methodology carried out for this thesis
work. Chapter 5 presents the experiments which are conducted for this thesis
followed by interpretation. Chapter 6 provides the conclusions for present work and
the scopes for future works.
10
CHAPTER 2 LITERATURE REVIEW
This chapter outlines the important research works and major achievements in
the field of high dimensional data analysis and data classification. The chapter begins
with some of the FE techniques and classification approaches, for solving problems
related to HD classification as suggested by various researchers. The results of useful
experiments with the HD will also be included to highlight the usefulness and
reliability of these approaches. These results are presented in tabulated form. Some
other issues related to classification of HD are also discussed at the end of this
chapter.
2.1 Dimensionality reduction by
Swain and Davis (1978) mentioned details of various separability measures for
multivariate normal class models. Various statistical classes are found to be
overlapping which causes error of misclassification as most of the classifiers use
decision boundary approach for classification. The idea was to obtain such a
separability measure which could give an overall estimate of range of classification
accuracies that can be achieved by using a sub-set of selected features so that the
sub-set of features corresponding to highest classification accuracy can be selected for
classification (Abhinav, 2009).
FE is the process of transforming the given data from a higher dimensional
space to a lower dimensional space while conserving the underlying information
(Fukunaga, 1990). The philosophy behind such transformation is to re-distribute the
underlying information spread in high dimensional space by containing it into
comparatively smaller number of dimensions without loss of significant amount of
useful information. FE techniques, in case of classification, try to enhance class
separability while reducing data dimensionality (Abhinav, 2009).
11
2.1.1 Segmented principal component analysis (SPCA) The principal component transform (PCT) has been successfully applied in
multispectral data for feature reduction. Also it can be used as the tool of image
enhancement and digital change detection (Lodwick, 1979). For the case of dimension
reduction of HD, PCA outperforms those FE techniques which are based on class
statistics (Muasher and Landgrebe, 1983). Further, as the number of TP is limited
and ratio to the number of dimension is low for HD, class covariance matrix cannot be
estimated properly. To overcome these problems Jia (1996) proposed the scheme for
segmented principal component analysis (SPCA) which applies PCT on each of the
highly correlated blocks of bands. This approach also reduces the processing time by
converting the complete set of bands into several highly correlated bands. Jensen and
James (1999) proposed that the SPCA-based compression generally outperforms
PCA-based compression in terms of high detection and classification accuracy on
decompressed HD. PCA works efficiently for the highly correlated data set but SPCA
works efficiently for both high correlated as well as low correlated data sets (Jia,
1996).
Jia (1996) compared SPCA and PCA extracted features for target detection and
concluded SPCA as a better FE technique than PCA. She also showed that both
feature extracted data sets are identical and there is no loss of variance in the middle
stages, as long as no components are removed.
2.1.2 Projection pursuit (PP) Projection pursuit (PP) methods were originally posed and experimented by
Kruskal (1969, 1972). PP approach was implemented successfully first by Friedman
and Tukey (1974). They described PP as a way of searching for and exploring
nonlinear structure in multi-dimensional data by examining many 2-D projections.
Their goal was to find interesting views of high dimensional data set. The next stages
in the development of the technique were presented by Jones (1983) who, amongst
other things, developed a projection index based on polynomial moments of the data.
Huber (1985) presented several aspects of PP, including the design of projection
indices. Friedman (1987) derived a transformed projection index. Hall (1989)
developed an index using methods similar to Friedman, and also developed
12
theoretical notions of the convergence of PP solutions. Posse (1995a, 1995b)
introduced a projection index called the chi-square projection pursuit index. Posse
(1995a, 1995b) used a random search method to locate a plane with an optimal value
of the projection index and combined it with the structure removal of Friedman
(1987) to get a sequence of interesting 2-D projections. Each projection found in this
manner shows a structure that is less important (in terms of the projection index)
than the previous one. Most recently, the PP technique can also be used to obtain 1-D
projections (Martinez, 2005). In this research work, Posse’s method is followed that
reduces n-dimensional data set to 2-dimensional data.
2.1.3 Orthogonal subspace projection (OSP) Harsanyi and Chang (1994) proposed orthogonal subspace projection (OSP)
method which simultaneously reduces the data dimensionality, suppresses undesired
or interfering spectral signatures, and detects the presence of a spectral signature of
interest. The concept is to project each pixel vector onto a subspace which is
orthogonal to the undesired pixel. In order to make the OSP to be effective, number of
bands must not be taken less than the number of signatures. It is a big limitation
associated with multispectral image. To overcome this, Ren and Chang (2000)
presented the Generalized OSP (GOSP) method that relaxes this constraint in such a
manner that the OSP can be extended to multispectral image processing in an
unsupervised fashion. OSP can be used to classify hyperspectral image (Lentilucci,
2001) and also for magnetic resonance image classification (Wang et.al, 2001).
2.1.4 Kernel principal component analysis (KPCA) Linear PCA always detect all structure in a given data set. By the use of
suitable nonlinear feature extractor, more information can be extracted from the data
set. The kernel principal component analysis (KPCA) can be used as a strong
nonlinear FE method (Scholkopf and Smola, 2002) which maps the input vectors to
feature space and then PCA is applied on the mapped vectors. KPCA is also a
powerful method for preprocessing steps for classification algorithm (Mika et. al.
1998). Rosipal et.al (2001) proposed the application of the KPCA technique for feature
selection in a high-dimensional feature space where input variables were mapped by
13
a Gaussian kernel. In contrast to linear PCA, KPCA is capable of capturing part of
the higher-order statistics. To obtain this higher-order statistics, a large number of
TP is required. This causes problems for KPCA, since KPCA requires storing and
manipulating the kernel matrix whose size is the square of the number of TP. To
overcome this problem, a new iterative algorithm for KPCA, the Kernel Hebbian
Algorithm (KHA) was introduced by (Scholkopf et. al., 2005).
2.2 Parametric classifiers Parametric classifiers (Fukunaga, 1990) require some parameters to develop
the assumed density function model for the given data. These parameters are
computed with the help of a set of already classified or labeled data points called
training data. It is a subset of given data for which the class labels are known and is
chosen by sampling techniques (Abhinav, 2009). It is used to compute some class
statistics to obtain the assumed density function for each class. Such classes are
referred to as statistical classes (Richards and Jia, 2006) as these are dependent upon
the training data and may differ from the actual classes.
2.2.1 Gaussian maximum likelihood (GML) Maximum likelihood method is based on the assumption that the frequency
distribution of the class membership can be approximated by the multivariate normal
probability distribution (Mather, 1987). Gaussian Maximum Likelihood (GML) is one
of the most popular parametric classifiers that has been used conventionally for
purpose of classification of remotely sensed data (Landgrebe, 2003). The advantages
of GML classification method are that, it can obtain minimum classification error
under the assumption that the spectral data of each class is normally distributed and
it not only considers the class centre but also its shape, size and orientation by
calculating a statistical distance based on the mean values and covariance matrix of
the clusters (Lillesand et al., 2002).
Lee and Landgrebe (1993) compared the result of GML classifier on PCA and
DBFE feature extracted data set and concluded that DBFE feature extracted data set
provides better accuracy than PCA feature extracted data set. NWFE and DAFE FE
techniques were compared for classification accuracy achieved by nearest neighbor
14
and GML classifiers by Kuo and Landgrebe (2004). They concluded that NWFE is
better FE technique than DAFE. Abhinav (2009) investigated the effect of PCA, ICA,
DAFE, DBFE and NWFE feature extracted data set on GML classifier. He showed
that PCA is the best FE technique for HD among the other mentioned feature
extractor for GML classifier. He also suggested that some FE techniques like KPCA,
OSP, SPCA, PP may improve the classification result using GML classifier.
2.3 Non–parametric classifiers
The non–parametric classifiers (Fukunaga, 1990) uses some control
parameters, carefully chosen by the user, to estimate the best fitting function by
using an iterative or learning algorithm. They may or may not require any training
data for estimating the PDF. Parzen window (Parzen, 1962) and k–nearest neighbor
(KNN) (Cover and Hart, 1967) are two popular working classifiers under this
category. Edward (1972) gave brief descriptions of many non-parametric approaches
for estimation of data density functions.
2.3.1 KNN KNN algorithm (Fix and Hodges, 1951) has proven to be effective in pattern
recognition. The technique can achieve high classification accuracy in problems which
have unknown and non-normal distributions. However, it has a major drawback that
a large amount of TP is required in the classifiers resulting in high computational
complexity for classification (Hwang and Wen, 1998).
Pechenizkiy (2005) compared the performance of KNN classifier on the PCA
and random projection (RP) feature extracted data set. He concluded that KNN
performs well on PCA feature extracted data set. Zhu et. al. (2007) showed that the
KNN works better on the ICA feature extracted data set than the original data set
(OD) (OD was captured by Hyperspectral imaging system developed by the ISL). ICA-
KNN method with a few wavelengths had the same performance as the KNN
classifier alone using information from all wavelengths.
Some more non–parametric classifiers based on geometrical approaches of data
classification were found during literature survey. These approaches consider the
data points to be located in the Euclidean space and exploit the geometrical patterns
of the data points for classification. Such approaches are grouped into a new class of
15
classifiers known as machine learning techniques. Support Vector Machines (SVM)
(Boser et al., 1992), k-nearest neighborhood (KNN) (Fix and Hudges, 1956) are among
the popular classifiers of this kind. These do not make any assumptions regarding
data density function or the discriminating functions and hence are purely non–
parametric classifiers. However, these classifiers also need to be trained using the
training data.
2.3.2 SVM
SVM has been considered as advance classifier. SVM is a new generation of
classification techniques based on Statistical Learning Theory having its origins in
Machine Learning and introduced by Boser, Vapnik and Guyon (1992). Vapnik (1995,
1998) discussed SVM based classification in detail. SVM tends to improve learning by
empirical risk minimization (ERM) to minimize learning error and to minimize the
upper bound on the overall expected classification error by structural risk
minimization (SRM). SVM makes use of principle of optimal separation of classes to
find a separating hyperplane that separates classes of interest to maximum extent by
maximizing the margin between the classes (Vapnik, 1992). This technique is
different from that of estimation of effective decision boundaries used by Bayesian
classifiers as only data vectors near to the decision boundary (also known as support
vectors) are required to find the optimal hyperplane. A linear hyperplane may not be
enough to classify the given data set without error. In such cases, data is transformed
to a higher dimensional space using a non–linear transformation that spreads the
data apart such that a linear separating hyperplane may be found. Kernel functions
are used to reduce the computational complexity that arises due to increased
dimensionality (Varshney and Arora, 2004).
Advantages of SVM (Varshney and Arora, 2004) lie in their high generalization
capability and ability to adapt their learning characteristics by using kernel functions
due to which they can adequately classify data on a high–dimensional feature space
with a limited number of training data sets and are not affected by the Hughes
phenomenon and other affects of dimensionality. The ability to classify using even
limited number of training samples make SVM as a very powerful classification tool
for remotely sensed data. Thus, SVM has the potential to produce accurate
classifications from HD with limited number of training samples. SVMs are believed
16
to be better learning machines than neural networks, which tends to overfit classes
causing misclassification (Abhinav, 2009), as they rely on margin maximization
rather than finding a decision boundary directly from the training samples.
For conventional SVM an optimizer is used based on quadratic programming
(QP) or linear programming (LP) methods to solve the optimization problem. The
major disadvantage of QP algorithm is the storage requirement of kernel matrix in
the memory. When the size of the kernel matrix is large enough, it requires huge
memory that may not be always available. To overcome this Benett and Campbell
(2000) suggested an optimization method which sequentially updates the Lagrange
multipliers called the kernel adatron (KA) algorithm. Another approach was
decomposition method which updates the Lagrange multipliers in parallel since they
update many parameters in each iteration unlike other methods that update
parameter at a time (Varshney and Arora, 2004). QP optimizer is used here which
updates lagrange multipliers on the fixed size working data set. Decomposition
method uses QP or LP optimizer to solve the problem of huge data set by considering
many small data sets rather than a single huge data set (Varshney, 2001). The
sequential minimal optimization (SMO) algorithm (Platt, 1999) is a special case of
decomposition method when the size of working data set is fixed such that an
analytical solution can be derived in very few numerical operations. This does not use
the QP or LP optimization methods. This method needs more number of iterations
but requires a small number of operations thus results in an increase in optimization
speed for very large data set.
The speed of SVM classification decreases as the number of support vectors
(SV) decreases. By using kernel mapping, different SVM algorithms have successfully
incorporated effective and flexible nonlinear models. There are some major difficulties
for large data set due to calculation of nonlinear kernel matrix. To overcome the
computational difficulties, some authors have proposed low rank approximation to
the full kernel matrix (Wiens, 92). As an alternative, Lee and Mangasarian (2002)
have proposed the method of reduced support vector machine (RSVM) which reduces
the size of the kernel matrix. But there was a problem of selecting the number of
support vectors (SV). In 2009, Sundaram proposed a method which will reduce the
number of SV through the application of KPCA. This method is different from other
17
proposed method as the exact choice of support vector is not important as long as the
vector spanned a fixed subspace.
Benediktsson et al (2000) applied KPCA on the ROSIS-03 data set. Then he
used linear SVM on the feature extracted data set and showed that KPCA features
are more linearly separable than the features extracted by conventional PCA. Shah et
al (2003) compared SVM, GML and ANN classifiers for accuracies at full
dimensionality and using DAFE and DBFE FE techniques on AVIRIS data set and
concluded that SVM gives higher accuracies than GML and ANN for full
dimensionality but poor accuracies for features extracted by DAFE and DBFE.
Abhinav (2009) compared SVM, GML and ANN with OD and PCA, ICA, NWFE,
DBFE, DAFE feature extracted data set. He concluded that SVM provides better
result for OD than GML. SVM works best with PCA and ICA feature extracted data
set where ANN works better with DBFE and NWFE feature extracted data set.
The works done by various researchers with different hyperspectral data sets
using different classifiers and FE methods and the results obtained by them is
summarized in Table 2.1.
18
Table 2.1: Summary of literature review Author Dataset used Method used Results obtained Lee and Landgrebe (1993)
Field Spectrometer System (airborne hyperspectral sensor)
GML classifier is used to compare classification accuracies obtained by DBFE and PCA FE
Features extracted by DBFE produces better classification accuracies than those obtained from PCA and Bhattacharya feature selection methods.
Jimenez and Landgrebe (1998)
Stimulated and real AVIRIS data
Hyperspectral data characteristics were studied with respect to effects of dimensionality, order of data statistics used on supervised classification techniques.
Hughes phenomenon was observed as an effect of dimensionality and classification accuracy was observed to be increasing with use of higher statistics order. But lower order statistics were observed to be less affected by Hughes phenomenon.
Benediktsson et al (2001)
ROSIS-03
KPCA and PCA feature extracted data set was used for classification using linear SVM.
KPCA features are more linearly separable than features extracted by conventional PCA.
Shah et al. (2003) AVIRIS Compared SVM, GML and ANN classifiers for accuracies at full dimensionality and using DAFE and DBFE feature extraction techniques
SVM was found to be giving higher accuracies than GML and ANN for full dimensionality but poor accuracies were obtained for features extracted by DAFE and DBFE.
Kuo and Landgrebe (2004)
Stimulated and real data (HYDICE image of DC mall, Washington, US)
NWFE and DAFE FE techniques were compared for classification accuracy achieved by nearest neighbor and GML classifiers.
NWFE was found to be producing better classification accuracies than DAFE.
Pechenizkiy (2005) 20 data sets with different characteristics were taken from the UCI machine learning repository.
KNN classifier was used to compare classification accuracies obtained by PCA and Random Projection FE
PCA gave the better result than Random Projection
Zhu et al (2007) Hyperspectral imaging system developed by ISL.
ICA ranking methods were used to select the optimal wave length the KNN was used. Then KNN alone was used.
ICA-KNN method with a few band had the same performance as the KNN classifier alone using all bands.
Sundaram (2009)
The adult dataset ,part of UCI Machine Learning Repository
KPCA was applied in the support vector, then usual SVM algorithm is used
Significantly reduce the processing time without effecting the classification accuracy
19
Abhinav (2009) DAIS 7915 GML, SAM, MDM classification techniques were used on the PCA, ICA, NWFE, DBFE and DAFE feature extracted data set
GML was the best among the other techniques and performs best on PCA extracted data set.
Abhinav (2009) DAIS 7915 SVM and GML classification techniques were used on the OD and PCA, ICA, NWFE, DBFE and DAFE feature extracted data set to compare the accuracy
GML performed very low in OD than SVM. SVM provide better accuracy than GML. SVM performs better on PCA and ICA extracted data set.
2.4 Conclusions from literature review
1. From Table 2.1, it can be easily concluded that the FE techniques like PCA,
ICA, DAFE, DBFE and NWFE perform well in improving the classification
accuracies when used with GML. But the features extracted by DBFE and
DAFE failed to improve results obtained by SVM implying a limitation of these
techniques for the advance classifiers. KNN works best with PCA and ICA
feature extracted data set. However, in the surveyed literature the effects of
PP, SPCA, KPCA and OSP extracted features on classification accuracy
obtained from the advance classifiers like SVM, parametric classifier like GML
and nonparametric classifier KNN have not been observed.
2. Another important aspect found missing in the literature is the comparison of
classification time for SVM classifiers because SVM takes long time for
training using large TP. It was seen that many approach of SVM were
proposed to reduce the classification time but there is no conclusion for the best
SVM algorithm depending on classification accuracy and processing time.
3. Although KNN is effective classification technique for HD, there is no guideline
for classification time or suggestion of best FE techniques for KNN classifier.
Also the effect of different parameters like number of nearest neighbor,
number of TP, number of bands is not suggested for KNN.
20
4. During the literature survey, it is further found that there is no suggestion for
the best FE techniques for different SVM algorithms, GML and KNN.
Such missing aspects will be investigated in this thesis work and the
guidelines to choose an efficient and less time consuming classification technique
shall be presented as the result of this research.
This chapter presented the FE and classification techniques for mitigating the
effects of dimensionality. These techniques were result of different approaches used
to deal with the problem of high dimensionality and improving performance of
advance, parametric and nonparametric classifier. The approaches were applied on
real life HD and comparative results as reported in literature were compiled and
presented here. In addition, the important aspects found missing in the literature
survey were highlighted which this thesis work shall try to investigate. The
mathematical rationale and algorithms used to apply these techniques will be
discussed in detail in the next chapter.
21
CHAPTER 3 MATHEMATICAL BACKGROUND
This chapter will provide the detailed mathematical background of each of the
techniques used in this thesis. Starting with the some basic concepts of kernels and
kernel space this chapter will describe the unsupervised and supervised FE
techniques followed by classification and optimization rules for supervised classifier.
Finally, the scheme for statistical analysis which has been used for comparing the
results of different classification techniques are discussed.
Notations which are followed in this chapter for matrix and vector are given
below:
X A two dimensional matrix, whose columns represent the data points (m) and
rows represent number of bands (n), where ,X X n m= ⎡ ⎤⎣ ⎦.
ix n -dimensional single pixel column vector where 1 2, ......., mX x x x= ⎡ ⎤⎣ ⎦and
1 2, ,....., Ti i i nix x x x= ⎡ ⎤⎣ ⎦
jc Represents jth class.
( )zΦ Mapping of the input vector z in kernel space, using some kernel function.
,a b Defines inner product of the vectors a and b.
∈ Belongs to nR Set of n-dimensional real number.
N Set of natural number. T
⎡ ⎤⎣ ⎦ Denotes the transpose of a matrix.
∀ For all.
3.1 What is kernel? Before defining kernel, let’s look at the following two definitions:
• Input space: The space where originally data points lie.
22
• Feature space: The space spanned by the transformed data points (from
original space) which were mapped by some functions.
Kernel is the dot product in feature space H via a map Φ from input space,
such that :X HΦ → . Kernel can be defined as ( , ') ( ), ( ')k x x x x= Φ Φ , where
, ' and ( ), ( ')x x x xΦ Φ are the elements of input space and feature space respectively
and k is called the kernel and Φ is called feature map associated with k. Φ also can
be called as the kernel function. The space containing these dot products is called
kernel space. This is a nonlinear mapping from input space to feature space which
increases the internal distance between two points in a data set. This means that the
data set which is nonlinearly separable in input space becomes linearly separable in
kernel space. A few definitions related to kernel are given below:
Gram matrix: Given a kernel k and inputs 1 2, ........., nx x x X∈ , the xn n matrix,: ( ( , ))i j ijK k x x= is called the gram matrix of k with respect to 1 2, ........., nx x x X∈ .
Positive definite matrix: A real xn n symmetric matrix K satisfying 1 1 0Tx Kx >
for
all ( )1 11 21 1, ,......., T nnx x x x R= ∈ is called positive definite. 1x is a column vector. If the
equality in previous equation occurs only for 11 21 1........ 0nx x x= = = = , then the matrix
is called strictly positive definite.
Positive definite kernel: Let X be a nonempty set. A function :k X X R× → , ∀
, ,in N x X i N∈ ∈ ∈ if it gives rise to a positive definite gram matrix, is called a
positive definite kernel. A function :k X X R× → ∀ n N∈ and distinct ix X∈ if it
gives rise to a strictly positive definite gram matrix, called strictly positive definite
kernel.
Definitions of some commonly used kernel functions are shown in Table 3.1.
23
Table 3.1: Examples of common kernel functions (Modified after Varshney and Arora, 2004)
Kernel function type Definition
( , )iK x x Parameters Performance depends on
Linear ix x× Decision boundary either linear or non linear
Polynomial with degree n ( 1)n
ix x× + n is a positive integer User defined parameters
Radial basis function 2
2
( - )exp
2ix x
σ
⎛ ⎞⎜ ⎟−⎜ ⎟⎝ ⎠
σ is a user defined value
User defined parameters
Sigmoid tanh( ( . ) )ik x x + Θ K and Θ are user defined parameter
User defined parameters
All the above definitions have been explained with the following simple
example.
Let, 1 2 3
1 2 12 1 31 1 3
X x x x⎡ ⎤⎢ ⎥= =⎡ ⎤⎣ ⎦ ⎢ ⎥⎢ ⎥⎣ ⎦
is a matrix in input space whose columns ( , 1,2,3ix i = )
denote the number of data points and rows denote the dimension of data points.
Let, by using Gaussian kernel function, this matrix be mapped in to the feature space.
Let ,i jx x denotes the inner product of the columns of the matrix X using Gaussian
kernel function.
Then the gram matrix (kernel matrix) K takes precisely the form,
1 1 1 2 1 3
2 1 2 2 2 3
3 1 3 2 3 3
, , ,, , ,, , ,
x x x x x xK x x x x x x
x x x x x x
⎡ ⎤⎢ ⎥
= ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
The numerical value of the matrix K is, 1.0000 0.0498 0.0821 0.0498 1.0000 0.6065 0.0821 0.6065 1.0000
K⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦
K is symmetric matrix. If the matrix K turns out to be positive definite, then it is
called positive definite kernel and if it is strictly positive definite, then it is called
strictly positive definite kernel.
24
3.2 Feature extraction techniques
FE techniques are based on a simple assumption that given data sample
( : )nx X R∈ belonging to an unknown probability distribution in n-dimensional space
can be represented by some coordinate system in m dimensional space (Carreira-
Perpinan, 1997). Thus, the FE techniques aim at finding an optimal coordinate
system such that when the data points from higher dimensional space are projected
onto it, a dimensionally compact representation of these data points will be obtained.
There are two following main conditions to obtain an optimal dimension reduction
(Carreira-Perpinan, 1997):
(i) Elimination of dimensions with very low information content. Features with
low information content can be discarded as noise.
(ii) Remove redundancy among the dimensions of data space i.e. the reduced
feature set should be spanned by orthogonal vectors.
The unsupervised and supervised FE techniques have been investigated in this research work (Figure 3.1). For the unsupervised approach, segmented principal
component analysis (SPCA), projection pursuit (PP) and for supervised FE technique,
kernel principal component analysis (KPCA) and orthogonal subspace projection
(OSP) are used. The next sub-sections will discuss the assumptions used by these FE
techniques in detail.
Figure 3.1: Overview of FE methods
25
3.2.1 Segmented principal component analysis (SPCA) The principal component transform (PCT) has been successfully applied in
multispectral data analysis. It is used as a powerful tool for FE . For hyperspectral
image data, PCT outperforms those FE techniques which are based on the class statistics. The main advantage of using a PCT is that global statistics are used to
determine the transform functions. Implementation of PCT on high dimensional data
set requires high computational load. SPCA can overcome the problem of long processing time by partitioning the complete data set into several highly correlated
subgroups (Jia, 1996).
The complete data set is first partitioned into K subgroups with respect to the
correlation of bands. From the correlation image of HD, it can be seen that blocks are
formed from highly correlated bands (Figure 3.2). These blocks are selected as the
subgroups. Let 1n , 2n and kn are the number of bands in subgroups 1, 2 and k
respectively (Figure 3.2a). Then PCT is applied in each subgroup of data. After
applying PCT on each subgroup, significant features are selected by variance
information of each component. The PCs which contain about 99% variance were
chosen for each block then the selected features can be regrouped and transformed
again to compress the data further.
26
Figure 3.2: Formation of blocks for SPCA. Here, 3 blocks, containing 32, 6 and 27
bands respectively, corresponding to highly correlated bands have been formed from the correlation image of HYDICE hyperspectral sensor data.
Segmented PCT retains all the variance as with the conventional PCT. There
is no information lost either in the case that the transformation is conducted on the
complete vector at once or a few sub vectors are transformed separately (Jia, 1996).
When the new components obtained from each segmented PCT are gathered and
transformed again, then the resulting data variance and covariance are identical to
those of the conventional PCT. The main effect is that, the data compression rate is
lower in the middle stages compared to the no segmentation case. However, it makes
a relatively small difference in compression rate, if segmented transformation is developed on those subgroups which have poor correlation with each other.
27
Figure 3.2a: Chart of multilayered segmented PCA
3.2.2 Projection pursuit (PP) Projection pursuit (PP) refers to a technique first described by Friedman and
Tukey (1974) for exploring the nonlinear structure of high dimensional data sets by
means of selected low dimensional linear projections. To reach this goal, an objective
function is assigned, called projection index, to every projection characterizing the
structure present in the projection. Interesting projections are then automatically
picked up by optimizing the projection index numerically. The notion of interesting
projections has usually been defined as the ones exhibiting departure from normality
(normal distribution function) (Diaconis and Freedman, 1984; Huber, 1985).
Posse (1990) proposed an algorithm based on a random search and a chi-
squared projection index for finding the most interesting plane (two-dimensional
view). The optimization method was able to locate in general the global maximum of
the projection index over all two-dimensional projections (Posse, 1995). The chi-
squared index was efficient, being fast to compute and sensitive to departure from
normality in the core rather than in the tail of the distribution. In this investigation
only chi-squared (Posse, 1995a, 1995b) projection index has been used.
28
Projection pursuit exploratory data analysis (PPEDA) consists of following two parts:
(i) A projection pursuit index measures the degree of departure from normality.
(ii) A method for finding the projection that yields the highest value for the index.
Posse (1995a, 1995b) used a random search to locate a plane with an optimal
value of the projection index and combined it with the structure removal of Friedman
(1987) to get a sequence of interesting 2-D projections. The interesting projections are found in decreasing order of the value of the PP index. This implies that each
projection found in this manner shows a structure that is less important (in terms of
the projection index) than the previous one. In the following discussion, first the chi-
squared PP index has been described followed by the structure finding procedure.
Finally, the structure removal procedure is illustrated.
3.2.2.1 Posse chi-square index
Posse proposed an index based on the chi-square index. The plane is first
divided into 48 regions or boxes kB , 1,2,..,48k = that are distributed in the form of
rings (Figure 3.3). Inner boxes have the same radial width R/5 and all boxes have the
same angular width of 045 . R is chosen so that the boxes have approximately the
same weight under normally distributed data and which is equal to ( )122log6
5. The
outer boxes were having weight 1/48 under normally distributed data. This choice for
the radial width provides regions with approximately same probability for the
standard bivariate normal distribution (Martinez, 2001). The projection index is
given as:
( ) ( )2
28 48( ) ( )
0 1 1
1 1 1, ,9
j j
k
n
B i i kj k ik
PI I z z cc n
α λ β λ
χα β
= = =
⎡ ⎤= −⎢ ⎥
⎣ ⎦∑∑ ∑ (3.1)
Where,
φ The standard bivariate normal density. kc Probability evaluated over kth region using the normal density function,
given by 1 2k
kB
c dz dzφ= ∫∫ .
29
kB Box in the projection plane.
jλ , 0,.....,836
j jπ= is the angle by which the data are rotated in the plane
before being assigned to regions.
,α β Orthonormal p-dimensional vectors which span the projection plane (It
can be first two PCs or randomly chosen two pixels of the OD set).
( , )P α β A plane consists of two orthonormal vectors ,α β ,i jZ Zα β Sphered observations projected onto the vectors andα β . ( T
i iZ Zα α= and
Ti iZ Zβ β= )
( )jα λ cos sinj jα λ β λ−
( )jβ λ sin cosj jα λ β λ+
kBI The indicator functions for region.
( )2 ,PIχ
α β The chi-squareprojection index evaluated using the data projected onto the plane spanned byα and β .
The chi-square projection index is not affected by the presence of outliers.
However, it is sensitive to distributions that have a hole in the core, and it will also
yield projections that contain clusters. The chi-square projection pursuit index is fast
and easy to compute, making it appropriate for large sample sizes. Posse (1995a)
provides a formula to approximate the percentiles of the chi-square index.
30
R
R/5
45o
1/48 1/48
1/48
1/48
1/48
1/48
1/48
1/48
Figure- 3.3: Layout of the regions for the chi-squareprojection index. (Modified after Posse, 1995a)
3.2.2.2 Finding the structure (PPEDA algorithm)
For PPEDA projection pursuit index, ( )2 ,PIχ
α β must be optimized over all
possible direction onto 2-D planes. Posse (1990) proposed a random search for locating the global maximum of the projection index. Combined with the structure-
removal procedure, this gives a sequence of interesting bi-dimensional views of
decreasing importance. Starting with random planes, the algorithm tries to improve
the current best solution ( )* *,α β by considering two candidate planes ( )1 1,a b and
( )2 2,a b within a neighborhood of ( )* *,α β . These candidate planes are given by,
( )( )( )( )
* **1 11
1 1* * *1 1 1
* **1 21
2 2* * *1 1 2
T
T
T
T
a acva bcv a a
a acva bcv a a
β βαα β β
β βαα β β
⎫−+ ⎪= =+ ⎪− ⎪
⎬− ⎪−
= = ⎪− − ⎪⎭
(3.2)
Where c is a scalar that determines the size of the neighborhood visited, and v is a
unit p-vector uniformly distributed on the unit p-dimensional sphere. The idea is to
31
start a global search and then to concentrate on the region of the global maximum by
decreasing the value of c. After a specified number of steps, called half, without an
increase of the projection index, the value of c is halved. When this value is small
enough, the optimization is stopped. Part of the search still remains global to avoid
being kept in dummy local optimum. The complete search of the best plane contains
m such random searches with different random starting planes. The goal of PP
algorithm is to find best projection plane.
The steps for PPEDA are given below:
1. Sphere the OD set, let’s say, Z is the matrix of sphered data set.
2. Generate a random starting plane ( )0 0,α β , where 0 0andα β are orthonormal.
Consider this as the current best plane ( )* *,α β .
3. Evaluate the projection index ( )2* *,PI
χα β for the starting plane.
4. Generate two candidate plane ( )1 1,a b and ( )2 2,a b according to the Eq. (3.2)
5. Now calculate the projection index for these candidate planes.
6. Choose the candidate plane with a higher value of the projection pursuit index
as the current best plane ( )* *,α β .
7. Repeat steps 4 through 6 while there are improvements in the projection
pursuit index. 8. If the index does not improve for certain time, then decrease the value of c by
half
9. Repeat step 4 to step 8 until c becomes some small number (say .01).
3.2.2.3 Structure removal
There may be more than one interesting projection, and there may be other
views that reveal insights about the hyperspectral data. To locate other views,
Friedman (1987) proposed a method called structure removal. In this approach, first
we perform the PP algorithm on the data set to obtain the structure which means the
optimal projection plane. The approach then removes the structure found at that
projection, and repeats the projection pursuit process to find a projection that yields
another maximum value of the projection pursuit index. By proceeding in this
32
manner, it will give a sequence of projections providing informative views of the data.
The procedure repeatedly transforms the projected data to standard normal until
they stop becoming more normal as measured by the projection pursuit index. One
starts with a p p× matrix, where the first two rows of the matrix are the vectors of
the projection obtained from PPEDA. The rest of the rows have ‘1’ on the diagonal and ‘0’ elsewhere. For example, if p = 4, then
* * * *1 2 3 4* * * *
* 1 2 3 4
0 0 1 00 0 0 1
U
α α α α
β β β β
⎡ ⎤⎢ ⎥⎢ ⎥= ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
(3.3)
Gram-Schmidt orthonormalization process (Strang, 1988) makes the rows of *U
orthonormal. Let U is the orthonormal matrix of *U . The next step in the structure
removal process is to transform the Z matrix using the following equation,
TT UZ= (3.4)
Where T is a p n× matrix. With this transformation, the first two rows of T of every
transformed observations are the projection onto the plane given by ( )* *,α β . Now
applying a transformation (Θ ), which transforms the first two rows of T to a
standard normal and the rest remain unchanged, structure removal is performed
(Martinez, 2004). This is where the structure is removed, making the data normal in that projection (the first two rows). The transformation is defined as follows,
( ) ( )( ) ( )( )
11 1
12 2
3,4,.........,i i
T F T
T F T
T T i p
φ
φ
−
−
⎫⎡ ⎤Θ = ⎣ ⎦ ⎪⎪⎡ ⎤Θ = ⎬⎣ ⎦⎪
Θ = = ⎪⎭
(3.5)
Where 1φ− the inverse of the standard normal cumulative distribution function, 1T
and 2T are the first two rows of the matrix T and F is a function defined in Eq. (3.7).
From Eq. (3.3), it is seen that only the first two row of T are changing. 1T and 2T can
be written as,
( )( )
* * * *
* * * *
1 1 2
2 1 2
, ......., ,.......,
, ......., ,.......,
j n
j n
T z z z z
T z z z z
α α α α
β β β β
=
= (3.6)
33
Where *
jzα and *
jzβ are coordinates of the jth observation projected onto the plane
spanned by ( )* *,α β . Next, a rotation is defined about the origin through the angle as
follows
( ) ( ) ( )
( ) ( ) ( )
1 1 2
2 2 1
cos sin
cos sin
t t tj j j
t t tj j j
z z z
z z z
γ γ
γ γ
= +
= − (3.7)
Where 0, / 4, / 8,3 / 8γ π π π= and ( )1 tjz represents the jth element of 1T at the tth
iteration of the process. Now, applying the following transformation on Eq. (3.7) to the
rotated points it replaces each rotated observation by its normal score in the
projection.
( )( )( )
( )( )( )
11 1 1
22 1 1
0.5
.5
tjt
j
tjt
j
r zz
n
r zz
n
φ
φ
+ −
+ −
⎧ ⎫−⎪ ⎪= ⎨ ⎬⎪ ⎪⎩ ⎭⎧ ⎫−⎪ ⎪= ⎨ ⎬⎪ ⎪⎩ ⎭
(3.8)
Where ( )( )1 tjr z represents the rank of ( )1 t
jz
With this procedure, the projection index is reduced by making the data more
normal. During the first few iteration, the projection index should decrease rapidly
(Friedman, 1987). After approximate normality is obtained, the index might oscillate
with small changes. Usually, the process takes between 5 to 15 complete iterations to
remove the structure. Once the structure is removed using this process, data is
transformed back using the following equation,
( )T TZ U UZ′ = Θ (3.9)
From Matrix Theory (Strang, 1988), it is known that all directions that are
orthogonal to the structure (i.e., all rows of T other than the first two) have not been
changed, whereas the structure has been Gaussianized and then transformed back.
Next section will describe the summary of the steps of PP,
34
3.2.2.4 Steps of PP
1. Load the data and set the value of the parameters like number of best
projection plane (N), number of neighborhood for random starts (m), value of c
and half
2. Sphere the data and obtain the Z matrix.
3. Find each of the desired number of projection plane (structures) (3.3.4.2) using
Posse chi-squareindex.
4. Remove the structure (to reduce the effect of local optimum) and find another
structure (3.3.4.3) until the projection pursuit index stop changing.
5. Continue the process until the best projection plane (orthogonal to each other)
is obtained.
3.2.3 Kernel principal component analysis (KPCA) Kernel principal component analysis (KPCA) means conducting PCT in feature
space (kernel space). KPCA is applied on the variables which are nonlinearly related
to the input variables. In this section KPCA algorithm has been described through
PCA algorithm.
First m number of TP ( , 1,........,nix R i m∈ = ) are chosen. PCA finds the principal
axes by diagonalizing the following covariance matrix,
1
1 mT
j jj
C x xm =
= ∑ (3.10)
The covariance matrix C is positive definite; hence, non-negative eigen values
can be obtained.
v Cvλ = (3.11) For PCA, first sort the eigen values in decreasing order and find the corresponding
eigen vectors. Then project test point on to eigen vectors. PCs are obtained in this
manner. Now next step is rewriting of PCA in terms of dot product. Now substituting
Eq. (3.10) in Eq. (3.11)
1
1 mT
j jj
Cv x x v vm
λ=
= =∑
Thus
35
( )
1
1
1
1 .
mT
j jj
m
j jj
v x x vm
x v xm
λ
λ
=
=
=
=
∑
∑ (3.12)
since ( ) ( ). .Tx x v x v x=
In Eq. (3.12), the term ( ).jx v is a scalar. This means that all the solutions v with λ ≠
0 lie in the span of 1,......, mx x , i.e.
1
m
i ii
v xα=
= ∑ (3.13)
Steps for KPCA
1. For KPCA, first transform the TPs using a kernel function (Φ ) to feature space
( H ). Data set ( ( ), 1,.....,ix i mΦ = ) in feature space are assumed as centered to
reduce the complexity of calculation. The covariance matrix in H of the data
set takes the form as following
1
1 ( ) ( )m
Tj j
jC x x
m =
= Φ Φ∑ (3.14)
2. Find the eigen values 0λ ≥ and corresponding non zero eigen vectors
\ {0}v H∈ of the covariance matrix C from the equation,
v Cvλ = (3.15)
3. As shown in previously (for PCA), all solution of v ( 0λ ≠ ) lie in the span of
1( ),........, ( )mx xΦ Φ , i.e.,
1
( )m
i ii
v xα=
= Φ∑ (3.16)
Therefore,
1
( )m
i ii
Cv v xλ λ α=
= = Φ∑ (3.17)
Substituting Eq. (3.14) and eq. 3.16 in Eq. (3.17)
1 1 1
( ) ( ) ( ) ( )m m m
Tj j j i i j
j i jm x x x xλ α α
= = =
Φ = Φ Φ Φ∑ ∑∑ (3.18)
4. Define kernel inner product by ( , ) ( ) ( )Ti j i jK x x x x= Φ Φ . Substituting this in Eq.
(3.18) following equation is obtained.
36
1 1 1
( ) ( ) ( , )m m m
j j j i i jj i j
m x x K x xλ α α= = =
Φ = Φ∑ ∑∑ (3.19)
5. To express the relationship in Eq. (3.19) entirely in terms of the inner-product
kernel, premultiply both sides by ( )TkxΦ for all k = 1,……,m. Define the m ×m
matrix K, called the kernel matrix, whose ijth element is the inner-product
kernel , ( , )i jK x x . The vector α of length m, whose jth element is the coefficient
jα .
6. Finally, Eq. (3.19) can be written as,
1 1 1
1( ) ( ) ( ) ( ) ( ) ( )
1,2,....,
m m mT T T
j k j j k i i ji i j
x x x x x xm
k m
λ α α= = =
Φ Φ = Φ Φ Φ Φ
∀ =
∑ ∑∑ (3.20)
Now Eq. (3.20) can be transformed as (using ( , ) ( ) ( )Ti j i jK x x x x= Φ Φ ),
2m K Kλ α α= (3.21)
To find the solution of Eq. (3.21), an eigen value problem Eq. (3.22) needs to be
solved,
m Kλα α= (3.22)
7. Solution of Eq. (3.22) provides the eigen values and eigen vectors of the kernel
matrix K. Let 1 2 ........ mλ λ λ≥ ≥ ≥ be the eigen values of K and 1 2, ,......., mβ β β be
the corresponding set of eigen vectors with pλ being the last non zero eigen
value.
Figu
8. To
eige
H. T
9. In t
it is
H (
feat
equ
Figure-3.5
(
ure 3.4: (aThon
extract pr
en vectors β
Then
the above a
s certainly Schölkopf,
ture space
ation for k
,i jK
5 provides t
(a)
a) Input pohe three gnly (Wikipe
incipal com
nβ in H (n
β
algorithm,
difficult to 2004) . Th
. However
kernel PCA
( 1mK K= −
the outline
oints beforgroups areedia, 2010)
mponent, i
1,...., p= ).
, ( )n xβ Φ = ∑
it has bee
o obtain th
herefore, it
r, there is
A. It is need
1 1m mK K− +
e of KPCA a
37
e kernel Pe distingui).
it is neede
Let x be a
1( )
m
n ii
xβ=
Φ∑
n assumed
he mean of
is problem
a way to
ded to diago
) ,1 Wm m i jK
algorithm.
PCA (b) Oushable usi
ed to comp
a test point
), ( )xΦ
d that the d
f the mappe
matic to cen
o do it by
onalize the
Where (1 )m ij
(b)
utput aftering the fir
pute projec
t, with an i
data set is
ed data in
nter the m
slightly m
e kernel ma
1: ,i jm
= ∀
r kernel Prst compon
ction onto
image (xΦ
(3.2
centered,
feature sp
mapped data
modifying
atrix K,
(3.2
CA. nent
the
) in
23)
but
pace
a in
the
24)
38
Figure 3.5: Outline of KPCA algorithm
3.2.4 Orthogonal subspace projection (OSP) subspace projection is to eliminate all unwanted or undesired spectral
signatures (background) within a pixel, then use a matched filter to extract the
desired spectral signature (endmember) present in that pixel.
39
3.2.4.1 Automated target generation process algorithm (ATGP)
In hyperspectral image analysis a pixel may encompass many different
materials; such pixels are called mixed pixels. It contains multiple spectral
signatures. Let a column vector ir represent the mixed pixel by linear model,
i i ir M nα= + (3.25)
where the vector ir is a 1l× column vector, represents the ith mixed pixel. l is the
number of spectral bands. Each distinct material in the mixed pixel is called an
endmember (p). Assume that there are p spectrally distinct endmembers in the ith
mixed pixel. M is a matrix of dimension l p× , is made up of linearly independent
columns. These columns are denoted by ( )1 2, ,......, ,.......,j pm m m m . Here this system is
considered as over determined ( l p> ) system and jm denotes the spectral signature of
the jth distinct material or endmember. Let α be a p column vector given by
( )1 2, ,......, ,......,T
j pα α α α where the jth element represents the fraction of the jth
signature as present in the ith mixed pixel. ni is a 1l× column vector presenting the
white Gaussian noise with zero mean and covariance matrix 2Iσ where I is an l l×
identity matrix.
In the Eq. (3.25), assume ir ’s are a linear combination of p endmembers with
the weight coefficients designated by the fraction vector iα . The term iMα has been
rewritten to separate the desired spectral signatures from the undesired signatures.
In other way, targets are being separated from background. In searching for a single
spectral signature this can be written as:
pM d Uα α γ= + (3.26)
Where d is l l× matrix, the desired signature of interest containing column vector mp
while pα is 1 1× , the fraction of the desired signature. The matrix U is composed of
the remaining column vectors from M. These are the undesired spectral signatures or
background information. This is given by ( )1 2 , 1, ,....., ........,j pU m m m m −= with
dimension ( 1)l p× − where γ is a column vector containing rest of ( )1p − components
(fractions) of α
40
Suppose P is an operator, which eliminates the effects of U, the undesired
signatures. To do this, an operator (orthogonal subspace operator) has been developed
that projects r onto a subspace that is orthogonal to the columns of U. This results in
a vector that only contains energy associated with the target d and noise n. The
operator used is the l l× matrix
( )11 ( )T TP U U U U−= − (3.27)
The operator P maps d into a space orthogonal to the space spanned by the
uninteresting signatures in U. Now apply the operator P on the mixed pixel r from
Eq. (3.25)
Pr pPd PU Pnα γ= + + (3.28)
It should be noticed that P operating on Uγ reduces the contribution of U to zero
(close to zero in real data applications). Therefore, from above rearrangement we
have
Pr pPd Pnα= + (3.29)
3.2.4.1 Signal-to-Noise Ratio (SNR) Maximization
The second step in deriving the pixel classification operator is to find the 1 l×
operator TX that maximizes the SNR. Operating on Eq. (3.28) get
PrT T T TpX X Pd X PU X Pnα γ= + + (3.30)
The operator TX acting on Pr will produce a scalar (Ientilucci, 2001), The SNR is
given by,
2T T Tp
T T T
X Pd d P XX PE nn P X
αλ =
⎡ ⎤⎣ ⎦ (3.31)
2
2
T T Tp
T TX Pdd P X
X PP Xα
λσ⎛ ⎞
= ⎜ ⎟⎜ ⎟⎝ ⎠
(3.32)
where [ ]E denotes the expected value. Maximization of this quotient is the
generalized eigenvector problem
T T TPdd P X PP Xλ= (3.33)
41
where 2
2p
σλ λα⎛ ⎞
= ⎜ ⎟⎜ ⎟⎝ ⎠
, The value of TX which maximizes λ can be determined in general
using techniques outlined by (Miller, Farison, Shin,1992) and the idempotent and symmetric properties of the interference rejection operator. As it turns out the value
of TX which maximizes the SNR is
T TX kd= (3.34)
where k is an arbitrary scalar. Substituting the result in Eq. (3.34) into Eq. (3.30) it is
seen that the overall classification operator for a desired hyperspectral signature in the presence of multiple undesired signatures and white noise is given by the 1 l×
vector as
T Tq d p= (3.35)
This result first nulls the interfering signatures, and then uses a matched filter for
the desired signature to maximize the SNR. When the operator is applied to all of the
pixels in a hyperspectral scene, each 1l× pixel is reduced to a scalar which is a
measure of the presence of the signature of interest. The ultimate aim is to reduce the
l images that make-up the hyperspectral image cube into a single image where pixels
with high intensity indicate the presence of the desired signature.
This operator can be easily extended to seek out k signatures of interest. The
vector operator simply becomes a k l× matrix operator which is given by,
( )1 2, ,...., ,....,j kQ q q q q= (3.36)
When the operator in Eq. (3.36) is applied to all of the pixels in a hyperspectral
scene, each 1l× pixel is reduced to 1 1× vector. Ultimately, l dimensional
hyperspectral image reduces to single dimensional feature extracted image where
pixels with high intensity indicate the presence of the desired signature. Thus for k
desired signature hyperspectral image can be reduce to k dimensional feature
extracted image. Here each band corresponds to the each desired signature.
The above algorithm is discussed with the following example:
Let us start with three vectors or classes, each six elements or bands long. The
vectors are in reflectance units and can be seen below.
42
0.26 0.07 0.070.30 0.07 0.130.31 0.11 0.190.31 0.54 0.250.31 0.55 0.300.31 0.54 0.34
Concrete Tree Water
⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥
= = =⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦
Suppose the image consists of 100 pixels starting from left to right. Let 40th pixels
looks like,
( ) ( ) ( )40 .08 .75 .07pixel concrete tree dirt noise= + + + (3.37).
Let us assume that the noise is zero. If all the pixel mixture fractions have been
defined, particular class spectrum can be chosen to extract from the image. Suppose
the concrete material has to be extracted throughout the image. Same procedure can
be followed to extract grass and tree material.
Assume that 40pixel is made up some weighted linear combination of
endmembers.
40pixel M noiseα= + (3.38)
Now Mα can be break up into desired, dα and undesired, Uγ signatures. Now
assign the desired as d and undesired as U signatures to spectrum. Let concrete be
the vector d and tree and water be the column vectors of the matrix U. However, the
fractions of mixing are unknown to us. But it is known that 40pixel is made up of
some combination of d and U.
,d concrete and U tree water= =⎡ ⎤ ⎡ ⎤⎣ ⎦ ⎣ ⎦
Now it is required to reduce the effect of U. To do this it is needed to find a
projection operator P, that when operated on U, will reduce its contribution to zero.
To find concrete, d, 40pixel is projected onto a subspace that is orthogonal to the
columns of U using the operator P. In other words, P maps d into a space orthogonal
to the space spanned by the undesired signatures while simultaneously minimizing
the effects of U. If P is operated on U, which contains tree and water, then it is seen
that the effect of U is minimized.
43
000 00 00 00 0
PU
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
(3.39)
Now let 1r = 40pixel and n = noise, then from eq. (3.29),
1Pr pPd Pnα= + (3.40)
Now operator Tx needs to find out which will maximizes the signal-to noise
ratio (SNR). The operator Tx acting on 1Pr will produce a scalar. As stated before, the
value of Tx which maximizes the SNR is T TX kd= . This leads to an overall OSP
operator (Eq. (3.35)). Such a way the matrix Q in Eq. (3.36) can be formed. Now the entire data vector can be project along the columns of Q and OSP feature extracted
image is formed.
3.3 Supervised classifier This section describes the mathematical background of supervised classifiers.
First, it will describe the Bayesian decision rule followed by the decision rule for Gaussian maximum likelihood classifier (GML). Afterwards it will describe the k-
nearest neighbor (KNN) and Support vector machine (SVM) classification rules.
3.3.1 Bayesian decision rule In pattern recognition, patterns need to be classified. There are plenty of
decision rules available in literatures but only Bayes Decision Theory is optimal (Riggi and Harmouche, 2004). It is based on the popular Bayes theorem. Suppose
there are K classes and let ( )f xk be the distribution function of the kth class, where
0 k K< < , and ( )kP c is the prior probability of the kth classes such that 1
( ) 1K
kk
P c=
=∑ .
For any class k , the posteriori probability for a pixel vector x is denoted by ( )|k kp c x
and defined by (assuming all classes are mutually exclusive):
1
( | ) ( )( | )( ) ( )
k kk k K
k kk
kP x c P cp c
f P c=
=
=
∑x
x (3.41)
44
Therefore, the Bayes decision rule is:
( | ) max ( | )i i i k kkc if p c p c∈ =x x x (3.41a)
3.3.2 Gaussian maximum likelihood classification (GML): Gaussian maximum likelihood classifier assumes that the distribution of the data points is
Gaussian (normally distributed) and classifies an unknown pixel based on the variance and
covariance of the spectral response patterns. This classification is based on probability density
function associated with training data. Pixels are assigned to the most likely class based on a
comparison of the posterior probability that it belongs to each of the signatures being considered.
Under this assumption, the distribution of a category response pattern can be completely described
by the mean vector and the covariance matrix. With these parameters, the statistical probability of
a given pixel value being a member of a particular land cover class can be computed (Lillesand et
al., 2002). GML classification can obtain minimum classification error under the assumption that
the spectral data of each class is normally distributed. It considers not only the cluster centre but
also its shape, size and orientation by calculating a statistical distance based on the mean values
and covariance matrix of the clusters. The decision boundary for the GML classification is:
( ) 1ˆ ˆˆ ˆ(1 2) ln ( ) ( )Tk k k k
−⎡ ⎤− + − −⎢ ⎥⎣ ⎦x xΣ Σμ μ
(3.42) And the final bayesian decision rule is:
( ) max ( )j j kkc if g g∈ =x x x
where ( )kg x is the decision boundary function for kth class.
3.3.3 k – nearest neighbor classification KNN algorithm (Fix and Hodges, 1951) is a nonparametric classification
technique which has been proven to be effective in pattern recognition. However, its
inherent limitations and disadvantages restrict its practical applications. One of the
shortages is lazy learning which makes the traditional KNN time-consuming. In this
thesis work traditional KNN process has been applied (Fix and Hodges, 1951).
The k-nearest neighbor classifier is commonly based on the Euclidean distance
between a test pixel and the specified TP. The TP are vectors in a multidimensional
feature space, each with a class label. In the classification phase, k is a user-defined
45
constant. An unlabelled vector i.e. test pixel, is classified by assigning the label which
is most frequent among the k training samples nearest to that test pixel.
Figure 3.6: KNN classification scheme. The test pixel (circle) should be classified either to the first class of squares or to the second class of triangles. If k = 3, it is classified to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5, it is classified to first class (3 squares vs. 2 triangles inside the outer circle).If k = 11, it is classified to first class (6 squares vs. 5 triangles) (Modified after Wikipedia, 2009).
Let x be a n -dimensional test pixel and iy ( (1,2.... ))i p= is n -dimensional TP,
Euclidian distance between them is defined by:
2 2 211 1 12 2 1( , ) ( ) ( ) ...... ( )i i i i n ind x y x y x y x y= − + − + + − (3.43)
Where 11 12 1( , ...... ),nx x x x= 1 2( , ...... )i i i iny y y y= and 1 2{ , ...... }pD d d d= , p is number of TP
The final KNN decision rule is:
46
j
1 , even 2 if minimum element of D corresponding to c is
, odd 2
j
k kx c
k k
⎧ ⎫⎛ ⎞⎡ ⎤ +⎪ ⎪⎜ ⎟⎢ ⎥⎪ ⎪⎣ ⎦⎝ ⎠∈ ⎨ ⎬⎡ ⎤⎪ ⎪⎢ ⎥⎪ ⎪⎢ ⎥⎩ ⎭
(3.44)
In case of tie, the test pixel is assigned to the class jc if its distance from the mean
vector of the class jc is minimum.
Where ,( 1,2,....., )ik i p= is a user defined parameter which implies the number of
nearest neighbor is chosen for classification. The outline of algorithm of KNN
classification is given in Figure: 3.7
Figure 3.7: Outline of KNN algorithm
3.3.4 Support vector machine (SVM): The foundations of Support Vector Machines (SVM) have been developed by
Vapnik (1995). The formulation represents the Structural Risk Minimization (SRM)
47
principle, which has been shown to be superior, (Gunnet al., 1997), to traditional
Empirical Risk Minimization (ERM) principle, employed by conventional neural
networks. SRM minimizes an upper bound on the expected risk, as opposed to ERM
that minimizes the error on the training data. SVMs were developed to solve the
classification problem, but recently they have been extended to the domain of
regression problems (Vapnik et al., 1997).
SVM is basically a linear learning machine based on the principle of optimal
separation of classes. The aim is to find a hyperplane which linearly separates the
class of interest. The linear separating hyperplane is placed between the classes in
such a way that it satisfies two conditions.
(i) All the data vector that belongs to the same class are placed to the same side of separating hyperplane.
(ii) Distance between two closest data in both classes is maximized (Vapnik, 1982).
The main aim of SVM is to define an optimum hyperplane between two classes
which will maximize the boundary of two classes. For each class, the data vectors
forming the boundary of classes are called the support vectors (SV) and the
hyperplane is called decision surface (Pal, 2002).
3.3.4.2 Statistical learning theory The goal of statistical learning theory (Vapnik, 1998) is to create a mathematical
framework for learning from input training with known class and predict the outcome of data point
with unknown identity. The first is called ERM whose aim is to reduce the training error and the
second is called SRM, whose goal is to minimize the upper bound on the expected error on the
whole data set. The empirical risk is different from the expected risk in two ways (Haykin, 1999).
First, it does not depend on the unknown cumulative distribution function. Secondly, it can be
minimized with respect to the parameter, which is used in decision rule.
3.3.4.2 Vapnik and Charvonenkis dimension (VC-dimension):
VC dimension is a measure of the capacity of a set of classification functions. The
VC-dimension, generally denoted by h, is an integer that represents the largest number of
data points that can be separated by a set of functions fα in all possible ways. For
example, for a arbitrary classification problem, VC-dimension is the maximum
48
number of points, which can be separated into two classes without error in all
possible 2k ways (Varshney and Arora, 2004).
3.3.4.3 Support vector machine algorithm with quadratic optimization method (SVM_QP):
The procedure of obtaining a separating hyperplane by SVM is explained for a
simple linearly separable case for two classes which can be separated by a hyperplane
and it can be extended for the multiclass classification problem. This procedure then
can be extended to the case where a hyperplane cannot separate the two classes that
is kernel method for SVM.
Let there are n number of training samples obtained from two classes,
represented as 1 1 1 1( , ),( , ),..........,( , )n nx y x y x y where mix R∈ , m is the dimension of the
data vector with each sample belonging to either of the two classes labeled by{ 1, 1}y∈ − + . These samples are said to be linearly separable if there exists a
hyperplane in m-dimensional space whose orientation is given by a vector w and
whose location is determined by a scalar b as offset of this hyperplane from the origin
(Figure 3.8). In case such a hyperplane exists then the given set of training data
points must satisfy the following inequalities:
1, : 1i iw x b i y⋅ + ≥ + ∀ = + (3.45)
1, : 1i iw x b i y⋅ + ≤ − ∀ = − (3.46)
Thus, the equation of hyperplane is given by 0iw x b⋅ + = .
49
Figure 3.8: Linear separating hyperplane for linearly separable data (Modified after Gunn, 1998).
The inequalities in Eq. (3.45) and Eq. (3.46) can be combined into a single inequality as:
( . ) 1i iy w x b+ ≥ (3.47)
Thus, the decision rule for the linearly separable case can be defined in the following
form:
( . )i ix sign w x b∈ + (3.48)
Where, (.)sign is the signum function whose value is +1 for any element greater than
or equal to zero, and –1 if it is less than zero. The signum function, thus, can easily
represent the two classes given by labels +1 and –1.
The separating hyperplane (Figure 3.8) will be able to separate the two classes
optimally when its margin from both the classes is equal and maximum (Varshney,
2004) i.e. the hyperplane should be located exactly in the middle of the two classes.
50
The distance ( ; , )D x w b is used to express the margin of separation or margin for a
point x from the hyperplane defined by w and b. It is given by
2
.( ; , )
w x bD x w b
w+
= (3.49)
Where, 2 denotes the second norm which is equivalent to the Euclidean length of
the element vector for which it is being computed and is the absolute function. Let
d be the value of the margin between two separating planes. To maximize the
margin, express the value of d as
2 2
. 1 . 1w x b w x bdw w+ + + −
= −
2
2w
=
2Tw w
= (3.49a)
To obtain an optimal hyperplane the margin value (d ) should be maximized i.e. 2
2w
should be maximized, it is equivalent to minimization of the 2-norm of the vector w.
Thus, the objective function Φ(w) of finding the best separating hyperplane reduces to
1( )2
Tw w wΦ = (3.50)
A constrained optimization problem can be constructed for minimizing the objective
function in Eq. (3.50) under the constraints given in Eq. (3.47). This kind of
constrained optimization problem with a convex objective function of w and linear
constraints is called a primal problem and can be solved using standard Quadratic
Programming (QP) optimization techniques. The QP optimization technique can be
implemented by replacing the inequalities in a simpler form by transforming the
problem into a dual space representation using Lagrange multipliers ( iλ )
(Leunberger, 1984). The vector w can be defined in terms of Lagrange multipliers ( iλ )
as shown:
51
1
1
,
0t
n
i i ii
n
i ii
w y x
y
λ
λ
=
=
=
=
∑
∑ (3.51)
The dual optimization problem reduced by Lagrange’s multipliers ( λi ) thus
becomes
1 1 1
1max ( , , ) ( )2
n n n
i i j j i i ji i j
L w b y y x xλ
λ λ λ λ= = =
= − ⋅∑ ∑∑ (3.52)
Subjected to the constraints:
1
0n
i ii
yλ=
=∑ (3.53)
0, 1,2,...,i i nλ ≥ = (3.54)
Solution of the optimization problem is obtained in terms of Lagrange’s
multiplier. According to Krush-Kuhn-Tucker (KKT) optimality condition (Taylor,
2000) some of the Lagrange’s multiplier will be zero. The multipliers which have
nonzero values are called SVs. The result from an optimizer, also called as an optimal
solution, will be a set of unique and independent multipliers: 1 2( , ,..., )s
o o o onλ λ λ λ=
where, sn is the number of support vectors found. Substituted these in Eq. (3.51) to
obtain the orientation of optimal separating hyperplane ( ow ) as
0 0
1
n
i i ii
w y xλ=
= ∑ (3.55)
The offset from origin ( 0b ) is determined from the equation given below,
0 0 0 0 01 1
12
b w x w x+ −⎡ ⎤= +⎣ ⎦ (3.56)
Where 01x+ and 0
1x− are support vector of class labels +1 and -1 respectively. The
following decision rule (obtained from Eq. (3.48)) is then applied to classify the data
vectors into two classes +1 and -1:
0 0
support vectors( ) ( ( . ) )i i if x sign y x x bλ= +∑ (3.57)
Eq. (3.57) implies that
0 0
support vectors( ( . ) )i i ix sign y x x bλ∈ +∑ (3.58)
52
Generally, it may not be possible to separate the classes optimally by a linear
hyperplane and thus a non-linear manifold in hyperspace would be required for
optimal separation among the classes. The data present in m-dimensional space can
be mapped into a higher dimensional space where it spread out and can be separated
by a linear hyperplane in that dimensional space, shown in Figure 3.9.
Suppose the non-linear transformation function φ map the data into a higher
dimensional space where a data point x in original m-dimensional space is
represented as ( )xφ in higher dimensional space. Thus, the dual optimization
problem in Eq. (3.52) is modified as:
1 1 1
1max ( , ) = ( , )2
n n n
i i j j i i ji i j
L w b, y y K x xλ
λ λ λ λ= = =
−∑ ∑∑ (3.59)
The computation of the dot product ( ) ( )i jx xφ φ⋅ will be computationally very
expensive as computations will be done in a higher dimensional space. So, kernel
functions are used to substitute the value of dot product of the transformed vectors
according to Mercer’s Theorem (Mercer, 1909). Suppose there exists a kernel function
K such that
( , ) = ( ) ( )i j i jK x x x xφ ⋅φ (3.60)
(a) Input space (b) Feature space
Figure 3.9: Non-linear mapping scheme. φ is a nonlinear mapping, transforms the
pixels from input space to feature space. ( )ixφ s are pixels in feature space. Linearly non separable pixels in input space become linearly separable in feature space (Cristianini, 2000).
53
Putting Eq. (3.60) into eq. (3.59), the modified form of dual optimization problem
becomes:
1 1 1
1max ( , ) = ( , )2
t t tn n n
i i j j i i ji i j
L w b, y y K x xλ
λ λ λ λ= = =
−∑ ∑∑ (3.61)
Subject to the constraints:
1
0tn
i ii
yλ=
=∑ (3.62)
Similarly, the final decision rule can be modified as:
1
( ( , ) )sn
o oi i i
ix sign y K x x bλ
=
∈ +∑ (3.63)
Some of the commonly used kernel functions for classification are presented in Table
3.2. Selection of suitable kernel function is essential for better classification of a
particular data set. The details on effects of different kernel functions on
classification accuracy are available in Varshney and Arora (2004).
Originally SVMs were developed to perform binary classification. Now it has
been extended for multiclass classification where the number of classes is more than
two. Pal (2004) proposed two multiclass classification methods: one is one against the
rest and another is pairwise classification method. In the first one, K binary
classifiers may be created where each classifier is trained to distinguish one class
from another 1K − class for a K class classification problem. The second approach
considers one pair of classes at a time and performs SVM based binary classification
for classifying all the pixels to one of the two classes under consideration. A total of
( 1)2
K K − pairs of classes are possible for a K class problem and thus that many SVM
binary classifiers are to be created. A pixel is finally classified to a class to which it is
classified by most number of SVM classifiers out of total ( 1)2
K K − (Varshney and
Arora, 2004).
Figure 3.10 shows summary of the SVM classification algorithm.
54
Figure 3.10: Brief description of SVM_QP algorithm
55
3.3.4.4 SMO optimization for SVM
Sequential Minimal Optimization (SMO) is a simple algorithm that can quickly
solve the SVM QP problem without any extra matrix storage and without using
numerical QP optimization steps at all. SMO decomposes the overall QP problem into
QP sub-problems, using Osuna’s theorem (Osuna, 1997) to ensure convergence.
Unlike the previous methods, SMO chooses to solve the smallest possible
optimization problem at every step. For the standard SVM QP problem, the smallest
possible optimization problem involves two Lagrange multipliers, because the
Lagrange multipliers must obey a linear equality constraint. At every step, SMO
chooses two Lagrange multipliers to jointly optimize, finds the optimal values for
these multipliers, and updates the SVM to reflect the new optimal values. The
advantage of SMO lies in the fact that solving for two Lagrange multipliers can be
done analytically. Thus, numerical QP optimization is avoided entirely. Even though
more optimization sub-problems are solved in the course of the algorithm, each sub-
problem is so fast that the overall QP problem is solved quickly. In addition, SMO
requires no extra matrix storage at all. Thus, very large SVM training problems can
fit inside the memory of an ordinary personal computer or workstation. Because no
matrix algorithms are used in SMO, it is less susceptible to numerical precision
problems. There are two components to SMO: an analytic method for solving for the
two Lagrange multipliers, and a heuristic for choosing which multipliers to optimize.
In this thesis, all the computations regarding SMO optimization method have
been done with the Matlab in-built function “SVMSMOSET”
3.3.4.4 KPCA-SVM
Nonlinear SVM is quite accurate then linear SVM. However, they are slow and
time taking for classification increases linearly with the number of SV. Reduced set
methods (reducing no. of SVs) try to speed up the SVM classification by reducing the
number of SV (Burges and Scholkopf, 1996). This section will present the technique of
reducing the number of SVs using KPCA algorithm (Sundaram, 2009). It should be
kept in mind that the space spanned by original set of SVs will be always equivalent
to the space spanned by reduced set of SVs. This is the criteria for choosing minimum
number of SVs to improve the classification time
56
The solution of the optimization problem Eq. (3.52) is obtained in terms of
Lagrange’s multiplier. SVs are extracted solving by the Eq. (3.52). The algorithm for
this method is stated below.
1. First choose appropriate kernel function. Then calculate the kernel matrix xxK
from the set of SV ix , 1,2,........,i N=
( , ) ( , )xx i jK i j K x x= (3.64)
where , 1,2,........,j N=
2. Center the kernel matrix xxK ,
cxx xxK HK H= (3.65)
where, 1H I IN
= − , I is N N× identity matrix. H is centering matrix
Sundaram (2009) used the Eq. (3.65) to center the kernel matrix. But, according to
different literatures, kernel matrix should be center by using Eq. (3.24). This is the
standard procedure for centering kernel matrix.
3. Perform Kernel PCA by implementing an eigen value decomposition on
centered kernel matrix ( cxxK ).
c TxxK A A= Λ (3.66)
Where A is the matrix of eigen vectors and Λ is a diagonal matrix of eigen
values whose diagonal elements are 1 2, ,..........., Nλ λ λ .
4. Sort the eigen values and corresponding eigen vectors. Discard eigen values
smaller than a threshold. A value of 510− has been used in this thesis work.
This was done to prevent numerical problems in the later stages of the
algorithm.
5. Calculate the normalized principal directions.
1
1 ( )N
k jk ijk
V a xλ =
= Φ∑ (3.67)
where 1 1
1( ) ( ) ( )N
j j ix x xN =
Φ = Φ − Φ∑
In matrix form this becomes:
12V KA
−= Λ (3.68)
Select the first M number of principal directions which retains a total 99% variance.
57
6. Calculate new SV by choosing the projections on the principal directions from
a uniform distribution [ , ]k kU σ σ− + where kk N
λσ = . In matrix form it
becomes,
V VR= (3.69)
Where 121R U
N= Λ
Where U is a matrix of points chosen from the uniform distribution [ 1, 1]U − + .
7. Each column of V corresponds to a new SV. Now project image of the old SVs
( ( )ixΦ ) along the direction of new set of SVs (i.e. along the direction of PCs).
1
( ) ( )N
k ik ii
z V x=
Φ = Φ∑ (3.70)
8. Calculate the approximate pre-images of the points obtained in the previous
step (( ( )kzΦ )) according to the formula given below (Scholkopf, 1996).
1
1
1( (1 2 ))21( (1 2 ))2
i
i
nT T
ik k xx k k x ii
k NT T
ik k xx k k xi
V V K V V k xz
V V K V V k=
=
− +=
− +
∑
∑ (3.71)
where 1 2[ ( , ) ( , )............ ( , ) ]i
Tx i i i nk K x x K x x K x x=
9. Calculate the new coefficients β by solving zz zxK Kβ α= (3.72)
This ensures that both SVMs produce same results for all the kz ’s, 1,2,.......k M=
(Scholkopf and Mika, 1999)
Therefore new set of SV are obtained, kz , 1,2,....,k M= and the new coefficients
, 1,2,.....,i i Mβ = of the SV’s. Then general SVM classification algorithm is applied on
the new set of SV’s. Figure 3.11 describes the outline of above algorithm.
58
Figure 3.11: Overview of KPCA_SVM algorithm
3.4 Analysis of classification results The classification results obtained using various classification techniques are
expressed in standard confusion matrix (Landgrebe, 2003) showing the class-wise
user ( uak ), producer ( pak ) and overall (k) kappa measures (Congalton, 1991). The
59
overall kappa (k) values obtained from different classification techniques were used
for the one-tail hypothesis testing (Congalton, 1991) for comparing any two
classification results. While the class-wise producer’s kappa ( pak ) values were used to
check the performance of different classification techniques in separating different
classes (Abhinav, 2009).
3.4.1 One tailed hypothesis testing z-statistic (Congalton, 1991) is computed using the kappa values obtained for
comparing any two classification techniques:
( )
1 212 2 2
1 2
ˆ ˆ
ˆ ˆk kZσ σ
−
+= (3.73)
Where, 1̂k and 2k̂ are the kappa estimates obtained for the two classification
techniques under consideration and 21σ̂ , 2
2σ̂ are the respective estimates of variances
for the kappa values observed. The z-statistic obtained is used for the one-tailed
hypothesis testing with the following null ( 0H ) and alternate ( 1H ) hypotheses:
0 12 1 2
1 12 1 2
: = 0: = 0
H Z k kH Z k k
− ≤
− > (3.74)
The null hypothesis chosen here is that the out of the two classification results
obtained 1̂k and 2k̂ , 1̂k is not significantly better than 2k̂ which means that the first
classification technique is not significantly better than the second technique. While
the alternate hypothesis selected, it says that the two classification results are
statistically different and also the result corresponding to 1̂k is statistically better
than that corresponding to 2k̂ and thus, it can be said that the first classification
technique is significantly better than the second (Abhinav, 2009).
The z-statistic obtained in Eq. (3.73) follows the standard normal distribution
(Congalton, 1991) and thus, according to one-tailed hypothesis testing (Fig. 3.12) if the
value of 12Z -statistic is greater than a critical value (say, 1.65) for a confidence level
60
of 95%, the null hypothesis can be rejected and it can be said with 95% confidence
that the two classification results are statistically different with the first one
performing better than the second one (Abhinav, 2009).
Figure3.12: Definitions and values used in applying one-tailed hypothesis testing
(Abhinav, 2009).
Zc = 1.65 0
Non-rejection region for 0H
Rejection region for 0H
61
CHAPTER 4 EXPERIMENTAL DESIGN
This chapter will address the methodology followed for this thesis work.
Experiments were designed to investigate the best FE technique, classification
algorithm and best time saving strategy for HD. On the basis of conclusions from the
literature survey and recommendations for future work by Abhinav (2009), several
FE and classification algorithms have been tested which have potential for improving
classification accuracy and time for HD. The theoretical background of these
algorithms was presented in Chapter 3.
The following FE methods and classification algorithms have been tested:
(1) Feature extraction algorithms
• Unsupervised feature extraction algorithm
a) Segmented principal component analysis (SPCA) (Jia, 1996).
a) Projection pursuit (PP) (Friedman and Tukey, 1974).
• Supervised feature extraction algorithm
b) Kernel principal component analysis (KPCA) (Scholkopf, 1995).
b) Orthogonal subspace projection (OSP) (Lentilucci, 2001).
(2) Classification algorithms
• Parametric classification approach
a) Gaussian maximum likelihood (GML) (Savage, (1976)).
• Non-parametric classification approach
a) k nearest neighborhood (KNN) (Fix and Hodges, 1951).
• Advance classification approach
a) Support vector machine (Quadratic programming optimization method)
(SVM_QP) (Vapnik, 1995).
b) Support vector machine (sequential minimal optimization method)
(SVM_SMO) (Platt, 1999).
62
c) Kernel principal component analysis support vector machine
(KPCA_SVM) (Sundaram, 2009).
This chapter starts with experimental details for different FE and selection
techniques. Then it explains the classification techniques for parametric and non-
parametric classifier followed by advanced classifier.
4.1 Feature extraction technique Two types of FE techniques, unsupervised and supervised, were used in this
experiment. SPCA, PP are unsupervised FE techniques and KPCA, OSP are
supervised FE techniques. The details of FE methods are given below.
4.1.1 SPCA For SPCA, complete data set is subgrouped on the basis of correlation of bands.
Then PCA is applied separately on each subgroup of data. Feature selection from the
new data set is obtained after the first subgroup transformation by variance
information (first few PCs retaining 99% variance were selected). Then selected
features are regrouped and transformed again to compress the data further. The
flowchart of SPCA method is shown in Figure 4.1.
Figure 4.1: SPCA feature extraction method
4.1.2 PP For PP, Posse’s (1995a) algorithm was used in this research work where OD (n-
dimension) is projected on two dimensional space. Thus the dimension of the PP
63
feature extracted data set is two. Chi-square projection pursuit index was chosen
here. The methodology adopted for PP method is shown in Figure 4.2.
Figure 4.2: PP feature extraction method
4.1.3 KPCA The number of PCs is equal to the number of TP used for FE . In this
experiment, a total up to 400 TP have been used for FE using KPCA method. Hence,
the dimension of the KPCA feature extracted data set is up to 400. Firstly, TP are
mapped into feature space using different kernel function (linear, polynomial and
Gaussian) in the form of gram matrix. Then eigen values and eigen vectors of gram
matrix are calculated. Afterwards, OD is mapped in kernel space using the same
kernel function (used for TP) and projected along the direction of eigen vectors.
Finally, KPCA feature extracted data set is obtained. The outline of KPCA method is
shown in Figure 4.3.
Figure 4.3: KPCA feature extraction method
64
4.1.4 OSP The dimensionality of feature extracted data set depends upon the number of
classes present in the OD. OSP starts with finding the endmembers by automated
target generation process (ATGP). Then OD is projected along the endmembers and
feature extracted data set is obtained. The data set used for this thesis has eight
classes, so the number of endmembers is also eight. The dimension of feature
extracted data set is equal to the number of endmembers. The brief description of
OSP method is shown in Figure 4.4.
Figure 4.4: OSP feature extraction method
4.2 Experimental design
This section will provide the detailed methodology of the classification which
was followed in this research work. Feature extracted data or OD, TP and selected
bands are given as the input to classifier. In this thesis work, same set of TP have
been used for any data set to train the classifier. For example, to perform
classification using 200 TP per class on SPCA modified data set, the same 200 TP
were used for OD. To vet the results obtained by Abhinav (2009), the same sets of TP
are also used here. Those TP were obtained by multinomial TP selection algorithm.
Statistically sufficient sample size for training and test was calculated at a confidence
level of 99% and a desired precision of 4% using formula as suggested by Toratora
(1976). Following this approach, a minimum of 99 TP per class have to be chosen to
train a classifier.
Experiments were performed with GML, KNN and advance classifier (SVM).
For each classifier, two types of experiments were performed. The first type of
classification experiment was implemented on OD and the second type was carried
out on the feature extracted data set. For each set of experiment, classifier was
trained with 25, 100, 200 and 300 TP per class. The same set of TP will ensure no
discrepancy due to different training data sets while comparing different
65
classification results. These numbers were chosen in order to consider the following
cases of training sample size.
a) Statistically insufficient training sample size (25 TP)
b) Statistically exact training sample size (100 TP)
c) Statistically sufficient training sample size (200 TP)
d) Very large training sample size (300 TP)
Classifier provides thematic map as output of classification. These maps were
used to obtain test accuracy of classifiers in terms of confusion matrix. Accuracy
analysis of the resulted maps was performed using the kappa value for different
algorithms comparing z-statistics, on the basis of one tailed hypothesis, performed on
95% confidence interval (Congalton, 1991).
For each classification technique, initially five bands of OD or feature extracted
data set (except OSP and PP feature extracted data set) were chosen. Later on, it was
incremented by five in a stepwise manner up to the available bands (number of
available bands may be different for different feature extracted data set). The
classification was performed to evaluate if there was any improvement in accuracy.
This was performed for each set of TP.
Dimension of OSP feature extracted data set is equal to the number of classeds
present in OD. Each band of OSP feature extracted data set contains information
corresponding to each class. Therefore, for the classification, all bands of the OSP
feature extracted data set should be taken together. Otherwise, it may produce
classification error. For all the experiment in this thesis work, eight bands of OSP
feature extracted data set was taken together .
The dimension of the PP feature extracted data set is two. Therefore, the
maximum number of bands available for PP feature extracted data set is two. For all
the experiment on PP feature extracted data set both the bands were taken together.
The methodology of the classification procedure for this thesis work is shown in
Figure 4.5.
66
Figure 4.5: Overview of classification procedure
4.3 First set of experiment (SET-I) using parametric and non-parametric classifier
Set-I experimental set up was designed to investigate the results of parametric
(GML) and non-parametric (KNN) classifier. The classification was performed by
selecting different parameters of KNN and GML.
For KNN, initially three neighboring pixels were chosen which was further
increased by one, up to a neighborhood size of 11. Then, it was performed only for
neighborhood size of 15. However, there were negligible improvements in accuracy for
more than five neighboring pixels. The experiment was conducted to study the effect
of neighboring pixels in accuracy.
The best classification result for KNN and GML for feature extracted data sets
as well as OD were independently observed along with the parameters responsible for
the best result. The experimental scheme is given in Figure 4.6.
67
Figure 4.6: Experimental scheme for Set-I experiments
4.4 Second set of experiment (SET-II) using advance classifier
The second sets of experiments were designed with advance classifier, SVM
algorithms. Different optimization techniques and algorithms for SVM were chosen
for comparing the accuracy and time taken to train the classifier. In this thesis work,
SVM_QP, SVM_SMO and another approach KPCA_SVM were used to compare the
classification accuracy and time. As mentioned before, all these algorithms were
performed on OD as well as on feature extracted data set.
The purpose for this experiment is summarized below:
(i) Investigation of the best classification algorithm among these SVM
algorithms, depending upon the accuracy and processing time
(ii) Inquiry of the best FE techniques for SVM classifier
For KPCA_SVM, initially SV were extracted by solving dual optimization
problem using quadratic programming (QP) optimization method. Then KPCA
algorithm with Gaussian kernel was applied on the SV and PCs were arranged in
descending order with respect to the eigen values of kernel matrix. These PCs are the
new set of SV. In this research work, for all the experiment related to KPCA_SVM,
about 70% of the original SV were chosen from the new set of SV (for details, section
3.2.3.4), because about 99% variance was stored in first 70% of the PCs. Finally, the
SVM decision rule was applied on the new set of SV to obtain classified map.
68
For SVM_QP and SVM_SMO, quadratic programming optimization and
sequential minimal optimization methods were used respectively to solve the dual
optimization problem. The classification scheme for Set-II experiment is given in
Figure 4.7.
Figure 4.7: The experimental scheme for advanced classifier (Set-II)
4.5 Parameters Parameters play also an important role in HD classification. So, choosing of
parameters are also an important task. All the parameters chosen for different FE techniques
and classification algorithms are listed in Table 4.1.
FE techniques Parameters
SPCA Correlation matrix of the bands
PP No. of random searches – 5
half – 15
Stopping value – .01
KPCA Kernel function – rbf
OSP No. of endmembers – 8
Classifiers Parameters
GML Confidence interval – 99%
KNN Neighbors – 3,4,5……,11 and 15
SVM Kernel function – rbf
Table 4.1: List of parameters
69
CHAPTER-5 RESULTS
This chapter provides observations for various experiments and interpretation of the
same. Starting with the visual interpretation of feature extracted data sets, the
chapter will discuss the result of GML classifier on feature-extracted data set. These
results are compared with the best result for GML as observed by Abhinav (2009).
Then it will discuss the effect of KNN classification algorithm on OD and feature
extracted data set followed by the discussion of the results of different SVM
algorithms.
5.1 Visual inspection of feature extraction
techniques
Apart from comparison of k-values, features extracted by various FE
techniques can be visually inspected using grayscale views of the first few features.
The image form of correlation matrix are also used for this purpose.
From the correlation image of OD (Figure 5.1), it is clear that there are three
highly correlated blocks of bands. The first block contains 32 bands, the second 6
bands and the last contains 27 bands (Figure 5.1). The average correlation values for
each block are 0.931, 0.997 and 0.941 respectively. Thus, the OD is segmented based
on correlation of these three blocks of bands. Then PCT was applied on the basis of
correlation matrix of each block of bands for which SPCA feature extracted data set
was obtained. Total time taken to complete the aforementioned process was about 8
seconds.
70
Figure 5.1: Correlation image of the OD set consisting of three blocks having bands 32, 6 and 27 respectively.
In PP process, one can find from the most important to less important two-
dimensional structures in a sequential manner. Two structures (first one is the most
interesting) with decreasing order is given in Figure 5.2. The PP index after five
random searches was 0.3825 and the size of neighborhood (c) around the best
projection plane was 0.011. Total time taken to complete the whole process was about
11.30 hours. Table 5.1 presents the required time for each FE techniques with
different constraints.
71
Table 5.1: The time taken for each FE techniques
FE methods Time
SPCA 6-8 seconds
KPCA with rbf
kernel 1) 4 minutes for 25 TP
2) 5.5 minutes for 100 TP
3) 6.3 minutes for 200 TP
4) 8.5 minutes for 300 TP
5) 10 minutes for 400 TP
OSP 90 seconds for 8 endmembers
PP 11.30 hours
(a)
(b)
Figure 5.2: Projection of the data points. (a) Most interesting projection direction
(b) Second most interesting projection direction.
The grayscale images of features extracted data using various FE techniques are
provided in Figures 5.3 to 5.6, followed by the corresponding correlation images
shown in Figure 5.7.
α* α*
*β*β
72
(a) SPCA-1
(b) SPCA-2
(c) SPCA-3
(d) SPCA-4
(e) SPCA-5
(f) SPCA-6
Figure 5.3: First six Segmented Principal Components (SPCs) (b) shows water body and salt lake
(a) KPCA-1
(b) KPCA-2
(c) KPCA-3
(d) KPCA-4
(e) KPCA-5
(f) KPCA-6
Figure 5.4: First six Kernel Principal Components (KPCs) obtained by using 400 TP
73
(a) OSP-1
(b) OSP-2
(c) OSP-3
(d) OSP-4
(e) OSP-5
(f) OSP-6
Figure 5.5: First six features obtained by using eight end-members (b) shows
vineyards and wheat, (c) shows bare soil, (d) shows salt lake.
(a) PP -1
(b) PP -2
Figure 5.6: Two components of most interesting projections (a) shows salt lake.
74
Figure 5.7: Correlation images after applying various FE techniques
The following were observed based on visual inspection of features extracted
data sets (Figure 5.3 to 5.6) and their correlation images (Figure 5.7):
(i) Since extracted SPCs were ranked according to their eigen values, a higher
amount of information can be easily noticed in the first four SPCs. No
interesting structures could be visually identified beyond 4th SPC. As SPC uses
the local correlation of the bands rather than global (like PCA), it has ability to
make involved bands highly uncorrelated than PCA. So better classification
result is expected from SPCs. It has also been visually observed that SPCA-2 is
associated with the water body and salt lake classes.
(ii) The first few features extracted by KPCA were visually inferior than those
obtained by SPCA (not revealing any class). Some of the features like KPCA-1
and KPCA-2 show water body and salt lake prominently but other classes are
also present there.
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a) SPCA
(b) KPCA
(d) OSP
(e) PP
75
(iii) OSP is generally used to extract same number of features as the number of
classes present in the data set (in this case eight classes; hence eight features).
Although number of extracted features by OSP is low, it can identify some
structures prominently. For example, OSP-4 identifies salt lake, OSP-2
identifies vineyards and wheat and OSP-3 shows bare soil. From the algorithm
of OSP, it can be suggested that each band of OSP extracted data set is
associated with one of the predefined classes. Therefore, it can be said that
OSP is expected to perform well for classification.
(iv) The dimension of PP extracted feature is two. However, from the first
extracted feature, salt lake can be identified very clearly but the second feature
contains no identifiable structures and gives hazy appearance.
(v) The quality improvement of features extracted by different FE techniques can
be observed by comparing the correlation images of OD (Figure 5.1) and
feature extracted data (Figure 5.7). The correlation matrices obtained by SPCA
and PP extracted data sets are found to be perfectly diagonal with values equal
to unity and all the off-diagonal elements as zeros. On the other hand, feature
extracted data using supervised FE techniques (OSP, KPCA) are correlated.
This is because the SPCA and PP algorithms extract only orthogonal features
while the FE criterion is different for OSP. So highly correlated features are
observed for OSP. For the correlation image of KPCA feature extracted data
set, t can be observed that along diagonal correlation is unity which decreases
inversely with the increase in distance from diagonal in correlation matrix,
except for bands 80 to 100. These bands are observed to be fully uncorrelated.
5.2 Results for parametric and non-parametric classifiers
This section will represent the results of GML and KNN classifier using different data sets. First, it will describe the results for GML classifier followed by KNN.
5.2.1Results of classification using GML classifier (GMLC)
The performance of GMLC with feature modified data sets (SPCA, KPCA,
OSP, PP FE methods) was compared to the best result obtained by Abhinav (2009)
76
for GML classifier to evaluate the improvement in classification due to these FE
technique. It may be noted that he obtained the best results with PCA modified data
set. Figure 5.8 shows k-values obtained for different feature modified data sets.
Following observations can be listed from Figure 5.8:
(i) Considering the case with sufficient TP (100, 200, 300), the k-values
obtained for PCA, SPCA, and OSP extracted data sets were observed to be
higher than the PP and KPCA modified data sets.
(ii) For statistically insufficient TP (25), GML performs poorly for SPCA, PCA
and OSP modified data sets. When the number of bands increase, after a
certain number of bands, k-value for PCA and SPCA modified data set
becomes negative for 25 TP per class. Because to invert a p p× matrix, at
least p+1 sample points are required for obtaining numerically well
conditioned inverse of the matrix. Due to this effect, GML fails when more
than 25 bands were used with 25 TP per class. These were insufficient for
computing the inverse of the class covariance matrix.
(iii) An interesting phenomenon can be observed for k-values of KPCA modified
data set. The k-value increases for the first 35 bands. Then suddenly it falls
for 40 bands. From 45 bands onwards, it again starts to increase. The result
for KPCA modified data set is observed up to 65 bands (dimension of OD is
65).
(iv) The k-values obtained for SPCA and OSP seems to be outperforming those
obtained by PCA, KPCA.
(v) Performance of PP is found to be very poor due to very low number of
features (two features). Hence, PP was not considered any further for
classification.
(vi) For all FE techniques (except KPCA, OSP), the k-values increase
significantly with increase in number of bands up to a critical number of
bands (say, Ncri) after which no improvement could be observed in k-values.
This is due to the fact that the features extracted by these techniques were
arranged in decreasing order of eigen values. So useful information are
stored in the first few features only while the lower order features contain
77
less useful information and are very noisy. Therefore, when noisy bands
were added then probability of misclassification increases. As a result, the
classification accuracy becomes stagnant.
(vii) Ncri is different for different set of TP. When number of TP increases, Ncri
increases. Because of Hughes phenomenon, classification of large number of
bands provide poor result unless the number of TP is large.
Figure 5.8
PCA
8: Overall extracte
OSP
kappa vald data sets
lue observe using select
KP
78
ed for GMLted differen
PCA
L classificant bands
SPCA
ation on di
PP
fferent feat
ture
79
To confirm these observations, statistical analysis was performed. The k-
values obtained for each FE technique are given in Table 5.1. The best results
obtained by GML classification on different feature extracted data set for three
training data sets (100, 200 and 300 TP) were selected for comparison with the best
GML result obtained with PCA extracted data set. The condition for selecting the
best classification result (best k-value) is the least number of bands used after which
no statistically significant improvement in k-value could be achieved. A comparison of
the best results between the PCA and other FE modified data sets and among the
various FE techniques is presented in Table 5.2 in terms of z-statistic values obtained
for one-tailed hypothesis testing at 5% significance level.
Following observations can be viewed from the Table 5.2.
(i) PCA and SPCA were found to be giving statistically similar result for 100 and
300 TPs per class while SPCA provides statistically significantly better result
than PCA for 200 TP per class. SPCA is more improved method than PCA.
(ii) In case of OSP, statistically better result could not be achieved for statistically
exact TP set (100 TP per class) but when number of TP increases, it provides
the statistically better result than PCA. In case of large TP (300), statistically
similar result to PCA is obtained.
(iii) For 200, 300 TP set, SPCA and OSP provides statistically similar result but in
the case of small set of TP (100), SPCA provides the better result than OSP.
Since, SPCA extracted data set is more orthonormal than OSP extracted data
set, it can be concluded that SPCA is the best FE techniques than OSP for
GML classification.
(iv) PP extracted data set always provides statistically very poor result than OSP.
It is because of the low dimensionality (dimension-2) of PP extracted data set.
(v) KPCA always fails (for large or small TP) to provide statistically better result
than PCA or OSP and OSP is statistically better than PP for all sets of TP.
Again, SPCA provides statistically better result than PCA or OSP. Therefore, it
can be concluded that SPCA is the best FE techniques than PCA and other FE
techniques like OSP, PP, KPCA.
80
(vi) The best kappa accuracy for GML classifier is obtained by using SPCA
extracted data set with 300 TP. The kappa value is 0.9589 and the number of
bands used for classification is 45.
Table 5.2: Best kappa values and z-statistic (at 5% significance values) for GML
NB* numbers of bands used, ( )
1 212 2 2
1 2
ˆ ˆ
ˆ ˆσ σ
−
+=
k kZ
From Table 5.3 it is observed that the best results for PCA, SPCA, KPCA
extracted data sets were obtained for 30-45 features at 300 TP and for OSP extracted
data set 8 features at 300 TP. During the experiments, it was seen that GMLC took
around 55-70 seconds for processing of 30-45 bands for 300 TP per class for SPCA and
PCA extracted data set and about 32 seconds for OSP extracted data. However, OSP
provides statistically similar result to PCA and SPCA for 300 TP, but the processing
time is very less than other FE techniques. Therefore, OSP can be considered as an
effective FE technique. However, considering both accuracy and processing time, OSP
can be rated as the most effective FE technique for GMLC. For statistically
insufficient TP (25) and statistically sufficient TP (200) SPCA is rated as the best FE
technique. For 100 TP per class, performance of PCA and SPCA for GMLC is same.
From Figure 5.9, it can be observed that GMLC on OSP is the fastest than any other
FE technique. PCA and SPCA take about same time to provide the best k-values.
Table 5.3: Ranking of FE techniques and time required to obtain the best k-value
TP SPCA PCA KPCA OSP PP
k1* Time (s)*
Rank k2 Time (s)
Rank k3 Time (s)
Rank k4 Time (sec)
Rank k5 Time (sec)
Rank
25 0.8409 53.6 1 0.8296 53.6 2 0.8215 59.7 2 0.2700 35.4 3 0.1960 - 4 100 0.9384 60.6 1 0.9362 60.6 1 0.8489 75.6 3 0.9205 39.2 2 0.2220 - 5 200 0.9579 65.2 1 0.9460 59.4 3 0.8332 74.4 4 0.9505 36.7 2 0.2146 - 5 300 0.9589 83.5 1 0.9568 72.3 1 0.8569 62.8 2 0.9572 39.8 1 0.2228 - 3
Time* = Time (second) for obtaining best k-value, ki* = k-value for ith FE technique , Rank:1 indicates the best
TP PCA SPCA KPCA OSP PP z-statistic Best
k1 NB* Best
k2 NB Best
k3 NB Best
k4 NB Best
k5 NB Z12
-1.35 -3.97 -1.45
Z13
41.95 53.47 50.65
Z14
8.87 -4.07 -0.28
Z24
10.51 0.99 1.20
Z34
-41.95 -53.47 -50.65
Z45
222.81 304.51 290.75
100 200 300
0.9362 0.9460 0.9568
20 20 40
0.9384 0.9579 0.9589
20 30 45
0.8489 0.8332 0.8569
35 35 35
0.9205 0.9505 0.9572
8 8 8
0.2220 0.2146 0.2228
2 2 2
Figure 5.9
5.2.2 Cla
The
extracted
(i) For
extr
high
data
Bec
clea
(ii) Bes
clas
TP.
accu
(iii) GM
vine
pixe
up
mod
9: Comparclassific
ass-wise e class-wis
data set. F
all sizes o
racted data
h k-value f
a set, only
ause first
arly (Figure
ide salt lak
sses with v
For 300
uracy from
MLC classi
eyards pixe
el. Accurac
area class
dified data
rison of cation meth
compare accuracy
From Figur
of TP, GM
a with very
for all feat
y Salt lak
feature of
e 5.6)
ke and wat
very high k
TP, GML
m SPCA mo
ifies viney
els have be
cy of classif
ses are ab
set. It is lo
kappa vahod.
rison of ry for GML
re 5.10, foll
LC can ex
y high k-va
ture modif
ke class ca
f PP modif
ter body, G
k-value for
LC separat
dified data
yards and
een classifi
fication for
out same
ow for KPC
81
alues and
result forLC has bee
owing can
xtract salt
alue. Wate
fied data s
an be sep
fied data s
GMLC sepa
all featur
tes hydrop
a set.
wheat w
fied to whe
r vineyards
for 200 an
CA modifie
d classifica
r GMLC en observ
be observe
lake class
er class is a
et (except
arated wi
set can dis
arates hydr
re extracted
phyticc veg
with about
at pixels d
s, bare soil,
nd 300 TP
d data set.
ation tim
ved for diff
ed:
from all s
also extrac
PP). From
th satisfac
stinguish s
rophytic cla
d data set
g class wi
t same k
due to pres
, pasture la
P for SPCA
es for G
ferent feat
sets of feat
cted with v
m PP modi
ctory k-va
salt lake v
ass from ot
and all se
ith very h
k-value. So
sence of mi
and and bu
A, PCA, O
GML
ture
ture
very
fied
alue.
very
ther
et of
high
ome
ixed
uilt-
OSP
2
2
Figure 5.10
5.2.3 Cla
To und
performed
classificati
figure no.
(i) In c
set
25 Training
00 Training
0: Best profeature e
assificat
derstand th
d with OD a
ion, was c
5.11 to 5.1
case of KN
(i.e. 25TP)
Pixels
g Pixels
oducer accuextracted da
ion resu
he effect o
as well as f
chosen to
4 are as fo
NN, poor pe
). However
WT
SLT
HV
WHT
VY
BS
PL
BUA
: W
:
:
: W
: V
:
: P
:
uracy of indata set with
ults using
of FE tech
feature ext
compare
llowing:
erformance
r, KNN on
82
Water
Salt lake
HydrophyticVeg
Wheat
Vineyards
Bare Soil
Pasture Land
Built-up Area
dividual clash respect to
g KNN cl
hniques on
tracted dat
classificat
e is observ
OD perfor
1
3
sses observdifferent se
lassifier
n KNN cla
ta. Same se
tion accura
ved for stat
rms better
00 Training
00 Training
ed for GMLet of TP.
r (KNNC)
assifier, ex
et of TP, as
acy. Obser
tistically in
r than PCA
g Pixels
g Pixels
LC on differ
)
periment w
s used in G
rvations fr
nsufficient
A, OSP, SP
rent
was
GML
rom
TP
PCA
83
extracted data set. The maximum k-value was obtained for 65 bands and three
neighbors. For the KPCA extracted data set, k-value was comparatively better
than OD when 50 bands were taken for all neighbors. PP was not taken into
accuracy analysis as due to very low dimensionality it would not be able to
provide good k-values.
(ii) For statistically exact TP (100 TP), the performance of KNN on OD is better
than any other feature extracted data set. More number of bands, increases the
k-values for all feature extracted data sets except SPCA. Increasing number of
bands did not show any significant change in case of SPCA. However, if
number of neighbors is increased, changes were easily observed. It is observed
that, when number of neighbors is increased, after a critical number of
neighbors (say, Nnbd), k-value starts decreasing. Therefore, it is independent on
number of bands. It may be due to the effect of noisy points present in training
data set. However, large number of neighbors accelerates the chance of using
noisy TP. Consequently, misclassification error is added up.
(iii) For 200 TP per class, no improvement in result is observed for PCA, KPCA,
OSP extracted data set than OD. But, improvement was observed for SPCA
extracted data set. However, it did not show a prior change in PCA and KPCA
extracted data set for KNNC with 100 and 200 TP set respectively. Effect of
neighborhood on accuracy can be viewed from Table 5.4. Always for the first
few neighbors for all sets of TP, highest k-value is achieved (Table 5.2).
(iv) For large training data set (300 TP), it was observed that the k-values are
better than OD. This is due to PCA and SPCA extracted data sets. After a
certain threshold neighborhood, k-value decreases monotonically for PCA,
OSP, and SPCA extracted data set.
(v) KPCA extracted data set provides better result for high dimension since it is
more refined than PCA or SPCA extracted data set.
(vi) For all training data sets, except statistically insufficient, k-value for OSP
extracted data set varies a little (0.02 - 0.05) because of very low
dimensionality. If the number of extracted end members is large enough, result
could be further improved.
84
(vii) Another important aspect was observed for feature-extracted data set. The
difference of the k-values (for all set of TP), obtained using minimum and
maximum number of bands, is about 0.15 to 0.20. This could be because most
of the information was gathered in first some bands of feature extracted data
set. Additional bands cannot provide more useful information to change k-
value significantly.
Table 5.4: Classification with KNNC on OD and feature extracted data set
Data sets
100 TP 200 TP 300 TP Bnd* NN Bnd NN Bnd NN
Original 55 3 35 3 30 3 PCA 35 5 45 5 20 3
SPCA 10 3 15 3 40 3 KPCA 35 3 45 3 30 6 OSP 8 3 8 3 8 3 PP 2 15 2 11 2 15
bnd* = best k-values obtained for the number of bands NN* = no. of neighbors, for which best k-value obtained
Figure 5.1
Origin
SPC
OSP
11: Overaldata se
nal
CA
P
N
ll accuracy oets for 25 TP
25 Trai
NNb*: number
observed forP
85
ining Pixel
of nearest nei
r KNN clas
s
ighbors
sification of
PCA
KPCA
PP
f OD and feature extra
cted
Figure 5.1
Origin
SPC
OSP
12: Overaldata se
nal
CA
P
N
ll accuracy oets for 100 T
100 Tra
NNb*: number
observed forTP
86
ining Pixel
of nearest nei
r KNN clas
ls
ighbors
sification of
PCA
KPCA
PP
f OD and feature extra
cted
Figure 5.1
Origin
SPC
OSP
13: Overaldata se
nal
CA
P
N
ll accuracy oets for 200 T
200 Tra
NNb*: number
observed forTP
87
ining Pixel
of nearest nei
r KNN clas
ls
P
ighbors
sification of
CA
KPCA
PP
f OD and fe
NNb
NN
ature extra
b
Nb
cted
Figure 5.1
Origin
SPC
OSP
14: Overaldata se
nal
CA
P
N
ll accuracy oets for 300 T
300 Tra
NNb*: number
observed forTP
88
ining Pixel
of nearest nei
r KNN clas
ls
ighbors
sification of
PCA
KPCA
PP
f OD and fe
NN
ature extra
Nb
cted
89
The k-values for the classification of these data sets were analyzed to select the
best results for each data set. Similar approach as in the case of GML is also followed
here. The z-statistic values obtained for selected best k-values are shown in Table 5.5.
The following can be inferred from these results:
(i) Results obtained using PCA and SPCA modified data sets, were found to be
significantly better than those obtained using the OD for large training data
size (300). However, SPCs and PCs still found to be performing inferior than
OD for 100 TP. Statistically similar results were obtained for OD and SPCA
modified data sets using a training data set of 200 TP. For other feature
extracted data set and for all set of training data, OD provides statistically
significant result for KNN classification.
(ii) The best results were obtained with OD using 30 to 55 bands and three
neighbors. For 300 TP, statistically better results than OD were obtained using
SPCA (40 bands) and PCA (20 bands) modified data sets with three neighbors.
For 200 TP, SPCA modified data set (15 features and 3 neighbors) provides
statistically similar results to OD.
(iii) SPCA extracted data sets were observed to be performing statistically
significant to PCA extracted data sets with smaller training data sets, whereas
the best results, obtained with 300 TP training data set using SPCs, were
statistically similar to those as obtained by PCs.
(iv) SPCs were also observed to be performing significantly better than KPCA and
OSP modified data sets for all training data sets. In addition, the best results
for PCA and OSP were found to be statistically poor for all training data size.
Table 5.5: The best k-values and z-statistic for KNNC
* Number of bands used to obtain best k-value
TP OD KPCA SPCA PCA OSP z-statistic k1 NB* k2 NB k3 NB k5 NB Z12
42.5148.98 47.91
Z13 9.42 0.15
-4.58
Z14
44.72 41.31 -4.29
Z23 -34.98 -49.10 -52.68
Z34 37.24 11.43
0.29
Z45 -20.55 -17.72 30.95
100 200 300
0.8889 0.9037 0.9244
55 35 30
0.7773 0.7881 0.8141
35 45 30
0.8669 0.9040 0.9325
10 15 40
0.7715 0.8062 0.9320
35 45 20
0.8268 0.8514 0.8701
8 8 8
90
Time taken to train the KNN classifier is highly affected by the number of TP.
This is due to the fact that a distance matrix needs to be computed between a test
pixels and each of TP. Increasing number of TP indeed extends the calculation time
i.e. for n TP and m test pixels, number of distances calculated is mn . However,
increasing number of neighbors has significantly less effect in run time. It has been
observed that time taken for classification, for three and for 15 neighbors are almost
similar (maximum difference is 60-120 seconds) (Figure 5.15). Another aspect is also
noticed, increasing number of bands proportionally affect the calculation time (Figure
5.15). From the Figure 5.16, it could be observed that PCA takes least time in
compared to OD and SPCA extracted data to provide best result. Considering the
time constraint and k-value, PCA could be chosen as the best FE technique, followed
by SPCA, among the available techniques for KNN classification. Figure 5.15 shows
the comparison of time between 200 TP and 300 TP for same number of bands and
neighbors. Rank of FE techniques with respect to accuracy for KNNC for each set of
TP could be inferred from table 5.6.
From Table 5.6, it is further observed that for statistically exact size of (i.e.
100), KNNC produced best result with OD. For statistically sufficient TP (i.e.200),
SPCA secured first rank. However, for statistically large TP (i.e. 300), SPCA and PCA
both perform better. Therefore, it is concluded that among all the data sets feature
modified and original, SPCA and PCA provide the best result for KNNC which in
turn tells that PCA is the best FE technique among all of these techniques for KNNC.
Table 5.6 Rank of FE techniques and time required to obtain best k-value (Rank 1
indicates the best)
TP Original KPCA SPCA PCA OSP
k1 Time (s)*
Rank k2 Time (s)
Rank k3 Time (s)
Rank k4 Time (s)
Rank k5 Time (s)
Rank
100 0.8889 875.1 1 0.7773 722.9 4 0.8669 661.2 2 0.7715 789.6 5 0.8268 655.2 3 200 0.9037 1200.6 1 0.7881 1271.1 4 0.9040 1122.1 1 0.8062 1272.0 3 0.8514 1022.7 2 300 0.9244 1574.6 2 0.8141 1556.0 4 0.9325 1712.5 1 0.9320 1434.0 1 0.8701 1291.9 3
Time(s)*: presents the required time in second
Figure 5.1
Figure 5.1
5.2.4 Cla Fro
of KNNC
KN
feature m
due to pre
classified i
presence o
(a) 300 TP
5: Time cdifferen
6: Compafeature
ass wise m Figure 5
NNC extrac
odified dat
esence of l
into hydrop
of large nu
NNP
N
comparisonnt neighb
arison of be extracted
compar5.17, follow
cts water a
ta and OD
arge numb
phytic veg,
umber of m
Nb
NNb*: number
n for KNNbors for (a)
best k-valud data set
ison of rwing observ
and salt la
D. However
ber of mixe
, wheat, pa
mixed pixel
91
of nearest nei
N classifica 300 TP (b)
ue and cla
results fovations can
ake classes
r, the built
ed pixels.
asture land
s in built-u
ighbors
ation. Time) 200 TP tr
assification
or KNNCn be viewed
s with very
t up area
For built u
d classes fo
up area cla
(b) 200 T
e for differraining dat
n time for
C d for class
y high accu
is classifie
up area so
r all data s
ass. Perfor
NTP
rent bandsta per class
original
wise accur
uracy for b
ed very poo
ome pixels
sets due to
rmance of O
NNb
s at s.
and
racy
both
orly
are
the
OD,
KPCA and
to provide
all sets of
10
30
Figure 5.1
5.3 E
In t
it will de
KPCA_SV
SVM algor
d OSP mod
e good clas
TP. For vin
00 Training
00 Training
7: Class wdata fo
Experim
this section
escribe the
VM. The sec
rithms.
dified data
sification a
neyards, a
Pixels
Pixels
wise accuror KNNC
ment re
n, results o
e results
ction also p
sets are lo
accuracy fo
built-up ar
racy compa
esults fo
of different
of SVM_Q
provides a
92
ower than
or classific
rea classes
WTSLTHVWHVY BSPL BU
arison of O
for SVM
t SVM algo
QP algorit
compariso
SPCA and
cation of h
s for all dat
200
T T
V HT Y S
UA
: Water : Salt lake: Hydroph: Wheat : Vineyard: Bare Soi: Pasture : Built-up
OD and dif
M based
orithms ha
thm follow
on of classif
d PCA mod
hydrophytic
ta sets and
0 Training P
e hobicVeg
ds il Land
p Area
fferent feat
d classi
ave been de
wed by SV
fication tim
dified data
c veg class
d TP.
Pixels
ture extrac
ifiers
escribed. F
VM_SMO
me of differ
a set
for
cted
First
and
rent
93
5.3.1 Experiment results for SVM_QP algorithm Using the optimal set of parameter values (Table 4.5, recommended by
Abhinav, 2009) for SVM classifiers, classification were performed on feature modified
data sets. Results from these experiments are compared with the best result obtained
by Abhinav (2009) for SVM classifier. He noted that performance of SVM_QP was the
best for PCA extracted data set. The same training and input data sets were used as
for GML and KNN classifiers. The classification results obtained by SVM are
presented in Figure 5.18 from which the following observations can be made:
(i) The k-values are seen as improving with increase in training data size for all
input data sets types (PCA, SPCA, KPCA, OSP and PP modified data sets).
(ii) The best classification results were obtained by PCA and SPCA modified data
sets. For KPCA modified data set, when number of bands increases the k-
values also increase. It is possible that for very high dimension, KPCA
extracted data set can provide high k-value like SPCA or PCA extracted data
sets.
(iii) Increasing in k-values were observed for PCs and SPCs which stagnates after a
critical number of features used. After that it starts to decrease gradually. This
could be due to same reason discussed for GML classification algorithm in
section 5.1.
(iv) A similarity can be observed for KPCA, PCA and SPCA modified data set. For
statistically insufficient TP (25) suddenly k-values reach to about zero for
classification using 50 bands. The reason is not clear. Probably due to using
these number of bands and TP, SVM_QP was unable to find proper decision
boundary.
(v) Best result for KPCA and OSP extracted data set are about to similar for each
set of TP except for 25 TP.
Figure 5.1
The
to select th
(a) PC
(c) KPC
(d) PP
8: Overalsets us
e k-values f
he best res
CA
CA
P
ll kappa vsing SVM a
for the clas
sults for ea
SV
alues obseand QP opt
ssification
ach data se
94
VM_QP
erved for ctimizer
of these da
et. The app
classificatio
ata sets we
proach was
(b) SPCA
(e) OSP
on of FE m
ere statistic
similar to
modified d
cally analy
o that follow
data
yzed
wed
95
in case of GML. The z-statistic values obtained for best k-values are shown in Table
5.7. The following can be inferred from these results:
(i) PCA and SPCA were found to be giving statistically similar result for all set of
TP. On the other hand, PCA always provides statistically significantly better
result than KPCA and OSP modified data set for all set of TP for SVM_QP
classifier.
(ii) Classification with SPCA modified data set always performs statistically better
than KPCA modified data set for all sets of TP. However, OSP performs
statistically better than KPCA modified data set for 100 and 200 TP per class.
For large set of TP (300), OSP performs statistically similar with KPCA
modified data set.
(iii) Another observation is made from the Table 5.7 that the SPCA modified data
set always performs statistically better than OSP modified data set.
(iv) It can be concluded that PCs and SPCs have the better ability to improve k-
value than any other FE techniques. KPCA performs the worst among all the
FE techniques.
Table 5.7: The best kappa accuracy and z-statistic for SVM_QP on different feature
modified data set
NB* = no. of bands used to achieve the best k-value; ki* = k-value for ith FE technique ,
During above experiments, it was observed that time taken to train the SVM
based classifier is affected very much by the number of training samples used. This is
because a kernel matrix has to be computed for every pair of TP. There were very
little changes in training times with increase in number of bands.
Generally the total time taken to perform SVM based classification was
observed to be ranging from 23 to 102 seconds when bands were increased from 5 to
TP
PCA KPCA SPCA OSP z-statistic k1* NB* k2 NB k3 NB k4 NB Z12
36.30
7.89 6.07
Z13 0.00 0.53
-0.59
Z14
28.70 6.26 6.30
Z23
-36.30 -33.39
-7.40
Z24
-7.79 -7.26 1.06
Z34 28.70 30.40 7.65
100 200 300
0.9408 0.9621 0.9643
15 15 15
0.8703 0.8901 0.9090
55 65 60
0.9408 0.9573 0.9691
15 15 20
0.8874 0.9050 0.9069
8 8 8
65 for 25 T
to 615 seco
An i
with SPCA
critical nu
Same tren
using larg
sufficient
modified d
of noise. D
properly fo
data sets.
number of
decrease. T
SPCA and
Exc
the trainin
to the QP
optimizers
times. It i
optimizer
out by Var
TP. The sa
onds for 20
important
A modified
umber of ba
nd was obs
ge number
number of
data sets, e
Due to the
for large nu
That mea
f SV are l
This could
d PCA modi
ceptionally
ng data siz
P optimize
s which wo
s known th
in case of
rshney and
ame range
00 TP.
aspect has
d data set
ands (30 b
erved for 3
of TP and
f support
except first
e presence
umber of b
ans that s
less then c
be suppor
ified data s
higher tim
ze was incr
er used. V
ould give
hat same p
SVM as it
d Arora (20
for 100 TP
s been obse
(Figure 5.
bands), the
300 TP per
d large num
vectors re
t few band
e of noise,
bands with
sufficient n
classificatio
rted from th
set k-value
mes of the
reased to 3
Varshney
the same
performanc
makes use
004)
96
P was obser
erved for th
.19). When
classificat
r class. Thi
mber of ba
quired for
s, all rema
optimizat
h large set
number of
on time al
he Figure
es start to d
order of 2
00 TP. Suc
and Arora
classificati
ce would be
e of the sta
rved as 82
he classific
n the band
tion time d
is could be
ands, SVM
classificat
aining band
tion proble
of TP for
SV could
lso be less
5.18 (a), (b
decrease af
2600 secon
ch higher t
a (2004) s
ion accura
e achieved
atistical lea
to 273 sec
cation time
ds are incr
decreases m
e due to the
M_QP was u
tion. For S
ds contain
em might
SPCA or
not be fin
and k-val
b). It is obs
fter 25 ban
nds were o
times were
suggested
acies in sh
d irrespecti
arning theo
conds, and
e using 200
eased, afte
monotonica
e, fact that
unable to f
SPCA or P
large amo
not be sol
PCA modi
nd. When
lues may a
served that
nds.
bserved w
e observed
a few be
orter train
ve of choic
ory as poin
522
0 TP
er a
ally.
t by
find
PCA
ount
lved
fied
the
also
t for
hen
due
tter
ning
ce of
nted
97
Figure 5.19: Classification time comparison using 200 and 300 TP per class.
5.3.2 Experiment results for SVM_SMO algorithm The classification results obtained using SVM with SMO optimization
techniques are presented in Figure 5.20. The rbf kernel function is used for
classification of different data sets using SVM_SMO algorithm. The following
observations can be made on the basis of k-value presented in Figure 5.20:
(i) The k-values could be seen as improving with increase in training data size
(except 200 TP) for all input data set.
(ii) Like SVM_QP, a sudden decrease in k-value is observed with 25 TP for the
OD, SPCA, KPCA and OSP extracted data sets. For all data sets, this
happens for 50 features.
(iii) For all data sets (except KCPA extracted data), statistically sufficient
training data set (200 TP) is unable to provide positive k-value. This could
be due to failure of solving optimization problem for these data sets using
200 TP. For KPCA extracted data set, first few bands provide very low k-
value for 200 TP. From 20 bands onwards, k-value provided by KPCA
extracted data set for 200 TP is acceptable.
(iv) Increasing k-values were observed for original and KPCA modified data sets
which stops after a critical number of features used. After that, it starts to
decrease. It is because of same reason as reported for GML classifier. For
the OD and KPCA modified data sets k-values increase monotonically for
100 and 300 TP per class.
(v) For PP modified data set, however, very low k-values are observed. So, all
the results for PP extracted data set are ignored for comparison of results of
SVM_SMO classifier.
The k-values for the classification of these data sets were statistically analyzed
to select the best results for each data set. The approach was similar to the one
followed in previous cases. The z-statistic values are obtained to compare each data
98
set. The best k-values are shown in Table 5.8. The following can be inferred from
these results:
(i) The best results obtained using feature modified data sets were found to be
significantly better than those obtained using the OD set for large training
data size (300 TP). For OSP modified result is marginal, but can be said
that significantly better than OD set. Performance of OD, SPCA and OSP
modified data is very bad, but performance of KPCA modified data is very
high for 200 TP training data. SPCs found to be performing statistically
better than OD set for 100 TP per class and statistically similar to OD for
200 TP.
(ii) The best results were obtained with the OD using 50-60 bands, while
significantly better results than OD were obtained using SPCA modified
data sets with 15-30 features. For 300 TP, statistically similar result to OD
is obtained using OSP modified data set with eight bands.
(iii) KPCs were observed to be performing significantly better than SPCA and
OSP modified data set for 200 TP. For 100 and 300 TP, the best results
obtained by SPCA modified data set are significantly better than OSP and
KPCA modified data sets.
(iv) Classification with OSP is found to be significantly better than KPCA for
100 TP while KPCA is observed to be statistically better than OSP modified
data for 200 and 300 TP. Thus it can be said that SPCA performs better
than OD and any other feature extracted data and performance of OSP is
worst for SVM_SMO based classification.
Figure 5.2
Table 5.8:
TP
100
200
300
Origin
KPC
20: Overamodifi
The best modified
OD k1 NB*
0.8955 50
0.1694 5
0.8934 60
NB* = No
nal
A
all kappa vfied data se
k-value an data set
KPCA * k2 NB
0.8626 40
0.8826 50
0.9013 50
. of band used
SVM
values obsets using SV
nd z-statisti
SPCA k3 NB
0.9304 15 0
0.1694 5 0
0.9436 30 0
d to obtain best
99
M_SMO
served for VM with S
ic for SVM
OSP k4 NB Z1
0.8739 8 15
0.0001 8 -33
0.8999 8 -3
t k-value; ki* =
classificatSMO optim
M_SMO on O
z-s12 Z13 Z1
5.00 -15.91 9.8
36.2 0.00 12.9
3.80 -26.98 1.6
= k-value for it
SPCA
OSP and
tion of orimizer
OD and dif
statistic 4 Z23 Z2
85 -33.90 -4
90 475.46 630
65 -23.75 5
th FE techniqu
PP
ginal and
fferent feat
24 Z34 .99 28.25
.00 -
.56 28.86
ue
FE
ture
Figure 5.2
The
to be rang
the same r
TP and 11
when num
requireme
than the
optimizati
numerical
small num
very large
.
5.3.3 ExpThe
is present
different d
made on th
21: Companumbe
e total time
ging from 5
range for 1
184 to 1814
mber of ba
ent for larg
SVM class
ion metho
l operation
mber of ope
data sets
perimene classificat
ted in Figu
data set u
he basis of
arison of cler of bands
e taken to
55-90 secon
100 TP was
4 for 300 T
ands incre
ge number
sification m
od. The so
ns. This me
erations th
nt resultstion result
ure 5.22.
using KPCA
f k-values p
lassification for SVM_S
perform S
nds when b
s observed
TP (Figure
ases the c
of TP for S
method ba
olution de
ethod need
hus resulti
s for KPs obtained
The rbf k
A_SVM al
presented i
100
n time for SMO classi
SVM_SMO
bands were
as 145-194
5.21). Unl
classificatio
SVM_SMO
ased on QP
erived for
ds more nu
ing in an i
CA_SVM using KPC
kernel fun
gorithm. T
in Figure 5
different sification al
based clas
e increased
4 seconds,
ike to SVM
on time al
is observe
P optimize
SMO me
umber of it
increase in
M algoritCA_SVM a
nction is u
The followi
5.22:
set of TP wlgorithm.
ssification
d from 5 to
350-409 se
M_QP it is
lso increas
ed to be sig
er. This is
ethods nee
terations b
n optimizat
thm algorithm (
used for cla
ing observ
with respec
was obser
o 65 for 25
econds for
observed t
ses. The t
gnificantly
due to S
eds very
but require
tion speed
(QP optimi
assification
vations can
ct to
rved
TP,
200
that
time
less
MO
few
es a
d for
zer)
n of
n be
101
(i) For OD and KPCA extracted data, unpredictable behavior of KPCA_SVM
classifier is observed for all data set, TP and for different bands. Maximum k-
value for OD is obtained for 200 TP with 35 bands and for KPCA 200 TP with
25 bands.
(ii) For SPCA extracted data set, k-values reach to about zero after 20 bands for
each set of TP. Maximum k-value obtained by SPCA is better than obtained by
OD and KPCA extracted data set. Maximum k-value for each set of TP is
obtained with five bands.
(iii) For OSP extracted data set, highest k-value is obtained for 200 TP. This value
is higher than the k-values of other feature modified data sets, those are
obtained for 200 TP. Reverse of this scenario is seen for OSP modified data set
with 300 TP.
(iv) One important phenomenon is observed for KPCA_SVM algorithm. For large
set of TP (300), KPCA_SVM provides very low k-value. The best k-value is
obtained for all data set using 200 TP per class.
Figure 5.
The
best result
cases. The
(i) The
foun
TP.
mod
How
obse
Origin
KPC
.22: Overamodifi
e k-values
ts for each
e following
e best resul
nd to be si
For 300
dified data
wever, perf
erved to be
nal
A
all kappa vfied data se
for classifi
h data set.
can be infe
lts obtaine
ignificantly
TP, OD p
. Performa
formance o
e performin
KPC
values obseets using K
fication of
The appro
erred from
d using fea
y better th
provides st
ance of OD
of SPCA mo
ng statistic
102
CA_SVM
erved for c
KPCA_SVM
these data
oach was si
these resu
ature modi
han those o
tatistically
D, KPCA an
odified dat
cally better
classificatiM algorithm
a sets were
imilar to th
ults (Table
ified data s
obtained u
better res
nd OSP mo
ta is very h
r than OD s
SPCA
OSP
ion originam.
e analyzed
hat followe
5.9):
sets (excep
sing the O
sult than
odified dat
high for 100
set for 100
al and feat
d to select
ed in previ
t KPCA) w
OD set for
other feat
ta is not go
0 TP. SPCA
TP per cla
ture
the
ious
were
200
ture
ood.
A is
ass.
103
(ii) The best results were obtained with the OD with 50-60 bands while
significantly better results than OD were obtained using SPCA modified data
sets with five to ten features for 100 and 200 TP per class. For OSP modified
data set, statistically better result than OD is obtained using 200 TP with eight
bands
(iii) SPCs were observed to be performing significantly better than OSP for 100 and
200 TP. While OSP performs statistically better than SPCs for 200 TP. KPCs
perform statistically better than OSP for 100 TP. However, performance of
KPCs for 200 and 300 training data is statistically significantly low than OSP.
(iv) SPCs always perform statistically better than KPCs and OSP performs better
than SPCA only for 200 TP. It could be concluded that for 100, 200 and 300 TP,
KPCA_SVM performs better with SPCA, OSP modified data set and OD
respectively. KPCA_SVM provides low k-value compared to SVM_QP or
SVM_SMO algorithms.
Table 5.9: The best k-value and z-statistic for KPCA_SVM on original and different feature modified data sets.
NB* = No. of band used to obtain best k-value
5.3.4 Class wise comparison of the best result of SVM Ability of SVM classifiers to separate different classes is observed from Figure
5.23.
(i) Ability to distinguish salt lake class of all SVM classifier is about same.
(ii) Accuracy of separation of wheat class by SVM_QP and SVM_SMO
classifiers is about same. However, performance of KPCA_SVM is very low
(except salt) to separate any other classes than other two classifiers.
(iii) SVM_SMO separates all other classes with little low accuracy than
SVM_QP.
TP OD KPCA SPCA OSP z-statistic k1 NB* k2 NB k3 NB k4 NB Z12
62.93 6.98
61.15
Z13
-15.32 -6.64 5.10
Z14
93.69 -39.72 203.10
Z34
96.77 -21.83 104.90
100 200 300
0.7110 0.6736 0.7142
50 45 55
0.5150 0.6514 0.5109
25 30 45
0.7565 0.6976 0.5340
10 5 5
0.4192 0.7917 0.3488
8 8 8
(iv) S
h
Figure 5.2
5.3.5 Com The
statisticall
accuracy o
to compare
(i) Fro
than
obta
For
clas
TP)
(ii) Fro
SVM
and
valu
SVM_QP i
high k-valu
23: CompdifferHydrPastu
mparisoe overall be
ly to find o
obtained. T
e the pract
m Table 5
n all other
ained for S
300 TP
ssification t
. This time
m Table 5
M decision
d 300 TP u
ues obtain
is the best
ue.
parison ofrent SVM rophobic veure land, B
on of resuest results
out the bes
The same w
tical applic
5.10, it is
r SVM algo
SPCA and
best resu
time range
e range is v
5.10, it is o
rule. The b
using SPCA
ned for 100
t classifier
f classific algorithmeg, WHT – BUA – Buil
ults for d obtained b
st SVM cla
was done fo
cability of t
observed t
orithms for
PCA modi
ult is obt
es from 14
very high.
observed th
best k-valu
A, KPCA a
0 and 300
104
. It has ab
ation accums. WT – wheat, VYlt-up area
differentby differen
assification
or the time
these meth
that SVM_
r all sets o
ified data s
tained for
48 seconds
hat SVM_S
ues for SVM
and SPCA
0 TP are l
bility to se
uracy of water, SL
Y – Vineyar
t SVM alnt SVM alg
n method i
e scales obs
ods.
_QP metho
of TP. Best
sets for 10
SPCA m
(for 100 T
SMO algor
M_SMO are
modified d
little less
eparate all
individualLT – Salt rds, BS – B
lgorithmgorithms w
n terms of
served for
od is statis
t results of
00 and 200
modified d
TP) to 2596
rithm is th
e obtained
data sets re
than SVM
l classes w
l classes Lake, HV
Bare soil, P
ms were compa
f classificat
these in or
stically be
f SVM_QP
TP per cl
data set. T
6 seconds (
he second b
with 100,
espectively
M_QP, tho
with
for V –
PL –
ared
tion
rder
tter
are
ass.
The
(300
best
200
y. k-
ugh
105
required classification time using 300 TP is about two third of SVM_QP.
Though SVM_SMO needs more bands than SVM_QP to obtain best k-values
for different sets of TP but its processing time is very less than SVM_QP.
(iii) KPCA_SVM is poorest method amongst SVM_QP and SVM_SMO. Highest k-
value is obtained for KPCA_SVM by using OSP modified data set with 200 TP.
When number of pixel is large performance of KPCA_SVM is less.
From the above discussion, it can be concluded that SVM_QP is the best
classifier with respect to accuracy. Considering both the classification time and
accuracy, SVM_SMO can be considered as the effective SVM classifier. The best
accuracy is obtained by SVM_QP by using 300 TP with the first 20 bands of SPCA
modified data set. For SVM_SMO the best accuracy is obtained by using 300 TP with
the first 30 bands of SPCA modified data set.
Table 5.10: Comparison of the best k-values with different FE techniques, classification time, and z-statistic for different SVM algorithms.
TP SVM_QP SVM_SMO KPCA_SVM z-statistic
k1 FEA* Time (s)*
NB* k2 FEA Time (s)
NB k3 FEA Time (s)
NB Z12 Z13 Z23
100 0.9408 PCA, SPCA
122.6 15 0.9304 SPCA 148.1 15 0.7565 SPCA 94.3 10 6.14 77.61 71.94
200 0.9621 PCA, SPCA
585.7 15 0.8836 KPCA 363.9 50 0.7927 OSP 262.3 8 45.16 77.51 36.4
300 0.9691 SPCA 2596.2 20 0.9446 SPCA 1694.8 30 0.7142 OD 1190.2 55 18.38 113.47 97.01 ki = best k-value for ith classifier; FEA* = Feature extraction algorithms; NB* = No. of band used to obtain best k-value; Time (s)* = Required time to obtain best k-value, presented in second
5.4 Comparison of best results of different
classifiers The best results obtained by the parametric (GML), non-parametric (KNN) and
advanced (SVM) classifiers with different feature modified data set are already
presented in Tables 5.2, 5.5 and 5.9. The best advanced classifier (SVM_QP) is chosen
by statistically comparing all the advanced classifiers. The statistical comparison of
parametric, nonparametric and best advanced classifiers are carried out in order to
evaluate the best classifier among these classifiers with respect to classification
accuracy and time. The corresponding z-statistic is presented in Table 5.11:
106
The followings are observed from the Table 5.11:
(i) GML performs statistically better than KNN classifier for all set of TP. Also
the classification time of GMLC is negligible with respect to KNNC.
(ii) GMLC performs statistically similar with SVM_QP for 100 and 200 TP. For
large set of TP (300), the performance of SVM_QP classifier is statistically
significantly better than GMLC. However, required classification time is very
high for SVM classifier.
(iii) SVM_QP provides statistically better result than KNNC for all set of TP. From
here it can be concluded that SVM_QP is the best classifier on the basis of
classification accuracy. GML is ranked as the second best classifier.
(iv) It is also observed that the best results are obtained by all the classifiers by
using SPCA modified data set. It is also concluded that SPCA is the best
feature reduction technique among all other techniques for all classifiers.
(v) Processing time of GMLC is very less than any other classifiers. GMLC
provides little poor k-value than SVM_QP for 300 TP. Considering both
classification time and accuracy, it can be concluded that GMLC is the best
classifier than any other classifier.
Table 5.11: Statistical comparison of different classifier’s results obtained for
different data sets
TP GML KNN SVM_QP z-statistic k1 FEA* Time (s)* NB* k2 FEA Time (s) NB k3 FEA Time (s) NB Z12 Z13 Z23
100 0.9384 SPCA 60.6 20 0.8669 SPCA 661.2 10 0.9408 SPCA, PCA
122.6 15 36.82 -1.54 -38.06
200 0.9579 SPCA 64.7 30 0.9040 SPCA 1122.1 15 0.9573 SPCA, PCA
585.7 15 31.33 0.42 -30.98
300 0.9589 SPCA 82.6 45 0.9325 SPCA 1712.5 40 0.9691 SPCA 2596.2 20 16.00 -7.97 -25.37 ki = best k-value for ith classifier; FEA* = Feature extraction algorithm; NB* = No. of band used to obtain best k-value; Time (s)* = Required time to obtain best k-value, presented in second
The difference in performance of GML, KNN and SVM classifiers can be
attributed to difference in their classification mechanisms. GML and KNN are
capable of forming only simple decision boundaries where SVM can forms highly
complex non-linear decision boundaries. In the given data, different kinds of class
separabilities were observed for different classes. The water and salt classes were
found easily separable from the rest of the classes. About 100% classification
accuracies were observed for these classes with very small number of features for all
107
the classifiers. After these, the classes: wheat, vineyards and bare-soil were showing
a little lower accuracy values which means these are a little difficult to separate. The
lowest accuracies were observed for pasture land, built-up area and hydrophytic
vegetation classes. These classes are very poorly separated and thus complex decision
boundaries would be required to separate them. For large set of TP, SVM_QP is able
to achieve higher classification accuracies than the parametric and non-parametric
classifier because they were not able to separate the poor classes in a better way.
Classified maps corresponding to the best results of different classifiers are
shown in Appendix A (Figure A.1).
5.5 Ramifications of results HD classification is very crucial task due to its characteristics and large
volume of data. It is clear from the analysis that depending on availability of TP the
selection of FE techniques and classification algorithms are very important for
classification of HD. Another important aspect should also be kept in mind that is
time-consuming classification and FE procedures. This thesis work has pointed on
some important guidelines for classification of HD (Table 5.12).
(i) When only statistically insufficient TP is available, it is suggested to apply
either SVM_QP algorithm with OSP FE technique. This will provide high
classification accuracy in minimum time.
(ii) GML is strongly recommended to apply on SPCA modified data set to achieve
very high accuracy in very less time for statistically exact and statistically
sufficient training data sets.
(iii) For statistically large training data set, high accuracy could be achieved by
implementing SVM_QP on SPCA modified data set. Nevertheless, this method
will take very large processing time. So, it is strongly recommended to apply
GML on SPCA modified data set, though achieved classification accuracy is
little less than SVM_QP but processing time is negligible than SVM_QP.
SVM_SMO could also be used for large set of TP on SPCA modified data set.
(iv) Among all the popular FE techniques for HD, SPCA is the most effective FE
technique, which could be used to achieve high classification accuracy for HD
for all classification techniques.
108
Table 5.12: Ranking of different classification algorithms depending on classification
accuracy and time. (Rank: 1 indicate the best)
Ranking depending on accuracy TP Parametric Non-
parametric Advanced
GML FEA KNN FEA SVM_QP FEA SVM_SMO FEA KPCA_SVM FEA 25 2 SPCA 3 KPCA 1 SPCA,
OSP 1 SPCA 4
100 1 SPCA 3 SPCA 1 PCA, SPCA
2 SPCA 4 SPCA
200 1 SPCA 2 SPCA 1 PCA, SPCA
3 KPCA 4 OSP
300 2 SPCA 4 SPCA 1 SPCA 3 SPCA 5 OD Ranking depending on accuracy & time
TP Parametric Non-parametric
Advanced
GML FEA KNN FEA SVM_QP FEA SVM_SMO FEA KPCA_SVM FEA 25 2 SPCA 3 SPCA 1 OSP 1 SPCA 4 SPCA
100 1 SPCA 4 SPCA 2 PCA, SPCA
3 SPCA 5 SPCA
200 1 SPCA 4 SPCA 2 PCA, SPCA
3 KPCA 5 OSP
300 1 SPCA 4 SPCA 3 SPCA 2 SPCA 5 OD
109
CHAPTER 6 SUMMARY OF RESULTS AND
CONCLUSIONS
Starting with the summary of observations as noticed in the previous, this
chapter mainly aims to summarize the conclusions corresponding to the main
objectives as defined in the first chapter. It also suggests the some area and methods
for further research in future.
6.1 Summary of results This research work is the extension of the work done by Abhinav (2009). For
this research work, DAIS 7915 hyperspectral sensor data was used for testing
different FE techniques and classification algorithms. The best results obtained by
these experiments were compared with those obtained by Abhinav (2009). Based on
the conclusions from the literature survey and recommendations for future work by
Abhinav (2009), several FE (SPCA, KPCA, OSP, PP) and classification algorithms
(KNN, GML, SVM based classifiers) have been tested to achieve the objectives as
mentioned in section 1.4.
For parametric classifier (GML), experiments were performed on different
feature extracted data sets which are mentioned above. The best result obtained by
the experiments were compared with the best result obtained by Abhinav (2009) to
observe the improvement. For non-parametric classifier (KNN), first experiment was
performed with OD. Then algorithm was applied on the different feature modified
data. The best results for OD and feature extracted data were compared to obtain the
best result for non-parametric classifier. For the advance classifier (SVM_QP,
SVM_SMO and KPCA_SVM) experiments were performed on OD as well as feature
modified data sets. For SVM_QP, like GML, also the best result was compared with
the best result obtained by Abhinav (2009). The best results of different SVM
classifiers were examined to obtain best SVM algorithm.
110
Lastly, the best results for parametric, non-parametric and advance classifiers
were compared to find out the best classifier for HD. All the comparisons were
performed by the one-tailed hypothesis testing at 5% significance level.
Classification experiments were performed using the four FE techniques,
namely, SPCA, KPCA, OSP and PP. From the statistical analysis of classification
results obtained using these feature modified data sets, it could be concluded that
among the four above mentioned FE techniques, SPCA modified data set provides the
best results. These results were also compared with the best classification results
obtained by Abhinav (2009) using different FE techniques. SPCA performs better
because it uses the local statistics rather than global.
Analyzing the different classifiers results, it is observed that sometimes the
results obtained from PCA modified data set competes with those obtained by SPCA
modified data set. Generally, different classifiers provide the best results using 15 to
30 bands of SPCA or PCA modified data sets, which effectively reduces the
classification time. For OSP and PP, due to very low dimensionality, these always fail
to produce satisfactory results. However, the results obtained by using eight bands of
OSP modified data set are reasonably good, though they are not always statistically
significantly better than SPCA or PCA modified data sets. There is a possibility of
improving result by increasing the dimension of OSP modified data set by extracting
more number of endmembers. For KPCA modified data set, it was observed that its
performance is always poor in quality. However, it is observed that KPCA can
produce satisfactory result by increasing the dimension which will also increase the
classification time proportionally. Therefore, KPCA is not considered as an effective
FE technique.
From the experiments performed with parametric classifier (GML), it was
observed that the performance of GML was significantly improved after applying FE
techniques. Comparing the obtained results with the best result obtained by Abhinav
(2009), SPCA was found to be working best among all available FE techniques, in
improving classification accuracy by GML.
Moving on to the non-parametric classifier, it is observed that result of KNN
classifier depends on the choice of number of bands and neighbors. Best results were
selected for KNN with and without applying FE techniques and it was found that
111
result of KNN was enhanced by PCA and SPCA techniques while the supervised FE
techniques like KPCA and OSP failed to do so.
SVM algorithm was selected as the advance classifier. It uses statistical learning
theory, which is expected to produce consistent and optimal results as compared to
the parametric and non-parametric classifiers. Different SVM algorithms (SVM_QP,
SVM_SMO and KPCA_SVM) were tested to reach this goal. For SVM based
classifiers, it was observed that, the dimension of the data sets and choosing of
optimizer significantly affect the results. The best result of SVM_QP was achieved by
SPCA feature extracted data set with 20 bands. It was also observed that, the
classification result using advanced classifier was further improved than the best
result obtained by Abhinav (2009). He obtained the best result using PCA modified
data sets. This result was further improved by using SPCA modified data set. This
proves that by using selected FE techniques, classification results of advance
classifier can further be improved. It was observed that supervised FE technique like
KPCA, OSP could not improve the result of SVM while unsupervised FE technique
(SPCA) made improvement in result. On the other hand, the best results of
SVM_SMO and KPCA_SVM were obtained by using SPCA and OSP modified data
sets respectively. Comparing the best results of different SVM algorithms SVM_QP is
concluded as the best SVM classifier.
On comparing the best results obtained by SVM classifiers with the best
results of parametric and non-parametric classifications, it was found that the
advance classifier performs significantly better for both the data sets, original or
feature extracted. The reason for better performance of this classifier is the
improvement in separating a few classes which shows poor k-values when parametric
or non-parametric classifiers were used. This observation is expected because of the
variation in formation of decision boundary. The decision boundary form by
parametric or non-parametric classifiers are simpler. For this reason they are unable
to perform to separate the poor classes efficiently. Advance classifier has ability to
form complex, nonlinear decision boundaries which help them to improve decision
boundary for separating poor classes.
Compared to parametric classifier, SVM required higher computation time and
memory requirement. In spite of these difficulties, significant improvement was
112
observed over parametric and non-parametric classifiers by advance classifier. This
strongly suggest that SVM has an ability to reduce the troubles regarding HD
classification.
6.2 Conclusions Based on these results, the following conclusions are drawn:
1. Out of various FE techniques for classification of HD, SPCA is the best FE
technique followed by PCA. In addition, orthogonal subspace projection can be
taken as the effective FE technique if its dimension could be increased.
2. Although advance classifiers needs large processing time but these are able to
reduce the problems concerned with the classification of HD in a much better
manner than the parametric or non-parametric classifiers. For statistically
exact and sufficient sets of TP, performance of SVM_QP is not statistically
better than those of parametric classifier. For large set of TP, SVM_QP
produces statistically better result than all classifiers. In addition, the SPCA
FE techniques were found to be helpful to increase the accuracy significantly
for all of advance, parametric and non-parametric classifiers.
6.3 Recommendations for future work During the literature survey, some additional methods were found that are not
included in this thesis work. These seem to be showing scope of improving accuracy
and computation time for the advance classifiers presented in this thesis. The
following methods are recommended for the future work:
(i). In this thesis work the high memory and computational time required by SVM
methods were little reduced by using different optimizers and algorithms.
There is still chance to reduce the computation time for SVM algorithm by
using Lagrangian SVM algorithm (Mangasarian and Musicant, 2000). This
required testing further. In addition, some optimization techniques like Kernel
Adatron (Bennett and Campbell, 200), Succesive Overrelaxation (SOR)
(Mangasarian and Musicant, 1998) should also be tested which may reduce the
computation time significantly.
113
(ii). Moreover, it can be commented that for large set of TP, KPCA method takes
much time. Lima and Zen (2005) suggested a method called Sparse KPCA
which may reduce the computation time. This needs to be tested.
(iii). The high computation time required by KNN found in this thesis work. It is
because of the large number of computation is required to classify a single
pixel. For large data set it will increase exponentially. In order to reduce these
Hash-table approach could be applied. By using Hash-table number of
computation will be less.
114
REFERENCES Barros, A. S and Rutledge, D, N (2005) ‘Segmented principal component transform–principal component analysis’, Chemometrics and Intelligent Laboratory Systems 78 (2005) 125– 137
Bhattacharyya, A. (1943) ‘On a measure of divergence between two statistical populations defined by probability distributions,’ Bulletin of Calcutta Mathematical Society, Vol. 35, pp. 99-109.
Ben-Dor, E., Patkin K., Banin A. and Karnieli, A. (2002) ‘Mapping of several soil properties using DAIS-7915 hyperspectral scanner data – a case study over clayey soils in Israel,’ International Journal of Remote Sensing, Vol. 23, No. 6, pp. 1043-1062.
Bierwirth, P., Huston, D., and Blewett, R. (2002) ‘Hyperspectral mapping of mineral assemblages associated with gold mineralization in the Central Pilbara, Western Australia,’ Economic Geology and the Bulletin of the Society of Economic Geologists, Vol. 97, No. 4, pp. 819-826.
Boser, H., Guyon, I. M., Vapnik, V. N. (1992) ‘A training algorithm for optimal margin classifiers’ Proceedings of the 5th Annual Workshop on Computational Learning Theory, ACM New York, NY, USA, pp. 144-152.
Carreira-Perpinan, M. A. (1997) ‘A review of dimension reduction techniques,’ Technical Report, Vol. 9, No. CS-96, Department of Computer Science, University of Sheffield.
Cha, G. H. (2005) ‘Kernel principal component analysis for content based image retrieval’, PAKDD 2005, LNAI 3518, pp. 844 – 849, Springer-Verlag Berlin Heidelberg.
Chang, C. I., Sun, T. L. E., and Althouse, M. L. G. (1998) ‘An unsupervised interference rejection approach to target detection and classification for hyperspectral imagery,’ Opt. Eng., VOL. 37, PP. 735–743.
Chang, C. I. (2005) ‘Orthogonal subspace projection (OSP) revisited: A comprehensive study and analysis’, IEEE Transactions on Geoscience and Remotesensing, VOL. 43, No. 3.
Cristianini, N., Shawe-Taylor, J. (2000) An introduction to support vector machines and other kernel-based learning methods, Cambridge University Press, Cambridge, UK.
115
Congalton, R. G. (1991) ‘A reviews of assessing the accuracy of classifications of remotely sensed data,’ Remote Sensing of Environment, Elsevier Science (pub.), Vol.37, No. 1, pp. 35-46.
Cover, T. M. and Hart, P. E. (1967) ‘Nearest neighbor pattern classification,’ IEEE Transactions Information Theory, Vol. IT-13, No. 1, pp. 21–27.
Curran, P. J. and Dungan J. L. (1989) ‘Estimation of signal-to-noise – a new procedure applied to AVIRIS data,’ IEEE Transactions on Geoscience and Remote Sensing, Vol. 27, No. 5, pp. 620-628.
Dasarathy, B. V. (1991) ‘Nearest neighbour (NN) norms: NN pattern classification techniques’, IEEE Computer Society Press, Los Alamitos, CA
Devijver, P. and Kittler, J. (1982) Pattern recognition: A statistical approach, Englewood Cliffs, New Jersey.
Dundar, M. M. and Landgrebe, D. A. (2004) ‘Toward an optimal supervised classifier for the analysis of hyperspectral data,’ IEEE Transactions on Geoscience and Remote Sensing, Vol. 42, No. 1, pp. 271-277.
Friedman, J. H. (1987) "Exploratory projection pursuit," Journal of the American statistical association, 82, 249-266.
Fukunaga, K. (1990) Introduction to statistical pattern recognition, Rheinboldt, W. (edt.), II edn., Academic Press, Inc., San Diego, USA.
Garg, A (2009) Investigations on classification techniques for hyperspectral imagery, M. Tech Thesis, Indian Institute of Technology, Kanpur.
Harsanyi, J. C. and Chang, C. I. (1994) ‘Hyperspectral image classification and dimensionality reduction: An orthogonal subspace projection,’ IEEE Transactions on Geoscience and Remote sensing, VOL. 32, PP. 779–785.
Harsanyi, J. C.(1993) Detection and classification of subpixel spectral signatures in hyperspectral image sequences, Ph.D. dissertation, Dept. Elect. Eng., Univ. Maryland Baltimore County, Baltimore, MD.
Huber, P. J. (1985) ‘Projection pursuit’, The Annals of Statistics, 13, 435-475.
Hughes, G. (1968) ‘On the mean accuracy of statistical pattern recognizers,’ IEEE Transactions on Information Theory, Vol. IT-14, No. 1, pp. 55-63.
Hwang, W. J. and Wen, K.W. (1998) ‘Fast KNN classification algorithm based on partial distance search’, IEEE Transaction, Electronics Filter, Vol. 34, No. 21.
116
Hwang, J., Lay, S., and Lippman, A. (1994), ‘Nonparametric multivariate density estimation: A comparative study,’ IEEE Transactions Signal Processing, Vol.42, No. 10, pp. 2795-2810.
Ifarraguerri, A. and Chang, C. I. (2000) ‘Unsupervised hyperspectral image analysis with projection pursuit’ IEEE Transactions on Geoscience and Remotesensing, VOL. 38, NO. 6.
Jia, X. (1996) Classification techniques for hyperspectral remote sensing data, Ph. D. Thesis, University of Canberra.
Jones, M. C., and Sibson, R. (1987) ‘What is projection pursuit?’, Journal of the Royal Statistical Society, Ser. A, 150, 1-38.
Jimenez, L. O. and Landgrebe, D. A. (1998) ‘Supervised classification in high dimensional space: Geometrical, statistical and asymptotic properties of multivariate data,’ IEEE Transactions Systems, Man and Cybernetics - Part C: Applications and Reviews, Vol. 28, No. 1, pp. 39-54.
Kim, K. I., Franz, F. O., and Scholkopf, B. (2005) ‘Iterative Kernel principal component analysis for image modeling’, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 9.
Kohram. M. and Sap, M. N. M. (2008) ‘Composite kernel for support vector classification of hyperspectral data’, MICAI 2008, LNAI 5317, pp. 360 – 370, Springer-Verlag Berlin Heidelberg.
Kolahdouzan, M. and Shahabi, C. (2004) ‘Voronoi-based K Nearest Neighbor search for spatial network databases’. Proceedings of the 30th VLDB Conference,Toronto, Canada, 2004.
Lee, Y. J. and Huang, S. Y. (2005) ‘Reduced support vector machines: A statistical theory’, Taiwan.
Landgrebe, A. (1971) ‘Description and results of the LARS/GE data compression study,’ LARS Information Note, Vol. 21171.
Leunberger, D. (1984) Linear and nonlinear programming, II edn., Addison-Wesley, Menlo Park, California
Luttrell, R. D. and Vogt, F. (2008) ‘Accelerating kernel principal component analysis (KPCA) by utilizing two dimensional wavelet compression: applications to spectroscopic imaging’, Wiley Inter Science.
Martinez, W. L. and Martinez, A. R. (2004) Exploratory data analysis with Matlab, Chapman and Hall /CRC
117
Mercer, J. (1909) ‘Functions of positive and negative type, and their connection with the theory of integral equations,’ Transactions of the London Philosophical Society, Vol.-209, No. A, pp. 415-446.
Nilsson, N. J. (1990) The mathematical foundations of learning machines, Morgan Kaufmann Publishers Inc., San Mateo, CA.
Pal, M. (2002) Factors influencing the accuracy of remote sensing classifications: A comparative study, Ph. D. Thesis, University of Nottingham.
Pechenizkiy, M. (2005) ‘The Impact of Feature Extraction on the Performance of a Classifier: kNN, Naïve Bayes and C4.5’. B. Kégl and G. Lapalme (Eds.): AI 2005, LNAI 3501, pp. 268 – 279, 2005., Springer-Verlag Berlin Heidelberg
Ping, X., Guo, G., and Chen, G. (2006) A fast document classification algorithm based on improved KNN, IEEE Transaction.
Posse, C. (1995) ‘Tools for two-dimensional exploratory projection pursuit’, Journal of Computational and Graphical Statistics, Vol. 4, No. 2 (June, 1995), pp. 83- 100.
Richards, J. A. and Jia, X. (2006) Remote sensing digital image analysis: An introduction, IV edn., Springer, Berlin.
Robila, S. A. and Varshney, P. K. (2002) ‘Target detection in hyperspectral images based on independent component analysis,’ Proceedings of SPIE: Automatic Target Recognition XII, SPIE-International Society for Optical Engineering, Vol. 4726, pp. 173-182.
Schraudolph, N. N., Gunter, S. S., and Vishwanathan, V. N. Fast iterative kernel PCA, Statistical Machine Learning, National ICT Australia.
Smola, A. J. and Scholkopf, B. (1997) ‘On a kernel-based method for pattern recognition, regression, approximation, and operator inversion’, GMD Technical Report: 1064.
Sundaram, N. (2009) ‘Support vector machine approximation using kernel PCA’, Technical Report No. UCB/EECS-2009-94.
Vapnik, V. N. (1995) The nature of statistical learning theory, Springer, NY.
Vapnik V. N. (1998) Statistical learning theory. John Wiley and Sons, NY.
Varshney, P. K. and Arora, M. K. (2004) Advanced image processing techniques for remotely sensed hyperspectral data, Springer, NY.
Wegman, E. J. (1990) ‘Hyperdimensional data analysis using parallel coordinates’, Journal of the American Statistical Association, Vol. 85, No. 411, PP. 664- 675.
118
Welling, M. ‘Kernel principal component analysis’, Department of Computer Science, University of Torento.
Zhu, B., Jiang, L., Jin, F., Qin, L.,Vogel, A., and Tao, Y. (2007) ‘Walnut shell and meat differentiation using fluorescence hyperspectral imagery with ICA-KNN optimal wavelength selection’, Sens. & Instrumen. Food Qual. (2007) 1:123–131 DOI 10.1007/s11694-007-9015-z, Springer Science+Business Media, LLC 2007
119
APPENDIX A
GML Legend KNN
SVM_QP SVM_SMO
KPCA_SVM
Figure A.1: Classified maps corresponding to the best results of different classifiers