is there a preferred classifier for operational thematic mapping?

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 52, NO. 5, MAY 2014 2715

Is There a Preferred Classifier for OperationalThematic Mapping?

John A. Richards, Life Fellow, IEEE, and Nick G. Kingsbury, Fellow, IEEE

Abstract—The importance of properly exploiting a classifier’sinherent geometric characteristics when developing a classifica-tion methodology is emphasized as a prerequisite to achievingnear optimal performance when carrying out thematic mapping.When used properly, it is argued that the long-standing maximumlikelihood approach and the more recent support vector machinecan perform comparably. Both contain the flexibility to segmentthe spectral domain in such a manner as to match inherent classseparations in the data, as do most reasonable classifiers. Thechoice of which classifier to use in practice is determined largelyby preference and related considerations, such as ease of training,multiclass capabilities, and classification cost.

Index Terms—Classification, maximum likelihood classifier(MLC), neural network, support vector machine (SVM), thematicmapping.

I. INTRODUCTION

MANY different supervised classification procedureshave been used for thematic mapping in remote sensing.

Some are simple, such as the parallelepiped and minimumdistance rules, and some are more complex, such as the artificialneural network and support vector machine (SVM). For manyyears, the maximum likelihood classifier (MLC), based on theuse of the multivariate normal or Gaussian distribution, hadbeen the standard operational algorithm [1], but its position hasbeen challenged in the past 15 years or so by the rise of thesupport vector approach, in which kernel methods are used toimprove separability. It is not unusual now to find assertionsthat the SVM outperforms the MLC [2], both in absolute termsand when high dimensionalities have to be handled, such aswith hyperspectral data sets. The latter is a direct result ofthe Hughes phenomenon [1, p. 344], which draws attentionto the difficulty in obtaining reliable estimates of the second-order parameters in the MLC algorithm when the data spacedimensionality is large compared with the number of labeledtraining samples. Poor parameter estimation prejudices theability of the algorithm to generalize well.

Any reasonable classifier can be made to perform well pro-vided that it is used properly; in some cases, any reasonable

Manuscript received December 5, 2012; revised April 2, 2013; acceptedMay 10, 2013. Date of publication July 4, 2013; date of current versionMarch 3, 2014.

J. A. Richards is with the Research School of Engineering, College ofEngineering and Computer Science, Australian National University, CanberraACT 0200, Australia (e-mail: [email protected]).

N. G. Kingsbury is with the Signal Processing and Communications Group,Department of Engineering, University of Cambridge, Cambridge CB2 1PZ,U.K. (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TGRS.2013.2264831

classifier can even be made to outperform all others on somedata sets, an assertion of the No Free Lunch Theorem [3]. Thatis why it is important, where possible, to test and comparenew algorithms on more than a single data set and desirablyon data sets of different complexities. It is the purpose of thispaper to consider whether any particular algorithm will alwaysoutperform others when faced with an operational thematicmapping task and whether considerations such as the amountand nature of the training data required, training complexity,classification time, and the need for feature selection impacton the choice of algorithm. Effectively, we argue that it is thecollection of these points and how a classifier is used—theclassification methodology—that determines performance.

We will carry through the analysis by comparing theproperties of the Gaussian maximum likelihood approachand the SVM, but our observations will allow comment onother methods as well, particularly the multilayer perceptron(MLP) neural network. We restrict our attention to the ba-sic algorithms—those that are encountered most frequentlyin operational thematic mapping—rather than variants devel-oped to overcome particular performance limitations, oftenin a research setting. Our principal goals are to show thatthe maximum likelihood rule is no less effective than othercommon procedures when used properly and that all reasonableapproaches can be made to perform well, so that the choiceof one among several candidates can be a matter of personalpreference and/or computational considerations. This treatmentis not about classifier algorithms, as such, but rather how theyare used; no new algorithms are proposed.

In principle, the techniques considered here can be appliedto any data type, but more commonly, they are used withoptical data sets. Radar imagery has its own preferred methodsof analysis [4], and multisource problems are usually betterhandled by methodologies that exploit the best match betweenalgorithms and the constituent data types [5].

II. SEGMENTING THE DATA DOMAIN

We commence our comparison by examining the data do-main in which the algorithms operate. We often regard spectralmeasurements as points in spectral space, but it is helpful tothink of each measurement vector defining a cell whose size isset by the dynamic range of the sensor and its radiometric reso-lution. These, together with the number of wavebands recorded,establish the total number of cells in the measurement space. Ifwe were able to attach a thematic label to each cell, then theclassification task would be trivial, and in principle, we wouldhave the ideal classifier. With the limited-resolution sensors of

0196-2892 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2716 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 52, NO. 5, MAY 2014

several decades ago that was technically feasible and led toalgorithms such as the table look-up classifier. With the largenumbers of cells now available, it is technically impossible totreat them separately, and instead, we regard the data spaceas a continuum. Successful classifiers have to segment thecontinuum such that there is a match of the segments to theinformation classes of interest to the user. The term informationclass describes the class labels defined by the user. They arenot necessarily the classes that can be found from the data; wecall the latter spectral or, more generally, data classes, depend-ing on the sensor used. During classification, the analyst has toform the bridge between information and data classes [1].

In this analysis, we focus first on how effectively classifiers,and the SVM and MLC in particular, segment the spectraldomain to allow the information classes to be mapped. Thatis helped by a consideration of decision boundaries since flex-ibility in how a boundary can be fitted between classes of dataeffectively determines the efficacy of the classifier algorithm. Inthe next section we look at the decision boundaries of the SVMand the Gaussian maximum likelihood rule. We also commenton how the MLP segments the spectral domain.

III. DECISION BOUNDARIES

A. SVM

The SVM is a binary classifier that implements an optimalseparating hyperplane wTΦ(x) + b between categories in thetransformed feature space Φ(x), where x is the pixel vector inthe original measurement domain and w is the weight vector ofcoefficients. It is trained by minimizing the functional [6]

1

2‖w‖2 +C

∑i

ξi. (1)

By minimizing the norm of the weight vector ‖w‖ in (1) themargin between classes is maximized, while the nonnegativeslack variables ξi allow a degree of classification error foroverlapping training data. Together these define a constraint onthe optimization that the training pixels lay on the correct sidesof the hyperplane apart from some permitted to be in error bytheir associated slack variables. The regularization parameterC controls the balance between maximizing the class marginand the degree of classification error that can be tolerated.During training, a set of support vectors xs, s ∈ S , is identified.Labeling an unknown pixel vector x into the class pair {ωi, ωi}is carried out using the sign of the decision function

f(x) =∑s∈S

csk(xs,x) + b (2)

with cs = αsys; ys is the indicator variable for the sth supportvector xs, which is +1 for class ωi vectors and −1 for class ωi

vectors; αs is the corresponding Lagrange multiplier, which isalways positive. Thus, cs is positive for pixels from class ωi andnegative for pixels from class ωi. k(xs,x) = Φ(xs)

TΦ(x) isthe chosen kernel. The decision surface between the two classesin the spectral domain is described explicitly by∑

s∈Scsk(xs,x) + b = 0 (3)

the particular form of which depends on the kernel. Consider aradial basis function kernel with radius parameter γ

k(xs,x) = exp{−γ‖x− xs‖2

}. (4)

With this, (2) says that a decision about the correct classfor pixel x is made on the basis of the aggregated value ofthe responses of the kernel function for each of the supportvectors [7]. Effectively, it is a weighted near-neighbor classifier,where the distance metric is exponentiated. What do its surfaceslook like? They are the result of the difference between twoweighted sums of the form in (3), one corresponding to positivevalues of cs and the other to negative values of cs. Each suminvolves radial basis functions, similar to Gaussian distributionswith diagonal covariance matrices of equal variances—calledscalar matrices. For two support vectors, one from each classand each using the same value of gamma, the surfaces willbe hyperplanes, but for several support vectors in each classthe surfaces can be curved, with flexibility determined bythe number of vectors and the radius parameter γ. It is thisflexibility that gives the kernel-based SVM its strength, in thatquite irregular decision surfaces in the data domain can beaccommodated. More flexibility can be obtained by choosingdifferent radius parameters in the radial basis functions forsome strategic support vectors [7].

If we choose a polynomial kernel in (3), then the boundarieswill be polynomial in data space. Huang et al. [2] and Burges[8] show examples of decision boundaries for polynomialkernels; Huang et al. also show boundaries for radial basisfunctions.

The binary SVM classifier has to be embedded in a multiclassstrategy, the most common of which now tends to be the one-against-one decision tree [6]. Although straightforward in prin-ciple, that adds to the complexity of training. For each binarydecision in the tree, the support vectors and optimal values forthe kernel parameters and the regularization parameter C in (1)have to be found.

B. MLC

The separating surface between two multidimensional nor-mal distributions is, in general, quadratic, a feature that givesthe MLC its strength compared with linear methods. Neverthe-less, if the actual separation between two information classes isnot roughly quadratic, the MLC may not perform well. For thatreason, in any practical classification exercise using maximumlikelihood, more than one Gaussian distribution (called a dataclass, subclass, or mode) should be used to model (describe)the data spread of a particular information class in the spectraldomain. That is the basis for adopting several spectral classesper information class, well known for many years [1], [9], [10].Nonetheless, many investigators, particularly when comparingclassifiers, fail to take into account the need to resolve infor-mation classes represented by the training data into spectralmodes, leading to less than optimal performance with thisapproach. That is the single most common shortcoming in manyanalyses that purport to present a fair comparison of classifieralgorithms. Failure to identify an acceptable modal structurefor the maximum likelihood rule would be like not optimizing

RICHARDS AND KINGSBURY: IS THERE A PREFERRED CLASSIFIER FOR OPERATIONAL THEMATIC MAPPING? 2717

the set of support vectors and the regularization and kernelparameters for the SVM to ensure that the decision boundariesbest match the training data.

This leads to an important related comment. The use of themultivariate Gaussian distribution in the MLC has nothing to dowith the natural distribution of measurements over informationclasses in the data domain. Rarely are they sufficiently Gaussianto contemplate that. Instead, the normal distribution is usedbecause it is easily understood and its multivariate propertiesare well known; it has meaningful parameters in its mean andcovariance that can be related directly to the spectral reflectancecharacteristics of ground cover types, and it is robust to errors. Italso naturally interfaces, via posterior probabilities, with post-classification processes such as Markov random fields, allowinga classification consistent with spatial context to be developed[11]. It is also thought to be more tolerant to imbalancesin training data sets than geometric algorithms such as theSVM [12].

When using the MLC in thematic mapping applications, thekey requirement is to resolve the largely continuous data spacefor a given information class into a set of normal distributionsthat well characterize the data spread. If that is not done, goodresults cannot be expected. What do the decision boundarieslook like when the information classes are represented in thismanner? We assume that the information classes have beenresolved into sets of single data classes. That can be done,with differing degrees of difficulty, by using a Gaussian mix-ture model [9], [10], [13], by adopting the Fleming clusteringmethod [14], or, in some simple cases, empirically by inspect-ing the data domain using mechanisms such as scatter diagrams.The distribution function for each information class ωi can thenbe expressed as

p(x|ωi) =

Ki∑k=1

αkip(x|dki) (5)

in which dki, k = 1 . . .Ki, is the kth data class in informationclass ωi, assumed to have a Gaussian distribution with parame-ters {mki,Cki}, and αki is its proportion in the set of modes.The distribution function in (5) can be weighted according tothe prior probabilities of class membership so that class alloca-tion can be based on a comparison of posterior probabilities inthe well-known MAP rule [3]. The priors can be used to bias theclassification by perceived class abundances or to inject analystknowledge of the classes from other sources of information,such as topography. It serves our purpose here to assume equalpriors so that classification is carried out according to

x ∈ ωi if p(x|ωi) > p(x|ωj) for all j �= i. (6)

This is a multiclass discriminant function, but for readycomparison with the SVM we will assume for the moment thatwe are just dealing with a two-class case. Each data class ismodeled as a multivariate normal distribution

p(x|dki) = (2π)−N2 |Cki|−

12 exp

×{−1

2(x−mki)

TC−1ki (x−mki)

}(7)

Fig. 1. (a) Decision boundary formed from five data classes over two in-formation classes; the projection on the base plane shows how the decisionboundary appears in the 2-D data space. (b) Case where each information classis unimodal.

where N is the dimensionality of the data space. The decisionboundary between information classes 1 and 2 is then given by

K1∑k=1

αk1p(x|dk1)−K2∑k=1

αk2p(x|dk2) = 0. (8)

These are differences in sums of Gaussians and are similar insome respects to the decision surfaces generated by the supportvector classifier, with some important caveats discussed next.They will be more complex than the simple quadratic surfacethat separates two Gaussians. This is illustrated in the contrived2-D example of Fig. 1(a), in which two information classesare represented, respectively, by three and two Gaussian modeswith the parameters shown in Table I. As seen, the decisionboundary is quite flexible and certainly of order greater thana simple quadratic. The unimodal case is shown in Fig. 1(b)for comparison. The boundary in Fig. 1(a) should be comparedwith that generated by six support vectors shown in Fig. 1of Kingsbury et al. [7] for the special case of equal kernelradii.


TABLE ICLASS PARAMETERS FOR THE EXAMPLE OF FIG. 1

C. Some Comments on the Decision Surfaces

In most remote sensing applications, many support vectorsare generated with the SVM. The decision boundary is thuspotentially quite detailed, although overfitting may be likely insome cases if a highly detailed surface is generated. More often,the typically large number of support vectors would define theoptimal placement of a smoother decision surface; this can beseen in [2, Fig. 5].

While the decision boundary for the SVM is determined bythe symmetric structure of the radial basis function kernel, forthe MLC, the surface has the flexibility available in the covari-ances of the data classes. There are usually many fewer dataclasses than support vectors, so that the maximum likelihoodboundaries, while more flexible than simple quadratics, will besmoother than those that can be achieved by the SVM approach,particularly if a range of kernel radii is used [7].

As a result of these observations, it is clear that individualsupport vectors can exert strong local influence in setting theposition of the decision boundary, while the training vectorsin the maximum likelihood approach enter in an average sensewhen generating the mean vectors and covariance matrices.Atypical training patterns are less likely to be influential for theMLC. In the case of the SVM, slack variables are used to permita degree of misclassification and thereby minimize the influ-ence of atypical patterns. However, because their contributionto the error increases linearly with distance from the separatinghyperplane, outliers can still be a problem [13].

Finally, a comment on the decision surfaces implementedby the MLP is relevant. Once trained, each of the processingelements in the hidden (or processing) layer of a three-layermachine effectively implements a linear binary discrimination.The output layer combines those binary results using logicaldecisions to generate piecewise linear surfaces between classes[15]. It is this piecewise linearity that gives the MLP itsstrength. Networks with more than a single hidden layer willclearly implement more flexible surfaces. For a tutorial onneural networks, see Jain et al. [16].

IV. COMPARATIVE EXPERIMENTAL EVIDENCE

Before proceeding, there is value in reviewing some salientclassification results as a baseline for further comments on

classifier comparison and the importance of mode resolutionwith the MLC as an essential step in any thematic mappingmethodology.

The most direct evidence of the performance improvementpossible when information classes are represented by sets ofGaussian modes has been given in two studies involving Gaus-sian mixture models [9], [10]. Dundar and Landgrebe [10]note an improvement in overall performance when labelingsix classes of the Indian Pines data (see https://engineering.purdue.edu/~biehl/MultiSpec/hyperspectral.html) from 93.9%using a simple unimodal classifier to 96.9% when using amixture model. Their study was based on the best ten of theavailable bands, chosen using the NWFE feature extractionprocess [17]. Most information classes required two modes.These studies draw attention to two potential difficulties. First,the algorithms used to find mixture models are time-consumingto run. Second, when an information class is resolved intomodes, the number of training samples available is sharedamong the individual Gaussian distributions, giving fewer perclass for parameter determination. This increases the chance ofgenerating poor estimates, potentially leading to unacceptablegeneralization, although regularized covariance estimators [17]can assist in generating more robust Gaussian models. Whilethe first limitation is peculiar to mixture modeling, the secondwill apply to all subclass determination methods.

Because of the paucity of reliably labeled hyperspectralimages in the public domain, the Indian Pines data set has beenused by many machine learning investigators in remote sensingand is thus a useful benchmark, featuring in further studies citedbelow. It consists of a 145 × 145 pixel, 220-channel AVIRISimage over Indian Pines in northwestern IN; reference data(ground truth) are available for 16 classes, largely agricultural.Landgrebe [17] notes it as a particularly challenging data setsince the primary crops of corn and soybean were very earlyin their growth cycle, with only about 5% canopy cover, and inboth cases three different tillage practices were used. He makesthe point that those two crop classes could be viewed as having(at least) three spectral classes each. If that were the case, thenthe 16 classes available could be viewed as being composed of12 information classes. We have not used that combination inthis study, choosing in our following experiments to retain thesoybean and corn subclasses as separate and resolving modeswithin them.

The first application of the SVM to remote sensing problemswas demonstrated by Gualtieri et al. [18], [19]. Although littlemethodological detail is given, an overall accuracy of 95.6% iscited for labeling the full 16 classes for a subscene of the IndianPines data set using all 220 bands [19]. This was compared withthe application of the object classifier ECHO [20] to the samedata set by Tajudin and Landgrebe [21], which gave 93.5%on the same subscene. Without knowing whether the sametraining samples were used, it is hard to know if that is a faircomparison.

Landgrebe [17] also labels the full scene using maximumlikelihood classification, treating the 16 classes as distinct andagain using all 220 channels. He chose only one data class foreach of the 16 classes; note, however, his earlier remark thatsets of those classes can be considered as subclasses resulting


TABLE IIINFORMATION CLASSES AND TWO SETS OF SAMPLE PIXELS USED IN EACH CASE FOR TRAINING AND TESTING;

FIGURES IN BRACKETS INDICATE NUMBERS OF SEPARATE TRAINING FIELDS

from tillage practices. He used a robust covariance estimator togenerate viable class signatures for the smaller training classes,achieving an overall accuracy of 99.8% on the training data.However, the results were based on large numbers of trainingpixels (see his Table 7-5) with as many as 2468 for one class,although as small as 54 for another. The average training classsize was 648 pixels. In another example on the same data set,he used an average training class size of 93 pixels and achieveda testing set accuracy of 72.4% when feature selection was usedto reduce the data space to 80 dimensions (which was optimalin his case).

In an early review of SVMs in remote sensing, Huang et al.[2] presented comparative analyses of several classifiers whenlabeling a spatially degraded TM data set from MD based onthree and seven features. Although it is difficult to summa-rize the results succinctly, in general, the SVM showed about1.5%–4% better performance at the 70%–75% level than theMLC on the same data, but no detail is given on how theMLC was developed; in particular, it appears that no attemptwas made to resolve spectral subclasses. This study also lookedat neural networks and decision tree classifiers and concludedthat their performances lay between those of the SVM andthe MLC. An interesting observation made by the authors isthat training the MLC took minutes whereas training the SVMtook days!

Another comparison of decision tree classification with theMLC has appeared recently [22]. Using the three bands ofthe IRS/LISS III instrument for an image recorded over theMayurakshi Resevoir in Jhakhand, India, an overall accuracyof 98% is quoted for the performance of the decision tree and95% for the MLC. Again, it appears that no attempt was madeto resolve modes for the MLC, even though effort was made tooptimize the tree design.

Using the three components of a color IR image ofJacksonville, FL, as their data set, Qiu and Jensen [23] com-pared the performances of an MLP neural network using theradial basis function activation function, the MLC, and a newfuzzy neural network of their own design. Although no attemptis reported on resolving Gaussian modes, they show that the rel-ative overall performances were 90.8%, 83.8%, and 81.3% fortheir algorithm, the standard MLP, and the MLC, respectively.

A general qualitative summary of all these studies is thatMLC is seen to perform about 2%–4% poorer than other

methods (apart from the fuzzy neural networks just cited) whenmultimodality is not resolved.

Possibly, the most detailed studies of the application ofSVMs in remote sensing have been those of Melgani andBruzzone [24] and Camps-Valls and Bruzzone [25], both basedon the Indian Pines data set. Together they provide a compre-hensive assessment of different kernel methods and multiclassstrategies and give a comparison of SVM performance againstother classification techniques, although not MLC. Their gen-eral conclusions are that SVMs with polynomial kernels per-form best, closely followed by those with radial basis functionkernels. In turn, SVMs are observed to perform better thanneural networks with radial basis function activation functions,the k nearest neighbor rule, and linear support vector methods(i.e., without the kernel transform). We use their SVM and MLPresults in the following discussion for comparison.

V. MLC ACCURACY IMPROVEMENT WITH

DATA CLASS MODELING

To underscore the accuracy gain that can be obtained by ap-propriately resolving data classes with the maximum likelihoodrule, we chose the same exercise as Melgani and Bruzzone[24]. They labeled 9 of the 16 classes available in the IndianPines data set. The others were discarded because they do notcontain sufficient numbers of samples, while the nine retainedinclude the most difficult to separate. Table II shows thoseclasses along with two sets of labeled samples that we selectedfor each. Notwithstanding the observation by Landgrebe thatthe different corn and soybean classes could be treated assubclasses, we regarded them as separate information classesfor this exercise, consistent with Bruzzone et al. [24], [25].

Apart from one instance, the samples in Table II consist ofsets of fields; the brackets show the number of separate fieldsin each case. We adopted the following strategy to generate ourresults. Sample set 1 was used to train classifiers, and sampleset 2 was used to test their performances. The process was thenrepeated in reverse: training was done on sample set 2, andresults were assessed on sample set 1. The outcomes from thetwo trials were then averaged to give figures hopefully free ofbias in the choice of training and testing data. Also, the numbersof samples in each column and for each class were chosen to beof similar order so that bias in the values of overall accuracy


TABLE IIIAVIRIS BANDS CHOSEN FOR THIS EXERCISE

introduced by significant training and testing set imbalancescould be minimized. All processing was carried out using theMultiSpec software package.

Although there are 220 channels in the Indian Pines dataset, we chose to use just the nine shown in Table III. Theywere selected from an assessment of the spectral reflectancecharacteristics of the cover types in the scene, a knowledge ofkey spectral regions that are diagnostic of those cover types,and an examination of a number of two-channel scatter plotsin order to judge that the channels chosen offered reasonableseparation. Although not as objective as running a test based onpairwise separability measures, the manual approach avoids thevery large computational burden of applying metrics such asdivergence for all band combinations. Clearly, there are largecorrelations among adjacent channels for all cover types in thisscene, and it is only necessary to select channels thought to bediscriminatory. We selected two channels in the farthest mid-IR band to weight up the part of the spectrum thought to favorlargely bare cover types, such as appear in this image. As acheck on the effectiveness of this selection, the best (top ranked)nine transformed features were found using the DBFE [17]distribution-free feature selection process; those nine featureswere then classified using the data sets in Table II. Because oflimitations in applying DBFE in MultiSpec with the numbersof training samples per class shown in Table II, not all 220channels could be used as inputs. Instead, every second channelup to 201 was selected from which the best transformed set wascomputed; given the very high correlations between adjacentchannels and the low dynamic range of the data beyond channel200, this was seen to be a supportable process. The results areshown in Table V in which it will be seen that the transformedfeatures perform no better than our manually selected setof nine.

Three maximum likelihood classifications were performedwith MultiSpec using the nine channels identified in Table III.Those three trials are summarized in Table IV. The first wasbased on just one Gaussian mode per information class (asimple maximum likelihood classification). In this case, all ofthe training pixels for a given information class (nine classes inall) were aggregated and used to train the classifier. Likewise,the relevant testing pixels were aggregated.

In the second case, the three separate training fields forthe soybean-min till information class were treated as separate

subclasses (data classes). Inspection of the image shows thatthere is considerable variability among those fields such thattreating all the pixels as though they came from a singleGaussian distribution would most likely prejudice the results.Thus, in this second exercise, there were 11 different dataclasses, three in the case of the soybean-min till class and onefor each of the other information classes. Once trained, the threesoybean-min till results were combined to give a single measurefor the classification of soybean-min till.

In the third exercise, in addition to treating the soybean-mintill class as three data classes, we also resolved the corn-mintill class into three data classes. To obtain three subclasses fromthe two fields shown for class 9 in Table II, we generated twosubclasses for the training field that exists midway down theleft-hand side of the image using Isodata [3] clustering.

Table V shows the classification performance by class (pro-ducer’s accuracy) and overall. Several sets of results are in-cluded. First, the best set obtained by Melgani and Bruzzonewith an SVM and an optimized radial basis function MLP,based on 30 channels, is shown [24]. Second, our results forthe three trials in Table IV based on nine channels are given.The accuracy gain by class is shown when modes are resolved.

Rather than generating our own SVM and MLP trials, wechose the results obtained by Melgani and Bruzonne as inde-pendent benchmarks that were carefully optimized by thoseinvestigators. However, a word of caution is in order whencomparing our MLC performance with their figures for theSVM and MLP. Strictly, a fair comparison requires the trainingand testing samples used in each case to be the same. While thatis true for our three MLC results, the training and testing fieldsused by Melgani and Bruzzone for the SVM and MLP wouldundoubtedly have been different from those chosen here; therewere also significant differences in the sizes of their trainingand testing samples by class. Those contrasts between Melganiand Bruzzone’s samples and ours would lead to differences intraining and generalization, albeit most likely small differences.Also, at the 95% confidence level, with the numbers of trainingsamples that we used, there is an uncertainty of about ±1% inour figures based on map accuracy considerations [26].

The beneficial result of resolving the soybean-min till classinto three data classes is evident in Table V, with a 2.2%improvement in the accuracy of that class and a 1.4% improve-ment overall. The performance on the other two soybean classeshas improved as well, but there has also been a large gain inaccuracy for the corn-no till class (class 8), presumably becauseof significant class overlap caused by a large covariance whenthe three soybean classes were aggregated. When the corn-mintill class is resolved into three subclasses, the accuracy gainon that class is 9.2%, improving the overall performance by2.1%. While that does not quite lift the overall accuracy tothat reported on the data set for the SVM on 30 channels, theperformance by class is now more uniform and improved forthose classes that were resolved into sets of Gaussians; alsorecall that these results have been obtained using just nine of thetotal of 220 channels of data. Although Melgani and Bruzzonedid not give results for nine channels, they can be assessedfrom the trends evident in [24, Fig. 4]; the results for ninechannels would appear to be substantially poorer than that for


TABLE IVTHREE MAXIMUM LIKELIHOOD CLASSIFICATION TRIALS WITH DIFFERENT NUMBERS

OF DATA CLASSES FOR SOYBEAN-MIN TILL AND CORN-MIN TILL

TABLE VCOMPARATIVE RESULTS (IN PERCENT) FOR A SIMPLE AND MULTIMODAL MAXIMUM LIKELIHOOD CLASSIFICATION: IN THE FIRST MULTIMODAL

CASE, CLASS 6 (SOYBEAN-MIN TILL) WAS RESOLVED INTO THREE DATA CLASSES BEFORE TRAINING, WHILE IN THE SECOND, BOTH CLASS 6 AND

CLASS 9 (CORN-MIN TILL) WERE EACH RESOLVED INTO THREE DATA CLASSES BEFORE TRAINING. THE SVM AND MLP RESULTS ARE THE BEST

SET FROM MELGANI AND BRUZZONE, BASED ON 30 FEATURES [24]; THE DBFE RESULTS ARE FROM A SIMPLE MAXIMUM LIKELIHOOD

CLASSIFICATION WITH THE BEST NINE TRANSFORMED FEATURES AND NO MODE RESOLUTION

Fig. 2. Channels 46 versus 31 bispectral plots. (a) Unimodal case. (b) When the soybean-min till and corn-min till information classes are each resolved intothree spectral classes.

their optimal 30 channels and most likely below those reportedhere for the MLC.

The reason for the MLC performance improvement canbe appreciated by examining simple bispectral plots of theclass means before and after spectral class resolution. Fig. 2shows plots in the channel 46 versus channel 31 subspace.The dispersions of the spectral classes for both the soybean-

min till and corn-min till data classes about each other and theother classes demonstrate the confusion likely when allocatingunseen (testing set) pixels to single aggregated classes of thosetypes.

Finally, it is important to recognize that the data classes usedin this exercise and shown in Fig. 2 are not unique; the methodsthat we used for their generation illustrate that point. Instead,


TABLE VISIMPLE COMPLEXITY COMPARISON. N = NO. OF FEATURES,K = NO. OF CLASSES, T = NO. OF TRAINING SAMPLES,

AND E = NO. OF MLP EPOCHS

they represent a convenient segmentation of the data space todescribe the data distribution by class in such a manner thathigher classification performance can be achieved, comparedwith the case when the corresponding information classes havenot been decomposed.

VI. OTHER CONSIDERATIONS AND

CLASSIFIER COMPLEXITY

We now look at related considerations such as computationalcomplexity since that can influence the choice of one particularclassifier methodology over others. The dependence of trainingtime on the numbers of samples, channels, and features isexamined, as is the dependence of classification time on thenumber of channels and features. We also comment on theease or difficulty of the training process. Our observations aresummarized in Tables VI and VII.

We approach the complexity problem by considering thenumbers of multiplications required. In the case of the SVMand MLP, they are embedded in RBF or sigmoidal functions;while those operations add to the overall costs, they do notaffect the dependences on the variables noted previously.

Consider the maximum likelihood approach first. It is wellknown that the time required to classify unseen pixels by theMLC is quadratically dependent on the number of features, orchannels, and depends directly, and thus linearly, on the num-ber of data classes since they set the number of discriminantfunctions to be evaluated and compared.

Training complexity for the MLC depends on the processused to generate the data classes. If data classes are not re-solved, the training process is straightforward and fast, re-quiring only the computation of class conditional mean andcovariance estimates from the training data. In this case thetraining time is proportional to the number of training samplesand classes and is quadratically dependent on the number ofchannels. The main requirement is that there be sufficientrepresentative samples available to ensure that the covariance iswell conditioned. Although we look for independent unbiasedsamples, often the use of fields of training pixels means thatindependence is not always assured, even though the results areaccepted. Regularized covariance estimators can be used partlyto obviate the problem of insufficient samples [17], but then,training time is extended by the need to run trials from which

parameters in the estimators are optimized. The sparseness ofthe inverse covariance matrix can be exploited to improve MLCperformance on hyperspectral data sets [27], [28], as can theapproximately block diagonal form of the correlation matrix[29]. Another transform approach seeks sums of contiguoushyperspectral channels as a reduced feature set [30].

If Gaussian mixture modeling is used to find data classes(by trialing numbers of data classes larger than the number ofinformation classes, as in Dundar and Landgrebe [10]), then theMLC training process can be quite complex, usually involvingthe process of expectation maximization [13]. This will bequadratically channel dependent and linearly class dependentbut with a large overhead caused by the need to iterate over theexpectation and maximization steps.

If clustering is used to resolve Gaussian modes, the processis simpler and quicker than mixture modeling. Based on theIsodata algorithm, the clustering step has a time demand thatdepends directly on the number of channels, the number ofclasses, and the number of pixels used for clustering. It alsodepends directly on the number of iterations required to achieveacceptable convergence. Because clustering is performed usu-ally on a small, but representative, subset of the data, thetraining overhead added is significant but generally not ex-cessive. An added benefit of mode resolution via clustering isthat boundary pixels and mixed pixels often get picked up asseparate clusters and thus data classes. After classification theycan be kept separate or combined with the most appropriateinformation class.

If clustering is used to generate the subclasses, the decisionboundaries between information classes are slightly differentfrom those shown in Fig. 1(a), which were generated fromsums of Gaussians. In the clustering approach, each cluster, orsubclass, is treated as a separate class; relevant subclasses arethen labeled the same after classification if they belong to thesame information class. That will give decision boundaries thatare piecewise quadratic. Fig. 3 shows the case for the five dataclasses of Fig. 1(a).

Classification cost for the SVM using a radial basis functionkernel is determined by the number of multiplications inherentin (4), which is linearly dependent on the number of channelsand the number of support vectors, as seen in the sum operationin (3). The dependence on the number of classes is set by themulticlass strategy used. The one-against-all strategy is linearlydependent on the number of classes, while the one-against-onestrategy, which is usually preferred because it employs balancedsets in training, is quadratically dependent.

Training the SVM will show the same dependence on thenumber of classes as it does in classification, set by themulticlass strategy adopted. Its dependence on the number ofchannels is less clear but is generally regarded as linear [8].Dependence on the number of training samples is also linearif techniques such as chunking are used [8], but it occurs as aproduct with the number of support vectors. Since the supportvectors are a subset of the training data, we can assume a higherorder dependence on the number of training samples, up toquadratic.

Training cost for the SVM can be minimized by keeping thenumber of training pixels small. Interestingly, this requirement


TABLE VIIOTHER COMPARATORS FOR THE THREE BASIC CLASSIFIER TYPES

Fig. 3. MLC piecewise quadratic decision boundaries when data classes areresolved by clustering.

is opposite to that for maximum likelihood methods. A largenumber is needed for the maximum likelihood rule to guaranteegood estimates of the class statistics. There are drawbacks withhaving too few training pixels for the SVM if some are inerror or are outliers since the SVM is generally regarded as notoptimized to deal with noisy training data [31].

Complexity is added to SVM training by the need to findoptimal values for the kernel and regularization parameters,often through a grid search process. Analyst experience canallow the search ranges to be limited, sometimes minimizingthe cost of this step.

To assess the training and classification cost of the MLP,we examine the three-layer MLP; we assume that the numberof nodes in the first layer is chosen to be the same as thedimensionality of the feature space (the number of channels),that the first hidden (processing) layer has the same number ofnodes as the first, and that the number of nodes in the output

layer is the same as the number of classes. That is the mostcommon and simplest configuration found in practice. At eachnode in the hidden and output layers, a Euclidean distancecalculation is performed using, as inputs, the outputs of theprevious layers.

With this structure, classification time, and thus cost, de-pends directly on the number of classes. Dependence on thenumber of features is linear at each processing element; sincethe number of processing elements depends directly on dimen-sionality, the overall complexity is proportional to the square ofthe number of features.

Since MLP training involves running the network for asmany epochs as necessary to achieve an acceptable degreeof convergence on the training data, training cost essentiallyfollows the same trend as classification but is also proportionalto the product of the number of training samples and epochs.Since the number of epochs is typically large in remote sensingapplications, training time can be very high.

From these observations, the greatest challenge for the MLCand MLP approaches is the dependence on the number offeatures, while for the SVM it is the dependence on the numberof classes if the one-against-one multiclass approach is used.These complexity remarks, however, relate to trends and notabsolute computational requirements. The latter is difficult tocomment on with any sense of precision, but some generalobservations are possible.

First, the clustering step used to identify data classes priorto maximum likelihood classification is costly if performed ona large data set. However, only a representative sample, withsufficient pixels to generate reliable class signatures, is needed.While adding to the training overhead, it is generally not adifficult or costly step.


Feature reduction is the greater consideration when usingMLC, in order that problems with high dimensionality are notencountered. It is a step that must be factored into thematicmapping using MLC. If transform-based methods are used,such as DBFE or NWFE [17], that overhead is not high;however, the features generated are then linear combinationsof the original measurements, and thus direct association ofthe new features and the original measurements is lost. Nev-ertheless, classes in the transformed data space still need to berepresented as sums of Gaussians if optimal performance is tobe achieved. If the reduced feature set is selected from amongthe measurements using pairwise separability measures, thena set of original features is maintained, but the computationaloverhead can be enormous since all class pair separations haveto be evaluated for each feature set of interest. Whether it is ofcomparable order to the parameter searching and optimizationsteps for the SVM, or the cost of the large numbers of epochswith the MLP, is difficult to say, and will, undoubtedly, be datadependent. However, the feature selection step can be guided bythe analyst’s knowledge of the broad spectral reflectance char-acter of the cover types in a given exercise, as was the case here.That permits attention to be focused just on spectral regimesknown to be significant, thus reducing the cost overhead offeature selection.

The last remark points to a contrast between two schoolsof thought in image interpretation: one adopts a domain-independent approach, while the other is sensitive to the ana-lyst’s knowledge of the remote sensing aspects of the problembeing considered—called domain knowledge. In an operationalcontext, the expert analyst would almost always have a good un-derstanding of the problem domain. In the early days of remotesensing, when the number of channels available in the recordedimage data was not sufficient to allow a good characterizationof spectral reflectance characteristics, image analysts had littlealternative other than to treat the pixel labeling problem as asignal or image processing exercise. With the large numbers ofchannels now available from imaging spectrometers, it is inju-dicious to follow a blind signal processing approach involvingall available channels and thereby not exploit analyst expertiseduring thematic mapping. Too many recent labeling exercisesmake that mistake; while the SVM has clear advantages overthe MLC in such a situation, because of its linear dependenceon the number of features, adoption of a domain-independentviewpoint belies the ability of the MLC to produce good resultswhen used in context and denies the analyst other benefitsthat flow from the MLC approach. The production of classsignatures, which can be compared directly with known spectralreflectance characteristics, is one advantage of MLC, as is thedirect generation of posterior probabilities that form naturalinterfaces with postclassification processing techniques. SVMoutputs can be interpreted probabilistically using regression,but the approximation can be poor [13]; likewise, MLP out-puts can be shown to provide approximations to, but not theactual, posterior probabilities [16]. Use of the class conditionallikelihoods that are available prior to the maximum selectionstep in the MLC has also been shown to be beneficial in asemisupervised methodology that helps offset the small trainingsample size problem [33].

VII. CONCLUDING REMARKS

The need to model the data space in a manner that matchesthe properties of the algorithm is not new. Implicitly, the SVMand MLP, respectively, do that by the identification of supportvectors and the choice of kernel, and through the identificationof a suitable network topology. Because of the normality as-sumption in the MLC, matching the classifier geometric char-acter to the data space does not always happen naturally; that iswhy data class resolution is needed. As demonstrated here, thatis not a difficult or costly step and, once done, ensures that theMLC generalizes as well as the other basic algorithms. Evensimple techniques like the parallelepiped classifier can be madeto work well, provided the data space is resolved into manydata classes of prismatic nature; indeed, that can be seen as thebasis behind the success of many tree-structured approaches,such as CART [13]. As another simple example, the minimumdistance rule requires the data space to be described by sets ofhyperspheres; this is seen in Moreton and Richards [34].

Data class modeling is needed for algorithms such as MLCbecause, by its nature, it describes the distribution of data inthe measurement space. Decision surfaces then arise, as seenin Fig. 1, as boundaries between distributions. In contrast,the SVM and MLP are based on finding separating surfacesimplicitly and do not inherently describe the dispersion of themeasurement vectors by class. That is why they do not needmodal decomposition to work properly. They learn the separa-tion of classes during training, whereas MLC learns the datastructure. Both are equally relevant means for discriminatingamong different information classes.

It is how the algorithms are used that determines the efficacyof the thematic mapping process and not their inherent proper-ties. In principle, provided enough training data are available totrain the constituent classifiers reliably, resolving the data spaceinto as many modes as desired will improve generalization.

If optimal methodologies are employed, other factors enterinto a comparison of preferred methods. While the choiceof technique will be based often on experience and personalpreference, ease of training and classification speed are signif-icant considerations. Comments on those matters are given inTable VII.

ACKNOWLEDGMENT

The authors would like to thank Emeritus Prof. D. Landgrebeof Purdue University for reading the draft of this paper and forsuggesting, in the first place, that it be written.

REFERENCES

[1] P. H. Swain and S. M. Davis, Remote Sensing: The Quantitative Approach.New York, NY, USA: McGraw-Hill, 1978.

[2] C. Huang, L. S. Davis, and J. R. G. Townshend, “An assessment of sup-port vector machines for land cover classification,” Int. J. Remote Sens.,vol. 23, no. 4, pp. 725–749, 2002.

[3] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed.New York, NY, USA: Wiley, 2001.

[4] J. A. Richards, Remote Sensing With Imaging Radar. Berlin, Germany:Springer-Verlag, 2009.

[5] J. A. Richards, “Analysis of remotely sensed data: The formativedecades and the future,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3,pp. 422–432, Mar. 2005.


[6] G. Camps-Valls and L. Bruzzone, Kernel Methods for Remote SensingData Analysis. Chichester, U.K.: Wiley, 2009.

[7] N. G. Kingsbury, D. B. H. Tay, and M. Palaniswami, “Multi-scale kernelmethods for classification,” Proc. IEEE Mach. Learn. Signal Process.Workshop, pp. 43–48, Sep. 28–30, 2005.

[8] C. J. C. Burges, “A tutorial on support vector machines for pattern recogni-tion,” Data Mining Knowl. Discov., vol. 2, no. 2, pp. 121–167, Jun. 1998.

[9] B.-C. Kuo and D. A. Landgrebe, “A robust classification procedure basedon mixture classifiers and nonparametric weighted feature extraction,”IEEE Trans. Geosci. Remote Sens., vol. 40, no. 11, pp. 2486–2494,Nov. 2002.

[10] M. M. Dundar and D. A. Landgrebe, “A model-based supervised clas-sification approach in hyperspectral data analysis,” IEEE Trans. Geosci.Remote Sens., vol. 40, no. 12, pp. 2692–2699, Dec. 2002.

[11] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, andthe Bayesian restoration of images,” IEEE Trans. Pattern Anal. Mach.Intell., vol. PAMI-6, no. 6, pp. 721–741, Nov. 1984.

[12] K. Song, “Tackling uncertainties and errors in the satellite monitoring offorest cover change,” Ph.D. dissertation, The Univ. Maryland, CollegePark, MD, USA, 2010.

[13] C. M. Bishop, Pattern Recognition and Machine Learning. New York,NY, USA: Springer Science+ Business Media LLM, 2006.

[14] M. D. Fleming, J. S. Berkebile, and R. M. Hofer, Computer aided analysisof Landsat 1 MSS data: A comparison of three approaches including amodified clustering approach, Lab. Applications Remote Sens., PurdueUniv., West Lafayette, IN, USA, Inf. Note 072475. [Online]. Available:http://www.lars.purdue.edu/home/references/LTR_072475.pdf

[15] Y.-H. Pao, Adaptive Pattern Recognition and Neural Networks. Reading,MA, USA: Addison-Wesley, 1989.

[16] A. Jain, J. Mao, and K. M. Mohiuddin, “Artificial neural networks: Atutorial,” Computer, vol. 29, no. 3, pp. 31–44, Mar. 1996.

[17] D. A. Landgrebe, Signal Theory Methods in Multispectral RemoteSensing. Hoboken, NJ, USA: Wiley, 2003.

[18] J. A. Gualtieri and R. F. Cromp, “Support vector machines for hyper-spectral remote sensing classification,” in Proc. SPIE, 1998, vol. 3584,pp. 221–232.

[19] J. A. Gualtieri and S. Chettri, “Support vector machines for classifica-tion of hyperspectral data,” in Proc. IEEE IGARSS, Sydney, Australia,Jul. 24–28, 2000, vol. 2, pp. 813–815.

[20] R. L. Kettig and D. A. Landgrebe, “Classification of multispectralimage data by extraction and classification of homogeneous objects,”IEEE Trans. Geosci. Electron., vol. GE-14, no. 1, pp. 19–26, Jan. 1976.

[21] S. Tajudin and D. A. Landgrebe, “Covariance estimation with limitedtraining samples,” IEEE Trans. Geosci. Remote Sens., vol. 37, no. 4,pp. 2113–2118, Jul. 1999.

[22] M. K. Ghose, R. Pradham, and S. S. Ghose, “Decision tree classificationof remotely sensed satellite data using spectral separability matrix,” Int. J.Adv. Comput. Sci. Appl., vol. 1, no. 5, pp. 93–101, Nov. 2010.

[23] F. Qiu and J. R. Jensen, “Opening the black box of neural networks forremote sensing image classification,” Int. J. Remote Sens., vol. 25, no. 9,pp. 1749–1768, May 2004.

[24] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sens-ing images with support vector machines,” IEEE Trans. Geosci. RemoteSens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004.

[25] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspec-tral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 43,no. 6, pp. 1351–1362, Jun. 2005.

[26] R. G. Congalton and K. Green, Assessing the Accuracy of RemotelySensed Data: Principles and Practices, 2nd ed. Boca Raton, FL, USA:CRC Press, 2009.

[27] A. Berge, A. C. Jensen, and A. H. Schistad Solberg, “Sparse inversecovariance estimates for hyperspectral image classification,” IEEE Trans.Geosci. Remote Sens., vol. 45, no. 5, pp. 1399–1407, May 2007.

[28] R. E. Roger, “Sparse inverse covariance matrices and the efficient maxi-mum likelihood classification of hyperspectral data,” Int. J. Remote Sens.,vol. 17, no. 3, pp. 589–613, Feb. 1996.

[29] X. Jia and J. A. Richards, “Efficient maximum likelihood classificationfor imaging spectrometer data sets,” IEEE Trans. Geosci. Remote Sens.,vol. 32, no. 2, pp. 274–281, Mar. 1994.

[30] S. B. Serpico and G. Moser, “Extraction of spectral channels from hyper-spectral images for classification purposes,” IEEE Trans. Geosci. RemoteSens., vol. 45, no. 2, pp. 484–495, Feb. 2007.

[31] G. Mountrakis, J. Im, and C. Ogole, “Support vector machines in remotesensing: A review,” ISPRS J. Photogramm. Remote Sens., vol. 66, no. 3,pp. 247–259, May 2011.

[32] M. Pal and G. M. Foody, “Feature selection for classification of hyper-spectral data by SVM,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 5,pp. 2297–2307, May 2010.

[33] Q. Jackson and D. A. Landgrebe, “An adaptive classifier design for highdimensional data analysis with a limited training data set,” IEEE Trans.Geosci. Remote Sens., vol. 39, no. 12, pp. 2664–2679, Dec. 2001.

[34] G. E. Moreton and J. A. Richards, “Irrigated crop inventory by classifi-cation of satellite image data,” Photogramm. Eng. Remote Sens., vol. 50,no. 6, pp. 729–730, Jun. 1984.

John A. Richards (LF’10) received the B.E.(Hons 1) and Ph.D. degrees in electrical engi-neering from the University of New South Wales,Kensington, NSW, Australia, in 1968 and 1972, re-spectively.

He was Master of University House with the Aus-tralian National University (ANU), Canberra, ACT,Australia, where he was formerly the Deputy Vice-Chancellor and Vice President, and the Dean of theCollege of Engineering and Computer Science. Inthe 1980s, he was the foundation Director of the

Centre for Remote Sensing, University of New South Wales. His researchinterests are in image interpretation and imaging radar.

Dr. Richards is a Fellow of the Australian Academy of TechnologicalSciences and Engineering and a Fellow of the Institution of Engineers Australia.He is the President of the International Society for Digital Earth.

Nick G. Kingsbury (F’13) received the Honoursand Ph.D. degrees in electrical engineering from theUniversity of Cambridge, Cambridge, U.K., in 1970and 1974, respectively.

From 1973 to 1983, he was a Design Engineer andsubsequently a Group Leader with Marconi Spaceand Defence Systems, Portsmouth, U.K., specializ-ing in digital signal processing and coding theory.Since 1983, he has been a Lecturer in communica-tions systems and image processing with the Univer-sity of Cambridge, where he became a Professor of

signal processing in 2007 and where he is the Head of the Signal Processing andCommunications Research Group. Since 1983, he has been a Fellow of TrinityCollege, Cambridge. His current research interests include image analysis andenhancement techniques, object recognition, motion analysis, and registrationmethods. He has developed the dual-tree complex wavelet transform and isespecially interested in the application of wavelet frames to the analysis ofimages and 3-D data sets.

is there a preferred classifier for operational thematic mapping?

Documents