an assessment of support vector machines for land cover classification

int j remote sensing 2002 vol 23 no 4 725ndash749

An assessment of support vector machines for land cover classi cation

C HUANGdagger

Department of Geography University of Maryland College Park MD 20742USA

L S DAVIS

Institute for Advanced Computing Studies University of Maryland CollegePark MD 20742 USA

and J R G TOWNSHEND

Department of Geography and Institute for Advanced Computing StudiesUniversity of Maryland College Park MD 20742 USA

(Received 27 October 1999 in nal form 27 November 2000)

Abstract The support vector machine (SVM) is a group of theoretically superiormachine learning algorithms It was found competitive with the best availablemachine learning algorithms in classifying high-dimensionaldata sets This papergives an introduction to the theoretical development of the SVM and an experi-mental evaluation of its accuracy stability and training speed in deriving landcover classi cations from satellite images The SVM was compared to three otherpopular classi ers including the maximum likelihood classi er (MLC) neuralnetwork classi ers (NNC) and decision tree classi ers (DTC) The impacts ofkernel con guration on the performance of the SVM and of the selection oftraining data and input variables on the four classi ers were also evaluated inthis experiment

1 IntroductionLand cover information has been identi ed as one of the crucial data components

for many aspects of global change studies and environmental applications (Sellerset al 1995) The derivation of such information increasingly relies on remote sensingtechnology due to its ability to acquire measurements of land surfaces at variousspatial and temporal scales One of the major approaches to deriving land coverinformation from remotely sensed images is classi cation Numerous classi cationalgorithms have been developed since the rst Landsat image was acquired in early1970s (Townshend 1992 Hall et al 1995) Among the most popular are the maximumlikelihood classi er (MLC) neural network classi ers and decision tree classi ersThe MLC is a parametric classi er based on statistical theory Despite limitationsdue to its assumption of normal distribution of class signature (eg Swain and Davis

daggerCurrent address Raytheon ITSS USGSEROS Data Center Sioux Falls SD 57108USA e-mail address huangedcmailcrusgsgov

International Journal of Remote SensingISSN 0143-1161 printISSN 1366-5901 online copy 2002 Taylor amp Francis Ltd

httpwwwtandfcoukjournalsDOI 10108001431160110040323

Chengquan Huang et al726

1978) it is perhaps one of the most widely used classi ers (eg Wang 1990 Hansenet al 1996) Neural networks avoid some of the problems of the MLC by adoptinga non-parametric approach Their potential discriminating power has attracted agreat deal of research eVort As a result many types of neural networks have beendeveloped (Lippman 1987) the most widely used in the classi cation of remotelysensed images is a group of networks called a multi-layer perceptron (MLP) (egPaola and Schowengerdt 1995 Atkinson and Tatnall 1997)

A decision tree classi er takes a diVerent approach to land cover classi cationIt breaks an often very complex classi cation problem into multiple stages of simplerdecision-making processes (Safavian and Landgrebe 1991) Depending on thenumber of variables used at each stage there are univariate and multivariate decisiontrees (Friedl and Brodley 1997) Univariate decision trees have been used to developland cover classi cations at a global scale (DeFries et al 1998 Hansen et al 2000)Though multivariate decision trees are often more compact and can be more accuratethan univariate decision trees (Brodley and UtgoV 1995) they involve more complexalgorithms and as a result are aVected by a suite of algorithm-related factors (Friedland Brodley 1997) The univariate decision tree developed by Quinlan (1993) isevaluated in this study

The support vector machine (SVM) represents a group of theoretically superiormachine learning algorithms As shall be described in the following section the SVMemploys optimization algorithms to locate the optimal boundaries between classesStatistically the optimal boundaries should be generalized to unseen samples withleast errors among all possible boundaries separating the classes therefore minimizingthe confusion between classes In practice the SVM has been applied to opticalcharacter recognition handwritten digit recognition and text categorization (Vapnik1995 Joachims 1998b) These experiments found the SVM to be competitive withthe best available classi cation methods including neural networks and decision treeclassi ers The superior performance of the SVM was also demonstrated in classifyinghyperspectral images acquired from the Airborne VisibleInfrared ImagingSpectrometer (AVIRIS) (Gualtieri and Cromp 1998) While hundreds of variableswere used as the input in the experiments mentioned above there are far fewervariables in data acquired from operational sensor systems such as Landsat theAdvanced Very High Resolution Radiometer (AVHRR) and the Moderate ResolutionImaging Spectroradiometer (MODIS) Because these are among the major sensorsystems from which land cover information is derived an evaluation of the perform-ance of the SVM using images from such sensor systems should have practicalimplications for land cover classi cation The purpose of this paper is to demonstratethe applicability of this algorithm to deriving land cover from such operationalsensor systems and systematically to evaluate its performances in comparison toother popular classi ers including the statistical maximum likelihood classi er(MLC) a back propagation neural network classi er (NNC) (Pao 1989) and adecision tree classi er (DTC) (Quinlan 1993) The SVM was implemented byJoachims (1998a) as SVMlight

A brief introduction to the theoretical development of the SVM is given in thefollowing section This is deemed necessary because the SVM is relatively new tothe remote sensing community as compared to the other three methods The dataset and experimental design are presented in sect3 Experimental results are discussedin the following three sections including impacts of kernel con guration on theperformance of the SVM comparative performances of the four classi ers and

Support vector machines for land cover classi cation 727

impacts of non-algorithm factors The results of this study are summarized in thelast section

2 Theoretical development of SVMThere are a number of publications detailing the mathematical formulation of

the SVM (see eg Vapnik 1995 1998 Burges 1998) The algorithm development ofthis section follows Vapnik (1995) and Burges (1998)

The inductive principle behind SVM is structural risk minimization (SRM)According to Vapnik (1995) the risk of a learning machine (R) is bounded by thesum of the empirical risk estimated from training samples (Remp) and a con denceinterval (Y ) R Remp+Y The strategy of SRM is to keep the empirical risk (Remp ) xed and to minimize the con dence interval (Y ) or to maximize the margin betweena separating hyperplane and closest data points ( gure 1) A separating hyperplanerefers to a plane in a multi-dimensional space that separates the data samples of twoclasses The optimal separating hyperplane is the separating hyperplane that maxim-izes the margin from closest data points to the plane Currently one SVM classi ercan only separate two classes Integration strategies are needed to extend this methodto classifying multiple classes

21 T he optimal separating hyperplaneLet the training data of two separable classes with k samples be represented by

(x1 y1 ) (xk y

k) where xmicroRn is an n-dimensional space and ymicro+1 shy 1 is class

label Suppose the two classes can be separated by two hyperplanes parallel to theoptimal hyperplane ( gure 1(a))

wmiddotxi+b 1 for y

i=1 i=1 2 k (1)

wmiddotxi+b shy 1 for y

i=shy 1 (2)

Figure 1 The optimal separating hyperplane between (a) separable samples and (b)non-separable data samples


where w=(w1 w

n) is a vector of n elements Inequalities (1) and (2) can be

combined into a single inequality

yi[ w acutex

i+b] 1 i=1 k (3)

As shown in gure 1 the optimal separating hyperplane is the one that separatesthe data with maximum margin This hyperplane can be found by minimizing thenorm of w or the following function

F (w)=Oslash(wacutew) (4)

under inequality constraint (3)The saddle point of the following Lagrangean gives solutions to the above

optimization problem

L(w b a)=Oslash(wacutew)shy aeligk

i= 1

aiy

i[ wacutex

i+b] shy 1 (5)

where ai

0 are Lagrange multipliers (Sundaram 1996) The solution to this optim-ization problem requires that the gradient of L (w b a) with respect to w and bvanishes giving the following conditions

w= aeligk

i= 1

yiaix

i(6)

aeligk

i= 1aiyi=0 (7)

By substituting (6 ) and (7) into (5) the optimization problem becomes maximize

L(a)= aeligk

i= 1

aishy

1

2aeligk

i= 1

aeligk

j= 1

aiajyiyj(x

iacutex

j) (8)

under constraints ai

0 i=1 kGiven an optimal solution a0=(a01 a0

k) to (8) the solution w0 to (5) is a linear

combination of training samples

w0= aeligk

i= 1

yia0ix

i(9)

According to the KuhnndashTucker theory (Sundaram 1996) only points that satisfythe equalities in (1) and (2) can have non-zero coeYcients a0

i These points lie on

the two parallel hyperplanes and are called support vectors ( gure 1) Let x0 (1) bea support vector of one class and x0 (shy 1) of the other then the constant b0 can becalculated as follows

b0=Oslash[ w0acutex0(1)+w0acutex0(shy 1)] (10)

The decision rule that separates the two classes can be written as

f (x)=signA aeligsupport vector

yia0i(x

iacutex)shy b0B (11)

22 Dealing with non-separable casesAn important assumption to the above solution is that the data are separable in

the feature space It is easy to check that there is no optimal solution if the datacannot be separated without error To resolve this problem a penalty value C for


misclassi cation errors and positive slack variables ji

are introduced ( gure 1(b))These variables are incorporated into constraint (1) and (2) as follows

wacutexi+b 1 shy j

ifor y

i=1 (12)

wiacutex

i+b shy 1+j

ifor y

i=1 (13)

ji

0 i=1 k (14)

The objective function (4) then becomes

F (w j)=Oslash(w acutew)+CA aeligk

i= 1jiBl

(15)

C is a preset penalty value for misclassi cation errors If l=1 the solution to thisoptimization problem is similar to that of the separable case

23 Support vector machinesTo generalize the above method to non-linear decision functions the support

vector machine implements the following idea it maps the input vector x into ahigh-dimensional feature space H and constructs the optimal separating hyperplanein that space Suppose the data are mapped into a high-dimensional space H throughmapping function W

W Rn H (16)

A vector x in the feature space can be represented as W (x) in the high-dimensionalspace H Since the only way in which the data appear in the training problem (8)are in the form of dot products of two vectors the training algorithm in the high-dimensional space H would only depend on data in this space through a dot productie on functions of the form W(x

i)acuteW(x

j) Now if there is a kernel function K such

that

K (xi x

j)=W(x

i)acuteW(x

j) (17)

we would only need to use K in the training program without knowing the explicitform of W The same trick can be applied to the decision function (11) because theonly form in which the data appear are in the form of dot products Thus if a kernelfunction K can be found we can train and use a classi er in the high-dimensionalspace without knowing the explicit form of the mapping function The optimizationproblem (8) can be rewritten as

L (a)= aeligk

i= 1aishy

1

2aeligk

i= 1aeligk

j= 1aiajyiyjK (x

i x

j) (18)

and the decision rule expressed in equation (11) becomes


yia0iK (x

i x)shy b0B (19)

A kernel that can be used to construct a SVM must meet Mercerrsquos condition(Courant and Hilbert 1953) The following two types of kernels meet this condition


and will be considered in this study (Vapnik 1995) The polynomial kernels

K (x1 x2)=(x1acutex2+1)p (20)

and the radial basis functions (RBF)

K (x1 x2)=e Otilde c(x1 Otilde x2)2 (21)

24 From binary classi er to multi-class classi erIn the above theoretical development the SVM was developed as a binary

classi er ie one SVM can only separate two classes Strategies are needed to adaptthis method to multi-class cases Two simple strategies have been proposed to adaptthe SVM to N-class problems (Gualtieri and Cromp 1998) One is to construct amachine for each pair of classes resulting in N (N shy 1)2 machines When applied toa test pixel each machine gives one vote to the winning class and the pixel is labelledwith the class having most votes The other strategy is to break the N-class caseinto N two-class cases in each of which a machine is trained to classify one classagainst all others When applied to a test pixel a value measuring the con dencethat the pixel belongs to a class can be calculated from equation (19) and the pixelis labelled with the class with which the pixel has the highest con dence value(Vapnik 1995) Without an evaluation of the two strategies the second one is usedin this study because it only requires to train N SVM machines for an N-class casewhile for the same classi cation the rst strategy requires to train N (N shy 1)2 SVMmachines

With the second strategy each SVM machine is constructed to separate one classfrom all other classes An obvious problem with this strategy is that in constructingeach SVM machine the sizes of the two concerned classes can be highly unbalancedbecause one of them is the aggregation of N shy 1 classes For data samples that cannotbe separated without error a classi er may not be able to nd a boundary betweentwo highly unbalanced classes For example a classi er may not be able to nd aboundary between the two classes shown in gure 2 because the classi er probablymakes least errors by labelling all pixels belonging to the smaller class with the

Figure 2 An example of highly unbalanced training samples in a two-dimensional spacede ned by two arbitrary variables features 1 and 2 A classi er might incur moreerrors by drawing boundaries between the two classes than labelling pixels of thesmaller class with the larger one


larger one To avoid this problem the samples of the smaller class are replicatedsuch that the two classes have approximately the same sizes Similar tricks wereemployed in constructing decision tree classi ers for highly unbalanced classes(DeFries et al 1998)

3 Data and experimental design31 Data and preprocessing

A spatially degraded Thematic Mapper (TM) image and a corresponding refer-ence map were used in this evaluation study The TM image acquired in easternMaryland on 14 August 1985 has a spatial resolution of 285m The six spectralbands (bands 1ndash5 and 7) of the TM image were converted to top-of-atmosphere(TOC) re ectance according to Markham and Barker (1986) Atmospheric correctionwas not necessary because the image was quite clear within the study area Threebroad cover typesmdashforest non-forest land and watermdashwere delimited from thisimage giving a land cover map with the same spatial resolution as the TM imageThis three-class scheme was selected to ensure the achievement of high accuracyof the collected land cover map at this resolution Confused pixels were labelledaccording to aerial photographs and eld visits covering the study area

Both the TM image and derived land cover map were degraded to a spatialresolution of 2565m with a degrading ratio of 91 ie each degraded pixel corre-sponds to 9 by 9 TM pixels The main reason for evaluating the classi ers usingdegraded data is that a highly reliable reference land cover map with a reasonablenumber of classes can be generated at the degraded resolution The image wasdegraded using a simulation programme embedded with models of the point spreadfunctions (PSF) of TM and MODIS sensors (Barker and Burelhach 1992) Byconsidering the PSF of both sensor systems the simulation programme gives morerealistic images than spatial averaging (Justice et al 1989) Overlaying the 2565mgrids on the 285m land cover map and calculating the proportions of forest non-forest land and water within each 2565m grid gave proportion images of forestnon-forest land and water at the 2565m resolution A land cover map at the 2565mresolution was developed by reclassifying the proportion images according to classde nitions given in table 1 These de nitions were based on the IGBP classi cationscheme (Belward and Loveland 1996 DeFries et al 1998) Class names were sochosen to match the de nitions used in this study

32 Experimental designMany factors aVect the performance of a classi er including the selection of

training and testing data samples as well as input variables (Gong and Howarth1990 Foody et al 1995) Because the impact of testing data selection on accuracy

Table 1 De nition of land cover classes for the Maryland data set

Code Cover type De nition

1 Closed forest tree covergt60 water 202 Open forest 30lttree cover 60 water 203 Woodland 10lttree cover 30 water 204 Non-forest land tree cover 10 water 205 Land-water mix 20ltwater 706 Water watergt70


assessment has been investigated in many works (eg Genderen and Lock 1978Stehman 1992) only the selection of training sample and the selection of inputvariable were considered in this study In order to avoid biases in the con dencelevel of accuracy estimates due to inappropriately sampled testing data (Fitzpatrick-Lins 1981 Dicks and Lo 1990) the accuracy measure of each test was estimatedfrom all pixels not used as training data

321 T raining data selectionTraining data selection is one of the major factors determining to what degree

the classi cation rules can be generalized to unseen samples (Paola and Schowengerdt1995) A previous study showed that this factor could be more important forobtaining accurate classi cations than the selection of classi cation algorithms(Hixson et al 1980) To assess the impact of training data size on diVerent classi ca-tion algorithms the selected algorithms were tested using training data of varyingsizes Speci cally the four algorithms were trained using approximately 2 4 6 8 10and 20 pixels of the entire image

With data sizes xed training pixels can be selected in many ways A commonlyused sampling method is to identify and label small patches of homogeneous pixelsin an image (Campbell 1996) However adjacent pixels tend to be spatially correlatedor have similar values (Campbell 1981) Training samples collected this way under-estimate the spectral variability of each class and are likely to give degraded classi- cations (Gong and Howarth 1990) A simple method to minimize the eVect ofspatial correlation is random sampling (Campbell 1996) Two random samplingstrategies were investigated in this experiment One is called equal sample rate (ESR)in which a xed percentage of pixels are randomly sampled from each class astraining data The other is called equal sample size (ESS) in which a xed numberof pixels are randomly sampled from each class as training data In both strategiesthe total number of training samples is approximately the same as those calculatedaccording to the prede ned 2 4 6 8 10 and 20 sampling rates for the wholedata set

322 Selection of input variablesThe six TM spectral bands roughly correspond to six MODIS bands at 250m

and 500m resolutions (Barnes et al 1998) Only the red (TM band 3) and nearinfrared (NIR TM band 4) bands are available at 250m resolution The other fourTM bands are available at 500m resolution Because these four bands containinformation that is complementary to the red and NIR bands (Townshend 1984Toll 1985) not having them at 250m resolution may limit the ability to derive landcover information at this resolution Two sets of tests were performed to evaluatethe impact of not having the four TM bands on land cover characterization at the250m resolution In the rst set only the red NIR band and the normalized diVerencevegetation index (NDVI) were used as input to the classi ers while in the secondset the other four bands were also included NDVI is calculated from the red andNIR bands as follows

NDVI=NIR shy red

NIR+red(22)

Table 2 summarizes the training conditions under which the four classi cationalgorithms were evaluated


Table 2 Training data conditions under which the classi cation algorithms were tested

Sample size ( Number of inputSampling method of entire image) variables Training case no

Equal sample size 2 3 17 2

4 3 37 4

6 3 57 6

8 3 77 8

10 3 97 10

20 3 117 12

Equal sample rate 2 3 137 14

4 3 157 16

6 3 177 18

8 3 197 20

10 3 217 22

20 3 237 24

323 Cross validationIn the above experiment only one training data set was sampled from the image

at each training size level In order to evaluate the stability of the selected classi ersand for the results to be statistically valid cross validations were performed at twotraining data size levels 6 pixels representing a relatively small training size and20 pixels representing a relatively large training size At each size level ten sets oftraining samples were randomly selected from the image using the equal sample rate(ESR) method As will be discussed in sect61 this method gave slightly higher accuraciesthan the ESS On each training data set the four classi cation algorithms weretrained using three and seven variables

33 Methods for performance assessmentThe criteria for evaluating the performance of classi cation algorithms include

accuracy speed stability and comprehensibility among others Which criterion or


which group of criteria to use depends on the purpose of the evaluation As acriterion most relevant to all parties and all purposes accuracy was selected as theprimary criterion in this assessment Speed and stability are also important factorsin algorithm selection and these were considered as well Two widely used accuracymeasuresmdashoverall accuracy and the kappa coeYcientmdashwere used in this study(Rosen eld and Fitzpatrick-Lins 1986 Congalton 1991 Janssen and Wel 1994) Theoverall accuracy has the advantage of being directly interpretable as the proportionof pixels being classi ed correctly (Janssen and Wel 1994 Stehman 1997) while thekappa coeYcient allows for a statistical test of the signi cance of the diVerencebetween two algorithms (Congalton 1991)

4 Impact of kernel con guration on the performances of the SVMAccording to the theoretical development of SVM presented in sect2 the kernel

function plays a major role in locating complex decision boundaries between classesBy mapping the input data into a high-dimensional space the kernel functionconverts non-linear boundaries in the original data space into linear ones in thehigh-dimensional space which can then be located using an optimization algorithmTherefore the selection of kernel function and appropriate values for correspondingkernel parameters referred to as kernel con guration may aVect the performanceof the SVM

41 Polynomial kernelsThe parameter to be prede ned for using the polynomial kernels is the polynomial

order p According to previous studies (Cortes and Vapnik 1995) p values of 1 to 8were tested for each of the 24 training cases Rapid increases in computing time asp increases limited experiments with higher p values Kernel performance is measuredusing the overall agreement between a classi cation and a reference mapmdashthe overallaccuracy (Stehman 1997) Figure 3 shows the impact of p on kernel performance Ingeneral the linear kernel (p=1) performed worse than nonlinear kernels which isexpected because boundaries between many classes are more likely to be non-linearWith three variables as the input there are obvious trends of improved accuracy asp increases (Figure 3(c) and (d )) Such trends are also observed in training cases withseven input variables when p increases from 1 to 4 ( gure 3(a) and (b)) This observa-tion is in contrast to the studies of Cortes and Vapnik (1995) in which no obvioustrend was observed when the polynomial order p increased from 2 to higher valuesThis is probably because the number of input variables used in this study is quitediVerent from those used in previous studies The data set used in this experimenthas only several variables while those used in previous studies had hundreds ofvariables DiVerences between observations of this experiment and those of previousstudies suggest that polynomial order p has diVerent impacts on kernel performancewhen diVerent number of input variables is used With large numbers of inputvariables complex nonlinear decision boundaries can still be mapped into linearones using relatively low-order polynomial kernels However if a data set has onlyseveral variables it is necessary to try high-order polynomial kernels in order toachieve optimal performances using polynomial SVM

42 RBF kernelsThe parameter to be preset for using the RBF kernel de ned in equation (21) is

c In previous studies c values of around 1 were used (Vapnik 1995 Joachims 1998b)


(a) (b)

(c) (d)

Figure 3 Performance of polynomial kernels as a function of polynomial order p (trainingdata size is pixel of the image) (a) Equal sample size 7 variables (b) equal samplerate 7 variables (c) equal sample size 3 variables (d ) equal sample rate 3 variables

For this speci c data set c values between 1 and 20 gave reasonable results ( gure 4)A comparison between gure 3 and gure 4 reveals that the performance of the RBFkernel is less aVected by c than that of the polynomial kernel by p With seven inputvariables ( gure 4(a) and (b)) the overall accuracy only changed slightly when cvaried between 1 and 20 With three input variables however the impact is moresigni cant Figure 4(c) and (d ) show obvious trends of increased performance whenc increased from 1 to 75 For most training cases the overall accuracy only changedslightly when c increased beyond 75

The impact of kernel parameter on kernel performance can be illustrated usingan experiment performed on arbitrary data samples collected in a two-dimensionalspace Figure 5 shows the data samples of two classes and the decision boundariesbetween the two classes as located by polynomial and RBF kernels Notice thatalthough the decision boundaries located by all non-linear kernels (all polynomialkernels with pgt1 and all RBF kernels) are similar for this speci c set of samplesthe shape of the decision boundary is adjusted slightly and misclassi cation errorsare reduced gradually as p increases from 3 to 12 for the polynomial kernel( gure 5(a)) or as c decreases from 1 to 01 for the RBF kernel ( gure 5(b)) Withappropriate kernel parameter values both polynomial (p=12) and RBF (c=01)kernels classi ed this arbitrary data set without error though the decision boundariesde ned by the two types of kernels are not exactly the same How well these decision


(a) (b)

(c) (d)

Figure 4 Performance of RBF kernels as a function of c (training data size is pixel of theimage) (a) Equal sample size 7 variables (b) equal sample rate 7 variables (c) equalsample size 3 variables (d ) equal sample rate 3 variables

boundaries can be generalized to unseen samples depends on the distribution ofunseen data samples

As will be discussed in sect6 classi cation accuracy is aVected by training samplesize and number of input variables Figures 3 and 4 show that most SVM kernelsgave higher accuracies with a larger training size and more input variables Withthree input variables however most SVM kernels gave unexpectedly higheraccuracies on the training case with 2 pixels sampled using the equal sample size(ESS) method than on several larger training data sets selected using the samesampling method ( gures 3(c) and 4(c)) This is probably because SVM de nesdecision boundaries between classes using support vectors rather than statisticalattributes which are sample size dependent ( gure 5) Although a larger trainingdata set has a better chance of including support vectors that de ne the actualdecision boundaries and hence should give higher accuracies there are occasionswhen a smaller training data set includes such support vectors while larger ones donot In sect61 we will show that the other three classi ers did not have such abnormalhigh accuracies on this training case (see gure 8(c) later)

5 Comparative performances of the four classi ersThe previous section has already illustrated the impact of kernel para-

meter setting on the accuracy of the SVM Similarly the performance of the other


Figure 5 Impact of kernel con guration on the decision boundaries and misclassi cationerrors of the SVM Empty and solid circles represent two arbitrary classes Circledpoints are support vectors Checked points represent misclassi cation errors Red andblue represent high con dence areas for class one (empty circle) and two (solid circle)respectively Optimal separating hyperplanes are highlighted in white

classi cation algorithms may also be aVected by the parameter settings of thosealgorithms For example the performance of the NNC is in uenced by the networkstructure (eg Sui 1994 Paola and Schowengerdt 1997) while that of the DTC isaVected by the degree of pruning (Breiman et al 1984 Quinlan 1993) In thisexperiment the NNC took a three-layer (input hidden and output) network structurewhich is considered suYcient for classifying multispectral imageries (Paola andSchowengerdt 1995) The numbers of units of the rst and last layers were set to


the numbers of input variables and output classes respectively There is no guidelinefor determining the number of hidden units In this experiment it was determinedaccording to the number of input variables Three hidden layer con gurations weretested on each training case the number of hidden units equals one two and threetimes of the number of input variables A major issue in pruning a classi cation treeis when to stop to produce a tree that generalizes well to unseen data samples Toosimple a tree may not be able to exploit fully the explanatory power of the datawhile too complex a tree may generalize poorly Yet there is no practical guidelinethat guarantees a lsquoperfectrsquo tree that is not too simple and not too complex In thisexperiment a wide range of pruning degrees were tested

Because of the diVerent nature of the impacts of algorithm parameters on diVerentalgorithms it is impossible to account for such diVerences in evaluating the comparat-ive performances of the algorithms To avoid this problem the best performance ofeach algorithm on each training case was reported in the following comparison Theperformances were evaluated in terms of algorithm accuracy stability and speed

51 Classi cation accuracyThe accuracy of classi cations was measured using the overall accuracy The

signi cance of accuracy diVerences was tested using the kappa statistics accordingto Congalton et al (1983) and Hudson and Ramm (1987) Figure 6 shows the overallaccuracies of the four algorithms on the 24 training cases Table 3 gives the signi c-ance values of accuracy diVerences between the four algorithms Table 4 gives themean and standard deviation of the overall accuracies of classi cations developedthrough cross validation at two training size levels 6 and 20 pixels of the imageSeveral patterns can be observed from gure 6 and tables 3 and 4 as follows

(1) Generally the SVM was more accurate than DTC or the MLC It gavesigni cantly higher accuracies than the MLC in 18 out of 20 training cases(MLC could not run on four training cases due to insuYcient trainingsamples) and than DTC in 14 of 24 training cases In all remaining trainingcases the MLC and DTC did not generate signi cantly better results thanthe SVM The SVM also gave signi cantly better results than NNC in six ofthe 12 training cases with seven input variables and though insigni cantlygave higher accuracies than NNC in ve of the remaining six training casesOn average when seven variables were used the overall accuracy of the SVMwas 1ndash2 higher than that of NNC and 2ndash4 higher than those of DTCand the MLC (table 4) When only three variables were used the averageoverall accuracies of the SVM were about 1ndash2 higher than those of DTCand the MLC These observations are in general agreement with previousworks in which the SVM was found to be more accurate than either NNCor DTC (Vapnik 1995 Joachims 1998b) This is expected because as discussedin sect2 the SVM is designed to locate an optimal separating hyperplane whilethe other three algorithms may not be able to locate this separating hyper-plane Statistically the optimal separating hyperplane located by theSVM should be generalized to unseen samples with least errors among allseparating hyperplanes

(2) Unexpectedly however SVM did not give signi cantly higher accuraciesthan NNC in any of the 12 training cases with three input variables On thecontrary it was signi cantly less accurate than NNC in three of the 12


Figure 6 Overall accuracies of classi cations developed using the four classi ers Y -axis isoverall accuracy () X-axis is training data size ( pixel of the image) (a) Equalsample size 7 variables (b) equal sample rate 7 variables (c) equal sample size 3variables (d ) equal sample rate 3 variables

training cases The average overall accuracies of the SVM were slightly lowerthan those of NNC (table 4) The lower accuracies of SVM than those ofNNC on data with three variables are probably due to the inability of theSVM to transform non-linear class boundaries in the original space intolinear ones in a high-dimensional space According to the algorithm develop-ment detailed in sect2 the applicability of the SVM to non-linear decisionboundaries depends on whether the decision boundaries can be transformedinto linear ones by mapping the input data into a high-dimensional spaceWith only three input variables the SVM might have less success in trans-forming complex decision boundaries in the original input space into linearones in a high-dimensional space The complex network structure of NNChowever might be able to approximate complex decision boundaries evenwhen the data contain very few variables and therefore have better compar-ative performances over the SVM The comparative performances of theSVM on data sets with very few variables should be further investigatedbecause data sets with such few variables were not considered in previousstudies (Cortes and Vapnik 1995 Joachims 1998b)


Table 3 Signi cance value (Z ) of diVerences between the accuracies of the four classi ers

Equal sample Equal sample Equal sample Equal sampleSample size rate size ratesize () 7 variables 7 variables 3 variables 3 variables

SVM vs NNC2 177 365 120 shy 1024 196 shy 150 shy 229 shy 2386 192 100 shy 460 0228 228 119 shy 106 shy 088

10 194 396 shy 002 00220 255 226 shy 150 002

SVM vs DTC2 061 248 346 1654 233 shy 081 061 shy 1376 443 189 046 3018 458 225 451 152

10 270 458 246 52320 468 310 119 143

SVM vs MLC2 803 NA 504 NA4 727 NA 033 NA6 634 338 235 3038 330 424 480 648

10 473 754 151 45120 632 503 339 386

DTC vs NNC2 117 117 shy 231 shy 2704 shy 037 shy 069 shy 291 shy 1016 shy 252 shy 089 shy 507 shy 2798 shy 230 shy 106 shy 560 shy 240

10 shy 076 shy 061 shy 248 shy 52220 shy 213 shy 083 shy 271 shy 142

DTC vs MLC2 744 NA 160 NA4 494 NA shy 028 NA6 190 149 188 0028 shy 129 199 028 498

10 202 297 shy 096 shy 0720 163 194 219 246

NNC vs MLC2 625 NA 391 NA4 533 NA 264 NA6 442 238 699 2808 101 305 588 739

10 278 358 154 45020 376 277 493 253

Notes1 DiVerences signi cant at 95 con dence level (Z 196) are highlighted in bold face

A positive value indicates better performance of the rst classi er while a negativeone indicates better performance of the second classi er

2 NA indicates that the MLC did not work due to insuYcient training samples forcertain classes and no comparison was made


Table 4 Mean and standard deviation (s) of the overall accuracies () of classi cationsdeveloped using ten sets of training samples randomly selected from the Marylanddata set

SVM NNC DTC MLC

Training condition Mean s Mean s Mean s Mean s

Training size=20 7562 019 7402 081 7331 065 7176 079Input variables=7Training size=6 7420 060 7210 131 7182 094 7092 104Input variables=7Training size=20 6641 039 6682 091 6592 052 6459 062Input variables=3Training size=6 6549 120 6597 079 6445 058 6395 097Input variables=3

(3) Of the other three algorithms NNC gave signi cantly higher results thanDTC in ten of the 12 training cases with three input variables and in threeof the 12 training cases with seven input variables Again NNC showed bettercomparative performances on training cases with three variables than ontraining cases with seven variables DTC did not give signi cantly betterresults than NNC on any of the remaining training cases Both NNC andDTC were more accurate than the MLC NNC had signi cantly higheraccuracies than the MLC in 18 of 20 training cases while DTC did soin eight of 20 training cases The MLC did not have signi cantly higheraccuracies than NNC and DTC on any of the remaining training cases

(4) The accuracy diVerences of the four algorithms on the data set used in thisstudy were generally small However many of them were statisticallysigni cant

52 Algorithm stability and speedThe standard deviation of the overall accuracy of an algorithm estimated in cross

validation is a quantitative measure of its relative stability (table 4) Figure 7 showsthe variations of the accuracies of the four classi ers Both table 4 and gure 7 revealthat the stabilities of the algorithms diVered greatly and were aVected by trainingdata size and number of input variables In general the overall accuracies of thealgorithms were more stable when trained using 20 pixels than using 6 pixelsespecially when seven variables were used ( gures 7(a) and (b)) The SVM gave farmore stable overall accuracies than the other three algorithms when trained using20 pixels with seven variables It also gave more stable overall accuracies than theother three algorithms when trained using 6 pixels with seven variables ( gure 7(b))and using 20 pixels with three variables ( gure 7(c)) But when trained using 6pixels with three variables it gave overall accuracies in a wider range than the otherthree algorithms ( gure 7(d )) Of the other three algorithms DTC gave slightlymore stable overall accuracies than NNC or the MLC both of which gave overallaccuracies in wider ranges in all cases

The training speeds of the four classi ers were substantially diVerent In alltraining cases training the MLC and DTC did not take more than a few minuteson a SUN Ultra 2 workstation while training NNC and the SVM took hours anddays respectively Furthermore the training speeds of the above algorithms were


(a) (b)

(d)(c)

Figure 7 Boxplots of the overall accuracies of classi cations developed using ten sets oftraining samples randomly selected from the Maryland data set (a) Training size=20 pixels of the image number of input variables=7 (b) Training size=6 pixelsof the image number of input variables=7 (c) Training size=20 pixels of the imagenumber of input variables=3 (d ) Training size=6 pixels of the image number ofinput variables=3

aVected by many factors including numbers of training samples and input variablesnoise level in the training data set as well as algorithm parameter setting This isespecially the case for the SVM and NNC Many studies have demonstrated thatthe training speed of NNC depends on network structure momentum rate learningrate and converging criteria (Paola and Schowengerdt 1995) The training ofthe SVM was aVected by training data size kernel parameter setting and classseparability Generally when the training data size was doubled the training timewould be more than doubled Training the SVM to classify two highly mixed classescould take several times longer than training it to classify two separable classes Forthe SVM programme used in this study polynomial kernels especially high-orderkernels took far more time to train than RBF kernels

6 Impacts of non-algorithm factors61 Impact of training sample selection

Training sample selection includes two parts training data size and selectionmethod Reorganizing the numbers in gure 6 shows the impact of training data sizeon algorithm performance ( gure 8) As expected increases in training data sizegenerally led to improved performances While the increases in overall accuracy were


Figure 8 Impact of training data size on the performances of the classi ers Y -axis is overallaccuracy () Training data size is pixel of the image (a) Equal sample size 7variables (b) equal sample rate 7 variables (c) equal sample size 3 variables (d ) equalsample rate 3 variables

not monotonic as training data size increased larger training data sets (gt6 of theimage) generally gave results better than smaller ones (lt6)

One of the goals of this experiment was to determine the minimum training datasize for suYcient training of an algorithm The obvious increases in overall accuracyas training data size increased from 2 to 6 indicate that for this test data settraining pixels less than 6 of the entire image are insuYcient for training the fouralgorithms Beyond 6 however it is hard to tell when an algorithm is trainedadequately When seven variables were used and the training samples were selectedusing the equal sample rate (ESR) method ( gure 8(b)) the largest training data set(20 pixels) gave the best results For other training cases however the bestperformance of an algorithm was often achieved with training pixels less than 20of the image ( gure 8(a) (c) (d )) Hepner et al (1990) considered a training data sizeof a 10 by 10 block for each class as the minimum data size for training NNCZhuang et al (1994) suggested that training data sets of approximately 5ndash10 ofan image were needed to train a neural network classi er adequately The resultsof this experiment suggest that the minimum number of samples for adequately


training an algorithm may depend on the algorithm concerned the number of inputvariables the method used to select the training samples and the size and spatialvariability of the study area

The impact of the two sampling methods for selecting training datamdashequalsample size (ESS) and equal sample rate (ESR)mdashon classi cation accuracy wasassessed using kappa statistics Table 5 shows that the two sampling methods didgive signi cantly diVerent accuracies for some training cases For most training casesslightly higher accuracies were achieved when the training samples were selectedusing the ESR method Considering the disadvantage of undersampling or eventotally missing rare classes of the ESR method the sampling rate of very rare classesshould be increased when this method is employed

62 Impact of input variablesIt is evident from gures 6 and 8 that substantial improvements were achieved

when the classi cations were developed using seven variables instead of using threeThe respective average improvements in overall accuracy for the SVM NNC DTCand the MLC were 88 58 80 and 59 when training samples were selectedusing the ESS method and 81 61 76 and 73 when training samples wereselected using the ESR method respectively Figure 9 shows two SVM classi cationsdeveloped using three and seven variables They were developed from the trainingdata set consisting of 20 pixels of the image selected using the ESR method Avisual inspection of the two classi cations reveals that using the four additional TMbands led to substantial improvements in discriminating between the four landclasses (closed forest open forest woodland and non-forest land) Table 6 gives thenumber of pixels classi ed correctly in the two classi cations The last row showsthat the relative increases in number of pixels classi ed correctly for the four landclasses are much higher than those for the classes of water and landndashwater mix

It should be noted that improvements in classi cation accuracy achieved byusing more variables were substantially higher than those achieved by choosingbetter classi cation algorithms or by increasing training data size underlining theimportance of using as much information as possible in land cover classi cation

Table 5 Signi cance value (Z ) of diVerences between classi cations developed from trainingsamples selected using the equal sample size (ESS) and equal sample rate (ESR)methods

Algorithm

SVM DTC NNC MLCSamplerate () 3-band 7-band 3-band 7-band 3-band 7-band 3-band 7-band

2 272 shy 316 shy 094 shy 128 shy 054 shy 583 mdash mdash4 shy 104 192 shy 301 shy 121 shy 119 shy 153 mdash mdash6 shy 307 112 shy 053 shy 142 174 021 shy 240 shy 1838 shy 081 085 shy 383 shy 147 shy 063 024 085 180

10 shy 270 shy 207 shy 001 shy 020 shy 267 006 030 07520 shy 313 shy 174 shy 293 shy 335 shy 164 shy 124 shy 267 shy 306

Note DiVerences signi cant at the 95 con dence level (Z 196) are highlighted in boldface Positive Z values indicate higher accuracies for the ESS method while negative onesindicate higher accuracies for the ESR method


Figure 9 SVM classi cations developed for the study area in eastern Maryland USAusing three and seven variables from the training data set consisting of 20 trainingpixels selected using the equal sample rate (ESR) method The classi cations coveran area of 225km by 225km (a) Classi cation developed using three variables(b) Classi cation developed using seven variables

Table 6 Number of pixels classi ed correctly in the two classi cations shown in gure 8 andper-class improvement due to using seven instead of three variables in the classi cation

Classi cation Closed Open Wood- Non-forest Land-waterdeveloped using forest forest land land mix Water

Per-class agreement (number of pixel) between a classi cation and the reference map

Three variables 1317 587 376 612 276 974Seven variables 1533 695 447 752 291 982

Relative increases () in per-class agreement when the number of input variables increasedfrom 3 to 7

164 184 189 229 54 08

Many studies have demonstrated the usefulness of the two mid-infrared bands ofthe TM sensor in discriminating between vegetation types (eg DeGloria 1984Townshend 1984) yet the two bands will not be available at 250m resolution onthe MODIS instrument (Running et al 1994) Results from this experiment showthat the loss of discriminatory power due to not having the two mid-infrared bandsat 250m resolution could not be fully compensated for by using better classi cationalgorithms or by increasing training data size Whether the lost information can befully compensated for by incorporating spatial and temporal information needs tobe further investigated


7 Summary and conclusionsThe support vector machine (SVM) is a machine learning algorithm based on

statistical learning theory In previous studies it had been found competitive withthe best available machine learning algorithms for handwritten digit recognition andtext categorization In this study an experiment was performed to evaluate thecomparative performances of this algorithm and three other popular classi ers (themaximum likelihood classi er (MLC) neural network classi ers (NNC) and decisiontree classi ers (DTC)) in land cover classi cation In addition to the comparativeperformances of the four classi ers impacts of the con gurations of SVM kernelson its performance and of the selection of training data and input variables on allfour classi ers were also evaluated

SVM uses kernel functions to map non-linear decision boundaries in the originaldata space into linear ones in a high-dimensional space Results from this experimentrevealed that kernel type and kernel parameter aVect the shape of the decisionboundaries as located by the SVM and thus in uence the performance of the SVMFor polynomial kernels better accuracies were achieved on data with three inputvariables as the polynomial order p increased from 1 to 8 suggesting the need forusing high-order polynomial kernels when the input data have very few variablesWhen seven variables were used in the classi cation improved accuracies wereachieved when p increased from 1 to 4 Further increases in p had little impact onclassi cation accuracy For RBF kernels the accuracy increased slightly when cincreased from 1 to 75 No obvious trend of improvement was observed when cincreased from 5 to 20 However an experiment using arbitrary data points revealedthat misclassi cation error is a function of c

Of the four algorithms evaluated the MLC had lower accuracies than the threenon-parametric algorithms The SVM was more accurate than DTC in 22 out of 24training cases It also gave higher accuracies than NNC when seven variables wereused in the classi cation This observation is in agreement with several previousstudies The higher accuracies of the SVM should be attributed to its ability to locatean optimal separating hyperplane As shown in gure 1 statistically the optimalseparating hyperplane found by the SVM algorithm should be generalized to unseensamples with fewer errors than any other separating hyperplane that might be foundby other classi ers Unexpectedly however NNC were more accurate than the SVMwhen only three variables were used in the classi cation This is probably becausethe SVM had less success in transforming non-linear class boundaries in a very low-dimensional space into linear ones in a high-dimensional space On the other handthe complex network structure of NNC might be more eYcient in approximatingnon-linear decision boundaries even when the data have only very few variablesGenerally the absolute diVerences of classi cation accuracy were small among thefour classi ers However many of the diVerences were statistically signi cant

In terms of algorithm stability the SVM gave more stable overall accuraciesthan the other three algorithms except when trained using 6 pixels with threevariables Of the other three algorithms DTC gave slightly more stable overallaccuracies than NNC or the MLC both of which gave overall accuracies in wideranges In terms of training speed the MLC and DTC were much faster than theSVM and NNC While the training speed of NNC depended on network structuremomentum rate learning rate and converging criteria that of the SVM was aVectedby training data size kernel parameter setting and class separability

All four classi ers were aVected by the selection of training samples It was not


possible to determine the minimum number of samples for suYciently training analgorithm according to results from this experiment However the initial trends ofimproved classi cation accuracies for all four classi ers as training data size increasedemphasize the necessity of having adequate training samples in land cover classi ca-tion Feature selection is another factor aVecting classi cation accuracy Substantialincreases in accuracy were achieved when all six TM spectral bands and the NDVIwere used instead of only the red NIR and the NDVI The additional four TMbands improved the discrimination between land classes Improvements due to theinclusion of the four TM bands exceeded those due to the use of better classi cationalgorithms or increased training data size underlining the need to use as muchinformation as possible in deriving land cover classi cation from satellite images

AcknowledgmentsThis study was made possible through a NSF grant (BIR9318183) and a contract

from the National Aeronautics and Space Administration (NAS596060) The SVMprogramme used in this study was made available by Mr Thorstan Joachain

References

Atkinson P M and Tatnall A R L 1997 Neural networks in remote sensingInternational Journal of Remote Sensing 18 699ndash709

Barker J L and Burelhach J W 1992 MODIS image simulation from LandsatTM imagery In Proceedings ASPRSACSMRT Washington DC April 22ndash25 1992(Washington DC ASPRS) pp 156ndash165

Barnes W L Pagano T S and Salomonson V V 1998 Prelaunch characteristics ofthe Moderate Resolution Imaging Spectroradiometer (MODIS) on EOS-AM1 IEEET ransactions on Geoscience and Remote Sensing 36 1088ndash1100

Belward A and Loveland T 1996 The DIS 1 km land cover data set Global ChangeNews L etter 27 7ndash9

Breiman L Friedman J H Olshend R A and Stone C J 1984 Classi cation andRegression T rees (Belmont CA Wadsworth International Group)

Brodley C E and Utgoff P E 1995 Multivariate decision trees Machine L earning19 45ndash77

Burges C J C 1998 A tutorial on support vector machines for pattern recognition DataMining and Knowledge Discovery 2 121ndash167

Campbell J B 1981 Spatial correlation eVects upon accuracy of supervised classi cationof land cover Photogrammetric Engineering and Remote Sensing 47 355ndash363

Campbell J B 1996 Introduction to Remote Sensing (New York The Guilford Press)Congalton R 1991 A review of assessing the accuracy of classi cations of remotely sensed

data Remote Sensing of Environment 37 35ndash46CongaltonR G Oderwald R G and Mead R A 1983 Assessing Landsat classi cation

accuracy using discrete multivariate analysis statistical techniques PhotogrammetricEngineering and Remote Sensing 49 1671ndash1678

Cortes C and Vapnik V 1995 Support vector networks Machine L earning 20 273ndash297Courant R and Hilbert D 1953 Methods of Mathematical Physics (New York John

Wiley)DeFries R S Hansen M Townshend J R G and Sohlberg R 1998 Global land

cover classi cations at 8km spatial resolution the use of training data derived fromLandsat imagery in decision tree classi ers International Journal of Remote Sensing19 3141ndash3168

DeGloria S 1984 Spectral variability of Landsat-4 Thematic Mapper and MultispectralScanner data for selected crop and forest cover types IEEE T ransactions on Geoscienceand Remote Sensing GE-22 303ndash311

Dicks S E and Lo T H C 1990 Evaluation of thematic map accuracy in a land-use andland-cover mapping program Photogrammetric Engineering and Remote Sensing 561247ndash1252


Fitzpatrick-Lins K 1981 Comparison of sampling procedures and data analysis for a land-use and land-cover map Photogrammetric Engineering and Remote Sensing 47343ndash351

Foody G M McCulloch M B and Yates W B 1995 The eVect of training set sizeand composition on arti cial neural network classi cation International Journal ofRemote Sensing 16 1707ndash1723

Friedl M A and Brodley C E 1997 Decision tree classi cation of land cover fromremotely sensed data Remote Sensing of Environment 61 399ndash409

Genderen V J L and Lock B F 1978 Remote sensing statistical testing of thematicmap accuracy Remote Sensing of Environment 7 3ndash14

Gong P and Howarth P J 1990 An assessment of some factors in uencing multispectralland-cover classi cation Photogrammetric Engineering and Remote Sensing 56597ndash603

Gualtieri J A and Cromp R F 1998 Support vector machines for hyperspectral remotesensing classi cation In Proceedings of the 27th AIPR WorkshopAdvances in ComputerAssisted Recognition Washington DC Oct 27 1998 (Washington DC SPIE)pp 221ndash232

Hall F G Townshend J R and Engman E T 1995 Status of remote sensing algorithmsfor estimation of land surface state parameters Remote Sensing of Environment 51138ndash156

Hansen M DeFries R S Townshend J R G and Sohlberg R 2000 Global landcover classi cation at 1 km spatial resulution using a classi cation tree approachInternational Journal of Remote Sensing 21 1331ndash1364

Hansen M Dubayah R and DeFries R 1996 Classi cation trees an alternative totraditional land cover classi ers International Journal of Remote Sensing 171075ndash1081

Hepner G F Logan T Ritter N and Bryant N 1990 Arti cial neural networkclassi cation using a minimal training set comparison to conventional supervisedclassi cation Photogrammetric Engineering and Remote Sensing 56 496ndash473

Hixson M Scholz D Fuhs N and Akiyama T 1980 Evaluation of several schemesfor classi cation of remotely sensed data Photogrammetric Engineering and RemoteSensing 46 1547ndash1553

Hudson W D and Ramm C W 1987 Correct formulation of the Kappa coeYcient ofagreement Photogrammetric Engineering and Remote Sensing 53 421ndash422

Janssen L L F and Wel F 1994 Accuracy assessment of satellite derived land coverdata a review IEEE Photogrammetric Engineering and Remote Sensing 60 419ndash426

Joachims T 1998a Making large scale SVM learning practical In Advances in KernelMethodsmdashSupport Vector L earning edited by B Scholkopf C Burges and A Smola(New York MIT Press)

Joachims T 1998b Text categorization with support vector machinesmdashlearning withmany relevant features In Proceedings of European Conference on Machine L earningChemnitz Germany April 10 1998 (Berlin Springer) pp 137ndash142

Justice C O Markham B L Townshend J R G and Kennard R L 1989 Spatialdegradation of satellite data International Journal of Remote Sensing 10 1539ndash1561

Lippman R P 1987 An introduction to computing with neural nets IEEE ASSP Magazine4 2ndash22

Markham B L and Barker J L 1986 Landsat MSS and TM post-calibration dynamicranges exoatmospheric re ectances and at-satellite temperatures EOSAT L andsatT echnical Notes 1 3ndash8

Pao Y-H 1989 Adaptive Pattern Recognition and Neural Networks (New York Addison-Wesley)

Paola J D and Schowengerdt R A 1995 A review and analysis of backpropagationneural networks for classi cation of remotely sensed multi-spectral imageryInternational Journal of Remote Sensing 16 3033ndash3058

Paola J D and Schowengerdt R A 1997 The eVect of neural network structure ona multispectral land-useland cover classi cation Photogrammetric Engineering andRemote Sensing 63 535ndash544

Quinlan J R 1993 C45 Programs for Machine L earning (San Mateo CA MorganKaufmann Publishers)


Rosenfield G H and Fitzpatrick-Lins K 1986 A coeYcient of agreement as a measureof thematic classi cation accuracy Photogrammetric Engineering amp Remote Sensing52 223ndash227

Running S W Justice C O Salomonson V Hall D Barker J Kaufmann Y JStrahler A H Huete A R Muller J P Vanderbilt V Wan Z MTeilletP and Carneggie D 1994 Terrestrial remote sensing science and algorithmsplanned for EOSMODIS International Journal of Remote Sensing 15 3587ndash3620

Safavian S R and Landgrebe D 1991 A survey of decision tree classi er methodologyIEEE T ransactions on Systems Man and Cybernetics 21 660ndash674

Sellers P J Meeson B W Hall F G Asrar G Murphy R E Schiffer R ABretherton F P et al 1995 Remote sensing of the land surface for studies ofglobal change modelsmdashalgorithmsmdashexperiments Remote Sensing of Environment51 3ndash26

Stehman S V 1992 Comparison of systematic and random sampling for estimating theaccuracy of maps generated from remotely sensed data Photogrammetric Engineeringand Remote Sensing 58 1343ndash1350

Stehman S V 1997 Selecting and interpreting measures of thematic classi cation accuracyRemote Sensing of Environment 62 77ndash89

Sui D Z 1994 Recent applications of neural networks for spatial data handling CanadianJournal of Remote Sensing 20 368ndash380

Sundaram R K 1996 A First Course in Optimization T heory (New York CambridgeUniversity Press)

Swain P H and Davis S M (editors) 1978 Remote Sensing the Quantitative Approach(New York McGraw-Hill)

Toll D L 1985 EVect of Landsat Thematic Mapper sensor parameters on land coverclassi cation Remote Sensing of Environment 17 129ndash140

Townshend J R G 1984 Agricultural land-cover discrimination using Thematic Mapperspectral bands International Journal of Remote Sensing 5 681ndash698

Townshend J R G 1992 Land cover International Journal of Remote Sensing 131319ndash1328

Vapnik V N 1995 T he Nature of Statistical L earning T heory (New York Springer-Verlag)Vapnik V N 1998 Statistical L earning T heory (New York Wiley)Wang F 1990 Fuzzy supervised classi cation of remote sensing images IEEE T ransactions

on Geoscience and Remote Sensing 28 194ndash201Zhuang X Engel B A Lozano-Garcia D F Fernandez R N and Johannsen C J

1994 Optimization of training data required for neuro-classi cation InternationalJournal of Remote Sensing 15 3271ndash3277


1978) it is perhaps one of the most widely used classi ers (eg Wang 1990 Hansenet al 1996) Neural networks avoid some of the problems of the MLC by adoptinga non-parametric approach Their potential discriminating power has attracted agreat deal of research eVort As a result many types of neural networks have beendeveloped (Lippman 1987) the most widely used in the classi cation of remotelysensed images is a group of networks called a multi-layer perceptron (MLP) (egPaola and Schowengerdt 1995 Atkinson and Tatnall 1997)

A decision tree classi er takes a diVerent approach to land cover classi cationIt breaks an often very complex classi cation problem into multiple stages of simplerdecision-making processes (Safavian and Landgrebe 1991) Depending on thenumber of variables used at each stage there are univariate and multivariate decisiontrees (Friedl and Brodley 1997) Univariate decision trees have been used to developland cover classi cations at a global scale (DeFries et al 1998 Hansen et al 2000)Though multivariate decision trees are often more compact and can be more accuratethan univariate decision trees (Brodley and UtgoV 1995) they involve more complexalgorithms and as a result are aVected by a suite of algorithm-related factors (Friedland Brodley 1997) The univariate decision tree developed by Quinlan (1993) isevaluated in this study

The support vector machine (SVM) represents a group of theoretically superiormachine learning algorithms As shall be described in the following section the SVMemploys optimization algorithms to locate the optimal boundaries between classesStatistically the optimal boundaries should be generalized to unseen samples withleast errors among all possible boundaries separating the classes therefore minimizingthe confusion between classes In practice the SVM has been applied to opticalcharacter recognition handwritten digit recognition and text categorization (Vapnik1995 Joachims 1998b) These experiments found the SVM to be competitive withthe best available classi cation methods including neural networks and decision treeclassi ers The superior performance of the SVM was also demonstrated in classifyinghyperspectral images acquired from the Airborne VisibleInfrared ImagingSpectrometer (AVIRIS) (Gualtieri and Cromp 1998) While hundreds of variableswere used as the input in the experiments mentioned above there are far fewervariables in data acquired from operational sensor systems such as Landsat theAdvanced Very High Resolution Radiometer (AVHRR) and the Moderate ResolutionImaging Spectroradiometer (MODIS) Because these are among the major sensorsystems from which land cover information is derived an evaluation of the perform-ance of the SVM using images from such sensor systems should have practicalimplications for land cover classi cation The purpose of this paper is to demonstratethe applicability of this algorithm to deriving land cover from such operationalsensor systems and systematically to evaluate its performances in comparison toother popular classi ers including the statistical maximum likelihood classi er(MLC) a back propagation neural network classi er (NNC) (Pao 1989) and adecision tree classi er (DTC) (Quinlan 1993) The SVM was implemented byJoachims (1998a) as SVMlight

A brief introduction to the theoretical development of the SVM is given in thefollowing section This is deemed necessary because the SVM is relatively new tothe remote sensing community as compared to the other three methods The dataset and experimental design are presented in sect3 Experimental results are discussedin the following three sections including impacts of kernel con guration on theperformance of the SVM comparative performances of the four classi ers and







(x1 y1 ) (xk y



wmiddotxi+b 1 for y

i=1 i=1 2 k (1)


i=shy 1 (2)



where w=(w1 w



yi[ w acutex

i+b] 1 i=1 k (3)






i= 1

aiy

i[ wacutex

i+b] shy 1 (5)

where ai


w= aeligk

i= 1

yiaix

i(6)

aeligk

i= 1aiyi=0 (7)


L(a)= aeligk

i= 1

aishy

1

2aeligk

i= 1

aeligk

j= 1

aiajyiyj(x

iacutex

j) (8)





w0= aeligk

i= 1

yia0ix

i(9)







yia0i(x







wacutexi+b 1 shy j

ifor y

i=1 (12)

wiacutex

i+b shy 1+j

ifor y

i=1 (13)

ji

0 i=1 k (14)



i= 1jiBl

(15)




W Rn H (16)


i)acuteW(x


that

K (xi x

j)=W(x

i)acuteW(x

j) (17)


L (a)= aeligk

i= 1aishy

1

2aeligk

i= 1aeligk

j= 1aiajyiyjK (x

i x

j) (18)



yia0iK (x

i x)shy b0B (19)




K (x1 x2)=(x1acutex2+1)p (20)
























NDVI=NIR shy red

NIR+red(22)






4 3 37 4

6 3 57 6

8 3 77 8

10 3 97 10

20 3 117 12


4 3 157 16

6 3 177 18

8 3 197 20

10 3 217 22

20 3 237 24














(a) (b)

(c) (d)





(a) (b)

(c) (d)























10 194 396 shy 002 00220 255 226 shy 150 002

SVM vs DTC2 061 248 346 1654 233 shy 081 061 shy 1376 443 189 046 3018 458 225 451 152

10 270 458 246 52320 468 310 119 143


10 473 754 151 45120 632 503 339 386




10 202 297 shy 096 shy 0720 163 194 219 246


10 278 358 154 45020 376 277 493 253






SVM NNC DTC MLC









(a) (b)

(d)(c)
















Algorithm












164 184 189 229 54 08













References






























































(x1 y1 ) (xk y



wmiddotxi+b 1 for y

i=1 i=1 2 k (1)


i=shy 1 (2)



where w=(w1 w



yi[ w acutex

i+b] 1 i=1 k (3)






i= 1

aiy

i[ wacutex

i+b] shy 1 (5)

where ai


w= aeligk

i= 1

yiaix

i(6)

aeligk

i= 1aiyi=0 (7)


L(a)= aeligk

i= 1

aishy

1

2aeligk

i= 1

aeligk

j= 1

aiajyiyj(x

iacutex

j) (8)





w0= aeligk

i= 1

yia0ix

i(9)







yia0i(x







wacutexi+b 1 shy j

ifor y

i=1 (12)

wiacutex

i+b shy 1+j

ifor y

i=1 (13)

ji

0 i=1 k (14)



i= 1jiBl

(15)




W Rn H (16)


i)acuteW(x


that

K (xi x

j)=W(x

i)acuteW(x

j) (17)


L (a)= aeligk

i= 1aishy

1

2aeligk

i= 1aeligk

j= 1aiajyiyjK (x

i x

j) (18)



yia0iK (x

i x)shy b0B (19)




K (x1 x2)=(x1acutex2+1)p (20)
























NDVI=NIR shy red

NIR+red(22)






4 3 37 4

6 3 57 6

8 3 77 8

10 3 97 10

20 3 117 12


4 3 157 16

6 3 177 18

8 3 197 20

10 3 217 22

20 3 237 24














(a) (b)

(c) (d)





(a) (b)

(c) (d)























10 194 396 shy 002 00220 255 226 shy 150 002

SVM vs DTC2 061 248 346 1654 233 shy 081 061 shy 1376 443 189 046 3018 458 225 451 152

10 270 458 246 52320 468 310 119 143


10 473 754 151 45120 632 503 339 386




10 202 297 shy 096 shy 0720 163 194 219 246


10 278 358 154 45020 376 277 493 253






SVM NNC DTC MLC









(a) (b)

(d)(c)
















Algorithm












164 184 189 229 54 08













References

























































where w=(w1 w



yi[ w acutex

i+b] 1 i=1 k (3)






i= 1

aiy

i[ wacutex

i+b] shy 1 (5)

where ai


w= aeligk

i= 1

yiaix

i(6)

aeligk

i= 1aiyi=0 (7)


L(a)= aeligk

i= 1

aishy

1

2aeligk

i= 1

aeligk

j= 1

aiajyiyj(x

iacutex

j) (8)





w0= aeligk

i= 1

yia0ix

i(9)







yia0i(x







wacutexi+b 1 shy j

ifor y

i=1 (12)

wiacutex

i+b shy 1+j

ifor y

i=1 (13)

ji

0 i=1 k (14)



i= 1jiBl

(15)




W Rn H (16)


i)acuteW(x


that

K (xi x

j)=W(x

i)acuteW(x

j) (17)


L (a)= aeligk

i= 1aishy

1

2aeligk

i= 1aeligk

j= 1aiajyiyjK (x

i x

j) (18)



yia0iK (x

i x)shy b0B (19)




K (x1 x2)=(x1acutex2+1)p (20)
























NDVI=NIR shy red

NIR+red(22)






4 3 37 4

6 3 57 6

8 3 77 8

10 3 97 10

20 3 117 12


4 3 157 16

6 3 177 18

8 3 197 20

10 3 217 22

20 3 237 24














(a) (b)

(c) (d)





(a) (b)

(c) (d)























10 194 396 shy 002 00220 255 226 shy 150 002

SVM vs DTC2 061 248 346 1654 233 shy 081 061 shy 1376 443 189 046 3018 458 225 451 152

10 270 458 246 52320 468 310 119 143


10 473 754 151 45120 632 503 339 386




10 202 297 shy 096 shy 0720 163 194 219 246


10 278 358 154 45020 376 277 493 253






SVM NNC DTC MLC









(a) (b)

(d)(c)
















Algorithm












164 184 189 229 54 08













References



























































wacutexi+b 1 shy j

ifor y

i=1 (12)

wiacutex

i+b shy 1+j

ifor y

i=1 (13)

ji

0 i=1 k (14)



i= 1jiBl

(15)




W Rn H (16)


i)acuteW(x


that

K (xi x

j)=W(x

i)acuteW(x

j) (17)


L (a)= aeligk

i= 1aishy

1

2aeligk

i= 1aeligk

j= 1aiajyiyjK (x

i x

j) (18)



yia0iK (x

i x)shy b0B (19)




K (x1 x2)=(x1acutex2+1)p (20)
























NDVI=NIR shy red

NIR+red(22)






4 3 37 4

6 3 57 6

8 3 77 8

10 3 97 10

20 3 117 12


4 3 157 16

6 3 177 18

8 3 197 20

10 3 217 22

20 3 237 24














(a) (b)

(c) (d)





(a) (b)

(c) (d)























10 194 396 shy 002 00220 255 226 shy 150 002

SVM vs DTC2 061 248 346 1654 233 shy 081 061 shy 1376 443 189 046 3018 458 225 451 152

10 270 458 246 52320 468 310 119 143


10 473 754 151 45120 632 503 339 386




10 202 297 shy 096 shy 0720 163 194 219 246


10 278 358 154 45020 376 277 493 253






SVM NNC DTC MLC









(a) (b)

(d)(c)
















Algorithm












164 184 189 229 54 08













References


























































K (x1 x2)=(x1acutex2+1)p (20)
























NDVI=NIR shy red

NIR+red(22)






4 3 37 4

6 3 57 6

8 3 77 8

10 3 97 10

20 3 117 12


4 3 157 16

6 3 177 18

8 3 197 20

10 3 217 22

20 3 237 24














(a) (b)

(c) (d)





(a) (b)

(c) (d)























10 194 396 shy 002 00220 255 226 shy 150 002

SVM vs DTC2 061 248 346 1654 233 shy 081 061 shy 1376 443 189 046 3018 458 225 451 152

10 270 458 246 52320 468 310 119 143


10 473 754 151 45120 632 503 339 386




10 202 297 shy 096 shy 0720 163 194 219 246


10 278 358 154 45020 376 277 493 253






SVM NNC DTC MLC









(a) (b)

(d)(c)
















Algorithm












164 184 189 229 54 08













References









































































NDVI=NIR shy red

NIR+red(22)






4 3 37 4

6 3 57 6

8 3 77 8

10 3 97 10

20 3 117 12


4 3 157 16

6 3 177 18

8 3 197 20

10 3 217 22

20 3 237 24














(a) (b)

(c) (d)





(a) (b)

(c) (d)























10 194 396 shy 002 00220 255 226 shy 150 002

SVM vs DTC2 061 248 346 1654 233 shy 081 061 shy 1376 443 189 046 3018 458 225 451 152

10 270 458 246 52320 468 310 119 143


10 473 754 151 45120 632 503 339 386




10 202 297 shy 096 shy 0720 163 194 219 246


10 278 358 154 45020 376 277 493 253






SVM NNC DTC MLC









(a) (b)

(d)(c)
















Algorithm












164 184 189 229 54 08













References




























































4 3 37 4

6 3 57 6

8 3 77 8

10 3 97 10

20 3 117 12


4 3 157 16

6 3 177 18

8 3 197 20

10 3 217 22

20 3 237 24














(a) (b)

(c) (d)





(a) (b)

(c) (d)























10 194 396 shy 002 00220 255 226 shy 150 002

SVM vs DTC2 061 248 346 1654 233 shy 081 061 shy 1376 443 189 046 3018 458 225 451 152

10 270 458 246 52320 468 310 119 143


10 473 754 151 45120 632 503 339 386




10 202 297 shy 096 shy 0720 163 194 219 246


10 278 358 154 45020 376 277 493 253






SVM NNC DTC MLC









(a) (b)

(d)(c)
















Algorithm












164 184 189 229 54 08













References

































































(a) (b)

(c) (d)





(a) (b)

(c) (d)























10 194 396 shy 002 00220 255 226 shy 150 002

SVM vs DTC2 061 248 346 1654 233 shy 081 061 shy 1376 443 189 046 3018 458 225 451 152

10 270 458 246 52320 468 310 119 143


10 473 754 151 45120 632 503 339 386




10 202 297 shy 096 shy 0720 163 194 219 246


10 278 358 154 45020 376 277 493 253






SVM NNC DTC MLC









(a) (b)

(d)(c)
















Algorithm












164 184 189 229 54 08













References

























































(a) (b)

(c) (d)























10 194 396 shy 002 00220 255 226 shy 150 002

SVM vs DTC2 061 248 346 1654 233 shy 081 061 shy 1376 443 189 046 3018 458 225 451 152

10 270 458 246 52320 468 310 119 143


10 473 754 151 45120 632 503 339 386




10 202 297 shy 096 shy 0720 163 194 219 246


10 278 358 154 45020 376 277 493 253






SVM NNC DTC MLC









(a) (b)

(d)(c)
















Algorithm












164 184 189 229 54 08













References









































































10 194 396 shy 002 00220 255 226 shy 150 002

SVM vs DTC2 061 248 346 1654 233 shy 081 061 shy 1376 443 189 046 3018 458 225 451 152

10 270 458 246 52320 468 310 119 143


10 473 754 151 45120 632 503 339 386




10 202 297 shy 096 shy 0720 163 194 219 246


10 278 358 154 45020 376 277 493 253






SVM NNC DTC MLC









(a) (b)

(d)(c)
















Algorithm












164 184 189 229 54 08













References


























































SVM NNC DTC MLC









(a) (b)

(d)(c)
















Algorithm












164 184 189 229 54 08













References

























































(a) (b)

(d)(c)
















Algorithm












164 184 189 229 54 08













References



































































Algorithm












164 184 189 229 54 08













References































































164 184 189 229 54 08













References



































































References
























































an assessment of support vector machines for land cover classification

Documents