ieee transactions on geoscience and remote · pdf file · 2017-10-25hyperspectral...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 1

Spectral–Spatial Residual Network forHyperspectral Image Classification:A 3-D Deep Learning Framework

Zilong Zhong, Student Member, IEEE, Jonathan Li , Senior Member, IEEE, Zhiming Luo, and Michael Chapman

Abstract— In this paper, we designed an end-to-end spectral–spatial residual network (SSRN) that takes raw 3-D cubesas input data without feature engineering for hyperspectralimage classification. In this network, the spectral and spatialresidual blocks consecutively learn discriminative featuresfrom abundant spectral signatures and spatial contexts inhyperspectral imagery (HSI). The proposed SSRN is a superviseddeep learning framework that alleviates the declining-accuracyphenomenon of other deep learning models. Specifically, theresidual blocks connect every other 3-D convolutional layerthrough identity mapping, which facilitates the backpropagationof gradients. Furthermore, we impose batch normalization onevery convolutional layer to regularize the learning processand improve the classification performance of trained models.Quantitative and qualitative results demonstrate that the SSRNachieved the state-of-the-art HSI classification accuracy inagricultural, rural–urban, and urban data sets: Indian Pines,Kennedy Space Center, and University of Pavia.

Index Terms— 3-D deep learning, hyperspectral image clas-sification, spectral–spatial feature extraction, spectral–spatialresidual network (SSRN).

I. INTRODUCTION

CLASSIFYING every pixel with a certain land-covertype is the cornerstone of hyperspectral image analysis,

which spans a broad range of applications, including imagesegmentation, object recognition, land-cover mapping, andanomaly detection [1]–[4]. Two major characteristics ofhyperspectral imagery (HSI) should be taken into accountto obtain discriminative features for HSI classification. First,abundant spectral information, which derives from hundreds

Manuscript received March 14, 2017; revised May 28, 2017 andJuly 25, 2017; accepted August 20, 2017. This work was supported by theChina Scholarship Council. (Corresponding author: Jonathan Li.)

Z. Zhong is with the Department of Geography and EnvironmentalManagement, University of Waterloo, ON N2L 3G1, Canada (e-mail:[email protected]).

J. Li is with the Department of Geography and Environmental Management,University of Waterloo, ON N2L 3G1, Canada, and also with the Fujian KeyLaboratory of Sensing and Computing for Smart City, School of InformationScience and Engineering, Xiamen University, Xiamen 361005, China (e-mail:[email protected]).

Z. Luo is with the Department of Cognitive Science, Xiamen University,Xiamen 361005, China, and also with the Department of Computer Science,University of Sherbrooke, Sherbrooke, QC J1K 2R1, Canada.

M. Chapman is with the Department of Civil Engineering, Ryerson Univer-sity, Toronto, ON M5B 2K3, Canada.

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TGRS.2017.2755542

of contiguous spectral bands, makes the accurate identificationof corresponding ground materials possible [5]. Second, highspatial correlation, which originates from homogeneous areasin HSI, provides complementary information to spectralfeatures for precise mapping [6].

To take advantage of abundant spectral bands, traditionalpixelwise HSI classification models mainly concentrate ontwo steps: feature engineering and classifier training. Featureengineering methods include feature selection (band selection)and feature extraction [7]. The main objectives of feature engi-neering are to reduce the high dimensionality of HSI pixelsand extract the most discriminative features or bands. Next,general-purpose classifiers are trained using the discriminativefeatures obtained from the feature engineering step. Featureextraction approaches usually learn representative featuresthrough nonlinear transformation. For example, [8] integratedmultiple features derived from different kinds of dimensional-ity reduction methods to train support vector machine (SVM)classifiers. Unlike feature extraction, feature selection methodstry to find the most representative features from raw HSIswithout transforming them to retain their physical meaning.For instance, [9] adopted manifold ranking as an unsupervisedfeature selection method, which chooses the most representa-tive bands for training the classifiers that follow. Moreover,a multitask joint sparse representation-based method [10]integrated band selection method with a smooth prior imposedby the Markov random field. These two band selection-basedparadigms used spectral bands from all available pixels forfeature selection and can be interpreted as semisupervisedlearning methods.

On the other hand, there are two ways to incorporatespatial information for HSI classification: spatialized inputand postprocessing. The spatialized input methods imposefeature engineering step on 3-D cubes obtained from HSI.Many papers suggested that methods expanding input datawith more spatial information can improve the classificationperformance [11], [12]. Among these methods, SVMs arethe most commonly used classifiers for HSI classifica-tion, because SVMs perform robustly with high-dimensionalinput data [13], [14]. For example, [15] employed a region-based kernel to extract spectral–spatial features on which thelearned SVM classifier identifies the categories of hyperspec-tral pixels. In contrast, the postprocessing approaches havetaken the prior knowledge of smoothness into considera-

0196-2892 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

https://orcid.org/0000-0001-7899-0049


2 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

tion that neighboring pixels with similar spectral informationare likely to belong to the same land-cover categories. Forinstance, [16] incorporated a probabilistic graphical model asthe postprocessing step to improve the classification outcomesof kernel SVMs. Although many works use typical classifica-tion frameworks, which are composed of feature extractors fol-lowed by trainable classifiers, they suffer from two drawbacks.First, the feature engineering step normally does not general-ize well to other scenarios. Second, the de facto one-layernonlinear transformation (e.g., kernel methods) being appliedbefore the linear classifiers has limited representation capacityto fully utilize the abundant spectral and spatial features.

In the face of these shortcomings of feature engineering-based frameworks, supervised deep learning models haveattracted increased attention, due to the fact that the objectivefunctions of deep learning models directly focus on classi-fication in lieu of two independent steps. The fundamentalphilosophy of deep learning is to let the trained modelitself decide which features are more important than otherfeatures with fewer constraints imposed by human beings.In other words, deep learning frameworks simultaneously learnfeature representation and corresponding classifiers throughthe training process. Furthermore, multilayer neural networkscan extract robust and discriminative features of HSI andoutperform SVMs [17], [18]. For example, the stacked autoen-coders (SAEs) were used as feature extractors to capturethe representative stacked spectral and spatial features witha greedy layerwise pretraining strategy [17]. Similarly, thepotential of deep belief networks for HSI classification wasexplored in [18]. However, both models suffer the sameproblem of spatial information loss, which is caused by therequirement for 1-D input data.

Recently, convolutional neural networks (CNNs) and theirextensions have obtained unprecedented advances in computervision tasks [19], [20]. Multiple papers have demonstrated thatCNNs can deliver the state-of-the-art results using spatializedinput for HSI classification [21]–[23]. For example, [23] usedCNNs to extract spatial features, which were integrated withspectral features that learned from balanced local discriminantembedding, for HSI classification. However, the input of theCNN models are the three principal components of originalHSIs, which means that the spatial feature extraction processstill loses some spectral–spatial information. A CNN-basedfeature extractor was proposed in [21], which can learn dis-criminative representations from pixel pairs and use a votingstrategy to smooth final classification maps. In addition,3-D CNNs were adopted to extract deep spectral–spatialfeatures directly from raw HSIs and delivered promisingclassification outcomes [22]. Similarly, [24] further studied3-D CNNs for spectral–spatial classification using input cubesof HSIs with a smaller spatial size. These models generatethematic maps using an approach that can directly process rawHSIs, whereas the classification accuracy of the CNN modelsdecreases when the network becomes deeper.

To resolve this problem, inspired by [25], we proposeda supervised spectral–spatial residual network (SSRN)1 with

1https://github.com/zilongzhong/SSRN

consecutive learning blocks that takes the characteristics ofHSI into account. The designed spectral and spatial residualblocks extract discriminative spectral–spatial features fromHSI cubes and can be regarded as an extension of convolu-tional layers in CNNs. The SSRN has a deeper structure thanthose of 3-D CNNs used in [21]–[24], and contains shortcutconnections between every other convolutional layer. Hence,the SSRN can learn robust spectral–spatial representationsfrom original HSIs. Similar to the SSRN, [26] incorporatedresidual learning with fully convolutional layers to form a con-textual CNN. However, this method fails to distinguish spectralfeatures and spatial features. Thus, this paper investigates theeffectiveness of two types of residual architecture toward thespectral–spatial feature learning for HSI classification and theirrobustness in different scenarios.

Compared with a large number of annotated data in com-puter vision and pattern recognition communities, which playa significant role in the unprecedented success achieved bydeep learning models [20], the available amount of trainingand testing samples in the widely studied HSI data sets is rel-atively small. Moreover, the unbalanced amounts of differentlylabeled samples undermine the accuracy of HSI classification.In addition, the input data of SSRN are 3-D cubes of raw HSI,and the multidimensional input data bring more challenges.Therefore, this paper aims to study the generalization abilityof the SSRN on HSI data sets with large and small trainingsizes, high and medium spatial resolution, and various land-cover types with uneven samples for different categories.

The four major contributions of this paper are listed asfollows.

1) The designed SSRN adopts residual connections to mit-igate the decreasing-accuracy phenomenon and improvethe HSI classification accuracy.

2) Two consecutive residual blocks learn spectral andspatial representations separately, through which morediscriminative features can be extracted.

3) This paper validates the effectiveness of batch normal-ization (BN) as a regularization method to improve clas-sification outcomes using unbalanced training samples.

4) The uniform architecture design makes the SSRN aframework that generalizes well in three commonlystudied HSI data sets. More importantly, the SSRNachieves the state-of-the-art classification accuracy usinglimited training data with a fixed spatial size.

The rest of this paper is organized as follows.Section II describes two types of residual blockand introduces the detailed architecture of SSRN. Thenetwork configuration and experimental results are reported,and some discussions are offered in Section III. Someconclusions are drawn in Section IV.

II. PROPOSED FRAMEWORK

Fig. 1 shows the whole deep learning framework of HSIclassification based on the SSRN. In this framework, allavailable annotated data are separated into three groups:training, validation, and testing groups for each data set.Suppose the HSI data set X contains N labeled pixels


ZHONG et al.: SSRN FOR HYPERSPECTRAL IMAGE CLASSIFICATION 3

Fig. 1. SSRN-based framework for HSI classification. (Top) Training group Z1 and their corresponding labels are used for updating the parametersof network. The validation group Z2 and their corresponding labels Y2 are used for monitoring the interim models generated in the training stage.(Bottom) Testing group Z3 is employed for assessing the optimal trained network.

{x1, x2, . . . , xN} ∈ R1×1×b and Y = {y1, y2, . . . , yN} ∈

R1×1×L is the set of corresponding one-hot label vectors,

where b and L represent the numbers of spectral bandsand land-cover categories, respectively. Neighboring cubescentered at pixels in X form a new group of data set Z ={z1, z2, . . . , zN} ∈ R

w×w×b . To fully utilize the spectral andspatial information provided by HSIs, the proposed networkstake cubes of size w ×w ×b from raw data as input, where isthe short width of 3-D cubes in training group Z1, validationgroup Z2, and testing group Z3 in Fig. 1. Their correspondinglabel vector sets are Y 1, Y 2, and Y 3. For example, the size ofHSI cubes for the Indian Pines (IN) data set is 7 × 7 × 200.Therefore, the objective of the training process is to updatethe parameters of the SSRN until the model can make high-accuracy predictions Y 3 with regard to the ground-truth labelsY 3 given the neighboring cubes Z3.

After the architecture of deep learning models is built andthe hyperparameters for training are configured, the modelsare trained for hundreds of epochs using the training group Z1

and their ground-truth label vector set Y 1. In this process, theparameters of the SSRN are updated through backpropagatingthe gradients of the cross-entropy objective function in (1),which represents the difference between predicted label vectory = [y1, y2, . . . , yL ] and ground-truth label vector y =[y1, y2, . . . , yL ]

C( y, y) =L∑

i=1

yi

⎛

⎝logL∑

j=1

ey j − yi

⎞

⎠. (1)

The validation group Z2 is used for monitoring the trainingprocess by measuring the classification performance of interimmodels, which are intermediate networks generated during thetraining stage, to select the network with the highest classifi-cation accuracy. Finally, the testing group Z3 is employed forassessing the generalizability of the trained SSRN through cal-culating classification metrics and visualizing thematic maps.

A. Three-Dimensional Convolutional Layer With BatchNormalization

Deep learning models consist of multiple layers of nonlinearneurons that can learn hierarchical representations through alarge number of labeled images [19]. CNNs have achieved orsurpassed human-level intelligence in several perceptiontasks [20], [27], because convolutional layers enable CNNsto learn more discriminative features with sparsity constraint.

In this paper, 3-D convolutional layers are adopted as thebasic element of the SSRN. In addition, BN [28] is conductedat every convolutional layer in SSRN. This strategy makes thetraining processing of deep learning models more efficient.As shown in Fig. 2, if the (k + 1)th 3-D convolutional layerhas nk input feature cubes of size wk × wk × dk , a convo-lutional filter bank that contains nk+1 convolutional filtersof size ak+1 × ak+1 × mk+1, and the subsampling strides of(s1, s1, s2) for the convolutional operation, then this layer gen-erates nk+1 output feature cubes of size wk+1 × wk+1 × dk+1,where the spatial width wk+1 = �1 + (wk − ak+1)/s1� and thespectral depth dk+1 = �1 + (dk − mk+1)/s2�. The i th outputof (k +1)th 3-D convolutional layer with BN (CONVBN) canbe formulated as

Xk+1i = R

⎛

⎝nk∑

j=1

Xkj ∗ Hk+1

i + bk+1i

⎞

⎠ (2)

Xk = Xk − E(Xk)

Var(Xk)(3)

where Xkj ∈ R

w×w×d is the j th input feature tensor of the

(k +1)th layer, Xk is the normalization result of batch featurecubes Xk in the kth layer, and E(·) and Var(·) represent theexpectation and variance function of the input feature tensor,respectively. Hk+1

i and bk+1i denote the parameters and bias

of the i th convolutional filter bank in the (k + 1)th layer,∗ represents the 3-D convolutional operation, and R(·) is therectified linear unit activation function that sets elements withnegative numbers to zero.



Fig. 2. Three-dimensional CONVBN. The (k + 1)th layer conducts a 3-Dconvolution of input feature cubes Xk and a convolutional filter bank Hk+1

and generates output feature cubes Xk+1.

B. Spectral and Spatial Residual Blocks

Although CNN models have been used for HSI classificationand achieved the state-of-the-art results, it is counterintuitivethat, after several layers, the classification accuracy decreaseswith the increase of convolutional layers [22]. This phenom-enon stems from the fact that the representation capacity ofCNNs is too high compared with the relative small numberof training samples with the same regularization settings.However, this decreasing-accuracy issue can be alleviated byadding shortcut connections between every other layer to buildresidual blocks [25]. To this end, we designed two residualblocks in a general architecture to consecutively extract spec-tral and spatial features from raw 3-D HSI cubes, owing to thehigh spectral resolution and high spatial correlation of HSI.As shown in Fig. 3, a residual block can be regarded asan extension of two convolutional layers. This architectureenables gradients in higher layers rapidly propagate back to thelower layers, thereby facilitating and regularizing the modeltraining process.

In the spectral residual blocks, as shown in Fig. 3, con-volutional kernels of size 1 × 1 × m are used in successivefilter banks hp+1 and hp+2 for pth and (p + 1)th layers,respectively. At the same time, the spatial size of 3-D featurecubes X p+1 and X p+2 is kept at w × w unchanged througha padding strategy, which means that the output feature cubescopy the values from the border area to the padding area afterconvolutional operation in the spectral dimension. Then, thesetwo convolutional layers build a residual function F(X p; θ)instead of directly mapping X p using a skip connection. Thespectral residual architecture can be formulated as follows:

X p+2 = X p + F(X p; θ) (4)

F(X p; θ) = R(X p+1) ∗ hp+2 + bp+2 (5)

X = R(X p) ∗ hp+1 + bp+1 (6)

where θ = {hp+1, hp+2, bp+1, bp+2}, X p+1 represents then input 3-D feature cubes of (p + 1)th layer, hp+1 andd p+1 denote the spectral convolutional kernels and bias inthe (p + 1)th layer, respectively. In fact, the convolutionalkernels hp+1 and d p+1 are composed of 1-D vectors, whichcan be regarded as a special case of 3-D convolutional kernels.The output tensor of the spectral residual block also includesn 3-D feature cubes.

In the spatial residual block, as shown in Fig. 4, a focusis primarily placed on the spatial feature extraction usingn 3-D convolutional kernels of size a ×a ×d in the successive

Fig. 3. Spectral residual block for spectral feature learning. This blockincludes two successive 3-D convolutional layers, and a skip connectiondirectly adds input feature cubes X p to output feature cubes X p+2.

Fig. 4. Spatial residual block for spatial feature learning. This block includestwo successive 3-D convolutional layers, and a skip connection directly addsinput feature cubes Xq to the output feature cubes Xq+2.

filter banks Hq+1 and Hq+2 for the two successive layers. Thespectral depth d of these kernels equals to that of the input 3-Dfeature cubes Xq . The spatial size of feature cubes Xq+1 andXq+2 is kept unchanged at w × w. Thus, the spatial residualarchitecture can be formulated as follows:

Xq+2 = Xq + F(Xq; ξ ) (7)

F(Xq ; ξ ) = R(Xq+1) ∗ Hq+2 + bq+2 (8)

X = R(Xq ) ∗ Hq+1 + bq+1 (9)

where ξ = {Hq+1, Hq+2, bq+1, bq+2}, Xq+1 represents the3-D input feature volume in the (q + 1)th layer, and Hq+1

and bq+1 denote the n spatial convolutional kernels in the(q + 1)th layer, respectively. Compared with their spectralcounterparts, the convolutional filter banks in spatial residualblocks comprises 3-D tensors. The output of this block is a3-D feature volume.

C. Spectral–Spatial Residual Network

Considering HSIs contain one spectral dimension and twospatial dimensions, we proposed a framework that consecu-tively extracts spectral and spatial features for pixelwise HSIclassification. As shown in Fig. 5, the SSRN includes a spectralfeature learning section, a spatial feature learning section, anaverage pooling layer, and a fully connected (FC) layer. Com-pared with CNN, SSRN alleviated the decreasing-accuracyphenomenon by adding skip connections between every otherlayer to formulate the hierarchical feature representation layersto consecutive residual blocks. We take the IN data set, the3-D samples of which have the size of 7 × 7 × 200, as anexample to explain the designed SSRN.

The spectral feature learning section includes two convo-lutional layers and two spectral residual blocks. In the firstconvolutional layer, 24 1 × 1 × 7 spectral kernels with a



Fig. 5. SSRN with a 7 × 7 × 200 input HSI volume. The network includes two spectral and two spatial residual blocks. An average pooling layer and anFC layer transform a 5 × 5 × 24 spectral–spatial feature volume into a 1 × 1 × L output feature vector y.

subsampling stride of (1, 1, 2) convolves the input HSI volumeto generate 24 7 × 7 × 97 feature cubes. Because the rawinput data contain rich and redundant spectral information,1 × 1 × 7 vector kernels are used in these blocks. This layerreduces the high dimensionality of input cubes and extractslow-level spectral features of HSI. Then, two consecutivespectral residual blocks, which contain four convolutionallayers and two identity mappings, use 24 1 × 1 × 7 vectorkernels at each convolutional layers to learn deep spectralrepresentation. In the spectral residual blocks, all convolutionallayers use padding to keep the sizes of output feature cubes thesame as input. Following the spectral residual blocks, the lastconvolutional layer in this learning section, which includes128 1 × 1 × 97 spectral kernels for keeping discriminativespectral features, convolves the 24 7 × 7 feature tensors toproduce a 7 × 7 feature volume as input for spatial featurelearning section.

The spatial feature learning section extracts discriminativespatial features using successive 3-D convolutional filterbanks, where the kernels have the same depth as the input3-D feature volume. The section comprises a 3-D con-volutional layer and two spatial residual blocks. The firstconvolutional layer in this section reduces the spatial sizeof input feature cubes and extract low-level spatial featureswith 24 3 × 3 × 128 spatial kernels, resulting an output5 × 5 × 24 feature tensor. Then, similar to their spectralcounterparts, the two spatial residual blocks learn deep spatialrepresentation with four convolutional layers, all of which use24 3×3×24 spatial kernels and keep the sizes of feature cubesunchanged.

After the above two feature learning sections, an averagepooling layer (POOL) transforms the extracted 5 × 5 × 24spectral–spatial feature volume to a 1 × 1 × 24 feature vector.Next, an FC layer adapts the SSRN to HSI data set accordingto the number of land-cover categories and generates an outputvector y = [y1, y2, . . . , yL ]. The total numbers of trainableparameters (about 360 000) for the SSRN are much largerthan the available training data in the three hyperspectral datasets, which means that the network possesses enough capacity

to learn the feature representations of HSI but also tend tooverfit the training sets. Therefore, BN and dropout [29] areinvestigated as regularization strategies to further improve theclassification performance of SSRN.

III. RESULTS AND DISCUSSION

In this section, we introduced the three HSI data sets,specified the model configuring process, and evaluated theproposed methods using classification metrics, such as overallaccuracy (OA), average accuracy (AA), and kappa coeffi-cient (κ). We adopted the IN, Kennedy Space Center (KSC),and University of Pavia (UP) data sets for assessing theclassification performance of the SSRN framework in thecases of unbalanced training data, a small number of trainingsamples, and high spatial resolution. In all three cases, weran experiments for ten times with randomly selected trainingdata and reported the mean and standard deviation of mainclassification metrics.

A. Experimental Data Sets

The IN data set, gathered by Airborne Visible/InfraredImaging Spectrometer (AVIRIS) in 1992 from NorthwestIndiana, includes 16 vegetation classes and has145 × 145 pixels with a resolution of 20 m by pixel.Once the 20 bands corrupted by water absorption effectshave been discarded, the remaining 200 bands are adoptedfor analysis and range from 400 to 2500 nm.

The KSC data set, collected by AVIRIS in Florida in1996, contains 13 upland and wetland classes and has512 × 614 pixels with a resolution of 18 m by pixel. Oncethe bands with low signal-to-noise ratio have been removed,the remaining 176 bands are used for assessment and rangefrom 400 to 2500 nm.

The UP data set, acquired by Reflective Optics SystemImaging Spectrometer in Northern Italy in 2001, containsnine urban land-cover types and has 610 × 340 pixels witha resolution of 1.3 m by pixel. Once the noisy bands havebeen discarded, the remaining 103 bands are employed forevaluation and ranges from 430 to 860 nm.



TABLE I

TRAINING, VALIDATION, AND TESTING NUMBERS IN THE IN DATA SET

TABLE II

TRAINING, VALIDATION, AND TESTING NUMBERS IN THE KSC DATA SET

In the IN and KSC data sets, 20%, 10%, and 70% of thelabeled data are randomly assigned to training, validation, andtesting groups, respectively. In the UP data sets, the ratiois 10%:10%:80%. In addition, all input data of three HSIdata sets are standardized to mean value with unit variance.Tables I–III list the numbers of three groups for all data sets.

B. Framework Setting

After designing the SSRN framework, we configured thetraining process that updates the parameters of 3-D filter banksthrough backpropagating the gradients of the cost function.Next, we analyzed four factors that control the training processand classification performance of the trained SSRN. The fourfactors are the learning rate, the kernel number of convolu-tional layers, the regularization method, and the spatial sizeof the input cubes. Because the training sets are small, we setthe batch size to 16 and adopted the RMSProp optimizer [30]to harness the training process. In the training process ofeach configuration, the models with the highest classification

TABLE III

TRAINING, VALIDATION, AND TESTING NUMBERS IN THE UP DATA SET

Fig. 6. OA (%) of SSRNs with different kernel numbers in the IN, KSC,and UP data sets.

TABLE IV

OA (%) OF SSRN WITH DIFFERENT REGULARIZERS

performance in validation groups were preserved, and thereported results were generated by these optimal models.

First, learning rates control the learning step for eachtraining iteration. Specifically, inappropriate learning ratesettings will lead to divergence or slow convergence.Therefore, we used the grid search method and ran eachexperiment for 200 epochs to find the optimum learning ratefrom {0.01, 0.003, 0.001, 0.0003, 0.0001, 0.00003} for eachdata set. Based on the classification outcomes, the optimumlearning rates for IN, KSC, and UP data sets are 0.0003,0.0001, and 0.0003, respectively.

Second, the kernel numbers of convolutional filter banksdecide the representation capacity and computational con-sumption of SSRN. As shown in Fig. 5, the proposed networkhas the same kernel number in each convolutional layer ofthe spectral and spatial residual blocks. We assessed differentkernel numbers from 8 to 32 in an interval of 8 in eachconvolutional layer to find a general framework. As shownin Fig. 6, the models with 24 kernels in each convolutional



TABLE V

OA (%) OF SSRN WITH DIFFERENT INPUT SIZES

TABLE VI

CLASSIFICATION RESULTS OF DIFFERENT METHODS

FOR THE IN DATA SET

filter bank achieved the highest classification accuracy in theIN and UP data sets, and the model with 16 kernels obtainedthe best performance in the KSC data set. These results areacquired in 200-epoch training processes for each setting inthree data sets.

Third, given there are more parameters than trainingsamples and deep learning models tend to overfit trainingdata, BN and a 50% dropout can be used for regulariz-ing the training process. Hence, we evaluated the modelswithout regularization method, with dropout, with BN, andwith both dropout and BN under the same condition for200-epoch training. As shown in Table IV, the BN out-performs the dropout in terms of mean overall classifica-tion accuracy. More importantly, the SSRN performs thebest when using both regularization strategies in all threeHSI data sets.

Fourth, to evaluate the influence of the spatialized input,we tested the proposed models with respect to the input cubesof different spatial sizes. Table V shows that the proposed

TABLE VII

CLASSIFICATION RESULTS OF DIFFERENT METHODSFOR THE KSC DATA SET

TABLE VIII

CLASSIFICATION RESULTS OF DIFFERENT METHODS

FOR THE UP DATA SET

SSRNs perform robustly for different spatial sizes if these sizesare equal to or larger than 7 × 7, because the SSRN learnsdiscriminative spatial features of input data. In all three datasets, the classification results increase with the spatial sizeof input cubes. The important role of spatial context that thisexperiment demonstrated is in accordance with results in otherpublications [3], [15]. Considering the larger input sizes leadto higher classification accuracy, we fixed the spatial size ofinput HSI data to make a fair comparison between differentclassification methods.



Fig. 7. Classification results of the best models for the IN data set. (a) False color image. (b) Ground-truth labels. (c)–(i) Classification results of SVM,SAE, CNN, CNNL, SPA, SPC, and SSRN.

Fig. 8. Classification results of the best models for the KSC data set. (a) False color image. (b) Ground-truth labels. (c)–(i) Classification results of SVM,SAE, CNN, CNNL, SPA, SPC, and SSRN.

C. Classification ResultsWe compared the SSRN with the kernel SVM [31] and

state-of-the-art deep learning models, such as SAE [17] and3-D CNN [22]. To demonstrate the effectiveness of the spectraland spatial residual blocks in the proposed framework, wealso tested the networks that only contain the spectral featurelearning part (SPC) and the ones that only contain spatialfeature learning part (SPA). Moreover, we evaluated the longerversions of 3-D CNN (denote as CNNL) generated from theSPA models without skip connections to study the effect ofthe designed spatial residual architecture on the decreasing-accuracy phenomenon [22]. To make a fair comparison, weset the same input volume size of 7×7×b for all methods andtuned these competitors to their optimal settings. We randomlyselected 20%, 20%, and 10% labeled 3-D HSI cubes astraining groups for IN, KSC, and UP data sets, respectively.

Tables VI–VIII report the OAs, AAs, kappa coefficients,and the classification accuracies of all classes forHSI classification. In all three cases, the SSRN achieved thehighest classification accuracy and lower standard deviationthan the 3-D CNN. For example, in the KSC data set,

SSRN (99.61%) delivered a roughly 2.5% increase of meanoverall classification accuracy compared with CNN (97.08%).All deep learning methods generated obviously betteroutcomes than the kernel SVM. In all three data sets, theclassification results of CNNL were worse than those of CNN.On the other hand, the SPA performed better than the CNN.These outcomes showed the proposed spatial residualstructures mitigate the declining-accuracy phenomenon.Furthermore, the SSRN constantly performed better thanthe SPA, because the spectral residual blocks learned spectralrepresentations that are complementary to spatial features.Although there are few training samples for oats andgrass-pasture-mowed classes in the IN data set, the SSRNclassified the testing data with higher than the 98% meanclassification accuracy. These results validated the robustnessof the designed models in the face of difficult conditions.

Figs. 7–9 visualize the classification results of the besttrained models in three data sets, along with the false colorimages of original HSI and their corresponding ground-truthmaps. In all three cases, the qualitative comparison betweendifferent methods is in line with the quantitative comparison in



Fig. 9. Classification results of the best models for the UP data set. (a) False color image. (b) Ground-truth labels. (c)–(i) Classification results of SVM,SAE, CNN, CNNL, SPA, SPC, and SSRN.

Fig. 10. OA of different methods with different training data percentages. (a) IN data set. (b) KSC data set. (c) UP data set.

Tables VI–VIII. The SPC generated classification maps withgreat noise. The SPA generated smoother results, but still somedot noises exist in some classes. For example, the SPA reducedthe speckles in the Wheat class of IN data set and the Bare Soilclass of UP data set. Compared with other methods, the SSRNdelivered the most accurate and smooth classification mapsfor all three HSIs, because the SSRN learned discriminativespectral and spatial features consecutively.

To test the robustness and generalizability of the proposedSSRN toward different numbers of training samples, 5%, 10%,15%, and 20% labeled samples were randomly chosen astraining data for IN and KSC data sets, and 4%, 6%, 8%, and10% for the UP data set. In Fig. 10, the overall accuracies ofdifferent classifiers using different numbers of training data

are illustrated. For a small number of training samples, whenthe SVM generated inferior OA, the SSRN still produced highclassification accuracy, and it is more obvious that the SSRNperforms the best than other methods, because the SSRNextract more discriminative features than other methods. Fora large number of training samples, the SSRN still generatesthe best classification outcomes in all three HSI data sets,but the improvements are not that clear, simply because theclassification accuracy is very high (higher than 99% OA).

To further validate the effectiveness of residual blocksfor mitigating the accuracy-decreasing phenomenon, SSRNmodels with varying residual blocks were constructed forclassifying 3-D HSI data. We tested SSRNs with from twoto five blocks and treated spectral and spatial residual blocks



Fig. 11. OA of spectral–spatial neural networks with varying layers andcombinations of residual blocks. The x + y formation in the horizontal axisdenotes an SSRN with x spectral and y spatial residual blocks.

differently using the same settings as in Tables VI–VIII.In Fig. 11, the overall classification accuracy differencesbetween the deeper SSRNs and their shallow-layer counter-parts are negligible. Therefore, in contrast to the obviousaccuracy-decreasing effects reported in [17] and [22], the con-sistent HSI classification performance of SSRNs with varyinglayers demonstrated that the residual connections mitigate thedecreasing-accuracy effects in other deep learning models.

The training and testing times provide a direct measureof computational efficiency for the SSRN. All experimentswere conducted on an MSI GT72S laptop with the GTX980M graphical processing unit (GPU). Table IX lists thetraining and testing times of the SSRN and other deep learningmodels. As presented in Table IX, the training times ofthe spectral section part (SPC) are 5–10 times longer thanits spatial counterpart (SPA), because the spectral residualblocks preserved abundant features and kept the spatial sizeunchanged. In other words, the spectral residual blocks in theSSRN require a larger amount of computational power thantheir spatial counterparts. The SSRN takes 6–10 times longerfor training than the CNN, which means that the SSRN is morecomputationally expensive than the CNN. Fortunately, theadoption of GPU has largely alleviated the extra computationalcosts and reduced the training times.

D. Discussion

The experimental outcomes validate the effectiveness of theSSRN framework. It is worth noting that different deep learn-ing models usually prefer different hyperparameters, whichposes a challenge for deploying these models. However, theclassification performance of the SSRN with different set-tings is stable according to experiment results. Comparedwith traditional feature engineering-based machine learningmethods (e.g., kernel SVM), deep learning models have fouradvantages: 1) automatic feature extraction; 2) hierarchicalnonlinear transformation; 3) objective functions that directlyfocus on classification in lieu of two independent steps;and 4) the ability to utilize computational hardware(especially GPU) efficiently.

TABLE IX

TRAINING AND TESTING TIMES OF DIFFERENTMODELS FOR THREE HSI DATA SETS

Three major differences exist between SSRNs and otherdeep learning models (e.g., SAE and CNN). First, the SSRNadopts residual connections that improve the classificationaccuracy and make deep learning models much easier to train.Second, the SSRN treats spectral features and spatial featuresseparately in two consecutive blocks, through which morediscriminative features can be extracted. Third, owing to BNoperation at each convolutional layer, we only need hundredsof iterations for training the SSRN instead of hundreds ofthousands in [24].

Three main factors influence the HSI classification perfor-mance of supervised deep learning models: 1) the number oftraining samples; 2) the spatial size of input data; and 3) therepresentative capacity of the designed models. Because theSSRN obtained very high classification accuracy for relativelyfew land-cover categories, we did not employ data augmen-tation [22] to further boost the classification performanceof the SSRN despite a small number of training samples.Given a fixed model, the more data used for training, and themore information these data contain, the higher classificationaccuracy deep learning models can generate. Therefore, tomake a fair comparison, we need to test different models underthe same number of training samples and the same size foreach input sample.

IV. CONCLUSION

In this paper, we have presented a supervised 3-D deeplearning framework for spectral–spatial representation learningand HSI classification. The designed SSRN, which containsconsecutive spectral and spatial residual blocks, has alleviatedthe decreasing-accuracy phenomenon. The experiment resultsdemonstrated that the SSRN performs consistently with thehighest classification accuracy for all three types of HSI datasets with different challenges. It is worth noting that thisnetwork has delivered robust classification performance usingsmall as well as large numbers of uneven training samples.In addition, the BN strategy regularized the training processand improved the classification accuracy. Finally, the SSRNachieved the state-of-the-art results with the limited labeled3-D cubes as training data in three cases and can easily begeneralized to other remote-sensing scenarios because of theiruniform structural design and deep feature learning capacity.



The essence of deep learning models is learning therepresentation of input data automatically without featureengineering, because the models themselves can extractdiscriminative features given appropriate architectural designsand training process settings. Moreover, these hyperparametersettings depend on the number of training samples and thespatial size of each sample. In the cases of HSI classification,one prominent challenge is the shortage of annotations.Thus, this paper counters this obstacle with the proposedspectral–spatial residual architecture that takes both abundantspectral signatures and spatial contexts into account.

It is suggested that the deep learning methods need asignificant amount of labeled data for training [21]. However,the experimental results have demonstrated that the proposedmodels, which have a spectral–spatial residual architectureand an appropriate regularization strategy, perform vigorouslywith large numbers as well as a limited numbers of trainingsamples. Also, according to the sensitivity test results, the pro-posed network can extract more discriminative spatial featureswith larger input cubes, and simply expanding the sizes ofinput data will increase the classification accuracy. In otherwords, HSI classification models using training samples withmore spatial information tend to have an advantage overthe ones using training data with less spatial information.Therefore, we advocate that the spatial size of input HSI datashould be the same when comparing different classificationmethods. Considering the consistent performance in threewidely studied HSI cases, we believe that the SSRN stillcan outperform other machine learning competitors for HSIclassification under the same comparison standards in othercases.

REFERENCES

[1] X. Huang and L. Zhang, “An adaptive mean-shift analysis approach forobject extraction and classification from urban hyperspectral imagery,”IEEE Trans. Geosci. Remote Sens., vol. 46, no. 12, pp. 4173–4185,Dec. 2008.

[2] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sensing data:A technical tutorial on the state of the art,” IEEE Geosci. Remote Sens.Mag., vol. 4, no. 2, pp. 22–40, Jun. 2016.

[3] J. Li, P. R. Marpu, A. Plaza, J. M. Bioucas-Dias, andJ. A. Benediktsson, “Generalized composite kernel framework forhyperspectral image classification,” IEEE Trans. Geosci. Remote Sens.,vol. 51, no. 9, pp. 4816–4829, Sep. 2013.

[4] L. Zhang, L. Zhang, D. Tao, X. Huang, and B. Du, “Hyperspectralremote sensing image subpixel target detection based on supervisedmetric learning,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 8,pp. 4955–4965, Aug. 2014.

[5] L. Zhang, Y. Zhong, B. Huang, J. Gong, and P. Li, “Dimensionalityreduction based on clonal selection for hyperspectral imagery,” IEEETrans. Geosci. Remote Sens., vol. 45, no. 12, pp. 4172–4186, Dec. 2007.

[6] J. Li, J. M. Bioucas-Dias, and A. Plaza, “Spectral–spatial classificationof hyperspectral data using loopy belief propagation and active learn-ing,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 2, pp. 844–856,Feb. 2013.

[7] X. Jia, B.-C. Kuo, and M. M. Crawford, “Feature mining for hyper-spectral image classification,” Proc. IEEE, vol. 101, no. 3, pp. 676–697,Mar. 2013.

[8] L. Zhang, L. Zhang, D. Tao, and X. Huang, “On combining multiplefeatures for hyperspectral remote sensing image classification,” IEEETrans. Geosci. Remote Sens., vol. 50, no. 3, pp. 879–893, Mar. 2012.

[9] Q. Wang, J. Lin, and Y. Yuan, “Salient band selection for hyperspectralimage classification via manifold ranking,” IEEE Trans. Neural Netw.Learn. Syst., vol. 27, no. 6, pp. 1279–1289, Jun. 2016.

[10] Y. Yuan, J. Lin, and Q. Wang, “Hyperspectral image classificationvia multitask joint sparse representation and stepwise MRF opti-mization,” IEEE Trans. Cybern., vol. 46, no. 12, pp. 2966–2977,Dec. 2016.

[11] M. Khodadadzadeh, J. Li, A. Plaza, H. Ghassemian,J. M. Bioucas-Dias, and X. Li, “Spectral–spatial classification ofhyperspectral data using local and global probabilities for mixed pixelcharacterization,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 10,pp. 6298–6314, Oct. 2014.

[12] X. Kang, S. Li, and J. A. Benediktsson, “Spectral–spatial hyperspectralimage classification with edge-preserving filtering,” IEEE Trans. Geosci.Remote Sens., vol. 52, no. 5, pp. 2666–2677, May 2014.

[13] M. Fauvel, J. A. Benediktsson, J. Chanussot, and J. R. Sveinsson,“Spectral and spatial classification of hyperspectral data using SVMsand morphological profiles,” IEEE Trans. Geosci. Remote Sens., vol. 46,no. 11, pp. 3804–3814, Nov. 2008.

[14] M. Pal and G. M. Foody, “Feature selection for classification ofhyperspectral data by SVM,” IEEE Trans. Geosci. Remote Sens., vol. 48,no. 5, pp. 2297–2307, May 2010.

[15] J. Peng, Y. Zhou, and C. L. P. Chen, “Region-kernel-based sup-port vector machines for hyperspectral image classification,” IEEETrans. Geosci. Remote Sens., vol. 53, no. 9, pp. 4810–4824,Sep. 2015.

[16] Y. Tarabalka, M. Fauvel, J. Chanussot, and J. A. Benediktsson, “SVM-and MRF-based method for accurate classification of hyperspectralimages,” IEEE Geosci. Remote Sens. Lett., vol. 7, no. 4, pp. 736–740,Oct. 2010.

[17] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-based classification of hyperspectral data,” IEEE J. Sel. TopicsAppl. Earth Observ. Remote Sens., vol. 7, no. 6, pp. 2094–2107,Jun. 2014.

[18] Y. Chen, X. Zhao, and X. Jia, “Spectral–spatial classification of hyper-spectral data based on deep belief network,” IEEE J. Sel. TopicsAppl. Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2381–2392,Jun. 2015.

[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classificationwith deep convolutional neural networks,” in Proc. Adv. Neural Inf.Process. Syst., 2012, pp. 1106–1114.

[20] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature,vol. 521, pp. 436–444, May 2015.

[21] W. Li, G. Wu, F. Zhang, and Q. Du, “Hyperspectral image classificationusing deep pixel-pair features,” IEEE Trans. Geosci. Remote Sens.,vol. 55, no. 2, pp. 844–853, Feb. 2017.

[22] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature extrac-tion and classification of hyperspectral images based on convolutionalneural networks,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10,pp. 6232–6251, Oct. 2016.

[23] W. Zhao and S. Du, “Spectral–spatial feature extraction for hyper-spectral image classification: A dimension reduction and deep learn-ing approach,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 8,pp. 4544–4554, Aug. 2016.

[24] Y. Li, H. Zhang, and Q. Shen, “Spectral–spatial classification of hyper-spectral imagery with 3D convolutional neural network,” Remote Sens.,vol. 9, no. 1, p. 67, 2017.

[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jun. 2016, pp. 770–778.

[26] H. Lee and H. Kwon, “Contextual deep CNN based hyperspectralclassification,” in Proc. Int. Geosci. Remote Sens. Symp., Jul. 2016,pp. 3322–3325.

[27] V. Mnih et al., “Human-level control through deep reinforcement learn-ing,” Nature, vol. 518, pp. 529–533, Feb. 2015.

[28] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in Proc. 32ndInt. Conf. Mach. Learn., 2015, pp. 448–456.

[29] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: A simple way to prevent neural networksfrom overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,2014.

[30] T. Tieleman and G. E. Hinton, “Lecture 6.5-rmsprop: Divide the gradientby a running average of its recent magnitude,” COURSERA, NeuralNetw. Mach. Learn., vol. 4, no. 2, pp. 26–31, 2012.

[31] B. Waske, S. van der Linden, J. A. Benediktsson, A. Rabe, andP. Hostert, “Sensitivity of support vector machines to random featureselection in classification of hyperspectral data,” IEEE Trans. Geosci.Remote Sens., vol. 48, no. 7, pp. 2880–2889, Jul. 2010.



Zilong Zhong (S’15) received the M.Eng. degreein electronic and communication engineering fromLanzhou University, Lanzhou, China, in 2014. He iscurrently pursuing the Ph.D. degree in remote sens-ing with the University of Waterloo, Waterloo, ON,Canada.

His research interests include computer vision,deep learning, probabilistic graphical models, andtheir applications in the context of high-dimensionalremotely sensed data.

Jonathan Li (M’00–SM’11) received the Ph.D.degree in geomatics engineering from the Universityof Cape Town, Cape Town, South Africa.

He is currently a Professor with the Departmentof Geography and Environmental Management,University of Waterloo, Waterloo, ON, Canada.He is also with the Fujian Key Laboratory ofSensing and Computing for Smart Cities, Schoolof Information Science and Engineering, XiamenUniversity, Xiamen, China. He has co-authoredmore than 300 publications, more than 150 of which

were published in refereed journals, including the IEEE TRANSACTIONS

ON GEOSCIENCE AND REMOTE SENSING, the IEEE TRANSACTIONS ONINTELLIGENT TRANSPORTATION SYSTEMS, IEEE JOURNAL OF SELECTED

TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING,ISPRS Journal of Photogrammetry and Remote Sensing, and Remote Sensingof Environment. His research interests include information extraction frommobile LiDAR point clouds and from earth observation images.

Dr. Li is the Chair of the ISPRS WG I/2 on LiDAR, Airborne andSpaceborne Optical Sensing (2016–2020) and the ICA Commission onSensor-driven Mapping (2015–2019). He is an Associate Editor of theIEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS andJSTARS.

Zhiming Luo received the B.Sc. degree in cognitivescience from Xiamen University, Xiamen, China,in 2011. He is currently pursuing the Ph.D. degreein computer science with Xiamen University and theUniversity of Sherbrooke, Sherbrooke, QC, Canada.

His research interests include traffic surveillancevideo analytics, computer vision, and machinelearning.

Michael Chapman received the Ph.D. degreein photogrammetry from Laval University,Quebec City, QC, Canada.

He is currently a Professor of geomatics with theDepartment of Civil Engineering, Ryerson Univer-sity, Toronto, ON, Canada. He has co-authored over160 technical papers. His research interests includealgorithms and processing methods for airbornesensors using GPS/INS geometric processing ofdigital imagery in industrial environments, terrestrialimaging systems for transportation infrastructure

mapping, and algorithms and processing strategies for metrology applications.

ieee transactions on geoscience and remote · pdf file · 2017-10-25hyperspectral...

Documents