isprs journal of photogrammetry and remote sensingcxc123730/oto-2018.pdf · available online xxxx...

13
One-two-one networks for compression artifacts reduction in remote sensing Baochang Zhang a , Jiaxin Gu a , Chen Chen c,, Jungong Han d , Xiangbo Su a , Xianbin Cao b , Jianzhuang Liu e a School of Automation Science and Electrical Engineering, Beihang University, Beijing, China b School of Electronics and Information Engineering, Beihang University, Beijing, China c Center for Research in Computer Vision (CRCV), University of Central Florida, Orlando, FL, USA d School of Computing and Communications, Lancaster University, LA1 4YW, UK e Noah’s Ark Lab, Huawei Technologies Co. Ltd., China article info Article history: Received 27 May 2017 Received in revised form 22 November 2017 Accepted 4 January 2018 Available online xxxx 2010 MSC: 00-01 99-00 Keywords: Compression artifacts reduction Remote sensing Deep learning One-two-one network abstract Compression artifacts reduction (CAR) is a challenging problem in the field of remote sensing. Most recent deep learning based methods have demonstrated superior performance over the previous hand- crafted methods. In this paper, we propose an end-to-end one-two-one (OTO) network, to combine dif- ferent deep models, i.e., summation and difference models, to solve the CAR problem. Particularly, the difference model motivated by the Laplacian pyramid is designed to obtain the high frequency informa- tion, while the summation model aggregates the low frequency information. We provide an in-depth investigation into our OTO architecture based on the Taylor expansion, which shows that these two kinds of information can be fused in a nonlinear scheme to gain more capacity of handling complicated image compression artifacts, especially the blocking effect in compression. Extensive experiments are con- ducted to demonstrate the superior performance of the OTO networks, as compared to the state-of- the-arts on remote sensing datasets and other benchmark datasets. The source code will be available here: https://github.com/bczhangbczhang/. Ó 2018 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved. 1. Introduction In remote sensing, the satellite- or aircraft-based sensor tech- nologies are used to capture and detect objects on Earth. Thanks to various propagated signals (e.g., electromagnetic radiation), remote sensing makes the data collection from dangerous or inac- cessible areas possible, and therefore plays a significant role in many applications including monitoring, military information col- lection and land-use classification (Chen et al., 2016; Li et al., 2014, 2015; Vosselman et al., 2017). With the technological development of various satellite sensors, the volume of high-resolution remote sensing image data is increasing rapidly. Hence, proper compres- sion of the satellite image becomes essential, which enables infor- mation exchange much more efficient, given a limited band width. Existing compression methods generally fall into two cate- gories: lossless (e.g., PNG) and lossy (e.g., JPEG) (Wang et al., 2002). The lossless methods usually provide better visual experi- ence to users, but lossy methods often achieve higher compression ratios via non-invertible compression functions along with trade- off parameters to balance the data amount and the decompressed quality. Therefore the lossy compression schemes are always pre- ferred by consumer devices in practice due to higher compression rate (Wang et al., 2002). However, high compression rate comes with the cost of having compression artifacts on the decoded image, which is a barrier for many applications, such as image analysis. Therefore, there is a clear need for compression artifact reduction, which is able to gain visual quality of the decompressed image, which can influence the visual effect and low-level vision processing (Yu et al., 2016). The compression artifacts are in relation to the schemes used for compression. Take JPEG compression as an example, blocking artifacts are caused by discontinuities at the borders when encod- ing adjacent 8 8 pixel blocks, which are in the form of ringing effects and blurring due to the coarse quantization of the high fre- quency components. To deal with these compression artifacts, an improved version of JPEG, named JPEG 2000, is proposed, which https://doi.org/10.1016/j.isprsjprs.2018.01.003 0924-2716/Ó 2018 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved. Corresponding author. E-mail addresses: [email protected] (B. Zhang), chenchen870713@gmail. com (C. Chen). ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx Contents lists available at ScienceDirect ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs Please cite this article in press as: Zhang, B., et al. One-two-one networks for compression artifacts reduction in remote sensing. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

Upload: phammien

Post on 16-Apr-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing

journal homepage: www.elsevier .com/ locate/ isprs jprs

One-two-one networks for compression artifacts reduction in remotesensing

https://doi.org/10.1016/j.isprsjprs.2018.01.0030924-2716/� 2018 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.

⇑ Corresponding author.E-mail addresses: [email protected] (B. Zhang), chenchen870713@gmail.

com (C. Chen).

Please cite this article in press as: Zhang, B., et al. One-two-one networks for compression artifacts reduction in remote sensing. ISPRS J. PhotRemote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

Baochang Zhang a, Jiaxin Gu a, Chen Chen c,⇑, Jungong Han d, Xiangbo Su a, Xianbin Cao b, Jianzhuang Liu e

a School of Automation Science and Electrical Engineering, Beihang University, Beijing, Chinab School of Electronics and Information Engineering, Beihang University, Beijing, ChinacCenter for Research in Computer Vision (CRCV), University of Central Florida, Orlando, FL, USAd School of Computing and Communications, Lancaster University, LA1 4YW, UKeNoah’s Ark Lab, Huawei Technologies Co. Ltd., China

a r t i c l e i n f o a b s t r a c t

Article history:Received 27 May 2017Received in revised form 22 November 2017Accepted 4 January 2018Available online xxxx

2010 MSC:00-0199-00

Keywords:Compression artifacts reductionRemote sensingDeep learningOne-two-one network

Compression artifacts reduction (CAR) is a challenging problem in the field of remote sensing. Mostrecent deep learning based methods have demonstrated superior performance over the previous hand-crafted methods. In this paper, we propose an end-to-end one-two-one (OTO) network, to combine dif-ferent deep models, i.e., summation and difference models, to solve the CAR problem. Particularly, thedifference model motivated by the Laplacian pyramid is designed to obtain the high frequency informa-tion, while the summation model aggregates the low frequency information. We provide an in-depthinvestigation into our OTO architecture based on the Taylor expansion, which shows that these two kindsof information can be fused in a nonlinear scheme to gain more capacity of handling complicated imagecompression artifacts, especially the blocking effect in compression. Extensive experiments are con-ducted to demonstrate the superior performance of the OTO networks, as compared to the state-of-the-arts on remote sensing datasets and other benchmark datasets. The source code will be availablehere: https://github.com/bczhangbczhang/.� 2018 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier

B.V. All rights reserved.

1. Introduction

In remote sensing, the satellite- or aircraft-based sensor tech-nologies are used to capture and detect objects on Earth. Thanksto various propagated signals (e.g., electromagnetic radiation),remote sensing makes the data collection from dangerous or inac-cessible areas possible, and therefore plays a significant role inmany applications including monitoring, military information col-lection and land-use classification (Chen et al., 2016; Li et al., 2014,2015; Vosselman et al., 2017). With the technological developmentof various satellite sensors, the volume of high-resolution remotesensing image data is increasing rapidly. Hence, proper compres-sion of the satellite image becomes essential, which enables infor-mation exchange much more efficient, given a limited band width.

Existing compression methods generally fall into two cate-gories: lossless (e.g., PNG) and lossy (e.g., JPEG) (Wang et al.,

2002). The lossless methods usually provide better visual experi-ence to users, but lossy methods often achieve higher compressionratios via non-invertible compression functions along with trade-off parameters to balance the data amount and the decompressedquality. Therefore the lossy compression schemes are always pre-ferred by consumer devices in practice due to higher compressionrate (Wang et al., 2002). However, high compression rate comeswith the cost of having compression artifacts on the decodedimage, which is a barrier for many applications, such as imageanalysis. Therefore, there is a clear need for compression artifactreduction, which is able to gain visual quality of the decompressedimage, which can influence the visual effect and low-level visionprocessing (Yu et al., 2016).

The compression artifacts are in relation to the schemes usedfor compression. Take JPEG compression as an example, blockingartifacts are caused by discontinuities at the borders when encod-ing adjacent 8� 8 pixel blocks, which are in the form of ringingeffects and blurring due to the coarse quantization of the high fre-quency components. To deal with these compression artifacts, animproved version of JPEG, named JPEG 2000, is proposed, which

ogram.

2 B. Zhang et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

adopts the wavelet transform to avoid blocking artifacts, but stillundergoes ringing effects and blurring. As an excellent alternative,SPIHT (Said et al., 1996) showed that using simple uniform scalarquantization, rather than complicated vector quantization, alsoyields superior results. Due to its simplicity, SPIHT has been suc-cessful on natural (portraits, landscape, weddings, etc.) and medi-cal (X-ray, CT, etc.) images. Furthermore, its embedded encodingprocess has proved to be effective in a broad range of reconstruc-tion qualities. For instance, it can code fair-quality portraits andhigh-quality medical images equally well (as compared with othermethods in the same conditions). However, in the field of remotesensing, the images usually suffer from severe artifacts after com-pression as shown in Fig. 1, which poses challenges to many high-level vision tasks, such as object detection (Cheng and Han, 2016;Xiao et al., 2016), classification (Chen et al., 2016; Bian et al., 2017),and anomaly detection (Chang and Chiang, 2002).

To cope with various compression artifacts, many conventionalapproaches have been proposed, such as filtering approaches (Listet al., 2003; Reeve and Lim, 1984; Wang et al., 2013), specific priors(e.g., the quantization table in DSC (Liu et al., 2015)), and thresh-olding techniques (Liew and Yan, 2004; Foi et al., 2007). Inspiredby the great success of deep learning technology in many imageprocessing applications, researchers start to exploit this powerfultool to reduce the compression artifact. Specifically, the Super-Resolution Convolutional Neural Network (SRCNN) (Dong et al.,2014) exhibits great potential of an end-to-end learning in imagesuper-resolution. It is also pointed out that conventional sparse-coding-based image restoration model can be equally seen as a

Fig. 1. Left: the SPIHT-compressed remotely sensed images with obvious blocking artiblurring is removed.

Please cite this article in press as: Zhang, B., et al. One-two-one networks foRemote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

deep model. However, if we directly apply SRCNN to the compres-sion artifact reduction task, the features extracted by its first layerare noisy, which will cause undesirable noisy patterns in recon-struction. Thus the three-layer SRCNN is not suitable for com-pressed image restoration, especially when dealing with complexartifacts. Thanks to transfer learning, ARCNN (Yu et al., 2016) hasbeen successfully applied to image restoration tasks. However,without exploiting the multi-scale information, ARCNNs fail tosolve more complicated compression artifact problems. Althoughmany deep models with different architectures have been explored(e.g., Dong et al., 2014; Yu et al., 2016; Cavigelli et al., 2017) tosolve the artifact reduction problem, there is little work incorpo-rating different models in a unified framework to inherit theirrespective advantages.

In this paper, a generic fusion network, dubbed as one-two-one(OTO) network, is developed for complex compression artifactsreduction. The general framework of the proposed OTO networkis presented in Fig. 2. Specifically, it consists of three sub-networks: a normal-scale network, a small-scale network withmax pooling to increase the network receptive field, and a fusionnetwork to perform principled fusion of the outputs from the sum-mation and difference models. The summation model aggregatesthe low frequency information captured from different networkscales, while the difference model is motivated by the Laplacianpyramid which is able to describe the high frequency information,such as detailed information. By combining the summation anddifference models, both low and high frequency information ofthe image can be better characterized. This is motivated by the fact

facts. Right: the restored images by our OTO network, where lines are sharp and

r compression artifacts reduction in remote sensing. ISPRS J. Photogram.

Conv+ReLU

DownSub-

network

Sub-network

Up -

+

Conv+ReLU+Scale

Conv+ReLU

+ResUnit

5Conv+ReLU +

BN ReLU Conv BN ReLU Conv +

Normal-scale Network(N1)

Small-scale Network(N2)

Fusion Network

Conv: Convolution LayerScale: Scale LayerDown: Down-sampling (Max-pooling)Up: Up-sampling (Deconvolution)BN: Batch Normalization LayerRelu: Rectified Linear UnitResUnit: Residual UnitDenseUnit: DenseNet UnitCnnUnit: Classic CNN Unit+: Summation-: Subtraction

n: Stack n times

Y X

Sub-network:ResNet (R)

ResUnit ResUnit ResUnit

Sub-network:DenseNet (D) DenseUnit DenseUnit DenseUnit DenseUnit DenseUnit

BN ReLU Conv

Conv

Sub-network:Classic CNN (C)

Conv ReLU

CnnUnit CnnUnit CnnUnit CnnUnit CnnUnit CnnUnit

Difference Model

Summation Model

Fig. 2. The architecture of One-Two-One (OTO) Network. Two different CNN models are combined in a principled framework, the outputs of which are further processedbased on a fusion network. The details of the three sub-networks are also included.

Table 1A brief description of variables used in the paper.

Y: input compressed image F1: normal-scale networkeY : output of the first convolutional

layer

F2: small-scale network

N1: output of F1 N2: output of F2Sum: summation of N1 and N2 Dif: difference between N1 and N2

GS: nonlinear operation of thesummation model

GD: nonlinear operation of thedifference model

HS: output of GS HD: output of GD

F3: fusion network a: weight term between the twosub-networks

G0Sð�Þ: derivative of GS G0

Dð�Þ: derivative of GD

c: constant term o½�; ��: higher order infinitesimalX: uncompressed target image

B. Zhang et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx 3

that adopting different schemes to process high frequency and lowfrequency information always benefits to low-level image process-ing applications, such as image denoising (Zhang et al., 2017a) andimage reconstruction (Early and Long, 2001). Most importantly, weprovide an in-depth investigation into our OTO architecture basedon the Taylor expansion, which shows that these two kinds ofinformation are fused in a nonlinear scheme to gain more capacityto handle complicated image compression artifacts. From a theo-retical perspective, this paper proposes a principled combinationof different CNN models, providing the capability of coping withthe extremely challenging task of the large blocking effect. Exten-sive experimental results verify that combining diverse modelseffectively boosts the performance. In a summary, we have the fol-lowing contributions in this paper.

1. We develop a new one-two-one (OTO) network, to combine dif-ferent models based on an end-to-end deep framework aimingto effectively deal with complicated artifacts, i.e., the big block-ing effect in compression.

2. We are motivated by the idea of the Laplacian pyramid, which isextended in the deep learning framework, and explicitly used tocapture the high frequency information in images. We show inthe experiments that the difference model is able to effectivelyimprove the compression artifact reduction performance.

3. Based on the Taylor expansion, we lead to two OTO variants,which provide a profound investigation into our method.

Please cite this article in press as: Zhang, B., et al. One-two-one networks foRemote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

4. Extensive experiments are conducted to validate the perfor-mance of OTO over the state-of-the-arts on both the benchmarkdatasets and remote sensing datasets.

For ease of explanation, we summarize main variables inTable 1. The rest of the paper is organized as follows. Section 2introduces the related works, and Section 3 describes the detailsof the proposed method. Experiments and results are presentedin Section 4. Finally, Section 5 concludes the paper.

r compression artifacts reduction in remote sensing. ISPRS J. Photogram.

4 B. Zhang et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

2. Related work

The OTO network is proposed to combine summation and differ-ence models in the end-to-end framework. Particularly, the differ-ence model motivated by the Laplacian pyramid is designed toobtain the high frequency information, while the summationmodelaggregates the low frequency information. Compared to the sum-mation model, the difference model can provide more detailedinformation. In this section, we briefly described the related workabout how the high frequency information used in the low-levelimage processing, and also the previous CAR methods.

On the high frequency information. The high frequency infor-mation has been exploited in tasks such as pansharpening (Vivoneet al., 2015), superresolution (Kim et al., 2016) and denoising(Wang et al., 2017). However, the way of exploring it is differentfrom ours. Specifically, in image superresolution, a low resolutioninput image is first interpolated to have the same size of the highresolution image as input. Then the goal of the network becomeslearning the high resolution image from the interpolated low res-olution image (Kim et al., 2016). In other words, the networkessentially aims to learn the high frequency information in orderto obtain the high resolution output (Lim et al., 2017; Lediget al., 2016). In pansharpening, the high frequency details are notavailable for multispectral bands, and must be inferred throughthe model (Masi et al., 2017; Scarpa et al., 2017; Vivone et al.,2015) starting from those of Pan images. In denoising, residuallearning is utilized to speed up the training process as well as boostthe denoising performance (Zhang et al., 2017b,a; Wang et al.,2017). The Laplacian pyramid is ubiquitous for decomposingimages into multiple scales and is widely used for image analysis(Paris et al., 2011; Burt and Adelson, 1983), which is computedas the difference between the original image and the low pass fil-tered image. This process is continued to obtain a set of band-passfiltered images, since each is the difference between two levels ofthe Gaussian pyramid. The Laplacian pyramids have been used toanalyze images at multiple scales for a broad range of applicationssuch as compression (Burt and Adelson, 1983), texture synthesis(Heeger and Bergen, 1995), and harmonization (Sunkavalli et al.,2010). We are motivated by the idea of the Laplacian pyramid,which is extended in the deep learning framework, and explicitlyused to capture more detailed information (high frequency) tosolve the CAR problem.

Traditional CAR methods. Traditional methods for the CARproblem are generally categorized into deblocking-based anddictionary-based algorithms. The deblocking-based algorithmsmainly focus on removing blocking and ringing artifacts using fil-ters in the spatial domain or utilizing wavelet transforms and set-ting thresholds at different wavelet scales in the frequencydomain. Among them, the most successful work is Shape-Adaptive Discrete Cosine Transformation (SA-DCT) (Foi et al.,2007), which achieved the state-of-the-art performance duringthe 2000s. However, similar to other deblocking-based methods,SA-DCT suffers from blurry edges and smooth texture regions aswell. It is worth noting that SA-DCT is an unsupervised method,which is more powerful than supervised methods when there arenot enough samples available. The supervised dictionary-basedalgorithms, such as RTF (Jancsary et al., 2012), S-D2 (Liu et al.,2015), take compression artifacts reduction as a restoration prob-lem and reverse the impact of DCT-domain quantization by learneddictionaries. Unfortunately, the optimization procedure of sparse-coding-based approaches is always complicated and the end-to-end training does not seem to be possible, which limits their recon-struction performance.

Deep CAR methods. Recently, deep convolutional neural net-works have shown promising performance on both high-level

Please cite this article in press as: Zhang, B., et al. One-two-one networks foRemote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

vision tasks, such as classification (Krizhevsky and Hinton, 2009;Cheng et al., 2017), detection (Ren et al., 2015; Yao et al., 2015;Han et al., 2015; Cheng et al., 2016) and segmentation (Longet al., 2015; Yao et al., 2016, 2017), and low-level image processinglike super-resolution (Kim et al., 2016). Super-Resolution Convolu-tional Neural Networks (SRCNN) (Dong et al., 2014) utilize a three-layer CNN to increase the resolution of images and achieve superiorresults over the traditional SR algorithms like A+ (Timofte et al.,2014). Following the idea of SRCNN, Yu et al. (2016) eliminate theundesired noisy patterns by directly applying SRCNN architecturefor compression artifacts suppression and prove that transfer learn-ing also succeeds in low-level vision problems. Compression arti-facts reduction CNN (Yu et al., 2016) mainly benefits fromtransfer learning in three aspects: from shallow networks to deepnetworks, from high-quality training datasets to low-quality onesand from one compression scheme to another scheme. Svobodaet al. (2016) learn a feed-forward CNN by combining residual learn-ing, skip architecture and symmetric weight initialization toimprove image restoration performance. The generative adversarialnetwork (GAN) is also successfully used to solve the CAR problem.In Leonardo et al. (2017), the Structural Similarity (SSIM) loss isdevised, which is a better loss with respect to the simpler MeanSquared Error (MSE), to re-formulate the compression artifactremoval problem in a generative adversarial framework. Themethod obtains better performance than MSE trained networks.

Due to the fixed quantization table in the JPEG compression stan-dard, it is reasonable to take advantage of JPEG-related prior for bet-ter restoration performance. Deep Dual-domain Convolutionalneural Network (DDCN) (Guo and Chao, 2016) adds DCT-domainprior into the dual networks so that the network is able to learnthe difference between the original images and compressed imagesin both pixel-domain and DCT-domain. Likewise, D3method (Wanget al., 2016) converts sparse-coding approaches into an LISTA-based(Gregor and LeCun, 2010) deep neural network, and gains bothspeed and performance. Both of DDCN and D3 adopt JPEG-relatedpriors to improve reconstruction quality. One-to-many network(Jun and Hongyang, 2017) is proposed for compression artifactsreduction. The network consists of three losses, a perceptual loss,a naturalness loss, and a JPEG loss, to measure the output quality.By combining multiple different losses, the one-to-many networkis able to achieve visually pleasing artifacts reduction.

Challenges of the CAR problem. In spite of already achievinggood compression artifact removal performance, they still havelimitations, especially when dealing with satellite imagery. Prior-based methods may not be generalized to other compressionschemes like SPIHT, and therefore their applications are limitedfor the reason that satellite- or aircraft-based sensor technologiesuse variable compression standards. Another ignored problem-specific prior is the size of blocks, which is typically 8� 8. Theexisting JPEG-based methods crop images into sub-samples orpatches with small size like 32� 32 and use 8� 8 blocks for pro-cessing. However, larger block size like 32� 32 is often adoptedin the digital signal processor (DSP) of satellites for parallel pro-cessing. In this case, an image patch only contains a whole blockand might have negative impact on the training process. As aresult, it is important for sub-samples to contain several blocksso that the networks can perceive the spatial context betweenadjacent blocks. On the other hand, the existing deep learningbased compression artifact removal approaches mainly focus onthe architecture design (Yu et al., 2016; Dong et al., 2014;Svoboda et al., 2016) or changing the loss function (Leonardoet al., 2017; Jun and Hongyang, 2017), with no theoretical explana-tions so that they fail to provide more profound investigation intomethodologies. Moreover, the benefits of different network archi-tectures are not fully explored for solving the CAR problem.

r compression artifacts reduction in remote sensing. ISPRS J. Photogram.

DownSub-

network

Sub-network

Up Scale

+

Normal-scale Network

Small-scale Network

Fig. 3. The architecture of OTO(Linear) with a learned a to balance the two branchnetworks.

B. Zhang et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx 5

3. One-two-one networks

The OTO networks are designed to reduce compression artifactsbased on a unified framework. As shown in Fig. 2, two differentmodels (summation model and difference model) are used torestore the input image individually, whose advantages are inher-ited by a CNN fusion network, and thus leading to a better perfor-mance than using each of them individually. In what follows, weaddress two issues to build the OTO network. We first describethe motivation of OTO, along with a theoretical investigation intothe network architecture which leads to two variants. We thenelaborate the architectures of the proposed OTO network, whichare divided into three specific sub-networks. For each of them,we give the details of the implementation.

3.1. Theoretical investigation of OTO

OTO is a general framework aiming to combine different deepmodels of different architectures. In OTO, a hierarchical CNN struc-ture is exploited to capture multi-scale texture information, whichis very effective in dealing with various compression artifacts. Inaddition, each network in our framework carries out a specificobjective, i.e., different-scale textures, and we end up combiningthem together to obtain better results. The idea origins from theLaplacian pyramid for capturing detailed information, but we usethe different scale networks to implement the idea in the deeplearning framework. The small-scale network involves spatialmax pooling, which essentially increases the network receptivefield and aggregates information in larger spatial area. Therefore,by combining small-scale network and normal-scale network fea-tures, the network learns features from different scales. Inspired bythe Laplacian pyramid, the difference model is exploited in thedeep framework and able to describe the high frequency informa-tion, while the summation model captures the low frequencyinformation. We then combine both in a principled end-to-enddeep framework. We like to highlight our idea from a more basicway. We provide a sensible way to combine the low and high fre-quency information in the deep learning framework, and also the-oretically explain it with the Taylor expansion. In OTO, we have:

N1 ¼ F1eY� �

; ð1Þ

and

N2 ¼ F2eY� �

; ð2Þ

where eY is the output of the first convolutional network, which isdesigned to pre-process the input compressed image Y based onconvolution layers. N1 and N2 denote the outputs of the two branchnetworks, i.e., normal-scale network and small-scale network. Tobetter restore the input image X, we exploit two different networks,i.e., summation model and difference model, based on eY , whichcomplement each other in terms of different network architectures.The summation model is used to mitigate the disparity betweentwo networks, while the difference model highlights that differentCNNs are designed for different purposes in order to obtain betterrestoration results. We have:

Sum ¼ N1 þ N2; ð3Þwhich actually aggregates the low frequency information.

Dif ¼ N1 � N2; ð4Þwhich describes the high frequency information as shown in theLaplacian pyramid. GS and GD denote the two branches followingthe summation and subtraction operation in Fig. 2 respectively.

Please cite this article in press as: Zhang, B., et al. One-two-one networks foRemote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

Both kinds of information are then combined together for a betterrestoration performance, and we have:

HS ¼ GS Sumð Þ; ð5Þand

HD ¼ GD Difð Þ; ð6Þwhere HS and HD are the outputs of the two branches. They are thencombined together via a nonlinear operation, which is designed tobe robust to the artifacts in the compressed images. And we have:

F3eY� �

¼ HS þ aHD ¼ GS Sumð Þ þ aGD Difð Þ¼ GS N1 þ N2ð Þ þ aGD N1 � N2ð Þ

ð7Þ

where a is a weight factor to balance different models. Based onTaylor expansion on GS and GD, we prove that our OTO is actuallythe combination of N1 and N2 based on a nonlinear scheme as:

F3eY� �

¼ G0S N1 þ N2ð Þ�ð Þ N1 þ N2ð Þ þ aG0

D N1 � N2ð Þ�ð Þ N1 � N2ð Þþ cþ o N1 þ N2ð Þ; N1 � N2ð Þ½ �

¼ G0S N1 þ N2ð Þ�ð Þ þ aG0

D N1 � N2ð Þ�ð Þ� �N1þ

G0S N1 þ N2ð Þ�ð Þ � aG0

D N1 � N2ð Þ�ð Þ� �N2 þ cþ o N1 þ N2ð Þ; N1 � N2ð Þ½ �;

ð8Þ

where � means that there is a point, which is always differential,used in the Taylor expansion. c denotes the constant term, ando½ðN1 þ N2Þ; ðN1 � N2Þ� denotes the higher order infinitesimal. Morespecifically, o½ðN1 þ N2Þ; ðN1 � N2Þ� in Eq. (8) is the nonlinear partand the remaining is the linear part. Note that the adopted nonlin-ear OTO model includes both the linear and nonlinear parts.

Based on Eq. (8), two linear OTO variants can be obtained asshown in Figs. 3 and 4. The first one, termed as OTO(Linear), is:

F3;1eY� �

¼ N1 þ aN2; ð9Þ

which can be derived from the linear part of Eq. (8). In its imple-mentation we learn a that is elaborated in the experimental part.Particularly a ¼ 1, we obtain the second one:

F3;2eY� �

¼ N1 þ N2; ð10Þ

which leads to our baseline, termed as OTO(Sum).

3.2. The architectures of OTOs

There are three distinct parts in OTOs: (a) normal-scale restora-tion network, (b) small-scale restoration network, (c) fusion net-work. For sub-networks (a) and (b), three kinds of CNN modelsare available: R, D and C (short for ResNet, DenseNet and ClassicCNNs respectively). The details about the OTO network are shownin Fig. 2.

r compression artifacts reduction in remote sensing. ISPRS J. Photogram.

DownSub-

network

Sub-network

Up

+

Normal-scale Network

Small-scale Network

Fig. 4. The architecture of OTO(Sum) without the difference model.

6 B. Zhang et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

ResNet(R): For each ResUnit, we follow the latest variant pro-posed in He et al. (2016), which is more powerful than its prede-cessors. More specifically, in each ResUnit, batch normalizationlayer (Ioffe and Szegedy, 2015), ReLU layer (Krizhevsky et al.,2012) and convolution layer are stacked twice in sequence.

DenseNet(D): Inspired by Densely Connected ConvolutionalNetworks (Huang et al., 2016), to further improve the informationflow between layers we propose a different connectivity pattern:we introduce direct connections from any layer to all subsequentlayers. In DenseNet, the feature fusion method is converted fromaddition to concatenation compared with ResNet, resulting inwider feature maps. The growth rate k is an important concept inDenseNet which means how fast the width of feature maps growsand in our implementation, we set k to 8. For each DenseUnit, wealso follow the pre-activation style unit as ResUnit except thenumber of convolutional layers is reduced to 1. As can be seen inFig. 2, five DenseUnits are stacked sequentially followed by a con-volutional layer to reduce the width of feature map so that it can befused with the other sub-network.

Classic CNNs(C): The classic CNN models only take advantagesof convolutional layers and activation layers. The CnnUnit consistsof one convolutional layer and one ReLU layer, and 6 CnnUnits arestacked to form the Classic CNN sub-network.

In the sub-network (b), we utilize 2� 2 max-pooling todecrease the size of feature map by half, which obtains the follow-ing benefits: the computational cost is decreased to 1/4, and withmore robust features extracted compared to the sub-network (a),and thus enlarging the perceptional field.

Fusion Network: Following Eq. (7), we construct the fusion net-work. Convolutional layers with ReLUs serve as the nonlinear oper-ation, and scale layers serve as the weight term. After fusion, westack 5 more ResUnits to further restore the images.

OTO Naming Rules: For convenience, we use abbreviations torepresent the three kinds of sub-networks. The first and second

Conv+ReLU Down

Sub-network

Sub-network

Normal-scale Network

1/4-scale Network Fusion Network

YDown

Sub-network

Up -

+

+

Conv+ReLU+Scale

Conv+ReLU

1/2-scale Network

CR

Fig. 5. The architecture of multi-scale OTO network (OTO_RRR) in which �1;� 12 an

Please cite this article in press as: Zhang, B., et al. One-two-one networks foRemote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

abbreviations after OTO represent the normal-scale and small-scale sub-networks respectively. For example, OTO_RD stands foran OTO network whose normal-scale sub-network is a ResNetand whose small-scale sub-network is a DenseNet.

Multi-scale OTO Networks: To further investigate our proposedOTO network, we design a multi-scale network whose structure isshown in Fig. 5. � 1

2 and � 14-scale features are first fused by the first

fusion network to get the combined � 12-scale feature. Then the

fused feature along with the �1-scale feature serves as the inputof the second fusion network. Except for the architecture, all theother details are the same as the two-scale OTO network.

It should be noted that the proposed OTO network exploits aseries of ResUnits to fit the residual of the input and target images.In other words, there is a long and direct shortcut connecting theinput image and the output of the subsequent network apart fromthe identity shortcut of each ResUnit. VDSR (Kim et al., 2016) hasalready proved that learning the residual between low-resolutionand high-resolution image is more efficient and effective in thesuper-resolution task, because the difference is small, i.e., residualis sparse. In the CAR task, this intuition is tenable because the com-pression algorithms do not change the essence of the image. As aresult, the ResUnit leads to a sparse residual and thus we can trainthe network efficiently.

4. Implementation and experiments

4.1. Datasets

In order to evaluate the OTO network, three groups of trainingand test dataset settings are designed, which are given accordingto their different test sets.

LIVE1 and Classic 5: Following the protocol of ARCNN (Yu et al.,2016), we evaluate our proposed network based on BSD500(Arbelaez et al., 2011), where the training and test set are com-bined to form a 400-image training set and a 100-image validationset. The disjoint dataset LIVE1 (Sheikh et al., 2005) containing 29images is chosen as our test set. Another test set we adopt is Classic5, one of the most frequently used datasets for evaluating the qual-ity of images.

BSD500 Test set: The BSD500 test set contains 200 images,which is more challenging than LIVE1, and more widely adoptedin the recent research papers (Guo and Chao, 2016). Consideringthat the 200-image BSD500 training set is too small to generateenough sub-samples when a large stride is chosen, we performdata augmentation by rotating the original images by 90, 180and 270 degrees. The remaining 100-image BSD500 validationset is used for validation.

Remotely Sensed Datasets: There are two public remote sens-ing datasets on ‘‘ISPRS Test Project on Urban Classification and 3DBuilding Reconstruction”: ‘‘Downtown Toronto” and ‘‘Vaihingen”

Up -

+

Conv+ReLU+Scale

Conv+ReLU

+ResUnit

5Conv+ReLU +

X

Fusion Network

onv+eLU

1/2-scale Network

d � 14-scale features are exploited. The sub-networks are the same as in Fig. 2.

r compression artifacts reduction in remote sensing. ISPRS J. Photogram.

B. Zhang et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx 7

(Cramer, 2010). To validate the performance of OTO on remotesensing images, ‘‘Downtown Toronto” dataset is employed, whichcontains various landscapes, such as ocean, road, house, vehicleand plant. To build a dataset for the CAR problem, we preprocessthe high-resolution images in the ‘‘Downtown Toronto” datasetby using various compression algorithms, but obviously withoutthe need for labeling the ground truth. SPIHT compression algo-rithm is used. The SPIHT algorithm can be applied to satelliteimages, where the original images is cropped into sub-images witha specific size 32� 32. Compared to JPEG, the size of block artifactsin SPIHT is 32� 32, which is much larger than that used in JPEG. Itis different from the quality factor in JPEG that the compressiondegree is decided by compression ratio, such as 8, 16, 32 and 64.Afterwards, we build the datasets used for training and validation.We randomly pick up 400 non-overlapping sub-images from thesource images and the compressed images to form the trainingset and each image has a uniform size of 512� 512. Then we dothe same operation to get the 200-image disjoint validation set.For testing, we use the other dataset ‘‘Vaihingen” to build a 400-image test set that has the same setting as the training set.

Evaluation Metrics: To quantitatively evaluate the proposedmethod, three widely used metrics: peak signal-to-noise ratio(PSNR), PSNR-B (Yim and Bovik, 2011) and structural similarity(SSIM) (Wang et al., 2004) are adopted in our experiments. PSNRis an engineering term for the ratio between the maximum possi-ble power of a signal and the power of corrupting noise that affectsthe fidelity of its representation, which is most commonly used tomeasure the quality of reconstructed image after a lossy compres-sion. The PSNR-B modifies PSNR by including a blocking effect fac-tor resulting in a better metrics than PSNR for quality assessmentof impaired images. SSIM index is a method for predicting the per-ceived quality of digital images. SSIM considers image degradationas perceived change in structural information. While incorporatingimportant perceptual phenomena, it also includes both luminancemasking and contrast masking terms.

Other Settings: We only focus on restoring the luminancechannel of the compressed image, and RGB-to-YCbCr operation isapplied via MATLAB function. We also use MATLAB to carry outJPEG compression to generate compressed images with differentqualities, such as QF-10, 20, 30 and 40. It is also worth noting thatwe crop every image such that the number of pixels in height andwidth are even since an odd number will affect the process ofdown-sampling and up-sampling (padding is necessary). To trainthe proposed OTO network, we choose SGD as the optimizationalgorithmwith a momentum 0.9 and a weight decay 0.001. The ini-tial learning rate is 0.01 with a degradation of 10% over every30000 iterations before it reaches the maximum iteration number120000.

Table 2Results on different combination of sub-networks. Red marks mean th

Please cite this article in press as: Zhang, B., et al. One-two-one networks foRemote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

4.2. Sub-networks and multi-scale network

As mentioned before, the OTO network is a framework that cantake advantage of any CNNs, e.g., ResNet(R), DenseNet(D) and Clas-sic CNN(C), as its sub-networks. The results of combining differentkinds of sub-networks are shown in Table 2. Classic CNNs obtainthe worst results, but which can be improved by using differentscales (OTO_CC) or combining with ResNet (OTO_CR and OTO_RC).The OTO based on the densely connected network (OTO_DD) isdesigned to encourage feature reuse, but the lack of an identitymapping enforces the network to learn residual, which deems itsfailure. The combination of DenseNet and ResNet with differentscales (OTO_DR and OTO_RD) are affected by two kinds of discrim-inated features. In contrast, residual learning benefits more on theCAR problem, and the combination of two ResNets (OTO_RR) out-performs all other combinations. Multi-scale features showpromising results, and we design a multi-scale OTO network(OTO_RRR) by adding an 1/4-scale sub-network to OTO_RR. Theresult outperforms OTO_RR with a large margin on all three met-rics. Even though OTO_RRR has outstanding performance, its com-putational cost increases almost 25%, resulting in more trainingand test time. After evaluating the pros and cons, we chooseOTO_RR as our main framework and if not mentioned, OTO meansOTO_RR in the following. We design an experiment by removingone of the sub-networks each time to investigate the function ofthe sub-networks. The results in Table 3 indicate that thenormal-scale feature is shown to be more helpful than the small-scale feature when only one sub-network is adopted.

4.3. OTO vs. its two variants

As mentioned above, OTO has the capability to utilize the non-linear model, i.e., the summation and difference models, which isfully evaluated in this section. OTO(Linear) and OTO(Sum) are usedin our comparison. The former one learns a weight factor to bal-ance the significance of two branch networks, which adaptivelycombine two CNNs to solve the CAR problem. It is verified to bevery effective for the reason that the significance of each CNNshould be well considered in the fusion process. For the OTO(Sum) network, the weight factor is fixed to 1, which means thatthis version of OTO is not only shortage of the nonlinear represen-tation ability but also impossible to tell which branch networkcontains more important information to suppress compressionartifacts. In other words, OTO(Sum) just directly applies the addi-tion operation to the two sub-networks. In this experiment, allcomparative networks are trained based on BSD500 training andtesting sets, and then tested on LIVE1 and Classic5. The resultsare shown in Tables 4 and 5. The OTO network along with its

e best results and blue marks mean the second best results.

r compression artifacts reduction in remote sensing. ISPRS J. Photogram.

Table 3Experiments on the sub-networks.

LIVE1 Algorithm PSNR PSNR-B SSIM

QF-10 OTO(normal-scale) 28.65 27.91 0.8263OTO(small-scale) 28.26 27.62 0.8245

OTO 29.28 28.95 0.8298

Table 4Comparative results between OTO and its two variants on LIVE1.

LIVE1 Algorithm PSNR PSNR-B SSIM

QF-10 JPEG 27.77 25.33 0.7905OTO(Linear) 29.26 28.91 0.8295OTO(Sum) 29.26 28.91 0.8287

OTO 29.28 28.95 0.8298

QF-20 JPEG 30.07 27.57 0.8683OTO(Linear) 31.62 31.12 0.8950OTO(Sum) 31.60 31.07 0.8941

OTO 31.67 31.17 0.8954

QF-30 JPEG 31.41 28.92 0.9000OTO(Linear) 33.06 32.45 0.9218OTO(Sum) 32.65 32.27 0.9160

OTO 33.08 32.48 0.9218

QF-40 JPEG 32.35 29.96 0.9173OTO(Linear) 34.03 33.43 0.9349OTO(Sum) 33.68 33.35 0.9316

OTO 34.10 33.48 0.9362

Table 5Comparisons between OTO and its two variants on Classic 5.

Classic5 Algorithm PSNR PSNR-B SSIM

QF-10 JPEG 27.82 25.21 0.7800OTO(Sum) 29.36 28.92 0.8207OTO(Linear) 29.34 28.93 0.8222

OTO 29.36 28.94 0.8222

QF-20 JPEG 30.12 27.50 0.8541OTO(Sum) 31.56 31.00 0.8767OTO(Linear) 31.54 31.01 0.8774

OTO 31.64 31.10 0.8785

QF-30 JPEG 31.48 28.94 0.8844OTO(Sum) 32.52 31.99 0.8966OTO(Linear) 32.93 32.28 0.9021

OTO 32.95 32.33 0.9022

QF-40 JPEG 32.43 29.92 0.9011OTO(Sum) 33.46 32.91 0.9114OTO(Linear) 33.77 33.06 0.9139

OTO 33.85 33.13 0.9155

Table 6The weight factor a evaluation experiments on LIVE1.

LIVE1 a PSNR PSNR-B SSIM

QF-20 0.01 31.59 31.12 0.89470.1 31.59 31.12 0.89491.0 31.60 31.07 0.8941

0.0651(auto-learned) 31.62 31.12 0.8950

QF-30 0.01 33.03 32.41 0.92140.1 33.04 32.43 0.92141.0 32.65 32.27 0.9160

0.0544(auto-learned) 33.06 32.45 0.9218

8 B. Zhang et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

two variants have promising restoration performances on the fourquality factors. These results demonstrate the effectiveness of theauto-learned weight factor a and the nonlinear operation on thesummation and difference of the two branch networks. PSNR-Bmetric is designed specifically to measure the blocking artifacts.Particularly, we analyze the PSNR-B gain of OTO and OTO(Linear)compared to their baseline OTO(Sum). We observe that for low-quality (QF-10, QF-20) compression images, the nonlinear opera-tion benefit more than the weight factor on suppressing blockingartifacts, but different for high-quality images (QF-30, QF-40).We trained OTO models on GTX 1070, I7-6700k with 32G memory.The training time for OTO, OTO(Linear), and OTO(Sum) are 6 h 12m, 5 h 42 m and 5 h 31 m, respectively. The average test time forOTO, OTO(Linear), and OTO(Sum) are 0.1803 s, 0.1738 s and0.1734 s per image, respectively.

We further evaluate how the weight factor a affects the finalperformance. Results are shown in Table 6. a is implemented based

Please cite this article in press as: Zhang, B., et al. One-two-one networks foRemote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

on a scale layer of the Caffe platform, which can be updated by thebackpropagation algorithm. We can also give a constant a by man-ually setting the learning rate of this layer to be 0, so that a keepsunchanged during the training process. Firstly, we revisit OTO(Lin-ear) and get the learned a, 0.0651 and 0.0544 for QF = 20 and 30respectively. The weight of the small-scale sub-network is 20 timessmaller than that of the normal-scale sub-network, indicating thatthe normal-scale features contain much richer information thansmall-scale ones. Then, we set a to 0.01, 0.1 and 1.0 (whena ¼ 1:0, it leads to OTO(Sum)). The results show that when a isset to 0.01 and 0.1, close to the auto-learned value, the performanceis slightlyworse thanOTO(Linear) (learned a), butmuch better thanOTO(Sum) (a ¼ 1), particularly on QF = 30. Considering on all casesthat OTO(Linear) achieves better results, we can conclude that anauto-learned a is significant for a practical CAR system especiallywhen a proper a cannot be given in advance. In addition, OTO(Sum) means that no difference model is in use, in contrast ourOTO with the difference model always achieve a better perfor-mance as shown in Tables 4 and 5, which prove that OTOs benefitfrom the high frequency information. We visualize the featuremaps after the summation model and the difference models inFig. 6 for a picture from LIVE1. The results show that differencemodel provides more detailed information(high frequency) thanthe summation model, which clearly support our motivation. Somequalitative results of ARCNN and OTO are compared in Figs. 8 and 9.

4.4. On remote sensing image datasets

For JPEG-based compression artifacts reduction methods, theirtarget block size is 8� 8, but in our SPIHT-based algorithm, block-ing artifacts with a larger size like 32� 32 will occur, which isshown in Fig. 7. Remotely sensed images are quite different fromthe natural images like BSD500 in terms of color richness, texturedistribution and so on. ARCNN is first designed for restoring natu-ral images. For a fair comparison, we retrain ARCNN on the remotesensed image dataset with the architecture of the networkunchanged. We adopt better training parameters with step-attenuated learning rate compared to its fixed one. The networktends to converge early in the four compression rates, 8, 16, 32and 64 and then we evaluate it on the remote sensing task. Wetrain and test our OTOs on remotely sensed image dataset, andthe results are shown in Table 7. The parameters in PSNR-B andSSIM algorithms are modified to evaluate the 32� 32-sized block-ing artifacts.

In various compression rates, ARCNN does not increase threescores except for PSNR-B, while OTO successfully suppressed com-pression artifacts on all measures. In Figs. 10 and 11, the imagesrestored by ARCNN tend to be blurry with blocking artifactsremained. The failure of ARCNN and the success of the OTO verifythat OTO are quite effective for remote sensing images restoration,when suffering from larger blocking artifacts problems. However,when compression rate becomes bigger, i.e., 64, the details of the

r compression artifacts reduction in remote sensing. ISPRS J. Photogram.

Fig. 6. Left: the feature map of the summation model, which contains more low frequency information. Right: the feature map of the difference model, which provides moredetailed information (high frequency). With the Fourier Transform, we can compare the amounts of the high frequency components between the two feature maps. Byremoving the DC (0 frequency) component from the frequency domain and considering those components with spatial frequencies > 100 as high frequency components (thesize of the feature maps is 1067 � 1600), we can find that the ratio of the high frequency energy to the whole energy is about 38% for the left map, while it is about 68% forthe right map.

OriginalPSNR/SSIM/PSNR-B

JPEG32.92/0.9316/29.67

ARCNN35.61/0.9564/34.57

OTO36.13/0.9628/35.68

Fig. 8. Qualitative comparison of OTO and ARCNN by JPEG with Quality Factor = 20 where ringing effects is carefully handled after being restored by OTO network.

Fig. 7. The difference between JPEG and SPIHT compression algorithm. Left: JPEG with block size 8 � 8, Right: SPIHT with block size 32 � 32. The blocking artifact caused bySPIHT is more severe than by JPEG.

B. Zhang et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx 9

compressed images are almost lost, our OTO fail to restore theedges and structure details of the balcony as shown in Fig. 12.

4.5. On LIVE1 and BSD500 tests sets

LIVE1: As mentioned above, the proposed OTO outperformARCNN on the remote sensing image dataset and shows thepromising results on restoring SPIHT-based compression artifacts.The following experiments further support that even compared

Please cite this article in press as: Zhang, B., et al. One-two-one networks foRemote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

with recently proposed deep learning methods, OTO can stillachieve the state-of-the-art results on publicly LIVE1 and BSD500test sets based on the JPEG compression.

We compare OTO with the most successful deblocking orientedmethod, SA-DCT, which achieves the state-of-the-art results. ThenARCNN is also included for a complete assessment, using the samemetric as before. The results are shown in Table 8. ARCNN does notuse data augmentation technique on the training set in the initialconference version, but in its extended journal version 20� aug-

r compression artifacts reduction in remote sensing. ISPRS J. Photogram.

Table 7Results on remotely sensed dataset.

Quality Evaluation SPIHT ARCNN OTO

8 PSNR 37.19 35.71 39.21SSIM 0.9782 0.9775 0.9854

PSNR-B 31.90 33.12 37.95

16 PSNR 33.47 32.69 35.23SSIM 0.9523 0.9554 0.9652

PSNR-B 28.80 30.61 34.38

32 PSNR 30.47 30.39 32.23SSIM 0.9111 0.9200 0.9361

PSNR-B 26.31 28.54 31.53

64 PSNR 27.90 27.83 29.54SSIM 0.8489 0.8547 0.8875

PSNR-B 24.19 26.55 29.20

OriginalPSNR/SSIM/PSNR-B

JPEG30.49/0.7727/28.19

ARCNN31.71/0.8101/31.59

OTO31.94/0.8173/31.81

Fig. 9. Qualitative comparison of OTO and ARCNN by JPEG with Quality Factor = 10, where severe block artifacts are removed and the edges are sharp again.

10 B. Zhang et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

mentation method is used so as to gain restoration performanceimprovement. In our experiments, no data augmentation is appliedwith the aim to accelerate the training process. Specifically, for the

OriginalPSNR/SSIM/PSNR-B

SPIHT32.86/ 0.9256/28.30

OriginalPSNR/SSIM/PSNR-B

SPIHT34.24/0.9617/29.44

Fig. 10. Qualitative comparison of OTO and ARC

Please cite this article in press as: Zhang, B., et al. One-two-one networks foRemote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

PSNR metric, we achieve an average gain of 0.90 dB compared withSA-DCT and 0.32 dB compared with ARCNN. For the PSNR-B metric,the gains are even larger to 1.38 dB and 0.34 dB respectively. Itshows that OTOs are suitable for suppressing compression artifactsfor natural images.

BSD500: We compare OTO with the traditional approaches likeDSC and also convolutional deep learning based approaches, suchas ARCNN and Trainable Nonlinear Reaction Diffusion (TNRD)(Chen et al., 2015). In DDCN, the DCT-Domain branch took advan-tage of JPEG-based prior so it is unfair for OTO only using pixel-domain information. Guo et al. propose a variant of DDCN byremoving the DCT-domain branch so that no extra prior is utilized,which is alternatively used in the comparison. The comparativeresults are shown in Table 9 with four quality factors from 10 to40. OTO outperforms all the other algorithms in terms of threemetrics, which indicates that OTO has a competent restorationability. More specifically, OTO obtains about 0.7 dB and 0.4 dBgains compared with DSC on the PSNR and PSNR-B respectively.ARCNN is beaten by 0.35 dB on the PSNR and 0.26 dB on thePSNR-B, which is consistent with the results on LIVE1.

ARCNN31.88/0.9311/30.36

OTO34.63/0.9494/33.88

ARCNN32.86/0.9619/31.43

OTO35.86/0.9723/35.08

NN by SPIHT with Compression Rate = 16.

r compression artifacts reduction in remote sensing. ISPRS J. Photogram.

OriginalPSNR/SSIM/PSNR-B

SPIHT30.61/0.9054/26.38

ARCNN30.76/0.9158/28.71

OTO32.17/0.9290/31.34

OriginalPSNR/SSIM/PSNR-B

SPIHT31.34/0.9284/26.68

ARCNN30.55/0.9312/29.03

OTO33.84/0.9528/33.09

Fig. 11. Qualitative comparison of OTO and ARCNN by SPIHT with Compression Rate = 32.

OriginalPSNR/SSIM/PSNR-B

SPIHT27.87/0.8503/24.67

ARCNN27.61/0.8564/25.23

OTO28.13/0.8728/27.68

Fig. 12. Qualitative comparison of OTO and ARCNN by SPIHT with Compression Rate = 64.

Table 8Results on LIVE1.

LIVE1 Algorithm PSNR PSNR-B SSIM

QF-10 JPEG 27.77 25.33 0.7905SA-DCT 28.65 28.01 0.8093AR-CNN 29.13 28.74 0.8232OTO 29.28 28.95 0.8298

QF-20 JPEG 30.07 27.57 0.8683SA-DCT 30.81 29.82 0.8781AR-CNN 31.40 30.69 0.8886OTO 31.67 31.17 0.8954

QF-30 JPEG 31.41 28.92 0.9000SA-DCT 32.08 30.92 0.9078AR-CNN 32.69 32.15 0.9166OTO 33.08 32.48 0.9218

QF-40 JPEG 32.35 29.96 0.9173SA-DCT 32.99 31.79 0.9240AR-CNN 33.63 33.12 0.9306OTO 34.10 33.48 0.9362

B. Zhang et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx 11

Please cite this article in press as: Zhang, B., et al. One-two-one networks for compression artifacts reduction in remote sensing. ISPRS J. Photogram.Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

Table 9Results on BSD500 test set.

Quality Evaluation JPEG DSC DDCN(-DCT) TNRD ARCNN OTO

10 PSNR 27.8 28.79 29.26 29.16 29.10 29.31SSIM 0.7875 0.8124 0.8267 0.8225 0.8198 0.8278

PSNR-B 25.1 28.45 28.89 28.81 28.73 28.92

20 PSNR 30.05 30.97 31.55 31.41 31.28 31.64SSIM 0.8671 0.8804 0.8923 0.8889 0.8854 0.8943

PSNR-B 27.22 30.57 30.84 30.83 30.55 30.95

30 PSNR 31.37 32.29 32.92 32.77 32.67 33.03SSIM 0.8994 0.9093 0.9193 0.9166 0.9152 0.9211

PSNR-B 28.53 31.84 32.01 31.99 31.94 32.17

40 PSNR 32.3 33.23 33.87 33.73 33.55 34.00SSIM 0.9171 0.9253 0.9336 0.9316 0.9296 0.9357

PSNR-B 29.49 32.71 32.86 32.79 32.78 32.98

12 B. Zhang et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

5. Conclusion and future work

The CAR problem is a challenge in the field of remote sensing. Inthis paper, we have developed a new and general framework tocombine different models based on a nonlinear method to effec-tively deal with complicated compression artifacts, i.e., big block-ing effect in the compression. Based on the Taylor expansion, welead to two simple OTO variants, which provide a more profoundinvestigation into our method and pose a new direction to solvethe artifact reduction problem. Extensive experiments are con-ducted to validate the performance of OTO and new state-of-the-art results are obtained. In the future work, we will deploy morecomplicated networks in our framework to gain betterperformance.

Acknowledgements

The work was supported by the Natural Science Foundation ofChina under Contract 61672079 and 61473086. This work is sup-ported by the Open Projects Program of National Laboratory of Pat-tern Recognition, and Shenzhen Peacock Plan. Baochang Zhang andJiaxin Gu have the same contributions to the paper.

References

Arbelaez, P., Maire, M., Fowlkes, C., Malik, J., 2011. Contour detection andhierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 898–916.

Bian, X., Chen, C., Tian, L., Du, Q., 2017. Fusing local and global features for high-resolution scene classification. IEEE J. Sel. Top. Appl. Earth Obser. Remote Sens.10 (6), 2889–2901.

Burt, P., Adelson, E., 1983. The Laplacian pyramid as a compact image code. IEEETrans. Commun. 31 (4), 532–540.

Cavigelli, L., Hager, P., Benini, L., 2017. Cas-cnn: a deep convolutional neuralnetwork for image compression artifact suppression. In: 2017 InternationalJoint Conference on Neural Networks (IJCNN). IEEE, pp. 752–759.

Chang, C.-I., Chiang, S.-S., 2002. Anomaly detection and classification forhyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 40 (6), 1314–1325.

Chen, Y., Yu, W., Pock, T., 2015. On learning optimized reaction diffusion processesfor effective image restoration. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp. 5261–5269.

Chen, C., Zhang, B., Su, H., Li, W., Wang, L., 2016. Land-use scene classification usingmulti-scale completed local binary patterns. SIViP 10 (4), 745–752.

Cheng, G., Han, J., 2016. A survey on object detection in optical remote sensingimages. ISPRS J. Photogramm. Remote Sens. 117, 11–28.

Cheng, G., Zhou, P., Han, J., 2016. Learning rotation-invariant convolutional neuralnetworks for object detection in vhr optical remote sensing images. IEEE Trans.Geosci. Remote Sens. 54 (12), 7405–7415.

Cheng, G., Han, J., Lu, X., 2017. Remote sensing image scene classification:benchmark and state of the art. Proc. IEEE 105, 1865–1883.

Cramer, M., 2010. The dgpf-test on digital airborne camera evaluation–overviewand test design. Photogramm.-Fernerkund.-Geoinform. 2010 (2), 73–82.

Dong, C., Loy, C.C., He, K., Tang, X., 2014. Learning a deep convolutional network forimage super-resolution. In: European Conference on Computer Vision. Springer,pp. 184–199.

Early, D.S., Long, D.G., 2001. Image reconstruction and enhanced resolution imagingfrom irregular samples. IEEE Trans. Geosci. Remote Sens. 39 (2), 291–302.

Please cite this article in press as: Zhang, B., et al. One-two-one networks foRemote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

Foi, A., Katkovnik, V., Egiazarian, K., 2007. Pointwise shape-adaptive dct for high-quality denoising and deblocking of grayscale and color images. IEEE Trans.Image Process. 16 (5), 1395–1411.

Gregor, K., LeCun, Y., 2010. Learning fast approximations of sparse coding. In:Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 399–406.

Guo, J., Chao, H., 2016. Building dual-domain representations for compressionartifacts reduction. In: European Conference on Computer Vision. Springer, pp.628–644.

Han, J., Zhang, D., Cheng, G., Guo, L., Ren, J., 2015. Object detection in optical remotesensing images based on weakly supervised learning and high-level featurelearning. IEEE Trans. Geosci. Remote Sens. 53 (6), 3325–3337.

Heeger, D.J., Bergen, J.R., 1995. Pyramid-based texture analysis/synthesis. In:Proceedings of the 22nd Annual Conference on Computer Graphics andInteractive Techniques. ACM, pp. 229–238.

He, K., Zhang, X., Ren, S., Sun, J., 2016. Identity mappings in deep residual networks.In: European Conference on Computer Vision. Springer, pp. 630–645.

Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L., 2016. Densely ConnectedConvolutional Networks. Available from: arXiv preprint <arXiv:1608.06993>.

Ioffe, S., Szegedy, C., 2015. Batch Normalization: Accelerating Deep NetworkTraining by Reducing Internal Covariate Shift. Available from: arXiv preprint<arXiv:1502.03167>.

Jancsary, J., Nowozin, S., Rother, C., 2012. Loss-specific training of non-parametricimage restoration models: a new state of the art. In: European Conference onComputer Vision. Springer, pp. 112–125.

Jun, G., Hongyang, C., 2017. One-to-many Network for Visually PleasingCompression Artifacts Reduction. Available from: arXiv preprint <arXiv:1611.04994>.

Kim, J., Kwon Lee, J., Mu Lee, K., 2016. Accurate image super-resolution using verydeep convolutional networks. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp. 1646–1654.

Krizhevsky, A., Hinton, G., 2009. Learning Multiple Layers of Features from TinyImages. Technical Report. University of Toronto.

Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deepconvolutional neural networks. In: Advances in Neural Information ProcessingSystems, pp. 1097–1105.

Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A.,Tejani, A., Totz, J., Wang, Z., 2016. Photo-realistic Single Image Super-resolutionusing a Generative Adversarial Network. Available from: arXiv preprint <arXiv:1609.04802>.

Leonardo, G., Lorenzo, S., Marco, B., Bimbo, A.D., 2017. Deep Generative AdversarialCompression Artifact Removal. Available from: arXiv preprint <arXiv:1704.02518>.

Li, J., Gamba, P., Plaza, A., 2014. A novel semi-supervised method for obtaining finerresolution urban extents exploiting coarser resolution maps. IEEE J. Sel. Top.Appl. Earth Obser. Remote Sens. 7 (10), 4276–4287.

Li, W., Chen, C., Su, H., Du, Q., 2015. Local binary patterns and extreme learningmachine for hyperspectral imagery classification. IEEE Trans. Geosci. RemoteSens. 53 (7), 3681–3693.

Liew, A.-C., Yan, H., 2004. Blocking artifacts suppression in block-coded imagesusing overcomplete wavelet representation. IEEE Trans. Circ. Syst. VideoTechnol. 14 (4), 450–461.

Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M., 2017. Enhanced deep residual networksfor single image super-resolution. In: Computer Vision and Pattern RecognitionWorkshops, pp. 1132–1140.

List, P., Joch, A., Lainema, J., Bjontegaard, G., Karczewicz, M., 2003. Adaptivedeblocking filter. IEEE Trans. Circ. Syst. Video Technol. 13 (7), 614–619.

Liu, X., Wu, X., Zhou, J., Zhao, D., 2015. Data-driven sparsity-based restoration ofjpeg-compressed images in dual transform-pixel domain. In: Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pp. 5171–5178.

Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semanticsegmentation. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pp. 3431–3440.

Masi, G., Cozzolino, D., Verdoliva, L., Scarpa, G., 2017. Cnn-based pansharpening ofmulti-resolution remote-sensing images. In: Urban Remote Sensing Event, pp.1–4.

r compression artifacts reduction in remote sensing. ISPRS J. Photogram.

B. Zhang et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx 13

Paris, S., Hasinoff, S.W., Kautz, J., 2011. Local Laplacian filters: edge-aware imageprocessing with a Laplacian pyramid. ACM Trans. Graph. 30 (4). 68–1.

Reeve III, H.C., Lim, J.S., 1984. Reduction of blocking effects in image coding. Opt.Eng. 23 (1). 230134–230134.

Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: towards real-time objectdetection with region proposal networks. In: Advances in Neural InformationProcessing Systems, pp. 91–99.

Said, A., Pearlman, W.A., 1996. A new, fast, and efficient image codec based on setpartitioning in hierarchical trees. IEEE Trans. Circ. Syst. Video Technol. 6 (3),243–250.

Scarpa, G., Vitale, S., Cozzolino, D., 2017. Target-adaptive Cnn-based Pansharpening.Available from: arXiv preprint <arXiv:1709.06054>.

Sheikh, H.R., Wang, Z., Cormack, L., Bovik, A.C., 2005. Live Image Quality AssessmentDatabase Release 2. http://live.ece.utexas.edu/research/quality.

Sunkavalli, K., Johnson, M.K., Matusik, W., Pfister, H., 2010. Multi-scale imageharmonization. ACM Transactions on Graphics (TOG), vol. 29. ACM, p. 125.

Svoboda, P., Hradis, M., Barina, D., Zemcik, P., 2016. Compression artifacts removalusing convolutional neural networks. Available from: arXiv preprint <arXiv:1605.00366>.

Timofte, R., De Smet, V., Van Gool, L., 2014. A+: adjusted anchored neighborhoodregression for fast super-resolution. In: Asian Conference on Computer Vision.Springer, pp. 111–126.

Vivone, G., Alparone, L., Chanussot, J., Dalla Mura, M., Garzelli, A., Licciardi, G.A.,Restaino, R., Wald, L., 2015. A critical comparison among pansharpeningalgorithms. In: Geoscience and Remote Sensing Symposium, pp. 2565–2586.

Vosselman, G., Coenen, M., Rottensteiner, F., 2017. Contextual segment-basedclassification of airborne laser scanner data. ISPRS J. Photogramm. Remote Sens.128, 354–371.

Wang, Z., Bovik, A.C., Lu, L., 2002. Why is image quality assessment so difficult?2002 IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP), vol. 4. IEEE, pp. 4–3313.

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., 2004. Image quality assessment:from error visibility to structural similarity. IEEE Trans. Image Process. 13 (4),600–612.

Wang, C., Zhou, J., Liu, S., 2013. Adaptive non-local means filter for imagedeblocking. Signal Process.: Image Commun. 28 (5), 522–530.

Wang, Z., Liu, D., Chang, S., Ling, Q., Yang, Y., Huang, T.S., 2016. D3: deep dual-domain based fast restoration of jpeg-compressed images. In: Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, pp. 2764–2772.

Wang, T., Sun, M., Hu, K., 2017. Dilated Deep Residual Network for Image Denoising.Available from: arXiv preprint <arXiv:1708.05473>.

Xiao, P., Zhang, X., Wang, D., Yuan, M., Feng, X., Kelly, M., 2016. Change detection ofbuilt-up land: a framework of combining pixel-based detection and object-based recognition. ISPRS J. Photogramm. Remote Sens. 119, 402–414.

Yao, X., Han, J., Guo, L., Bu, S., Liu, Z., 2015. A coarse-to-fine model for airportdetection from remote sensing images using target-oriented visual saliency andcrf. Neurocomputing 164, 162–172.

Yao, X., Han, J., Cheng, G., Qian, X., Guo, L., 2016. Semantic annotation of high-resolution satellite images via weakly supervised learning. IEEE Trans. Geosci.Remote Sens. 54 (6), 3660–3671.

Yao, X., Han, J., Zhang, D., Nie, F., 2017. Revisiting co-saliency detection: a novelapproach based on two-stage multi-view spectral rotation co-clustering. IEEETrans. Image Process. 26 (7), 3196–3209.

Yim, C., Bovik, A.C., 2011. Quality assessment of deblocked images. IEEE Trans.Image Process. 20 (1), 88–98.

Yu, K., Dong, C., Loy, C.C., Tang, X., 2016. Deep Convolution Networks forCompression Artifacts Reduction. Available from: arXiv preprint <arXiv:1608.02778>.

Zhang, K., Chen, Y., Chen, Y., Meng, D., Zhang, L., 2017a. Beyond a gaussian denoiser:residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155.

Please cite this article in press as: Zhang, B., et al. One-two-one networks foRemote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.01.003

Zhang, K., Zuo, W., Zhang, L., 2017. Ffdnet: Toward a Fast and Flexible Solution forcnn based Image Denoising. Available from: arXiv preprint <arXiv:1710.0402>.

Baochang Zhang received the B.S., M.S. and Ph.D. degrees in Computer Sciencefrom Harbin Institute of the Technology, Harbin, China, in 1999, 2001, and 2006,respectively. From 2006 to 2008, he was a research fellow with the ChineseUniversity of Hong Kong, Hong Kong, and with Griffith University, Brisban, Aus-tralia. Currently, he is an associate professor with the Science and Technology onAircraft Control Laboratory, School of Automation Science and Electrical Engineer-ing, Beihang University, Beijing, China. He was supported by the Program for NewCentury Excellent Talents in University of Ministry of Education of China. His cur-rent research interests include pattern recognition, machine learning, face recog-nition, and wavelets.

Jiaxin Gu received the B.S. degree in School of Automation Science and ElectricalEngineering of Beihang University in 2017. He is pursuing his Master degree in thesame shool of Beihang University and his current research interests include imagerestoration, object detection and deep learning.

Chen Chen received the B.E. degree in automation from the Beijing ForestryUniversity, Beijing, China, in 2009, and the M.S. degree in electrical engineeringfrom the Mississippi State University, Starkville, MS, USA, in 2012, and the Ph.D.degree in the Department of Electrical Engineering, University of Texas at Dallas,Richardson, TX, USA, in 2016. He is currently a Post-Doc in the Center for Researchin Computer Vision, University of Central Florida, Orlando, FL, USA. He has pub-lished more than 50 papers in refereed journals and conferences in these areas. Hisresearch interests include compressed sensing, signal and image processing, patternrecognition, and computer vision.

Jungong Han is currently a tenured faculty member with the School of Computingand Communications at Lancaster University, UK. His research interestes includevideo analysis, computer vision and artificial intelligence.

Xiangbo Su received his B.E. degree in Automation Science from Beihang Univer-sity, Beijing, China, in 2015. He is currently a M.S. student at the School ofAutomation Science and Electrical Engineering, Beihang University. His currentresearch interests include machine learning and deep learning in general, withcomputer vision applications in object tracking, recognition and image restoration.

Xianbin Cao received the B.Eng and M.Eng degrees in computer applications andinformation science from Anhui University, Hefei, China, in 1990 and 1993,respectively, and the Ph.D. degree in information science from the University ofScience and Technology of China, Beijing, in 1996. He is currently a Professor withthe School of Electronic and Information Engineering, Beihang University, Beijing,China. He is also the Director of the Laboratory of Intelligent Transportation System.His current research interests include intelligent transportation systems, airspacetransportation management, and intelligent computation.

Jianzhuang Liu received the Ph.D. degree in computer vision from The ChineseUniversity of Hong Kong, Hong Kong, in 1997. He was a Research Fellow withNanyang Technological University, Singapore, from 1998 to 2000. From 2000 to2012, he was a Post-Doctoral Fellow, an Assistant Professor, and an Adjunct Asso-ciate Professor with The Chinese University of Hong Kong. He was a Professor in2011 with the University of Chinese Academy of Sciences. He is currently a PrincipalResearcher with Huawei Technologies Company, Ltd., Shenzhen. He has authoredover 100 papers, most of which are in prestigious journals and conferences incomputer science. His research interests include computer vision, image processing,machine learning, multimedia, and graphics.

r compression artifacts reduction in remote sensing. ISPRS J. Photogram.