learning across views for stereo image completion version.pdflearning across views for stereo image...

10
IET Computer Vision IET Research Journals Regular Paper Learning across Views for Stereo Image Completion ISSN 1751-8644 doi: 0000000000 www.ietdl.org Wei Ma 1 Mana Zheng 1 Wenguang Ma 1 Shibiao Xu 2* Xiaopeng Zhang 2 1 Faculty of Information Technology, Beijing University of Technology, No.100 Pingleyuan Street, Chaoyang District, Beijing, China 2 National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190,China * E-mail: [email protected] Abstract: Stereo image completion (SIC for short) is to fill holes existing in a pair of stereo images. SIC is more complicated than single image repairing, which needs to complete the pair of images while keeping their stereoscopic consistency. In recent years, deep learning has been introduced into single image repairing but seldom used for SIC. In this paper, we present a novel deep learning based approach for high quality SIC. In our method, an X-shaped network (SICNet for short) is proposed and designed to complete stereo images, which is composed of two branches of CNN layers to encode the context of the left and right images separately, a fusion module for stereo-interactive completion, and two branches of decoders to produce completed left and right images respectively. In consideration of both inter-view and intra-view cues, we introduce auxiliary networks and define comprehensive losses to train SICNet to perform single-view coherent and cross-view consistent completion simultaneously. Extensive experiments are conducted to show the state-of-the-art performances of the proposed approach and its key components. 1 Introduction The goal of image completion is to fill holes in images with contents coherent with remaining parts [1, 2]. The holes might be caused by removing unwanted objects [3, 4], or data corruptions due to mali- cious attacks, careless preservation or transmission [5, 6]. Image completion is a hot research topic in computer vision. It is also key to other tasks, e.g. image-based scene modeling and rendering [7, 8]. Compared to single image repairing, stereo image completion (SIC for short) is more complicated. SIC needs to fill contents coher- ent with both surroundings and corresponding parts in the other view. Although single image completion in deep learning framework has been studied for years [2, 9, 10], there are seldom deep-learning based stereo image completion methods. An intuitive solution to SIC is treating stereo pairs as doubled single images and repairing them one by one using methods for single image completion. However, this cannot guarantee the consistency between the filled left and right views, which is essential for stereo image data [11, 12]. On the other hand, missing areas in a given pair of stereo images might be dif- ferently positioned in the left and right views, like the first example given in Fig. 1. Repairing one view by referring to the existing con- tents in the other view is definitely more confident than filling from scratch view by view [13]. This also denies the way of treating stereo pairs as doubled single images. In 2019, Chen et al. [14] attempted to solve the SIC problem by extending the Context Encoder struc- ture in [2]. However, their method is limited in practical applications since it can only deal with holes of fixed size and shape centered in stereo image pairs. Given the above facts, in this paper, we present a practical approach for SIC based on convolutional neural network (CNN). The main body is a fully convolutional network with an encoder- fusion-decoder structure trained for stereo image completion. We call the network SICNet for short. SICNet is composed of three parts, encoders, a fusion module and decoders, respectively. At first, SICNet uses two branches of CNN encoders to excavate the context cues in the left and right images, separately. Then, a convolution- based fusion module merges the two and performs stereo-interactive repairing. Next, two CNN decoders connected to the fusion mod- ule generate the completed left and right views, respectively. The encoder-fusion-decoder structure makes sure that our hole filling could refer to both inter-view and intra-view cues. Moreover, since it is a fully convolutional structure, SICNet can complete image with holes of any size and shape located at any position of the images. In order to train the proposed SICNet model, we introduce several complementary losses, some of which are defined based on auxiliary networks. Specifically speaking, by referring to available methods for single image completion, we introduce a pixel-level reconstruc- tion loss and a GAN-based adversarial loss to supervise learning for overall structure and subtle detail repairing in each view, respec- tively. More importantly, we present two types of stereo consistency losses, defined based on the completed pair and a disparity estima- tion network, respectively, to supervise learning across views for stereo-interactive repairing. Fig. 1 gives two examples used to test the trained SICNet in cases of content missing in one view and both views, repectively. The two cases generally happen due to random data corruption and intentional object removal, respectively. Note that, we have no ground-truth data for training in case of object removal, therefore, we only use data for corruption restoration to train SICNet and test it in both cases. From the two examples, we can see that, in case of corruption restoration, SICNet can confidently repair a view well by referring to the other view. In terms of object removal, SICNet fills the holes existing in the two views consistently with each other and coherently with their surroundings. More results are given in the experiment part. The main contributions of this paper are summarized as follows: i. We propose a novel deep-learning based approach for high quality SIC. It is capable of generating contents coherent with surroundings and consistent with corresponding parts in the other view. ii. An X-shaped fully convolutional network, called SICNet, is proposed and designed. The trained SICNet can deal with arbitrary size of stereoscopic images with holes at any position, which is significant for real applications. iii. A group of complementary losses is defined, with assistance from auxiliary networks via inter-view and intra-view cues, to train the proposed SICNet to learn across views. iv. We conduct extensive experiments to show the state-of-the-art performance of the proposed method and the effectiveness of its key components. IET Research Journals, pp. 1–10 c The Institution of Engineering and Technology 2015 1

Upload: others

Post on 01-Jan-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning across Views for Stereo Image Completion VERSION.pdfLearning across Views for Stereo Image Completion ISSN 1751-8644 ... Moreover, they trained their network with a globally

IET Computer Vision

IET Research Journals

Regular Paper

Learning across Views for Stereo ImageCompletion

ISSN 1751-8644doi: 0000000000www.ietdl.org

Wei Ma1Mana Zheng1Wenguang Ma1Shibiao Xu2∗Xiaopeng Zhang2

1Faculty of Information Technology, Beijing University of Technology, No.100 Pingleyuan Street, Chaoyang District, Beijing, China2National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190,China* E-mail: [email protected]

Abstract: Stereo image completion (SIC for short) is to fill holes existing in a pair of stereo images. SIC is more complicatedthan single image repairing, which needs to complete the pair of images while keeping their stereoscopic consistency. In recentyears, deep learning has been introduced into single image repairing but seldom used for SIC. In this paper, we present a noveldeep learning based approach for high quality SIC. In our method, an X-shaped network (SICNet for short) is proposed anddesigned to complete stereo images, which is composed of two branches of CNN layers to encode the context of the left andright images separately, a fusion module for stereo-interactive completion, and two branches of decoders to produce completedleft and right images respectively. In consideration of both inter-view and intra-view cues, we introduce auxiliary networks anddefine comprehensive losses to train SICNet to perform single-view coherent and cross-view consistent completion simultaneously.Extensive experiments are conducted to show the state-of-the-art performances of the proposed approach and its key components.

1 Introduction

The goal of image completion is to fill holes in images with contentscoherent with remaining parts [1, 2]. The holes might be caused byremoving unwanted objects [3, 4], or data corruptions due to mali-cious attacks, careless preservation or transmission [5, 6]. Imagecompletion is a hot research topic in computer vision. It is also keyto other tasks, e.g. image-based scene modeling and rendering [7, 8].

Compared to single image repairing, stereo image completion(SIC for short) is more complicated. SIC needs to fill contents coher-ent with both surroundings and corresponding parts in the other view.Although single image completion in deep learning framework hasbeen studied for years [2, 9, 10], there are seldom deep-learningbased stereo image completion methods. An intuitive solution to SICis treating stereo pairs as doubled single images and repairing themone by one using methods for single image completion. However,this cannot guarantee the consistency between the filled left and rightviews, which is essential for stereo image data [11, 12]. On the otherhand, missing areas in a given pair of stereo images might be dif-ferently positioned in the left and right views, like the first examplegiven in Fig. 1. Repairing one view by referring to the existing con-tents in the other view is definitely more confident than filling fromscratch view by view [13]. This also denies the way of treating stereopairs as doubled single images. In 2019, Chen et al. [14] attemptedto solve the SIC problem by extending the Context Encoder struc-ture in [2]. However, their method is limited in practical applicationssince it can only deal with holes of fixed size and shape centered instereo image pairs.

Given the above facts, in this paper, we present a practicalapproach for SIC based on convolutional neural network (CNN).The main body is a fully convolutional network with an encoder-fusion-decoder structure trained for stereo image completion. Wecall the network SICNet for short. SICNet is composed of threeparts, encoders, a fusion module and decoders, respectively. At first,SICNet uses two branches of CNN encoders to excavate the contextcues in the left and right images, separately. Then, a convolution-based fusion module merges the two and performs stereo-interactiverepairing. Next, two CNN decoders connected to the fusion mod-ule generate the completed left and right views, respectively. Theencoder-fusion-decoder structure makes sure that our hole fillingcould refer to both inter-view and intra-view cues. Moreover, since

it is a fully convolutional structure, SICNet can complete image withholes of any size and shape located at any position of the images.

In order to train the proposed SICNet model, we introduce severalcomplementary losses, some of which are defined based on auxiliarynetworks. Specifically speaking, by referring to available methodsfor single image completion, we introduce a pixel-level reconstruc-tion loss and a GAN-based adversarial loss to supervise learningfor overall structure and subtle detail repairing in each view, respec-tively. More importantly, we present two types of stereo consistencylosses, defined based on the completed pair and a disparity estima-tion network, respectively, to supervise learning across views forstereo-interactive repairing.

Fig. 1 gives two examples used to test the trained SICNet incases of content missing in one view and both views, repectively.The two cases generally happen due to random data corruptionand intentional object removal, respectively. Note that, we have noground-truth data for training in case of object removal, therefore,we only use data for corruption restoration to train SICNet and testit in both cases. From the two examples, we can see that, in caseof corruption restoration, SICNet can confidently repair a view wellby referring to the other view. In terms of object removal, SICNetfills the holes existing in the two views consistently with each otherand coherently with their surroundings. More results are given in theexperiment part.

The main contributions of this paper are summarized as follows:i. We propose a novel deep-learning based approach for high

quality SIC. It is capable of generating contents coherent withsurroundings and consistent with corresponding parts in the otherview.

ii. An X-shaped fully convolutional network, called SICNet, isproposed and designed. The trained SICNet can deal with arbitrarysize of stereoscopic images with holes at any position, which issignificant for real applications.

iii. A group of complementary losses is defined, with assistancefrom auxiliary networks via inter-view and intra-view cues, to trainthe proposed SICNet to learn across views.

iv. We conduct extensive experiments to show the state-of-the-artperformance of the proposed method and the effectiveness of its keycomponents.

IET Research Journals, pp. 1–10c© The Institution of Engineering and Technology 2015 1

Page 2: Learning across Views for Stereo Image Completion VERSION.pdfLearning across Views for Stereo Image Completion ISSN 1751-8644 ... Moreover, they trained their network with a globally

Input pair with regions to be restored

Restored results obtained by SICNet

Input pair with objects to be removed

Results obtained by SICNet

Fig. 1: Applying our proposed SICNet to repairing missing regions (indicated in green), caused by corruption (the example given in the lefttwo columns) and object removal (the example given in the right two columns). In each example, the first row shows the input left and rightviews, and the second row gives results of the two views. All the results are directly generated by using SICNet without involving any extrapost-processing operation.

This paper is organized as follows. Section 2 introduces relatedworks. In Section 3, we describe our method in detail. Experimen-tal results and analyses are given in Section 4. Section 5 presentsconclusions and limitations.

2 Related works

In this section, we introduce works related to SIC, i.e. single imagecompletion and stereo image completion. Considering that singleimage repairing can be treated as a special case of SIC, and methodsfor SIC are generally extension of those for single image repairing,we start from the review of methods for single image repairing.

Traditional methods for single image completion. Traditionalmethods for single image repairing can be divided into two classes,diffusion-based and patch-based. Diffusion-based methods [15–18]propagate surrounding information into missing regions. However,these methods often fail to recover meaningful structures. The con-tents they fill are generally restricted by locally available informationaround the holes. Moreover, it is hard for these methods to deal withlarge holes. Patch-based methods [1, 4, 19–21] fill missing regions ina given image by searching and copying patches from the remainingparts of the image. These methods are computationally expensiveand have difficulties in reconstructing locally unique patterns.

Traditional methods for stereo image completion. Stereo imagecompletion is more complicated than single image repairing. Besidebeing meaningful, the generated new stereo pairs must be left-rightconsistent so that viewers can obtain convincing 3D visual experi-ence [22, 23]. In order to achieve this purpose, traditional methodsgenerally estimate completed disparity/depth maps and use themto guide stereo view repairing [24–26]. Involved algorithms areall derived from traditional methods for single image completion.For example, Wang et al. [13] used depth-assisted texture synthe-sis to simultaneously fill in both color and depth images. Morseet al. [27] estimated depth by using diffusion-based method. Thenthey extended the PatchMatch algorithm [28] for cross-view search-ing and matching with consistency constrains defined by the depth.These depth-guided stereo image completion methods developed byreferring to existing algorithms for single image completion are sen-sitive to errors in depth recovery and also inherit the problems intraditional methods for single image completion.

Deep learning methods for single image completion. Deeplearning frameworks are widely used for single image repairing inrecent years [2, 5, 6, 29, 30]. Among these methods, some mainlytarget at inpainting small corruptions [29]. To deal with large holes,

Pathak et al. [2] proposed Context Encoder. An decoder is used togenerate completed image by referring to the context cues capturedby the encoder. In order to generate the entire image content in theencoder, a channel-wise fully-connected layer was used to connectthe encoder and decoder. An L2 reconstruction loss and an adver-sarial loss were combined to train the Context Encoder. Iizuka etal. [9] changed Context Encoder to be fully convolutional so as tocomplete images of arbitrary resolutions with holes of any shape.Moreover, they trained their network with a globally and locallyconsistent adversarial training approach. Their two-scales of adver-sarial loss is demonstrated effective and also adopted as a part ofour method. Results obtained with only the adversarial consistencycheck might have subtle color inconsistencies. The authors per-formed post-processing with fast marching [31] followed by Poissonimage blending [32]. Yu et al. [33] extended [9] to be two stages.In the first stage, a coarse result was produced. Inspired by the tra-ditional patch-based method, in the second stage, the authors usedthe features of known patches as convolutional filters to refine thepatches generated in the first stage. Kamyar et al. presented Edge-Connect [10], which is composed of an edge generator and animage completion network. The former hallucinates edges in themissing regions and the latter completes the regions guided by theedges. EdgeConnect shows good performances in reconstructingreasonable structures.

In Summary, traditional methods are intuitive and inspiring fordeep-learning based methods. However, their performances are lim-ited. Deep learning has been widely used for single image repairingbut hardly used for stereo image completion. These methods couldbe used to repair stereo images view by view. However, they cannotrefer to reliable cues which might exist in the other view and oftenfail to generate stereo-consistent results, as we will demonstrate inthe experiment part. Chen et al. [14] tried to solve the SIC problemby extending the Context Encoder in [2]. However, their method islimited to repairing image pairs with square holes in the center. Inthis paper, we propose a stereo image completion network, whichis capable of completing any size and shape of holes positionedanywhere.

3 Proposed Method

In this section, we first introduce the proposed method for stereoimage completion. Next, we present the designed losses used to trainour network. Finally, we share the training strategy in details.

IET Research Journals, pp. 1–102 c© The Institution of Engineering and Technology 2015

IET Computer Vision

Page 3: Learning across Views for Stereo Image Completion VERSION.pdfLearning across Views for Stereo Image Completion ISSN 1751-8644 ... Moreover, they trained their network with a globally

Fusion module

Encoder Decoder

Decoder

Disparity network

Ldisp

Ground-truth disparity

Estimated disparity

Right output

Left output

Ldisc(Real or Fake)

Right global discriminator

Right local discriminator

Left input

Right input

Left global discriminator

(Real or Fake)

Left local discriminator

Ldisc

Encoder

Stereo image completion network (SICNet)

Fig. 2: Overall architecture of the proposed approach. It is composed of a stereo image completion network (within the dashed box in thefigure) and auxiliary branches for training.

3.1 Deep Architecture

The architecture of the proposed method for stereo image comple-tion is shown in Fig. 2. The main body, i.e. the part within the dashedbox in the figure, is a fully convolutional network responsible forstereo image completion, called SICNet for short. Besides, we haveseveral auxiliary branches used to train the network together withdirectly defined losses which will be given in section 3.2.

SICNet is designed to be an X-shaped encoder-fusion-decoderstructure. It has two branches of encoders and decoders. The twobranches have intersection in the middle to exchange cues. Given apair of stereo images and their binary masks indicating regions to berepaired, SICNet treats each view as a four-channel (RGB and thebinary mask) map and feeds the two views separately into the twobranches of encoders. Each view is encoded to be lower-resolutionfeature maps which could well express spatial context and lower sub-sequent computational complexity. The two views of feature mapsare then sent to the fusion module for stereo-interactive repairing.Next, the generated feature maps are used to produce the completedtwo views, each being a three-channel RGB image, by two branchesof decoders, separately.

Fig. 3 shows the structure of the fusion module. At first, it con-catenates the two feature maps of the left and right views along thechannel dimension. Then, two 1× 1 convolution layers are used tointegrate cues across different channels and views and generate twoviews of feature maps for the encoders. The integration across viewshelps cue sharing and consistency keeping between the two views.Note that, here we assume elements in the same position of the twoviews are correspondent, which is approximate but reasonable afterresolution reduction and context aggregation via the encoders.

The detail structures of our SICNet are listed in Fig. 4. The encod-ing part downsamples the input twice and uses four dilated convolu-tion layers to increase the receptive field. Following [9], the dilatedfactors are 2, 4, 8, 16 in order. Here, dila in Fig. 4 stands for dilationrate. ConvBnRe represents convolution with batch normalizationand Relu activation function. ConvMergeLeft/ConvMergeRight is

1x1

Conv

Concat 1x1

Conv

Right feature map

Left feature map Fused left feature map

Fused right feature map

Fig. 3: Fusion module. The integration across views helps cuesharing and consistency keeping between the two views.

the layer for the left/right view after concatenation in the fusion mod-ule. The column of Output presents the output channel numbers. Theresolutions of the images/feature maps are increased to the originalin the decoding process by upsampling twice.

We design two types of auxiliary branches (as can be seen inFig. 2), one for discriminating the realism of the filled contents inboth local and global scales and one for assessing the stereo con-sistency by estimating disparity of the completed image pair. Thediscriminators and the main body part, SICNet introduced above,form a GAN framework, which has been proved very effective inimage generation tasks [5, 6, 9, 33]. The global and local discrimi-nators check the context coherency in both whole images and localareas. All the discriminators are defined the same with [9]. For moredetails, please refer to [9].

IET Research Journals, pp. 1–10c© The Institution of Engineering and Technology 2015 3

IET Computer Vision

Page 4: Learning across Views for Stereo Image Completion VERSION.pdfLearning across Views for Stereo Image Completion ISSN 1751-8644 ... Moreover, they trained their network with a globally

PSMNet [34], a state-of-the-art disparity estimation network, isused to compute disparity of the completed pair. The disparity isthen compared with ground-truth disparity computed also by PSM-Net with the ground-truth image pair. This design is to indirectlycheck stereo-consistency of the completed pair based on the fact thatif the generated views are stereo consistent and close to the groundtruth, their disparity will be close to the ground-truth disparity.

Name Kernel Stride Output SICNetEncoder

ConvBnRe 5 × 5 1×1 64ConvBnRe 3 × 3 2×2 128ConvBnRe 3 × 3 1×1 128ConvBnRe 3 × 3 2×2 256ConvBnRe*2 3 × 3 1×1 256

DilateConvBnRe*43 × 3

[dila=2,dila=4,dila=8,dila=16]

1×1 256

ConvBnRe*2 3 × 3 1×1 256

Fusion moduleConcat - - 512

ConvMergeLeft 1 × 1 1×1 256

ConvMergeRight 1 × 1 1×1 256

DecoderDeconvBnRe 4 × 4 2×2 128

ConvBnRe 3 × 3 1×1 128DeconvBnRe 4 × 4 2×2 64ConvBnRe 3 × 3 1×1 32ConvBnRe 3 × 3 1×1 3

Global and Local DiscriminatorGlobal discriminator Local discriminator

ConvBnRe*6

5 × 5Stride: 2×2

ConvBnRe*5

5 × 5Stride: 2×2

FC 1024 FC 1024Cat 2048FC 1

Fig. 4: Detail architecture of the proposed method.

3.2 Loss function

In this part, we present the losses defined to train SICNet. We denotean input pair as Iinput, and masks indicating areas to be repairedin the input pair as M in which missing areas are filled with 1and the other pixels are set to be 0. G is SICNet which generatesa pair of completed images. The ground truth of Iinput is denoted asIgt. We use superscripts l and r to indicate the left and right views,respectively.

Reconstruction loss Given a pair of images to be repaired andits ground truth, we adopt L2 distance to measure the pixel-levelreconstruction loss of the generated areas in each view, left or right,which is given by

LvMSE =

∥∥(Gv(Iinput)− Ivgt)�Mv∥∥2 (1)

Here, � is the pixel-wise multiplication and ‖.‖ is the Euclideannorm. v ∈ {l, r}.

Adversarial loss Considering that L2 loss is incapable of super-vising SICNet to generate sharper details [2, 9], we introduce anadversarial loss Ladv for each view,

Lvadv = −[log(Dv(Gv(Iinput)))] (2)

where D is the auxiliary discriminator networks introduced insection 3.1. It evaluates the reality of the completed left/right viewin two scales as done in [9].

Directly-defined consistency loss We guide learning for stereoconsistency with two losses. Here we present the first one whichis directly defined on the completed image pair. The computationprocess of the loss is illustrated in Fig. 5. Given a pair of com-pleted imagesG(Iinput), we synthesize a left view by the right viewGr(Iinput) via warping (denoted as

←−W ) based on the left ground-

truth disparity map dlgt. Then we compare the completed left viewGl(Iinput) with the synthesized one

←−W (Gr(Iinput), d

lgt) in the

newly generated areas. The loss is defined as

LconsD = 1−∑

a�b√(∑

a�a)×(∑ b�b) (4)

Here,a =M l �Gl(Iinput)

b =M l �←−W (Gr(Iinput), dlgt)

LconsD is defined by referring to Normalized Cross Correlation(NCC) [35]. We will verify the effectiveness of this consistency lossin the experiment part.

Right Disparity Synthesized left Left

Fig. 5: Illustration on stereo consistency computation: synthesizingthe left view by warping the right and comparing it with the originalleft.

Total loss We combine the reconstruction loss, the adversarialloss, and the directly-defined consistency loss together to help trainSICNet. The total loss is given by,

Ltotal =∑

v∈{l,r}(αLv

MSE + βLvadv + λLconsD) (7)

Here, α, β, λ are weights of the three types of loss respectively. Inimplementation, we set α = 1, β = 0.0004, λ = 0.01 . Note that weuse the same directly-defined consistency loss for both the left andright views for simplification.

PSMNet-based consistency loss Considering that a well-repaired pair would have a disparity map coherent with that obtainedby the ground-truth pair, we define a consistency loss based on PSM-Net, a state-of-the-art disparity network proposed in [34]. The lossis given by:

LconsP (dgt, dest) =1

N

N∑

i=1

smoothL1(digt − diest) (5)

in which

smoothL1(x) =

{0.5x2, if |x| < 1|x| − 0.5, otherwise

(6)

Here, dgt and dest are the ground-truth disparity map and thatestimated with the completed pair, respectively. dgt is obtained byPSMNet with the ground-truth image pair. N is the total number ofpixels.

IET Research Journals, pp. 1–104 c© The Institution of Engineering and Technology 2015

IET Computer Vision

Page 5: Learning across Views for Stereo Image Completion VERSION.pdfLearning across Views for Stereo Image Completion ISSN 1751-8644 ... Moreover, they trained their network with a globally

Binary cross entropy loss A binary cross entropy loss L_binis introduced to help train the two discriminators introduced insection 3.1. The loss is given by

Lvbin = −[logDv(Igt) + log(1−Dv(Gv(Iinput)))] (8)

Here, D denotes the discriminators.

3.3 Training strategy

The proposed network is relatively complex with multiple functionalbranches and modules. In order to stabilize its training process, weadopt a multistage strategy, which can be treated as an extendedversion of the one in [9] for training their single image completionnetwork. Our training is composed of three stages. In the first stage,we only trains the main body, i.e. SICNet, using the reconstructionloss. In the second stage, we trains the two discriminators. In thethird stage, we conduct joint training of SICNet and discriminator,and then joint training of SICNet and the disparity estimation net-work. Details of the training process are given in Algorithm 1.

Algorithm 1: Training the proposed SICNet.Input: Stereo image pairs1:while iterations t < Ttotal do2: Random sample a mini-batch of stereo images {(Ilinput, Irinput)}from training data3: Generate masks {(M l,Mr)} with random holes for each pairof images in the mini-batch4: if t < TG then5: update the generator network SICNet with the MSE lossLMSE using {(Ilinput, Irinput,M l,Mr)}6: end if7: if t< TDthen8: generate masks {(M l

d,Mrd )}with random holes for each pair

of images in the mini-batch {(Ilinput, Irinput)}9: update the discriminators with the binary cross entropyloss Lbin with both {(G(Ilinput, I

rinput,M

l,Mr),M ld,M

rd )} and

{(Ilinput, Irinput,M ld,M

rd )}

10: end if11: if t > TG + TD then12: update the generator network SICNet with the total lossLtotal using {(Ilinput, Irinput,M l,Mr)} and the discriminatorswith the binary cross entropy loss, together13: update the disparity network and the generator with LconsPloss14: end if15:end while

4 Experiments

In this section, we first introduce the dataset used to train andtest the proposed approach and implementation details. Then, wedescribe the evaluation metrics used in the experiments. Next, wecompare our method with state-of-the-art methods, in aspects ofimage restoration and object removal, respectively. Finally,we con-duct ablation study to verify the key components of the proposedmethod.

4.1 Dataset

Our experiments are performed on KITTI [36] dataset, which con-tains 42382 rectified stereo pairs from 61 scenes. In order to avoidhigh correlation among image pairs, we resample the dataset with1/5 of the original frequency. For convenience of training and fairquantitative comparison of different methods, we rescale the resam-pled 8476 images to 256× 256, 8226 images for training, 250images for testing. In practice, the trained model can be used forany size of images.

4.2 Implementation details

All computations are performed in a computer with a singleGTX1080T iGPU. The proposed architecture is designed in pythonwith Pytorch [37] library and optimized by using the Adam opti-mizer [38] with β1 = 0.9, β2 = 0.999. The initial generator learn-ing rate is set as 5e− 4 and the initial discriminator learning rateis set to be 1e− 5. Ttotal, TG, and TD in Algorithm 1 are set as1000, 200, and 100 epochs, respectively.

4.3 Evaluation metrics

We present both quantitative and qualitative evaluation on the com-pleted image pairs in the test dataset. The quantitative evaluation isconducted in aspects of image quality and stereo consistency. We useseveral popular criterions for single image quality evaluation, includ-ing L1 distance, L2 distance, PSNR and SSIM. They are computedby comparing the completed contents with their ground-truth parts.L1 distance and L2 distance measure pixel-level errors. PSNR [39]quantifies the overall consistency with ground truth from perceptualperspective. SSIM [40] measures local structural coherency withthe ground truth. Higher PSNR or SSIM and lower L1 or L2 meansbetter quality.

We compute the disparity map of the completed image pair andcompare it with the ground-truth disparity map obtained by PSMNetwith the ground-truth image pair. Then, we compute the percentagesof erroneous pixels to express stereo consistency, which is given by

DispE = 1N

∑i[(diest > p1)&(

diest

digt

> p2)]

dE =∣∣∣desti − dgti

∣∣∣(9)

where, dest is the disparity map of completed image contents. dgt isthe ground truth disparity map. dE is error map. A pixel is consid-ered to be an erroneous pixel if its disparity error is larger than p1pixels. We set p1 = 3. p2 is the relative disparity error, which is setas 0.05. N is the total number of pixels.

4.4 Comparison with state-of-the-art methods

We test our method and compare it with state-of-the-art methods inaspects of corruption restoration and object removal, respectively.The corruptions are simulated by randomly positioning arbitrary sizeof blocks on both the training and test dataset. Each pair of cor-rupted images has ground-truth data for supervision during trainingor quantitative evaluation during testing.

As for object removal from images, we have no data behind theobjects to be removed, which means there are no ground-truth dataavailable for training or quantitative evaluation. Therefore, we trainour model only for corruption but test it on both corruption andobject removal data. Also due to this reason, we can only presentqualitative comparison with those obtained by state-of-the-art meth-ods in aspect of object removal.

Note that comparison with traditional methods is not consideredin our experiments for the following reasons. At first, deep-learningbased methods learn priors from large data while traditional methodsrely on the remaining parts in the input. This leads the two work wellin different scenarios. Specifically, deep-learning based methods per-form better in repairing scenes having similar structures with thosein the training data. Traditional methods can sufficiently excavate theclues in the remaining parts which are generally much more reliablethan those from other images. Therefore, it is considered unfair tocompare them. Secondly, traditional methods repair the holes grad-ually along with optimization, which generally consume much time.For example, Laplacian [19] takes around 50 minutes to repair a640× 384 image while deep learning methods, like our SICNet,take only a few seconds. This make it hard to perform sufficientcomparison with traditional methods on 250 testing images.

On the other hand, the only deep-learning based stereo imagerestoration method [14], which was also developed by us, cannotdeal with holes of arbitrary size at any position. Therefore, we only

IET Research Journals, pp. 1–10c© The Institution of Engineering and Technology 2015 5

IET Computer Vision

Page 6: Learning across Views for Stereo Image Completion VERSION.pdfLearning across Views for Stereo Image Completion ISSN 1751-8644 ... Moreover, they trained their network with a globally

present comparison with recent deep-learning based methods for sin-gle image repairing, including GL [9] and EdgeConnect [10]. Tocomplete stereo views by using methods for single images, we treateach pair of stereo images as two single ones and repair them one byone.

4.4.1 Corruption restoration: Here we compare our methodwith state-of-the-art deep learning based methods in aspect ofcorruption restoration, quantitatively and qualitatively.

Quantitative comparison Table 1 lists the results. From the table,we can see that our method obtains the highest PSNR and SSIM,and the lowest L1 and L2 errors in both left and right views, whichmeans our method obtains the best single image repairing results.This is because that we can refer to the other views for confidentrepairing in case only contents in one view is missing. Even in caseof no reference in the other view, our method can do repairing underconstraints of both single-view context and stereo-view consistency.Moreover, our method has the lowest disparity error, which meansthat our method generates the most stereo-consistent results thanksto the stereo-consistency constraints during repairing.

Table 1 Quantitative comparison on image restoration.

Metrics GL EdgeConnect SICNet

PSNR(left/right) 28.26/29.00 28.91/29.57 31.62/32.36SSIM(left/right) 0.986/0.988 0.987/0.989 0.995/0.995L1(%)(left/right) 1.02/0.93 0.98/0.86 0.64/0.59L2(%)(left/right) 0.32/0.27 0.32/0.26 0.14/0.12DispE(%) 6.17 5.77 2.14

Qualitative comparison Fig. 6 presents some qualitative results.Here, we only give results of stereo pairs with only one-view miss-ing, since results of those with two-view missing contents couldbe read from experiments in object removal. GL and EdgeConnectmethods try to complete by the surroundings in the single view, withthe learned priors from data. However, they totally cannot guess cor-rectly what are missing in the images, which can be seen from Fig. 6.Since we have reference contents in the other view, we can recoverthe content correctly. This fact reminds us that in digital mediarecording and preservation, auxiliary views might be necessary for

Ground-truth InputGL

(2017)Edgeconnect

(2019) SICNet

a

PSNR(left)|DispE(%) 25.43 | 0.693 25.39 | 0.919 33.24 | 0.021

b

PSNR(left)|DispE(%) 31.75 | 0.005 32.34 | 0.417 33.43 | 0.000

c

PSNR(left)|DispE(%) 26.07 | 0.150 27.12 | 0.172 31.65 | 0.128

Fig. 6: Qualitative comparison on image restoration. Green color indicates the corrupted regions.

IET Research Journals, pp. 1–106 c© The Institution of Engineering and Technology 2015

IET Computer Vision

Page 7: Learning across Views for Stereo Image Completion VERSION.pdfLearning across Views for Stereo Image Completion ISSN 1751-8644 ... Moreover, they trained their network with a globally

safer preservation. Note that, the stereo consistency metric DispEis computed via comparing disparity maps of repaired image pairsand ground-truth image pairs obtained by using PSMNet. PSMNetestimates disparities by referring to stereo correspondence, context

and pre-learned priors. Therefore, some stereo inconsistency cannotbe reflected byDispE. As it can be seen from the second example inFig. 6, the parts repaired by GL are certainly inconsistent. However,its DispE value is close to us.

Left Right Object to be removed (green)

Input

GL (2017)

EdgeConnect

(2019)

SICNet

Input

GL (2017)

EdgeConnect

(2019)

SICNet

Fig. 7: Qualitative comparison on completion after object removal. Green color indicates the objects to be removed.

IET Research Journals, pp. 1–10c© The Institution of Engineering and Technology 2015 7

IET Computer Vision

Page 8: Learning across Views for Stereo Image Completion VERSION.pdfLearning across Views for Stereo Image Completion ISSN 1751-8644 ... Moreover, they trained their network with a globally

Ground truth Input SCL Fusion SCL + Fusion SICNet

b

PSNR(left/right)

DispE(%)23.04 / 37.67

1.41

27.41 / 36.40

0.41

28.07 / 38.95

1.19

29.47 / 39.87

0.07

c

PSNR(left/right)

DispE(%)33.04 / 24.53

1.37

36.22 / 27.76

0.21

37.08 / 29.28

0.07

36.49 / 30.00

0.003

d

PSNR(left/right)

DispE(%)27.61 / 28.54

24.72

30.52 / 30.65

18.86

33.20 / 31.42

2.89

34.02 / 31.77

2.36

a

PSNR(left/right)

DispE(%)23.16 / 22.87

13.03

24.73 / 25.50

9.84

25.93 / 26.84

4.51

25.97 / 27.32

3.27

Fig. 8: Results obtained by SICNet with different modules or losses.

IET Research Journals, pp. 1–108 c© The Institution of Engineering and Technology 2015

IET Computer Vision

Page 9: Learning across Views for Stereo Image Completion VERSION.pdfLearning across Views for Stereo Image Completion ISSN 1751-8644 ... Moreover, they trained their network with a globally

Left Right Object to be removed (green)

Input

Laplacian

SIC_trad

SICNet

Fig. 9: Qualitative comparison with traditional methods.

4.4.2 Object removal: In this part, we compare our methodwith the others in aspect of object removal. Since no ground truthavailable, we present only qualitative results. Results are given inFig. 7. From the figure, we can see that compared with GL andEdgeConnect, the repair results of our network obviously reduce theproblem of color distortion; the connection with the surrounding ismore natural, although there is still some room for improvement incontent clarity.

4.5 Ablation Study

In order to fully demonstrate the effectiveness of the proposedmethod, we disassemble the network framework and verify the per-formances of the key components, especially the fusion module, thewarp based stereo consistency loss (SCL for short) and the dispar-ity network based consistency loss. We test the method with parts ofthe three key elements, including only SCL, only the fusion module,both SCL and the fusion module, and all the three. The last one isour complete method (the trained SICNet). From Table 2, we cansee that simply using SCL can only obtains comparable results withGL. The fusion module is effective to increase both the single imagequality and the stereo consistency. Integrating SCL to train SICNet

is helpful. The disparity network based consistency loss further helpboost the performances of SICNet in all metrics.

We show some visual results in Fig. 8. From Fig. 8(a), we cannotice that with SCL the texture of the repaired leaves looks nat-ural but inconsistent. The phenomenon is lightened after involvingthe fusion module. From Fig. 8(b), it can be seen that with thefusion module, the structure of the roof in the left view is repairedcorrectly. From Fig. 8(c), we can see that with both the fusion mod-ule and SCL, we can recover the room with correct structure andsmooth walls in the right view. From Fig. 8(d), it can be seen thatafter integrating the disparity network, the sky becomes smooth andstereo consistent. In all, the three modules are all helpful to promptperformance of the proposed method.

We also test the stability of our method in dealing with differentscales of holes. We set the size of test images as 256× 256 andproduce several groups of corruptions with increasing sizes of holes.As shown in Fig. 10, as the size of holes increase, the quality ofthe repaired images decreases. No matter what sizes of holes, all thethree components contribute.

IET Research Journals, pp. 1–10c© The Institution of Engineering and Technology 2015 9

IET Computer Vision

Page 10: Learning across Views for Stereo Image Completion VERSION.pdfLearning across Views for Stereo Image Completion ISSN 1751-8644 ... Moreover, they trained their network with a globally

Table 2 PSNR, SSIM, L1, L2, and DispE achieved by SICNet with different modules or losses.Module PSNR SSIM L_1(%) L_2(%) DispE(%)

(left/right) (left/right) (left/right) (left/right)SCL 28.05/28.75 0.9855/0.9867 1.04/0.95 0.33/0.28 6.30Fusion 29.48/30.16 0.9920/0.9928 0.82/0.76 0.22/0.20 3.55SCL+ Fusion 31.06/31.78 0.9941/0.9943 0.68/0.63 0.15/0.14 2.56SICNet 31.62/32.36 0.9950/0.9952 0.64/0.59 0.14/0.12 2.14

Fig. 10: PSNR achieved by SICNet in completing images with dif-ferent sizes of holes. Only PSNRs of the left views are presented forsimplification.

5 Conclusion and limitation

In this paper, we presented an approach for stereo image comple-tion in the framework of deep learning. It can repair arbitrary sizeof images with freely positioned, shaped and dimensioned holes bylearning across stereoscopic views. Experimental results validatedthat the proposed stereo image completion network outperformedstate-of-the-art methods.

The proposed method still has limitations. As the other deeplearning based methods, our network is not good at dealing withscenes different from those in the training dataset. As it can be seenfrom Fig. 9, the original images, taken from [26] which presentsa traditional method for stereo image completion, are much dif-ferent from the street images captured in a distance. In this case,our method cannot obtain satisfying results while traditional meth-ods [19] [26] can by searching cues only from the current images. Inthe future, we will try to borrow ideas from traditional methods andtry to formulate them into our deep network for image completion.

6 References1 Darabi, S., Shechtman, E., Barnes, C., et al.:‘Image melding: combining inconsis-

tent images using patch-based synthesis’, ACM Trans. Graphics (TOG), 2012, 31,(4), article No.82.

2 Pathak, D., Krahenbuhl, P., Donahue, J., et al.:‘Context encoders: Feature learningby inpainting’, IEEE Int. Conf. Computer Vision and Pattern Recognition (CVPR),Las Vegas, Nevada, June 2016, pp.2536-2544.

3 Ma, W., Yang, L., Zhang, Y., et al.:‘Fast interactive stereo image segmentation’,Multimedia Tools and Applications, 2016, 75, (18), pp.10935-10948.

4 Criminisi, A., Perez, P., Toyama, K.:‘Object removal by exemplar-based inpaint-ing’, IEEE Trans. Image Processing (TIP), 2004, 13, (9), pp.1200-1212.

5 Liu, G., Reda, F. A., Shih, K. J., et al.:‘Image inpainting for irregular holes usingpartial convolutions’, Euro. Conf. Computer Vision (ECCV), Munich, Germany,September 2018, pp. 85-100.

6 Yu, J., Lin, Z., Yang, J., et al.:‘Free-form image inpainting with gated convolution’,arXiv preprint, arXiv: 1806.03589, 2018.

7 Tauber, Z., Li, Z., Drew, M. S.:‘Review and Preview: Disocclusion by Inpaintingfor Image-Based Rendering’, IEEE Trans. on Systems, Man, and Cybernetics, PartC, 2007, 37, (6), pp.527-540.

8 Thonat, T., Shechtman, E., Paris, S., et al.: ‘Multi-View Inpainting for Image-BasedScene Editing and Rendering’, Fourth International Conference on 3D Vision(3DV), Stanford, California, USA, Oct. 2016.

9 Ishikawa, H., Simo-Serra, E., Iizuka, S:‘Globally and locally consistent imagecompletion’, ACM Trans. Graphics, 2017, 36, (4), article No.107.

10 Kamyar, N., Eric, N., Tony, J., et al.:‘EdgeConnect: generative image inpaintingwith adversarial edge learning’, arXiv preprint, arXiv: 1901.00212, 2019.

11 Gupta, R. K., Cho, S.-Y.:‘Window-based approach for fast stereo correspondence’,IET Computer Vision, 2013, 7, (2), pp.123-134.

12 Lopez-Quintero, M. I., Marin-Jimenez, M. J., Munoz-Salinas, R., et al.:‘Mixingbody-parts model for 2D human pose estimation in stereo videos’, IET ComputerVision, 2017, 11, (6), pp.426-433.

13 Wang, L., Jin, H., Yang, R., et al.:‘Stereoscopic inpainting: joint color and depthcompletion from stereo images’, IEEE Int. Conf. Computer Vision and PatternRecognition (CVPR), Anchorage, Alaska, USA, June 2008, pp.1-8.

14 Chen, S., Ma, W., Qin, Y.:‘CNN-based stereoscopic image inpainting’, Int. Conf.Image and Graphics (ICIG), Beijing, China, August 2019.

15 Ballester, C., Bertalmio, M., Caselles, V., et al.:‘Filling-in by joint interpolation ofvector fields and gray levels’, IEEE Trans. Image Processing (TIP), 2001,10, (8),pp.1200-1211.

16 Bertalmio, M., Sapiro, G., Caselles, V., et al.:‘Image inpainting’, Proc. ACMSIGGRAPH, New Orleans, Louisiana, USA, 2000, pp.417-424.

17 Levin, A., Zomet, A., Weiss Y.:‘Learning how to in-paint from global image statis-tics’, IEEE Int. Conf. Computer Vision (ICCV), Nice, France, October 2003, pp.305-312.

18 Xiao, M., Li, G., Jiang, Y., et al:‘Image completion using belief propagation basedon planar priorities’, KSII Trans. Internet and Information Systems (TIIS), 2016,10, (9), pp.4405-4418.

19 Lee, J. H., Choi, I., Kim, M. H.:‘Laplacian patch-based image synthesis’, IEEEInt. Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada,USA, Jun 2016, pp.2727-2735.

20 Huang, J.-B., Kang, S.B., Ahuja, N., et al.:‘Image completion using planarstructure guidance’, ACM Trans. Graphics (TOG), 2014, 33, (4), article No.129.

21 Xiao, M., Liu, Y., Xie, L., et al.:‘A novel image completion algorithmbased on planar features’, KSII Trans. Internet and Information Systems(TIIS),2018,12,(8),pp.3842-3855.

22 Lo, W. Y., Van Baar, J., Knaus, C., et al.:‘Stereoscopic 3D copy & paste’,ACMTrans. Graphics (TOG), 2010, 29, (6), article No.147.

23 Luo, S. J., Shen, I. C., Chen, B. Y., et al.:‘Perspective-aware warping for seamlessstereoscopic image cloning’, ACM Trans. Graphics (TOG), 2012, 31, (6), articleNo.182.

24 Hervieux, A., Papadakis, N., Bugeau, A., et al.:‘Stereoscopic image inpainting:distinct depth maps and images inpainting’, Int. Conf. Pattern Recognition (ICPR),Istanbul, Turkey, August 2010, pp. 4101-4104.

25 Hervieux, A., Papadakis, N., Bugeau, A., et al.:‘Stereoscopic image inpaintingusing scene geometry’, IEEE Int. Conf. Multimedia and Expo (ICME), Barcelona,Spain, July 2011, 47, (10), pp.1-6.

26 Mu, T. J., Wang, J. H., Du, S. P., et al.:‘Stereoscopic image completion and depthrecovery’, The Visual Computer, 2014, 30, (6), pp.833-843.

27 Morse, B., Howard, J., Cohen, S., et al.:‘PatchMatch-based content completion ofstereo image pairs’, Int. Conf. 3D Imaging, Modeling, Processing, Visualization &Transmission (3DIMPVT), Zurich, Switzerland, October 2012, pp.555-562.

28 Barnes, C., Shechtman, E., Finkelstein, A., et al.:‘PatchMatch: a randomized cor-respondence algorithm for structural image editing’, ACM Trans. Graphics (TOG),2009, 28, (3), article No.24.

29 Ren, J., Xu, L., Qiong, Y., et al.: ‘Shepard Convolutional Neural Networks’,Advances in Neural Information Processing Systems (NIPS), Montreal, Canada,Oct. 2015, pp. 901-909.

30 Yang, C., Lu, X., Lin, Z., et al.: ‘High-resolution image inpainting using multi-scale neural patch synthesis’, IEEE Int. Conf. Computer Vision and PatternRecognition (CVPR), Hawaii, USA, July 2017.

31 Telea, A.:‘An image inpainting technique based on the fast marching method’, J.Graphics Tools, 2004, 1, (9), pp.23-34.

32 Rez, P., Gangnet, M., Blake, A.:‘Poisson image editing’, ACM Trans. Graphics(TOG), 2003, 22, (3), pp.313-318.

33 Yu, J., Lin, Z., Yang, J., et al.:‘Generative image inpainting with contextual atten-tion’, IEEE Int. Conf. Computer Vision and Pattern Recognition (CVPR), Salt LakeCity, Utah, USA, June 2018, pp.5505-5514.

34 Chang, J. R., Chen, Y. S.:‘Pyramid stereo matching network’, IEEE Int. Conf.Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, USA,June 2018, pp.5410-5418.

35 Hirschmuller, H., Scharstein, D.: ‘Evaluation of cost functions for stereo match-ing’, IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Minneapolis, Minnesota, USA, June 2007.

36 Geiger, A., Lenz, P., Stiller, C., et al.:‘Vision meets robotics: The KITTI dataset’,Int. J. Robotics Research (IJRR), 2013, 32, (11), pp.1231-1237.

37 Adam, P., Sam, G., Adam, L., et al.:‘Automatic differentiation in Pytorch’, Neu-ral Information Processing Systems (NIPS) Workshop Autodiff, Long Beach,California, December USA, 2017.

38 Kingma, D. P., Ba, J.:‘Adam: A method for stochastic optimization’, Int. Conf.Learning Representation (ICLR), San Diego, California, USA, May 2015.

39 Huynh-Thu, Q., Ghanbari, M.:‘Scope of validity of PSNR in image/video qualityassessment’, Electronics Letters, 2008, 44, (13), pp.800-801.

40 Wang, Z., Bovik, A. C., Sheikh, H. R., et al.:‘Image quality assessment: from errorvisibility to structural similarity’, IEEE Trans. Image Processing (TIP), 2004, 13,(4), pp.600-612.

IET Research Journals, pp. 1–1010 c© The Institution of Engineering and Technology 2015

IET Computer Vision