rgbd salient object detection using spatially coherent...

5
RGBD Salient Object Detection using Spatially Coherent Deep Learning Framework Posheng Huang, Chin-Han Shen, Hsu-Feng Hsiao Department of Computer Science National Chiao Tung University Hsinchu, Taiwan [email protected], [email protected], [email protected] Abstract—In this paper, a learning based salient object detection method for RGBD images is introduced. With the assistance of depth information, the silhouette features of an object can be retrieved primarily, and it can lead to better detection of salient objects. In addition, many recent works still rely on some image post-processing methods to improve their performance. We develop a more efficient end-to-end model with a modified design of loss function used in our training network. The design of the new loss function is to increase the spatial coherence of detected salient objects. From the evaluation results, the proposed approach shows good performance compared with the methods that are considered to be state-of-the-art. Keywords—fully convolutional networks, salient object detection, deep learning I. INTRODUCTION Recently, an increasing research interest can be observed on the subject of salient object detection. The early works primarily focused on predicting eye-fixations in images. When emphasizing object-level integrity of saliency prediction results, the detection of salient objects can be used as a pre- processing in various visual applications. For example, image classification [1], image segmentation [2], image and video compression [4, 5], content-aware image editing [6], object detection [7], object tracking [3], etc. Most conventional approaches [8] take advantage of prior knowledge of human vision, and it had been the mainstream in this field, before deep learning methods started to thrive explosively. Since the emergence of AlexNet [9], more and more researches have employed deep convolutional neural networks (deep CNNs or DNNs) to reach substantially better results than the previous state-of-the-art. In a variety of tasks, such as image classification [9, 16], semantic segmentation [10, 11], and edge detection [12], CNN approaches have outperformed the conventional methods significantly. In [13], an end-to-end deep contrast network consisting of two complementary models is used. For the two models, pixel- wise fully convolutional stream and segment-based spatial pooling stream can compensate each other to enhance the performance. Inspired by the Holistically-Nested Edge Detector (HED) [12], Hou et al. [14] developed a simpler solution with just one modified fully convolutional networks (FCN) model. In addition, short connections between the branches are utilized to transfer high-level features to low-level outputs, which can help better locate the most salient region. These researches took RGB images as input and did not consider the depth information. While most of the researches focus on the detection in RGB images, the RGBD inputs, which have the conception of depth and can simulate 3D environments well, are less taken into consideration. The depth images can be visually seen as the primary information for object segmentation as shown in Fig. 1. The silhouette of object can be captured better in the depth map compared with using RGB components. Therefore, using depth information is beneficial to detect salient objects in an image. Fig. 1 The components of inputs and the ground truth of salient object. In [15], images with RGB plus depth are used in a CNN- based network for salient object detection. It is a region-based method, which slices an input image into segments as prediction units. Then the color, depth and other information are fused together into a relatively smaller CNN to predict a single salient value for each region. It is common to use post-processing methods to assist the results of a trained network, in order to enhance the performance of detection of salient objects. For example, fully connected conditional random field (CRF) can be used as an external refinement to an output of saliency map. CRF is based on a probabilistic model, originally used in the labeling or segmentation of sequence data [19]. In [18], an energy function consisting of two counterpart potentials is employed. With this energy function, both the original salient value and the relations between neighboring pixels are considered to enhance spatial coherence. However, the refinement of CRF forces the entire prediction task split into two stages. For deep learning network, the training procedure can be time-consuming as the data scale increases. Using a post-processing to enhance the performance might imply that the deep learning network has room for improvement. In our work, we introduce the structure of FCN into the field of RGBD salient object detection. In addition, we believe that a deep learning network shall have the capacity of being trained to make the detection results more coherent spatially, and an

Upload: others

Post on 11-May-2020

25 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: RGBD Salient Object Detection using Spatially Coherent ...static.tongtianta.site/paper_pdf/cc07902a-df8d-11e9-90aa-00163e08… · RGBD Salient Object Detection using Spatially Coherent

RGBD Salient Object Detection using

Spatially Coherent Deep Learning Framework

Posheng Huang, Chin-Han Shen, Hsu-Feng Hsiao

Department of Computer Science

National Chiao Tung University

Hsinchu, Taiwan

[email protected], [email protected], [email protected]

Abstract—In this paper, a learning based salient object detection

method for RGBD images is introduced. With the assistance of

depth information, the silhouette features of an object can be

retrieved primarily, and it can lead to better detection of salient

objects. In addition, many recent works still rely on some image

post-processing methods to improve their performance. We

develop a more efficient end-to-end model with a modified design

of loss function used in our training network. The design of the

new loss function is to increase the spatial coherence of detected

salient objects. From the evaluation results, the proposed

approach shows good performance compared with the methods

that are considered to be state-of-the-art.

Keywords—fully convolutional networks, salient object

detection, deep learning

I. INTRODUCTION

Recently, an increasing research interest can be observed on the subject of salient object detection. The early works primarily focused on predicting eye-fixations in images. When emphasizing object-level integrity of saliency prediction results, the detection of salient objects can be used as a pre-processing in various visual applications. For example, image classification [1], image segmentation [2], image and video compression [4, 5], content-aware image editing [6], object detection [7], object tracking [3], etc.

Most conventional approaches [8] take advantage of prior knowledge of human vision, and it had been the mainstream in this field, before deep learning methods started to thrive explosively. Since the emergence of AlexNet [9], more and more researches have employed deep convolutional neural networks (deep CNNs or DNNs) to reach substantially better results than the previous state-of-the-art. In a variety of tasks, such as image classification [9, 16], semantic segmentation [10, 11], and edge detection [12], CNN approaches have outperformed the conventional methods significantly.

In [13], an end-to-end deep contrast network consisting of two complementary models is used. For the two models, pixel-wise fully convolutional stream and segment-based spatial pooling stream can compensate each other to enhance the performance. Inspired by the Holistically-Nested Edge Detector (HED) [12], Hou et al. [14] developed a simpler solution with just one modified fully convolutional networks (FCN) model. In addition, short connections between the branches are utilized to transfer high-level features to low-level outputs, which can help better locate the most salient region. These researches took

RGB images as input and did not consider the depth information.

While most of the researches focus on the detection in RGB images, the RGBD inputs, which have the conception of depth and can simulate 3D environments well, are less taken into consideration. The depth images can be visually seen as the primary information for object segmentation as shown in Fig. 1. The silhouette of object can be captured better in the depth map compared with using RGB components. Therefore, using depth information is beneficial to detect salient objects in an image.

Fig. 1 The components of inputs and the ground truth of salient

object.

In [15], images with RGB plus depth are used in a CNN-

based network for salient object detection. It is a region-based

method, which slices an input image into segments as

prediction units. Then the color, depth and other information

are fused together into a relatively smaller CNN to predict a

single salient value for each region. It is common to use post-processing methods to assist the

results of a trained network, in order to enhance the performance of detection of salient objects. For example, fully connected conditional random field (CRF) can be used as an external refinement to an output of saliency map. CRF is based on a probabilistic model, originally used in the labeling or segmentation of sequence data [19]. In [18], an energy function consisting of two counterpart potentials is employed. With this energy function, both the original salient value and the relations between neighboring pixels are considered to enhance spatial coherence. However, the refinement of CRF forces the entire prediction task split into two stages. For deep learning network, the training procedure can be time-consuming as the data scale increases. Using a post-processing to enhance the performance might imply that the deep learning network has room for improvement.

In our work, we introduce the structure of FCN into the field of RGBD salient object detection. In addition, we believe that a deep learning network shall have the capacity of being trained to make the detection results more coherent spatially, and an

Page 2: RGBD Salient Object Detection using Spatially Coherent ...static.tongtianta.site/paper_pdf/cc07902a-df8d-11e9-90aa-00163e08… · RGBD Salient Object Detection using Spatially Coherent

Fig. 2 The flow of the proposed method.

end-to-end deep learning network is proposed in this paper. The proposed network can improve the performance in finding salient objects better than the previous CNN/FCN-based methods as well as the traditional methods.

The rest of the paper is organized as follows. Section II presents the proposed FCN network and the newly-designed loss function. In Section III, we present simulation results, followed by the conclusion in Section IV.

II. THE PROPOSED NETWORK

In our structure, a FCN network is utilized to retrieve different levels of features to predict salient objects. The original network is extended to include RGBD images. The input to the network consists of a depth map concatenated to the corresponding RGB image. The network used in this paper, which is a modified framework based on the FCN model [14] with short connections, is shown in Fig. 2.

The basic architecture is based on VGG-16 [16] where the final two fully connected layers, originally functioning as the classifiers, are discarded. This removal can result in the so-called “dense prediction.” In computer vision, the task of a pixel-wise dense prediction is to predict a label for each pixel in the image [17]. In our work, the dense prediction is used to predict the saliency map. As also shown in Fig. 2, there are six branches of layers stemming from the upper network, which extract the features and feed them from a lower level to higher levels. Those layers are listed in Table 1. For each (n, k × k) tuple, n means the number of output channels in the additional convolutional layer, and k × k is the kernel size. The column “Layer” shows the branch position from the original VGG16 network. Table 1 also represents three convolution sublayers in each branch from the upper network.

Table 1 Details of each side output (m)

Branch (m)

Layer 1 2 3

1 conv1_2 128, 3 × 3 128, 3 × 3 1, 1 × 1

2 conv2_2 128, 3 × 3 128, 3 × 3 1, 1 × 1

3 conv3_3 256, 5 × 5 256, 5 × 5 1, 1 × 1

4 conv4_3 256, 5 × 5 256, 5 × 5 1, 1 × 1

5 conv5_3 512, 5 × 5 512, 5 × 5 1, 1 × 1

6 Pool_5 512, 7 × 7 512, 7 × 7 1, 1 × 1

These extracted features pass through additional

convolutional layers, and form the single channel feature maps

called side outputs. Most importantly, the deeper side outputs

are considered carrying higher-level knowledge, and they are

connected to the feature maps at lower-levels in order to

combine the knowledge to find out the most visually attractive

region. These connections are called “short connections.” The

formulas for the side activations after short connections are

listed as the following.

𝑅side(𝑚)

= {

∑ 𝑟𝑖𝑚𝑅side

(𝑖)6𝑖=3 + 𝐴side

(𝑚), for m = 1,2.

𝑟5𝑚𝑅𝑠𝑖𝑑𝑒

(5)+ 𝑟6

𝑚𝑅𝑠𝑖𝑑𝑒(6)

+ 𝐴side(𝑚)

, for m=3,4

𝐴side(𝑚)

, for m = 5, 6.

. (1)

where 𝑅𝑠𝑖𝑑𝑒(𝑚)

represents the combined side activation from

branch m, and 𝐴𝑠𝑖𝑑𝑒(𝑚)

is the corresponding side output. 𝑟𝑖𝑚 is the

weight for the combination of 𝑅𝑠𝑖𝑑𝑒(𝑖)

from i-th side activation of

Page 3: RGBD Salient Object Detection using Spatially Coherent ...static.tongtianta.site/paper_pdf/cc07902a-df8d-11e9-90aa-00163e08… · RGBD Salient Object Detection using Spatially Coherent

a deeper layer to calculate the current side activation 𝑅𝑠𝑖𝑑𝑒(𝑚)

. The

determination of 𝑟𝑖𝑚 relies on the training process. The

combinations of the side activations and side outputs are implemented using concatenation and 1*1 convolution.

In this work, we propose to calculate the loss function based on three loss terms: side losses, fusion loss, and neighboring losses. The side activations mentioned above are used to compute side losses and fusion loss. Let 𝑋 = {𝑥𝑗 , 𝑗 = 1, … , |𝑋|} denote one of the training images in the training data, and 𝑍 = {𝑧𝑗 , 𝑗 = 1, … , |𝑍|}, 𝑧𝑗 ∈ [0, 1] denotes the

corresponding ground truth of saliency map for 𝑋. We denote the collection of all parameters in the standard VGG-16 structure as 𝑊 , and denote the parameters in the layers of

branches as 𝑤 = (𝑤(1), 𝑤(2), … , 𝑤(𝑀)), with 𝑀 side outputs (𝑀 = 6). Finally, we denote the parameters used in short connections as 𝒓 . According to the notations above, the formulas for cross entropy losses of each side outputs and fusion loss are listed as following:

𝐿side(𝑾, 𝒘, 𝒓) = ∑ 𝛼𝑚𝑙𝑠𝑖𝑑𝑒(𝑚)

(𝑾, 𝒘(𝑚), 𝒓)𝑀𝑚=1 ,

𝐿fuse(𝑾, 𝒘, 𝒇, 𝒓) = σ(𝑍, ∑ 𝑓𝑚𝑀𝑚=1 𝑅side

(𝑚)). (2)

where

𝑙𝑠𝑖𝑑𝑒(𝑚)

(𝑾, 𝒘(𝑚), 𝒓) = σ(𝑍, 𝑅side(𝑚)

) (3)

is the loss of each side output. In (2) and (3), 𝜎(∙,∙) is calculated the same as the cross entropy. 𝛼𝑚 denotes the weight of each cross entropy loss from each side activation contributing in (2). The weights 𝑓𝑚 are used in weighted fusion layer to combine side activations from all branches. The value of each 𝑓𝑚 is initialized to 0.16667 and will be updated in the training phase.

For the neighboring losses, our concept is similar to the pairwise potential in [18]. The pairwise potential is defined as

𝜃𝑖,𝑗(𝑥𝑖 , 𝑥𝑗) = 𝜇(𝑥𝑖 , 𝑥𝑗) [𝑤1 exp (−‖𝑝𝑖−𝑝𝑗‖

2𝜎𝛼2 −

‖𝐼𝑖−𝐼𝑗‖

2𝜎𝛽2 ) +

𝑤2 exp (−‖𝑝𝑖−𝑝𝑗‖

2𝜎𝛾2 )]. (4)

where 𝜇(𝑥𝑖 , 𝑥𝑗) = 1 if 𝑥𝑖 ≠ 𝑥𝑗 and 𝜇(𝑥𝑖 , 𝑥𝑗) = 0 ,

otherwise. 𝐼𝑖 and 𝑝𝑖 are the fused saliency value and the position of 𝑥𝑖 , respectively. Parameters 𝑤1, 𝑤2, 𝜎𝛼 , 𝜎𝛽 , 𝜎𝛾

control the importance of each term, all determined through cross validation [14]. We extract the core of pairwise potential as our neighboring loss which is calculated with the restriction of 𝑥𝑗 = {8 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑜𝑓 𝑥𝑖}. The difference of pixel positions

can be considered as small values that may be ignored. In the final, our neighbor loss function is defined as the following:

𝐿neighbor(𝐖, 𝐰, 𝐟, 𝐫) = 𝑤11

𝑛∑ exp (−

‖𝐼𝑖−𝐼𝑗‖

2𝜎𝛽2 )𝑥𝑗

. (5)

where 𝐼𝑖 is the predicted saliency value from the fused saliency map used for computing fusion loss and 𝜎𝛽 is the controlling

parameter using the same value as in (4), which is 8. 𝑛 is the number of the neighbors of 𝑥𝑖 . According to the discussions above, the complete loss function is stated as the following:

𝐿final(𝐖, 𝐰, 𝐟, 𝐫) = 𝐿fuse(𝐖, 𝐰, 𝐟, 𝐫) + 𝐿side(𝐖, 𝐰, 𝐫) + 𝐿neighbor(𝐖, 𝐰, 𝐟, 𝐫). (6)

The proposed loss function can train the network to consider spatial coherence in the output salient map, without the need of additional post-processing.

Fig. 3 Visual comparisons for different salient object detection methods. “Ours” is the proposed method.

Page 4: RGBD Salient Object Detection using Spatially Coherent ...static.tongtianta.site/paper_pdf/cc07902a-df8d-11e9-90aa-00163e08… · RGBD Salient Object Detection using Spatially Coherent

Table 2 Comparison of F-measures for different methods on two datasets

Dataset S-CNN

(2015)

BSCA

(2015)

MB+

(2015)

LEGS

(2015)

FCNSC

(RGB)

(2017)

LMH

(2014)

ACSD

(2014)

GP

(2015)

DF

(2017) Ours

NLPR 0.5141 0.5634 0.6049 0.6335 0.8071 0.6519 0.5448 0.7184 0.7823 0.8760

NJUD 0.6096 0.6133 0.6156 0.6791 0.7640 0.6381 0.6952 0.7246 0.7874 0.8189

III. PERFORMANCE EVALUATION

Our proposed model is evaluated with two datasets: NLPR RGBD1000 dataset, and NJU2000 dataset.

NLPR1000 dataset [20]. This dataset contains 1000 color images and their corresponding depth images were captured using Microsoft Kinect in different scenes. We split them into two parts randomly: 750 for training, and 250 for testing.

NJU2000 dataset [21]. This dataset contains 2000 stereo images, with their depth images and ground truth. The depth images were generated using an optical flow method. We also split this dataset into two part randomly: 1000 for training and 1000 for testing [15].

Our implementations are based on the platform of TensorFlow 1.7 in python version, with hardware Intel Core i7-6700K CPU @ 4.00GHz × 8 and Nvidia GTX TITAN X GPU. We set the momentum in our network to 0.9 and the weight decay to 0.0005. The learning rate of our network is set to 1e-4 to 1e-5 depending on layers, and there is no decay during training time. The total number of epochs is 80.

For the result analysis, we use precision-recall (PR) curve, mean of average precision and recall, and F-measure score to evaluate the performance of the proposed and the compared methods. The PR curve reveals the mean precision and recall of the saliency map at different thresholds. For average precision and recall, we adopt the adaptive threshold depicted in [22], which is twice the mean saliency value of each saliency map relatively. Then we gather precisions and recalls from all thresholded maps and average them. The F-measure is defined

as 𝐹𝛽 = (1+𝛽2)𝑃𝑅

𝛽2𝑃+𝑅, where 𝛽2 is set to 0.3, 𝑃 is average precision

and 𝑅 is average recall under adaptive thresholds.

9 methods are compared with the proposed approach. Four are designed for RGB image (S-CNN [23], BSCA [24], MB+ [25], and LEGS [26], FCNSC(RGB) [14]), and four are designed for RGBD image (LMH [27], ACSD [28], GP [29], and DF [15]). In the Figure 3, the results from other methods are either directed from their papers or obtained running their publicly available codes. It can be visually proved that our method has better object detecting ability. We compare the F-measure score of the proposed method with other methods listed in Table 2. Part of the results of the compared methods are obtained from [15].

IV. CONCLUSION

We have proposed a RGBD salient object detection model based on the FCN network with short connection. A deep

learning network with the capacity of making the detection results more coherent spatially is developed in this paper. A new loss function is proposed to enhance spatial coherence to make the network an end-to-end network without the need of post processing algorithms.

REFERENCES

[1] R. Wu, Y. Yu, and W. Wang, “Scale: Supervised and cascaded laplacian

eigenmaps for visual object recognition based on nearest neighbors,” in CVPR, 2013.

[2] M. Donoser, M. Urschler, M. Hirzer, and H. Bischof, “Saliency driven

total variation segmentation,” in ICCV, pages 817-824, IEEE, 2009. [3] V. Mahadevan and N. Vasconcelos, “Saliency-based discriminant

tracking,” in CVPR, 2009.

[4] C. Guo and L. Zhang, “A novel multiresolution spatiotemporal saliency detection model and its application in image and video compression,” in

IEEE TIP, 19(1): 185-198, 2010.

[5] L. Itti, “Automatic foveation for video compression using a neurobiological model of visual attention,” in IEEE TIP, 13(10):1304-

1318, 2004.

[6] S. Avidan and A. Shamir, “Seam carving for content-aware image resizing,” in ACM Transactions on Graphics(TOG), volume 26, pp. 10.

ACM, 2007.

[7] V. Navalpakkam and L. Itti. “An integrated model of top-down and bottom-up attention for optimizing detection speed,” in Computer Vision

and Pattern Recognition, 2006 IEEE Computer Society Conference on,

volume 2, pp. 2049-2056. IEEE, 2006. [8] BORJI, Ali, et al. “Salient object detection: A survey.” arXiv preprint

arXiv:1411.5878, 2014.

[9] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in NIPS, pp. 1106–1114,

2012.

[10] S. Hong, H. Noh, B. Han, “Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation,” in NIPS, pp. 1495-1503, 2015.

[11] J. Long, E. Shelhamer, and T. Darrel, “Fully convolutional networks for

semantic segmentation,” in CVPR, pp. 3431-3440, 2015. [12] S. Xie and Z. Tu, “Holistically-nested edge detection,” in ICCV, pp.

1395-1403, 2015.

[13] G. Li and Y. Yu, “Deep contrast learning for salient object detection,” in CVPR, 2016.

[14] Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, P. Torr, “Deeply supervised

salient object detection with short connections,” in CVPR, pp. 3203-3212, 2017.

[15] L. Qu, S. He, J. Zhang, J. Tian, Y. Tang, and Q. Yang, “RGBD salient

object detection via deep fusion,” in IEEE TIP, 26(5):2274-2285, 2017. [16] K. Simonyan and A. zisserman, “Very deep convolutional networks for

large-scale image recognition,” in arXiv preprint arXiv:1409.1556, 2014.

[17] SERCU, Tom; GOEL, Vaibhava. “Dense prediction on sequences with time-dilated convolutions for speech recognition.” arXiv preprint

arXiv:1611.09288, 2016.

[18] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs

with gaussian edge potentials,” in NIPS, 2011.

[19] J. D. Lafferty, A. McCallum, F. C. N. Pereira, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence

Data,” in ICML, pp. 282-289, 2001.

[20] H. Peng, B. Li, W. Xiong, W. Hu, and R. Ji, “RGBD salient object detection: a benchmark and algorithms,” in ECCV, pp. 92-109, 2014.

Page 5: RGBD Salient Object Detection using Spatially Coherent ...static.tongtianta.site/paper_pdf/cc07902a-df8d-11e9-90aa-00163e08… · RGBD Salient Object Detection using Spatially Coherent

[21] R. Ju, L. Ge, W. Geng, T. Ren, and G. Wu, “Depth saliency based on

anisotropic center-surround difference,” in ICIP, pp. 1115-1119, 2014. [22] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tuned

salient region detection,” in CVPR, pp. 1597-1604, 2009.

[23] He, S., Lau, R. W., Liu, W., Huang, Z., & Yang, Q. (2015). Supercnn: A superpixelwise convolutional neural network for salient object detection.

International journal of computer vision, 115(3), 330-344.

[24] Y. Qin, H. Lu, Y. Xu, and H. Wang, “Saliency detection via cellular automata,” in CVPR, pp. 110-119, 2015.

[25] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mĕch, “Minimum

barrier salient object detection at 80 fps,” in ICCV, 2015

[26] L. Wang, H. Lu, X. Ruan, and M.-H. Yang, “Deep networks for saliency

detection via local estimation and global search,” in CVPR, pp.3183-3192, 2015.

[27] H. Peng, B. Li, W. Xiong, W. Hu, and R. Ji, “Rgbd salient object

detection: a bench mark and algorithms,” in ECCV, pp. 92-109, 2014. [28] R. Ju, L. Ge, W. Geng, T. Ren, and G. Wu, “Depth saliency based on

anisotropic center-surround difference,” in ICIP, pp. 1115-1119, 2014.

[29] J. Ren, X. Gong, L. Yu, W. Zhou, and M. Yang, “Exploiting global priors for rgb-d saliency detection,” in CVPRW, pp. 25-32, 2015.