volumetric spatial transformer network for object …volumetric spatial transformer network for...

1
Volumetric Spatial Transformer Network for Object Recognition Min Liu, Yifei Shi, Lintao Zheng, Yueshan Xiong, Kai Xu * National University of Defense Technology, HPCL * [email protected] 3D Shape Volumetric grid 3D CNN Depth layer Depth image 40 class Soft-max loss 2D CNN , Figure 1: Framework of volumetric spatial transformer network 1. Introduction Understanding 3D environments is a vital element of modern computer vision research due to paramount relevance in many vision systems, spanning a wide field of application scenar- ios from self-driving cars to autonomous robots [1]. At the present time, object recognition mainly employs two methods: volumetric CNNs [2] and multi-view CNNs [3] [4]. In this paper, we propose a volumetric spatial transformer network for object recognition. It fills the gap between 3D CNN and 2D CNN for the first time, and provides an end-to-end training fashion. Given a 3D shape, the network can automatically select the best view that maximizes the accuracy of object recognition. 2. Approach The main idea of our volumetric spatial transformer network is to build up an end-to- end fashion deep neural network. As illustrated in Figure 1, our volumetric spatial trans- former network mainly consist of three parts: a 3D CNN, a depth layer and a 2D CNN. For a 3D shape, we first convert the shape into volumetric representation with a 60 × 60 × 60 resolution. The 3D volumetric data is then fed into the 3D CNN. In or- der to mitigate overfitting, we adopt the mlpconv layer from [1]. Our 3D CNN includes a final regression layer to produce the spatial transformation parameters (θ, ϕ), which is employed by the depth layer to generate depth image. For 2D CNN, we use a 2D NIN [1] to classify the 2D depth image (60 × 60 resolution) of the original 3D shape. V(, ,Z) X Y Z P( , , ) M y x W p(x, y) Figure 2: Projection of 3D shape Key to our network is the implementation of the depth layer. The depth layer can generate depth image correctly, and can be trained with standard back-propagation, allowing for end- to-end fashion. As illustrated in Figure 2, a view is represented by (ρ, θ, ϕ), where ρ, θ and ϕ are the radius distance to the shape center, azimuthal and polar angle, respectively. The view direction points to the 3D shape’s center, which is also the origin of coordinate system. In our approach, the ρ is a constant. The (ρ, θ, ϕ) can be easily transformed into Cartesian coordinate (X,Y,Z ) by using Equation (1). X = ρ sin(θ )cos(ϕ) Y = ρ sin(θ )sin(ϕ) Z = ρ cos(θ ) (1) 2.1 Depth calculation As illustrated in Figure 2, a projection plane lies between the viewpoint V and the 3D shape. M is the intersection point of view line and projection plane, and --→ OM is perpendicular to the plane. At the same time ---→ MW --→ OM , so --→ OM × ---→ MW is the opposite direction of y-axis in projection plane, and ---→ MW lies in x-axis. We then calculate three unit vector ---→ iOM , - ih and -→ im by using Equation (2). ---→ iOM = --→ OM/| --→ OM | = --→ OV /| --→ OV | - ih = ---→ iMW × ---→ iOM -→ im = ---→ iOM × - ih (2) So, given a point P (X g ,Y g ,Z g ), the projected coordinate of P in plane xMy can be calculated by using Equation (3), where z is the projection of -→ OP along ---→ iOM . So, the depth of point P is d = ρ - z , i.e., d = ρ - 1 ρ (XX g + YY g + ZZ g ). x = -→ im · -→ OP y = - ih · -→ OP z = ---→ iOM · -→ OP (3) 2.2 Depth image generation From projected coordinate to pixel coordinate, the camera intrinsic matrix should be applied. Each point of 3D shape will be projected onto the pixel coordinate (we use rounding distance to fit the pixel coordinate). While one pixel may involve many points of 3D shape, only the point with minimum depth value is recorded. So, the depth image is generated. At the same time, the index of points which are with minimum depth value is recorded in a index map, the index map will be used in backpropagation through depth layer. 2.3 Backpropagation through depth layer A backpropagation through the depth layer computes loss gradients at input (θ, ϕ) given loss gradients of output (depth image). During backpropagation, each pixel will get a ∂loss ∂d . We know the depth d, and (X g ,Y g ,Z g ) can be get from the index map. That means ∂loss ∂θ and ∂loss ∂ϕ in Equation 5 can be calculated along with Equation 1 and Equation 4. ∂d ∂θ = ∂d ∂X × ∂X ∂θ + ∂d ∂Y × ∂Y ∂θ + ∂d ∂Z × ∂Z ∂θ ∂d ∂ϕ = ∂d ∂X × ∂X ∂ϕ + ∂d ∂Y × ∂Y ∂ϕ + ∂d ∂Z × ∂Z ∂ϕ (4) ∂loss ∂θ = ∂loss ∂d × ∂d ∂θ ∂loss ∂ϕ = ∂loss ∂d × ∂d ∂ϕ (5) Each pixel of depth image will get a ∂loss ∂θ and ∂loss ∂ϕ , we just average all the ∂loss ∂θ and ∂loss ∂ϕ to get the final loss gradients of (θ, ϕ). 3. Results To evaluate the performance of view selection, we compare our Spatial Transformer Net- work against two alternative methods. We use two baseline approaches, i.e., Projected Area which selects the view by maximizing the area of projection of a 3D model, Ran- dom which selects viewpoint randomly. The recognition network is trained on 40 object categories, and each category has 250 3D shapes. Figure 3 shows some of the recogni- tion results. It is obvious that our view selection approach by Spatial Transformation Net- works outperforms baseline approaches. Figure 4 shows some view selection results. 0% 20% 40% 60% 80% 100% chair airplane bike Accuracy Ours Projected Area Random Figure 3: The accuracy of object recognition with three approaches Figure 4: Some results of view selection 4. Conclusions We propose a volumetric spatial transformer network, which is proved to be effective for object recognition. The intermediate result (θ, ϕ) can be used for many other task. We can extend the work by combine recurrent neural network (RNN) with our volumetric spatial transformer network, so it can develop next best view approaches to reduce the recognition uncertainty of 3D objects with a minimal number of views. This work was supported in part by NSFC (61379103, 61572507, 61532003, 61622212). References [1]Charles Ruizhongtai Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas Guibas. Volumetric and multi-view cnns for object classification on 3d data. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016. [2] Khosla A et al. Wu Z, Song S. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920., 2015. [3] Kai Xu, Hui Huang, Yifei Shi, Hao Li, Pinxin Long, Jiannong Caichen, Wei Sun, and Bao- quan Chen. Autoscanning for coupled scene reconstruction and proactive object analysis. ACM Transactions on Graphics (Proc. of SIGGRAPH Asia 2015), 34(6):177:1–177:14, 2015. [4] Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen- Or, and Baoquan Chen. 3d attention-driven depth acquisition for object identification. ACM Transactions on Graphics (Proc. of SIGGRAPH Asia 2016), 35(6):to appear, 2016. SA ’16 Posters, December 05-08, 2016, Macao

Upload: others

Post on 17-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Volumetric Spatial Transformer Network for Object …Volumetric Spatial Transformer Network for Object Recognition Min Liu, Yifei Shi, Lintao Zheng, Yueshan Xiong, Kai Xu National

Volumetric Spatial TransformerNetwork for Object Recognition

Min Liu, Yifei Shi, Lintao Zheng, Yueshan Xiong, Kai Xu∗

National University of Defense Technology, HPCL∗[email protected]

3D Shape Volumetric grid 3D CNN Depth layer Depth image 40 class

Soft-max loss

2D CNN

𝜃, 𝜑

Figure 1: Framework of volumetric spatial transformer network

1. Introduction

Understanding 3D environments is a vital element of modern computer vision research dueto paramount relevance in many vision systems, spanning a wide field of application scenar-ios from self-driving cars to autonomous robots [1]. At the present time, object recognitionmainly employs two methods: volumetric CNNs [2] and multi-view CNNs [3] [4]. In this paper,we propose a volumetric spatial transformer network for object recognition. It fills the gapbetween 3D CNN and 2D CNN for the first time, and provides an end-to-end training fashion.Given a 3D shape, the network can automatically select the best view that maximizes theaccuracy of object recognition.

2. Approach

The main idea of our volumetric spatial transformer network is to build up an end-to-end fashion deep neural network. As illustrated in Figure 1, our volumetric spatial trans-former network mainly consist of three parts: a 3D CNN, a depth layer and a 2DCNN. For a 3D shape, we first convert the shape into volumetric representation with a60 × 60 × 60 resolution. The 3D volumetric data is then fed into the 3D CNN. In or-der to mitigate overfitting, we adopt the mlpconv layer from [1]. Our 3D CNN includesa final regression layer to produce the spatial transformation parameters (θ, ϕ), which isemployed by the depth layer to generate depth image. For 2D CNN, we use a 2DNIN [1] to classify the 2D depth image (60 × 60 resolution) of the original 3D shape.

V(𝑋, 𝑌,Z)

𝑂

X

Y

Z

P(𝑋𝑔, 𝑌𝑔, 𝑍𝑔) M

y

xW

p(x, y)

Figure 2: Projection of 3D shape

Key to our network is the implementation of the depth layer. The depth layer can generate

depth image correctly, and can be trained with standard back-propagation, allowing for end-to-end fashion. As illustrated in Figure 2, a view is represented by (ρ, θ, ϕ), where ρ, θ andϕ are the radius distance to the shape center, azimuthal and polar angle, respectively. Theview direction points to the 3D shape’s center, which is also the origin of coordinate system.In our approach, the ρ is a constant. The (ρ, θ, ϕ) can be easily transformed into Cartesiancoordinate (X, Y, Z) by using Equation (1).X = ρ sin(θ)cos(ϕ)

Y = ρ sin(θ)sin(ϕ)Z = ρ cos(θ)

(1)

2.1 Depth calculationAs illustrated in Figure 2, a projection plane lies between the viewpoint V and the 3D shape.M is the intersection point of view line and projection plane, and

−−→OM is perpendicular to

the plane. At the same time−−−→MW⊥

−−→OM , so

−−→OM ×

−−−→MW is the opposite direction of y-axis in

projection plane, and−−−→MW lies in x-axis. We then calculate three unit vector

−−−→iOM ,

−→ih and

−→im

by using Equation (2). −−−→iOM =

−−→OM/|

−−→OM | =

−−→OV /|

−−→OV |

−→ih =

−−−→iMW ×

−−−→iOM

−→im =

−−−→iOM ×

−→ih

(2)

So, given a point P (Xg, Yg, Zg), the projected coordinate of P in plane xMy can be calculatedby using Equation (3), where z is the projection of

−→OP along

−−−→iOM . So, the depth of point P

is d = ρ− z, i.e., d = ρ− 1ρ(XXg + Y Yg + ZZg).

x =−→im ·−→OP

y =−→ih ·−→OP

z =−−−→iOM ·

−→OP

(3)

2.2 Depth image generationFrom projected coordinate to pixel coordinate, the camera intrinsic matrix should be applied.Each point of 3D shape will be projected onto the pixel coordinate (we use rounding distanceto fit the pixel coordinate). While one pixel may involve many points of 3D shape, only thepoint with minimum depth value is recorded. So, the depth image is generated. At the sametime, the index of points which are with minimum depth value is recorded in a index map, theindex map will be used in backpropagation through depth layer.2.3 Backpropagation through depth layer

A backpropagation through the depth layer computes loss gradients at input (θ, ϕ) given lossgradients of output (depth image). During backpropagation, each pixel will get a ∂loss

∂d . Weknow the depth d, and (Xg, Yg, Zg) can be get from the index map. That means ∂loss

∂θ and ∂loss∂ϕ

in Equation 5 can be calculated along with Equation 1 and Equation 4.∂d∂θ =

∂d∂X ×

∂X∂θ + ∂d

∂Y ×∂Y∂θ + ∂d

∂Z ×∂Z∂θ

∂d∂ϕ = ∂d

∂X ×∂X∂ϕ + ∂d

∂Y ×∂Y∂ϕ + ∂d

∂Z ×∂Z∂ϕ

(4)

∂loss∂θ = ∂loss

∂d ×∂d∂θ

∂loss∂ϕ = ∂loss

∂d ×∂d∂ϕ

(5)

Each pixel of depth image will get a ∂loss∂θ and ∂loss

∂ϕ , we just average all the ∂loss∂θ and ∂loss

∂ϕ toget the final loss gradients of (θ, ϕ).

3. Results

To evaluate the performance of view selection, we compare our Spatial Transformer Net-work against two alternative methods. We use two baseline approaches, i.e., ProjectedArea which selects the view by maximizing the area of projection of a 3D model, Ran-dom which selects viewpoint randomly. The recognition network is trained on 40 objectcategories, and each category has 250 3D shapes. Figure 3 shows some of the recogni-tion results. It is obvious that our view selection approach by Spatial Transformation Net-works outperforms baseline approaches. Figure 4 shows some view selection results.

0%

20%

40%

60%

80%

100%

chair airplane bike

Acc

ura

cy

Ours Projected Area Random

Figure 3: The accuracy of object recognition with three approaches

Figure 4: Some results of view selection

4. Conclusions

We propose a volumetric spatial transformer network, which is proved to be effective for objectrecognition. The intermediate result (θ, ϕ) can be used for many other task. We can extendthe work by combine recurrent neural network (RNN) with our volumetric spatial transformernetwork, so it can develop next best view approaches to reduce the recognition uncertaintyof 3D objects with a minimal number of views. This work was supported in part by NSFC(61379103, 61572507, 61532003, 61622212).

References

[1] Charles Ruizhongtai Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, andLeonidas Guibas. Volumetric and multi-view cnns for object classification on 3d data.In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016.

[2] Khosla A et al. Wu Z, Song S. 3d shapenets: A deep representation for volumetric shapes.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 1912–1920., 2015.

[3] Kai Xu, Hui Huang, Yifei Shi, Hao Li, Pinxin Long, Jiannong Caichen, Wei Sun, and Bao-quan Chen. Autoscanning for coupled scene reconstruction and proactive object analysis.ACM Transactions on Graphics (Proc. of SIGGRAPH Asia 2015), 34(6):177:1–177:14,2015.

[4] Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen-Or, and Baoquan Chen. 3d attention-driven depth acquisition for object identification. ACMTransactions on Graphics (Proc. of SIGGRAPH Asia 2016), 35(6):to appear, 2016.

SA ’16 Posters, December 05-08, 2016, Macao