volumetric spatial transformer network for object …volumetric spatial transformer network for...
TRANSCRIPT
Volumetric Spatial TransformerNetwork for Object Recognition
Min Liu, Yifei Shi, Lintao Zheng, Yueshan Xiong, Kai Xu∗
National University of Defense Technology, HPCL∗[email protected]
3D Shape Volumetric grid 3D CNN Depth layer Depth image 40 class
Soft-max loss
2D CNN
…
𝜃, 𝜑
Figure 1: Framework of volumetric spatial transformer network
1. Introduction
Understanding 3D environments is a vital element of modern computer vision research dueto paramount relevance in many vision systems, spanning a wide field of application scenar-ios from self-driving cars to autonomous robots [1]. At the present time, object recognitionmainly employs two methods: volumetric CNNs [2] and multi-view CNNs [3] [4]. In this paper,we propose a volumetric spatial transformer network for object recognition. It fills the gapbetween 3D CNN and 2D CNN for the first time, and provides an end-to-end training fashion.Given a 3D shape, the network can automatically select the best view that maximizes theaccuracy of object recognition.
2. Approach
The main idea of our volumetric spatial transformer network is to build up an end-to-end fashion deep neural network. As illustrated in Figure 1, our volumetric spatial trans-former network mainly consist of three parts: a 3D CNN, a depth layer and a 2DCNN. For a 3D shape, we first convert the shape into volumetric representation with a60 × 60 × 60 resolution. The 3D volumetric data is then fed into the 3D CNN. In or-der to mitigate overfitting, we adopt the mlpconv layer from [1]. Our 3D CNN includesa final regression layer to produce the spatial transformation parameters (θ, ϕ), which isemployed by the depth layer to generate depth image. For 2D CNN, we use a 2DNIN [1] to classify the 2D depth image (60 × 60 resolution) of the original 3D shape.
V(𝑋, 𝑌,Z)
𝑂
X
Y
Z
P(𝑋𝑔, 𝑌𝑔, 𝑍𝑔) M
y
xW
p(x, y)
Figure 2: Projection of 3D shape
Key to our network is the implementation of the depth layer. The depth layer can generate
depth image correctly, and can be trained with standard back-propagation, allowing for end-to-end fashion. As illustrated in Figure 2, a view is represented by (ρ, θ, ϕ), where ρ, θ andϕ are the radius distance to the shape center, azimuthal and polar angle, respectively. Theview direction points to the 3D shape’s center, which is also the origin of coordinate system.In our approach, the ρ is a constant. The (ρ, θ, ϕ) can be easily transformed into Cartesiancoordinate (X, Y, Z) by using Equation (1).X = ρ sin(θ)cos(ϕ)
Y = ρ sin(θ)sin(ϕ)Z = ρ cos(θ)
(1)
2.1 Depth calculationAs illustrated in Figure 2, a projection plane lies between the viewpoint V and the 3D shape.M is the intersection point of view line and projection plane, and
−−→OM is perpendicular to
the plane. At the same time−−−→MW⊥
−−→OM , so
−−→OM ×
−−−→MW is the opposite direction of y-axis in
projection plane, and−−−→MW lies in x-axis. We then calculate three unit vector
−−−→iOM ,
−→ih and
−→im
by using Equation (2). −−−→iOM =
−−→OM/|
−−→OM | =
−−→OV /|
−−→OV |
−→ih =
−−−→iMW ×
−−−→iOM
−→im =
−−−→iOM ×
−→ih
(2)
So, given a point P (Xg, Yg, Zg), the projected coordinate of P in plane xMy can be calculatedby using Equation (3), where z is the projection of
−→OP along
−−−→iOM . So, the depth of point P
is d = ρ− z, i.e., d = ρ− 1ρ(XXg + Y Yg + ZZg).
x =−→im ·−→OP
y =−→ih ·−→OP
z =−−−→iOM ·
−→OP
(3)
2.2 Depth image generationFrom projected coordinate to pixel coordinate, the camera intrinsic matrix should be applied.Each point of 3D shape will be projected onto the pixel coordinate (we use rounding distanceto fit the pixel coordinate). While one pixel may involve many points of 3D shape, only thepoint with minimum depth value is recorded. So, the depth image is generated. At the sametime, the index of points which are with minimum depth value is recorded in a index map, theindex map will be used in backpropagation through depth layer.2.3 Backpropagation through depth layer
A backpropagation through the depth layer computes loss gradients at input (θ, ϕ) given lossgradients of output (depth image). During backpropagation, each pixel will get a ∂loss
∂d . Weknow the depth d, and (Xg, Yg, Zg) can be get from the index map. That means ∂loss
∂θ and ∂loss∂ϕ
in Equation 5 can be calculated along with Equation 1 and Equation 4.∂d∂θ =
∂d∂X ×
∂X∂θ + ∂d
∂Y ×∂Y∂θ + ∂d
∂Z ×∂Z∂θ
∂d∂ϕ = ∂d
∂X ×∂X∂ϕ + ∂d
∂Y ×∂Y∂ϕ + ∂d
∂Z ×∂Z∂ϕ
(4)
∂loss∂θ = ∂loss
∂d ×∂d∂θ
∂loss∂ϕ = ∂loss
∂d ×∂d∂ϕ
(5)
Each pixel of depth image will get a ∂loss∂θ and ∂loss
∂ϕ , we just average all the ∂loss∂θ and ∂loss
∂ϕ toget the final loss gradients of (θ, ϕ).
3. Results
To evaluate the performance of view selection, we compare our Spatial Transformer Net-work against two alternative methods. We use two baseline approaches, i.e., ProjectedArea which selects the view by maximizing the area of projection of a 3D model, Ran-dom which selects viewpoint randomly. The recognition network is trained on 40 objectcategories, and each category has 250 3D shapes. Figure 3 shows some of the recogni-tion results. It is obvious that our view selection approach by Spatial Transformation Net-works outperforms baseline approaches. Figure 4 shows some view selection results.
0%
20%
40%
60%
80%
100%
chair airplane bike
Acc
ura
cy
Ours Projected Area Random
Figure 3: The accuracy of object recognition with three approaches
Figure 4: Some results of view selection
4. Conclusions
We propose a volumetric spatial transformer network, which is proved to be effective for objectrecognition. The intermediate result (θ, ϕ) can be used for many other task. We can extendthe work by combine recurrent neural network (RNN) with our volumetric spatial transformernetwork, so it can develop next best view approaches to reduce the recognition uncertaintyof 3D objects with a minimal number of views. This work was supported in part by NSFC(61379103, 61572507, 61532003, 61622212).
References
[1] Charles Ruizhongtai Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, andLeonidas Guibas. Volumetric and multi-view cnns for object classification on 3d data.In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016.
[2] Khosla A et al. Wu Z, Song S. 3d shapenets: A deep representation for volumetric shapes.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 1912–1920., 2015.
[3] Kai Xu, Hui Huang, Yifei Shi, Hao Li, Pinxin Long, Jiannong Caichen, Wei Sun, and Bao-quan Chen. Autoscanning for coupled scene reconstruction and proactive object analysis.ACM Transactions on Graphics (Proc. of SIGGRAPH Asia 2015), 34(6):177:1–177:14,2015.
[4] Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen-Or, and Baoquan Chen. 3d attention-driven depth acquisition for object identification. ACMTransactions on Graphics (Proc. of SIGGRAPH Asia 2016), 35(6):to appear, 2016.
SA ’16 Posters, December 05-08, 2016, Macao