supplementary material: three-dimensional object detection and

2
Supplementary Material: Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients Zhile Ren Erik B. Sudderth Department of Computer Science, Brown University, Providence, RI 02912, USA 1. Proposing Layout Candidates We predict floors and ceilings as the 0.001 and 0.999 quantiles of the 3D points along the gravity direction. After that, we only need to propose wall candidates that follows Manhattan structure. We discretize orientation into 18 evenly spaced angles between 0 and 180 . For each orientation, the frontal wall is bounded by the 0.99 quantiles of the farthest points and the back wall is bounded by camera location. Because the layout candidates should follow a Manhattan structure, the left/right wall should be orthogonal to the frontal wall and should be bounded by the 3D points. We further discretize along the width and depth direction by 0.1m, and propose candidates that capture at least 80% of all 3D points. For typical scenes, there are 5,000-20,000 layout hypotheses. 2. Contextual Features Here, we give a detailed explanation of the contextual features we use to model object-object and object-layout re- lationships in the second stage cascaded classifier. For completeness, we define the notations again. For an overlapping pair of detected bounding boxes B i and B j , we denote their volumes as V (B i ) and V (B j ), their volume of their overlap as O(B i ,B j ), and the volume of their union as U (B i ,B j ). We characterize their geometric relationship via three features: S 1 (i, j )= O(Bi,Bj ) V (Bi) , S 2 (i, j )= O(Bi,Bj ) V (Bj ) , and the IOU S 3 (i, j )= O(Bi,Bj ) U(Bi,Bj ) . To model object-layout context [1], we compute the distance D(B i ,M ) and angle A(B i ,M ) of cuboid B i to the closest wall in layout M . The first-stage detectors provide a most-probable layout hypothesis, as well as a set of detections (following non- maximum suppression) for each category. For each bound- ing box B i with confidence score z i , there may be several bounding boxes of various categories c ∈{1, 2, ..., C } that overlap with it. We let i c be the instance of category c with the maximum confidence score z ic . The features ψ i for bounding box B i are then as follows: 1. Constant bias feature, and confidence score z i from the first-stage detector. 2. For m ∈{1, 2, 3} and c ∈{1, 2, ..., C }, we calculate S m (i, i c ), S m (i, i c ) · z ic , S m (i, i c ) · z i and concatenate those numbers. 3. For c ∈{1, 2, ..., C }, we calculate the difference in confidence score from each first-stage detector, z i - z ic , and concatenate those numbers. 4. For D(B i ,M ), we consider radial basis functions of the form f j (x) = exp(- (x - μ j ) 2 2σ 2 ) (1) For a typical indoor scene, the largest object-to-wall distance is usually less than 5m, therefore we space the basis function centers μ j evenly between 0 and 5 with step size 0.5, and choose σ =0.5. We expand D(B i ,M ) using this radial basis expansion. 5. The absolute value of the cosine of D(B i ,M ): | cos(D(B i ,M ))| To model the second-stage layout candidates, we se- lect the bounding box i c with the highest confidence score z ic from the first-stage classifier in each category c {1, 2, ..., C }, and use the following features for layout M i with confidence score z 0 i : 1. All the features used in the first-stage to model M i us- ing Manhattan Voxels. 2. For c ∈{1, 2, ..., C }, we calculate the radial basis ex- pansion for D(B ic ,M i ), and its product with z 0 i and z ic . 3. For c ∈{1, 2, ..., C }, we calculate the absolute value of the cosine of D(B ic ,M i ): | cos(D(B ic ,M i ))|, | cos(D(B ic ,M i ))z 0 i and | cos(D(B ic ,M i ))z ic . 4. For c ∈{1, 2, ..., C }, we calculate the difference in confidence score from each first-stage detector, z 0 i - z ic , and concatenate those numbers. 1

Upload: trinhduong

Post on 28-Jan-2017

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supplementary Material: Three-Dimensional Object Detection and

Supplementary Material:Three-Dimensional Object Detection and Layout Prediction

using Clouds of Oriented Gradients

Zhile Ren Erik B. SudderthDepartment of Computer Science, Brown University, Providence, RI 02912, USA

1. Proposing Layout CandidatesWe predict floors and ceilings as the 0.001 and 0.999

quantiles of the 3D points along the gravity direction. Afterthat, we only need to propose wall candidates that followsManhattan structure.

We discretize orientation into 18 evenly spaced anglesbetween 0 and 180◦. For each orientation, the frontal wallis bounded by the 0.99 quantiles of the farthest points andthe back wall is bounded by camera location. Because thelayout candidates should follow a Manhattan structure, theleft/right wall should be orthogonal to the frontal wall andshould be bounded by the 3D points. We further discretizealong the width and depth direction by 0.1m, and proposecandidates that capture at least 80% of all 3D points. Fortypical scenes, there are 5,000-20,000 layout hypotheses.

2. Contextual FeaturesHere, we give a detailed explanation of the contextual

features we use to model object-object and object-layout re-lationships in the second stage cascaded classifier.

For completeness, we define the notations again. For anoverlapping pair of detected bounding boxesBi andBj , wedenote their volumes as V (Bi) and V (Bj), their volume oftheir overlap asO(Bi, Bj), and the volume of their union asU(Bi, Bj). We characterize their geometric relationship viathree features: S1(i, j) =

O(Bi,Bj)V (Bi)

, S2(i, j) =O(Bi,Bj)V (Bj)

,

and the IOU S3(i, j) =O(Bi,Bj)U(Bi,Bj)

. To model object-layoutcontext [1], we compute the distance D(Bi,M) and angleA(Bi,M) of cuboid Bi to the closest wall in layout M .

The first-stage detectors provide a most-probable layouthypothesis, as well as a set of detections (following non-maximum suppression) for each category. For each bound-ing box Bi with confidence score zi, there may be severalbounding boxes of various categories c ∈ {1, 2, ..., C} thatoverlap with it. We let ic be the instance of category cwith the maximum confidence score zic . The features ψi

for bounding box Bi are then as follows:1. Constant bias feature, and confidence score zi from the

first-stage detector.

2. For m ∈ {1, 2, 3} and c ∈ {1, 2, ..., C}, we calculateSm(i, ic), Sm(i, ic) ·zic , Sm(i, ic) ·zi and concatenatethose numbers.

3. For c ∈ {1, 2, ..., C}, we calculate the difference inconfidence score from each first-stage detector, zi −zic , and concatenate those numbers.

4. For D(Bi,M), we consider radial basis functions ofthe form

fj(x) = exp(− (x− µj)2

2σ2) (1)

For a typical indoor scene, the largest object-to-walldistance is usually less than 5m, therefore we spacethe basis function centers µj evenly between 0 and 5with step size 0.5, and choose σ = 0.5. We expandD(Bi,M) using this radial basis expansion.

5. The absolute value of the cosine of D(Bi,M):| cos(D(Bi,M))|

To model the second-stage layout candidates, we se-lect the bounding box ic with the highest confidence scorezic from the first-stage classifier in each category c ∈{1, 2, ..., C}, and use the following features for layout Mi

with confidence score z′i:

1. All the features used in the first-stage to model Mi us-ing Manhattan Voxels.

2. For c ∈ {1, 2, ..., C}, we calculate the radial basis ex-pansion for D(Bic ,Mi), and its product with z′i andzic .

3. For c ∈ {1, 2, ..., C}, we calculate the absolute valueof the cosine of D(Bic ,Mi): | cos(D(Bic ,Mi))|,| cos(D(Bic ,Mi))| · z′i and | cos(D(Bic ,Mi))| · zic .

4. For c ∈ {1, 2, ..., C}, we calculate the difference inconfidence score from each first-stage detector, z′i −zic , and concatenate those numbers.

1

Page 2: Supplementary Material: Three-Dimensional Object Detection and

3. Additional Experimental ResultsOrientation Besides evaluating detection based onintersection-over-union score, we can also evaluate the ac-curacy of the predicted orientations. We plot the cumulativecounts of true positive detections whose orientation errorin degrees is less than certain thresholds in Fig. 1. Usinggeometric feature and COG, our detector is able to detectmore true positives than sliding-shape [3], while maintain-ing comparable accuracy in orientation estimation. Thosecurves are not related to the precision-recall curves as weare only evaluating on true positives.

Error in Degree0 30 60 90 120 150 180

Cou

nt o

f Tru

e P

ositi

ves

0

100

200

300

400

500

600Geom+COGSliding-Shape

Error in Degree0 30 60 90 120 150 180

Cou

nt o

f Tru

e P

ositi

ves

0

100

200

300

400

500

Bed Sofa

Error in Degree0 30 60 90 120 150 180

Cou

nt o

f Tru

e P

ositi

ves

0

2000

4000

6000

8000

10000

Error in Degree0 30 60 90 120 150 180

Cou

nt o

f Tru

e P

ositi

ves

0

50

100

150

Chair ToiletFigure 1. Quantification of orientation estimation accuracy for thefour object categories considered by [2].

Additional Layout Prediction results Besides the twoexamples for layout prediction shown in the paper, in fig-ure 2 we show some additional results to demonstrate theeffectiveness of Manhattan Voxels.

References[1] D. Lin, S. Fidler, and R. Urtasun. Holistic scene under-

standing for 3d object detection with rgbd cameras. InICCV, pages 1417–1424. IEEE, 2013. 1

[2] S. Song, L. Samuel, and J. Xiao. Sun rgb-d: A rgb-dscene understanding benchmark suite. In CVPR. IEEE,2013. 2

[3] S. Song and J. Xiao. Sliding shapes for 3d object de-tection in depth images. In ECCV, pages 634–651.Springer, 2014. 2

Figure 2. Comparison of our Manhattan voxel 3D layout pre-dictions (blue) to the SUN RGB-D baseline ([2], green) and theground truth annotations (red). Our learning-based approach isless sensitive to outliers and degrades gracefully in cases wherethe true scene structure violates the Manhattan world assumption.