replacing lidar: 3d object detection using monodepth...

Replacing LIDAR: 3D Object Detection using Monodepth Estimation Benjamin Goeing, Lars Jebe Motivation Related Work New Technique Experimental Results • LIDAR is very expensive, but currently irreplaceable for many autonomous driving and robotics applications • The Goal of this project is to perform 3D Object Detection (3D bounding box prediction and classification) from a single RGB image: • We use the Frustum Point Net 1 which makes a 3D bounding box prediction using an RGB image in combination with LIDAR data • We replace the LIDAR data with unsupervised depth estimations from a single RGB image 2 • Current results show that 3D object detection relies on LIDAR (up to 65% accuracy) whereas without LIDAR the best performance is < 8% • The existing architecture relies on both a depth and RGB input in order to produce a 3D bounding box, as detailed in the figure below: References 1. Qi, Charles R., et al. "Frustum pointnets for 3d object detection from rgb-d data." CVPR 2018. 2. Godard, Clément, Oisin Mac Aodha, and Gabriel J. Brostow. "Unsupervised monocular depth estimation with left-right consistency." CVPR 2017. 3. Chen, Xiaozhi, et al. "Monocular 3d object detection for autonomous driving." CVPR 2016. 4. Zhou, Yin, and Oncel Tuzel. "Voxelnet: End-to-end learning for point cloud based 3d object detection." CVPR 2018. RGB Image • This is a fundamentally ill-posed problem which can potentially be solved for homogenous datasets with strong priors • We evaluated our model using the KITTI Dataset, which contains a variety of street-view images Existing Approach 1 : Our Approach: • We suggest to replace the depth data with an RGB monodepth prediction . The following diagram represents our approach: Differentiable Real Depth Pointcloud Transformation Subgraph 1: Monodepth Prediction Subgraph 2: Frustum Pointnet Two Steps of Implementation 1. Re-training of frustum pointnet (Subgraph 2) ○ This steps involves retraining the model using the monodepth predictions from subgraph (1), which runs inference only 2. End-to-end training ○ This step involves the end to end training of both graphs to achieve even higher performance RGB Image Baseline Performance (Using LIDAR) Step 1 Performance: ● eval segmentation accuracy: 0.715 ● eval segmentation avg class acc: 0.756 ● eval box IoU (ground/3D): 0.253 / 0.230 ● eval box estimation accuracy (IoU=0.7): 0.035 Discussion Completed Work in Progress ● eval segmentation accuracy: 0.904 ● eval segmentation avg class acc: 0.907 ● eval box IoU (ground/3D): 0.743 / 0.696 ● eval box estimation accuracy (IoU=0.7): 0.629 • Given the ill-posed nature of the problem, our approach shows promising results, but falls short of reaching near-Lidar quality • One particularly noteworthy fact is that the distribution of points is often correct on a relative basis, but incorrect on an absolute one (e.g. object is shifted by a certain amount) Next steps: • Train subgraph (1) and (2) end-to-end, with a combined loss function for depth estimation and object detection

Upload: others

Post on 21-Aug-2021

3 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Replacing LIDAR: 3D Object Detection using Monodepth ...stanford.edu/class/ee367/Winter2019/jebe_going_poster.pdfeval box IoU (ground/3D): 0.743 / 0.696 eval box estimation accuracy

Replacing LIDAR: 3D Object Detection using Monodepth Estimation Benjamin Goeing, Lars Jebe

Motivation

Related Work

New Technique Experimental Results

• LIDAR is very expensive, but currently irreplaceable for many autonomous driving and robotics applications

• The Goal of this project is to perform 3D Object Detection (3D bounding box prediction and classification) from a single RGB image:

• We use the Frustum Point Net 1 which makes a 3D bounding box prediction using an RGB image in combination with LIDAR data

• We replace the LIDAR data with unsupervised depth estimations from a single RGB image 2

• Current results show that 3D object detection relies on LIDAR (up to 65% accuracy) whereas without LIDAR the best performance is < 8%

• The existing architecture relies on both a depth and RGB input in order to produce a 3D bounding box, as detailed in the figure below:

References1. Qi, Charles R., et al. "Frustum pointnets for 3d object detection from rgb-d data." CVPR 2018.

2. Godard, Clément, Oisin Mac Aodha, and Gabriel J. Brostow. "Unsupervised monocular depth estimation with left-right consistency." CVPR 2017.

3. Chen, Xiaozhi, et al. "Monocular 3d object detection for autonomous driving." CVPR 2016.

4. Zhou, Yin, and Oncel Tuzel. "Voxelnet: End-to-end learning for point cloud based 3d object detection." CVPR 2018.

RGB Image

• This is a fundamentally ill-posed problem which can potentially be solved for homogenous datasets with strong priors

• We evaluated our model using the KITTI Dataset, which contains a variety of street-view images

Existing Approach1 :

Our Approach:

• We suggest to replace the depth data with an RGB monodepth prediction . The following diagram represents our approach:

Differentiable Real Depth Pointcloud Transformation

Subgraph 1:Monodepth Prediction

Subgraph 2: Frustum Pointnet

Two Steps of Implementation

1. Re-training of frustum pointnet (Subgraph 2)○ This steps involves retraining the model using

the monodepth predictions from subgraph (1), which runs inference only

2. End-to-end training○ This step involves the end to end training of

both graphs to achieve even higher performance

RGB Image

Baseline Performance (Using LIDAR)

Step 1 Performance:● eval segmentation accuracy: 0.715● eval segmentation avg class acc: 0.756● eval box IoU (ground/3D): 0.253 / 0.230● eval box estimation accuracy (IoU=0.7): 0.035

Discussion

Completed

Work in Progress

● eval segmentation accuracy: 0.904● eval segmentation avg class acc: 0.907● eval box IoU (ground/3D): 0.743 / 0.696● eval box estimation accuracy (IoU=0.7): 0.629

• Given the ill-posed nature of the problem, our approach shows promising results, but falls short of reaching near-Lidar quality

• One particularly noteworthy fact is that the distribution of points is often correct on a relative basis, but incorrect on an absolute one (e.g. object is shifted by a certain amount)

Next steps:

• Train subgraph (1) and (2) end-to-end, with a combined loss function for depth estimation and object detection

Image Reconstruction with Burst of Low-Light Photographystanford.edu/class/ee367/Winter2018/qian_yang_zhou... · we find significant improvement on diminishing the ghost effect with

E4596 - Ecography...can fisp Field sparrow 0.715 0.105 0.058 0.666 3.6 can gcki Golden crowned kinglet 0.707 0.123 0.081 0.365 46.5 can hosp House sparrow 0.799 0.256 0.245 0.743 8.0

Mobile 2D Barcode Scanner and Decoder for …stanford.edu/class/ee367/Winter2019/irani_ee367_win19...Brouwer [5] includes a pre-processing algorithm that transforms a raw image into

Transmission Electron Tomography 3D Reconstruction of ...stanford.edu/class/ee367/Winter2015/report_pitner.pdf · Directed-self Assembly, or DSA, is a technique in which a polymer

Image Space Modal Analysis and Simulation of Object Motionstanford.edu/class/ee367/Winter2017/huang_ee367... · CS348C in Fall 2016 (attached with this proposal) for background information

Compressive Sensing based Single-Pixel …stanford.edu/class/ee367/Winter2015/report_shi_shrestha.pdfIn this report, we propose a novel compressive hyper-spectral(HS) imaging approach

Denoising for Magnetic Resonance Imaging - Stanford …stanford.edu/class/ee367/Winter2016/Chaudhari_Report.pdf · Denoising for Magnetic Resonance Imaging ... nal to noise ratio

Virtual Reality Motion Parallax with the Facebook Surround-360stanford.edu/class/ee367/Winter2017/lindell_ee367_win17_poster.pdf · Source: Facebook From Simulated DASP From Surround-360

Genetic diversity for nitrogen use efficiency in ...NG064 6.00 ± 1.497 abc 20.49 abc± 1.497 ab * 4.89 ± abcd0.743 bc 7.33 ± 0.743 * 1.294 ± 0.1704 ab 2.916 ± 0.1704 * WC056 5.85

DCGANs for image super-resolution, denoising and debluringweb.stanford.edu/class/ee367/Winter2017/yan_wang_ee367... · 2017-03-18 · We propose the use of deep convolutional generative