deep models for 3d reconstruction...deep models for 3d reconstruction andreas geiger autonomous...

Post on 17-Aug-2021

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Models for 3D Reconstruction

Andreas Geiger

Autonomous Vision Group, MPI for Intelligent Systems, TubingenComputer Vision and Geometry Group, ETH Zurich

October 12, 2017

Autonomous Vision Group

Max Planck Institutefor Intelligent Systems

3D Reconstruction

[Furukawa & Hernandez: Multi-View Stereo: A Tutorial]

Task:I Given a set of 2D imagesI Reconstruct 3D shape of object/scene

2

3D Reconstruction Pipeline

Input Images

3

3D Reconstruction Pipeline

Input Images Camera Poses

3

3D Reconstruction Pipeline

Input Images Camera Poses Dense Correspondences

3

3D Reconstruction Pipeline

Input Images Camera Poses Dense Correspondences

Depth Maps3

3D Reconstruction Pipeline

Input Images Camera Poses Dense Correspondences

Depth MapsDepth Map Fusion3

3D Reconstruction Pipeline

Input Images Camera Poses Dense Correspondences

Depth MapsDepth Map Fusion3D Reconstruction3

3D Reconstruction Pipeline

Input Images Camera Poses Dense Correspondences

Depth MapsDepth Map Fusion3D Reconstruction3

Large 3D Datasets and Repositories

[Newcombe et al., 2011] [Choi et al., 2011] [Dai et al., 2017]

[Wu et al., 2015] [Chang et al., 2015] [Chang et al., 2017]4

Can we learn 3D Reconstruction from Data?

OctNet: Learning Deep3D Representations at High Resolutions

[Riegler, Ulusoy, & Geiger, CVPR 2017]

Deep Learning in 2D

[LeCun, 1998]

7

Deep Learning in 3D

I Existing 3D networks limited to ∼ 323 voxels

8

Deep Learning in 3D

I Existing 3D networks limited to ∼ 323 voxels8

3D Data is often Sparse

[Geiger et al., 2012]9

3D Data is often Sparse

[Li et al., 2016]

Can we exploit sparsity for efficient deep learning?

9

3D Data is often Sparse

[Li et al., 2016]

Can we exploit sparsity for efficient deep learning?9

Network Activations

Layer 1: 323 Layer 2: 163 Layer 3: 83

Idea:I Partition space adaptively based on sparse input

10

Network Activations

Layer 1: 323 Layer 2: 163 Layer 3: 83

Idea:I Partition space adaptively based on sparse input

10

Network Activations

Layer 1: 323 Layer 2: 163 Layer 3: 83

Idea:I Partition space adaptively based on sparse input

10

Convolution

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

I Differentiable⇒ allows for end-to-end learning

11

Convolution

I Differentiable⇒ allows for end-to-end learning

11

Convolution

I Differentiable⇒ allows for end-to-end learning

11

Convolution

I Differentiable⇒ allows for end-to-end learning

11

Efficient ConvolutionThis operation can be implemented very efficiently:

I 4 different casesI First case requires only 1 evaluation!

12

Pooling

I Unpooling operation defined similarly

13

Pooling

I Unpooling operation defined similarly

13

Pooling

I Unpooling operation defined similarly

13

Pooling

I Unpooling operation defined similarly13

Results: 3D Shape Classification

FullyConn.

Convolutionand Pooling

FullyConn.

Convolutionand Pooling

Airplane

14

Results: 3D Shape Classification

83 163 323 643 1283 2563

Input Resolution

0

10

20

30

40

50

60

70

80

Mem

ory

[GB

]

OctNetDenseNet

15

Results: 3D Shape Classification

83 163 323 643 1283 2563

Input Resolution

0

2

4

6

8

10

12

14

16

Run

tim

e[s

]

OctNetDenseNet

15

Results: 3D Shape Classification

83 163 323 643 1283 2563

Input Resolution

0.70

0.75

0.80

0.85

0.90

0.95

Acc

urac

yOctNetDenseNet

I Input: voxelized meshes from ModelNet

16

Results: 3D Shape Classification

83 163 323 643 1283 2563

Input Resolution

0.86

0.88

0.90

0.92

0.94

Acc

urac

y

OctNet 1OctNet 2OctNet 3

I Input: voxelized meshes from ModelNet

16

Results: 3D Shape Classification

17

Results: 3D Semantic Labeling

Input Prediction

I Dataset: RueMonge201418

Results: 3D Semantic Labeling

Convolutionand Pooling

Convolutionand Pooling

Skip

Skip

Unpoolingand Conv.

Unpoolingand Conv.

I Decoder octree structure copied from encoder

19

Results: 3D Semantic Labeling

IoU

[Riemenschneider et al., 2014] 42.3[Martinovic et al., 2015] 52.2[Gadde et al., 2016] 54.4

OctNet 643 45.6OctNet 1283 50.4OctNet 2563 59.2

20

OctNetFusion:Learning Depth Fusion from Data

[Riegler, Ulusoy, Bischof & Geiger, 3DV 2017]

Volumetric Fusion

di+1(p) =wi(p)di(p) + w(p)d(p)

wi(p) + w(p)

wi+1(p) = wi(p) + w(p)

I p ∈ R3: voxel locationI d: distance, w: weight

[Curless and Levoy, SIGGRAPH 1996]22

Volumetric FusionI Pros:

I Simple, fast, easy to implementI Defacto ”gold standard” (KinectFusion, Voxel Hashing, . . . )

I Cons:I Requires many redundant views to reduce noiseI Can’t handle outliers / complete missing surfaces

Ground Truth Volumetric Fusion

TV-L1 Fusion OctNetFusion

23

Volumetric FusionI Pros:

I Simple, fast, easy to implementI Defacto ”gold standard” (KinectFusion, Voxel Hashing, . . . )

I Cons:I Requires many redundant views to reduce noiseI Can’t handle outliers / complete missing surfaces

Ground Truth Volumetric Fusion

TV-L1 Fusion OctNetFusion

23

TV-L1 FusionI Pros:

I Prior on surface areaI Noise reduction

I Cons:I Simplistic local prior (penalizes surface area, shrinking bias)I Can’t complete missing surfaces

Ground Truth Volumetric Fusion TV-L1 Fusion

OctNetFusion

23

TV-L1 FusionI Pros:

I Prior on surface areaI Noise reduction

I Cons:I Simplistic local prior (penalizes surface area, shrinking bias)I Can’t complete missing surfaces

Ground Truth Volumetric Fusion TV-L1 Fusion

OctNetFusion

23

Learned FusionI Pros:

I Learn noise suppression from dataI Learn surface completion from data

I Cons:I Requires large 3D datasets for trainingI How to scale to high resolutions?

Ground Truth Volumetric Fusion TV-L1 Fusion OctNetFusion23

Learned FusionI Pros:

I Learn noise suppression from dataI Learn surface completion from data

I Cons:I Requires large 3D datasets for trainingI How to scale to high resolutions?

Ground Truth Volumetric Fusion TV-L1 Fusion OctNetFusion23

Learning 3D Fusion

Convolutionand Pooling

Convolutionand Pooling

Skip

Skip

Unpoolingand Conv.

Unpoolingand Conv.

Input Representation:I TSDFI Higher-order statistics

Output Representation:I OccupancyI TSDF

24

Learning 3D Fusion

Convolutionand Pooling

Convolutionand Pooling

Skip

Skip

Unpoolingand Conv.

Unpoolingand Conv.

What is the problem?

I Octree structure unknown⇒ needs to be inferred as well!

24

Learning 3D Fusion

Convolutionand Pooling

Convolutionand Pooling

Skip

Skip

Unpoolingand Conv.

Unpoolingand Conv.

What is the problem?I Octree structure unknown⇒ needs to be inferred as well!

24

OctNetFusion Architecture

Features

Features

Input Output

Input Output

Input Output

256³ 256³

128³128³

64³64³

Octree Structure

∆64

∆128

∆256

Octree Structure

25

Results: Surface Reconstruction

VolFus TV-L1 Ours Ground Truth643

1283

256

3

26

Results: Volumetric Completion

[Firman, 2016] Ours Ground Truth27

Thank you!

top related