deep models for 3d reconstruction...deep models for 3d reconstruction andreas geiger autonomous...
Post on 17-Aug-2021
2 Views
Preview:
TRANSCRIPT
Deep Models for 3D Reconstruction
Andreas Geiger
Autonomous Vision Group, MPI for Intelligent Systems, TubingenComputer Vision and Geometry Group, ETH Zurich
October 12, 2017
Autonomous Vision Group
Max Planck Institutefor Intelligent Systems
3D Reconstruction
[Furukawa & Hernandez: Multi-View Stereo: A Tutorial]
Task:I Given a set of 2D imagesI Reconstruct 3D shape of object/scene
2
3D Reconstruction Pipeline
Input Images
3
3D Reconstruction Pipeline
Input Images Camera Poses
3
3D Reconstruction Pipeline
Input Images Camera Poses Dense Correspondences
3
3D Reconstruction Pipeline
Input Images Camera Poses Dense Correspondences
Depth Maps3
3D Reconstruction Pipeline
Input Images Camera Poses Dense Correspondences
Depth MapsDepth Map Fusion3
3D Reconstruction Pipeline
Input Images Camera Poses Dense Correspondences
Depth MapsDepth Map Fusion3D Reconstruction3
3D Reconstruction Pipeline
Input Images Camera Poses Dense Correspondences
Depth MapsDepth Map Fusion3D Reconstruction3
Large 3D Datasets and Repositories
[Newcombe et al., 2011] [Choi et al., 2011] [Dai et al., 2017]
[Wu et al., 2015] [Chang et al., 2015] [Chang et al., 2017]4
Can we learn 3D Reconstruction from Data?
OctNet: Learning Deep3D Representations at High Resolutions
[Riegler, Ulusoy, & Geiger, CVPR 2017]
Deep Learning in 2D
[LeCun, 1998]
7
Deep Learning in 3D
I Existing 3D networks limited to ∼ 323 voxels
8
Deep Learning in 3D
I Existing 3D networks limited to ∼ 323 voxels8
3D Data is often Sparse
[Geiger et al., 2012]9
3D Data is often Sparse
[Li et al., 2016]
Can we exploit sparsity for efficient deep learning?
9
3D Data is often Sparse
[Li et al., 2016]
Can we exploit sparsity for efficient deep learning?9
Network Activations
Layer 1: 323 Layer 2: 163 Layer 3: 83
Idea:I Partition space adaptively based on sparse input
10
Network Activations
Layer 1: 323 Layer 2: 163 Layer 3: 83
Idea:I Partition space adaptively based on sparse input
10
Network Activations
Layer 1: 323 Layer 2: 163 Layer 3: 83
Idea:I Partition space adaptively based on sparse input
10
Convolution
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
I Differentiable⇒ allows for end-to-end learning
11
Convolution
I Differentiable⇒ allows for end-to-end learning
11
Convolution
I Differentiable⇒ allows for end-to-end learning
11
Convolution
I Differentiable⇒ allows for end-to-end learning
11
Efficient ConvolutionThis operation can be implemented very efficiently:
I 4 different casesI First case requires only 1 evaluation!
12
Pooling
I Unpooling operation defined similarly
13
Pooling
I Unpooling operation defined similarly
13
Pooling
I Unpooling operation defined similarly
13
Pooling
I Unpooling operation defined similarly13
Results: 3D Shape Classification
FullyConn.
Convolutionand Pooling
FullyConn.
Convolutionand Pooling
Airplane
14
Results: 3D Shape Classification
83 163 323 643 1283 2563
Input Resolution
0
10
20
30
40
50
60
70
80
Mem
ory
[GB
]
OctNetDenseNet
15
Results: 3D Shape Classification
83 163 323 643 1283 2563
Input Resolution
0
2
4
6
8
10
12
14
16
Run
tim
e[s
]
OctNetDenseNet
15
Results: 3D Shape Classification
83 163 323 643 1283 2563
Input Resolution
0.70
0.75
0.80
0.85
0.90
0.95
Acc
urac
yOctNetDenseNet
I Input: voxelized meshes from ModelNet
16
Results: 3D Shape Classification
83 163 323 643 1283 2563
Input Resolution
0.86
0.88
0.90
0.92
0.94
Acc
urac
y
OctNet 1OctNet 2OctNet 3
I Input: voxelized meshes from ModelNet
16
Results: 3D Shape Classification
17
Results: 3D Semantic Labeling
Input Prediction
I Dataset: RueMonge201418
Results: 3D Semantic Labeling
Convolutionand Pooling
Convolutionand Pooling
Skip
Skip
Unpoolingand Conv.
Unpoolingand Conv.
I Decoder octree structure copied from encoder
19
Results: 3D Semantic Labeling
IoU
[Riemenschneider et al., 2014] 42.3[Martinovic et al., 2015] 52.2[Gadde et al., 2016] 54.4
OctNet 643 45.6OctNet 1283 50.4OctNet 2563 59.2
20
OctNetFusion:Learning Depth Fusion from Data
[Riegler, Ulusoy, Bischof & Geiger, 3DV 2017]
Volumetric Fusion
di+1(p) =wi(p)di(p) + w(p)d(p)
wi(p) + w(p)
wi+1(p) = wi(p) + w(p)
I p ∈ R3: voxel locationI d: distance, w: weight
[Curless and Levoy, SIGGRAPH 1996]22
Volumetric FusionI Pros:
I Simple, fast, easy to implementI Defacto ”gold standard” (KinectFusion, Voxel Hashing, . . . )
I Cons:I Requires many redundant views to reduce noiseI Can’t handle outliers / complete missing surfaces
Ground Truth Volumetric Fusion
TV-L1 Fusion OctNetFusion
23
Volumetric FusionI Pros:
I Simple, fast, easy to implementI Defacto ”gold standard” (KinectFusion, Voxel Hashing, . . . )
I Cons:I Requires many redundant views to reduce noiseI Can’t handle outliers / complete missing surfaces
Ground Truth Volumetric Fusion
TV-L1 Fusion OctNetFusion
23
TV-L1 FusionI Pros:
I Prior on surface areaI Noise reduction
I Cons:I Simplistic local prior (penalizes surface area, shrinking bias)I Can’t complete missing surfaces
Ground Truth Volumetric Fusion TV-L1 Fusion
OctNetFusion
23
TV-L1 FusionI Pros:
I Prior on surface areaI Noise reduction
I Cons:I Simplistic local prior (penalizes surface area, shrinking bias)I Can’t complete missing surfaces
Ground Truth Volumetric Fusion TV-L1 Fusion
OctNetFusion
23
Learned FusionI Pros:
I Learn noise suppression from dataI Learn surface completion from data
I Cons:I Requires large 3D datasets for trainingI How to scale to high resolutions?
Ground Truth Volumetric Fusion TV-L1 Fusion OctNetFusion23
Learned FusionI Pros:
I Learn noise suppression from dataI Learn surface completion from data
I Cons:I Requires large 3D datasets for trainingI How to scale to high resolutions?
Ground Truth Volumetric Fusion TV-L1 Fusion OctNetFusion23
Learning 3D Fusion
Convolutionand Pooling
Convolutionand Pooling
Skip
Skip
Unpoolingand Conv.
Unpoolingand Conv.
Input Representation:I TSDFI Higher-order statistics
Output Representation:I OccupancyI TSDF
24
Learning 3D Fusion
Convolutionand Pooling
Convolutionand Pooling
Skip
Skip
Unpoolingand Conv.
Unpoolingand Conv.
What is the problem?
I Octree structure unknown⇒ needs to be inferred as well!
24
Learning 3D Fusion
Convolutionand Pooling
Convolutionand Pooling
Skip
Skip
Unpoolingand Conv.
Unpoolingand Conv.
What is the problem?I Octree structure unknown⇒ needs to be inferred as well!
24
OctNetFusion Architecture
Features
Features
Input Output
Input Output
Input Output
256³ 256³
128³128³
64³64³
Octree Structure
∆64
∆128
∆256
Octree Structure
25
Results: Surface Reconstruction
VolFus TV-L1 Ours Ground Truth643
1283
256
3
26
Results: Volumetric Completion
[Firman, 2016] Ours Ground Truth27
Thank you!
top related