data driven affordance learning in both 2d and 3d scenes · data driven affordance learning in both...
TRANSCRIPT
DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3DSCENES
Sifei Liu, NVIDIA Research
Xueting Li, University of California, Merced
March 19, 2019
2
scene image segmentation human pose estimation
UNDERSTANDING SCENE AND HUMAN
semantic segmentation from cityscapes dataset pose estimation via OpenCV
3
instance placement human placement
CREATING SCENE OR HUMAN?
semantic segmentation from cityscapes dataset rendered scene from the SUNCG dataset
✘ ✘
4
LET’S MAKE IT MORE CHALLENGING!
shape synthesis
semantic segmentation from cityscapes dataset
5
?
LET’S MAKE IT MORE CHALLENGING!
shape synthesis
6
LET’S MAKE IT MORE CHALLENGING!
placement in the real world
video’s from: Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation
7
WHAT IS AFFORDANCE?
Where are they?
scene image indoor environment
human car sitting standing
8
WHAT IS AFFORDANCE?
What are they look like?
scene image indoor environment
9
WHAT IS AFFORDANCE?
How do they interact with the others?
Input Image
Generated Poses
10
OUTLINES
Context-Aware Synthesis and Placement of Object Instances
Neurips 2018
Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, Jan Kautz
Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments
CVPR 2019
Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz
11
QUIZ
Which object is a fake one?
12
SEQUENTIAL EDITING
Insert new objects one by one
13
Add a person
PROBLEM DEFINATION
Semantic map manipulation by inserting objects
14
WHY SEMANTIC MAP?
• Editing RGB image is difficult
Image 1 Image 2
Image-to-image translation,Image editing, ...
15
WHY SEMANTIC MAP?
• We don’t have real RGB images in case of using a simulator, playing a game, or experiencing a virtual world
Image is fromStephan R. Richter, Zeeshan Hayer, and Vladlen Koltun, “Playing for Benchmarks”, ICCV 2017
Rendering
Semantic map
Visualization
16
1. Learn “where” and “what” jointly
2. End-to-end trainable network
3. Diverse outputs given the same input
MAIN GOALS
17
“WHERE” MODULE
How can we learn where to put a new object?
18
“WHERE” MODULE
Pixel-wise annotation: almost impossible to get
p=0.2
p=0
p=0.8
19
“WHERE” MODULE
Existing objects: need to remove and inpaint objects
Object
20
Removed
Object
“WHERE” MODULE
Existing objects: need to remove and inpaint objects
21
“WHERE” MODULE
Existing objects: need to remove and inpaint objects
Inpainting
?
22
“WHERE” MODULE
Our approach: put a box and see if it is reasonable
Good
box
Bad
box
Why box?
1) We don’t want to care about the object shape for now.
2) All objects can be covered by a bounding box.
23
“WHERE” MODULE
How to put a box?
Unit box
Affine transform
Why not using (x,y,w,h) directly?
It is not differentiable to put a box using indices.
24
bbox
“WHERE” MODULE
Affine transform
25
bbox
STNconcat
tile
“WHERE” MODULE
26
bbox
STNconcat
real
fake
tile Real/fake loss
“WHERE” MODULE
27
“WHERE” MODULE
Results with 100 different random vectors
28
bbox
STNconcat
real
fake
tile Real/fake loss
Ignored
“WHERE” MODULE
29
bbox
STNconcat
real
fake
tile Real/fake loss
“WHERE” MODULE
30
bbox
STNconcat
real
fake
tile Real/fake loss
“WHERE” MODULE
31
“WHERE” MODULEResults with 100 different random vectors
32
bbox
STNconcat
real
fake
tile Real/fake loss
“WHERE” MODULE
Lazy
z1
z2
Lazy
33
bbox
STNconcat
real
fake
tile Real/fake loss
“WHERE” MODULE
bbox
34
bbox
STN
tile
concat
STNconcat
bbox
(share
d)
(share
d)
real
fake
tile
“WHERE” MODULE
Real/fake loss
35
bbox
STN
tile
concat
STNconcat
bbox
Supervised path
Unsupervised path
(share
d)
(share
d)
real
fake
tile
Encoder-decoder
+ reconstruct
+ supervision
“WHERE” MODULE
36(red: person, blue: car)
“WHERE” MODULEResults with 100 different random vectors
37
“WHERE” MODULEResults from epoch 0 to 30
38
1. Learn “where” and “what” jointly
2. End-to-end trainable network
3. Diverse outputs given the same input.
MAIN GOAL
39
tile
concat
“WHAT” MODULE
40
tile
concat
“WHAT” MODULE
“Where”
module
41tile
tile
concat
concat
Unsupervised path
Supervised path
(shared)
real
fakeEncoder-decoder
+ supervision
“WHAT” MODULE
“Where”
module
42
OVERALL ARCHITECTUREForward pass
Input
Affine
Bounding box
prediction
Object shape
generationOutput
Unit
box
43
OVERALL ARCHITECTUREBackward pass for “where” loss
Input
Affine
Bounding box
prediction
Object shape
generationOutput
Unit
box
“Where”
discriminator
44
OVERALL ARCHITECTUREBackward pass for “what” loss
Input
Affine
Bounding box
prediction
Object shape
generationOutput
Unit
box
“What”
discriminator
45
“WHAT” MODULEFix “where”, change “what”
46
InputGenerated
Synthesized RGB
(pix2pix HD)
EXPERIMENTS
47
GeneratedNearest Neighbor
Synthesized RGB
(nearest-neighbor)
EXPERIMENTS
48
InputGenerated
Synthesized RGB
(pix2pix HD)
EXPERIMENTS
49
GeneratedNearest Neighbor
Synthesized RGB
(nearest-neighbor)
EXPERIMENTS
50
51
52
53
54
Ideal: 50%
Our result: 43%
USER STUDY
55
BASELINES
InputEncoder -
decoder
Generated
objectResult
Real Real
Input
GeneratorGenerated
objectEncoder
STN Result
Real
Real
Baseline 1 Baseline 2
Baseline 1
Baseline 2
56
CONCLUSION
Learning Affordance in 2D
where are they? what are they look like?
PUTTING HUMANS IN A SCENE: LEARNING AFFORDANCE IN 3D INDOOR
ENVIRONMENTS
Xueting Li, Sifei Liu, Kihwan Kim , XiaolongWang, Ming-Hsuan Yang, Jan Kautz
WHAT IS AFFORDANCE IN 3D?
• General definition:
➢ opportunities of interaction in the scene, i.e. what actions can the object be used for.
• Applications:
➢ Robot navigation
➢ Game development
Image Credit: David F. Fouhey et al. In Defense of the Direct Perception of Affordances, CoRR abs/1505.01085 (2015)
The floor can
be used for
standing
The desk can
be used for
sitting
AFFORDANCE IN 3D WORLD
• Given a single image of a 3D scene, generating reasonable human poses in 3D scenes.
?
LEARNING 3D AFFORDANCE
• Semantically plausible: the human should take common actions in indoor environment
How to define a “reasonable” human pose in indoor scenes?
• Physically stable: the human should be well supported by its surrounding objects.
LEARNING 3D AFFORDANCE
semantic knowledge
geometry knowledge fuse
A data-driven
way?
LEARNING 3D AFFORDANCE
• Stage I: Build a fully-automatic 3D pose synthesizer.
• Stage II: Using the dataset synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model.
semantic knowledge geometry knowledge
pose synthesizer
…
where what
LEARNING 3D AFFORDANCE
• Stage I: Build a fully-automatic 3D pose synthesizer.
• Stage II: Using the dataset synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model.
semantic knowledge geometry knowledge
pose synthesizer
…
where what
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER
Fusing semantic & geometry knowledge
[1] Wang X, Girdhar R, Gupta A. Binge watching: Scaling affordance learning from sitcoms. CVPR, 2017
[2] Song S, Yu F, Zeng A, et al. Semantic scene completion from a single depth. CVPR 2017.
The Sitcom [1] dataset. (no 3D annotations) The SUNCG [2] dataset. (no human poses)
Combine ?
semantic
knowledge
geometry
knowledge
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER
geometry adjustmentsemantic knowledge adaptation
𝑌
𝑋
𝑍𝑦
𝑥
𝑊
𝑈
𝑉
input image location heat map generated poses mapped pose adjusted pose
mapping from image to voxelDomain adaptation
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER
geometry adjustmentsemantic knowledge adaptation
𝑌
𝑋
𝑍𝑦
𝑥
𝑊
𝑈
𝑉
input image location heat map generated poses mapped pose adjusted pose
mapping from image to voxelDomain adaptation
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER
semantic knowledge adaptation
input image location heat map
ResN
et
18 convolution
deconvolution
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER
semantic knowledge adaptation
input image location heat map generated poses
Binge Watching: Scaling Affordance Learning from Sitcoms, Xiaolong Wang et al. CVPR, 2017
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER
semantic knowledge adaptation
input image location heat map generated poses
Domain adaptation
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER
geometry adjustmentsemantic knowledge adaptation
𝑌
𝑋
𝑍𝑦
𝑥
𝑊
𝑈
𝑉
input image location heat map generated poses mapped pose adjusted pose
mapping from image to voxelDomain adaptation
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER
mapping from image to voxel
generated pose mapped pose
𝑑 =𝐻 × 𝑓
𝐻𝑝 × 𝑟32
𝐻~𝒩(1.7,0.1)
mapping from image to voxel
𝑌
𝑋
𝑍𝑦
𝑥
𝑊
𝑈
𝑉 𝐻𝑝𝐻
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER
geometry adjustmentsemantic knowledge adaptation
𝑌
𝑋
𝑍𝑦
𝑥
𝑊
𝑈
𝑉
input image location heat map generated poses mapped pose adjusted pose
mapping from image to voxelDomain adaptation
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER
geometry adjustment
mapped pose adjusted pose
• Free space constraint [1]: No human body parts can intersect with any object in the scene, such as furniture .
⨂
[1] A. Gupta et al. From 3d scene geometry to human workspace. CVPR, 2011.
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER
geometry adjustment
[1] A. Gupta et al. From 3d scene geometry to human workspace. CVPR, 2011.
generated pose adjusted pose
Support constraint [2]. The human pose should be supported by a surface of surrounding objects (e.g., floor, bed).
(a) generated pose
(c) scene voxel (d) sittable surface
(b) pose in voxel
(e) positive response (f) adjusted 3D pose
⨂
3D Gaussian
sampling
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER
geometry adjustmentsemantic knowledge adaptation
𝑌
𝑋
𝑍𝑦
𝑥
𝑊
𝑈
𝑉
input image location heat map generated poses mapped pose adjusted pose
mapping from image to voxelDomain adaptation
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER
• In total, we generate ~1.5 million poses in 13,774 scenes. We use 13,074 scenes for the data-driven 3D affordance generative model training and 700 scenes for testing.
Synthesized poses visualization
A DATA-DRIVEN 3D POSE PREDICTION MODEL
• Our 3D pose prediction model includes a where module for pose locations prediction and a what module for pose gestures prediction.
Framework
𝑝
𝑧~𝒩(𝜇,𝜎)
𝑝′𝑥, 𝑦, 𝑑 , 𝑝𝑐
𝑧~𝒩(𝜇,𝜎)
{𝐼}
where module what module
𝐿𝑎𝑑𝑣
real
fake𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐
′
discriminator
A DATA-DRIVEN 3D POSE PREDICTION MODEL
• Our 3D pose prediction model includes a where module for pose locations prediction and a what module for pose gestures prediction.
Framework
𝑝
𝑧~𝒩(𝜇,𝜎)
𝑝′𝑥, 𝑦, 𝑑 , 𝑝𝑐
𝑧~𝒩(𝜇,𝜎)
{𝐼}
where module what module
𝐿𝑎𝑑𝑣
real
fake𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐
′
discriminator
A DATA-DRIVEN 3D POSE PREDICTION MODEL
• The where module for pose locations prediction.
• It should be able to sample during inference.
• It should condition on scene image.
• Losses:
• 𝐿 = 𝜆𝐺𝐸𝑂𝐿𝐺𝐸𝑂 + 𝜆𝐾𝐿𝐷𝐿𝐾𝐿𝐷 + 𝜆𝑀𝑆𝐸𝐿𝑀𝑆𝐸
• 𝐿𝑀𝑆𝐸 = (𝑥 − 𝑥 ′)2+(𝑦 − 𝑦′)2+(𝑑 − 𝑑′)2+(𝑝𝑐 − 𝑝𝑐′ )2
• 𝐿𝐾𝐿𝐷 = [𝑄(𝑧|𝜇 𝑥, 𝑦, 𝑑, 𝐼 , 𝜎(𝑥, 𝑦, 𝑑, 𝐼))||𝒩(0,1)]
• 𝐿𝐺𝐸𝑂 = (𝑀𝑒𝑀𝑖 𝑥, 𝑦, 𝑑 −𝑀𝑒𝑀𝑖 𝑥′,𝑦′ , 𝑑′ )2
The where module
𝑥, 𝑦, 𝑑 , 𝑝𝑐
𝑧~𝒩(𝜇,𝜎)
𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′
{𝐼}
where module
𝐿𝑚𝑠𝑒𝐿𝑘𝑙𝑑
A DATA-DRIVEN 3D POSE PREDICTION MODEL
• Our 3D pose prediction model includes a where module for pose locations prediction and a what module for pose gestures prediction.
Framework
𝑝
𝑧~𝒩(𝜇,𝜎)
𝑝′𝑥, 𝑦, 𝑑 , 𝑝𝑐
𝑧~𝒩(𝜇,𝜎)
{𝐼}
where module what module
𝐿𝑎𝑑𝑣
real
fake𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐
′
discriminator
A DATA-DRIVEN 3D POSE PREDICTION MODEL
• The what module for pose gestures prediction.
• Losses:
• 𝐿 = 𝜆𝐺𝐸𝑂𝐿𝐺𝐸𝑂 + 𝜆𝐾𝐿𝐷𝐿𝐾𝐿𝐷 + 𝜆𝑀𝑆𝐸𝐿𝑀𝑆𝐸
• 𝐿𝑀𝑆𝐸 = (𝑝 − 𝑝′)2
• 𝐿𝐾𝐿𝐷 = [𝑄(𝑧|𝜇 𝑥, 𝑦, 𝑑, 𝑝𝑐 , 𝐼 , 𝜎(𝑥, 𝑦, 𝑑, 𝑝𝑐 , 𝐼))||𝒩(0,1)]
• 𝐿𝐺𝐸𝑂 = (𝑀𝑒𝑀𝑖 𝑥𝑗 , 𝑦𝑗 , 𝑑𝑗 − 𝑀𝑒𝑀𝑖 𝑥𝑗′,𝑦𝑗
′ , 𝑑𝑗′ )2
The what module
𝑝
𝑥′ , 𝑦′, 𝑑′ , 𝑝𝑐′ , 𝐼
𝑝′
what module
𝐿𝑚𝑠𝑒
𝐿𝑘𝑙𝑑
𝑧~𝒩(𝜇,𝜎)
A DATA-DRIVEN 3D POSE PREDICTION MODEL
• Our 3D pose prediction model includes a where module for pose locations prediction and a what module for pose gestures prediction.
Framework
𝑝
𝑧~𝒩(𝜇,𝜎)
𝑝′𝑥, 𝑦, 𝑑 , 𝑝𝑐
𝑧~𝒩(𝜇,𝜎)
{𝐼}
where module what module
𝐿𝑎𝑑𝑣
real
fake𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐
′
discriminator
A DATA-DRIVEN 3D POSE PREDICTION MODEL
Geometry-aware discriminator
𝑝
𝑝′𝑥, 𝑦, 𝑑 , 𝑝𝑐
𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′
𝐼
𝐼encoder d
ecoder e
ncoder d
ecoder
𝑧~𝒩(𝜇,𝜎)
𝑧~𝒩(𝜇,𝜎)
decoder
𝑧~𝒩(0,1) 𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′
decoder
𝑧~𝒩(0,1)
supervised path
unsupervised path
𝐿𝑎𝑑𝑣real
fake
depth heat
map
D. Lee, S. Liu, J. Gu, M.-Y. Liu, M.-H. Yang, and J. Kautz. Context-aware synthesis and placement of object instances. In NIPS, 2018.
A DATA-DRIVEN 3D POSE PREDICTION MODEL
• Our 3D pose prediction model includes a where module for pose locations prediction and a what module for pose gestures prediction.
Framework
𝑧~𝒩(𝜇,𝜎)
𝑝′
𝑧~𝒩(𝜇,𝜎)
{𝐼}
where module what module
𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′
QUANTITATIVE RESULTS
• Using a pre-trained classifier to score the “plausibility” of generated poses.
Semantic score
[1] Binge Watching: Scaling Affordance Learning from Sitcoms, Xiaolong Wang et al. CVPR, 2017
classifier
0/1plausible?
Method Semantic Score (%)
Baseline [1] 72.53
Ours
RGB 91.69
RGBD 91.14
Depth 89.86
QUANTITATIVE RESULTS
• Using a pre-trained classifier to score the “plausibility” of generated poses.
• User study.
Semantic score
46.43
74.45
53.57
72.36
25.55 27.64
0%
25%
50%
75%
100%
GT vs. ours GT vs. baseline ours vs. baseline
GT ours baseline
(a) User study interface. (b) User study result.
QUANTITATIVE RESULTS
• Mapping each generated pose into the 3D scene voxel and check if it satisfies the free space and support constraint.
Geometry score
Metric
(%)
Baseline
[1]
Ours
RGB RGBD Depth
geometry score 23.25 66.40 71.17 72.11
QUALITATIVE RESULTS
Generated Poses in Scene Voxel
Input Image
Generated
Poses
QUALITATIVE RESULTS
Generated Poses in Scene Voxel
Input Image
Generated
Poses
FAILURE CASES
Poses that are not semantically plausible.
Poses that are violate geometry rules in 3D world.
CONCLUSION
• We propose fully-automatic 3D human pose synthesizer that leverages the pose distributions learned from the 2D world, and the physical feasibility extracted from the 3D world
• We develop a generative model for 3D affordance prediction which generates plausible human poses with full 3D information, from a single scene image.
semantic knowledge geometry knowledge
pose synthesizer
…
where what
HIGHLIGHT
• Where
• What
• Interaction
Three aspects of affordance modeling
Locations for sitting poses Locations for standing poses
Locations for cars Locations for people
HIGHLIGHT
• Where
• What
• Interaction
Three aspects of affordance modeling
HIGHLIGHT
• Where
• What
• Interaction
Three aspects of affordance modeling
HIGHLIGHT
• Effective regularization by introducing adversarial training in unsupervised path.
Parallel-path training
THANK YOUQ&A
97
Transform a
template object to
fit in the input image
RELATED WORK“Where” problem
Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey,“ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing”, CVPR 2018
98
?(side view car)
Limitation
1. A template has to be given (it cannot generate templates)
2. Sometimes there’s no way to fit the template into the input scene by applying affine transforms
Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey,“ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing”, CVPR 2018
RELATED WORK“Where” problem
99Xi Ouyang, Yu Cheng, Yifan Jiang, Chun-Liang Li, and Pan Zhou“Pedestrian-Synthesis-GAN: Generating Pedestrian Data in Real Scene and Beyond”, arXiv 2018
Add a new pedestrian at a target region
Limitation
The location and size of a pedestrian has to be given by a user
RELATED WORK“What” problem
100Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee,“Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis”, CVPR 2018
Limitation
Layout prediction and image generation networks are not end-to-end trainable
RELATED WORK“Where” and “What” problem