data driven affordance learning in both 2d and 3d scenes · data driven affordance learning in both...

DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3DSCENES

Sifei Liu, NVIDIA Research

Xueting Li, University of California, Merced

March 19, 2019

2

scene image segmentation human pose estimation

UNDERSTANDING SCENE AND HUMAN

semantic segmentation from cityscapes dataset pose estimation via OpenCV

https://www.learnopencv.com/wp-content/uploads/2018/05/openpose-dance.gif

3

instance placement human placement

CREATING SCENE OR HUMAN?

semantic segmentation from cityscapes dataset rendered scene from the SUNCG dataset

✘ ✘

4

LET’S MAKE IT MORE CHALLENGING!

shape synthesis

semantic segmentation from cityscapes dataset

5

?


shape synthesis

6


placement in the real world

video’s from: Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation

https://research.nvidia.com/publication/2018-09_Learning-Rigidity-in

7

WHAT IS AFFORDANCE?

Where are they?

scene image indoor environment

human car sitting standing

8

WHAT IS AFFORDANCE?

What are they look like?

scene image indoor environment

9

WHAT IS AFFORDANCE?

How do they interact with the others?

Input Image

Generated Poses

10

OUTLINES

Context-Aware Synthesis and Placement of Object Instances

Neurips 2018

Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, Jan Kautz

Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments

CVPR 2019

Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz

11

QUIZ

Which object is a fake one?

12

SEQUENTIAL EDITING

Insert new objects one by one

13

Add a person

PROBLEM DEFINATION

Semantic map manipulation by inserting objects

14

WHY SEMANTIC MAP?

• Editing RGB image is difficult

Image 1 Image 2

Image-to-image translation,Image editing, ...

15

WHY SEMANTIC MAP?

• We don’t have real RGB images in case of using a simulator, playing a game, or experiencing a virtual world

Image is fromStephan R. Richter, Zeeshan Hayer, and Vladlen Koltun, “Playing for Benchmarks”, ICCV 2017

Rendering

Semantic map

Visualization

16

1. Learn “where” and “what” jointly

2. End-to-end trainable network

3. Diverse outputs given the same input

MAIN GOALS

17

“WHERE” MODULE

How can we learn where to put a new object?

18

“WHERE” MODULE

Pixel-wise annotation: almost impossible to get

p=0.2

p=0

p=0.8

19

“WHERE” MODULE

Existing objects: need to remove and inpaint objects

Object

20

Removed

Object

“WHERE” MODULE


21

“WHERE” MODULE


Inpainting

?

22

“WHERE” MODULE

Our approach: put a box and see if it is reasonable

Good

box

Bad

box

Why box?

1) We don’t want to care about the object shape for now.

2) All objects can be covered by a bounding box.

23

“WHERE” MODULE

How to put a box?

Unit box

Affine transform

Why not using (x,y,w,h) directly?

It is not differentiable to put a box using indices.

24

bbox

“WHERE” MODULE

Affine transform

25

bbox

STNconcat

tile

“WHERE” MODULE

26

bbox

STNconcat

real

fake

tile Real/fake loss

“WHERE” MODULE

27

“WHERE” MODULE

Results with 100 different random vectors

28

bbox

STNconcat

real

fake

tile Real/fake loss

Ignored

“WHERE” MODULE

29

bbox

STNconcat

real

fake

tile Real/fake loss

“WHERE” MODULE

30

bbox

STNconcat

real

fake

tile Real/fake loss

“WHERE” MODULE

31

“WHERE” MODULEResults with 100 different random vectors

32

bbox

STNconcat

real

fake

tile Real/fake loss

“WHERE” MODULE

Lazy

z1

z2

Lazy

33

bbox

STNconcat

real

fake

tile Real/fake loss

“WHERE” MODULE

bbox

34

bbox

STN

tile

concat

STNconcat

bbox

(share

d)

(share

d)

real

fake

tile

“WHERE” MODULE

Real/fake loss

35

bbox

STN

tile

concat

STNconcat

bbox

Supervised path

Unsupervised path

(share

d)

(share

d)

real

fake

tile

Encoder-decoder

+ reconstruct

+ supervision

“WHERE” MODULE

36(red: person, blue: car)

“WHERE” MODULEResults with 100 different random vectors

37

“WHERE” MODULEResults from epoch 0 to 30

38

1. Learn “where” and “what” jointly

2. End-to-end trainable network

3. Diverse outputs given the same input.

MAIN GOAL

39

tile

concat

“WHAT” MODULE

40

tile

concat

“WHAT” MODULE

“Where”

module

41tile

tile

concat

concat

Unsupervised path

Supervised path

(shared)

real

fakeEncoder-decoder

+ supervision

“WHAT” MODULE

“Where”

module

42

OVERALL ARCHITECTUREForward pass

Input

Affine

Bounding box

prediction

Object shape

generationOutput

Unit

box

43

OVERALL ARCHITECTUREBackward pass for “where” loss

Input

Affine

Bounding box

prediction

Object shape

generationOutput

Unit

box

“Where”

discriminator

44

OVERALL ARCHITECTUREBackward pass for “what” loss

Input

Affine

Bounding box

prediction

Object shape

generationOutput

Unit

box

“What”

discriminator

45

“WHAT” MODULEFix “where”, change “what”

46

InputGenerated

Synthesized RGB

(pix2pix HD)

EXPERIMENTS

47

GeneratedNearest Neighbor

Synthesized RGB

(nearest-neighbor)

EXPERIMENTS

48

InputGenerated

Synthesized RGB

(pix2pix HD)

EXPERIMENTS

49

GeneratedNearest Neighbor

Synthesized RGB

(nearest-neighbor)

EXPERIMENTS

54

Ideal: 50%

Our result: 43%

USER STUDY

55

BASELINES

InputEncoder -

decoder

Generated

objectResult

Real Real

Input

GeneratorGenerated

objectEncoder

STN Result

Real

Real

Baseline 1 Baseline 2

Baseline 1

Baseline 2

56

CONCLUSION

Learning Affordance in 2D

where are they? what are they look like?

PUTTING HUMANS IN A SCENE: LEARNING AFFORDANCE IN 3D INDOOR

ENVIRONMENTS

Xueting Li, Sifei Liu, Kihwan Kim , XiaolongWang, Ming-Hsuan Yang, Jan Kautz

WHAT IS AFFORDANCE IN 3D?

• General definition:

➢ opportunities of interaction in the scene, i.e. what actions can the object be used for.

• Applications:

➢ Robot navigation

➢ Game development

Image Credit: David F. Fouhey et al. In Defense of the Direct Perception of Affordances, CoRR abs/1505.01085 (2015)

The floor can

be used for

standing

The desk can

be used for

sitting

AFFORDANCE IN 3D WORLD

• Given a single image of a 3D scene, generating reasonable human poses in 3D scenes.

?

LEARNING 3D AFFORDANCE

• Semantically plausible: the human should take common actions in indoor environment

How to define a “reasonable” human pose in indoor scenes?

• Physically stable: the human should be well supported by its surrounding objects.


semantic knowledge

geometry knowledge fuse

A data-driven

way?


• Stage I: Build a fully-automatic 3D pose synthesizer.

• Stage II: Using the dataset synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model.

semantic knowledge geometry knowledge

pose synthesizer

…

where what

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

Fusing semantic & geometry knowledge

[1] Wang X, Girdhar R, Gupta A. Binge watching: Scaling affordance learning from sitcoms. CVPR, 2017

[2] Song S, Yu F, Zeng A, et al. Semantic scene completion from a single depth. CVPR 2017.

The Sitcom [1] dataset. (no 3D annotations) The SUNCG [2] dataset. (no human poses)

Combine ?

semantic

knowledge

geometry

knowledge


geometry adjustmentsemantic knowledge adaptation

𝑌

𝑋

𝑍𝑦

𝑥

𝑊

𝑈

𝑉

input image location heat map generated poses mapped pose adjusted pose

mapping from image to voxelDomain adaptation


semantic knowledge adaptation

input image location heat map

ResN

et

18 convolution

deconvolution



input image location heat map generated poses

Binge Watching: Scaling Affordance Learning from Sitcoms， Xiaolong Wang et al. CVPR, 2017



input image location heat map generated poses

Domain adaptation



𝑌

𝑋

𝑍𝑦

𝑥

𝑊

𝑈

𝑉




mapping from image to voxel

generated pose mapped pose

𝑑 =𝐻 × 𝑓

𝐻𝑝 × 𝑟32

𝐻~𝒩(1.7,0.1)

mapping from image to voxel

𝑌

𝑋

𝑍𝑦

𝑥

𝑊

𝑈

𝑉 𝐻𝑝𝐻



𝑌

𝑋

𝑍𝑦

𝑥

𝑊

𝑈

𝑉




geometry adjustment

mapped pose adjusted pose

• Free space constraint [1]: No human body parts can intersect with any object in the scene, such as furniture .

⨂

[1] A. Gupta et al. From 3d scene geometry to human workspace. CVPR, 2011.


geometry adjustment

[1] A. Gupta et al. From 3d scene geometry to human workspace. CVPR, 2011.

generated pose adjusted pose

Support constraint [2]. The human pose should be supported by a surface of surrounding objects (e.g., floor, bed).

(a) generated pose

(c) scene voxel (d) sittable surface

(b) pose in voxel

(e) positive response (f) adjusted 3D pose

⨂

3D Gaussian

sampling



𝑌

𝑋

𝑍𝑦

𝑥

𝑊

𝑈

𝑉




• In total, we generate ~1.5 million poses in 13,774 scenes. We use 13,074 scenes for the data-driven 3D affordance generative model training and 700 scenes for testing.

Synthesized poses visualization

A DATA-DRIVEN 3D POSE PREDICTION MODEL

• Our 3D pose prediction model includes a where module for pose locations prediction and a what module for pose gestures prediction.

Framework

𝑝

𝑧~𝒩(𝜇,𝜎)

𝑝′𝑥, 𝑦, 𝑑 , 𝑝𝑐


{𝐼}

where module what module

𝐿𝑎𝑑𝑣

real

fake𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐

′

discriminator


• The where module for pose locations prediction.

• It should be able to sample during inference.

• It should condition on scene image.

• Losses:

• 𝐿 = 𝜆𝐺𝐸𝑂𝐿𝐺𝐸𝑂 + 𝜆𝐾𝐿𝐷𝐿𝐾𝐿𝐷 + 𝜆𝑀𝑆𝐸𝐿𝑀𝑆𝐸

• 𝐿𝑀𝑆𝐸 = (𝑥 − 𝑥 ′)2+(𝑦 − 𝑦′)2+(𝑑 − 𝑑′)2+(𝑝𝑐 − 𝑝𝑐′ )2

• 𝐿𝐾𝐿𝐷 = [𝑄(𝑧|𝜇 𝑥, 𝑦, 𝑑, 𝐼 , 𝜎(𝑥, 𝑦, 𝑑, 𝐼))||𝒩(0,1)]

• 𝐿𝐺𝐸𝑂 = (𝑀𝑒𝑀𝑖 𝑥, 𝑦, 𝑑 −𝑀𝑒𝑀𝑖 𝑥′,𝑦′ , 𝑑′ )2

The where module

𝑥, 𝑦, 𝑑 , 𝑝𝑐


𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′

{𝐼}

where module

𝐿𝑚𝑠𝑒𝐿𝑘𝑙𝑑



Framework

𝑝




{𝐼}


𝐿𝑎𝑑𝑣

real


′

discriminator


• The what module for pose gestures prediction.

• Losses:

• 𝐿 = 𝜆𝐺𝐸𝑂𝐿𝐺𝐸𝑂 + 𝜆𝐾𝐿𝐷𝐿𝐾𝐿𝐷 + 𝜆𝑀𝑆𝐸𝐿𝑀𝑆𝐸

• 𝐿𝑀𝑆𝐸 = (𝑝 − 𝑝′)2

• 𝐿𝐾𝐿𝐷 = [𝑄(𝑧|𝜇 𝑥, 𝑦, 𝑑, 𝑝𝑐 , 𝐼 , 𝜎(𝑥, 𝑦, 𝑑, 𝑝𝑐 , 𝐼))||𝒩(0,1)]

• 𝐿𝐺𝐸𝑂 = (𝑀𝑒𝑀𝑖 𝑥𝑗 , 𝑦𝑗 , 𝑑𝑗 − 𝑀𝑒𝑀𝑖 𝑥𝑗′,𝑦𝑗

′ , 𝑑𝑗′ )2

The what module

𝑝

𝑥′ , 𝑦′, 𝑑′ , 𝑝𝑐′ , 𝐼

𝑝′

what module

𝐿𝑚𝑠𝑒

𝐿𝑘𝑙𝑑




Framework

𝑝




{𝐼}


𝐿𝑎𝑑𝑣

real


′

discriminator


Geometry-aware discriminator

𝑝


𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′

𝐼

𝐼encoder d

ecoder e

ncoder d

ecoder



decoder

𝑧~𝒩(0,1) 𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′

decoder

𝑧~𝒩(0,1)

supervised path

unsupervised path

𝐿𝑎𝑑𝑣real

fake

depth heat

map

D. Lee, S. Liu, J. Gu, M.-Y. Liu, M.-H. Yang, and J. Kautz. Context-aware synthesis and placement of object instances. In NIPS, 2018.



Framework


𝑝′


{𝐼}


𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′

QUANTITATIVE RESULTS

• Using a pre-trained classifier to score the “plausibility” of generated poses.

Semantic score

[1] Binge Watching: Scaling Affordance Learning from Sitcoms， Xiaolong Wang et al. CVPR, 2017

classifier

0/1plausible?

Method Semantic Score (%)

Baseline [1] 72.53

Ours

RGB 91.69

RGBD 91.14

Depth 89.86


• Using a pre-trained classifier to score the “plausibility” of generated poses.

• User study.

Semantic score

46.43

74.45

53.57

72.36

25.55 27.64

0%

25%

50%

75%

100%

GT vs. ours GT vs. baseline ours vs. baseline

GT ours baseline

(a) User study interface. (b) User study result.


• Mapping each generated pose into the 3D scene voxel and check if it satisfies the free space and support constraint.

Geometry score

Metric

(%)

Baseline

[1]

Ours

RGB RGBD Depth

geometry score 23.25 66.40 71.17 72.11

QUALITATIVE RESULTS

Generated Poses in Scene Voxel

Input Image

Generated

Poses

FAILURE CASES

Poses that are not semantically plausible.

Poses that are violate geometry rules in 3D world.

CONCLUSION

• We propose fully-automatic 3D human pose synthesizer that leverages the pose distributions learned from the 2D world, and the physical feasibility extracted from the 3D world

• We develop a generative model for 3D affordance prediction which generates plausible human poses with full 3D information, from a single scene image.

semantic knowledge geometry knowledge

pose synthesizer

…

where what

HIGHLIGHT

• Where

• What

• Interaction

Three aspects of affordance modeling

Locations for sitting poses Locations for standing poses

Locations for cars Locations for people

HIGHLIGHT

• Where

• What

• Interaction

Three aspects of affordance modeling

HIGHLIGHT

• Effective regularization by introducing adversarial training in unsupervised path.

Parallel-path training

THANK YOUQ&A

97

Transform a

template object to

fit in the input image

RELATED WORK“Where” problem

Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey,“ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing”, CVPR 2018

98

?(side view car)

Limitation

1. A template has to be given (it cannot generate templates)

2. Sometimes there’s no way to fit the template into the input scene by applying affine transforms

Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey,“ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing”, CVPR 2018

RELATED WORK“Where” problem

99Xi Ouyang, Yu Cheng, Yifan Jiang, Chun-Liang Li, and Pan Zhou“Pedestrian-Synthesis-GAN: Generating Pedestrian Data in Real Scene and Beyond”, arXiv 2018

Add a new pedestrian at a target region

Limitation

The location and size of a pedestrian has to be given by a user

RELATED WORK“What” problem

100Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee,“Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis”, CVPR 2018

Limitation

Layout prediction and image generation networks are not end-to-end trainable

RELATED WORK“Where” and “What” problem

data driven affordance learning in both 2d and 3d scenes · data driven affordance learning in both...

Documents