data driven affordance learning in both 2d and 3d scenes · data driven affordance learning in both...

100
DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California, Merced March 19, 2019

Upload: others

Post on 11-Oct-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3DSCENES

Sifei Liu, NVIDIA Research

Xueting Li, University of California, Merced

March 19, 2019

Page 2: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

2

scene image segmentation human pose estimation

UNDERSTANDING SCENE AND HUMAN

semantic segmentation from cityscapes dataset pose estimation via OpenCV

Page 3: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

3

instance placement human placement

CREATING SCENE OR HUMAN?

semantic segmentation from cityscapes dataset rendered scene from the SUNCG dataset

✘ ✘

Page 4: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

4

LET’S MAKE IT MORE CHALLENGING!

shape synthesis

semantic segmentation from cityscapes dataset

Page 5: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

5

?

LET’S MAKE IT MORE CHALLENGING!

shape synthesis

Page 6: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

6

LET’S MAKE IT MORE CHALLENGING!

placement in the real world

video’s from: Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation

Page 7: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

7

WHAT IS AFFORDANCE?

Where are they?

scene image indoor environment

human car sitting standing

Page 8: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

8

WHAT IS AFFORDANCE?

What are they look like?

scene image indoor environment

Page 9: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

9

WHAT IS AFFORDANCE?

How do they interact with the others?

Input Image

Generated Poses

Page 10: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

10

OUTLINES

Context-Aware Synthesis and Placement of Object Instances

Neurips 2018

Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, Jan Kautz

Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments

CVPR 2019

Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz

Page 11: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

11

QUIZ

Which object is a fake one?

Page 12: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

12

SEQUENTIAL EDITING

Insert new objects one by one

Page 13: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

13

Add a person

PROBLEM DEFINATION

Semantic map manipulation by inserting objects

Page 14: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

14

WHY SEMANTIC MAP?

• Editing RGB image is difficult

Image 1 Image 2

Image-to-image translation,Image editing, ...

Page 15: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

15

WHY SEMANTIC MAP?

• We don’t have real RGB images in case of using a simulator, playing a game, or experiencing a virtual world

Image is fromStephan R. Richter, Zeeshan Hayer, and Vladlen Koltun, “Playing for Benchmarks”, ICCV 2017

Rendering

Semantic map

Visualization

Page 16: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

16

1. Learn “where” and “what” jointly

2. End-to-end trainable network

3. Diverse outputs given the same input

MAIN GOALS

Page 17: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

17

“WHERE” MODULE

How can we learn where to put a new object?

Page 18: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

18

“WHERE” MODULE

Pixel-wise annotation: almost impossible to get

p=0.2

p=0

p=0.8

Page 19: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

19

“WHERE” MODULE

Existing objects: need to remove and inpaint objects

Object

Page 20: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

20

Removed

Object

“WHERE” MODULE

Existing objects: need to remove and inpaint objects

Page 21: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

21

“WHERE” MODULE

Existing objects: need to remove and inpaint objects

Inpainting

?

Page 22: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

22

“WHERE” MODULE

Our approach: put a box and see if it is reasonable

Good

box

Bad

box

Why box?

1) We don’t want to care about the object shape for now.

2) All objects can be covered by a bounding box.

Page 23: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

23

“WHERE” MODULE

How to put a box?

Unit box

Affine transform

Why not using (x,y,w,h) directly?

It is not differentiable to put a box using indices.

Page 24: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

24

bbox

“WHERE” MODULE

Affine transform

Page 25: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

25

bbox

STNconcat

tile

“WHERE” MODULE

Page 26: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

26

bbox

STNconcat

real

fake

tile Real/fake loss

“WHERE” MODULE

Page 27: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

27

“WHERE” MODULE

Results with 100 different random vectors

Page 28: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

28

bbox

STNconcat

real

fake

tile Real/fake loss

Ignored

“WHERE” MODULE

Page 29: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

29

bbox

STNconcat

real

fake

tile Real/fake loss

“WHERE” MODULE

Page 30: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

30

bbox

STNconcat

real

fake

tile Real/fake loss

“WHERE” MODULE

Page 31: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

31

“WHERE” MODULEResults with 100 different random vectors

Page 32: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

32

bbox

STNconcat

real

fake

tile Real/fake loss

“WHERE” MODULE

Lazy

z1

z2

Lazy

Page 33: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

33

bbox

STNconcat

real

fake

tile Real/fake loss

“WHERE” MODULE

bbox

Page 34: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

34

bbox

STN

tile

concat

STNconcat

bbox

(share

d)

(share

d)

real

fake

tile

“WHERE” MODULE

Real/fake loss

Page 35: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

35

bbox

STN

tile

concat

STNconcat

bbox

Supervised path

Unsupervised path

(share

d)

(share

d)

real

fake

tile

Encoder-decoder

+ reconstruct

+ supervision

“WHERE” MODULE

Page 36: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

36(red: person, blue: car)

“WHERE” MODULEResults with 100 different random vectors

Page 37: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

37

“WHERE” MODULEResults from epoch 0 to 30

Page 38: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

38

1. Learn “where” and “what” jointly

2. End-to-end trainable network

3. Diverse outputs given the same input.

MAIN GOAL

Page 39: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

39

tile

concat

“WHAT” MODULE

Page 40: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

40

tile

concat

“WHAT” MODULE

“Where”

module

Page 41: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

41tile

tile

concat

concat

Unsupervised path

Supervised path

(shared)

real

fakeEncoder-decoder

+ supervision

“WHAT” MODULE

“Where”

module

Page 42: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

42

OVERALL ARCHITECTUREForward pass

Input

Affine

Bounding box

prediction

Object shape

generationOutput

Unit

box

Page 43: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

43

OVERALL ARCHITECTUREBackward pass for “where” loss

Input

Affine

Bounding box

prediction

Object shape

generationOutput

Unit

box

“Where”

discriminator

Page 44: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

44

OVERALL ARCHITECTUREBackward pass for “what” loss

Input

Affine

Bounding box

prediction

Object shape

generationOutput

Unit

box

“What”

discriminator

Page 45: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

45

“WHAT” MODULEFix “where”, change “what”

Page 46: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

46

InputGenerated

Synthesized RGB

(pix2pix HD)

EXPERIMENTS

Page 47: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

47

GeneratedNearest Neighbor

Synthesized RGB

(nearest-neighbor)

EXPERIMENTS

Page 48: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

48

InputGenerated

Synthesized RGB

(pix2pix HD)

EXPERIMENTS

Page 49: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

49

GeneratedNearest Neighbor

Synthesized RGB

(nearest-neighbor)

EXPERIMENTS

Page 50: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

50

Page 51: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

51

Page 52: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

52

Page 53: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

53

Page 54: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

54

Ideal: 50%

Our result: 43%

USER STUDY

Page 55: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

55

BASELINES

InputEncoder -

decoder

Generated

objectResult

Real Real

Input

GeneratorGenerated

objectEncoder

STN Result

Real

Real

Baseline 1 Baseline 2

Baseline 1

Baseline 2

Page 56: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

56

CONCLUSION

Learning Affordance in 2D

where are they? what are they look like?

Page 57: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

PUTTING HUMANS IN A SCENE: LEARNING AFFORDANCE IN 3D INDOOR

ENVIRONMENTS

Xueting Li, Sifei Liu, Kihwan Kim , XiaolongWang, Ming-Hsuan Yang, Jan Kautz

Page 58: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

WHAT IS AFFORDANCE IN 3D?

• General definition:

➢ opportunities of interaction in the scene, i.e. what actions can the object be used for.

• Applications:

➢ Robot navigation

➢ Game development

Image Credit: David F. Fouhey et al. In Defense of the Direct Perception of Affordances, CoRR abs/1505.01085 (2015)

The floor can

be used for

standing

The desk can

be used for

sitting

Page 59: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

AFFORDANCE IN 3D WORLD

• Given a single image of a 3D scene, generating reasonable human poses in 3D scenes.

?

Page 60: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

LEARNING 3D AFFORDANCE

• Semantically plausible: the human should take common actions in indoor environment

How to define a “reasonable” human pose in indoor scenes?

• Physically stable: the human should be well supported by its surrounding objects.

Page 61: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

LEARNING 3D AFFORDANCE

semantic knowledge

geometry knowledge fuse

A data-driven

way?

Page 62: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

LEARNING 3D AFFORDANCE

• Stage I: Build a fully-automatic 3D pose synthesizer.

• Stage II: Using the dataset synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model.

semantic knowledge geometry knowledge

pose synthesizer

where what

Page 63: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

LEARNING 3D AFFORDANCE

• Stage I: Build a fully-automatic 3D pose synthesizer.

• Stage II: Using the dataset synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model.

semantic knowledge geometry knowledge

pose synthesizer

where what

Page 64: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

Fusing semantic & geometry knowledge

[1] Wang X, Girdhar R, Gupta A. Binge watching: Scaling affordance learning from sitcoms. CVPR, 2017

[2] Song S, Yu F, Zeng A, et al. Semantic scene completion from a single depth. CVPR 2017.

The Sitcom [1] dataset. (no 3D annotations) The SUNCG [2] dataset. (no human poses)

Combine ?

semantic

knowledge

geometry

knowledge

Page 65: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustmentsemantic knowledge adaptation

𝑌

𝑋

𝑍𝑦

𝑥

𝑊

𝑈

𝑉

input image location heat map generated poses mapped pose adjusted pose

mapping from image to voxelDomain adaptation

Page 66: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustmentsemantic knowledge adaptation

𝑌

𝑋

𝑍𝑦

𝑥

𝑊

𝑈

𝑉

input image location heat map generated poses mapped pose adjusted pose

mapping from image to voxelDomain adaptation

Page 67: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

semantic knowledge adaptation

input image location heat map

ResN

et

18 convolution

deconvolution

Page 68: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

semantic knowledge adaptation

input image location heat map generated poses

Binge Watching: Scaling Affordance Learning from Sitcoms, Xiaolong Wang et al. CVPR, 2017

Page 69: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

semantic knowledge adaptation

input image location heat map generated poses

Domain adaptation

Page 70: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustmentsemantic knowledge adaptation

𝑌

𝑋

𝑍𝑦

𝑥

𝑊

𝑈

𝑉

input image location heat map generated poses mapped pose adjusted pose

mapping from image to voxelDomain adaptation

Page 71: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

mapping from image to voxel

generated pose mapped pose

𝑑 =𝐻 × 𝑓

𝐻𝑝 × 𝑟32

𝐻~𝒩(1.7,0.1)

mapping from image to voxel

𝑌

𝑋

𝑍𝑦

𝑥

𝑊

𝑈

𝑉 𝐻𝑝𝐻

Page 72: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustmentsemantic knowledge adaptation

𝑌

𝑋

𝑍𝑦

𝑥

𝑊

𝑈

𝑉

input image location heat map generated poses mapped pose adjusted pose

mapping from image to voxelDomain adaptation

Page 73: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustment

mapped pose adjusted pose

• Free space constraint [1]: No human body parts can intersect with any object in the scene, such as furniture .

[1] A. Gupta et al. From 3d scene geometry to human workspace. CVPR, 2011.

Page 74: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustment

[1] A. Gupta et al. From 3d scene geometry to human workspace. CVPR, 2011.

generated pose adjusted pose

Support constraint [2]. The human pose should be supported by a surface of surrounding objects (e.g., floor, bed).

(a) generated pose

(c) scene voxel (d) sittable surface

(b) pose in voxel

(e) positive response (f) adjusted 3D pose

3D Gaussian

sampling

Page 75: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustmentsemantic knowledge adaptation

𝑌

𝑋

𝑍𝑦

𝑥

𝑊

𝑈

𝑉

input image location heat map generated poses mapped pose adjusted pose

mapping from image to voxelDomain adaptation

Page 76: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

• In total, we generate ~1.5 million poses in 13,774 scenes. We use 13,074 scenes for the data-driven 3D affordance generative model training and 700 scenes for testing.

Synthesized poses visualization

Page 77: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A DATA-DRIVEN 3D POSE PREDICTION MODEL

• Our 3D pose prediction model includes a where module for pose locations prediction and a what module for pose gestures prediction.

Framework

𝑝

𝑧~𝒩(𝜇,𝜎)

𝑝′𝑥, 𝑦, 𝑑 , 𝑝𝑐

𝑧~𝒩(𝜇,𝜎)

{𝐼}

where module what module

𝐿𝑎𝑑𝑣

real

fake𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐

discriminator

Page 78: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A DATA-DRIVEN 3D POSE PREDICTION MODEL

• Our 3D pose prediction model includes a where module for pose locations prediction and a what module for pose gestures prediction.

Framework

𝑝

𝑧~𝒩(𝜇,𝜎)

𝑝′𝑥, 𝑦, 𝑑 , 𝑝𝑐

𝑧~𝒩(𝜇,𝜎)

{𝐼}

where module what module

𝐿𝑎𝑑𝑣

real

fake𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐

discriminator

Page 79: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A DATA-DRIVEN 3D POSE PREDICTION MODEL

• The where module for pose locations prediction.

• It should be able to sample during inference.

• It should condition on scene image.

• Losses:

• 𝐿 = 𝜆𝐺𝐸𝑂𝐿𝐺𝐸𝑂 + 𝜆𝐾𝐿𝐷𝐿𝐾𝐿𝐷 + 𝜆𝑀𝑆𝐸𝐿𝑀𝑆𝐸

• 𝐿𝑀𝑆𝐸 = (𝑥 − 𝑥 ′)2+(𝑦 − 𝑦′)2+(𝑑 − 𝑑′)2+(𝑝𝑐 − 𝑝𝑐′ )2

• 𝐿𝐾𝐿𝐷 = [𝑄(𝑧|𝜇 𝑥, 𝑦, 𝑑, 𝐼 , 𝜎(𝑥, 𝑦, 𝑑, 𝐼))||𝒩(0,1)]

• 𝐿𝐺𝐸𝑂 = (𝑀𝑒𝑀𝑖 𝑥, 𝑦, 𝑑 −𝑀𝑒𝑀𝑖 𝑥′,𝑦′ , 𝑑′ )2

The where module

𝑥, 𝑦, 𝑑 , 𝑝𝑐

𝑧~𝒩(𝜇,𝜎)

𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′

{𝐼}

where module

𝐿𝑚𝑠𝑒𝐿𝑘𝑙𝑑

Page 80: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A DATA-DRIVEN 3D POSE PREDICTION MODEL

• Our 3D pose prediction model includes a where module for pose locations prediction and a what module for pose gestures prediction.

Framework

𝑝

𝑧~𝒩(𝜇,𝜎)

𝑝′𝑥, 𝑦, 𝑑 , 𝑝𝑐

𝑧~𝒩(𝜇,𝜎)

{𝐼}

where module what module

𝐿𝑎𝑑𝑣

real

fake𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐

discriminator

Page 81: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A DATA-DRIVEN 3D POSE PREDICTION MODEL

• The what module for pose gestures prediction.

• Losses:

• 𝐿 = 𝜆𝐺𝐸𝑂𝐿𝐺𝐸𝑂 + 𝜆𝐾𝐿𝐷𝐿𝐾𝐿𝐷 + 𝜆𝑀𝑆𝐸𝐿𝑀𝑆𝐸

• 𝐿𝑀𝑆𝐸 = (𝑝 − 𝑝′)2

• 𝐿𝐾𝐿𝐷 = [𝑄(𝑧|𝜇 𝑥, 𝑦, 𝑑, 𝑝𝑐 , 𝐼 , 𝜎(𝑥, 𝑦, 𝑑, 𝑝𝑐 , 𝐼))||𝒩(0,1)]

• 𝐿𝐺𝐸𝑂 = (𝑀𝑒𝑀𝑖 𝑥𝑗 , 𝑦𝑗 , 𝑑𝑗 − 𝑀𝑒𝑀𝑖 𝑥𝑗′,𝑦𝑗

′ , 𝑑𝑗′ )2

The what module

𝑝

𝑥′ , 𝑦′, 𝑑′ , 𝑝𝑐′ , 𝐼

𝑝′

what module

𝐿𝑚𝑠𝑒

𝐿𝑘𝑙𝑑

𝑧~𝒩(𝜇,𝜎)

Page 82: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A DATA-DRIVEN 3D POSE PREDICTION MODEL

• Our 3D pose prediction model includes a where module for pose locations prediction and a what module for pose gestures prediction.

Framework

𝑝

𝑧~𝒩(𝜇,𝜎)

𝑝′𝑥, 𝑦, 𝑑 , 𝑝𝑐

𝑧~𝒩(𝜇,𝜎)

{𝐼}

where module what module

𝐿𝑎𝑑𝑣

real

fake𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐

discriminator

Page 83: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A DATA-DRIVEN 3D POSE PREDICTION MODEL

Geometry-aware discriminator

𝑝

𝑝′𝑥, 𝑦, 𝑑 , 𝑝𝑐

𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′

𝐼

𝐼encoder d

ecoder e

ncoder d

ecoder

𝑧~𝒩(𝜇,𝜎)

𝑧~𝒩(𝜇,𝜎)

decoder

𝑧~𝒩(0,1) 𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′

decoder

𝑧~𝒩(0,1)

supervised path

unsupervised path

𝐿𝑎𝑑𝑣real

fake

depth heat

map

D. Lee, S. Liu, J. Gu, M.-Y. Liu, M.-H. Yang, and J. Kautz. Context-aware synthesis and placement of object instances. In NIPS, 2018.

Page 84: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

A DATA-DRIVEN 3D POSE PREDICTION MODEL

• Our 3D pose prediction model includes a where module for pose locations prediction and a what module for pose gestures prediction.

Framework

𝑧~𝒩(𝜇,𝜎)

𝑝′

𝑧~𝒩(𝜇,𝜎)

{𝐼}

where module what module

𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′

Page 85: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

QUANTITATIVE RESULTS

• Using a pre-trained classifier to score the “plausibility” of generated poses.

Semantic score

[1] Binge Watching: Scaling Affordance Learning from Sitcoms, Xiaolong Wang et al. CVPR, 2017

classifier

0/1plausible?

Method Semantic Score (%)

Baseline [1] 72.53

Ours

RGB 91.69

RGBD 91.14

Depth 89.86

Page 86: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

QUANTITATIVE RESULTS

• Using a pre-trained classifier to score the “plausibility” of generated poses.

• User study.

Semantic score

46.43

74.45

53.57

72.36

25.55 27.64

0%

25%

50%

75%

100%

GT vs. ours GT vs. baseline ours vs. baseline

GT ours baseline

(a) User study interface. (b) User study result.

Page 87: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

QUANTITATIVE RESULTS

• Mapping each generated pose into the 3D scene voxel and check if it satisfies the free space and support constraint.

Geometry score

Metric

(%)

Baseline

[1]

Ours

RGB RGBD Depth

geometry score 23.25 66.40 71.17 72.11

Page 88: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

QUALITATIVE RESULTS

Generated Poses in Scene Voxel

Input Image

Generated

Poses

Page 89: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

QUALITATIVE RESULTS

Generated Poses in Scene Voxel

Input Image

Generated

Poses

Page 90: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

FAILURE CASES

Poses that are not semantically plausible.

Poses that are violate geometry rules in 3D world.

Page 91: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

CONCLUSION

• We propose fully-automatic 3D human pose synthesizer that leverages the pose distributions learned from the 2D world, and the physical feasibility extracted from the 3D world

• We develop a generative model for 3D affordance prediction which generates plausible human poses with full 3D information, from a single scene image.

semantic knowledge geometry knowledge

pose synthesizer

where what

Page 92: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

HIGHLIGHT

• Where

• What

• Interaction

Three aspects of affordance modeling

Locations for sitting poses Locations for standing poses

Locations for cars Locations for people

Page 93: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

HIGHLIGHT

• Where

• What

• Interaction

Three aspects of affordance modeling

Page 94: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

HIGHLIGHT

• Where

• What

• Interaction

Three aspects of affordance modeling

Page 95: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

HIGHLIGHT

• Effective regularization by introducing adversarial training in unsupervised path.

Parallel-path training

Page 96: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

THANK YOUQ&A

Page 97: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

97

Transform a

template object to

fit in the input image

RELATED WORK“Where” problem

Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey,“ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing”, CVPR 2018

Page 98: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

98

?(side view car)

Limitation

1. A template has to be given (it cannot generate templates)

2. Sometimes there’s no way to fit the template into the input scene by applying affine transforms

Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey,“ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing”, CVPR 2018

RELATED WORK“Where” problem

Page 99: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

99Xi Ouyang, Yu Cheng, Yifan Jiang, Chun-Liang Li, and Pan Zhou“Pedestrian-Synthesis-GAN: Generating Pedestrian Data in Real Scene and Beyond”, arXiv 2018

Add a new pedestrian at a target region

Limitation

The location and size of a pedestrian has to be given by a user

RELATED WORK“What” problem

Page 100: Data driven Affordance Learning in both 2D and 3d scenes · DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California,

100Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee,“Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis”, CVPR 2018

Limitation

Layout prediction and image generation networks are not end-to-end trainable

RELATED WORK“Where” and “What” problem