data driven affordance learning in both 2d and 3d scenes · data driven affordance learning in both...

DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3DSCENES

Sifei Liu, NVIDIA Research

Xueting Li, University of California, Merced

March 19, 2019

scene image segmentation human pose estimation

UNDERSTANDING SCENE AND HUMAN

semantic segmentation from cityscapes dataset pose estimation via OpenCV

instance placement human placement

CREATING SCENE OR HUMAN?

semantic segmentation from cityscapes dataset rendered scene from the SUNCG dataset

✘ ✘

LET’S MAKE IT MORE CHALLENGING!

shape synthesis

semantic segmentation from cityscapes dataset

shape synthesis

placement in the real world

video’s from: Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation

WHAT IS AFFORDANCE?

Where are they?

scene image indoor environment

human car sitting standing

WHAT IS AFFORDANCE?

What are they look like?

scene image indoor environment

WHAT IS AFFORDANCE?

How do they interact with the others?

Input Image

Generated Poses

OUTLINES

Context-Aware Synthesis and Placement of Object Instances

Neurips 2018

Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, Jan Kautz

Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments

CVPR 2019

Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz

Which object is a fake one?

SEQUENTIAL EDITING

Insert new objects one by one

Add a person

PROBLEM DEFINATION

Semantic map manipulation by inserting objects

WHY SEMANTIC MAP?

• Editing RGB image is difficult

Image 1 Image 2

Image-to-image translation,Image editing, ...

WHY SEMANTIC MAP?

• We don’t have real RGB images in case of using a simulator, playing a game, or experiencing a virtual world

Image is fromStephan R. Richter, Zeeshan Hayer, and Vladlen Koltun, “Playing for Benchmarks”, ICCV 2017

Rendering

Semantic map

Visualization

1. Learn “where” and “what” jointly

2. End-to-end trainable network

3. Diverse outputs given the same input

MAIN GOALS

“WHERE” MODULE

How can we learn where to put a new object?

“WHERE” MODULE

Pixel-wise annotation: almost impossible to get

“WHERE” MODULE

Existing objects: need to remove and inpaint objects

Object

Removed

Object

“WHERE” MODULE

Inpainting

“WHERE” MODULE

Our approach: put a box and see if it is reasonable

Why box?

1) We don’t want to care about the object shape for now.

2) All objects can be covered by a bounding box.

“WHERE” MODULE

How to put a box?

Unit box

Affine transform

Why not using (x,y,w,h) directly?

It is not differentiable to put a box using indices.

“WHERE” MODULE

Affine transform

STNconcat

“WHERE” MODULE

STNconcat

tile Real/fake loss

“WHERE” MODULE

Results with 100 different random vectors

STNconcat

tile Real/fake loss

Ignored

“WHERE” MODULE

STNconcat

tile Real/fake loss

“WHERE” MODULE

STNconcat

tile Real/fake loss

“WHERE” MODULE

“WHERE” MODULEResults with 100 different random vectors

STNconcat

tile Real/fake loss

“WHERE” MODULE

STNconcat

tile Real/fake loss

“WHERE” MODULE

concat

STNconcat

(share

“WHERE” MODULE

Real/fake loss

concat

STNconcat

Supervised path

Unsupervised path

(share

Encoder-decoder

+ reconstruct

+ supervision

“WHERE” MODULE

36(red: person, blue: car)

“WHERE” MODULEResults with 100 different random vectors

“WHERE” MODULEResults from epoch 0 to 30

1. Learn “where” and “what” jointly

2. End-to-end trainable network

3. Diverse outputs given the same input.

MAIN GOAL

concat

“WHAT” MODULE

concat

“WHAT” MODULE

“Where”

module

41tile

concat

Unsupervised path

Supervised path

(shared)

fakeEncoder-decoder

+ supervision

“WHAT” MODULE

“Where”

module

OVERALL ARCHITECTUREForward pass

Affine

Bounding box

prediction

Object shape

generationOutput

OVERALL ARCHITECTUREBackward pass for “where” loss

Affine

Bounding box

prediction

Object shape

generationOutput

“Where”

discriminator

OVERALL ARCHITECTUREBackward pass for “what” loss

Affine

Bounding box

prediction

Object shape

generationOutput

“What”

discriminator

“WHAT” MODULEFix “where”, change “what”

InputGenerated

Synthesized RGB

(pix2pix HD)

EXPERIMENTS

GeneratedNearest Neighbor

Synthesized RGB

(nearest-neighbor)

EXPERIMENTS

InputGenerated

Synthesized RGB

(pix2pix HD)

EXPERIMENTS

GeneratedNearest Neighbor

Synthesized RGB

(nearest-neighbor)

EXPERIMENTS

Ideal: 50%

Our result: 43%

USER STUDY

BASELINES

InputEncoder -

decoder

Generated

objectResult

Real Real

GeneratorGenerated

objectEncoder

STN Result

Baseline 1 Baseline 2

Baseline 1

Baseline 2

CONCLUSION

Learning Affordance in 2D

where are they? what are they look like?

PUTTING HUMANS IN A SCENE: LEARNING AFFORDANCE IN 3D INDOOR

ENVIRONMENTS

Xueting Li, Sifei Liu, Kihwan Kim , XiaolongWang, Ming-Hsuan Yang, Jan Kautz

WHAT IS AFFORDANCE IN 3D?

• General definition:

➢ opportunities of interaction in the scene, i.e. what actions can the object be used for.

• Applications:

➢ Robot navigation

➢ Game development

Image Credit: David F. Fouhey et al. In Defense of the Direct Perception of Affordances, CoRR abs/1505.01085 (2015)

The floor can

be used for

standing

The desk can

be used for

sitting

AFFORDANCE IN 3D WORLD

• Given a single image of a 3D scene, generating reasonable human poses in 3D scenes.

LEARNING 3D AFFORDANCE

• Semantically plausible: the human should take common actions in indoor environment

How to define a “reasonable” human pose in indoor scenes?

• Physically stable: the human should be well supported by its surrounding objects.

semantic knowledge

geometry knowledge fuse

A data-driven

• Stage I: Build a fully-automatic 3D pose synthesizer.

• Stage II: Using the dataset synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model.

semantic knowledge geometry knowledge

pose synthesizer

where what

• Stage I: Build a fully-automatic 3D pose synthesizer.

• Stage II: Using the dataset synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model.

pose synthesizer

where what

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

Fusing semantic & geometry knowledge

[1] Wang X, Girdhar R, Gupta A. Binge watching: Scaling affordance learning from sitcoms. CVPR, 2017

[2] Song S, Yu F, Zeng A, et al. Semantic scene completion from a single depth. CVPR 2017.

The Sitcom [1] dataset. (no 3D annotations) The SUNCG [2] dataset. (no human poses)

Combine ?

semantic

knowledge

geometry

knowledge

geometry adjustmentsemantic knowledge adaptation

𝑍𝑦

input image location heat map generated poses mapped pose adjusted pose

mapping from image to voxelDomain adaptation

𝑍𝑦

semantic knowledge adaptation

input image location heat map

18 convolution

deconvolution

input image location heat map generated poses

Binge Watching: Scaling Affordance Learning from Sitcoms， Xiaolong Wang et al. CVPR, 2017

input image location heat map generated poses

Domain adaptation

𝑍𝑦

mapping from image to voxel

generated pose mapped pose

𝑑 =𝐻 × 𝑓

𝐻𝑝 × 𝑟32

𝐻~𝒩(1.7,0.1)

mapping from image to voxel

𝑍𝑦

𝑉 𝐻𝑝𝐻

𝑍𝑦

geometry adjustment

mapped pose adjusted pose

• Free space constraint [1]: No human body parts can intersect with any object in the scene, such as furniture .

[1] A. Gupta et al. From 3d scene geometry to human workspace. CVPR, 2011.

geometry adjustment

[1] A. Gupta et al. From 3d scene geometry to human workspace. CVPR, 2011.

generated pose adjusted pose

Support constraint [2]. The human pose should be supported by a surface of surrounding objects (e.g., floor, bed).

(a) generated pose

(c) scene voxel (d) sittable surface

(b) pose in voxel

(e) positive response (f) adjusted 3D pose

3D Gaussian

sampling

𝑍𝑦

• In total, we generate ~1.5 million poses in 13,774 scenes. We use 13,074 scenes for the data-driven 3D affordance generative model training and 700 scenes for testing.

Synthesized poses visualization

A DATA-DRIVEN 3D POSE PREDICTION MODEL

• Our 3D pose prediction model includes a where module for pose locations prediction and a what module for pose gestures prediction.

Framework

𝑧~𝒩(𝜇,𝜎)

𝑝′𝑥, 𝑦, 𝑑 , 𝑝𝑐

{𝐼}

where module what module

𝐿𝑎𝑑𝑣

fake𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐

discriminator

Framework

{𝐼}

𝐿𝑎𝑑𝑣

discriminator

• The where module for pose locations prediction.

• It should be able to sample during inference.

• It should condition on scene image.

• Losses:

• 𝐿 = 𝜆𝐺𝐸𝑂𝐿𝐺𝐸𝑂 + 𝜆𝐾𝐿𝐷𝐿𝐾𝐿𝐷 + 𝜆𝑀𝑆𝐸𝐿𝑀𝑆𝐸

• 𝐿𝑀𝑆𝐸 = (𝑥 − 𝑥 ′)2+(𝑦 − 𝑦′)2+(𝑑 − 𝑑′)2+(𝑝𝑐 − 𝑝𝑐′ )2

• 𝐿𝐾𝐿𝐷 = [𝑄(𝑧|𝜇 𝑥, 𝑦, 𝑑, 𝐼 , 𝜎(𝑥, 𝑦, 𝑑, 𝐼))||𝒩(0,1)]

• 𝐿𝐺𝐸𝑂 = (𝑀𝑒𝑀𝑖 𝑥, 𝑦, 𝑑 −𝑀𝑒𝑀𝑖 𝑥′,𝑦′ , 𝑑′ )2

The where module

𝑥, 𝑦, 𝑑 , 𝑝𝑐

𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′

{𝐼}

where module

𝐿𝑚𝑠𝑒𝐿𝑘𝑙𝑑

Framework

{𝐼}

𝐿𝑎𝑑𝑣

discriminator

• The what module for pose gestures prediction.

• Losses:

• 𝐿 = 𝜆𝐺𝐸𝑂𝐿𝐺𝐸𝑂 + 𝜆𝐾𝐿𝐷𝐿𝐾𝐿𝐷 + 𝜆𝑀𝑆𝐸𝐿𝑀𝑆𝐸

• 𝐿𝑀𝑆𝐸 = (𝑝 − 𝑝′)2

• 𝐿𝐾𝐿𝐷 = [𝑄(𝑧|𝜇 𝑥, 𝑦, 𝑑, 𝑝𝑐 , 𝐼 , 𝜎(𝑥, 𝑦, 𝑑, 𝑝𝑐 , 𝐼))||𝒩(0,1)]

• 𝐿𝐺𝐸𝑂 = (𝑀𝑒𝑀𝑖 𝑥𝑗 , 𝑦𝑗 , 𝑑𝑗 − 𝑀𝑒𝑀𝑖 𝑥𝑗′,𝑦𝑗

′ , 𝑑𝑗′ )2

The what module

𝑥′ , 𝑦′, 𝑑′ , 𝑝𝑐′ , 𝐼

𝑝′

what module

𝐿𝑚𝑠𝑒

𝐿𝑘𝑙𝑑

Framework

{𝐼}

𝐿𝑎𝑑𝑣

discriminator

Geometry-aware discriminator

𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′

𝐼encoder d

ecoder e

ncoder d

ecoder

decoder

𝑧~𝒩(0,1) 𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′

decoder

𝑧~𝒩(0,1)

supervised path

unsupervised path

𝐿𝑎𝑑𝑣real

depth heat

D. Lee, S. Liu, J. Gu, M.-Y. Liu, M.-H. Yang, and J. Kautz. Context-aware synthesis and placement of object instances. In NIPS, 2018.

Framework

𝑝′

{𝐼}

𝑥′, 𝑦′, 𝑑′ , 𝑝𝑐′

QUANTITATIVE RESULTS

• Using a pre-trained classifier to score the “plausibility” of generated poses.

Semantic score

[1] Binge Watching: Scaling Affordance Learning from Sitcoms， Xiaolong Wang et al. CVPR, 2017

classifier

0/1plausible?

Method Semantic Score (%)

Baseline [1] 72.53

RGB 91.69

RGBD 91.14

Depth 89.86

• Using a pre-trained classifier to score the “plausibility” of generated poses.

• User study.

Semantic score

25.55 27.64

GT vs. ours GT vs. baseline ours vs. baseline

GT ours baseline

(a) User study interface. (b) User study result.

• Mapping each generated pose into the 3D scene voxel and check if it satisfies the free space and support constraint.

Geometry score

Metric

Baseline

RGB RGBD Depth

geometry score 23.25 66.40 71.17 72.11

QUALITATIVE RESULTS

Generated Poses in Scene Voxel

Input Image

Generated

QUALITATIVE RESULTS

Generated Poses in Scene Voxel

Input Image

Generated

FAILURE CASES

Poses that are not semantically plausible.

Poses that are violate geometry rules in 3D world.

CONCLUSION

• We propose fully-automatic 3D human pose synthesizer that leverages the pose distributions learned from the 2D world, and the physical feasibility extracted from the 3D world

• We develop a generative model for 3D affordance prediction which generates plausible human poses with full 3D information, from a single scene image.

pose synthesizer

where what

HIGHLIGHT

• Where

• What

• Interaction

Three aspects of affordance modeling

Locations for sitting poses Locations for standing poses

Locations for cars Locations for people

HIGHLIGHT

• Where

• What

• Interaction

HIGHLIGHT

• Where

• What

• Interaction

HIGHLIGHT

• Effective regularization by introducing adversarial training in unsupervised path.

Parallel-path training

THANK YOUQ&A

Transform a

template object to

fit in the input image

RELATED WORK“Where” problem

Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey,“ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing”, CVPR 2018

?(side view car)

Limitation

1. A template has to be given (it cannot generate templates)

2. Sometimes there’s no way to fit the template into the input scene by applying affine transforms

Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey,“ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing”, CVPR 2018

RELATED WORK“Where” problem

99Xi Ouyang, Yu Cheng, Yifan Jiang, Chun-Liang Li, and Pan Zhou“Pedestrian-Synthesis-GAN: Generating Pedestrian Data in Real Scene and Beyond”, arXiv 2018

Add a new pedestrian at a target region

Limitation

The location and size of a pedestrian has to be given by a user

RELATED WORK“What” problem

100Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee,“Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis”, CVPR 2018

Limitation

Layout prediction and image generation networks are not end-to-end trainable

RELATED WORK“Where” and “What” problem

data driven affordance learning in both 2d and 3d scenes · data driven affordance learning in both...

Documents

deep affordance-grounded sensorimotor object...

binge watching: scaling affordance learning from...

affordance gibson3.pdf

dopamine, affordance and active inferencekarl/dopamine...

affordance detection poster

technology educational affordance: bridging the gap

affordance and architecture of banal

whirlstools: kinetic furniture with adaptive affordance ·...

affordance detection of tool parts from geometric...

environmental affordance

semantic instance annotation of street scenes by 3d to 2d...

release guide - geospatial-insight.com · define deep...

chimie et affordance. perspectives mésologiques

automatic 3d view generation from a single 2d image for both...

affordance gibson5.pdf

usability, affordance, and usability...

self-assessment of grasp affordance transfer

binge watching: scaling affordance learning from sitcoms ·...

[hci] week 03 affordance & uaf

affordance gibson2.pdf