3d-aware scene manipulation via inverse graphics3d-aware scene manipulation via inverse graphics...

11
3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao * Tsinghua University Tzu Ming Harry Hsu * MIT CSAIL Jun-Yan Zhu MIT CSAIL Jiajun Wu MIT CSAIL Antonio Torralba MIT CSAIL William T. Freeman MIT CSAIL, Google Research Joshua B. Tenenbaum MIT CSAIL Abstract We aim to obtain an interpretable, expressive, and disentangled scene representation that contains comprehensive structural and textural information for each object. Previous representations learned by neural networks are often uninterpretable, limited to a single object, or lacking 3D knowledge. In this work, we address the above issues by integrating disentangled representations for object geometry, appearance, and pose into a deep generative model. Our scene encoder performs inverse graphics, translating a scene into the structured object representation. Our decoder has two components: a differentiable shape renderer and a neural texture generator. The disentanglement of geometry, appearance, and pose supports various 3D-aware scene manipulations, e.g., rotating and moving objects freely while maintaining consistent shape and texture, changing object appearance without affecting its shape. We systematically evaluate our model and demonstrate that our editing scheme is superior to its 2D counterpart. 1 Introduction Humans are incredible at perceiving the world, but more distinguishing is our mental ability to simulate and imagine what will happen. Given an image of the street as in Fig. 1, we can effortlessly detect and recognize cars and their attributes, and more interestingly, imagine how cars may move and rotate in the 3D world. Motivated by such human abilities, in this work we seek to obtain an interpretable, expressive, and disentangled scene representation for machines, and employ the learned representation for flexible, 3D-aware scene manipulation. Deep generative models [Goodfellow et al., 2014, Kingma and Welling, 2014, van den Oord et al., 2016] have led to remarkable breakthroughs in learning hierarchical representations of images and decoding the representation back to images. However, the obtained representation is often limited to a single isolated object, hard to interpret, and missing the complex 3D structure behind input. As a result, these deep generative models cannot support manipulation tasks such as moving an object around as in Fig. 1. On the other hand, graphics engines use a predefined structured and disentangled input (i.e., graphics code) to render images. They can therefore be used for scene manipulation straightforwardly. However, it is in general intractable to recover the graphics code from the image. In this paper, we propose to incorporate an object-based, interpretable scene representation into deep generative models. Our model employs an encoder-decoder architecture and has three branches, one for object geometry and pose, one for background appearance, and one for object appearance. The geometry branch learns to infer object shape and pose through an approximately differentiable renderer. The appearance branches first predicts the instance label map for the input image. It then employs a texture autoencoder to obtain the texture representation for each object. Disentangling 3D * indicates equal contributions Preprint. Work in progress. arXiv:1808.09351v2 [cs.CV] 29 Aug 2018

Upload: others

Post on 23-Jun-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3D-Aware Scene Manipulation via Inverse Graphics3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao Tsinghua University Tzu Ming Harry Hsu MIT CSAIL Jun-Yan Zhu MIT CSAIL Jiajun

3D-Aware Scene Manipulation via Inverse Graphics

Shunyu Yao∗

Tsinghua UniversityTzu Ming Harry Hsu∗

MIT CSAILJun-Yan ZhuMIT CSAIL

Jiajun WuMIT CSAIL

Antonio TorralbaMIT CSAIL

William T. FreemanMIT CSAIL, Google Research

Joshua B. TenenbaumMIT CSAIL

Abstract

We aim to obtain an interpretable, expressive, and disentangled scene representationthat contains comprehensive structural and textural information for each object.Previous representations learned by neural networks are often uninterpretable,limited to a single object, or lacking 3D knowledge. In this work, we addressthe above issues by integrating disentangled representations for object geometry,appearance, and pose into a deep generative model. Our scene encoder performsinverse graphics, translating a scene into the structured object representation. Ourdecoder has two components: a differentiable shape renderer and a neural texturegenerator. The disentanglement of geometry, appearance, and pose supports various3D-aware scene manipulations, e.g., rotating and moving objects freely whilemaintaining consistent shape and texture, changing object appearance withoutaffecting its shape. We systematically evaluate our model and demonstrate that ourediting scheme is superior to its 2D counterpart.

1 Introduction

Humans are incredible at perceiving the world, but more distinguishing is our mental ability tosimulate and imagine what will happen. Given an image of the street as in Fig. 1, we can effortlesslydetect and recognize cars and their attributes, and more interestingly, imagine how cars may moveand rotate in the 3D world. Motivated by such human abilities, in this work we seek to obtain aninterpretable, expressive, and disentangled scene representation for machines, and employ the learnedrepresentation for flexible, 3D-aware scene manipulation.

Deep generative models [Goodfellow et al., 2014, Kingma and Welling, 2014, van den Oord et al.,2016] have led to remarkable breakthroughs in learning hierarchical representations of images anddecoding the representation back to images. However, the obtained representation is often limited toa single isolated object, hard to interpret, and missing the complex 3D structure behind input. As aresult, these deep generative models cannot support manipulation tasks such as moving an objectaround as in Fig. 1. On the other hand, graphics engines use a predefined structured and disentangledinput (i.e., graphics code) to render images. They can therefore be used for scene manipulationstraightforwardly. However, it is in general intractable to recover the graphics code from the image.

In this paper, we propose to incorporate an object-based, interpretable scene representation into deepgenerative models. Our model employs an encoder-decoder architecture and has three branches,one for object geometry and pose, one for background appearance, and one for object appearance.The geometry branch learns to infer object shape and pose through an approximately differentiablerenderer. The appearance branches first predicts the instance label map for the input image. It thenemploys a texture autoencoder to obtain the texture representation for each object. Disentangling 3D

∗indicates equal contributions

Preprint. Work in progress.

arX

iv:1

808.

0935

1v2

[cs

.CV

] 2

9 A

ug 2

018

Page 2: 3D-Aware Scene Manipulation via Inverse Graphics3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao Tsinghua University Tzu Ming Harry Hsu MIT CSAIL Jun-Yan Zhu MIT CSAIL Jiajun

0001/30-deg-left/400.png

<seg sky code=[0.3, 0.5, 0.1]>

<obj type=car center forward black>

<seg tree code=[0.2, 0.7, 0.9]>

<obj type=car left forward red>

Semantic + Textural De-render

TexturalRender

Geometric De-render

<seg sky code=[0.3, 0.5, 0.1]>

<obj type=car center forward black >

<seg tree code=[0.2, 0.7, 0.9]>

<obj type=car right forward red>

...

...

Manipulate

GeometricRender

Figure 1: We propose to learn a holistic scene representation that encodes scene semantics aswell as 3D information and texture of objects. Our encoder-decoder model learns disentangledrepresentations for image reconstruction and 3D-aware image editing. For example, cars can bemoved to various locations with new 3D poses.

geometry and pose from texture enables 3D-aware scene manipulation. For example, to move a carcloser, we can simply edit its position and pose, but leave its texture and semantics unchanged.

We report quantitative and qualitative results to demonstrate the effectiveness of our frameworkon two datasets, Virtual KITTI [Gaidon et al., 2016] and Cityscapes [Cordts et al., 2016]. As the3D-aware scene manipulation problem has not been well formulated, beyond qualitative results,we have created an image editing benchmark on Virtual KITTI to evaluate our model against 2Dpipelines capable of doing so. We also investigate the design of our model through evaluations onrepresentation accuracy and image reconstruction quality.

2 Related Work

Interpretable image representation. Our work is inspired by prior work on obtaining interpretableimage representations with neural networks [Cheung et al., 2015, Kulkarni et al., 2015, Chen et al.,2016]. To achieve this goal, DC-IGN [Kulkarni et al., 2015] freezes a subset of latent codes whilefeeding images that move along a specific direction on the image manifold. A recurrent model [Yanget al., 2015] learns to alter latent factors for view synthesis. InfoGAN [Chen et al., 2016] proposes todisentangle an image into independent factors without supervised data. Another line of approachesare built on intrinsic image decomposition [Barrow and Tenenbaum, 1978] and have shown promisingresults on faces [Shu et al., 2017, Sengupta et al., 2018] and objects [Janner et al., 2017, Liu et al.,2017a].

While prior work focuses on a single object, we aim to obtain a holistic scene understanding. Ourwork most resembles Wu et al. [2017a], which proposes to de-render an entire image with a graphicsengine as the decoder and a neural network as the encoder to perform inverse graphics. However,their method cannot back-propagate the gradients from the graphics engine or generalize to a newenvironment, and the results were limited to simple game environments such as Minecraft. Differentfrom Wu et al. [2017a], we use neural networks for both differentiable encoder and decoder, makingit possible to handle more complex natural images.

Deep generative models. Deep generative models [Goodfellow et al., 2014, Kingma and Welling,2014, van den Oord et al., 2016, Dinh et al., 2017, Salimans et al., 2017, Gulrajani et al., 2017, Karraset al., 2018] have been used to synthesize realistic images and learn rich internal representations.Unfortunately, the learned representations are often hard for humans to interpret and understand.Besides, existing deep generative models fail to consider the 3D nature of our visual world. Onthe contrary, our differentiable geometric renderer can discover the 3D geometry of objects, whichimproves the quality of image generation results and allows the 3D-aware scene manipulation.

2

Page 3: 3D-Aware Scene Manipulation via Inverse Graphics3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao Tsinghua University Tzu Ming Harry Hsu MIT CSAIL Jun-Yan Zhu MIT CSAIL Jiajun

0006/clone/00048.png

Textural De-renderer Textural Renderer

Geometric De-renderer Geometric Renderer

Semantic De-renderer

Figure 2: Illustration of our framework. The encoder consists of a semantic-, a textural- and ageometric de-renderer, the output representations of which would be combined in a texture rendererto reconstruct the input image.

Deep image manipulation. Recently, deep networks have enabled various image editing tasks,such as style transfer [Gatys et al., 2016, Selim et al., 2016], image-to-image translation [Isolaet al., 2017, Zhu et al., 2017a, Liu et al., 2017b, Zhu et al., 2017b, Wang et al., 2018], automaticcolorization [Zhang et al., 2016, Sangkloy et al., 2017], inpainting [Pathak et al., 2016, Yang et al.,2017], interactive editing [Zhu et al., 2016], attribute editing [Yan et al., 2016a], and denoising [Gharbiet al., 2016]. Different from prior work that operates in a 2D setting, our model allows 3D-awareimage manipulation. Besides, while the above methods often require a given structured representation(e.g., label maps [Wang et al., 2018]) as input, our algorithm can automatically learn an internalrepresentation suitable for image editing. Our work is also inspired by previous semi-automatic3D editing systems [Karsch et al., 2011, Chen et al., 2013, Kholgade et al., 2014]. While thesesystems require detailed human annotation of object geometry and scene layout, our method is fullyautomatic.

3 Method

We use an encoder-decoder framework for the scene de-rendering problem. In particular, we firstde-render (encode) an image into a compact representation. Then a renderer (decoder) reconstructsthe image from the representation. Fig. 2 shows three branches for extracting semantic, geometric,and textural information from an input image, respectively.

The semantic branch aims to produce an instance segmentation map of the image and assigns a uniqueinstance ID to the background and each foreground object. The texture branch learns to encode thecolor and texture of each instance (object/background) into a latent texture code, which later helpsreconstruct the input image. The 3D geometry branch infers 3D attributes (e.g., 3D pose, silhouette)of objects (cars and vans) with a differentiable shape renderer.

3.1 3D Geometry Inference

Fig. 3 shows our 3D geometry inference module. We first segment object instances with Mask-RCNN [He et al., 2017]. For each object, we infer its 3D mesh model and object pose.

3D attributes. We describe a 3D object with a mesh model M = (V, F ) of vertices and faces andits pose information π, which consists of scale s ∈ R3, rotation represented by an unit quaternionq ∈ R4, and translation t ∈ R3. For most real-world scenarios such as road scenes, we assume thatall objects lie parallel to the ground, i.e. the quaternion has only one rotational degree of freedom θ.

To handle multiple objects in a scene, we follow previous depth prediction work [Eigen et al.,2014] and parametrize the distance t of objects to the camera in the log-space. Determining the logdistance from the cropped image patch only turns out to be a under-determined problem. We find thatmultiplying the distance t with

√A, where A denotes the area of an object mask, would give us a

more predictive quantity τ = t√A. Fig. 4 exhibits that the proposed value τ has a more concentrated

distribution and hence possibly yields a smaller prediction error. This parametrization improvesresults as shown in later experiments (Sec. 4.2). To predict the 3D center of the object on the image

3

Page 4: 3D-Aware Scene Manipulation via Inverse Graphics3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao Tsinghua University Tzu Ming Harry Hsu MIT CSAIL Jun-Yan Zhu MIT CSAIL Jiajun

Object-wise

0006/clone/00048.png

Rotation

Translation

Scale

3DRe

nder

er Downsampling

Upsampling

Fully Connected

Bounding Box

Patch3D Information Map

Instance MapAttribute Map

ROI Pooling

Figure 3: The geometric interpretation module. The module takes the whole image, infers 3D at-tributes with proper formulations from object proposals, and can generate interpretable representationsfor understanding and manipulation.

Naïve Proposed

3.3 4.3

log t

0

10000

20000

30000

40000

Fre

quen

cy

Naıve

7.3 8.3

log ⌧

Proposed

(a) (b)

Naïve Proposed

Frequency

Figure 4: Proposed improvements to the de-rendering task. (a) Distributions of the originaldistance t compared to proposed τ . x axes are of equal stretches. (b) Object silhouettes renderedwithout and with our proposed re-projection loss.

(xc, yc), we follow Ren et al. [2015] and parametrize it as e =(xc−xr

wr, yc−yrhr

)where (xr, yr) is

the center of the bounding box and (wr, hr) is its width and height. Let · denote inferred attributeswhile (·) denotes the ground truth quantity, we calculate the 3D attributes regression loss on π asLπ = sin2

(θ−θ2

)+ ‖e− e‖22 + (log τ − log τ)

2+ ‖s− s‖22, where θ, τ, e, s denote the rotation,

center position, distance, and scale as mentioned before.

Reprojection consistency. Regressing to the ground truth 3D attributes does not necessarily guar-antee that the 2D projection of the 3D model can fit the object silhouette. Therefore, we enforcea reprojection consistency loss [Yan et al., 2016b, Rezende et al., 2016, Wu et al., 2016, Tulsianiet al., 2017, Wu et al., 2017b] between the segmented object silhouette S and 3D model’s projectiononto the 2D image plane. Fig. 4b shows an example. To optimize the above reprojection consis-tency loss, we use a differentiable renderer [Kato et al., 2018] to project a 3D model M to a 2Dmask according to the object pose π: i.e., Proj(M,π). We then calculate the re-projection loss asLr = ‖S− Proj(M,π)‖1. We obtain the optimal 3D pose based on arg minπ(Lπ + λrLr), whereλr controls the relative importance.

3.2 Semantic and Texture Inference

Our semantic branch obtains an instance label map L by combining the output of a semanticsegmentation model DRN [Yu et al., 2017, Zhou et al., 2017] and an instance segmentation modelMask-RCNN [He et al., 2017] with simple heuristics [Kirillov et al., 2018]. Built on recent work onmultimodal image-to-image translation [Zhu et al., 2017b, Wang et al., 2018], our texture branchencodes the texture of each object into a low dimensional code so that the decoder can later reconstructthe original object from the code. We use a background textural encoder to encode the color andtexture of background segments (e.g., road, sky) and use an object textural encoder for objects (e.g.,car, van). Later, we combine the output of this object textural encoder with the estimated 3D attributesto better represent objects.

As objects often appear with coherent visual appearances, our texture network encodes each objectinstance into a unique latent code, similar to pix2pixHD [Wang et al., 2018]. Formally, given animage I and its instance label map L, we want to obtain a feature embedding z such that (L, z) canlater reconstruct I. We formulate our method under a conditional adversarial learning frameworkwith three networks (G,D,E): a textural encoder E : (L, I)→ z, a generator G : (L, z)→ I and adiscriminator D : (L, I)→ (0, 1) are trained jointly with the following objective.

4

Page 5: 3D-Aware Scene Manipulation via Inverse Graphics3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao Tsinghua University Tzu Ming Harry Hsu MIT CSAIL Jun-Yan Zhu MIT CSAIL Jiajun

To increase the photorealism of generated images, we use a standard conditional GAN loss [Goodfel-low et al., 2014, Mirza and Osindero, 2014, Isola et al., 2017] as:

LGAN(G,D,E) = EL,I[log (D(L, I)) + log (1−D(L, G(L, E(L, I))))]. (1)

To stabilize training, we add a feature matching loss [Gatys et al., 2016, Dosovitskiy and Brox, 2016,Johnson et al., 2016] to match the statistics of intermediate features between generated and realimages .

LFM(G,E) = EL,I

[1

n

n∑i=1

‖Fi(I)− Fi(G(L, E(L, I)))‖1

]. (2)

Here Fi denotes the feature extractor at the i-th layer of a convolutional network. Finally, we use apixel-wise image reconstruction loss as:

LRecon(G,E) = EL,I[‖I−G(L, E(L, I))‖1]. (3)

The learning is formulated as a minimax three-players game between (G,E) and D:

G∗, E∗ = arg minG,E

maxD

(LGAN(G,D,E) + λFMLFM(G,E) + λReconLRecon(G,E)), (4)

where λFM and λRecon control the relative importance of each term.

Decoupling geometry and texture. We observe that the textural encoder often learns not onlytexture but also object poses. To further decouple these two factors, we concatenate the inferred3D attributes from the geometric branch to the texture code map z and feed both of them to thetexture generator G. Also, we reduce the dimension of the texture code so that the code can focuson texture as 3D poses are already available. These two modifications help encode textural featuresthat are independent of the object geometry. It also resolves ambiguity in object poses: cars sharesimilar silhouettes when facing forward or backward. The texture generator needs the 3D pose codeto synthesize the correct texture (See Fig. 5b for example).

3.3 Implementation Details

Our semantic branch adopts DRN for semantic segmentation [Yu et al., 2017, Zhou et al., 2017] andMask-RCNN for instance segmentation and object proposal generation [He et al., 2017]. Then, wefuse object instance masks with semantic segmentation maps, resolving any conflict in favor of theobject masks [Kirillov et al., 2018]. We feed fused object instance maps to the texture branch andobject proposals to the geometric branch.

For our geometric branch, we use a single generic car CAD model from ShapeNet [Chang et al.,2015] and use the same mesh model for all cars and vans. Given an object proposal, we predict 3Dattributes with a ResNet-18 network [He et al., 2015] and then render the 2D projection with a Neural3D Mesh Renderer [Kato et al., 2018]. We empirically set λr = 0.1. We first train the networkwith Lπ using Adam [Kingma and Ba, 2015] with a learning rate of 10−3 for 128 epochs; we thenfine-tune the model with Lπ + λrLM with a learning rate 10−4 for another 128 epochs.

For our textural branch, we use the same architecture as in Wang et al. [2018] for the encoder,generator, and discriminator. We set the dimension of the texture code as 3 for Virtual KITTI [Gaidonet al., 2016] and 5 for on Cityscapes Cordts et al. [2016]. The rendered object pose θ is quantizedinto 24 bins and concatenated to the texture code map z using one-hot encoding. We use λFM =5, λRecon = 20 and train the network for 60 epochs for Virtual KITTI or 100 epochs for Cityscapes.

4 Results

We report the results in three parts. First, we present new 3D-aware image editing applicationsenabled by our framework. Second, we build a Virtual KITTI [Gaidon et al., 2016] Image EditingBenchmark and compare our editing scheme to baselines without 3D knowledge. Lastly, we analyzeour design through evaluating representation accuracy and reconstruction quality.

Datasets. We conduct experiments on two datasets: Virtual KITTI [Gaidon et al., 2016] andCityscapes [Cordts et al., 2016]. Virtual KITTI serves as a proxy to the KITTI dataset [Geiger et al.,

5

Page 6: 3D-Aware Scene Manipulation via Inverse Graphics3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao Tsinghua University Tzu Ming Harry Hsu MIT CSAIL Jun-Yan Zhu MIT CSAIL Jiajun

Original image Edited images

(a)

(b)

(c)

(d)

Figure 5: Image editing examples on Virtual KITTI. (a) A car far away can be pulled closely withconsistent texture. (b) A car can be rotated to its left side, front side and right side by only changingits pose code. The same texture code is used for different poses. (c) We use texture codes from carsof different poses on the red car without affecting its pose. We can also change the texture codes ofbackground segments to change the environment condition. (d) We show occlusion recovery andobject removal schemes.

2012]. It contains five virtual worlds, each rendered under ten environment conditions, leadingto a sum of 21,260 images. For each condition in each world, we use either the first or the last80% consecutive frames for training, and the rest for testing. For object-wise evaluations, we useobjects that have at least 256 visible pixels, a < 70% occlusion ratio , and a < 50% truncationratio [Gaidon et al., 2016]. Cityscapes does not have ground truth pose annotations and yet isannotated to pixel-level for 2,975 training images.

We have also built the Virtual KITTI Image Editing Benchmark, allowing users to systematicallyevaluate image editing algorithms. The benchmark contains 92 pairs of images in the test set, chosenfrom varying worlds but the same environment condition, with the camera either stationary or almoststill. An example pair is presented on the left of Fig. 1.

For each pair of source and target images, we provide object-wise operations. Each operation isparametrized by the starting position (usrc, vsrc), the ending position (utgt, vtgt) (both are 2D pixelcoordinates of the object’s 3D center), the zoom-in factors ρ, and the difference of rotations ∆rywith respect to the y-axis. Objects can move out of the image, so we label each operation as either“modify” or “delete”.

4.1 3D-Aware Image Editing

The geometric and textural disentangled suggests a natural image manipulation scheme. We canmodify the 3D attributes of an object to translate, scale, or rotate it in the 3D world, while keeping itsappearance intact. We can also replace its texture codes with those from other objects to change itsappearance without polluting its pose.

Methods. We compare our full model with two simplified alternatives:

• 2D: The naïve 2D baseline takes the source and target positions and applies the translationand scale of the operation.

• 2D+: The 2D+ baseline includes the 2D operations above and rotates the planar silhouettefor ∆ry along the y-axis.

6

Page 7: 3D-Aware Scene Manipulation via Inverse Graphics3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao Tsinghua University Tzu Ming Harry Hsu MIT CSAIL Jun-Yan Zhu MIT CSAIL Jiajun

Original image Edited images

(a)

(b)

(c)

Figure 6: Image editing examples on Cityscapes. (a) We move an occluded car to the right andthen move it closer to the camera. Note that our model can synthesize the occluded part automatically.(b) We move two cars closer to the camera. (c) We rotate the left car with small angles.

Ours 2D 2D+

LPIPS (predicted seg) 0.1742 0.1941 0.1946LPIPS (GT seg) 0.1708 0.1893 0.1902

(a) Perception similarity scores

2D 2D+

Ours 67.46% 73.48%

(b) Human study results

Table 1: Evaluations on Virtual KITTI editing benchmark. (a) We evaluate the perceptualsimilarity [Zhang et al., 2018a]. Lower score is better. We consider two cases: the model uses its ownpredicted semantic segmentation or the ground truth segmentation. (b) Human subjects compare tworesults from two methods. The percentage shows how often they prefer the left method to the top one.Our method outperforms previous 2D approaches consistently.

Metrics. We cannot simply evaluate the similarity between the edited image and the target on thepixel level, because two images that look similar may have a large L1 or L2 distance. Instead, weadopt the Learned Perceptual Image Patch Similarity (LPIPS) metric [Zhang et al., 2018b], which isdesigned to match human perception. LPIPS ranges from 0 to 1, with 0 being the most similar.

We also conduct a human study, where we present each target image and editing results from twodifferent methods (Ours vs. 2D or Ours vs. 2D+) to 15 subjects on Amazon Mechanical Turk, and askthem which edited image looks closer to the target. We then compute, between a pair of methods,how likely one is preferred by human over the other across all test data and all human responses.

Results. Fig. 5 and Fig. 6 show qualitative results on Virtual KITTI and Cityscapes, respectively. Asstated before, the Cityscapes dataset does not contain groundtruth object poses for fine-tuning, hencewe exploit the differentiable renderer by first making attribute inferences based on the trained de-renderer, then iteratively optimizing the attributes via re-projection. The resulting geometric attributesturn out to be useful for manipulation tasks. Fig. 7 shows a direct comparison to a state-of-the-art 2Dmanipulation method pix2pixHD [Wang et al., 2018].

Quantitatively, Table 1 shows that our algorithm outperforms both baselines in LPIPS, when eitherthe predicted semantic segmentation and the ground truth segmentation is provided for manipulation.Our results are also preferred by human subjects to baselines (67.46% to 2D and 73.48% to 2D+).

4.2 Evaluation on the Geometric Representation

Methods. As described in Section 3.1, we adopt multiple strategies to formulate the inferenceproblem. As an ablation study, We hereby compare our full model (full), which is first trained usingLπ then finetuned using Lr, with its three variants:

• π only: only optimizes Lπ; train from scratch.

• π + r only: optimizes L = Lπ + Lr; train from scratch.

7

Page 8: 3D-Aware Scene Manipulation via Inverse Graphics3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao Tsinghua University Tzu Ming Harry Hsu MIT CSAIL Jun-Yan Zhu MIT CSAIL Jiajun

Rotation (◦) Direction (◦) Distance (×10−2) Scale Label Map (×10−3)

Bbox3D 5.18 0.341 4.41 0.391 9.80

π only 2.44 0.397 3.76 0.372 9.54π + r only 2.33 0.333 3.44 0.442 5.24Quat 5.22 0.380 4.51 0.405 11.23Naïve-d 2.68 0.357 4.13 0.413 10.15Ours (full) 2.16 0.295 3.37 0.464 4.60

Table 2: Performance of 3D attributes prediction on Virtual KITTI We compare our full model,two variants, and three other formulations using ground truth proposals. Our full model performs thebest regarding most metrics. Our model obtains much lower error for instance label map. Refer to thetext for more details in our metrics.

Original Ours pix2pixpix2pixOurs

(a) (b)

Figure 7: Comparison to pix2pixHD [Wang et al., 2018] on Virtual KITTI editing benchmark.(a) We successfully recover the mask of an occluded car and move it closer to the camera whilepix2pixHD fails. (b) We rotate the car from back to front. With the texture code encoded from theback view and a new pose code for the front view, our model figures out that tail lights should beremoved, while pix2pixHD cannot.

• Quat: uses a rotation space of SO(3), characterized by a unit quaternion q.

• Naïve-d: roll back to the original depth in Fig. 4a and regress in log space for the distance.

We also compare with Mousavian et al. [2017], where they first infer the object’s 2D bounding boxand pose from input, and then search for its 3D bounding box. To eliminate the effect of the objectproposal network, we fix the object proposal as the ground truth bounding box. The evaluation istherefore only for geometry inference.

Metrics. To quantify the error in the estimated object parameters, we use a variety of measures ondifferent quantities. For rotation, we compute the geodesic distance ∆q = cos−1

(2(q · q)

2 − 1)

;for observation direction e, we use the angle between the ground truth and inferred vectors ∆e =cos−1 (e · e); for distance, we have absolute logarithm error ∆t =

∣∣log t− log t∣∣; and for scale, we

adopt standard Euclidean distance ∆s = ‖s− s‖2.

In addition to calculating the error across objects, we also compute the fraction of the re-projectionerror per scene, defined as the percentage difference of instance id maps between ground truth andrendered. Note that in terms of the instance maps we effectively emphasize closer objects, which wehave optimized throughout our formulation. All numbers reported are averaged, either over objectsor scenes across the test set.

Results. Quantitative results are shown in Table 2. Our full model performs the best in terms ofmultiple criteria, and specifically the mismatch in instance label map outperforms significantly thanother formulations. Each of the proposed components (re-projection consistency, normalized depth,constrained rotation, and fine-tuning) contribute to the performance.

5 Conclusion

In this work, we have developed an inverse graphics network to obtain an interpretable and disentan-gled scene representation with rich 3D structural and textural information. Though our work mainlyfocuses on 3D-aware scene manipulation, the learned representations could be potentially useful forvarious tasks such as image reasoning, captioning, and analogy-making.

8

Page 9: 3D-Aware Scene Manipulation via Inverse Graphics3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao Tsinghua University Tzu Ming Harry Hsu MIT CSAIL Jun-Yan Zhu MIT CSAIL Jiajun

Acknowledgements. This work is supported by NSF #1231216, ONR MURI N00014-16-1-2007,the Center for Brain, Minds and Machines (NSF #1231216), Toyota Research Institute, and Facebook.

References

Harry G Barrow and Jay M Tenenbaum. Recovering intrinsic scene characteristics from images. ComputerVision Systems, 1978. 2

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich3d model repository. arXiv:1512.03012, 2015. 5

Tao Chen, Zhe Zhu, Ariel Shamir, Shi-Min Hu, and Daniel Cohen-Or. 3-sweep: Extracting editable objectsfrom a single photo. ACM Transactions on Graphics (TOG), 32(6):195, 2013. 3

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretablerepresentation learning by information maximizing generative adversarial nets. In NIPS, 2016. 2

Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. Discovering hidden factors of variationin deep networks. In ICLR workshop, 2015. 2

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, UweFranke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InCVPR, 2016. 2, 5

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ICLR, 2017. 2

Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deepnetworks. In NIPS, 2016. 5

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scaledeep network. In Advances in neural information processing systems, pages 2366–2374, 2014. 3

A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR,2016. 2, 5, 6

Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neuralnetworks. In CVPR, pages 2414–2423, 2016. 3, 5

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti visionbenchmark suite. In CVPR, 2012. 5

Michaël Gharbi, Gaurav Chaurasia, Sylvain Paris, and Frédo Durand. Deep joint demosaicking and denoising.ACM Transactions on Graphics (TOG), 35(6):191, 2016. 3

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 1, 2, 5

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved trainingof wasserstein gans. In NIPS, 2017. 2

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2015. 5

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017. 3, 4, 5

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditionaladversarial networks. In CVPR, 2017. 3, 5

Michael Janner, Jiajun Wu, Tejas D. Kulkarni, Ilker Yildirim, and Josh Tenenbaum. Self-supervised intrinsicimage decomposition. In NIPS, 2017. 2

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016. 5

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality,stability, and variation. In ICLR, 2018. 2

9

Page 10: 3D-Aware Scene Manipulation via Inverse Graphics3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao Tsinghua University Tzu Ming Harry Hsu MIT CSAIL Jun-Yan Zhu MIT CSAIL Jiajun

Kevin Karsch, Varsha Hedau, David Forsyth, and Derek Hoiem. Rendering synthetic objects into legacyphotographs. ACM Transactions on Graphics (TOG), 30(6):157, 2011. 3

Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In CVPR, 2018. 4, 5

Natasha Kholgade, Tomas Simon, Alexei Efros, and Yaser Sheikh. 3d object manipulation in a single photographusing stock 3d models. ACM Transactions on Graphics (TOG), 33(4):127, 2014. 3

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 5

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014. 1, 2

A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár. Panoptic Segmentation. arXiv:1801.00868, 2018. 4, 5

Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inversegraphics network. In NIPS, 2015. 2

Guilin Liu, Duygu Ceylan, Ersin Yumer, Jimei Yang, and Jyh-Ming Lien. Material editing using a physicallybased rendering network. In ICCV, pages 2280–2288. IEEE, 2017a. 2

Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In NIPS,pages 700–708, 2017b. 3

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784,2014. 5

Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Košecká. 3d bounding box estimation usingdeep learning and geometry. In CVPR, pages 5632–5640. IEEE, 2017. 8

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders:Feature learning by inpainting. In CVPR, pages 2536–2544, 2016. 3

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detectionwith region proposal networks. In NIPS, 2015. 4

Danilo Jimenez Rezende, SM Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess.Unsupervised learning of 3d structure from images. In NIPS, 2016. 4

Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn withdiscretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017. 2

Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. Scribbler: Controlling deep imagesynthesis with sketch and color. In CVPR, volume 2, 2017. 3

Ahmed Selim, Mohamed Elgharib, and Linda Doyle. Painting style transfer for head portraits using convolutionalneural networks. ACM Transactions on Graphics (ToG), 35(4):129, 2016. 3

Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo, and David W Jacobs. Sfsnet: Learning shape,reflectance and illuminance of faces ‘in the wild’. In CVPR, 2018. 2

Zhixin Shu, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli, Eli Shechtman, and Dimitris Samaras. Neural faceediting with intrinsic image disentangling. In CVPR, pages 5444–5453. IEEE, 2017. 2

Shubham Tulsiani, Hao Su, Leonidas J Guibas, Alexei A Efros, and Jitendra Malik. Learning shape abstractionsby assembling volumetric primitives. In CVPR, 2017. 4

Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional imagegeneration with pixelcnn decoders. In NIPS, 2016. 1, 2

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolutionimage synthesis and semantic manipulation with conditional gans. In CVPR, 2018. 3, 4, 5, 7, 8

Jiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba, and William TFreeman. Single image 3d interpreter network. In ECCV, 2016. 4

Jiajun Wu, Joshua B Tenenbaum, and Pushmeet Kohli. Neural scene de-rendering. In CVPR, 2017a. 2

Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T Freeman, and Joshua B Tenenbaum. MarrNet:3D Shape Reconstruction via 2.5D Sketches. In NIPS, 2017b. 4

10

Page 11: 3D-Aware Scene Manipulation via Inverse Graphics3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao Tsinghua University Tzu Ming Harry Hsu MIT CSAIL Jun-Yan Zhu MIT CSAIL Jiajun

Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generationfrom visual attributes. In ECCV, 2016a. 3

Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learningsingle-view 3d object reconstruction without 3d supervision. In NIPS, 2016b. 4

Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. High-resolution image inpainting usingmulti-scale neural patch synthesis. In CVPR, 2017. 3

Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentangling with recurrenttransformations for 3d view synthesis. In NIPS, 2015. 2

Fisher Yu, Vladlen Koltun, and Thomas A Funkhouser. Dilated residual networks. In CVPR, volume 2, page 3,2017. 4, 5

Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016. 3

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectivenessof deep networks as a perceptual metric. In CVPR, 2018a. 7

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectivenessof deep networks as a perceptual metric. In CVPR, 2018b. 7

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing throughade20k dataset. In CVPR, 2017. 4, 5

Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on thenatural image manifold. In ECCV, 2016. 3

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation usingcycle-consistent adversarial networks. In ICCV, 2017a. 3

Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman.Toward multimodal image-to-image translation. In NIPS, pages 465–476, 2017b. 3, 4

11