l 3d-aware scene manipulation via inverse graphics3dsdn.csail.mit.edu/3dsdn.pdf3d-aware scene...

1
s Humans are good at perceiving and simulating the world with 3D structure in mind Previous deep generative models are often limited to a single object, hard to interpret, and missing the 3D structure 3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao* 1 , Tzu-Ming Harry Hsu* 2 , Jun-Yan Zhu 2 , Jiajun Wu 2 , Antonio Torralba 2 , William T. Freeman 23 , and Joshua B. Tenenbaum 2 1 Tsinghua University, 2 MIT CSAIL, 3 Google Research Original image 3D-SDN (ours) pix2pixHD 3D-SDNs learn and incorporate (1) Scene semantic labels (2) Texture encodings for objects and the background (3) 3D geometry and pose for objects Textural De-renderer Textural Renderer Geometric De-renderer Geometric Renderer Semantic De-renderer Object-wise 3D Inference Scale 3D Renderer Bounding Box Masked Image 3D Information Instance Map Normal Map Pose Map Mesh Model FFD coefficients Output Input Distribution Mask-RCNN Rotation Translation Downsampling Upsampling ROI Pooling REINFORCE Fully Connected <seg sky code=[0.90, 0.98, 0.43, …]> <obj type=car center forward black> <seg tree code=[0.93, 0.99, 0.52, …]> <obj type=car center forward red> Semantic + Textural De-render Textural Render Geometric De-render <obj type=car right forward black> <obj type=car right forward red> ... Manipulate Geometric Render Results 3D Scene De-rendering Networks (3D-SDN) Motivation & Contributions 3D-SDN (ours) vs. 2D Method Scene Manipulation via 3D-SDN Textural De-renderer & Renderer Mask-RCNN generates object proposals 3D De-renderer infers object attributes and free form deformation (FFD) coecients, and selects a mesh model 3D Mesh Renderer renders silhouettes, and a normal map (byproduct: object edge map and object pose map) REINFORCE + regular gradient train on the loss L reproj = ˜ S - S . eometric branch is L pred + λ reproj L reproj , L pred = klog ˜ s - log sk 2 2 + 1 - q · q) 2 + k˜ e - ek 2 2 + (log ˜ - log ) 2 , scale rotation 2D oset depth predicted Reconstruction Loss encourages reconstruction Minimax Game trains on the loss L Recon (G, E )= E L,I h I - ˜ I 1 i G ,E = arg min G,E max D L GAN (G, D, E ) + λ FM L FM (G, D, E )+ λ Recon L Recon (G, E ) Image Translator Network takes a texture code map, semantic label map, normal map, object edge map and object pose map as , generating an image [pix2pixHD] GAN Loss accounts for photorealism (L, 201 ˜ I W DRN generates a semantic label map for the background Semantic De-renderer Virtual KITTI Editing Benchmark { "operations": [ { "type": "modify", "from": {"u": "750.9", "v": “213.9"}, "to": { "u": "804.4", "v": "227.1", "roi": [194, 756, 269, 865] }, "zoom": “1.338", "ry": "0.007" }, { "type": "delete", "from": {"u": “1328.5", "v": "271.3"}, } ] } 92 pairs of images picked from Virtual KITTI dataset Each pair contains operations in .json format delete u v u v roi modify Virtual KITTI Editing Benchmark 2D/2D+: only texture code map and semantic label map; naïve translation and scaling (+out-of-plane rotation) Image Editing Examples Original image Edited images Virtual KITTI Cityscapes vkitti_editing_benchmark.json Geometric De-renderer & Renderer Overview ˜ I = G(L,E (L, I)) [Wang et al., 2018]a encoder generator L GAN (G, D, E )= E L,I h log (D(L, I)) + log 1 - D L, ˜ I ⌘⌘i discriminator Feature Matching Loss stabilizes training L FM (G, D, E )= E L,I " T F X i=1 1 N i F (i) (I) - F (i) ˜ I 1 + T D X i=1 1 M i D (i) (I) - D (i) ˜ I 1 # featurizer LPIPS 0.100 0.150 0.200 whole all largest 3D-SDN 2D 2D+ 23.1% 76.9% 25.7% 74.3% Perception Similarity Scores (lower is better) Human Study Scores (higher is better) 3D-SDN vs. 2D 3D-SDN vs. 2D+ References [pix2pixHD] Wang et al. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs, In CVPR, 2018.

Upload: others

Post on 16-Jul-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: L 3D-Aware Scene Manipulation via Inverse Graphics3dsdn.csail.mit.edu/3dsdn.pdf3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao* 1 , Tzu-Ming Harry Hsu* 2 , Jun-Yan Zhu

s

• Humans are good at perceiving and simulating the world with 3D structure in mind

• Previous deep generative models are often limited to a single object, hard to interpret, and missing the 3D structure

3D-Aware Scene Manipulation via Inverse Graphics Shunyu Yao*1, Tzu-Ming Harry Hsu*2, Jun-Yan Zhu2, Jiajun Wu2, Antonio Torralba2, William T. Freeman23, and Joshua B. Tenenbaum2

1Tsinghua University, 2MIT CSAIL, 3Google Research

Original image 3D-SDN (ours) pix2pixHD

(a)

(b)

• 3D-SDNs learn and incorporate (1) Scene semantic labels(2) Texture encodings for objects and the background(3) 3D geometry and pose for objects

Textural De-renderer Textural Renderer

Geometric De-renderer Geometric Renderer

Semantic De-renderer

Object-wise 3D Inference

0006/clone/00048.png

Scale

3DRe

nder

er

Bounding Box

Masked Image

3D InformationInstance MapNormal MapPose MapMesh

Model

FFD coefficients

Output

Input

Distribution

Mask-RCNN

RotationTranslation

Downsampling

Upsampling

ROI Pooling

REINFORCE

Fully Connected

(0006, ’fog’, 00067)

<seg sky code=[0.90, 0.98, 0.43, …]>

<obj type=car center forward black>

<seg tree code=[0.93, 0.99, 0.52, …]>

<obj type=car center forward red>

Semantic + Textural De-render

TexturalRender

Geometric De-render

<seg sky code=[0.90, 0.98, 0.43, …]>

<obj type=car right forward black>

<seg tree code=[0.93, 0.99, 0.52, …]>

<obj type=car right forward red>

...

...

Manipulate

GeometricRender

Results3D Scene De-rendering Networks (3D-SDN) Motivation & Contributions

3D-SDN (ours) vs. 2D Method

Scene Manipulation via 3D-SDN

Textural De-renderer & Renderer

• Mask-RCNN generates object proposals• 3D De-renderer infers object attributes and free form

deformation (FFD) coefficients, and selects a mesh model

• 3D Mesh Renderer renders silhouettes, and a normal map (byproduct: object edge map and object pose map)

• REINFORCE + regular gradient train on the loss

Object-wise 3D Inference

0006/clone/00048.png

Scale

3DRe

nder

er

Bounding Box

Masked Image

3D InformationInstance MapNormal MapPose MapMesh

Model

FFD coefficients

Output

Input

Distribution

Mask-RCNN

RotationTranslation

Downsampling

Upsampling

ROI Pooling

REINFORCE

Fully Connected

Figure 3: 3D geometric inference. Given a masked object image and its bounding box, the geometricbranch of the 3D-SDN predicts the object’s mesh model, scale, rotation, translation, and the free-formdeformation (FFD) coefficients. We then compute 3D information (instance map, normal maps, andpose map) using a differentiable renderer [Kato et al., 2018].

3D estimation. We describe a 3D object with a mesh M , its scale s 2 R3, rotation q 2 R4 as an unitquaternion, and translation t 2 R3. For most real-world scenarios such as road scenes, objects oftenlie on the ground. Therefore, the quaternion has only one rotational degree of freedom: i.e., q 2 R.

As shown in Fig. 3, given an object’s masked image and estimated bounding box, the geometricde-renderer learns to predict the mesh M by first selecting a mesh from eight candidate shapes, andthen applying a Free-Form Deformation (FFD) [Sederberg and Parry, 1986] with inferred grid pointcoordinates �. It also predicts the scale, rotation, and translation of the 3D object. Below we describethe training objective for the network.

3D attribute prediction loss. The geometric de-renderer directly predicts the values of scale sand rotation q. For translation t, it instead predicts the object’s distance to the camera t and theimage-plane 2D coordinates of the object’s 3D center, denoted as [x3D, y3D]. Given the intrinsiccamera matrix, we can calculate t from t and [x3D, y3D]. We parametrize t in the log-space [Eigenet al., 2014]. As determining t from the image patch of the object is under-constrained, our modelpredicts a normalized distance ⌧ = t

pwh, where [w, h] is the width and height of the bounding box.

This reparameterization improves results as shown in later experiments (Sec. 4.2). For [x3D, y3D], wefollow the prior work [Ren et al., 2015] and predict the offset e = [(x3D � x2D)/w, (y3D � y2D)/h]relative to the estimated bounding box center [x2D, y2D]. The 3D attribute prediction loss for scale,rotation, and translation can be calculated as

Lpred = klog s� log sk22 +⇣1� (q · q)2

⌘+ ke� ek22 + (log ⌧ � log ⌧)2, (1)

where · denotes the predicted attributes.

Reprojection consistency loss. We also use a reprojection loss to ensure the 2D rendering of thepredicted shape fits its silhouette S [Yan et al., 2016b, Rezende et al., 2016, Wu et al., 2016a, 2017b].Fig. 4a and Fig. 4b show an example. Note that for mesh selection and deformation, the reprojectionloss is the only training signal, as we do not have a ground truth mesh model.

We use a differentiable renderer [Kato et al., 2018] to render the 2D silhouette of a 3D mesh M ,according to the FFD coefficients � and the object’s scale, rotation and translation ⇡ = {s, q, t}:S = RenderSilhouette(FFD�(M), ⇡). We then calculate the reprojection loss as Lreproj =

���S� S���.

We ignore the region occluded by other objects. The full loss function for the geometric branch isthus Lpred + �reprojLreproj, where � controls the relative importance of two terms.

3D model selection via REINFORCE. We choose the mesh M from a set of eight meshes tominimize the reprojection loss. As the model selection process is non-differentiable, we formulatethe model selection as a reinforcement learning problem and adopt a multi-sample REINFORCEparadigm [Williams, 1992] to address the issue. The network predicts a multinomial distribution overthe mesh models. We use the negative reprojection loss as the reward. We experimented with a singlemesh without FFD in Fig. 4c. Fig. 4d shows a significant improvement when the geometric branchlearns to select from multiple candidate meshes and allows flexible deformation.

4

Object-wise 3D Inference

0006/clone/00048.png

Scale

3DRe

nder

er

Bounding Box

Masked Image

3D InformationInstance MapNormal MapPose MapMesh

Model

FFD coefficients

Output

Input

Distribution

Mask-RCNN

RotationTranslation

Downsampling

Upsampling

ROI Pooling

REINFORCE

Fully Connected

Figure 3: 3D geometric inference. Given a masked object image and its bounding box, the geometricbranch of the 3D-SDN predicts the object’s mesh model, scale, rotation, translation, and the free-formdeformation (FFD) coefficients. We then compute 3D information (instance map, normal maps, andpose map) using a differentiable renderer [Kato et al., 2018].

3D estimation. We describe a 3D object with a mesh M , its scale s 2 R3, rotation q 2 R4 as an unitquaternion, and translation t 2 R3. For most real-world scenarios such as road scenes, objects oftenlie on the ground. Therefore, the quaternion has only one rotational degree of freedom: i.e., q 2 R.

As shown in Fig. 3, given an object’s masked image and estimated bounding box, the geometricde-renderer learns to predict the mesh M by first selecting a mesh from eight candidate shapes, andthen applying a Free-Form Deformation (FFD) [Sederberg and Parry, 1986] with inferred grid pointcoordinates �. It also predicts the scale, rotation, and translation of the 3D object. Below we describethe training objective for the network.

3D attribute prediction loss. The geometric de-renderer directly predicts the values of scale sand rotation q. For translation t, it instead predicts the object’s distance to the camera t and theimage-plane 2D coordinates of the object’s 3D center, denoted as [x3D, y3D]. Given the intrinsiccamera matrix, we can calculate t from t and [x3D, y3D]. We parametrize t in the log-space [Eigenet al., 2014]. As determining t from the image patch of the object is under-constrained, our modelpredicts a normalized distance ⌧ = t

pwh, where [w, h] is the width and height of the bounding box.

This reparameterization improves results as shown in later experiments (Sec. 4.2). For [x3D, y3D], wefollow the prior work [Ren et al., 2015] and predict the offset e = [(x3D � x2D)/w, (y3D � y2D)/h]relative to the estimated bounding box center [x2D, y2D]. The 3D attribute prediction loss for scale,rotation, and translation can be calculated as

Lpred = klog s� log sk22 +⇣1� (q · q)2

⌘+ ke� ek22 + (log ⌧ � log ⌧)2, (1)

where · denotes the predicted attributes.

Reprojection consistency loss. We also use a reprojection loss to ensure the 2D rendering of thepredicted shape fits its silhouette S [Yan et al., 2016b, Rezende et al., 2016, Wu et al., 2016a, 2017b].Fig. 4a and Fig. 4b show an example. Note that for mesh selection and deformation, the reprojectionloss is the only training signal, as we do not have a ground truth mesh model.

We use a differentiable renderer [Kato et al., 2018] to render the 2D silhouette of a 3D mesh M ,according to the FFD coefficients � and the object’s scale, rotation and translation ⇡ = {s, q, t}:S = RenderSilhouette(FFD�(M), ⇡). We then calculate the reprojection loss as Lreproj =

���S� S���.

We ignore the region occluded by other objects. The full loss function for the geometric branch isthus Lpred + �reprojLreproj, where � controls the relative importance of two terms.

3D model selection via REINFORCE. We choose the mesh M from a set of eight meshes tominimize the reprojection loss. As the model selection process is non-differentiable, we formulatethe model selection as a reinforcement learning problem and adopt a multi-sample REINFORCEparadigm [Williams, 1992] to address the issue. The network predicts a multinomial distribution overthe mesh models. We use the negative reprojection loss as the reward. We experimented with a singlemesh without FFD in Fig. 4c. Fig. 4d shows a significant improvement when the geometric branchlearns to select from multiple candidate meshes and allows flexible deformation.

4

Object-wise 3D Inference

0006/clone/00048.png

Scale

3DRe

nder

er

Bounding Box

Masked Image

3D InformationInstance MapNormal MapPose MapMesh

Model

FFD coefficients

Output

Input

Distribution

Mask-RCNN

RotationTranslation

Downsampling

Upsampling

ROI Pooling

REINFORCE

Fully Connected

Figure 3: 3D geometric inference. Given a masked object image and its bounding box, the geometricbranch of the 3D-SDN predicts the object’s mesh model, scale, rotation, translation, and the free-formdeformation (FFD) coefficients. We then compute 3D information (instance map, normal maps, andpose map) using a differentiable renderer [Kato et al., 2018].

3D estimation. We describe a 3D object with a mesh M , its scale s 2 R3, rotation q 2 R4 as an unitquaternion, and translation t 2 R3. For most real-world scenarios such as road scenes, objects oftenlie on the ground. Therefore, the quaternion has only one rotational degree of freedom: i.e., q 2 R.

As shown in Fig. 3, given an object’s masked image and estimated bounding box, the geometricde-renderer learns to predict the mesh M by first selecting a mesh from eight candidate shapes, andthen applying a Free-Form Deformation (FFD) [Sederberg and Parry, 1986] with inferred grid pointcoordinates �. It also predicts the scale, rotation, and translation of the 3D object. Below we describethe training objective for the network.

3D attribute prediction loss. The geometric de-renderer directly predicts the values of scale sand rotation q. For translation t, it instead predicts the object’s distance to the camera t and theimage-plane 2D coordinates of the object’s 3D center, denoted as [x3D, y3D]. Given the intrinsiccamera matrix, we can calculate t from t and [x3D, y3D]. We parametrize t in the log-space [Eigenet al., 2014]. As determining t from the image patch of the object is under-constrained, our modelpredicts a normalized distance ⌧ = t

pwh, where [w, h] is the width and height of the bounding box.

This reparameterization improves results as shown in later experiments (Sec. 4.2). For [x3D, y3D], wefollow the prior work [Ren et al., 2015] and predict the offset e = [(x3D � x2D)/w, (y3D � y2D)/h]relative to the estimated bounding box center [x2D, y2D]. The 3D attribute prediction loss for scale,rotation, and translation can be calculated as

Lpred = klog s� log sk22 +⇣1� (q · q)2

⌘+ ke� ek22 + (log ⌧ � log ⌧)2, (1)

where · denotes the predicted attributes.

Reprojection consistency loss. We also use a reprojection loss to ensure the 2D rendering of thepredicted shape fits its silhouette S [Yan et al., 2016b, Rezende et al., 2016, Wu et al., 2016a, 2017b].Fig. 4a and Fig. 4b show an example. Note that for mesh selection and deformation, the reprojectionloss is the only training signal, as we do not have a ground truth mesh model.

We use a differentiable renderer [Kato et al., 2018] to render the 2D silhouette of a 3D mesh M ,according to the FFD coefficients � and the object’s scale, rotation and translation ⇡ = {s, q, t}:S = RenderSilhouette(FFD�(M), ⇡). We then calculate the reprojection loss as Lreproj =

���S� S���.

We ignore the region occluded by other objects. The full loss function for the geometric branch isthus Lpred + �reprojLreproj, where � controls the relative importance of two terms.

3D model selection via REINFORCE. We choose the mesh M from a set of eight meshes tominimize the reprojection loss. As the model selection process is non-differentiable, we formulatethe model selection as a reinforcement learning problem and adopt a multi-sample REINFORCEparadigm [Williams, 1992] to address the issue. The network predicts a multinomial distribution overthe mesh models. We use the negative reprojection loss as the reward. We experimented with a singlemesh without FFD in Fig. 4c. Fig. 4d shows a significant improvement when the geometric branchlearns to select from multiple candidate meshes and allows flexible deformation.

4

scale rotation 2D offset depth

predicted

• Reconstruction Loss encourages reconstruction

• Minimax Game trains on the loss

(a) w/o re-projection (b) w/ re-projection (c) Single CAD w/o FFD (d) Multiple CADs w/ FFD

Figure 4: (a)(b) Re-projection consistency loss: Object silhouettes rendered without and with re-projection consistency loss. (c)(d) Multiple CAD models and free form deformation (FFD): In (c), ageneric car model without FFD fails to represent the input vans. In (d), our model learns to choosethe best-fitting mesh from eight candidate meshes and allows FFD. As a result, we can reconstructthe silhouettes more precisely.

3.2 Semantic and Textural Inference

The semantic branch of the 3D-SDN uses a semantic segmentation model DRN [Yu et al., 2017,Zhou et al., 2017] to obtain an semantic map of the input image. The textural branch of the 3D-SDNfirst obtains an instance-wise semantic label map L by combining the semantic map generated by thesemantic branch and the instance map generated by the geometric branch, resolving any conflict infavor of the instance map [Kirillov et al., 2018]. Built on recent work on multimodal image-to-imagetranslation [Zhu et al., 2017b, Wang et al., 2018], our textural branch encodes the texture of eachinstance into a low dimensional latent code, so that the textural renderer can later reconstruct theappearance of the original instance from the code. By ‘instance’ we mean a background semanticclass (e.g., road, sky) or a foreground object (e.g., car, van). Later, we combine the object texturalcode with the estimated 3D information to better reconstruct objects.

Formally speaking, given an image I and its instance label map L, we want to obtain a featureembedding z such that (L, z) can later reconstruct I. We formulate the textural branch of the 3D-SDNunder a conditional adversarial learning framework with three networks (G,D,E): a textural de-renderer E : (L, I) ! z, a texture renderer G : (L, z) ! I and a discriminator D : (L, I) ! [0, 1]are trained jointly with the following objectives.

To increase the photorealism of generated images, we use a standard conditional GAN loss [Goodfel-low et al., 2014, Mirza and Osindero, 2014, Isola et al., 2017] as: ⇤

LGAN(G,D,E) = EL,I

hlog (D(L, I)) + log

⇣1�D

⇣L, I

⌘⌘i, (2)

where I = G(L, E(L, I)) is the reconstructed image. To stabilize the training, we follow the priorwork [Wang et al., 2018] and use both discriminator feature matching loss [Wang et al., 2018, Larsenet al., 2016] and perceptual loss [Dosovitskiy and Brox, 2016, Johnson et al., 2016], both of whichaim to match the statistics of intermediate features between generated and real images:

LFM(G,D,E) = EL,I

"TFX

i=1

1

Ni

���F (i)(I)� F (i)⇣I⌘���

1+

TDX

i=1

1

Mi

���D(i)(I)�D(i)⇣I⌘���

1

#, (3)

where F (i) denotes the i-th layer of a pre-trained VGG network [Simonyan and Zisserman, 2015] withNi elements. Similarly, for our our discriminator D, D(i) denotes the i-th layer with Mi elements.TF and TD denote the number of layers in network F and D. We fix the network F during ourtraining. Finally, we use a pixel-wise image reconstruction loss as:

LRecon(G,E) = EL,I

h���I� I���1

i. (4)

The final training objective is formulated as a minimax game between (G,E) and D:

G⇤, E⇤ = argminG,E

✓maxD

⇣LGAN(G,D,E)

⌘+ �FMLFM(G,D,E) + �ReconLRecon(G,E)

◆, (5)

where �FM and �Recon control the relative importance of each term.⇤We denote EL,I , E(L,I)⇠pdata(L,I)

for simplicity.

5

(a) w/o re-projection (b) w/ re-projection (c) Single CAD w/o FFD (d) Multiple CADs w/ FFD

Figure 4: (a)(b) Re-projection consistency loss: Object silhouettes rendered without and with re-projection consistency loss. (c)(d) Multiple CAD models and free form deformation (FFD): In (c), ageneric car model without FFD fails to represent the input vans. In (d), our model learns to choosethe best-fitting mesh from eight candidate meshes and allows FFD. As a result, we can reconstructthe silhouettes more precisely.

3.2 Semantic and Textural Inference

The semantic branch of the 3D-SDN uses a semantic segmentation model DRN [Yu et al., 2017,Zhou et al., 2017] to obtain an semantic map of the input image. The textural branch of the 3D-SDNfirst obtains an instance-wise semantic label map L by combining the semantic map generated by thesemantic branch and the instance map generated by the geometric branch, resolving any conflict infavor of the instance map [Kirillov et al., 2018]. Built on recent work on multimodal image-to-imagetranslation [Zhu et al., 2017b, Wang et al., 2018], our textural branch encodes the texture of eachinstance into a low dimensional latent code, so that the textural renderer can later reconstruct theappearance of the original instance from the code. By ‘instance’ we mean a background semanticclass (e.g., road, sky) or a foreground object (e.g., car, van). Later, we combine the object texturalcode with the estimated 3D information to better reconstruct objects.

Formally speaking, given an image I and its instance label map L, we want to obtain a featureembedding z such that (L, z) can later reconstruct I. We formulate the textural branch of the 3D-SDNunder a conditional adversarial learning framework with three networks (G,D,E): a textural de-renderer E : (L, I) ! z, a texture renderer G : (L, z) ! I and a discriminator D : (L, I) ! [0, 1]are trained jointly with the following objectives.

To increase the photorealism of generated images, we use a standard conditional GAN loss [Goodfel-low et al., 2014, Mirza and Osindero, 2014, Isola et al., 2017] as: ⇤

LGAN(G,D,E) = EL,I

hlog (D(L, I)) + log

⇣1�D

⇣L, I

⌘⌘i, (2)

where I = G(L, E(L, I)) is the reconstructed image. To stabilize the training, we follow the priorwork [Wang et al., 2018] and use both discriminator feature matching loss [Wang et al., 2018, Larsenet al., 2016] and perceptual loss [Dosovitskiy and Brox, 2016, Johnson et al., 2016], both of whichaim to match the statistics of intermediate features between generated and real images:

LFM(G,D,E) = EL,I

"TFX

i=1

1

Ni

���F (i)(I)� F (i)⇣I⌘���

1+

TDX

i=1

1

Mi

���D(i)(I)�D(i)⇣I⌘���

1

#, (3)

where F (i) denotes the i-th layer of a pre-trained VGG network [Simonyan and Zisserman, 2015] withNi elements. Similarly, for our our discriminator D, D(i) denotes the i-th layer with Mi elements.TF and TD denote the number of layers in network F and D. We fix the network F during ourtraining. Finally, we use a pixel-wise image reconstruction loss as:

LRecon(G,E) = EL,I

h���I� I���1

i. (4)

The final training objective is formulated as a minimax game between (G,E) and D:

G⇤, E⇤ = argminG,E

✓maxD

⇣LGAN(G,D,E)

⌘+ �FMLFM(G,D,E) + �ReconLRecon(G,E)

◆, (5)

where �FM and �Recon control the relative importance of each term.⇤We denote EL,I , E(L,I)⇠pdata(L,I)

for simplicity.

5

• Image Translator Network takes a texture code map, semantic label map, normal map, object edge map and object pose map as , generating an image [pix2pixHD]

• GAN Loss accounts for photorealism

(a) w/o re-projection (b) w/ re-projection (c) Single CAD w/o FFD (d) Multiple CADs w/ FFD

Figure 4: (a)(b) Re-projection consistency loss: Object silhouettes rendered without and with re-projection consistency loss. (c)(d) Multiple CAD models and free form deformation (FFD): In (c), ageneric car model without FFD fails to represent the input vans. In (d), our model learns to choosethe best-fitting mesh from eight candidate meshes and allows FFD. As a result, we can reconstructthe silhouettes more precisely.

3.2 Semantic and Textural Inference

The semantic branch of the 3D-SDN uses a semantic segmentation model DRN [Yu et al., 2017,Zhou et al., 2017] to obtain an semantic map of the input image. The textural branch of the 3D-SDNfirst obtains an instance-wise semantic label map L by combining the semantic map generated by thesemantic branch and the instance map generated by the geometric branch, resolving any conflict infavor of the instance map [Kirillov et al., 2018]. Built on recent work on multimodal image-to-imagetranslation [Zhu et al., 2017b, Wang et al., 2018], our textural branch encodes the texture of eachinstance into a low dimensional latent code, so that the textural renderer can later reconstruct theappearance of the original instance from the code. By ‘instance’ we mean a background semanticclass (e.g., road, sky) or a foreground object (e.g., car, van). Later, we combine the object texturalcode with the estimated 3D information to better reconstruct objects.

Formally speaking, given an image I and its instance label map L, we want to obtain a featureembedding z such that (L, z) can later reconstruct I. We formulate the textural branch of the 3D-SDNunder a conditional adversarial learning framework with three networks (G,D,E): a textural de-renderer E : (L, I) ! z, a texture renderer G : (L, z) ! I and a discriminator D : (L, I) ! [0, 1]are trained jointly with the following objectives.

To increase the photorealism of generated images, we use a standard conditional GAN loss [Goodfel-low et al., 2014, Mirza and Osindero, 2014, Isola et al., 2017] as: ⇤

LGAN(G,D,E) = EL,I

hlog (D(L, I)) + log

⇣1�D

⇣L, I

⌘⌘i, (2)

where I = G(L, E(L, I)) is the reconstructed image. To stabilize the training, we follow the priorwork [Wang et al., 2018] and use both discriminator feature matching loss [Wang et al., 2018, Larsenet al., 2016] and perceptual loss [Dosovitskiy and Brox, 2016, Johnson et al., 2016], both of whichaim to match the statistics of intermediate features between generated and real images:

LFM(G,D,E) = EL,I

"TFX

i=1

1

Ni

���F (i)(I)� F (i)⇣I⌘���

1+

TDX

i=1

1

Mi

���D(i)(I)�D(i)⇣I⌘���

1

#, (3)

where F (i) denotes the i-th layer of a pre-trained VGG network [Simonyan and Zisserman, 2015] withNi elements. Similarly, for our our discriminator D, D(i) denotes the i-th layer with Mi elements.TF and TD denote the number of layers in network F and D. We fix the network F during ourtraining. Finally, we use a pixel-wise image reconstruction loss as:

LRecon(G,E) = EL,I

h���I� I���1

i. (4)

The final training objective is formulated as a minimax game between (G,E) and D:

G⇤, E⇤ = argminG,E

✓maxD

⇣LGAN(G,D,E)

⌘+ �FMLFM(G,D,E) + �ReconLRecon(G,E)

◆, (5)

where �FM and �Recon control the relative importance of each term.⇤We denote EL,I , E(L,I)⇠pdata(L,I)

for simplicity.

5

(a) w/o re-projection (b) w/ re-projection (c) Single CAD w/o FFD (d) Multiple CADs w/ FFD

Figure 4: (a)(b) Re-projection consistency loss: Object silhouettes rendered without and with re-projection consistency loss. (c)(d) Multiple CAD models and free form deformation (FFD): In (c), ageneric car model without FFD fails to represent the input vans. In (d), our model learns to choosethe best-fitting mesh from eight candidate meshes and allows FFD. As a result, we can reconstructthe silhouettes more precisely.

3.2 Semantic and Textural Inference

The semantic branch of the 3D-SDN uses a semantic segmentation model DRN [Yu et al., 2017,Zhou et al., 2017] to obtain an semantic map of the input image. The textural branch of the 3D-SDNfirst obtains an instance-wise semantic label map L by combining the semantic map generated by thesemantic branch and the instance map generated by the geometric branch, resolving any conflict infavor of the instance map [Kirillov et al., 2018]. Built on recent work on multimodal image-to-imagetranslation [Zhu et al., 2017b, Wang et al., 2018], our textural branch encodes the texture of eachinstance into a low dimensional latent code, so that the textural renderer can later reconstruct theappearance of the original instance from the code. By ‘instance’ we mean a background semanticclass (e.g., road, sky) or a foreground object (e.g., car, van). Later, we combine the object texturalcode with the estimated 3D information to better reconstruct objects.

Formally speaking, given an image I and its instance label map L, we want to obtain a featureembedding z such that (L, z) can later reconstruct I. We formulate the textural branch of the 3D-SDNunder a conditional adversarial learning framework with three networks (G,D,E): a textural de-renderer E : (L, I) ! z, a texture renderer G : (L, z) ! I and a discriminator D : (L, I) ! [0, 1]are trained jointly with the following objectives.

To increase the photorealism of generated images, we use a standard conditional GAN loss [Goodfel-low et al., 2014, Mirza and Osindero, 2014, Isola et al., 2017] as: ⇤

LGAN(G,D,E) = EL,I

hlog (D(L, I)) + log

⇣1�D

⇣L, I

⌘⌘i, (2)

where I = G(L, E(L, I)) is the reconstructed image. To stabilize the training, we follow the priorwork [Wang et al., 2018] and use both discriminator feature matching loss [Wang et al., 2018, Larsenet al., 2016] and perceptual loss [Dosovitskiy and Brox, 2016, Johnson et al., 2016], both of whichaim to match the statistics of intermediate features between generated and real images:

LFM(G,D,E) = EL,I

"TFX

i=1

1

Ni

���F (i)(I)� F (i)⇣I⌘���

1+

TDX

i=1

1

Mi

���D(i)(I)�D(i)⇣I⌘���

1

#, (3)

where F (i) denotes the i-th layer of a pre-trained VGG network [Simonyan and Zisserman, 2015] withNi elements. Similarly, for our our discriminator D, D(i) denotes the i-th layer with Mi elements.TF and TD denote the number of layers in network F and D. We fix the network F during ourtraining. Finally, we use a pixel-wise image reconstruction loss as:

LRecon(G,E) = EL,I

h���I� I���1

i. (4)

The final training objective is formulated as a minimax game between (G,E) and D:

G⇤, E⇤ = argminG,E

✓maxD

⇣LGAN(G,D,E)

⌘+ �FMLFM(G,D,E) + �ReconLRecon(G,E)

◆, (5)

where �FM and �Recon control the relative importance of each term.⇤We denote EL,I , E(L,I)⇠pdata(L,I)

for simplicity.

5

• DRN generates a semantic label map for the background

Semantic De-renderer

Virtual KITTI Editing Benchmark

{ … "operations": [ { "type": "modify", "from": {"u": "750.9", "v": “213.9"}, "to": { "u": "804.4", "v": "227.1", "roi": [194, 756, 269, 865] }, "zoom": “1.338", "ry": "0.007" }, … { "type": "delete", "from": {"u": “1328.5", "v": "271.3"}, } ] }

• 92 pairs of images picked from Virtual KITTI dataset • Each pair contains operations in .json format

delete

u

v

u

v

roi

modify

Virtual KITTI Editing Benchmark

• 2D/2D+: only texture code map and semantic label map; naïve translation and scaling (+out-of-plane rotation)

Image Editing ExamplesOriginal image Edited images

(a)

(b)

(c)

(d)

(0006,overcast,00067)White:(0001,30-deg-left,00420)Black:(0020,overcast,00101)

(0020,clone,00073)

(0018,fog,00324)

(0006,30-deg-right,00043)

Original image Edited images

(a)

(b)

(c)

(berlin,000044,000019)

bonn,000031,000019)

bielefeld,000000,012584)

munich,000005,000019) (d)

Virt

ual K

ITTI

City

scap

es

vkitti_editing_benchmark.json

Geometric De-renderer & Renderer

Overview

(a) w/o re-projection (b) w/ re-projection (c) Single CAD w/o FFD (d) Multiple CADs w/ FFD

Figure 4: (a)(b) Re-projection consistency loss: Object silhouettes rendered without and with re-projection consistency loss. (c)(d) Multiple CAD models and free form deformation (FFD): In (c), ageneric car model without FFD fails to represent the input vans. In (d), our model learns to choosethe best-fitting mesh from eight candidate meshes and allows FFD. As a result, we can reconstructthe silhouettes more precisely.

3.2 Semantic and Textural Inference

The semantic branch of the 3D-SDN uses a semantic segmentation model DRN [Yu et al., 2017,Zhou et al., 2017] to obtain an semantic map of the input image. The textural branch of the 3D-SDNfirst obtains an instance-wise semantic label map L by combining the semantic map generated by thesemantic branch and the instance map generated by the geometric branch, resolving any conflict infavor of the instance map [Kirillov et al., 2018]. Built on recent work on multimodal image-to-imagetranslation [Zhu et al., 2017b, Wang et al., 2018], our textural branch encodes the texture of eachinstance into a low dimensional latent code, so that the textural renderer can later reconstruct theappearance of the original instance from the code. By ‘instance’ we mean a background semanticclass (e.g., road, sky) or a foreground object (e.g., car, van). Later, we combine the object texturalcode with the estimated 3D information to better reconstruct objects.

Formally speaking, given an image I and its instance label map L, we want to obtain a featureembedding z such that (L, z) can later reconstruct I. We formulate the textural branch of the 3D-SDNunder a conditional adversarial learning framework with three networks (G,D,E): a textural de-renderer E : (L, I) ! z, a texture renderer G : (L, z) ! I and a discriminator D : (L, I) ! [0, 1]are trained jointly with the following objectives.

To increase the photorealism of generated images, we use a standard conditional GAN loss [Goodfel-low et al., 2014, Mirza and Osindero, 2014, Isola et al., 2017] as: ⇤

LGAN(G,D,E) = EL,I

hlog (D(L, I)) + log

⇣1�D

⇣L, I

⌘⌘i, (2)

where I = G(L, E(L, I)) is the reconstructed image. To stabilize the training, we follow the priorwork [Wang et al., 2018] and use both discriminator feature matching loss [Wang et al., 2018, Larsenet al., 2016] and perceptual loss [Dosovitskiy and Brox, 2016, Johnson et al., 2016], both of whichaim to match the statistics of intermediate features between generated and real images:

LFM(G,D,E) = EL,I

"TFX

i=1

1

Ni

���F (i)(I)� F (i)⇣I⌘���

1+

TDX

i=1

1

Mi

���D(i)(I)�D(i)⇣I⌘���

1

#, (3)

where F (i) denotes the i-th layer of a pre-trained VGG network [Simonyan and Zisserman, 2015] withNi elements. Similarly, for our our discriminator D, D(i) denotes the i-th layer with Mi elements.TF and TD denote the number of layers in network F and D. We fix the network F during ourtraining. Finally, we use a pixel-wise image reconstruction loss as:

LRecon(G,E) = EL,I

h���I� I���1

i. (4)

The final training objective is formulated as a minimax game between (G,E) and D:

G⇤, E⇤ = argminG,E

✓maxD

⇣LGAN(G,D,E)

⌘+ �FMLFM(G,D,E) + �ReconLRecon(G,E)

◆, (5)

where �FM and �Recon control the relative importance of each term.⇤We denote EL,I , E(L,I)⇠pdata(L,I)

for simplicity.

5

encoder

generator

(a) w/o re-projection (b) w/ re-projection (c) Single CAD w/o FFD (d) Multiple CADs w/ FFD

Figure 4: (a)(b) Re-projection consistency loss: Object silhouettes rendered without and with re-projection consistency loss. (c)(d) Multiple CAD models and free form deformation (FFD): In (c), ageneric car model without FFD fails to represent the input vans. In (d), our model learns to choosethe best-fitting mesh from eight candidate meshes and allows FFD. As a result, we can reconstructthe silhouettes more precisely.

3.2 Semantic and Textural Inference

The semantic branch of the 3D-SDN uses a semantic segmentation model DRN [Yu et al., 2017,Zhou et al., 2017] to obtain an semantic map of the input image. The textural branch of the 3D-SDNfirst obtains an instance-wise semantic label map L by combining the semantic map generated by thesemantic branch and the instance map generated by the geometric branch, resolving any conflict infavor of the instance map [Kirillov et al., 2018]. Built on recent work on multimodal image-to-imagetranslation [Zhu et al., 2017b, Wang et al., 2018], our textural branch encodes the texture of eachinstance into a low dimensional latent code, so that the textural renderer can later reconstruct theappearance of the original instance from the code. By ‘instance’ we mean a background semanticclass (e.g., road, sky) or a foreground object (e.g., car, van). Later, we combine the object texturalcode with the estimated 3D information to better reconstruct objects.

Formally speaking, given an image I and its instance label map L, we want to obtain a featureembedding z such that (L, z) can later reconstruct I. We formulate the textural branch of the 3D-SDNunder a conditional adversarial learning framework with three networks (G,D,E): a textural de-renderer E : (L, I) ! z, a texture renderer G : (L, z) ! I and a discriminator D : (L, I) ! [0, 1]are trained jointly with the following objectives.

To increase the photorealism of generated images, we use a standard conditional GAN loss [Goodfel-low et al., 2014, Mirza and Osindero, 2014, Isola et al., 2017] as: ⇤

LGAN(G,D,E) = EL,I

hlog (D(L, I)) + log

⇣1�D

⇣L, I

⌘⌘i, (2)

where I = G(L, E(L, I)) is the reconstructed image. To stabilize the training, we follow the priorwork [Wang et al., 2018] and use both discriminator feature matching loss [Wang et al., 2018, Larsenet al., 2016] and perceptual loss [Dosovitskiy and Brox, 2016, Johnson et al., 2016], both of whichaim to match the statistics of intermediate features between generated and real images:

LFM(G,D,E) = EL,I

"TFX

i=1

1

Ni

���F (i)(I)� F (i)⇣I⌘���

1+

TDX

i=1

1

Mi

���D(i)(I)�D(i)⇣I⌘���

1

#, (3)

where F (i) denotes the i-th layer of a pre-trained VGG network [Simonyan and Zisserman, 2015] withNi elements. Similarly, for our our discriminator D, D(i) denotes the i-th layer with Mi elements.TF and TD denote the number of layers in network F and D. We fix the network F during ourtraining. Finally, we use a pixel-wise image reconstruction loss as:

LRecon(G,E) = EL,I

h���I� I���1

i. (4)

The final training objective is formulated as a minimax game between (G,E) and D:

G⇤, E⇤ = argminG,E

✓maxD

⇣LGAN(G,D,E)

⌘+ �FMLFM(G,D,E) + �ReconLRecon(G,E)

◆, (5)

where �FM and �Recon control the relative importance of each term.⇤We denote EL,I , E(L,I)⇠pdata(L,I)

for simplicity.

5

discriminator

• Feature Matching Loss stabilizes training

(a) w/o re-projection (b) w/ re-projection (c) Single CAD w/o FFD (d) Multiple CADs w/ FFD

Figure 4: (a)(b) Re-projection consistency loss: Object silhouettes rendered without and with re-projection consistency loss. (c)(d) Multiple CAD models and free form deformation (FFD): In (c), ageneric car model without FFD fails to represent the input vans. In (d), our model learns to choosethe best-fitting mesh from eight candidate meshes and allows FFD. As a result, we can reconstructthe silhouettes more precisely.

3.2 Semantic and Textural Inference

The semantic branch of the 3D-SDN uses a semantic segmentation model DRN [Yu et al., 2017,Zhou et al., 2017] to obtain an semantic map of the input image. The textural branch of the 3D-SDNfirst obtains an instance-wise semantic label map L by combining the semantic map generated by thesemantic branch and the instance map generated by the geometric branch, resolving any conflict infavor of the instance map [Kirillov et al., 2018]. Built on recent work on multimodal image-to-imagetranslation [Zhu et al., 2017b, Wang et al., 2018], our textural branch encodes the texture of eachinstance into a low dimensional latent code, so that the textural renderer can later reconstruct theappearance of the original instance from the code. By ‘instance’ we mean a background semanticclass (e.g., road, sky) or a foreground object (e.g., car, van). Later, we combine the object texturalcode with the estimated 3D information to better reconstruct objects.

Formally speaking, given an image I and its instance label map L, we want to obtain a featureembedding z such that (L, z) can later reconstruct I. We formulate the textural branch of the 3D-SDNunder a conditional adversarial learning framework with three networks (G,D,E): a textural de-renderer E : (L, I) ! z, a texture renderer G : (L, z) ! I and a discriminator D : (L, I) ! [0, 1]are trained jointly with the following objectives.

To increase the photorealism of generated images, we use a standard conditional GAN loss [Goodfel-low et al., 2014, Mirza and Osindero, 2014, Isola et al., 2017] as: ⇤

LGAN(G,D,E) = EL,I

hlog (D(L, I)) + log

⇣1�D

⇣L, I

⌘⌘i, (2)

where I = G(L, E(L, I)) is the reconstructed image. To stabilize the training, we follow the priorwork [Wang et al., 2018] and use both discriminator feature matching loss [Wang et al., 2018, Larsenet al., 2016] and perceptual loss [Dosovitskiy and Brox, 2016, Johnson et al., 2016], both of whichaim to match the statistics of intermediate features between generated and real images:

LFM(G,D,E) = EL,I

"TFX

i=1

1

Ni

���F (i)(I)� F (i)⇣I⌘���

1+

TDX

i=1

1

Mi

���D(i)(I)�D(i)⇣I⌘���

1

#, (3)

where F (i) denotes the i-th layer of a pre-trained VGG network [Simonyan and Zisserman, 2015] withNi elements. Similarly, for our our discriminator D, D(i) denotes the i-th layer with Mi elements.TF and TD denote the number of layers in network F and D. We fix the network F during ourtraining. Finally, we use a pixel-wise image reconstruction loss as:

LRecon(G,E) = EL,I

h���I� I���1

i. (4)

The final training objective is formulated as a minimax game between (G,E) and D:

G⇤, E⇤ = argminG,E

✓maxD

⇣LGAN(G,D,E)

⌘+ �FMLFM(G,D,E) + �ReconLRecon(G,E)

◆, (5)

where �FM and �Recon control the relative importance of each term.⇤We denote EL,I , E(L,I)⇠pdata(L,I)

for simplicity.

5

featurizer

LPIP

S

0.100

0.150

0.200

whole all largest

3D-SDN2D2D+

23.1%

76.9%

25.7%

74.3%

Perception Similarity Scores (lower is better) Human Study Scores (higher is better)

3D-SDN vs. 2D 3D-SDN vs. 2D+

References [pix2pixHD] Wang et al. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs, In CVPR, 2018.