image enhancement with conditional adversarial networks...the idea is mapping the output image back...

Image Enhancement with Conditional AdversarialNetworks

Jiahui [email protected]

Yuan-Ping [email protected]

Peng [email protected]

Bo-Syun [email protected]

Abstract

In this project we try to explore the possibility of using Conditional AdversarialNetworks (Conditional GAN) to enhance images. Conditional Adversarial Net-works can learn the image-to-image translation and adapt the translation to futureimages. We try to use Conditional GAN to learn the translation between imagesfrom original images and enhanced images and automatically translate original im-ages to the images we want. We compare the difference of purpose and architecturebetween the Conditional GAN and Cycle-Consistent adversarial networks(CycleGAN).We use both Conditional GAN and Cycle GAN on training images and theresults demonstrates that Conditional GAN are a promising approach for imageenhancement.

1 Introduction

Image enhancement is a very popular topic as people want to retouch images to look better, butit takes a lot of time to learn the techniques for retouching images. Image enhancement is also asubjective work as different people have different preference and judgment on enhanced images.

Using Conditional GAN is discussed as a general-purpose solution to image-to-image translationproblems[1]. These networks not only learn the mapping from input image to output image, but alsolearn a loss function to train this mapping. This makes it possible to apply the same generic approachto problems that traditionally would require very different loss formulations. This approach has beenproven to be effective at synthesizing photos from label maps, reconstructing objects from edge maps,and colorizing images.

However, for many tasks, paired training data will not be available. [2] presents an approach forlearning to translate an image from a source domain X to a target domain Y in the absence of pairedexamples. A mapping G: X → Y is learned such that the distribution of images from G(X) isindistinguishable from the distribution Y using an adversarial loss. Qualitative results are presentedon several tasks where paired training data does not exist, including collection style transfer, objecttransfiguration, season transfer.

Image enhancement data is difficult to acquire and in most cases we can not get pixel to pixel alignedimages. In our project we train images using both Conditional GAN and Cycle GAN and try tounderstand their possibility and limitation on image enhancement.

2 Related Work

2.1 Image-to-Image Translation with Conditional Adversarial Networks

As we all know, GANs[3] are consist of generator G and discriminator D. In image translation, thegenerator G always tries its best to produce a “fake” images that cannot be distinguished from “true”

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

images. In contrast, the discriminator D always tries to detect the “fake” images that generated bygenerator G.

Traditional GANs like DCGAN[4] and WGAN[5] are different from Conditional GAN in thefollowing way: When we use GANs, we just provide random noise vector z and the output imagey. Then our network will try to learn the mapping between z and y. And when we use conditionalGAN, we provide the observed image x together with z and y before our network learn the mappingbetween these images. In this way, both the generator and discriminator can observe the image x.The following formula is the objective of a conditional GAN:

LcGAN (G,D) =Ex,y∼pdata(x,y)[logD(x, y)]+

Ex∼pdata(x),z∼pz(z)[log(1−D(x,G(x, z))](1)

In addition to that, [1] also figured out that it is benefit to mix the objective above with L1 distance:

LL1(G) = Ex,y∼pdata(x,y),z∼pz(z)[‖y −G(x, z)‖1] (2)

Thus, their final objective is:

G∗ = argminG

maxDLcGAN (G,D) + λLL1(G) (3)

The network is defined as combination of encoder and decoder. Figure 1 shows the structure of theEncoder and Decoder. In Encoder, the input is first passed through a convolutional layer and thena batch normalization layer. After that, a ReLU is used to get the final activation of this encoder.Structure is similar in Decoder, and the convolutional layer is replaced by a deconvolutional layer.

Figure 1: Structure of Encoder and Decoder

As shown in Figure 2, structure of generator network is similar to the structure in sequence to sequencetranslation in NLP. The input is a 256 ∗ 256 image with three colorful channels. After a series ofencoding processes, we can get a vector representation of our input image in a high dimension. Aftera series of decoding processes, an output image can be generated from the high dimensional vector.Note that skip connections are added thus the detail of input image can be maintained.

Figure 2: Structure of generator network

2

As shown in Figure 3, the structure of discriminator network is similar to what has been using inclassification task with CNN. There are two inputs: in is an input image and unknown represents animage that can either be generated by generator G or be a target image. You can see the network asusing a 6 channel image instead of 3 channels in traditional CNN. Note that the output is a 30 ∗ 30image with one channel. Each pixel of that image is corresponding to a 70x70 patch in the inputimage. And the pixel value implies a probability of whether that patch is generated or not.

Figure 3: Structure of discriminator network

2.2 Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

To solve the problem of unable to provide pixel to pixel aligned data, Cycle GAN is introduced. Thenetwork structure is quite similar as using encoder and decoder in generator network and traditionalCNN in discriminator network. The idea is using a cycle between generator and discriminator insteadof a single directional connection.

Figure 4 illustrates the idea of Cycle-Consistent. X and Y represents two different sets of images. Grepresents a mapping from X to Y , F represents a mapping from Y to X . And DX , DY representstwo discriminator networks.

First, x is mapped to Y . Then DX will try to distinguish differences between Y and images in thecollection of Y . And when Y is mapped to x, DX will try to distinguish x and images in collectionof X . The idea is mapping the output image back to the original image to prevent mapping all theimages to one single output and stucks in the local minimum.

Figure 4: Idea of Cycle-Consistent

To make sure that they can obtain a good result, cycle consistency loss is used in addition to theclassical adversarial loss.

Lcyc(G,F ) = Ex∼pdata(x) [‖F (G(x))− x]‖1]+ Ey∼pdata(y)[‖G(F (y)− y]‖1]

(4)

F (G(x)) means that first the input x is mapped to y domain, then the result is mapped back to xdomain. Similarly, G(F (y)) means that first the input y is mapped to x domain, then the result ismapped back to y domain. As we can see, the aim of this loss function is to make sure that F (G(x))is similar to x, and G(F (y)) is similar to y. In this way, the network can generate better result.

The following formula is the full objective of Cycle-GAN:L(G,F,DX , DY ) = LGAN (G,DY , X, Y )

+ LGAN (F,DX , Y,X)

+ λLcyc(G,F )

(5)

3

where λ controls the relative importance of the two objectives. Then aim to slove :

G∗, F ∗ = argminG,F

maxDX ,DY

L(G,F,DX , DY ) (6)

This model can be viewed as training two auto-encoders, such that learn one auto-encoder F ◦G:X → X jointly with another G ◦ F : Y → Y . However, these auto-encoders each have specialinternal structure: they map an image to itself via an intermediate representation that is a translationof the image into another domain. Such a setup can also be seen as a special case of “adversarial autoencoders", which use an adversarial loss to train the bottleneck layer of an auto encoder to match anarbitrary target distribution. In this case, the target distribution for the X → X auto-encoder is thatof domain Y .

3 Implementation

MIT-Adobe FiveK Dataset[6] is used for evaluation. This dataset consists of 5,000 images taken bya series of photographers with various scenes, subjects and lighting conditions. Five photographystudents in an art school were hired to retouch these images and make sure that these images arevisually pleasing. The images are pixel to pixel aligned and therefore are perfect training data forour network. We choose the original images and retouched images from Expert A as our input andoutput.

We used the Pytorch code from the two papers [7] and adapt the code to work with our training data.The network is trained and tested on the the UCSD’s remote GPU server, which is shared with theUCSC.

To reduce the training time, all the input images are resized to 256 ∗ 256 and the maximum number ofepochs is set to 200. Larger output image size can be achieved by adding more encoders and decodersinto networks. We use batch size of 8 to increase the training speed.

4 Evaluation

To be objective in the evaluation process, we use the image from Expert A in the dataset as groundtruth and only focus on the output image and ground truth.

It took us 4 days to train the pix2pix network. As shown in our presentation slides, the results arequite good at about 80 epochs. The result after 200 epochs are better at details and adapt to differentscenes.

As shown in Figure 5, the output image is pretty similar to target image, which imply that our modelworks well in this case. The network can learn the translation between different images and use theproper translation for the images.

Failure cases are shown in Figure 6. In the first row of images, the output image is similar to thetarget image but there are strange white light around the bird. We think the main cause lies in thelimitation of convolutional layer structure. The convolutional kernel is taking some pixels whichare similar, like the sky and neck of the bird and fail to do good translation on these representation.The second row also shows a bad result. The tone of our output image is totally different from theinput image and target image. This implies pixel to pixel translation fails to see the global colortemperature changes in the images.

We train Cycle-GAN with the same images to compare the results. It took 7 days to train the network.Much Longer time is used because the cycle training for each image consumes more time. Similarto Conditional GAN, the results are very good at early stage of 70 epochs, and the rest epochs arelearning some difficult representation of color changes.

In Figure 7, we can see the Cycle-GAN can get quite similar results like Conditional GAN. Howeverwe can see some changes in the results: the images are darker in the second row of chairs and whenyou look more into the details you can see the front chair is brighter while the back chairs are darker.

It’s quite interesting, Cycle GAN also does pretty bad at the images which Conditional GAN failsas shown in Figure 8. We can see the image with the bird doesn’t have the strange while circlearound the neck. This is because Cycle GAN are not focusing on the exact pixel to pixel translation

4

Figure 5: Results from Conditional GAN

Figure 6: Failure case with Conditional GAN

rather than the general style changes. In the second row of grass, Cycle GAN does better at colortemperature changes compared with Conditional GAN while the result is still far from the groundtruth.

From the results we can see Conditional GAN and Cycle GAN can do pretty well in some imageenhancement scenario. This can help us in the future to change the way of retouching images. Eachperson can get his/her own network which knows the style and changes he/she would like to havewithout using very common and general filters which are the same for everyone.

5

Figure 7: Results from Cycle-GAN

Figure 8: Failure case with Cycle-GAN

The networks also have some limitations from the results. Pixel to pixel Conditional GAN are goodat exact pixel to pixel learning but may have some strange behavior like the white circle around thebird. Cycle GAN are good like learning the style changes, but it doesn’t have the exact pixel to pixelinformation which will make it more like a enhancement filter rather than a detailed retouching.

5 Extra work on Edge2Pokemon

Besides the image enhancement network, we use Conditional GAN to learn translation between edgeimage and a Pokemon image. The network is built upon [8]. Figure 9 illustrate our results after

6

around 130 epochs. The output image is similar to the ground truth image, which implies that ourmodel works pretty well. And it is rather interesting to see that in the second row, the eye color of thePokemon changes. The target Pokemon has red eyes while the model output a Pokemon who hasyellow eyes. What’s more, as we can see in third row, the Pokemon in edge image doesn’t have righteye. But the model still generated its right eye, which is also pretty interesting.

Figure 9: The Edge2Pokemon

6 Summary and Future Work

In this project we implemented Conditional Adversarial Networks for image enhancement. It takesin two images and learn the translation between two images. We learned the basic structure andmathematical loss for Conditional Adversarial Network. This will help us better understand whatGAN is and what we can do with GAN in the future.

To improve the project result, we will focus on changing the original structure to combine local pixelto pixel feature with the global changes. We will also try to change the structure to get images withlarger size.

Acknowledgments

For this project, we want to thank the authors from and for providing this amazing idea of using GANto do image translation. we also want to thank Christopher Hesse for his implementation of pixel topixel Conditional GAN with tensorflow. We built our Edge2Pokemon upon his work.

We want to thank Professor Manfred K. Warmuth for his teaching during this quarter and we learneda lot on machine learning as well as attitude towards future graduate study. We also want to thankEhsan Amid and Tianyi Luo acting as TAs for this course. They helped us better understand lecturematerials as well as other questions on machine learning.

7

References[1] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional

adversarial networks. arxiv, 2016.

[2] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation usingcycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.

[3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processingsystems, pages 2672–2680, 2014.

[4] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolu-tional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

[5] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875,2017.

[6] Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Frédo Durand. Learning photographic global tonaladjustment with a database of input / output image pairs. In The Twenty-Fourth IEEE Conference onComputer Vision and Pattern Recognition, 2011.

[7] Jun-Yan Zhu. pytorch-cyclegan-and-pix2pix. https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix, 2017.

[8] Christopher Hesse. pix2pix-tensorflow. https://github.com/affinelayer/pix2pix-tensorflow,2017.

8

https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

https://github.com/affinelayer/pix2pix-tensorflow

image enhancement with conditional adversarial networks...the idea is mapping the output image back...

Documents