arxiv:1911.09652v1 [cs.cv] 20 nov 2019 · azeez oluwafemi carnegie mellon university africa kigali,...

6
U NSUPERVISED D OMAIN A DAPTATION BY O PTICAL F LOW AUGMENTATION IN S EMANTIC S EGMENTATION (A PROPOSAL ) APREPRINT Azeez Oluwafemi Carnegie Mellon University Africa Kigali, Rwanda [email protected] November 22, 2019 ABSTRACT It is expensive to generate real-life image labels and there is a domain gap between real-life and simulated images, hence a model trained on the latter cannot adapt to the former. Solving this can totally eliminate the need for labeling real-life datasets completely. Class balanced self-training is one of the existing techniques that attempt to reduce the domain gap. Moreover, augmenting RGB with flow maps has improved performance in simple semantic segmentation and geometry is preserved across domains. Hence, by augmenting images with dense optical flow map, domain adaptation in semantic segmentation can be improved. Keywords Semantic segmentation · Domain Adaptation · Optical Flow 1 Introduction Semantic segmentation involves being able to label every pixel in an image by assigning each pixel to a specific class. The Problem can be poised as a supervised learning one. This means a machine learning model can be trained on a labeled dataset and then tested on an unlabelled one. Labeling dataset is commonly referred to as annotating. Annotating datasets can be very expensive in terms of time and labor. For example annotating cityscapes dataset takes 90 minutes [2]. In an attempt to overcome such limitations, rendered scenes or synthetic datasets were created such as Grand Theft Auto V [13], SYNTHIA [14], Virtual Kitti[5] and VIPER [12]. However, there’s a large domain gap between synthetic datasets and real datasets based on appearance. Models trained on synthetic datasets seem to perform poorly when tested on real-life datasets as a result. If this gap can be closed or made infinitesimal or negligible, then we can completely bypass collecting real-life datasets and massively annotate synthetics ones and train with them and then adapt them for real-life inference. According to Yuhua Chen et al. [18], rich geometric information that could be easily and cheaply obtained from synthetic data, such as surface norm, optical flow, depth, etc has been overlooked, Geometric information don't suffer from domain shifts and there's a strong correlation between semantics and geometry. The authors then built a deep network that jointly reasons about depth and semantics. This inspired me to consider reasoning about optical flow. Would augmenting appearance dataset with dense optical flow improve domain adaptation in semantic segmentation? Mean Intersection Union (mIoU) is the metric used for measuring how well a good semantic segmentation is [3]. IoU is estimated for each class and then the mean is taken over all the available classes. If augmentation by optical flow increases the IoU of motion sensitive class, then it would increase the mIoU for the whole task thereby reducing the overall domain gap between synthetic and real data. I would be working with adapting a model trained on a synthetic scene GTA5 [13] to a real-life dataset cityscapes [2]. The training technique I plan to use which is unsupervised is class-balance self-training [20]. I would arXiv:1911.09652v1 [cs.CV] 20 Nov 2019

Upload: others

Post on 13-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1911.09652v1 [cs.CV] 20 Nov 2019 · Azeez Oluwafemi Carnegie Mellon University Africa Kigali, Rwanda oazeez@andrew.cmu.edu November 22, 2019 ABSTRACT It is expensive to generate

UNSUPERVISED DOMAIN ADAPTATION BY OPTICAL FLOWAUGMENTATION IN SEMANTIC SEGMENTATION (A PROPOSAL)

A PREPRINT

Azeez OluwafemiCarnegie Mellon University Africa

Kigali, [email protected]

November 22, 2019

ABSTRACT

It is expensive to generate real-life image labels and there is a domain gap between real-life andsimulated images, hence a model trained on the latter cannot adapt to the former. Solving this cantotally eliminate the need for labeling real-life datasets completely. Class balanced self-trainingis one of the existing techniques that attempt to reduce the domain gap. Moreover, augmentingRGB with flow maps has improved performance in simple semantic segmentation and geometryis preserved across domains. Hence, by augmenting images with dense optical flow map, domainadaptation in semantic segmentation can be improved.

Keywords Semantic segmentation · Domain Adaptation · Optical Flow

1 Introduction

Semantic segmentation involves being able to label every pixel in an image by assigning each pixel to a specific class.The Problem can be poised as a supervised learning one. This means a machine learning model can be trained on alabeled dataset and then tested on an unlabelled one. Labeling dataset is commonly referred to as annotating.

Annotating datasets can be very expensive in terms of time and labor. For example annotating cityscapesdataset takes 90 minutes [2]. In an attempt to overcome such limitations, rendered scenes or synthetic datasets werecreated such as Grand Theft Auto V [13], SYNTHIA [14], Virtual Kitti[5] and VIPER [12]. However, there’s a largedomain gap between synthetic datasets and real datasets based on appearance. Models trained on synthetic datasetsseem to perform poorly when tested on real-life datasets as a result. If this gap can be closed or made infinitesimalor negligible, then we can completely bypass collecting real-life datasets and massively annotate synthetics ones andtrain with them and then adapt them for real-life inference.

According to Yuhua Chen et al. [18], rich geometric information that could be easily and cheaply obtainedfrom synthetic data, such as surface norm, optical flow, depth, etc has been overlooked, Geometric informationdon't suffer from domain shifts and there's a strong correlation between semantics and geometry. The authors thenbuilt a deep network that jointly reasons about depth and semantics. This inspired me to consider reasoning aboutoptical flow. Would augmenting appearance dataset with dense optical flow improve domain adaptation in semanticsegmentation?

Mean Intersection Union (mIoU) is the metric used for measuring how well a good semantic segmentationis [3]. IoU is estimated for each class and then the mean is taken over all the available classes. If augmentation byoptical flow increases the IoU of motion sensitive class, then it would increase the mIoU for the whole task therebyreducing the overall domain gap between synthetic and real data.

I would be working with adapting a model trained on a synthetic scene GTA5 [13] to a real-life datasetcityscapes [2]. The training technique I plan to use which is unsupervised is class-balance self-training [20]. I would

arX

iv:1

911.

0965

2v1

[cs

.CV

] 2

0 N

ov 2

019

Page 2: arXiv:1911.09652v1 [cs.CV] 20 Nov 2019 · Azeez Oluwafemi Carnegie Mellon University Africa Kigali, Rwanda oazeez@andrew.cmu.edu November 22, 2019 ABSTRACT It is expensive to generate

A PREPRINT - NOVEMBER 22, 2019

then use mIoU [3] as a measurement metric. Optical flow maps would be generated with Flownetv2 [7]. A bettermIoU than the one obtained by Zou, Yang et al. [20], would prove that augmenting with optical flow actually im-proved domain adaptation since the images there were not augmented at all.

This would extend the previous work done by Zou, Yang et al. [20] and Yuhua Chen et al. [18]. Since whatI'm doing differently from Zou, Yang et al. [20] are just augmenting the images with optical flow based on the inspi-ration of using depth with a completely different architecture by Yuhua Chen et al. [18]. I would also be generatinga new idea since it's the first time images would be augmented with flow maps for the purpose of domain adaptationthereby giving an incite to the role of optical flow maps in helping to solve the domain gap problem.

The approach would be to generate flow maps using Flownet v2 or Open CV Farneback function or groundtruth (if available). The existing RGB image would then be augmented with the generated flow maps and finallytrained with class balanced self-training. The datasets used would be GTA5 and cityscapes in which 19 classeswould be evaluated. The 19 classes include Road, sidewalk, building, wall, fence, pole, traffic light, traffic sign,vegetation, terrain, sky, person, rider, car, truck, bus, train, motorcycle and bike. The GTA5 dataset has 24966 imagesof size 1052 by 1914.

2 Literature Review

There’s been an advance in deep network architectures used for semantic segmentation, one of such is Deeplab v2[1]. Input images were passed through a layer of atrous(dilated) convolutional network which helps with adjustingthe field of view and control the resolution of the feature map generated by a deep convolutional neural network. Acoarse score map is then produced which undergoes bilinear interpolation for upsampling, finally Fully connectedconditional random field (CRF) is used as a post-processing process which helps incorporated low-level details in thesegmentation result since skip connections are not used here. In an attempt to close domain gaps, the main idea hasbeen to reduce the gap between source and target distribution by learning embeddings that are invariant to domainssuch as in Deep Adaptation Network (DAN) architecture which generalizes simple convolutional network tasks to fitfor domain adaptation [9]. The base convolutional neural network for Deeplab v2 is Resnet38 [17] which has beenpretrained on Imagenet dataset [4] through transfer learning technique[10].

One of the first adversarial approaches in dealing with unsupervised domain adaptation was CyCADA (cycleconsistent adversarial domain adaptation) [6] which combined the concept of cycle consistency and adversarial do-main adaptation. In cycle consistency, the goal is to learn the mapping. G : X− > Y , where X is the source domainand Y is target domain. Training G(X) ≈ Y using adversarial loss. Since this is underconstrained, they introducedcycle consistency loss to enforce F (G(X)) ≈ X by also learning the mapping F : Y− > X [19]. In CyCADA,the goal was to train a target domain encoder by trying to discriminate from a representation of a pretrained sourcedomain. And then predicting with the target domain using the source classifier and avoid divergence by enforcingcycle consistency [6].

Yuhua Chen et al. [18] took this idea further by trying to take advantage of the correlation between semanticsand geometry specifically using depth maps. They employed input level adaptation and output level adaptation. Theytried to generate a realistic target image using a generator Gimg with a combination of synthetic image, syntheticlabel and synthetic depth map as input and a discriminator Dimg which takes target image and the output of thegenerator as an input to predict fake or real. The output generator is then used for a task of generating labels anddepth maps using a generator and discriminator network alike. The intuition for output adaptation is that the networkwould produce good semantic labels if it tries to produce good depth maps simultaneously since this would helpensure the correlation between semantics and geometry is exploited.

Zou Yang et al. [20] then applied the concept of class-balanced self-training. A self-training method is usedby generating pseudo labels from the target domain based on high confidence prediction and the network is trainedend-to-end with a unified domain and task-specific loss minimization. The concept of “class-balancing” is also intro-duced which involved normalizing the confidence of each pseudo-label with reference confidence for that particularclass. Spatial priors were also introduced since traffic scenes share similar spatial priors. This improves cross-domainadaptation and yielded state-of-the-art results.

Rashed et al. [11] was able to show that augmenting color images with optical flow maps can improve se-mantic segmentation. It could improve IoU for moving object classes. They augmented RGB data with optical flowmaps using 4 various CNN architectures. Three of them were one stream networks based on FCN8s [8] and the thirdone was a two-stream network which was influenced by the works of Simonyan and Zisserman [16] and Modenetby Siam et al [15] They used three various options of depth flows, which are: Colour wheel representation in 3D,Magnitude and direction in 2D and Magnitude in 1D. The Optical flow was then normalized to be in the range 0-

2

Page 3: arXiv:1911.09652v1 [cs.CV] 20 Nov 2019 · Azeez Oluwafemi Carnegie Mellon University Africa Kigali, Rwanda oazeez@andrew.cmu.edu November 22, 2019 ABSTRACT It is expensive to generate

A PREPRINT - NOVEMBER 22, 2019

255. OpenCV Farneback function was used to generate optical flow magnitude along with FlowNet v2 [7]. Theirresults showed a generally better performance for the two-stream RGB and Flow networks. With a 3 color wheel 3Drepresentation for the flow map.

3 Proposed Methodology

The general approach would be to generate a flow map and also use the available ground truth flow map. AugmentRGB image with optical flow maps and then use class balanced domain adaptation for domain adaptation.

3.1 Optical Flow Generator

[7] was created for the sole purpose of generating flow maps. Since cityscape dataset does not have a ground truthflow map. There are three options of flow maps representation. They are Colour wheel representation in three dimen-sions, magnitude and direction representation in two dimensions and magnitude in one dimension. [11] observed abetter performance from the color wheel representation while trying to augment data with optical flow for semanticsegmentation, so that would be the representation we would be choosing since our task is similar. Then the opticalflow values would be normalized to the range 0 to 255 so that the scale is similar to the equivalent RGB image. An-other option that would be explored for generating optical flow maps would be to use OpenCV Farneback function.

3.2 Augmentation Approach

The next stage would now be to augment RGB dataset with a flow map. [7] also tried augmenting using four CNNarchitectures. One had an RGB image with RGB encoder, Flow map with Flow encoder, concatenated RGB andflow maps and a two-way stream of RGB image with encoder and Flow map with its encoder. They got their bestperformance with RGB + F two-way stream. Concatenated RGBF would be the approach we would use since itperforms well on a simple segmentation task.

3.3 Class Balanced Self Training

Finally, a class balanced self-training method is used for adapting a model trained on the GTA5 dataset to cityscapes.This architecture is used to train a hard case of domain adaptation where the source domain dataset is synthetic andlabeled and the target domain is unlabelled and the task is to predict the label for that target dataset. Class Balancedself-training is actually introduced by [20]. A self-training method is used by generating pseudo labels from thetarget domain based on high confidence prediction and the network is trained end-to-end with a unified domain andtask-specific loss minimization. The concept of “class-balancing” is also introduced which involved normalizingthe confidence of each pseudo-label with reference confidence for that particular class. Spatial priors were alsointroduced since traffic scenes share similar spatial priors. This would improve cross-domain adaptation.

3.4 Datasets, Framework, and Base Deep Neural Network

The datasets used would be GTA5 (synthetic and labeled) and cityscapes (real and unlabelled) in which 19 classeswould be evaluated. The GTA5 dataset has 24966 images of size 1052 by 1914. Pytorch, a python framework forbuilding deep neural networks would be used. Deeplab v2 [1] which had been pre-trained on imagenet [4] and wouldbe used here implementing transfer learning technique [10]

3

Page 4: arXiv:1911.09652v1 [cs.CV] 20 Nov 2019 · Azeez Oluwafemi Carnegie Mellon University Africa Kigali, Rwanda oazeez@andrew.cmu.edu November 22, 2019 ABSTRACT It is expensive to generate

A PREPRINT - NOVEMBER 22, 2019

Figure 1: Class balanced self-training.[20]

4

Page 5: arXiv:1911.09652v1 [cs.CV] 20 Nov 2019 · Azeez Oluwafemi Carnegie Mellon University Africa Kigali, Rwanda oazeez@andrew.cmu.edu November 22, 2019 ABSTRACT It is expensive to generate

A PREPRINT - NOVEMBER 22, 2019

Figure 2: Training process

4 Expected Results

Mean Intersection Union (mIoU) [3] is the metric that would be used for measuring performance. The expectationis that augmenting with the optical flow should increase the IoU of motion sensitive class, then it would increasethe mIoU for the whole task thereby reducing the overall domain gap between synthetic and real data. [11] got anincrease in IOU of 17 % and 7 % on the motorcycle and train classes respectively by augmenting with visual flowmap on cityscapes for a semantic segmentation task. The mIoU expected should be greater than 47 % for the task ofadapting from GTA5 to cityscapes. 47 % is the baseline obtained by [20] for this task without augmentation.

5 Conclusion

Annotating a real-life image dataset like cityscapes for the task of semantic segmentation can be very expensive,while it is easier to annotate a synthetic dataset like GTA5. Therefore a naive approach of training a model to learnsemantic segmentation on synthetic dataset and test on real-life dataset easily comes to mind. It doesn’t work easilybecause of the challenge of domain gap. There’s a domain gap in the RGB of both datasets. The reasonable approachis then to try out unsupervised methods of domain adaptation using techniques like class balanced self-training(CBST) [20] which produces some results but does not totally close the domain gap. Augmentation of RGB datasetwith optical flows is however proposed because the geometry is preserved across both domain, also [11] already wasable to perform better at ordinary segmentation task by augmenting RGB with optical flows. A future work could becorrecting for doppler effect by adding depth information.

5

Page 6: arXiv:1911.09652v1 [cs.CV] 20 Nov 2019 · Azeez Oluwafemi Carnegie Mellon University Africa Kigali, Rwanda oazeez@andrew.cmu.edu November 22, 2019 ABSTRACT It is expensive to generate

A PREPRINT - NOVEMBER 22, 2019

References

[1] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deepconvolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machineintelligence, 40(4):834–848, 2018.

[2] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapesdataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and patternrecognition, pages 3213–3223, 2016.

[3] G. Csurka, D. Larlus, F. Perronnin, and F. Meylan. What is a good evaluation measure for semantic segmentation?. In BMVC,volume 27, page 2013. Citeseer, 2013.

[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.

[5] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of theIEEE conference on computer vision and pattern recognition, pages 4340–4349, 2016.

[6] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarialdomain adaptation. arXiv preprint arXiv:1711.03213, 2017.

[7] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation withdeep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462–2470, 2017.

[8] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEEconference on computer vision and pattern recognition, pages 3431–3440, 2015.

[9] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. arXiv preprintarXiv:1502.02791, 2015.

[10] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: transfer learning from unlabeled data. InProceedings of the 24th international conference on Machine learning, pages 759–766. ACM, 2007.

[11] H. Rashed, S. Yogamani, A. El-Sallab, P. Krizek, and M. El-Helw. Optical flow augmented semantic segmentation networksfor automated driving. arXiv preprint arXiv:1901.07355, 2019.

[12] S. R. Richter, Z. Hayder, and V. Koltun. Playing for benchmarks. In Proceedings of the IEEE International Conference onComputer Vision, pages 2213–2222, 2017.

[13] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In European Confer-ence on Computer Vision, pages 102–118. Springer, 2016.

[14] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic imagesfor semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition,pages 3234–3243, 2016.

[15] M. Siam, H. Mahgoub, M. Zahran, S. Yogamani, M. Jagersand, and A. El-Sallab. Modnet: Moving object detection networkwith motion and appearance for autonomous driving. arXiv preprint arXiv:1709.04821, 2017.

[16] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neuralinformation processing systems, pages 568–576, 2014.

[17] Z. Wu, C. Shen, and A. Van Den Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. PatternRecognition, 90:119–133, 2019.

[18] X. C. L. V. G. Yuhua Chen, Wen Li. Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach. In CVPR: Computer Vision and Pattern Recognition, 2019.

[19] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks.In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.

[20] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang. Unsupervised domain adaptation for semantic segmentation via class-balancedself-training. In Proceedings of the European Conference on Computer Vision (ECCV), pages 289–305, 2018.

6