[dl輪読会]learning what and where to draw (nips’16)

論⽂輪読LearningWhatandWhereto

Draw(NIPS’16)

2017/1/20 1

書誌情報• LearningWhatandWheretoDraw• ScottReed(Google),Zeynep Akata (MPI),SantoshMohan(umich),SamuelTenka (umich),Bernt Schiele(MPI),Honglak Lee(umich)• NIPS‘16(ConferenceEventType:Poster)• https://papers.nips.cc/paper/6111-learning-what-and-where-to-draw

2017/1/20 2

c.f.GenerativeAdversarialTexttoImageSynthesis

• ICML’16• http://www.slideshare.net/mmisono/generative-adversarial-text-to-image-synthesis

2017/1/20 3

2017/1/20 4

2017/1/20 5

GenerativeAdversarialWhat-WhereNetwork(GAWWN)•「なに」を「どこ」に描くか指定する GAN

⽂章 bondingbox/keypoint

2017/1/20 6

Bounding-box-conditionaltext-to-imagemodel1. textembeddingをMxMxTに変換2. boundingboxに合うように正規化.周りは0で埋める

0でマスクMxMxT 0でマスク

2017/1/20 7

Keypoint-conditionaltext-to-imagemodelKeyPointはグリッド座標で指定それぞれがhead,leftfoot,などに対応

2017/1/20 8

Conditionalkeypoint generationmodel

•全てのキーポイントを⼊⼒するのは⾯倒• 今回の実験では，⿃は15個のキーポイントを持つ

•ここではConditionalGANでキーポイントを⽣成

•キーポイント :• x,y :座標,v:visibleflag• v=0なら x=y=0

• Generator:

• Dはを1,合成したものを0とするよう学習

s:ユーザが指定したキーポイントに対応する箇所が1

2017/1/20 9

Experiments:Dataset

• USBBirdsdataset• 200種類の⿃，11,788枚の画像• 1枚の画像に10のキャプション,1つのboundingbox,15のkeypoints

• MHP• 25kimage,410種類の動作• 各画像3キャプション

• 複数⼈が写っている画像を除くと19k

2017/1/20 10

Experiments:Misc

• textencoder:char-CNN-GRU• GenerativeAdversarialTextToImageSynthesisと多分同じ

• Solver:Adam• Batchsize 16• Learningrate0.0002

•実装 :torch• spatialtransform:https://github.com/qassemoquab/stnbhwd• looselybasedondcgan.torch

2017/1/20 11

Conditionalbirdlocationviaboundingboxes

textとnoiseは3つとも同じ・背景は似ている3つの画像で同じではない・boundingboxが変わっても⿃の向きは同じ・zは背景や向きなど制御できない情報を担当しているのでは2017/1/20 12

Conditionalindividualpartlocationsviakeypoints

・keypointsは groundtruthに固定 (合成でない)・noiseは各例で別

・keypointsはnoiseに対してinvaliant・背景等はnoiseで変化

2017/1/20 13

Usingkeypoints condition

・くちばしと尻尾を指定・全ての⿃が左を向いている (c.f.conditiononboundingbox)

2017/1/20 14

Generatingbothbirdkeypoints andimagesfromtextalone

・textだけからkeypointsを⽣成，その後画像⽣成・全部keypointsを⽣成するようにすると質は下がる2017/1/20 15

先⾏研究との⽐較・先⾏研究はtextはほぼ正確に捉えているものの，くちばちなどが⽋けることがある (64x64)・提案⼿法は128x128でほぼ正確な画像を⽣成

2017/1/20 16

GeneratingHuman

・⿃より質が下がる・textが似ているものが少ない，複雑なポーズは難しい (ヨガぐらいならまぁまぁできてる)2017/1/20 17

まとめ• GAWWN:boundingboxとkeypointsでどこに描くかを条件付け

• CUBdatasetでは128x128で質の⾼い画像が⽣成可能

• Futurework• 物体の位置を unsupervisedorweeklysupervisedな⽅法で学習• bettertext-to-humangeneration

2017/1/20 18

所感•「どこ」の情報をどうエンコードするか，という点が新しい• boundingbox• keypoints

•⽂章だけだと任意性が⾼すぎる．位置情報を与えてあげることで画像が⽣成しやすくなる

•細かいネットワーク構成に関しては，なぜそういう設計にしたか説明がないため不明• もう少し何か理論的根拠が欲しいところ

2017/1/20 19

[dl輪読会]learning what and where to draw (nips’16)

Technology