computer vision meets fashion (第12回ステアラボ人工知能セミナー)

117
Computer Vision meets Fashion Kota Yamaguchi CyberAgent, Inc. STAIR Lab AI seminar, Aug 17, 2017

Upload: stair-lab-chiba-institute-of-technology

Post on 22-Jan-2018

1.905 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Computer Vision meets FashionKota Yamaguchi

CyberAgent, Inc.

STAIR Lab AI seminar, Aug 17, 2017

Page 2: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Who am I

Kota YamaguchiResearch scientist

CyberAgent, Inc.

vision.is.tohoku.ac.jp/~kyamagu

Computer vision and machine learning

2017 Assistant professor, Tohoku University

2014 PhD, CS, Stony Brook University

2008 MS, 2006 BE, University of Tokyo

twitter.com/kotymg

github.com/kyamagu

Page 3: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Research agenda

1. Learning visual perception on the Web[ACCV16] [ECCV16]

2. Clothing and body recognition[BMVC 15] [TPAMI14] [ICCV13] [CVPR12]

3. Understanding fashion and behavior[WACV15] [ACMMM14] [ECCV14]

4. Language and vision[PACLIC16] [IJCV15] [CVPR12] [NAACL12] [EACL12]

Page 4: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

style.com

Computer

vision and

fashion?

Page 5: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

style.com

Page 6: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Computer vision

provides machine

perception to

fashion

Page 7: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

For e-commerce

Garment search

Shorts

Blazer

T-sh

irt

query results

Clothing search[Liu 12] [Kalantidis 13] [Cushen 13] [Kiapour

15] [Liu 16]

Recommendation[Liu 12] [McAuley 15] [Veit 15] [Yang 17]

Categorization[Borras 03] [Berg 10] [Bourdev 11] [Chen 12]

[Bossard 12] [Di 13]

Page 8: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

For re-identification

Person

identification[Anguelov 07] [Gallagher 08] [Vaquero 09]

[Wang 11]

[Gallagher 08]

[Vaquero 2009]

Page 9: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

For social science

Social groups[Murillo 12] [Kwak 13]

[Kwak 13]

Input: (Z1, Y1), . . . , (ZN , YN ), C, ϵ

Output: w, ξ

1 Initialization: H = ∅2 repeat

3 (w, ξ) ← solve problem (8) based on current H ;

4 for n = 1 to N do

5 Y ∗n ← arg maxY ∗

n ∈Y {△ (Yn , Y ∗ )+

6 wTΨ(Zn , Y ∗ )} ;

7 end

8 H ← H ∪ { (Y ∗1 , . . . , Y ∗

N )} ;

9 until 1N

Nn △ (Yn , Y ∗

n ) − 1N

wT Nn [Ψ(Zn , Yn ) −

Ψ(Zn , Y ∗n )] ≤ ξ + ϵ;

Algor ithm 2: 1-slack formulation for structure SVM.

Recall that wehave6 spatial relations, |A| dimensional fea-

ture, and C categories of occupations. Then the dimension-

ality of wa and wb is 6C2 and C × |A|, respectively. Anal-

ogously, both ψ(·) and φ(·) are sparse vectors whose ele-

ments are allocated by (yi , yj ) and yi , respectively. Since

we predict labels and their structure together, we integrate

weight vectors into one, having the following formulation:

J (Z, Y ) = wTΨ(Z, Y ), (5)

wherew = [wa ; wb],Ψ(Z, Y ) = [ i j ψ(·); i φ(·)]. Nex-

t, wewill show how to train amax-margin model that given

training data Zn , n = 1, 2, . . . , N , the predicted label Y ∗n

for Zn isapproaching the true label Yn , i.e., Y ∗n ≈ Yn . This

essentially isalossminimization problem plusaregularized

term:

arg minw,ξn > 0

1

2wT w +

C

N

N

n

ξn ,

s.t. ∀Yn wT△Ψ(Zn , Yn , Yn ) ≥ △ (Yn , Yn ) − ξn ,

(6)

where Yn is the hypothesis of the true label Yn ,

△Ψ(Zn , Yn , Yn ) = Ψ(Zn , Yn ) − Ψ(Zn , Yn ), △ (Yn , Yn )

is the loss function that quantifies the loss associated with

the hypothesis Yn , ξn is the slack variable in the n-th con-

straint, and C is the penalty term. More specifically, the

loss function here sums over all the single label loss func-

tions indicated by △ (yi , yi ), namely,

△ (Y, Y ) =

i

△ (yi , yi ), where △ (yi , yi ) = 1y i = y i, (7)

Problem (6) is essentially a structure SVM [29] favoring

the constraint term that involves structure output based loss

function. This is identified as margin-rescaling in [30].

Thekey step in solution of problem (6) isto find themost

significant violated constraint, namely, to find the most vio-

lated hypothesis Yn . Intuitively, if themost violated hypoth-

esis satisfies the constraints in problem (6), then all other

Soccer Player

Mara-thoner

Chef

Lawyer

Doctor Firefighter

Policeman

WaiterSoldier Student

Clergy

Mailman

Construc-tion Labor

Teacher

Figure 3. Illustrations of the collected occupation database. There

are 14 occupations and over 7K images in total.

hypothesis should be valid. However, the runtime for this

n-slack formulation in problem (6) is still polynomial with

cutting planemethod. To accelerate, werefer to 1-slack for-

mulation in [16], which employs a single slack variable ξ,

rather than agroup of ξi for each constraint. Wethen rewrite

the problem (6) in:

arg minw,ξ> 0

1

2wT w + Cξ,

s.t. ∀Yn

1

NwT

N

n

[Ψ(Zn , Yn ) − Ψ(Zn , Yn )]

≥1

N

N

n

△ (Yn , Yn ) − ξ.

(8)

Differently, since 1-slack formulation shares one unique ξ

among all constraints, it addsonly onethemost violated hy-

pothesis in each iteration. This consequently makes linear

runtime possible. To solve this problem, aworking set H is

constructed to store the hypothesis and violated constraints.

In each iteration, we compute w over the current H , find

the most violated constraint based on current w, and add it

to the working set. The iteration will not terminate until no

constraint can be found that is violated by more than the

desired precision ϵ. The solution of problem (8) is summa-

rized in Algorithm 2

5. Database

To the best of our knowledge, the occupation database2

collected in this paper is so far the largest image database

for occupation recognition research in computer vision

community. There are over 7K images of 14 different oc-

cupations, and each category contains at least 500 images.

Theseimagesaredownloaded from theInternet using image

2The database will be public available soon.

3635

Occupation[Song 11] [Shao 13]

[Shao 13]

Page 10: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

FashionIndustry

“The global apparel market is valued at 3 trillion dollars, 3,000 billion, and accounts for 2 percent of the world‘s Gross Domestic Product (GDP).” – FashionUnited.com

Page 11: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Amazon Echo Look:Hands-Free Camera and Style Assistant

https://www.amazon.com/Echo-Hands-Free-Camera-Style-Assistant/dp/B0186JAEWK

Page 12: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Start-ups

• Cutting-edge computer vision in business

Fashwell Wide Eyes Technologies

ShopagonVasilyMarkable

Page 13: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Start-ups

Original Stitch

• Cutting-edge computer vision in business

http://jp.techcrunch.com/2017/03/14/original-stitch-open-platform/

Makip

Page 14: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Computer vision meets fashion

• Recent topics1. Clothing recognition

2. Retrieval for e-commerce

3. Style / trend analysis

4. Learning from big data

5. Image generation

• Clothing parsing

• Attribute discovery

• Style recognition

• Trend analysis

• Popularity analysis

Page 15: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Recent topics

Page 16: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

1. Clothing recognition

• Classification

• Segmentation

Semantic segmentationPose estimation

Contexts, joint models, dependency

[Shotton 06] [Gould 09] [Liu 09] [Eigen 12] [Singh

13] [Tighe 10, 13, 14] [Dong 13] [Liang 15]

[Ferrari 08] [Bourdev 09] [Yang 11]

[Ukita 12] [Dantone 13] [Ladicky 13]

Page 17: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Mix and Match: Joint Model for Clothing and Attribute Recognition

matched

unmatched

[Yamaguchi, BMVC 2015]

Page 18: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Human Parsing with Contextualized Convolutional Neural Network [Liang, ICCV 2015]

Page 19: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

2. Retrieval for e-commerce

• Street2shop

• Attribute-based search

• VQA shopping assistant

Exact retrieval, interaction, attributes

Page 20: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Where to Buy It: Matching Street Clothing Photos in Online Shops

Liu et al., Street-to-shop: Cross-scenario clothing

retrieval via parts alignment and auxiliary set.

CVPR 2012

Huang et al., Cross-Domain Image Retrieval With

a Dual Attribute-Aware Ranking Network, ICCV

2015

[Kiapour, ICCV 2015]

Page 21: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Whittlesearch: Image search with relative attribute feedback [Kovashka, CVPR 2012]

Page 22: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search

[Zhao, CVPR 2017]

Page 23: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Visual Search at eBay

• VQA in shopping scenario

[Yang, KDD 2017]

Page 24: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations

http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html

[Liu, CVPR 2016]

800K images

50 categories

1K attributes, bbox,

landmarks

Attribute prediction

Street-to-shop

In-shop retrieval

Landmark detection

Retrieval demo: http://fashion.sensetime.com/

Page 25: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

3. Style / trend analysis

• Learning style/outfit as a whole

• Item compatibility

• Geographical trend

• Temporal trend

Unsupervised learning, weakly supervised learning

Page 26: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Fashion Style in 128 Floats: Joint Ranking and Classification Using Weak Data for Feature Extraction [Simo-Serra, CVPR 2016]

Page 27: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Learning the Latent "Look": Unsupervised Discovery of a Style-Coherent Embedding from Fashion Images [Hsiao, arXiv 2017]

Polylingual LDA

Page 28: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Learning Visual Clothing Style with Heterogeneous Dyadic Co-occurrences

[Veit, ICCV 2015]

IncompatibleCompatible

Page 29: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Learning Fashion Compatibility with Bidirectional LSTMs [Han, ACM MM 2017]

Page 30: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

StreetStyle: Exploring world-wide clothing styles from millions of photos

[Matzen, arXiv 2017]

Page 31: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Changing Fashion Cultures [Abe, MIRU 2017]

Page 32: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Fashion Forward: Forecasting Visual Style in Fashion [Al-Halah, ICCV 2017]

Popularity prediction by styles Keyword popularity prediction

Page 33: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

4. Learning from fashion big data

• Social signal + visual signal

• Textual signal

Page 34: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Fashion Conversation Data on Instagram

Successful categorization

Unsuccessful categorization

[Ha, ICWSM 2017]

Page 35: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing Features

[Chen, WWW 2017]

Page 36: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

The Elements of Fashion Style [Vaccaro, UIST 2016]

Learning semantic relationship between high-level and low-level fashion concepts in text

Page 37: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

5. Image generation

• Domain translation

• Creative inspiration

• Virtual fitting

Page 38: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Pixel-Level Domain Transfer

• Generating a product photo given a street snap

[Yoo, ECCV 2016]

https://github.com/fxia22/PixelDTGAN

Page 39: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Pose Guided Person Image Generation[Ma, arXiv 2017]

Page 40: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

A Generative Model of People in Clothing

• Generating people from pose map and styling pipeline

[Lassner, ICCV 2017]

Page 41: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Recent topics: discussion

• Fashion tech is strongly application-oriented• UX in e-commerce and social media

• Computer vision as a building block

• Deep learning almost solves recognition problems• Contextual modeling is still under investigation

• Data issues: research towards unsupervised / weakly-annotated data

• Machine learning for creativity?

Page 42: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Clothing parsingCVPR 2012, ICCV 2013, TPAMI 2014, arXiv

Page 43: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

style.com

Clothing parsing

Page 44: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Fully-convolutional Neural Networks (FCN)

• CNN for semantic segmentation

• All-convolution architecture

Fully Convolutional Networks for Semantic Segmentation

Jonathan Long, Evan Shelhamer, Trevor Darrell

CVPR 2015

Page 45: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Looking at outfit to parse clothing

FC FC Sigmoid

De

co

nv

Su

m

Cro

p

De

con

v

Deco

nv

Cro

p

Su

mConv

3xC

on

v

Po

ol5

3xC

on

v

Po

ol4

3xC

on

v

3xC

onv

3xC

on

v Po

ol3

Po

ol2

Po

ol1

Conv

Conv

Pro

du

ct

De

co

nv

Cro

p

So

ftm

ax

CRF

Loss

Outfit encoder

Conv Conv

1. Outfit prediction (clothing combination)

2. Filter out inappropriate garments

3. Smooth out boundary

FCN

Input

[Pongsate, arXiv 2017]

Input Output

Page 46: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Parsing resultsinput truth prediction input truth prediction

Page 47: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

skin

hair

dress

hat/headband

shoes

skin

hair

bag

dress

jacket/blazer

necklace

shoes

sweater/cardigan

top/t-shirt

vest

watch/bracelet

predictiontruthinput

Page 48: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Performance [%]

Dataset Method Accuracy IoU

Fashionista v0.2 Paper doll [ICCV 2013] 84.68 -

Clothlets [ACCV 2014] 84.88 -

FCN-8s [CVPR 2015] 87.51 33.97

Our model 88.34 37.23

Refined

Fashionista

FCN-8s [CVPR 2015] 90.09 44.72

Our model 91.74 51.78

CFPD CFPD [TMM 2015] - 42.10

FCN-8s [CVPR 2015] 91.58 51.28

Our model 92.35 54.65

Page 49: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Refined Fashionista dataset

• High-quality, manually-annotated 685 pictures

• Major improvement from v0.2• CVPR 2012

• Used to learn FCN by fine-tuning from pre-trained model [Long 2015]

Page 50: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Pixel-based annotation using superpixels

https://github.com/kyamagu/js-segment-annotator

Page 51: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Coarse-to-fine superpixels on the fly

• SLIC superpixels computed on the client-side• Takes just a second in modern browsers

• Efficient annotation from large to small segments

Page 52: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Limitations

• CRF tends to trim small items• sunglasses

• watch/bracelet

• Dress vs. top+skirtdistinction is still hard

truth prediction

input truth prediction

Page 53: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Pose estimation using FCN

• Human pose as heatmap of parts

• Predict heatmapby FCN

• Can pose help segmentation, or vice versa?

Page 54: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Pose estimation results

truth prediction prediction predictiontruth truth

Good Small mistake Failure

Page 55: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Discussion

• Lack of data size• 685 pictures are not sufficient for deep-learning approach

• Global information in segmentation• Local appearance cannot solve the confusion

• Need a global prediction of clothing combination to avoid confusion between items

Page 56: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Attribute discoveryECCV 2016

Page 57: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Visual attribute perception

• How does a _______ t-shirt look like?• yellow

• large

• surfer

• comfy

• original

• popular

...onehourtees.com

www.justclick.ie

www.matsongraphics.compolyvore.com

Page 58: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Another question

• How many words can we use to describe visual attributes of a t-shirt?• My t-shirt looks __________.

Page 59: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Automatic attribute discovery

• Finding vocabulary of attributes• Open-world recognition challenge

• Using pre-trained deep neural networks to identify visual words in the Web data

Page 60: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Our approach

Pre-trained deep CNN

beautiful soft blush handmade leather ballet

flats.

***please, note, our new blush ballet flats are

without the beige trim line (around the edges),

still just as beautiful and perhaps even more***

SIZING

✍ how to take measurements ✍

there are a number of ways to measure your

feet, however we find the quickest and most

reliable practice is by tracing your feet. Here is

how to do it: stand on a piece of paper that's

bigger than your feet, circle your feet around

with a straight standing pencil (without pressing

the pencil too hard to the edges of your feet).

Once you have the tracing, measure distance

between longest and widest points. Compare

the measurements to the list below.

Image Text

white

red

striped

wooden

sliky

...

Attributes

1. Get Web data

2. Analyze DNN's

internal activity

Page 61: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Web data:unlimited vocabulary with images

Textual descriptionFeel So Good ... Purple Halter

Maxi Cotton dress 2 Sizes Available

Tagsused, american casual, summer, shorts, t-shirt, surfer, printed, duffer

Etsy dataset: e-commerce Wear dataset: fashion-blog

Page 62: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Discovery: intuition

• Contrast positive and negative sets to identify difference of semantics

pink not pink

Page 63: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Identifying difference at neurons

conv1

conv2

conv3

conv4

conv5

fc6

fc7

positive

negative

Deep neural

network

Activation

histograms

unit #1

unit #2

...

KL

divergence

Images

neurons

Page 64: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Why neural activation?

• Discriminability• If the attribute is visual, positive

set should activate different set of neurons

• Semantic depth• Depth of activating layer should

be encoding semantic information

...

conv1

conv2

conv3

conv4

conv5

fc6

fc7

Activation

histograms

Deep neural

network

Page 65: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Kullback-Leibler divergence

• Measure of difference between P+ and P-

• Used to identify highly-activating units

DKL (P+ ||P-) º P+

x

å (x)logP+(x)

P-(x)

Page 66: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

KL visualization: shortsPositive Negative

pool5 KL

average

image

norm2 KL

Page 67: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

KL visualization: redPositive Negative

pool5 KL

average

image

norm2 KL

Page 68: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Is the attribute visual?

• Which attribute is visually perceptible?

• Measure the classification performance, and compare against human

yellow comfylarge originalsurfer popular

Page 69: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Visualness

• Visualness of word u given a classifier f and dataset D+, D-:

V(u | f ) º accuracy( f ,Du+,Du

-)

positive negative

D+ D-

Page 70: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Discovered attributeslovelybrightorange acrylic

NOT lovelyNOT brightNOT orange NOT acrylic

elegant

NOT elegant

Page 71: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Discovery in noisy data

an

no

tate

d f

lora

lN

OT

an

no

tate

d f

lora

l

predicted MOST floral predicted LEAST floral

False positives

False negatives

Page 72: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Most/least visual attributes

Method Most Least

Human flip pink red floral blue

sleeve purple little black yellow

url due last right additional

sure free old possible cold

Pre-trained +

Resample

flip pink red yellow green purple

floral blue sexy elegant

big great due much own

favorite new free different good

Attribute-tuned flip sexy green floral yellow pink

red purple lace loose

right same own light happy

best small different favorite free

Language prior top sleeve front matching waist

bottom lace dry own right

organic lightweight classic

gentle adjustable floral adorable

url elastic super

Page 73: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Perceptual depth

• Which layer affects attribute recognition?

conv1

conv2

conv3

conv4

conv5

fc6

fc7

orange

bright

elegant

lovely

acrylic

Page 74: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Most salient words (Etsy)

norm1 norm2 conv3 conv4 pool5 fc6 fc7

orange green bright flattering lovely many sleeve

colorful red pink lovely elegant soft sole

vibrant yellow red vintage natural new acrylic

bright purple purple romantic beautiful upper cold

blue colorful green deep delicate sole flip

welcome blue lace waist recycled genuine newborn

exact vibrant yellow front chic friendly large

yellow ruffle sweet gentle formal sexy floral

red orange French formal decorative stretchy waist

specific only black delicate romantic great American

Page 75: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Most salient words (Wear)norm1 norm2 conv3 conv4 pool5 fc6 fc7

blue denim-jacket border-

striped-tops

kids shorts white-skirt long-skirt

green pink stripes bucket-hat half-length flared-skirt suit-style

red-black red dark-style hat-n-glasses pants spring midi-skirt

red red-socks stripes black denim upper gaucho-pants

denim-on-

denim

red-black backpack sleeveless dotted beret handmade

denim-skirt champion red American-

casual

border-stripes shirt-dress straw-hat

pink blue dark-n-dark long-cardigan white-pants overalls white-n-white

denim white denim-shirt white-n-white border-striped-

tops

hair-band white

yellow shirt navy stole gingham-check loincloth-style white-

coordinate

leopard i-am-clumsy outdoor-style mom-style sandals matched-pair white-pants

Page 76: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Saliency detection

• Can we identify salient region of the discovered attribute?

tulle-skirt

Page 77: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Our approach: cumulative field

• Accumulating receptive field of highly-activating neurons by KL

conv1

conv2

conv3

conv4

conv5

fc6

fc7

masked input

...

unit#1 unit#2

unit#3

saliency map

Page 78: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

image

sunglasses

shorts sneakers

gingham check

white style yellow

human K=64 image human K=64

Page 79: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Attribute discovery

• Web data + deep network

• Highly-activating neurons to identify visual stimuli associated to the given word

• Neural activations can further identify salient regions

Page 80: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Studying fashion stylesECCV 2014

Page 81: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Q: What makes the boy on the right look Harajuku-style?

Tie? Shoes?

tokyofashion.com

Page 82: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Goal

• Finding what constitutes a fashion style

• Approach• Game-based annotation

• Attribute factorization

Goth

Page 83: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Who’s more Bohemian?

Page 84: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Other people think...

hipsterwars.com

Page 85: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

hipsterwars.comGame-based relative ``style-ness’’ collection

Asking our online friends for participation

NO MONETARY REWARDS!

Initial keyword-search on Google or Fashion SNS

Page 86: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Participation statistics

Most played the game only a few clicks

Some motivated users clicked A LOT

Page 87: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

TrueSkill game algorithm

• Algorithm to select which pair to play

• Idea:• Represent each image by Gaussian over rating

• Update Gaussian parameters after each click

• Chooses expected-to-tie images for play

[R Herbrich, 2007]

Page 88: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Score distribution after game

Most Hipster

Least Hipster

Page 89: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Annotation examples

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

ECCV

# 1534

ECCV

# 1534

10 ECCV-14 submission ID 1534

Most (Predicted) Least (Predicted) P

inup

G

oth

H

ipste

r B

ohe

mia

n

Pre

ppy

Fig. 5: Exampleresultsof within-classificat ion task with δ = 0.5. Top and bot tom

predict ions for each style category are shown.

5.2 W it hin-class classificat ion

Our next style recognit ion tasks considers classificat ion between top rated and

bot tom rated examples for each style independent ly. Here the goal is, for ex-

ample, to determine whether a person is an uber-hipster or only sort of hipster.

Again, we ut ilize linear SVMs [27], but here learn one visual model for each styles

in our dataset . Here δ determines the percentage of top and bot tom ranked im-

ages used in the classificat ion task. For example, δ = 0.1 means that we use

the top rated 10% of images from a style as posit ive samples and the bot tom

rated 10% of samples from the same style as negat ive samples (using the rat ings

computed in Sec 3.2). We evaluate experiments for δ ranging from 10% to 50%.

We repeat the experiments for 100 random folds with a 9 : 1 t rain to test rat io.

In each experiment , C, is determined using 5 fold cross-validat ion.

Resultsare reported in Figure6. Weobserve that when δ is small wegenerally

have bet ter performance than for larger δ. This is because the classificat ion task

generally becomes more challenging as we add less ext reme examples of each

style. Addit ionally, wefind best performance on the pinup category. Performance

on the goth category comes in second. For the hipster category, we do quite well

at di↵erent iat ing between ext remely st rong or weak examples, but performance

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

ECCV

# 1534

ECCV

# 1534

10 ECCV-14 submission ID 1534

Most (Predicted) Least (Predicted)

Pin

up

Goth

H

ipste

r B

ohe

mia

n

Pre

ppy

Fig. 5: Example resultsof within-classificat ion task with δ = 0.5. Top and bot tom

predict ions for each style category are shown.

5.2 W it hin-class classificat ion

Our next style recognit ion tasks considers classificat ion between top rated and

bot tom rated examples for each style independent ly. Here the goal is, for ex-

ample, to determine whether a person is an uber-hipster or only sort of hipster.

Again, weut ilize linear SVMs [27], but here learn onevisual model for each styles

in our dataset . Here δ determines the percentage of top and bot tom ranked im-

ages used in the classificat ion task. For example, δ = 0.1 means that we use

the top rated 10% of images from a style as posit ive samples and the bot tom

rated 10% of samples from the same style as negat ive samples (using the rat ings

computed in Sec 3.2). We evaluate experiments for δ ranging from 10% to 50%.

We repeat the experiments for 100 random folds with a 9 : 1 t rain to test rat io.

In each experiment , C, is determined using 5 fold cross-validat ion.

Resultsare reported in Figure6. Weobserve that when δ is small wegenerally

have bet ter performance than for larger δ. This is because the classificat ion task

generally becomes more challenging as we add less ext reme examples of each

style. Addit ionally, wefind best performance on the pinup category. Performance

on the goth category comes in second. For the hipster category, we do quite well

at di↵erent iat ing between ext remely st rong or weak examples, but performance

MOST LEAST

High-quality dataset without Amazon MTurk

Page 90: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Relative vs. absolute

• Gamification resulted in higher quality annotation• Asked MTurk workers 1-10

ratings

• Much noisier results from MTurk

Page 91: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Analyzing what makes her look preppy

Factorization

results

Page 92: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Fashion style analysis

• Game-based annotation collected high-quality data without monetary rewards

• How can we collect seed images?

Page 93: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Fashion trend analysisWACV 2015

Page 94: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Fashion trend: Runway to realwayFashion show Street

style.com chictopia.com

Page 95: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Runway dataset

~35k images in 9k fashion shows over 15 years, from 2000 to 2014

Page 96: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

What does it mean by similar?

The real challenge is the definition of similarity.

Page 97: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

The query image is given in the left column, while five candidate

images are shown in the right columns.

1. Select an image with the most similar outfit to the query.

2. If there is NO similar image, please select NONE.

Query image

NONE

Collecting human judgments to learn similarity

Select an image with the most similar outfit to the query image

Page 98: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Visual processing

Pose estimation Foreground

segmentation

Boundary map

Page 99: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Runway-to-runway retrievalRetrieving similar styles from other fashion shows

Page 100: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Runway-to-realwayretrievalRetrieving similar styles from street snaps

Page 101: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Visually analyzing floral trendRunway image of floral Retrieved images in street with timestamp

Peaks in spring!

% retrieved

images

Page 102: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Runway to realway analysis

• What is considered similar in fashion?

• Our approach: Learn human judgment

• Tracking similarity over time = trend analysis of a specific style

Page 103: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Visual Popularity AnalysisACM Multimedia 2014

Page 104: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Online fashion networks

Chictopia

Lookbook

Chicisimo

Pinterest

Tumblr

...

www.chictopia.com

Page 105: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Like button in Chictopia

Long tail

Promotion effect?

~300K posts

Page 106: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Why do some pictures get popular?

Content factors

Social factors

• Active posting

• Lots of friends

• Good fashion items

• Photo quality

How much do they

matter?

Page 107: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Regression analysis in 300K posts

• Tag TF-IDF

• Image

composition

• Color entropy

• Style descriptor

• Parse

descriptor

Popularity

• User identity

• Previous posts

• Node degrees

Input Output

Social factors

Content factors

• Votes

Page 108: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Findings

• Your outfit doesn’t matter (!!!)

• Popularity is mostly the outcome of the network – social bias• #votes ∝ #followers

• People just click on friends’ photos

• c.f., Rich-get-richer phenomenon

Page 109: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Regression performance

Factors R2 Spearman Accuracy

top 25%

Accuracy

top 75%

Social 0.491 0.682 0.847 0.779

Content 0.248 0.488 0.778 0.737

Social +

Content

0.493 0.685 0.845 0.775

Social factors significantly boosts the performance

Page 110: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

What if there is no social network?

•Popularity = f ( content factors )?

Page 112: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Out-of-network popularity

#posts

#votes

No social factors in the voting process

Page 113: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Task

• Predict crowd popularity using Content factors and/or Social factors in Chictopia

Social factors

Chictopia

Content factors

MTurk

Voting data

?

Page 114: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Predicting crowd votes

Factors R2 Spearman Accuracy

top 25%

Accuracy

top 75%

Social 0.423 0.634 0.845 0.787

Content 0.428 0.647 0.888 0.862

Social +

Content

0.473 0.686 0.884 0.858

• Content factors matter

• Social factors from Chictopia predict crowd votes well

• User-content correlation: Top-bloggers consistently post

good pictures

Page 115: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Predicted most popular

Predicted least popular

Popularity prediction

Page 116: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

The data told us...

• Popularity is mostly the outcome of the social network• People click on friends’ photos

• Content affects popularity, but we conjecture the existence of user-content correlation

Page 117: Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)

Computer Vision meets Fashion

• Computer vision = machine perception to quantify visual• Tool to analyze semantics of fashion

• Research topics• Recognition, street2shop, style understanding, social influence, fashion

trend, creativity