computer vision meets fashion (第12回ステアラボ人工知能セミナー)
TRANSCRIPT
Computer Vision meets FashionKota Yamaguchi
CyberAgent, Inc.
STAIR Lab AI seminar, Aug 17, 2017
Who am I
Kota YamaguchiResearch scientist
CyberAgent, Inc.
vision.is.tohoku.ac.jp/~kyamagu
Computer vision and machine learning
2017 Assistant professor, Tohoku University
2014 PhD, CS, Stony Brook University
2008 MS, 2006 BE, University of Tokyo
twitter.com/kotymg
github.com/kyamagu
Research agenda
1. Learning visual perception on the Web[ACCV16] [ECCV16]
2. Clothing and body recognition[BMVC 15] [TPAMI14] [ICCV13] [CVPR12]
3. Understanding fashion and behavior[WACV15] [ACMMM14] [ECCV14]
4. Language and vision[PACLIC16] [IJCV15] [CVPR12] [NAACL12] [EACL12]
style.com
Computer
vision and
fashion?
style.com
Computer vision
provides machine
perception to
fashion
For e-commerce
Garment search
Shorts
Blazer
T-sh
irt
query results
Clothing search[Liu 12] [Kalantidis 13] [Cushen 13] [Kiapour
15] [Liu 16]
Recommendation[Liu 12] [McAuley 15] [Veit 15] [Yang 17]
Categorization[Borras 03] [Berg 10] [Bourdev 11] [Chen 12]
[Bossard 12] [Di 13]
For re-identification
Person
identification[Anguelov 07] [Gallagher 08] [Vaquero 09]
[Wang 11]
[Gallagher 08]
[Vaquero 2009]
For social science
Social groups[Murillo 12] [Kwak 13]
[Kwak 13]
Input: (Z1, Y1), . . . , (ZN , YN ), C, ϵ
Output: w, ξ
1 Initialization: H = ∅2 repeat
3 (w, ξ) ← solve problem (8) based on current H ;
4 for n = 1 to N do
5 Y ∗n ← arg maxY ∗
n ∈Y {△ (Yn , Y ∗ )+
6 wTΨ(Zn , Y ∗ )} ;
7 end
8 H ← H ∪ { (Y ∗1 , . . . , Y ∗
N )} ;
9 until 1N
Nn △ (Yn , Y ∗
n ) − 1N
wT Nn [Ψ(Zn , Yn ) −
Ψ(Zn , Y ∗n )] ≤ ξ + ϵ;
Algor ithm 2: 1-slack formulation for structure SVM.
Recall that wehave6 spatial relations, |A| dimensional fea-
ture, and C categories of occupations. Then the dimension-
ality of wa and wb is 6C2 and C × |A|, respectively. Anal-
ogously, both ψ(·) and φ(·) are sparse vectors whose ele-
ments are allocated by (yi , yj ) and yi , respectively. Since
we predict labels and their structure together, we integrate
weight vectors into one, having the following formulation:
J (Z, Y ) = wTΨ(Z, Y ), (5)
wherew = [wa ; wb],Ψ(Z, Y ) = [ i j ψ(·); i φ(·)]. Nex-
t, wewill show how to train amax-margin model that given
training data Zn , n = 1, 2, . . . , N , the predicted label Y ∗n
for Zn isapproaching the true label Yn , i.e., Y ∗n ≈ Yn . This
essentially isalossminimization problem plusaregularized
term:
arg minw,ξn > 0
1
2wT w +
C
N
N
n
ξn ,
s.t. ∀Yn wT△Ψ(Zn , Yn , Yn ) ≥ △ (Yn , Yn ) − ξn ,
(6)
where Yn is the hypothesis of the true label Yn ,
△Ψ(Zn , Yn , Yn ) = Ψ(Zn , Yn ) − Ψ(Zn , Yn ), △ (Yn , Yn )
is the loss function that quantifies the loss associated with
the hypothesis Yn , ξn is the slack variable in the n-th con-
straint, and C is the penalty term. More specifically, the
loss function here sums over all the single label loss func-
tions indicated by △ (yi , yi ), namely,
△ (Y, Y ) =
i
△ (yi , yi ), where △ (yi , yi ) = 1y i = y i, (7)
Problem (6) is essentially a structure SVM [29] favoring
the constraint term that involves structure output based loss
function. This is identified as margin-rescaling in [30].
Thekey step in solution of problem (6) isto find themost
significant violated constraint, namely, to find the most vio-
lated hypothesis Yn . Intuitively, if themost violated hypoth-
esis satisfies the constraints in problem (6), then all other
Soccer Player
Mara-thoner
Chef
Lawyer
Doctor Firefighter
Policeman
WaiterSoldier Student
Clergy
Mailman
Construc-tion Labor
Teacher
Figure 3. Illustrations of the collected occupation database. There
are 14 occupations and over 7K images in total.
hypothesis should be valid. However, the runtime for this
n-slack formulation in problem (6) is still polynomial with
cutting planemethod. To accelerate, werefer to 1-slack for-
mulation in [16], which employs a single slack variable ξ,
rather than agroup of ξi for each constraint. Wethen rewrite
the problem (6) in:
arg minw,ξ> 0
1
2wT w + Cξ,
s.t. ∀Yn
1
NwT
N
n
[Ψ(Zn , Yn ) − Ψ(Zn , Yn )]
≥1
N
N
n
△ (Yn , Yn ) − ξ.
(8)
Differently, since 1-slack formulation shares one unique ξ
among all constraints, it addsonly onethemost violated hy-
pothesis in each iteration. This consequently makes linear
runtime possible. To solve this problem, aworking set H is
constructed to store the hypothesis and violated constraints.
In each iteration, we compute w over the current H , find
the most violated constraint based on current w, and add it
to the working set. The iteration will not terminate until no
constraint can be found that is violated by more than the
desired precision ϵ. The solution of problem (8) is summa-
rized in Algorithm 2
5. Database
To the best of our knowledge, the occupation database2
collected in this paper is so far the largest image database
for occupation recognition research in computer vision
community. There are over 7K images of 14 different oc-
cupations, and each category contains at least 500 images.
Theseimagesaredownloaded from theInternet using image
2The database will be public available soon.
3635
Occupation[Song 11] [Shao 13]
[Shao 13]
FashionIndustry
“The global apparel market is valued at 3 trillion dollars, 3,000 billion, and accounts for 2 percent of the world‘s Gross Domestic Product (GDP).” – FashionUnited.com
Amazon Echo Look:Hands-Free Camera and Style Assistant
https://www.amazon.com/Echo-Hands-Free-Camera-Style-Assistant/dp/B0186JAEWK
Start-ups
• Cutting-edge computer vision in business
Fashwell Wide Eyes Technologies
ShopagonVasilyMarkable
Start-ups
Original Stitch
• Cutting-edge computer vision in business
http://jp.techcrunch.com/2017/03/14/original-stitch-open-platform/
Makip
Computer vision meets fashion
• Recent topics1. Clothing recognition
2. Retrieval for e-commerce
3. Style / trend analysis
4. Learning from big data
5. Image generation
• Clothing parsing
• Attribute discovery
• Style recognition
• Trend analysis
• Popularity analysis
Recent topics
1. Clothing recognition
• Classification
• Segmentation
Semantic segmentationPose estimation
Contexts, joint models, dependency
[Shotton 06] [Gould 09] [Liu 09] [Eigen 12] [Singh
13] [Tighe 10, 13, 14] [Dong 13] [Liang 15]
[Ferrari 08] [Bourdev 09] [Yang 11]
[Ukita 12] [Dantone 13] [Ladicky 13]
Mix and Match: Joint Model for Clothing and Attribute Recognition
matched
unmatched
[Yamaguchi, BMVC 2015]
Human Parsing with Contextualized Convolutional Neural Network [Liang, ICCV 2015]
2. Retrieval for e-commerce
• Street2shop
• Attribute-based search
• VQA shopping assistant
Exact retrieval, interaction, attributes
Where to Buy It: Matching Street Clothing Photos in Online Shops
Liu et al., Street-to-shop: Cross-scenario clothing
retrieval via parts alignment and auxiliary set.
CVPR 2012
Huang et al., Cross-Domain Image Retrieval With
a Dual Attribute-Aware Ranking Network, ICCV
2015
[Kiapour, ICCV 2015]
Whittlesearch: Image search with relative attribute feedback [Kovashka, CVPR 2012]
Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search
[Zhao, CVPR 2017]
Visual Search at eBay
• VQA in shopping scenario
[Yang, KDD 2017]
DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations
http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html
[Liu, CVPR 2016]
800K images
50 categories
1K attributes, bbox,
landmarks
Attribute prediction
Street-to-shop
In-shop retrieval
Landmark detection
Retrieval demo: http://fashion.sensetime.com/
3. Style / trend analysis
• Learning style/outfit as a whole
• Item compatibility
• Geographical trend
• Temporal trend
Unsupervised learning, weakly supervised learning
Fashion Style in 128 Floats: Joint Ranking and Classification Using Weak Data for Feature Extraction [Simo-Serra, CVPR 2016]
Learning the Latent "Look": Unsupervised Discovery of a Style-Coherent Embedding from Fashion Images [Hsiao, arXiv 2017]
Polylingual LDA
Learning Visual Clothing Style with Heterogeneous Dyadic Co-occurrences
[Veit, ICCV 2015]
IncompatibleCompatible
Learning Fashion Compatibility with Bidirectional LSTMs [Han, ACM MM 2017]
StreetStyle: Exploring world-wide clothing styles from millions of photos
[Matzen, arXiv 2017]
Changing Fashion Cultures [Abe, MIRU 2017]
Fashion Forward: Forecasting Visual Style in Fashion [Al-Halah, ICCV 2017]
Popularity prediction by styles Keyword popularity prediction
4. Learning from fashion big data
• Social signal + visual signal
• Textual signal
Fashion Conversation Data on Instagram
Successful categorization
Unsuccessful categorization
[Ha, ICWSM 2017]
When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing Features
[Chen, WWW 2017]
The Elements of Fashion Style [Vaccaro, UIST 2016]
Learning semantic relationship between high-level and low-level fashion concepts in text
5. Image generation
• Domain translation
• Creative inspiration
• Virtual fitting
Pixel-Level Domain Transfer
• Generating a product photo given a street snap
[Yoo, ECCV 2016]
https://github.com/fxia22/PixelDTGAN
Pose Guided Person Image Generation[Ma, arXiv 2017]
A Generative Model of People in Clothing
• Generating people from pose map and styling pipeline
[Lassner, ICCV 2017]
Recent topics: discussion
• Fashion tech is strongly application-oriented• UX in e-commerce and social media
• Computer vision as a building block
• Deep learning almost solves recognition problems• Contextual modeling is still under investigation
• Data issues: research towards unsupervised / weakly-annotated data
• Machine learning for creativity?
Clothing parsingCVPR 2012, ICCV 2013, TPAMI 2014, arXiv
style.com
Clothing parsing
Fully-convolutional Neural Networks (FCN)
• CNN for semantic segmentation
• All-convolution architecture
Fully Convolutional Networks for Semantic Segmentation
Jonathan Long, Evan Shelhamer, Trevor Darrell
CVPR 2015
Looking at outfit to parse clothing
FC FC Sigmoid
De
co
nv
Su
m
Cro
p
De
con
v
Deco
nv
Cro
p
Su
mConv
3xC
on
v
Po
ol5
3xC
on
v
Po
ol4
3xC
on
v
3xC
onv
3xC
on
v Po
ol3
Po
ol2
Po
ol1
Conv
Conv
Pro
du
ct
De
co
nv
Cro
p
So
ftm
ax
CRF
Loss
Outfit encoder
Conv Conv
1. Outfit prediction (clothing combination)
2. Filter out inappropriate garments
3. Smooth out boundary
FCN
Input
[Pongsate, arXiv 2017]
Input Output
Parsing resultsinput truth prediction input truth prediction
skin
hair
dress
hat/headband
shoes
skin
hair
bag
dress
jacket/blazer
necklace
shoes
sweater/cardigan
top/t-shirt
vest
watch/bracelet
predictiontruthinput
Performance [%]
Dataset Method Accuracy IoU
Fashionista v0.2 Paper doll [ICCV 2013] 84.68 -
Clothlets [ACCV 2014] 84.88 -
FCN-8s [CVPR 2015] 87.51 33.97
Our model 88.34 37.23
Refined
Fashionista
FCN-8s [CVPR 2015] 90.09 44.72
Our model 91.74 51.78
CFPD CFPD [TMM 2015] - 42.10
FCN-8s [CVPR 2015] 91.58 51.28
Our model 92.35 54.65
Refined Fashionista dataset
• High-quality, manually-annotated 685 pictures
• Major improvement from v0.2• CVPR 2012
• Used to learn FCN by fine-tuning from pre-trained model [Long 2015]
Pixel-based annotation using superpixels
https://github.com/kyamagu/js-segment-annotator
Coarse-to-fine superpixels on the fly
• SLIC superpixels computed on the client-side• Takes just a second in modern browsers
• Efficient annotation from large to small segments
Limitations
• CRF tends to trim small items• sunglasses
• watch/bracelet
• Dress vs. top+skirtdistinction is still hard
truth prediction
input truth prediction
Pose estimation using FCN
• Human pose as heatmap of parts
• Predict heatmapby FCN
• Can pose help segmentation, or vice versa?
Pose estimation results
truth prediction prediction predictiontruth truth
Good Small mistake Failure
Discussion
• Lack of data size• 685 pictures are not sufficient for deep-learning approach
• Global information in segmentation• Local appearance cannot solve the confusion
• Need a global prediction of clothing combination to avoid confusion between items
Attribute discoveryECCV 2016
Visual attribute perception
• How does a _______ t-shirt look like?• yellow
• large
• surfer
• comfy
• original
• popular
...onehourtees.com
www.justclick.ie
www.matsongraphics.compolyvore.com
Another question
• How many words can we use to describe visual attributes of a t-shirt?• My t-shirt looks __________.
Automatic attribute discovery
• Finding vocabulary of attributes• Open-world recognition challenge
• Using pre-trained deep neural networks to identify visual words in the Web data
Our approach
Pre-trained deep CNN
beautiful soft blush handmade leather ballet
flats.
***please, note, our new blush ballet flats are
without the beige trim line (around the edges),
still just as beautiful and perhaps even more***
SIZING
✍ how to take measurements ✍
there are a number of ways to measure your
feet, however we find the quickest and most
reliable practice is by tracing your feet. Here is
how to do it: stand on a piece of paper that's
bigger than your feet, circle your feet around
with a straight standing pencil (without pressing
the pencil too hard to the edges of your feet).
Once you have the tracing, measure distance
between longest and widest points. Compare
the measurements to the list below.
Image Text
white
red
striped
wooden
sliky
...
Attributes
1. Get Web data
2. Analyze DNN's
internal activity
Web data:unlimited vocabulary with images
Textual descriptionFeel So Good ... Purple Halter
Maxi Cotton dress 2 Sizes Available
Tagsused, american casual, summer, shorts, t-shirt, surfer, printed, duffer
Etsy dataset: e-commerce Wear dataset: fashion-blog
Discovery: intuition
• Contrast positive and negative sets to identify difference of semantics
pink not pink
Identifying difference at neurons
conv1
conv2
conv3
conv4
conv5
fc6
fc7
positive
negative
Deep neural
network
Activation
histograms
unit #1
unit #2
...
KL
divergence
Images
neurons
Why neural activation?
• Discriminability• If the attribute is visual, positive
set should activate different set of neurons
• Semantic depth• Depth of activating layer should
be encoding semantic information
...
conv1
conv2
conv3
conv4
conv5
fc6
fc7
Activation
histograms
Deep neural
network
Kullback-Leibler divergence
• Measure of difference between P+ and P-
• Used to identify highly-activating units
DKL (P+ ||P-) º P+
x
å (x)logP+(x)
P-(x)
KL visualization: shortsPositive Negative
pool5 KL
average
image
norm2 KL
KL visualization: redPositive Negative
pool5 KL
average
image
norm2 KL
Is the attribute visual?
• Which attribute is visually perceptible?
• Measure the classification performance, and compare against human
yellow comfylarge originalsurfer popular
Visualness
• Visualness of word u given a classifier f and dataset D+, D-:
V(u | f ) º accuracy( f ,Du+,Du
-)
positive negative
D+ D-
Discovered attributeslovelybrightorange acrylic
NOT lovelyNOT brightNOT orange NOT acrylic
elegant
NOT elegant
Discovery in noisy data
an
no
tate
d f
lora
lN
OT
an
no
tate
d f
lora
l
predicted MOST floral predicted LEAST floral
False positives
False negatives
Most/least visual attributes
Method Most Least
Human flip pink red floral blue
sleeve purple little black yellow
url due last right additional
sure free old possible cold
Pre-trained +
Resample
flip pink red yellow green purple
floral blue sexy elegant
big great due much own
favorite new free different good
Attribute-tuned flip sexy green floral yellow pink
red purple lace loose
right same own light happy
best small different favorite free
Language prior top sleeve front matching waist
bottom lace dry own right
organic lightweight classic
gentle adjustable floral adorable
url elastic super
Perceptual depth
• Which layer affects attribute recognition?
conv1
conv2
conv3
conv4
conv5
fc6
fc7
orange
bright
elegant
lovely
acrylic
Most salient words (Etsy)
norm1 norm2 conv3 conv4 pool5 fc6 fc7
orange green bright flattering lovely many sleeve
colorful red pink lovely elegant soft sole
vibrant yellow red vintage natural new acrylic
bright purple purple romantic beautiful upper cold
blue colorful green deep delicate sole flip
welcome blue lace waist recycled genuine newborn
exact vibrant yellow front chic friendly large
yellow ruffle sweet gentle formal sexy floral
red orange French formal decorative stretchy waist
specific only black delicate romantic great American
Most salient words (Wear)norm1 norm2 conv3 conv4 pool5 fc6 fc7
blue denim-jacket border-
striped-tops
kids shorts white-skirt long-skirt
green pink stripes bucket-hat half-length flared-skirt suit-style
red-black red dark-style hat-n-glasses pants spring midi-skirt
red red-socks stripes black denim upper gaucho-pants
denim-on-
denim
red-black backpack sleeveless dotted beret handmade
denim-skirt champion red American-
casual
border-stripes shirt-dress straw-hat
pink blue dark-n-dark long-cardigan white-pants overalls white-n-white
denim white denim-shirt white-n-white border-striped-
tops
hair-band white
yellow shirt navy stole gingham-check loincloth-style white-
coordinate
leopard i-am-clumsy outdoor-style mom-style sandals matched-pair white-pants
Saliency detection
• Can we identify salient region of the discovered attribute?
tulle-skirt
Our approach: cumulative field
• Accumulating receptive field of highly-activating neurons by KL
conv1
conv2
conv3
conv4
conv5
fc6
fc7
masked input
...
unit#1 unit#2
unit#3
saliency map
∑
image
sunglasses
shorts sneakers
gingham check
white style yellow
human K=64 image human K=64
Attribute discovery
• Web data + deep network
• Highly-activating neurons to identify visual stimuli associated to the given word
• Neural activations can further identify salient regions
Studying fashion stylesECCV 2014
Q: What makes the boy on the right look Harajuku-style?
Tie? Shoes?
tokyofashion.com
Goal
• Finding what constitutes a fashion style
• Approach• Game-based annotation
• Attribute factorization
Goth
Who’s more Bohemian?
hipsterwars.comGame-based relative ``style-ness’’ collection
Asking our online friends for participation
NO MONETARY REWARDS!
Initial keyword-search on Google or Fashion SNS
Participation statistics
Most played the game only a few clicks
Some motivated users clicked A LOT
TrueSkill game algorithm
• Algorithm to select which pair to play
• Idea:• Represent each image by Gaussian over rating
• Update Gaussian parameters after each click
• Chooses expected-to-tie images for play
[R Herbrich, 2007]
Score distribution after game
Most Hipster
Least Hipster
Annotation examples
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
ECCV
# 1534
ECCV
# 1534
10 ECCV-14 submission ID 1534
Most (Predicted) Least (Predicted) P
inup
G
oth
H
ipste
r B
ohe
mia
n
Pre
ppy
Fig. 5: Exampleresultsof within-classificat ion task with δ = 0.5. Top and bot tom
predict ions for each style category are shown.
5.2 W it hin-class classificat ion
Our next style recognit ion tasks considers classificat ion between top rated and
bot tom rated examples for each style independent ly. Here the goal is, for ex-
ample, to determine whether a person is an uber-hipster or only sort of hipster.
Again, we ut ilize linear SVMs [27], but here learn one visual model for each styles
in our dataset . Here δ determines the percentage of top and bot tom ranked im-
ages used in the classificat ion task. For example, δ = 0.1 means that we use
the top rated 10% of images from a style as posit ive samples and the bot tom
rated 10% of samples from the same style as negat ive samples (using the rat ings
computed in Sec 3.2). We evaluate experiments for δ ranging from 10% to 50%.
We repeat the experiments for 100 random folds with a 9 : 1 t rain to test rat io.
In each experiment , C, is determined using 5 fold cross-validat ion.
Resultsare reported in Figure6. Weobserve that when δ is small wegenerally
have bet ter performance than for larger δ. This is because the classificat ion task
generally becomes more challenging as we add less ext reme examples of each
style. Addit ionally, wefind best performance on the pinup category. Performance
on the goth category comes in second. For the hipster category, we do quite well
at di↵erent iat ing between ext remely st rong or weak examples, but performance
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
ECCV
# 1534
ECCV
# 1534
10 ECCV-14 submission ID 1534
Most (Predicted) Least (Predicted)
Pin
up
Goth
H
ipste
r B
ohe
mia
n
Pre
ppy
Fig. 5: Example resultsof within-classificat ion task with δ = 0.5. Top and bot tom
predict ions for each style category are shown.
5.2 W it hin-class classificat ion
Our next style recognit ion tasks considers classificat ion between top rated and
bot tom rated examples for each style independent ly. Here the goal is, for ex-
ample, to determine whether a person is an uber-hipster or only sort of hipster.
Again, weut ilize linear SVMs [27], but here learn onevisual model for each styles
in our dataset . Here δ determines the percentage of top and bot tom ranked im-
ages used in the classificat ion task. For example, δ = 0.1 means that we use
the top rated 10% of images from a style as posit ive samples and the bot tom
rated 10% of samples from the same style as negat ive samples (using the rat ings
computed in Sec 3.2). We evaluate experiments for δ ranging from 10% to 50%.
We repeat the experiments for 100 random folds with a 9 : 1 t rain to test rat io.
In each experiment , C, is determined using 5 fold cross-validat ion.
Resultsare reported in Figure6. Weobserve that when δ is small wegenerally
have bet ter performance than for larger δ. This is because the classificat ion task
generally becomes more challenging as we add less ext reme examples of each
style. Addit ionally, wefind best performance on the pinup category. Performance
on the goth category comes in second. For the hipster category, we do quite well
at di↵erent iat ing between ext remely st rong or weak examples, but performance
MOST LEAST
High-quality dataset without Amazon MTurk
Relative vs. absolute
• Gamification resulted in higher quality annotation• Asked MTurk workers 1-10
ratings
• Much noisier results from MTurk
Analyzing what makes her look preppy
Factorization
results
Fashion style analysis
• Game-based annotation collected high-quality data without monetary rewards
• How can we collect seed images?
Fashion trend analysisWACV 2015
Fashion trend: Runway to realwayFashion show Street
style.com chictopia.com
Runway dataset
~35k images in 9k fashion shows over 15 years, from 2000 to 2014
What does it mean by similar?
The real challenge is the definition of similarity.
The query image is given in the left column, while five candidate
images are shown in the right columns.
1. Select an image with the most similar outfit to the query.
2. If there is NO similar image, please select NONE.
Query image
NONE
Collecting human judgments to learn similarity
Select an image with the most similar outfit to the query image
Visual processing
Pose estimation Foreground
segmentation
Boundary map
Runway-to-runway retrievalRetrieving similar styles from other fashion shows
Runway-to-realwayretrievalRetrieving similar styles from street snaps
Visually analyzing floral trendRunway image of floral Retrieved images in street with timestamp
Peaks in spring!
% retrieved
images
Runway to realway analysis
• What is considered similar in fashion?
• Our approach: Learn human judgment
• Tracking similarity over time = trend analysis of a specific style
Visual Popularity AnalysisACM Multimedia 2014
Online fashion networks
Chictopia
Lookbook
Chicisimo
Tumblr
...
www.chictopia.com
Like button in Chictopia
Long tail
Promotion effect?
~300K posts
Why do some pictures get popular?
Content factors
Social factors
• Active posting
• Lots of friends
• Good fashion items
• Photo quality
How much do they
matter?
Regression analysis in 300K posts
• Tag TF-IDF
• Image
composition
• Color entropy
• Style descriptor
• Parse
descriptor
Popularity
• User identity
• Previous posts
• Node degrees
Input Output
Social factors
Content factors
• Votes
Findings
• Your outfit doesn’t matter (!!!)
• Popularity is mostly the outcome of the network – social bias• #votes ∝ #followers
• People just click on friends’ photos
• c.f., Rich-get-richer phenomenon
Regression performance
Factors R2 Spearman Accuracy
top 25%
Accuracy
top 75%
Social 0.491 0.682 0.847 0.779
Content 0.248 0.488 0.778 0.737
Social +
Content
0.493 0.685 0.845 0.775
Social factors significantly boosts the performance
What if there is no social network?
•Popularity = f ( content factors )?
Ask crowds!• Collecting popularity
votes in Amazon
MTurk
• No network!
3000 pictures
25 assignments
Out-of-network popularity
#posts
#votes
No social factors in the voting process
Task
• Predict crowd popularity using Content factors and/or Social factors in Chictopia
Social factors
Chictopia
Content factors
MTurk
Voting data
?
Predicting crowd votes
Factors R2 Spearman Accuracy
top 25%
Accuracy
top 75%
Social 0.423 0.634 0.845 0.787
Content 0.428 0.647 0.888 0.862
Social +
Content
0.473 0.686 0.884 0.858
• Content factors matter
• Social factors from Chictopia predict crowd votes well
• User-content correlation: Top-bloggers consistently post
good pictures
Predicted most popular
Predicted least popular
Popularity prediction
The data told us...
• Popularity is mostly the outcome of the social network• People click on friends’ photos
• Content affects popularity, but we conjecture the existence of user-content correlation
Computer Vision meets Fashion
• Computer vision = machine perception to quantify visual• Tool to analyze semantics of fashion
• Research topics• Recognition, street2shop, style understanding, social influence, fashion
trend, creativity