collaborative deep metric learning for video understanding22,798 trailers (out of 27,279 movies,...

Confidential + ProprietaryConfidential + Proprietary

Collaborative Deep Metric Learningfor Video UnderstandingMachine Perception, Google Research

August 23, 2018

Balakrishnan Varadarajan

PaulNatsev

JoonseokLee

SamiAbu-El-Haija

Confidential + Proprietary

What is Video Understanding?

2

Figure skating Winter sports Ice rink Pair skating


What is Video Understanding?

3

{(238, 204, 187), (238, 187, 187), … (255, 221, 221), (255, 238, 204), … (255, 238, 221), (238, 238, 221), … ：

Figure skatingWinter sportsIce rinkPair skating


Goal

Collaborative Deep Metric Learning

We’d like to learn a content-aware video embedding,

preserving video-video similarity using collaborative filtering.

4


Overview

Pre-trainedDeep Models

Visual/Audio Features

EmbeddingModel

Video Embedding

NearestNeighbor Search

UserModeling

Classifier

Related VideoRetrieval

VideoRecommendation

VideoAnnotation

5


Feature Extraction

6

Imag

e fe

atur

e ex

tract

or

Poo

ling

Poo

ling

L2L2

Aud

io

feat

ure

extra

ctor

Frames

Audio

Vid

eo fe

atur

eA

udio

feat

ure

FC

L2

Fina

l em

bedd

ing

FC

X


Embedding Models: Triplet Loss● Train a model, preserving pairwise distance between videos.● Triplet loss: {anchor,positive} closer than {anchor,negative}.

[1] F. Schroff, D. Kalenichenko, J. Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR 2015.

7


Ground Truth for “Related Videos”?

[2] D. Goldberg, D. Nichols, B. M. Oki, D. Terry. Using collaborative filtering to weave an information tapestry, Communications of the ACM, 1992.

● Collaborative Filtering [2]:Users (implicitly) collaborate to filter relevant items to themselves by annotating their preference.

● Collected users’ aggregated watch history from YouTube.

○ Co-watched videos are regarded as related.

8


Related Video Retrieval

9


Related Video Retrieval● Task: Given a query video q with its content features xq, rank videos in a

candidate set according to relevance to q.● Cold-start

○ Training on triplets from T2T.○ Evaluation on T2E and E2T.

● Dataset○ Trained on 500M triplets.○ Tested on 100M eval set (T2E + E2T).

Positive Training(70%)

Eval(30%)Anchor

Training(70%)

1T2T

2T2E

Eval(30%)

3E2T

10


Examples

11


Examples

12


Personalized Video Recommendation

13


Personalized Video Recommendation● Task: Given a user with watch history Q = {x1, x2, …, x|Q|}, rank videos

in a candidate set according to the user’s preference.○ Similar to related video retrieval, but with multiple query videos.

● Average aggregation: mostly harmonious videos to entire watch history

● Max aggregation: most related to any of the user’s taste

14


0.39

0.92

Scalability Issues● A naive implementation

Candidate set V

0.92 0.75 0.23 0.90

0.05 0.14 0.76 0.06

0.43 0.56 0.37 0.14

0.02 0.11 0.04 0.03

WatchHistory

Q

0.75 0.76 0.90

0.36 0.35 0.28

MAX

AVG

|Q|·|V| dot-products

● Still, computing 40,000 dot-products per user in 100ms is hard.

● So, what should we do?

5B200

● In YouTube scale?

● But, ranking must be done in ~100ms.

● With prefiltering, |V| = 100~1000.

20015


Optimizations for Scalability● Pre-computing averaged watch history

○ Thanks to linearity of dot-product, we compute averaged watch history once and reuse for every candidate.

○ Total time complexity is O(|Q| + |V|) instead of O(|Q||V|).○ Unfortunately, max aggregation is not possible to optimize in this way.

16


MovieLens Experiment● Collected MovieLens trailers from YouTube.

○ 22,798 trailers (out of 27,279 movies, 83.6%) for MovieLens 20M○ Released pre-computed features through MovieLens official site.

● Cold-start experiment○ Our content-based models outperform CF model when we know less about the users!

17


Video Annotation/Classification

18


Video Annotation Problem● Model as a multi-labeled classification problem, learning a mapping from

video features x to d binary labels y ∈ {0, 1}d.● Data preprocessing:

○ PCA to 256-D, then L2 normalize.

19

Input x is correlated PCA decorrelates the dimensions

Whitening ensures each dimension is equally important.

z: L2-normalized input data


Video Annotation Problem● Mixture of Expert (MoE) classifier:

○ Given z, MoE model estimates the probability p(e|z) for an entity e exist in the video as a weighted average over experts h. For each expert h, we use a binary logistic regression classifier.

○ We train different MoE for each label. Each MoE is trained independently, in parallel.○ More info about the model available here.

20

Probability of label e given features z(e.g, Hyundai Sonata)

Probability of a hidden state h given features z(e.g, interior of car, engine)

Probability of a label e given features z and hidden state h(e.g, prob of Sonata given it is a view of the engine)

Logistic regression

https://docs.google.com/presentation/d/1BpIeYvS1IjYXJIXK1p51KRBSSG3J3itocpggrXcUqjI/


Experimental Result: YouTube-8M

21

Rank Team NameVideo-level features only Frame-level features used

Single Ensemble Single Ensemble

1 WILLOW ㅡ ㅡ 0.8300 0.8469

2 monkeytyping 0.8106 0.8225 0.8179 0.8458

3 offline 0.8082 ㅡ 0.8275 0.8454

4 FDT ㅡ ㅡ 0.8178 0.8419

5 You8M ㅡ 0.8308 ㅡ 0.8418

6 Rankyou 0.8041 ㅡ 0.8246 0.8408

7 Yeti ㅡ ㅡ 0.8254 0.8396

8 SNUVL X SKT ㅡ ㅡ 0.8200 0.8389

9 LanzanRamen ㅡ ㅡ ㅡ 0.8372

10 Samartian ㅡ ㅡ 0.8139 0.8366

Ours 0.8430 ㅡ ㅡ ㅡ

Scores in GAP; higher values are better.


Summary

22


Take-home Messages● Signals that are indirectly related can be useful to various tasks.

○ CF signals are useful for video annotation as well.

● Pure content models can perform comparably against CF models.○ Even outperform in cold/cool-start cases.

● Analyzing video content in large-scale is challenging, but we are improving.○ Video features are extracted and widely used in Google products (YouTube, Photos, and

more).

23

Thank you for your [email protected]

collaborative deep metric learning for video understanding22,798 trailers (out of 27,279 movies,...

Documents