an in-depth evaluation of multimodal video genre categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013

An In-Depth Evaluation of MultimodalVideo Genre Categorization

Ionu MIRONICĂț 1

[email protected]

Bogdan IONESCU1,2

[email protected]

Peter KNEES3

[email protected]

Patrick LAMBERT2

patrick.lambert

@univ-savoie.fr

11th International Workshop on Content-Based Multimedia Indexing, CBMI 2013, Veszprém, Hungary, June 17-19, 2013.

University POLITEHNICA of Bucharest

12 3


CBMI 2013Monday, April 29, 2013 2

Presentation outline

• Introduction

• Video Content Description

• Fusion Techniques

• Experimental results

• Conclusions



Problem StatementConcepts• Content Based Video Retrieval• Genre Retrieval

genrequery

Query DatabaseQuery Results



video database

> challenge: find a way to assign (genre) tags to unknown videos;

> approach: machine learning paradigm;

train

classifier

unlabeled data

web food autos

…label data

labeled data

tagged video database

Global Approach



• the entire proces relies on the concept of “similarity” computed between content annotations (numeric features),

objective 1: go multimodal (truly)

visual audio Text & metadata

objective 2: test a broad range of classifiers

• We focus on:

Global Approach

objective 3: test a broad range of fusion techniques



[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]

• Linear Predictive Coefficients,

• Line Spectral Pairs,

• Mel-Frequency Cepstral Coefficients,

• Zero-Crossing Rate,

+ variance of each feature over a certain window.

• spectral centroid, flux, rolloff, and kurtosis,

Standard audio features (audio frame-based)

f1 fn…f2

globalfeature = mean & variance

time

+ var{f2} var{fn}

Video Content Description - audio



[OpenCV toolbox, http://opencv.willowgarage.com]

MPEG-7 & color/texture descriptors(visual frame-based)

• Local Binary Pattern,

• Autocorrelogram,

• Color Coherence Vector,

• Color Layout Pattern,

• Edge Histogram,

• Structure Color Descriptor,

• Classic color histogram,

• Color moments.

timef1 fn…

globalfeature =mean & dispersion & skewness & kurtosis & median & root mean square

f2

Video Content Description - visual



Feature descriptorsBag of Words•we train the model with 4,096 words•rgbSIFT and spatial pyramids (2x2)


[CIVR 2009, J. Uijlings et all]

Detection on interest points Codewords Dictionary

Bag-of-Visual-WordsBag-of-Visual-Words Framework Framework

Generate BoW histograms

Train classifier



Histogram of oriented Gradients (HoG)•divides the image into 3x3 cells and for each of them builds a pixel-wise histogram of edge orientations.


[CITS 2009, O. Ludwig,et all]

Feature descriptors




10

[IJCV, C. Rasche’10]

+ Appearance parameters:

: mean, std.dev. of intensity along the contour;sm cc ,

: fuzziness, obtained from a blob (DOG) filter: I * DOGsm ff ,

Contour properties:

: degree of curvature (proportional to the maximum amplitude of the bowness space); – straight vs. bow

b

: degree of circularity; – ½ circle vs. full circleζ: edginess parameter – zig-zag vs. sinusoid;e

edginess

: symmetry parameter – irregular vs. “even”y

symmetry

Objective: describe structural information in terms of contours and their relations;

Structural descriptors


TF-IDF descriptors(Term Frequency-Inverse Document Frequency)

Text sources: ASR and metadata

1. remove XML markups,

2. remove terms <5%-percentile of the frequency distribution,

3. select term corpus: retaining for each genre class m terms (e.g. m = 150 for ASR and 20 for metadata) with the highest χ2 values that occur more frequently than in complement classes,

4. for each document we represent the TF-IDF values.

Video Content Description - text



ClassifiersWe test a broad range of classifiers:

• SVM with linear, RBF and Chi kernels

• 5-NN

• Random Trees and Extremely Random Trees



Fusion Techniques

Global DescriptorGlobal Descriptor

Feature concatenation

DecisionDecision

Global Confidence

score

Global Confidence

score

Obtain the Global Confidence Score

Descriptor 1

Descriptor 2 Descriptor 2

Descriptor n Descriptor n

Feature extraction

Descriptor 1 normalized


Descriptor n normalized

Descriptor n normalized



Feature Normalization

ClassifierClassifier

Classification

Step

Early Fusion



Fusion Techniques

Late Fusion

Confidence value 1(normalized)




Confidence value n(normalized)

Confidence value n(normalized)

Confidence Scores Normalization

Descriptor 1

Descriptor 2 Descriptor 2

Descriptor n Descriptor n

Feature extraction

Classifier 1Classifier 1

Classifier 2Classifier 2

Classifier nClassifier n

Classification Step

DecisionDecisionGlobal Confidencescore

Global Confidencescore

Global Confidence Score



Fusion Techniques

Late Fusion

where - cvi is the confidence value of classifier i for class q , d is the current video, i are some weights and N is the number of classifiers to be aggregated.- rank() represents the rank of classifier i.



Experimental Setup MediaEval 2012 Dataset - Tagging Task• 14,838 episodes from 2,249 shows ~ 3,260 hours of data• splited into Development and Test sets 5,288 for development / 9,550 for test• focuses on semi-professional video on the Internet



Experimental Setup MediaEval 2012 Dataset

• 26 Genre labels26 Genre labels1000 art 1001 autos_and_vehicles 1002 business 1003 citizen_journalism 1004 comedy 1005 conferences_and_other_events 1006 default_category 1007 documentary 1008 educational 1009 food_and_drink 1010 gaming 1011 health 1012 literature 1013 movies_and_television 1014 music_and_entertainment 1015 personal_or_auto-biographical 1016 politics 1017 religion 1018 school_and_education 1019 sports 1020 technology 1021 the_environment 1022 the_mainstream_media 1023 travel 1024 videoblogging 1025 web_development_and_sites



18

Experimental Setup

• Mean Average Precision summarizes rankings from multiple queries by averaging average precision

• Classifier’s parameters and late fusion weights were optimized on training dataset



Evaluation(1) Classification performance on individual modalitiesFeature SVM

LinearSVM RBF SVM - Chi 5-NN Random

ForestExt. Random

Forests

Hog 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%

Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%

MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%

Structural Descriptors

7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%

Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33%

TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%

TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52%

(MAP values)



Evaluation(1) Classification performance on individual modalities (visual)

Feature SVMLinear

SVM RBF SVM - Chi 5-NN Random Forest

Ext. Random Forests

HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%

Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%

MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%


7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%


TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%


(MAP values)Visual Performance - Best performance with MPEG-7 (ERF) and HOG (SVM-RBF) - Bag-of-Visual-Words is not performing very well



Evaluation(1) Classification performance on individual modalities (audio)

Feature SVMLinear


Ext. Random Forests

HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%

Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%

MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%


7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%


TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%


(MAP values)Audio Performance - Best performance with Extremely Random Forests (42.33%) - Provide higher discriminative power than visual features



Evaluation(1) Classification performance on individual modalities (text)

Feature SVMLinear


Ext. Random Forests

HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%

Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%

MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%


7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%


TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%


(MAP values)Text Performance - Best performance with Metadata and Random Forests (58.66%) - ASR provide lower performance than audio - Metadata features outperformes all the features



Evaluation

CombSUM

CombMean

CombMNZ

CombRank

EarlyFusion

All Visual 35.82% 36.76% 38.21% 30.90% 30.11%

All Audio 43.86% 44.19% 44.50% 41.81% 42.33%

All Text 62.62% 62.81% 62.69% 50.60% 55.68%

All 64.24% 65.61% 65.82% 53.84% 60.12%

(2) Performance on Multimodal Integration

(MAP values)

Fusion Techniques Performance - late fusion provide higher performance than early fusion - CombMNZ tends to provide the best accurate results



Evaluation

Team Modality Method MAP

proposed all Late Fusion CombMNZ with all descriptors 65.82%

proposed text Late Fusion CombMean with TF-IDF of ASR and metadata 62.81%

TUB text Naive Bayes with Bag of Words on text (metadata) 52.25%

proposed all Late Fusion CombMNZ with all descriptors except for metadata 51.9%

proposed audio Late Fusion CombMean with standard audio descriptors 44.50%

proposed visual Late Fusion CombMean with MPEG-7 related, structural, HoG and B-o-VW with rgbSIFT

38.21%

ARF text SVM linear on early fusion of TF-IDF of ASR and metadata 37.93%

TUD visual & text Late Fusion of SVM with B-o-W (visual word, ASR & metadata) 35.81%

KIT visual SVM with Visual descriptors (color, texture, B-o-VW with rgbSIFT) 35.81%

TUD-MM text Dynamic Bayesian networks on text (ASR & metadata) 25.00%

UNICAMP - UFMG

visual Late fusion (KNN, Naive Bayes, SVM, Random Forests) with BOW (text ASR)

21.12%

ARF audio SVM linear with block-based audio features 18.92%

(3) Comparison to MediaEval 2012 Tagging task results

(MAP values)



Conclusions> we provided an in-depth evaluation of truly multimodal video description in the context of a real-world genre-categorization scenario;

> we proved that late fusion can boost performance of automated content descriptors to achieve close performance;

> we demonstrated the potential of appropriate late fusion to genre categorizationand achieve very high categorization performance;

> we setup a new baseline for the Genre Tagging Task by outperforming the performance of the other participants;

Acknowledgement: we would like to thank Prof. Nicu Sebe and Dr. J. Uijlings from University of Trento for their support.

We also acknowledge the 2012 Genre Tagging Task of the MediaEval Multimedia Benchmark for the dataset (http://www.multimediaeval.org/).



Thank you!

Questions?

an in-depth evaluation of multimodal video genre categorization

Engineering