an in-depth evaluation of multimodal video genre categorization

26
University Politehnica of Bucharest CBMI 2013 Monday, April 29, 2013 An In-Depth Evaluation of Multimodal Video Genre Categorization Ionu MIRONICĂ ț 1 [email protected] Bogdan IONESCU 1,2 [email protected] Peter KNEES 3 [email protected] Patrick LAMBERT 2 patrick.lambert @univ-savoie.fr 11th International Workshop on Content-Based Multimedia Indexing, CBMI 2013, Veszprém, Hungary, June 17-19, 2013. University POLITEHNICA of Bucharest 1 2 3

Upload: ionut-mironica

Post on 22-Jul-2015

88 views

Category:

Engineering


7 download

TRANSCRIPT

Page 1: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013

An In-Depth Evaluation of MultimodalVideo Genre Categorization

Ionu MIRONICĂț 1

[email protected]

Bogdan IONESCU1,2

[email protected]

Peter KNEES3

[email protected]

Patrick LAMBERT2

patrick.lambert

@univ-savoie.fr

11th International Workshop on Content-Based Multimedia Indexing, CBMI 2013, Veszprém, Hungary, June 17-19, 2013.

University POLITEHNICA of Bucharest

12 3

Page 2: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 2

Presentation outline

• Introduction

• Video Content Description

• Fusion Techniques

• Experimental results

• Conclusions

Page 3: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 3

Problem StatementConcepts• Content Based Video Retrieval• Genre Retrieval

genrequery

Query DatabaseQuery Results

Page 4: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 4

video database

> challenge: find a way to assign (genre) tags to unknown videos;

> approach: machine learning paradigm;

train

classifier

unlabeled data

web food autos

…label data

labeled data

tagged video database

Global Approach

Page 5: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013

• the entire proces relies on the concept of “similarity” computed between content annotations (numeric features),

objective 1: go multimodal (truly)

visual audio Text & metadata

objective 2: test a broad range of classifiers

• We focus on:

Global Approach

objective 3: test a broad range of fusion techniques

Page 6: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013

[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]

• Linear Predictive Coefficients,

• Line Spectral Pairs,

• Mel-Frequency Cepstral Coefficients,

• Zero-Crossing Rate,

+ variance of each feature over a certain window.

• spectral centroid, flux, rolloff, and kurtosis,

Standard audio features (audio frame-based)

f1 fn…f2

globalfeature = mean & variance

time

+ var{f2} var{fn}

Video Content Description - audio

Page 7: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013

[OpenCV toolbox, http://opencv.willowgarage.com]

MPEG-7 & color/texture descriptors(visual frame-based)

• Local Binary Pattern,

• Autocorrelogram,

• Color Coherence Vector,

• Color Layout Pattern,

• Edge Histogram,

• Structure Color Descriptor,

• Classic color histogram,

• Color moments.

timef1 fn…

globalfeature =mean & dispersion & skewness & kurtosis & median & root mean square

f2

Video Content Description - visual

Page 8: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013

Feature descriptorsBag of Words•we train the model with 4,096 words•rgbSIFT and spatial pyramids (2x2)

Video Content Description - visual

[CIVR 2009, J. Uijlings et all]

Detection on interest points Codewords Dictionary

Bag-of-Visual-WordsBag-of-Visual-Words Framework Framework

Generate BoW histograms

Train classifier

Page 9: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013

Histogram of oriented Gradients (HoG)•divides the image into 3x3 cells and for each of them builds a pixel-wise histogram of edge orientations.

Video Content Description - visual

[CITS 2009, O. Ludwig,et all]

Feature descriptors

Page 10: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013

Video Content Description - visual

10

[IJCV, C. Rasche’10]

+ Appearance parameters:

: mean, std.dev. of intensity along the contour;sm cc ,

: fuzziness, obtained from a blob (DOG) filter: I * DOGsm ff ,

Contour properties:

: degree of curvature (proportional to the maximum amplitude of the bowness space); – straight vs. bow

b

: degree of circularity; – ½ circle vs. full circleζ: edginess parameter – zig-zag vs. sinusoid;e

edginess

: symmetry parameter – irregular vs. “even”y

symmetry

Objective: describe structural information in terms of contours and their relations;

Structural descriptors

Page 11: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

TF-IDF descriptors(Term Frequency-Inverse Document Frequency)

Text sources: ASR and metadata

1. remove XML markups,

2. remove terms <5%-percentile of the frequency distribution,

3. select term corpus: retaining for each genre class m terms (e.g. m = 150 for ASR and 20 for metadata) with the highest χ2 values that occur more frequently than in complement classes,

4. for each document we represent the TF-IDF values.

Video Content Description - text

Page 12: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 12

ClassifiersWe test a broad range of classifiers:

• SVM with linear, RBF and Chi kernels

• 5-NN

• Random Trees and Extremely Random Trees

Page 13: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 13

Fusion Techniques

Global DescriptorGlobal Descriptor

Feature concatenation

DecisionDecision

Global Confidence

score

Global Confidence

score

Obtain the Global Confidence Score

Descriptor 1

Descriptor 2 Descriptor 2

Descriptor n Descriptor n

Feature extraction

Descriptor 1 normalized

Descriptor 1 normalized

Descriptor n normalized

Descriptor n normalized

Descriptor 2 normalized

Descriptor 2 normalized

Feature Normalization

ClassifierClassifier

Classification

Step

Early Fusion

Page 14: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 14

Fusion Techniques

Late Fusion

Confidence value 1(normalized)

Confidence value 1(normalized)

Confidence value 2(normalized)

Confidence value 2(normalized)

Confidence value n(normalized)

Confidence value n(normalized)

Confidence Scores Normalization

Descriptor 1

Descriptor 2 Descriptor 2

Descriptor n Descriptor n

Feature extraction

Classifier 1Classifier 1

Classifier 2Classifier 2

Classifier nClassifier n

Classification Step

DecisionDecisionGlobal Confidencescore

Global Confidencescore

Global Confidence Score

Page 15: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 15

Fusion Techniques

Late Fusion

where - cvi is the confidence value of classifier i for class q , d is the current video, i are some weights and N is the number of classifiers to be aggregated.- rank() represents the rank of classifier i.

Page 16: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 16

Experimental Setup MediaEval 2012 Dataset - Tagging Task• 14,838 episodes from 2,249 shows ~ 3,260 hours of data• splited into Development and Test sets 5,288 for development / 9,550 for test• focuses on semi-professional video on the Internet

Page 17: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 17

Experimental Setup MediaEval 2012 Dataset

• 26 Genre labels26 Genre labels1000 art 1001 autos_and_vehicles 1002 business 1003 citizen_journalism 1004 comedy 1005 conferences_and_other_events 1006 default_category 1007 documentary 1008 educational 1009 food_and_drink 1010 gaming 1011 health 1012 literature 1013 movies_and_television 1014 music_and_entertainment 1015 personal_or_auto-biographical 1016 politics 1017 religion 1018 school_and_education 1019 sports 1020 technology 1021 the_environment 1022 the_mainstream_media 1023 travel 1024 videoblogging 1025 web_development_and_sites

Page 18: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013

18

Experimental Setup

• Mean Average Precision summarizes rankings from multiple queries by averaging average precision

• Classifier’s parameters and late fusion weights were optimized on training dataset

Page 19: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 19

Evaluation(1) Classification performance on individual modalitiesFeature SVM

LinearSVM RBF SVM - Chi 5-NN Random

ForestExt. Random

Forests

Hog 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%

Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%

MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%

Structural Descriptors

7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%

Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33%

TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%

TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52%

(MAP values)

Page 20: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 20

Evaluation(1) Classification performance on individual modalities (visual)

Feature SVMLinear

SVM RBF SVM - Chi 5-NN Random Forest

Ext. Random Forests

HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%

Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%

MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%

Structural Descriptors

7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%

Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33%

TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%

TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52%

(MAP values)Visual Performance - Best performance with MPEG-7 (ERF) and HOG (SVM-RBF) - Bag-of-Visual-Words is not performing very well

Page 21: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 21

Evaluation(1) Classification performance on individual modalities (audio)

Feature SVMLinear

SVM RBF SVM - Chi 5-NN Random Forest

Ext. Random Forests

HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%

Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%

MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%

Structural Descriptors

7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%

Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33%

TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%

TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52%

(MAP values)Audio Performance - Best performance with Extremely Random Forests (42.33%) - Provide higher discriminative power than visual features

Page 22: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 22

Evaluation(1) Classification performance on individual modalities (text)

Feature SVMLinear

SVM RBF SVM - Chi 5-NN Random Forest

Ext. Random Forests

HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%

Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%

MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%

Structural Descriptors

7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%

Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33%

TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%

TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52%

(MAP values)Text Performance - Best performance with Metadata and Random Forests (58.66%) - ASR provide lower performance than audio - Metadata features outperformes all the features

Page 23: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 23

Evaluation

CombSUM

CombMean

CombMNZ

CombRank

EarlyFusion

All Visual 35.82% 36.76% 38.21% 30.90% 30.11%

All Audio 43.86% 44.19% 44.50% 41.81% 42.33%

All Text 62.62% 62.81% 62.69% 50.60% 55.68%

All 64.24% 65.61% 65.82% 53.84% 60.12%

(2) Performance on Multimodal Integration

(MAP values)

Fusion Techniques Performance - late fusion provide higher performance than early fusion - CombMNZ tends to provide the best accurate results

Page 24: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 24

Evaluation

Team Modality Method MAP

proposed all Late Fusion CombMNZ with all descriptors 65.82%

proposed text Late Fusion CombMean with TF-IDF of ASR and metadata 62.81%

TUB text Naive Bayes with Bag of Words on text (metadata) 52.25%

proposed all Late Fusion CombMNZ with all descriptors except for metadata 51.9%

proposed audio Late Fusion CombMean with standard audio descriptors 44.50%

proposed visual Late Fusion CombMean with MPEG-7 related, structural, HoG and B-o-VW with rgbSIFT

38.21%

ARF text SVM linear on early fusion of TF-IDF of ASR and metadata 37.93%

TUD visual & text Late Fusion of SVM with B-o-W (visual word, ASR & metadata) 35.81%

KIT visual SVM with Visual descriptors (color, texture, B-o-VW with rgbSIFT) 35.81%

TUD-MM text Dynamic Bayesian networks on text (ASR & metadata) 25.00%

UNICAMP - UFMG

visual Late fusion (KNN, Naive Bayes, SVM, Random Forests) with BOW (text ASR)

21.12%

ARF audio SVM linear with block-based audio features 18.92%

(3) Comparison to MediaEval 2012 Tagging task results

(MAP values)

Page 25: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 25

Conclusions> we provided an in-depth evaluation of truly multimodal video description in the context of a real-world genre-categorization scenario;

> we proved that late fusion can boost performance of automated content descriptors to achieve close performance;

> we demonstrated the potential of appropriate late fusion to genre categorizationand achieve very high categorization performance;

> we setup a new baseline for the Genre Tagging Task by outperforming the performance of the other participants;

Acknowledgement: we would like to thank Prof. Nicu Sebe and Dr. J. Uijlings from University of Trento for their support.

We also acknowledge the 2012 Genre Tagging Task of the MediaEval Multimedia Benchmark for the dataset (http://www.multimediaeval.org/).

Page 26: An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest

CBMI 2013Monday, April 29, 2013 26

Thank you!

Questions?