mirex 2011 audio tag classification using weighted ... -...

1
Mohamed Sordo¹, Òscar Celma², Dmitry Bogdanov¹ Method General Framework The audio features are captured on a short-time frame-by-frame basis, and then averaged over the whole audio excerpt. We take the means, variances and their corresponding deltas. CHARACTERISTIC TIME GENDER GENRE INSTRUMENTATION TEMPO ¹ Music Technology Group, Univesritat Pompeu Fabra {[email protected]} ² Gracenote {[email protected]} MIREX 2011 AUDIO TAG CLASSIFICATION USING WEIGHTED-VOTE NEAREST NEIGHBOR CLASSIFICATION request tags propose tags Feature Extraction Feature Selection spectral, rhythm, tonal, highlevel, ... remove, clean, normalize, pca, ... points in a d-dimensional space Transformed Dataset Original Dataset raw audio files Autotagging algorithm (k-NN) Query Audio Transformed Audio Similarity Distance New Audio! TIME GENDER GENRE CHARACTERISTIC TEMPO INSTRUMENTATION MajorMiner Dataset [1] M. Sordo, C. Laurier, and O. Celma "Annotating Music Collections: How content-based similarity helps to propagate labels". Proceedings of the International Conference on Music Information Retrieval, pp. 531–534, 2007. [2] D. Bogdanov, J. Serrà , N. Wack, P. Herrera and X. Serra “Unifying Low-level and High-level Music Similarity Measures", IEEE Transactions on Multimedia, vol. 13, pp. 587–701, 2011. Significance tests Overall by Fold Overall by Fold Overall by Fold Overall by Fold AUC-ROC by Tag AUC-ROC by Tag F-measure by Tag F-measure by Tag Affinity Ranking Classification MajorMiner Dataset Mood Tag Dataset Mood Tag Dataset Significance tests * Our algorithm has shown to be: - Fast: up to 2-3 minutes to train and classify a fold. - Scalable: in SC1, for example, each audio excerpt is defined by a single vector of 29-30 dimensions. - Easy to implement: built on top of a k-NN classifier, defines simple heuristics for classification and affinity. - Consistent: the overall results, averaged per fold, show a very low standard deviation (see, for instance, how PH2 performs in different folds). Furthermore, the algorithm ranks between second and fourth (out of 15 participants) in almost all the evaluation results. * Statistical significance was found mainly in MajorMiner at tag level, probably due to the diverse (and correlated) concepts/categories used in the Tag vocabulary. Yet, in many evaluation results there was no statistical significance. This issue is not new in MIREX Audio tag classification. * After 3 years with the same set of evaluation measures, are all of them good and sufficient to discriminate between different algorithms? * Are we actually reaching a "glass ceiling" for audio tag classification, even with novel algorithms that take into account correlations between higher level concepts? * AUC-ROC by Fold: No statistical significance between all the presented algorithms. * AUC-ROC by Tag: PH2>SBC1 is statistically significant. PH2>SC1 significant in MajorMiner only. SSKS1 is not statistically significant in both datasets. SBC1>(JR4-5, BAx, TCCPx) and SC1>(JR5, BAx, TCCPx) are statistically signficant in MajorMiner. SC1>SBC1 not stat. significant. * Precision at N: No statistical significance between all the presented algorithms. * F-measure by Fold: No statistical significance between all the presented algorithms. * F-measure by Tag: (PH2, SSKS1, BAx)>(SC1, SBC1) are statistically significant in MajorMiner only, no statistical significance between them in the Mood Tag dataset. SC1, SBC1>TCCPx statistically significant in MajorMiner. SBC1>SC1 not statistically significant. * SC1: PCA (75% covered variance) + Euclidean distance (EUC) * SBC1: Linear combination (PCA-EUC, Tempo-based distance, Kullback Leibler divergence with 1G MFCC, Semantic classifier-based distance) Feature Extraction Autotagging Feature Selection + Distance Measures Conclusions References

Upload: others

Post on 23-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MIREX 2011 AUDIO TAG CLASSIFICATION USING WEIGHTED ... - MTGmtg.upf.edu/system/files/publications/mirex2011_audio_tag_poster.pdf · WEIGHTED-VOTE NEAREST NEIGHBOR CLASSIFICATION request

Mohamed Sordo¹, Òscar Celma², Dmitry Bogdanov¹

Method

General Framework

The audio features are captured on a short-timeframe-by-frame basis, and then averaged over thewhole audio excerpt. We take the means, variancesand their corresponding deltas.

CHARACTERISTIC

TIME

GENDER

GENRE

INSTRUMENTATION

TEMPO

¹ Music Technology Group, Univesritat Pompeu Fabra {[email protected]}

² Gracenote {[email protected]}

MIREX 2011 AUDIO TAG CLASSIFICATION USINGWEIGHTED-VOTE NEAREST NEIGHBOR CLASSIFICATION

requesttags

proposetags

FeatureExtraction

FeatureSelection

spectral, rhythm,tonal, highlevel, ...

remove, clean,normalize, pca, ...

points in ad-dimensional

space

TransformedDataset

OriginalDataset

raw audiofiles

Autotaggingalgorithm

(k-NN)

QueryAudio

TransformedAudio

SimilarityDistance

New Audio!

TIME

GENDER

GENRE

CHARACTERISTIC

TEMPO

INSTRUMENTATION

MajorMiner Dataset

[1] M. Sordo, C. Laurier, and O. Celma "Annotating Music Collections: How content-based similarity helps to propagate labels". Proceedings of the International Conference on Music Information Retrieval, pp. 531–534, 2007.[2] D. Bogdanov, J. Serrà , N. Wack, P. Herrera and X. Serra “Unifying Low-level and High-level Music Similarity Measures", IEEE Transactions on Multimedia, vol. 13, pp. 587–701, 2011.

Significance tests

Overall by Fold

Overall by Fold

Overall by Fold

Overall by Fold

AUC-ROC by Tag

AUC-ROC by Tag F-measure by Tag

F-measure by Tag

Affinity Ranking Classification

MajorMiner Dataset

Mood Tag Dataset Mood Tag Dataset

Significance tests

* Our algorithm has shown to be: - Fast: up to 2-3 minutes to train and classify a fold. - Scalable: in SC1, for example, each audio excerpt is defined by a single vector of 29-30 dimensions. - Easy to implement: built on top of a k-NN classifier, defines simple heuristics for classification and affinity. - Consistent: the overall results, averaged per fold, show a very low standard deviation (see, for instance, how PH2 performs in different folds). Furthermore, the algorithm ranks between second and fourth (out of 15 participants) in almost all the evaluation results.

* Statistical significance was found mainly in MajorMiner at tag level, probably due to the diverse (and correlated) concepts/categories used in the Tag vocabulary. Yet, in many evaluation results there was no statistical significance. This issue is not new in MIREX Audio tag classification.

* After 3 years with the same set of evaluation measures, are all of them good and sufficient to discriminate between different algorithms?

* Are we actually reaching a "glass ceiling" for audio tag classification, even with novel algorithms that take into account correlations between higher level concepts?

* AUC-ROC by Fold: No statistical significance between all the presented algorithms.

* AUC-ROC by Tag: PH2>SBC1 is statistically significant. PH2>SC1 significant in MajorMiner only.SSKS1 is not statistically significant in both datasets. SBC1>(JR4-5, BAx, TCCPx) and SC1>(JR5, BAx, TCCPx) are statistically signficant in MajorMiner. SC1>SBC1 not stat. significant.

* Precision at N: No statistical significance between all the presented algorithms.

* F-measure by Fold: No statistical significance between all the presented algorithms.

* F-measure by Tag: (PH2, SSKS1, BAx)>(SC1, SBC1) are statistically significant in MajorMiner only,no statistical significance between them in the Mood Tag dataset. SC1, SBC1>TCCPx statisticallysignificant in MajorMiner. SBC1>SC1 not statistically significant.

* SC1: PCA (75% covered variance) + Euclidean distance (EUC)

* SBC1: Linear combination (PCA-EUC, Tempo-based distance,Kullback Leibler divergence with 1G MFCC,Semantic classifier-based distance)

Feature Extraction Autotagging

Feature Selection + Distance Measures

Conclusions

References