deep learning and feature learning for mir

8/10/2019 Deep Learning and Feature learning for MIR

1/71

Deep learning andfeature learning for MIR

Sander Dieleman July 23rd

, 2014


2/71

Deep learning and feature learning for MIR

PhD student at Ghent University

Gent

Graduating in December 2014

Currently interning at in NYC

http://reslab.elis.ugent.be

http://github.com/benanne

http://benanne.github.io

Working on audio-based music classification,

recommendation,

[email protected]


3/71

I. Multiscale music audio feature learning

II. Deep content-based music recommendation

III. End-to-end learning for music audio

IV. Transfer learning by supervised pre-training

V. More music recommendation + demo


4/71


I. Multiscale musicaudio feature learning

4


5/71


Feature learning is receiving moreattention from the MIR community

5

Inspired by good results in:speech recognitioncomputer vision, image classification

NLP, machine translation


6/71


Music exhibits structure onmany different timescales

6

BA AC B Musical form

Themes

Motifs

Periodic waveforms


7/71


K-means for feature learning:cluster centers are features

7

Spherical K-means:

means lie on the unit sphere, have a unit L2 norm

+ conceptually very simple

+ only one parameter to tune: number of

means

+ orders of magnitude fasterthan RBMs,

autoencoders, sparse coding

(Coates and Ng, 2012)


8/71


Spherical K-means features workwell with linear feature encoding

8

Feature extraction is a convolution operation

input data

filter

During training:

During feature extraction:

0 0 1.7 0

-0.2 2.3 1.7 0.7

One-of-K

Linear

(Coates and Ng, 2012)


9/71


Multiresolution spectrograms:different window sizes

9

Coarse

Fine 1024 samples

2048 samples

4096 samples

8192 samples

(Hamel et al., 2012)


10/71


Gaussian pyramid: repeatedsmoothing and subsampling

10

Coarse

Fine

Smooth and

subsample /2

Smooth and

subsample /2

Smooth and

subsample /2

(Burt and Adelson, 1983)


11/71


Laplacian pyramid: differencebetween levels of the Gaussian pyramid

11

subtract

(Burt and Adelson, 1983)


12/71


Our approach: feature learningon multiple timescales

12


13/71


Task: tag prediction onthe Magnatagatune dataset

14

We trained a multilayer perceptron (MLP):

1000 rectified linear hidden units

cross-entropy objective

predict 50 most common tags

25863 clips of 29 seconds, annotated with 188 tags

Tags are versatile: genre, tempo, instrumentation, dynamics,

(Law and von Ahn, 2009)


14/71


Results: tag prediction onthe Magnatagatune dataset

15


15/71


Results: importance of eachtimescale for different types of tags

18


16/71

Deep learning and feature learning for MIR 19

Learning features at multiple timescales improves

performance over single-timescale approaches

Spherical K-means features consistently

improve performance


17/71


II. Deep content-basedmusic recommendation

20


18/71


Music recommendation is becomingan increasingly relevant problem

21

Shift to digital distribution

long tail

The long tail is

particularly long for music


19/71


Collaborative filtering: use listeningpatterns for recommendation

22

+ good performance

- cold start problem

many niche items that

only appeal to a smallaudience


20/71


- worse performance

+ no usage data required

Content-based: use audio contentand/or metadata for recommendation

23

allows for all items to

be recommendedregardless of popularity

ArtistTitle

Th i l ti


21/71


There is a large semantic gapbetween audio signals and listener

preference

24

audio signals

genre popularity time

lyrical themes

mood

instrumentationlocation


22/71


Matrix Factorization: model listeningdata as a product of latent factors

25

Rusers

songs

= XYT.songs

use

rs

factors

factors

listening data

play counts user profileslatent factors song profileslatent factors


23/71


Weighted Matrix Factorization: latentfactor model for implicit feedback data

26

Play count > 0 is a strong positive signal

Play count = 0 is a weak negative signal

WMF uses a confidence matrix toemphasize positive signals

iui

T

uuiuiyx yxpc

,

2

, 2

1

min**

Hu et al., ICDM 2008


24/71


We predict latent factorsfrom music audio signals

27

Rusers

songs

= XYT.songs

use

rs

factors

factors

audio signals

regression

model


25/71


Deep learning approach:convolutional neural network

29

6

128

~3s

4

123

6

3

128128

convolution

max-pooling

time

Spectrograms


26/71


The Million Song Dataset providesmetadata for 1,000,000 songs

+ Echo Nest Taste profile subset

Listening data from 1.1m users for 380k songs

+ 7digital

Raw audio clips (over 99% of dataset)

30

Bertin-Mahieux et al., ISMIR 2011


27/71


Subset (9330 songs, 20000 users)

Model mAP@500 AUC

Metric learning to rank 0.01801 0.60608

Linear regression 0.02389 0.63518

Multilayer perceptron 0.02536 0.64611

CNN with MSE 0.05016 0.70987

CNN with WPE 0.04323 0.70101

Quantitative evaluation: musicrecommendation performance

31


28/71


Quantitative evaluation: musicrecommendation performance

32

Full dataset (382,410 songs, 1m users)

Model mAP@500 AUC

Random 0.00015 0.49935

Linear regression 0.00101 0.64522

CNN with MSE 0.00672 0.77192

Upper bound 0.23278 0.96070


29/71


30/71


Qualitative evaluation: somequeries and their closest matches

34

Query Most similar tracks (WMF)Most similar tracks

(predicted)

Coldplay

I Ran Away

Coldplay

Careful Where You StandColdplay

The Goldrush

Coldplay

X & Y

ColdplaySquare One

Jonas Brothers

BB Good

Arcade Fire

Keep The Car RunningM83

You Appearing

Angus & Julia Stone

Hollywood

Bon IverCreature Fear

Coldplay

The Goldrush


31/71



35


(predicted)

Beyonce

Speechless

Beyonce

Gift From VirgoBeyonce

Daddy

Rihanna / J-Status

Crazy Little Thing Called ...

BeyonceDangerously In Love

Rihanna

Haunted

Daniel Bedingfield

If Youre Not The OneRihanna

Haunted

Alejandro Sanz

Siempre Es De Noche

MadonnaMiles Away

Lil Wayne / Shanell

American Star


32/71



36


(predicted)

Daft Punk

Rockn Roll

Daft Punk

Short CircuitDaft Punk

Nightvision

Daft Punk

Too Long

Daft PunkAerodynamite

Daft Punk

One More Time

Boys Noize

Shine ShineBoys Noize

Lava Lava

Flying Lotus

Pet Monster Shotglass

LCD SoundsystemOne Touch

Justice

One Minute To Midnight


33/71


Qualitative evaluation: visualisationof predicted usage patterns (t-SNE)

37

McFee et al., TASLP 2012


34/71



38


35/71



39


36/71



40


37/71



41


38/71


Predicting latent factors is a viablemethod for music recommendation

42


39/71


III. End-to-endlearning for music audio

43


40/71


The traditional two-stage approach:feature extraction + shallow classifier

44

TODO

Extract features

(SIFT, HOG, )

Shallow classifier

(SVM, RF, )

Extract features

(MFCCs, chroma, )

Shallow classifier

(SVM, RF, )


41/71


Integrated approach: learn both thefeatures and the classifier

45

Learn features +

classifier

Extract mid-level

representation

(spectrograms,constant-Q)

Learn features +

classifier

End-to-end learning:

try to remove the mid-level representation


42/71


Convnets can learn the features andthe classifier simultaneously

46

features features

predictions

features


43/71


We use log-scaled mel spectrograms asa mid-level representation

47

hop size = window size / 2

X(f) = |STFT[x(t)]|2

logarithmic loudness (DRC): X(f) = log(1 + CX(f))

mel binning: X(f) = M X(f)


44/71


Evaluation: tag predictionon Magnatagatune

48

25863 clips of 29 seconds, annotated with 188 tags

Tags are versatile: genre, tempo, instrumentation, dynamics,


45/71



46/71



47/71


Spectrograms vs. raw audio signals

51

Length Stride AUC (spectrograms) AUC (raw audio)

1024 1024 0.8690 0.8366

1024 512 0.8726 0.8365512 512 0.8793 0.8386

512 256 0.8793 0.8408

256 256 0.8815 0.8487


48/71


The learned filters are mostlyfrequency-selective (and noisy)

52


49/71


Their dominant frequenciesresemble the mel scale

53

h h l d


50/71


Changing the nonlinearity to introducecompression does not help

54

Nonlinearity AUC (raw audio)

Rectified linear, max(0, x) 0.8366Logarithmic, log(1 + C x2) 0.7508

Logarithmic, log(1 + C |x|) 0.7487


51/71


Addi f li l l


52/71


Adding a feature pooling layer letsthe network learn invariances

56

Pooling method Pool size AUC (raw audio)

No pooling 1 0.8366

L2 pooling 2 0.8387

L2 pooling 4 0.8387

Max pooling 2 0.8183

Max pooling 4 0.8280

Th l i t f filt th t


53/71


The pools consist of filters that areshifted versions of each other

57


54/71


Learning features from raw audio is possible, but thisdoesnt work as well as using spectrograms (yet).


55/71


IV. Transfer learning bysupervised pre-training

59


56/71


Supervised feature learning

60

dog

catrabbit

penguin

car

table

input output

features!

Supervised feature


57/71


Supervised featurelearning for MIR tasks

61

lots of training data for:

- automatic tagging

- user listening preference prediction

(i.e. recommendation)

GTZAN genre classification 10 genres

Unique genre classification 14 genres1517-artists genre classification 19 genres

Magnatagatune automatic tagging 188 tags

Tag and listening prediction differ


58/71


Tag and listening prediction differfrom typical classification tasks

- multi-label classification

- large number of classes (tags, users)

- weak labeling- redundancy

- sparsity

62

use WMF for label space

dimensionality reduction


59/71


Schematic overview

63


60/71


Source task results

64

Model NMSE AUC mAP

Linear regression 0.986 0.75 0.0076

MLP (1 hidden layer) 0.971 0.76 0.0149

MLP (2 hidden layers) 0.961 0.746 0.0186

User listening preference prediction


61/71


Source task results

65

Model NMSE AUC mAP

Linear regression 0.965 0.823 0.0099

MLP (1 hidden layer) 0.939 0.841 0.0179

MLP (2 hidden layers) 0.924 0.837 0.0179

Tag prediction

Target task results:


62/71


Target task results:GTZAN genre classification

66



63/71


Target task results:Unique genre classification

67



64/71


Target task results:1517-artists genre classification

68



65/71


Target task results:Magnatagatune auto-tagging (50)

69



66/71


Target task results:Magnatagatune auto-tagging (188)

70


67/71


V. More music recommendation

71


68/71


128

599

4

256

149

4x

MP

4

256

73

2x

MP

4

512

35

4

2xMP

global

temporalpooling

mean

15362048 2048

40

L2

max

Spectrograms

(30 seconds)

Latent

factors


69/71



70/71

DEMO


71/71

Papers

Multiscale approaches to music audio feature learningSander Dieleman, Benjamin Schrauwen, ISMIR 2013

Deep content-based music recommendationAron van den Oord, Sander Dieleman, Benjamin Schrauwen, NIPS 2013

End-to-end learning for music audioSander Dieleman, Benjamin Schrauwen, ICASSP 2014

Transfer learning by supervised pre-training for audio-based music classificationAron van den Oord, Sander Dieleman, Benjamin Schrauwen, ISMIR 2014

deep learning and feature learning for mir

Documents