computer vision for music identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording...

37
Computer Vision for Music Identification Computer Vision and Pattern Recognition (CVPR) 2005 Yan Ke, Derek Hoiem, and Rahul Sukthankar, CMU Presented by Eugene Weinstein April 4 th , 2006

Upload: others

Post on 24-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

Computer Visionfor MusicIdentification

Computer Vision and PatternRecognition (CVPR) 2005Yan Ke, Derek Hoiem, andRahul Sukthankar, CMU

Presented by Eugene WeinsteinApril 4th, 2006

Page 2: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

2

Music Identification Scenario:

User records a few seconds of audio Need to match it to a database of songs

Recording could be distorted due to Noise: background, crosstalk, etc Transmission over limited channels (i.e. cell phone)

Recording can come from any point of song Must align to correct position in reference recording

Practical issues: must support >100,000 songs

Page 3: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

3

Main Contributions Novel computer vision approach to an audio task Pairwise variant of boosting Functional music identification system with state-

of-the-art performance

Page 4: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

4

Background Topics To Be Covered

Spectrograms Jones/Viola image features Boosting/AdaBoost Expectation Maximization (EM) Random Sample Consensus (RANSAC)

Page 5: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

5

Spectrogram Graphical representation of frequency content of sound Based on short-time Fourier transform

Extract frequency content over a time window Plot time against frequency “density”

Page 6: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

6

Another Example The ship was torn apart on the sharp (reef)

Page 7: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

7

Spectrograms of Music Identify music snippets by matching spectrogram

of test recording to reference recording Comparing by correlation is inaccurate and slow Solution: match based on simple features

Page 8: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

8

Viola/Jones Features Use rectangle features instead of pixels

Can compute efficiently with integral image Compute sum of pixels within a box,

features are combinations of box sums: B, W: Black, white regions Two rectangles: W-B Three: W1+W2-B Four: W1+W2-(B1+B2)

Page 9: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

9

Viola/Jones Features

Above feature classes model Power differences across frequencies Power differences across time Up/downward drifts of dominant frequency Power peaks across frequencies Power peaks across time

Page 10: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

10

Viola/Jones Features

Each feature can vary in Frequency location: 1 to 33 Frequency width: 1 to 33 Time width: 1 frame (11.6ms) to 82 frames

(951ms) ~25,000 total possible features

Page 11: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

11

Feature Selection Need to pick M features (out of ~25,000)

Matching song classifier composed of selected features Idea: Use boosting to select features that yield

Similar output when recording is a match Differing output when recording is not a match

Page 12: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

12

AdaBoost Review Standard AdaBoost scenario: boost

classification performance of a “weak”classifier, e.g., perceptron Apply to successively harder problems Tweak parameters at each classification stage

This work: use Jones/Viola features asweak classifiers Find sequence of best features by boosting

Page 13: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

13

Boosting Framework x1, x2 spectrogram images Want to find “strong” classifier H(x1, x2)=

1 if images derive from same audio source -1 if derive from different sources

Find weak classifiers of the form

Want matching images on same side ofthreshold

Difference from AdaBoost: label assigned topairs of images (pairwise boosting)

Page 14: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

14

Boosting Initialization

Given: n spectrogram image pairs:

Labels for each pair:

Initialize weights:

Page 15: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

15

Boosting Training Loop For m=1,…,M

1.Select min-error classifier

2.If ith image pair classified incorrectly and yi=1(matching pair of images), adjust its weight up:

– If yi=-1, don’t do anything3.Normalize the weights such that:

Page 16: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

16

Final (Strong) Classifier Linear combination of weak classifiers Weighted by performance of each classifier

Note, if , classifier t does notcontribute to combination

Strong classifier apparently not used in finalsystem, just for evaluating selected features

Page 17: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

17

Differences From AdaBoost AdaBoost reweights all correctly learned points

down and incorrect points up Our boosting algorithm cannot do that

Recall, our classifier has the form

Let us draw a pair of non-matching spectrogramimages x1, x2 at random

Then let But then Thus,

Violates weak classifier criterion: correct at least ½ the time

Solution: reweight only matching examples

Page 18: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

18

Occlusion Model Boosting classifier identifies distorted versions of

the same song However, some parts of the recorded song might

be mostly noise or interference Thus, need to model whether audio chunk is the

song or some distraction (occlusion)

Page 19: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

19

Occlusion Model Compute M weak features at each time

step (11.6ms): “descriptor” (M-bit vector) Probability that current descriptor is

caused by an occlusion depends on Current descriptor: xi

Whether previous descriptor was caused byocclusion: yi-1

Page 20: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

20

Occlusion Model Details Given

n vector descriptors from recorded song’sspectrogram:

Descriptors from original song: Differences between recorded and original

descriptors: Find: yi={0,1}, whether ith chunk due to

distortion

Page 21: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

21

What’s the problem?

Have to simultaneously estimatedistributions for xi

r-o : data, with underlying distribution yi

: occlusion labels Solution:

Model data, labels with Bernoulli distribution Apply Expectation Maximization (EM) algorithm

Page 22: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

22

EM: An Aside Given dependent random variables:

Observed variable x Latent (unobserved) variable y that generates x

Assume probability distributions: Pθ (x), Pθ (y) θ represents all parameters of distribution

Repeat until convergence E-step: Compute “expectation” of logPθ (y,x)

θ ’ ,θ : old, new distribution parameters

M-step: Find θ that maximizes above sum

Page 23: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

23

EM Derivation Lemma (Special case of Jensen’s

Inequality): Let p(x), q(x) be probabilitydistributions. Then

Proof: rewrite as:

Page 24: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

24

EM Derivation EM Theorem:

If then

Proof:

By a lot of algebra and lemma on last slide,

So, if this quantity is positive, so is

Page 25: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

25

EM Summary Repeat until convergence

E-step: Compute “expectation” of logPθ (x,y) θ ’ ,θ : old, new distribution parameters

M-step: Find θ that maximizes (1)

EM Theorem: If then

Interpretation As long as we can improve the “expectation” in (1),

EM improves our model of observed variable x

Page 26: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

26

EM Discussion

Problems with EM? Local maxima Need to bootstrap training process (pick a θ )

When is EM most useful? When model distributions easy to maximize

e.g., Gaussian mixture models

EM is a meta-algorithm, needs to beadapted to particular application

Page 27: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

27

Applying EM to Our Problem EM “score”: xi

r-o: data, yi : labels

Model P(xir-o |yi

) with 2M Bernoulli variables Each xi consists of M=32 weak classifier outputs

Model P(yi |yi-1

) with 2 Bernoulli variables 2M+2=66 total parameters to estimate

Repeat until convergence E-step: Compute “expectation” of logPθ (x,y)

θ ’ ,θ : old, new distribution parameters

M-step: Find θ that maximizes (1)

Page 28: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

28

EM For Song Matching Given recording xr, find most likely original song

xo that produced the recording Reject any match where EM score less than

threshold T : need Unclear how T is determined

So, now we can calculate the likelihood ofrecording snippet matching a given original song But, matching against entire song database too slow Solution: search database for near-neighbors of

recording

Page 29: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

29

Retrieval Calculate M-bit descriptor of each song in

database at each time step (11.6ms) Store in hash table (descriptor song)

To look up song, perturb descriptor vector Try all flips of 1 bit, 2 bits, etc

= Hamming distance 1, 2, etc Look up perturbed vectors in hash table

Get back near-neighbor candidates Now, need to align: use RANSAC

Page 30: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

30

Another Aside: RANSAC Random Sample Consensus (Fischler & Bolles,

1981) Assumption: data to be modeled consists of

Mostly data points matching the model (“inliers”) A few outliers

Idea: Keep picking random samples of data points Eventually we will pick a set with few outliers Improvement: pick data points intelligently

What is “a few” outliers? When selected data points are explained by model

within a certain error tolerance

Page 31: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

31

Applying RANSAC Given: Sequence of M-bit descriptors over

test recording Iterate over all time alignments of test

recording to candidate originals Select alignments at random Compute EM score over all descriptors for

each alignment Pick candidate original with best EM score

Subject to

<500 iterations usually “sufficient”

Page 32: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

32

Experiments Data set: 1,861 songs from variety of genres First, learn “bootstrap” features and EM

parameters on synthetically distorted data This yields basic model good enough to align training

data to original Training data: 78 songs played, recorded using

low-quality equipment Test data

A: 71 songs played at low volume, recorded withdistorted microphone

B: 220 songs recorded in “very noisy” setup

Page 33: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

33

Testing Descriptor Performance Test data: ~100,000(+)/1,000,000(-) examples

15-second snippets of 71 songs from test set A

Baseline: original and improved algorithm from otherauthors (Haitsma & Kalker, 2002)

Vary Hamming distance threshold to generate ROC curve

Page 34: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

34

Varying Hamming Threshold For fast retrieval, need HamDist ≤ 2 For 10-sec query, need recall of a few %

Recall = TP/(TP+FN) Precision = TP/(TP+FP)

Recall rates table vs. HamDist Threshold

Page 35: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

35

Song Retrieval 10/15 seconds (860,

1290 descriptors) Both Test Sets A and B HamDist={0,1,2}

Page 36: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

36

Interface Screenshot

Page 37: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database

37

Thank You! Any questions? References:

Lecture notes, MIT class 6.345: Automatic SpeechRecognition

F. Jelinek, Statistical Methods for SpeechRecognition, 1997

M. A. Fischler, R. C. Bolles. Random SampleConsensus: A Paradigm for Model Fitting withApplications to Image Analysis and AutomatedCartography, Comm. of the ACM, Vol 24, pp 381-395,June 1981.

Wikipedia