computer vision for music identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording...

Computer Visionfor MusicIdentification

Computer Vision and PatternRecognition (CVPR) 2005Yan Ke, Derek Hoiem, andRahul Sukthankar, CMU

Presented by Eugene WeinsteinApril 4th, 2006

2

Music Identification Scenario:

User records a few seconds of audio Need to match it to a database of songs

Recording could be distorted due to Noise: background, crosstalk, etc Transmission over limited channels (i.e. cell phone)

Recording can come from any point of song Must align to correct position in reference recording

Practical issues: must support >100,000 songs

3

Main Contributions Novel computer vision approach to an audio task Pairwise variant of boosting Functional music identification system with state-

of-the-art performance

4

Background Topics To Be Covered

Spectrograms Jones/Viola image features Boosting/AdaBoost Expectation Maximization (EM) Random Sample Consensus (RANSAC)

5

Spectrogram Graphical representation of frequency content of sound Based on short-time Fourier transform

Extract frequency content over a time window Plot time against frequency “density”

6

Another Example The ship was torn apart on the sharp (reef)

7

Spectrograms of Music Identify music snippets by matching spectrogram

of test recording to reference recording Comparing by correlation is inaccurate and slow Solution: match based on simple features

8

Viola/Jones Features Use rectangle features instead of pixels

Can compute efficiently with integral image Compute sum of pixels within a box,

features are combinations of box sums: B, W: Black, white regions Two rectangles: W-B Three: W1+W2-B Four: W1+W2-(B1+B2)

9

Viola/Jones Features

Above feature classes model Power differences across frequencies Power differences across time Up/downward drifts of dominant frequency Power peaks across frequencies Power peaks across time

10

Viola/Jones Features

Each feature can vary in Frequency location: 1 to 33 Frequency width: 1 to 33 Time width: 1 frame (11.6ms) to 82 frames

(951ms) ~25,000 total possible features

11

Feature Selection Need to pick M features (out of ~25,000)

Matching song classifier composed of selected features Idea: Use boosting to select features that yield

Similar output when recording is a match Differing output when recording is not a match

12

AdaBoost Review Standard AdaBoost scenario: boost

classification performance of a “weak”classifier, e.g., perceptron Apply to successively harder problems Tweak parameters at each classification stage

This work: use Jones/Viola features asweak classifiers Find sequence of best features by boosting

13

Boosting Framework x1, x2 spectrogram images Want to find “strong” classifier H(x1, x2)=

1 if images derive from same audio source -1 if derive from different sources

Find weak classifiers of the form

Want matching images on same side ofthreshold

Difference from AdaBoost: label assigned topairs of images (pairwise boosting)

14

Boosting Initialization

Given: n spectrogram image pairs:

Labels for each pair:

Initialize weights:

15

Boosting Training Loop For m=1,…,M

1.Select min-error classifier

2.If ith image pair classified incorrectly and yi=1(matching pair of images), adjust its weight up:

– If yi=-1, don’t do anything3.Normalize the weights such that:

16

Final (Strong) Classifier Linear combination of weak classifiers Weighted by performance of each classifier

Note, if , classifier t does notcontribute to combination

Strong classifier apparently not used in finalsystem, just for evaluating selected features

17

Differences From AdaBoost AdaBoost reweights all correctly learned points

down and incorrect points up Our boosting algorithm cannot do that

Recall, our classifier has the form

Let us draw a pair of non-matching spectrogramimages x1, x2 at random

Then let But then Thus,

Violates weak classifier criterion: correct at least ½ the time

Solution: reweight only matching examples

18

Occlusion Model Boosting classifier identifies distorted versions of

the same song However, some parts of the recorded song might

be mostly noise or interference Thus, need to model whether audio chunk is the

song or some distraction (occlusion)

19

Occlusion Model Compute M weak features at each time

step (11.6ms): “descriptor” (M-bit vector) Probability that current descriptor is

caused by an occlusion depends on Current descriptor: xi

Whether previous descriptor was caused byocclusion: yi-1

20

Occlusion Model Details Given

n vector descriptors from recorded song’sspectrogram:

Descriptors from original song: Differences between recorded and original

descriptors: Find: yi={0,1}, whether ith chunk due to

distortion

21

What’s the problem?

Have to simultaneously estimatedistributions for xi

r-o : data, with underlying distribution yi

: occlusion labels Solution:

Model data, labels with Bernoulli distribution Apply Expectation Maximization (EM) algorithm

22

EM: An Aside Given dependent random variables:

Observed variable x Latent (unobserved) variable y that generates x

Assume probability distributions: Pθ (x), Pθ (y) θ represents all parameters of distribution

Repeat until convergence E-step: Compute “expectation” of logPθ (y,x)

θ ’ ,θ : old, new distribution parameters

M-step: Find θ that maximizes above sum

23

EM Derivation Lemma (Special case of Jensen’s

Inequality): Let p(x), q(x) be probabilitydistributions. Then

Proof: rewrite as:

24

EM Derivation EM Theorem:

If then

Proof:

By a lot of algebra and lemma on last slide,

So, if this quantity is positive, so is

25

EM Summary Repeat until convergence

E-step: Compute “expectation” of logPθ (x,y) θ ’ ,θ : old, new distribution parameters

M-step: Find θ that maximizes (1)

EM Theorem: If then

Interpretation As long as we can improve the “expectation” in (1),

EM improves our model of observed variable x

26

EM Discussion

Problems with EM? Local maxima Need to bootstrap training process (pick a θ )

When is EM most useful? When model distributions easy to maximize

e.g., Gaussian mixture models

EM is a meta-algorithm, needs to beadapted to particular application

27

Applying EM to Our Problem EM “score”: xi

r-o: data, yi : labels

Model P(xir-o |yi

) with 2M Bernoulli variables Each xi consists of M=32 weak classifier outputs

Model P(yi |yi-1

) with 2 Bernoulli variables 2M+2=66 total parameters to estimate

Repeat until convergence E-step: Compute “expectation” of logPθ (x,y)

θ ’ ,θ : old, new distribution parameters

M-step: Find θ that maximizes (1)

28

EM For Song Matching Given recording xr, find most likely original song

xo that produced the recording Reject any match where EM score less than

threshold T : need Unclear how T is determined

So, now we can calculate the likelihood ofrecording snippet matching a given original song But, matching against entire song database too slow Solution: search database for near-neighbors of

recording

29

Retrieval Calculate M-bit descriptor of each song in

database at each time step (11.6ms) Store in hash table (descriptor song)

To look up song, perturb descriptor vector Try all flips of 1 bit, 2 bits, etc

= Hamming distance 1, 2, etc Look up perturbed vectors in hash table

Get back near-neighbor candidates Now, need to align: use RANSAC

30

Another Aside: RANSAC Random Sample Consensus (Fischler & Bolles,

1981) Assumption: data to be modeled consists of

Mostly data points matching the model (“inliers”) A few outliers

Idea: Keep picking random samples of data points Eventually we will pick a set with few outliers Improvement: pick data points intelligently

What is “a few” outliers? When selected data points are explained by model

within a certain error tolerance

31

Applying RANSAC Given: Sequence of M-bit descriptors over

test recording Iterate over all time alignments of test

recording to candidate originals Select alignments at random Compute EM score over all descriptors for

each alignment Pick candidate original with best EM score

Subject to

<500 iterations usually “sufficient”

32

Experiments Data set: 1,861 songs from variety of genres First, learn “bootstrap” features and EM

parameters on synthetically distorted data This yields basic model good enough to align training

data to original Training data: 78 songs played, recorded using

low-quality equipment Test data

A: 71 songs played at low volume, recorded withdistorted microphone

B: 220 songs recorded in “very noisy” setup

33

Testing Descriptor Performance Test data: ~100,000(+)/1,000,000(-) examples

15-second snippets of 71 songs from test set A

Baseline: original and improved algorithm from otherauthors (Haitsma & Kalker, 2002)

Vary Hamming distance threshold to generate ROC curve

34

Varying Hamming Threshold For fast retrieval, need HamDist ≤ 2 For 10-sec query, need recall of a few %

Recall = TP/(TP+FN) Precision = TP/(TP+FP)

Recall rates table vs. HamDist Threshold

35

Song Retrieval 10/15 seconds (860,

1290 descriptors) Both Test Sets A and B HamDist={0,1,2}

36

Interface Screenshot

37

Thank You! Any questions? References:

Lecture notes, MIT class 6.345: Automatic SpeechRecognition

F. Jelinek, Statistical Methods for SpeechRecognition, 1997

M. A. Fischler, R. C. Bolles. Random SampleConsensus: A Paradigm for Model Fitting withApplications to Image Analysis and AutomatedCartography, Comm. of the ACM, Vol 24, pp 381-395,June 1981.

Wikipedia

computer vision for music identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording...

Documents