computer vision for music identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording...
TRANSCRIPT
Computer Visionfor MusicIdentification
Computer Vision and PatternRecognition (CVPR) 2005Yan Ke, Derek Hoiem, andRahul Sukthankar, CMU
Presented by Eugene WeinsteinApril 4th, 2006
2
Music Identification Scenario:
User records a few seconds of audio Need to match it to a database of songs
Recording could be distorted due to Noise: background, crosstalk, etc Transmission over limited channels (i.e. cell phone)
Recording can come from any point of song Must align to correct position in reference recording
Practical issues: must support >100,000 songs
3
Main Contributions Novel computer vision approach to an audio task Pairwise variant of boosting Functional music identification system with state-
of-the-art performance
4
Background Topics To Be Covered
Spectrograms Jones/Viola image features Boosting/AdaBoost Expectation Maximization (EM) Random Sample Consensus (RANSAC)
5
Spectrogram Graphical representation of frequency content of sound Based on short-time Fourier transform
Extract frequency content over a time window Plot time against frequency “density”
6
Another Example The ship was torn apart on the sharp (reef)
7
Spectrograms of Music Identify music snippets by matching spectrogram
of test recording to reference recording Comparing by correlation is inaccurate and slow Solution: match based on simple features
8
Viola/Jones Features Use rectangle features instead of pixels
Can compute efficiently with integral image Compute sum of pixels within a box,
features are combinations of box sums: B, W: Black, white regions Two rectangles: W-B Three: W1+W2-B Four: W1+W2-(B1+B2)
9
Viola/Jones Features
Above feature classes model Power differences across frequencies Power differences across time Up/downward drifts of dominant frequency Power peaks across frequencies Power peaks across time
10
Viola/Jones Features
Each feature can vary in Frequency location: 1 to 33 Frequency width: 1 to 33 Time width: 1 frame (11.6ms) to 82 frames
(951ms) ~25,000 total possible features
11
Feature Selection Need to pick M features (out of ~25,000)
Matching song classifier composed of selected features Idea: Use boosting to select features that yield
Similar output when recording is a match Differing output when recording is not a match
12
AdaBoost Review Standard AdaBoost scenario: boost
classification performance of a “weak”classifier, e.g., perceptron Apply to successively harder problems Tweak parameters at each classification stage
This work: use Jones/Viola features asweak classifiers Find sequence of best features by boosting
13
Boosting Framework x1, x2 spectrogram images Want to find “strong” classifier H(x1, x2)=
1 if images derive from same audio source -1 if derive from different sources
Find weak classifiers of the form
Want matching images on same side ofthreshold
Difference from AdaBoost: label assigned topairs of images (pairwise boosting)
14
Boosting Initialization
Given: n spectrogram image pairs:
Labels for each pair:
Initialize weights:
15
Boosting Training Loop For m=1,…,M
1.Select min-error classifier
2.If ith image pair classified incorrectly and yi=1(matching pair of images), adjust its weight up:
– If yi=-1, don’t do anything3.Normalize the weights such that:
16
Final (Strong) Classifier Linear combination of weak classifiers Weighted by performance of each classifier
Note, if , classifier t does notcontribute to combination
Strong classifier apparently not used in finalsystem, just for evaluating selected features
17
Differences From AdaBoost AdaBoost reweights all correctly learned points
down and incorrect points up Our boosting algorithm cannot do that
Recall, our classifier has the form
Let us draw a pair of non-matching spectrogramimages x1, x2 at random
Then let But then Thus,
Violates weak classifier criterion: correct at least ½ the time
Solution: reweight only matching examples
18
Occlusion Model Boosting classifier identifies distorted versions of
the same song However, some parts of the recorded song might
be mostly noise or interference Thus, need to model whether audio chunk is the
song or some distraction (occlusion)
19
Occlusion Model Compute M weak features at each time
step (11.6ms): “descriptor” (M-bit vector) Probability that current descriptor is
caused by an occlusion depends on Current descriptor: xi
Whether previous descriptor was caused byocclusion: yi-1
20
Occlusion Model Details Given
n vector descriptors from recorded song’sspectrogram:
Descriptors from original song: Differences between recorded and original
descriptors: Find: yi={0,1}, whether ith chunk due to
distortion
21
What’s the problem?
Have to simultaneously estimatedistributions for xi
r-o : data, with underlying distribution yi
: occlusion labels Solution:
Model data, labels with Bernoulli distribution Apply Expectation Maximization (EM) algorithm
22
EM: An Aside Given dependent random variables:
Observed variable x Latent (unobserved) variable y that generates x
Assume probability distributions: Pθ (x), Pθ (y) θ represents all parameters of distribution
Repeat until convergence E-step: Compute “expectation” of logPθ (y,x)
θ ’ ,θ : old, new distribution parameters
M-step: Find θ that maximizes above sum
23
EM Derivation Lemma (Special case of Jensen’s
Inequality): Let p(x), q(x) be probabilitydistributions. Then
Proof: rewrite as:
24
EM Derivation EM Theorem:
If then
Proof:
By a lot of algebra and lemma on last slide,
So, if this quantity is positive, so is
25
EM Summary Repeat until convergence
E-step: Compute “expectation” of logPθ (x,y) θ ’ ,θ : old, new distribution parameters
M-step: Find θ that maximizes (1)
EM Theorem: If then
Interpretation As long as we can improve the “expectation” in (1),
EM improves our model of observed variable x
26
EM Discussion
Problems with EM? Local maxima Need to bootstrap training process (pick a θ )
When is EM most useful? When model distributions easy to maximize
e.g., Gaussian mixture models
EM is a meta-algorithm, needs to beadapted to particular application
27
Applying EM to Our Problem EM “score”: xi
r-o: data, yi : labels
Model P(xir-o |yi
) with 2M Bernoulli variables Each xi consists of M=32 weak classifier outputs
Model P(yi |yi-1
) with 2 Bernoulli variables 2M+2=66 total parameters to estimate
Repeat until convergence E-step: Compute “expectation” of logPθ (x,y)
θ ’ ,θ : old, new distribution parameters
M-step: Find θ that maximizes (1)
28
EM For Song Matching Given recording xr, find most likely original song
xo that produced the recording Reject any match where EM score less than
threshold T : need Unclear how T is determined
So, now we can calculate the likelihood ofrecording snippet matching a given original song But, matching against entire song database too slow Solution: search database for near-neighbors of
recording
29
Retrieval Calculate M-bit descriptor of each song in
database at each time step (11.6ms) Store in hash table (descriptor song)
To look up song, perturb descriptor vector Try all flips of 1 bit, 2 bits, etc
= Hamming distance 1, 2, etc Look up perturbed vectors in hash table
Get back near-neighbor candidates Now, need to align: use RANSAC
30
Another Aside: RANSAC Random Sample Consensus (Fischler & Bolles,
1981) Assumption: data to be modeled consists of
Mostly data points matching the model (“inliers”) A few outliers
Idea: Keep picking random samples of data points Eventually we will pick a set with few outliers Improvement: pick data points intelligently
What is “a few” outliers? When selected data points are explained by model
within a certain error tolerance
31
Applying RANSAC Given: Sequence of M-bit descriptors over
test recording Iterate over all time alignments of test
recording to candidate originals Select alignments at random Compute EM score over all descriptors for
each alignment Pick candidate original with best EM score
Subject to
<500 iterations usually “sufficient”
32
Experiments Data set: 1,861 songs from variety of genres First, learn “bootstrap” features and EM
parameters on synthetically distorted data This yields basic model good enough to align training
data to original Training data: 78 songs played, recorded using
low-quality equipment Test data
A: 71 songs played at low volume, recorded withdistorted microphone
B: 220 songs recorded in “very noisy” setup
33
Testing Descriptor Performance Test data: ~100,000(+)/1,000,000(-) examples
15-second snippets of 71 songs from test set A
Baseline: original and improved algorithm from otherauthors (Haitsma & Kalker, 2002)
Vary Hamming distance threshold to generate ROC curve
34
Varying Hamming Threshold For fast retrieval, need HamDist ≤ 2 For 10-sec query, need recall of a few %
Recall = TP/(TP+FN) Precision = TP/(TP+FP)
Recall rates table vs. HamDist Threshold
35
Song Retrieval 10/15 seconds (860,
1290 descriptors) Both Test Sets A and B HamDist={0,1,2}
36
Interface Screenshot
37
Thank You! Any questions? References:
Lecture notes, MIT class 6.345: Automatic SpeechRecognition
F. Jelinek, Statistical Methods for SpeechRecognition, 1997
M. A. Fischler, R. C. Bolles. Random SampleConsensus: A Paradigm for Model Fitting withApplications to Image Analysis and AutomatedCartography, Comm. of the ACM, Vol 24, pp 381-395,June 1981.
Wikipedia