on the use of sparse time-relative auditory codes for musicpift6080/h09/documents/... · genre...
TRANSCRIPT
![Page 1: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/1.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
On the use of sparsetime-relative auditory codes for music
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck
June 25th 2008
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 2: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/2.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
What’s wrong with conventional features?
time-frequency tradeoff
sensitive to arbitrary alignment of blocks with signal
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 3: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/3.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
Sparse Auditory codes
The signal x is modelled as a linear superposition of time shiftablekernels φ that are placed at precise time locations τ with a givenscaling coefficient s. Formally:
x(t) =M∑
m=1
nm∑i=1
smi φm(t − τm
i ) + ε(t),
where m runs over the different kernels, i over the differentinstances of a given kernel and ε represents additive noise. Thelength of the kernels is variable.
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 4: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/4.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
Gammatones
Gammatone kernels are set according to an equivalent rectangularband (ERB) filter bank cochlear model using Slaney’s auditorytoolbox for Matlab [5], as done in [6].
0 100 200 300 400 500-0.15
-0.10
-0.05
0.00
0.05
0.10
0.15
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 5: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/5.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
C-major scale piano
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 6: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/6.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
Close-up of piano sounding an A4
4.9 4.95 5 5.05 5.1 5.15 5.2 5.25 5.3−1
−0.5
0
0.5
1
4.9 4.95 5 5.05 5.1 5.15 5.2 5.25 5.30
10
20
30
40
50
60
Time (sec)
Ker
nel
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 7: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/7.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
Sparse time-relative auditory codes
Sparse: notion that only a small number of things (out of alarge number of possible things) are going on at the sametime.
Time-relative: the events are located at a precise point intime.
Auditory: biologically validated from what goes on in thecochlea.
Efficient
No time-frequency trade-off or sensitivity to arbitraryalignment of signal
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 8: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/8.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
Extra
Various encoding algorithms.
Possibility of either using predefined kernels (sayGammatones) or of learning kernels.
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 9: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/9.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
Encoding complex western musicGenre recognition resultsLearning kernels
Contributions
First use of these codes on complex western commercialmusic.
Used the codes as naive input for genre classification (as goodas MFCCs)
Learn some kernels which are qualitatively different, butsomewhat disappointing.
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 10: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/10.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
Encoding complex western musicGenre recognition resultsLearning kernels
Encoding Tzanetakis
# of kernels 103 104 2.5× 104 5× 104
32 2.82 (0.97) 8.57 (2.17) 12.69 (2.87) 16.60 (2.88)
64 3.02 (1.03) 9.07 (2.29) 13.44 (2.98) 17.43 (2,70)
128 3.08 (1.04) 9.24 (2.31) 13.76 (2.99) 17.76 (2.52)
Table: Mean (standard deviation) of the SNR (dB) obtained over 100songs (ten from each genre), as a function of the number ofGammatones in the codebook and of the maximum number of spikesallowed to reach a maximal value of 20 dB SNR.
64 Gammatones, ERB 20Hz-Nyquist, 20db SNR
About 1 hour to encode a 30 second segment!!!
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 11: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/11.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
Encoding complex western musicGenre recognition resultsLearning kernels
Encoding Tzanetakis - basic genre statistics
classical disco jazz pop rock blues countryhiphop metal reggae10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
classical disco jazz pop rock blues countryhiphop metal reggae0
5000
10000
15000
20000
25000
30000
Figure: Top: box-and-whisker plot of the number of spikes (out of a maximum of 105) required to encodesongs from various genres up to a SNR of 20dB. The red line represents the median and the box extends from thelower quartile to the upper one. Bottom: Bar plot of the mean number of kernels used per song depending on thegenre. The 64 kernels are split into eight bins going from the lowest frequency Gammatones (left) to the highestones (right).
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 12: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/12.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
Encoding complex western musicGenre recognition resultsLearning kernels
Genre Recognition on Tzanetakis
Features: basic statistics on the spikegrams for windows of 5seconds.
Results similar to those obtained with MFCCs.
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 13: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/13.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
Encoding complex western musicGenre recognition resultsLearning kernels
Learned kernels
0.0 1500.0
0
0.0 2000.0
1
0.0 350.0
2
0.0 200.0
3
0.0 800.0
4
0.0 1600.0
5
0.0 800.0
6
0.0 100.0
7
0.0 500.0
8
0.0 1400.0
9
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 14: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/14.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
Encoding complex western musicGenre recognition resultsLearning kernels
Learned kernels - results
kernel 103 104 2.5× 104 5× 104
gammatones 2.82 (0.97) 8.57 (2.17) 12.69 (2.87) 16.60 (2.88)
learned 1.86 (0.57) 6.06 (1.49) 8.86 (2.00) 11.53 (2,37)
Table: Mean (standard deviation) of the SNR (dB) obtained over 100songs (ten from each genre) with either 32 Gammatones or 32 learnedkernels in the codebook.
Encoding not as good.
Genre recognition results not as good.
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 15: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/15.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
Future work
Three specific issues that will need to be dealt with:
How to put the features in a usable form?
Learning kernels.
Computational issue.
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 16: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/16.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
References
T. Blumensath and M. Davies.
Sparse and shift-invariant representations of music.IEEE Transactions on Speech and Audio Processing, 2006.
G. Davis, S. Mallat, and M. Avellaneda.
Adaptive greedy approximations.Constructive approximation, 13(1):57–98, 2004.
S.G. Mallat and Zhifeng Zhang.
Matching pursuits with time-frequency dictionaries.Signal Processing, IEEE Transactions on, 41(12):3397–3415, December 1993.
M. Plumbley, S. Abdallah, T. Blumensath, and M. Davies.
Sparse representations of polyphonic music.Signal Processing, 86(3):417–431, March 2006.
M. Slaney.
Auditory toolbox, 1998.(Tech. Rep. No. 1998-010). Palo Alto, CA:Interval Research Corporation.
E. Smith and M. S. Lewicki.
Efficient coding of time-relative structure using spikes.Neural Computation, 17(1):19–45, 2005.
E. Smith and M. S. Lewicki.
Efficient auditory coding.Nature, 439:978–982, February 2006.
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google
![Page 17: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f32e8bace0604000a3fd151/html5/thumbnails/17.jpg)
Smith and Lewicki’s sparse auditory codesOur ISMIR paper
Outlook
Questions?
Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google