Joint Beat & Tatum Tracking from Music ?· Joint Beat & Tatum Tracking from Music Signals Jarno Seppanen…

Download Joint Beat & Tatum Tracking from Music ?· Joint Beat & Tatum Tracking from Music Signals Jarno Seppanen…

Post on 07-Sep-2018




0 download


<ul><li><p>Joint Beat &amp; Tatum Tracking from Music Signals</p><p>Jarno Seppanen Antti Eronen Jarmo HiipakkaNokia Research Center</p><p>P.O.Box 407, FI-00045 NOKIA GROUP, Finland{jarno.seppanen, antti.eronen, jarmo.hiipakka}</p><p>AbstractThis paper presents a method for extracting two key met-rical properties, the beat and the tatum, from acoustic sig-nals of popular music. The method is computationally veryefficient while performing comparably to earlier methods.High efficiency is achieved through multirate accent analy-sis, discrete cosine transform periodicity analysis, and phaseestimation by adaptive comb filtering. During analysis, themusic signals are first represented in terms of accentuationon four frequency subbands, and then the accent signals aretransformed into periodicity domain. Beat and tatum peri-ods and phases are estimated in a probabilistic setting, incor-porating primitive musicological knowledge of beattatumrelations, the prior distributions, and the temporal continu-ities of beats and tatums. In an evaluation with 192 songs,the beat tracking accuracy of the proposed method was foundcomparable to the state of the art. Complexity evaluationshowed that the computational cost is less than 1% of earliermethods. The authors have written a real-time implementa-tion of the method for the S60 smartphone platform.</p><p>Keywords: Beat tracking, music meter estimation, rhythmanalysis.</p><p>1. IntroductionRecent years have brought significant advances in the fieldof automatic music signal analysis, and music meter estima-tion is no exception. In general, the music meter containsa nested grouping of pulses called metrical levels, wherepulses on higher levels are subsets of the lower level pulses;the most salient level is known as the beat, and the lowestlevel is termed the tatum [1, p. 21].</p><p>Metrical analysis of music signals has many applicationsranging from browsing and visualization to classificationand recommendation of music. The state of the art has ad-vanced high in performance, but the computational require-ments have also remained restrictively high. The proposedmethod significantly improves computational efficiency whilemaintaining satisfactory performance.</p><p>Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page.c 2006 University of Victoria</p><p>The technical approaches for meter estimation are vari-ous, including e.g. autocorrelation based methods [6], inter-onset interval histogramming [5], or banks of comb filterresonators [4], possibly followed by a probabilistic model [3].See [2] for a review on rhythm analysis systems.</p><p>2. Algorithm DescriptionThe algorithm overview is presented in Fig. 1: the input isaudio signals of polyphonic music, and the output consistsof the times of beats and tatums. The implementation of thebeat and tatum tracker has been done in C++ programminglanguage in the S60 smartphone platform. The algorithmdesign is causal and the implementation works in real time.</p><p>The operation of the system can be described in six stages(see Fig. 1):</p><p>1. Resampling stage,</p><p>2. Accent filter bank stage,</p><p>3. Buffering stage,</p><p>4. Periodicity estimation stage,</p><p>5. Period estimation stage, and</p><p>6. Phase estimation stage.</p><p>First, the signal is resampled to a fixed sample rate, tosupport arbitrary input sample rates. Second, the accentfilter bank transforms the acoustic signal of music into aform that is suitable for beat and tatum analysis. In thisstage, subband accent signals are generated, which consti-tute an estimate of the perceived accentuation on each sub-band. The accent filter bank stage significantly reduces theamount of data.</p><p>Then, the accent signals are accumulated into four-secondframes. Periodicity estimation looks for repeating accentson each subband. The subband periodicities are then com-bined, and summary periodicity is computed.</p><p>Next, the most likely beat and tatum periods are esti-mated from each periodicity frame. This uses a probabilisticformulation of primitive musicological knowledge, includ-ing the relation, the prior distribution, and the temporal con-tinuity of beats and tatums. Finally, the beat phase is foundand beat and tatum times are positioned. The accent signalis filtered with a pair of comb filters, which adapt to differentbeat period estimates.</p></li><li><p>Phase estimation</p><p>Period estimation</p><p>Accentfilter bank Buffering</p><p>Periodicity estimation</p><p>Audio signal</p><p>Subbandaccent signals</p><p>Accent frames</p><p>Summary periodicity</p><p>Beat and tatum </p><p>periods</p><p>Beat and tatum times</p><p>Resampling</p><p>Audio signal</p><p>Figure 1. Beat and tatum analyzer.</p><p>x2 LPF</p><p>(a)</p><p>DiffComp Rect 0.8</p><p>0.2</p><p>(b)</p><p>M1QMF2 6 kHz</p><p>QMF</p><p>QMF2 1,5 kHz</p><p>QMF</p><p>QMF2</p><p>QMF</p><p>375 Hz</p><p>24 kHz</p><p>375 Hz</p><p>(c)</p><p>125 Hz 125 Hz</p><p>a2</p><p>a1</p><p>a3</p><p>a4</p><p>x</p><p>125 Hz</p><p>125 Hz</p><p>125 Hz(b) (c)</p><p>(b) (c)</p><p>(b) (c)</p><p>Figure 2. Accent filter bank overview. (a) The audio signal is first divided into subbands, then (b) power estimates on each subbandare calculated, and (c) accent computation is performed on the subband power signals.</p><p>2.1. ResamplingBefore any audio analysis takes place, the signal is con-verted to a 24 kHz sample rate. This is required becausethe filter bank uses fixed frequency regions. The resamplingcan be done with a relatively low-quality algorithm, linearinterpolation, because high fidelity is not required for suc-cessful beat and tatum analysis.</p><p>2.2. Accent Filter BankFigure 2 presents an overview of the accent filter bank. The in-coming audio signal x[n] is (a) first divided into subband au-dio signals, and (b) a power estimate signal is calculated foreach band separately. Last, (c) an accent signal is computedfor each subband.</p><p>The filter bank divides the acoustic signal into seven fre-quency bands by means of six cascaded decimating quadra-ture mirror filters (QMF). The QMF subband signals arecombined pairwise into three two-octave subband signals,as shown in Fig. 2(a). When combining two consecutivebranches, the signal from the higher branch is decimatedwithout filtering. However, the error caused by the alias-ing produced in this operation is negligible for the proposedmethod. The sampling rate decreases by four between suc-cessive bands due to the two QMF analysis stages and theextra decimation step. As a result, the frequency bands arelocated at 0190 Hz, 190750 Hz, 7503000 Hz, and 312 kHz, when the filter bank input is at 24 kHz.</p><p>There is a very efficient structure that can be used to im-</p><p>plement the downsampling QMF analysis with just two all-pass filters, an addition, and a subtraction. This structure isdepicted in Fig. 5.2-5 in [7, p. 203]. The allpass filters forthis application can be first-order filters, because only mod-est separation is required between bands.</p><p>The subband power computation is shown Fig. 2(b). Theaudio signal is squared, low-pass filtered (LPF), and dec-imated by subband specific factor Mi to get the subbandpower signal. The low-pass filter is a digital filter having10 Hz cutoff frequency. The subband decimation ratios Mi ={48, 12, 3, 3} have been chosen so that the power signal sam-ple rate is 125 Hz on all subbands.</p><p>The subband accent signal computation in Fig. 2(c) ismodelled according to Klapuri et al. [3, p. 344345]. In theprocess, the power signal first is mapped with a nonlinearlevel compression function labeled Comp in Fig. 2(c),</p><p>f(x) ={</p><p>5.213 ln(1 + 10</p><p>x), x &gt; 0.00015.213 ln 1.1 otherwise. (1)</p><p>Following compression, the first-order difference signal iscomputed (Diff ) and half-wave rectified (Rect). In accor-dance with Eq. (3) in [3], the rectified signal is summed tothe power signal after constant weighting, see Fig. 2(c). Thehigh computational efficiency of the proposed method liesmostly in the accent filter bank design. In addition to effi-ciency, the resulting accent signals are comparable to thoseof Klapuri et al., see e.g. Fig. 3 in [3].</p></li><li><p>0 0.5 1 1.5 2 2.5 3 3.5 40</p><p>0.01</p><p>0.02 BT</p><p>(a)</p><p>Time lag [s]</p><p>1 2 3 4 5 6 7 8 9 100</p><p>0.5</p><p>1B T</p><p>(b)</p><p>Frequency [Hz]</p><p>Figure 3. (a) Normalized autocorrelation and (b) summary pe-riodicity, with beat (B) and tatum (T) periods shown.</p><p>2.3. BufferingThe buffering stage implements a ring buffer which accu-mulates the signal into fixed-length frames. The incomingsignal is split into consecutive accent signal frames of a fixedlength N = 512 (4.1 seconds). The value of N can be mod-ified to choose a different performancelatency tradeoff.</p><p>2.4. Accent Periodicity EstimationThe accent signals are analyzed for intrinsic repetitions. Here,periodicity is defined as the combined strength of accentsthat repeat with a given period. For all subband accent sig-nals, a joint summary periodicity vector is computed.</p><p>Autocorrelation [`] =N1</p><p>n=0 a[n]a[n `], 0 ` N1, is first computed from each N -length subband accentframe a[n]. The accent signal reaches peak values wheneverthere are high accents in the music and remains low other-wise. Computing autocorrelation from an impulsive accentsignal is comparable to computing the inter-onset interval(IOI) histogram as described by Seppanen [5], with addi-tional robustness due to not having to discretize the accentsignal into onsets.</p><p>The accent frame power [0] is stored for later weight-ing of subband periodicities. Offset and scale variations areeliminated from autocorrelation frames by normalization,</p><p>[`] =[`]minn [n]N1</p><p>n=0 [n]N minn [n]. (2)</p><p>See Fig. 3(a) for an example normalized autocorrelation frame.The figure shows also the correct beat period B, 0.5 seconds,and tatum period T, 0.25 seconds, as vertical lines.</p><p>Next, accent periodicity is estimated by means of the N -point discrete cosine transform (DCT)</p><p>R[k] = ckN1n=0</p><p>[n] cos(2n + 1)k</p><p>2N(3)</p><p>c0 =</p><p>1/N (4)</p><p>ck =</p><p>2/N, 1 k N 1. (5)</p><p>Calculate final </p><p>weight matrix</p><p>Summary periodicity strength</p><p>Update beat and </p><p>tatum weights</p><p>Previous period </p><p>estimates</p><p>Priors</p><p>Beat &amp; tatum </p><p>weights</p><p>Period relation matrix</p><p>Weight matrix</p><p>Calculate weighted periodicity</p><p>Find maxima</p><p>Weighted periodicity</p><p>Beat and </p><p>tatum periods</p><p>Figure 4. The period estimator.</p><p>Similarly to an IOI histogram [5], accent peaks with a periodp cause high responses in the autocorrelation function at lags` = 0, ` = p (nearest peaks), ` = 2p (second-nearest peaks),` = 3p (third-nearest peaks), and so on. Such response is ex-ploited in DCT-based periodicity estimation, which matchesthe autocorrelation response with zero-phase cosine func-tions; see dashed lines in Fig. 3(a).</p><p>Only a specific periodicity window, 0.1 s p 2 s, isutilized from the DCT vector R[k]. This window specifiesthe range of beat and tatum periods for estimation. The sub-band periodicities Ri[k] are combined into an M -point sum-mary periodicity vector, M = 128,</p><p>S[k] =4</p><p>i=1</p><p>i[0]Ri[k] 0 k M 1, (6)</p><p>where Ri[k] has interpolated values of Ri[k] from 0.5 Hzto 10 Hz, and the parameter = 1.2 controls weighting.Figure 3(b) shows an example summary periodicity vector.</p><p>2.5. Beat and Tatum Period EstimationThe period estimation stage finds the most likely beat pe-riod Bn and tatum period </p><p>An for the current frame at time n</p><p>based on the observed periodicity S[k] and primitive mu-sicological knowledge. Likelihood functions are used formodeling primitive musicological knowledge as proposedby Klapuri et al. in [3, p. 344345], although the actualcalculations of the model are different. An overview of theperiod estimator are depicted in Fig. 4.</p><p>First, weights f i( in) for the different beat and tatum pe-riod candidates are calculated as a product of prior distribu-tions pi( i) and continuity functions:</p><p>f iC</p><p>( in</p><p> in1</p><p>)=</p><p>11</p><p>2exp</p><p>[ 1</p><p>221</p><p>(ln</p><p> in in1</p><p>)2], (7)</p><p>as defined in Eq. (21) in [3, p. 348]. Here, i = A denotesthe tatum and i = B denotes the beat. The value 1 = 0.63is used. The continuity function describes the tendency thatthe periods are slowly varying, thus taking care of tyingthe successive period estimates together. in1 is defined asthe median of three previous period estimates. This is foundto be slightly more robust than just using the estimate from</p></li><li><p>00.5</p><p>11.5</p><p>2</p><p>00.511.5</p><p>20</p><p>0.2</p><p>0.4</p><p>0.6</p><p>0.8</p><p>Beat period [s]Tatum period [s]</p><p>Like</p><p>lihoo</p><p>d</p><p>Figure 5. Likelihood of different beat and tatum periods tooccur jointly.</p><p>the previous frame. The priors are lognormal distributionsas described in Eq. (22) in [3, p. 348].</p><p>The output of the Update beat and tatum weights step inFig. 4 are two weighting vectors containing the evaluatedvalues of the functions fB(Bn ) and f</p><p>A(An ). The valuesare obtained by evaluating the continuity functions for theset of possible periods given the previous beat and tatumestimates, and multiplying with the priors.</p><p>The next step, Calculate final weight matrix, adds in themodelling of the most likely relations between simultaneousbeat and tatum periods. For example, the beat and tatum aremore likely to occur at ratios of 2, 4, 6, and 8 than in ratiosof 1, 3, 5, and 7. The likelihood of possible beat and tatumperiod combinations B , A is modelled with a Gaussianmixture density, as described in Eq. (25) in [3, p. 348]:</p><p>g(B , A) =9</p><p>l=1</p><p>wlN (B</p><p>A; l, 2) (8)</p><p>where l are the component means and 2 is the commonvariance. Eq. (8) is evaluated for the set of M M periodcombinations. The weights wl were hand adjusted to givegood performance on a small set of test data. Fig. 5 de-picts the resulting likelihood surface g(B , A). The finalweighting function is</p><p>h(Bn , An ) =</p><p>fB(Bn )</p><p>g(Bn , An )fA(An ). (9)</p><p>Taking the square root spreads the function such that thepeaks do not become too narrow. The result is a final MMlikelihood weighting matrix H with values of h(Bn , </p><p>An ) for</p><p>all beat and tatum period combinations.The Calculate weighted periodicity step weights the sum-</p><p>mary periodicity observation with the obtained likelihoodweighting matrix H. We assume that the likelihood of ob-serving a certain beat and tatum combination is proportionalto a sum of the corresponding values of the summary peri-odicity, and define the observation O(Bn , </p><p>An ) = (S[kB ] +</p><p>Comb filter 1</p><p>Time of last beat in the </p><p>previous frame</p><p>Beat period</p><p>Phase prediction</p><p>Weighted accentuation</p><p>Phase prediction</p><p>Comb filter 2</p><p>Previous beat period</p><p>Calculate scores for </p><p>phase candidates</p><p>Calculate scores for </p><p>phase candidates</p><p>Previous beat </p><p>period</p><p>Beat period Select winning </p><p>phase and refine beat period. Store </p><p>winning Comb filter state. Beat period </p><p>and phase</p><p>Predicted phase</p><p>Predicted phase</p><p>Figure 6. The phase estimation stage finds the phase of the beatand tatum pulses, and may also refine the beat period estimate.</p><p>S[kA])/2, where the indices kB and kA correspond to theperiods Bn and </p><p>An , respectively. This gives an observation</p><p>matrix of the same size as our weighting matrix. The ob-servation matrix is multiplied pointwise with the weightingmatrix, giving the weighted M M periodicity matrix Pwith values P (Bn , </p><p>An ) = h(</p><p>Bn , </p><p>An )O(</p><p>Bn , </p><p>An ). The fi-</p><p>nal step is to Find the maximum from P. The indices ofthe maximum correspond to the beat and tatum period es-timates Bn , </p><p>An . The period estima...</p></li></ul>


View more >