Joint Beat & Tatum Tracking from Music ?· Joint Beat & Tatum Tracking from Music Signals Jarno Seppanen…

Download Joint Beat & Tatum Tracking from Music ?· Joint Beat & Tatum Tracking from Music Signals Jarno Seppanen…

Post on 07-Sep-2018




0 download

Embed Size (px)


  • Joint Beat & Tatum Tracking from Music Signals

    Jarno Seppanen Antti Eronen Jarmo HiipakkaNokia Research Center

    P.O.Box 407, FI-00045 NOKIA GROUP, Finland{jarno.seppanen, antti.eronen, jarmo.hiipakka}

    AbstractThis paper presents a method for extracting two key met-rical properties, the beat and the tatum, from acoustic sig-nals of popular music. The method is computationally veryefficient while performing comparably to earlier methods.High efficiency is achieved through multirate accent analy-sis, discrete cosine transform periodicity analysis, and phaseestimation by adaptive comb filtering. During analysis, themusic signals are first represented in terms of accentuationon four frequency subbands, and then the accent signals aretransformed into periodicity domain. Beat and tatum peri-ods and phases are estimated in a probabilistic setting, incor-porating primitive musicological knowledge of beattatumrelations, the prior distributions, and the temporal continu-ities of beats and tatums. In an evaluation with 192 songs,the beat tracking accuracy of the proposed method was foundcomparable to the state of the art. Complexity evaluationshowed that the computational cost is less than 1% of earliermethods. The authors have written a real-time implementa-tion of the method for the S60 smartphone platform.

    Keywords: Beat tracking, music meter estimation, rhythmanalysis.

    1. IntroductionRecent years have brought significant advances in the fieldof automatic music signal analysis, and music meter estima-tion is no exception. In general, the music meter containsa nested grouping of pulses called metrical levels, wherepulses on higher levels are subsets of the lower level pulses;the most salient level is known as the beat, and the lowestlevel is termed the tatum [1, p. 21].

    Metrical analysis of music signals has many applicationsranging from browsing and visualization to classificationand recommendation of music. The state of the art has ad-vanced high in performance, but the computational require-ments have also remained restrictively high. The proposedmethod significantly improves computational efficiency whilemaintaining satisfactory performance.

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page.c 2006 University of Victoria

    The technical approaches for meter estimation are vari-ous, including e.g. autocorrelation based methods [6], inter-onset interval histogramming [5], or banks of comb filterresonators [4], possibly followed by a probabilistic model [3].See [2] for a review on rhythm analysis systems.

    2. Algorithm DescriptionThe algorithm overview is presented in Fig. 1: the input isaudio signals of polyphonic music, and the output consistsof the times of beats and tatums. The implementation of thebeat and tatum tracker has been done in C++ programminglanguage in the S60 smartphone platform. The algorithmdesign is causal and the implementation works in real time.

    The operation of the system can be described in six stages(see Fig. 1):

    1. Resampling stage,

    2. Accent filter bank stage,

    3. Buffering stage,

    4. Periodicity estimation stage,

    5. Period estimation stage, and

    6. Phase estimation stage.

    First, the signal is resampled to a fixed sample rate, tosupport arbitrary input sample rates. Second, the accentfilter bank transforms the acoustic signal of music into aform that is suitable for beat and tatum analysis. In thisstage, subband accent signals are generated, which consti-tute an estimate of the perceived accentuation on each sub-band. The accent filter bank stage significantly reduces theamount of data.

    Then, the accent signals are accumulated into four-secondframes. Periodicity estimation looks for repeating accentson each subband. The subband periodicities are then com-bined, and summary periodicity is computed.

    Next, the most likely beat and tatum periods are esti-mated from each periodicity frame. This uses a probabilisticformulation of primitive musicological knowledge, includ-ing the relation, the prior distribution, and the temporal con-tinuity of beats and tatums. Finally, the beat phase is foundand beat and tatum times are positioned. The accent signalis filtered with a pair of comb filters, which adapt to differentbeat period estimates.

  • Phase estimation

    Period estimation

    Accentfilter bank Buffering

    Periodicity estimation

    Audio signal

    Subbandaccent signals

    Accent frames

    Summary periodicity

    Beat and tatum


    Beat and tatum times


    Audio signal

    Figure 1. Beat and tatum analyzer.

    x2 LPF


    DiffComp Rect 0.8



    M1QMF2 6 kHz


    QMF2 1,5 kHz




    375 Hz

    24 kHz

    375 Hz


    125 Hz 125 Hz






    125 Hz

    125 Hz

    125 Hz(b) (c)

    (b) (c)

    (b) (c)

    Figure 2. Accent filter bank overview. (a) The audio signal is first divided into subbands, then (b) power estimates on each subbandare calculated, and (c) accent computation is performed on the subband power signals.

    2.1. ResamplingBefore any audio analysis takes place, the signal is con-verted to a 24 kHz sample rate. This is required becausethe filter bank uses fixed frequency regions. The resamplingcan be done with a relatively low-quality algorithm, linearinterpolation, because high fidelity is not required for suc-cessful beat and tatum analysis.

    2.2. Accent Filter BankFigure 2 presents an overview of the accent filter bank. The in-coming audio signal x[n] is (a) first divided into subband au-dio signals, and (b) a power estimate signal is calculated foreach band separately. Last, (c) an accent signal is computedfor each subband.

    The filter bank divides the acoustic signal into seven fre-quency bands by means of six cascaded decimating quadra-ture mirror filters (QMF). The QMF subband signals arecombined pairwise into three two-octave subband signals,as shown in Fig. 2(a). When combining two consecutivebranches, the signal from the higher branch is decimatedwithout filtering. However, the error caused by the alias-ing produced in this operation is negligible for the proposedmethod. The sampling rate decreases by four between suc-cessive bands due to the two QMF analysis stages and theextra decimation step. As a result, the frequency bands arelocated at 0190 Hz, 190750 Hz, 7503000 Hz, and 312 kHz, when the filter bank input is at 24 kHz.

    There is a very efficient structure that can be used to im-

    plement the downsampling QMF analysis with just two all-pass filters, an addition, and a subtraction. This structure isdepicted in Fig. 5.2-5 in [7, p. 203]. The allpass filters forthis application can be first-order filters, because only mod-est separation is required between bands.

    The subband power computation is shown Fig. 2(b). Theaudio signal is squared, low-pass filtered (LPF), and dec-imated by subband specific factor Mi to get the subbandpower signal. The low-pass filter is a digital filter having10 Hz cutoff frequency. The subband decimation ratios Mi ={48, 12, 3, 3} have been chosen so that the power signal sam-ple rate is 125 Hz on all subbands.

    The subband accent signal computation in Fig. 2(c) ismodelled according to Klapuri et al. [3, p. 344345]. In theprocess, the power signal first is mapped with a nonlinearlevel compression function labeled Comp in Fig. 2(c),

    f(x) ={

    5.213 ln(1 + 10

    x), x > 0.00015.213 ln 1.1 otherwise. (1)

    Following compression, the first-order difference signal iscomputed (Diff ) and half-wave rectified (Rect). In accor-dance with Eq. (3) in [3], the rectified signal is summed tothe power signal after constant weighting, see Fig. 2(c). Thehigh computational efficiency of the proposed method liesmostly in the accent filter bank design. In addition to effi-ciency, the resulting accent signals are comparable to thoseof Klapuri et al., see e.g. Fig. 3 in [3].

  • 0 0.5 1 1.5 2 2.5 3 3.5 40


    0.02 BT


    Time lag [s]

    1 2 3 4 5 6 7 8 9 100


    1B T


    Frequency [Hz]

    Figure 3. (a) Normalized autocorrelation and (b) summary pe-riodicity, with beat (B) and tatum (T) periods shown.

    2.3. BufferingThe buffering stage implements a ring buffer which accu-mulates the signal into fixed-length frames. The incomingsignal is split into consecutive accent signal frames of a fixedlength N = 512 (4.1 seconds). The value of N can be mod-ified to choose a different performancelatency tradeoff.

    2.4. Accent Periodicity EstimationThe accent signals are analyzed for intrinsic repetitions. Here,periodicity is defined as the combined strength of accentsthat repeat with a given period. For all subband accent sig-nals, a joint summary periodicity vector is computed.

    Autocorrelation [`] =N1

    n=0 a[n]a[n `], 0 ` N1, is first computed from each N -length subband accentframe a[n]. The accent signal reaches peak values wheneverthere are high accents in the music and remains low other-wise. Computing autocorrelation from an impulsive accentsignal is comparable to computing the inter-onset interval(IOI) histogram as described by Seppanen [5], with addi-tional robustness due to not having to discretize the accentsignal into onsets.

    The accent frame power [0] is stored for later weight-ing of subband periodicities. Offset and scale variations areeliminated from autocorrelation frames by normalization,

    [`] =[`]minn [n]N1

    n=0 [n]N minn [n]. (2)

    See Fig. 3(


View more >