real-time implementation of hmm-based chord estimation in music …€¦ · tion of the harmonic...

REAL-TIME IMPLEMENTATION OF HMM-BASED CHORD ESTIMATIONIN MUSICAL AUDIO

Taemin Cho, Juan P. Bello

Music and Audio Research Laboratory (MARL)New York University, New York, USA

[email protected]

ABSTRACT

In this paper, we implement a real-time chord estimationsystem based on the hidden Markov model (HMM) usingchroma features. HMMs, with the Viterbi decoding algo-rithm, have proven a powerful tool for chord recognition inpolyphonic music signals. However, the direct applicationof the traditional, non-causal, decoding approach is not theoptimal choice for real-time processing, with limited mem-ory capabilities and no access to future observations. Wepropose a system of buffers and a modified decoding pro-cess that approximates offline results while minimizing thesystem’s latency.

1. INTRODUCTION

The automatic recognition of musical attributes such as melo-dy, rhythm, and harmony from polyphonic music signals isan important area of development in computer music andmachine listening research. Particularly, the characteriza-tion of the harmonic content (e.g. chords and keys) findsapplicability in tasks such as song thumbnailing, automaticmusic transcription and the development of interactive mu-sic systems. Some of these tasks require attribute recogni-tion to occur in real-time, allowing performers to effectivelycommunicate with computers using the sound of their in-struments without recourse to MIDI devices or triggeringoperations.

Unfortunately, most existing state of the art approachesto chord recognition are designed for offline processing [1,4, 6], thus precluding their usefulness to realtime applica-tions. There are a few existing realtime works; however,they either do not use adaptive machine learning [3] or havenot yet been implemented as realtime systems [7].

In this paper, we propose a method for real-time chordrecognition in polyphonic audio signals. We adapt the hid-den Markov model (HMM) using chroma vectors formula-tion to a real-time context by introducing modifications tothe decoding process intended to approximate offline resultswith minimal decoding latency.

2. BACKGROUND

Pitch class profiles (PCP), commonly know as chroma fea-tures, represent the distribution of a signal’s frequency con-tent across the 12 semitones of the chromatic scale, mak-ing them the features of choice for analysis tasks such aschord and key estimation. Fujishima [3] pioneered the useof chroma features for audio-based chord recognition. Hissystem, intended for real-time operation, uses a simple pat-tern matching approach that performs well on samples con-taining a single instrument, but does not cope well with thecomplexities of polyphonic and multi-instrumental music.

In an attempt to minimize, the short-comings of the sim-ple pattern matching approach, Sheh and Ellis [6] intro-duced the Chord estimation from the chroma features usingHMMs. Although their results were poor (maximum accu-racy of 26.4%), their work laid a foundation for future stud-ies. The combination of chroma features and hidden Markovmodels is now the de facto standard for audio-based chordrecognition. All approaches presented at the MIREX 2008’saudio chord detection task are fundamentally variations ofthis basic architecture. For a comparative review see [4].

2.1. Approach

Our approach is a re-implementation of the system discussedin [1], one of the top performers in MIREX 20081. It be-gins with an efficient algorithm for the calculation of theconstant Q transform Xcq, as the multiplication of the short-time Fourier transform and the complex conjugate of a pre-calculated sparse kernel [2]. Xcq is then used to computethe chroma features as Ob = ∑M−1

m=0

∣∣Xcq(b+mβ )∣∣, where β

is the number of bins per octave, b = 1 · · ·β are the chromabin indices, and M is the total number of octaves from aminimum frequency fmin. In our implementation, β = 36,M = 6 and fmin = 130.81Hz are used. Finally the chromafeature vectors are quantized into 12 bins using a Gaussianfilter bank.

A hidden Markov model, λ , is defined by the initial statedistribution π , the set of state to observation probability dis-

1MIREX2008: http://www.music-ir.org/mirex/2008/

Proceedings of the International Computer Music Conference (ICMC 2009), Montreal, Canada August 16-21, 2009

117

tributions B = {bi}, and the matrix of state transition prob-abilities A = {ai j} [5]. In our system, an ergodic 24-statesHMM is used; each state is assigned a single chord among12 major and 12 minor triads. For each state Si, we use asingle multivariate Gaussian as the continuous observationprobability distribution bi(O).

The HMM model λ is trained using the Baum-Welchalgorithm and decoded using the Viterbi algorithm. The ap-proach uses unsupervised learning, constrained such that Aand π are updated during training, while the parameters ofthe distributions bi are not. This strategy is supported bybasic music theory; the model parameters are initiated withbasic chord templates and chord relationships organized ac-cording to the circle of fifths (see [1] for more details).

3. REALTIME DECODING SYSTEM

Unlike an offline system, an online system does not haveaccess to the entire signal at the onset, nor does it know thelength of the signal. Data is fed to the system in real timeand must be processed with the lowest latency possible.

To accommodate these constraints, we use two differenttypes of buffers, an audio input buffer and an observationbuffer. The audio input buffer captures block size of audiosamples from incoming audio signals. Chroma features areextracted from this buffer each time new hop size of sam-ples comes into the buffer. The observation buffer keeps Lnumber of the extracted chroma vectors for local decodingprocesses. Figure 1 shows the process at times t and t +1.

3.1. The Viterbi Algorithm

The Viterbi Algorithm [5] is the most commonly used de-coding algorithm for HMMs. For a given model λ and aT -long observation sequence O = {O1,O2, · · · ,OT}, it findsthe single best state path Q = {q1,q2, · · · ,qT} that maxi-mizes P[Q|O,λ ], which is the same as maximizing P[Q,O|λ ].The highest probability, δt(i), along a single path ending instate Si, from the beginning of the observation sequence totime t, is defined as:

δt(i) = maxq1,q2,···qt−1

P[q1 q2 · · · qt = Si, O1 O2 · · · Ot |λ ] (1)

By induction, δt(i) can be denoted as:

δt( j) = [maxi

δt−1(i)ai j] ·b j(Ot), t > 1 (2)

With an array ψt( j), a container for the argument whichmaximizes (2) at time t and state S j, the complete Viterbiprocedure for the given N-state model λ = {A,B,π} and theobservation sequence O = {O1,O2, · · · ,OT} can be stated asfollows:

1. Initialization:

δ1(i) = πibi(O1), 1≤ i≤ N (3a)ψ1(i) = 0, 1≤ i≤ N (3b)

Ot!4 Ot!3 Ot!2 Ot!1

C C# D D# E F F# G G# A A# B

12-bin Chroma

ObservationC C# D D# E F F# G G# A A# B

Ot!4

Ot

F F F C C

Modified Viterbi Decoder

Constant Q transform Constant Q transform

Observation Buffer

Audio Input Buffer Audio Input Buffer

12-bin Chroma

ObservationOt

Ot!3 Ot!2 Ot!1 Ot

Ot+1

F F C C D


Observation Buffer

Ot+1

hop sizet t t + 1

Qt Qt+1

Ot!4 Ot!3 Ot!2 Ot!1

C C# D D# E F F# G G# A A# B

12-bin Chroma

ObservationC C# D D# E F F# G G# A A# B

Ot!4

Ot

F F F C C


Constant Q transform Constant Q transform

Observation Buffer

Audio Input Buffer Audio Input Buffer

12-bin Chroma

ObservationOt

Ot!3 Ot!2 Ot!1 Ot

Ot+1

F F C C D


Observation Buffer

Ot+1

hop sizet t t + 1

Qt Qt+1

Ot+1

Figure 1. Buffering and decoding process at times t (left)and t +1 (right) where L = 5. At t +1, after a new hop sizeof audio data comes in from time t, a new chroma vectorOt+1 is extracted from the audio input buffer and pushed intothe observation buffer. The modified Viterbi decoder usesthe decoding calculations at time t to generate the sequenceQt+1.

2. Recursion:

δt( j) = max1≤ i≤ N

[δt−1(i)ai j]b j(Ot),

2≤ t ≤ T, 1≤ j ≤ N(4a)

ψt( j) = argmax1≤ i≤ N

[δt−1(i)ai j],

2≤ t ≤ T, 1≤ j ≤ N(4b)

3. Termination:

P∗ = max1≤ i≤ N

[δT (i)] (5a)

q∗T = argmax1≤ i≤ N

[δT (i)] (5b)

4. Backtracking:

q∗t = ψt+1(q∗t+1), t = T −1,T −2 · · · ,1 (6)

The obtained sequence Q∗ = {q∗1,q∗2, · · · ,q∗T} from (5b)and (6) is the single best state path, which has the maximumlikelihood, P∗.

3.2. A Variation on the Viterbi Algorithm

The standard Viterbi algorithm shown in Section 3.1 is de-signed for decoding on an independent observation sequence.Therefore, the direct adaptation of it to the observation bufferisolates the current decoding process from previous processes.This computational independence is critical because the ef-fectiveness of dynamic programming of the Viterbi algo-rithm is based on previous computations. In order to includeprevious calculations without losing dynamic programmingproperties, we propose a variation of the Viterbi algorithm.


118

Q5

Q6

Q7

Q8

Q9

Q10

Time : 1 2 3 4 5 6 7 8 9 10

C C C C C

C C C F F

C C F F F

C F F F F

F F F F G

F F F G G

Figure 2. The decoded sequence matrix with L = 5, t =1,2, · · · ,10. The decoded sequences are stacked into a L×Lmatrix top to bottom. At time t = 10, Q10 is pushed intothe matrix and Q5 is pushed out from it. The dotted squareshows the matrix at time t = 9.

The basic idea is to substitute the fixed π-based initial-ization step (3) of the standard algorithm with an initializa-tion that reuses previous buffer calculations. When the totalnumber of extracted observations T = L, then the observa-tion buffer is filled, and the decoding process is triggered(see Figure 1). The standard Viterbi algorithm is only usedat this time to initialize δ1(i) and ψ1(i), and calculate δt(i)and ψt(i) at t = 2 · · ·L, thus generating the first decoded se-quence, QL. At T > L and t = T , the modified Viterbi al-gorithm is used to decode the observation sequence. As wealready have δt(i) and ψt(i) for t ∈ [T − 1,T − L + 1], wecan skip the initialization step (3) and most calculations atthe recursion step (4). The resulting variation is denoted as:

1. Recursion:

δT ( j) = max1≤ i≤ N

[δT−1(i)ai j]b j(OT ),

T > L, 1≤ j ≤ N(7a)

ψT ( j) = argmax1≤ i≤ N

[δT−1(i)ai j],

T > L, 1≤ j ≤ N(7b)

2. Termination:

q∗T = argmax1≤ i≤ N

[δT (i)] (8)

3. Backtracking:

q∗t = ψt+1(q∗t+1),t = T −1,T −2 · · · ,T −L+1

(9)

By using the results of previous calculations we makedynamic programming possible in real-time, operating verymuch in the same way as the offline process, with the excep-tion of the partial backtracking process (9). Therefore, weexpect the results of the modification to be closer to thoseof non-realtime decoding. In addition, as the modified algo-rithm skips many duplicated procedures, we expect it to befaster than the standard algorithm.

Figure 3. Realtime Chord Recognizer

3.3. Frame-Level Chord Estimation

As seen in Figure 2, after 2×L−1 frames, the decoded se-quences always form an L×L lower triangular matrix. Theestimated chord at time t is the mode of the first columnvector of the matrix. For example, in Figure 2, the final es-timated chords at times t = 5 and 6 are both F.

3.4. Decoding Latency

The decoding latency is determined by the block size, thehop size and by the length of the observation sequence - L.Due to the fact that the recognized chord is the chord of theLth frame and assuming that the chord represents the blocksize of the signal, the latency can be calculated as:

latency =(block size/2+hop size · (L−1))

Sample rate(sec.) (10)

Both the accuracy and the latency are highly dependenton L. We argue that the modified Viterbi algorithm is ca-pable of simultaneously increasing accuracy and decreasingthe latency.

4. IMPLEMENTATION AND EXPERIMENTS

Our system, shown in Figure 3, was developed on an AppleMacBook Pro (2.33 GHz Intel Core 2 Duo, 2 GB 667 MHzDDR2 SDRAM). The prime consideration for this imple-mentation was having every DSP calculation finished withinthe time interval determined by the hop size. For maximumperformance, we used a C-based library optimized for SSE2

- the vDSP Accelerate framework. Coding was done in C++and wrapped by Objective-C++ 2.0 within the Cocoa frame-work. In our experiments, it takes an average of 0.35 ms

2Intel’s Streaming SIMD Extensions (called SSE) is a 128-bit SIMDvector extension to the x86 ISA


119

to generate a chroma vector, while the decoding times for30-long observation sequences are 3.8 and 1.2 ms using thestandard and modified Viterbi algorithms, respectively.

For the experiments, we performed 12-fold cross valida-tion on a set of 169 annotated Beatles recordings, the sameused for the 2008 MIREX evaluation. Each song in the setis a CD quality wav file (2 channels, 44100 Hz and 16 bitPCM). We use 8192 and 2048 samples for the block sizeand hop size respectively, and use a 44100 Hz sampling rate.Model training was unsupervised and performed offline.

For the evaluation of the modified Viterbi algorithm, therealtime decoding process was computed with 6 differentobservation sequence lengths, L = 5, 10, 15, 20, 25 and 30.We compared with non-realtime results and the results ofthe standard Viterbi algorithm in real-time.

50

55

60

65

70

5 10 15 20 25 30

56.06

61.53

64.3965.74

66.60 67.1364.3066.17

67.25 67.68 67.98 68.13

Acc

urac

y (%

)

Non - Realtime Modified Viterbi Standard Viterbi

(0.278) (0.511) (0.743) (0.975) (1.207) (1.430)

Observation Length L (upper) and latency in sec. (lower)

68.56

Figure 4. Comparison of decoding accuracies.

As seen in Figure 4, both decoding algorithms, the mod-ified Viterbi and the standard Viterbi, converge towards thenon-realtime decoding result. However, the results of themodified Viterbi algorithm converge faster than the standardViterbi algorithm. This means that the modified Viterbi al-gorithm achieves higher accuracy at lower latencies. Forexample, the modified Viterbi algorithm needs 5 frames toachieve a mean accuracy of 64.30%, while the standard Vi-terbi algorithm needs 15 frames for a similar accuracy of64.39%, resulting in an additional latency of 0.5 seconds.This is an important difference in real-time processing.

Using a paired t-test, we found the difference betweenthe results of the modified and standard algorithms to bestatistically significant at the 5% level.

5. CONCLUSION

In this paper, we propose a new approach to the implemen-tation of an audio-based chord recognition system in real-time. To cope with the restrictions imposed by online pro-cessing, we implement a system of buffers and a modifica-tion of the traditional Viterbi algorithm for the decoding ofstate sequences.

We accomplish the initial goal of this paper, namely, tomake the results of realtime decoding as close as possibleto the results of non-realtime decoding while minimizinglatency. However, the inevitable and comparatively long de-coding latency still remains a problem, made worse by theproportionality between this latency and the accuracy of thesystem. Simple solutions, such as a reduction of the blocksize, result in unstable and noisier features that negativelyeffect accuracy.

The experimental results clearly show that the real-timesystem can only be as good as the offline system is. Thus,future work will be focused on improving both the featureextraction and the modeling stages, by exploring means be-yond the using chroma features with the HMM

6. ACKNOWLEDGMENTS

This work is made possible by grant LG-06-08-0073-08 fromthe U.S. Institute of Museum and Library Services.

7. REFERENCES

[1] J. P. Bello and J. Pickens, “A Robust Mid-Level Rep-resentation for Harmonic Content in Music Signals,”Proc. of the Int. Symposium on Music Information Re-trieval, pp. 304–311, September 2005.

[2] J. C. Brown and M. S. Puckette, “An efficient algorithmfor the calculation of a constant Q transform,” Acousti-cal Society of America Journal, vol. 92, no. 5, pp. 2698–2701, November 1992.

[3] T. Fujishima, “Realtime chord recognition of musicalsound: A system using common lisp music,” Proc.of the Int. Computer Music Conference, pp. 464–467,1999.

[4] H. Papadopoulos and G. Peeters, “Large-scale studyof chord estimation algorithms based on chroma rep-resentation and HMM,” Proc. of the Int. Workshop onContent-Based Multimedia Indexing (CBMI), pp. 53–60, 2007.

[5] L. R. Rabiner, “A tutorial on Hidden Markov Modelsand selected applications in speech recognition,” Proc.of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.

[6] A. Sheh and D. P. Ellis, “Chord Segmentation andRecognition using EM-Trained Hidden Markov Mod-els,” Proc. of the Int. Symposium on Music InformationRetrieval, pp. 185–191, October 2003.

[7] T. Yoshioka, T. Kitahara, K. Komatani, T. Ogata andH. G. Okuno, “Automatic Chord Transcription withConcurrent Recognition of Chord Symbols and Bound-aries,” Proc. 5th Int. Conf. on Music Information Re-trieval (ISMIR), pp. 100–105, 2004.


120

real-time implementation of hmm-based chord estimation in music …€¦ · tion of the harmonic...

Documents