duke electrical and computer engineering - music analysis...

1

Music Analysis Using Hidden Markov Mixture Models

Yuting Qi, John William Paisley and Lawrence Carin

Department of Electrical and Computer Engineering, Duke University

Durham, NC, 27708-0291

Email: {yuting, jwp4, lcarin}@ee.duke.edu

Abstract

We develop a hidden Markov mixture model based on a Dirichlet process (DP) prior, for represen-

tation of the statistics of sequential data for which a single hidden Markov model (HMM) may not be

sufficient. The DP prior has an intrinsic clustering property that encourages parameter sharing, and this

naturally reveals the proper number of mixture components. The evaluation of posterior distributions for

all model parameters is achieved in two ways: (i) via a rigorous Markov chain Monte Carlo method, and

(ii) approximately and efficiently via a variational Bayes formulation. Using DP HMM mixture models

in a Bayesian setting, we propose a novel scheme for music analysis, highlighting the effectiveness of

the DP HMM mixture model. Music is treated as a time-series data sequence and each music piece is

represented as a mixture of HMMs. We approximate the similarity of two music pieces by computing

the distance between the associated HMM mixtures. Experimental results are presented for synthesized

sequential data and from classical music clips. Music similarities computed using DP HMM mixture

modeling are compared to those computed from Gaussian mixture modeling, for which the mixture

modeling is also performed using DP. The results show that the performance of DP HMM mixture

modeling exceeds that of the DP GMM modeling.

Index Terms

Dirichlet Process, HMM mixture, Music, MCMC, Variational Bayes.

I. INTRODUCTION

With the advent of online music a large quantity and variety of music is now highly accessible, this

music spanning many eras and genres. However, this wealth of available music can also pose a problem for

music listeners and researchers alike: how can the listener efficiently find new music he or she would like

from a vast library, and how can this database be organized? This challenge has motivated researchers

March 19, 2007 DRAFT

2

working in the field of music recognition to find ways of building robust and efficient classification,

retrieval, browsing, and recommendation systems for listeners.

The use of statistical models has great potential for analyzing music, and for recognition of similarities

and relationships between different musical pieces. Motivated by these goals, ideas from statistical

machine learning have attracted growing interest in the music-analysis community. For example, in the

work of Logan and Salomon [19], the sampled music signal is divided into overlapping frames and Mel-

frequency cepstral coefficients (MFCCs) are computed as a feature vector for each frame. A K-means

method is then applied to cluster frames in the MFCC feature space. Aucouturier and Pachet [3] model

the distribution of the MFCCs over all frames of an individual song with a Gaussian mixture model, and

the distance between two pieces is evaluated based on their corresponding GMMs. Similar work can be

found in [7] as well. Recently support vector machines (SVMs) have been used in the context of genre

classification [32][33], where individual frames are classified based on short-time features, which then

vote for the classification of the entire piece. In [21], SVMs are utilized for genre classification, with

song-level features and a Kullback-Leiber (KL) divergence-based kernel employed to measure distances

between songs.

A common disadvantage of the aforementioned methods is that either frame-level or song-level feature

vectors are treated as i.i.d. samples from a distribution assumed to characterize the song (for frame-

level feature vectors) or the genre (song-level feature vectors), with no dynamic behavior of the music

accounted for. However, “in order to understand and segment acoustic phenomena, the brain dynamically

links a multitude of short events which cannot always be separated” [4]. For the brain to recognize and

appreciate music, temporal cues are critical and contain information that should not be ignored. Therefore,

there have been many attempts at modeling frame-to-frame dynamics in music (notably hidden Markov

models (HMMs)), and complex and multi-faceted music analysis may benefit from considering dynamical

information (see [1] for a review). An HMM can accurately represent the statistics of sequential data,

and such models have been exploited extensively in speech recognition [5][22]; one may utilize HMMs

to learn short-term transitions across a piece of music. For a piece of monophonic music, Raphael [23]

modeled the overall music as an HMM and segmented the sampled music data into a sequence of

contiguous regions that correspond to notes. In [4], a song is considered as a collection of distinct

regions that present steady statistical properties, which are uncovered through HMM modeling. More

recently, Sheh and Ellis [29] transcribe audio recordings into chord sequences for music indexing and

retrieval via HMMs. Shao [28] and Scaringella [25] use HMMs for music genre classification. Building a

single HMM for a song performs well when the music’s “movement pattern”, modeled as Markov chain

transition pattern, is relatively simple and thus the structure is of modest complexity (e.g., the number of


3

states is few). However, most real music is a complicated signal, which has a quasi-uniform feature set

and may have more than one “movement pattern” across the entire piece. One may in principle model

the entire piece by a single HMM, with distinct segments of music characterized by an associated set

of HMM states. In such a model there would be limited state transitions between states associated with

distinct segments of music. Semi-parametric techniques such as the iHMM [30] could be used to infer

the appropriate number of HMM states. We alternatively consider an HMM mixture model, because

the distinct mixture components allow analysis of the characteristics of different segments of a given

piece. An important question centers around the proper number of mixture components, and this issue is

addressed using ideas from modern Bayesian statistics.

Mixture models provide an effective means of density estimation. For example, GMMs have been

widely used in modeling uncertain data distributions. However, few researchers have worked on mixtures

of HMMs designed using Bayesian principles. The most related work can be found in [17] and [26],

where an HMM mixture is learned not in a Bayesian setting, but via the EM algorithm with the number

of mixture components preset. The work reported here develops an HMM mixture model in a Bayesian

setting using a non-parametric Dirichlet process (DP) as a common prior distribution on the parameters of

the individual HMMs. Through Bayesian inference, we learn the posterior of the model parameters, which

yields an ensemble of HMM mixture models rather than a point estimation of the model parameters. In

contrast, the maximum-likelihood (ML) Expectation-Maximization (EM) algorithm [10] provides a point

of estimate of the model parameters, i.e., a single HMM mixture model.

Research on DP models can be traced to Ferguson [12]. A DP is characterized by a base distribution G0

and a positive scalar “innovation” parameter α. The DP is a distribution on a distribution G, specifically

the distribution G is drawn from the DP, denoted G ∼ DP (α,G0). Ferguson proved that there is positive

probability that a sample G drawn from a DP will be as close as desired to any probability function

defined on the support of G0. Therefore DP is rich enough to model parameters of individual components

with arbitrarily high complexity, and flexible enough to fit them well without any assumptions about the

functional form of the prior distribution.

In a DP formulation we let {xn}n=1,N represent the N segments of data associated with a given

piece, where each xn represents a sequence of time-evolving features. Each xn is assumed to be drawn

from an HMM characterized by parameters Θn, xn ∼ H(Θn). All Θn are assumed to be drawn i.i.d.

from the distribution G, i.e., Θn|Gi.i.d∼ G, where G is drawn from DP (α,G0), i.e., G ∼ DP (α,G0).

As discussed further below, this hierarchical model naturally yields a clustering of the data {xn}n=1,N ,

from which an HMM mixture model is constituted. Importantly, the DP HMM mixture is essentially an

ensemble of HMM mixtures where each mixture may have different number of components and different


4

component parameters, therefore the number of mixture components need not be set a priori. The same

basic framework may be applied to virtually any data model, as shown by West [31], and for comparison

we also consider a Gaussian mixture model.

Two schemes are considered to perform DP-based mixture modeling, Gibbs sampling [15] and varia-

tional Bayes [9]. The variational Bayesian inference algorithm avoids the expensive computation of Gibbs

sampling while retaining much of the rigor of the Bayesian formulation. In this paper we focus on HMM

mixture models based on discrete observations; we have also considered continuous observation HMMs,

which yielded very similar results as the discrete case, and therefore the case of continuous observations

is omitted for brevity. We also note that the discrete HMMs are computationally much faster to learn

than their continuous counterparts. Although music modeling is the motivation and specific application

in this paper, our method is applicable to any discrete sequential data sets containing multiple underlying

patterns, for instance behavior modeling in video.

The remainder of the paper is organized as follows. The proposed HMM mixture model is described

in Section 2. Section 3 provides an introduction to the Dirichlet process and its application to HMM

mixture models. An MCMC-based sampling scheme and variational Bayes inference are developed in

Section 4. Section 5 describes the application of HMM mixture models in music analysis. In Section 6

we present experimental results on synthetic data as well as for real music data. Section 7 concludes the

work and outlines future directions.

II. HIDDEN MARKOV MIXTURE MODEL

A. Hidden Markov Model

For a sequence of observations x = {xt}t=1,T , an HMM assumes that the observation xt at time t

is generated by an underlying, unobservable discrete state st and that the state sequence s = {st}t=1,T

follows a first-order Markov process. In the discrete case considered here, xt ∈ {1, · · · ,M} and st ∈

{1, · · · , I}, where M is the alphabet size and I the number of states. Therefore, an HMM can be modeled

as Θ = {A,B,π}, where A, B and π are defined as

• A = {aij}, aij = P (st+1 = j|st = i): state transition probabilities,

• B = {bim}, bim = P (xt = m|st = i): emission probabilities,

• π = {πi}, πi = P (s1 = i): initial state distribution.

For given model parameters Θ, the joint probability of the observation sequence and the underlying state

sequence is expressed as

p(x, s|Θ) = πs1

T−1∏

t=1

astst+1

T∏

t=1

bstxt. (1)


5

The likelihood of data x given model parameters Θ results from a summation over all possible hidden

state sequences,

p(x|Θ) =∑

s

πs1

T−1∏

t=1

astst+1

T∏

t=1

bstxt. (2)

B. Hidden Markov Mixture Model

The hidden Markov mixture model with K∗ mixture components may be written as

p(x|{wk}k=1,K∗ , {Θk}k=1,K∗) =K∗

∑

k=1

wkp(x|Θk), (3)

where p(x|Θk) represents the kth HMM component with associated parameters Θk, and wk represents

the mixing weight for the kth HMM, with∑K∗

k=1wk = 1.

Most HMM mixture models have been designed using a maximum-likelihood (ML) solution, via the

EM algorithm [10] for which a single mixture model is learned. An important question concerns the

proper number of mixture components K∗, this constituting a problem of model selection. For this

purpose researchers have used such measures as the minimum description length (MDL) [24]. In the

work reported here, rather than a point estimate of an HMM mixture we learn an ensemble of HMM

mixtures. We assume a set X = {xn}n=1,N of N sequences of data. Each data sequence xn is assumed

to be drawn from an associated HMM with parameters Θn = {An,Bn,πn}, i.e., xn ∼ H(Θn), where

H(Θ) represents the HMM. The set of associated parameters {Θn}n=1,N are drawn i.i.d from a shared

prior G, i.e., Θn|Gi.i.d∼ G. The distribution G is itself drawn from a distribution, in particular a Dirichlet

process. We discuss below that proper selection of the DP “innovation parameter” yields a framework

by which the parameters {Θn}n=1,N are encouraged to cluster, and each such cluster corresponds to

an HMM mixture component in (3). The algorithm automatically balances the DP-generated desire to

cluster with the likelihood’s desire to choose parameters that match the data X well. The likelihood and

the DP prior are balanced in the posterior density function for parameters {Θn}n=1,N , and the posterior

distribution for {Θn}n=1,N is learned to constitute an ensemble of HMM mixture models.

III. DIRICHLET PROCESS PRIOR

A. Dirichlet Process

The Dirichlet process, denoted as DP (α,G0), is a random measure on measures and is parameterized

by the “innovation parameter” α and a base distribution G0. Assume we have N random variables

{Θn}n=1,N distributed according to G, and G itself is a random measure drawn from a Dirichlet process,

Θn|G ∼ G, n = 1, · · · , N,

G ∼ DP (α,G0),


6

where G0 is the expectation of G,

E[G] = G0. (4)

Defining Θ−n = {Θ1, · · · ,Θn−1,Θn+1, · · · ,ΘN} and integrating out G, the conditional distribution

of Θn given Θ−n follows a Polya urn scheme and has the following form [8],

p(Θn|Θ−n, α,G0) =

α

α+N − 1G0 +

1

α+N − 1

N∑

i=1,i6=n

δΘi, (5)

where δΘidenotes the distribution concentrated at single point Θi. Let {Θ∗

k}k=1,K∗ be the distinct values

taken by {Θn}n=1,N and let n−nk be the number of values in Θ−n that equal Θ∗

k. We can rewrite (5) as

p(Θn|Θ−n, α,G0) =

α

α+N − 1G0 +

1

α+N − 1

K∗

∑

k=1

n−nk δΘ∗

k. (6)

Equation (6) shows that when considering Θn given all other observations Θ−n, this new sample is

either drawn from base distribution G0 with probability αα+N−1 , or is selected from the existing draws

Θ∗k according to a multinomial allocation, with probabilities proportional to existing groups sizes n−n

k .

This highlights the valuable sharing property of the Dirichlet process: a new sample prefers to join a

group with a large population, i.e., the more often a parameter is shared, the more likely it will be shared

in the future.

The parameter α plays a balancing role between sampling a new parameter from the base distribution

G0 (“innovating”), or sharing previously sampled parameters. A larger α yields more clusters, and in the

limit α → ∞, G → G0; as α → 0, all {Θn}n=1,N are aggregated into a single cluster and take on the

same value.

Prediction of a future sample can be directly extended from (6),

p(ΘN+1|{Θn}n=1,N , α,G0) =α

α+NG0 +

1

α+N

K∗

∑

k=1

nkδΘ∗

k, (7)

where nk is the number of Θn that take value Θ∗k. It is proven in [31] that the posterior of G is still a

DP, with the scaling parameter and the base distribution updated as follows,

p(G|{Θn}n=1,N , α,G0) = DP (α+N,α

α+NG0 +

1

α+N

K∑

k=1

nkδΘ∗

k). (8)

The above DP representation highlights its sharing property, but without an explicit form for G.

Sethuraman [27] provides an explicit characterization of G in terms of a stick-breaking construction.

Consider two infinite collections of independent random variables vk and Θ∗k, k = 1, 2, · · · ,∞, where

vk is drawn from a Beta distribution, denoted Beta(1, α), and Θ∗k is drawn independently from the base


7

distribution G0. The stick-breaking representation of G is then defined as

G =

∞∑

k=1

pkδΘ∗

k, (9)

with

pk = vk

k−1∏

i=1

(1 − vi), (10)

where

vk|α ∼ Beta(1, α),

Θ∗k|G0 ∼ G0.

This representation makes explicit that the random measure G is discrete with probability one and the

support of G consists of an infinite set of atoms located at Θ∗k, drawn independently from G0. The mixing

weights pk for atom Θ∗k are given by successively breaking a unit length “stick” into an infinite number

of pieces [27], with 0 ≤ pk ≤ 1 and∑∞

k=1 pk = 1.

The relationship between the stick-breaking representation and the Polya urn scheme is interpreted as

follows: if α is large, each vk drawn from Beta(1, α) will be very small, which means we will tend to

have many sticks of very short length. Consequently, G will consist of an infinite number of Θ∗k with

very small weights pk and therefore G will approach G0, the base distribution. For a small α, each vk

drawn from Beta(1, α) will be large, which will result in several large sticks with the remaining sticks

very small. This leads to a clustering effect on the parameters {Θn}n=1,N , as G will only have a large

mass on a small subset of {Θ∗k}k=1,∞ (those Θ∗

k corresponding to the large sticks pk).

B. DP HMM mixture models

Given the observed data X = {xn}n=1,N , each xn is assumed to be drawn from its own HMM H(Θn)

parameterized by Θn with the underlying state sequence sn. The common prior G on all Θn is given

as (9). Since G is discrete, different Θn may share the same value, Θ∗k, and take the value of Θ∗

k with

probability pk. Introducing an indicator variable c = {cn}n=1,N and letting cn = k indicate that Θn takes

the value of Θ∗k, the hidden Markov mixture model with DP prior can be expressed as

xn|cn, {Θ∗k}

∞k=1 ∼ H(Θ∗

cn),

cn|p ∼ Mult(p),

vk|α ∼ Beta(1, α),

Θ∗k|G0 ∼ G0, (11)


8

where p = {pk}k=1,∞ is given by (10) and Mult(p) is the multinomial distribution with parameter p.

Assuming A,B and π are independent of each other, the base distribution G0 is represented as

G0 = p(A)p(B)(π). For computational convenience (use of appropriate conjugate priors), p(A) are

specified as a product of Dirichlet distributions

P (A|uA) =

I∏

i=1

Dir({aij}j=1,I ;uA), (12)

where uA = {uAi }i=1,I are parameters of the Dirichlet distribution. Similarly, for p(B) and p(π), we

have

p(B|uB) =I∏

i=1

Dir({bim}m=1,M ;uB), (13)

and

p(π|uπ) = Dir({πi}i=1,I ;uπ), (14)

where uB = {uBm}m=1,M and uπ = {uπ

i }i=1,I . As discussed in Section IVA, the use of conjugate

distribution yields analytic update equations for the MCMC sampler as well as for the variational Bayes

algorithm.

It has been addressed in [31] that the innovation parameter α plays a critical role and α will define the

number of clusters inferred, with the appropriate α depending on the number of data points. Therefore

a prior distribution is placed on α and a posterior on α is learned from the data. We choose

p(α) = Ga(α; γ01, γ02), (15)

where Ga(α; γ01, γ02) is the Gamma distribution with selected parameters γ01 and γ02.

The corresponding graphical representation of the model is shown in Figure 1. We note that the stick-

breaking representation in (11) in principle uses an infinite number of sticks. It has been demonstrated [15]

that in practice a finite set of sticks may be employed with minimal error and may reduce computational

complexity; in the work reported here an appropriate truncation level K, i.e., a finite number of sticks,

is employed.

IV. INFERENCE FOR HMM MIXTURE MODELS WITH DP PRIOR

A. MCMC Inference

The posterior distribution of the model parameters is expressed as p(A∗,B∗,π∗,p, α,S, c|X) where

A∗ = {A∗k}k=1,K , B∗ = {B∗

k}k=1,K , π∗ = {π∗

k}k=1,K , S = {{snk}n=1,N}k=1,K , snk is the hidden state

sequence corresponding to xn when assuming xn is generated from the kth HMM, and c is the indicator

variable defined in Section IIIB. Truncation level K corresponds to the number of sticks used in the stick-

breaking construction. The posterior can be approximated by MCMC methods based on Gibbs sampling,


9

*knc

N

nx

ns

01

02p 0G

Fig. 1. Graphical representation of a Dirichlet process mixture model in terms of the stick breaking construction, where xn

are observed sequences, γ01, γ02 and G0 are preset, and the other parameters are hidden variables to be learned.

by iteratively drawing samples for each random variable from its full conditional posterior distribution

given the most recent values of all other random variables. We follow Bayes’ rule to derive the full

conditional distributions for each random variable in p(A∗,B∗,π∗,p, α,S, c|X). These distributions are

prerequisites for performing the Gibbs sampling of the posterior. In each iteration of the Gibbs sampling

scheme, we sample a typical state sequence snk = {snk,t}t=1,Tnfor observation xn. Given the HMM

parameters Θ∗k and a particular state sequence snk, the joint likelihood for xn and snk is obtained from

(1).

Using A∗−k to represent all A∗’s except A∗

k, the conditional posterior for A∗k is obtained as,

p(A∗k|A

∗−k,B

∗,π∗,p, α,S, c,X) ∝ p(A∗k)p(X,S|A

∗k,B

∗k,π

∗k, c)

∝I∏

i,j=1

auA

j −1

ij,k

∏

{l:cl=k}

{

πslk,1,k

[

Tl−1∏

t=1

aslk,tslk,t+1,k

][

Tl∏

t=1

bslk,txl,t,k

]}

∝I∏

i,j=1

auA

j +uAij,k−1

ij,k , (16)

where uAij,k ≡

∑

l:cl=k

∑Tl

t=1 δ(i = slk,t, j = slk,t+1). Equation (16) indicates that the conditional posterior

for A∗k is still a product of Dirichlet distributions,

p(A∗k|A

∗−k,B

∗,π∗,p, α,S, c,X) =I∏

i=1

Dir({aij,k}j=1,I ; {uAj + uA

ij,k}j=1,I). (17)

Similarly, the conditional posterior for B∗k is given as

p(B∗k|A

∗,B∗−k,π

∗,p, α,S, c,X) =I∏

i=1

Dir({bim,k}m=1,M ; {uBm + uB

im,k}m=1,M ), (18)

where uBim,k ≡

∑

{l:cl=k}

∑Tl

t=1 δ(i = slk,t,m = xl,t), and the conditional posterior for π∗k is

p(π∗k|A

∗,B∗,π∗−k,p, α,S, c,X) = Dir({πi,k}i=1,I ; {u

πi + uπ

i,k}i=1,I). (19)


10

where uπi,k ≡

∑

l:cl=k δ(i = slk,1).

The conditional posterior for a state sequence snk is given as,

p(snk,1|S−n, snk,−1,A∗,B∗,π∗,p, α, c,X) ∝ p(snk,1)p(snk,2|snk,1)p(xn,1|snk,1),

p(snk,t|S−n, snk,−t,A∗,B∗,π∗,p, α, c,X) ∝ p(snk,t|snk,t−1)p(snk,t+1|snk,t)p(xn,t|snk,t),

2 ≤ t ≤ Tn − 1,

p(snk,Tn|S−n, snk,−Tn

,A∗,B∗,π∗,p, α, c,X) ∝ p(snk,Tn|snk,Tn−1)p(xn,Tn

|snk,Tn). (20)

Equation (20) shows that K state sequences will be sampled for each xn corresponding to {Θ∗k}

Kk=1.

West proved in [11] that the conditional posterior of α is

p(α|A∗,B∗,π∗,p,S, c,X) ∝ p(α)P (K|α)

∝ p(α)αK−1(α+N)

∫ 1

0tα(1 − t)N−1dt. (21)

When given the prior p(α) = Ga(α; γ01, γ02), drawing a sample α from the above conditional posteriors

can be realized in two steps: 1) sample an intermediate variable η from a beta distribution p(η|α,K) =

Beta(η;α+ 1, N) given the most recent value of α, 2) sample a new α value from

p(α|η,K) = ρGa(α; γ01 +K, γ02 − log(η)) + (1 − ρ)Ga(α; γ01 +K − 1, γ02 − log(η)), (22)

where ρ is defined by ρ/(1−ρ) = (γ1 +K−1)/N [γ2− log(η)]. The detailed derivation of this sampling

scheme is found in [11].

The priors for cn in (11) can be rewritten as p(cn) =∑K

k=1 pkδk. Thus the conditional posterior for

cn is

p(cn|c−n,A∗,B∗,π∗,p, α,S,X) ∝ pk,nδk, n = 1, · · · , N, (23)

where pk,n ∝ pkp(xn, snk|A∗k,B

∗k,π

∗k).

The conditional posterior for p is

p1 = v∗1,

pk = v∗k

k−1∏

j=1

(1 − v∗j ), k ≥ 2,

p(v∗k) = Beta(v∗k; 1 + nk, α+

K∑

l=k+1

nl), k = 1, · · · ,K − 1, (24)

where nk is the number of xn associated with the kth component, i.e., the number of cn = k.

A Gibbs sampling scheme called the Blocked Gibbs Sampler [15] is adopted to implement the MCMC

method. In the Blocked Gibbs Sampler, vK is set to 1. Let c∗ denote the set of current unique values of


11

c, a set of indices of those components that contain at least one sequence. In each iteration of the Gibbs

sampling scheme, the values are drawn in the following order:

(a) Draw samples for HMM mixture component parameters A∗k, B∗

k, π∗k, and state sequences snk given

a particular HMM component, k = 1, · · · ,K:

For k /∈ c∗, draw A∗k, B∗

k, and π∗k from (12)-(14) respectively, i.e., for those mixture components

that are currently not occupied by any data, new samples are generated from the base distribution.

A new state sequence snk for xn is generated by the current Θ∗k according a Markov process, which

has the same procedure as in (20), but without the emission probabilities.

For k ∈ c∗, draw A∗k, B∗

k, and π∗k from (17)-(19) respectively, i.e., for those mixture components

that have at least one sequence, component parameters are drawn from their conditional posteriors.

A state sequence snk is generated from (20) given current component parameters.

(b) Draw the membership, cn, for the data according to the density given by (23), where n = 1, · · · , N .

(c) Draw mixing weights, p, from the posterior in (24).

(d) Draw innovation parameter α from (21).

The above steps proceed iteratively, with each variable drawn either from the base distribution or its

conditional posterior given the most recent values of all other samples. According to MCMC theory, after

the Markov chain reaches its stationary distribution, these samples can be viewed as random draws from

the full posteriors p(A∗,B∗,π∗,S, c,p|X), and thus, in practice, the posteriors can be constructed by

collecting a sufficient number of samples after the above iteration stabilizes (the change of each variable

becomes stable) [13].

An approximate predictive distribution can be obtained by averaging the predictive distributions across

these collected samples,

p(xN+1|X, γ01, γ02, G0) =1

Nsam

Nsam∑

i=1

[

K∑

k=1

p(i)k p(xN+1|Θ

∗(i)k )

]

, (25)

where Nsam is the number of collected samples and p(i)k and Θ

∗(i)k are the ith collected sample values of

pk and Θ∗k.

It’s interesting to compare the traditional HMM mixture model in (3) to the DP product in (25). In

(3) the data are characterized by an HMM mixture model, in which each mixture component has a fixed

set of parameters. In this model we must set the number of mixture components K∗ and must learn the

associated parameters. Equation (25) makes it explicit that the DP HMM mixture model is essentially an

ensemble of HMM mixtures of (3). Each mixture, the term in the parenthesis in (25), is sampled from

the Gibbs sampler. Because of the clustering properties of DP, most components of each such mixture

will have near-zero probability of being used, but the number of utilized mixture components for each


12

mixture are different (less than K).

B. Variational Inference

As stated above, in MCMC the posterior is estimated from collected samples. Although this yields an

accurate result theoretically (with sufficient samples) [13], it often requires vast computational resources

and the convergence of the algorithm is often difficult to diagnose. Variational Bayes inference is

introduced as an alternative method for approximating likelihoods and posteriors.

From Bayes’ rule, we have

p(Φ|X,Ψ) =p(X|Φ)p(Φ|Ψ)

∫

p(X|Φ)p(Φ|Ψ)dΦ, (26)

where Φ = {A∗,B∗,π∗,v, α,S, c} are hidden variables of interest and Ψ = {uA,uB,uπ, γ01, γ02} are

hyper-parameters which determine the distribution of the model parameters. Since p is a function of v (see

(10)), estimating the posterior of v is equivalent to estimating the posterior of p. The integration in the

denominator of (26), called the marginal likelihood, or “evidence”, is generally intractable analytically

except for simple cases and thus estimating the posterior p(Φ|X,Ψ) cannot be achieved analytically.

Instead of directly estimating p(Φ|X,Ψ), variational methods seek a distribution q(Φ) to approximate

the true posterior distribution p(Φ|X,Ψ). Consider the log marginal likelihood

log p(X|Ψ) = L(q(Φ)) + DKL(q(Φ)||p(Φ|X,Ψ)), (27)

where

L(q(Φ)) =

∫

q(Φ) logp(X|Φ)p(Φ|Ψ)

q(Φ)dΦ, (28)

and

DKL(q(Φ)||p(Φ|X,Ψ)) =

∫

q(Φ) logq(Φ)

p(Φ|X,Ψ)dΦ. (29)

DKL(q(Φ)||p(Φ|X,Ψ)) is the KL divergence between the approximate and true posterior. The approxima-

tion of the true posterior p(Φ|X,Ψ) using q(Φ) can be achieved by minimizing DKL(q(Φ)||p(Φ|X,Ψ)).

Since the KL divergence is nonnegative, from (27) this minimization is equivalent to maximization of

L(q(Φ)), which forms a strict lower bound on log p(X|Ψ),

log p(X|Ψ) ≥ L(q). (30)

For computational convenience, q(Φ) is expressed in a factorized form, with the same functional form

as the priors p(Φ|Ψ) and each parameter represented by its own conjugate prior. For the HMM mixture


13

model proposed in this paper, we assume

q(Φ) = q(A∗,B∗,π∗,S, c,v, α)

= q(α)q(v)

{

K∏

k=1

[q(A∗k)q(B

∗k)q(π

∗k)]

}{

N∏

n=1

K∏

cn=1

[q(cn)q(sncn)]

}

, (31)

where q(A∗k), q(B

∗k), q(π

∗k) have the same form as in (12), (13) and (14) respectively but different

parameters, q(v) =∏K−1

k=1 q(vk) with q(vk) = Beta(vk;β1k, β2k), and q(α) = Ga(α; γ1, γ2). Once we

learn the parameters of these variational distributions from the data, we obtain the approximation of

p(Φ|X,Ψ) by q(Φ). The joint distribution of Φ and observations X are given as

p(X,Φ) = p(X,A∗,B∗,π∗,v, α,S, c|Ψ)

= p(α)p(v|α)K∏

k=1

[p(A∗k)p(B

∗k)p(π

∗k)]

N∏

n=1

K∏

cn=1

[p(cn|v)p(xn, sncn|A∗,B∗,π∗, cn)] ,(32)

where priors p(A∗k), p(B

∗k), p(π

∗k), and p(α) are given in (12), (13), (14), and (15) respectively, and

p(v|α) =∏K−1

k=1 p(vk|α) with p(vk|α) = Beta(vk; 1, α). All parameters Ψ in these priors distributions

are assumed to be set.

We substitute (31) and (32) into (28) to yield

L(q) =

∫

q(α)q(v) ·

{

K∏

k=1

[q(A∗k)q(B

∗k)q(π

∗k)]

}

·

{

N∏

n=1

K∏

cn=1

[q(cn)q(sncn)]

}

·

{

log p(α) log p(v|α) +K∑

k=1

[log p(A∗k) + log p(B∗

k) + log p(π∗k)] +

+N∑

n=1

K∑

cn=1

[log p(cn|v) + log p(xn, sncn|A∗,B∗,π∗, cn)] − log q(v) − log q(α) −

−K∑

k=1

[log q(A∗k) + log q(B∗

k) + log q(π∗k)] −

N∑

n=1

K∑

cn=1

[log q(cn) + log q(sncn)]

}

. (33)

The optimization of the lower bound in (33) is realized by taking functional derivatives with respect to

each of the q(·) distributions while fixing the other q distributions and setting ∂L(q)/∂q(·) = 0 to find

the distribution q(·) that increases L [6]. The update equations for the variational posteriors are listed as

follows (their derivation is summarized in the Appendix)

(1) q(A∗k) =

∏Ii=1 Dir({aij,k}j=1,I ; {u

Aij,k}j=1,I), where

uAij,k = uA

j +∑N

n=1 φn,k ·[

∑Tn

t=1 q(snk,t = i, snk,t+1 = j)]

.

(2) q(B∗k) =

∏Ii=1 Dir({bim,k}m=1,M ; {uB

im,k}m=1,M ), where

uBim,k = uB

m +∑N

n=1 φn,k ·[

∑Tn

t=1 q(snk,t = i, xn,t = m)]

.

(3) q(π∗k) = Dir({πi,k}i=1,I ; {u

πi,k}i=1,I), where


14

uπi,k = uπ

i +∑N

n=1 φn,k · q(snk,1 = i).

(4) q(v) =∏K−1

k=1 q(vk) =∏K−1

k=1 Beta(vk;β1k, β2k), where

β1k = 1 +∑N

n=1 φn,k and β2k = γ1

γ2+∑N

n=1

∑Ki=k+1 φn,i.

(5) q(α) = Ga(α; γ1, γ2), where

γ1 = γ01 +K − 1 and γ2 = γ02 −∑K

k=1[ψ(β2k) − ψ(β1k + β2k)].

(6) q(snk) ∝ πsnk,1,k

[

∏Tn−1t=1 asnk,tsnk,t+1,k

] [

∏Tn

t=1 bsnk,txn,t,k

]

,where aij,k, bim,k, and πi,k are given in

(56)-(58) in the Appendix.

(7) q(cn = k) = φn,k = q′(cn=k)∑

Nn=1

q′(cn=k)where q′(cn = k) is given as (61) in the Appendix.

The local maximum of the lower bound L(q) is achieved by iteratively updating the parameters of

the variational distributions q(·) according to the above equations. Each iteration guarantees to either

increase the lower bound or leave it unchanged. We terminate the algorithm when the change in L(q)

is negligibly small. L(q) can be computed by substituting the updated q(·) and the prior distributions

p(Φ|Ψ) into (33).

The prediction for a new observation sequence y is,

p(y|X,Ψ) =

∫ K∑

k=1

pk(v)p(Θ∗k|X,Ψ)p(y|Θ∗

k)dp(Θ∗k,v|X,Ψ)

≈K∑

k=1

E[pk]

∫

q(Θ∗k)p(y|Θ

∗k)dΘ

∗k. (34)

Since the true posteriors are unknown, we use the variational posterior q(Θ∗k,v) from the VB optimization

to approximate p(Θ∗k,v|X,Ψ) in (34). However, the above quantity is still intractable because the states

and the model parameters are coupled. There are several possible methods for approximating the predictive

probability. One such method is to sample parameters from the posterior distribution and construct a

Monte Carlo estimate, but this approach is not efficient. An alternative is to construct a lower bound on

the approximation of the predictive quantity in (34) [6]. Another way suggested in [20] assumes that the

states and the model parameters are independent and the model can be evaluated at the mean (or mode)

of the variational posterior. This approach makes the prediction tractable and so is used in our following

experiments. Equation (34) can thus be approximated as

p(y|X,Ψ) ≈

K∑

k=1

E[pk]

∑

syk

E[πsyk,1,k] ·

Ty−1∏

t=1

E[asyk,tsyk,t+1,k] ·

Ty∏

t=1

E[bsyk,tyt,k]

, (35)

where E[p1] = E[v1], E[pk] = E[vk] ·∏k−1

i=1 (1 − E[vi]) with E[vk] = β1k

β1k+β2k, E[aij,k] =

uAij,k

∑

j′uA

ij′,k

,

E[bim,k] =uB

im,k∑

m′ uBim′,k

, and E[πsi,k] =

uπi,k

∑

i′uπ

i′,k

. Equation (35) can be evaluated efficiently using the

forward-backward algorithm.


15

C. Truncation level

In the DP prior without truncation G is a probability mass function with an infinite number of atoms.

In practice, a mixture model estimated from N observed data usually cannot have more than N mixture

components (in the limit, each data is drawn from one mixture component). So we truncate G at finite

level, K ≤ N , which saves computational resources.

Here we need to make clear the relationship between the truncation level K and the utilized number

of mixture components K∗. A given prior G consists of K distinct values {Θ∗k}k=1,K with associated

probability pk. However {Θn}n=1,N may take only a subset of {Θ∗k}k=1,K , which means the true utilized

number of mixture components K∗ may be less than K (and the clustering properties of DP almost always

yield less than K mixture components, unless α is very large). Furthermore, since G is itself random,

drawn from a DP, {Θ∗k}k=1,K will have different values rather than a set of fixed values. Consequently, the

utilized number of mixture components K∗ vary with different {Θ∗k}k=1,K but will typically be less than

K. However, in the DP formulation, the unoccupied HMMs are still included as part of the mixture with

very small mixing weights, pk, and continue to draw from the base distribution, G0, with the potential

of being occupied at some point. Therefore, from this point forward we will refer to a K-component DP

HMM mixtures, essentially an ensemble of K-component mixture models, though the actual number of

occupied HMMs will likely be less than K.

V. MUSIC ANALYSIS VIA HMM MIXTURE MODELS

Considering music to be a set of concurrently played notes (each note defining a location in feature

space) and note transitions (time-evolving features), music can be represented as a time series, and thus

modeled by an HMM. As music often follows a deliberate structure, the underlying, hidden mechanism

of that music should not be viewed as homogenous, but rather originating from a mixture of HMMs.

We are interested in modeling music in this manner to ultimately develop a similarity metric to aid in

music browsing and query. Our experiments are confined to classical music with a highly overlapping

observation space, but multiple “motion patterns”.

We sample the music clips at 22 kHz and divide each clip into 25 ms non-overlapping frames. We

extract 10-dimensional MFCC features for each frame using downloaded software1 with each vector an

observation in feature space. We then quantize our features into discrete symbols using vector quantization

(VQ), in which the codebook is learned on the whole set of tested music [18]. For our experiments, we

use a sequence of 1 second, or 40 observations without overlap; this transforms the music into a collection

of sequences, with each sequence assumed to originate from an HMM.

1http://www.ee.columbia.edu/∼dpwe/resources/matlab/rastamat/


16

A. Music Similarity Measure

The proposed method for measuring music similarity computes the distance between the respective

HMM mixture models. Similar to the work in [2], we use Monte Carlo sampling to compare two HMM

mixture models. Let Mg be the learned HMM mixture model for music g, and Mh for music h. We

draw a sample set Sg of size Ng from Mg and a sample set Sh of size Nh from Mh. The distance

between any two HMM mixture models is defined as

D(Mg,Mh) =1

2[log p(Sh|Mg) − log p(Sh|Mh)] +

1

2[log p(Sg|Mh) − log p(Sg|Mg)] , (36)

where the first two terms on the right hand side of (36) are a measure of how well model Mg matches

observations generated by model Mh, relative to how well Mh matches the observations generated by

itself. The last two terms are in the same spirit and make the distance D(Mg,Mh) symmetric. Equation

(36) can be rewritten in terms of individual samples as

D(Mg,Mh) =1

2

Nh∑

n=1

log p(S(n)h |Mg) − log p(S

(n)h |Mh)

Tn+

1

2

Ng∑

m=1

log p(S(m)g |Mh) − log p(S

(m)g |Mg)

Tm,

(37)

where S(n)h is the nth sample in Sh with length Tn, S(m)

g is the mth sample in Sg with length Tm, and the

log-likelihood for each sample given the HMM mixture model can be obtained from (25) in the MCMC

implementation or from (35) in the VB approach. The similarity Sim(g, h) of music g and h is defined

by a kernel function as

Sim(g, h) = exp(−|D(Mg,Mh)|2

σ2), (38)

where σ is a fixed parameter; we notice that the choice of σ will not change the order of similarities.

This approach is well suited to large music databases, as it only requires the storage of the HMM mixture

parameters for each piece rather than the original music data itself.

B. DP GMM modeling

For comparison, we also model each piece of music as a DP Gaussian mixture model (DP GMM),

where the 10-dimensional MFCC feature vector of a frame corresponds to one data point in the feature

space. In the same spirit as the DP HMM mixture model, each datum in the DP GMM is assumed drawn

from a Gaussian and a DP prior is placed on the mean and precision (inverse of variance) parameters

of all Gaussians, encouraging sharing of these parameters. An MCMC solution and variational approach

for the DP GMM can be found in [31] and [9], respectively. The posterior on the number of mixtures

used in DP GMM is learned automatically from the algorithm (as for the DP-based HMM mixture model

discussed above), however the dynamic (time-evolving) information between observations is discarded.


17

The music similarity under DP GMM modeling is defined similar to (36), while the log-likelihood is

given by the DP GMM. In our experiments the DP GMM is trained via the VB method for computational

efficiency.

VI. EXPERIMENTS

The effectiveness of the proposed method is demonstrated with both synthetic data and real music

data. Our synthetic problem, for which ground truth is known, exhibits how a data set, including data

with different hidden mechanisms, can be characterized by an HMM mixture. We then explore music

similarity within the classical genre with three experiments.

For the DP HMM mixture modeling in each experiment, a Ga(α; 1, 1) prior is placed on the scaling

parameter α, which slightly favors the data over the base distribution when enough is collected to learn

the mixture model. To avoid overfitting problems in the VB approach, we choose a reasonably large

K but K < N (K = 50 in our experiments). In the MCMC implementation, we set uniform, or non-

informative priors for A, B, and π, i.e., uA, uB , and uπ are set to unit vector 1. In the VB algorithm,

we set uA = 1/I , uB = 1/M , and uπ = 1/I to ensure a good convergence [6].

For the DP GMM analysis in our music experiments, we also employ a Ga(α; 1, 1) prior on the

scaling parameter α. The base distribution G0 for model parameters (mean µ and precision matrix Γ)

is a Normal-Wishart distribution, G0 = Norm(µ;m, (βΓ)−1)Wish(Γ; ν,W), where m and W are set

to the mean and inverse covariance matrix of all observed data, respectively; β = 0.01, and ν equals the

feature dimension.

A. Synthetic data

The synthetic data are generated from three distinct HMMs, each of which generates 50 sequences of

length 30 (i.e., mixing weights for the three HMM components are 1/3). The alphabet size of the discrete

observations is M = 3 for each HMM, and the number of states has two values, 2, 2, 3 respectively.

The parameters for these three HMMs are: A1 =

0.8 0.2

0.2 0.8

, B1 =

0.85 0.1 0.05

0.1 0.85 0.05

, π1 =

0.5

0.5

, A2 =

0.2 0.8

0.8 0.2

, B2 =

0.05 0.1 0.85

0.05 0.85 0.1

, π2 =

0.5

0.5

, A3 =

1/3 1/3 1/3

1/3 1/3 1/3

1/3 1/3 1/3

,

B3 =

0.9 0.05 0.05

0.05 0.9 0.05

0.05 0.05 0.9

, and π3 =

1/3

1/3

1/3

. In this configuration, HMM 1 tends to generate the


18

first two symbols with a high probability of maintaining the previous symbol, HMM 2 primarily generates

the last two symbols and has a high tendency to switch, and HMM 3 produces all three observations

with equal probability. We apply both MCMC and VB implementations to model the synthetic data as an

HMM mixture. In our algorithm, we assume that each HMM mixture component has the same number of

states, which we set to I . Note that if a given mixture component has less than I states, the I-state model

may be used with a very small probability of transitioning to un-needed states, i.e., for a superfluous

state, the corresponding row and column in the state transition matrix A will be close to zero. This is

achieved through the Dirichlet priors put on each row of A, which promote sparseness. Therefore, I is

set to a relatively large value, which can be estimated in principle by choosing the upper bound of the

model evidence: the log likelihood in MCMC implementation and the log-marginal likelihood (the lower

bound) in the VB algorithm. For brevity, we neglect this step and set I empirically. We’ve observed that,

in general, setting different values of I does not influence the final results in our experiments, as long

as I is sufficiently large.

Fig. 2 shows the clustering results of modeling the synthetic data as an HMM mixture, where (a)-(c)

employ MCMC and (d)-(f) employ VB. We set the truncated level in the DP prior to K = 50 and the

number of states I = 5. Fig. 2 (a) and (d) represent the estimated distribution of the indicators, i.e., the

probability that each sequence belongs to a given HMM component, which is computed by averaging the

200 collected c samples (spaced evenly 25 apart in burn-out iterations) in MCMC and is the variational

posteriors of the indicator matrix φn,k in VB. The memberships in Fig. 2 (b) and (e) are obtained by

setting the membership of a sequence to the component for which it has the highest probability. Fig. 2

(c) and (f) draw the mixing weights of the HMM mixture model computed from the MCMC and VB

algorithms respectively, with the mean and standard derivation estimated from 500 collected samples in

MCMC and in VB derived readily from (10) given E(vk) = β1k

β1k+β2kand var(vk) = β1kβ2k

(β1k+β2k)2(β1k+β2k+1) .

The results show that although the true number of HMM mixture components (here 3) is unknown a

priori, both algorithms automatically reduce the superfluous components and give an accurate approxi-

mation of the true component number (3 dominant mixture components in MCMC and 4 in VB). Fig. 2

(a) and (d) show that sequences generated from the same HMM have a high probability of belonging

to the same HMM mixture component. Fig. 2 (b) and (e) clearly indicates that the synthetic data can

be clustered into three major groups, which matches the ground truth. In comparison, MCMC slightly

outperforms VB; MCMC yields a mixture of 3 HMMs to VB’s 4 and less data are indexed incorrectly.

However, the computation of MCMC is expensive; it requires roughly 4 hours of CPU in MatlabTM on a

Pentium IV PC with a 1.73 GHz CPU to compute the results (5000 burn-in and 5000 burn-out iterations),

while VB requires less than two minutes. Considering the high efficiency and acceptable performance of


19

0 50 100 150

10

20

30

40

50

Data Index

Compo

ne

nt

Ind

ex

0 10 20 30 40 500

0.1

0.2

0.3

0.4

Component Index

Mixing

weigh

t

Data Index

Compo

ne

nt

Ind

ex

50 100 150

10

20

30

40

50

0.2

0.4

0.6

0.8

0 50 100 150

5

10

15

20

25

30

35

40

45

50

Data Index

Compon

en

t In

de

x

Data Index

Compo

ne

nt

Ind

ex

50 100 150

10

20

30

40

50

0.2

0.4

0.6

0.8

0 10 20 30 40 500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Component Index

Mixing

weigh

t

Fig. 2. DP HMM mixture modeling for synthetic data. (a) The averaged indicator matrix obtained from MCMC, where the

element at (i, j) is the probability that the j th sequence are generated from the ith mixture component. (b) The membership

of each sequence from MCMC, which is the maximum value along the columns of the averaged indicator matrix in (a). (c)

Mixing weights of DP HMM mixture with K = 50 computed by MCMC, the error bars representing the standard deviation of

each weight. (d) The variational posteriors of the indicator matrix, φn,k, in VB. (e) The sequence membership from VB, i.e.

the maximum along the columns of (d). (f) Mixing weights of DP HMM mixture with K = 50 by VB.


20

1 2 3 4

1

2

3

4

1 2 3 4

1

2

3

4

1: Bach-Violin Concerto

BWV 1041 Mvt I


BWV 1042 Mvt I

3: Stravinsky-

Violin Concerto Mvt I

4: Stravinsky-

Violin Concerto Mvt IV

(a) (b)

Fig. 3. Hinton diagram for the similarity matrix for 4 violin music clips. (a) Similarity matrix learned by DP HMM mixture

models; (b) Similarity matrix learned by DP GMMs.

the VB algorithm, we adopt it for our following application.

It is important to note that the synthetic data used have overlapping observation spaces, {1,2}, {2,3},

and {1,2,3}, and are generated by HMMs with more than one unique state number. Despite this, the

algorithm correctly clusters the data into three major HMMs, and setting one comprehensive state number

(I = 5) worked well in this problem.

−20 0 20

−20

0

20

x1

x2

clip1

−20 0 20

−20

0

20

x1

x2

clip2

−20 0 20

−20

0

20

x1

x2

clip3

−20 0 20

−20

0

20

x1

x2

clip4

Fig. 4. 2D PCA features for four violin music clips considered in Fig. 3.


21

B. Music data

For our first experiment, we choose four 3-minute violin concerto clips from two different composers.

We chose to model the first few minutes rather than the whole of a particular piece in our experiments,

because we felt that in this way we could better control the ground truth for our initial application of these

methods. Clips 1 and 2 are from Bach and are considered similar, clips 3 and clip 4 are from Stravinsky

are also considered similar. The clips of one composer are considered different from the other. All four

music clips are played using essentially the same instruments, but their styles vary, which would indicate

a high overlap in feature space, but significantly different movement. We divided each clip into 180

sequences of length 40 and quantized the MFCC feature vectors into symbols of size M = 32 with the

VQ algorithm. An HMM mixture model is then built for each clip with mixture component number, or

truncation level, set to K = 50 and number of states to I = 8; the truncation level of the DP GMM

was set to 50 as well. Fig. 3 shows the computed similarity between each clip for both HMM mixture

and GMM modeling using a Hinton diagram [14], in which the size of a block is inversely proportional

to the value of the corresponding matrix elements. To compute the distances, we use 100 sequences

of length 50 drawn from the HMM mixture and 200 samples drawn from the GMM. It is difficult to

compare the similarity matrices directly since the distance metrics are not on the same scale, but as our

goal is to suggest music by similarities, we may focus on the relative similarities within each matrix. Our

modeling by HMM mixture produces results that fit with our intuition2. However, our GMM results do

not catch the connection between clips 3 and 4, and, proportionally, do not contrast clips 1 and 2 from

3 and 4 as well. This is because of their high overlap in feature space, which we show by reducing the

10-dimensional MFCC features to the a 2 dimensional space through principle component analysis (PCA)

[16] in Fig. 4 to give a sense of their distribution. We observe that the features for all four clips almost

share the same range in the feature space and have a similar distribution. If we model each piece of

music without taking into consideration the motion in feature space, it is clear that the results will be very

similar. The improved similarity recognition can be attributed to the temporal consideration given by the

HMM mixture model. Fig. 5 shows the mixing weights of the VB-learned HMM mixture models for the

four violin clips. Although the number of significant weights is initially high, the algorithm automatically

reduces this number by suppressing the superfluous components to that necessary to model each clip:

the expected mixing weights for these unused HMMs are near zero with high confidence, indicated by

the small variance of the mixing weights. We notice that each clip is represented by a different number

of dominant HMMs. For example, clips 1 and 2 require fewer HMMs, which is understandable, as the

2Prof. Scott Lindroth, Chair of the Duke University Music Department, provided guidance in assessing similarities between

the pieces considered


22

0 10 20 30 40 500

0.1

0.2

0.3

0.4Clip 1

Component IndexM

ixin

g w

eigh

t0 10 20 30 40 50

0

0.1

0.2

0.3

0.4Clip 2

Component Index

Mix

ing

wei

ght

0 10 20 30 40 500

0.1

0.2

0.3

0.4Clip 3

Component Index

Mix

ing

wei

ght

0 10 20 30 40 500

0.1

0.2

0.3

0.4Clip 4

Component Index

Mix

ing

wei

ght

Fig. 5. Mixing weights of HMM mixtures for four violin clips when K = 50 computed by VB; the error bars representing

the standard deviation of each weight

music for these particular clips is more homogenous. We give an example of a posterior membership

for clip 4 in Fig. 6, where those parts having similar styles should be drawn from the same HMM. The

fact that the first 20 seconds of this clip are repeated during the last 20 can be seen in their similar

membership patterns.

For our second experiment, we compute the similarities between ten 3-minute clips, this time of works

of different formats (instrumentation). These clips were chosen deliberately with the following intended

0 20 40 60 80 100 120 140 160 1800

5

10

15

20

25

30

35

40

45

50

Time (second)

Ind

ex

of

mixture

co

mp

on

ents

Fig. 6. Memberships for violin clip 4 considered in Fig. 3-5.


23

1: Beethoven-

Consecration of the House

2: Chopin-

Etudes Op 10 No 01

3: Rachmaninov-

Preludes Op 23 No 02

4: Scarlatti-

Sonata K. 135

5: Scarlatti-

Sonata K. 380

6: Debussy-

String Quartet op 10 Mvt II

7: Ravel-

String Quartet in F Mvt II

8: Shostakovich-

String Quartet No 08 Mvt II


BWV 1041 Mvt I

10:Bach-Violin Concerto

BWV 1042 Mvt I

(a) (b)

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

Fig. 7. Hinton diagram for the similarity matrix for 10 music clips. (a) Similarity matrix learned by DP HMM mixture models;

(b) Similarity matrix learned by DP GMMs.

clustering: 1) clip 1 is unique in style and instrumentation; 2) clips 2 and 3, 4 and 5, 6 and 7, and 9 and

10 are intended to be paired together 3) clip 8 is also unique, but is the same format (instrumentation)

as clips 6 and 7. The similarities of these clips are computed in the same manner as the first experiment

via HMM mixture (with M = 32, I = 8, and K = 50) and GMM modeling. The Hinton diagrams of the

corresponding similarity matrices are shown in Fig. 7. Again, our intuition is consistent in this experiment

with HMM mixture modeling, but less accurate with GMM modeling. Though the GMM model does

not contradict our intuition, the similarities are not as stark as in the HMM mixture, especially in the

case of clip 1, which was selected to be unique.

Table I lists the number of dominant and utilized HMMs in the mixture model for each clip. Dominant

HMMs are those for which the corresponding expected mixing weights E[pk] > ε (ε is a small value).

Utilized HMMs are those that contain at least one sequence after the membership is calculated. Given

K = 50, the results show that each clip needs fewer HMMs than the preset truncation level.

Our third experiment posed our most challenging problem. In this experiment, we look at 15 two-

minute clips from three different classical formats: solo piano, string quartet and orchestral, with five

pieces from each. The feature space covered by these works is significantly larger than that spanned by

the other two experiments. Given M = 64 and I = 8, we built an HMM mixture for each of the 15 clips

of which we considered 1 and 2, 3 and 4, 6 and 7, 9 and 10 and 11 and 12 to be the most similar pairs.

Clips 8 and 13 were considered to be very different from all other clips and can therefore be considered


24

Clip index 1 2 3 4 5 6 7 8 9 10

Number of Dominant HMMs (pk > 0.01) 26 20 22 25 25 26 27 27 29 29

Number of Dominant HMMs (pk > 0.015) 10 6 7 14 11 13 12 10 13 10

Number of utilized HMMs 12 6 9 14 11 14 11 11 13 10TABLE I

NUMBER OF HMM MIXTURE COMPONENTS UTILIZED FOR EACH CLIP CONSIDERED IN FIG. 7.

anomalies. Clips 5 and 14 were also considered more unique than similar to any other piece. The results

of the HMM modeling as well as GMM modeling are shown in Fig. 8. The HMM mixture better catches

the connection between clips 3 and 4 and deemphasizes the connections between clip 6 and clips 9 and

10. Also, though still not considered unique, the similarities of clip 5 are reduced in proportion to other

similarities within the solo piano format, indicating the beneficial effects of the temporal information. The

results indicate that clip 14 is closer to the other orchestral works than expected from human listening. In

general, the GMM and HMM mixture modeling approaches are more comparable in our third experiment,

with a slight edge given to the HMM mixture resulting from the temporal information.

It is important to note that, theoretically, the GMM should always perform well when comparing two

similar pieces, as their similarity will be manifested in part by an overlapping feature space. However,

in cases where the overlapping features contain significantly different motions, be it in rhythm or note

transitions, the GMM inherently will not perform well and will tend to yield larger similarity than the

“truth”, whereas our HMM mixture, in theory, will detect and account for these variations.

VII. CONCLUSION AND FUTURE WORK

We have developed a discrete HMM mixture model for situations for which a sequential data set

may have several different underlying mechanisms, unable to be adequately modeled by a single HMM.

This model is built in a Bayesian setting using DP priors, which has the advantage of automatically

determining the component number and associated membership through the encouragement of parameter

sharing. Two inference tools are provided for the HMM mixture model: an MCMC solution based on a

Gibbs sampler, and a VB approach via maximizing a variational lower bound.

The performance of HMM mixture modeling was demonstrated on both synthetic and music data

sets. We compared MCMC and VB implementations with the synthetic data, where we showed that VB

provides an acceptable and highly efficient alternative to MCMC, allowing consideration of large data

sets, such as music. For our music application, we presented three experiments within the classical music

genre, where the MFCC feature spaces for each piece of music was highly overlapped. We compared our


25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

1: Bach- Sinfonia No 02


3: Beethoven- Piano Works/Bagatelle Op 126 No 2


5: Prokofiev- Romeo and Juliet No_01

6: Brahms- String Quartet No 1 Mvt I

7: Brahms- String Quartet No 1 Mvt III

8: Bartok- String Quartet No 4 Mvt V

9: Mozart- String Quartets/String Quartet No 16 Mvt III

10: Mozart- String Quartets/String Quartet No 17 Mvt I

11: Beethoven- Symphonies/Symphony No 8 Mvt IV

12: Schubert- Symphony No 9 Mvt IV

13: Bach- Brandenburg Concerto No 1 Mvt I

14: Stravinsky- Symphony in Three Movements I

15: Brahms- Symphony No 1 Mvt I





5: Prokofiev- Romeo and Juliet No_01

6: Brahms- String Quartet No 1 Mvt I

7: Brahms- String Quartet No 1 Mvt III

8: Bartok- String Quartet No 4 Mvt V

9: Mozart- String Quartets/String Quartet No 16 Mvt III

10: Mozart- String Quartets/String Quartet No 17 Mvt I

11: Beethoven- Symphonies/Symphony No 8 Mvt IV

12: Schubert- Symphony No 9 Mvt IV

13: Bach- Brandenburg Concerto No 1 Mvt I

14: Stravinsky- Symphony in Three Movements I

15: Brahms- Symphony No 1 Mvt I

Fig. 8. Hinton diagram for the similarity matrix for 15 music clips. (a) Similarity matrix learned by DP HMM mixture models;

(b) Similarity matrix learned by DP GMMs.

HMM mixture model to the GMM, computing similarities between music as a measure of performance.

The results showed that HMM mixture modeling was able to better distinguish between the content of

the given music, generally providing sharper contrasts in similarities than the GMM. The results from

this and the synthetic data are both promising. Where GMMs have been shown to succeed at genre


26

classification, our HMM mixture model matches this performance and exceeds it within a music format

by taking the temporal character of the given music into account. However, we have to break up the

input sequence into a number of short segments, which may create a problem of trading-off the accuracy

of boundary placement (which would require short pieces) against the accuracy of HMM parameter

estimation/recognition (which would require longer pieces). In addition, pieces that span the boundary

between two segments may show characteristics of more than one HMM. These issues constitute important

subjects for future work.

APPENDIX: DERIVATION OF UPDATING EQUATIONS IN VB APPROACH

1) Update q(A∗k), k = 1, · · · ,K: Only those terms related to q(A∗

k) in (33) are kept for the derivation

and then we have

L(q(A∗k)) =

∫

q(A∗k) ·

[

log(p(A∗k)) +

N∑

n=1

q(cn = k)

∫

q(snk) log p(xn, sn|A∗,B∗,π∗, cn) −

− log q(A∗k)

]

dA∗k

=

∫

q(A∗k) [Z − log q(A∗

k)] dA∗k, (39)

where

Z =I∑

i,j=1

(uAj − 1) log aij,k +

N∑

n=1

φn,k ·

[

∑

snk

q(snk)

Tn∑

t=1

log asnk,tsnk,t+1,k

]

(40)

with φn,k = q(cn = k). The quantity Z can be rewritten as

Z =I∑

i,j=1

(uj − 1) log aij,k +N∑

n=1

φn,k ·

I∑

i,j=1

∑

snk

q(snk)

Tn∑

t=1

log aij,kδ(snk,t = i, snk,t+1 = j)

=I∑

i,j=1

{

uAj +

N∑

n=1

φn,k ·

[

Tn∑

t=1

q(snk,t = i, snk,t+1 = j)

]

− 1

}

log aij,k, (41)

where∑Tn

t=1 q(snk,t = i, snk,t+1 = j) can be computed from the forward-backward algorithm [22]. Then

we have

L(q(A∗k)) = −

∫

q(A∗k) log

q(A∗k)

∏Ii,j=1 a

uAij,k−1

ij,k

dA∗k, (42)

where

uAij,k = uA

j +N∑

n=1

φn,k ·

[

Tn∑

t=1

q(snk,t = i, snk,t+1 = j)

]

. (43)


27

According to Gibbs inequality that is

q∗(x) = arg minq(x)

∫

q(x) logq(x)

p(x)dx

=1

Cp(x), (44)

where C is a normalizing constant to ensure∫

q∗(x)dx = 1, L(q(A∗k)) is maximized with respect to

q(A∗k) by choosing

q(A∗k) =

I∏

i,j=1

auA

ij,k−1

ij,k

=

I∏

i

Dir({aij,k}j=1,I ; {uAij,k}j=1,I), (45)

which is a product of Dirichlet distributions with the hyper-parameters uAij,k.

2) Update q(B∗k), k = 1, · · · ,K: L(q) is maximized with respect to q(B∗

k) in a similar procedure and

the optimal q(B∗k) is obtained as

q(B∗k) =

I∏

i=1

Dir({bim,k}m=1,M ; {uBim,k}m=1,M ), (46)

where

uBim,k = uB

m +N∑

n=1

φn,k ·

[

Tn∑

t=1

q(snk,t = i, xn,t = m)

]

. (47)

3) Update q(π∗k), k = 1, · · · ,K: Similarly, the optimal q(π∗

k) to maximize L(q) is derived as

q(π∗k) = Dir({πi,k}i=1,I ; {u

πi,k}i=1,I), (48)

where

uπi,k = uπ

i +N∑

n=1

φn,k · q(snk,1 = i). (49)

∑Tn

t=1 q(snk,t = i, xn,t = m) in (47) and q(snk,1 = i) in (49) can also be computed from the forward-

backward algorithm.

4) Update q(α): The lower bound related to q(α) is given as

L(q(α)) =

∫

q(α)[log p(α) +

∫

q(v) log p(v|α) − log q(α)]dα. (50)

Using the property that∫

Dir(p;u) log pi = ψ(ui) − ψ(∑

j uj) where ψ(x) = ∂∂x

log Γ(x) and Beta

distribution is a two-parameter Dirichlet distribution, (50) is then maximized at

q(α) = Ga(α; γ1, γ2) (51)

with γ1 = γ01 +K − 1 and γ2 = γ02 −∑K

k=1[ψ(β2k) − ψ(β1k + β2k)].


28

5) Update q(v): Collecting all the quantities related to q(v), we have

L(q(v)) =

∫

q(v)

[

∫

q(α) log p(v|α) +N∑

n=1

K∑

k=1

q(cn = k) log p(cn = k|v) − log q(v)

]

dv. (52)

p(cn = k|v) is rewritten in [9] as

p(cn = k|v) =K∏

l=1

(1 − vl)1[cn>l]v

1[cn=l]l , (53)

where 1[x] is the indicator function. Substitute (53) into (52) and recall that p(vk|α) = Beta(vk; 1, α)

and q(α) = Ga(α; γ1, γ2), the optimized q(v) become

q(v) =K−1∏

k=1

q(vk) =K−1∏

k=1

Beta(vk;β1k, β2k), (54)

where β1k = 1 +∑N

n=1 φn,k and β2k = γ1

γ2+∑N

n=1

∑Ki=k+1 φn,i.

6) Update q(snk), n = 1, · · · , N, k = 1, · · · ,K: The L(q(snk)) is expressed as

L(q(snk)) =

∫

q(snk)φn,k

[

∫

q(A∗k)

Tn−1∑

t=1

log asnk,tsnk,t+1,k +

+

∫

q(B∗k)

Tn∑

t=1

log bsnk,txn,t,k +

∫

q(π∗k) log πsnk,1,k − log q(snk)

]

dsnk. (55)

Using the property again that∫

Dir(p;u) log pi = ψ(ui) − ψ(∑

j uj), we define

aij,k = exp[ψ(uAij,k) − ψ(

I∑

j′=1

uAij′,k)] (56)

bim,k = exp[ψ(uBim,k) − ψ(

M∑

m′=1

uBim′,k)] (57)

πi,k = exp[ψ(uπi,k) − ψ(

I∑

i′=1

uπi′,k)]. (58)

Maximizing (55) is achieved at

q(snk) ∝ πsnk,1,k

[

Tn−1∏

t=1

asnk,tsnk,t+1,k

][

Tn∏

t=1

bsnk,txn,t,k

]

. (59)

7) Update q(cn = k), n = 1, · · · , N, k = 1, · · · ,K: L(q(cn = k)) is given as

L(q(cn = k)) =

∫

q(cn = k)[

∫

q(v) log p(cn,k|v) +

∫

q(sn,k)q(A∗cn

)q(B∗cn

)q(π∗cn

) log p(xn, snk|A∗,B∗,π∗, cn) − log q(cn = k)

]

.(60)


29

Define

q′(cn = k) = exp

{

k−1∑

l=1

[ψ(β2l) − ψ(β1l + β2l)] + [ψ(β1k) − ψ(β1k + β2k)]

+∑

sn,k

q(snk)

(

log πsnk,1,k +

Tn−1∑

t=1

log asnk,tsnk,t+1,k +

Tn∑

t=1

log bsnk,txn,t,k

)}

, (61)

then q(cn,k) to optimize (60) is expressed as

q(cn = k) =q′(cn = k)

∑Nn=1 q

′(cn = k). (62)

ACKNOWLEDGEMENT

The authors thank Prof. Scott Lindroth, Chair of the Duke University Music Department, for reviewing

our results and providing guidance with regard to defining “truth” for the music data.

REFERENCES

[1] J.-J. Aucouturier and F. Pachet, “The influence of polyphony on the dynamical modelling of musical timbre.” Pattern

Recognition Letters, (in press, 2007).

[2] ——, “Music similarity measures: What’s the use?” In Proceedings of the International Symposium on Music Information

Retrieval (ISMIR), 2002.

[3] ——, “Improving timbre similarity: How high’s the sky?” Journal of Negative Results in Speech and Audio Sciences,

vol. 1, no. 1, 2004.

[4] J.-J. Aucouturier and M. Sandler, “Segmentation of musical signals using hidden markov models,” In Proceedings of the

110th Convention of the Audio Engineering Society, May 2001.

[5] L. R. Bahl, F. Jelinek, and R. L. Mercer, “A maximum likelihood approach to continuous speech recognition,” IEEE

Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 5, no. 2, pp. 179–190, 1983.

[6] M. J. Beal, “Variational algorithms for approximate bayesian inference,” Ph.D. dissertation, Gatsby Computational

Neuroscience Unit, University College London, 2003.

[7] A. Berenzweig, B. Logan, D. P. Ellis, and B. Whitman, “A large-scale evaluation of acoustic and subjective music similarity

measures,” Computer Music Journal, vol. 28, no. 2, pp. 63–76, 2004.

[8] D. Blackwell and J. MacQueen, “Ferguson distributions via polya urn schemes,” Annals of Statistics, vol. 1, pp. 353–355,

1973.

[9] D. Blei and M. Jordan, “Variational methods for the dirichlet process,” In Proceedings of the 21st International Conference

on Machine Learning, 2004.

[10] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of

Royal Statistical Society B, vol. 39, no. 1, pp. 1–38, 1977.

[11] M. D. Escobar and M. West, “Bayesian density estimation and inference using mixtures,” Journal of the American Statistical

Association, vol. 90, no. 430, pp. 577–588, 1995.


30

[12] T. S. Ferguson, “A bayesian analysis of some nonparametric problems,” Annals of Statistics, vol. 1, pp. 209–230, 1973.

[13] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, “Introducing markov chain monte carlo,” in Markov Chain Monte

Carlo in Practice. London, U.K.: Chapman Hall, 1996.

[14] G. Hinton, T. Sejnowski, and D. Ackley, “Boltzmann machines: Constraint satisfaction networks that learn,” 1984.

[15] H. Ishwaran and L. F. James, “Gibbs sampling methods for stick-breaking priros,” Journal of the American Statistical

Association, Theory and Methods, vol. 96, no. 453, pp. 161–173, 2001.

[16] I. T. Jolliffe, Principal Component Analysis, 2nd ed. Springer, 2002.

[17] Y. Lin, “Learning phonetic features from waveforms,” UCLA Working Papers in Phonetics (WPP), no. 103, pp. 64–70,

September 2004.

[18] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE Trans. Communications, vol.

COM-28, no. 1, pp. 84–95, 1980.

[19] B. Logan and A. Salomon, “A music similarity function based on signal analysis,” in ICME 2001, 2001.

[20] D. J. C. MacKay., “Ensemble learning for hidden markov models,” Cavendish Laboratory, University of Cambridge,”

Technical report, 1997.

[21] M. Mandel, G. Poliner, and D. Ellis, “Support vector machine active learning for music retrieval,” Multimedia Systems,

Special issue on machine learning approaches to multimedia information retrieval, vol. 12, no. 1, pp. 3–13, 2006.

[22] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” in Proceedings of

the IEEE, vol. 77, no. 2, pp. 257–286, 1989.

[23] C. Raphael, “Automatic segmentation of acoustic musical signals using hidden markov models,” IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 21, no. 4, pp. 360–370, 1999.

[24] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, pp. 465–471, 1978.

[25] N. Scaringella and G. Zoia, “On the modeling of time information for automatic genre recognition systems in audio

signals,” in Proc. of the 6th Int. Symposium on Music Information Retrieval (ISMIR), pp. 666–671, 2005.

[26] A. Schliep, B. Georgi, W. Rungsarityotin, I. G. Costa, and A. Schonhuth, “The general hidden markov model library:

Analyzing systems with unobservable states.” [Online]. Available: www.billingpreis.mpg.de/hbp04/ghmm.pdf

[27] J. Sethuraman, “A constructive definition of the dirichlet prior,” Statistica Sinica, vol. 2, pp. 639–650, 1994.

[28] X. Shao, C. Xu, and M. Kankanhalli, “Unsupervised classification of musical genre using hidden markov model,” in Proc.

IEEE Int. Conf. Multimedia Explore (ICME), pp. 2023–2026, 2004.

[29] A. Sheh and D. P. W. Ellis, “Chord segmentation and recognition using emtrained hidden markov models,” In Proceedings

of the 4rd Annual International Symposium on Music Information Retrieval (ISMIR), pp. 183–189, October 2003.

[30] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical dirichlet processes,” Journal of the American Statistical

Association, 2006.

[31] M. West, P. Muller, and M. Escobar, “Hierarchical priors and mixture models with applications in regression and density

estimation,” in Aspects of Uncertainty, P. R. Freeman and A. F. Smith, Eds. John Wiley, 1994, pp. 363–386.

[32] B. Whitman, G. Flake, and S. Lawrence, “Artist detection in music with minnowmatch,” Proceedings of the 2001 IEEE

Workshop on Neural Networks for Signal Processing, pp. 559–568, 2001.

[33] C. Xu, N. C. Maddage, X. Shao, F. Cao, and Q. Tian, “Musical genre classification using support vector machines,” In

International Conference on Acoustics, Speech, and Signal Processing. IEEE, vol. V, pp. 429–432, 2003.


duke electrical and computer engineering - music analysis...

Documents