duke electrical and computer engineering - music analysis...
TRANSCRIPT
1
Music Analysis Using Hidden Markov Mixture Models
Yuting Qi, John William Paisley and Lawrence Carin
Department of Electrical and Computer Engineering, Duke University
Durham, NC, 27708-0291
Email: {yuting, jwp4, lcarin}@ee.duke.edu
Abstract
We develop a hidden Markov mixture model based on a Dirichlet process (DP) prior, for represen-
tation of the statistics of sequential data for which a single hidden Markov model (HMM) may not be
sufficient. The DP prior has an intrinsic clustering property that encourages parameter sharing, and this
naturally reveals the proper number of mixture components. The evaluation of posterior distributions for
all model parameters is achieved in two ways: (i) via a rigorous Markov chain Monte Carlo method, and
(ii) approximately and efficiently via a variational Bayes formulation. Using DP HMM mixture models
in a Bayesian setting, we propose a novel scheme for music analysis, highlighting the effectiveness of
the DP HMM mixture model. Music is treated as a time-series data sequence and each music piece is
represented as a mixture of HMMs. We approximate the similarity of two music pieces by computing
the distance between the associated HMM mixtures. Experimental results are presented for synthesized
sequential data and from classical music clips. Music similarities computed using DP HMM mixture
modeling are compared to those computed from Gaussian mixture modeling, for which the mixture
modeling is also performed using DP. The results show that the performance of DP HMM mixture
modeling exceeds that of the DP GMM modeling.
Index Terms
Dirichlet Process, HMM mixture, Music, MCMC, Variational Bayes.
I. INTRODUCTION
With the advent of online music a large quantity and variety of music is now highly accessible, this
music spanning many eras and genres. However, this wealth of available music can also pose a problem for
music listeners and researchers alike: how can the listener efficiently find new music he or she would like
from a vast library, and how can this database be organized? This challenge has motivated researchers
March 19, 2007 DRAFT
2
working in the field of music recognition to find ways of building robust and efficient classification,
retrieval, browsing, and recommendation systems for listeners.
The use of statistical models has great potential for analyzing music, and for recognition of similarities
and relationships between different musical pieces. Motivated by these goals, ideas from statistical
machine learning have attracted growing interest in the music-analysis community. For example, in the
work of Logan and Salomon [19], the sampled music signal is divided into overlapping frames and Mel-
frequency cepstral coefficients (MFCCs) are computed as a feature vector for each frame. A K-means
method is then applied to cluster frames in the MFCC feature space. Aucouturier and Pachet [3] model
the distribution of the MFCCs over all frames of an individual song with a Gaussian mixture model, and
the distance between two pieces is evaluated based on their corresponding GMMs. Similar work can be
found in [7] as well. Recently support vector machines (SVMs) have been used in the context of genre
classification [32][33], where individual frames are classified based on short-time features, which then
vote for the classification of the entire piece. In [21], SVMs are utilized for genre classification, with
song-level features and a Kullback-Leiber (KL) divergence-based kernel employed to measure distances
between songs.
A common disadvantage of the aforementioned methods is that either frame-level or song-level feature
vectors are treated as i.i.d. samples from a distribution assumed to characterize the song (for frame-
level feature vectors) or the genre (song-level feature vectors), with no dynamic behavior of the music
accounted for. However, “in order to understand and segment acoustic phenomena, the brain dynamically
links a multitude of short events which cannot always be separated” [4]. For the brain to recognize and
appreciate music, temporal cues are critical and contain information that should not be ignored. Therefore,
there have been many attempts at modeling frame-to-frame dynamics in music (notably hidden Markov
models (HMMs)), and complex and multi-faceted music analysis may benefit from considering dynamical
information (see [1] for a review). An HMM can accurately represent the statistics of sequential data,
and such models have been exploited extensively in speech recognition [5][22]; one may utilize HMMs
to learn short-term transitions across a piece of music. For a piece of monophonic music, Raphael [23]
modeled the overall music as an HMM and segmented the sampled music data into a sequence of
contiguous regions that correspond to notes. In [4], a song is considered as a collection of distinct
regions that present steady statistical properties, which are uncovered through HMM modeling. More
recently, Sheh and Ellis [29] transcribe audio recordings into chord sequences for music indexing and
retrieval via HMMs. Shao [28] and Scaringella [25] use HMMs for music genre classification. Building a
single HMM for a song performs well when the music’s “movement pattern”, modeled as Markov chain
transition pattern, is relatively simple and thus the structure is of modest complexity (e.g., the number of
March 19, 2007 DRAFT
3
states is few). However, most real music is a complicated signal, which has a quasi-uniform feature set
and may have more than one “movement pattern” across the entire piece. One may in principle model
the entire piece by a single HMM, with distinct segments of music characterized by an associated set
of HMM states. In such a model there would be limited state transitions between states associated with
distinct segments of music. Semi-parametric techniques such as the iHMM [30] could be used to infer
the appropriate number of HMM states. We alternatively consider an HMM mixture model, because
the distinct mixture components allow analysis of the characteristics of different segments of a given
piece. An important question centers around the proper number of mixture components, and this issue is
addressed using ideas from modern Bayesian statistics.
Mixture models provide an effective means of density estimation. For example, GMMs have been
widely used in modeling uncertain data distributions. However, few researchers have worked on mixtures
of HMMs designed using Bayesian principles. The most related work can be found in [17] and [26],
where an HMM mixture is learned not in a Bayesian setting, but via the EM algorithm with the number
of mixture components preset. The work reported here develops an HMM mixture model in a Bayesian
setting using a non-parametric Dirichlet process (DP) as a common prior distribution on the parameters of
the individual HMMs. Through Bayesian inference, we learn the posterior of the model parameters, which
yields an ensemble of HMM mixture models rather than a point estimation of the model parameters. In
contrast, the maximum-likelihood (ML) Expectation-Maximization (EM) algorithm [10] provides a point
of estimate of the model parameters, i.e., a single HMM mixture model.
Research on DP models can be traced to Ferguson [12]. A DP is characterized by a base distribution G0
and a positive scalar “innovation” parameter α. The DP is a distribution on a distribution G, specifically
the distribution G is drawn from the DP, denoted G ∼ DP (α,G0). Ferguson proved that there is positive
probability that a sample G drawn from a DP will be as close as desired to any probability function
defined on the support of G0. Therefore DP is rich enough to model parameters of individual components
with arbitrarily high complexity, and flexible enough to fit them well without any assumptions about the
functional form of the prior distribution.
In a DP formulation we let {xn}n=1,N represent the N segments of data associated with a given
piece, where each xn represents a sequence of time-evolving features. Each xn is assumed to be drawn
from an HMM characterized by parameters Θn, xn ∼ H(Θn). All Θn are assumed to be drawn i.i.d.
from the distribution G, i.e., Θn|Gi.i.d∼ G, where G is drawn from DP (α,G0), i.e., G ∼ DP (α,G0).
As discussed further below, this hierarchical model naturally yields a clustering of the data {xn}n=1,N ,
from which an HMM mixture model is constituted. Importantly, the DP HMM mixture is essentially an
ensemble of HMM mixtures where each mixture may have different number of components and different
March 19, 2007 DRAFT
4
component parameters, therefore the number of mixture components need not be set a priori. The same
basic framework may be applied to virtually any data model, as shown by West [31], and for comparison
we also consider a Gaussian mixture model.
Two schemes are considered to perform DP-based mixture modeling, Gibbs sampling [15] and varia-
tional Bayes [9]. The variational Bayesian inference algorithm avoids the expensive computation of Gibbs
sampling while retaining much of the rigor of the Bayesian formulation. In this paper we focus on HMM
mixture models based on discrete observations; we have also considered continuous observation HMMs,
which yielded very similar results as the discrete case, and therefore the case of continuous observations
is omitted for brevity. We also note that the discrete HMMs are computationally much faster to learn
than their continuous counterparts. Although music modeling is the motivation and specific application
in this paper, our method is applicable to any discrete sequential data sets containing multiple underlying
patterns, for instance behavior modeling in video.
The remainder of the paper is organized as follows. The proposed HMM mixture model is described
in Section 2. Section 3 provides an introduction to the Dirichlet process and its application to HMM
mixture models. An MCMC-based sampling scheme and variational Bayes inference are developed in
Section 4. Section 5 describes the application of HMM mixture models in music analysis. In Section 6
we present experimental results on synthetic data as well as for real music data. Section 7 concludes the
work and outlines future directions.
II. HIDDEN MARKOV MIXTURE MODEL
A. Hidden Markov Model
For a sequence of observations x = {xt}t=1,T , an HMM assumes that the observation xt at time t
is generated by an underlying, unobservable discrete state st and that the state sequence s = {st}t=1,T
follows a first-order Markov process. In the discrete case considered here, xt ∈ {1, · · · ,M} and st ∈
{1, · · · , I}, where M is the alphabet size and I the number of states. Therefore, an HMM can be modeled
as Θ = {A,B,π}, where A, B and π are defined as
• A = {aij}, aij = P (st+1 = j|st = i): state transition probabilities,
• B = {bim}, bim = P (xt = m|st = i): emission probabilities,
• π = {πi}, πi = P (s1 = i): initial state distribution.
For given model parameters Θ, the joint probability of the observation sequence and the underlying state
sequence is expressed as
p(x, s|Θ) = πs1
T−1∏
t=1
astst+1
T∏
t=1
bstxt. (1)
March 19, 2007 DRAFT
5
The likelihood of data x given model parameters Θ results from a summation over all possible hidden
state sequences,
p(x|Θ) =∑
s
πs1
T−1∏
t=1
astst+1
T∏
t=1
bstxt. (2)
B. Hidden Markov Mixture Model
The hidden Markov mixture model with K∗ mixture components may be written as
p(x|{wk}k=1,K∗ , {Θk}k=1,K∗) =K∗
∑
k=1
wkp(x|Θk), (3)
where p(x|Θk) represents the kth HMM component with associated parameters Θk, and wk represents
the mixing weight for the kth HMM, with∑K∗
k=1wk = 1.
Most HMM mixture models have been designed using a maximum-likelihood (ML) solution, via the
EM algorithm [10] for which a single mixture model is learned. An important question concerns the
proper number of mixture components K∗, this constituting a problem of model selection. For this
purpose researchers have used such measures as the minimum description length (MDL) [24]. In the
work reported here, rather than a point estimate of an HMM mixture we learn an ensemble of HMM
mixtures. We assume a set X = {xn}n=1,N of N sequences of data. Each data sequence xn is assumed
to be drawn from an associated HMM with parameters Θn = {An,Bn,πn}, i.e., xn ∼ H(Θn), where
H(Θ) represents the HMM. The set of associated parameters {Θn}n=1,N are drawn i.i.d from a shared
prior G, i.e., Θn|Gi.i.d∼ G. The distribution G is itself drawn from a distribution, in particular a Dirichlet
process. We discuss below that proper selection of the DP “innovation parameter” yields a framework
by which the parameters {Θn}n=1,N are encouraged to cluster, and each such cluster corresponds to
an HMM mixture component in (3). The algorithm automatically balances the DP-generated desire to
cluster with the likelihood’s desire to choose parameters that match the data X well. The likelihood and
the DP prior are balanced in the posterior density function for parameters {Θn}n=1,N , and the posterior
distribution for {Θn}n=1,N is learned to constitute an ensemble of HMM mixture models.
III. DIRICHLET PROCESS PRIOR
A. Dirichlet Process
The Dirichlet process, denoted as DP (α,G0), is a random measure on measures and is parameterized
by the “innovation parameter” α and a base distribution G0. Assume we have N random variables
{Θn}n=1,N distributed according to G, and G itself is a random measure drawn from a Dirichlet process,
Θn|G ∼ G, n = 1, · · · , N,
G ∼ DP (α,G0),
March 19, 2007 DRAFT
6
where G0 is the expectation of G,
E[G] = G0. (4)
Defining Θ−n = {Θ1, · · · ,Θn−1,Θn+1, · · · ,ΘN} and integrating out G, the conditional distribution
of Θn given Θ−n follows a Polya urn scheme and has the following form [8],
p(Θn|Θ−n, α,G0) =
α
α+N − 1G0 +
1
α+N − 1
N∑
i=1,i6=n
δΘi, (5)
where δΘidenotes the distribution concentrated at single point Θi. Let {Θ∗
k}k=1,K∗ be the distinct values
taken by {Θn}n=1,N and let n−nk be the number of values in Θ−n that equal Θ∗
k. We can rewrite (5) as
p(Θn|Θ−n, α,G0) =
α
α+N − 1G0 +
1
α+N − 1
K∗
∑
k=1
n−nk δΘ∗
k. (6)
Equation (6) shows that when considering Θn given all other observations Θ−n, this new sample is
either drawn from base distribution G0 with probability αα+N−1 , or is selected from the existing draws
Θ∗k according to a multinomial allocation, with probabilities proportional to existing groups sizes n−n
k .
This highlights the valuable sharing property of the Dirichlet process: a new sample prefers to join a
group with a large population, i.e., the more often a parameter is shared, the more likely it will be shared
in the future.
The parameter α plays a balancing role between sampling a new parameter from the base distribution
G0 (“innovating”), or sharing previously sampled parameters. A larger α yields more clusters, and in the
limit α → ∞, G → G0; as α → 0, all {Θn}n=1,N are aggregated into a single cluster and take on the
same value.
Prediction of a future sample can be directly extended from (6),
p(ΘN+1|{Θn}n=1,N , α,G0) =α
α+NG0 +
1
α+N
K∗
∑
k=1
nkδΘ∗
k, (7)
where nk is the number of Θn that take value Θ∗k. It is proven in [31] that the posterior of G is still a
DP, with the scaling parameter and the base distribution updated as follows,
p(G|{Θn}n=1,N , α,G0) = DP (α+N,α
α+NG0 +
1
α+N
K∑
k=1
nkδΘ∗
k). (8)
The above DP representation highlights its sharing property, but without an explicit form for G.
Sethuraman [27] provides an explicit characterization of G in terms of a stick-breaking construction.
Consider two infinite collections of independent random variables vk and Θ∗k, k = 1, 2, · · · ,∞, where
vk is drawn from a Beta distribution, denoted Beta(1, α), and Θ∗k is drawn independently from the base
March 19, 2007 DRAFT
7
distribution G0. The stick-breaking representation of G is then defined as
G =
∞∑
k=1
pkδΘ∗
k, (9)
with
pk = vk
k−1∏
i=1
(1 − vi), (10)
where
vk|α ∼ Beta(1, α),
Θ∗k|G0 ∼ G0.
This representation makes explicit that the random measure G is discrete with probability one and the
support of G consists of an infinite set of atoms located at Θ∗k, drawn independently from G0. The mixing
weights pk for atom Θ∗k are given by successively breaking a unit length “stick” into an infinite number
of pieces [27], with 0 ≤ pk ≤ 1 and∑∞
k=1 pk = 1.
The relationship between the stick-breaking representation and the Polya urn scheme is interpreted as
follows: if α is large, each vk drawn from Beta(1, α) will be very small, which means we will tend to
have many sticks of very short length. Consequently, G will consist of an infinite number of Θ∗k with
very small weights pk and therefore G will approach G0, the base distribution. For a small α, each vk
drawn from Beta(1, α) will be large, which will result in several large sticks with the remaining sticks
very small. This leads to a clustering effect on the parameters {Θn}n=1,N , as G will only have a large
mass on a small subset of {Θ∗k}k=1,∞ (those Θ∗
k corresponding to the large sticks pk).
B. DP HMM mixture models
Given the observed data X = {xn}n=1,N , each xn is assumed to be drawn from its own HMM H(Θn)
parameterized by Θn with the underlying state sequence sn. The common prior G on all Θn is given
as (9). Since G is discrete, different Θn may share the same value, Θ∗k, and take the value of Θ∗
k with
probability pk. Introducing an indicator variable c = {cn}n=1,N and letting cn = k indicate that Θn takes
the value of Θ∗k, the hidden Markov mixture model with DP prior can be expressed as
xn|cn, {Θ∗k}
∞k=1 ∼ H(Θ∗
cn),
cn|p ∼ Mult(p),
vk|α ∼ Beta(1, α),
Θ∗k|G0 ∼ G0, (11)
March 19, 2007 DRAFT
8
where p = {pk}k=1,∞ is given by (10) and Mult(p) is the multinomial distribution with parameter p.
Assuming A,B and π are independent of each other, the base distribution G0 is represented as
G0 = p(A)p(B)(π). For computational convenience (use of appropriate conjugate priors), p(A) are
specified as a product of Dirichlet distributions
P (A|uA) =
I∏
i=1
Dir({aij}j=1,I ;uA), (12)
where uA = {uAi }i=1,I are parameters of the Dirichlet distribution. Similarly, for p(B) and p(π), we
have
p(B|uB) =I∏
i=1
Dir({bim}m=1,M ;uB), (13)
and
p(π|uπ) = Dir({πi}i=1,I ;uπ), (14)
where uB = {uBm}m=1,M and uπ = {uπ
i }i=1,I . As discussed in Section IVA, the use of conjugate
distribution yields analytic update equations for the MCMC sampler as well as for the variational Bayes
algorithm.
It has been addressed in [31] that the innovation parameter α plays a critical role and α will define the
number of clusters inferred, with the appropriate α depending on the number of data points. Therefore
a prior distribution is placed on α and a posterior on α is learned from the data. We choose
p(α) = Ga(α; γ01, γ02), (15)
where Ga(α; γ01, γ02) is the Gamma distribution with selected parameters γ01 and γ02.
The corresponding graphical representation of the model is shown in Figure 1. We note that the stick-
breaking representation in (11) in principle uses an infinite number of sticks. It has been demonstrated [15]
that in practice a finite set of sticks may be employed with minimal error and may reduce computational
complexity; in the work reported here an appropriate truncation level K, i.e., a finite number of sticks,
is employed.
IV. INFERENCE FOR HMM MIXTURE MODELS WITH DP PRIOR
A. MCMC Inference
The posterior distribution of the model parameters is expressed as p(A∗,B∗,π∗,p, α,S, c|X) where
A∗ = {A∗k}k=1,K , B∗ = {B∗
k}k=1,K , π∗ = {π∗
k}k=1,K , S = {{snk}n=1,N}k=1,K , snk is the hidden state
sequence corresponding to xn when assuming xn is generated from the kth HMM, and c is the indicator
variable defined in Section IIIB. Truncation level K corresponds to the number of sticks used in the stick-
breaking construction. The posterior can be approximated by MCMC methods based on Gibbs sampling,
March 19, 2007 DRAFT
9
*knc
N
nx
ns
01
02p 0G
Fig. 1. Graphical representation of a Dirichlet process mixture model in terms of the stick breaking construction, where xn
are observed sequences, γ01, γ02 and G0 are preset, and the other parameters are hidden variables to be learned.
by iteratively drawing samples for each random variable from its full conditional posterior distribution
given the most recent values of all other random variables. We follow Bayes’ rule to derive the full
conditional distributions for each random variable in p(A∗,B∗,π∗,p, α,S, c|X). These distributions are
prerequisites for performing the Gibbs sampling of the posterior. In each iteration of the Gibbs sampling
scheme, we sample a typical state sequence snk = {snk,t}t=1,Tnfor observation xn. Given the HMM
parameters Θ∗k and a particular state sequence snk, the joint likelihood for xn and snk is obtained from
(1).
Using A∗−k to represent all A∗’s except A∗
k, the conditional posterior for A∗k is obtained as,
p(A∗k|A
∗−k,B
∗,π∗,p, α,S, c,X) ∝ p(A∗k)p(X,S|A
∗k,B
∗k,π
∗k, c)
∝I∏
i,j=1
auA
j −1
ij,k
∏
{l:cl=k}
{
πslk,1,k
[
Tl−1∏
t=1
aslk,tslk,t+1,k
][
Tl∏
t=1
bslk,txl,t,k
]}
∝I∏
i,j=1
auA
j +uAij,k−1
ij,k , (16)
where uAij,k ≡
∑
l:cl=k
∑Tl
t=1 δ(i = slk,t, j = slk,t+1). Equation (16) indicates that the conditional posterior
for A∗k is still a product of Dirichlet distributions,
p(A∗k|A
∗−k,B
∗,π∗,p, α,S, c,X) =I∏
i=1
Dir({aij,k}j=1,I ; {uAj + uA
ij,k}j=1,I). (17)
Similarly, the conditional posterior for B∗k is given as
p(B∗k|A
∗,B∗−k,π
∗,p, α,S, c,X) =I∏
i=1
Dir({bim,k}m=1,M ; {uBm + uB
im,k}m=1,M ), (18)
where uBim,k ≡
∑
{l:cl=k}
∑Tl
t=1 δ(i = slk,t,m = xl,t), and the conditional posterior for π∗k is
p(π∗k|A
∗,B∗,π∗−k,p, α,S, c,X) = Dir({πi,k}i=1,I ; {u
πi + uπ
i,k}i=1,I). (19)
March 19, 2007 DRAFT
10
where uπi,k ≡
∑
l:cl=k δ(i = slk,1).
The conditional posterior for a state sequence snk is given as,
p(snk,1|S−n, snk,−1,A∗,B∗,π∗,p, α, c,X) ∝ p(snk,1)p(snk,2|snk,1)p(xn,1|snk,1),
p(snk,t|S−n, snk,−t,A∗,B∗,π∗,p, α, c,X) ∝ p(snk,t|snk,t−1)p(snk,t+1|snk,t)p(xn,t|snk,t),
2 ≤ t ≤ Tn − 1,
p(snk,Tn|S−n, snk,−Tn
,A∗,B∗,π∗,p, α, c,X) ∝ p(snk,Tn|snk,Tn−1)p(xn,Tn
|snk,Tn). (20)
Equation (20) shows that K state sequences will be sampled for each xn corresponding to {Θ∗k}
Kk=1.
West proved in [11] that the conditional posterior of α is
p(α|A∗,B∗,π∗,p,S, c,X) ∝ p(α)P (K|α)
∝ p(α)αK−1(α+N)
∫ 1
0tα(1 − t)N−1dt. (21)
When given the prior p(α) = Ga(α; γ01, γ02), drawing a sample α from the above conditional posteriors
can be realized in two steps: 1) sample an intermediate variable η from a beta distribution p(η|α,K) =
Beta(η;α+ 1, N) given the most recent value of α, 2) sample a new α value from
p(α|η,K) = ρGa(α; γ01 +K, γ02 − log(η)) + (1 − ρ)Ga(α; γ01 +K − 1, γ02 − log(η)), (22)
where ρ is defined by ρ/(1−ρ) = (γ1 +K−1)/N [γ2− log(η)]. The detailed derivation of this sampling
scheme is found in [11].
The priors for cn in (11) can be rewritten as p(cn) =∑K
k=1 pkδk. Thus the conditional posterior for
cn is
p(cn|c−n,A∗,B∗,π∗,p, α,S,X) ∝ pk,nδk, n = 1, · · · , N, (23)
where pk,n ∝ pkp(xn, snk|A∗k,B
∗k,π
∗k).
The conditional posterior for p is
p1 = v∗1,
pk = v∗k
k−1∏
j=1
(1 − v∗j ), k ≥ 2,
p(v∗k) = Beta(v∗k; 1 + nk, α+
K∑
l=k+1
nl), k = 1, · · · ,K − 1, (24)
where nk is the number of xn associated with the kth component, i.e., the number of cn = k.
A Gibbs sampling scheme called the Blocked Gibbs Sampler [15] is adopted to implement the MCMC
method. In the Blocked Gibbs Sampler, vK is set to 1. Let c∗ denote the set of current unique values of
March 19, 2007 DRAFT
11
c, a set of indices of those components that contain at least one sequence. In each iteration of the Gibbs
sampling scheme, the values are drawn in the following order:
(a) Draw samples for HMM mixture component parameters A∗k, B∗
k, π∗k, and state sequences snk given
a particular HMM component, k = 1, · · · ,K:
For k /∈ c∗, draw A∗k, B∗
k, and π∗k from (12)-(14) respectively, i.e., for those mixture components
that are currently not occupied by any data, new samples are generated from the base distribution.
A new state sequence snk for xn is generated by the current Θ∗k according a Markov process, which
has the same procedure as in (20), but without the emission probabilities.
For k ∈ c∗, draw A∗k, B∗
k, and π∗k from (17)-(19) respectively, i.e., for those mixture components
that have at least one sequence, component parameters are drawn from their conditional posteriors.
A state sequence snk is generated from (20) given current component parameters.
(b) Draw the membership, cn, for the data according to the density given by (23), where n = 1, · · · , N .
(c) Draw mixing weights, p, from the posterior in (24).
(d) Draw innovation parameter α from (21).
The above steps proceed iteratively, with each variable drawn either from the base distribution or its
conditional posterior given the most recent values of all other samples. According to MCMC theory, after
the Markov chain reaches its stationary distribution, these samples can be viewed as random draws from
the full posteriors p(A∗,B∗,π∗,S, c,p|X), and thus, in practice, the posteriors can be constructed by
collecting a sufficient number of samples after the above iteration stabilizes (the change of each variable
becomes stable) [13].
An approximate predictive distribution can be obtained by averaging the predictive distributions across
these collected samples,
p(xN+1|X, γ01, γ02, G0) =1
Nsam
Nsam∑
i=1
[
K∑
k=1
p(i)k p(xN+1|Θ
∗(i)k )
]
, (25)
where Nsam is the number of collected samples and p(i)k and Θ
∗(i)k are the ith collected sample values of
pk and Θ∗k.
It’s interesting to compare the traditional HMM mixture model in (3) to the DP product in (25). In
(3) the data are characterized by an HMM mixture model, in which each mixture component has a fixed
set of parameters. In this model we must set the number of mixture components K∗ and must learn the
associated parameters. Equation (25) makes it explicit that the DP HMM mixture model is essentially an
ensemble of HMM mixtures of (3). Each mixture, the term in the parenthesis in (25), is sampled from
the Gibbs sampler. Because of the clustering properties of DP, most components of each such mixture
will have near-zero probability of being used, but the number of utilized mixture components for each
March 19, 2007 DRAFT
12
mixture are different (less than K).
B. Variational Inference
As stated above, in MCMC the posterior is estimated from collected samples. Although this yields an
accurate result theoretically (with sufficient samples) [13], it often requires vast computational resources
and the convergence of the algorithm is often difficult to diagnose. Variational Bayes inference is
introduced as an alternative method for approximating likelihoods and posteriors.
From Bayes’ rule, we have
p(Φ|X,Ψ) =p(X|Φ)p(Φ|Ψ)
∫
p(X|Φ)p(Φ|Ψ)dΦ, (26)
where Φ = {A∗,B∗,π∗,v, α,S, c} are hidden variables of interest and Ψ = {uA,uB,uπ, γ01, γ02} are
hyper-parameters which determine the distribution of the model parameters. Since p is a function of v (see
(10)), estimating the posterior of v is equivalent to estimating the posterior of p. The integration in the
denominator of (26), called the marginal likelihood, or “evidence”, is generally intractable analytically
except for simple cases and thus estimating the posterior p(Φ|X,Ψ) cannot be achieved analytically.
Instead of directly estimating p(Φ|X,Ψ), variational methods seek a distribution q(Φ) to approximate
the true posterior distribution p(Φ|X,Ψ). Consider the log marginal likelihood
log p(X|Ψ) = L(q(Φ)) + DKL(q(Φ)||p(Φ|X,Ψ)), (27)
where
L(q(Φ)) =
∫
q(Φ) logp(X|Φ)p(Φ|Ψ)
q(Φ)dΦ, (28)
and
DKL(q(Φ)||p(Φ|X,Ψ)) =
∫
q(Φ) logq(Φ)
p(Φ|X,Ψ)dΦ. (29)
DKL(q(Φ)||p(Φ|X,Ψ)) is the KL divergence between the approximate and true posterior. The approxima-
tion of the true posterior p(Φ|X,Ψ) using q(Φ) can be achieved by minimizing DKL(q(Φ)||p(Φ|X,Ψ)).
Since the KL divergence is nonnegative, from (27) this minimization is equivalent to maximization of
L(q(Φ)), which forms a strict lower bound on log p(X|Ψ),
log p(X|Ψ) ≥ L(q). (30)
For computational convenience, q(Φ) is expressed in a factorized form, with the same functional form
as the priors p(Φ|Ψ) and each parameter represented by its own conjugate prior. For the HMM mixture
March 19, 2007 DRAFT
13
model proposed in this paper, we assume
q(Φ) = q(A∗,B∗,π∗,S, c,v, α)
= q(α)q(v)
{
K∏
k=1
[q(A∗k)q(B
∗k)q(π
∗k)]
}{
N∏
n=1
K∏
cn=1
[q(cn)q(sncn)]
}
, (31)
where q(A∗k), q(B
∗k), q(π
∗k) have the same form as in (12), (13) and (14) respectively but different
parameters, q(v) =∏K−1
k=1 q(vk) with q(vk) = Beta(vk;β1k, β2k), and q(α) = Ga(α; γ1, γ2). Once we
learn the parameters of these variational distributions from the data, we obtain the approximation of
p(Φ|X,Ψ) by q(Φ). The joint distribution of Φ and observations X are given as
p(X,Φ) = p(X,A∗,B∗,π∗,v, α,S, c|Ψ)
= p(α)p(v|α)K∏
k=1
[p(A∗k)p(B
∗k)p(π
∗k)]
N∏
n=1
K∏
cn=1
[p(cn|v)p(xn, sncn|A∗,B∗,π∗, cn)] ,(32)
where priors p(A∗k), p(B
∗k), p(π
∗k), and p(α) are given in (12), (13), (14), and (15) respectively, and
p(v|α) =∏K−1
k=1 p(vk|α) with p(vk|α) = Beta(vk; 1, α). All parameters Ψ in these priors distributions
are assumed to be set.
We substitute (31) and (32) into (28) to yield
L(q) =
∫
q(α)q(v) ·
{
K∏
k=1
[q(A∗k)q(B
∗k)q(π
∗k)]
}
·
{
N∏
n=1
K∏
cn=1
[q(cn)q(sncn)]
}
·
{
log p(α) log p(v|α) +K∑
k=1
[log p(A∗k) + log p(B∗
k) + log p(π∗k)] +
+N∑
n=1
K∑
cn=1
[log p(cn|v) + log p(xn, sncn|A∗,B∗,π∗, cn)] − log q(v) − log q(α) −
−K∑
k=1
[log q(A∗k) + log q(B∗
k) + log q(π∗k)] −
N∑
n=1
K∑
cn=1
[log q(cn) + log q(sncn)]
}
. (33)
The optimization of the lower bound in (33) is realized by taking functional derivatives with respect to
each of the q(·) distributions while fixing the other q distributions and setting ∂L(q)/∂q(·) = 0 to find
the distribution q(·) that increases L [6]. The update equations for the variational posteriors are listed as
follows (their derivation is summarized in the Appendix)
(1) q(A∗k) =
∏Ii=1 Dir({aij,k}j=1,I ; {u
Aij,k}j=1,I), where
uAij,k = uA
j +∑N
n=1 φn,k ·[
∑Tn
t=1 q(snk,t = i, snk,t+1 = j)]
.
(2) q(B∗k) =
∏Ii=1 Dir({bim,k}m=1,M ; {uB
im,k}m=1,M ), where
uBim,k = uB
m +∑N
n=1 φn,k ·[
∑Tn
t=1 q(snk,t = i, xn,t = m)]
.
(3) q(π∗k) = Dir({πi,k}i=1,I ; {u
πi,k}i=1,I), where
March 19, 2007 DRAFT
14
uπi,k = uπ
i +∑N
n=1 φn,k · q(snk,1 = i).
(4) q(v) =∏K−1
k=1 q(vk) =∏K−1
k=1 Beta(vk;β1k, β2k), where
β1k = 1 +∑N
n=1 φn,k and β2k = γ1
γ2+∑N
n=1
∑Ki=k+1 φn,i.
(5) q(α) = Ga(α; γ1, γ2), where
γ1 = γ01 +K − 1 and γ2 = γ02 −∑K
k=1[ψ(β2k) − ψ(β1k + β2k)].
(6) q(snk) ∝ πsnk,1,k
[
∏Tn−1t=1 asnk,tsnk,t+1,k
] [
∏Tn
t=1 bsnk,txn,t,k
]
,where aij,k, bim,k, and πi,k are given in
(56)-(58) in the Appendix.
(7) q(cn = k) = φn,k = q′(cn=k)∑
Nn=1
q′(cn=k)where q′(cn = k) is given as (61) in the Appendix.
The local maximum of the lower bound L(q) is achieved by iteratively updating the parameters of
the variational distributions q(·) according to the above equations. Each iteration guarantees to either
increase the lower bound or leave it unchanged. We terminate the algorithm when the change in L(q)
is negligibly small. L(q) can be computed by substituting the updated q(·) and the prior distributions
p(Φ|Ψ) into (33).
The prediction for a new observation sequence y is,
p(y|X,Ψ) =
∫ K∑
k=1
pk(v)p(Θ∗k|X,Ψ)p(y|Θ∗
k)dp(Θ∗k,v|X,Ψ)
≈K∑
k=1
E[pk]
∫
q(Θ∗k)p(y|Θ
∗k)dΘ
∗k. (34)
Since the true posteriors are unknown, we use the variational posterior q(Θ∗k,v) from the VB optimization
to approximate p(Θ∗k,v|X,Ψ) in (34). However, the above quantity is still intractable because the states
and the model parameters are coupled. There are several possible methods for approximating the predictive
probability. One such method is to sample parameters from the posterior distribution and construct a
Monte Carlo estimate, but this approach is not efficient. An alternative is to construct a lower bound on
the approximation of the predictive quantity in (34) [6]. Another way suggested in [20] assumes that the
states and the model parameters are independent and the model can be evaluated at the mean (or mode)
of the variational posterior. This approach makes the prediction tractable and so is used in our following
experiments. Equation (34) can thus be approximated as
p(y|X,Ψ) ≈
K∑
k=1
E[pk]
∑
syk
E[πsyk,1,k] ·
Ty−1∏
t=1
E[asyk,tsyk,t+1,k] ·
Ty∏
t=1
E[bsyk,tyt,k]
, (35)
where E[p1] = E[v1], E[pk] = E[vk] ·∏k−1
i=1 (1 − E[vi]) with E[vk] = β1k
β1k+β2k, E[aij,k] =
uAij,k
∑
j′uA
ij′,k
,
E[bim,k] =uB
im,k∑
m′ uBim′,k
, and E[πsi,k] =
uπi,k
∑
i′uπ
i′,k
. Equation (35) can be evaluated efficiently using the
forward-backward algorithm.
March 19, 2007 DRAFT
15
C. Truncation level
In the DP prior without truncation G is a probability mass function with an infinite number of atoms.
In practice, a mixture model estimated from N observed data usually cannot have more than N mixture
components (in the limit, each data is drawn from one mixture component). So we truncate G at finite
level, K ≤ N , which saves computational resources.
Here we need to make clear the relationship between the truncation level K and the utilized number
of mixture components K∗. A given prior G consists of K distinct values {Θ∗k}k=1,K with associated
probability pk. However {Θn}n=1,N may take only a subset of {Θ∗k}k=1,K , which means the true utilized
number of mixture components K∗ may be less than K (and the clustering properties of DP almost always
yield less than K mixture components, unless α is very large). Furthermore, since G is itself random,
drawn from a DP, {Θ∗k}k=1,K will have different values rather than a set of fixed values. Consequently, the
utilized number of mixture components K∗ vary with different {Θ∗k}k=1,K but will typically be less than
K. However, in the DP formulation, the unoccupied HMMs are still included as part of the mixture with
very small mixing weights, pk, and continue to draw from the base distribution, G0, with the potential
of being occupied at some point. Therefore, from this point forward we will refer to a K-component DP
HMM mixtures, essentially an ensemble of K-component mixture models, though the actual number of
occupied HMMs will likely be less than K.
V. MUSIC ANALYSIS VIA HMM MIXTURE MODELS
Considering music to be a set of concurrently played notes (each note defining a location in feature
space) and note transitions (time-evolving features), music can be represented as a time series, and thus
modeled by an HMM. As music often follows a deliberate structure, the underlying, hidden mechanism
of that music should not be viewed as homogenous, but rather originating from a mixture of HMMs.
We are interested in modeling music in this manner to ultimately develop a similarity metric to aid in
music browsing and query. Our experiments are confined to classical music with a highly overlapping
observation space, but multiple “motion patterns”.
We sample the music clips at 22 kHz and divide each clip into 25 ms non-overlapping frames. We
extract 10-dimensional MFCC features for each frame using downloaded software1 with each vector an
observation in feature space. We then quantize our features into discrete symbols using vector quantization
(VQ), in which the codebook is learned on the whole set of tested music [18]. For our experiments, we
use a sequence of 1 second, or 40 observations without overlap; this transforms the music into a collection
of sequences, with each sequence assumed to originate from an HMM.
1http://www.ee.columbia.edu/∼dpwe/resources/matlab/rastamat/
March 19, 2007 DRAFT
16
A. Music Similarity Measure
The proposed method for measuring music similarity computes the distance between the respective
HMM mixture models. Similar to the work in [2], we use Monte Carlo sampling to compare two HMM
mixture models. Let Mg be the learned HMM mixture model for music g, and Mh for music h. We
draw a sample set Sg of size Ng from Mg and a sample set Sh of size Nh from Mh. The distance
between any two HMM mixture models is defined as
D(Mg,Mh) =1
2[log p(Sh|Mg) − log p(Sh|Mh)] +
1
2[log p(Sg|Mh) − log p(Sg|Mg)] , (36)
where the first two terms on the right hand side of (36) are a measure of how well model Mg matches
observations generated by model Mh, relative to how well Mh matches the observations generated by
itself. The last two terms are in the same spirit and make the distance D(Mg,Mh) symmetric. Equation
(36) can be rewritten in terms of individual samples as
D(Mg,Mh) =1
2
Nh∑
n=1
log p(S(n)h |Mg) − log p(S
(n)h |Mh)
Tn+
1
2
Ng∑
m=1
log p(S(m)g |Mh) − log p(S
(m)g |Mg)
Tm,
(37)
where S(n)h is the nth sample in Sh with length Tn, S(m)
g is the mth sample in Sg with length Tm, and the
log-likelihood for each sample given the HMM mixture model can be obtained from (25) in the MCMC
implementation or from (35) in the VB approach. The similarity Sim(g, h) of music g and h is defined
by a kernel function as
Sim(g, h) = exp(−|D(Mg,Mh)|2
σ2), (38)
where σ is a fixed parameter; we notice that the choice of σ will not change the order of similarities.
This approach is well suited to large music databases, as it only requires the storage of the HMM mixture
parameters for each piece rather than the original music data itself.
B. DP GMM modeling
For comparison, we also model each piece of music as a DP Gaussian mixture model (DP GMM),
where the 10-dimensional MFCC feature vector of a frame corresponds to one data point in the feature
space. In the same spirit as the DP HMM mixture model, each datum in the DP GMM is assumed drawn
from a Gaussian and a DP prior is placed on the mean and precision (inverse of variance) parameters
of all Gaussians, encouraging sharing of these parameters. An MCMC solution and variational approach
for the DP GMM can be found in [31] and [9], respectively. The posterior on the number of mixtures
used in DP GMM is learned automatically from the algorithm (as for the DP-based HMM mixture model
discussed above), however the dynamic (time-evolving) information between observations is discarded.
March 19, 2007 DRAFT
17
The music similarity under DP GMM modeling is defined similar to (36), while the log-likelihood is
given by the DP GMM. In our experiments the DP GMM is trained via the VB method for computational
efficiency.
VI. EXPERIMENTS
The effectiveness of the proposed method is demonstrated with both synthetic data and real music
data. Our synthetic problem, for which ground truth is known, exhibits how a data set, including data
with different hidden mechanisms, can be characterized by an HMM mixture. We then explore music
similarity within the classical genre with three experiments.
For the DP HMM mixture modeling in each experiment, a Ga(α; 1, 1) prior is placed on the scaling
parameter α, which slightly favors the data over the base distribution when enough is collected to learn
the mixture model. To avoid overfitting problems in the VB approach, we choose a reasonably large
K but K < N (K = 50 in our experiments). In the MCMC implementation, we set uniform, or non-
informative priors for A, B, and π, i.e., uA, uB , and uπ are set to unit vector 1. In the VB algorithm,
we set uA = 1/I , uB = 1/M , and uπ = 1/I to ensure a good convergence [6].
For the DP GMM analysis in our music experiments, we also employ a Ga(α; 1, 1) prior on the
scaling parameter α. The base distribution G0 for model parameters (mean µ and precision matrix Γ)
is a Normal-Wishart distribution, G0 = Norm(µ;m, (βΓ)−1)Wish(Γ; ν,W), where m and W are set
to the mean and inverse covariance matrix of all observed data, respectively; β = 0.01, and ν equals the
feature dimension.
A. Synthetic data
The synthetic data are generated from three distinct HMMs, each of which generates 50 sequences of
length 30 (i.e., mixing weights for the three HMM components are 1/3). The alphabet size of the discrete
observations is M = 3 for each HMM, and the number of states has two values, 2, 2, 3 respectively.
The parameters for these three HMMs are: A1 =
0.8 0.2
0.2 0.8
, B1 =
0.85 0.1 0.05
0.1 0.85 0.05
, π1 =
0.5
0.5
, A2 =
0.2 0.8
0.8 0.2
, B2 =
0.05 0.1 0.85
0.05 0.85 0.1
, π2 =
0.5
0.5
, A3 =
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
,
B3 =
0.9 0.05 0.05
0.05 0.9 0.05
0.05 0.05 0.9
, and π3 =
1/3
1/3
1/3
. In this configuration, HMM 1 tends to generate the
March 19, 2007 DRAFT
18
first two symbols with a high probability of maintaining the previous symbol, HMM 2 primarily generates
the last two symbols and has a high tendency to switch, and HMM 3 produces all three observations
with equal probability. We apply both MCMC and VB implementations to model the synthetic data as an
HMM mixture. In our algorithm, we assume that each HMM mixture component has the same number of
states, which we set to I . Note that if a given mixture component has less than I states, the I-state model
may be used with a very small probability of transitioning to un-needed states, i.e., for a superfluous
state, the corresponding row and column in the state transition matrix A will be close to zero. This is
achieved through the Dirichlet priors put on each row of A, which promote sparseness. Therefore, I is
set to a relatively large value, which can be estimated in principle by choosing the upper bound of the
model evidence: the log likelihood in MCMC implementation and the log-marginal likelihood (the lower
bound) in the VB algorithm. For brevity, we neglect this step and set I empirically. We’ve observed that,
in general, setting different values of I does not influence the final results in our experiments, as long
as I is sufficiently large.
Fig. 2 shows the clustering results of modeling the synthetic data as an HMM mixture, where (a)-(c)
employ MCMC and (d)-(f) employ VB. We set the truncated level in the DP prior to K = 50 and the
number of states I = 5. Fig. 2 (a) and (d) represent the estimated distribution of the indicators, i.e., the
probability that each sequence belongs to a given HMM component, which is computed by averaging the
200 collected c samples (spaced evenly 25 apart in burn-out iterations) in MCMC and is the variational
posteriors of the indicator matrix φn,k in VB. The memberships in Fig. 2 (b) and (e) are obtained by
setting the membership of a sequence to the component for which it has the highest probability. Fig. 2
(c) and (f) draw the mixing weights of the HMM mixture model computed from the MCMC and VB
algorithms respectively, with the mean and standard derivation estimated from 500 collected samples in
MCMC and in VB derived readily from (10) given E(vk) = β1k
β1k+β2kand var(vk) = β1kβ2k
(β1k+β2k)2(β1k+β2k+1) .
The results show that although the true number of HMM mixture components (here 3) is unknown a
priori, both algorithms automatically reduce the superfluous components and give an accurate approxi-
mation of the true component number (3 dominant mixture components in MCMC and 4 in VB). Fig. 2
(a) and (d) show that sequences generated from the same HMM have a high probability of belonging
to the same HMM mixture component. Fig. 2 (b) and (e) clearly indicates that the synthetic data can
be clustered into three major groups, which matches the ground truth. In comparison, MCMC slightly
outperforms VB; MCMC yields a mixture of 3 HMMs to VB’s 4 and less data are indexed incorrectly.
However, the computation of MCMC is expensive; it requires roughly 4 hours of CPU in MatlabTM on a
Pentium IV PC with a 1.73 GHz CPU to compute the results (5000 burn-in and 5000 burn-out iterations),
while VB requires less than two minutes. Considering the high efficiency and acceptable performance of
March 19, 2007 DRAFT
19
0 50 100 150
10
20
30
40
50
Data Index
Compo
ne
nt
Ind
ex
0 10 20 30 40 500
0.1
0.2
0.3
0.4
Component Index
Mixing
weigh
t
Data Index
Compo
ne
nt
Ind
ex
50 100 150
10
20
30
40
50
0.2
0.4
0.6
0.8
0 50 100 150
5
10
15
20
25
30
35
40
45
50
Data Index
Compon
en
t In
de
x
Data Index
Compo
ne
nt
Ind
ex
50 100 150
10
20
30
40
50
0.2
0.4
0.6
0.8
0 10 20 30 40 500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Component Index
Mixing
weigh
t
Fig. 2. DP HMM mixture modeling for synthetic data. (a) The averaged indicator matrix obtained from MCMC, where the
element at (i, j) is the probability that the j th sequence are generated from the ith mixture component. (b) The membership
of each sequence from MCMC, which is the maximum value along the columns of the averaged indicator matrix in (a). (c)
Mixing weights of DP HMM mixture with K = 50 computed by MCMC, the error bars representing the standard deviation of
each weight. (d) The variational posteriors of the indicator matrix, φn,k, in VB. (e) The sequence membership from VB, i.e.
the maximum along the columns of (d). (f) Mixing weights of DP HMM mixture with K = 50 by VB.
March 19, 2007 DRAFT
20
1 2 3 4
1
2
3
4
1 2 3 4
1
2
3
4
1: Bach-Violin Concerto
BWV 1041 Mvt I
2: Bach-Violin Concerto
BWV 1042 Mvt I
3: Stravinsky-
Violin Concerto Mvt I
4: Stravinsky-
Violin Concerto Mvt IV
(a) (b)
Fig. 3. Hinton diagram for the similarity matrix for 4 violin music clips. (a) Similarity matrix learned by DP HMM mixture
models; (b) Similarity matrix learned by DP GMMs.
the VB algorithm, we adopt it for our following application.
It is important to note that the synthetic data used have overlapping observation spaces, {1,2}, {2,3},
and {1,2,3}, and are generated by HMMs with more than one unique state number. Despite this, the
algorithm correctly clusters the data into three major HMMs, and setting one comprehensive state number
(I = 5) worked well in this problem.
−20 0 20
−20
0
20
x1
x2
clip1
−20 0 20
−20
0
20
x1
x2
clip2
−20 0 20
−20
0
20
x1
x2
clip3
−20 0 20
−20
0
20
x1
x2
clip4
Fig. 4. 2D PCA features for four violin music clips considered in Fig. 3.
March 19, 2007 DRAFT
21
B. Music data
For our first experiment, we choose four 3-minute violin concerto clips from two different composers.
We chose to model the first few minutes rather than the whole of a particular piece in our experiments,
because we felt that in this way we could better control the ground truth for our initial application of these
methods. Clips 1 and 2 are from Bach and are considered similar, clips 3 and clip 4 are from Stravinsky
are also considered similar. The clips of one composer are considered different from the other. All four
music clips are played using essentially the same instruments, but their styles vary, which would indicate
a high overlap in feature space, but significantly different movement. We divided each clip into 180
sequences of length 40 and quantized the MFCC feature vectors into symbols of size M = 32 with the
VQ algorithm. An HMM mixture model is then built for each clip with mixture component number, or
truncation level, set to K = 50 and number of states to I = 8; the truncation level of the DP GMM
was set to 50 as well. Fig. 3 shows the computed similarity between each clip for both HMM mixture
and GMM modeling using a Hinton diagram [14], in which the size of a block is inversely proportional
to the value of the corresponding matrix elements. To compute the distances, we use 100 sequences
of length 50 drawn from the HMM mixture and 200 samples drawn from the GMM. It is difficult to
compare the similarity matrices directly since the distance metrics are not on the same scale, but as our
goal is to suggest music by similarities, we may focus on the relative similarities within each matrix. Our
modeling by HMM mixture produces results that fit with our intuition2. However, our GMM results do
not catch the connection between clips 3 and 4, and, proportionally, do not contrast clips 1 and 2 from
3 and 4 as well. This is because of their high overlap in feature space, which we show by reducing the
10-dimensional MFCC features to the a 2 dimensional space through principle component analysis (PCA)
[16] in Fig. 4 to give a sense of their distribution. We observe that the features for all four clips almost
share the same range in the feature space and have a similar distribution. If we model each piece of
music without taking into consideration the motion in feature space, it is clear that the results will be very
similar. The improved similarity recognition can be attributed to the temporal consideration given by the
HMM mixture model. Fig. 5 shows the mixing weights of the VB-learned HMM mixture models for the
four violin clips. Although the number of significant weights is initially high, the algorithm automatically
reduces this number by suppressing the superfluous components to that necessary to model each clip:
the expected mixing weights for these unused HMMs are near zero with high confidence, indicated by
the small variance of the mixing weights. We notice that each clip is represented by a different number
of dominant HMMs. For example, clips 1 and 2 require fewer HMMs, which is understandable, as the
2Prof. Scott Lindroth, Chair of the Duke University Music Department, provided guidance in assessing similarities between
the pieces considered
March 19, 2007 DRAFT
22
0 10 20 30 40 500
0.1
0.2
0.3
0.4Clip 1
Component IndexM
ixin
g w
eigh
t0 10 20 30 40 50
0
0.1
0.2
0.3
0.4Clip 2
Component Index
Mix
ing
wei
ght
0 10 20 30 40 500
0.1
0.2
0.3
0.4Clip 3
Component Index
Mix
ing
wei
ght
0 10 20 30 40 500
0.1
0.2
0.3
0.4Clip 4
Component Index
Mix
ing
wei
ght
Fig. 5. Mixing weights of HMM mixtures for four violin clips when K = 50 computed by VB; the error bars representing
the standard deviation of each weight
music for these particular clips is more homogenous. We give an example of a posterior membership
for clip 4 in Fig. 6, where those parts having similar styles should be drawn from the same HMM. The
fact that the first 20 seconds of this clip are repeated during the last 20 can be seen in their similar
membership patterns.
For our second experiment, we compute the similarities between ten 3-minute clips, this time of works
of different formats (instrumentation). These clips were chosen deliberately with the following intended
0 20 40 60 80 100 120 140 160 1800
5
10
15
20
25
30
35
40
45
50
Time (second)
Ind
ex
of
mixture
co
mp
on
ents
Fig. 6. Memberships for violin clip 4 considered in Fig. 3-5.
March 19, 2007 DRAFT
23
1: Beethoven-
Consecration of the House
2: Chopin-
Etudes Op 10 No 01
3: Rachmaninov-
Preludes Op 23 No 02
4: Scarlatti-
Sonata K. 135
5: Scarlatti-
Sonata K. 380
6: Debussy-
String Quartet op 10 Mvt II
7: Ravel-
String Quartet in F Mvt II
8: Shostakovich-
String Quartet No 08 Mvt II
9: Bach-Violin Concerto
BWV 1041 Mvt I
10:Bach-Violin Concerto
BWV 1042 Mvt I
(a) (b)
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
Fig. 7. Hinton diagram for the similarity matrix for 10 music clips. (a) Similarity matrix learned by DP HMM mixture models;
(b) Similarity matrix learned by DP GMMs.
clustering: 1) clip 1 is unique in style and instrumentation; 2) clips 2 and 3, 4 and 5, 6 and 7, and 9 and
10 are intended to be paired together 3) clip 8 is also unique, but is the same format (instrumentation)
as clips 6 and 7. The similarities of these clips are computed in the same manner as the first experiment
via HMM mixture (with M = 32, I = 8, and K = 50) and GMM modeling. The Hinton diagrams of the
corresponding similarity matrices are shown in Fig. 7. Again, our intuition is consistent in this experiment
with HMM mixture modeling, but less accurate with GMM modeling. Though the GMM model does
not contradict our intuition, the similarities are not as stark as in the HMM mixture, especially in the
case of clip 1, which was selected to be unique.
Table I lists the number of dominant and utilized HMMs in the mixture model for each clip. Dominant
HMMs are those for which the corresponding expected mixing weights E[pk] > ε (ε is a small value).
Utilized HMMs are those that contain at least one sequence after the membership is calculated. Given
K = 50, the results show that each clip needs fewer HMMs than the preset truncation level.
Our third experiment posed our most challenging problem. In this experiment, we look at 15 two-
minute clips from three different classical formats: solo piano, string quartet and orchestral, with five
pieces from each. The feature space covered by these works is significantly larger than that spanned by
the other two experiments. Given M = 64 and I = 8, we built an HMM mixture for each of the 15 clips
of which we considered 1 and 2, 3 and 4, 6 and 7, 9 and 10 and 11 and 12 to be the most similar pairs.
Clips 8 and 13 were considered to be very different from all other clips and can therefore be considered
March 19, 2007 DRAFT
24
Clip index 1 2 3 4 5 6 7 8 9 10
Number of Dominant HMMs (pk > 0.01) 26 20 22 25 25 26 27 27 29 29
Number of Dominant HMMs (pk > 0.015) 10 6 7 14 11 13 12 10 13 10
Number of utilized HMMs 12 6 9 14 11 14 11 11 13 10TABLE I
NUMBER OF HMM MIXTURE COMPONENTS UTILIZED FOR EACH CLIP CONSIDERED IN FIG. 7.
anomalies. Clips 5 and 14 were also considered more unique than similar to any other piece. The results
of the HMM modeling as well as GMM modeling are shown in Fig. 8. The HMM mixture better catches
the connection between clips 3 and 4 and deemphasizes the connections between clip 6 and clips 9 and
10. Also, though still not considered unique, the similarities of clip 5 are reduced in proportion to other
similarities within the solo piano format, indicating the beneficial effects of the temporal information. The
results indicate that clip 14 is closer to the other orchestral works than expected from human listening. In
general, the GMM and HMM mixture modeling approaches are more comparable in our third experiment,
with a slight edge given to the HMM mixture resulting from the temporal information.
It is important to note that, theoretically, the GMM should always perform well when comparing two
similar pieces, as their similarity will be manifested in part by an overlapping feature space. However,
in cases where the overlapping features contain significantly different motions, be it in rhythm or note
transitions, the GMM inherently will not perform well and will tend to yield larger similarity than the
“truth”, whereas our HMM mixture, in theory, will detect and account for these variations.
VII. CONCLUSION AND FUTURE WORK
We have developed a discrete HMM mixture model for situations for which a sequential data set
may have several different underlying mechanisms, unable to be adequately modeled by a single HMM.
This model is built in a Bayesian setting using DP priors, which has the advantage of automatically
determining the component number and associated membership through the encouragement of parameter
sharing. Two inference tools are provided for the HMM mixture model: an MCMC solution based on a
Gibbs sampler, and a VB approach via maximizing a variational lower bound.
The performance of HMM mixture modeling was demonstrated on both synthetic and music data
sets. We compared MCMC and VB implementations with the synthetic data, where we showed that VB
provides an acceptable and highly efficient alternative to MCMC, allowing consideration of large data
sets, such as music. For our music application, we presented three experiments within the classical music
genre, where the MFCC feature spaces for each piece of music was highly overlapped. We compared our
March 19, 2007 DRAFT
25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1: Bach- Sinfonia No 02
2: Bach- Sinfonia No 11
3: Beethoven- Piano Works/Bagatelle Op 126 No 2
4: Beethoven- Piano Works/Bagatelle Op 126 No 4
5: Prokofiev- Romeo and Juliet No_01
6: Brahms- String Quartet No 1 Mvt I
7: Brahms- String Quartet No 1 Mvt III
8: Bartok- String Quartet No 4 Mvt V
9: Mozart- String Quartets/String Quartet No 16 Mvt III
10: Mozart- String Quartets/String Quartet No 17 Mvt I
11: Beethoven- Symphonies/Symphony No 8 Mvt IV
12: Schubert- Symphony No 9 Mvt IV
13: Bach- Brandenburg Concerto No 1 Mvt I
14: Stravinsky- Symphony in Three Movements I
15: Brahms- Symphony No 1 Mvt I
1: Bach- Sinfonia No 02
2: Bach- Sinfonia No 11
3: Beethoven- Piano Works/Bagatelle Op 126 No 2
4: Beethoven- Piano Works/Bagatelle Op 126 No 4
5: Prokofiev- Romeo and Juliet No_01
6: Brahms- String Quartet No 1 Mvt I
7: Brahms- String Quartet No 1 Mvt III
8: Bartok- String Quartet No 4 Mvt V
9: Mozart- String Quartets/String Quartet No 16 Mvt III
10: Mozart- String Quartets/String Quartet No 17 Mvt I
11: Beethoven- Symphonies/Symphony No 8 Mvt IV
12: Schubert- Symphony No 9 Mvt IV
13: Bach- Brandenburg Concerto No 1 Mvt I
14: Stravinsky- Symphony in Three Movements I
15: Brahms- Symphony No 1 Mvt I
Fig. 8. Hinton diagram for the similarity matrix for 15 music clips. (a) Similarity matrix learned by DP HMM mixture models;
(b) Similarity matrix learned by DP GMMs.
HMM mixture model to the GMM, computing similarities between music as a measure of performance.
The results showed that HMM mixture modeling was able to better distinguish between the content of
the given music, generally providing sharper contrasts in similarities than the GMM. The results from
this and the synthetic data are both promising. Where GMMs have been shown to succeed at genre
March 19, 2007 DRAFT
26
classification, our HMM mixture model matches this performance and exceeds it within a music format
by taking the temporal character of the given music into account. However, we have to break up the
input sequence into a number of short segments, which may create a problem of trading-off the accuracy
of boundary placement (which would require short pieces) against the accuracy of HMM parameter
estimation/recognition (which would require longer pieces). In addition, pieces that span the boundary
between two segments may show characteristics of more than one HMM. These issues constitute important
subjects for future work.
APPENDIX: DERIVATION OF UPDATING EQUATIONS IN VB APPROACH
1) Update q(A∗k), k = 1, · · · ,K: Only those terms related to q(A∗
k) in (33) are kept for the derivation
and then we have
L(q(A∗k)) =
∫
q(A∗k) ·
[
log(p(A∗k)) +
N∑
n=1
q(cn = k)
∫
q(snk) log p(xn, sn|A∗,B∗,π∗, cn) −
− log q(A∗k)
]
dA∗k
=
∫
q(A∗k) [Z − log q(A∗
k)] dA∗k, (39)
where
Z =I∑
i,j=1
(uAj − 1) log aij,k +
N∑
n=1
φn,k ·
[
∑
snk
q(snk)
Tn∑
t=1
log asnk,tsnk,t+1,k
]
(40)
with φn,k = q(cn = k). The quantity Z can be rewritten as
Z =I∑
i,j=1
(uj − 1) log aij,k +N∑
n=1
φn,k ·
I∑
i,j=1
∑
snk
q(snk)
Tn∑
t=1
log aij,kδ(snk,t = i, snk,t+1 = j)
=I∑
i,j=1
{
uAj +
N∑
n=1
φn,k ·
[
Tn∑
t=1
q(snk,t = i, snk,t+1 = j)
]
− 1
}
log aij,k, (41)
where∑Tn
t=1 q(snk,t = i, snk,t+1 = j) can be computed from the forward-backward algorithm [22]. Then
we have
L(q(A∗k)) = −
∫
q(A∗k) log
q(A∗k)
∏Ii,j=1 a
uAij,k−1
ij,k
dA∗k, (42)
where
uAij,k = uA
j +N∑
n=1
φn,k ·
[
Tn∑
t=1
q(snk,t = i, snk,t+1 = j)
]
. (43)
March 19, 2007 DRAFT
27
According to Gibbs inequality that is
q∗(x) = arg minq(x)
∫
q(x) logq(x)
p(x)dx
=1
Cp(x), (44)
where C is a normalizing constant to ensure∫
q∗(x)dx = 1, L(q(A∗k)) is maximized with respect to
q(A∗k) by choosing
q(A∗k) =
I∏
i,j=1
auA
ij,k−1
ij,k
=
I∏
i
Dir({aij,k}j=1,I ; {uAij,k}j=1,I), (45)
which is a product of Dirichlet distributions with the hyper-parameters uAij,k.
2) Update q(B∗k), k = 1, · · · ,K: L(q) is maximized with respect to q(B∗
k) in a similar procedure and
the optimal q(B∗k) is obtained as
q(B∗k) =
I∏
i=1
Dir({bim,k}m=1,M ; {uBim,k}m=1,M ), (46)
where
uBim,k = uB
m +N∑
n=1
φn,k ·
[
Tn∑
t=1
q(snk,t = i, xn,t = m)
]
. (47)
3) Update q(π∗k), k = 1, · · · ,K: Similarly, the optimal q(π∗
k) to maximize L(q) is derived as
q(π∗k) = Dir({πi,k}i=1,I ; {u
πi,k}i=1,I), (48)
where
uπi,k = uπ
i +N∑
n=1
φn,k · q(snk,1 = i). (49)
∑Tn
t=1 q(snk,t = i, xn,t = m) in (47) and q(snk,1 = i) in (49) can also be computed from the forward-
backward algorithm.
4) Update q(α): The lower bound related to q(α) is given as
L(q(α)) =
∫
q(α)[log p(α) +
∫
q(v) log p(v|α) − log q(α)]dα. (50)
Using the property that∫
Dir(p;u) log pi = ψ(ui) − ψ(∑
j uj) where ψ(x) = ∂∂x
log Γ(x) and Beta
distribution is a two-parameter Dirichlet distribution, (50) is then maximized at
q(α) = Ga(α; γ1, γ2) (51)
with γ1 = γ01 +K − 1 and γ2 = γ02 −∑K
k=1[ψ(β2k) − ψ(β1k + β2k)].
March 19, 2007 DRAFT
28
5) Update q(v): Collecting all the quantities related to q(v), we have
L(q(v)) =
∫
q(v)
[
∫
q(α) log p(v|α) +N∑
n=1
K∑
k=1
q(cn = k) log p(cn = k|v) − log q(v)
]
dv. (52)
p(cn = k|v) is rewritten in [9] as
p(cn = k|v) =K∏
l=1
(1 − vl)1[cn>l]v
1[cn=l]l , (53)
where 1[x] is the indicator function. Substitute (53) into (52) and recall that p(vk|α) = Beta(vk; 1, α)
and q(α) = Ga(α; γ1, γ2), the optimized q(v) become
q(v) =K−1∏
k=1
q(vk) =K−1∏
k=1
Beta(vk;β1k, β2k), (54)
where β1k = 1 +∑N
n=1 φn,k and β2k = γ1
γ2+∑N
n=1
∑Ki=k+1 φn,i.
6) Update q(snk), n = 1, · · · , N, k = 1, · · · ,K: The L(q(snk)) is expressed as
L(q(snk)) =
∫
q(snk)φn,k
[
∫
q(A∗k)
Tn−1∑
t=1
log asnk,tsnk,t+1,k +
+
∫
q(B∗k)
Tn∑
t=1
log bsnk,txn,t,k +
∫
q(π∗k) log πsnk,1,k − log q(snk)
]
dsnk. (55)
Using the property again that∫
Dir(p;u) log pi = ψ(ui) − ψ(∑
j uj), we define
aij,k = exp[ψ(uAij,k) − ψ(
I∑
j′=1
uAij′,k)] (56)
bim,k = exp[ψ(uBim,k) − ψ(
M∑
m′=1
uBim′,k)] (57)
πi,k = exp[ψ(uπi,k) − ψ(
I∑
i′=1
uπi′,k)]. (58)
Maximizing (55) is achieved at
q(snk) ∝ πsnk,1,k
[
Tn−1∏
t=1
asnk,tsnk,t+1,k
][
Tn∏
t=1
bsnk,txn,t,k
]
. (59)
7) Update q(cn = k), n = 1, · · · , N, k = 1, · · · ,K: L(q(cn = k)) is given as
L(q(cn = k)) =
∫
q(cn = k)[
∫
q(v) log p(cn,k|v) +
∫
q(sn,k)q(A∗cn
)q(B∗cn
)q(π∗cn
) log p(xn, snk|A∗,B∗,π∗, cn) − log q(cn = k)
]
.(60)
March 19, 2007 DRAFT
29
Define
q′(cn = k) = exp
{
k−1∑
l=1
[ψ(β2l) − ψ(β1l + β2l)] + [ψ(β1k) − ψ(β1k + β2k)]
+∑
sn,k
q(snk)
(
log πsnk,1,k +
Tn−1∑
t=1
log asnk,tsnk,t+1,k +
Tn∑
t=1
log bsnk,txn,t,k
)}
, (61)
then q(cn,k) to optimize (60) is expressed as
q(cn = k) =q′(cn = k)
∑Nn=1 q
′(cn = k). (62)
ACKNOWLEDGEMENT
The authors thank Prof. Scott Lindroth, Chair of the Duke University Music Department, for reviewing
our results and providing guidance with regard to defining “truth” for the music data.
REFERENCES
[1] J.-J. Aucouturier and F. Pachet, “The influence of polyphony on the dynamical modelling of musical timbre.” Pattern
Recognition Letters, (in press, 2007).
[2] ——, “Music similarity measures: What’s the use?” In Proceedings of the International Symposium on Music Information
Retrieval (ISMIR), 2002.
[3] ——, “Improving timbre similarity: How high’s the sky?” Journal of Negative Results in Speech and Audio Sciences,
vol. 1, no. 1, 2004.
[4] J.-J. Aucouturier and M. Sandler, “Segmentation of musical signals using hidden markov models,” In Proceedings of the
110th Convention of the Audio Engineering Society, May 2001.
[5] L. R. Bahl, F. Jelinek, and R. L. Mercer, “A maximum likelihood approach to continuous speech recognition,” IEEE
Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 5, no. 2, pp. 179–190, 1983.
[6] M. J. Beal, “Variational algorithms for approximate bayesian inference,” Ph.D. dissertation, Gatsby Computational
Neuroscience Unit, University College London, 2003.
[7] A. Berenzweig, B. Logan, D. P. Ellis, and B. Whitman, “A large-scale evaluation of acoustic and subjective music similarity
measures,” Computer Music Journal, vol. 28, no. 2, pp. 63–76, 2004.
[8] D. Blackwell and J. MacQueen, “Ferguson distributions via polya urn schemes,” Annals of Statistics, vol. 1, pp. 353–355,
1973.
[9] D. Blei and M. Jordan, “Variational methods for the dirichlet process,” In Proceedings of the 21st International Conference
on Machine Learning, 2004.
[10] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of
Royal Statistical Society B, vol. 39, no. 1, pp. 1–38, 1977.
[11] M. D. Escobar and M. West, “Bayesian density estimation and inference using mixtures,” Journal of the American Statistical
Association, vol. 90, no. 430, pp. 577–588, 1995.
March 19, 2007 DRAFT
30
[12] T. S. Ferguson, “A bayesian analysis of some nonparametric problems,” Annals of Statistics, vol. 1, pp. 209–230, 1973.
[13] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, “Introducing markov chain monte carlo,” in Markov Chain Monte
Carlo in Practice. London, U.K.: Chapman Hall, 1996.
[14] G. Hinton, T. Sejnowski, and D. Ackley, “Boltzmann machines: Constraint satisfaction networks that learn,” 1984.
[15] H. Ishwaran and L. F. James, “Gibbs sampling methods for stick-breaking priros,” Journal of the American Statistical
Association, Theory and Methods, vol. 96, no. 453, pp. 161–173, 2001.
[16] I. T. Jolliffe, Principal Component Analysis, 2nd ed. Springer, 2002.
[17] Y. Lin, “Learning phonetic features from waveforms,” UCLA Working Papers in Phonetics (WPP), no. 103, pp. 64–70,
September 2004.
[18] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE Trans. Communications, vol.
COM-28, no. 1, pp. 84–95, 1980.
[19] B. Logan and A. Salomon, “A music similarity function based on signal analysis,” in ICME 2001, 2001.
[20] D. J. C. MacKay., “Ensemble learning for hidden markov models,” Cavendish Laboratory, University of Cambridge,”
Technical report, 1997.
[21] M. Mandel, G. Poliner, and D. Ellis, “Support vector machine active learning for music retrieval,” Multimedia Systems,
Special issue on machine learning approaches to multimedia information retrieval, vol. 12, no. 1, pp. 3–13, 2006.
[22] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” in Proceedings of
the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[23] C. Raphael, “Automatic segmentation of acoustic musical signals using hidden markov models,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 21, no. 4, pp. 360–370, 1999.
[24] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, pp. 465–471, 1978.
[25] N. Scaringella and G. Zoia, “On the modeling of time information for automatic genre recognition systems in audio
signals,” in Proc. of the 6th Int. Symposium on Music Information Retrieval (ISMIR), pp. 666–671, 2005.
[26] A. Schliep, B. Georgi, W. Rungsarityotin, I. G. Costa, and A. Schonhuth, “The general hidden markov model library:
Analyzing systems with unobservable states.” [Online]. Available: www.billingpreis.mpg.de/hbp04/ghmm.pdf
[27] J. Sethuraman, “A constructive definition of the dirichlet prior,” Statistica Sinica, vol. 2, pp. 639–650, 1994.
[28] X. Shao, C. Xu, and M. Kankanhalli, “Unsupervised classification of musical genre using hidden markov model,” in Proc.
IEEE Int. Conf. Multimedia Explore (ICME), pp. 2023–2026, 2004.
[29] A. Sheh and D. P. W. Ellis, “Chord segmentation and recognition using emtrained hidden markov models,” In Proceedings
of the 4rd Annual International Symposium on Music Information Retrieval (ISMIR), pp. 183–189, October 2003.
[30] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical dirichlet processes,” Journal of the American Statistical
Association, 2006.
[31] M. West, P. Muller, and M. Escobar, “Hierarchical priors and mixture models with applications in regression and density
estimation,” in Aspects of Uncertainty, P. R. Freeman and A. F. Smith, Eds. John Wiley, 1994, pp. 363–386.
[32] B. Whitman, G. Flake, and S. Lawrence, “Artist detection in music with minnowmatch,” Proceedings of the 2001 IEEE
Workshop on Neural Networks for Signal Processing, pp. 559–568, 2001.
[33] C. Xu, N. C. Maddage, X. Shao, F. Cao, and Q. Tian, “Musical genre classification using support vector machines,” In
International Conference on Acoustics, Speech, and Signal Processing. IEEE, vol. V, pp. 429–432, 2003.
March 19, 2007 DRAFT