gaussian processes for underdetermined source separation · 1 gaussian processes for...

HAL Id: hal-00643951https://hal.archives-ouvertes.fr/hal-00643951

Submitted on 23 Nov 2011

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Gaussian Processes for Underdetermined SourceSeparation

Antoine Liutkus, Roland Badeau, Gaël Richard

To cite this version:Antoine Liutkus, Roland Badeau, Gaël Richard. Gaussian Processes for Underdetermined SourceSeparation. IEEE Transactions on Signal Processing, Institute of Electrical and Electronics Engineers,2011, 59 (7), pp.3155 - 3167. <10.1109/TSP.2011.2119315>. <hal-00643951>

https://hal.archives-ouvertes.fr/hal-00643951

https://hal.archives-ouvertes.fr

1

Gaussian Processes for Underdetermined SourceSeparation

Antoine Liutkus, Roland Badeau, Senior Member, IEEE, Gaël Richard, Senior Member, IEEE

Abstract—Gaussian process (GP) models are very popular formachine learning and regression and they are widely used toaccount for spatial or temporal relationships between multi-variate random variables. In this paper, we propose a generalformulation of underdetermined source separation as a probleminvolving GP regression. The advantage of the proposed unifiedview is firstly to describe the different underdetermined sourceseparation problems as particular cases of a more generalframework. Secondly, it provides a flexible means to includea variety of prior information concerning the sources such assmoothness, local stationarity or periodicity through the useof adequate covariance functions. Thirdly, given the model, itprovides an optimal solution in the minimum mean squarederror (MMSE) sense to the source separation problem. In orderto make the GP models tractable for very large signals, weintroduce framing as a GP approximation and we show thatcomputations for regularly sampled and locally stationary GPscan be done very efficiently in the frequency domain. Thesefindings establish a deep connection between GP and NonnegativeTensor Factorizations with the Itakura-Saito distance and leadto effective methods to learn GP hyperparameters for very largeand regularly sampled signals.

Index Terms—Gaussian Processes, NMF, NTF, Source Separa-tion, Probability Theory, Regression, Kriging, Cokriging

I. INTRODUCTION

Gaussian processes [28], [35], [36], [44] are commonly usedto model functions whose mean and covariances are known.Given some learning points, they enable us to estimate thevalues taken by the function at any other points of interest.Their main advantages are to provide a simple and effectiveprobabilistic framework for regression and classification aswell as an effective means to optimize a model’s parametersthrough maximization of the marginal likelihood of the obser-vations. For these reasons, they are widely used in many areasto model dependencies between multivariate random variablesand their use can be traced back at least to works by Wiener in1941 [43]. They have also been known in geostatistics underthe name of kriging for almost 40 years [29]. A great surge ofinterest for Gaussian Process (GP) models occurred when theywere expressed as a general purpose framework for regressionas well as for classification (see [35] for a review). Theirrelation to other methods commonly used in machine learningsuch as multi-layer perceptrons, spline interpolation or supportvector machines are now well understood.

Copyright (c) 2011 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

Authors are with Institut Telecom, Telecom ParisTech, CNRS LTCI, France.This work is partly funded by the French National Research Agency (ANR) asa part of the DReaM project (ANR-09-CORD-006-03) and partly supported bythe Quaero Programme, funded by OSEO, French State agency for innovation.

Source separation is another very intense field of research(see [10] for a review) where the objective is to recover severalunknown signals called sources that were mixed togetherin observable mixtures. Source separation problems arise inmany fields such as sound processing, telecommunications andimage processing. They differ mainly in the relative numberof mixtures per source signal and in the nature of the mixingprocess. The latter is generally modeled as convolutive, i.e.as a linear filtering of the sources into the mixtures. Whenthe mixing filters reduce to a single amplification gain, themixing is called instantaneous. When there are more mix-tures than sources, the problem is called overdetermined andalgorithms may rely on beamforming techniques to performsource separation. When there are fewer mixtures than sources,the problem is said to be underdetermined and is notablyknown to be very difficult. Indeed, in this case there areless observable signals than necessary to solve the underlyingmixing equations. Many models were hence studied to addressthis problem and they all either restrict the set of possiblesource signals or assign prior probabilities to them in aBayesian setting. Among the most popular approaches, wecan mention Independent Component Analysis [6] that focusesboth on probabilistic independence between the source signalsand on high order statistics. We can also cite Non-negativeMatrix Factorization (NMF) source separation that models thesources as locally stationary with constant normalized powerspectra and time-varying energy [16], [27].

In this study, we revisit underdetermined source separation(USS) as a problem involving GP regression. To our knowl-edge, no unified treatment of the different underdeterminedlinear source separation problems in terms of classical GPis available to date and we thus propose here an attempt atproviding such a formulation whose advantages are numerous.Firstly, it provides a unified framework for handling the differ-ent USS problems as particular cases, including convolutive orinstantaneous mixing as well as single or multiple mixtures.Secondly, when prior information such as smoothness, localstationarity or periodicity is available, it can be taken into ac-count through appropriate covariance functions, thus providinga significant expressive power to the model. Thirdly, it yieldsan optimal way in the minimum mean squared error (MMSE)sense to proceed to the separation of the sources given themodel.

In spite of all their interesting features, GP models comeat a high O

(

n3)

computational cost where n is the numberof training points. For many applications such as audio signalprocessing where n ≈ 107 is common, this cost is prohibitive.Hence, the GP framework has to come along with effectivemethods to simplify the computations in order to be of

2

practical use. Over the years, many approximation methodshave been proposed [31], [34], [37], [38], [40] to address thisissue and we show that the common practice of framing inaudio signal processing can precisely be understood in termsof GP modeling as a particular choice for GP approximation.In particular, we give its connections with recently publishedPartially Independent Conditional (PIC) approximation [37]and Compact Support (CS) covariance functions [31], [40]. Forthe special case of locally stationary and regularly sampled sig-nals, we furthermore show that computations can be performedextremely efficiently in the frequency domain and we establisha novel connection between GP models and the emergingtechniques of Nonnegative Tensor Factorizations (NTF) [9]using the Itakura-Saito divergence.

The article is organized as follows. First, we present GP andparticularly Gaussian Process Regression (GPR) in section II.Then, we set out the various linear underdetermined sourceseparation problems in terms of GPR in section III. In orderto make the GP models tractable for very large signals, weintroduce framing as a GP approximation and we show thatcomputations for regularly sampled and locally stationary GPscan be done very efficiently in the frequency domain in sectionIV. Finally, we illustrate the performance of the methods onsynthetic and real data in section V and draw some conclusionsin section VI.

II. GAUSSIAN PROCESSES

A. Introduction

A Gaussian process [28], [35], [36], [44] is a possiblyinfinite set of scalar random variables {f (x)}x∈X indexed byan input space X , typically X = R

D, and taking values on R,such that for any finite set of inputs X = {x1 · · ·xn} ∈ Xn,f , [f (x1) · · · f (xn)]

⊤ is distributed with respect to a multi-variate Gaussian distribution1. A GP is thus completely deter-mined by a mean function m (x) = E [f (x)] and a covariancefunction k (x, x′) = E [(f (x)−m (x)) (f (x′)−m (x′))].

More fundamentally, a GP may be understood as a processwhose mean m (x) and covariance k (x, x′) between any twoinputs are known. Given only this prior information, assigninga multivariate Gaussian distribution to f given any finite set Xof inputs from X is a sensible choice, since it is the probabilitydistribution that maximizes entropy when only the first twomoments are known [23].

It has been shown that the class of valid covariance func-tions coincides with the class of positive definite functions[1]. Let X be a finite set of elements from X that is possiblyrandomly drawn as in [19], the covariance matrix Kf,XX isdefined as [Kf,XX ]

i,j= k (xi, xj) and the probability of f

given X is then given by2:

p (f | X) =

1

(2π)n2 |Kf,XX |

1

2

exp

(

−1

2(f −m)

⊤K−1

f,XX (f −m)

)

(1)

1The symbol , denotes a definition.2Positive semi-definite covariance matrices are possible. In the case of

singular Kf,XX , a characterization involving the characteristic functioninstead of (1) is required.

where m , [m (x1) · · ·m (xn)]⊤. This is usually written:

f ∼ GP (m (x) , k (x, x′))

Most studies in underdetermined source separation focus onthe single sensor scenario X = R. Still, there is no difficultyinvolved in considering the general case X = R

D and wewill see examples of GPs defined on a multidimensional inputspace in sections III-C and IV. This framework thus easilyallows modeling multivariate functions defined on arbitrary in-put spaces and many studies have used Gaussian processes forregression (f (x) ∈ R) as well as for classification (f (x) ∈ N).Their main advantages are to provide a probabilistic interpre-tation and a way to compute the variances of the estimates.From now on, we will focus on GPR, since our objective is tohighlight the connections between GP and source separation,which is usually stated in terms of processes taking valuesin R. For the sake of notational simplicity, we will assume a

priori centered signals, i.e. ∀x ∈ X ,m (x) = 0, as it is verycommon for audio signals. Still, there is no particular issueraised when considering arbitrary mean functions.

B. Gaussian processes regression

Suppose we observe y (x) = f (x)+ǫ (x), with f (x) beingthe signal of interest and ǫ (x) being some additive signal —usually called noise — that is independent from f (x), for a fi-nite set X of input points from X : X = {x1 · · ·xn} ∈ Xn. Wewant to estimate the values taken by f on a finite and possiblydifferent set X∗ = {x∗

1 · · ·x∗n∗} ∈ Xn∗

of input points fromX . Let us furthermore assume that f ∼ GP (0, kf (x, x

′)) andǫ ∼ GP (0, kǫ (x, x

′)) where the covariance functions kf andkǫ are known. As f and ǫ are supposed independent, we have:

f + ǫ ∼ GP (0, kf (x, x′) + kǫ (x, x

′)) (2)

Let Kf,XX∗ be the covariance matrix defined by[Kf,XX∗ ]

ij= kf

(

xi, x∗j

)

. We define Kf,X∗X , Kf,X∗X∗ ,

Kǫ,XX in the same way. Let f , [f (x1) · · · f (xn)]⊤, f∗ ,

[f (x∗1) · · · f (x∗

n)]⊤ and similarly for y. We have:

[

y

f∗

]

∼ N

(

0,

[

Kf,XX +Kǫ,XX Kf,XX∗

Kf,X∗X Kf,X∗X∗

])

Classical probability results then assert that the conditionaldistribution of f∗ given y is (see [35]):

f∗ | y ∼ N(

f∗, covf∗)

(3)

with3:f∗ = Kf,X∗X [Kf,XX +Kǫ,XX ]

−1y (4)

and

covf∗ = Kf,X∗X∗ −Kf,X∗X [Kf,XX +Kǫ,XX ]−1

Kf,XX∗

(5)These expressions show that the maximum likelihood es-

timate f̂∗ of f∗ | y is found by setting f̂∗ = f∗, whichis also the Minimum Mean Squared Error (MMSE) estimatein the Gaussian case. This result will be fundamental whenperforming source separation using Gaussian processes. Wecan furthermore compute the covariance of the estimates.

3In the case of singular covariance matrix Kf,XX + Kǫ,XX , numericalmethods such as Moore-Penrose pseudo-inversion may be used.

3

0 200 400−20

0

20

0 200 400−20

0

20

0 200 400−20

0

20

0 200 400−20

0

20

100 200 300 400 500

200

400

100 200 300 400 500

200

400

100 200 300 400 500

200

400

100 200 300 400 500

200

400

Figure 1. Typical realizations of GP with SE and periodic covariancefunctions with different values of the hyperparameters. On the left column,D = 1 and the processes are only parameterized by 3 scalars. On the right,D = 2 and the processes are parameterized by 6 scalars.

C. Covariance functions

Many studies (see [1] for a review) concern valid covariancefunctions. They belong to the general family of kernels andmust be definite positive. Some properties of interest have beendemonstrated:

• Sums and products of valid covariance functions are validcovariance functions.

• When it is stationary, i.e. when it can be expressed as afunction of τ = x− x′, then the covariance function canbe parameterized by its Fourier transform.

Two examples of covariance functions for X = RD are:

• The Squared Exponential (SE) covariancefunction defined by: kSE (x, x

′|σ,M) =

σ2 exp

(

−(x−x′)

⊤M(x−x′)2

)

with σ2 > 0 and

M positive semidefinite. When D = 1, we havekSE (x, x

′|σ, λ) = σ2 exp(

− (x−x′)2

2λ2

)

. λ is called a

characteristic length scale in the sense that |x− x′| ≫ λ

is required for two points x and x′ of the process to beindependent.

• The less common periodic covariance functionof period T given by: kperiodic (x, x

′|T, λ) =

σ2 exp

(

−2 sin2 π(x−x′)

T

λ2

)

.

As can be seen, covariance functions are generally parameter-ized by a set of scalar values such as their Fourier transformwhen they are stationary, their characteristic length scales, theirperiod, etc. These scalars are often called hyperparameters

and are usually gathered in a hyperparameter set Θ. Typicalrealizations of GPs with SE and periodic covariance functionsare given for D = 1 and D = 2 in Figure 1.

D. Optimization of the hyperparameters

In a Bayesian context, we may need to find the hy-perparameters that maximize the marginal likelihood of theobservations. In other words, we may need to find Θ∗ such thatp (y | X,Θ∗) is maximum. Indeed, even if we may well guessthe covariance functions that are adequate to the problem athand, such as stationary covariance functions parameterized bytheir Fourier transform or SE covariance functions, it is likelythat the hyperparameters that best explain the observations arenot exactly known.

To this purpose, we can compute the closed-form ex-pression of the marginal log-likelihood of the observations,log p (y | X,Θ) as (see [35]):

log p (y | X,Θ) = −n

2log 2π

−1

2y⊤ [Kf,XX +Kǫ,XX ]

−1y −

1

2log |Kf,XX +Kǫ,XX |

(6)

where each covariance matrix depends on Θ. Using theopposite of (6) as a cost function, we can proceed to the opti-mization of the hyperparameters using classical optimizationalgorithms, in a principled probabilistic framework. Note thatdepending on the covariance function and the hyperparameterconsidered, the corresponding optimization problem may ormay not be convex.

III. GAUSSIAN PROCESSES FOR SOURCE SEPARATION

A. Single mixture with instantaneous mixing

The presentation of GPR given in section II-B is actuallyslightly more general than what is usual in the literature.Indeed, it is often assumed that the covariance function kǫof the additive signal ǫ is given by kǫ (x, x

′) = σ2δxx′

where δxx′ = 1 if and only if x = x′ and zero otherwise.This assumption corresponds to additive independent andidentically distributed (i.i.d.) white Gaussian noise of varianceσ2.

In our presentation, the additive signal ǫ (x) is a GP itselfand is potentially very complex. In any case, its covariancefunction is given by kǫ and the only assumption made is itsindependence with the signal of interest f (x). A particularexample of a model where kǫ non trivially depends on x andx′ was for example studied in [20].

The results obtained can very well be generalized to thesituation where y is the sum of M independent latent Gaussianprocesses:

∀x ∈ X , y (x) =

M∑

m=1

fm (x)

withfm ∼ GP (0, km (x, x′))

In this case, if our objective is to extract the signal cor-responding to the source m0, we only need to replace kfwith km0

and kǫ with∑

m 6=m0km in section II-B. Note

that inversion of Kf,XX + Kǫ,XX is needed only once forthe extraction of all sources. Similarly, we can also jointly

4

optimize the hyperparameters of all covariance functions usingexactly the same framework as in section II-D. We nowconsider the case of convolutive mixtures of independent GPs.

B. Single mixture with convolutive mixing

An important fact, which has already been noticed in theliterature [2], [4], is that the convolution of a GP, as a linearcombination of Gaussian random variables, remains a GP.Indeed, let us consider some GP f0,m ∼ GP (0, k0,m (x, x′))and let us define

fm (x) =

ˆ

X

am (x− z) f0,m (z) dz , (am ∗ f0,m) (x)

where am : X → R is a stable mixing filter from f0,m tofm. If the mean function of f0,m is identically 0, the meanfunction of fm is easily seen to also be identically 0. Thecovariance function of fm can be computed as km (x, x′) =E [fm (x) fm (x′)], that is:

km (x, x′) =

ˆ

X

ˆ

X

am (x− z) am (x′ − z′) k0,m (z, z′) dzdz′

which is the convolution of k0,m by am × am , (x, x′) ∈X 2 7→ am (x) am (x′):

km (x, x′) = ((am × am) ∗ k0,m) (x, x′) (7)

Moreover, if several convolved GPs{fm = (am ∗ f0,m)}

m=1···M are summed up in a mixture,it can readily be shown that the fm are independent if thef0,m are independent. We thus get back to the instantaneousmixing model using modified covariance functions (7).

C. Multiple output GP

We have for now only considered GPs whose outputs lie inR. A sizable body of literature focuses on possible extensionsof this framework to cases where the processes of interest aremultiple-valued, i.e. whose outputs lie in R

C for C ∈ N∗.

In geostatistics for example, important applications comprisethe modeling of co-occurrences of minerals or pollutants in aspatial field. First attempts in this direction [24] include the so-called linear model of coregionalization, that considers eachoutput as a linear combination of some latent processes. Thename of cokriging has often been used for such systems inthe field of geostatistics. If the latent processes are assumedto be GPs, the outputs are also GPs.

In the machine learning community, multiple-output GPshave been introduced [5] and popularized under the nameof dependent GPs. Several extensions of such models havebeen proposed subsequently [2]–[4], [30] and we focus hereon the model presented in [2] which is very close to the usualconvolutive mixing model commonly used in multi-channelsource separation, e.g. in [32].

Let {yc (x)}c=1···C be the C output signals calledthe mixtures. The convolutive GP model consistsin assuming that each observable signal yc is thesum of convolved versions of M latent GPs ofinterest {f0,m ∼ GP (0, k0m (x, x′))}

m=1···M that wewill call sources, plus one specific additional term

ǫc ∼ GP (0, kǫc (x, x′)) that is often referred to as additive

noise. We thus have:

yc (x) =

M∑

m=1

(acm ∗ f0,m) (x) + ǫc (x) (8)

Instead of making a fundamental distinction betweenc and x, the GP framework allows us to consider that{yc (x)}(c,x)∈{1···C}×X is a single signal {y (x′)}x′∈{1···C}×X

indexed on an extended input space {1 · · ·C} × X . If weassume that the different underlying sources {f0,m}

m=1···Mare independent, which is frequent in source separation andthat the different {ǫc}c=1···C are also independent, we canexpress the covariance function k ((c, x) , (c′, x′)) of y for twoextended input points (c, x) and (c′, x′) as:

kcc′ (x, x′) =

(

M∑

m=1

kcc′,m + δcc′kǫc

)

(x, x′) (9)

where kcc′,m (x, x′) , ((acm × ac′m) ∗ k0,m) (x, x′) (10)

For any given c, the different{

fcm , acm ∗ f0,m}

m=1···Mare independent and are GPs with mean functions 0 andcovariance functions kcc,m (x, x′). fcm will be called thecontribution of source m to mixture c. We can readily performsource separation on yc to recover the different {fcm}m=1···M

using the standard formalism presented in section II-B. Letf̂cm0

be the estimate of fcm0, we have:

f̂cm0= Kcc,m0

[

M∑

m=1

Kcc,m +Kcc,ǫ

]−1

yc (11)

where Kcc,ǫ is the covariance matrix of the additive signalǫc and where the covariance matrix Kcc,m is defined as[Kcc,m]

x,x′ = kcc,m (x, x′).It is important to note here that even if the sources are the

{f0,m}m=1···M , many systems consider the signals of interest

to actually be the different {fcm}c,m. For example, in thecase of audio source separation, a stereophonic mixture canbe composed of several monophonic sources such as voice,piano and drums. It is often considered sufficient to be able toseparate the different instruments within the stereo mixturesand thus to obtain one stereo signal for each source, ratherthan trying to recover the original monophonic signals.

Still, for some m, given the estimates{

f̂cm

}

c=1···Cof all

the different {fcm}c=1···C , we can for example estimate f0,musing standard beamforming techniques.

D. Parameter optimization

Even in complex situations such as those presented insections III-B or III-C, we can still use classical optimizationmethods to maximize the marginal log-likelihood (6) of theobservations. Following [2], we will now give a simple wayto include multiple output GPs in this framework.

Given a set X of n input points and the corresponding C

column vectors {yc}c=1···C , we can build y ,[

y⊤1 , . . . ,y

⊤C

]⊤

as the Cn column vector containing all stacked outputs and usethe expression (9) to build its covariance matrix K. We can tenproceed to parameters estimation through maximization of the

5

marginal log-likelihood log p (y | X,Θ) of the observations.Once more, depending on the covariance functions considered,this problem may or may not be convex.

E. Conclusion

In this section, we have derived a way to perform un-derdetermined source separation using GP models when themixtures are the convolved sums of several independent GPs.Given some covariance functions and mixing filters, we sawthat stating the problem in terms of GPs provides a principledway to estimate the source signals that minimize the meansquared error. In the GP framework, optimization of thehyperparameters is done through maximization of the marginallog-likelihood of the mixtures given the model.

To our knowledge, very few references are available to dateon the topic. For example, [33] performs source separationusing GPs in the determined case, but the covariance functionsare therein applied on the outputs of the source signalsrather than on the coordinates themselves (i.e. time or spatialposition). A successful application of GPs to a subject closeto source separation can also be found in [39] for echocancellation.

IV. GP APPROXIMATIONS FOR LARGE SIGNALS

A. The need for approximations

The main issue with GP models is the need to invert then×n covariance matrix of the learning points for inference (4)and for each evaluation of the observation likelihood in (6).In many areas of interest, we cannot afford to handle such abig matrix, since it is not computationally tractable. In audiosignal processing for example, values such as n ≈ 107 arecommon and GP models cannot be used without a significantreduction of the computational cost of the method.

In order to address this issue, many authors have proposedsparse approximation techniques [31], [34], [37], [38], [40]over the years that all aim at making GP inference possiblefor large datasets. As highlighted in [34], many methodsrely on the choice of a small set of input points calledthe inducing inputs to approximate the posterior distributionat test points X∗. Among those methods, we can mentionthe Fully Independent Conditional (FIC) approximation [34],[38], that considers all the test points and the learning pointsindependent given the inducing inputs. This leads to a veryimportant reduction of the computational burden, but heavilyrelies on the density of the inducing points [37] to yield goodestimates. Another approximation called Partially IndependentConditional (PIC) [37] no longer makes the assumption thatboth the training and test cases are independent given theinducing points, but rather that each of them not only dependson the possibly remote inducing points, but also on a limitednumber of other learning points nearby. This technique hasthe advantage of producing better estimates than FIC, whilemaintaining an easy inversion of the n× n covariance matrixthat is now block-diagonal. Its main disadvantage is to lead todiscontinuities of the estimates between the blocks, which maybe problematic for some applications such as audio processing.

Another very attractive direction of research in the last fewyears has been the consideration of covariance functions withCompact Support (CS) [31], [40], i.e covariance functionsk (x, x′) such that ‖x− x′‖ > l ⇒ k (x, x′) = 0 for somegiven scale l. The idea underlying these techniques is toconsider that if they are sufficiently far from each other, twopoints will be independent. If such covariance functions areused, the covariance matrix is sparse and inference throughCholesky decompositions is done much faster [40]. The mainissue with this approach is to design covariance functions thatcorrespond to some prior knowledge about the sources andthat have CS at the same time.

In sections IV-B and IV-C, we introduce a general methodfor fast inference in GP models based on framing and that isa direct generalization of the common practice in audio signalprocessing.

Another important computational simplification is intro-duced in sections IV-D and IV-E when the signals are regularlysampled. In that case, we show that when the covariance func-tions are assumed stationary and separable, exact inference canbe done extremely efficiently in the frequency domain.

When both approaches are combined into so called locally-

scaled and framewise-independent stationary covariance func-tions, we show in section IV-F that inference and learningof hyperparameters become equivalent to recent and powerfulNonnegative Tensor Factorization (NTF) techniques [9].

B. Frames

In audio signal processing, it is common to split the signalsinto overlapping frames and to process the frames separately.Formally, the frames {yi (x′)}i∈N

are defined as small portionsof the original signal. The advantage of the technique is thatthe frames are small and can be easily processed. The originalsignals can then be recovered through a deterministic overlap-

add procedure: each frame is multiplied by a weighting

function g : X ′ → R+ to ensure smooth transitions between

the frames and is added to the reconstructed signal. g is oftena HANN or a triangular window.

This idea can very well be generalized in any dimensionD. Instead of considering the original signal y, we can splitit into overlapping frames of smaller dimension. To thisend, we consider a frame input set X ′ ⊂ X , a summableweighting function g : X ′ → R

+ and a set of frame positions

{ti ∈ X}i∈Nsuch that:

∀x ∈ X , Ix , {i ∈ N : x− ti ∈ X ′} 6= ∅ (12)

Ix is thus the set of frame numbers to which the inputpoint x is mapped. Condition (12) ensures that each point ofthe signal is represented in at least one frame. Finally, givensome signal {y (x)}x∈X , we can make the assumption thatthere is a set of frames {yi (x)}i∈N,x∈X ′ , also noted G {y} inthe following, such that:

∀x ∈ X , y (x) =1

∑

i∈Ixg (x− ti)

∑

i∈Ix

g (x− ti) yi (x− ti)

(13)When considering a finite set X of n input points,

we only need to consider the frames I ,⋃

{Ix}x∈X

6

that contain at least one input point from X . Let X ′i =

{x′ ∈ X ′ | ∃x ∈ X : x′ = x− ti} be the finite set of pointsfrom X ′ to which the elements of X in the scope of framei are mapped and let4 Li = #(X ′

i). Given some signal{y (x)}x∈X , it is always possible to build a set of framesobeying (13). This can be achieved by choosing X ′

i and yi (x)such that:

∀ (i, x′) ∈ I ×X ′, (ti + x′ ∈ X) ⇒

{

x′ ∈ X ′i

yi (x′) = y (ti + x′)

(14)When they make use of framing, usual methods focus on

the frames G {y} as the signals of interest rather than on y.Indeed, a good model for G {y} is de facto a good model for ysince it can be computed deterministically from G {y}. In suchmethods based on framing, the set (14) of frames is usuallytaken as being the observation. From our point of view, theframes are simply another process which is indexed on N×X ′

and from which we can deterministically recover y which isindexed on X .

C. Frame-wise independence assumption

Given a signal {y (x)}x∈X and a corresponding set offrames {yi}i∈N

, a classical assumption consists in writing thatthe different yi are independent. As y can be deterministicallycomputed from {yi}i∈N

, this is written:

log p (y | X,Θ) =∑

i∈I

log p (yi | X′i,Θ) (15)

If G {y} is modeled as a GP, the frame-wise independenceassumption is equivalent to modeling the covariance functionk ((i, x) , (i′, x′)) of G {y} as:

k ((i, x) , (i′, x′)) = δii′ki (x, x′) (16)

with ki being the covariance function of the GP {yi (x)}x∈X ′ .Let X be a finite set of input points from X , y a processindexed on X and I ,

⋃

{Ix}x∈X be the corresponding frameindexes for a framing G. Let nI be the number of frames. Ifwe model G {y} as a GP, we readily see that it is equivalentto a multiple output GP as seen in section III-C with nI

outputs whose input set is X ′. We can thus stack its outputsand observe that the corresponding covariance matrix is block-diagonal due to the frame-wise independence assumption. Itsinverse is thus easily computed.

The main computational trick involved by framing is henceto split the signal into overlapping frames, with a synthesisscheme that allows perfect reconstruction. Then, the frames aresupposed to be independent and the corresponding covariancematrix becomes block diagonal. The advantage of this methodis that when the frames are overlapping, each point estimateis a smoothing of several estimates computed in the differ-ent frames that contain this point, thus avoiding systematicdiscontinuities. In Figure 2, we illustrate this advantage offraming over PIC to produce smooth estimates in a very simpleregression problem.

4For a countable set X , #(X) denotes the number of elements in X .

−30 −20 −10 0 10 20 30−5

0

5

x

y

GP regression, full model

−30 −20 −10 0 10 20 30−5

0

5

x

y

GP regression, PIC

−30 −20 −10 0 10 20 30−5

0

5

x

y

GP regression, framing

Figure 2. A simple regression example with a full GP model (up), PIC(middle) and framing (bottom) . The dotted and red lines are respectively thetrue and estimated signal. The grey area represents three standard deviationsof the estimates around the mean and the circles stand for observed points.

The connections between frame-wise independent corre-lation functions and other existing models such as PIC orCS covariance functions [31], [40] are numerous. Firstly, wesee that framing without overlap between the frames is verysimilar to PIC except for the fact that it does not take inducingpoints outside the frames into account. In framing, such remoteinducing points are handled through the use of overlappingframes of different scales. Secondly, when considering (13),which gives the expression of the signal given its constitutiveframes, we can straightforwardly compute the covariancefunction of the signal y itself given the covariance functionsof the different frames. This computation is actually verysimilar to that led in [31] where the basis function used in thecomputation becomes the weighting function g we consideredhere. The basis function proposed in [31] is precisely theHANN window which is a very popular choice for g in audioprocessing.

Of course, as in practice the frames are built using expres-sion (14), there is a duplication of the samples that belongto overlapping frames and the independance assumption be-tween the frames may seem unjustified. Nevertheless, the ideaunderlying framing is that even if the occurrence of one pointfrom X in some frame gets duplicated in another, and evenif the corresponding observed values are connected or equal,they are supposed to be produced by two different underlyingprocesses that do not share the same covariance function.

Still, there are interesting conceptual issues raised by fram-ing that should be more thoroughly studied in the future. Inparticular, contrary to PIC, framing as it was exposed heresuffers from overconfidence. This can be seen by computingthe variance of the estimates for y given by (13) and thennoticing that the more a point will get duplicated in differentframes because of overlap, the smaller the a posteriori variance

7

of this point will become. This is due to the independenceassumption between the frames. If this assumption was strictlylegitimate, this diminution of the variance would be justified.However, as the overlap between the frames gets large, in-dependence cannot be a valid assumption anymore and thevariances get underestimated. This is illustrated in Figure 2where the overlap was large.

Even if overlapping the different frames is common practicein audio signal processing, its consequences on statisticalmodels has been largely neglected. In [26], LE ROUX etal. devise practical ways of consistently handling the depen-dencies between the frames as a postprocessing step. Still,to our knowledge, practical statistical models that fully takethe overlap between the frames into account while remainingcomputationally tractable are yet to be proposed.

D. Stationarity assumption for regularly sampled signals

In this section, we assume that the signals are defined onX = R

D for D > 1 and that x ∈ X can be written x =(x1, · · · , xD). We will moreover assume that all the covariancefunctions k that we consider are separable, i.e. there are D

covariance functions k(d) such that:

∀ (x, x′) ∈ X 2, k (x, x′) =

D

∏

d=1

k(d) (xd, x′d) . (17)

It is readily shown that this assumption implies that allthe covariance matrices K considered can be expressed asa Kronecker product5 of D covariance matrices K(d) of lowerdimensions:

K = K(1) ⊗K(2) · · · ⊗K(D) ,

D⊗

d=1

K(d). (18)

From now on, we suppose that the points are regularlysampled. This is equivalent to assuming that any signal y,fm or k considered is the vectorization6 of a correspondingunderlying D-dimensional tensor y, fm or k. Indeed, we willshow in section IV-D1 that computations can be very conciselywritten using these tensors, which are actually natural toconsider. For example, when D = 2, it makes sense to directlythink of regularly sampled signals as matrices instead of theirvectorized counterpart.

1) GPs for the separation of stationary mixtures: As seen insection II-C, a stationary covariance function k (x, x′) betweentwo input points can be expressed as a function of theirdifference τ = x− x′. It is noted k (x− x′). If all covariancefunctions considered are stationary, the computations becomeparticularly simple.

Indeed, let us assume that a mixture {y (x)}x∈X is thesum of several GPs {fm (x)}m=1...M,x∈X whose covariancefunctions km (x− x′) are all stationary, and let us furthermoresuppose that we are interested in separating the differentsources for all points in X, thus having X∗ = X . Thecovariance matrix Ky of y is given by : Ky =

∑M

m=1 Km

where Km is the covariance matrix of source m.

5See [9] for a concise introduction to tensor algebra.6Vectorization is done recursively. For example, with D = 2 where tensors

are matrices, it is done one row after the other.

Considering (18), Km is given by: Km =⊗D

d=1 K(d)m

where[

K(d)m

]

i,j= k

(d)m (xi,d − xj,d). K

(d)m can approximately

be considered as circulant7. It is readily shown that anycirculant matrix M can be expressed as M = W ∗

FΛWF

where WF is the discrete Fourier transform matrix8 and whereΛ is diagonal. Thus, for all m and d, there is a diago-nal positive semidefinite matrix diagS(d)

m such that K(d)m ≈

W ∗F diagS(d)

m WF where the vector S(d)m is the discrete Fourier

transform of τ 7→ k(d)m (τ). We can thus write Ky as:

Ky =

M∑

m=1

D⊗

d=1

W ∗F diagS(d)

m WF . (19)

Using classical results from tensor algebra, (19) can bewritten:

Ky =

(

D⊗

d=1

W ∗F

)(

M∑

m=1

D⊗

d=1

diagS(d)m

)(

D⊗

d=1

WF

)

. (20)

We can use this property to extract a given source m0, andwrite9 (4) as:

f∗m0=

(

D⊗

d=1

W ∗F

)(

⊗D

d=1 diagS(d)m0

∑M

m=1

⊗D

d=1 diagS(d)m

)(

D⊗

d=1

WF

)

y.

(21)Introducing the D-dimensional tensor10

Sm = S(1)m ◦ S(2)

m · · ·S(D)m , ©D

d=1S(d)m (22)

as the model for source m and FD

{

y}

as the D-dimensionalFourier transform of y, we can simply write (21) in tensorform as:

FD

{

f∗

m0

}

=

(

Sm0

∑M

m=1 Sm

)

· FD

{

y}

(23)

which is similar to the classical Wiener filter for stationaryprocesses. The differences between this expression and theclassical one is firstly that it is valid for any dimension D ofthe input space and secondly that it is not restricted to the caseof only two stationary sources. The sources themselves canbe recovered through an inverse D-dimensional Fourier trans-form. The nonnegative tensor Sm can be understood as theD-dimensional Fourier transform of the stationary covariancefunction km. Note that the complexity of this exact GP infer-ence method relying on stationarity of the covariance functionsand on regular sampling is O (n log n), and it is dominated bythe computation of Fourier transforms, for which there existvery efficient and specialized algorithms. If FD

{

y}

is knownbeforehand, the complexity of (23) decreases to O (n) whichis remarkable for an exact GP inference technique.

7If the signal is regularly sampled, this approximation holds when thenumber nd of points along dimension d tends to infinity or when k(d) (τ) isperiodic of periodnd

pwith p ∈ N

∗.8W ∗

F denotes the complex conjugate of WF .9 AB

and A.B are respectively the element-wise division and multiplicationof A and B.

10◦ denotes the outer product.

8

2) Marginal likelihood for stationary sources: When allthe covariance functions considered are stationary and pa-rameterized by some hyperparameter set Θ that consists oftheir respective D-dimensional Fourier transforms, i.e. Θ ={S1 · · ·SM}, it can readily be shown that the marginal log-likelihood log p (y | X,Θ) of the observations given regularlyspaced input points and the hyperparameters simplifies from(6) to:

log p (y | X, {S1 · · ·SM}) =

−1

2

∑

i1,···iD

∣

∣

∣

[

FD

{

y}]

i1,···iD

∣

∣

∣

2

∑M

m=1 [Sm]i1,···iD

+ log

M∑

m=1

[Sm]i1,···iD

+Cte

(24)

Considering (24), we see that it is equivalent up to anadditive constant independent of Θ to half the opposite ISdivergence11 between12

∣

∣FD

{

y}∣

∣

.2and

∑M

m=1 Sm:

log p (y | X) = −1

2DIS

(

∣

∣FD

{

y}∣

∣

.2|

M∑

m=1

Sm

)

+Cte (25)

The evaluation of the likelihood can be done in O (n log n)operations when the signals are regularly sampled and the co-variance functions are stationary. If the squared D-dimensionalFourier transform

∣

∣FD

{

y}∣

∣

.2of the signal is known before-

hand — it is typically computed only once — the computa-tional complexity is reduced to O (n).

E. Locally stationary covariance functions

Let {y (x)}x∈X be a particular signal, observed on a finite

input set X ∈ Xn and let{

yi ∈ RX′

i

}

i∈Ibe a set of nI

corresponding frames. As in section IV-C, we can assume thatthe frames are independent and we can further suppose thatthe covariance function kim of source m within frame i isstationary. This means that we model each source as beingcomposed of several locally stationary frames, each of whichhas its own covariance function. The resulting signal is not

supposed stationary with this assumption, only its restrictionsto small regions of the input space X are assumed stationary.

Let us denote Y the (D + 1)-dimensional tensor whose lastdimension goes over the frames and whose first D dimensionsfor a fixed frame contain the D-dimensional Fourier transformof the signal tensor for this frame as in section IV-D. As thistensor is called the Short Term Fourier Transform (STFT) ofthe signal when D = 1, it will be called the STFT tensor ofthe mixture. We define the STFT tensor Fm of the sourcesand the STFT tensor Sm of the covariance function of sourcem in the same way. We can use the results from the previoussection for each frame and for source m0: the MMSE estimateF

∗

m0of Fm0

is given by:

F∗

m0=

Sm0

∑M

m=1 Sm

·Y. (26)

11DIS

(

x|y)

,∑

i1···iD

[

[x]i1···iD

[y]i1···iD

− log[x]i1···iD

[y]i1···iD

− 1

]

.

12For a matrix M ,[

M .2]

ij, M2

ij

The sources can then be recovered by first applying aninverse D-dimensional Fourier transform to the estimate (26)for each frame, and then using the reconstruction scheme (13)to obtain the estimated sources in the original input space X .

Let Θ = {S1, · · ·SM} be the models for the sources. Themarginal likelihood log p (y | X,Θ) of the observations cansimilarly be shown to be:

log p (y | X,Θ) = −1

2DIS

(

|Y|.2 |M∑

m=1

Sm

)

+ Cte (27)

where the constant is independent of Θ. This very simpleexpression can be computed in O (n) when |Y|.2 is knownand permits to efficiently proceed to hyperparameters learningas demonstrated in section IV-F.

F. Putting structures over the covariances

Given some regularly sampled signal tensor y and itscorresponding STFT tensor Y as defined in section IV-E,we have seen that source separation can be very efficientlyperformed provided some (D + 1)-dimensional model Sm isknown for every source. As highlighted by CEMGIL et al.

in [7] or [8] for the case of audio processing (D = 1), theimportant issue raised by this probabilistic framework becomesdevising realistic but effective models for the nonnegativesources parameters Sm.

In audio signal processing (D=1), the result (26) is knownas adaptive or generalized Wiener filtering and many methodsfor source separation such as [8], [32] use this technique in aprincipled way to recover the sources in the frequency domain.Those studies state their probabilistic model in the frequencydomain where the time-frequency bins are supposed to bedistributed with respect to independent Gaussian distributions.In our approach, the model is expressed directly in the originalinput space. The two points of view are actually equivalent:a stationary GP has an independently distributed Gaussianrepresentation in the frequency domain. Sm can hence be seeneither as the STFT tensor of a covariance function or as atensor containing the variances of the independent componentsof Y.

Focusing on the second interpretation of Sm, recent studies[8], [12], [13] proposed to model these tensors as GammaMarkov Random Fields (GMRF). This is a sensible choiceindeed, because such models guarantee the nonnegativity ofall the elements of Sm while implementing the knowledgethat for a given source, the spectrum is much likely to exhibitsome continuity over time, or over the frequencies, or overboth. As GMRF do not provide a closed-form expression forthe marginal log-likelihood of the observations, the learning ofhyperparameters has to be done using approximate methods.To this end, DIKMEN et al. [12] propose to use contrastivedivergence [22] and report good results. To our knowledge,no generalization of GMRF has yet been published for inputspaces of dimension greater than 2, but GP modeling maygreatly benefit from such an extension.

Another point of view is to introduce some deterministicstructure into the covariance functions of the GPs. A simpleassumption to this end is to consider that for a given source

9

m, the covariance functions of the different independent

frames are stationary and locally scaled, i.e identical up toan amplification gain depending on the frame. The model forsource m and frame i can then be written:

Sim = Him S0,m (28)

where S0,m = ©Dd=1S

(d)0,m is the D-dimensional Fourier

transform of some template covariance function k0,m forsource m that is independent of the frame index i. We get:

Sm =(

©Dd=1S

(d)0,m

)

◦Hm (29)

where Hm = (H1m · · ·HnIm) denotes the amplification gainsof the covariance function for source m on the differentframes. Considering (29) we readily see that it is equivalentto a classical Nonnegative Tensor Factorization (NTF) modelcalled Canonical Polyadic (CP) decomposition13. The differentparameters become Θ =

{{

Hm, S(1)0,m · · ·S

(D)0,m

}

m=1···M

}

and can be estimated by standard CP algorithms using theIS-divergence function. See [9] for a review of these modelsand algorithms.

G. Optimization

We have shown how GP learning can be connected torecent NTF techniques by factorizing the covariance structureof the GP model into a CP decomposition. More generallyand depending on the application, the covariances can befactorized in many other ways to account for some priorknowledge we may have concerning the structure of thesources. For example, if the covariances are considered tobe the outer product of some shared dictionaries, the tensordecompositions to be used become particular cases of BlockComponents Decompositions as introduced in [25]. Many veryinformative models can be designed this way, that decomposethe covariance structure of the sources onto sophisticateddictionaries. In music processing (D = 1) for example, [41]decomposes the covariances into templates of harmonic bases.Other models of this type have also been used to model andextract singing voice signals from polyphonic mixtures withvery promising results [14].

In any case, when an appropriate model has been chosen for{Sm}m=1···M , we have seen in section IV-E that hyperparam-eters learning can be done by minimizing the IS-divergencebetween

∑

m Sm and |Y|.2 through tensor factorizations.Efficient algorithms for IS-NTF can be found in the litterature,for example in [9].

V. EVALUATION

In this section, we demonstrate the performance of theproposed approach based on GP models for the separationof real-valued mixtures. In section V-A, we first show that GPcan easily be used for the separation of synthetic 2D randomfields, or textures (D = 2). Then, we show in section V-Bhow GP can be used for the separation of drums signals inreal polyphonic stereo recordings.

13CP is also called PARAFAC or CANDECOMP [9].

A. Synthetic additive textures

In this section, we set D = 2, which means that we aim atseparating additive functions fm (x1, x2) defined on the planeand summed in an observable mixture signal y (x1, x2). Forthis toy example, we will consider the case of one mixture(K = 1) that is the sum of M = 2 stationary sources14.Following the notations that were introduced in section IV-D,we will thus suppose that the mixture tensor y is the sum oftwo sources tensors f

1and f

2. The corresponding vectors y,

f1 and f2 will denote the vectorization of these tensors onerow after the other. X denotes the corresponding coordinates.

In this experimental setup, the dimensions of the sources andmixtures tensors are 500× 500 each, leading to n = 250000.In the following, we will assume that the covariance functionof each source along each dimension is stationary. For theexperiment, the covariance functions were arbitrarily set to:

k(d)m (xd, x′d) = exp

−2 sin2

π(xd−x′

d)Tm,d

l2m,d

−(xd − x′

d)2

2λ2m,d

(30)

where {Tm,d, lm,d, λm,d}m,dare scalar parameters.

This model implements a particular prior knowledge wherethe sources are known to exhibit some kind of complexstructure. More specifically, source m is known to be pseudo-periodic of period (Tm,1, Tm,2) and (lm,1, lm,2) controlsthe smoothness within one period. A further lengthscale(λm,1, λm,2) controls global covariance between two inputpoints. In the particular example shown in Figure 3, theparameters were:

m λm,1 λm,2 Tm,1 Tm,2 lm,1 lm,2

1 100 100 50 20 0.5 0.72 40 4 25 +∞ 0.7 N/A

1) Data synthesis: Generating a realization of a GP withsome known covariance matrix K is generally addressedthrough Cholesky or Singular Value Decompositions (SVD)of the covariance matrix [35]. As we have n = 250000,we cannot naively implement this idea here. A simple wayto circumvent this problem is to write K as in (18) andthen to perform a Cholesky decomposition of each K(d)

to get K(d) = L(d)L(d)⊤. The Cholesky decompositionof K =

⊗D

d=1 L(d)L(d)⊤ is finally obtained by K =

(

⊗D

d=1 L(d))(

⊗D

d=1 L(d))⊤

and a realization of this GP canbe very easily generated. Indeed, let R be a vector of lengthn whose entries are i.i.d. Gaussian random variables of unitvariance. K is the covariance matrix of

(

⊗D

d=1 L(d))

R.15

2) Source separation: In this very simple experimentalsetup, we consider that the 12 hyperparameters for the sourcescovariance functions (30) are known beforehand.

14This usecase is common in geostatistics: the observed signal is oftenmodeled as the sum of the signal of interest whith a contaminating whiteGaussian noise [11]. Estimating the value of the target signal through Krigingis hence a special case of source separation with GP priors.

15We can further speed up this computation by using the fact that for c =vec (C) and matrices A and B of appropriate size, (A⊗B) c = ACB⊤.

This avoids considering such a big matrix as(

⊗Dd=1 L

(d))

.

10

We can perform separation through the exact method pre-sented in section IV-D1. To this purpose, we can build thespectral covariance tensor Sm of each source as the outerproduct of the Fourier transforms of (30) along each dimensionand then perform separation in the frequency domain asin (23). The sources are recovered through an inverse 2-dimensional Fourier transform. It is worth noticing here thatthe computations are performed extremely rapidly since theyonly involve element-wise multiplications of 500×500 images,instead of the inversion of the 250000 × 250000 covariancematrix required by the basic GP setup. Overall computationsfor this example — synthesis and source separation — areachieved in less than 3 seconds on a standard laptop computer.

Results for one example are shown in Figure 3. The averageSignal to Error ratio obtained on 50 experiments was of 8dB,which is very encouraging.

B. Separation of drums signal in polyphonic music

In this section, we apply the general framework we havepresented in sections III and IV to the separation of drumssignals in polyphonic music. The regularly sampled signalswe consider are thus defined on the input space X = Z ofdimension D = 1.

Separation of the drums track from polyphonic music is achallenging task that has already been addressed in severalstudies such as [18], [21]. Whereas HÉLEN and VIRTANEN

[21] perform a Nonnegative Matrix Factorization (NMF) ofthe mixture and then group the different components obtainedthrough a classification procedure, GILLET and RICHARD [18]decompose the mixture signal with spectral templates learnedfrom a drums database.

In section V-B1, we introduce a GP model for this task andin section V-B2, we compare its performance with the state ofthe art [18].

1) GP model: The observed mixtures y (x) are supposedto be the sum of two independent GPs sd (x) and sr (x)corresponding respectively to the drums and the musical

residual tracks. We assume that some framing G {y} withnI frames of same length as defined in section IV-B isavailable for the mixtures and we aim at estimating theframings G {sd} and G {sr} of the different sources such thatG {y} = G {sd}+ G {sr}.

We suppose that each of the signals sd and sr are themselvesthe sum of several independent processes called components.In our example, the Rd different components of sd are thefive most common sources we find in a drums signal, e.g.kick drum, snare drum, hihat, bells and clap sounds. The Rr

different components of the musical residual are all the otherelements composing the polyphonic mixture. This assumptioncan be written sd =

∑Rd

m=1 fm and sr =∑Rd+Rr

m=Rd+1 fm.In the model we are considering, we will assume that all

the components fm are GP whose covariance functions kmare locally scaled, frame-wise independent and stationary asdefined in section IV-F. For some frame i, they can thus beexpressed as:

kim = Him k0,m (31)

where k0,m denotes the template stationary covariance func-tion for component m and Hm = (H1m · · ·HnIm) arethe nonnegative activation gains of this component withinthe frames. Introducing the Fourier transform S0,m of k0,mand using the method presented in section IV-F, the MMSEestimate F̂ d of the STFT F d of the drums signal is given by:

F̂ d =

∑Rd

m=1 S0,m ◦Hm∑Rd+Rr

m=1 S0,m ◦Hm

·Y (32)

where Y is the STFT of the mixtures. The model S ,∑

m Sm

becomes S =∑Rd+Rr

m=1 S0,m ◦Hm. Since D = 1, this can bewritten in matrix form as S = WH where S0,m is the mth

column of W and Hm is the mth row of H . As we have shownin section IV-F, the optimization of the hyperparameters Θ ={W,H} through likelihood maximization is thus equivalent tothe minimization of the IS distance between the power STFT|Y|.2 and the product WH , yielding a NMF model as in [9],[17], [27], [32].

Some other kind of knowledge has now to be put into themodel so that it can be useful in practice, since we have notyet made any distinction between the covariance functionsof the drums components and those of the musical residualsignal. A very simple and computationally cheap solutionto this problem is to appropriately initialize some of thehyperparameters. In this experiment, we will focus on themeaning of the activation gains Hm of the components asintroduced in (31). Him can be understood as a magnitudeparameter for component m into frame i. A good way toinitialize all these parameters {Hm}m=1···Rd

for the drumssignal is simply to use an onset detector such as [15]. Indeed,if an onset detector feature has a high magnitude in some framei, then some drums component must be active in it. The onsetdetector of [15] was hence used in Rd different frequencybands of the mixture STFT, yielding Rd signals. These signalswere used to initialize the activation gains {Hm}m=1···Rd

andall the other hyperparameters of the model were randomlyinitialized. A NMF was then applied using this initializationand separation was performed using (32).

2) Results: The proposed GP separation method was testedon ten 30-second excerpts sampled at 44.1kHz from theQuaero16 source separation corpus. The excerpts featuredmany different kinds of music signals, including pop, elec-tropop, rock, reggae and bossa. For each of these excerpts,the ground truth drums and musical residual signals areknown for evaluation but the separation systems can onlyobserve their mixtures. On average, the relative amplitude20 log10

∑x|sr(x)|∑

x|sd(x)|

of the musical residual signal was set to+6dB compared to the drums signal.

We applied the method proposed by GILLET and RICHARD

in [18] on the same mixtures and the quality of the results werequantified through the BSSEVAL toolbox [42]. The separationquality was evaluated both on the drums signals and on themusical residual signal.

The metrics obtained through BSSEVAL include the Sourceto Distortion Ratio (SDR), the Source to Artifact Ratio (SAR)

16http://www.quaero.org

11

Original source 1

100 200 300 400 500

100

200

300

400

500

Original source 2

100 200 300 400 500

100

200

300

400

500

Mixture

100 200 300 400 500

100

200

300

400

500

Separated source 1

100 200 300 400 500

100

200

300

400

500

Separated source 2

100 200 300 400 500

100

200

300

400

500

Figure 3. GP for the separation of two stationary random fields (D = 2) using a Gaussian Process model. On the left are the original sources. On the centeris the mixture and on the right are the estimated sources.

Figure 4. Evaluation of the separation quality for the extraction of the drumstrack (top) and the musical residual signal (bottom). Higher is better.

and the Source to Interference Ratio (SIR) that are all ex-pressed in dB. Whereas the SDR is a global measure ofseparation quality, the SAR and SIR respectively measure theamount of separation/reconstruction artifacts and the amountof energy from the other sources. Results are given in Figure 4.

From Figure 4, we can see that the GP model presented insection V-B1 very well manages to separate the drums andmusical residual signals on many different kinds of musicand both signals are well recovered. A further feature of thistechnique is that it is extremely fast: on average, 30 secondsare needed to handle a 30-second long excerpt whereas 300

seconds are needed by [18]. Sound excerpts and a full im-plementation in Python of this separation technique are freelyavailable on our website17.

VI. CONCLUSION

In this study, we have stated the linear underdetermined,instantaneous, convolutive and multiple-output source sepa-ration problems in terms of Gaussian processes regressionand have shown that it leads to simple formulas to optimallyproceed to signals separation w.r.t. the MMSE. The advantagesof setting out the source separation problem in terms of GPare numerous.

First, there is neither notational burden nor any conceptualissue raised when using input spaces X different from R orZ, thus enabling a vast range of source separation problemsto be handled within the same framework. Multi-dimensionalsignal separation may include audio, image or video sensorarrays as well as geostatistics.

Secondly, GP source separation can perfectly be used for theseparation of non locally-stationary signals. Of course, someimportant simplifications of the computations as presented insections IV-D and IV-E are lost when using non-stationarycovariance functions. Still, the frame-wise independence as-sumption presented in section IV-C may nonetheless be usedin order to make the estimations computationally tractable.

Thirdly, it provides a coherent probabilistic way to takemany sorts of relevant prior information into account. Indeed,prior information is encapsulated in the choice of the covari-ance functions and the framework proposed here thus clearlydistinguishes between the optimal separation methods and theparticular models considered.

17http://www.telecom-paristech.fr/~liutkus/GPSS/

12

Finally, we have seen that under appropriate assumptions,optimization of the hyperparameters of a GP model is equiv-alent to a classical NTF using the Itakura-Saito divergence onthe spectrogram tensor of the mixtures, thus enabling efficientestimation of the hyperparameters.

Setting the source separation problem in such a unifiedframework allows it to be considered from a larger perspec-tive where its objective is to separate additive independentfunctions on arbitrary input spaces that are mathematicallycharacterized by their first and second moments only.

REFERENCES

[1] P. Abrahamsen. A review of Gaussian random fields and correlationfunctions. Technical Report 878, Norsk Regnesentral, Oslo, Norway,April 1997.

[2] M. Alvarez and N. D. Lawrence. Sparse convolved Gaussian processesfor multi-output regression. In Neural Information Processing Systems

(NIPS), pages 57–64. MIT Press, 2008.[3] E. V. Bonilla, K. M. A. Chai, and C. K. I. Williams. Multi-task Gaussian

process prediction. In Neural Information Processing Systems (NIPS),pages 153–160. MIT Press, 2007.

[4] P. Boyle and M. Frean. Multiple output Gaussian process regression.Technical report, Victoria University of Wellington, April 2005.

[5] P. Boyle and M. R. Frean. Dependent Gaussian processes. In Neural

Information Processing Systems (NIPS), pages 217–224. MIT Press,2004.

[6] J.-F. Cardoso. Blind signal separation: statistical principles. Proceedings

of the IEEE, 90:2009–2026, October 1998.[7] A. T. Cemgil, S. J. Godsill, P. H. Peeling, and N. Whiteley. The Oxford

Handbook of Applied Bayesian Analysis, chapter Bayesian StatisticalMethods for Audio and Music Processing. Number ISBN13: 978-0-19-954890-3. Oxford University Press, 2010.

[8] A.T. Cemgil, P. Peeling, O. Dikmen, and S. Godsill. Prior structuresfor Time-Frequency energy distributions. In Proc. of the 2007 IEEE

Workshop on. App. of Signal Proc. to Audio and Acoust. (WASPAA’07),pages 151–154, NY, USA, October 2007.

[9] A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari. Nonnegative Matrix

and Tensor Factorizations: Applications to Exploratory Multi-way Data

Analysis and Blind Source Separation. Wiley Publishing, September2009.

[10] P. Comon and C. Jutten, editors. Handbook of Blind Source Separation:

Independent Component Analysis and Blind Deconvolution. AcademicPress, 2010.

[11] P.J. Diggle and P. J. Ribeiro. Model-based Geostatistics. Springer seriesin statistics. Springer, 1 edition, March 2007.

[12] O. Dikmen and A. T. Cemgil. Gamma markov random fields for audiosource modelling. IEEE Transactions on Audio, Speech, and Language

Processing, 18(3):589–601, March 2010.[13] O. Dikmen and A.T. Cemgil. Unsupervised single-channel source

separation using Bayesian NMF. In Proc. of the 2009 IEEE Workshop

on. App. of Signal Proc. to Audio and Acoust. (WASPAA’09), pages 93–96, NY, USA, October 2009.

[14] J.-L. Durrieu, A. Ozerov., C. Févotte, G. Richard, and B. David. Main in-strument separation from stereophonic audio signals using a source/filtermodel. In Proc. 17th European Signal Proc. Conf. (EUSIPCO’09), pages15–19, Glasgow, UK, August 2009.

[15] C. Duxbury, J.P. Bello, M. Davies, and M. Sandler. Complex domainonset detection for musical signals. In In Proc. Digital Audio Effects

Workshop (DAFx, London, UK, September 2003.[16] C. Févotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factor-

ization with the Itakura-Saito divergence. With application to musicanalysis. Neural Computation, 21(3):793–830, March 2009.

[17] D. FitzGerald, M. Cranitch, and E. Coyle. On the use of the betadivergence for musical source separation. In Proc. of Irish Sig. and

Systems Conf. (ISSC’08), 2008.[18] O. Gillet and G. Richard. Transcription and separation of drum signals

from polyphonic music. IEEE Trans. on Audio, Speech, and Language

Processing,, 16(3):529 –540, 2008.[19] A. Girard, C. E. Rasmussen, J. Quiñonero Candela, and R. Murray-

Smith. Gaussian process priors with uncertain inputs - applicationto multiple-step ahead time series forecasting. In Neural Information

Processing Systems (NIPS), pages 529–536. MIT Press, 2002.

[20] P. W. Goldberg, C. K. I. Williams, and C. M. Bishop. Regression withinput-dependent noise: A Gaussian process treatment. In Michael I.Jordan, Michael J. Kearns, and Sara A. Solla, editors, Neural Information

Processing Systems (NIPS), pages 493–499. The MIT Press, 1997.[21] M. Helén and T. Virtanen. Separation of drums from polyphonic music

using non-negative matrix factorization and support vector machine.In Proc. 13th European Signal Processing Conference (EUSIPCO),Antalaya, Turkey, 2005.

[22] G. E. Hinton. Training products of experts by minimizing contrastivedivergence. Neural Computation, 14(8):1771–1800, August 2002.

[23] E. T. Jaynes and G. L. Bretthorst. Probability Theory: The Logic of

Science. Cambridge University Press, 2003.[24] A.G. Journel and C.J. Huijbregts. Mining geostatistics. Academic Press,

London ; New York, 1978.[25] L. De Lathauwer. Decompositions of a higher-order tensor in block

terms—part II: Definitions and uniqueness. SIAM J. Matrix Anal. Appl.,30(3):1033–1066, September 2008.

[26] J. Le Roux, E. Vincent, Y. Mizuno, H. Kameoka, N. Ono, andS. Sagayama. Consistent Wiener filtering: Generalized time-frequencymasking respecting spectrogram consistency. In Proc. 9th International

Conference on Latent Variable Analysis and Signal Separation (LVA/ICA

2010), pages 89–96, St. Malo, France, September 2010.[27] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix

factorization. In Advances in Neural Information Processing Systems

(NIPS), volume 13, pages 556–562. The MIT Press, April 2001.[28] D. MacKay. Gaussian processes - a replacement for supervised neural

networks? In Neural Information Processing Systems (NIPS). MIT Press,1997.

[29] G. Matheron. The intrinsic random functions and their applications.Advances in Applied Probability, 5(3):439–468, 1973.

[30] A. Melkumyan and F. Ramos. Multi-kernel Gaussian processes. InNeural Information Processing Systems (NIPS). MIT Press, 2009.

[31] A. Melkumyan and F. Ramos. A sparse covariance function for exactgaussian process inference in large datasets. In IJCAI’09: Proceedings

of the 21st international joint conference on Artifical intelligence, pages1936–1942, San Francisco, CA, USA, July 2009. Morgan KaufmannPublishers Inc.

[32] A. Ozerov and C. Févotte. Multichannel nonnegative matrix factorizationin convolutive mixtures for audio source separation. IEEE Trans. on

Audio, Speech and Language Processing, 18(3):550–563, March 2010.[33] S. Park and S. Choi. Gaussian processes for source separation. In

Proc. IEEE Intl. Conf. Acoust. Speech Signal Processing (ICASSP’08),volume 18, pages 1909–1912, Las Vegas, USA, March 2008.

[34] J. Quiñonero-Candela, C. E. Rasmussen, and R. Herbrich. A unifyingview of sparse approximate Gaussian process regression. The Journal

of Machine Learning Research, 6:1939–1959, December 2005.[35] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine

Learning (Adaptive Computation and Machine Learning). The MITPress, 2005.

[36] M. Seeger. Gaussian processes for machine learning. Int. J. Neural

Syst., 14(2):69–106, April 2004.[37] E. Snelson. Local and global sparse gaussian process approximations. In

Proceedings of Artificial Intelligence and Statistics (AISTATS), volume 2,pages 524–531, San Juan, Puerto Rico, March 2007.

[38] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Neural Information Processing Systems (NIPS), pages 1257–1264. MIT press, 2006.

[39] J. Ichiro Tomita and Y. Hirai. Acoustic echo cancellation using Gaussianprocesses. In (ICONIP’08) 15th Int. Conf. on Neural Information

Processing., volume 5507 of Lecture Notes in Computer Science, pages353–360. Springer, 2008.

[40] J. Vanhatalo and A. Vehtari. Modelling local and global phenomena withsparse gaussian processes. In Proc. of 24th Conference on Uncertainty

in Artificial Intelligence (UAI), pages 571–578, Helsinky, Finland, July2008. AUAI Press.

[41] E. Vincent, N. Bertin, and R. Badeau. Adaptive harmonic spectraldecomposition for multiple pitch estimation. IEEE trans. on Audio,

Speech and Language Proc. (TASLP), 18(3):528–537, March 2010.[42] E. Vincent, C. Févotte, and R. Gribonval. Performance measurement

in blind audio source separation. IEEE Trans. on Audio, Speech and

Language Processing, 14(4):1462–1469, 2006.[43] N. Wiener. Extrapolation, interpolation, and smoothing of stationary

time series with engineering applications. MIT Press, 1949.[44] C. K. I. Williams. Prediction with Gaussian processes: From linear

regression to linear prediction and beyond. In M. I. Jordan, editor,Learning and Inference in Graphical Models. Kluwer, 1999.

13

Antoine Liutkus was born in France on Febru-ary 23rd, 1981. He received the State Engineeringdegree from Telecom ParisTech, France, in 2005,and the M.Sc. degree in acoustics, computer scienceand signal processing applied to music (ATIAM)from the Université Pierre et Marie Curie (Paris VI),Paris, in 2005. He worked as a research engineeron source separation at Audionamix from 2007 to2010 and is currently pursuing the Ph.D. degreein the Department of Signal and Image Processing,Telecom ParisTech.

His research interests include statistical music processing, source separationand machine learning methods applied to signal processing.

Roland Badeau (M’02-SM’10) was born in Mar-seille, France, on August 28, 1976. He received theState Engineering degree from the École Polytech-nique, Palaiseau, France, in 1999, the State Engi-neering degree from the École Nationale Supérieuredes Télécommunications (ENST), Paris, France, in2001, the M.Sc. degree in applied mathematicsfrom the École Normale Supérieure (ENS), Cachan,France, in 2001, and the Ph.D degree from theENST in 2005, in the field of signal processing. Hereceived the ParisTech Ph.D. Award in 2006, and

the Habilitation à Diriger des Recherches degree from the Université Pierreet Marie Curie (UPMC), Paris VI, in 2010.In 2001, he joined the Department of Signal and Image Processing, TelecomParisTech (ENST), as an Assistant Professor, where he became Associate Pro-fessor in 2005. From November 2006 to February 2010, he was the manager ofthe DESAM project, funded by the French National Research Agency (ANR),whose consortium was composed of four academic partners. His researchinterests include high resolution methods, adaptive subspace algorithms, non-negative factorizations, audio signal processing, and musical applications.Roland Badeau is a Senior Member of the IEEE Signal Processing Societyand he is a Chief Engineer of the French Corps of Mines (foremost of thegreat technical corps of the French state). He is the author of 17 journal papersand 40 international conference papers.

Gaël Richard (SM’06) received the State Engi-neering degree from Telecom ParisTech, France(formerly ENST) in 1990, the Ph.D. degree fromLIMSI-CNRS, University of Paris-XI, in 1994 inspeech synthesis, and the Habilitation à Diriger desRecherches degree from the University of Paris XIin September 2001.After the Ph.D. degree, he spent two years at theCAIP Center, Rutgers University, Piscataway, NJ, inthe Speech Processing Group of Prof. J. Flanagan,where he explored innovative approaches for speech

production. From 1997 to 2001, he successively worked for Matra, Boisd’Arcy, France, and for Philips, Montrouge, France. In particular, he wasthe Project Manager of several large scale European projects in the field ofaudio and multimodal signal processing. In September 2001, he joined theDepartment of Signal and Image Processing, Telecom ParisTech, where heis now a Full Professor in audio signal processing and Head of the Audio,Acoustics, and Waves research group. He is a coauthor of over 80 papers andinventor in a number of patents. He is also one of the experts of the Europeancommission in the field of speech and audio signal processing.Prof. Richard is a member of the EURASIP and an Associate Editor ofthe IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGEPROCESSING.

gaussian processes for underdetermined source separation · 1 gaussian processes for...

Documents