missing data imputation for spectral audio signals …bhiksha/courses/mlsp.fall... · ponents for...

MISSING DATA IMPUTATION FOR SPECTRAL AUDIO SIGNALS

Paris SmaragdisAdobe Systems

Newton, MA, USA

Bhiksha RajCarnegie Mellon University

Pittsburgh, PA, USA

Madhusudana ShashankaMars Inc.

Hackettstown, NJ, USA

ABSTRACTWith the recent attention to audio processing in the time -frequency domain we increasingly encounter the problem ofmissing data. In this paper we present an approach that allowsfor imputing missing values in the time-frequency domain ofaudio signals. The presented approach is able to deal withreal-world polyphonic signals by performing imputation evenin the presence of complex mixtures. We show that this ap-proach outperforms generic imputation approaches, and wepresent a variety of situations that highlight its utility.

1. INTRODUCTION

In this paper we address the problem of estimating missing re-gions of time-frequency representations of audio signals.Theproblem of missing data in the time-frequency domain occursin several scenarios. The problem is common in speech pro-cessing algorithms that employ computational auditory sceneanalysis or related methods to mask out time-frequency com-ponents for recognition, denoising or signal separation [1, 2].An increasing number of audio processing tools allow inter-active spectral editing of audio signals, which can often resultin the excision of time-frequency regions of sound. In yetother scenarios, such as in signals that have passed throughatelephone or have encountered other linear or non-linear fil-tering operations, excision of time-frequency regions of thesignal can occur naturally.

In a majority of these scenarios the goal is to resynthesizethe audio from the incomplete time-frequency characteriza-tions. To do so, the “missing” regions of the time-frequencyrepresentations must first be “filled in” somehow, in order toeffect the transform from the time-frequency representation toa time-domain signal. In certain cases the missing values canbe set to zero and the resulting reconstructions do not sufferheavily from perceptible artifacts. In most cases however amoderate to severe distortion is easily noticeable. This distor-tion can render speech recordings unintelligible, or severelyreduce the quality of a music recording thereby distractingthe listener.

Although existing generic imputation algorithms [3] canbe used to infer the values of the missing data they are of-ten ill-suited for use with audio signals and result in audibledistortions. Other algorithms such as those in [1, 4] are suit-able for imputation of missing time-frequency components in

speech. However, these algorithms typically exploit continu-ity in spectral structures and are implicitly aided by the factthat speech recordings usually contain signals from a singlesource (the voice) and exhibit spectro-temporal patterns fromthat source alone. General audio or music recordings howeverinclude a variety of sounds, many of which are concurrentlyactive at any time, and each of which has its own typical pat-terns, and are hence much harder to model.

The algorithm proposed in this paper characterizes time-frequency representations as histograms of draws from a mix-ture model as proposed in [5]. Being a mixture, the model isimplicitly capable of representing multiple concurrent spec-tral patterns. The process of imputing missing values thenbecomes one of learning from incomplete data. Experimen-tal evaluations show that the imputed spectral constructionsobtained with our algorithm result in distinctly better resyn-thesized signals than those obtained from other common im-putation methods.

2. MISSING DATA IN THE TIME-FREQUENCYDOMAIN

Before we proceed, we briefly describe the time-frequencyrepresentations used to characterize the signal, and presentsome examples with missing data. We will assume that thetime-frequency representations are derived through short-timeFourier transformation (STFT) of the signal [6]. The short-time Fourier transform converts an audio signal into a se-quence of vectors, each of which represents the Fourier spec-trum of a short (typically 20-60ms wide) segment orframeof the signal. The STFT of a signal can be inverted to theoriginal time-domain signal by an inverse short-time Fouriertransform. Being complex, the STFT of the signal has botha magnitude and a phase. However most sound processingalgorithms for denoising, recognition, synthesis and editingoperate primarily on the magnitude since it is known to rep-resent most of the perceptual information in the signal – thephase contributes mainly to the perceived quality of the sig-nal rather than its intelligibility. Recovery of full or partialmissing phase information can be done with good results us-ing the technique in [7]. Since the phase reconstruction ishighly dependent on the magnitude values we will primarilyexamine the reconstruction of the magnitudes of the missingtime-frequency terms in this paper.

Figure 1a shows an example of the magnitude of the STFTof a speech signal that is a mixture of a spoken utterance anda phone ring. We will refer to matrix-like representationsof the magnitudes of STFTs of a signal, such as the one inFigure 1a as “magnitude spectrograms”, or simply as “spec-trograms” in this paper. Spectral magnitudes have the prop-erty that when multiple sounds co-occur, the spectral magni-tudes of the mixed signal are approximately (but not exactly)equal to the sum of the spectral magnitudes of the compo-nent sounds1. This is apparent in the spectrogram in Figure1a which shows the distinctive spectral patterns of both thespeech and phone ring.

Time

Fre

quen

cy

Mixture spectrogram

Time

Fre

quen

cy

Masked spectrum

Fre

quen

cy

Time

Manually masked spectrum

Fig. 1. a. Magnitude spectrogram of a mixture of a spokenutterance and a phone ring. The spectrogram clearly showsthe spectral patterns for both signals – the wavy lines andthe clouds of data are derived from the spoken word and thephone ring is represented by the three horizontal, parallelandthin lines. b. An incomplete spectrogram with missing re-gions. In this example time-frequency components deemednot to belong to the main speech signal have been erased byan automated algorithm.c. Another example of an incom-plete spectrogram. Here time-frequency regions dominatedby the phone ring have been manually edited out.

A spectrogram with “missing” data is one where sometime-frequency components have been lost or erased due to

1Although in theory this statement holds true for spectralpower when thecomponent signals are uncorrelated, in practice phase cancelations in finiteanalysis windows make this true for spectral magnitudes raised to a powerthat is closer to 1.0 than 2.0.

some reason. Figures 1b and 1c show examples of spectro-grams with missing data. In these examples time-frequencyregions of the spectrogram have been erased to eliminate thephone ring from the signal, automatically in one case and bymanual editing in the other. Missing data may also occur forother reasons, such as systematic filtering etc. In order to re-construct a time-domain signal from these incomplete spec-trograms the values in the missing regions must be imputedsomehow. A simple technique is to simply floor these termsto some threshold value; however time-domain signals recon-structed from such spectrograms will often contain audibleand often unacceptable artefacts. A more principled approachis required to reconstruct the missing regions in an acceptablemanner.

3. MODELING THE SPECTROGRAM

At the outset, we would like to specify the terminology andnotation we will use. We will denote the (magnitude) spec-trogram of any signal asS. The spectrogram consists of asequence of (magnitude) spectral vectorsSt, 0 ≤ t < T ,each of which in turn consists of a number of frequency com-ponentsSt(f), 0 ≤ f < F . All of these terms, being magni-tudes are non-negative.

In our model we viewS as a scaled version of a histogramŜ comprising purely integer valued components, such thatS = C−1Ŝ; however the scaling factorC cancels out of allequations in our formulation and is thus not required to beknown. In the rest of the paper, we therefore treatS itself as ahistogram and do not explicitly invoke the scaling factor, be-sides assuming that it is very large so that for anyŜt(f) suchthatSt(f) = C−1Ŝt(f) the following holds:

C−1Ŝt(f) ≈ C−1(Ŝt(f) + 1). (1)

4. MODEL DEFINITION

We model each spectral vectorSt as a scaled histogram ofdraws from a mixture multinomial distribution. Per our model,St is generated by repeated draws from a distributionPt(f),wheref is a random variable over the frequencies{1, . . . , F},andPt(f) is a mixture multinomial given by

Pt(f) =∑

z

Pt(z)P (f |z) (2)

Herez represents the identity of the multinomial componentsin the mixture. P (f |z) are the multinomial components or“bases” that compose the mixture. Note that the componentmultinomialsP (f |z) are not specific to any givenSt but arethe same at allt. P (f |z) are thus assumed to be characteristicto the entire data set of whichS is representative. The onlyparameter that is specific tot are the mixture weightsPt(z).

The model essentially characterizes the spectral vectorsthemselves as additive combinations of of histograms drawn

from each of the multinomial bases. Consequently, it is in-herently able to model complex sounds such as music that areadditively composed by several component sounds. This is incontrast to conventional models used for data imputatione.g.[1, 2, 4], that model the spectral components as the outcomeof a single draw from a distribution (although the distributionitself might be a mixture) and cannot model the additive na-ture of the data.

The model of Equation 2 is nearly identical to the one-sided model for Probabilistic Latent Semantic Analysis, in-troduced by Hoffman [8], with the distinction that whereasthe original PLSA model characterizes random variables asdocuments and words, we refer instead to time and frequen-cies. Also, while the one-sided PLSA specifies a probabilitydistribution over documents, in our model we do not have asimilar probability distribution over the time variablet.

4.1. Learning the model parameters

We can estimate the parameters of the generative model inEquation 2 forS using the Expectation Maximization algo-rithm [8]. In the expectation step of the EM algorithm at eachtime t we estimate thea posteriori probabilities for the multi-nomial bases conditioned on the observation of a frequencyfas:

Pt(z|f) =Pt(z)P (f |z)∑z′ Pt(z

′)P (f |z′)(3)

In the maximization step we update the spectral-vector-specificmixture weightsPt(z) and data-set characteristic multinomialbasesP (f |z) as:

Pt(z) =

∑f Pt(z|f)St(f)∑

z′

∑f Pt(z

′|f)St(f)(4)

P (f |z) =

∑t Pt(z|f)St(f)∑

f ′

∑t Pt(z|f

′)St(f)(5)

5. ESTIMATION WITH INCOMPLETE DATA

The algorithm of Section 4.1 assumes that the entire spec-trogramS is available to learn model parameters. When thespectrograms are incomplete, several of theSt(f) terms inEquation 4 will be missing or otherwise unknown. Our ob-jective in this paper is to estimate the missing componentson the data. Along the way we will also estimate the modelparameters themselves as necessary.

In the rest of the paper we use the following notation: wewill denote theobserved regions of any spectrogramS asSoand themissing regions asSm. Within any spectral vectorSt of S, we will represent the set of observed components asSot and the missing components asS

mt . S

ot (f) andS

mt (f)

will refer to specific frequency components ofSot and Smt

respectively.Fot will refer to the set of frequencies for whichthe values ofSt are known,i.e. the set of frequencies inSot .

Fmt will similarly refer to the set of frequencies for which thevalues ofSt are missing,i.e. the set of frequencies inSmt .

5.1. The Conditional Distribution of Missing Terms

In the first step we obtain the conditional probability distri-bution of the missing termsSmt (f) given the observed termsSot and the probability distributionPt(f) from whichS

t wasdrawn. LetNot =

∑f∈Fo

t

Sot (f). Not is the total value

of all observed spectral frequencies at timet. Let Po,t =∑f∈Fo

t

Pt(f) be the total probability of all observed frequen-cies att. The probability distribution ofSmt , given that thefrequencies inFot are known to have been drawn exactlyN

ot

times cumulatively is easily shown to be given by thenegativemultinomial distribution [9]:

P (Smt ) =Γ(Not +

∑f∈Fm

t

Smt (f))

Γ(Not )∏

f∈Fmt

Γ(Smt (f) + 1)P

Not

o,t

∏

f∈Fmt

Pt(f)Sm

t(f)

(6)whereFmt is the set of all frequency components inS

mt . The

expected value of any termSmt (f) whose probability is spec-ified by Equation 6 is given by2:

E[Smt (f)] = Not

Pt(f)

Po,t(7)

We now describe the actual learning procedures to estimatemodel parameters and missing spectral components. We iden-tify two situations, stated in order of increasing complexityas (a) where the multinomial bases for the dataP (f |z) areknown a priori and only the mixture weightsPt(z) are un-known, and (b) where none of the model parameters are known.We address them in reverse order below to simplify the pre-sentation.

5.2. Learning the Model Parameters from Incomplete Data

Let Λ be the set of all parametersPt(z) andP (f |z) of themodel defined by Equation 2. We derive a set of likelihood-maximizing rules to estimateΛ fromSo using the ExpectationMaximization algorithm as follows.

We denote the set of draws that resulted in the the gener-ation ofS asz. Thecomplete data specification required byEM is thus given by(So,Sm, z) = (S, z), whereSm andzare unseen. The EM algorithm iteratively estimates the valuesof Λ that maximizes the expected value of the log likelihoodof the complete data with respect to the unseen variables [10],i.e. it optimizes:

Q(Λ, Λ̂) = ESm,z|So,Λ̂logP (S

o,Sm, z|Λ)

= ESm|So,Λ̂Ez|S,Λ̂logP (S

o,Sm, z|Λ) (8)

2To be precise Equations 6 and 7 must actually be specified in terms ofC−1No

t+1; however, given the assumption in Equation 1, Equation 7, which

is the primary equation of interest remains valid.

whereΛ̂ is the current estimate ofΛ. Since the draws thatcompose any spectrumSt are independent of those that com-pose any otherSt′ , Equation 8 simplifies to:.

Q(Λ, Λ̂) =∑

t

ESm

t|So

t,Λ̂Ezt|St,Λ̂logP (S

ot S

mt , zt|Λ) (9)

wherezt is the set of draws that composedSt. OptimizingEquation 9 with respect toΛ, and invoking Equation 7 leadsus to the following update rules:

Pt(z|f) =Pt(z)P (f |z)∑z′ Pt(z

′)P (f |z′)(10)

Nt =∑

f∈Fot

Sot (f)

Pt(f)(11)

S̄t(f) = St(f) if f ∈ Fot

NtPt(f) if f ∈ Fmt (12)

Pt(z) =

∑f Pt(z|f)S̄t(f)∑

z′

∑f Pt(z

′|f)S̄t(f)(13)

P (f |z) =

∑t Pt(z|f)S̄t(f)∑

f ′

∑t Pt(z|f

′)S̄t(f)(14)

Note thatS̄t(f) are also the minimum mean-squared es-timates of the terms inSm. The above update rules thus alsoimplicitly impute the missing values of the data.

In some situations the multinomial basesP (f |z) may beavailable, for instance when they have been learned separatelyfrom regions of the data that have no missing time-frequencycomponents. In such situations, only the mixture weightsPt(z) need to be learned for any incomplete spectral vectorSt in order to estimatēSt(f). This can be achieved simply byiterations of Equations 10, 11, 12 and 13.

0 5 10 15 20 25 30 35 40 45 50−9.8

−9.7

−9.6

−9.5

−9.4

−9.3

−9.2

−9.1

−9

−8.9Model likelihoods over training iterations

Iteration #

Like

lihoo

ds

Likelihood over observed dataLikelihood over all data

Fig. 2. Likelihood of observed dataSo (solid line) and com-plete spectrogramS (dashed line) as a function of iterationfor the data in Figure 6.

The likelihood of the model for the observed data is guar-anteed to converge monotonically, and that has been validatedwith multiple experiments. The model likelihood over both

the observed and the imputed data is got guaranteed to in-crease all the time however it always converges, and does thatmonotonically so after the first few iterations once the im-putation becomes increasingly plausible. Figure 2 shows theconvergence patters for the experiment in the next section.

6. EXPERIMENTAL EVALUATION

In this section we will evaluate the algorithms of Section 5 onseveral examples of spectrograms with missing time-frequencyregions, showing both their convergence and effectivenessatimputation. In our examples the spectrograms are from com-plex musical recordings with multiple, additive concurrentspectral patterns. Such data are particularly difficult to im-pute using conventional algorithms.

6.1. Illustrative example

We first evaluate our proposed approach with an illustrativeexample. The input data in this case consisted of a syntheticpiano recording of some isolated notes and subsequently of amixture of these notes. We removed a triangular time-frequencysection of the part where the multiple notes took place. Wetrained our model using the isolated note sections and ex-tracted 60 multinomial bases which we then used to imputethe missing values. Figure 3 shows both the original spec-trogram (with the missing region marked by the dotted line)and the reconstructed spectrogram. As a comparator we alsoshow the complete spectrograms obtained by imputation ofthe missing regions using SVD and K-nearest neighbors, twoalgorithms which are two commonly used models that of-ten perform very well in imputation problems [3, 11]. SVDin particular also models the data additively and may be ex-pected to capture additive patterns.

K−NN SVD Proposed model Ground Truth

Fig. 3. Comparison of three data imputation approaches ona simple problem. The missing data area is denoted by thedashed black line in all plots. The first plot from the left showsthe results from K-NN imputation, the second from SVD im-putation, the third using our proposed model and the fourth isthe ground truth.

We note that even in this simple problem the K-NN ap-

proach does a poor job of modeling all three notes and in-stead averages out the imputed data thus creating a visiblyand audibly incorrect reconstruction. The SVD imputationis more successful, appropriately borrowing elements to re-construct the coinciding notes. However this approach cannotguarantee that the the imputed output will be non-negative(as required for magnitude spectra) and can potentially returnnegative values which must be either set to zero, or be recti-fied, resulting in musical noise. The proposed method prop-erly layers elements of multiple notes to impute the missingdata and does not suffer from producing negative values. Theresult is almost indistinguishable from the ground truth.

Although our approach is not as fast as the k-nearest neigh-bors approach, it is an order of magnitude faster than theSVD approach since each iteration involves the evaluation ofa small number of inner products as opposed to the computa-tion of an entire SVD. The computation times for the aboveexample were 0.7 seconds for k-nearest neighbors, 90 sec-onds for the SVD and 8 seconds for our proposed approach.Simulations were run on an ordinary laptop computer.

6.2. Real-world examples

In this section we will consider some complex cases derivedout of real-world recordings with challenging missing datacases. We will first examine a more complex case of the aboveexample where we attempt to fill a large continuous gap inthe spectrogram for a complex musical piece. The sectionwith the gap was a five second real piano recording of Bach’sthree-part piano invention #3 in D - BWV 879 [12]. The sizeof the gap was 4.3 seconds by 3 kHz at its widest extent.We also used 10 seconds of data from another piano piece(two-part invention #3 in D -BWV 774) which provided theneeded information to impute the missing data. We extracted60 multinomial bases while training on both the complete andthe incomplete data. As a comparator, we also reconstructedthe spectrogram using a rank-60 SVD. The sampling rate was14700Hz and the spectrum size was 1024 points The imputa-tion results are shown in figure 4. One can find some visualinconsistencies using the SVD most noticeably in the roughertexture of the reconstructed area. In contrast our approachre-sults into visually more plausible results. We also invert thespectrograms back to the time domain in order to perform alistening evaluation. In order to accurately compare the arti-facts caused by the imputation we used the phase values fromthe original signal. The proposed method resulted in virtuallyinaudible reconstruction artifacts whereas the SVD approachintroduced audible noise3.

In the next example we attempt to performbandwidth ex-pansion. Audio signals can often be bandlimited by havingbeen passed through a restrictive channel, such as a telephone.

3The soundfiles for all the examples in the remainder of this section areattached in the original PDF file of this paper. They can be played using anyPDF reading software that allows the viewing of PDF attachments.

Ground Truth Missing Data Input

SVD Imputation Proposed Imputation

Fig. 4. Example reconstruction of a gap filling experiment.The leftmost plot shows the actual data, the second plot showsthe input with the large gap removing about 15% of the data,we have zoomed into the region of the gap by not plot-ting some of the higher frequency content. the third plot isthe SVD reconstruction and the fourth plot is our proposedmethod.

The spectral excision here is very systematic and consistent.We learn a set of 120 multinomial bases from other widebandtraining data, and use these to infer the missing spectral con-tent of the bandlimited signal. This presumes that the trainingdata is similar in nature to the testing data (i.e. if we need toresample piano music we should train on piano music). Anexample of this is shown in figure 5. We removed 80% ofthe upper frequencies of a ten second rock recording of thesong ”Back in Black” by the band AC/DC and we trained onan eight second recording by the same band playing a differ-ent song (”Highway to Hell”). The sampling rate was onceagain 14700Hz and the spectrum size was 1024 points. Asa comparator we have also reconstructed the spectrogram us-ing rank-120 SVD. The SVD clearly underperforms in boththe audible and the visual reconstruction in this experiment,whereas our proposed method results in a plausible, althoughas expected non-exact, reconstruction.

Finally we present a very challenging case where the miss-ing data was evenly and randomly distributed across the in-put. For this test a smoothed random binary mask was ap-plied to an input sound so that about 60% of the data wasremoved. The input sound was sampled as 22050Hz andthe time-frequency values were obtained using a 1024 pointshort-time Fourier transform. We modeled the data with amixture of 60 multinomials. As a comparator, a rank-60 SVDwas also used to reconstruct the spectrogram. The results of

null

6112.669

gap_filling_missing_data.mp3Media File (audio/x-mp3)

null

6112.669

gap_filling_svd_imputation.mp3Media File (audio/x-mp3)

null

6138.7915

gap_filling_proposed_imputation.mp3Media File (audio/x-mp3)



Fig. 5. Example reconstruction of a bandwidth expansion.In the leftmost plot the original signal is shown. The secondplot displays the bandlimited input we used where 80% ofthe top frequencies were removed. The third plot is the SVDreconstruction and the fourth plot is the reconstruction usingour model.

this experiment are shown in figure 6.We note that the SVD model results in grainier recon-

struction in the missing parts, whereas the proposed modelresults in a smoother output. Our proposed model results inreconstructed audio signals of significantly improved quality.The processing artifacts are virtually imperceptible unless thesound is compared to the original input.

7. REFERENCES

[1] B. Raj, Reconstruction of Incomplete Spectrograms for RobustSpeech Recognition, Ph.D. Dissertation, Carnegie Mellon Uni-versity, May 2000.

[2] Sam T. Roweis. One Microphone Source Separation. NIPS2000: 793-799.

[3] M. E. Brand. Incremental Singular Value Decomposition ofUncertain Data with Missing Values, European Conference onComputer Vision (ECCV), Vol 2350, pps 707-720, May 2002.

[4] M.J. Reyes-Gomez,N. Jojic and D.P.W. Ellis. Detailed graphi-cal models for source separation and missing data interpolationin audio. 2004 Snowbird Learning Workshop Snowbird, Utah,March 2004

[5] M. Shashanka, B. Raj, P. Smaragdis. Sparse OvercompleteLa-tent Variable Decomposition of Counts Data. NIPS 2000.



Fig. 6. Example reconstruction of a music signal with a bi-nary mask occluding roughly 60% of the samples. The left-most plot shows the original signal, the second plot shows themasked input we used for the reconstruction, the third plotshows the reconstruction using the SVD and the fourth oneshows the reconstruction using our model.

[6] Smith, J. O. Spectral Audio Signal Processing, March 2007Draft, http://ccrma.stanford.edu/∼jos/sasp/,online book, accessed June 2008.

[7] J. Bouvrie and T. Ezzat. ”An Incremental Algorithm for Sig-nal Reconstruction from Short-Time Fourier Transform Magni-tude”, in Proc. Ninth International Conference on Spoken Lan-guage Processing (Interspeech), Pittsburgh, 2006.

[8] T. Hofmann and J. Puzicha. Unsupervised learning from dyadicdata. TR 98-042, ICSI, Berkeley, CA, 1998.

[9] M. Hazewinkel. Encyclopedia of Mathematics.http://eom.springer.de/

[10] A. P. Dempster, N. M. Laird, D. B. Rubin. Maximum likeli-hood from incomplete data via the EM algorithm. Journal of theRoyal Statistical Society, B, 39, 1-38. 1977.

[11] Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P.and Botstein, D. Imputing Missing Data for Gene ExpressionArrays. Technical report (1999), Stanford Statistics Department.

[12] Gould, G. Bach: The Two and Three Part Inventions - TheGlenn Gould Edition, by SONY classics, ASIN B000GF2YZ8,1994.

null

8907.77

bandwidth_expansion_missing_data.mp3Media File (audio/x-mp3)

null

8933.892

bandwidth_expansion_svd_imputation.mp3Media File (audio/x-mp3)

null

8907.77

bandwidth_expansion_proposed_imputation.mp3Media File (audio/x-mp3)

null

7680.0225

binary_mask_missing_data.mp3Media File (audio/x-mp3)

null

7680.0225

binary_mask_svd_imputation.mp3Media File (audio/x-mp3)

null

7680.0225

binary_mask_proposed_imputation.mp3Media File (audio/x-mp3)

http://ccrma.stanford.edu/~jos/sasp/

Introduction Missing data in the time-frequency domain Modeling the Spectrogram Model Definition Learning the model parameters

Estimation with Incomplete Data The Conditional Distribution of Missing Terms Learning the Model Parameters from Incomplete Data

Experimental Evaluation Illustrative example Real-world examples

References

missing data imputation for spectral audio signals …bhiksha/courses/mlsp.fall... · ponents for...

Documents