single-channel multi-speaker separation using deep ?· single-channel multi-speaker separation...

Download Single-Channel Multi-Speaker Separation using Deep ?· Single-Channel Multi-Speaker Separation using…

Post on 04-Aug-2018




0 download

Embed Size (px)



    Single-Channel Multi-Speaker Separation using DeepClustering

    Isik, Y.; Le Roux, J.; Chen, Z.; Watanabe, S.; Hershey, J.R.

    TR2016-073 September 2016

    AbstractDeep clustering is a recently introduced deep learning architecture that uses discriminativelytrained embeddings as the basis for clustering. It was recently applied to spectrogram segmen-tation, resulting in impressive results on speaker-independent multi-speaker separation. Inthis paper we extend the baseline system with an end-to-end signal approximation objectivethat greatly improves performance on a challenging speech separation. We first significantlyimprove upon the baseline system performance by incorporating better regularization, largertemporal context, and a deeper architecture, culminating in an overall improvement in signalto distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker sep-aration, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extendthe model to incorporate an enhancement layer to refine the signal estimates, and performend-to-end training through both the clustering and enhancement stages to maximize signalfidelity. We evaluate the results using automatic speech recognition. The new signal approx-imation objective, combined with end-to-end training, produces unprecedented performance,reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a majoradvancement towards solving the cocktail party problem.


    This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy inwhole or in part without payment of fee is granted for nonprofit educational and research purposes provided that allsuch whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi ElectricResearch Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and allapplicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall requirea license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved.

    Copyright c Mitsubishi Electric Research Laboratories, Inc., 2016201 Broadway, Cambridge, Massachusetts 02139

  • Single-Channel Multi-Speaker Separation using Deep Clustering

    Yusuf Isik1,2,3, Jonathan Le Roux1, Zhuo Chen1,4,Shinji Watanabe1, John R. Hershey1

    1 Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA2 Sabanci University, Istanbul, Turkey, 3 TUBITAK BILGEM, Kocaeli, Turkey

    4 Columbia University, New York, NY, USA

    AbstractDeep clustering is a recently introduced deep learning architec-ture that uses discriminatively trained embeddings as the basisfor clustering. It was recently applied to spectrogram segmen-tation, resulting in impressive results on speaker-independentmulti-speaker separation. In this paper we extend the baselinesystem with an end-to-end signal approximation objective thatgreatly improves performance on a challenging speech separa-tion. We first significantly improve upon the baseline systemperformance by incorporating better regularization, larger tem-poral context, and a deeper architecture, culminating in an over-all improvement in signal to distortion ratio (SDR) of 10.3 dBcompared to the baseline of 6.0 dB for two-speaker separation,as well as a 7.1 dB SDR improvement for three-speaker separa-tion. We then extend the model to incorporate an enhancementlayer to refine the signal estimates, and perform end-to-endtraining through both the clustering and enhancement stages tomaximize signal fidelity. We evaluate the results using auto-matic speech recognition. The new signal approximation ob-jective, combined with end-to-end training, produces unprece-dented performance, reducing the word error rate (WER) from89.1% down to 30.8%. This represents a major advancementtowards solving the cocktail party problem.Index Terms: single-channel speech separation, embedding,deep learning

    1. IntroductionThe human auditory system gives us the extraordinary abil-ity to converse in the midst of a noisy throng of party goers.Solving this so-called cocktail party problem [1] has provenextremely challenging for computers, and separating and rec-ognizing speech in such conditions has been the holy grail ofspeech processing for more than 50 years. Previously, no practi-cal method existed that could solve the problem in general con-ditions, especially in the case of single channel speech mixtures.This work builds upon recent advances in single-channel sepa-ration, using a method known as deep clustering [2]. In deepclustering, a neural network is trained to assign an embed-ding vector to each element of a multi-dimensional signal, suchthat clustering the embeddings yields a desired segmentationof the signal. In the cocktail-party problem, the embeddingsare assigned to each time-frequency (TF) index of the short-time Fourier transform (STFT) of the mixture of speech sig-nals. Clustering these embeddings yields an assignment of eachTF bin to one of the inferred sources. These assignments areused as a masking function to extract the dominant parts of eachsource. Preliminary work on this method produced remarkable

    This work was done while Y. Isik was an intern at MERL.

    performance, improving SNR by 6 dB on the task of separatingtwo unknown speakers from a single-channel mixture [2].In this paper we present improvements and extensions that en-able a leap forward in separation quality, reaching levels of im-provement that were previously out of reach (audio examplesand scripts to generate the data used here are available at [3]).In addition to improvements to the training procedure, we inves-tigate the three speaker case, showing generalization betweentwo- and three-speaker networks.The original deep clustering system was intended to only re-cover a binary masks for each source, leaving recovery of themissing features to subsequent stages. In this paper, we incorpo-rate enhancement layers to refine the signal estimate. Using softclustering, we can then train the entire system end-to-end, train-ing jointly through the deep clustering embeddings, the cluster-ing and enhancement stages. This allows us to directly use a sig-nal approximation objective instead of the original mask-baseddeep clustering objective.Prior work in this area includes auditory grouping approachesto computational auditory scene analysis (CASA) [4, 5]. Thesemethods used hand-designed features to cluster the parts of thespectrum belonging to the same source. Their success was lim-ited by the lack of a machine learning framework. Such a frame-work was provided in subsequent work on spectral clustering[6], at the cost of prohibitive complexity.Generative models have also been proposed, beginning with[7]. In constrained tasks, super-human speech separation wasfirst achieved using factorial HMMs [8, 9, 10, 11] and wasextended to speaker-independent separation [12]. Variants ofnon-negative matrix factorization [13, 14] and Bayesian non-parametric models [15, 16] have also been used. These methodssuffer from computational complexity and difficulty of discrim-inative training.In contrast to the above, deep learning approaches have recentlyprovided fast and accurate methods on simpler enhancementtasks [17, 18, 19, 20]. These methods treat the mask inferanceas a classification problem, and hence can be discriminativelytrained for accuracy without sacrificing speed. However theyfail to learn in the speaker-independent case, where sources areof the same class [2], despite the work-around of choosing thebest permutation of the network outputs during training. Wecall this the permutation problem: there are multiple valid out-put masks that differ only by a permutation of the order of thesources, so a global decision is needed to choose a permutation.Deep clustering solves the permutation problem by framingmask estimation as a clustering problem. To do so, it producesan embedding for each time-frequency element in the spectro-gram, such that clustering the embeddings produces the desiredsegmentation. The representation is thus independent of per-

  • mutation of the source labels. It can also flexibly represent anynumber of sources, allowing the number of inferred sources tobe decided at test time. Below we present the deep clusteringmodel and further investigate its capabilities. We then presentextensions to allow end-to-end training for signal fidelity.The results are evaluated using an automatic speech recogni-tion model trained on clean speech. The end-to-end signalapproximation produces unprecedented performance, reducingthe word error rate (WER) from close to 89.1% WER down to30.8% by using the end-to-end training. This represents a majoradvancement towards solving the cocktail party problem.

    2. Deep Clustering ModelHere we review the deep clustering formalism presented in [2].We define as x a raw input signal and as Xi = gi(x), i {1, . . . , N}, a feature vector indexed by an element i. In audiosignals, i is typically a TF index (t, f), where t indexes frameof the signal, f indexes frequency, and Xi = Xt,f the value ofthe complex spectrogram at the corresponding TF bin. We as-sume that the TF bins can be partitioned into sets of TF bins inwhich each source dominates. Once estimated, the partition foreach source serves as a TF mask to be applied to Xi, yieldingthe TF components of each source that are uncorrupted by othersources. The STFT can then be inverted to obtain estimates ofeach isolated source. The target partition in a given mixture isrepresented by the indicator Y = {yi,c}, mapping each elementi to each of C components of the mixture, so that yi,c = 1 ifelement i is in cluster c. Then A = Y Y T is a binary affinitym


View more >