singing voice separation - university of rochester
TRANSCRIPT
Singing Voice Separation
Christos Benetatos, Ge Zhu, Yoon mo Yang
Motivations
● Real world applications○ Automatic speech recognition (source separation)○ Chord recognition and main melody extraction
● Commercial applications○ Query-by-humming (e.g. Soundhound)○ Automatic pitch correction (e.g. Autotune)○ Singing synthesis (e.g. Vocaloid)
Background and Challenges
● Background○ All pop music around the world use singing○ Singing voice is treated as the most expressive instrument○ Contains two aspects of information: sound and words
● Challenges○ For monaural recordings, only single channel info. is available○ For conventional approaches, polyphonic cases are not easy○ Datasets are important for DNN-based approaches
Low Rank Approximation MethodsFor Voice Separation
What does it mean that music background is Low-Rank signal ?A musicians intuitive explanation
● Low-Rank (Context Level)○ Most of the time, in music which is voice/lyrics-centered, where the music’s role is just to
support the singing, a trained musician can predict the instrumental part for the whole piece, after hearing the first x seconds. The smaller x is, the lowest the rank of the musical (not voice) signal is. This is not happening often in pure instrumental or classical music.
● Low-Rank (Note Spectrogram Level)○ In a piano background musical piece, all the repeated piano notes sound the same. If you
hear a C5, all the C5 repetitions will sound the same. No surprises. So we can say that all the piano backing music, is a combination of at most 88 spectrum elements (number of keys)
○ However, if the singer is virtuoso, there is more freedom and variety in his/her notes. Rarely we hear the same passage, or the same timbre in the notes. We need a lot more than 88 elements to describe the voice, that it stops being a low-rank signal
Main categories of Low-Rank Methods
Dictionary Methods
NMF [1]
● Works good when the signal contains only music
● Low rank assumption for both accompaniment and voice
● It is more difficult to summarize the voice in a small number of spectra templates
Lp-Norm NMF [2]
● Voice is a sparse signal● Voice = NMF reconstruction error● Implicitly control voice by controlling sparsity of
error
[1]. Smaragdis, Paris, and Judith C. Brown. "Non-negative matrix factorization for polyphonic music transcription." IEEE workshop on applications of signal processing to audio and acoustics. Vol. 3. No. 3. 2003.[2]. T. Nakamuray and H. Kameoka, “Lp-norm non-negative matrix factorization and its application to singing voice enhancement,” in IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, Apr. 2015
Low Rank + Sparse Decomposition
Figure from http://www.ihes.fr/~comdev/liens/Chaire_Schlumberger/candes.pdf
Robust Principal Component Analysis
● Method to separate low-rank background from sparse foreground
● Applications to video background extraction (moving objects are sparse noise)
Assumptions:
● Low-rank background is not sparse● Sparse foreground is not low-rank
Norms and Convexity Background
∝ : propotional to
≈ : indicator of (or measure of)
● Sparsity
○ Sparsity(A) ≈ zero elements of A○ Lo norm = non-zero elements of A (non-convex Norm)○ So, maximizing sparsity = minimizing Lo norm○ L1 norm = sum of the absolute values of all A elements.○ We can use L1 norm as a convex approximation of Lo
● Rank
○ Rank(A) ≈ # of non-zero eigenvalues of A (non Convex Function)
○ L* norm = sum of singular values of A ∝ sum of eigenvalues of AA* ≈ Rank(AA) = Rank(A)○ So, we can use L* norm as a convex approximation of Rank function
RPCA as an optimization problem usingPrincipal Component Pursuit (PCP)
Non Convex PCP Problem
Convex PCP relaxation
λ is not a tunable parameter. There is a proof that if
Then
Why not just regular PCA ?
● Sensitive to outliers / Breaks down with heavily corrupted data● Like NMF, it also uses the low-rank assumption for the total voice-music signal
PCA RPCA
Po-Sen Huang et al 2012 [1]“Singing-voice separation from monaural recordings using robust principal component analysis”
[1]. P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson, “Singing-voice separation from monaural recordings using robust principal component analysis,” in IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, Mar. 2012.
Overall Architecture and parameters
● Overall Architecture ● Tunable λambda
Time-Frequency Masking
● Time-Frequency Masking:
Evaluation and Results
● Evaluation Metrics
Po-Sen Huang et al 2014 [1]“Singing-voice Separation From Monaural Recordings Using Deep Recurrent Neural Networks”
● From monaural recordings in a supervised setting● DRNNs with different temporal connections● Jointly optimizing the networks for multiple source signals by including the
separation step● Different discriminative training objectives● Proposed framework:
[1]. P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson, “Singing-voice separation from monaural recordings using deep recurrent neural networks” Proc. of the International Society for Music Information Retrieval (ISMIR), 2014
● A RNN is a DNN, with layers which introduce the memory from the past
● Black: hidden states, White: input frames, Grey: output frames
● Weakness of RNNs: lack hierarchical processing of the input at the current time step○ Deep recurrent neural
networks to solve this.
Deep Recurrent Neural Networks Architectures
Proposed Model Architecture Joint Training via time-freq masking
● Magnitude spectra as features● Separating one of the sources from a
mixture, not learning one of the sources as the target
● A time-frequency masking technique
● Enforces the constraint that the sum of the prediction results is equal to the original mixture
● Can be viewed as a layer● Jointly train the network with the
time-freq masking function● An extra layer to the original output of
the network at the end
DiscriminativeTraining objectives ● The Mean Square Error
● Generalized KL divergence● Discriminative objective
functions to have high SIR○ Increase the similarity b/w the
prediction and its target○ Decrease the similarity b/w the
prediction and the targets of other sources
● 𝜸 is a constant chosen by the performance
Results and Conclusion
● Results with unsupervised and supervised settings
● 2.30 ~ 2.48 dB GNSDR gain, 4.32~5.42 dB GSIR gain compared to RNMF
● To further enhance the results○ Jointly optimizing a soft mask
function with the networks○ the discriminative training criteria
● Demo:https://sites.google.com/site/deeplearningsourceseparation/
Yi Luo et.al. 2017“Deep Clustering and Conventional Networks for Music Separation: Stronger Together”
Conventional regression-based networks:
● Supervised mask-inference based method● Increase separation between sources
Deep clustering:
● Unsupervised method to solve general audio separation problem with multiple sources of same type and arbitrary number of sources
● Reduce within-source variance
Demo Page: http://danetapi.com/chimera
Deep Clustering Intuition:Deep (learning techniques to derive embedding features for performing efficient) Clustering
Traditional spectral clustering (contrast to central clustering):
● Spectral decomposition of the original feature signal● Map feature matrix into a different dimensional space based on spectrum ● In the mapped dimensional space, perform simple central clustering
Deep clustering:
● Use a neural network to learn embedding features automatically. Then perform central clustering algorithm to cluster
John Hershey et.al.: Deep Clustering: Discriminative embeddings for segmentation and separation
Details
Partition based training:
● Reference label indicator: Y = {y(n,c)} (map element n to class c)● Then, A = YY’ is an ideal affinity matrix to represent partition.
Training objective:
● Embeddings enable accurate clustering based on labels
Objective function:
Details
Cost Function:
Test:
● After computing V on test signals, cluster rows of V using k-means.
Deep ClusteringAdvantages:
● Partition-based (No label) instead of class-based (Label required).● Help to solve permutation problem. (By using permutation-independent
embedding.)
Disadvantages:
● Embedding dimension has to be tuned● Requires post-processing
Sing Voice Separation Task
Conventional NN:
● Output soft mask
Deep Clustering:
● Output embeddings ● Post-processing embeddings to get soft mask
Sing Voice Separation
Chimera Networks:
● Deep clustering head
● Conventional NN head
● Globally:
Results
● Not only won 1st in MIREX 2016 but also outperformed best systems from the past.
(SDRi: Improvement of SDR
with respect to that in mixture)
(iKala dataset: not trained on)
● Demo: http://www.merl.com/demos/deep-clustering