music signal separation by supervised nonnegative matrix...

1
Music Signal Separation by Supervised Nonnegative Daichi Kitamura, Hiroshi Saruwatari, Kiyohiro Shikano (Nara Institute of Science and Technology, Nara, Japan) Overview of research Automatic music transcription Sound augmented reality (AR) 3D audio system, etc. Applications To cope with the problem of supervised NMF, we propose advanced supervised NMF algorithm that employs a deformable capability for the trained spectral bases and penalty terms for making the bases fit into the real instrumental sound. Previous research Nonnegative matrix factorization (NMF) [1] Sparse representation and decomposition algorithm NMF attempts to separate instrumental sources using spectral characteristics Many techniques have been proposed using NMF as an unsupervised method [2] ~ [6], It is difficult to cluster the decomposed spectral patterns into a specific target sound Supervised NMF (SNMF) [7], [8] Use some sample sound of the target instrumental signal in a priori training A mismatch between the spectra trained in advance and the target sound reduces the accuracy of source separation Purpose of our research 1. Introduction We address a music signal separation problem, and propose a new supervised algorithm for real instrumental signal separation. Nonnegative matrix factorization (NMF) is a sparse representation algorithm. Supervised NMF can extract the target instrumental sources. The instrumental sounds differ according to, e.g., individual styles of playing and the timbre individuality for each instrument. How do we get the supervision sound of the target instrument recorded in CD? Problem of conventional supervised method We propose a new advanced supervised method that adapts the supervision sound (that we can get easily) to the target instrumental sound as “Supervised NMF with basis deformation.” From experimental results, our proposed method separated the real instrumental sources using supervision sound generated by MIDI synthesizer. Extract! To solve this inherent problem… Recently, music signal separation technologies have received much attention. Mixture of instruments Target signal Supervision sound Slightly different Proposed SNMF adapts supervision sound to the target signal Sample sound of piano Proposed SNMF Extracted piano signal 2. NMF NMF for source separation : Observed spectrogram : Basis matrix that includes frequently-appearing spectral patterns in as column vectors : Activation matrix that involves activation information of each basis of : Number of frequency bins : Number of frames : Number of bases 3. Conventional method Penalized Supervised NMF: PSNMF [8] Update Supervised bases Observed spectrogram that consists multiple instrumental sources Spectrogram of the solo-played target signal for training Training process Separation process Fix trained bases and update : Supervised basis matrix that involves spectral patterns of the target instrumental sound : Basis matrix that involves residual spectral patterns that cannot be expressed by : Activation matrix that corresponds to : Activation matrix that corresponds to Ideally, represents the target instrumental components, and represents other different components from the target sounds after the decomposition. In addition, to prevent the simultaneous formulation of similar spectral patterns in the matrices and , a specific penalty is imposed between and . Force to become mostly different Cost function subject to for all , where is the weighting parameter for penalty term . 60 40 20 0 -20 Amplitude [dB] 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Frequency [kHz] Real sound Artificial sound by MIDI These spectral differences reduce separation performance. 4. Proposed method A mismatch between the bases trained in advance and the target actual sound reduces the accuracy of separation. It is impossible to provide perfect supervision or to predict more realistic supervision in practice. Problem of supervised method In this research, we assume that we can obtain specific solo-played instrumental sounds, which is the target of the separation task. We use the target supervision sound generated by MIDI synthesizer because we can easily generate the training sound via MIDI. Target signal (piano) Supervision sound MIDI data of the same type of target instrument Slightly different Generate In this method, we need not prepare a lot of real sound samples of the target instrument to learn the variance of the instrumental sounds. All entries of these matrix are nonnegative value. Constrains Matrix Factorization with Basis Deformation Kazunobu Kondo, Yu Takahashi (Yamaha Corporate Research & Development Center, Shizuoka, Japan) In proposed method, supervised bases composed in training process can be deformed to adapt to the target spectra. Strategy Supervision sound Target sound Adapt to the target Supervised bases Separation process : Basis matrix for the deformation and shares the activation matrix with Proposed SNMF with basis deformation Decomposition model We employed a deformable term to adapt the supervision bases into the target sound that cannot be represented by . Update Deformable term To deform the supervised bases , deformation term should have both positive and negative value, unlike normal NMF decomposition. However, the target spectral bases after deformation must be nonnegative. The deformation matrix should be constructed under the following constrains: : entry of matrix ,which probably has positive and negative values : Hyper parameter that controls the allowable range of the negative deformation of Cost function subject to for all , where and are the weighting parameters for penalty terms. Force to become mostly different with each other Supervised bases Other bases Deformation bases represents undesired instrumental components. represents spectral difference between supervision sound and the target sound. Deform For the case of It is possible to decrease until 30% of the total. If , the deformed bases has a risk to become . What is the hyper parameter ? 5. Evaluation experiment Update rules The update rules that minimize the cost function are defined as follows: Target instruments Flute, Clarinet, Piano, Trombone Observed signal Mixing two sources selected from four sources with the input SNR of 0dB Supervision sound (MIDI) Artificial MIDI sounds of the target instruments that consists two octave notes, which cover all notes of the target signal Number of bases Supervision bases: 100, Other bases: 50 Number of iterations of NMF Training process: 500, Separation process: 400 Parameters Experimentally determined Evaluation scores [9] Signal to distortion ratio (SDR: quality of extracted signal), Source to interference ratio (SIR: degree of separation), Sources to artifact ratio (SAR: absence of distortion) lari t, Pia Tr bo Experimental condition To confirm the effectiveness of the proposed algorithm, we compared the conventional method (SNMF) and our SNMF with basis deformation. where, Target signals Average scores Experimental results 4 3 2 1 0 Frequency [kHz] 4 3 2 1 0 Time [s] 4 3 2 1 0 Frequency [kHz] 4 3 2 1 0 Time [s] 4 3 2 1 0 Frequency [kHz] 4 3 2 1 0 Time [s] 4 3 2 1 0 Frequency [kHz] 4 3 2 1 0 Time [s] Observed signal consisting of piano and trombone Oracle signal of target piano signal Piano signal extracted by conventional method Piano signal extracted by proposed method The target signals were recorded with actual musical instruments. The supervision sounds were generated by MIDI synthesizer. 4 Example of spectrograms References D. D. Lee, H. S. Seung, “Algorithms for non-negative matrix factorization,” Neural Info. Process. Syst., vol.13, pp.556562, 2001. T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Trans. Audio, Speech and Language Processing, vol.15, pp.10661074, 2007. S. A. Raczynski, N. Ono, S. Sagayama, “Multipitch analysis with harmonic nonnegative matrix approximation,” Proc. ISMIR, pp.381386, 2007. A. Ozerov, C. Fevotte, M. Charbit, “Factorial scaled hidden Markov model for polyphonic audio representation and source separation,” Proc. WASPAA, 2009. W. Wang, A. Cichocki, J. A. Chambers, “A multiplicative algorithm for convolutive non-negative matrix factorization based on squared Euclidean distance,” IEEE Trans. on Signal Processing, vol.57, no.7, pp.28582864, 2009. H. Kameoka, M. Nakano, K. Ochiai, Y. Imoto, K. Kashino, S. Sagayama, “Constrained and regularized variants of non-negative matrix factorization incorporating music-specific constraints,” Proc. ICASSP, pp.53655368, 2012. P. Smaragdis, B. Raj, M. Shashanka, “Supervised and semi-supervised separation of sounds from single-channel mixtures,” Proc. LVA/ICA 2010, LNCS 6365, pp.140148, 2010. K. Yagi, Y. Takahashi, H. Saruwatari, K. Shikano, K. Kondo, “Music signal separation by orthogonality and maximum-distance constrained nonnegative matrix factorization with target signal information,” Proc. Audio Engineering Society 45th International Conference, 2012. E. Vincent, R. Gribonval, C. Fevotte. “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech and Language Processing, vol.14, no.4, pp.14621469, 2006. [1] [2] [3] [4] [5] [6] [7] [8] [9] 0 2 4 6 8 10 SNMF Proposed method SDR [dB] 0 5 10 15 20 SNMF Proposed method SIR [dB] 0 2 4 6 8 10 SNMF Proposed method SAR [dB] Target sound Other sound Conventional method Proposed method SDR SIR SAR SDR SIR SAR Piano Clarinet 1.0 6.1 3.5 7.4 13.0 9.1 Piano Trombone 1.6 13.7 2.1 12.3 23.5 12.7 Clarinet Flute 0.1 1.8 7.2 0.6 2.4 7.3 Clarinet Trombone -0.5 12.8 0.0 9.4 21.5 9.7 Flute Piano 4.2 14.2 4.8 6.4 16.6 6.9 Trombone Clarinet 0.7 12.5 1.2 5.4 17.0 5.8

Upload: others

Post on 20-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Music Signal Separation by Supervised Nonnegative Matrix ...d-kitamura.net/pdf/poster/DSP2013_poster_kitamura.pdfMusic Signal Separation by Supervised Nonnegative Daichi Kitamura,

Music Signal Separation by Supervised Nonnegative

Daichi Kitamura, Hiroshi Saruwatari, Kiyohiro Shikano

(Nara Institute of Science and Technology, Nara, Japan)

Overview of research

• Automatic music transcription

• Sound augmented reality (AR)

• 3D audio system, etc.

Applications

To cope with the problem of supervised NMF, we propose advanced

supervised NMF algorithm that employs a deformable capability for

the trained spectral bases and penalty terms for making the bases fit

into the real instrumental sound.

n Previous research

• Nonnegative matrix factorization (NMF) [1] Sparse representation and decomposition algorithm

NMF attempts to separate instrumental sources using spectral characteristics

Many techniques have been proposed using NMF as an unsupervised method [2] ~ [6],

It is difficult to cluster the decomposed spectral patterns into a specific target sound

• Supervised NMF (SNMF) [7], [8] Use some sample sound of the target instrumental signal in a priori training

A mismatch between the spectra trained in advance and the target sound reduces the

accuracy of source separation

Purpose of our research

1. Introduction

n We address a music signal separation problem, and propose a new

supervised algorithm for real instrumental signal separation.

n Nonnegative matrix factorization (NMF) is a sparse representation

algorithm.

n Supervised NMF can extract the target instrumental sources.

• The instrumental sounds differ according to, e.g., individual styles

of playing and the timbre individuality for each instrument.

• How do we get the supervision sound of the target instrument

recorded in CD?

Problem of conventional supervised method

n We propose a new advanced supervised method that adapts the

supervision sound (that we can get easily) to the target instrumental

sound as “Supervised NMF with basis deformation.”

n From experimental results, our proposed method separated the real

instrumental sources using supervision sound generated by MIDI

synthesizer.

Extract!

To solve this inherent problem…

• Recently, music signal separation technologies have received much

attention.

Mixture of instruments

Target signal Supervision sound

Slightly different

Proposed SNMF adapts supervision

sound to the target signal

Sample sound of piano

Proposed SNMF

Extracted piano signal

2. NMF NMF for source separation

: Observed spectrogram

: Basis matrix that includes frequently-appearing

spectral patterns in as column vectors

: Activation matrix that involves activation

information of each basis of

: Number of frequency bins

: Number of frames

: Number of bases

3. Conventional method Penalized Supervised NMF: PSNMF [8]

Update Supervised bases

Observed spectrogram that consists

multiple instrumental sources

Spectrogram of the solo-played

target signal for training

Training process

Separation process

Fix trained bases

and update

: Supervised basis matrix that involves spectral

patterns of the target instrumental sound

: Basis matrix that involves residual spectral

patterns that cannot be expressed by

: Activation matrix that corresponds to

: Activation matrix that corresponds to

• Ideally, represents the target instrumental components, and

represents other different components from the target sounds after the

decomposition.

In addition, to prevent the simultaneous formulation of

similar spectral patterns in the matrices and , a

specific penalty is imposed between and . Force to become

mostly different

Cost function

subject to for all ,

where is the weighting parameter for penalty term .

60

40

20

0

-20

Am

plit

ude [

dB

]

3.02.52.01.51.00.50.0

Frequency [kHz]

Real sound Artificial sound by MIDI

These spectral

differences reduce

separation performance.

4. Proposed method

A mismatch between the bases trained in advance and the target

actual sound reduces the accuracy of separation.

It is impossible to provide perfect supervision or to predict more

realistic supervision in practice.

Problem of supervised method

• In this research, we assume that we can obtain specific solo-played

instrumental sounds, which is the target of the separation task.

• We use the target supervision sound generated by MIDI synthesizer

because we can easily generate the training sound via MIDI.

Target signal

(piano)

Supervision

sound

MIDI data of the same type of target instrument

Slightly different

Generate

• In this method, we need not prepare a lot of real sound samples of the

target instrument to learn the variance of the instrumental sounds.

All entries of these matrix are nonnegative value.

Constrains

Matrix Factorization with Basis Deformation

Kazunobu Kondo, Yu Takahashi

(Yamaha Corporate Research & Development Center, Shizuoka, Japan)

In proposed method, supervised bases composed in training process

can be deformed to adapt to the target spectra.

Strategy

Supervision sound Target sound

Adapt to the target

Supervised bases

Separation process

: Basis matrix for the deformation and

shares the activation matrix with

Proposed SNMF with basis deformation

n Decomposition model

• We employed a deformable term to adapt the supervision bases into the

target sound that cannot be represented by .

Update Deformable term

• To deform the supervised bases , deformation term should have both

positive and negative value, unlike normal NMF decomposition.

• However, the target spectral bases after deformation must be

nonnegative.

• The deformation matrix should be constructed under the following

constrains:

: entry of matrix ,which probably has positive and negative values

: Hyper parameter that controls the allowable range of the negative deformation of

Cost function

subject to for all ,

where and are the weighting parameters for penalty terms.

Force to become mostly

different with each other

Supervised bases

Other bases

Deformation bases

represents undesired

instrumental components.

represents spectral

difference between supervision

sound and the target sound.

Deform

For the case of It is possible to decrease

until 30% of the total.

If , the deformed bases

has a risk to become .

What is the hyper parameter ?

5. Evaluation experiment

Update rules

• The update rules that minimize the cost function are defined as follows:

Target instruments Flute, Clarinet, Piano, Trombone

Observed signal Mixing two sources selected from four sources with the input SNR of 0dB

Supervision sound

(MIDI)

Artificial MIDI sounds of the target instruments that consists two octave

notes, which cover all notes of the target signal

Number of bases Supervision bases: 100, Other bases: 50

Number of iterations

of NMF Training process: 500, Separation process: 400

Parameters Experimentally determined

Evaluation scores [9]

Signal to distortion ratio (SDR: quality of extracted signal),

Source to interference ratio (SIR: degree of separation),

Sources to artifact ratio (SAR: absence of distortion)

Clari t, Pia Tr bo

Experimental condition

• To confirm the effectiveness of the proposed algorithm, we compared the

conventional method (SNMF) and our SNMF with basis deformation.

where,

Target signals

Average scores

Experimental results

4

3

2

1

0

Fre

qu

en

cy [

kH

z]

43210

Time [s]

4

3

2

1

0

Fre

qu

en

cy [

kH

z]

43210

Time [s]

4

3

2

1

0

Fre

qu

en

cy [

kH

z]

43210

Time [s]

4

3

2

1

0

Fre

qu

en

cy [

kH

z]

43210

Time [s]

Observed signal consisting of piano and trombone Oracle signal of target piano signal

Piano signal extracted by conventional method Piano signal extracted by proposed method

The target signals were

recorded with actual

musical instruments.

The supervision sounds

were generated by MIDI

synthesizer.

4

Example of spectrograms

References D. D. Lee, H. S. Seung, “Algorithms for non-negative matrix factorization,” Neural Info. Process. Syst., vol.13, pp.556–562, 2001.

T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,”

IEEE Trans. Audio, Speech and Language Processing, vol.15, pp.1066–1074, 2007.

S. A. Raczynski, N. Ono, S. Sagayama, “Multipitch analysis with harmonic nonnegative matrix approximation,” Proc. ISMIR, pp.381–386,

2007.

A. Ozerov, C. Fevotte, M. Charbit, “Factorial scaled hidden Markov model for polyphonic audio representation and source separation,”

Proc. WASPAA, 2009.

W. Wang, A. Cichocki, J. A. Chambers, “A multiplicative algorithm for convolutive non-negative matrix factorization based on squared

Euclidean distance,” IEEE Trans. on Signal Processing, vol.57, no.7, pp.2858–2864, 2009.

H. Kameoka, M. Nakano, K. Ochiai, Y. Imoto, K. Kashino, S. Sagayama, “Constrained and regularized variants of non-negative matrix

factorization incorporating music-specific constraints,” Proc. ICASSP, pp.5365–5368, 2012.

P. Smaragdis, B. Raj, M. Shashanka, “Supervised and semi-supervised separation of sounds from single-channel mixtures,” Proc.

LVA/ICA 2010, LNCS 6365, pp.140–148, 2010.

K. Yagi, Y. Takahashi, H. Saruwatari, K. Shikano, K. Kondo, “Music signal separation by orthogonality and maximum-distance constrained

nonnegative matrix factorization with target signal information,” Proc. Audio Engineering Society 45th International Conference, 2012.

E. Vincent, R. Gribonval, C. Fevotte. “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech and

Language Processing, vol.14, no.4, pp.1462–1469, 2006.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

0

2

4

6

8

10

SNMF Proposedmethod

SD

R [

dB

]

0

5

10

15

20

SNMF Proposedmethod

SIR

[d

B]

0

2

4

6

8

10

SNMF Proposedmethod

SA

R [d

B]

Target

sound

Other

sound

Conventional method Proposed method

SDR SIR SAR SDR SIR SAR

Piano Clarinet 1.0 6.1 3.5 7.4 13.0 9.1

Piano Trombone 1.6 13.7 2.1 12.3 23.5 12.7

Clarinet Flute 0.1 1.8 7.2 0.6 2.4 7.3

Clarinet Trombone -0.5 12.8 0.0 9.4 21.5 9.7

Flute Piano 4.2 14.2 4.8 6.4 16.6 6.9

Trombone Clarinet 0.7 12.5 1.2 5.4 17.0 5.8