elen e4896 music signal processing lecture 14: source separationdpwe/e4896/lectures/e4896-l... ·...
TRANSCRIPT
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Lecture 14:Source Separation
Dan EllisDept. Electrical Engineering, Columbia University
[email protected] http://www.ee.columbia.edu/~dpwe/e4896/1
1. Sources, Mixtures, & Perception2. Spatial Filtering3. Time-Frequency Masking4. Model-Based Separation
ELEN E4896 MUSIC SIGNAL PROCESSING
mailto:[email protected]:[email protected]://labrosa.ee.columbia.eduhttp://labrosa.ee.columbia.edu
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
1. Sources, Mixtures, & Perception
• Sound is a linear process (superposition)no “opacity” (unlike vision)sources → “auditory scenes” (polyphony)
• Humans perceive discrete sources.. a subjective construct
2
02_m+s-15-evil-goodvoice-fade
0 2 4 6 8 10 12 time/s
frq/Hz
0
2000
1000
3000
4000
Voice (evil)Stab
Rumble StringsChoir
Voice (pleasant)
Analysis
level / dB-60
-40
-20
0
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Spatial Hearing• People perceive sources based on cues
spatial (binaural): ITD, ILD
3
2.2 2.205 2.21 2.215 2.22 2.225 2.23 2.235-0.1
-0.05
0
0.05
0.1
time / s
shatr78m3 waveform
Left
Right
path length difference
path length difference
head shadow (high freq)
source
LR
Blauert ’96
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Auditory Scene Analysis• Spatial cues may not be enough/available
single channel signal
• Brain uses signal-intrinsic cues to form sourcesonset, harmonicity
4
time / sec
freq
/ kH
z
0 0.5 1 1.5 2 2.5 3 3.5 4 level / dB0
1
2
3
4
-40
-30
-20
-10
0
10
20
30
Reynolds-McAdams Oboe
Bregman ’90
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Auditory Scene Analysis
“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90)
• Quite a challenge!
5
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Audio Mixing• Studio recording combines separate tracks
into, e.g., 2 channels (stereo)different levelspanningother effects
• Stereo Intensity Panningmanipulating ILD onlyconstant powermore channels: use just nearest pair?
6
L R
−1 −0.5 0 0.5 10
0.5
1
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
2. Spatial Filtering• N sources detected by M sensors
degrees of freedom(else need other constraints)
• Consider 2 x 2 case:directionalmics
→mixing matrix:
7
m1 s1
m2s2
a21a22
a12a11
�m1m2
�=
�a11 a12a21 a22
� �s1s2
��
�ŝ1ŝ2
�= �1m
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Source Cancelation• Simple 2 x 2 case example:
8
�m1m2
�=
�1 0.5
0.8 1
� �s1s2
�
m1(t) = s1(t) + 0.5s2(t)m2(t) = 0.8s1(t) + s2(t)
� m1(t)� 0.5m2(t) = 0.6s1(t)
if no delay and linearly-independent sums, can cancel one source per combination
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Independent Component Analysis• Can separate “blind” combinations by
maximizing independence of outputs
kurtosis for independence?
9
m1 m2
s1 s2
a11 a21
a12 a22
x
−δ MutInfo δa
-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
mix 1
mix
2
0 0.2 0.4 0.6 0.8 10
2
4
6
8
10
12
/
kurto
sis
Kurtosis vs. Mixture Scatter
s1
s1
s2
s2
kurt(y) = E
��y � µ
�
�4�� 3
Bell & Sejnowski ’95
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Microphone Arrays• If interference is diffuse, can simply
boost energy from target directione.g. shotgun mic - delay-and-sum
off-axis spectral colorationmany variants - filter & sum, sidelobe cancelation ...
10
DD +D ++
-40
-20
0λ = 4D
λ = 2D
λ = D
∆x = c . D
Benesty, Chen, Huang ’08
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
3. Time-Frequency Masking• What if there is only one channel?
cannot have fixed cancellationbut could have fast time-varying filtering:
• The trick is finding the right mask...11
freq
/ kHz
0
2
4
6
8
freq
/ kHz
0
2
4
6
8
time / s0.5 1 1.5 2 2.5 3
Brown & Cooke ’94Roweis ’01
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Time-Frequency Masking• Works well for overlapping voices
time-frequency resolution?12
freq
/ kH
z
0
2
4
6
8
freq
/ kH
z
0
2
4
6
8
level / dB-20
0
20
40
time / sec time / sec0 0.5 1 0 0.5 1
Male
Original
Mix +OracleLabels
Oracle-based
Resynth
Female
cooke-v3n7.wav
cooke-v3msk-ideal.wav cooke-n7msk-ideal.wav
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Pan-Based Filtering• Can use time-frequency masking
even for stereoe.g. calculate “panning index” as ILDmask cells matching that ILD
13
Avendano 2003
freq
/ kHz
0
2
4
6
level
/ dB
level
/ dB
ILD
/ dB
freq
/ kHz
ILD mask 1024 pt win −2.5 .. +1.0 dB
0 5 10 15 20 time / s0
2
4
6
−20
0
20
−5
0
5
−20
0
20
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Harmonic-based Masking• Time-frequency masking
can be used to pick out harmonicsgiven pitch track, know where to expect harmonics
14
Denbigh & Zhao 1992
freq
/ kHz
0
1
2
3
4
freq
/ kHz
1
2
3
4
time / s0 1 2 3 4 5 6 7 80
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Harmonic Filtering• Given pitch track, could use
time-varying comb filter to get harmonicsor: isolate each harmonic by heterodyning:
15
Avery Wang 1995
x̂(t) =�
k
âk(t)cos(k�̂0(t)t)
âk(t) = LPF{|x(t)e�jk�̂0(t)t|}
time / s
freq
/ kHz
0 1 2 3 4 5 6 7 80
1
2
3
4
freq
/ kHz
0
1
2
3
4
freq
/ kHz
0
1
2
3
4
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Nonnegative Matrix Factorization• Decomposition of spectrograms
into templates + activation
fast & forgiving gradient descent algorithmfits neatly with time-frequency masking
16
Lee & Seung ’99Abdallah & Plumbley ’04Smaragdis & Brown ’03
Virtanen ’07
Smar
agdi
s ’0
4
1 2 3Bases from all Wt
50 100 150 200
3
2
1
Row
s of
H
Time (DFT slices)
Freq
uenc
y (D
FT in
dex)
50 100 150 200
20
40
60
80
100
120
X = W · H
Virtanen ’03sounds
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
4. Model-Based Separation• When N (sources) > M (sensors), need
additional constraints to solve probleme.g. assumption of single dominant pitch
• Can assemble into a model M of source sidefines set of “possible” waveforms..probabilistically:
• Source separation from mixture as inference:
where
17
Pr(si|M)
s = {si} = arg maxs
Pr(x|s, A)P (A)�
i
Pr(si|M)
Pr(x|s, A) = N (x|As, �)
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Time
Freq
uenc
y
Stereo − instantaneous mix
0.5 1 1.5 2 2.5 30
1000
2000
3000
4000
Time
Freq
uenc
y
0.5 1 1.5 2 2.5 30
1000
2000
3000
4000
Time
Freq
uenc
y
Separated source 1
0.5 1 1.5 2 2.5 30
1000
2000
3000
4000
Time
Freq
uenc
y
Separated source 2
0.5 1 1.5 2 2.5 30
1000
2000
3000
4000
Time
Freq
uenc
y
Separated source 3
0.5 1 1.5 2 2.5 30
1000
2000
3000
4000
Source Models• Can constrain:
source spectra (e.g. harmonic, noisy, smooth)temporal evolution (piecewise-continuous)spatial arrangements (point-source, diffuse)
• Factored decomposition:
18
Ozerov, Vincent & Bimbot 2010http://bass-db.gforge.inria.fr/fasst/
Music: Shannon Hurley / Mix: Michel Desnoues & Alexey Ozerov / Separations: Alexey OzerovVexj = Wexj Uexj Gexj Hexj
Vexj = Eexj PexjEexj = Wexj Uexj Pexj = Gexj Hexj
Wexj UexjGexj Hexj
inria
-005
3691
7, v
ersi
on 2
- 29
Nov
201
0
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
Summary• Acoustic Source Mixtures
The normal situation in real-world sounds
• Spatial filteringCanceling sources by subtracting channels
• Time-Frequency MaskingSelecting spectrogram cells
• Model-Based SeparationExploiting regularities in source signals
19
-
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19
ReferencesS. Abdallah & M. Plumbley, “Polyphonic transcription by non-negative sparse coding of power spectra”, Proc. Int. Symp. Music Info. Retrieval, 2004. C. Avendano, “Frequency-domain source identification and manipulation in stereo mixes for enhancement, suppression and re-panning applications,” IEEE WASPAA, Mohonk, pp. 55-58, 2003.A. Bell, T. Sejnowski, “An information-maximization approach to blind separation and blind deconvolution,” Neural Computation, vol. 7 no. 6, pp. 1129-1159, 1995.J. Benesty, J. Chen, Y. Huang, Microphone Array Signal Processing, Springer, 2008.J. Blauert, Spatial Hearing, MIT Press, 1996.A. Bregman, Auditory Scene Analysis, MIT Press, 1990.G. Brown & M. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8 no. 4, pp. 297-336, 1994.P. Denbigh & J. Zhao, “Pitch extraction and separation of overlapping speech,” Speech Communication, vol. 11 no. 2-3, pp. 119-125, 1992.D. Lee & S. Seung, “Learning the Parts of Objects by Non-negative Matrix Factorization”, Nature 401, 788, 1999.A. Ozerov, E. Vincent, & F. Bimbot, “A general flexible framework for the handling of prior information in audio source separation,” INRIA Tech. Rep. 7453, Nov. 2010.S. Roweis, “One microphone source separation,” Adv. Neural Info. Proc. Sys., pp. 793-799, 2001.P. Smaragdis & J. Brown, “Non-negative Matrix Factorization for Polyphonic Music Transcription”, Proc. IEEE WASPAA,177-180, October, 2003.T. Virtanen “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Tr. Audio, Speech, & Lang. Proc. 15(3), 1066–1074, 2007.Avery Wang, Instantaneous and frequency-warped signal processing techniques for auditory source separation, Ph.D. dissertation, Stanford CCRMA, 1995.
20
http://hal.inria.fr/docs/00/54/07/35/PDF/RR-7453.pdfhttp://hal.inria.fr/docs/00/54/07/35/PDF/RR-7453.pdfhttp://hal.inria.fr/docs/00/54/07/35/PDF/RR-7453.pdfhttp://hal.inria.fr/docs/00/54/07/35/PDF/RR-7453.pdf