speech segregation

103
Speech Segregation DeLiang Wang Perception & Neurodynamics Lab Ohio State University http://www.cse.ohio-state.edu/pnl/

Upload: rue

Post on 17-Jan-2016

59 views

Category:

Documents


4 download

DESCRIPTION

DeLiang Wang Perception & Neurodynamics Lab Ohio State University http://www.cse.ohio-state.edu/pnl/. Speech Segregation. Outline of presentation. Introduction: Speech segregation problem Auditory scene analysis (ASA) Speech enhancement - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Speech Segregation

Speech Segregation

DeLiang Wang

Perception & Neurodynamics LabOhio State University

http://www.cse.ohio-state.edu/pnl/

Page 2: Speech Segregation

ICASSP'10 tutorial 2

Outline of presentation

I. Introduction: Speech segregation problemII. Auditory scene analysis (ASA)III. Speech enhancementIV. Speech segregation by computational auditory scene

analysis (CASA)V. Segregation as binary classificationVI. Concluding remarks

Page 3: Speech Segregation

ICASSP'10 tutorial 3

Real-world auditionWhat?• Speech

messagespeaker

age, gender, linguistic origin, mood, …

• Music• Car passing byWhere?• Left, right, up, down• How close?Channel characteristicsEnvironment characteristics• Room reverberation• Ambient noise

Page 4: Speech Segregation

ICASSP'10 tutorial 4

Sources of intrusion and distortion

additive noise from other sound sources

reverberation from surface reflections

channel distortion

Page 5: Speech Segregation

ICASSP'10 tutorial 5

Cocktail party problem

• Term coined by Cherry• “One of our most important faculties is our ability to listen to, and

follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail party problem’…” (Cherry, 1957)

• “For ‘cocktail party’-like situations… when all voices are equally loud, speech remains intelligible for normal-hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp, 1992)

Ball-room problem by Helmholtz “Complicated beyond conception” (Helmholtz, 1863)

• Speech segregation problem

Page 6: Speech Segregation

ICASSP'10 tutorial 6

Listener performance

Speech reception threshold (SRT)

• The speech-to-noise ratio needed for 50% intelligibility

• Each 1 dB gain in SRT corresponds to 5-10% increase in intelligibility (Miller et al., 1951) dependent upon materials

Source: Steeneken (1992)

Page 7: Speech Segregation

ICASSP'10 tutorial 7

Effects of competing source

Source: Wang and Brown (2006)

SRT Difference(23 dB!)

Page 8: Speech Segregation

ICASSP'10 tutorial 8

Some applications of speech segregation

• Robust automatic speech and speaker recognition

• Processor for hearing prosthesis• Hearing aids

• Cochlear implants

• Audio information retrieval

Page 9: Speech Segregation

ICASSP'10 tutorial 9

Approaches to speech segregation

• Monaural approaches• Speech enhancement • CASA• Focus of this tutorial

• Microphone-array approaches• Spatial filtering (beamforming)

– Extract target sound from a specific spatial direction with a sensor array– Limitation: Configuration stationarity. What if the target switches or

changes location?

• Independent component analysis– Find a demixing matrix from mixtures of sound sources– Limitation: Strong assumptions. Chief among them is stationarity of

mixing matrix

Page 10: Speech Segregation

ICASSP'10 tutorial 10

Part II: Auditory scene analysis

• Human auditory system

• How does the human auditory system organize sound?• Auditory scene analysis account

Page 11: Speech Segregation

ICASSP'10 tutorial 11

Auditory periphery

A complex mechanism for transducing pressure variations in the air to neural impulses in auditory nerve fibers

Page 12: Speech Segregation

ICASSP'10 tutorial 12

Beyond the periphery

• The auditory system is complex with four relay stations between periphery and cortex rather than one in the visual system• In comparison to the auditory

periphery, central parts of the auditory system are less understood

• Number of neurons in the primary auditory cortex is comparable to that in the primary visual cortex despite the fact that the number of fibers in the auditory nerve is far fewer than that of the optic nerve (thousands vs. millions)

The auditory system (Source: Arbib, 1989)

The auditory nerve

Page 13: Speech Segregation

ICASSP'10 tutorial 13

Auditory scene analysis

• Listeners are capable of parsing an acoustic scene (a sound mixture) to form a mental representation of each sound source – stream – in the perceptual process of auditory scene analysis (Bregman, 1990)• From acoustic events to perceptual streams

• Two conceptual processes of ASA:• Segmentation. Decompose the acoustic mixture into sensory

elements (segments)

• Grouping. Combine segments into streams, so that segments in the same stream originate from the same source

Page 14: Speech Segregation

ICASSP'10 tutorial 14

Simultaneous organization

Simultaneous organization groups sound components that overlap in time. ASA cues for simultaneous organization:• Proximity in frequency (spectral proximity)

• Common periodicity• Harmonicity

• Temporal fine structure

• Common spatial location

• Common onset (and to a lesser degree, common offset)

• Common temporal modulation• Amplitude modulation (AM)

• Frequency modulation (FM) • Demo:

Page 15: Speech Segregation

ICASSP'10 tutorial 15

Sequential organization

Sequential organization groups sound components across time. ASA cues for sequential organization:• Proximity in time and frequency

• Temporal and spectral continuity

• Common spatial location; more generally, spatial continuity

• Smooth pitch contour• Smooth format transition?

• Rhythmic structure

Demo: streaming in African xylophone music • Note in pentatonic scale

Page 16: Speech Segregation

ICASSP'10 tutorial 16

Primitive versus schema-based organization

Primitive grouping. Innate data-driven mechanisms, consistent with those described by Gestalt psychologists for visual perception – feature-based or bottom-up It is domain-general, and exploits intrinsic structure of

environmental sound Grouping cues described earlier are primitive in nature

Schema-driven grouping. Learned knowledge about speech, music and other environmental sounds – model-based or top-down It is domain-specific, e.g. organization of speech sounds into

syllables

Page 17: Speech Segregation

ICASSP'10 tutorial 17

Organization in speech: Spectrogram

offset synchrony

onset synchrony

continuity

“… pure pleasure … ”

harmonicity

Page 18: Speech Segregation

ICASSP'10 tutorial 18

Interim summary of ASA

• Auditory peripheral processing amounts to a decomposition of the acoustic signal

• ASA cues essentially reflect structural coherence of natural sound sources

• A subset of cues believed to be strongly involved in ASA• Simultaneous organization: Periodicity, temporal modulation, onset

• Sequential organization: Location, pitch contour and other source characteristics (e.g. vocal tract)

Page 19: Speech Segregation

ICASSP'10 tutorial 19

Part III. Speech enhancement Speech enhancement aims to remove or reduce

background noise Improve signal-to-noise ratio (SNR) Assumes stationary noise or at least that noise is more stationary

than speech A tradeoff between speech distortion and noise distortion

(residual noise) Types of speech enhancement algorithms

Spectral subtraction Wiener filtering Minimum mean square error (MMSE) estimation Subspace algorithms

Material in this part is mainly based on Loizou (2007)

Page 20: Speech Segregation

ICASSP'10 tutorial 20

Spectral subtraction It is based on a simple principle:

Assuming additive noise, one can obtain an estimate of the clean signal spectrum by subtracting an estimate of the noise spectrum from the noisy speech spectrum

The noise spectrum can be estimated (and updated) during periods when the speech signal is absent or when only noise is present It requires voice activity detection or speech pause detection

Page 21: Speech Segregation

ICASSP'10 tutorial 21

Basic principle In the signal domain

y(n) = x(n) + d(n) x: speech signal; d: noise; y: noisy speech

In the DFT domain Y(ω) = X(ω) + D(ω)

Hence we have the estimated signal magnitude spectrum

To ensure nonnegative magnitudes, which can happen due to noise estimation errors, half-wave rectification is applied

)(ˆ)()(ˆ DYX

else0

0)(ˆ)( if )(ˆ)()(ˆ DYDY

X

Page 22: Speech Segregation

ICASSP'10 tutorial 22

Basic principle (cont.) Assuming that speech and noise are

uncorrelated, we have the estimated signal power spectrum

In general

Again, half-wave rectification needs to be applied

222)(ˆ)()(ˆ DYX

pppDYX )(ˆ)()(ˆ

Page 23: Speech Segregation

ICASSP'10 tutorial 23

Flow diagram

FFTNoisySpeech

| . |p

Phaseinformation

1/| . | pIFFT

EnhancedSpeech

Noise estimation/ update

ˆ| ( ) |pD ( )Y ( )y n

+

Page 24: Speech Segregation

ICASSP'10 tutorial 24

Effects of half-wave rectification

0 500 1000 1500 2000 2500 3000 3500 40000

2

4|Y

(f)|

0 500 1000 1500 2000 2500 3000 3500 40000

2

4

|X(f

)|

0 500 1000 1500 2000 2500 3000 3500 40000

2

4

|X(f

)|

0 500 1000 1500 2000 2500 3000 3500 40000

2

4

|X(f

)|

Frequency (Hz)

(a)

(b)

(c)

(d)

Noise spectrum

Frame t

Frame t

Frame t+1

Frame t+2

Page 25: Speech Segregation

ICASSP'10 tutorial 25

Musical noise

Time (msecs)

Fre

q. (

KH

z)

0 200 400 600 800 1000 1200 1400 1600 18000

1

2

3

4

5

Isolated peaks cause musical noise

Page 26: Speech Segregation

ICASSP'10 tutorial 26

Over-subtraction to reduce musical noise

• By over-subtracting the noise spectrum, we can reduce the amplitude of isolated peaks and in some cases eliminate them altogether. This by itself, however, is not sufficient because the deep valleys surrounding the peaks still remain in the spectrum

• For that reason, spectral flooring is used to “fill in” the spectral valleys

• α is over-subtraction factor (α > 1), and β is spectral floor parameter (β < 1)

else)(ˆ

)(ˆ)()( if )(ˆ)()(ˆ

2

22222

D

DYDYX

Page 27: Speech Segregation

ICASSP'10 tutorial 27

Effects of parameters: Sound demo

• Half-wave rectification: α =1, β = 0

• α =3, β = 0

• α =8, β = 0

• α =8, β = 0.1

• α =8, β = 1

• α =15, β = 0

• Noisy sentence (+5 dB SNR)

• Original (clean) sentence

Page 28: Speech Segregation

ICASSP'10 tutorial 28

Wiener filter

• Aim: To find the optimal filter that minimizes the mean square error between the desired signal (clean signal) and the estimated output• Input to this filter: Noisy speech

• Output of this filter: Enhanced speech

Page 29: Speech Segregation

ICASSP'10 tutorial 29

Wiener filter in frequency domain

• Wiener filter for noise reduction

• H(ω) denotes the filter

• Minimizing mean square error between filtered noisy speech and clean speech leads to

for frequency ωk

• Pxx(ωk ): power spectrum of x(n)

• Pdd(ωk ): power spectrum of d(n)

)()()(ˆ YHX

)()(

)()(

kddkxx

kxxk PP

PH

Page 30: Speech Segregation

ICASSP'10 tutorial 30

Wiener filter in terms of a priori SNR

• Define a priori SNR at frequency ωk:

• Wiener filter becomes

• More attenuation at lower SNR and less attenuation at higher SNR

1)(

k

kkH

)(

)(

kdd

kxxk P

P

Page 31: Speech Segregation

ICASSP'10 tutorial 31

Iterative Wiener filtering

• Optimal Wiener filter depends on input signal power spectrum, which is not available. In practice, we can estimate the Wiener filter iteratively

• We can consider the following procedure at iteration i to estimate H():• Step 1: Obtain an estimate of the Wiener filter based

on the enhanced signal obtained at iteration i – Initialize with noisy speech signal

• Step 2: Filter the noisy signal through the newly obtained Wiener filter according to:

to get the new enhanced signal, . Repeat the above procedure

)()()(ˆ1 kkiki YHX

)(ˆ nxi

0x̂

)(ˆ 1 nxi

Page 32: Speech Segregation

ICASSP'10 tutorial 32

MMSE estimator• The Wiener filter is the optimal (in the mean

square error sense) complex spectrum estimator, not the optimal magnitude spectrum estimator

• Ephraim and Malah (1984) proposed an MMSE estimator which is the optimal magnitude spectrum estimator

• Unlike the Wiener estimator, the MMSE estimator does not require a linear model between the observed data and the estimator, but assumes the probability distributions of speech and noise DFT coefficients:• Fourier transform coefficients (real and imaginary parts)

have a Gaussian probability distribution. The mean of the coefficients is zero, and the variances of the coefficients are time-varying due to the nonstationarity of speech

• Fourier transform coefficients are statistically independent, and hence uncorrelated

Page 33: Speech Segregation

ICASSP'10 tutorial 33

MMSE estimator (cont.)

• In the frequency domain: Y(ωk) = X(ωk) + D(ωk) or

• The MMSE derivation leads to

• In(.) is the modified Bessel function of order n

• a posteriori SNR:

)()()( kjk

kjk

kjk

dxy eDeXeY

kk

kk

kk

k

kk Y

vIv

vIv

vvX )]

2()

2()1)[(

2exp(

10

})({)( where,)(

22

kdd

kk DEk

k

Y

kk

kkv

1

Page 34: Speech Segregation

ICASSP'10 tutorial 34

MMSE gain function

• MMSE spectral gain function

)]2

()2

()1)[(2

exp(2

ˆ),( 10

kk

kk

k

k

k

k

kkk

vIv

vIv

vv

Y

XG

Page 35: Speech Segregation

ICASSP'10 tutorial 35

-15 -10 -5 0 5 10 15-30

-25

-20

-15

-10

-5

0

5

Instantaneous SNR (dB)

k - 1

20

log

(G

ain

) d

B

k=-15 dB

k=-10 dB

k= 5 dB

Gain for given a prior SNR

Page 36: Speech Segregation

ICASSP'10 tutorial 36

-20 -15 -10 -5 0 5 10 15 20-45

-40

-35

-30

-25

-20

-15

-10

-5

0

5

A priori SNR k (dB)

20

log

(G

ain

) d

B

Wiener

k-1= -20 dB

k-1= 0 dB

k-1= 20 dB

Gain for given a posteriori SNR

Page 37: Speech Segregation

ICASSP'10 tutorial 37

Estimating a priori SNR

• The suppression curves suggest that the posteriori SNR has a small effect and the a priori SNR is the main factor influencing suppression

• The a priori SNR can be estimated recursively (frame-wise) using the so-called “decision-directed” approach at frame m:

• 0 < a < 1, and a = 0.98 is found to work well

]0,1)(max[)1()1,(

)1(ˆ)(ˆ

2

mamk

mXam k

d

kk

Page 38: Speech Segregation

ICASSP'10 tutorial 38

Other remarks and sound demo

• It is noted that when the a priori SNR is estimated using the “decision-directed” approach, the enhanced speech has no “musical noise”

• A log-MMSE estimator also exists, which might be perceptually more meaningful

• Sound demo:• Noisy sentence (5 dB SNR):

• MMSE estimator:

• Log-MMSE estimator:

Page 39: Speech Segregation

ICASSP'10 tutorial 39

Subspace-based algorithms

• This class of algorithms is based on singular value decomposition (SVD) or eigenvalue decomposition of either data matrices or covariance matrices

• The basic idea behind the SVD approach is that the singular vectors corresponding to the largest singular values contain speech information, while the remaining singular vectors contain noise information

• Noise reduction is therefore accomplished by discarding the singular vectors corresponding to the smallest singular values

Page 40: Speech Segregation

ICASSP'10 tutorial 40

Subjective evaluations

• In terms of speech quality, a subset of algorithms improve the overall quality in a few conditions against the unprocessed condition. No algorithm produces improvement in multitalker babble

• In terms of intelligibility, no algorithm produces significant improvement over unprocessed noisy speech

Page 41: Speech Segregation

ICASSP'10 tutorial 41

Interim summary on speech enhancement

• Algorithms are derived analytically• Optimization theory

• Noise estimation is key• These algorithms are particularly needed for highly non-stationary

environments

• Speech enhancement algorithms cannot deal with multitalker mixtures

• Inability to improve speech intelligibility

Page 42: Speech Segregation

ICASSP'10 tutorial 42

Part IV. CASA-based speech segregation Fundamentals of CASA for monaural mixtures CASA for speech segregation

Feature-based algorithms Model-based algorithms

Page 43: Speech Segregation

ICASSP'10 tutorial 43

Cochleagram: Auditory spectrogram

Spectrogram• Plot of log energy across time and

frequency (linear frequency scale)

Cochleagram• Cochlear filtering by the gammatone

filterbank (or other models of cochlear filtering), followed by a stage of nonlinear rectification; the latter corresponds to hair cell transduction by either a hair cell model or simple compression operations (log and cube root)

• Quasi-logarithmic frequency scale, and filter bandwidth is frequency-dependent

• A waveform signal can be constructed (inverted) from a cochleagram

Spectrogram

Cochleagram

Page 44: Speech Segregation

ICASSP'10 tutorial 44

Neural autocorrelation for pitch perception

Licklider (1951)

Page 45: Speech Segregation

ICASSP'10 tutorial 45

Correlogram

• Short-term autocorrelation of the output of each frequency channel of the cochleagram

• Peaks in summary correlogram indicate pitch periods (F0)

• A standard model of pitch perception

Correlogram & summary correlogram of a vowel with F0 of 100 Hz

Page 46: Speech Segregation

ICASSP'10 tutorial 46

Onset and offset detection

• An onset (offset) corresponds to a sudden intensity increase (decrease), which can be detected by taking the time derivative of the intensity

• To reduce intensity fluctuations, Gaussian smoothing (low-pass filtering) is typically applied (as in edge detection for image analysis):

• Note that , where s(t) denotes intensity and

)2

exp(2

1),(

2

2

t

tG

),()()),()(( tGtstGts

)2

exp(2

),(2

2

3 tt

tG

Page 47: Speech Segregation

ICASSP'10 tutorial 47

Onset and offset detection (cont.)• Hence onset and offset detection is a three-step

procedure• Convolve the intensity s(t) with G' to obtain O(t)• Identify the peaks and the valleys of O(t)• Onsets are those peaks above a certain threshold, and offsets are

those valleys below a certain threshold

Onsets

Offsets

Page 48: Speech Segregation

ICASSP'10 tutorial 48

Segmentation versus grouping

• Mirroring Bregman’s two-stage conceptual model, a CASA model generally consists of a segmentation stage and a subsequent grouping stage

• Segmentation stage decomposes an acoustic scene into a collection of segments, each of which is a contiguous region in the cochleagram with energy primarily from one source• Based on cross-channel correlation that encodes correlated responses

(temporal fine structure) of adjacent filter channels, and temporal continuity

• Based on onset and offset analysis

• Grouping aggregates segments into streams based on various ASA cues

Page 49: Speech Segregation

ICASSP'10 tutorial 49

Cross-channel correlation for segmentation

Time (seconds)0.0 1.5

5000

2741

1457

729

315

80

Correlogram and cross-channel correlation for a mixture of speech and trill telephone Segments generated based on cross-channel correlation and temporal continuity

Page 50: Speech Segregation

ICASSP'10 tutorial 50

Ideal binary mask

• A main CASA goal is to retain the parts of a mixture where target sound is stronger than the acoustic background (i.e. to mask interference by the target), and discard the other parts (Hu & Wang, 2001; 2004)• What a target is depends on intention, attention, etc.

• In other words, the goal is to identify the ideal binary mask (IBM), which is 1 for a time-frequency (T-F) unit if the SNR within the unit exceeds a threshold, and 0 otherwise It does not actually separate the mixture! More discussion on the IBM in Part V

Page 51: Speech Segregation

ICASSP'10 tutorial 51

Ideal binary mask illustration

Page 52: Speech Segregation

ICASSP'10 tutorial 52

CASA for speech segregation Monaural CASA systems for speech segregation are

based on harmonicity, onset/offset, AM/FM, and trained models (Weintraub, 1985; Brown & Cooke, 1994; Ellis, 1996; Hu & Wang, 2004; Barker et al., 2005; Radfar et al., 2007) Hu-Wang and Barker et al. models will be explained as

representatives for feature-based and model-based methods, respectively

Page 53: Speech Segregation

ICASSP'10 tutorial 53

CASA system architecture

Typical architecture of CASA systems

Page 54: Speech Segregation

ICASSP'10 tutorial 54

Voiced speech segregation

For voiced speech, lower harmonics are resolved while higher harmonics are not

For unresolved harmonics, the envelopes of filter responses fluctuate at the fundamental frequency of speech

A voiced speech segregation model by Hu and Wang (2004) applies different grouping mechanisms for low-frequency and high-frequency signals: Low-frequency signals are grouped based on periodicity and

temporal continuity High-frequency signals are grouped based on amplitude modulation

and temporal continuity

Page 55: Speech Segregation

ICASSP'10 tutorial 55

Pitch tracking

Pitch periods of target speech are estimated from an initially segregated speech stream based on dominant pitch within each frame

Estimated pitch periods are checked and re-estimated using two psychoacoustically motivated constraints: Target pitch should agree with the periodicity of the T-F units in the

initial speech stream Pitch periods change smoothly, thus allowing for verification and

interpolation

Page 56: Speech Segregation

ICASSP'10 tutorial 56

Pitch tracking example

(a) Dominant pitch (Line: pitch track of clean speech) for a mixture of target speech and ‘cocktail-party’ intrusion

(b) Estimated target pitch

Page 57: Speech Segregation

ICASSP'10 tutorial 57

T-F unit labeling

In the low-frequency range: A T-F unit is labeled by comparing the periodicity of its

autocorrelation with the estimated target pitch

In the high-frequency range: Due to their wide bandwidths, high-frequency filters respond to

multiple harmonics. These responses are amplitude modulated due to beats and combinational tones (Helmholtz, 1863)

A T-F unit in the high-frequency range is labeled by comparing its AM rate with the estimated target pitch

Page 58: Speech Segregation

ICASSP'10 tutorial 58

Amplitude modulation illustration

(a) The output of a gammatone filter (center frequency: 2.6 kHz) in response to clean speech

(b) The corresponding autocorrelation function

Page 59: Speech Segregation

ICASSP'10 tutorial 59

Final segregation

New segments corresponding to unresolved harmonics are formed based on temporal continuity and cross-channel correlation of response envelopes (i.e. common AM). Then they are grouped into the foreground stream according to AM rates

Other units are grouped according to temporal and spectral continuity

Page 60: Speech Segregation

ICASSP'10 tutorial 60

Voiced speech segregation example

Page 61: Speech Segregation

ICASSP'10 tutorial 61

Model-based speech segregation

Basic observations• Pure primitive CASA may not be sufficient for speech separation

and listeners use top-down information Pure schema-based CASA needs models for all sources and ignores

organization in the input Barker, Cooke and Ellis (2005) proposed a model-based

approach attempting to integrate primitive and schema-based processes in CASA Key idea: Use speech recognition to group segments (fragments)

generated in primitive organization

Page 62: Speech Segregation

ICASSP'10 tutorial 62

CASA recognition

• By extending automatic speech recognition (ASR) framework, the goal of CASA recognition is to find the word sequence and speech/background separation which jointly have the maximum a posteriori probability given Y

• Y represents noisy speech and X clean speech

Page 63: Speech Segregation

ICASSP'10 tutorial 63

Introducing speech acoustics

since W is independent of S and Y given X

Here, P(X) cannot be dropped from the integral

Page 64: Speech Segregation

ICASSP'10 tutorial 64

For HMM recognition, a hidden state sequence is introduced. This leads to

CASA recognition with HMM modeling

Language modelbigrams, dictionary

Acoustic modelschemas

Segregation modelprimitive grouping

Search algorithmModified decoder

Segregation weightConnection to observations

Page 65: Speech Segregation

ICASSP'10 tutorial 65

Acoustic model and segregation weighting

• Impractical to evaluate the acoustic model and the segregation weighting term over entire search space of X. Simplifying assumptions must be made

• Segregation weighting is determined by primitive organization to make search feasible

Page 66: Speech Segregation

ICASSP'10 tutorial 66

Experimental evaluation

Page 67: Speech Segregation

ICASSP'10 tutorial 67

Results

• Factory noise: stationary background, hammer blows, machine noise, etc.

• MFCC + CMN: standard ASR approach

• Missing data recognition focuses on reliable T-F regions

• The model-based approach provides about 1.5 dB improvement over missing data recognition at low SNR levels • No significant CASA is done

Page 68: Speech Segregation

ICASSP'10 tutorial 68

Interim summary on CASA

• Main progress occurs in voiced speech segregation based on pitch tracking and model-based grouping• Recent work starts to address unvoiced speech separation (e.g. Hu &

Wang, 2008)

• Integration of feature-based and model-based segregation is promising • Substantial effort attempts to incorporate speaker models into

segregation (e.g. Shao & Wang, 2006)

Page 69: Speech Segregation

ICASSP'10 tutorial 69

Part V. Segregation as binary classification What is the goal of speech segregation?

Ideal binary mask Speech intelligibility tests

Classification approach to speech segregation An MLP based algorithm to separate reverberant speech A GMM based algorithm to improve intelligibility

Page 70: Speech Segregation

ICASSP'10 tutorial 70

What is the goal of speech segregation?

From the perspective of perceptual information processing, the analysis of the computational goal is critically important (Marr, 1982) Computational theory analysis

What is the goal of audition? By analogy to vision (Marr, 1982), the purpose of audition is to

produce an auditory description of the environment for the listener

What is the goal of CASA? The goal of ASA is to segregate sound mixtures into separate

perceptual representations (or auditory streams), each of which corresponds to an acoustic event (Bregman, 1990)

By extrapolation the goal of CASA is to develop computational systems that extract individual streams from sound mixtures

Page 71: Speech Segregation

ICASSP'10 tutorial 71

Computational-theory analysis of ASA

To form a stream, a sound must be audible on its own The number of streams that can be computed at a time is

limited Magical number 4 for simple sounds such as tones and vowels

(Cowan, 2001)? 1+1, or figure-ground segregation, in noisy environment such as a

cocktail party?

Auditory masking further constrains the ASA output Within a critical band a stronger signal masks a weaker one

Page 72: Speech Segregation

ICASSP'10 tutorial 72

Computational-theory analysis of ASA (cont.)

ASA outcome depends on sound types (overall SNR is 0) Noise-Noise: pink , white , pink+white Speech-Speech: Noise-Speech: Tone-Speech:

Page 73: Speech Segregation

ICASSP'10 tutorial 73

Some alternative goals

Extract all underlying sound sources or the target sound source (the gold standard) Implicit in speech enhancement and spatial filtering Segregating all sources is implausible, and probably unrealistic with

one or two microphones

Enhance ASR Close coupling with a primary motivation of speech segregation Perceiving is more than recognizing (Treisman, 1999)

Enhance human listening Advantage: close coupling with auditory perception There are applications that involve no human listening

Page 74: Speech Segregation

ICASSP'10 tutorial 74

Ideal binary mask as CASA goal

• As mentioned in Part IV, the ideal binary mask has been suggested as a main goal of CASA

• Definition

s(t, f ): Target energy in unit (t, f ) n(t, f ): Noise energy θ: A local SNR criterion (LC) in dB, which is typically chosen to be

0 dB

otherwise0

),(),( if1),(

ftnftsftIBM

Page 75: Speech Segregation

ICASSP'10 tutorial 75

Properties of IBM

IBM notion is consistent with computational-theory analysis of ASA Audibility and capacity Auditory masking Effects of target and noise types (spectral overlap)

Optimality: Under certain conditions the IBM with θ = 0 dB is the optimal binary mask from the perspective of SNR gain (Li & Wang, 2009)

Page 76: Speech Segregation

ICASSP'10 tutorial 76

Subject tests of IBM

• Recent studies found large speech intelligibility improvements by applying ideal binary masking for normal-hearing (Brungart et al., 2006; Li & Loizou, 2008), and hearing-impaired (Anzalone et al., 2006; Wang et al., 2009) listeners• Improvement for stationary noise is above 7 dB for normal-

hearing (NH) listeners, and above 9 dB for hearing-impaired (HI) listeners

• Improvement for modulated noise is significantly larger than for stationary noise

Page 77: Speech Segregation

ICASSP'10 tutorial 77

Test conditions of Wang et al.’09

SSN: Unprocessed monaural mixtures of speech-shaped noise (SSN) and Dantale II sentences (0 dB: -10 dB: )

CAFÉ: Unprocessed monaural mixtures of cafeteria noise (CAFÉ) and Dantale II sentences (0 dB: -10 dB: )

SSN-IBM: IBM applied to SSN (0 dB: -10 dB: -20 dB: )

CAFÉ-IBM: IBM applied to CAFÉ (0 dB: -10 dB: -20 dB: )

Intelligibility results are measured in terms of SRT

Page 78: Speech Segregation

ICASSP'10 tutorial 78

Intelligibility results

12 NH subjects (10 male and 2 female), and 12 HI subjects (9 male and 3 female) SRT means for the 4 conditions for NH listeners: (-8.2, -10.3, -15.6, -20.7) SRT means for the 4 conditions for HI listeners: (-5.6, -3.8, -14.8, -19.4)

NH H I

SSN CAFE SSN -IBM CAFE-IBM-24

-22

-20

-18

-16

-14

-12

-10

-8

-6

-4

-2

0

2

4

Dan

tale

II

SR

T (

dB)

Page 79: Speech Segregation

ICASSP'10 tutorial 79

Speech perception of noise with binary gains Wang et al. (2008) found that, when LC is chosen to be the same

as the input SNR, nearly perfect intelligibility is obtained when input SNR is -∞ dB (i.e. the mixture contains noise only with no target speech)

Time (s)

Ce

nte

r F

req

ue

ncy

(H

z)

0.4 0.8 1.2 1.6 2

7743

2489

603

55

96 dB

72 dB

48 dB

24 dB

0 dB

Time (s)

Ce

nte

r F

req

ue

ncy

(H

z)

0.4 0.8 1.2 1.6 2

7743

2489

603

55

Time (s)

Ch

an

ne

l Nu

mb

er

0.4 0.8 1.2 1.6 2

32

22

12

2

Time (s)

Ce

nte

r F

req

ue

ncy

(H

z)

0.4 0.8 1.2 1.6 2

7743

2489

603

55

Page 80: Speech Segregation

ICASSP'10 tutorial 80

IBM-gated noise results

Despite a great reduction of spectrotemporal information, a pattern of binary gains is apparently sufficient for human speech recognition

Mean numbers for the 4 conditions: (97.1%, 92.9%, 54.3%, 7.6%)

N umber of channels

4 8 16 320

10

20

30

40

50

60

70

80

90

100P

erc

en

t c

orr

ec

t

Page 81: Speech Segregation

ICASSP'10 tutorial 81

Main points

Speech intelligibility results support the IBM as an appropriate goal of CASA in general, and speech segregation in particular

Hence solving the speech segregation problem would amount to binary classification This is a strong claim with major consequences This formulation opens the problem to a variety of pattern

classification methods

Page 82: Speech Segregation

ICASSP'10 tutorial 82

Segregation of reverberant speech

Room reverberation poses an additional challenge to the problem of speech segregation Segregation performance drops significantly in reverberation Common approach to deal with reverberation is inverse filtering,

which is sensitive to different room configurations

Jin and Wang (2009) proposed a model to segregate reverberant voiced speech based on classification

Page 83: Speech Segregation

ICASSP'10 tutorial 83

System overview

A multilayer perceptron (MLP) is trained for each frequency channel in order to label T-F units as target or interference dominant

Page 84: Speech Segregation

ICASSP'10 tutorial 84

Feature extraction

A 6-dimensional pitch-based feature vector is constructed for each T-F unit

τm: given pitch period at frame m; A(c,m,τm): autocorrelation function : average instantaneous frequency [.]: Nearest integer, indicating harmonic number |.|: Deviation from the nearest harmonic (subscript E for envelope)

]),([),(

]),([

),,(

]),([),(

]),([

),,(

,

mEmE

mE

mE

mm

m

m

mc

mcfmcf

mcf

mcA

mcfmcf

mcf

mcA

y

),( mcf

Page 85: Speech Segregation

ICASSP'10 tutorial 85

MLP learning

The objective function is to maximize SNR:

d: desired output; a: actual output; E: energy

Generalized mean-square-error criterion – each squared error weighted by normalized energy (i.e., cost sensitive learning)

mc

cc

mcc mE

mEmamdJ

)(

)()]()([ 2

Page 86: Speech Segregation

ICASSP'10 tutorial 86

MLP based unit labeling

MLP encodes the posterior probability of a T-F unit being target dominant given observed features

Labeling criterion:

H1: Hypothesis that the unit is dominated by the target

2

1)|( ,1 mcyHP

Page 87: Speech Segregation

ICASSP'10 tutorial 87

Segmentation and grouping

Segmentation is based on cross-channel correlation at low frequencies, and onset/offset analysis at high frequencies

Label each segment based on the labels of its T-F units Units not in any segment are grouped according to temporal and

spectral continuity

Time (Sec)

Fre

qu

en

cy

(H

z)

0.2 0.4 0.6 0.8 1 1.2 1.4 1.650

363

1246

3255

8000

Page 88: Speech Segregation

ICASSP'10 tutorial 88

Segregation results

Train on different room configurations Case 1: Train on one

configuration and test on others in the same room (reverb. time)

Case 2: Train on one configuration from each of the 6 rooms

Case 3: Train on room 6 Generalize well to different utterances

and different speakers

Page 89: Speech Segregation

ICASSP'10 tutorial 89

Discussion and demo

MLP based classification leads to substantial SNR gain It generalizes well to different utterances and different

speakers Limitations: known pitch and voiced speech

See Jin and Wang lecture on Tuesday afternoon (SP-L2)

Demo: Two speech utterances (trained just in Room 6 with T60 = 0.6 s)

T60 (s) Original Hu-WangInverse

filterJin-

Wang’09

Ideal binary mask

 0.0          

 0.3          

 0.6          

Page 90: Speech Segregation

ICASSP'10 tutorial 90

GMM-based classification

Instead of treating voiced speech and unvoiced speech separately, a simpler approach is to perform classification on noisy speech regardless of voicing

A classification model by Kim, Lu, Hu, and Loizou (2009) deals with speech segregation in a speaker and masker dependent way: AM spectrum (AMS) features are used Classification is based on Gaussian mixture models (GMM) Speech intelligibility evaluation is performed with normal-hearing

listeners

Page 91: Speech Segregation

ICASSP'10 tutorial 91

Diagram of Kim et al.’s model

Page 92: Speech Segregation

ICASSP'10 tutorial 92

Feature extraction and GMM

Peripheral analysis is done by a 25-channel mel-frequency filter bank

A 15-dimensational AMS feature vector is extracted within each T-F unit This vector is then concatenated with the delta vectors over time and

frequency to form a 45-dimensional feature vector for each unit

One GMM (λ1) models target-dominant T-F units and another GMM models (λ0) noise-dominant units Each GMM has 256 Gaussian components

Page 93: Speech Segregation

ICASSP'10 tutorial 93

Training and classification

To improve efficiency, each GMM is subdivided into two models during training: one for relatively high local SNRs and one for relatively low SNRs

With the 4 trained GMMs, segregation comes down to Bayesian classification with prior probabilities of P(λ0) and P(λ1) estimated from the training data

The training and test data are mixtures of IEEE sentences and 3 masking noises: babble, factory, and speech-shaped noise Separate GMMs are trained for each speaker (a male and a female)

and each masker

Page 94: Speech Segregation

ICASSP'10 tutorial 94

A separation example

Target utterance

-5 dB mixture with babble

Estimated mask

Masked mixture

Page 95: Speech Segregation

ICASSP'10 tutorial 95

Clean: 0-dB mixture with babble: Segregated:

Intelligibility results and demo

UN: unprocessedIdBM: ideal binary masksGMM: trained on a single noisemGMM: trained on multiple noises

Page 96: Speech Segregation

ICASSP'10 tutorial 96

Discussion

GMM classifier achieves a hit rate (active units correctly classified) higher than 80% for most cases while keeps the false-alarm (FA) rate relatively low As expected, mGMM results are worse than sGMM HIT – FA well correlates with intelligibility

The first monaural separation algorithm to achieve significant speech intelligibility improvements

Main limitation is speaker and masker dependency Also, AMS features would not be applicable to multitalker mixtures

Page 97: Speech Segregation

ICASSP'10 tutorial 97

Concluding remarks

Monaural speech segregation is of fundamental importance Compared to beamforming, it does not assume configuration

stationarity, hence more versatile for application It may hold the key to robustness to room reverberation

Speech enhancement algorithms are based on statistical analysis of general signal properties For example, uncorrelatedness of speech and noise

CASA-based speech segregation is based on analysis of perceptual and speech properties For example, heavy use of pitch and model training

Page 98: Speech Segregation

ICASSP'10 tutorial 98

Concluding remarks (cont.)

Speech enhancement aims to increase the SNR of noisy speech Reasonable for improving speech quality, but unsuccessful for

improving speech intelligibility

Binary masking is a key concept of CASA, which has led to the new formulation of segregation as binary classification Classification based segregation shows promising results for

improving speech intelligibility

Page 99: Speech Segregation

ICASSP'10 tutorial 99

Remarks on SNR and intelligibility

SNR measures signal similarity to clean speech Recent intelligibility studies provide evidence that

lower SNR can result in higher intelligibility! LC = 0 dB is best for SNR gain (Li & Wang, 2009) However, intelligibility results in Brungart et al. (2006), Wang et

al. (2009), and Kim et al. (2009) indicate that negative LC values produce higher intelligibility

Hence the pursuit of increased SNR could be wrongly headed

It is important to identify appropriate computational goals, and design evaluation metrics accordingly This way, one can avoid making effort in wrong directions

Page 100: Speech Segregation

ICASSP'10 tutorial 100

Review of presentation

I. Introduction: Speech segregation problemII. Auditory scene analysis (ASA)III. Speech enhancementIV. Speech segregation by computational auditory scene

analysis (CASA)V. Segregation as binary classificationVI. Concluding remarks

Page 101: Speech Segregation

ICASSP'10 tutorial 101

Further information

This tutorial is partly drawn from the following two books (Loizou, 2007; Wang & Brown, 2006)

Page 102: Speech Segregation

ICASSP'10 tutorial 102

BibliographyAnzalone et al. (2006) Ear & Hearing 27: 480-492.Arbib (1989) The Metaphorical Brain 2. Wiley.Barker, Cooke, Ellis (2005) Speech Comm. 45: 5-25.Bregman (1990) Auditory Scene Analysis. MIT Press.Bronkhorst & Plomp (1992) JASA 92:3132-3139.Brown & Cooke (1994) Comp. Speech & Lang. 8: 297-

336.Brungart et al. (2006) JASA 120: 4007-4018.Cherry (1957) On Human Communication. Wiley.Cowan (2001) Beh. & Brain Sci. 24: 87-185.Ellis (1996) PhD MIT.Ephraim & Malah (1984) IEEE T-ASSP 32: 1109-1121.Helmholtz (1863) On the Sensation of Tone. Dover.Hu & Wang (2001) WASPPA.

Hu & Wang (2004) IEEE T-NN 15: 1135-1150.

Hu & Wang (2008) JASA 124: 1306-1319.

Jin & Wang (2009) IEEE T-ASLP 17: 625-638.Kim et al. (2009) JASA 126: 1486-1494.Li N & Loizou (2008) JASA 123: 1673-1682.

Li Y & Wang (2009) Speech Comm. 51: 230-239.

Licklider (1951) Experientia 7: 128-134.

Loizou (2007) Speech enhancement: Theory and practice. CRC.

Marr (1982) Vision. Freeman

Miller, Heise, & Lichten (1951) JEP 41: 329-335.

Radfar, Dansereau, & Sayadiyan (2007) EURASIP J-ASMP: Article 84186

Shao & Wang (2006) IEEE T-ASLP 14: 289-298

Steeneken (1992) PhD Univ of Amsterdam.

Treisman (1999) Neuron 24: 105-110.

Wang & Brown, Ed. (2006) Computational auditory scene analysis. Wiley & IEEE Press.

Weintraub (1985) PhD Stanford.

Wang et al. (2008) JASA 124: 2303-2307.

Wang et al. (2009) JASA 125: 2336-2347.

Page 103: Speech Segregation

ICASSP'10 tutorial 103

Resources and acknowledgments

Source programs for some studies referred to in this tutorial are available at OSU Perception & Neurodynamics Lab’s website

http://www.cse.ohio-state.edu/pnl

Bregman and Ahad (1995) produced a CD that accompanies Bregman’s 1990 book, which can be ordered from MIT Press

Houtsma, Rossing, and Wagenaars (1987) produced a CD that demonstrates basic auditory perception phenomena, which can be ordered from the Acoustical Society of America

Thanks to M. Cooke and P. Loizou for making available their slides, some of which are incorporated in this tutorial