hcs 7367 speech perception - university of texas at dallasassmann/hcs6367/lec9.pdfdiscrimination of...

1

HCS 7367Speech Perception

Dr. Peter AssmannFall 2012

Long-term spectrum of speech

Absolute threshold

Females

Males

Connected speech

Long-term spectrum of speech

Absolute threshold

Females

Males

2000)

VowelsSo

und

pres

sure

leve

l (dB

)

0

120

100

80

60

40

20

62.5 125 500 1k 4k 8k2k 16k250 32k

Frequency (Hz)

Conversational Speech

Audibility of speech

Absolutethreshold(normal listeners)

Types of hearing loss

• Conductive loss• Sensorineural loss

• Audibility/distortion• Effect of noise

Source: www.brainconnection.com

Pure-tone audiogram

250 500 1K 2K 4K-10 0 10 20 30 40 50 60 70 80 90 100 110 120 130

Bone conduction thresholdsAir conduction thresholds

250 500 1K 2K 4K-10 0 10 20 30 40 50 60 70 80 90 100 110 120 130H

eari

ng L

oss i

n dB

(AN

SI-1

996)

Left ear Right ear

Normal Conductive loss

Source: www.bcm.tmc.edu/oto/studs/aud.html

2

Pure-tone audiogram

250 500 1K 2K 4K-10 0 10 20 30 40 50 60 70 80 90 100 110 120 130

250 500 1K 2K 4K-10 0 10 20 30 40 50 60 70 80 90 100 110 120 130H

eari

ng L

oss i

n dB

(AN

SI-1

996)

Left ear Right ear

Sensorineural loss Mixed loss

Bone conduction thresholdsAir conduction thresholds

Source: www.bcm.tmc.edu/oto/studs/aud.html

Speech audiometry

• Nonsense syllables, real words, words in sentences

• threshold for recognizing 50% of test items• percentage of items correctly reported• Speech tests provide a valid measure of hearing

handicap.• Poor speech scores may indicate hearing loss of

retrocochlear origin

Sensorineural hearing loss

• Listeners with cochlear hearing loss have difficulty recognizing speech when background noise is present.

Reduced audibilitySupra-threshold “distortions”• Impaired frequency selectivity• Loudness recruitment

Speech recognition in noise

• Speech reception threshold, SRT(Plomp & Mimpen, 1969)

Speech-to-noise ratio required to achieve a specific level of intelligibility, typically 50%Effects of speech materialsEffects of type of masker (e.g., speech-shaped noise vs. a single competing talker)Effects of spatial separation of target & masker

Speech recognition in noiseMasker type Listening situation Deficit in SRT

Speech-shaped noise

Speech+masker in front, unaided 2.5 - 7.0 dB

Speech-shaped noise

Speech+masker in front, aided 2.5 - 6.0 dB

Single talker Speech+masker in front, unaided 6.0 - 12.0 dB

Single talker Speech+masker in front, aided 4.0 - 10.0 dB

Single talker Speech+masker in front, spatially separated

12.0 – 19.0 dB

Source: Moore, BCJ (2003) Speech Communication

Articulation Index

• How much does audibility contribute to difficulty understanding speech in noise?

• Articulation Index (AI) estimates the contribution of audibility (and other factors) to speech intelligibility

3

Articulation Index

1. Divides the speech and masker spectrum into a small number of frequency bands

2. Estimates the audibility of speech in each band, weighted by its relative importance for intelligibility

3. Derives overall intelligibility by summing the contributions of each band.

Articulation Index

• Most studies show that speech intelligibility is worse than predicted by the AI for hearing-impaired listeners, especially for moderate or severe hearing loss.

Articulation Index

• Conclusion: factors other than audibility must be responsible for the difficulties experienced by hearing-impaired listeners understanding speech in noise.

• What else?Frequency selectivityTemporal resolution

Frequency Selectivity

• Frequency selectivity is the ability to resolve the spectral components of complex sounds.

• Reduced frequency selectivity may lead to difficulty in understanding speech in noise.

Auditory filters• Fletcher (1940) suggested that the

peripheral auditory system could be modeled as a bank of linear bandpass filters with continuously overlapping center frequencies.

Frequency

Gai

n

Auditory filters

• Each point along the basilar membrane corresponds to a filter with a different center frequency, with center frequencies increasing roughly logarithmically from the apex to the base.

4

Auditory filters

• About half of the length of the human basilar membrane is devoted to the lowest kHz (F1 range of speech) with the majority of neural fibers responding best to low-to-mid-frequencies.

Critical Bandwidth• Fletcher (1940) band-widening experiment

The threshold for detecting a pure tone in the presence of a bandpass noise masker increases as the noise bandwidth increases, until the width of the band exceeds the critical bandwidth of the auditory filter.

Noise masker bandwidth

Tone

det

ectio

nth

resh

old

Critical Bandwidth• Sources of evidence for critical bandwidth:

Band-widening experiments (Fletcher, 1940)Loudness summation (Zwicker et al., 1957)Two-tone masking (Zwicker, 1954)Discrimination of partials within complex tones (Plomp and Mimpen, 1968)

Only the lowest 5-8 partials can be reliably discriminated.

Critical Bandwidth

• Fletcher (1940) made the simplifying assumption that the auditory filter could be modeled as a rectangle, with flat top and vertical slopes.

CB

Frequency

Gai

n

CB

Power spectrum model of masking

• Fletcher suggested that only a narrow band of frequencies in the region of the tone contribute to masking.

• He called this the critical bandwidth (CB).

AuditoryFilter

Frequency

Gai

n


• But threshold changes gradually as the noise bandwidth increases, suggesting auditory filters with sloping rather than rectangular skirts (Patterson, 1976).

Frequency

Gai

n CB

AuditoryFilter

5


• Detection of probe tone in the presence of a noise masker depends of the relative power of probe and noise passed by the auditory filter centered on the tone (Patterson, 1976).

AuditoryFilter

Tone Noise masker

Frequency


• Noise power is often specified as the power in a band of frequencies 1 Hz wide. This is called noise power density, designated N0.

• The total power in a band of noise is calculated as W N0, where W is the noise bandwidth in Hz.

W


• When the noise just masks the tone, the ratio of the power in the tone to the power in the noise is a constant, K.

KNWP )(/ 0

)(/ 0NKPW and


• Assumptions: Only frequencies within the passband of the auditory filter contribute to masking.Detection is based on a single auditory filter, centered on the frequency of the tone.Listeners ignore short-term fluctuations in the noise, and do not rely on phase differences between signal and noise.

Notched noise method

Tone

LP Noise HP NoiseAuditory

Filter

Patterson (1976)

Off-frequency listening

Tone

HP NoiseShiftedFilter

Tone detectioncan be improvedby shifting filtercenter frequencyto maximize SNR

6

400 600 800 1000 1200 1400 1600-50

-40

-30

-20

-10

0

Frequency (kHz)

Rel

ativ

e am

plitu

de (d

B)

Notched noise method

• Patterson (1976) estimated auditory filter shapes from the function relating tone threshold to notch width.

• The derived filters have a rounded top and steep skirts, with bandwidths 10-15% of filter center frequency. Derived auditory filter shape

Simulation of reduced frequency selectivity

Normal

400 600 800 1000 1200 1400 1600-50

-40

-30

-20

-10

0

Frequency (kHz)

()

Impaired (3Normal)

400 600 800 1000 1200 1400 1600-50

-40

-30

-20

-10

0

Frequency (kHz)

Rel

ativ

e am

plitu

de (d

B)

Derived auditory filter shapes

0 1 2 3 4 5-60

-50

-40

-30

-20

-10

0

Fc=1094 HzERB=143 Hz

Frequency (kHz)

Filte

r Gai

n (d

B)

Frequency response of gammatone filter bank

Auditory filter shapes as a function of frequency

Auditory filter shapes as a function of level

0 500 1000 1500 20000

10

20

30

40

50

60

70

80

90

Frequency (Hz)

Out

put l

evel

(dB

)

Equivalent Rectangular Bandwidth• The equivalent rectangular bandwidth (or ERB)

of a filter is the bandwidth of a rectangularfilter which has the same power output as that filter, when the input is white noise.

400 600 800 1000 1200 1400 1600-50

-40

-30

-20

-10

0

Frequency (kHz)

Rel

ativ

e am

plitu

de (d

B) ERB

Equivalent Rectangular Bandwidth• ERB – equivalent rectangular bandwidth of

the estimated auditory filter – about 10-15% of the filter center frequency.

0.05 0.1 0.2 0.5 1 2 5 100.02

0.05

0.1

0.2

0.5

1

2

Center Frequency (Hz)

ER

B (H

z)

7

Cochlear frequency-place map

• Greenwood (1961) developed a function to relate the characteristic frequency (CF) at each place on the cochlea to the distance (x) of that place from the apex.

ERB Scale• One ERB unit corresponds to a distance of

about 0.89 mm along the basilar membrane.

Human data

ERB-rate scale• The ERB-rate scale is a warped frequency scale

modeling changes in the ERB of the auditory filter as a function of frequency.

0 1000 2000 3000 40000

5

10

15

20

25

30

Frequency (Hz)

ER

B-r

ate

(ER

B) ERB-rate as

a function of frequency

Excitation patterns• Auditory excitation patterns show the

composite output of a bank of simulated auditory filters as a function of filter center frequency.

Filter Center Frequency

Filte

r out

put

Excitation patterns• Excitation patterns provide a good model of

auditory frequency selectivity and masking: frequency components that are resolved by the auditory system produce distinct peaks in the excitation pattern.

Excitation patterns

Cochlear FilteringCochlear Filtering

Outer and middle earsOuter and middle ears

EnergyDetector

EnergyDetector

EnergyDetector

CNSCNS

Freq

uenc

y (E

RB

-rat

e)

8

Excitation patterns

0.2 0.5 1.0 2.0 4.0-60

-50

-40

-30

-20

-10

0

Frequency (kHz)

Ampl

itude

(dB)

500 Hz pure tone

0.2 0.5 1.0 2.0 4.0-60

-50

-40

-30

-20

-10

0

Frequency (kHz)

Ampl

itude

(dB

)

Excitation patternsComplex tone, equal amplitude harmonics

0.2 0.5 1.0 2.0 4.0-60

-50

-40

-30

-20

-10

0

Frequency (kHz)

Ampl

itude

(dB

)

Excitation patternsVowel: / æ / F0 = 200 Hz

200Hz 400

600

800 F21450 Hz F3

2450 Hz

Time (ms)

Freq

uenc

y (H

z)

0 100 200 300 400 500 600 7000.1

0.2

0.5

1.0

2.0

Auditory filterbank spectrogram

Simulation studies

• Simulation of reduced frequency selectivity (spectral smearing of the short-term speech spectrum) results in lowered intelligibility for listeners with normal hearing, particularly in noise (ter Keurs et al., 1993; Baer & Moore, 1994)

Simulation of reduced frequency selectivity

Normal

400 600 800 1000 1200 1400 1600-50

-40

-30

-20

-10

0

Frequency (kHz)

()

Impaired (3Normal)

400 600 800 1000 1200 1400 1600-50

-40

-30

-20

-10

0

Frequency (kHz)

Rel

ativ

e am

plitu

de (d

B)

Derived auditory filter shapes

9

Effects of reduced frequency selectivity on vowel / ӕ /

0.2 0.5 1.0 2.0 4.0-60

-50

-40

-30

-20

-10

0

Frequency (kHz)

Ampl

itude

(dB)

F0 = 200 Hz

200Hz 400

600

800 F21450 Hz F3

2450 Hz

3 x normal

2 x normal

1 x normal

Distortion of spectral shape

• Broader auditory filters produce a “smeared” excitation pattern: reduced prominence of peaks, smaller peak-to-valley ratios.

• Introduction of noise fills up the valleys between the spectral peaks and reduces the distinctiveness of the spectral profile.

Distortion of temporal structure

• Broader auditory filters alter the temporal fine structure of the output.• Increased contribution of adjacent

components• Increase in within-channel modulation• Diminished differences between adjacent

channels

Effects of reduced frequency selectivity on temporal structure

0 100 200 300

4515

4214

3913

3612

3311

3010

2709

2408

2107

1806

1505

1204

903

602

301

0

Normal x 1

0 100 200 300

4515

4214

3913

3612

3311

3010

2709

2408

2107

1806

1505

1204

903

602

301

0

Normal x 3Normal x 1 Normal x 3

Time Time

Filte

r cen

ter f

requ

ency

(Hz)

Loudness Recruitment

• When a sound is increased in level above absolute threshold, the rate of growth of loudness is greater than normal.

• At levels >90-100 dB SPL, loudness returns to normal (sound appears equally loud to hearing-impaired and normal listeners).

Loudness Recruitment

• Loudness recruitment is associated with reduced dynamic range (range between absolute threshold and highest comfortable level).

• Recruitment may reduce the ability to “listen in the dips” in a fluctuating masker, such as a competing voice.

• Recruitment distorts loudness relationships among components of speech sounds.

10

A glimpsing model of speech perception in noise

Martin Cooke

Journal of the Acoustical Society of America, Vol. 119, No. 3, pp. 1562–1573, March 2006

“Glimpsing” speech in noise

• “speech is a highly modulated signal in time and frequency, regions of high energy are typically sparsely distributed.”

Time (ms)

Freq

uenc

y (H

z)

0 100 200 300 400 500 600 7000.1

0.2

0.5

1.0

2.0

“Glimpsing” speech in noise• “The information conveyed by the spectro-

temporal energy spectrum of clean speech is redundant… Redundancy allows speech to be identified based on relatively sparse evidence.”

Time (ms)

Freq

uenc

y (H

z)

0 100 200 300 400 500 600 7000.1

0.2

0.5

1.0

2.0

Glimpsing speech in noise

• Can listeners take advantage of “glimpses”?direct attention to spectrotemporal regions where the S+N mixture is dominated by the target speech ASR system trained to recognize consonants in noiseMaskers differed in “glimpse size”ASR model developed to exploit non-uniform distribution of SNR in different time-frequency bandsConclusion: model + listeners benefit from glimpsing.

Speech + noise mixtures• Some regions dominated by target voice• Local SNR varies across time and frequency• Where the target voice dominates, the problem

of source segregation is solved because the signal is effectively “clean” speech.

• Clean speech is highly redundant; it remains intelligible after 50% or more of its energy is removed by gating and/or filtering

STEP model

• Auditory excitation pattern (Moore, 2003)Spectrogram-like representationReflects non-uniform frequency selectivity in different frequency bandsIncorporates a sliding time window reflecting temporal analysis by the auditory systemRelative audibility at different frequenciesLoudness model

11

Missing data ASR

• HMM-based speech recognizer• “Missing-data” models

Glimpses only• Ignore missing information (in masked regions)

Glimpses-plus-background• Try to fill in missing information (based on masked

regions)

Sparseness and redundancy• Glimpses = spectrotemporal regions where

signal exceeds masker by ~3 dB.

single talkermasker

eight-talkermasker

speech-shapednoise

target

glimpses

Syllable identification accuracy as a function of the number of competing voices. The level of the target speech (monosyllabic nonsense words) was held constant at 95dB. (After Miller 1947).

Results Results

FIG. 4. The correlation between intelligibility and proportion of the target speech in which the local SNR exceeds 3 dB. Each point represents a noise condition, and proportions are means across all tokens in the test set. The best linear fit is also shown. The correlation between listeners and these putative glimpses is 0.955.

12

Conclusions

• Best model:Uses information in glimpses and counterevidence in the masked regionsGlimpses constrained to a minimum areaTreats all regions with local SNR > -5 dB as potential glimpses

Conclusions• A higher “glimpse threshold” (e.g. local SNR

> 0 dB) produces fewer glimpses, but this provides less distorted information than a lower threshold (e.g. -5 dB).

Conclusions• Limitation: local SNR must be known in

advance. Is there a way to estimate the local SNR directly from the mixture?

• Tracking problem: how to integrate glimpses over time?

Brungart et al. (2001)

–12 –9 –6 –3 0 3 6 9 12 Target-to-Masker Ratio (dB)

2-ta

lker

cor

rect

resp

onse

s (%

)

Same talkerSame talkerDifferent talker, same sexModulated noiseDifferent talker, different sex

Brungart et al. (2001)

–12 –9 –6 –3 0 3 6 9 12 Target-to-Masker Ratio (dB)

2-ta

lker

cor

rect

res

pons

es (%

)

hcs 7367 speech perception - university of texas at dallasassmann/hcs6367/lec9.pdfdiscrimination of...

Documents