hcs 7367 speech perception - university of texas at dallasassmann/hcs6367/lec9.pdfdiscrimination of...
TRANSCRIPT
-
1
HCS 7367Speech Perception
Dr. Peter AssmannFall 2012
Long-term spectrum of speech
Absolute threshold
Females
Males
Connected speech
Long-term spectrum of speech
Absolute threshold
Females
Males
2000)
VowelsSo
und
pres
sure
leve
l (dB
)
0
120
100
80
60
40
20
62.5 125 500 1k 4k 8k2k 16k250 32k
Frequency (Hz)
Conversational Speech
Audibility of speech
Absolutethreshold(normal listeners)
Types of hearing loss
• Conductive loss• Sensorineural loss
• Audibility/distortion• Effect of noise
Source: www.brainconnection.com
Pure-tone audiogram
250 500 1K 2K 4K-10 0 10 20 30 40 50 60 70 80 90 100 110 120 130
Bone conduction thresholdsAir conduction thresholds
250 500 1K 2K 4K-10 0 10 20 30 40 50 60 70 80 90 100 110 120 130H
eari
ng L
oss i
n dB
(AN
SI-1
996)
Left ear Right ear
Normal Conductive loss
Source: www.bcm.tmc.edu/oto/studs/aud.html
-
2
Pure-tone audiogram
250 500 1K 2K 4K-10 0 10 20 30 40 50 60 70 80 90 100 110 120 130
250 500 1K 2K 4K-10 0 10 20 30 40 50 60 70 80 90 100 110 120 130H
eari
ng L
oss i
n dB
(AN
SI-1
996)
Left ear Right ear
Sensorineural loss Mixed loss
Bone conduction thresholdsAir conduction thresholds
Source: www.bcm.tmc.edu/oto/studs/aud.html
Speech audiometry
• Nonsense syllables, real words, words in sentences
• threshold for recognizing 50% of test items• percentage of items correctly reported• Speech tests provide a valid measure of hearing
handicap.• Poor speech scores may indicate hearing loss of
retrocochlear origin
Sensorineural hearing loss
• Listeners with cochlear hearing loss have difficulty recognizing speech when background noise is present.
Reduced audibilitySupra-threshold “distortions”• Impaired frequency selectivity• Loudness recruitment
Speech recognition in noise
• Speech reception threshold, SRT(Plomp & Mimpen, 1969)
Speech-to-noise ratio required to achieve a specific level of intelligibility, typically 50%Effects of speech materialsEffects of type of masker (e.g., speech-shaped noise vs. a single competing talker)Effects of spatial separation of target & masker
Speech recognition in noiseMasker type Listening situation Deficit in SRT
Speech-shaped noise
Speech+masker in front, unaided 2.5 - 7.0 dB
Speech-shaped noise
Speech+masker in front, aided 2.5 - 6.0 dB
Single talker Speech+masker in front, unaided 6.0 - 12.0 dB
Single talker Speech+masker in front, aided 4.0 - 10.0 dB
Single talker Speech+masker in front, spatially separated
12.0 – 19.0 dB
Source: Moore, BCJ (2003) Speech Communication
Articulation Index
• How much does audibility contribute to difficulty understanding speech in noise?
• Articulation Index (AI) estimates the contribution of audibility (and other factors) to speech intelligibility
-
3
Articulation Index
1. Divides the speech and masker spectrum into a small number of frequency bands
2. Estimates the audibility of speech in each band, weighted by its relative importance for intelligibility
3. Derives overall intelligibility by summing the contributions of each band.
Articulation Index
• Most studies show that speech intelligibility is worse than predicted by the AI for hearing-impaired listeners, especially for moderate or severe hearing loss.
Articulation Index
• Conclusion: factors other than audibility must be responsible for the difficulties experienced by hearing-impaired listeners understanding speech in noise.
• What else?Frequency selectivityTemporal resolution
Frequency Selectivity
• Frequency selectivity is the ability to resolve the spectral components of complex sounds.
• Reduced frequency selectivity may lead to difficulty in understanding speech in noise.
Auditory filters• Fletcher (1940) suggested that the
peripheral auditory system could be modeled as a bank of linear bandpass filters with continuously overlapping center frequencies.
Frequency
Gai
n
Auditory filters
• Each point along the basilar membrane corresponds to a filter with a different center frequency, with center frequencies increasing roughly logarithmically from the apex to the base.
-
4
Auditory filters
• About half of the length of the human basilar membrane is devoted to the lowest kHz (F1 range of speech) with the majority of neural fibers responding best to low-to-mid-frequencies.
Critical Bandwidth• Fletcher (1940) band-widening experiment
The threshold for detecting a pure tone in the presence of a bandpass noise masker increases as the noise bandwidth increases, until the width of the band exceeds the critical bandwidth of the auditory filter.
Noise masker bandwidth
Tone
det
ectio
nth
resh
old
Critical Bandwidth• Sources of evidence for critical bandwidth:
Band-widening experiments (Fletcher, 1940)Loudness summation (Zwicker et al., 1957)Two-tone masking (Zwicker, 1954)Discrimination of partials within complex tones (Plomp and Mimpen, 1968)
Only the lowest 5-8 partials can be reliably discriminated.
Critical Bandwidth
• Fletcher (1940) made the simplifying assumption that the auditory filter could be modeled as a rectangle, with flat top and vertical slopes.
CB
Frequency
Gai
n
CB
Power spectrum model of masking
• Fletcher suggested that only a narrow band of frequencies in the region of the tone contribute to masking.
• He called this the critical bandwidth (CB).
AuditoryFilter
Frequency
Gai
n
Power spectrum model of masking
• But threshold changes gradually as the noise bandwidth increases, suggesting auditory filters with sloping rather than rectangular skirts (Patterson, 1976).
Frequency
Gai
n CB
AuditoryFilter
-
5
Power spectrum model of masking
• Detection of probe tone in the presence of a noise masker depends of the relative power of probe and noise passed by the auditory filter centered on the tone (Patterson, 1976).
AuditoryFilter
Tone Noise masker
Frequency
Power spectrum model of masking
• Noise power is often specified as the power in a band of frequencies 1 Hz wide. This is called noise power density, designated N0.
• The total power in a band of noise is calculated as W N0, where W is the noise bandwidth in Hz.
W
Power spectrum model of masking
• When the noise just masks the tone, the ratio of the power in the tone to the power in the noise is a constant, K.
KNWP )(/ 0
)(/ 0NKPW and
Power spectrum model of masking
• Assumptions: Only frequencies within the passband of the auditory filter contribute to masking.Detection is based on a single auditory filter, centered on the frequency of the tone.Listeners ignore short-term fluctuations in the noise, and do not rely on phase differences between signal and noise.
Notched noise method
Tone
LP Noise HP NoiseAuditory
Filter
Patterson (1976)
Off-frequency listening
Tone
HP NoiseShiftedFilter
Tone detectioncan be improvedby shifting filtercenter frequencyto maximize SNR
-
6
400 600 800 1000 1200 1400 1600-50
-40
-30
-20
-10
0
Frequency (kHz)
Rel
ativ
e am
plitu
de (d
B)
Notched noise method
• Patterson (1976) estimated auditory filter shapes from the function relating tone threshold to notch width.
• The derived filters have a rounded top and steep skirts, with bandwidths 10-15% of filter center frequency. Derived auditory filter shape
Simulation of reduced frequency selectivity
Normal
400 600 800 1000 1200 1400 1600-50
-40
-30
-20
-10
0
Frequency (kHz)
()
Impaired (3Normal)
400 600 800 1000 1200 1400 1600-50
-40
-30
-20
-10
0
Frequency (kHz)
Rel
ativ
e am
plitu
de (d
B)
Derived auditory filter shapes
0 1 2 3 4 5-60
-50
-40
-30
-20
-10
0
Fc=1094 HzERB=143 Hz
Frequency (kHz)
Filte
r Gai
n (d
B)
Frequency response of gammatone filter bank
Auditory filter shapes as a function of frequency
Auditory filter shapes as a function of level
0 500 1000 1500 20000
10
20
30
40
50
60
70
80
90
Frequency (Hz)
Out
put l
evel
(dB
)
Equivalent Rectangular Bandwidth• The equivalent rectangular bandwidth (or ERB)
of a filter is the bandwidth of a rectangularfilter which has the same power output as that filter, when the input is white noise.
400 600 800 1000 1200 1400 1600-50
-40
-30
-20
-10
0
Frequency (kHz)
Rel
ativ
e am
plitu
de (d
B) ERB
Equivalent Rectangular Bandwidth• ERB – equivalent rectangular bandwidth of
the estimated auditory filter – about 10-15% of the filter center frequency.
0.05 0.1 0.2 0.5 1 2 5 100.02
0.05
0.1
0.2
0.5
1
2
Center Frequency (Hz)
ER
B (H
z)
-
7
Cochlear frequency-place map
• Greenwood (1961) developed a function to relate the characteristic frequency (CF) at each place on the cochlea to the distance (x) of that place from the apex.
ERB Scale• One ERB unit corresponds to a distance of
about 0.89 mm along the basilar membrane.
Human data
ERB-rate scale• The ERB-rate scale is a warped frequency scale
modeling changes in the ERB of the auditory filter as a function of frequency.
0 1000 2000 3000 40000
5
10
15
20
25
30
Frequency (Hz)
ER
B-r
ate
(ER
B) ERB-rate as
a function of frequency
Excitation patterns• Auditory excitation patterns show the
composite output of a bank of simulated auditory filters as a function of filter center frequency.
Filter Center Frequency
Filte
r out
put
Excitation patterns• Excitation patterns provide a good model of
auditory frequency selectivity and masking: frequency components that are resolved by the auditory system produce distinct peaks in the excitation pattern.
Excitation patterns
Cochlear FilteringCochlear Filtering
Outer and middle earsOuter and middle ears
EnergyDetector
EnergyDetector
EnergyDetector
CNSCNS
Freq
uenc
y (E
RB
-rat
e)
-
8
Excitation patterns
0.2 0.5 1.0 2.0 4.0-60
-50
-40
-30
-20
-10
0
Frequency (kHz)
Ampl
itude
(dB)
500 Hz pure tone
0.2 0.5 1.0 2.0 4.0-60
-50
-40
-30
-20
-10
0
Frequency (kHz)
Ampl
itude
(dB
)
Excitation patternsComplex tone, equal amplitude harmonics
0.2 0.5 1.0 2.0 4.0-60
-50
-40
-30
-20
-10
0
Frequency (kHz)
Ampl
itude
(dB
)
Excitation patternsVowel: / æ / F0 = 200 Hz
200Hz 400
600
800 F21450 Hz F3
2450 Hz
Time (ms)
Freq
uenc
y (H
z)
0 100 200 300 400 500 600 7000.1
0.2
0.5
1.0
2.0
Auditory filterbank spectrogram
Simulation studies
• Simulation of reduced frequency selectivity (spectral smearing of the short-term speech spectrum) results in lowered intelligibility for listeners with normal hearing, particularly in noise (ter Keurs et al., 1993; Baer & Moore, 1994)
Simulation of reduced frequency selectivity
Normal
400 600 800 1000 1200 1400 1600-50
-40
-30
-20
-10
0
Frequency (kHz)
()
Impaired (3Normal)
400 600 800 1000 1200 1400 1600-50
-40
-30
-20
-10
0
Frequency (kHz)
Rel
ativ
e am
plitu
de (d
B)
Derived auditory filter shapes
-
9
Effects of reduced frequency selectivity on vowel / ӕ /
0.2 0.5 1.0 2.0 4.0-60
-50
-40
-30
-20
-10
0
Frequency (kHz)
Ampl
itude
(dB)
F0 = 200 Hz
200Hz 400
600
800 F21450 Hz F3
2450 Hz
3 x normal
2 x normal
1 x normal
Distortion of spectral shape
• Broader auditory filters produce a “smeared” excitation pattern: reduced prominence of peaks, smaller peak-to-valley ratios.
• Introduction of noise fills up the valleys between the spectral peaks and reduces the distinctiveness of the spectral profile.
Distortion of temporal structure
• Broader auditory filters alter the temporal fine structure of the output.• Increased contribution of adjacent
components• Increase in within-channel modulation• Diminished differences between adjacent
channels
Effects of reduced frequency selectivity on temporal structure
0 100 200 300
4515
4214
3913
3612
3311
3010
2709
2408
2107
1806
1505
1204
903
602
301
0
Normal x 1
0 100 200 300
4515
4214
3913
3612
3311
3010
2709
2408
2107
1806
1505
1204
903
602
301
0
Normal x 3Normal x 1 Normal x 3
Time Time
Filte
r cen
ter f
requ
ency
(Hz)
Loudness Recruitment
• When a sound is increased in level above absolute threshold, the rate of growth of loudness is greater than normal.
• At levels >90-100 dB SPL, loudness returns to normal (sound appears equally loud to hearing-impaired and normal listeners).
Loudness Recruitment
• Loudness recruitment is associated with reduced dynamic range (range between absolute threshold and highest comfortable level).
• Recruitment may reduce the ability to “listen in the dips” in a fluctuating masker, such as a competing voice.
• Recruitment distorts loudness relationships among components of speech sounds.
-
10
A glimpsing model of speech perception in noise
Martin Cooke
Journal of the Acoustical Society of America, Vol. 119, No. 3, pp. 1562–1573, March 2006
“Glimpsing” speech in noise
• “speech is a highly modulated signal in time and frequency, regions of high energy are typically sparsely distributed.”
Time (ms)
Freq
uenc
y (H
z)
0 100 200 300 400 500 600 7000.1
0.2
0.5
1.0
2.0
“Glimpsing” speech in noise• “The information conveyed by the spectro-
temporal energy spectrum of clean speech is redundant… Redundancy allows speech to be identified based on relatively sparse evidence.”
Time (ms)
Freq
uenc
y (H
z)
0 100 200 300 400 500 600 7000.1
0.2
0.5
1.0
2.0
Glimpsing speech in noise
• Can listeners take advantage of “glimpses”?direct attention to spectrotemporal regions where the S+N mixture is dominated by the target speech ASR system trained to recognize consonants in noiseMaskers differed in “glimpse size”ASR model developed to exploit non-uniform distribution of SNR in different time-frequency bandsConclusion: model + listeners benefit from glimpsing.
Speech + noise mixtures• Some regions dominated by target voice• Local SNR varies across time and frequency• Where the target voice dominates, the problem
of source segregation is solved because the signal is effectively “clean” speech.
• Clean speech is highly redundant; it remains intelligible after 50% or more of its energy is removed by gating and/or filtering
STEP model
• Auditory excitation pattern (Moore, 2003)Spectrogram-like representationReflects non-uniform frequency selectivity in different frequency bandsIncorporates a sliding time window reflecting temporal analysis by the auditory systemRelative audibility at different frequenciesLoudness model
-
11
Missing data ASR
• HMM-based speech recognizer• “Missing-data” models
Glimpses only• Ignore missing information (in masked regions)
Glimpses-plus-background• Try to fill in missing information (based on masked
regions)
Sparseness and redundancy• Glimpses = spectrotemporal regions where
signal exceeds masker by ~3 dB.
single talkermasker
eight-talkermasker
speech-shapednoise
target
glimpses
Syllable identification accuracy as a function of the number of competing voices. The level of the target speech (monosyllabic nonsense words) was held constant at 95dB. (After Miller 1947).
Results Results
FIG. 4. The correlation between intelligibility and proportion of the target speech in which the local SNR exceeds 3 dB. Each point represents a noise condition, and proportions are means across all tokens in the test set. The best linear fit is also shown. The correlation between listeners and these putative glimpses is 0.955.
-
12
Conclusions
• Best model:Uses information in glimpses and counterevidence in the masked regionsGlimpses constrained to a minimum areaTreats all regions with local SNR > -5 dB as potential glimpses
Conclusions• A higher “glimpse threshold” (e.g. local SNR
> 0 dB) produces fewer glimpses, but this provides less distorted information than a lower threshold (e.g. -5 dB).
Conclusions• Limitation: local SNR must be known in
advance. Is there a way to estimate the local SNR directly from the mixture?
• Tracking problem: how to integrate glimpses over time?
Brungart et al. (2001)
–12 –9 –6 –3 0 3 6 9 12 Target-to-Masker Ratio (dB)
2-ta
lker
cor
rect
resp
onse
s (%
)
Same talkerSame talkerDifferent talker, same sexModulated noiseDifferent talker, different sex
Brungart et al. (2001)
–12 –9 –6 –3 0 3 6 9 12 Target-to-Masker Ratio (dB)
2-ta
lker
cor
rect
res
pons
es (%
)