introduction to speech processing - iit bombay · p. c. pandey / introduction to speech processing...
TRANSCRIPT
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
1
Introduction to
SPEECH PROCESSING
P. C. Pandey
EE Dept., IIT Bombay
References
1. L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Pearson Education, 2004 2. B. Gold and N. Morgan, Speech and Audio Signal Processing, John Wiley, 2002 3. D. O'Shaughnessy, Speech Communications: Human and Machine, Universities Press, 2001
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
2
Speech communication
Speech production Propagation of the acoustic wave Speech perception by the listener
Applications of speech processing
Generation Transmission & storage (coding, speech enhancement) Reception (speaker & speech recognition, quality assessment) Aids for the disabled
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
3
Applications of speech processing
Generation (speech synthesis) Machine voice output Reading aid for the blind Testing of speech communication systems Study of speech production & perception system Speech production aids
Transmission / storage Coding reduction in channel capacity requirements encryption Reduction in susceptibility to noise Speech enhancement Voice transformation: change of accent, speaker identity
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
4
Reception Analysis for studying speech signal Diagnostic aid for speech disorders Speaker identification & quality assessment Speech recognition Aids for the hearing impaired Variable rate playback Visual / tactile displays Effective use of residual hearing Electrical stimulation of auditory nerve
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
5
Speech units
Words Syllables Phonemes Sub-phonemic events
Speech transcription (scripts and alphabet systems)
Pictographic Alphabetic Phonemic Syllabic
International Phonetic Alphabet (IPA)
/^/, /a/, /k/, /kh/, / /, etc.
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
6
Efficiency of phonemic transcription
Measurement of information
Information: mean logarithmic probability (MLP).
For a set of messages , ....1 2x x xn with probabilities ,1 2...p p pn ,
information
( ) ( )MLP log2
1
1/ bitsn
Ii
i
x p pi i= ==
If the messages occur at the rate of M messages/s, then the rate of information is
C M I= bits/s
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
7
Phonemic transcription of speech
Consider speech generated from phonemic transcription (written text). No of phonemes in English = 42. Considering them equiprobable, p = 1/42
( )
42
log21
1/ 5.39 bitsIi
p pii==
=
For conversational speech, with10 phonemes/s. C M I= = 53.9 bits/s.
Considering actual phoneme probabilities, I = 4.9 bits and C= 49 bits/s
Considering relatedness of sequence of phonemes, C 45 bits/s
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
8
Channel capacity requirements
Channel capacity of an analog channel, with signal power S , noise power N , and bandwidth B Hz ,
log 12 bits/s
SC B
N
= +
For conventional analog telephony, B = 3 kHz, SNR = 30 dB S/N = 1000
( )3
3 10 log 1 1000 30 k bits/s2C = +
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
9
Now consider good quality speech Digitization (without any data compression techniques)
12 bit quantization (212 = 4096 levels) Sampling rate = 10 k samples/s
Channel capacity required for digital transmission
C = (10103)log2 2
12 = 120 k bits/s
Thus we have three different estimates for channel capacity requirements for speech
Analog telephony (3 kHz bandwidth, 30 dB SNR): 30 k bits/s
Digital transmission (12-bit, 10 k samples/s, no encoding): 120 k bits/s
Phonemic transcription: 45 bits/s
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
10
Auditory system
Peripheral auditory system
External ear : pinna and auditory canal (sound collection)
Middle ear : ear drum and middle ear bones (impedance matching)
Inner ear : cochlea (analysis and transduction)
Auditory nerve (transmission of neural impulses)
Central auditory system (information interpretation)
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
11
Outer ear Inner earMiddle ear
External canal
Pinna
Nasal cavity
Cochlea
Cochlear nerve
Vestibular nerve
Oval windowRound window
Eustachian tube
Eardrum (Tympanicmembrane)
Vestibular apparatus with semicircular canals
MalleusIncusStapes
Peripheral auditory system (outer, middle, and inner ear). (Adapted from Flanagan (1972a), Fig. 4.1.)
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
12
A longitudinal section through the uncoiled cochlea (Source: Pickles (1982), Fig. 3.1.(C))
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
13
The mechanism of
speech production
A schematic diagram of the human vocal mechanism
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
14
A schematic representation of the vocal apparatus. (Adapted from Rabiner & Schafer (1978), Fig. 3.2)
Pitch : F0 (rate of vibration of vocal cords)
Formant frequencies: F1, F2, .. (resonance frequencies of vocal tract filter)
( )u t ( )s t
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
15
Phonemic features
Modes of excitation Glottal: voiced unvoiced (aspiration) Frication (constriction in vocal tract) : voiced unvoiced
Movement of articulators Continuant (steady vocal tract): vowels, nasal stops, fricatives
Non-continuant (changing vocal tract): diphthongs, semivowels, plosives (oral stops)
Place of articulation bilabial, labio-dental, linguo-dental, alveolar, palatal, velar, gluttoral
Changes in Fo
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
16
Classification
of English
phonemes
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
17
Classification
of Hindi
phonemes
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
18
Suprasegmental features Intonation Rhythm (syllable stressing) Carriers Changes in intensity Syllable duration Changes in voice pitch
Visual features Partial information on place of articulation
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
19
Short-time speech analysis
( ) ( ) ( )s n s n w n m
windowing
Storage/display/transmission
decimation ( )Q n
k
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
20
Speech analysis techniques Time domain analysis Intensity Zero crossing rate Autocorrelation AMDF Frequency domain analysis Filter bank analysis Short-time Fourier analysis Spectrographic displays Formant estimation & tracking F0 estimation and tracking Separation of excitation and vocal tract filter effects Cepstral analysis Linear predictive coding (LPC)
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
21
Multi-resolution spectrographic analysis
Analog analysis Spectral analysis using bandpass filter & demodulator
( ) ( )t dBs t S f
f = 300 Hz (wideband), 45 Hz (narrow band)
Digital analysis
Change in resolution, implemented by padding L-sample segment with N-L zero valued
samples, and using N-point DFT. Pre-emphasis for boosting high-frequency components.
( )
1 2
0
Short-time Fourier transform: ( , ) ( ) ( )
2Hamming window : ( ) 0.54 0.46cos 2 ,1 2 2
( , ) 20log ( , ) /
For 10 k Sa/s, 43 300 Hz
289 45 Hz
mkN jN
m
dB ref
s
X n k w m x n m e
Lnt
L Lw n n
L
S n k X n k X
f L f
L f
=
=
=
=
= =
= =
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
22
Vowel
triangle
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
23
Spectrograms for English phonemes (vowels or /aCa/ syllable)
Place of articulation Manner Voicing Back Mid Front
unvoiced k t p Oral stop voiced g d b unvoiced Nasal stop voiced ng n m unvoiced (sh) s f Oral fricative voiced (zh) z v
unvoiced Nasal fricative voiced
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
24
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
25
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
26
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
27
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
28
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
29
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
30
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
31
Cepstral analysis
( )v n( )e n ( ) ( ) ( )s n e n v n=
( ) ( ) ( )S z E z V z=
We can use "log" transformation for deconvolution and separation of excitation & vocal tract filter parameters.
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
32
{ } { }
{ }
1 1
1
Complex cepstrum ( ) log ( ) log ( )
Cepstrum ( ) = log ( )
j
j
x
x n Z X z F X e
c n F X e
= =
s
[ ]
( ) ( )* ( )
( ) ( ) ( )
log ( ) log ( ) log ( )
s n e n v n
S z E z V z
S z E z V z
=
=
= +
1Z
( ) ( ) ( )s n e n v n
= +
( )x n ( )X zZ log
( )X z
1Z
( )x n
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
33
Linear predictive coding (LPC)
Excitation e(n) : modeled as an impulse or random noise. Vocal tract filter transfer function V(z): modeled as an
all-pole filter (AR model).
Inverse filter A(z): all-zero filter (FIR or MA filter), with parameters
estimated for minimizing ( )n in LMS sense. Result
Error ( )n acquires the characteristics of the excitation function
Inverse filter coefficients, obtained by solving a set of linear equations, approximate the vocal tract filter coefficients.
( )V z( )e n( )s n
( )n+
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
34
Speech production
( )V z( )e n ( )s n
1
1 1( )
1 ( )1
pk
k
k
V zA z
a z
=
= =
1
( ) ( ) ( )p
k
k
e n s n a s n k=
=
-
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
35
Linear prediction model
( )A z( )s n ( )s n
( )n
1
( ) ( )p
k
k
s n a s n k
=
=
1
( ) ( ) ( )
( ) ( )p
k
k
n s n s n
s n a s n k
=
=
=