introduction to speech processing - iit bombay · p. c. pandey / introduction to speech processing...

35
P. C. Pandey / Introduction to Speech Processing / Jan. 2008 1 Introduction to SPEECH PROCESSING P. C. Pandey EE Dept., IIT Bombay References 1. L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Pearson Education, 2004 2. B. Gold and N. Morgan, Speech and Audio Signal Processing, John Wiley, 2002 3. D. O'Shaughnessy, Speech Communications: Human and Machine, Universities Press, 2001

Upload: vuduong

Post on 20-Aug-2018

239 views

Category:

Documents


1 download

TRANSCRIPT

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    1

    Introduction to

    SPEECH PROCESSING

    P. C. Pandey

    EE Dept., IIT Bombay

    References

    1. L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Pearson Education, 2004 2. B. Gold and N. Morgan, Speech and Audio Signal Processing, John Wiley, 2002 3. D. O'Shaughnessy, Speech Communications: Human and Machine, Universities Press, 2001

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    2

    Speech communication

    Speech production Propagation of the acoustic wave Speech perception by the listener

    Applications of speech processing

    Generation Transmission & storage (coding, speech enhancement) Reception (speaker & speech recognition, quality assessment) Aids for the disabled

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    3

    Applications of speech processing

    Generation (speech synthesis) Machine voice output Reading aid for the blind Testing of speech communication systems Study of speech production & perception system Speech production aids

    Transmission / storage Coding reduction in channel capacity requirements encryption Reduction in susceptibility to noise Speech enhancement Voice transformation: change of accent, speaker identity

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    4

    Reception Analysis for studying speech signal Diagnostic aid for speech disorders Speaker identification & quality assessment Speech recognition Aids for the hearing impaired Variable rate playback Visual / tactile displays Effective use of residual hearing Electrical stimulation of auditory nerve

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    5

    Speech units

    Words Syllables Phonemes Sub-phonemic events

    Speech transcription (scripts and alphabet systems)

    Pictographic Alphabetic Phonemic Syllabic

    International Phonetic Alphabet (IPA)

    /^/, /a/, /k/, /kh/, / /, etc.

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    6

    Efficiency of phonemic transcription

    Measurement of information

    Information: mean logarithmic probability (MLP).

    For a set of messages , ....1 2x x xn with probabilities ,1 2...p p pn ,

    information

    ( ) ( )MLP log2

    1

    1/ bitsn

    Ii

    i

    x p pi i= ==

    If the messages occur at the rate of M messages/s, then the rate of information is

    C M I= bits/s

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    7

    Phonemic transcription of speech

    Consider speech generated from phonemic transcription (written text). No of phonemes in English = 42. Considering them equiprobable, p = 1/42

    ( )

    42

    log21

    1/ 5.39 bitsIi

    p pii==

    =

    For conversational speech, with10 phonemes/s. C M I= = 53.9 bits/s.

    Considering actual phoneme probabilities, I = 4.9 bits and C= 49 bits/s

    Considering relatedness of sequence of phonemes, C 45 bits/s

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    8

    Channel capacity requirements

    Channel capacity of an analog channel, with signal power S , noise power N , and bandwidth B Hz ,

    log 12 bits/s

    SC B

    N

    = +

    For conventional analog telephony, B = 3 kHz, SNR = 30 dB S/N = 1000

    ( )3

    3 10 log 1 1000 30 k bits/s2C = +

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    9

    Now consider good quality speech Digitization (without any data compression techniques)

    12 bit quantization (212 = 4096 levels) Sampling rate = 10 k samples/s

    Channel capacity required for digital transmission

    C = (10103)log2 2

    12 = 120 k bits/s

    Thus we have three different estimates for channel capacity requirements for speech

    Analog telephony (3 kHz bandwidth, 30 dB SNR): 30 k bits/s

    Digital transmission (12-bit, 10 k samples/s, no encoding): 120 k bits/s

    Phonemic transcription: 45 bits/s

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    10

    Auditory system

    Peripheral auditory system

    External ear : pinna and auditory canal (sound collection)

    Middle ear : ear drum and middle ear bones (impedance matching)

    Inner ear : cochlea (analysis and transduction)

    Auditory nerve (transmission of neural impulses)

    Central auditory system (information interpretation)

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    11

    Outer ear Inner earMiddle ear

    External canal

    Pinna

    Nasal cavity

    Cochlea

    Cochlear nerve

    Vestibular nerve

    Oval windowRound window

    Eustachian tube

    Eardrum (Tympanicmembrane)

    Vestibular apparatus with semicircular canals

    MalleusIncusStapes

    Peripheral auditory system (outer, middle, and inner ear). (Adapted from Flanagan (1972a), Fig. 4.1.)

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    12

    A longitudinal section through the uncoiled cochlea (Source: Pickles (1982), Fig. 3.1.(C))

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    13

    The mechanism of

    speech production

    A schematic diagram of the human vocal mechanism

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    14

    A schematic representation of the vocal apparatus. (Adapted from Rabiner & Schafer (1978), Fig. 3.2)

    Pitch : F0 (rate of vibration of vocal cords)

    Formant frequencies: F1, F2, .. (resonance frequencies of vocal tract filter)

    ( )u t ( )s t

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    15

    Phonemic features

    Modes of excitation Glottal: voiced unvoiced (aspiration) Frication (constriction in vocal tract) : voiced unvoiced

    Movement of articulators Continuant (steady vocal tract): vowels, nasal stops, fricatives

    Non-continuant (changing vocal tract): diphthongs, semivowels, plosives (oral stops)

    Place of articulation bilabial, labio-dental, linguo-dental, alveolar, palatal, velar, gluttoral

    Changes in Fo

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    16

    Classification

    of English

    phonemes

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    17

    Classification

    of Hindi

    phonemes

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    18

    Suprasegmental features Intonation Rhythm (syllable stressing) Carriers Changes in intensity Syllable duration Changes in voice pitch

    Visual features Partial information on place of articulation

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    19

    Short-time speech analysis

    ( ) ( ) ( )s n s n w n m

    windowing

    Storage/display/transmission

    decimation ( )Q n

    k

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    20

    Speech analysis techniques Time domain analysis Intensity Zero crossing rate Autocorrelation AMDF Frequency domain analysis Filter bank analysis Short-time Fourier analysis Spectrographic displays Formant estimation & tracking F0 estimation and tracking Separation of excitation and vocal tract filter effects Cepstral analysis Linear predictive coding (LPC)

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    21

    Multi-resolution spectrographic analysis

    Analog analysis Spectral analysis using bandpass filter & demodulator

    ( ) ( )t dBs t S f

    f = 300 Hz (wideband), 45 Hz (narrow band)

    Digital analysis

    Change in resolution, implemented by padding L-sample segment with N-L zero valued

    samples, and using N-point DFT. Pre-emphasis for boosting high-frequency components.

    ( )

    1 2

    0

    Short-time Fourier transform: ( , ) ( ) ( )

    2Hamming window : ( ) 0.54 0.46cos 2 ,1 2 2

    ( , ) 20log ( , ) /

    For 10 k Sa/s, 43 300 Hz

    289 45 Hz

    mkN jN

    m

    dB ref

    s

    X n k w m x n m e

    Lnt

    L Lw n n

    L

    S n k X n k X

    f L f

    L f

    =

    =

    =

    =

    = =

    = =

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    22

    Vowel

    triangle

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    23

    Spectrograms for English phonemes (vowels or /aCa/ syllable)

    Place of articulation Manner Voicing Back Mid Front

    unvoiced k t p Oral stop voiced g d b unvoiced Nasal stop voiced ng n m unvoiced (sh) s f Oral fricative voiced (zh) z v

    unvoiced Nasal fricative voiced

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    24

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    25

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    26

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    27

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    28

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    29

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    30

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    31

    Cepstral analysis

    ( )v n( )e n ( ) ( ) ( )s n e n v n=

    ( ) ( ) ( )S z E z V z=

    We can use "log" transformation for deconvolution and separation of excitation & vocal tract filter parameters.

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    32

    { } { }

    { }

    1 1

    1

    Complex cepstrum ( ) log ( ) log ( )

    Cepstrum ( ) = log ( )

    j

    j

    x

    x n Z X z F X e

    c n F X e

    = =

    s

    [ ]

    ( ) ( )* ( )

    ( ) ( ) ( )

    log ( ) log ( ) log ( )

    s n e n v n

    S z E z V z

    S z E z V z

    =

    =

    = +

    1Z

    ( ) ( ) ( )s n e n v n

    = +

    ( )x n ( )X zZ log

    ( )X z

    1Z

    ( )x n

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    33

    Linear predictive coding (LPC)

    Excitation e(n) : modeled as an impulse or random noise. Vocal tract filter transfer function V(z): modeled as an

    all-pole filter (AR model).

    Inverse filter A(z): all-zero filter (FIR or MA filter), with parameters

    estimated for minimizing ( )n in LMS sense. Result

    Error ( )n acquires the characteristics of the excitation function

    Inverse filter coefficients, obtained by solving a set of linear equations, approximate the vocal tract filter coefficients.

    ( )V z( )e n( )s n

    ( )n+

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    34

    Speech production

    ( )V z( )e n ( )s n

    1

    1 1( )

    1 ( )1

    pk

    k

    k

    V zA z

    a z

    =

    = =

    1

    ( ) ( ) ( )p

    k

    k

    e n s n a s n k=

    =

  • P. C. Pandey / Introduction to Speech Processing / Jan. 2008

    35

    Linear prediction model

    ( )A z( )s n ( )s n

    ( )n

    1

    ( ) ( )p

    k

    k

    s n a s n k

    =

    =

    1

    ( ) ( ) ( )

    ( ) ( )p

    k

    k

    n s n s n

    s n a s n k

    =

    =

    =