speech production mechanisms

Y(J) Stein VoP2 1

V

O

P

YJS

Speech Speech productionproduction

mechanismsmechanisms

Y(J) Stein VoP2 2

V

O

P

YJS

Speech Production OrgansSpeech Production Organs

Esophagus

Nasalcavity

Mouthcavity

Tongue

Larynx

Trachea

Uvula

Brain

Lungs

Pharynx

Teeth

Lips

Hard Palate

Velum

SpchProd

Y(J) Stein VoP2 3

V

O

P

YJS

Speech Production Organs - cont.Speech Production Organs - cont.

Air from lungs is exhaled into trachea (windpipe)

Vocal chords (folds) in larynx can produce periodic pulses of air by opening and closing (glottis)

Throat (pharynx), mouth, tongue and nasal cavity modify air flow

Teeth and lips can introduce turbulence

Epiglottis separates esophagus (food pipe) from trachea

SpchProd

Y(J) Stein VoP2 4

V

O

P

YJS

Voiced vs. Unvoiced SpeechVoiced vs. Unvoiced Speech

When vocal cords are held open air flows unimpeded When laryngeal muscles stretch them glottal flow is in bursts

When glottal flow is periodic called voiced speech Basic interval/frequency called the pitch Pitch period usually between 2.5 and 20 milliseconds Pitch frequency between 50 and 400 Hz

You can feel the vibration of the larynx Vowels are always voiced (unless whispered) Consonants come in voiced/unvoiced pairs for example : B/P K/G D/T V/F J/CH TH/th W/WH Z/S ZH/SH

SpchProd

Y(J) Stein VoP2 5

V

O

P

YJS

Excitation spectraExcitation spectra

Voiced speech Pulse train is not sinusoidal - harmonic rich

Unvoiced speech Common assumption : white noise

f

f

SpchProd

Y(J) Stein VoP2 6

V

O

P

YJS

Effect of vocal tractEffect of vocal tract

Mouth and nasal cavities have resonances

Resonant frequencies depend on geometry

SpchProd

Y(J) Stein VoP2 7

V

O

P

YJS

Effect of vocal tract - cont.Effect of vocal tract - cont. Sound energy at these resonant frequencies is amplified Frequencies of peak amplification are called formants

F1

F2

F3

F4

freq

uenc

y re

spon

se

frequency

voiced speech unvoiced speech

F0

SpchProd

Y(J) Stein VoP2 8

V

O

P

YJS

Formant frequenciesFormant frequencies Peterson - Barney data (note the “vowel triangle”)

SpchProd

Y(J) Stein VoP2 9

V

O

P

YJS

SonogramsSonogramsSpchProd

Y(J) Stein VoP2 10

V

O

P

YJS

Cylinder model(s)Cylinder model(s)

Rough model of throat and mouth cavity

With nasal cavity

VoiceExcitation

VoiceExcitation

open

open

open/closed

SpchProd

Y(J) Stein VoP2 11

V

O

P

YJS

PhonemesPhonemes

The smallest acoustic unit that can change meaning Different languages have different phoneme sets Types: (notations: phonetic, CVC, ARPABET)

– Vowels• front (heed, hid, head, hat)• mid (hot, heard, hut, thought)• back (boot, book, boat)• dipthongs (buy, boy, down, date)

– Semivowels• liquids (w, l)• glides (r, y)

SpchProd

Y(J) Stein VoP2 12

V

O

P

YJS

Phonemes - cont.Phonemes - cont.

– Consonants• nasals (murmurs) (n, m, ng)• stops (plosives)

– voiced (b,d,g)– unvoiced (p, t, k)

• fricatives– voiced (v, that, z, zh)– unvoiced (f, think, s, sh)

• affricatives (j, ch)• whispers (h, what)• gutturals ( ע ,ח )• clicks, etc. etc. etc.

SpchProd

Y(J) Stein VoP2 13

V

O

P

YJS

Basic LPC ModelBasic LPC Model

LPCsynthesis

filterWhite Noise

Generator

PulseGenerator

U/VSwitch

SpchProd

Y(J) Stein VoP2 14

V

O

P

YJS

Basic LPC Model - cont.Basic LPC Model - cont.

Pulse generator produces a harmonic rich periodic impulse train (with pitch period and gain)

White noise generator produces a random signal (with gain)

U/V switch chooses between voiced and unvoiced speech

LPC filter amplifies formant frequencies (all-pole or AR IIR filter)

The output will resemble true speech to within residual error

SpchProd

Y(J) Stein VoP2 15

V

O

P

YJS

CepstrumCepstrum

Another way of thinking about the LPC modelSpeech spectrum is the obtained from multiplication Spectrum of (pitch) pulse train times Vocal tract (formant) frequency response

So log of this spectrum is obtained from addition Log spectrum of pitch train plus Log of vocal tract frequency response

Consider this log spectrum to be the spectrum of some new signal called the cepstrum

The cepstrum is the sum of two components: excitation plus vocal tract

SpchProd

Y(J) Stein VoP2 16

V

O

P

YJS

Cepstrum - cont.Cepstrum - cont.

Cepstral processing has its own language Cepstrum (note that this is really a signal in the time domain) Quefrency (its units are seconds) Liftering (filtering) Alanysis Saphe

Several variants: complex cepstrum power cesptrum LPC cepstrum

SpchProd

Y(J) Stein VoP2 17

V

O

P

YJS

Do we know enough?Do we know enough?

Standard speech model (LPC) (used by most speech processing/compression/recognition systems)is a model of speech production

Unfortunately, speech production and speech perception systemsare not matched

So next we’ll look at the biology of the hearing (auditory) systemand some psychophysics (perception)

SpchProd

Y(J) Stein VoP2 18

V

O

P

YJS

SpeechSpeechHearing &Hearing & P Perceptionerception

MMechanismsechanisms

Y(J) Stein VoP2 19

V

O

P

YJS

Hearing OrgansHearing OrgansSpchPerc

Y(J) Stein VoP2 20

V

O

P

YJS

Hearing Organs - cont.Hearing Organs - cont.

Sound waves impinge on outer ear enter auditory canal Amplified waves cause eardrum to vibrate Eardrum separates outer ear from middle ear The Eustachian tube equalizes air pressure of middle ear Ossicles (hammer, anvil, stirrup) amplify vibrations Oval window separates middle ear from inner ear Stirrup excites oval window which excites liquid in the cochlea The cochlea is curled up like a snail The basilar membrane runs along middle of cochlea The organ of Corti transduces vibrations to electric pulses Pulses are carried by the auditory nerve to the brain

SpchPerc

Y(J) Stein VoP2 21

V

O

P

YJS

Function of CochleaFunction of Cochlea

Cochlea has 2 1/2 to 3 turns were it straightened out it would be 3 cm in length The basilar membrane runs down the center of the cochlea as does the organ of Corti 15,000 cilia (hairs) contact the vibrating basilar membrane and release neurotransmitter stimulating 30,000 auditory neurons Cochlea is wide (1/2 cm) near oval window and tapers towards apex is stiff near oval window and flexible near apex Hence high frequencies cause section near oval window to vibrate low frequencies cause section near apex to vibrate Overlapping bank of filter frequency decomposition

SpchPerc

Y(J) Stein VoP2 22

V

O

P

YJS

Psychophysics - Weber’s lawPsychophysics - Weber’s lawErnst Weber Professor of physiology at Leipzig in the early 1800s

Just Noticeable Difference :

minimal stimulus change that can be detected by senses

Discovery: I = K IExample Tactile sense: place coins in each handsubject could discriminate between with 10 coins and 11, but not 20/21, but could 20/22!

Similarly vision lengths of lines, taste saltiness, sound frequency

SpchPerc

Y(J) Stein VoP2 23

V

O

P

YJS

Weber’s law - cont.Weber’s law - cont.This makes a lot of sense

Bill Gates

SpchPerc

Y(J) Stein VoP2 24

V

O

P

YJS

Psychophysics - FechnerPsychophysics - Fechner’s laws law

Weber’s law is not a true psychophysical law it relates stimulus threshold to stimulus (both physical entities) not internal representation (feelings) to physical entity

Gustav Theodor Fechner student of Weber medicine, physics philosophy

Simplest assumption: JND is single internal unitUsing Weber’s law we find:

Y = A log I + BFechner Day (October 22 1850)

SpchPerc

Y(J) Stein VoP2 25

V

O

P

YJS

FechnerFechner’s law - cont.s law - cont.

Log is very compressive

Fechner’s law explains the fantastic ranges of our sensesSight: single photon - direct sunlight 1015

Hearing: eardrum move 1 H atom - jet plane 1012

Bel defined to be log10 of power ratiodecibel (dB) one tenth of a Bel

d(dB) = 10 log10 P 1 / P 2

SpchPerc

Y(J) Stein VoP2 26

V

O

P

YJS

FechnerFechner’s law - sound amplitudess law - sound amplitudes

Companding

adaptation of logarithm to positive/negative signals

law and A-law are piecewise linear approximations

Equivalent to linear sampling at 12-14 bits

(8 bit linear sampling is significantly more noisy)

SpchPerc

Y(J) Stein VoP2 27

V

O

P

YJS

Fechner’s law - sound frequenciesFechner’s law - sound frequencies

octaves, well tempered scale

Critical bands

Frequency warping

Melody 1 KHz = 1000, JND afterwards M 1000 log2 ( 1 + fKHz )

Barkhausen can be simultaneously heard B 25 + 75 ( 1 + 1.4 f2KHz )0.69

excite different basilar membrane regions

f

12 2

SpchPerc

Y(J) Stein VoP2 28

V

O

P

YJS

Psychophysics - changesPsychophysics - changes

Our senses respond to changes

InverseE

Filter

SpchPerc

Y(J) Stein VoP2 29

V

O

P

YJS

Psychophysics - maskingPsychophysics - masking

Masking: strong tones block weaker ones at nearby frequencies

narrowband noise blocks tones (up to critical band)

f

SpchPerc

Y(J) Stein VoP2 30

V

O

P

YJS

SpeechSpeechDSPDSP

Y(J) Stein VoP2 31

V

O

P

YJS

Some Speech DSPSome Speech DSP

Simplest processing – Gain– AGC– VAD

More complex processing – pitch tracking– U/V decision– computing LPC – other features

Y(J) Stein VoP2 32

V

O

P

YJS

SimpleSimpleSpeech Speech

DSPDSP

Y(J) Stein VoP2 33

V

O

P

YJS

Gain (volume) ControlGain (volume) ControlIn analog processing (electronics) gain requires an amplifier

Great care must be taken to ensure linearity!

In digital processing (DSP) gain requires only multiplication

y = G xNeed enough bits!

SpchDSP

Y(J) Stein VoP2 34

V

O

P

YJS

Automatic Gain Control (AGC)Automatic Gain Control (AGC)

Can we set the gain automatically?

Yes, based on the signal’s Energy!

E = x2 (t) dt = xn2

All we have to do is apply gain until attain desired energy

Assume we want the energy to be Y

Then

y = Y/ E x = G x

has exactly this energy

SpchDSP

Y(J) Stein VoP2 35

V

O

P

YJS

AGC - cont.AGC - cont.

What if the input isn’t stationary (gets stronger and weaker over time) ?

The energy is defined for all times - < t < so it can’t help!

So we define “energy in window” E(t)and continuously vary gain G(t)

This is Adaptive Gain Control

We don’t want gain to jump from window to windowso we smooth the instantaneous gain

G(t) G(t) + (1-) Y/E(t) IIR filter

8 8

SpchDSP

Y(J) Stein VoP2 36

V

O

P

YJS

AGC - cont.AGC - cont.

The coefficient determines how fast G(t) can change

In more complex implementations we may separately controlintegration time, attack time, release time

What is involved in the computation of G(t) ?– Squaring of input value– Accumulation– Square root (or Pythagorean sum)

– Inversion (division)

Square root and inversion are hard for a DSP processorbut algorithmic improvements are possible (and often needed)

SpchDSP

Y(J) Stein VoP2 37

V

O

P

YJS

Simple VADSimple VAD

Sometimes it is useful to know whether someone is talking (or not)– Save bandwidth– Suppress echo– Segment utterances

We might be able to get away with “energy VOX”Normally need Noise Riding Threshold / Signal Riding Threshold

However, there are problems energy VOXsince it doesn’t differentiate between speech and noise

What we really want is a speech-specific activity detectorVoice Activity Detector

SpchDSP

Y(J) Stein VoP2 38

V

O

P

YJS

Simple VAD - cont.Simple VAD - cont.

VADs operate by recognizing that speech is different from noise– Speech is low-pass while noise is white– Speech is mostly voiced and so has pitch in a given range– Average noise amplitude is relatively constant

A simple VAD may use:– zero crossings– zero crossing “derivative”– spectral tilt filter– energy contours– combinations of the above

SpchDSP

Y(J) Stein VoP2 39

V

O

P

YJS

Other “simple” processesOther “simple” processes

Simple = not significantly dependent on details of speech signal

Speed change of recorded signal Speed change with pitch compensation Pitch change with speed compensation Sample rate conversion Tone generation Tone detection Dual tone generation Dual tone detection (need high reliability)

SpchDSP

Y(J) Stein VoP2 40

V

O

P

YJS

Complex Complex SpeechSpeech

DSPDSP

Y(J) Stein VoP2 41

V

O

P

YJS

CorrelationCorrelation

One major difference between simple and complex processingis the computation of correlations (related to LPC model)

Correlation is a measure of similarity

Shouldn’t we use squared difference to measure similarity?

x(t) - y(t) )2

No, since squared difference is sensitive to– gain– time shifts

SpchDSP

Y(J) Stein VoP2 42

V

O

P

YJS

Correlation - cont.Correlation - cont.

x(t) - y(t) )2 = x2 + y2 - 2 x(t)

y(t)

So when is minimal C(0) = x(t) y(t)is maximaland arbitrary gains don’t change this

To take time shifts into account

C() = x(t) y(t+)and look for maximal

We can even find out how much a signal resembles itself

SpchDSP

Y(J) Stein VoP2 43

V

O

P

YJS

AutocorrelationAutocorrelation

Crosscorrelation Cx y () = x(t) y(t+)Autocorrelation Cx () = x(t) x(t+)

Cx () is the energy!

Autocorrelation helps find hidden periodicities!Much stronger than looking in the time representation

Wiener KhintchineAutocorrelation C(t) and Power Spectrum S(f) are FT pair

So autocorrelation contains the same information as the power spectrum… and can itself be computed by FFT

SpchDSP

Y(J) Stein VoP2 44

V

O

P

YJS

Pitch trackingPitch tracking

How can we measure (and track) the pitch?

We can look for it in the spectrum – but it may be very weak– may not even be there (filtered out)– need high resolution spectral estimation

Correlation based methodsThe pitch periodicity should be seen in the autocorrelation!

Sometimes computationally simpler is the Absolute Magnitude Difference Function

x(t) - x(t+)

SpchDSP

Y(J) Stein VoP2 45

V

O

P

YJS

Pitch tracking - cont.Pitch tracking - cont.

Sondhi’s algorithm for autocorrelation-based pitch tracking :– obtain window of speech– determine if the segment is voiced (see U/V decision below)– low-pass filter and center-clip to reduce formant induced correlations– compute autocorrelation lags corresponding to valid pitch intervals

• find lag with maximum correlation OR• find lag with maximal accumulated correlation in all multiples

Post processingPitch trackers rarely make small errors (usually double pitch) So correct outliers based on neighboring values

SpchDSP

Y(J) Stein VoP2 46

V

O

P

YJS

Other Pitch TrackersOther Pitch Trackers

Miller’s data-reduction & Gold and Rabiner’s parallel processing methodsZero-crossings, energy, extrema of waveform

Noll’s cepstrum based pitch trackerSince the pitch and formant contributions are separated in cepstral domainMost accurate for clean speech, but not robust in noise

Methods based on LPC error signalLPC technique breaks down at pitch pulse onsetFind periodicity of error by autocorrelation

Inverse filtering methodRemove formant filtering by low-order LPC analysisFind periodicity of excitation by autocorrelation

Sondhi-like methods are the best for noisy speech

SpchDSP

Y(J) Stein VoP2 47

V

O

P

YJS

U/V decisionU/V decision

Between VAD and pitch tracking Simplest U/V decision is based on energy and zero crossings More complex methods are combined with pitch tracking Methods based on pattern recognition

Is voicing well defined? Degree of voicing (buzz) Voicing per frequency band (interference) Degree of voicing per frequency band

SpchDSP

Y(J) Stein VoP2 48

V

O

P

YJS

LPC CoefficientsLPC Coefficients

How do we find the vocal tract filter coefficients?

System identification problem

All-pole (AR) filter Connection to prediction

Sn = G en + m am sn-m

Can find G from energy (so let’s ignore it)

Unknown

filterknown input known output

SpchDSP

Y(J) Stein VoP2 49

V

O

P

YJS

LPC CoefficientsLPC Coefficients

For simplicity let’s assume three a coefficientsSn = en + a1 sn-1 + a 2 s n-2 + a 3 s n-3

Need three equations!Sn = en + a1 sn-1 + a 2 s n-2 + a 3 s n-3

Sn+1 = en+1 + a1 sn + a 2 s n-1 + a 3 s n-2

Sn+2 = en+2 + a1 sn+1 + a 2 s n + a 3 s n-1

In matrix form Sn en sn-1 s n-2 s n-3 a1

Sn+1 = en+1 + sn s n-1 s n-2 a 2

Sn+2 en+2 sn+1 s n s n-1 a 3

s = e + S a

SpchDSP

Y(J) Stein VoP2 50

V

O

P

YJS

LPC Coefficients - cont.LPC Coefficients - cont.

S = e + S aso by simple algebra

a = S-1 ( s - e )

and we have reduced the problem to matrix inversion

Toeplitz matrix so the inversion is easy (Levinson-Durbin algorithm)

Unfortunately noise makes this attempt break down!Move to next time and the answer will be different.Need to somehow average the answersThe proper averaging is before the equation solvingcorrelation vs autocovariance

SpchDSP

Y(J) Stein VoP2 51

V

O

P

YJS

LPC Coefficients - cont.LPC Coefficients - cont.

Can’t just average over time - all equations would be the same!Let’s take the input to be zero

Sn = m am sn-m

multiply by Sn-q and sum over n

n Sn Sn-q = m am n sn-m sn-qwe recognize the autocorrelations

Cs (q) = m Cs (|m-q|) am

Yule-Walker equationsautocorrelation method: sn outside window are zero (Toeplitz)autocovariance method: use all needed sn (no window)Also - pre-emphasis!

SpchDSP

Y(J) Stein VoP2 52

V

O

P

YJS

Alternative featuresAlternative features

The a coefficients aren’t the only set of features Reflection coefficients (cylinder model) log-area coefficients (cylinder model) pole locations LPC cepstrum coefficients Line Spectral Pair frequencies

All theoretically contain the same information (algebraic transformations) Euclidean distance in LPC cepstrum space ~ Itakura Saito measure so these are popular in speech recognition LPC (a) coefficients don’t quantize or interpolate well so these aren’t good for speech compression LSP frequencies are best for compression

SpchDSP

Y(J) Stein VoP2 53

V

O

P

YJS

LSP coefficientsLSP coefficients

a coefficients are not statistically equally weighted pole positions are better (geometric) but radius is sensitive near unit circle Is there an all-angle representation?

Theorem 1: Every real polynomial with all roots on the unit circleis palindromic (e.g. 1 + 2t + t2) or antipalindromic (e.g. t + t2 - t3)

Theorem 2: Every polynomial can be written as the sum ofpalindromic and antipalindromic polynomials

Consequence: Every polynomial can be represented by rootson the unit circle, that is, by angles

SpchDSP

Y(J) Stein VoP2 54

V

O

P

YJS

LPC - based CompressionLPC - based Compression

We learned that from– gain– pitch– a small number of LPC coefficients

we could synthesize speech

It is easy to find the energy of a speech signalWe have seen methods to find pitchWe saw how to extract LPC coefficients from speech

So do we know how to compress speech?

SpchDSP

speech production mechanisms

Documents