speech production mechanisms
DESCRIPTION
Speech production mechanisms. Spch Prod. Esophagus. Speech Production Organs. Brain. Hard Palate. Nasal cavity. Velum. Teeth. Lips. Uvula. Mouth cavity. Pharynx. Tongue. Larynx. Trachea. Lungs. Spch Prod. Speech Production Organs - cont. - PowerPoint PPT PresentationTRANSCRIPT
Y(J) Stein VoP2 1
V
O
P
YJS
Speech Speech productionproduction
mechanismsmechanisms
Y(J) Stein VoP2 2
V
O
P
YJS
Speech Production OrgansSpeech Production Organs
Esophagus
Nasalcavity
Mouthcavity
Tongue
Larynx
Trachea
Uvula
Brain
Lungs
Pharynx
Teeth
Lips
Hard Palate
Velum
SpchProd
Y(J) Stein VoP2 3
V
O
P
YJS
Speech Production Organs - cont.Speech Production Organs - cont.
Air from lungs is exhaled into trachea (windpipe)
Vocal chords (folds) in larynx can produce periodic pulses of air by opening and closing (glottis)
Throat (pharynx), mouth, tongue and nasal cavity modify air flow
Teeth and lips can introduce turbulence
Epiglottis separates esophagus (food pipe) from trachea
SpchProd
Y(J) Stein VoP2 4
V
O
P
YJS
Voiced vs. Unvoiced SpeechVoiced vs. Unvoiced Speech
When vocal cords are held open air flows unimpeded When laryngeal muscles stretch them glottal flow is in bursts
When glottal flow is periodic called voiced speech Basic interval/frequency called the pitch Pitch period usually between 2.5 and 20 milliseconds Pitch frequency between 50 and 400 Hz
You can feel the vibration of the larynx Vowels are always voiced (unless whispered) Consonants come in voiced/unvoiced pairs for example : B/P K/G D/T V/F J/CH TH/th W/WH Z/S ZH/SH
SpchProd
Y(J) Stein VoP2 5
V
O
P
YJS
Excitation spectraExcitation spectra
Voiced speech Pulse train is not sinusoidal - harmonic rich
Unvoiced speech Common assumption : white noise
f
f
SpchProd
Y(J) Stein VoP2 6
V
O
P
YJS
Effect of vocal tractEffect of vocal tract
Mouth and nasal cavities have resonances
Resonant frequencies depend on geometry
SpchProd
Y(J) Stein VoP2 7
V
O
P
YJS
Effect of vocal tract - cont.Effect of vocal tract - cont. Sound energy at these resonant frequencies is amplified Frequencies of peak amplification are called formants
F1
F2
F3
F4
freq
uenc
y re
spon
se
frequency
voiced speech unvoiced speech
F0
SpchProd
Y(J) Stein VoP2 8
V
O
P
YJS
Formant frequenciesFormant frequencies Peterson - Barney data (note the “vowel triangle”)
SpchProd
Y(J) Stein VoP2 9
V
O
P
YJS
SonogramsSonogramsSpchProd
Y(J) Stein VoP2 10
V
O
P
YJS
Cylinder model(s)Cylinder model(s)
Rough model of throat and mouth cavity
With nasal cavity
VoiceExcitation
VoiceExcitation
open
open
open/closed
SpchProd
Y(J) Stein VoP2 11
V
O
P
YJS
PhonemesPhonemes
The smallest acoustic unit that can change meaning Different languages have different phoneme sets Types: (notations: phonetic, CVC, ARPABET)
– Vowels• front (heed, hid, head, hat)• mid (hot, heard, hut, thought)• back (boot, book, boat)• dipthongs (buy, boy, down, date)
– Semivowels• liquids (w, l)• glides (r, y)
SpchProd
Y(J) Stein VoP2 12
V
O
P
YJS
Phonemes - cont.Phonemes - cont.
– Consonants• nasals (murmurs) (n, m, ng)• stops (plosives)
– voiced (b,d,g)– unvoiced (p, t, k)
• fricatives– voiced (v, that, z, zh)– unvoiced (f, think, s, sh)
• affricatives (j, ch)• whispers (h, what)• gutturals ( ע ,ח )• clicks, etc. etc. etc.
SpchProd
Y(J) Stein VoP2 13
V
O
P
YJS
Basic LPC ModelBasic LPC Model
LPCsynthesis
filterWhite Noise
Generator
PulseGenerator
U/VSwitch
SpchProd
Y(J) Stein VoP2 14
V
O
P
YJS
Basic LPC Model - cont.Basic LPC Model - cont.
Pulse generator produces a harmonic rich periodic impulse train (with pitch period and gain)
White noise generator produces a random signal (with gain)
U/V switch chooses between voiced and unvoiced speech
LPC filter amplifies formant frequencies (all-pole or AR IIR filter)
The output will resemble true speech to within residual error
SpchProd
Y(J) Stein VoP2 15
V
O
P
YJS
CepstrumCepstrum
Another way of thinking about the LPC modelSpeech spectrum is the obtained from multiplication Spectrum of (pitch) pulse train times Vocal tract (formant) frequency response
So log of this spectrum is obtained from addition Log spectrum of pitch train plus Log of vocal tract frequency response
Consider this log spectrum to be the spectrum of some new signal called the cepstrum
The cepstrum is the sum of two components: excitation plus vocal tract
SpchProd
Y(J) Stein VoP2 16
V
O
P
YJS
Cepstrum - cont.Cepstrum - cont.
Cepstral processing has its own language Cepstrum (note that this is really a signal in the time domain) Quefrency (its units are seconds) Liftering (filtering) Alanysis Saphe
Several variants: complex cepstrum power cesptrum LPC cepstrum
SpchProd
Y(J) Stein VoP2 17
V
O
P
YJS
Do we know enough?Do we know enough?
Standard speech model (LPC) (used by most speech processing/compression/recognition systems)is a model of speech production
Unfortunately, speech production and speech perception systemsare not matched
So next we’ll look at the biology of the hearing (auditory) systemand some psychophysics (perception)
SpchProd
Y(J) Stein VoP2 18
V
O
P
YJS
SpeechSpeechHearing &Hearing & P Perceptionerception
MMechanismsechanisms
Y(J) Stein VoP2 19
V
O
P
YJS
Hearing OrgansHearing OrgansSpchPerc
Y(J) Stein VoP2 20
V
O
P
YJS
Hearing Organs - cont.Hearing Organs - cont.
Sound waves impinge on outer ear enter auditory canal Amplified waves cause eardrum to vibrate Eardrum separates outer ear from middle ear The Eustachian tube equalizes air pressure of middle ear Ossicles (hammer, anvil, stirrup) amplify vibrations Oval window separates middle ear from inner ear Stirrup excites oval window which excites liquid in the cochlea The cochlea is curled up like a snail The basilar membrane runs along middle of cochlea The organ of Corti transduces vibrations to electric pulses Pulses are carried by the auditory nerve to the brain
SpchPerc
Y(J) Stein VoP2 21
V
O
P
YJS
Function of CochleaFunction of Cochlea
Cochlea has 2 1/2 to 3 turns were it straightened out it would be 3 cm in length The basilar membrane runs down the center of the cochlea as does the organ of Corti 15,000 cilia (hairs) contact the vibrating basilar membrane and release neurotransmitter stimulating 30,000 auditory neurons Cochlea is wide (1/2 cm) near oval window and tapers towards apex is stiff near oval window and flexible near apex Hence high frequencies cause section near oval window to vibrate low frequencies cause section near apex to vibrate Overlapping bank of filter frequency decomposition
SpchPerc
Y(J) Stein VoP2 22
V
O
P
YJS
Psychophysics - Weber’s lawPsychophysics - Weber’s lawErnst Weber Professor of physiology at Leipzig in the early 1800s
Just Noticeable Difference :
minimal stimulus change that can be detected by senses
Discovery: I = K IExample Tactile sense: place coins in each handsubject could discriminate between with 10 coins and 11, but not 20/21, but could 20/22!
Similarly vision lengths of lines, taste saltiness, sound frequency
SpchPerc
Y(J) Stein VoP2 23
V
O
P
YJS
Weber’s law - cont.Weber’s law - cont.This makes a lot of sense
Bill Gates
SpchPerc
Y(J) Stein VoP2 24
V
O
P
YJS
Psychophysics - FechnerPsychophysics - Fechner’s laws law
Weber’s law is not a true psychophysical law it relates stimulus threshold to stimulus (both physical entities) not internal representation (feelings) to physical entity
Gustav Theodor Fechner student of Weber medicine, physics philosophy
Simplest assumption: JND is single internal unitUsing Weber’s law we find:
Y = A log I + BFechner Day (October 22 1850)
SpchPerc
Y(J) Stein VoP2 25
V
O
P
YJS
FechnerFechner’s law - cont.s law - cont.
Log is very compressive
Fechner’s law explains the fantastic ranges of our sensesSight: single photon - direct sunlight 1015
Hearing: eardrum move 1 H atom - jet plane 1012
Bel defined to be log10 of power ratiodecibel (dB) one tenth of a Bel
d(dB) = 10 log10 P 1 / P 2
SpchPerc
Y(J) Stein VoP2 26
V
O
P
YJS
FechnerFechner’s law - sound amplitudess law - sound amplitudes
Companding
adaptation of logarithm to positive/negative signals
law and A-law are piecewise linear approximations
Equivalent to linear sampling at 12-14 bits
(8 bit linear sampling is significantly more noisy)
SpchPerc
Y(J) Stein VoP2 27
V
O
P
YJS
Fechner’s law - sound frequenciesFechner’s law - sound frequencies
octaves, well tempered scale
Critical bands
Frequency warping
Melody 1 KHz = 1000, JND afterwards M 1000 log2 ( 1 + fKHz )
Barkhausen can be simultaneously heard B 25 + 75 ( 1 + 1.4 f2KHz )0.69
excite different basilar membrane regions
f
12 2
SpchPerc
Y(J) Stein VoP2 28
V
O
P
YJS
Psychophysics - changesPsychophysics - changes
Our senses respond to changes
InverseE
Filter
SpchPerc
Y(J) Stein VoP2 29
V
O
P
YJS
Psychophysics - maskingPsychophysics - masking
Masking: strong tones block weaker ones at nearby frequencies
narrowband noise blocks tones (up to critical band)
f
SpchPerc
Y(J) Stein VoP2 30
V
O
P
YJS
SpeechSpeechDSPDSP
Y(J) Stein VoP2 31
V
O
P
YJS
Some Speech DSPSome Speech DSP
Simplest processing – Gain– AGC– VAD
More complex processing – pitch tracking– U/V decision– computing LPC – other features
Y(J) Stein VoP2 32
V
O
P
YJS
SimpleSimpleSpeech Speech
DSPDSP
Y(J) Stein VoP2 33
V
O
P
YJS
Gain (volume) ControlGain (volume) ControlIn analog processing (electronics) gain requires an amplifier
Great care must be taken to ensure linearity!
In digital processing (DSP) gain requires only multiplication
y = G xNeed enough bits!
SpchDSP
Y(J) Stein VoP2 34
V
O
P
YJS
Automatic Gain Control (AGC)Automatic Gain Control (AGC)
Can we set the gain automatically?
Yes, based on the signal’s Energy!
E = x2 (t) dt = xn2
All we have to do is apply gain until attain desired energy
Assume we want the energy to be Y
Then
y = Y/ E x = G x
has exactly this energy
SpchDSP
Y(J) Stein VoP2 35
V
O
P
YJS
AGC - cont.AGC - cont.
What if the input isn’t stationary (gets stronger and weaker over time) ?
The energy is defined for all times - < t < so it can’t help!
So we define “energy in window” E(t)and continuously vary gain G(t)
This is Adaptive Gain Control
We don’t want gain to jump from window to windowso we smooth the instantaneous gain
G(t) G(t) + (1-) Y/E(t) IIR filter
8 8
SpchDSP
Y(J) Stein VoP2 36
V
O
P
YJS
AGC - cont.AGC - cont.
The coefficient determines how fast G(t) can change
In more complex implementations we may separately controlintegration time, attack time, release time
What is involved in the computation of G(t) ?– Squaring of input value– Accumulation– Square root (or Pythagorean sum)
– Inversion (division)
Square root and inversion are hard for a DSP processorbut algorithmic improvements are possible (and often needed)
SpchDSP
Y(J) Stein VoP2 37
V
O
P
YJS
Simple VADSimple VAD
Sometimes it is useful to know whether someone is talking (or not)– Save bandwidth– Suppress echo– Segment utterances
We might be able to get away with “energy VOX”Normally need Noise Riding Threshold / Signal Riding Threshold
However, there are problems energy VOXsince it doesn’t differentiate between speech and noise
What we really want is a speech-specific activity detectorVoice Activity Detector
SpchDSP
Y(J) Stein VoP2 38
V
O
P
YJS
Simple VAD - cont.Simple VAD - cont.
VADs operate by recognizing that speech is different from noise– Speech is low-pass while noise is white– Speech is mostly voiced and so has pitch in a given range– Average noise amplitude is relatively constant
A simple VAD may use:– zero crossings– zero crossing “derivative”– spectral tilt filter– energy contours– combinations of the above
SpchDSP
Y(J) Stein VoP2 39
V
O
P
YJS
Other “simple” processesOther “simple” processes
Simple = not significantly dependent on details of speech signal
Speed change of recorded signal Speed change with pitch compensation Pitch change with speed compensation Sample rate conversion Tone generation Tone detection Dual tone generation Dual tone detection (need high reliability)
SpchDSP
Y(J) Stein VoP2 40
V
O
P
YJS
Complex Complex SpeechSpeech
DSPDSP
Y(J) Stein VoP2 41
V
O
P
YJS
CorrelationCorrelation
One major difference between simple and complex processingis the computation of correlations (related to LPC model)
Correlation is a measure of similarity
Shouldn’t we use squared difference to measure similarity?
x(t) - y(t) )2
No, since squared difference is sensitive to– gain– time shifts
SpchDSP
Y(J) Stein VoP2 42
V
O
P
YJS
Correlation - cont.Correlation - cont.
x(t) - y(t) )2 = x2 + y2 - 2 x(t)
y(t)
So when is minimal C(0) = x(t) y(t)is maximaland arbitrary gains don’t change this
To take time shifts into account
C() = x(t) y(t+)and look for maximal
We can even find out how much a signal resembles itself
SpchDSP
Y(J) Stein VoP2 43
V
O
P
YJS
AutocorrelationAutocorrelation
Crosscorrelation Cx y () = x(t) y(t+)Autocorrelation Cx () = x(t) x(t+)
Cx () is the energy!
Autocorrelation helps find hidden periodicities!Much stronger than looking in the time representation
Wiener KhintchineAutocorrelation C(t) and Power Spectrum S(f) are FT pair
So autocorrelation contains the same information as the power spectrum… and can itself be computed by FFT
SpchDSP
Y(J) Stein VoP2 44
V
O
P
YJS
Pitch trackingPitch tracking
How can we measure (and track) the pitch?
We can look for it in the spectrum – but it may be very weak– may not even be there (filtered out)– need high resolution spectral estimation
Correlation based methodsThe pitch periodicity should be seen in the autocorrelation!
Sometimes computationally simpler is the Absolute Magnitude Difference Function
x(t) - x(t+)
SpchDSP
Y(J) Stein VoP2 45
V
O
P
YJS
Pitch tracking - cont.Pitch tracking - cont.
Sondhi’s algorithm for autocorrelation-based pitch tracking :– obtain window of speech– determine if the segment is voiced (see U/V decision below)– low-pass filter and center-clip to reduce formant induced correlations– compute autocorrelation lags corresponding to valid pitch intervals
• find lag with maximum correlation OR• find lag with maximal accumulated correlation in all multiples
Post processingPitch trackers rarely make small errors (usually double pitch) So correct outliers based on neighboring values
SpchDSP
Y(J) Stein VoP2 46
V
O
P
YJS
Other Pitch TrackersOther Pitch Trackers
Miller’s data-reduction & Gold and Rabiner’s parallel processing methodsZero-crossings, energy, extrema of waveform
Noll’s cepstrum based pitch trackerSince the pitch and formant contributions are separated in cepstral domainMost accurate for clean speech, but not robust in noise
Methods based on LPC error signalLPC technique breaks down at pitch pulse onsetFind periodicity of error by autocorrelation
Inverse filtering methodRemove formant filtering by low-order LPC analysisFind periodicity of excitation by autocorrelation
Sondhi-like methods are the best for noisy speech
SpchDSP
Y(J) Stein VoP2 47
V
O
P
YJS
U/V decisionU/V decision
Between VAD and pitch tracking Simplest U/V decision is based on energy and zero crossings More complex methods are combined with pitch tracking Methods based on pattern recognition
Is voicing well defined? Degree of voicing (buzz) Voicing per frequency band (interference) Degree of voicing per frequency band
SpchDSP
Y(J) Stein VoP2 48
V
O
P
YJS
LPC CoefficientsLPC Coefficients
How do we find the vocal tract filter coefficients?
System identification problem
All-pole (AR) filter Connection to prediction
Sn = G en + m am sn-m
Can find G from energy (so let’s ignore it)
Unknown
filterknown input known output
SpchDSP
Y(J) Stein VoP2 49
V
O
P
YJS
LPC CoefficientsLPC Coefficients
For simplicity let’s assume three a coefficientsSn = en + a1 sn-1 + a 2 s n-2 + a 3 s n-3
Need three equations!Sn = en + a1 sn-1 + a 2 s n-2 + a 3 s n-3
Sn+1 = en+1 + a1 sn + a 2 s n-1 + a 3 s n-2
Sn+2 = en+2 + a1 sn+1 + a 2 s n + a 3 s n-1
In matrix form Sn en sn-1 s n-2 s n-3 a1
Sn+1 = en+1 + sn s n-1 s n-2 a 2
Sn+2 en+2 sn+1 s n s n-1 a 3
s = e + S a
SpchDSP
Y(J) Stein VoP2 50
V
O
P
YJS
LPC Coefficients - cont.LPC Coefficients - cont.
S = e + S aso by simple algebra
a = S-1 ( s - e )
and we have reduced the problem to matrix inversion
Toeplitz matrix so the inversion is easy (Levinson-Durbin algorithm)
Unfortunately noise makes this attempt break down!Move to next time and the answer will be different.Need to somehow average the answersThe proper averaging is before the equation solvingcorrelation vs autocovariance
SpchDSP
Y(J) Stein VoP2 51
V
O
P
YJS
LPC Coefficients - cont.LPC Coefficients - cont.
Can’t just average over time - all equations would be the same!Let’s take the input to be zero
Sn = m am sn-m
multiply by Sn-q and sum over n
n Sn Sn-q = m am n sn-m sn-qwe recognize the autocorrelations
Cs (q) = m Cs (|m-q|) am
Yule-Walker equationsautocorrelation method: sn outside window are zero (Toeplitz)autocovariance method: use all needed sn (no window)Also - pre-emphasis!
SpchDSP
Y(J) Stein VoP2 52
V
O
P
YJS
Alternative featuresAlternative features
The a coefficients aren’t the only set of features Reflection coefficients (cylinder model) log-area coefficients (cylinder model) pole locations LPC cepstrum coefficients Line Spectral Pair frequencies
All theoretically contain the same information (algebraic transformations) Euclidean distance in LPC cepstrum space ~ Itakura Saito measure so these are popular in speech recognition LPC (a) coefficients don’t quantize or interpolate well so these aren’t good for speech compression LSP frequencies are best for compression
SpchDSP
Y(J) Stein VoP2 53
V
O
P
YJS
LSP coefficientsLSP coefficients
a coefficients are not statistically equally weighted pole positions are better (geometric) but radius is sensitive near unit circle Is there an all-angle representation?
Theorem 1: Every real polynomial with all roots on the unit circleis palindromic (e.g. 1 + 2t + t2) or antipalindromic (e.g. t + t2 - t3)
Theorem 2: Every polynomial can be written as the sum ofpalindromic and antipalindromic polynomials
Consequence: Every polynomial can be represented by rootson the unit circle, that is, by angles
SpchDSP
Y(J) Stein VoP2 54
V
O
P
YJS
LPC - based CompressionLPC - based Compression
We learned that from– gain– pitch– a small number of LPC coefficients
we could synthesize speech
It is easy to find the energy of a speech signalWe have seen methods to find pitchWe saw how to extract LPC coefficients from speech
So do we know how to compress speech?
SpchDSP