text independent speaker recognition
TRANSCRIPT
-
8/2/2019 Text Independent Speaker Recognition
1/31
Speaker
Recognition-Prepared b
Pravin
Gondaliya[08BEC029]Surendra
Jalu[08BEC034]-Guided b
Dr. Tanish H. Zave
-
8/2/2019 Text Independent Speaker Recognition
2/31
Our Goal:
To understand the Digital Speech SignalProcessing
and exploit it into spartan 3A DSP kit.
-
8/2/2019 Text Independent Speaker Recognition
3/31
Todays Agenda:
Basics of speech processing
What is speech enhancement?
Speech enhancement algorithm
Spartan 3A DSP kit
ISE tool for Designing
-
8/2/2019 Text Independent Speaker Recognition
4/31
Introduction to speechprocessing
Speech processing is the application ofDigital signal processing (DSP) techniquesto the processing and or analysis of speechsignals.
Application of Speech processing include
- Speech coding
- Speech Recognition
- Speaker Verification Identification- Speech Enhancement
- Speech synthesis (Text to Speech
conversion)
-
8/2/2019 Text Independent Speaker Recognition
5/31
Figure shows a schematic diagram of thespeech production /speech perceptionprocess in human beings.
The speech production process beginswhen the talker formulates a message in
his/her mind to transmit to the listener viaspeech.
The next step in the process is the conversionof the message into a language code. This
corresponds to converting the message into aset phoneme sequences corresponding to thesounds that make up the words. Along withprosody (syntax) markers denoting durationofsounds, loudness of sounds and pitchassociated with the sounds.
-
8/2/2019 Text Independent Speaker Recognition
6/31
-
8/2/2019 Text Independent Speaker Recognition
7/31
Information Rate of the speechSignal
The discrete symbol information rate in theraw message text is rather low (about 50bits per second corresponding to about 8sounds per sounds per second, where each
sound is one of the about 50 distinctsymbols).
After the language code conversion, with
the inclusion of prosody information, theinformation rate rises to about 200 bps.
-
8/2/2019 Text Independent Speaker Recognition
8/31
The mechanism of Speechproduction In order to apply DSP techniques to
speech processing problems it isimportant to understand thefundamentals of the speech
production process.
Speech signals are composed of asequence of sounds and thesequence of sounds are produced asa result of acoustical excitation of thevocal tract when air is expelled from
the lungs.
-
8/2/2019 Text Independent Speaker Recognition
9/31
Speech Production Mechanism
Vocal tracts begins at theopening between the
vocalcords and ends at the lips
In the average male, thetotal
length of the vocal tract isabout 17 cm
The cross-sectional area ofthe
vocal , determined by thepositions of the tongue , lips,jaw and velum varies from
zero(complete closure) to about20 cm
f S h
-
8/2/2019 Text Independent Speaker Recognition
10/31
Classification of Speech
Sounds In speech processing, speech sounds are
divided into TWO broad classes whichdepend on the role of the vocal chords onthe speech production mechanism.
-VOICED speech is produced when thevocal chords play an active role (i.e. vibrate)in the production of a sound:
Examples: voiced sounds /a/,/e/,/i/-UNVOICED speech is produced when vocalchords are inactiveExamples: unvoiced sounds /s/,/f/
-
8/2/2019 Text Independent Speaker Recognition
11/31
Voiced Speech Voiced speech occurs when air flows
through the vocal chords into the vocaltract in discrete puffs rather than as acontinuous flow
The vocal chords vibrates at particularfrequency, which is called the fundamentalfrequency of the sound
- 50:200 Hz for male speakers
- 150:300 Hz for female speakers
- 200:400 Hz for child speakers
-
8/2/2019 Text Independent Speaker Recognition
12/31
Unvoiced speech
For unvoiced speech, the vocal chordsare held open and air flowscontinuously through them
The vocal tract, however, is narrowed
resulting in a turbulent flow of air alongthe tract
Examples include the unvoiced
fricatives/f/ & /s/
Characterized by high frequencycomponents
-
8/2/2019 Text Independent Speaker Recognition
13/31
Other Sound classes
Nasal Sounds
- Vocal tract coupled acoustically with nasalcavity through velar opening
- Sound radiated from nostrils as well as lips
- Examples include m, n, ing
Plosive Sounds
- Characterized by complete
closure/constriction towards front of thevocal tract
- Build up of pressure behind closure, suddenrelease
- Examples include p, t, k
-
8/2/2019 Text Independent Speaker Recognition
14/31
Speech Enhancement
Speech enhancement is concernedwith improving some perceptualaspect of speech that has beendegraded by additive noise.
Different kind of noise affect on thequality of the speech. Different speechenhancement techniques are used to
improve the quality of speech andreduce the specific noise coming fromdifferent sources at different SNRs.
Bl k Di f MFCC
-
8/2/2019 Text Independent Speaker Recognition
15/31
Block Diagram of MFCCalgorithm
-
8/2/2019 Text Independent Speaker Recognition
16/31
Preprocessing & Frame Blocking
Continuous human speech is recorded and
preprocessed. In preprocessing , silence detection and
amplification takes place.
Then after the preprocessed output is fed toframe blocking section.
In frame blocking, the continuous speechsignal is blocked into frames of somenumber of samples. This process continuesuntil all the speech is accounted for withinone or more frames.
-
8/2/2019 Text Independent Speaker Recognition
17/31
Windowing
The next step in the processing is to window
each individual frame so as to minimize thesignal discontinuities at the beginning and endof each frame. The concept here is to minimizethe spectral distortion by using the window to
taper the signal to zero at the beginning andend of each frame.
If we define the window as w(n), 0 n N-1,where N is the number of samples in each
frame, then the resulting signaly(n)=x(n)w(n) ; 0 n N-1
Typically the Hamming window is used, whichhas the form
w n =0.54 046 cos 2n/N-1 ; 0 n N-1
-
8/2/2019 Text Independent Speaker Recognition
18/31
Mel Frequency Cepstrum The power cepstrum (of a signal) is the squared magnitude of
the Fourier transform of the logarithm of the squared
magnitude of the Fourier transform of a signal.
Mathematically: power cepstrum of signal
|F{log(|F{Y(t)}|2)}|2
Algorithmically:
signal FT abs() square log FT abs() square power cepstrum
The cepstrum can be seen as information about rate ofchange in the different spectrum bands. It was originallyinvented for characterizing the seismic echoes resulting from
earthquakes and bomb explosions. It has also been used todetermine the fundamental frequency of human speech andto analyze radar signal returns. Cepstrum pitch determinationis particularly effective because the effects of the vocalexcitation (pitch) and vocal tract (formants) are additive in the
logarithm of the power spectrum and thus clearly separate.
-
8/2/2019 Text Independent Speaker Recognition
19/31
The independent variable of a cepstral graph iscalled the quefrency. The quefrency is a measureof time, though not in the sense of a signal in thetime domain. For example, if the sampling rate ofan audio signal is 44100 Hz and there is a largepeak in the cepstrum whose quefrency is 100samples, the peak indicates the presence of a pitchthat is 44100/100 = 441 Hz. This peak occurs in
the cepstrum because the harmonics in thespectrum are periodic, and the period correspondsto the pitch.
Mel-frequency cepstrum (MFC) is a
representation of the short-term power spectrum ofa sound, based on a linear cosine transform of alog power spectrum on a nonlinear mel scale offrequency.
So, Our next step is FFT(Fast Fourier Transform)of a speech signal and then it is fed to mel
Diff b t l d
-
8/2/2019 Text Independent Speaker Recognition
20/31
Difference between normal andmel cepstrum Mel-frequency cepstral coefficients
(MFCCs) are coefficients that collectivelymake up an MFC. They are derived from atype of cepstral representation of the audio
clip (a nonlinear "spectrum-of-a-spectrum").The difference between the cepstrum andthe mel-frequency cepstrum is that in theMFC, the frequency bands are equally
spaced on the mel scale, whichapproximates the human auditory system'sresponse more closely than the linearly-spaced frequency bands used in the normal
cepstrum. This frequency warping can allow
-
8/2/2019 Text Independent Speaker Recognition
21/31
Why MEL scale? psychophysical studies have shown
that human perception of thefrequency contents of sounds forspeech signals does not follow a
linear scale. Thus for each tone withan actual frequency, f, measured in
Hz, a subjective pitch is measured
on a scale called the mel scale. The mel-frequency scale is a linear
frequency spacing below 1000 Hz
and a logarithmic spacing above
-
8/2/2019 Text Independent Speaker Recognition
22/31
MEL scale The mel scale, is a perceptual scale of
pitches judged by listeners to be equalin distance from one another. Thename mel comes from the word
melody to indicate that the scale isbased on pitch comparisons.
A popular formula to convert f hertz
into mmel is:
m = 2595 log10 {1+(f/700)}
-
8/2/2019 Text Independent Speaker Recognition
23/31
-
8/2/2019 Text Independent Speaker Recognition
24/31
MFCCMFCCs are commonly derived as follows:
Take the Fourier transform of (a windowedexcerpt of) a signal.
Map the powers of the spectrum obtainedabove onto the mel scale, using triangularoverlapping windows.
Take the logs of the powers at each of themel frequencies.
Take the discrete cosine transform of the listof mel log powers, as if it were a signal.
The MFCCs are the amplitudes of theresulting spectrum.
-
8/2/2019 Text Independent Speaker Recognition
25/31
Implementation So, most of the work has been done. Now
For each speaker we record 5 samples ofspeech. Each speech sample will undergomel frequency cepstral analysis and MFCCare calculated for each of the sample. The
computed values are then stored in DB.matdatabase.
Then Pattern matching will takes place. Itwill ask user to enter his/her speech fortesting and compare the computed MFCCof this test speech with that of the DB.mat
database. If it matches then user will
-
8/2/2019 Text Independent Speaker Recognition
26/31
-
8/2/2019 Text Independent Speaker Recognition
27/31
Pattern matching In this process, the centroid of the
values for five samples is computedas shown in figure.
Then for each speaker, the test
speech of each speaker is comparedwith each of the samples including thecentroid one. The best match isselected on basis of maximum valuesmatched in the particular sample.
So if for any speaker any one out offive is matched with test speech thenthat user will be identified.
-
8/2/2019 Text Independent Speaker Recognition
28/31
Waiting for YourValuable Suggestions
Thank You
Resonant Frequencies of Vocal
-
8/2/2019 Text Independent Speaker Recognition
29/31
Resonant Frequencies of VocalTract Vocal tract is a non-uniform acoustic tube
that is terminated at one end by the vocalchords and at the other end by the lips
The cross-sectional area of the vocal tractis determined by the positions of tongue,
lips, jaws and velum The spectrum of vocal tract response
consists of a number of resonant
frequencies of the vocal tract The frequencies are called formants
Three to four formants present below 4KHzof speech
-
8/2/2019 Text Independent Speaker Recognition
30/31
Formant Frequencies
Speech normally exhibits one formantfrequency in every 1KHz
For VOICED speech, the magnitude ofthe lower formant frequencies are
successively larger than magnitude ofthe higher formant frequencies
For UNVOICED speech, the
magnitude of the higher formantfrequencies are successively largerthan magnitude of the lower formant
frequencies
-
8/2/2019 Text Independent Speaker Recognition
31/31