text-prompted remote speaker authentication - project report - ganesh tiwari - ioe - tu
TRANSCRIPT
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
1/71
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION
By:
GANESH TIWARI (063/BCT/510)
MADHAV PANDEY (063/BCT/514)
MANOJ SHRESTHA (063/BCT/518)
A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS
AND COMPUTER ENGINEERING IN PARTIAL FULFILLMENT OF THEREQUIREMENT FOR THE BACHELORS DEGREE IN COMPUTER
ENGINEERING
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING
LALITPUR, NEPAL
January, 2011
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
2/71
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING
The undersigned certify that they have read, and recommended to the Institute of Engineering for
acceptance, a project report entitled Text-Prompted Remote Speaker Authentication submitted
by Ganesh Tiwari, Madhav Pandey and Manoj Shrestha in partial fulfillment of the requirements
for the Bachelors degree in Computer Engineering.
__________________________________
Supervisor, Dr. Subarna Shakya
Associate Professor
Department of Electronics and Computer Engineering
__________________________________
Internal Examiner,
_________________________________
External Examiner,
DATE OF APPROVAL:
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
3/71
COPYRIGHT
The author has agreed that the Library, Department of Electronics and Computer Engineering,
Pulchowk Campus, Institute of Engineering may make this report freely available for inspection.
Moreover, the author has agreed that permission for extensive copying of this project report for
scholarly purpose may be granted by the supervisors who supervised the project work recorded
herein or, in their absence, by the Head of the Department wherein the project report was done. It
is understood that the recognition will be given to the author of this report and to the Department
of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering in any use
of the material of this project report. Copying or publication or the other use of this report for
financial gain without approval of to the Department of Electronics and Computer Engineering,
Pulchowk Campus, Institute of Engineering and authors written permission is prohibited.Request for permission to copy or to make any other use of the material in this report in whole or
in part should be addressed to:
Head
Department of Electronics and Computer Engineering
Pulchowk Campus, Institute of Engineering
Lalitpur, Kathmandu
Nepal
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
4/71
ACKNOWLEDGEMENT
We are very thankful to Institute of Engineering (IOE), Pulchowk Campus for offering the
course on major project. We also thank all teachers and staffs of Electronics and Computer
Engineering Department who assisted during the project conduction period by giving suitable
suggestions and lectures on different subject matters relating to the conduction and achievement
of the project goals.
We are very much obliged to Dr. Subarna Shakya, Department, Electronics and Computer
Engineering, IOE Pulchowk Campus, for their inspiration and valuable suggestions that we got
throughout the working period.
We would like to thank to the forum members of askmeflash.com, stackoverflow.com,
dsprelated.com for their quick response and value able opinion to our queries.
We also express our gratitude to all the friends and juniors who helped a lot for training data
collection.
Members of Project
Ganesh Tiwari (063BCT510)
Madhav Pandey (063BCT514)
Manoj Shrestha (063BCT518)
IOE, PULCHOWK CAMPUS
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
5/71
ABSTRACT
Biometric is physical characteristic unique to each individual. It has a very useful application in
authentication and access control.
The designed system is a text-prompted version of voice biometric which incorporates text-
independent speaker verification and speaker-independent speech verification system
implemented independently. The foundation for this joint system is that the speech signal
conveys both the speech content and speaker identity. Such systems are more-secure from
playback attack, since the word to speak during authentication is not previously set.
During the course of the project various digital signal processing and pattern classification
algorithms were studied. Short time spectral analysis was performed to obtain MFCC, energy
and their deltas as feature. Feature extraction module is same for both systems. Speaker
modeling was done by GMM and Left to Right Discrete HMM with VQ was used for isolated
word modeling. And results of both systems were combined to authenticate the user.
The speech model for each word was pre-trained by using utterance of 45 English words. The
speaker model was trained by utterance of about 2 minutes each by 15 speakers. While uttering
the individual words, the recognition rate of the speech recognition system is 92 % and speaker
recognition system is 66%. For longer duration of utterance (>5sec) the recognition rate of
speaker recognition system improves to 78%.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
6/71
TABLE OF CONTENTS
PAGE OF APPROVAL.I
COPYRIGHT ................................................................................................................... 2
ACKNOWLEDGEMENT ................................................................................................. 3
ABSTRACT ..................................................................................................................... 4
TABLE OF CONTENTS .............................................................................................. V
LIST OF FIGURES .......................................................................................................... 1
LIST OF SYMBOLS AND ABBREVIATIONS ............................................................. IX
1. INTRODUCTION......................................................................................................... 1
1.2 Objectives ................................................................................................................... 2
2. LITERATURE REVIEW .............................................................................................. 32.1 Pattern Recognition..................................................................................................... 3
2.2 Generation of Voice ................................................................................................... 4
2.3 Voice as Biometric ..................................................................................................... 6
2.4 Speech Recognition .................................................................................................... 7
2.5 Speaker Recognition ................................................................................................... 7
2.5.1. Types of Speaker Recognition ............................................................................. 8
2.5.2. Modes of Speaker Recognition ............................................................................ 9
2.6 Feature Extraction for Speech/Speaker Recognition System ................................ ...... 10
2.6.1. Short Time Analysis .......................................................................................... 10
2.6.2. MFCC Feature ................................................................................................... 11
2.7 Speaker/Speech Modeling ......................................................................................... 12
2.7.1. Gaussian Mixture Model ................................................................................... 12
2.7.2. Hidden Markov Model ...................................................................................... 15
2.7.3. K-Means Clustering .......................................................................................... 19
3. IMPLEMENTATION DETAILS ................................................................................ 20
3.1 Pre-Processing and Feature Extraction ...................................................................... 20
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
7/71
3.1.1. Capture .............................................................................................................. 20
3.1.2. End point Detection and Silence Removal ......................................................... 21
3.1.3. PCM Normalization .......................................................................................... 22
3.1.4. Pre-emphasis ..................................................................................................... 22
3.1.5. Framing and Windowing ................................................................................... 23
3.1.6. Discrete Fourier Transform ............................................................................... 25
3.1.7. Mel Filter .......................................................................................................... 25
3.1.8. Cepstrum by Inverse Discrete Fourier Transform .............................................. 27
3.2 GMM Implementation .............................................................................................. 30
3.2.1. Block Diagram of GMM Based Speaker Recognition System, ........................... 30
3.2.2. GMM Training .................................................................................................. 31
3.2.3. Verification ....................................................................................................... 34
3.2.4. Performance Measure of Speaker Verification System....................................... 34
3.3 Implementation of HMM for Speech Recognition ..................................................... 36
3.3.1. Isolated Word Recognition ................................................................................ 39
3.3.2. Application of HMM ......................................................................................... 40
3.3.3. Scaling .............................................................................................................. 47
4. UML CLASS DIAGRAMS OF THE SYSTEMS ........................................................ 48
5. DATA COLLECTION AND TRAINING ................................................................... 50
6. RESULTS ................................................................................................................... 51
7. APPLICATION AREA ............................................................................................... 52
8. CONCLUSION ........................................................................................................... 52
REFERENCES ............................................................................................................... 53
APPENDIX A: BlazeDS Configuration for Remoting Service ........................................ 54
APPENDIX B: Words Used for HMM Training.............................................................. 55
APPENDIX C: Development Tools and Environment ..................................................... 56
APPENDIX D: Snapshots of Output GUI ....................................................................... 57
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
8/71
LIST OF FIGURES
Figure 1.1: System Architecture..1
Figure 1.2: Block Diagram of Text Prompted Speaker Verification System .2
Figure 2.1: General block diagram of pattern recognition system ..3
Figure 2.2: Vocal Schematic....4
Figure 2.3: Audio Sample for /i:/ phoneme ....5
Figure 2.4: Audio Magnitude Spectrum for /i:/ phoneme ..6
Figure 2.5: GMM with four Gaussian components and their equivalent model ..13
Figure 2.6: Ergodic Model of HMM ....17
Figure 2.7: Left to Right HMM ...18
Figure 3.1: Pre-Processing and Feature Extraction..20
Figure 3.2: Input signal to End-point detection system ...22
Figure 3.3: Output signal from End point Detection System .. 22
Figure 3.4: Signal before Pre-Emphasis . .23
Figure 3.5: Signal after Pre-Emphasis 23
Figure 3.6: Frame Blocking of the Signal 23
Figure 3.7: Hamming window . 24
Figure 3.8: A single frame before and after windowing .. 24
Figure 3.9: Equally spaced Mel values 26
Figure 3.10: Mel Scale Filter Bank .. 26
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
9/71
Figure 3.11: Block diagram of GMM based Speaker Recognition System . 30
Figure 3.12: Equal Error Rate (EER) ... 35
Figure 3.13: Speech Recognition algorithm flow .36
Figure 3.14: Pronunciation model of word TOMATO .37
Figure 3.15: Vector Quantization ..38
Figure 3.16: Average word error rate (for a digits vocabulary) versus the number of
states N in the HMM .39
Figure 3.17: Curve showing tradeoff of VQ average distortion as a function of the
size of the VQ, M (shown of a log scale) . 40
Figure 3.18: Forward Procedure - Induction Step .. 42
Figure 3.19: Backward Procedures - Induction Step .. 43
Figure 3.20: Viterbi Search . ... 45
Figure 3.21: Computation of t (i, j) . 46
Figure 4.1: UML diagram of Client System .48
Figure 4.2: UML Diagram of Server System ....49
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
10/71
LIST OF SYMBOLS AND ABBREVIATIONS
GMM/HMM Model
T Threshold
Variance
() Likelihood Ratio
Mean
Initial State Distribution
A State Transition Probability Distribution
Observation Symbol Probability Distribution
Cm Covariance Matrix for mth
Component
State at time t
Wm Weighting Factor for mth
Gaussian Component
Feature Vector
AIR Adobe Integrated Runtime
DC Direct Current
DCT Discrete Cosine Transform
DFT Discrete Fourier Transform
DHMM Discrete Hidden Markov Model
DTW Dynamic Time Warping
EM Expectation-Maximization
FAR False Acceptance Rate
FRR False Rejection Rate
GMM Gaussian Markov Model
HMM Hidden Markov Model
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
11/71
LPC Linear Prediction Coding
MFCC Mel Frequency Cepstral Coefficient
ML Maximum Likelihood
PDF Probability Distribution Function
PLP Perceptual Linear Prediction
RIA Rich Internet Application
RPC Remote Procedure Call
SID Speaker IDentification
TER Total Error Rate
UBM Universal Background Model
UML Unified Modeling Language
VQ Vector Quantization
WTP Web Tool Platform
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
12/71
1
1. INTRODUCTION
Biometrics is, in the simplest definition, something you are. It is a physical characteristic
unique to each individual such as fingerprint, retina, iris, speech. Biometrics has a very useful
application in security; it can be used to authenticate a persons identity and control access to
a restricted area, based on the premise that the set of these physical characteristics can be
used to uniquely identify individuals.
Speech signal conveys two important types of information, the primarily the speech content
and on the secondary level, the speaker identity. Speech recognizers aim to extract the lexical
information from the speech signal independently of the speaker by reducing the inter-
speaker variability. On the other hand, speaker recognition is concerned with extracting the
identity of the person speaking the utterance. So both speech recognition and speaker
recognition system is possible from same voice input.
Text Prompted Remote Speaker Authentication is a voice biometric system that authenticates
a user before permitting the user to log into a system on the basis of the user's input voice. It
is a web application. Voice signal acquisition and feature extraction is done on the client.
Training and Authentication task based on the voice feature obtained from client side is done
on Server. The authentication task is based on text-prompted version of speaker recognition,
which incorporates both speaker recognition and speech recognition. This joint
implementation of speech and speaker recognition includes text-independent speaker
recognition and speaker-independent speech recognition. Speaker Recognition verifies
whether the speaker is claimed one or not while Speech Recognition verifies whether or not
spoken word matches the prompted word.
The client side is realized in Adobe Flex whereas the server side is realized in Java. The
communication between these two cross-platforms is made possible with the help of Blaze
DSs RPC remote object.
Browser Application in Client(Flex)
Server(Java)User BlazeDS RPC
Figure 1.1: System Architecture
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
13/71
2
Mel Filter Cepstral Coefficient (MFCC) is used as feature for both speech and speaker
recognition task. We also combined energy features and delta and delta-delta features of
energy and MFCC.
After calculating feature, Gaussian Mixture Model (GMM) is used to model the speaker
modeling and Left to Right Discrete Hidden Markov Model with Vector Quantization(DHMM/VQ) for speech modeling.
Based on the speech model the system decides whether or not the uttered speech matches
what was prompted to utter. Similarly, based on the speaker model, the system decides
whether or not the speaker is claimed one. Then the speaker is authenticated with the help of
combined result of these two tests.
Referring to figure 1.2, the feature extraction module is same for both speech and speaker
recognition. And these recognition systems are implemented independent of each other.
Figure 1.2: Block Diagram of Text Prompted Speaker Verification System
1.2 Objectives
The objectives of this project are:
To design and build a speaker verification system
To design and build a speech verification system
To implement these systems jointly to control remote access to secret area
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
14/71
3
2. LITERATURE REVIEW
2.1 Pattern Recognition
Pattern recognition, one of the branches of artificial intelligence, sub-section of machine
learning, is the study of how machines can observe the environment, learn to distinguishpatterns of interest from their background, and make sound and reasonable decisions about
the categories of the patterns. A pattern can be a fingerprint image, a handwritten cursive
word, a human face, or a speech signal, sales pattern etc
The applications of pattern recognition include data mining, document classification,
financial forecasting, organization and retrieval of multimedia databases, and biometrics
(personal identification based on various physical attributes such as face, retina, speech, ear
and fingerprints).
The essential steps of pattern recognition are: Data Acquisition, Preprocessing, Feature
Extraction, Training and Classification.
Figure 2.1: General block diagram of pattern recognition system
Features are used to denote the descriptor. Features must be selected so that they are
discriminative and invariant. They can be represented as a vector, matrix, tree, graph, or
string. Theyare ideally similar for objects in the same class and very different for objects in
different class.
Pattern class is a family of patterns that share some common properties. Pattern recognition
by machine involves techniques for assigning patterns to their respective classes
automatically and with as little human intervention as possible.
Learning and Classification usually use one of the following approaches: Statistical Pattern
Recognition is based on statistical characterizations of patterns, assuming that the patterns are
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
15/71
4
generated by a probabilistic system. Syntactical (or Structural) Pattern Recognition is based
on the structural interrelationships of features.
Given a pattern, its recognition/classification may consist of one of the following two tasks
according to the type of learning procedure: 1) Supervised Classification (e.g., Discriminant
Analysis) in which the input pattern is identified as a member of a predefined class. 2)Unsupervised Classification (e.g., clustering) in which the pattern is assigned to a previously
unknown class.
2.2 Generation of Voice
Speech begins with the generation of an airstream, usually by the lungs and diaphragm -
process called initiation. This air then passes through the larynx tube, where it is modulated
by the glottis (vocal chords). This step is called phonation or voicing, and is responsible for
the generation of pitch and tone. Finally, the modulated air is filtered by the mouth, nose, and
throat - a process called articulation - and the resultant pressure wave excites the air.
Figure 2.2: Vocal Schematic
Depending upon the positions of the various articulators different sounds are produced.
Position of articulators can be modeled by linear time- invariant system that has frequency
response characterized by several peaks called formants. The change in frequency of
formants characterizes the phoneme being articulated.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
16/71
5
As a consequence of this physiology, we can notice several characteristics of the frequency
domain spectrum of speech. First of all, the oscillation of the glottis results in an underlying
fundamental frequency and a series of harmonics at multiples of this fundamental.
This is shown in the figure below, where we have plotted a brief audio waveform for the
phoneme /i:/ and its magnitude spectrum. The fundamental frequency (180 Hz) and itsharmonics appear as spikes in the spectrum. The location of the fundamental frequency is
speaker dependent, and is a function of the dimensions and tension of the vocal chords. For
adults it usually falls between 100 Hz and 250 Hz, and females average significantly higher
than that of males.
Figure 2.3: Audio Sample for /i:/ phoneme showing stationary property of phonemes for a short period
The sound comes out in phonemes which are the building blocks of speech. Each phoneme
resonates at a fundamental frequency and harmonics of it and thus has high energy at those
frequencies in other words have different formats. It is the feature that enables the
identification of each phoneme at the recognition stage.
The variations in inter-speaker features of speech signal during utterance of a word are
modeled in word training in speech recognition. And for speaker recognition the intra-
speaker variations in features in long speech content is modeled.
0 500 1000 1500 2000 2500-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Samples
Amplitude
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
17/71
6
Figure 2.4: Audio Magnitude Spectrum for /i:/ phoneme showing fundamental frequency and its harmonics
Besides the configuration of articulators, the acoustic manifestation of a phoneme is affectedby:
Physiology and emotional state of speaker
Phonetic context
Accent
2.3 Voice as Biometric
The underlying premise for voice authentication is that each persons voice differs in pitch,
tone, and volume enough to make it uniquely distinguishable. Several factors contribute to
this uniqueness: size and shape of the mouth, throat, nose, and teeth (articulators) and the
size, shape, and tension of the vocal cords. The chance that all of these are exactly the same
in any two people is very low.
Voice Biometric has following advantages from other form of biometrics
Natural signal to produce
Implementation cost is low since, doesnt require specialized input device
Acceptable by user
Easily mixed with other form of authentication system for multifactor authentication
Only biometric that allows users to authenticate remotely
0 500 1000 1500 2000 2500 3000 3500 40000
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Frequency (Hz)
|Y(f)|
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
18/71
7
2.4 Speech recognition
Speech is the dominant means for communication between humans, and promises to be
important for communication between humans and machines, if it can just be made a little
more reliable.
Speech recognition is the process of converting an acoustic signal to a set of words. The
applications include voice commands and control, data entry, voice user interface, automating
the telephone operators job in telephony, etc. They can also serve as the input to natural
language processing.
There is two variant of speech recognition based on the duration of speech signal : Isolated
word recognition, in which each word is surrounded by some sort of pause, is much easier
than recognizing continuous speech, in which words run into each other and have to be
segmented.
Speech recognition is a difficult task because of the many source of variability associated
with the signal such as the acoustic realizations of phonemes, the smallest sound units of
which words are composed, are highly dependent on the context. Acoustic variability can
result from changes in the environment as well as in the position and characteristics of the
transducer. Third, within speaker variability can result from changes in the speaker's physical
and emotional state, speaking rate, or voice quality. Finally, differences in socio linguistic
background, dialect, and vocal tract size and shape can contribute to cross-speaker variability.
Such variability is modeled in various ways. At the level of signal representation, the
representation that emphasizes the speaker independent features is developed.
2.5 Speaker Recognition
Speaker recognition is the process of automatically recognizing who is speaking on the basis
of individuals information included in speech waves. Speaker recognition can be classified
into identification and verification. Speaker recognition has been applied most often as means
of biometric authentication.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
19/71
8
2.5.1. Types of Speaker Recognition
2.5.1.1 Speaker Identification
Speaker identification is the process of determining which registered speaker provides a
given utterance. In Speaker IDentification (SID) system, no identity claim is provided, the
test utterance is scored against a set of known (registered) references for each potential
speaker and the one whose model best matches the test utterance is selected.
There is two types of speaker identification task closed-set and open-set speaker
identification.
In closed-set, the test utterance belongs to one of the registered speakers. During testing, a
matching score is estimated for each registered speaker. The speaker corresponding to the
model with the best matching score is selected. This requires N comparisons for a populationof N speakers.
In open-set, any speaker can access the system; those who are not registered should be
rejected. This requires another model referred to as garbage model or imposter model or
background model, which is trained with data provided by other speakers different from the
registered speakers. During testing, the matching score corresponding to the best speaker
model is compared with the matching score estimated using the garbage model. In order to
accept or reject the speaker, making the total number of comparisons equal to N + 1. Speaker
identification performance tends to decrease as the population size increases.2.5.1.2 Speaker verification
Speaker verification, on the other hand, is the process of accepting or rejecting the identity
claim of a speaker. That is, the goal is to automatically accept or reject an identity that is
claimed by the speaker. During testing, a verification score is estimated using the claimed
speaker model and the anti-speaker model. This verification score is then compared to a
threshold. If the score is higher than the threshold, the speaker is accepted, otherwise, thespeaker is rejected. Thus, speaker verification, involves a hypothesis test requiring a simple
binary decision: accept or reject the claimed identity regardless of the population size. Hence,
the performance is quite independent of the population size, but it depends on the number of
test utterances used to evaluate the performance of the system.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
20/71
9
2.5.2. Modes of Speaker Recognition
There are 3 modes in which speaker verification/identification can be done.
2.5.2.1 Text Independent
In text independent mode, the system relies only on the voice characteristics of the speaker;
the lexical content of the utterance is not used. System models the characteristics of his
speech which show up irrespective of what one is saying. This mode is used in surveillance
or forensic applications where there is no control over the speakers to access the system. The
test utterances can be different from those used for enrollment; hence, text-independent
speaker verification needs a large and rich training data set to model the characteristics of the
speaker's voice and to cover the phonetic space.
A large training set and long test segments is required to appropriately model the featurevariations from current user in uttering different phonemes, than that for text-dependent.
2.5.2.2 Text Dependent
In the text dependent mode of verification, the user is expected to say a pre-determined text -
a voice password. Since recognition is based on the speaker characteristics as well as the
lexical content of the password, text dependent speaker recognition systems are generally
more robust and achieve good performance. However, this system is not yet used in large
scale due to fear of playback attack, since, the system has a priori knowledge about the
password i.e., the training and the test texts are the same. The speaker model encodes the
speaker's voice characteristics associated with the phonemic or syllabic content of the
password.
2.5.2.3 Text-prompted
Both text-dependent and text-independent systems are susceptible to fraud, since for typical
applications the voice of a speaker could be captured, recorded, and reproduced. To limit this
risk, a particular kind of text-dependent speaker verification systems based on prompted text
has been developed. The password i.e., the text to speak is not pre-determined; rather he/she
is asked to speak a prompted text (digits or word or phrase). If the number of distinct random
passwords is large, the playback attack is not feasible. Hence the text prompted system is
more secure.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
21/71
10
As in the case of text-independent systems, the text-prompted systems also need a large and
rich training data set for each registered speaker to create robust speaker-dependent models.
Because of that reason, we have chosen text prompted system.
2.6 Feature Extraction for speech/speaker recognition system
Signal representation or coding from short-term spectrum into feature vectors is one of the
most important steps in automatic speaker recognition and continues being subject of
research. Many different techniques have been proposed in the literature and generally they
are based on speech production models or speech perception models.
Goal of feature extraction is to transform the input waveform into a sequence of acoustic
feature vectors, each vector representing the information in a small time window of the
signal. Feature extraction transforms high-dimensional input signal into lower dimensional
vectors. For speaker recognition purposes, optimal feature has the following properties
1. High inter-speaker variation,
2. Low intra-speaker variation,
3. Easy to measure,
4. Robust against disguise and mimicry,
5. Robust against distortion and noise,
6. Maximally independent of the other features.
2.6.1. Short time analysis
The analysis at spectral level of the speech signal is based on classic Fourier analysis to the
whole speech signal. However, an exact definition of Fourier transform cannot be directly
applied because speech signal cannot be considered stationary due to constant changes in the
articulatory system within each speech utterance. To solve these problems, speech signal is
split into a sequence of short segments in such a way that each one is short enough to be
considered pseudo-stationary. The length of each segment, also called window or frame,
ranges between 10 and 40ms (in such a short time period our articulatory system is not able
to significantly change). Finally, a feature vector will be extracted from the short-time
spectrum in each window. The whole process, known as short-term Spectral analysis,
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
22/71
11
2.6.2. MFCC Feature
The commonly used feature extraction method for speech/ speaker recognition is LPC (linear
prediction coding), MFCC (Mel Frequency Cepstral Coefficients) and PLP (Perceptual
Linear Prediction). LPC is based on assumption that a speech sample can be approximated by
a linearly weighted summation of determined number of preceding samples. PLP iscalculated in a similar way as LPC coefficients, but previous transformations are carried out
in the spectrum of each window aiming at introducing about human hearing behavior.
The most popular feature extraction method, MFCC mimic the human hearing behavior by
emphasizing lower frequencies and penalizing higher frequencies.
The Mel scale, proposed by Stevens, Volkman and Newman in 1937 is a perceptual scale of
pitches judged by listeners to be equal in distance from one another.
The Mel scale is based on an empirical study of the human perceived pitch or frequency.
Human hearing, however, is not equally sensitive at all frequency bands. It is less sensitive at
higher frequencies, roughly above 1000 Hertz. It is a unit of pitch defined so that pairs of
sounds which are perceptually equidistant in pitch are separated by an equal number of Mels.
The mapping between frequency in Hertz and the Mel scale is linear below 1000 Hz and the
logarithmic above 1000 Hz.
(
) = 2595log
1 +
700
Modeling this property of human hearing during feature extraction improves speech
recognition performance. The form of the model used in MFCCs is to warp the frequencies
output by the DFT onto the Mel scale. During MFCC computation, this insight is
implemented by creating a bank of filters which collect energy from each frequency band.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
23/71
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
24/71
13
recognition systems due to their capability of representing large class of sample distributions.
Like K-Means, Gaussian Mixture Models (GMM) can be regarded as a type of unsupervised
learning or clustering methods. GMM is based on clustering technique, where the entire set
of experimental data set is modeled by a mixture of Gaussians. But unlike K-Means, GMMs
are able to build soft clustering boundaries, i.e., points in space can belong to any class with a
given probability.
In a Gaussian mixture distribution, its density function is just a convex combination (a linear
combination in which all coefficients or weights sum to one) of Gaussian probability density
functions:
Figure: 2.5: GMM with four Gaussian components and their equivalent model
Mathematically, A GMM is the weighted sum of M Gaussian component densities given bythe equation
(/) = .(/ ,) where,
is a k dimensional random vector,wm are the mixture weights that shows the relative importance of each component and
satisfies the constraint that = 1. (/ ,), m=1,2,,M are the component densities where each componentdensity is a k-dimensional Gaussian function (pdf) of the form
(/ ,) = 1(2).|| exp{1
2( ).(( ))}
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
25/71
14
Where,
s the mean vector of length k of mth Gaussian PDF,Cm is the covariance matrix of kk of m
thGaussian PDF
Thus the complete Gaussian Mixture Model is parameterized by mixture weights, mean
vectors and covariance matrices for all component densities. The parameters are collectively
represented by the notation,
= {,,}, m = 1, 2,, MThese parameters are estimated in training section. For speaker recognition system, each
speaker is represented by a GMM and is referred by his/her model .
GMM is widely used in speaker modeling and classification due to its two important benefits:
first the individual Gaussian component in a speaker-dependent GMM are interpreted to
represent some broad acoustic classes such as speaker-dependent vocal tract configurations
that are useful for modeling speaker identity. A speaker voice can be characterized by a set of
acoustic classes representing some broad phonetic events such as vowels, nasals, fricatives.
These acoustic classes reflect some general speaker-dependent vocal tract configurations that
are useful for characterizing speaker identity. The spectral shape of the i th acoustic class can
in turn be represented by mean of the ith component density and variations of the average
spectral shape can be represented by the covariance matrix. These acoustic classes are hidden
before training. Secondly Gaussian mixture density provides a smooth approximation to the
long term sample distribution of training utterances by a given speaker. The unimodal
Gaussian speaker model represents a speakers feature distribution by a mean vector and
covariance matrix and the VQ model represents a speakers distribution by a discrete set of
characteristic templates. GMM acts as a hybrid between these two models using a discrete set
of Gaussian functions, each with their own mean and covariance matrix to allow better
modeling capability.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
26/71
15
2.7.2. Hidden Markov Model
In general, a Markov model is a way of describing a process that goes through a series of
states. The model describes all the possible paths through the state space and assigns a
probability to each one. The probability of transitioning from the current state to another one
depends only on the current state, not on any prior part of the path.
HMMs can be applied in many fields where the goal is to recover a data sequence that is not
immediately observable. Common applications include: Cryptanalysis, Speech recognition,
Part-of-speech tagging, Machine translation, Partial discharge, Gene prediction, Alignment of
bio-sequences, Activity recognition.
2.7.2.1 Discrete Markov Processes
The transition probability
with N distinct states,
,
,
,,
, for the first order
Markov chain is given by:
= = = , 1 , where is the state at time t.
The state transition coefficients have the following properties (due to standard stochasticconstraints):
0
,
= 1The transition probabilities for all states in a model can be described by a transition
probability matrix:
A =
The initial state distribution matrix is given by:
= =( = 1) =( = 2) =( =)
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
27/71
16
The stochastic property for initial state distribution vector is:
= 1 where the
is defined as:
= ( = ),1 The Markov model can be described by
= (,)This stochastic process could be called an observable Markov model since the output of the
process is the set of states at each instant of time, where each state corresponds to physical
(observable) event.
2.7.2.3 Hidden Markov Model
Markov model is too restrictive to be applicable to many problems of interest. So the concept
of Markov model is extended to Hidden Markov model to include the case where the
observation is a probabilistic function of the state. The resulting model is doubly embedded
stochastic process with an underlying stochastic process that is not observable (i.e. hidden),
but can only be observed through another set of stochastic processes that produce the
sequence of observations. The difference is that in Markov Chain the output state is
completely determined at each time t. In the Hidden Markov Model the state at each time t
must be inferred from observations. An observation is a probabilistic function of a state.
Elements of HMM
The HMM is characterized by the following:
1) Set of hidden states
S = {S1,S2,,SN} and
state at time t, qt S
2) Set of observation symbols per state
V = {v1,v2,,vM}
observation at time t, Ot V
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
28/71
17
3) The initial state distribution
={ i } i = P[q1 = S1] 1 i N
4) State transition probability distribution
A = {aij} aij =P[qt+1 = Si|qt = Si] 1 i, j N
5) Observation symbol probability distribution in state j
B = {bj(k)} bj(k) = P[vkat t|qt = Sj] 1 j N, 1 k M
Normally, an HMM is typically written as: = (,,)2.7.2.4 Types of HMMS
An ergodic or fully connected HMMs has the property that every state can be reached from
every other state in a finite number of steps. This type of model has the property that every
aij coefficient is positive. For some applications, other types of HMMs have been found toaccount for observed properties of the signal being modeled better than the standard ergodic
model.
Figure 2.6: Ergodic Model of HMM
One such model is left-right model or Bakis model because the underlying state sequence
associated with the model has the property that as time increases the state index increases (or
stays the same), i.e. the state proceed from left to right. Clearly, the left-right type of HMMs
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
29/71
18
has the desirable property that it can readily model signals whose properties change over time
e.g., Speech.
State1 State2State3 State4
a11
a13
a12 a23a22 a34
a44a33
a24
Figure 2.7: Left to Right HMM
The properties of left-right HMMs are:
1) The state transition coefficients have the property = 0, < i.e., no transition is allowed to states whose indices are lower than the current state.
3) The state transition coefficient for the last state in a left-right model are specified as
= 12) The initial state probabilities have the property
i
= 1,
= 1
= 0, 1Since the state sequence must begin in state 1 and end in state N.
With left-right models, additional constraints are placed on the state transition coefficients to
make sure that large changes in state indices do not occur, hence a constraint of the form
= 0, > +is often used. The value of is 2 in this speech recognition system, i.e., no jumps of more
than 2 states are allowed. The form of the state transition matrix for = 2 and N=4 is asfollows.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
30/71
19
2.7.3. K-Means Clustering
Clustering can be considered the most important unsupervised learning problem; so, as every
other problem of this kind, it deals with finding a structure in a collection of unlabeled data.
A loose definition of clustering could be the process of organizing objects into groups whose
members are similar in some way. A cluster is therefore a collection of objects which aresimilar between them and are dissimilar to the objects belonging to other clusters.
In statistics and machine learning,k-means clustering is a method of cluster analysis which
aims to partition n observations into k clusters in which each observation belongs to the
cluster with the nearest mean.
The algorithm is composed of the following steps:
1. Place K points into the space represented by the objects that are being clustered.
These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation
of the objects into groups from which the metric to be minimized can be calculated.
Both the clustering process and the decoding process require a distance metric ordistortion
metric, that specifies how similar two acoustic feature vectors are. Thedistance metric is used
to build clusters, to find a prototype vector for each cluster, andto compare incoming vectors
to the prototypes. The simplest distance metric for acoustic feature vectors is Euclidean
distance. Euclidean distance is the distance in N-dimensional space between the two points
defined by the two vectors.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
31/71
20
3. IMPLEMENTATION DETAILS
The implementation of joint speaker/speech recognition system includes common pre-
processing and feature extraction module, text independent speaker modeling and
classification by GMM and speaker independent speech modeling and classification by
HMM/VQ.
3.1 Pre-Processing and Feature Extraction
Starting from the capturing of audio signal, feature extraction consists of the following steps
as shown in the block diagram below:
Pre-emphasis
Window DFTMel-Filter
bankLog
IDFT
DeltasEnergy
SPEECHSIGNAL
MFCC 12Coefficients
1 Energy
Feature
12 MFCC
12 MFCC12 MFCC
1 energy1 energy
1 energy
SilenceRemoval
Framing
CMS
Figure 3.1: Pre-Processing and Feature Extraction
3.1.1. Capture
The first step in processing speech is to convert the analog representation (first air pressure,
and then analog electric signals in a microphone) into a digital signal x[n], where n is an
index over time. Analysis of the audio spectrum shows that nearly all energy resides in the
band between DC and 4 kHz, and beyond 10 kHz there is virtually no energy whatsoever
Used sound format
22050 Hz
16-bits, Signed
Little Endian
Mono Channel
Uncompressed PCM
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
32/71
21
3.1.2. End point detection and Silence removal
The captured audio signal may contain silence at different positions such as beginning of
signal, in between the words of a sentence, end of signal. etc. If silent frames are included,
modeling resources are spent on parts of the signal which do not contribute to the
identification. The silence present must be removed before further processing.
There are several ways for doing this: most popular are Short Time Energy and Zeros
Crossing Rate. But they have their own limitation regarding setting thresholds as an ad hoc
basis. The algorithm we used [Ref.4] uses statistical properties of background noise as well as
physiological aspect of speech production and does not assume any ad hoc threshold. It
assumes that background noise present in the utterances is Gaussian in nature.
Usually first 200msec or more (we used 4410 samples for the sampling rate 22050
samples/sec) of a speech recording corresponds to silence (or background noise) because the
speaker takes some time to read when recording starts.
Endpoint Detection Algorithm
Step 1: Calculate the mean () and standard deviation () of the first 200ms samples of the
given utterance. The background noise is characterized by this and.
Step 2: Go from 1stsample to the last sample of the speech recording. In each sample, check
whether one-dimensional Mahalanobis distance functions i.e. |x-|/greater than 3 or not. If
Mahalanobis distance function is greater than 3, the sample is to be treated as voiced sample
otherwise it is an unvoiced/silence.
The threshold reject the samples up to 99.7% as per given by P[|x|3]=0.997 in a
Gaussian Distribution thus accepting only the voiced samples.
Step 3: Mark the voiced sample as 1 and unvoiced sample as 0. Divide the whole speech
signal into 10 ms non-overlapping windows. Represent the complete speech by only zeros
and ones.
Step 4: Consider there areMnumber of zeros andNnumber of ones in a window. IfM N
then convert each of ones to zeros and vice versa. This method adopted here keeping in mind
that a speech production system consisting of vocal cord, tongue, vocal tract etc. cannot
change abruptly in a short period of time window taken here as 10ms.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
33/71
22
Step 5: Collect the voiced part only according to the labeled 1 samples from the windowed
array and dump it in a new array. Retrieve the voiced part of the original speech signal from
labeled 1 sample.
Figure 3.2: Input signal to End-point detection system
Figure 3.3: Output signal from End point Detection System
3.1.3. PCM Normalization
The extracted pulse code modulated values of amplitude is normalized, to avoid amplitude
variation during capturing.
3.1.4. Pre-emphasis
Usually speech signal is pre-emphasized before any further processing, if we look at the
spectrum for voiced segments like vowels, there is more energy at lower frequencies than the
higher frequencies. This drop in energy across frequencies is caused by the nature of the
glottal pulse. Boosting the high frequency energy makes information from these higher
formants more available to the acoustic model and improves phone detection accuracy.
The pre-emphasis filter is a first-order high-pass filter. In the time domain, with input x[n]
and 0.9 1.0, the filter equation is:
y[n] =x[n]x[n1]
We used=0.95.
0 1 2 3 4 5 6 7 8 9
x 104
-1
-0.5
0
0.5
1
0 0.5 1 1.5 2 2.5 3 3.5 4
x 104
-1
-0.5
0
0.5
1
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
34/71
23
Figure 3.4: Signal before Pre-Emphasis
Figure 3.5: Signal after Pre-Emphasis
3.1.5. Framing and windowing
Speech is a non-stationary signal, meaning that its statistical properties are not constant
across time. Instead, we want to extract spectral features from a small window of speech that
characterizes a particular sub phone and for which we can make the (rough) assumption that
the signal is stationary (i.e. its statistical properties are constant within this region).
We used frame block of 23.22ms with 50% overlapping i.e., 512 samples per frame.
Figure 3.6: Frame Blocking of the Signal
0 2000 4000 6000 8000 10000 120000
0.01
0.02
0.03
0.04
0.05
Frequency (Hz)
|Y(f)|
0 2000 4000 6000 8000 10000 120000
1
2
3
4
5x 10
-3
Frequency (Hz)
|Y(f)|
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
35/71
24
The rectangular window (i.e., no window) can cause problems, when we do Fourier analysis;
it abruptly cuts of the signal at its boundaries. A good window function has a narrow main
lobe and low side lobe levels in their transfer functions, which shrinks the values of the signal
toward zero at the window boundaries, avoiding discontinuities. The most commonly used
window function in speech processing is theHamming window defined as follows:
() = 0.54 0.46cos2( 1) 1 ,1
Figure 3.7: Hamming window
The extraction of the signal takes place by multiplying the value of the signal at time n,
sframe[n], with the value of the window at time n, Sw[n]:
y[n] = Sw[n]Sframe[n]
Figure 3.8: A single frame before and after windowing
0 10 20 30 40 50 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Hamming Window
0 200 400 600 800 1000 1200-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0.04
0 200 400 600 800 1000 1200-0.05
-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0.04
0.05
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
36/71
25
3.1.6. Discrete Fourier Transform
A Discrete Fourier Transform (DFT) of the windowed signal is used to extract the frequency
content (the spectrum) of the current frame. The tool for extracting spectral information i.e.,
how much energy the signal contains at discrete frequency bands for a discrete-time
(sampled) signal is the Discrete Fourier Transform or DFT. The input to the DFT is a
windowed signal x[n]...x[m], and the output, for each ofN discrete frequency bands, is a
complex numberX[k] representing the magnitude and phase of that frequency component in
the original signal.
= ()() , = 0,1,2,, 1The commonly used algorithm for computing the DFT is the Fast Fourier Transform or in
short FFT.
3.1.7. Mel Filter
For calculating the MFCC, first, a transformation is applied according to the following
formula:
(
) = 2595log
1 +
700
Where, x is the linear frequency.
Then, a filter bank is applied to the amplitude of the mel-scaled spectrum.
The Mel frequency warping is most conveniently done by utilizing a filter bank with filters
centered according to Mel frequencies. The width of the triangular filters varies according to
the Mel scale, so that the log total energy in a critical band around the center frequency is
included. The centers of the filters are uniformly spaced in the mel scale.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
37/71
26
Figure 3.9: Equally spaced Mel values
The result of Mel filter is information about distribution of energy at each Mel scale band.
We obtain a vector of outputs (12 coeffs) from each filter.
Figure 3.10: Triangular filter bank in frequency scale
We have used 30 filters in the filter bank.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
38/71
27
The Mel frequency m can be computed from the raw acoustic frequency as follows:
= + 1 + 1
+ + 1
,
= 1,2,,
1
where,
= = ;
= 2 =2
= {}+ {} = 1,2,3,, 1() = 2595ln1 +
700
= 700.10 13.1.8. Cepstrum by Inverse Discrete Fourier Transform
Cepstrum transform is applied to the filter outputs in order to obtain MFCC feature of each
frame. The triangular filter outputs Y (i), i=0, 1, 2 M are compressed using logarithm, and
discrete cosine transform (DCT) is applied. Here, M is equal to number of filters in filter
bank i.e., 30.
[] = log() cos ( 12)
Where, C[n] is the MFCC vector for each frame.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
39/71
28
The resulting vector is called the Mel-frequency cepstrum (MFC), and the individual
components are the Mel-frequency cepstral coefficients (MFCCs). We extracted 12 features
from each speech frame.
3.1.9. Post Processing
3.1.9.1 Cepstral Mean Subtraction (CMS)
A speech signal may be subjected to some channel noise when recorded, also referred to as
the channel effect. A problem arises if the channel effect when recording training data for a
given person is different from the channel effect in later recordings when the person uses the
system. The problem is that a false distance between the training data and newly recorded
data is introduced due to the different channel effects. The channel effect is eliminated by
subtracting the Mel-cepstrum coefficients with the mean Mel-cepstrum coefficients:
() =() 1() , = 1,2,,123.1.9.2 The energy feature
The energyin a frame is the sum over time of the power of the samples in the frame; thus for
a signalx in a window from time sample t1 to time sample t2, the energy is:
= [] 3.1.9.3 Delta feature
Another interesting fact about the speech signal is that it is not constant from frame to frame.
Co-articulation (influence of a speech sound during another adjacent or nearby speech sound)
can provide a useful cue for phone identity. It can be preserved by using delta features.
Velocity (delta) and acceleration (delta delta) coefficients are usually obtained from the static
window based information. This delta and delta delta coefficients model the speed andacceleration of the variation of cepstral feature vectors across adjacent windows.
A simple way to compute deltas would be just to compute the difference between frames;
thus the delta value d(t) for a particular cepstral value c(t) at time tcan be estimated as:
() = [] =[] []
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
40/71
29
The differentiating method is simple, but since it acts as a high-pass filtering operation on the
parameter domain, it tends to amplify noise. The solution to this is linear regression, i.e. first-
order polynomial, the least squares solution is easily shown to be of the following form:
[
] =
[
]
Where, M is regression window size. We used M=4.
3.1.9.4 Composition of Feature Vector
We calculated 39 Features from each frame
12 MFCC Features
12 Delta MFCC
12 Delta-Delta MFCC
1 Energy Feature
1 Delta Energy Feature
1 Delta-Delta Energy Feature
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
41/71
30
3.2 GMM Implementation
It is also important to note that because the component Gaussians are acting together to
model the overall pdf, full covariance matrices are not necessary even if the features are not
statistically independent. So, the linear combination of diagonal covariance basis Gaussians is
capable of modeling the correlations between feature vector elements. In addition, the use ofdiagonal covariance matrices greatly reduces the complexity in computation. Hence in our
project, the mth covariance matrix is
Cm = diag ( am1 , am2,, amK),
Where,
amj,j = 1, 2,,K are the diagonal elements or variances
K=Number of features in each feature vector
The effect of a set of using a set of M full covariance Gaussians can be compensated by using
by using a larger set of diagonal covariance Gaussians (M=16 in our case). M=16 is best for
speaker Modeling, according to research papers.
The components pdfs can now be expressed as,
(/ ,) = 1
(2
).
,
exp{12(( )/,)
}
Where,m,j are the elements of m
th mean vector.3.2.1. Block diagram of GMM based Speaker Recognition System,
Feature
Extraction
Model training
Model DB
MatchingDecision
Accepted /RejectedSpeech
Enrollment
Verification
Figure 3.11: Block diagram of GMM based Speaker Recognition System
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
42/71
31
3.2.2. GMM Training
Given the training speech from a speaker, the goal of speaker model training is to estimate
the parameters of the GMM that best match the distribution of training features vectors and
hence develop a robust model for the speaker. Out of several techniques available for
estimating the parameters of GMM, the most popular method is Maximum Likelihood (ML)
estimation orExpectation-Maximization (EM).
It is a well-established maximum likelihood algorithm for fitting a mixture model to a set of
training data. EM requires an a priori selection of model order, the number of M components
to be incorporated into the model and initial estimate of training parameters before iterating
through the training.
The aim of the ML estimation method is to maximize the likelihood of GMM, given the
training data. Under the assumption of independent feature vectors, the likelihood of GMM,
for the sequence of T training vectors X = {,,,} can be written as,P(X/) =(/)
In practice, the above computation is done in log domain to avoid underflow. That is, instead
of multiplying lots of very small probabilities, we can simply add them in log domain.
Thus, the log-likelihood of a model for a sequence of feature vectors X = {,,, } iscomputed as follows:logP(X/) =1 log(/)
Note that in the above equation, the average log likelihood value is used so as to normalize
out duration effects from the log-likelihood value. Also, since the incorrect assumption of
independence is underestimating the actual likelihood value with dependencies, scaling by T
can be considered as a rough compensation factor.
The direct maximization of this likelihood function is not possible as it is a non-linear
function of the parameter. So, the likelihood function is maximized using ExpectationMaximization algorithm.
The basic idea of EM algorithm is beginning with the initial model , to estimate a newmodel such that P(X/) P(X/). The new model then becomes the initial
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
43/71
32
model for the next iteration and the process is repeated until some convergence threshold is
reached. i.e., P(X/) - P(X/)
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
44/71
33
= 1 The j-th diagonal element of Cdata can be estimated as
, =1
,
A measure of the volume that the training data occupies can be given by
=, Finally the covariance can be calculated as
,
=
(
)
For minimum covariance (threshold) value to avoid NaN (Not a Number) error during EMiterations,
= ( ) Covariance limiting was done as calculated above for each mixture. For simplicity we
initialized covariance values to be same for all gaussian components.
For Training the GMM parameters we used the following constants:
Number of Iterations:
MINIMUM_ITERATION = 100;
MAXIMUM_ITERATION = 500;
And
Minimum log likelihood change for Convergence:
LOGLIKELIHOOD_CHANGE = 0.000001;
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
45/71
34
3.2.3. Verification
After training section, now we have a complete model (GMM) of speakers. The speaker
verification task is a hypothesis testing problem where based on the input speech
observations, it must be decided whether the claimed identity of the speaker is correct or not.
So, the hypothesis test can be set as:
H0: the speaker is the claimed speaker
H1: the speaker is an imposter
The right decision between these two hypotheses is based on the likelihood ratio given by
P(X/)
P(X/
)
Where, P(X/) is the likelihood that the utterance was produced by speaker model while
P(X/) is the likelihood that he utterance was produced by imposter model .Here, the imposter model , also called as Universal Background Model (UBM), is obtained
by training a collection of speech samples from a large no. of speakers, representative of the
population of speakers.
The likelihood ratio is often expressed in logarithm as
() = log((/)(/)) = logP(X/) logP(X/)The decision is made as follows:
If() < T , reject null hypothesis i.e. the speaker is an imposter.If() > T , accept null hypothesis i.e. the speaker is the claimed one.
where, the threshold value T is set in suck a way that, the error of the system is minimum sothat the true claimants are always accepted and false claimants are always rejected.
3.2.4. Performance measure of Speaker Verification System
In general, the performance of the speaker verification system is determined by False
Rejection Rate (FRR) and False Acceptance Rate (FAR).
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
46/71
35
1) False Rejection Rate(FRR)
FRR is the measure of the likelihood that the system will incorrectly reject an access
attempted by an authorized user. A systems FRR typically is the ratio of the number of false
rejections divided by the number of verification tests.
2) False Acceptance Rate(FAR)
FAR is the measure of the likelihood that the system will incorrectly accept an access attempt
by an unauthorized user. A systems FAR usually is stated as the ratio of the number of false
acceptances divided by the number of verification tests.
Total Error Rate (TER) is the combination of false rejection and false acceptance rate. And
the requirement of the system is to minimize the Total Error Rate. These errors are dependent
on the choice of threshold value used during verification. It seems that, at lower thresholdvalue, FAR is predominant while at higher threshold value, FRR is predominant. This
dependency of the two errors can be seen in the figure below. At certain threshold value,
these errors are equal and TER is minimum.
Figure 3.12: Equal Error Rate (EER)
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
47/71
36
3.3 Implementation of HMM for Speech Recognition
The basic block diagram for isolated word recognition is given below:
Pre-
process
MFCC
Features
Vector Quantization
(VQ)HMM Recognition
CODEBOOK
HMMModel
SpeechSignal
K-means
Clustering
Recognition
Result
Baum-WelchAlgorithm
ViterbiAlgorithmDiscrete
Observation
Sequence
),,( BA
Figure 3.13: Speech Recognition algorithm flow
In order to do isolated word speech recognition, we must perform the following:
1) The codebook is generated using the feature vector of the training data and Vector
quantization uses the codebook to map the feature vector to discrete observation
symbol.
2) For each word v in the vocabulary, an HMM v is built, i.e., we must estimate the
model parameters (A, B, ) that optimize the likelihood of the training set observation
vectors for the vth word. In order to make reliable estimates of all model parameters,
multiple observation sequences must be used. Baum-Welch algorithm is used for
estimation of HMM parameters.
3) For each unknown word which is to be recognized, processing of some steps must be
carried out, namely measurement of the observation sequence O={O1,O2,..,OT}, via
feature analysis of the speech corresponding to the word, followed by calculation of
model likelihoods for all possible models, P(O| v), 1 v V; followed by selection
of the word whose model likelihood is highest.
=
max
[P(O|
]
The probability computation step is performed using the Viterbi algorithm and
requires on the order of V.N2.T computations.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
48/71
37
Figure 3.14: Pronunciation model of word TOMATO
The above figure shows the pronunciation model of word tomato. The circles represent the
states and the numbers above the arrows represent transition probabilities. The pronunciation
of the same word may differ from person to person. The above figure reflects the two
pronunciation styles for the same word tomato. So, in order to best model each word, we
need to train the word for as large set of persons as possible so that it models all the variationin pronunciation for that word.
Vector Quantization:
HMM is used in speech recognition because a speech signal can be viewed as a piecewise
stationary signal or a short-time stationary signal. In a short-time speech can be approximated
as a stationary process. Each acoustic feature vector represents information such as the
amount of energy in different frequency bands at a particular point in time. The observation
sequence for speech recognition is a sequence of acoustic feature vectors (MFCC vectors)andthe phonemes are the hidden states. One way to make MFCC vectors look like symbols
that we could count is to build a mapping function that maps each input vector into one of a
small number of symbols. This idea of mapping input vectors to discrete quantized symbols
is called vector quantizationorVQ.
The type of HMM that models speech signals based on VQ technique to produce the
observations is called Discrete Hidden Markov Model (DHMM). However, VQ is
responsible for losing some information from the speech signal even when we try to increase
the codewords. This lose is due to the quantization error (distortion). This distortion can be
reduced by increasing the number of codewords in the codebook but cannot be eliminated.
The long sequence of speech samples will be represented by stream of indices representing
frames of different window lengths. Hence, VQ is considered as a process of redundancy
removal, which minimizes the number of bits required to identify each frame of speech
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
49/71
38
signal. In vector quantization, we create the small symbol set by mapping each training
feature vector into a small number of classes, and then we represent each class by a discrete
symbol. More formally, a vector quantization system is characterized by a codebook, a
clustering algorithm, and adistance metric.
Acodebookis a list of possible classes, a set of symbols constituting features F= {f1,f2, ...,fn}. All feature vector from training speech data are clustered into 256 classes thereby
generating a Codebook with 256 centroids with the help of K-Means clustering technique.
Vector Quantization (VQ) is used to get discrete observation sequence from input feature
vector by applying distance metric to Codebook.
Figure 3.15: Vector Quantization
As shown in the above figure, to make the feature vectors discrete, each incoming feature
vector is compared with each of the 256 prototype vectors in the codebook. And the one
which is closest (Euclidian distance) is selected, and then the input vector is replaced by theindex of corresponding centroid in codebook. In this way all continuous input feature vectors
are quantized to a discrete set of symbols.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
50/71
39
3.3.1. Isolated Word Recognition
For isolated word recognition with a distinct HMM designed for each word in the vocabulary,
a left-right model is more appropriate than an ergodic model, since we can then associate
time with model states in a fairly straightforward manner. Furthermore we can envision the
physical meaning of the model states as distinct sounds (e.g., phonemes, syllables) of theword being modeled.
The issue of the number of states to use in each word model leads to two schools of thought.
One idea is to let the number of states correspond roughly to the number of sounds
(phonemes) within the word hence model with from 2 to 10 states would be appropriate.
The other idea is to let the number of states correspond roughly to the average number of
observations in a spoken version of the word. In this manner each state corresponds to an
observation interval i.e., about 15 ms for the analysis we use. The former approach is used
in our speech recognition system. Furthermore we restrict each word model to have the same
number of states; this implies that the models will work best when they represent words with
the same number of sounds.
Figure 3.16: Average word error rate (for a digits vocabulary) versus the number of states N in the HMM
Above figure shows a plot of average word error rate versus N, for the case of recognition of
isolated digits (i.e., a 10-word vocabulary). It can be seen that the error is somewhat
insensitive to N, achieving a local minimum at N=6; however, differences in error rate for
values of N close to 6 are small.
The next issue is the choice of observation vector and the way it is represented. Since we are
representing an entire region of the vector space by a single vector, distortion penalty is
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
51/71
40
associated with VQ. It is advantageous to keep the distortion penalty as small as possible.
However, this implies a large size codebook, and that leads to problems in implementing
HMMs with a large number of parameters. Although the distortion steadily decreases as M
increases, only small decreases in distortion accrue beyond a value of M=32. Hence HMMs
with codebooks sizes of from M=32 to 256 vectors have been in speech recognition
experiments using HMMs. For the discrete symbol models we have used codebook to
generate the discrete symbols with M=256 codewords.
Figure 3.17: Curve showing tradeoff of VQ average distortion as a function of the size of the VQ, M (shown of
a log scale)
Another main issue is to initialize the parameters of HMM. The parameters that constitute
any model are , A, and B. The values of are given by = [1 0 0 0 0........0] because the
left-right model of HMM is used in our speech recognition system which always starts with
first state and ends in the last state. The random values between 0 and 1 are assigned as the
initial value to the elements of A and B parameters.
3.3.2. Application of HMM
Given the form of HMM, there are three basic problems of interest that must be solved for the
model to be useful in real-world applications. These problems are the following:
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
52/71
41
3.3.2.1 Evaluation Problem: Calculating Parameters
Given the observation sequence O = O1O2OT, and Markov Model = (A,B,) ,how do we efficiently compute P(O |) , the probability of the observation sequence, given
the model?
Solution:
The aim of this problem is to find the probability of the observation sequence, O = (O1, O2,
, OT ) given the model , i.e. P(O | ). Because the observations produced by states are
assumed to be independent of each other and the time t, the probability of observation
sequence, O = (O1, O2 , , OT ) being generated by a certain state sequence q can be
calculated by a product:
(
|
,
) =
(
).
(
) ..
(
)
And the probability of the state sequence, q can be found as:
(|,) =.. ..The aim was to findP(O | ), and this probability ofO (given the model ) is obtained by
summing the joint probability over all possible state sequence q, giving:
(|) =(|,).(|,) This direct computation has one major drawback. It is infeasible due to the exponential
growth of computations as a function of sequence length T. To be precise, it needs (2T-1)NT
multiplications and NT-1 additions. An excellent tool which cuts the computational
requirements to linear, relative to T, is the well-known forward algorithm. The forward
algorithm hasN(N+1)(T-1)+1 multiplications andN(N-1)(T-1) additions.
Forward Algorithm
Initially consider a new forward probability variable t(i) , at instant tand state i , has thefollowing formula:
t (i) P(O1, O2, O3, ......., Ot, qt Si/)
This probability function could be solved forNstates andTobservations iteratively:
Step 1: Initialization () =.()
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
53/71
42
Figure 3.18: Forward Procedure - Induction Step
Step 2: Induction
(
) =
(
)
(
)
,
Step 3: Termination
(|) =() This stage is just a sum of all the values of the probability function T(i)over all the states atinstant T. This sum will represent the probability of the given observations to be driven from
the given model. That is how likely the given model produces the given observations.
Backward Algorithm
This procedure is similar to the forward procedure but it takes into consideration the state
flow as if in backward direction from the last observation entity, instant T, till the first one,
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
54/71
43
instant 1. That means that the access to any state will be from the states that are coming just
after that state in time.
To formulate this approach let us consider the backward probability function t (i)which canbe defined as:
(
) =
(
,
,
|
=
,
)
Figure 3.19: Backward Procedures - Induction Step
In analogy to the forward procedure we can solve fort(i) in the following two steps:1 - Initialization: () =, These initial values for s of all states at instant Tis arbitrarily selected.2 Induction:
() =.() .() = , ,,,
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
55/71
44
3.3.2.2 Decoding Problem: Finding the best path
Given the observation sequence O = O1O2OT , and Markov Model = (A,B,), findoptimal state sequence q = q1q2 qT .
Solution:
The problem is to find the optimal sequence of states, given the observation sequence and the
model. This means that we have to find the optimal state sequence Q= ( q1 , q2 , q3,....., qT-1,
qT) associated with the given observation sequence O = (O1 , O2, O3,........., OT-1 , OT )
presented to the model = (A , B , ). The criteria of optimality here is to search for a singlebest state sequence through modified dynamic programming technique called Viterbi
Algorithm.
To explain the Viterbi Algorithm, the probability quantity t (i) is defined which representsthe maximum probability along the best probable state sequence path of a given observation
sequence aftertinstants and being in state i. This quantity can be defined mathematically by:
() = ,, [,,, = , |]The best state sequence is backtracked by another function t (j). The complete algorithm can
be described by the following steps:
Step 1: Initialization: () =(), () =Step 2: Recursion:
() =()(), ,
(
) =
[
(
)
]
(
)
,
,
Step 3: Termination: =[()] =[()]Step 4: Path (state sequence) backtracking:
=( ),
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
56/71
45
Viterbi Algorithm can also be used to calculate theP(O/) approximately by considering theuse ofP*instead.
Figure 3.20: Viterbi Search
3.3.2.3 Training Problem: Estimating the Model Parameters
Given the observation sequence O = O1O2OT , estimate parameters for Model = (A,B,)that maximizeP(O |) .
Solution:
This problem deals with the training issue which is the most difficult one in all the three
cases. The task of this problem is to adjust the model parameters, (A, B, ), according to acertain optimality criteria. Baum-Welch Algorithm (ForwardBackward Algorithm) is one of
the well-known techniques to solve the problem. It is an iterative method to estimate the new
values for the model parameters. To explain the training procedure, first a posteriori
probability function t(i) is defined, the probability of being in state i at instant t, given the
observation sequence Oand the model as:() =( = |,)() =(, = |)(|) () = ()() ()()
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
57/71
46
Then another probability function t (i, j) is defined, the probability of being in state i atinstantt and going to state j at instantt+1, given the model and the observation sequence O.t (i, j)can be mathematically defined as:
(,) =( = , = |,)
Figure 3.21: Computation of t (i, j)
From the definition of the forward and backward variables, we can write t (i, j)in the form
(,) = ()()() ()()
(
,
) =
(
)
(
)
(
)
()()()
The relation between t(i) andt (i, j)can be easily deduced from their definitions :
() =(,) Now, ift(i) is summed over all instants (excluding instant T) we get the expected number of
times that state Si has left, or the number of times this state has been visited over all instants.
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
58/71
47
On the other hand if we sum t (i, j)over all instants (excluding T) we will get the expectednumber of transitions that have been made from i toj.
From the behavior oft(i)andt (i, j)the following re-estimations of the model parameterscould be deduced:
Initial state distribution:
)( 1 ii
Transition probabilities:
1
1
1
1
)(
),(
T
t
t
T
t
t
ij
i
ji
a
Emission probabilities:
j
kj
sstateintimesofnumberexpected
vsymbolobservingandsstateintimesofnumberexpected)( kbj
3.3.3. Scaling
t(i) consists of the sum of a large number of terms. Since transition matrix element (a) and
emission matrix element (b) are less than 1, as t starts to get big, each term of t(i) starts to
head exponentially to zero. For large t the dynamic range of t(i) computation will exceed
the precision range of computer (even in double precision ). This is accomplished bymultiplying t(i) and t(i) by a scaling factor that is independent of i (i.e., it depends only on
t), with the goal of keeping the scaled t(i) within the dynamic range of the computer for 1
t T . Then at the end of the computation, the scaling coefficients are canceled out exactly.
When using the Viterbi Algorithm, if logarithms are used to give the maximum likelihood
state sequence, no scaling is required.
1tat timesstateintimesofnumberexpected i i
i
ji
sstatefromnstransitioofnumberexpected
sstatetosstatefromnstransitioofnumberexpected ija
T
t
t
T
vot
t
i
j
j
kb kt
1
1
)(
)(
)(
-
7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
59/71
48
+execute()
interfaceAlg ori th m
+doPreprocessing()+doPCMNormalization()
-capturedSignal-processedSi