chapter 5 enhancement of recognition accuracy of automatic...
TRANSCRIPT
119
CHAPTER 5
ENHANCEMENT OF RECOGNITION ACCURACY OF
AUTOMATIC SPEAKER RECOGNITION SYSTEM IN
NOISY ENVIRONMENTS
5.1 INTRODUCTION
In this Thesis two wavelet based methods (TADWT and SSWPT)
for enhancement of noisy speech are developed and reported. The
effectiveness of the reported speech enhancement techniques is evaluated by
means of speaker recognition under noisy conditions. The automatic speaker
recognition technologies have been developed into more and more important
modern technologies required by many speech-aided applications. The
practical implementation of speaker recognition technologies on handheld
devices or the internet, faces many challenges. One of these is environmental
noise. A new ASR system is developed and presented which aims to provide
the robustness at the signal level using the TADWT and SSWPT speech
enhancement methods as a front end processing stage to improve the speaker
specific features and hence speaker recognition performance in noisy
environment. The comparative analysis made based on extracting features and
speaker modeling, to get improvement in recognition accuracy.
In this chapter initially, the speaker recognition system under clean
speech condition for open-set applications is developed and its performance is
analyzed. Later the experimental analysis of the proposed speaker recognition
system is extended to noisy environment using various speech enhancement
120
algorithms in front end processing stage. In order to evaluate and compare
some of the enhancement algorithms with the proposed TADWT and SSWPT
the test signals were corrupted by WGN, pink and other real- life
environmental noises with -5 dB, 0 dB, 5 dB, 10 dB and 15 dB SNRs. Then,
the noisy test signals are enhanced by the SS, IWF, EMF, BWT, TADWT and
SSWPT algorithms and the enhanced signals are evaluated by the recognition
system. The evaluation is based on the recognition accuracy.
5.2 AUTOMATIC SPEAKER RECOGNITION (ASR) SYSTEM
Speaker recognition is the process of automatically recognizing the
speakers on the basis of individuality information from the speech signals.
Some of the applications of speaker recognition system are access to
buildings, cars, highly secured campuses, site security monitoring,
telecommunication areas, in law enforcement (e.g. forensics), e-transaction,
information retrieval from large databases, pin (personal identification
number) and so on (Uludag et al 2004). Speaker recognition can be divided
into speaker identification and speaker verification. In speaker identification,
the task is to identify the speaker from the speech signal. The task of a
verification system is to authenticate the claim of a speaker based on the test
speech (Campbell 1997). According to the constraints placed on the speech
used to train and test the system, Automatic speaker recognition can be
further classified into text-dependent or text-independent tasks. In text-
dependent recognition, the user must speak a given phrase known to the
system, which can be fixed or prompted. The knowledge of a spoken phrase
can provide better recognition results. In the text-independent recognition, the
system does not know the phrase spoken by the user ( Bimbot et al 2004).
Hence, this work focuses on text independent speaker recognition system.
Speaker recognition system typically involves two phases: training
and testing. In the training phase, a user enrolls by providing voice samples to
121
the system. The system extracts speaker-specific information from the voice
samples to build a voice model of the enrolling speaker. In the testing phase, a
user provides a voice sample (also referred to as test sample) that is used by
the system to measure the similarity of the user’s voice to the model(s) of the
previously enrolled user(s) and subsequently, to make a decision. The speaker
associated with the model that is being tested is referred to as target speaker
or claimant. Proposed system is designed for open-set situation. As mentioned
earlier in previous section, speaker identification can be divided into two
categories: closed-set speaker identification and open-set speaker
identification. Given a set of enrolled speakers and a test utterance, open-set
speaker identification is defined as a twofold problem. Firstly, it is required to
identify the speaker model in the set, which best matches the test utterance.
Secondly, it must be determined whether the test utterance has actually been
produced by the speaker associated with the best-matched model, or by some
unknown speaker outside the enrolled set. That is, in a speaker identification
task, the system measures the similarity of the test sample to all stored voice
models and in speaker verification task, the similarity is measured only to the
model of the claimed identity (Aronowitz et al 2005).
Figure 5.1 Generic speaker recognition system
Classification
Feature
ExtractionPattern
MatchingDecision
Speaker
models
Speech Accept/
Reject
122
Like most pattern recognition problems, a speaker recognition
system can be partitioned into two modules: feature extraction and
classification. The classification module has two components: pattern
matching and decision. Figure 5.1 depicts a generic speaker recognition
system. The feature extraction module estimates a set of features from the
speech signal that represent some speaker-specific information. The set of
features should be consistent for each speaker and have variability between
speakers. The pattern matching module is responsible for comparing the
estimated features to the speaker models. There are many types of pattern
matching methods and corresponding models used in speaker recognition
(Campbell 1997). Some of the methods include hidden Markov models
(HMM), dynamic time warping (DTW) and vector quantization (VQ). In
open-set applications (open-set speaker identification and speaker
verification), the estimated features can also be compared to a model that
represents the unknown speakers (Furui 1997). In an identification task, it
outputs similarity scores for all stored voice models. In a verification task, this
module outputs a similarity score between the test sample and the claimed
identity. The decision module analyzes the similarity score(s) (statistical or
deterministic) to make a decision. The decision process depends on the
system task.
The effectiveness of a speaker recognition system is measured
differently for different tasks. Since the output of a closed-set speaker
identification system is a speaker identity from a set of known speakers, the
identification accuracy is used to measure the performance. For the speaker
detection/verification systems, there are two types of error: false acceptance
of an impostor and false rejection of a target speaker. The performance
measure can also incorporate the cost associated with each error, which
depends on the application. For example, in a telephone credit card purchase
123
system, a false acceptance is very costly; in a toll fraud prevention system,
false rejection can alienate customers (Kim and Rose 2003).
Aside from biometric commercial applications, the forensic domain
is another important field where speaker recognition is used. To meet these
challenges in recognizing the speaker in real time Australian academy of
sciences-Acoustic research institute currently working in the area of
Automatic Speaker Identification Research group in Language Technologies
Institute in Carnegie Mellon University at Pittsburgh is currently working in
the area of robust speaker recognition. Lincoln Laboratory, Massachusetts
Institute of Technology presently doing their researches in the area of speaker
recognition, focused on application of recognition techniques to multi-speaker
speech and improving the performance of recognition systems under
mismatched channel conditions. Speech synthesis and recognition, speaker
recognition, speech recognition for Indian languages, spoken language
identification are some of the research areas in IIT, Chennai, India. The
researchers in Tokyo Institute of technology (Japan), Air Force Research
Laboratory/IFEC and Speech Processing Lab, Philadelphia (USA) are
working to improve the performance of the speaker recognition system in real
environment.
5.2.1 Introduction to Speaker Recognition in Real Environments
The main challenge for automatic speaker recognition is to deal
with the variability of the environments and channels from where the speech
was obtained. Existing speaker recognition system achieved good results for
clean high-quality speech with matched training and test acoustic conditions,
such as high accuracy of speaker identification and verification. However,
under mismatched conditions and noisy environments, often expected in real-
world conditions, the performance of the systems degrades significantly, far
away from the satisfactory level (Varchol et al 2005). The presence of
124
irrelevant information (like speaker or environment information) may actually
degrade the system accuracy. Therefore, robustness becomes a crucial
research issue in speaker recognition field. Ji Ming et al (2007) investigated
the problem of speaker identification and verification in noisy conditions,
suggested a method which combines multi-condition model training and
missing-feature theory to model noise with unknown temporal-spectral
characteristics. Missing-feature theory is applied to refine the compensation
by ignoring noise variation outside the given training conditions, thereby
reducing the training and testing mismatch. In this thesis, the main focus is to
improve the robustness of speaker recognition system. Some of the sources
that degrade the performance of speaker recognition are:
i. Variation in the speaker characteristics (speaking rate, style
and stress)
ii. Different microphones used in the training and testing
environment
iii. Distortion introduced by the channel such as telephone
network
iv. Distortion due to background noise
This work concentrates on distortion due to background noise.
Speaker recognition systems are generally trained using data obtained in
controlled conditions. This data is acquired from noise free environment using
high quality microphones. Practically in any speaker recognition application
the input speech signal may not always be clean, in particular during testing
and may be corrupted in many ways that can degrade the quality of the speech
signal and therefore reduce the performance of the speaker recognition system
(Reynolds1995). Many strategies have been adapted to deal with background
noise degradation and to provide the robustness to the recognizer. The process
125
of providing robustness to the recognizer can be accomplished in three
different stages (Krishnamurthy and Prasanna 2009):
i. Robustness at the signal level (speech enhancement)
ii. Robustness at the feature level (feature enhancement)
iii. Robustness at the classifier level (model enhancement)
The presented work aims to provide the robustness at the signal
level using TADWT and SSWPT methods as a preprocessing stage. Text
independent speaker verification task is taken for demonstration.
5.3 PROPOSED NOISE ROBUST SPEAKER RECOGNITION
SYSTEM
This work is focused on several issues relating to the
implementation of the new model for real-world applications. It aims to
provide the robustness at the signal level using the developed and reported
TADWT and SSWPT speech enhancement methods as a front end
processing stage. In this approach speech signals are enhanced before the
feature extraction stage. Accordingly before being transformed into feature
vectors, the degraded speech undergoes an enhancement step which tries to
filter out the degradation. Text independent speaker verification task is taken
for demonstration.
The proposed speaker recognition system is schematically
illustrated in Figure 5.2 in which front end processing block refers to speech
enhancement using TADWT and SSWPT methods. MFCC features are used
for speaker identification (it possess high inter speaker variability) and
MMFCC features are used for speaker verification (it possess high intra
speaker variability) to increase robustness and to achieve higher recognition
accuracy. Speaker recognition system typically involves: training and testing
phases. In training phase, a user enrolls by providing voice samples under
126
controlled environment to the system. The system extracts speaker-specific
information from the voice samples to build a voice model of the enrolling
speaker. During the testing phase, degraded speech samples are given through
the front end processor (speech enhancement) for the reduction of noise
outside the given training conditions, thereby reducing the training and testing
mismatch. Then it is used by the system to measure the similarity of the user’s
voice to the model(s) of the previously enrolled user(s) and subsequently to
make a decision.
Figure 5.2 Block diagram of proposed noise robust ASR system
Front end processing
(Speech enhancement)
Training speech
Models for speaker 1, 2...n
1 2 3 n
Speaker Identification
System
Feature extraction (MFCC)
Pre-processing
Similarity Measure> T (ACCEPT)
< T (REJECT)
Speaker Identity
Pattern matching
Identified speaker
Model
Feature extraction (MMFCC)
Speech sample of a user
claiming for authority
(Test speech)
127
5.3.1 Front End Processing (Speech Enhancement)
Many solutions have been developed to deal with the noisy speech
enhancement problem is discussed in previous chapters. Generally, these
solutions can be classified into two main areas: Temporal processing and
spectral processing based speech enhancement techniques. The temporal
processing involves identification and enhancement of high SNR regions in
the time domain representation of the degraded speech signal. Spectral
processing involves estimation and elimination of degradation component and
identification, enhancement of speech specific spectral features in the
frequency domain representation of degraded speech (Krishnamoorthy and
Prasanna 2011). Font end SE used in this work are the proposed and reported
TADWT and SSWPT SE methods come under second category.
5.3.2 Feature Extraction
Feature extraction is the key to the front-end process in speaker
recognition systems. Feature extraction aims at giving a useful representation
of the speech signal by capturing the important information from it. It
transforms the speech signal into a compact but effective representation that is
more stable and discriminative than the original signal. The performance of a
speaker verification system is highly dependent on the quality of the selected
speech features. Several feature extraction methods have already been studied
for the speaker recognition task.
The most commonly used acoustic vectors are Mel Frequency
Cepstral Coefficients (MFCC) (Yegnanarayana et al 2005), Bark Scale Filter
bank Cepstrum Coefficients (BFCC), Linear Prediction Cepstral Coefficients
(LPCC) (Reynolds 1995) and Perceptual Linear Prediction Cepstral (PLPC)
Coefficients. All of these features are based on the spectral information
derived from a short time windowed segment of speech. They differ mainly in
128
the detail of the power spectrum representation. MFCC features are derived
directly from the FFT power spectrum whereas the LPCC and PLPC use an
all-pole model to represent the smoothed spectrum. In all above methods, for
energy normalization the cepstral coefficient with index zero is discarded.
PLPC and MFCC features are used in most state-of-the-art automatic speech
recognition systems. A new modification of Mel-Frequency Cepstral
Coefficient (MMFCC) feature has been proposed for extraction of speech
features for speaker verification application (Goutam Saha et al 2000).
However, MFCC features are used in more and more speaker recognition
applications. For example, most of the participating systems in NIST speaker
recognition evaluations in 1998 used MFCC features and some systems used
LPCC features (Doddington et al 2000).
Revised perceptual linear prediction was proposed by Pawan
Kumar et al (2010) for the purpose of identifying the spoken language;
Revised Perceptual Linear Prediction Coefficients (RPLP) was obtained from
combination of MFCC and PLP. The speech recognizers use a parametric
form of a signal to get the most important distinguishable features of speech
signal for recognition task.
5.3.3 Selection of Feature Extraction Technique
Speaker’s speech features are selected based on acoustic vector
distribution of different samples of same speaker and different speaker’s
sample. Figure 5.3 to Figure 5.7 shows an acoustic vector distribution of
different samples from the same speaker using MFCC, MMFCC, BFCC,
LPCC and RPLP features respectively. The feature vector space for MMFCC
has lower variability among the different samples of same speaker as
compared with other feature distribution. Thus for verification purpose
MMFCC features are used in this proposed system. Figure 5.8 to Figure 5.12
shows an acoustic vector distribution of three different speaker’s samples
129
using MFCC, MFCC, MMFCC, BFCC, LPCC and RPLP features
respectively. The feature vector space for MFCC has higher variability
among the different speakers as compared with other feature vector
distributions. Thus for identification purpose MFCC features are selected in
this proposed system.
-15 -10 -5 0 5 10 15 20-4
-3
-2
-1
0
1
2
3
4
1st Dimension
2nd D
imension
2D plot of accoustic vectors
Signal 1
Signal 2
Signal 3
Figure 5.3 2D acoustic vectors distribution of same speaker’s different
samples using MFCC features
-5 0 5 10 15 20-8
-6
-4
-2
0
2
4
6
8
1st Dimension
2nd D
imensio
n
2D plot of acoustic vectors using MMFCC
Signal 1
Signal 2
Signal 3
Figure 5.4 2D acoustic vectors distribution of same speaker’s different
samples using MMFCC features
130
0 1000 2000 3000 4000 5000 6000 7000 80000
200
400
600
800
1000
1200
1400
1600
1800
First Dimension
Second D
imension
2D plot of acoustic vectors using BFCC
Signal 1
Signal 2
Signal 3
Figure 5.5 2D acoustic vectors distribution same speaker’s different
samples using BFCC
0 1000 2000 3000 4000 5000 6000 7000 8000-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
First Dimension
Second D
imension
2D plot of acoustic vectors using LPCC
Signal 1
Signal 2
Signal 3
Figure 5.6 2D acoustic vectors distribution of same speaker’s different
samples using LPCC
0 100 200 300 400 500 600 7000
100
200
300
400
500
600
700
First Dimension
Second D
imension
2D plot of acoustic vectors using RPLP
Signal 1
Signal 2
Signal 3
Figure 5.7 2D acoustic vectors distribution of same speaker’s different
samples using RPLP
131
-15 -10 -5 0 5 10 15 20 25-15
-10
-5
0
5
10
15
1st Dimension
2nd D
imension
2D plot of acoustic vectors using MFCC
User1
User2
User3
Figure 5.8 2D acoustic vectors distribution of different speaker’s
sample using MFCC features
-10 -5 0 5 10 15-8
-6
-4
-2
0
2
4
6
8
1st Dimension
2nd D
imension
2D plot of acoustic vectors using MMFCC
User1
User2
User3
Figure 5.9 2D acoustic vectors distribution of different speaker’s
sample MMFCC features
132
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
0
10
20
30
40
50
60
70
80
90
First Dimension
Second D
imension
2D plot of acoustic vectors using BFCC
User1
User2
User3
Figure 5.10 2D acoustic vectors distribution of different speaker’s
sample using BFCC
-15 -10 -5 0 5 10 15-30
-20
-10
0
10
20
30
First Dimension
Second D
imension
2D plot of acoustic vectors using LPCC
User1
User2
User3
Figure 5.11 2D acoustic vectors distribution of different speaker’s
sample using LPCC
0 100 200 300 400 500 600 7000
100
200
300
400
500
600
First Dimension
Second D
imension
2D plot of acoustic vectors using RPLP
User1
User2
User3
Figure 5.12 2D acoustic vectors distribution of different speaker’s
sample using RPLP
133
5.3.3.1 MFCC Feature extraction
MFCCs are one of the most popular methods for extracting features
from the speech signal developed by Davis and Mermelstain (1980). MFCCs
are shown to be less susceptible to the variation of the speaker’s voice and
surrounding environment. It is based on the known variations of human ears.
Group the neighboring frequency bins into overlapping triangular bands with
equal bandwidth according to theme scale. Critical bandwidths with
frequency, filters spaced linearly at low frequencies and logarithmically at
high frequencies have been used to capture the phonetically important
characteristics of speech. The MFCC feature extraction technique basically
includes windowing the signal, applying the DFT, taking the log of the
magnitude and then warping the frequencies on a Mel scale, followed by
applying the discrete cosine transform (DCT). The various steps involved in
the MFCC feature extraction are pre-emphasis, frame blocking and
windowing, mel-spectrum and DCT.
i Pre-emphasis: Pre-emphasis refers to filtering that emphasizes the
higher frequencies. Its purpose is to balance the spectrum of voiced sounds
that have steep roll-off in the high frequency region. Also it removes some of
the glottal effects from the vocal tract parameters. The most commonly used
pre-emphasis filter is given by the following transfer function
1za1)z(H (5.1)
where the value of a controls the slope of the filter usually between 0.4 to
1(Picone 1993). In this work it is assigned as 0.9375. This first order filter
will compensate the lower formants contain more energy than the higher.
ii Frame Blocking and Windowing: The speech signal is a slowly
time varying or quasi stationary signal. Therefore, speech analysis must
always be carried out on short segments across which the speech signal is
134
assumed to be stationary. Short term spectral measurements are typically
carried out over 20 ms windows and advanced every 10ms. Advancing the
time window for every 10ms enables the temporal characteristics of
individual speech sounds to be tracked and the 20ms analysis window is
usually sufficient to provide good spectral resolution of these sounds and at
the same time short enough to resolve significant temporal characteristics.
Generally Hamming or Hanning windows are used. This is done to enhance
the harmonics, smooth the edges and to reduce the edge effect while taking
the DFT on the signal. Hamming window is used in this system because it is
having lesser spectral leakage. Each windowed frame is converted into
magnitude spectrum by applying DFT.
iii Mel-Spectrum: Mel-spectrum is computed by passing the Fourier
transformed signal through a set of band pass filters known as mel-filter
bank. A Mel is a unit of measure of human ear’s perceived pitch or frequency
of a tone. The characteristics is expressed on the mel frequency scale, which
is a linear frequency spacing below 1kHz and a logarithmic spacing above
1kHz (Deller et al 1994). The following function transforms real (linear
frequency) to Mel frequency.
700
f1log2595=(f)Mel (5.2)
0 500 1000 1500 2000 2500 3000 3500 40000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2Mel-Spaced Filterbank
Frequency [Hz]
Figure 5.13 Mel spaced filter banks
135
Following the mel scaling the remaining signal values are
compressed with a non linear function, such as logarithm operation. This
results in a signal in the cepstral domain with a frequency peak corresponding
to the pitch of the signal and a number of formants representing low
frequency peaks. The triangular filter banks with mel- frequency warping is
given in Figure 5.13.
iv DCT: Since the vocal tract is smooth, the energy levels in adjacent
bands tend to be correlated. The DCT is applied to the transformed mel-
frequency coefficients to produce a set of cepstral co-efficients. The purpose
of DCT is to reduce the dependency between the feature vector components
within a single frame. Mel-Frequency Cepstral Coefficients (MFCCs) are
calculated from the log filter bank amplitudes (mj) using equation (5.3).
N
1j
jn)5.0j(
N
ncosm
N
2C (5.3)
where nc is the cepstral coefficient N is the number of filter bank channels.
Since most of the signal information is represented by the first few MFCC
coefficients, the system can be made robust by extracting only those
coefficients ignoring higher order DCT components (Deller et al 1994). In
this work first 13 coefficients are extracted. The coefficient with index zero is
often excluded since it represents the average log energy of the input signal,
which carries only little speaker specific information.
5.3.3.2 Modified MFCC feature extraction
A modified mel-frequency cepstral coefficient (MMFCC) is the
improved version of conventional MFCC for speaker verification application.
The weighting function was introduced which is unique for each frame of an
136
utterance for each speaker (Goutam Saha et al 2000). From the filter bank
outputs, the average value is calculated using equation (5.4).
20
1ik
]i[S20
1avg (5.4)
where ]i[Sk is the filter bank outputs and the city block distance ]i[M
kof each
filter bank output is calculated from the average by using equation (5.5).
avg]i[S]i[Mkk
(5.5)
Sweep is unique for each frame (window) since it is calculated
from filter bank output of that frame is calculated using equation (5.6). It
represents the total variation in magnitude of the filter outputs for each frame
and gives a measure of the magnitude spread of the coefficients, equivalent to
variance in Euclidean distance measure.
]i[MlogSweepk
20
1i
(5.6)
Finally weighting function is derived as,
Sweep
]i[Slog]i[W k (5.7)
Then the MMFCCs are obtained as,
22
1incos].i[W].i[SlogC
kn (5.8)
where i = 1,2,…20. MMFCC uses compensation based on the magnitude of
spread, through a frame based weighting function to preserve the speaker
dependent information in different frames. The variation of intensity/loudness
137
at different segments of a spoken word may influence the magnitude of the
coefficients affecting cluster formation in parameter space for a speaker.
MMFCC is a frame based technique to reduce these effects through
normalization of coefficients in each frame by its total spread, so that
coefficients of all the frames are brought to same level of spread. This also
minimizes effect of change in background noise level in SR applications
where the speaker while speaking is moving from one environment to
another. MMFCC features shows enhanced discriminative ability for the
coefficients that is important in speaker verification applications.
5.3.4 Speaker Modeling
From the extracted features speaker model is created for each and
every speaker in the database. The created speaker model should be unique
and contains maximally the speaker specific information for reproducing the
vocal tract. There are several methods for speaker modeling reported in
literature. In text- dependent speaker recognition the most popular methods
like dynamic time warping (DTW) (Furui 1981) and hidden Markov models
(HMM) (Tisby 1991) are used.In Text independent speaker recognition the
most popular methods used for speaker modeling are vector quantization
(VQ), Poonam Bansal et al (2007)), artificial neural networks(ANN)
(Yegnanarayana and Kishore 2002), support vector machine (SVM) (Gish and
Schmidt 2002) and Gaussian mixture models (GMM) (Reynolds and Rose
(1995), Xiang and Berger (2003)).These models can be classified into
parametric and non parametric models. The parametric models like GMM
have a particular model structure characterized by certain parameters for the
distribution of feature vectors. In non-parametric models like VQ and ANN,
no such assumptions are made.
Furui (1981) introduced the concept of dynamic time warping
(DTW) for text-dependent speaker recognition. However, it was originally
138
developed for speech recognition. The disadvantage of template matching is
that it is time consuming, as the number of feature vectors increase. For this
reason, it is common to reduce the number of training feature vectors by
some modeling technique like clustering. The cluster centers are known as
code vectors and the set of code vectors is known as codebook. The most
well known codebook generation algorithm is the K-means algorithm (Linde
et al (1980), Gray (1984)). Soong et al (1987) used the LBG algorithm for
generating speaker based vector quantization (VQ) codebooks for speaker
recognition. It is demonstrated that larger codebook and larger test data give
good recognition performance.In order to model the statistical variations, the
hidden Markov model (HMM) for text-dependent speaker recognition was
studied and reported (Rosenberg and Parthasarathy (1996), Matsui and Furui
(1994)).
In 1995, Reynolds proposed Gaussian mixture modeling (GMM)
classifier for speaker recognition task (Reynolds 1995). This is the most
widely used probabilistic technique in speaker recognition. The GMM needs
sufficient data to model the speaker to get good performance. In the GMM
modeling technique, the distribution of feature vectors is modeled by the
parameters mean, covariance and weight. The disadvantage of GMM is that
it requires sufficient data to model the speaker well.
To overcome this problem, Reynolds et al (2000) introduced GMM
universal background model (UBM) for the speaker recognition task. In this
system, speech data collected from a large number of speakers is pooled
and the UBM is trained, which acts as a speaker-independent model.
The speaker-dependent model is then created from the UBM by performing
maximum a posteriori (MAP) adaptation technique using speaker-specific
training speech. As a result, the GMM-UBM gives better results than the
139
GMM (Liu et al 2006). The disadvantage of this is the requirement of a
gender-balanced large speaker set for UBM training.
Campbell et al (2006) proposed generalized linear discriminant
sequence kernel for speaker recognition and language identification tasks.
Kosaka et al (2007) proposed a speaker vector-based speaker identification
with phonetic modeling using SVM. The SVM has many desirable properties,
including the ability to classify sparse data without over-training. Among the
developed techniques the state of art speaker verification systems widely uses
the MFCC and its derivatives as features; GMM or VQ as a modeling
technique (Jayanna and Mahadeva Prasanna 2009). In this thesis for speaker
modeling, one of the traditional and successful vector quantization method is
chosen and experimented with LBG and K-means algorithm.
5.3.4.1 Vector Quantization
A speaker recognition system must be able to estimate probability
distributions of the computed feature vectors. Storing every single vector that
generate from the training mode is impossible, since these distributions are
defined over a high-dimensional space. It is often easier to start by quantizing
each feature vector to one of a relatively small number of template vectors,
with a process called vector quantization. VQ is a process of taking a large set
of feature vectors and producing a smaller set of measure vectors that
represents the centroids of the distribution. The technique of VQ consists of
extracting a small number of representative feature vectors as an efficient
means of characterizing the speaker specific features. Then a speaker-specific
VQ codebook is generated for each known speaker by clustering his/her
training acoustic vectors. The codebook is a set of cells in a multidimensional
space. Each cell defines a small part of the total space and contains a point
centered within the cell, the centroid (Soong et al 1987). The cepstral
coefficients derived from each frame are regarded as a vector (a point) in the
140
space and thereby belonging to one of the cells. The vector always belongs to
the cell containing the closest located centroid. The two dimensional
representation of VQ is shown in Figure 5.14. In the recognition stage, the
data from the tested speaker is compared to the codebook of each speaker and
measure the difference. These differences are then used to make the
recognition decision.
Figure 5.14 Two Dimensional Representation of VQ
5.3.4.2 Linde buzo gray (LBG) algorithm
This algorithm is an iterative algorithm which alternatively solves
the nearest neighbour condition and centroid condition. It is used for
clustering the feature vectors in Vector Quantization process. After the
enrolment session, the acoustic vectors extracted from input speech of a
speaker provide a set of training vectors. As described above, the next
important step is to build a speaker-specific VQ codebook for this speaker
using those training vectors. There is a well-known algorithm, namely LBG
algorithm , for clustering a set of L training vectors into a set of M codebook
vectors. The flowchart for LBG algorithm is given in Figure 5.15.
First Dimension
Sec
on
d D
imen
sion
141
5.3.4.3 K-means algorithm
The K-means algorithm is a way to cluster the training vectors to
get feature vectors. In this algorithm the vectors are clustered based on
attributes into k partitions. It uses the K-means of data generated from
Gaussian distributions to cluster the vectors.
Extracted feature vector
Figure 5.15 Flow chart for LBG Algorithm
Find centroids
m – Current size
of the codebook
M – Desired size
of the codebook
D – Average
distance
– Threshold
distance
Cluster
Vectors
D'=D
m = 2*m
(D'-D)/D
is <
Compute D
Stop
Find Centroid
Split each
Centroid
m< M
Yes No
No
Yes
142
The objective of the k-means is to minimize total intra-cluster
variance, V.
2V
ii
Sjj
xk
1i
(5.9)
where there are k clusters Si , i = 1, 2... k andi
is the centroid or mean point
of all the pointsij
Sx (Furui 1989). The process of k-means algorithm used
least-squares partitioning method to divide the input vectors into k initial sets.
Then it calculates the mean point or centroid, of each set. It constructs a new
partition by associating each point with the closest centroid. Then the
centroids are recalculated for the new clusters and algorithm is repeated until
when the vectors no longer switch clusters or alternatively centroids are no
longer changed.
5.3.5 Feature Matching
Earlier studies on speaker recognition (Atal (1976), Furui (1981)
and Campbell (1997)) used direct template matching between training and
testing data. In the direct template matching, training and testing feature
vectors are directly compared using similarity measure. For the similarity
measure, any of the techniques like spectral or Euclidean distance or
Mahalanobis distance is used. From literature it is found that better results
were obtained for ED on text independent ASR task. Thus ED measure is
preferred here for feature matching.
In the speaker recognition phase, an unknown speaker’s voice is
represented by a sequence of feature vector is compared with the codebooks
in the database. For each codebook a distortion measure is computed and the
speaker with the lowest distortion is chosen. In order to identify the unknown
143
speaker, this can be done by measuring the distortion distance of two vector
sets based on minimizing the Euclidean distance. The Euclidean distance
between two points P = (p1, p2…pn) represents feature vector of trained
speaker and feature vector of test speaker Q = (q1, q2...qn) is defined as,
2)i
qi
p(n
1i
2)n
qn
p(......2)2
q2
p(2)1
q1
p( (5.9)
Thus, each feature vector in the sequence of test speaker is
compared with all the codebooks of trained speakers and the codebook with
the minimum ED is declared to be “accepted” or “ recognized speaker”.
5.3.6 Performance Metric
The performance of a speaker verification system is measured in
terms of false acceptance rate (FA %) and false rejection rate (FR %). False
acceptance error consists of accepting identity claim from an imposter. False
rejection error happens when a valid identity claim is rejected. It is
mathematically represented as,
100I
IF
T
AA
(5.10)
100C
CF
T
AR
(5.11)
RAEFFT (5.12)
where IA- No. of imposter classified as true speakers, IT - Total no. of
speakers, FA-False Acceptance, FR- False Rejection,E
T - Total error of
recognition system, CA – No. of true speakers classified as Imposters , CT -
Total no. of speakers. The overall recognition accuracy (RA) is calculated by
144
subtracting the total error of recognition from maximum RA of 100% (ideal
case) .
5.3.7 Experimental Results and Discussion
Various experiments are conducted to analyze the performance of
proposed ASR in controlled and uncontrolled environment. Initially, the
performance of the proposed ASR system is analyzed under clean speech
condition. Later the experimental analysis is extended to noisy environment
using various speech enhancement algorithms in front end processing stage.
The speaker recognition studies are carried out using the standard TIMIT
database consisting of 630 speakers’ sample and King database consisting of
370 speakers’ sample. In total 1000 speakers’ sample is used in this work, of
which 570 are males and 430 are female. Speech files are stored in .wav file
format with a sampling frequency of 16 kHz and quantization resolution of 16
bits per sample. Since most of the speech information is present up to 4 kHz,
the samples are down sampled to 8 kHz and then used in this work. Each
speaker has contributed 5 utterances of approximately 5 sec each. Out of 1000
speakers, 750 speakers are randomly selected for forming subset for the study.
The selection of the subset was arbitrary. Among the 5 utterances, 3
utterances of each speaker are used for training and the remaining two
utterances are used for testing.
5.3.7.1 Speaker recognition system under clean condition
Various experiments are conducted to analyze the performance of
the ASR system for different feature extraction methods like MFCC, BFCC,
MMFCC, RPLP and LPCC using vector quantization for modeling. Then the
extracted features are modeled using K-means and LBG (vector quantization)
algorithms. The analysis gave better understanding about each feature
extraction methods, speaker modeling and their performances under clean
145
speech conditions. The snap shot of simulation results of the proposed system
under clean speech condition is shown in Figure 5.16. A maximum accuracy
of 96.67% is obtained for the proposed scheme under controlled environment.
Further the recognition studies are carried out on test speech (degraded
speech) data with and without front end processing to evaluate the
performance of the proposed method under noisy condition (under
uncontrolled environment).
Figure 5.16 Snap shot of simulation results
5.3.7.2 Performance comparison of feature extraction techniques
The performance of the proposed ASR system is analyzed using
various feature extraction methods like MFCC, Modified MFCC, LPCC,
RPLP , BFCC and the proposed one , by varying filter bank size, frame
lengths and percentage of frame overlapping. Figure 5.17 shows the
recognition accuracy while using 24 filter banks to extract both MFCC and
MMFCC features. The filter banks are designed by placing 12 filters below 1
146
kHz and 12 filters above 1 kHz. While analyzing the system performance
with 24 filter banks by using K-means algorithm as the speaker modeling
technique, it performs well in an intermediate length frames with 60% shift
over the frames with average recognition accuracy of 91%. Even though the
256 and 512 size frames performed equally well in extracting features the
system performance rely on speaker modeling module. The system
performance is improved for 60% overlapped frames by 2% in shorter frames.
Because 60% overlapped frame captures the speaker specific information well
than 50% and 70% overlap. Therefore 60% overlap is preferred for feature
extraction.
Figure 5.17 Performance comparison of speaker modeling techniques
for 24 filter banks
Further, the speaker verification system is analyzed for 40 filter
banks with 24 filters below 1 kHz and 16 filters above 1 kHz. While
increasing the number of filters, the extracted feature’s accuracy is improved
which further increases the RA. From Figure 5.18, it is observed that while
increasing the length of the frame there were loses in the features, which
makes creating a speaker specific model difficult one. At the same time
147
increasing shift over the frame increases the redundancy resulting in a
reduction of efficiency by 2% (for 70% overlap). By increasing the number of
filter banks to 40 the recognition accuracy is improved by 2% for intermediate
length frame, where the detailed features from the speech signal can be
extracted. More speaker specific information is held by the lower frequency
components than higher frequency components. Hence out of 40 filter bank
outputs only lower 13 outputs are taken for generating the speaker model.
Figure 5.18 Performance comparison of speaker modeling techniques
for 40 filter banks
From Figure 5.18, it is observed that ASR perform well under
60% shift over the segment both in LBG and K-means modeling with the
recognition accuracy of 96.67 % and 95.6% respectively for an intermediate
frame of size 512. As compared to K-means, LBG algorithm efficiently
clusters the speakers, even though the data base contains large population.
From the above analysis, it is reported that the proposed scheme using
combined MFCC and MMFCC features for 40 filter banks with 512 frame
length for an overlapping of 60% through LBG speaker modeling yielded the
highest accuracy of 96.67 % over others. Hence it is preferred for further the
analysis under noisy condition.
148
Table 5.1 summarizes the percentage of speaker recognition
accuracy for two different speaker modeling techniques with different frame
size using various feature extraction techniques for 60% of frame
overlapping.
Table 5.1 Performance comparison of speaker recognition accuracy for
different feature extraction techniques
Speaker
modeling
Technique
Frame
size
Different feature extraction techniques
MFCC MMFCC LPCC RPLP BFCC
Combined
MFCC
and
MMFCC
K-means
256 89.02 73.59 43.09 64.77 39.67 94.82
512 90.79 68.97 39.77 65.77 35.82 95.6
1024 83.52 72.28 30.68 46.17 23.73 86.48
2048 79.06 65.97 23.47 45.87 19.69 81.31
4096 74.15 53.98 20.78 29.78 16.26 76.22
LBG
256 89.67 75.37 47.82 67.57 40.64 95.76
512 92.78 74.27 40.87 65.49 37.82 96.67
1024 86.42 72.08 30.98 58.23 25.27 88.49
2048 80.86 66.89 28.97 47.39 23.34 84.06
4096 75.85 38.98 25.65 38.86 20.07 78.32
From the Table 5.1 it is observed that the combined MFCC and
MMFCC shown higher performance than other feature extraction methods for
both the speaker modeling techniques. Relatively higher results are obtained
for the system with combined feature extraction technique for both LBG and
K-means speaker modeling technique for 256 and 512 length frames with
60% of overlapping. This indicates that the smaller size frames captures the
speaker specific information in a better manner because this mode effectively
149
masked the intra speaker variability. RA of 95.76% and 94.82% is obtained
for the frame length 256 while using LBG and K-means respectively. For the
frame length of 512 both LBG and K-means yielded a maximum accuracy of
96.67% and 95.6 %.
Based on the results shown above, a paper was published in
Journal of computer science entitled, “A New Speaker Recognition System
with Combined Feature Extraction Techniques ”, vol. 7, Issue 4, pp.459-
465, 2011.
5.3.8 Speaker recognition in noisy environment
The noisy speech is created by adding noise from NOISEX-92
database to the test utterances. The noise waveform being added to the speech
was scaled to give the desired global SNR. The value of SNR is varied over -5
to 15 dB. While testing, the degraded speech signal is enhanced using
individual and proposed methods as described in earlier chapters. To illustrate
the merit of reported TADWT AND SSWPT enhancement methods, Figures
5.19 (a)-(h) show the excitation source signal that is LPC spectrum of clean,
degraded, spectral subtraction, iterative wiener filter, Ephraim Malah filter,
Bionic wavelet transform, time adaptive discrete wavelet transform and
combined spectral subtraction with wavelet packet transform processed
speech respectively.
Table 5.2 summarizes the recognition results for the various noise
types at different SNR levels varying from -5dB to +15 dB. In the table,
abbreviation DEG, SS, IWF, EMF, BWT, TADWT and SSWPT refer to
degraded speech, spectral subtraction, iterative Wiener filtering, Ephraim-
Malah Filtering, bionic wavelet transform, time adaptive discrete wavelet
transform and combined spectral subtraction with wavelet transform
respectively. Avg. RA represents average percentage of recognition accuracy
150
which is the average performance of different approaches for the corrupted
speech at different noise level.
Figure 5.19 LPC spectra of (a) Clean speech (b) degraded speech
(factory noise), Enhanced speech by (c) SS (d) IWF (e)EMF
(f) BWT(g) TADWT (h) SSWPT
(a) (e)
(b) (f)
(g)
(h)(d)
(c)
151
Table 5.2 Speaker recognition performance under noisy environment
Recognition Accuracy (RA) (%) Avg.
RA
Recognition Accuracy (RA)
(%)
Avg.
RA
Noise
level
(dB)
Method
-5 0 5 10 15 -5 0 5 10 15
Babble noise Street noise
DEG 0.5 4.2 13 19.5 24.6 12.4 0.5 4.2 13 19.5 24.6 12.4
SS 5.8 21 41 62 70.5 40.1 5.1 20.7 37.3 57 66.5 37.3
IWF 5.1 18.7 37.5 58.2 62.5 36.4 4.8 17.9 37.5 58.5 67.5 37.2
EMF 10.9 32 57 75.5 78.5 50.8 10.7 31.2 57 69.7 77.5 49.2
BWT 11.5 28.5 54.5 74.2 76.4 49.0 11.1 32.5 54.5 70.4 77.2 49.1
TADWT 22.5 40.9 60.5 76.5 84 56.9 21.5 39.9 59.6 76.5 83.4 56.2
SSWPT 21.9 41.5 61.3 78.5 82.5 57.1 20.3 40.1 60.4 78.5 82.8 56.4
HF channel noise Factory noise
DEG 1.1 3 12.5 17 26.5 12.0 0.5 5.2 12.7 20.5 24.8 12.7
SS 5.4 31 44 55.5 65 40.2 6.2 18.9 38.2 58.7 67.4 37.9
IWF 4.9 28 33.5 37 57.5 32.2 5.6 16.3 36.9 59.8 67.5 37.2
EMF 12.7 40.5 49 60.7 80.5 48.7 9.8 32.6 56.2 71.7 78.5 49.8
BWT 11.1 40 50 60.5 78 47.9 13.1 31.4 53.4 72.3 79.2 49.9
TADWT 26.9 41.5 63.5 78.6 84.1 58.9 22.6 42.5 58.6 78.7 82.9 57.1
SSWPT 27.4 43 64.1 79.7 83.6 59.6 21.9 43.2 61.4 79.2 81.6 57.5
Train noise Exhibition noise
DEG 0.6 2.8 12.6 19 28.5 12.7 0.9 7 20.4 24.6 38.5 18.3
SS 6.6 23 56.8 65.5 78 46.0 5.7 29.8 57 66 74 46.5
IWF 5.8 19.5 43.2 58 70 39.3 6.8 24.7 54 65 72.5 44.6
EMF 12.8 24.5 57.8 71.5 78.6 49.0 7.4 32 58.7 73 79.2 50.1
BWT 11.2 24 54 68 77.9 47.0 8.1 30.4 61.2 70.4 79.5 49.9
TADWT 25.4 38.5 66.5 78.7 84 58.6 22.8 43.7 62.7 76.7 82.5 57.7
SSWPT 25.6 39.2 67.5 77 83 58.5 20.7 42.4 63 77.5 81.3 57.0
Air port noise Station noise
DEG 0.5 2 8.5 14 18.8 8.8 0.8 8.5 14 17.6 39.2 16.0
SS 9 20.3 48.7 56 68.5 40.5 6.8 25.1 49 57.8 67 41.1
IWF 10.8 19 45.8 55.3 69.5 40.1 3.5 22.9 47.5 56 65.8 39.1
EMF 17.6 25 59.6 57.5 77 47.3 11.6 28.5 54.6 60.6 72 45.5
BWT 16.5 24.6 59.2 66.5 76.7 48.7 9.8 29 53 59.3 71 44.4
TADWT 21.3 47.5 61.5 73.3 83.5 57.4 20.4 42.6 63.5 73.8 83.5 56.8
SSWPT 21.9 46 60.7 71.9 82.8 56.7 21.8 41.4 64.4 75.4 84.3 57.5
Car noise Restaurant noise
DEG 0.5 1.5 5.8 13.6 20.5 8.4 1.6 7 15.9 20.5 24.7 13.9
SS 7.5 18.8 32.5 38.6 62 31.9 7.4 18.6 42.8 64.6 69.4 40.6
IWF 6.8 17.9 32.1 37 58.7 30.5 6 13.2 41.7 61.5 63.2 37.1
EMF 12.5 22.6 36.4 42.6 69 36.6 11 26.8 53.6 69.9 73.8 47.0
BWT 13 21.7 38.6 44 71.4 37.7 8.5 27.7 52.4 68 71.5 45.6
TADWT 20.8 45.7 59.5 72.7 82.8 56.3 21.4 44.7 64.4 74.9 84.6 58.0
SSWPT 21.1 45.4 60 73.1 83.6 56.6 24.6 43.8 63.5 72.5 84.1 57.7
152
Figure 5.20 Performance of the proposed ASR system using different
enhancement techniques for noisy test speech at -5 dB input
SNR
Figure 5.20 shows the performance of the speaker recognition
system with and without speech enhancement techniques for various noises at
-5dB. It is observed from the analysis that an increase in system’s RA is
ranging from 20% to 25% while including TADWT and SSWPT SE
techniques as a front end processor with the highest average recognition
accuracy of 22.6% and 22.8% as compared to others. From Figure 5.21, it is
observed that the system efficiency improves at an average of 18.6% and 14%
while using SS and IWF respectively. At the same time the proposed
TADWT and SSWPT methods give an average improvement of 44.7% and
43.8% respectively. The overall performance of TADWT and SSWPT is 13%
higher than EMF and 14% higher than BWT. The TADWT based frontage
processor provides the highest improvement on recognition accuracy (47.5%)
for the airport noise corrupted speech. Even though TADWT based speaker
recognition system performs well under different noisy conditions, the
performance is slightly lower than SSWPT for train (0.7%) and street (0.2%)
noise corrupted speech.
153
Figure 5.21 Performance of the proposed ASR system using different
enhancement techniques for noisy test speech at 0dB input
SNR
From Figure 5.22, it is inferred that the average recognition
accuracy improvement varies from 25% to 44% while using any one of the
speech enhancement techniques as compared to the recognition accuracy for
degraded speech. Higher average RA improvement of 79.5% is shown by
TADWT and 80% is shown by SSWPT with reference to degraded speech.
Lesser average recognition accuracy of 36% is shown by IWF for 5dB noises
and higher results of 54.6% and 55.6 % are obtained while using TADWT
and SSWPT. The recognition results obtained while using SS, EMF and
BWT are 42.6%, 49.7% and 48% respectively. In case of TADWT and
SSWPT front end processing highest average recognition accuracies of
54.6% and 55.6 % are achieved respectively. Figure 5.23 shows the
performances of the speaker recognition system for different techniques and
corrupted speech under different noisy environments (10dB). For the
degraded speech at 10 dB SNR level the system yields the average
recognition accuracies of 56.3%, 50.5%, 63.6% , 62.9%, 69.1% and 69.28%
while using the SS, IWF EMF,BWT, TADWT and SSWPT techniques
154
respectively to enhance the efficiency of the system under real life noise
conditions.
Figure 5.22 Performance of the proposed ASR system using different
enhancement techniques for noisy test speech at 5dB input
SNR
Figure 5.23 Performance of the proposed ASR system using different
enhancement techniques for noisy test speech at 10dB input
SNR
155
From Figure 5.24, it is observed that even for the speech degraded
by noise the average recognition accuracy of 27% is obtained because of high
input SNR. The speech enhancement techniques improve the accuracy of the
system at a minimum of 39% and maximum of 56% compared to degraded
speech at 15 dB level. While using the SS, IWF EMF, BWT, TADWT and
SSWPT techniques as preprocessing stage to enhance the efficiency of the
system under real life noise conditions, the average recognition accuracies of
68.7%, 63.9%, 76.8%, 75.7%, 80% and 80.2% are attained respectively.
Figure 5.24 Performance of the proposed ASR system using different
enhancement techniques for noisy test speech at 15dB input
SNR
From Figure 5.25, it is observed that the proposed ASR with
reported TADWT and SSWPT show better performance than other SE
methods under noisy environment. It can be seen that the recognition
performance of the TADWT method is higher than SSWPT at 0dB (0.2%)
and 15dB (0.5%). However for higher SNR values, in particular for 15 dB,
the proposed methods are more comparable with the results of clean speech
condition, because for higher SNR values noise level is very low. But at -5dB,
5dB and 10 dB SSWPT gives an improvement of 0.2%, 0.6% and 0.3% over
TADWT.
156
Figure 5.25 Comparison of average recognition accuracy of the system
for the degraded speech at various input SNR levels
From the overall analysis at different dB levels for different kinds
of noise the two methods such as SSWPT and TADWT perform almost
equally well with the average recognition accuracy of 57.4% and 57.65%. The
improvement over recognition accuracy while using these methods is notably
an improved one, compared to other speech enhancement techniques on this
task specifically for the degraded speech at lower input SNR levels of -5dB,
0dB and 5dB.
Based on the results shown above, a paper was published in
International journal of Information analysis and Processing entitled,
“Robust Speaker Recognition System using Combined Feature Extraction
Techniques”, Vol. 4, No. 1, pp.27-33, 2011.
5.4 CONCLUSION
The performance of the reported TADWT and SSWPT SE methods
are evaluated in the speaker recognition task in noisy environment by using it
as front end processor. Experimental analysis of speaker recognition
system’s accuracy is done by extracting different features and by using LBG
and K-means speaker modeling techniques. Based on the obtained results a
157
new speaker recognition system is developed by combining MFCC and
MMFCC features to achieve higher recognition accuracy. Its performance is
studied with both clean speech and noise corrupted speech. The proposed
ASR system yielded 96.67% of RA under clean speech condition. To assess
the performance of suggested ASR system in noisy environment, the
synthetic degraded speech is generated for each type of degradation at
different input SNR level. The degraded test speech is enhanced by SS, IWF,
EMF, BWT, TADWT SSWPT speech enhancement methods before the
feature extraction step to improve the perceptional quality of speech. The
average recognition accuracy of the system is improved while incorporating
SE methods as compared to degraded speech. Irrespective of noise types and
SNR level considered in this task the average recognition accuracy of the
system results in 12.8%, 40.2% , 37.4%, 47.4%,46.9%, 57.4% and 57.65% for
SS, IWF, EMF, BWT, TADWT and SSWPT correspondingly. The
recognition results show that TADWT and SSWPT methods give relatively
higher performance than other methods. But this higher performance is shown
by SSWPT at the cost of increased computational complexity.