chapter 5 enhancement of recognition accuracy of automatic...

119

CHAPTER 5

ENHANCEMENT OF RECOGNITION ACCURACY OF

AUTOMATIC SPEAKER RECOGNITION SYSTEM IN

NOISY ENVIRONMENTS

5.1 INTRODUCTION

In this Thesis two wavelet based methods (TADWT and SSWPT)

for enhancement of noisy speech are developed and reported. The

effectiveness of the reported speech enhancement techniques is evaluated by

means of speaker recognition under noisy conditions. The automatic speaker

recognition technologies have been developed into more and more important

modern technologies required by many speech-aided applications. The

practical implementation of speaker recognition technologies on handheld

devices or the internet, faces many challenges. One of these is environmental

noise. A new ASR system is developed and presented which aims to provide

the robustness at the signal level using the TADWT and SSWPT speech

enhancement methods as a front end processing stage to improve the speaker

specific features and hence speaker recognition performance in noisy

environment. The comparative analysis made based on extracting features and

speaker modeling, to get improvement in recognition accuracy.

In this chapter initially, the speaker recognition system under clean

speech condition for open-set applications is developed and its performance is

analyzed. Later the experimental analysis of the proposed speaker recognition

system is extended to noisy environment using various speech enhancement

120

algorithms in front end processing stage. In order to evaluate and compare

some of the enhancement algorithms with the proposed TADWT and SSWPT

the test signals were corrupted by WGN, pink and other real- life

environmental noises with -5 dB, 0 dB, 5 dB, 10 dB and 15 dB SNRs. Then,

the noisy test signals are enhanced by the SS, IWF, EMF, BWT, TADWT and

SSWPT algorithms and the enhanced signals are evaluated by the recognition

system. The evaluation is based on the recognition accuracy.

5.2 AUTOMATIC SPEAKER RECOGNITION (ASR) SYSTEM

Speaker recognition is the process of automatically recognizing the

speakers on the basis of individuality information from the speech signals.

Some of the applications of speaker recognition system are access to

buildings, cars, highly secured campuses, site security monitoring,

telecommunication areas, in law enforcement (e.g. forensics), e-transaction,

information retrieval from large databases, pin (personal identification

number) and so on (Uludag et al 2004). Speaker recognition can be divided

into speaker identification and speaker verification. In speaker identification,

the task is to identify the speaker from the speech signal. The task of a

verification system is to authenticate the claim of a speaker based on the test

speech (Campbell 1997). According to the constraints placed on the speech

used to train and test the system, Automatic speaker recognition can be

further classified into text-dependent or text-independent tasks. In text-

dependent recognition, the user must speak a given phrase known to the

system, which can be fixed or prompted. The knowledge of a spoken phrase

can provide better recognition results. In the text-independent recognition, the

system does not know the phrase spoken by the user ( Bimbot et al 2004).

Hence, this work focuses on text independent speaker recognition system.

Speaker recognition system typically involves two phases: training

and testing. In the training phase, a user enrolls by providing voice samples to

121

the system. The system extracts speaker-specific information from the voice

samples to build a voice model of the enrolling speaker. In the testing phase, a

user provides a voice sample (also referred to as test sample) that is used by

the system to measure the similarity of the user’s voice to the model(s) of the

previously enrolled user(s) and subsequently, to make a decision. The speaker

associated with the model that is being tested is referred to as target speaker

or claimant. Proposed system is designed for open-set situation. As mentioned

earlier in previous section, speaker identification can be divided into two

categories: closed-set speaker identification and open-set speaker

identification. Given a set of enrolled speakers and a test utterance, open-set

speaker identification is defined as a twofold problem. Firstly, it is required to

identify the speaker model in the set, which best matches the test utterance.

Secondly, it must be determined whether the test utterance has actually been

produced by the speaker associated with the best-matched model, or by some

unknown speaker outside the enrolled set. That is, in a speaker identification

task, the system measures the similarity of the test sample to all stored voice

models and in speaker verification task, the similarity is measured only to the

model of the claimed identity (Aronowitz et al 2005).

Figure 5.1 Generic speaker recognition system

Classification

Feature

ExtractionPattern

MatchingDecision

Speaker

models

Speech Accept/

Reject

122

Like most pattern recognition problems, a speaker recognition

system can be partitioned into two modules: feature extraction and

classification. The classification module has two components: pattern

matching and decision. Figure 5.1 depicts a generic speaker recognition

system. The feature extraction module estimates a set of features from the

speech signal that represent some speaker-specific information. The set of

features should be consistent for each speaker and have variability between

speakers. The pattern matching module is responsible for comparing the

estimated features to the speaker models. There are many types of pattern

matching methods and corresponding models used in speaker recognition

(Campbell 1997). Some of the methods include hidden Markov models

(HMM), dynamic time warping (DTW) and vector quantization (VQ). In

open-set applications (open-set speaker identification and speaker

verification), the estimated features can also be compared to a model that

represents the unknown speakers (Furui 1997). In an identification task, it

outputs similarity scores for all stored voice models. In a verification task, this

module outputs a similarity score between the test sample and the claimed

identity. The decision module analyzes the similarity score(s) (statistical or

deterministic) to make a decision. The decision process depends on the

system task.

The effectiveness of a speaker recognition system is measured

differently for different tasks. Since the output of a closed-set speaker

identification system is a speaker identity from a set of known speakers, the

identification accuracy is used to measure the performance. For the speaker

detection/verification systems, there are two types of error: false acceptance

of an impostor and false rejection of a target speaker. The performance

measure can also incorporate the cost associated with each error, which

depends on the application. For example, in a telephone credit card purchase

123

system, a false acceptance is very costly; in a toll fraud prevention system,

false rejection can alienate customers (Kim and Rose 2003).

Aside from biometric commercial applications, the forensic domain

is another important field where speaker recognition is used. To meet these

challenges in recognizing the speaker in real time Australian academy of

sciences-Acoustic research institute currently working in the area of

Automatic Speaker Identification Research group in Language Technologies

Institute in Carnegie Mellon University at Pittsburgh is currently working in

the area of robust speaker recognition. Lincoln Laboratory, Massachusetts

Institute of Technology presently doing their researches in the area of speaker

recognition, focused on application of recognition techniques to multi-speaker

speech and improving the performance of recognition systems under

mismatched channel conditions. Speech synthesis and recognition, speaker

recognition, speech recognition for Indian languages, spoken language

identification are some of the research areas in IIT, Chennai, India. The

researchers in Tokyo Institute of technology (Japan), Air Force Research

Laboratory/IFEC and Speech Processing Lab, Philadelphia (USA) are

working to improve the performance of the speaker recognition system in real

environment.

5.2.1 Introduction to Speaker Recognition in Real Environments

The main challenge for automatic speaker recognition is to deal

with the variability of the environments and channels from where the speech

was obtained. Existing speaker recognition system achieved good results for

clean high-quality speech with matched training and test acoustic conditions,

such as high accuracy of speaker identification and verification. However,

under mismatched conditions and noisy environments, often expected in real-

world conditions, the performance of the systems degrades significantly, far

away from the satisfactory level (Varchol et al 2005). The presence of

124

irrelevant information (like speaker or environment information) may actually

degrade the system accuracy. Therefore, robustness becomes a crucial

research issue in speaker recognition field. Ji Ming et al (2007) investigated

the problem of speaker identification and verification in noisy conditions,

suggested a method which combines multi-condition model training and

missing-feature theory to model noise with unknown temporal-spectral

characteristics. Missing-feature theory is applied to refine the compensation

by ignoring noise variation outside the given training conditions, thereby

reducing the training and testing mismatch. In this thesis, the main focus is to

improve the robustness of speaker recognition system. Some of the sources

that degrade the performance of speaker recognition are:

i. Variation in the speaker characteristics (speaking rate, style

and stress)

ii. Different microphones used in the training and testing

environment

iii. Distortion introduced by the channel such as telephone

network

iv. Distortion due to background noise

This work concentrates on distortion due to background noise.

Speaker recognition systems are generally trained using data obtained in

controlled conditions. This data is acquired from noise free environment using

high quality microphones. Practically in any speaker recognition application

the input speech signal may not always be clean, in particular during testing

and may be corrupted in many ways that can degrade the quality of the speech

signal and therefore reduce the performance of the speaker recognition system

(Reynolds1995). Many strategies have been adapted to deal with background

noise degradation and to provide the robustness to the recognizer. The process

125

of providing robustness to the recognizer can be accomplished in three

different stages (Krishnamurthy and Prasanna 2009):

i. Robustness at the signal level (speech enhancement)

ii. Robustness at the feature level (feature enhancement)

iii. Robustness at the classifier level (model enhancement)

The presented work aims to provide the robustness at the signal

level using TADWT and SSWPT methods as a preprocessing stage. Text

independent speaker verification task is taken for demonstration.

5.3 PROPOSED NOISE ROBUST SPEAKER RECOGNITION

SYSTEM

This work is focused on several issues relating to the

implementation of the new model for real-world applications. It aims to

provide the robustness at the signal level using the developed and reported

TADWT and SSWPT speech enhancement methods as a front end

processing stage. In this approach speech signals are enhanced before the

feature extraction stage. Accordingly before being transformed into feature

vectors, the degraded speech undergoes an enhancement step which tries to

filter out the degradation. Text independent speaker verification task is taken

for demonstration.

The proposed speaker recognition system is schematically

illustrated in Figure 5.2 in which front end processing block refers to speech

enhancement using TADWT and SSWPT methods. MFCC features are used

for speaker identification (it possess high inter speaker variability) and

MMFCC features are used for speaker verification (it possess high intra

speaker variability) to increase robustness and to achieve higher recognition

accuracy. Speaker recognition system typically involves: training and testing

phases. In training phase, a user enrolls by providing voice samples under

126

controlled environment to the system. The system extracts speaker-specific

information from the voice samples to build a voice model of the enrolling

speaker. During the testing phase, degraded speech samples are given through

the front end processor (speech enhancement) for the reduction of noise

outside the given training conditions, thereby reducing the training and testing

mismatch. Then it is used by the system to measure the similarity of the user’s

voice to the model(s) of the previously enrolled user(s) and subsequently to

make a decision.

Figure 5.2 Block diagram of proposed noise robust ASR system

Front end processing

(Speech enhancement)

Training speech

Models for speaker 1, 2...n

1 2 3 n

Speaker Identification

System

Feature extraction (MFCC)

Pre-processing

Similarity Measure> T (ACCEPT)

< T (REJECT)

Speaker Identity

Pattern matching

Identified speaker

Model

Feature extraction (MMFCC)

Speech sample of a user

claiming for authority

(Test speech)

127

5.3.1 Front End Processing (Speech Enhancement)

Many solutions have been developed to deal with the noisy speech

enhancement problem is discussed in previous chapters. Generally, these

solutions can be classified into two main areas: Temporal processing and

spectral processing based speech enhancement techniques. The temporal

processing involves identification and enhancement of high SNR regions in

the time domain representation of the degraded speech signal. Spectral

processing involves estimation and elimination of degradation component and

identification, enhancement of speech specific spectral features in the

frequency domain representation of degraded speech (Krishnamoorthy and

Prasanna 2011). Font end SE used in this work are the proposed and reported

TADWT and SSWPT SE methods come under second category.

5.3.2 Feature Extraction

Feature extraction is the key to the front-end process in speaker

recognition systems. Feature extraction aims at giving a useful representation

of the speech signal by capturing the important information from it. It

transforms the speech signal into a compact but effective representation that is

more stable and discriminative than the original signal. The performance of a

speaker verification system is highly dependent on the quality of the selected

speech features. Several feature extraction methods have already been studied

for the speaker recognition task.

The most commonly used acoustic vectors are Mel Frequency

Cepstral Coefficients (MFCC) (Yegnanarayana et al 2005), Bark Scale Filter

bank Cepstrum Coefficients (BFCC), Linear Prediction Cepstral Coefficients

(LPCC) (Reynolds 1995) and Perceptual Linear Prediction Cepstral (PLPC)

Coefficients. All of these features are based on the spectral information

derived from a short time windowed segment of speech. They differ mainly in

128

the detail of the power spectrum representation. MFCC features are derived

directly from the FFT power spectrum whereas the LPCC and PLPC use an

all-pole model to represent the smoothed spectrum. In all above methods, for

energy normalization the cepstral coefficient with index zero is discarded.

PLPC and MFCC features are used in most state-of-the-art automatic speech

recognition systems. A new modification of Mel-Frequency Cepstral

Coefficient (MMFCC) feature has been proposed for extraction of speech

features for speaker verification application (Goutam Saha et al 2000).

However, MFCC features are used in more and more speaker recognition

applications. For example, most of the participating systems in NIST speaker

recognition evaluations in 1998 used MFCC features and some systems used

LPCC features (Doddington et al 2000).

Revised perceptual linear prediction was proposed by Pawan

Kumar et al (2010) for the purpose of identifying the spoken language;

Revised Perceptual Linear Prediction Coefficients (RPLP) was obtained from

combination of MFCC and PLP. The speech recognizers use a parametric

form of a signal to get the most important distinguishable features of speech

signal for recognition task.

5.3.3 Selection of Feature Extraction Technique

Speaker’s speech features are selected based on acoustic vector

distribution of different samples of same speaker and different speaker’s

sample. Figure 5.3 to Figure 5.7 shows an acoustic vector distribution of

different samples from the same speaker using MFCC, MMFCC, BFCC,

LPCC and RPLP features respectively. The feature vector space for MMFCC

has lower variability among the different samples of same speaker as

compared with other feature distribution. Thus for verification purpose

MMFCC features are used in this proposed system. Figure 5.8 to Figure 5.12

shows an acoustic vector distribution of three different speaker’s samples

129

using MFCC, MFCC, MMFCC, BFCC, LPCC and RPLP features

respectively. The feature vector space for MFCC has higher variability

among the different speakers as compared with other feature vector

distributions. Thus for identification purpose MFCC features are selected in

this proposed system.

-15 -10 -5 0 5 10 15 20-4

-3

-2

-1

0

1

2

3

4

1st Dimension

2nd D

imension

2D plot of accoustic vectors

Signal 1

Signal 2

Signal 3

Figure 5.3 2D acoustic vectors distribution of same speaker’s different

samples using MFCC features

-5 0 5 10 15 20-8

-6

-4

-2

0

2

4

6

8

1st Dimension

2nd D

imensio

n

2D plot of acoustic vectors using MMFCC

Signal 1

Signal 2

Signal 3


samples using MMFCC features

130

0 1000 2000 3000 4000 5000 6000 7000 80000

200

400

600

800

1000

1200

1400

1600

1800

First Dimension

Second D

imension

2D plot of acoustic vectors using BFCC

Signal 1

Signal 2

Signal 3

Figure 5.5 2D acoustic vectors distribution same speaker’s different

samples using BFCC

0 1000 2000 3000 4000 5000 6000 7000 8000-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

First Dimension

Second D

imension

2D plot of acoustic vectors using LPCC

Signal 1

Signal 2

Signal 3


samples using LPCC

0 100 200 300 400 500 600 7000

100

200

300

400

500

600

700

First Dimension

Second D

imension

2D plot of acoustic vectors using RPLP

Signal 1

Signal 2

Signal 3


samples using RPLP

131

-15 -10 -5 0 5 10 15 20 25-15

-10

-5

0

5

10

15

1st Dimension

2nd D

imension

2D plot of acoustic vectors using MFCC

User1

User2

User3

Figure 5.8 2D acoustic vectors distribution of different speaker’s

sample using MFCC features

-10 -5 0 5 10 15-8

-6

-4

-2

0

2

4

6

8

1st Dimension

2nd D

imension

2D plot of acoustic vectors using MMFCC

User1

User2

User3


sample MMFCC features

132

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0

10

20

30

40

50

60

70

80

90

First Dimension

Second D

imension

2D plot of acoustic vectors using BFCC

User1

User2

User3


sample using BFCC

-15 -10 -5 0 5 10 15-30

-20

-10

0

10

20

30

First Dimension

Second D

imension

2D plot of acoustic vectors using LPCC

User1

User2

User3


sample using LPCC

0 100 200 300 400 500 600 7000

100

200

300

400

500

600

First Dimension

Second D

imension

2D plot of acoustic vectors using RPLP

User1

User2

User3


sample using RPLP

133

5.3.3.1 MFCC Feature extraction

MFCCs are one of the most popular methods for extracting features

from the speech signal developed by Davis and Mermelstain (1980). MFCCs

are shown to be less susceptible to the variation of the speaker’s voice and

surrounding environment. It is based on the known variations of human ears.

Group the neighboring frequency bins into overlapping triangular bands with

equal bandwidth according to theme scale. Critical bandwidths with

frequency, filters spaced linearly at low frequencies and logarithmically at

high frequencies have been used to capture the phonetically important

characteristics of speech. The MFCC feature extraction technique basically

includes windowing the signal, applying the DFT, taking the log of the

magnitude and then warping the frequencies on a Mel scale, followed by

applying the discrete cosine transform (DCT). The various steps involved in

the MFCC feature extraction are pre-emphasis, frame blocking and

windowing, mel-spectrum and DCT.

i Pre-emphasis: Pre-emphasis refers to filtering that emphasizes the

higher frequencies. Its purpose is to balance the spectrum of voiced sounds

that have steep roll-off in the high frequency region. Also it removes some of

the glottal effects from the vocal tract parameters. The most commonly used

pre-emphasis filter is given by the following transfer function

1za1)z(H (5.1)

where the value of a controls the slope of the filter usually between 0.4 to

1(Picone 1993). In this work it is assigned as 0.9375. This first order filter

will compensate the lower formants contain more energy than the higher.

ii Frame Blocking and Windowing: The speech signal is a slowly

time varying or quasi stationary signal. Therefore, speech analysis must

always be carried out on short segments across which the speech signal is

134

assumed to be stationary. Short term spectral measurements are typically

carried out over 20 ms windows and advanced every 10ms. Advancing the

time window for every 10ms enables the temporal characteristics of

individual speech sounds to be tracked and the 20ms analysis window is

usually sufficient to provide good spectral resolution of these sounds and at

the same time short enough to resolve significant temporal characteristics.

Generally Hamming or Hanning windows are used. This is done to enhance

the harmonics, smooth the edges and to reduce the edge effect while taking

the DFT on the signal. Hamming window is used in this system because it is

having lesser spectral leakage. Each windowed frame is converted into

magnitude spectrum by applying DFT.

iii Mel-Spectrum: Mel-spectrum is computed by passing the Fourier

transformed signal through a set of band pass filters known as mel-filter

bank. A Mel is a unit of measure of human ear’s perceived pitch or frequency

of a tone. The characteristics is expressed on the mel frequency scale, which

is a linear frequency spacing below 1kHz and a logarithmic spacing above

1kHz (Deller et al 1994). The following function transforms real (linear

frequency) to Mel frequency.

700

f1log2595=(f)Mel (5.2)

0 500 1000 1500 2000 2500 3000 3500 40000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2Mel-Spaced Filterbank

Frequency [Hz]

Figure 5.13 Mel spaced filter banks

135

Following the mel scaling the remaining signal values are

compressed with a non linear function, such as logarithm operation. This

results in a signal in the cepstral domain with a frequency peak corresponding

to the pitch of the signal and a number of formants representing low

frequency peaks. The triangular filter banks with mel- frequency warping is

given in Figure 5.13.

iv DCT: Since the vocal tract is smooth, the energy levels in adjacent

bands tend to be correlated. The DCT is applied to the transformed mel-

frequency coefficients to produce a set of cepstral co-efficients. The purpose

of DCT is to reduce the dependency between the feature vector components

within a single frame. Mel-Frequency Cepstral Coefficients (MFCCs) are

calculated from the log filter bank amplitudes (mj) using equation (5.3).

N

1j

jn)5.0j(

N

ncosm

N

2C (5.3)

where nc is the cepstral coefficient N is the number of filter bank channels.

Since most of the signal information is represented by the first few MFCC

coefficients, the system can be made robust by extracting only those

coefficients ignoring higher order DCT components (Deller et al 1994). In

this work first 13 coefficients are extracted. The coefficient with index zero is

often excluded since it represents the average log energy of the input signal,

which carries only little speaker specific information.

5.3.3.2 Modified MFCC feature extraction

A modified mel-frequency cepstral coefficient (MMFCC) is the

improved version of conventional MFCC for speaker verification application.

The weighting function was introduced which is unique for each frame of an

136

utterance for each speaker (Goutam Saha et al 2000). From the filter bank

outputs, the average value is calculated using equation (5.4).

20

1ik

]i[S20

1avg (5.4)

where ]i[Sk is the filter bank outputs and the city block distance ]i[M

kof each

filter bank output is calculated from the average by using equation (5.5).

avg]i[S]i[Mkk

(5.5)

Sweep is unique for each frame (window) since it is calculated

from filter bank output of that frame is calculated using equation (5.6). It

represents the total variation in magnitude of the filter outputs for each frame

and gives a measure of the magnitude spread of the coefficients, equivalent to

variance in Euclidean distance measure.

]i[MlogSweepk

20

1i

(5.6)

Finally weighting function is derived as,

Sweep

]i[Slog]i[W k (5.7)

Then the MMFCCs are obtained as,

22

1incos].i[W].i[SlogC

kn (5.8)

where i = 1,2,…20. MMFCC uses compensation based on the magnitude of

spread, through a frame based weighting function to preserve the speaker

dependent information in different frames. The variation of intensity/loudness

137

at different segments of a spoken word may influence the magnitude of the

coefficients affecting cluster formation in parameter space for a speaker.

MMFCC is a frame based technique to reduce these effects through

normalization of coefficients in each frame by its total spread, so that

coefficients of all the frames are brought to same level of spread. This also

minimizes effect of change in background noise level in SR applications

where the speaker while speaking is moving from one environment to

another. MMFCC features shows enhanced discriminative ability for the

coefficients that is important in speaker verification applications.

5.3.4 Speaker Modeling

From the extracted features speaker model is created for each and

every speaker in the database. The created speaker model should be unique

and contains maximally the speaker specific information for reproducing the

vocal tract. There are several methods for speaker modeling reported in

literature. In text- dependent speaker recognition the most popular methods

like dynamic time warping (DTW) (Furui 1981) and hidden Markov models

(HMM) (Tisby 1991) are used.In Text independent speaker recognition the

most popular methods used for speaker modeling are vector quantization

(VQ), Poonam Bansal et al (2007)), artificial neural networks(ANN)

(Yegnanarayana and Kishore 2002), support vector machine (SVM) (Gish and

Schmidt 2002) and Gaussian mixture models (GMM) (Reynolds and Rose

(1995), Xiang and Berger (2003)).These models can be classified into

parametric and non parametric models. The parametric models like GMM

have a particular model structure characterized by certain parameters for the

distribution of feature vectors. In non-parametric models like VQ and ANN,

no such assumptions are made.

Furui (1981) introduced the concept of dynamic time warping

(DTW) for text-dependent speaker recognition. However, it was originally

138

developed for speech recognition. The disadvantage of template matching is

that it is time consuming, as the number of feature vectors increase. For this

reason, it is common to reduce the number of training feature vectors by

some modeling technique like clustering. The cluster centers are known as

code vectors and the set of code vectors is known as codebook. The most

well known codebook generation algorithm is the K-means algorithm (Linde

et al (1980), Gray (1984)). Soong et al (1987) used the LBG algorithm for

generating speaker based vector quantization (VQ) codebooks for speaker

recognition. It is demonstrated that larger codebook and larger test data give

good recognition performance.In order to model the statistical variations, the

hidden Markov model (HMM) for text-dependent speaker recognition was

studied and reported (Rosenberg and Parthasarathy (1996), Matsui and Furui

(1994)).

In 1995, Reynolds proposed Gaussian mixture modeling (GMM)

classifier for speaker recognition task (Reynolds 1995). This is the most

widely used probabilistic technique in speaker recognition. The GMM needs

sufficient data to model the speaker to get good performance. In the GMM

modeling technique, the distribution of feature vectors is modeled by the

parameters mean, covariance and weight. The disadvantage of GMM is that

it requires sufficient data to model the speaker well.

To overcome this problem, Reynolds et al (2000) introduced GMM

universal background model (UBM) for the speaker recognition task. In this

system, speech data collected from a large number of speakers is pooled

and the UBM is trained, which acts as a speaker-independent model.

The speaker-dependent model is then created from the UBM by performing

maximum a posteriori (MAP) adaptation technique using speaker-specific

training speech. As a result, the GMM-UBM gives better results than the

139

GMM (Liu et al 2006). The disadvantage of this is the requirement of a

gender-balanced large speaker set for UBM training.

Campbell et al (2006) proposed generalized linear discriminant

sequence kernel for speaker recognition and language identification tasks.

Kosaka et al (2007) proposed a speaker vector-based speaker identification

with phonetic modeling using SVM. The SVM has many desirable properties,

including the ability to classify sparse data without over-training. Among the

developed techniques the state of art speaker verification systems widely uses

the MFCC and its derivatives as features; GMM or VQ as a modeling

technique (Jayanna and Mahadeva Prasanna 2009). In this thesis for speaker

modeling, one of the traditional and successful vector quantization method is

chosen and experimented with LBG and K-means algorithm.

5.3.4.1 Vector Quantization

A speaker recognition system must be able to estimate probability

distributions of the computed feature vectors. Storing every single vector that

generate from the training mode is impossible, since these distributions are

defined over a high-dimensional space. It is often easier to start by quantizing

each feature vector to one of a relatively small number of template vectors,

with a process called vector quantization. VQ is a process of taking a large set

of feature vectors and producing a smaller set of measure vectors that

represents the centroids of the distribution. The technique of VQ consists of

extracting a small number of representative feature vectors as an efficient

means of characterizing the speaker specific features. Then a speaker-specific

VQ codebook is generated for each known speaker by clustering his/her

training acoustic vectors. The codebook is a set of cells in a multidimensional

space. Each cell defines a small part of the total space and contains a point

centered within the cell, the centroid (Soong et al 1987). The cepstral

coefficients derived from each frame are regarded as a vector (a point) in the

140

space and thereby belonging to one of the cells. The vector always belongs to

the cell containing the closest located centroid. The two dimensional

representation of VQ is shown in Figure 5.14. In the recognition stage, the

data from the tested speaker is compared to the codebook of each speaker and

measure the difference. These differences are then used to make the

recognition decision.

Figure 5.14 Two Dimensional Representation of VQ

5.3.4.2 Linde buzo gray (LBG) algorithm

This algorithm is an iterative algorithm which alternatively solves

the nearest neighbour condition and centroid condition. It is used for

clustering the feature vectors in Vector Quantization process. After the

enrolment session, the acoustic vectors extracted from input speech of a

speaker provide a set of training vectors. As described above, the next

important step is to build a speaker-specific VQ codebook for this speaker

using those training vectors. There is a well-known algorithm, namely LBG

algorithm , for clustering a set of L training vectors into a set of M codebook

vectors. The flowchart for LBG algorithm is given in Figure 5.15.

First Dimension

Sec

on

d D

imen

sion

141

5.3.4.3 K-means algorithm

The K-means algorithm is a way to cluster the training vectors to

get feature vectors. In this algorithm the vectors are clustered based on

attributes into k partitions. It uses the K-means of data generated from

Gaussian distributions to cluster the vectors.

Extracted feature vector

Figure 5.15 Flow chart for LBG Algorithm

Find centroids

m – Current size

of the codebook

M – Desired size

of the codebook

D – Average

distance

– Threshold

distance

Cluster

Vectors

D'=D

m = 2*m

(D'-D)/D

is <

Compute D

Stop

Find Centroid

Split each

Centroid

m< M

Yes No

No

Yes

142

The objective of the k-means is to minimize total intra-cluster

variance, V.

2V

ii

Sjj

xk

1i

(5.9)

where there are k clusters Si , i = 1, 2... k andi

is the centroid or mean point

of all the pointsij

Sx (Furui 1989). The process of k-means algorithm used

least-squares partitioning method to divide the input vectors into k initial sets.

Then it calculates the mean point or centroid, of each set. It constructs a new

partition by associating each point with the closest centroid. Then the

centroids are recalculated for the new clusters and algorithm is repeated until

when the vectors no longer switch clusters or alternatively centroids are no

longer changed.

5.3.5 Feature Matching

Earlier studies on speaker recognition (Atal (1976), Furui (1981)

and Campbell (1997)) used direct template matching between training and

testing data. In the direct template matching, training and testing feature

vectors are directly compared using similarity measure. For the similarity

measure, any of the techniques like spectral or Euclidean distance or

Mahalanobis distance is used. From literature it is found that better results

were obtained for ED on text independent ASR task. Thus ED measure is

preferred here for feature matching.

In the speaker recognition phase, an unknown speaker’s voice is

represented by a sequence of feature vector is compared with the codebooks

in the database. For each codebook a distortion measure is computed and the

speaker with the lowest distortion is chosen. In order to identify the unknown

143

speaker, this can be done by measuring the distortion distance of two vector

sets based on minimizing the Euclidean distance. The Euclidean distance

between two points P = (p1, p2…pn) represents feature vector of trained

speaker and feature vector of test speaker Q = (q1, q2...qn) is defined as,

2)i

qi

p(n

1i

2)n

qn

p(......2)2

q2

p(2)1

q1

p( (5.9)

Thus, each feature vector in the sequence of test speaker is

compared with all the codebooks of trained speakers and the codebook with

the minimum ED is declared to be “accepted” or “ recognized speaker”.

5.3.6 Performance Metric

The performance of a speaker verification system is measured in

terms of false acceptance rate (FA %) and false rejection rate (FR %). False

acceptance error consists of accepting identity claim from an imposter. False

rejection error happens when a valid identity claim is rejected. It is

mathematically represented as,

100I

IF

T

AA

(5.10)

100C

CF

T

AR

(5.11)

RAEFFT (5.12)

where IA- No. of imposter classified as true speakers, IT - Total no. of

speakers, FA-False Acceptance, FR- False Rejection,E

T - Total error of

recognition system, CA – No. of true speakers classified as Imposters , CT -

Total no. of speakers. The overall recognition accuracy (RA) is calculated by

144

subtracting the total error of recognition from maximum RA of 100% (ideal

case) .

5.3.7 Experimental Results and Discussion

Various experiments are conducted to analyze the performance of

proposed ASR in controlled and uncontrolled environment. Initially, the

performance of the proposed ASR system is analyzed under clean speech

condition. Later the experimental analysis is extended to noisy environment

using various speech enhancement algorithms in front end processing stage.

The speaker recognition studies are carried out using the standard TIMIT

database consisting of 630 speakers’ sample and King database consisting of

370 speakers’ sample. In total 1000 speakers’ sample is used in this work, of

which 570 are males and 430 are female. Speech files are stored in .wav file

format with a sampling frequency of 16 kHz and quantization resolution of 16

bits per sample. Since most of the speech information is present up to 4 kHz,

the samples are down sampled to 8 kHz and then used in this work. Each

speaker has contributed 5 utterances of approximately 5 sec each. Out of 1000

speakers, 750 speakers are randomly selected for forming subset for the study.

The selection of the subset was arbitrary. Among the 5 utterances, 3

utterances of each speaker are used for training and the remaining two

utterances are used for testing.

5.3.7.1 Speaker recognition system under clean condition

Various experiments are conducted to analyze the performance of

the ASR system for different feature extraction methods like MFCC, BFCC,

MMFCC, RPLP and LPCC using vector quantization for modeling. Then the

extracted features are modeled using K-means and LBG (vector quantization)

algorithms. The analysis gave better understanding about each feature

extraction methods, speaker modeling and their performances under clean

145

speech conditions. The snap shot of simulation results of the proposed system

under clean speech condition is shown in Figure 5.16. A maximum accuracy

of 96.67% is obtained for the proposed scheme under controlled environment.

Further the recognition studies are carried out on test speech (degraded

speech) data with and without front end processing to evaluate the

performance of the proposed method under noisy condition (under

uncontrolled environment).

Figure 5.16 Snap shot of simulation results

5.3.7.2 Performance comparison of feature extraction techniques

The performance of the proposed ASR system is analyzed using

various feature extraction methods like MFCC, Modified MFCC, LPCC,

RPLP , BFCC and the proposed one , by varying filter bank size, frame

lengths and percentage of frame overlapping. Figure 5.17 shows the

recognition accuracy while using 24 filter banks to extract both MFCC and

MMFCC features. The filter banks are designed by placing 12 filters below 1

146

kHz and 12 filters above 1 kHz. While analyzing the system performance

with 24 filter banks by using K-means algorithm as the speaker modeling

technique, it performs well in an intermediate length frames with 60% shift

over the frames with average recognition accuracy of 91%. Even though the

256 and 512 size frames performed equally well in extracting features the

system performance rely on speaker modeling module. The system

performance is improved for 60% overlapped frames by 2% in shorter frames.

Because 60% overlapped frame captures the speaker specific information well

than 50% and 70% overlap. Therefore 60% overlap is preferred for feature

extraction.

Figure 5.17 Performance comparison of speaker modeling techniques

for 24 filter banks

Further, the speaker verification system is analyzed for 40 filter

banks with 24 filters below 1 kHz and 16 filters above 1 kHz. While

increasing the number of filters, the extracted feature’s accuracy is improved

which further increases the RA. From Figure 5.18, it is observed that while

increasing the length of the frame there were loses in the features, which

makes creating a speaker specific model difficult one. At the same time

147

increasing shift over the frame increases the redundancy resulting in a

reduction of efficiency by 2% (for 70% overlap). By increasing the number of

filter banks to 40 the recognition accuracy is improved by 2% for intermediate

length frame, where the detailed features from the speech signal can be

extracted. More speaker specific information is held by the lower frequency

components than higher frequency components. Hence out of 40 filter bank

outputs only lower 13 outputs are taken for generating the speaker model.

Figure 5.18 Performance comparison of speaker modeling techniques

for 40 filter banks

From Figure 5.18, it is observed that ASR perform well under

60% shift over the segment both in LBG and K-means modeling with the

recognition accuracy of 96.67 % and 95.6% respectively for an intermediate

frame of size 512. As compared to K-means, LBG algorithm efficiently

clusters the speakers, even though the data base contains large population.

From the above analysis, it is reported that the proposed scheme using

combined MFCC and MMFCC features for 40 filter banks with 512 frame

length for an overlapping of 60% through LBG speaker modeling yielded the

highest accuracy of 96.67 % over others. Hence it is preferred for further the

analysis under noisy condition.

148

Table 5.1 summarizes the percentage of speaker recognition

accuracy for two different speaker modeling techniques with different frame

size using various feature extraction techniques for 60% of frame

overlapping.

Table 5.1 Performance comparison of speaker recognition accuracy for

different feature extraction techniques

Speaker

modeling

Technique

Frame

size

Different feature extraction techniques

MFCC MMFCC LPCC RPLP BFCC

Combined

MFCC

and

MMFCC

K-means

256 89.02 73.59 43.09 64.77 39.67 94.82

512 90.79 68.97 39.77 65.77 35.82 95.6

1024 83.52 72.28 30.68 46.17 23.73 86.48

2048 79.06 65.97 23.47 45.87 19.69 81.31

4096 74.15 53.98 20.78 29.78 16.26 76.22

LBG

256 89.67 75.37 47.82 67.57 40.64 95.76

512 92.78 74.27 40.87 65.49 37.82 96.67

1024 86.42 72.08 30.98 58.23 25.27 88.49

2048 80.86 66.89 28.97 47.39 23.34 84.06

4096 75.85 38.98 25.65 38.86 20.07 78.32

From the Table 5.1 it is observed that the combined MFCC and

MMFCC shown higher performance than other feature extraction methods for

both the speaker modeling techniques. Relatively higher results are obtained

for the system with combined feature extraction technique for both LBG and

K-means speaker modeling technique for 256 and 512 length frames with

60% of overlapping. This indicates that the smaller size frames captures the

speaker specific information in a better manner because this mode effectively

149

masked the intra speaker variability. RA of 95.76% and 94.82% is obtained

for the frame length 256 while using LBG and K-means respectively. For the

frame length of 512 both LBG and K-means yielded a maximum accuracy of

96.67% and 95.6 %.

Based on the results shown above, a paper was published in

Journal of computer science entitled, “A New Speaker Recognition System

with Combined Feature Extraction Techniques ”, vol. 7, Issue 4, pp.459-

465, 2011.

5.3.8 Speaker recognition in noisy environment

The noisy speech is created by adding noise from NOISEX-92

database to the test utterances. The noise waveform being added to the speech

was scaled to give the desired global SNR. The value of SNR is varied over -5

to 15 dB. While testing, the degraded speech signal is enhanced using

individual and proposed methods as described in earlier chapters. To illustrate

the merit of reported TADWT AND SSWPT enhancement methods, Figures

5.19 (a)-(h) show the excitation source signal that is LPC spectrum of clean,

degraded, spectral subtraction, iterative wiener filter, Ephraim Malah filter,

Bionic wavelet transform, time adaptive discrete wavelet transform and

combined spectral subtraction with wavelet packet transform processed

speech respectively.

Table 5.2 summarizes the recognition results for the various noise

types at different SNR levels varying from -5dB to +15 dB. In the table,

abbreviation DEG, SS, IWF, EMF, BWT, TADWT and SSWPT refer to

degraded speech, spectral subtraction, iterative Wiener filtering, Ephraim-

Malah Filtering, bionic wavelet transform, time adaptive discrete wavelet

transform and combined spectral subtraction with wavelet transform

respectively. Avg. RA represents average percentage of recognition accuracy

150

which is the average performance of different approaches for the corrupted

speech at different noise level.

Figure 5.19 LPC spectra of (a) Clean speech (b) degraded speech

(factory noise), Enhanced speech by (c) SS (d) IWF (e)EMF

(f) BWT(g) TADWT (h) SSWPT

(a) (e)

(b) (f)

(g)

(h)(d)

(c)

151

Table 5.2 Speaker recognition performance under noisy environment

Recognition Accuracy (RA) (%) Avg.

RA

Recognition Accuracy (RA)

(%)

Avg.

RA

Noise

level

(dB)

Method

-5 0 5 10 15 -5 0 5 10 15

Babble noise Street noise

DEG 0.5 4.2 13 19.5 24.6 12.4 0.5 4.2 13 19.5 24.6 12.4

SS 5.8 21 41 62 70.5 40.1 5.1 20.7 37.3 57 66.5 37.3

IWF 5.1 18.7 37.5 58.2 62.5 36.4 4.8 17.9 37.5 58.5 67.5 37.2

EMF 10.9 32 57 75.5 78.5 50.8 10.7 31.2 57 69.7 77.5 49.2

BWT 11.5 28.5 54.5 74.2 76.4 49.0 11.1 32.5 54.5 70.4 77.2 49.1

TADWT 22.5 40.9 60.5 76.5 84 56.9 21.5 39.9 59.6 76.5 83.4 56.2

SSWPT 21.9 41.5 61.3 78.5 82.5 57.1 20.3 40.1 60.4 78.5 82.8 56.4

HF channel noise Factory noise

DEG 1.1 3 12.5 17 26.5 12.0 0.5 5.2 12.7 20.5 24.8 12.7

SS 5.4 31 44 55.5 65 40.2 6.2 18.9 38.2 58.7 67.4 37.9

IWF 4.9 28 33.5 37 57.5 32.2 5.6 16.3 36.9 59.8 67.5 37.2

EMF 12.7 40.5 49 60.7 80.5 48.7 9.8 32.6 56.2 71.7 78.5 49.8

BWT 11.1 40 50 60.5 78 47.9 13.1 31.4 53.4 72.3 79.2 49.9

TADWT 26.9 41.5 63.5 78.6 84.1 58.9 22.6 42.5 58.6 78.7 82.9 57.1

SSWPT 27.4 43 64.1 79.7 83.6 59.6 21.9 43.2 61.4 79.2 81.6 57.5

Train noise Exhibition noise

DEG 0.6 2.8 12.6 19 28.5 12.7 0.9 7 20.4 24.6 38.5 18.3

SS 6.6 23 56.8 65.5 78 46.0 5.7 29.8 57 66 74 46.5

IWF 5.8 19.5 43.2 58 70 39.3 6.8 24.7 54 65 72.5 44.6

EMF 12.8 24.5 57.8 71.5 78.6 49.0 7.4 32 58.7 73 79.2 50.1

BWT 11.2 24 54 68 77.9 47.0 8.1 30.4 61.2 70.4 79.5 49.9

TADWT 25.4 38.5 66.5 78.7 84 58.6 22.8 43.7 62.7 76.7 82.5 57.7

SSWPT 25.6 39.2 67.5 77 83 58.5 20.7 42.4 63 77.5 81.3 57.0

Air port noise Station noise

DEG 0.5 2 8.5 14 18.8 8.8 0.8 8.5 14 17.6 39.2 16.0

SS 9 20.3 48.7 56 68.5 40.5 6.8 25.1 49 57.8 67 41.1

IWF 10.8 19 45.8 55.3 69.5 40.1 3.5 22.9 47.5 56 65.8 39.1

EMF 17.6 25 59.6 57.5 77 47.3 11.6 28.5 54.6 60.6 72 45.5

BWT 16.5 24.6 59.2 66.5 76.7 48.7 9.8 29 53 59.3 71 44.4

TADWT 21.3 47.5 61.5 73.3 83.5 57.4 20.4 42.6 63.5 73.8 83.5 56.8

SSWPT 21.9 46 60.7 71.9 82.8 56.7 21.8 41.4 64.4 75.4 84.3 57.5

Car noise Restaurant noise

DEG 0.5 1.5 5.8 13.6 20.5 8.4 1.6 7 15.9 20.5 24.7 13.9

SS 7.5 18.8 32.5 38.6 62 31.9 7.4 18.6 42.8 64.6 69.4 40.6

IWF 6.8 17.9 32.1 37 58.7 30.5 6 13.2 41.7 61.5 63.2 37.1

EMF 12.5 22.6 36.4 42.6 69 36.6 11 26.8 53.6 69.9 73.8 47.0

BWT 13 21.7 38.6 44 71.4 37.7 8.5 27.7 52.4 68 71.5 45.6

TADWT 20.8 45.7 59.5 72.7 82.8 56.3 21.4 44.7 64.4 74.9 84.6 58.0

SSWPT 21.1 45.4 60 73.1 83.6 56.6 24.6 43.8 63.5 72.5 84.1 57.7

152

Figure 5.20 Performance of the proposed ASR system using different

enhancement techniques for noisy test speech at -5 dB input

SNR

Figure 5.20 shows the performance of the speaker recognition

system with and without speech enhancement techniques for various noises at

-5dB. It is observed from the analysis that an increase in system’s RA is

ranging from 20% to 25% while including TADWT and SSWPT SE

techniques as a front end processor with the highest average recognition

accuracy of 22.6% and 22.8% as compared to others. From Figure 5.21, it is

observed that the system efficiency improves at an average of 18.6% and 14%

while using SS and IWF respectively. At the same time the proposed

TADWT and SSWPT methods give an average improvement of 44.7% and

43.8% respectively. The overall performance of TADWT and SSWPT is 13%

higher than EMF and 14% higher than BWT. The TADWT based frontage

processor provides the highest improvement on recognition accuracy (47.5%)

for the airport noise corrupted speech. Even though TADWT based speaker

recognition system performs well under different noisy conditions, the

performance is slightly lower than SSWPT for train (0.7%) and street (0.2%)

noise corrupted speech.

153


enhancement techniques for noisy test speech at 0dB input

SNR

From Figure 5.22, it is inferred that the average recognition

accuracy improvement varies from 25% to 44% while using any one of the

speech enhancement techniques as compared to the recognition accuracy for

degraded speech. Higher average RA improvement of 79.5% is shown by

TADWT and 80% is shown by SSWPT with reference to degraded speech.

Lesser average recognition accuracy of 36% is shown by IWF for 5dB noises

and higher results of 54.6% and 55.6 % are obtained while using TADWT

and SSWPT. The recognition results obtained while using SS, EMF and

BWT are 42.6%, 49.7% and 48% respectively. In case of TADWT and

SSWPT front end processing highest average recognition accuracies of

54.6% and 55.6 % are achieved respectively. Figure 5.23 shows the

performances of the speaker recognition system for different techniques and

corrupted speech under different noisy environments (10dB). For the

degraded speech at 10 dB SNR level the system yields the average

recognition accuracies of 56.3%, 50.5%, 63.6% , 62.9%, 69.1% and 69.28%

while using the SS, IWF EMF,BWT, TADWT and SSWPT techniques

154

respectively to enhance the efficiency of the system under real life noise

conditions.



SNR



SNR

155

From Figure 5.24, it is observed that even for the speech degraded

by noise the average recognition accuracy of 27% is obtained because of high

input SNR. The speech enhancement techniques improve the accuracy of the

system at a minimum of 39% and maximum of 56% compared to degraded

speech at 15 dB level. While using the SS, IWF EMF, BWT, TADWT and

SSWPT techniques as preprocessing stage to enhance the efficiency of the

system under real life noise conditions, the average recognition accuracies of

68.7%, 63.9%, 76.8%, 75.7%, 80% and 80.2% are attained respectively.



SNR

From Figure 5.25, it is observed that the proposed ASR with

reported TADWT and SSWPT show better performance than other SE

methods under noisy environment. It can be seen that the recognition

performance of the TADWT method is higher than SSWPT at 0dB (0.2%)

and 15dB (0.5%). However for higher SNR values, in particular for 15 dB,

the proposed methods are more comparable with the results of clean speech

condition, because for higher SNR values noise level is very low. But at -5dB,

5dB and 10 dB SSWPT gives an improvement of 0.2%, 0.6% and 0.3% over

TADWT.

156

Figure 5.25 Comparison of average recognition accuracy of the system

for the degraded speech at various input SNR levels

From the overall analysis at different dB levels for different kinds

of noise the two methods such as SSWPT and TADWT perform almost

equally well with the average recognition accuracy of 57.4% and 57.65%. The

improvement over recognition accuracy while using these methods is notably

an improved one, compared to other speech enhancement techniques on this

task specifically for the degraded speech at lower input SNR levels of -5dB,

0dB and 5dB.

Based on the results shown above, a paper was published in

International journal of Information analysis and Processing entitled,

“Robust Speaker Recognition System using Combined Feature Extraction

Techniques”, Vol. 4, No. 1, pp.27-33, 2011.

5.4 CONCLUSION

The performance of the reported TADWT and SSWPT SE methods

are evaluated in the speaker recognition task in noisy environment by using it

as front end processor. Experimental analysis of speaker recognition

system’s accuracy is done by extracting different features and by using LBG

and K-means speaker modeling techniques. Based on the obtained results a

157

new speaker recognition system is developed by combining MFCC and

MMFCC features to achieve higher recognition accuracy. Its performance is

studied with both clean speech and noise corrupted speech. The proposed

ASR system yielded 96.67% of RA under clean speech condition. To assess

the performance of suggested ASR system in noisy environment, the

synthetic degraded speech is generated for each type of degradation at

different input SNR level. The degraded test speech is enhanced by SS, IWF,

EMF, BWT, TADWT SSWPT speech enhancement methods before the

feature extraction step to improve the perceptional quality of speech. The

average recognition accuracy of the system is improved while incorporating

SE methods as compared to degraded speech. Irrespective of noise types and

SNR level considered in this task the average recognition accuracy of the

system results in 12.8%, 40.2% , 37.4%, 47.4%,46.9%, 57.4% and 57.65% for

SS, IWF, EMF, BWT, TADWT and SSWPT correspondingly. The

recognition results show that TADWT and SSWPT methods give relatively

higher performance than other methods. But this higher performance is shown

by SSWPT at the cost of increased computational complexity.

chapter 5 enhancement of recognition accuracy of automatic...

Documents