forensic speaker recognition under adverse conditions kamil hasan_al... · 2019. 6. 16. · speaker...

FORENSIC SPEAKER RECOGNITION UNDER

ADVERSE CONDITIONS

Ahmed Kamil Hasan AL-ALI

MSc in Electronic and Communication Engineering

Submitted in fulfilment of the requirements for the degree of

Doctor of Philosophy

School of Electrical Engineering and Computer Science

Science and Engineering Faculty

Queensland University of Technology

2019

Keywords

Forensic Speaker Recognition, Gaussian Mixture Modeling, I-vectors, Probabilis-

tic Linear Discriminant Analysis, Environmental Noise, Reverberation Condi-

tions, DWT, Feature-Warped MFCC, Fusion of Feature Warping with DWT-

MFCC and MFCC, Speech Enhancement Algorithms, Independent Component

Analysis, Multi-Run Independent Component Analysis, Independent Component

Analysis by Entropy Bound Minimization.

Abstract

Speaker recognition is the process of identifying or verifying the identity of a

speaker by analyzing acoustic speech information. Speaker recognition has be-

come more common in recent years and can be used in many applications such

as banking security, national security, building access, and forensic applications.

Judges and law enforcement agencies have used speaker recognition in forensic

applications when investigating a suspect or confirming the judgment of guilt

or innocence. Forensic speaker recognition compares the speech samples from

criminals with speech signals from the suspect to prepare legal evidence for the

court.

Forensic speaker verification systems face numerous challenges in real-world ap-

plications. For example, the police often record speech from a suspect in a police

office, where reverberation is usually present. A criminal may use a mobile phone

in public places to commit criminal offenses. Recorded speech from covert po-

lice recordings is often corrupted by various types of environmental noise. It is

difficult to use these types of recorded speech directly in police investigations

and courts. Forensic speaker recognition performance degrades significantly in

the presence of high levels of environmental noise and reverberation conditions.

Therefore, the objective of this thesis is to improve forensic speaker verification

performance under high levels of noise and reverberation environments by using

robust feature extraction techniques and multiple channel speech enhancement

ii

algorithms.

Three main contributions are presented in this thesis. The first contribution is to

design noisy and reverberant frameworks from the Australian Forensic Voice Com-

parison (AFVC) and QUT-NOISE databases. The aim of designing noisy and

reverberant frameworks is to use these frameworks for simulation forensic speaker

verification performance based on the robust feature extraction techniques and

the independent component analysis (ICA) algorithms under real-world scenarios.

The second contribution is a detailed investigation of the robustness of fusion fea-

tures, mel frequency cepstral coefficients (MFCC) and MFCC extracted from the

discrete wavelet transform (DWT), with and without applying feature warping

to improve forensic speaker verification performance in the presence of environ-

mental noise only. The performance of forensic speaker verification system is

evaluated using different feature extraction techniques: MFCC, feature-warped

MFCC, DWT-MFCC, DWT-MFCC feature warping, fusion of DWT-MFCC and

MFCC, and fusion of feature warping with DWT-MFCC and MFCC. This study

found that the proposed fusion of feature warping with DWT-MFCC and MFCC

improves forensic speaker recognition performance compared with other feature

extraction techniques in the presence of different types of environmental noise

over the majority of signal to noise ratios (SNRs). Experimental results also

demonstrated that the proposed fusion of feature warping with DWT-MFCC

and MFCC improves forensic speaker verification performance under reverber-

ation conditions only and noisy and reverberant environments. The proposed

fusion of feature warping with DWT-MFCC and MFCC can be used as the fea-

ture extraction in forensic speaker verification based on the ICA algorithms.

The third contribution is to develop forensic speaker verification based on the

multi-run ICA or independent component analysis by entropy bound minimiza-

tion (ICA-EBM) algorithm for improving forensic speaker verification perfor-

iii

mance under high levels of environmental noise and reverberation conditions.

The results demonstrated that forensic speaker verification based on the multi-

run ICA or the ICA-EBM algorithm significantly improved speaker verification

performance compared with the baselines of Noise-without-Reverberation and

Reverberation-with-Noise speaker verification systems under reverberation con-

ditions and high levels of environmental noise at SNRs ranging from -10 dB to 0

dB.

Contents

Abstract i

List of Tables xiv

List of Figures xvi

Acronyms & Abbreviations xxvii

Certification of Thesis xxxi

Acknowledgments xxxii

Chapter 1 Introduction 1

1.1 Motivation and overview . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Scope of the PhD . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

CONTENTS v

1.4 Original contributions . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Chapter 2 Overview of speaker recognition systems 11

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Overview of speaker verification . . . . . . . . . . . . . . . . . . . 12

2.3 Front-end processing . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 Voice activity detection . . . . . . . . . . . . . . . . . . . . 14

2.3.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Speaker verification based GMM . . . . . . . . . . . . . . . . . . 18

2.4.1 Overview of Gaussian mixture modeling . . . . . . . . . . 18

2.4.2 Universal background model . . . . . . . . . . . . . . . . . 20

2.4.3 Adaptation of UBM model . . . . . . . . . . . . . . . . . . 20

2.4.4 Speaker verification based GMM-UBM . . . . . . . . . . . 21

2.5 GMM super-vector . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6 Channel compensation approaches . . . . . . . . . . . . . . . . . . 23

2.6.1 Feature domain approaches . . . . . . . . . . . . . . . . . 24

2.6.2 Model-domain techniques (factor analysis) . . . . . . . . . 26

vi CONTENTS

2.6.2.1 Joint factor analysis . . . . . . . . . . . . . . . . 27

2.7 I-vector based speaker verification system . . . . . . . . . . . . . . 28

2.7.1 I-vector feature extraction . . . . . . . . . . . . . . . . . . 28

2.8 Standard channel compensation techniques . . . . . . . . . . . . . 30

2.9 PLDA speaker verification system . . . . . . . . . . . . . . . . . . 33

2.9.1 GPLDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.9.2 HTPLDA . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.9.3 Length-normalized GPLDA . . . . . . . . . . . . . . . . . 35

2.10 PLDA scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.11 Performance evaluations . . . . . . . . . . . . . . . . . . . . . . . 37

2.11.1 Type of error . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.11.2 Performance metrics . . . . . . . . . . . . . . . . . . . . . 37

2.12 Speaker recognition in adverse conditions . . . . . . . . . . . . . 39

2.12.1 Feature extraction based on multiband techniques . . . . 39

2.12.2 Feature warping . . . . . . . . . . . . . . . . . . . . . . . . 44

2.12.3 Independent component analysis . . . . . . . . . . . . . . 46

2.12.4 I-vector PLDA speaker recognition . . . . . . . . . . . . . 50

CONTENTS vii

2.12.5 Deep neural network . . . . . . . . . . . . . . . . . . . . . 51

2.13 Limitation of the existing techniques . . . . . . . . . . . . . . . . 53

2.14 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Chapter 3 Noisy and reverberant speech frameworks 57

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2 AFVC speech database . . . . . . . . . . . . . . . . . . . . . . . 58

3.3 QUT-NOISE database . . . . . . . . . . . . . . . . . . . . . . . . 59

3.4 Construction of noisy speech signals . . . . . . . . . . . . . . . . . 61

3.4.1 Noisy speech in single channel speech enhancement . . . . 61

3.4.2 Noisy speech in multi-channel speech enhancement . . . . 61

3.5 Noisy and reverberant frameworks . . . . . . . . . . . . . . . . . 64

3.5.1 Single microphone adverse framework . . . . . . . . . . . . 65

3.5.1.1 Adding noise . . . . . . . . . . . . . . . . . . . . 65

3.5.1.2 Adding reverberation . . . . . . . . . . . . . . . . 66

3.5.1.3 Adding reverberation and noise . . . . . . . . . . 69

3.5.2 Multiple microphones adverse framework . . . . . . . . . . 70

3.5.2.1 Adding noise . . . . . . . . . . . . . . . . . . . . 71

viii CONTENTS

3.5.2.2 Adding reverberation and noise . . . . . . . . . . 74

3.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Chapter 4 Forensic speech enhancement algorithms 77

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.2 Independent component analysis . . . . . . . . . . . . . . . . . . . 79

4.2.1 Statistical independence . . . . . . . . . . . . . . . . . . . 81

4.2.1.1 Independence . . . . . . . . . . . . . . . . . . . . 81

4.2.1.2 Uncorrelateness and independence . . . . . . . . 82

4.2.1.3 Non-Gaussianity and independence . . . . . . . . 82

4.3 ICA assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.4 ICA ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.4.1 Magnitude and scaling ambiguity . . . . . . . . . . . . . . 86

4.4.2 Permutation ambiguity . . . . . . . . . . . . . . . . . . . . 87

4.5 Pre-processing for ICA . . . . . . . . . . . . . . . . . . . . . . . . 88

4.5.1 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.5.2 Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.6 Fast ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

CONTENTS ix

4.6.1 Fast ICA for one unit . . . . . . . . . . . . . . . . . . . . . 91

4.6.2 Fast ICA for several units . . . . . . . . . . . . . . . . . . 91

4.7 Simple illustration of ICA . . . . . . . . . . . . . . . . . . . . . . 92

4.7.1 Separation of speech from street noise . . . . . . . . . . . 93

4.7.2 Illustration example of statistical independent in ICA . . . 95

4.8 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.8.1 Noisy speech signals . . . . . . . . . . . . . . . . . . . . . 99

4.8.2 Speech enhancement algorithms . . . . . . . . . . . . . . . 99

4.8.2.1 Discrete wavelet transform . . . . . . . . . . . . . 99

4.8.2.1.1 Wavelet threshold technique . . . . . . . . . . . 102

4.8.2.2 Spectral Subtraction . . . . . . . . . . . . . . . . 103

4.8.2.3 Independent component analysis . . . . . . . . . 105

4.8.3 Evaluation of performance . . . . . . . . . . . . . . . . . . 105

4.8.4 Comparison of speech enhancement algorithms . . . . . . . 106

4.9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Chapter 5 Robust feature extraction techniques 111

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

x CONTENTS

5.2 Combination of DWT and MFCC FW . . . . . . . . . . . . . . . 113

5.3 Experimental methodology . . . . . . . . . . . . . . . . . . . . . . 116

5.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 116

5.4.1 Noisy conditions . . . . . . . . . . . . . . . . . . . . . . . 117

5.4.1.1 Effect of decomposition level . . . . . . . . . . . 117

5.4.1.2 Effect of wavelet family . . . . . . . . . . . . . . 118

5.4.1.3 Comparison of feature extraction techniques . . 120

5.4.1.4 Fusion of feature warping with DWT-MFCC and

MFCC . . . . . . . . . . . . . . . . . . . . . . . . 124

5.4.1.5 Performance comparison to related work . . . . 124

5.4.1.6 Effect of utterance length . . . . . . . . . . . . . 126

5.4.2 Reverberation conditions . . . . . . . . . . . . . . . . . . 130

5.4.2.1 Effect of decomposition level . . . . . . . . . . . 130


5.4.2.3 Comparison of feature extraction techniques . . . 133

5.4.2.4 Effect of reverberation time . . . . . . . . . . . . 135

5.4.2.5 Effect of utterance duration . . . . . . . . . . . . 138

5.4.2.6 Effect of suspect and microphone position . . . 141

CONTENTS xi

5.4.3 Noisy and reverberant conditions . . . . . . . . . . . . . . 143

5.4.3.1 Effect of decomposition level . . . . . . . . . . . . 144


5.4.3.3 Comparison of feature extraction techniques . . 146


5.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Chapter 6 ICA for speaker verification 157

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

6.2 Multi-run ICA algorithm . . . . . . . . . . . . . . . . . . . . . . . 160

6.2.1 SIR computation . . . . . . . . . . . . . . . . . . . . . . . 161

6.3 ICA-EBM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

6.4.1 Multi-run ICA or ICA-EBM speech enhancement . . . . 165

6.4.2 Fusion of feature warping with DWT-MFCC and MFCC . 166

6.4.3 Length-normalized GPLDA classifier . . . . . . . . . . . . 167

6.5 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 168

6.5.1 Noisy and reverberant speaker verification

baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

xii CONTENTS

6.5.1.1 Noise-without-Reverberation speaker verification

baseline . . . . . . . . . . . . . . . . . . . . . . . 169

6.5.1.2 Reverberation-with-Noise speaker verification

baseline . . . . . . . . . . . . . . . . . . . . . . . 170

6.5.2 Noise-without-Reverberation conditions . . . . . . . . . . . 171

6.5.2.1 Speaker verification performance based multi-run

ICA (SIRact) . . . . . . . . . . . . . . . . . . . . 172

6.5.2.2 Speaker verification performance based multi-run

ICA (SIRest) . . . . . . . . . . . . . . . . . . . . 175


6.5.3 Reverberation-with-Noise conditions . . . . . . . . . . . . 181

6.5.3.1 Speaker verification performance based on ICA

algorithms . . . . . . . . . . . . . . . . . . . . . . 183


6.5.3.3 Effect of reverberation time . . . . . . . . . . . . 191

6.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

Chapter 7 Conclusions and future directions 199

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

7.2 Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . 199

CONTENTS xiii

7.2.1 Designing noisy and reverberant frameworks . . . . . . . . 200

7.2.2 DWT-MFCC feature warping . . . . . . . . . . . . . . . 200

7.2.3 Multi-run ICA and ICA-EBM algorithms . . . . . . . . . 202

7.3 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

Bibliography 207

List of Tables

3.1 Reverberation test room parameter. . . . . . . . . . . . . . . . . 67

5.1 Summary of feature extraction labels. . . . . . . . . . . . . . . . . 115

5.2 Description of the number of features extracted corresponding to

each feature extraction label. . . . . . . . . . . . . . . . . . . . . . 115

5.3 Comparison of speaker verification performance based on mDCF

using different feature extraction techniques in the presence of re-

verberation at 0.15 sec. . . . . . . . . . . . . . . . . . . . . . . . . 135



6.2 Comparison mDCFs for speaker verification based multi-run ICA

(highest SIRest) and baseline Noise-without-Reverberation speaker

verification system. . . . . . . . . . . . . . . . . . . . . . . . . . . 180

LIST OF TABLES xv

6.3 comparison mDCFs for speaker verification based on the ICA-EBM

algorithm and Reverberation-with-Noise speaker verification base-

line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

List of Figures

2.1 Text-independent speaker verification system. . . . . . . . . . . . 13

2.2 Block diagram of front-end processing. . . . . . . . . . . . . . . . 15

2.3 Block diagram of MFCC extraction. . . . . . . . . . . . . . . . . . 17

2.4 Speaker verification based GMM-UBM. . . . . . . . . . . . . . . . 22

2.5 A block diagram of extraction GMM super-vectors. . . . . . . . . 23

2.6 A block diagram of the feature warping process. . . . . . . . . . . 26

2.7 Length-normalized GPLDA based speaker verification system. . . 30

2.8 An example of DET plot. . . . . . . . . . . . . . . . . . . . . . . 38

3.1 Configuration of sources and microphones in instantaneous ICA

mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.2 Position of speech and noise signals to the microphones. . . . . . . 63

3.3 Design of a single microphone adverse framework based on adding

noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

LIST OF FIGURES xvii

3.4 Position of suspect and microphones in a room. All microphones

and suspect are at 1.3 m height and the height of the room is 2.5 m. 67


reverberation conditions. . . . . . . . . . . . . . . . . . . . . . . 68


reverberation and noise conditions. . . . . . . . . . . . . . . . . . 70

3.7 Position of speech and noise signals to the microphones. . . . . . . 72

3.8 Design of a multiple microphones adverse framework based on

adding noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.9 Design of a multiple microphones adverse framework based on

adding reverberation and noise conditions. . . . . . . . . . . . . . 75

4.1 Histogram of clean speech from the AFVC database. . . . . . . . 79

4.2 Histogram of STREET noise from the QUT-NOISE database. . . 79

4.3 Mixing and unmixing processes in blind source separation. s are

the source signals, x are the observation signals, s are the estimated

source signals, A is the mixing matrix and W is the unmixing matrix. 81

4.4 Original speech and street noise. . . . . . . . . . . . . . . . . . . . 93

4.5 Mixed speech with street noise. . . . . . . . . . . . . . . . . . . . 94

4.6 Estimated speech and street noise using the fast ICA algorithm. . 94

4.7 Original sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

xviii LIST OF FIGURES

4.8 Mixed of source signals. . . . . . . . . . . . . . . . . . . . . . . . 96

4.9 Joint density of whitened signals obtained from whitening the

mixed sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.10 Estimation of source signals using the ICA algorithm. . . . . . . . 97

4.11 Comparison of speech enhancement algorithms. . . . . . . . . . . 98

4.12 Daubechies 8 wavelet function. . . . . . . . . . . . . . . . . . . . . 100

4.13 Block schematic of the dyadic wavelet transform. . . . . . . . . . . 101

4.14 Block schematic of the dyadic inverse discrete wavelet transform. . 102

4.15 Comparison of average SNR enhancement when STREET noise is

added to the speech signals from the AFVC database. . . . . . . 107

4.16 Comparison of average SNR enhancement when CAR noise is


4.17 Comparison of average SNR enhancement when HOME noise is


5.1 Extraction and fusion of DWT-MFCC and MFCC features with

and without feature warping (FW). . . . . . . . . . . . . . . . . . 114

5.2 Effect of the decomposition levels on the performance of fusion of

feature warping with DWT-MFCC and MFCC in the presence of

noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

LIST OF FIGURES xix

5.3 Average EER for fusion of feature warping with DWT-MFCC and

MFCC using different wavelet families in the presence of different

types of environmental noise. . . . . . . . . . . . . . . . . . . . . 119

5.4 Comparison of speaker verification performance using different fea-

ture extraction techniques in the presence of various types of en-

vironmental noise. . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.5 Average reduction in EER for fusion of feature warping with DWT-

MFCC and MFCC over feature-warped MFCC in the presence of

various types of environmental noise. Higher average reduction in

EER indicates better performance. . . . . . . . . . . . . . . . . . 122

5.6 mDCF for feature-warped MFCC and fusion of feature warping

with DWT-MFCC and MFCC features in the presence of various

types of environmental noise. . . . . . . . . . . . . . . . . . . . . 123

5.7 Fusion of feature warping with DWT-MFCC and MFCC. . . . . . 125

5.8 Comparison of average EER for the proposed fusion of feature

warping with DWT-MFCC features with other feature extraction

techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.9 Effect of the utterance length on the performance of feature-warped

MFCC in the presence of environmental noise. . . . . . . . . . . 127

5.10 Effect of the utterance length on the performance of fusion of fea-

ture warping with DWT-MFCC and MFCC in the presence of

various types environmental noise. . . . . . . . . . . . . . . . . . 128

xx LIST OF FIGURES


MFCC and MFCC over feature-warped MFCC when 40 sec of the

surveillance recordings were mixed with various types of environ-

mental noise at SNRs ranging from -10 dB to 10 dB. . . . . . . . 129


MFCC and MFCC when the duration of the surveillance recordings

increased from 10 sec to 40 sec. . . . . . . . . . . . . . . . . . . . 130

5.13 Effect of decomposition level on the performance of fusion of fea-

ture warping with DWT-MFCC and MFCC under reverberation

conditions only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5.14 EER for fusion of feature warping with DWT-MFCC and MFCC

using different wavelet families in the presence of reverberation

conditions only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

5.15 Comparison of fusion of feature warping with DWT-MFCC and

MFCC with different feature extraction techniques in the presence

of reverberation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.16 Effect of reverberation time on the performance of feature-warped

MFCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.17 Effect of reverberation time on the performance of fusion of feature

warping with DWT-MFCC and MFCC. . . . . . . . . . . . . . . . 137

5.18 Reduction in EER for fusion of feature warping with DWT-MFCC

and MFCC over feature-warped MFCC when interview recordings

reverberated at different reverberation times. . . . . . . . . . . . . 138

LIST OF FIGURES xxi

5.19 Effect of changing the utterances length of the surveillance record-

ings on the performance of feature-warped MFCC under reverber-

ation conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.20 Effect of changing the surveillance utterance duration on the per-

formance of fusion of feature warping with DWT-MFCC and

MFCC under reverberation conditions. . . . . . . . . . . . . . . . 140

5.21 Position of suspect and microphones in a room. All microphones

and suspect are at 1.3 m height and the height of the room is 2.5 m.141

5.22 Effect of configuration microphone and suspect positions on the

performance of feature-warped MFCC. . . . . . . . . . . . . . . . 143

5.23 Effect of configuration microphone and suspect positions on the

performance of fusion of feature warping with DWT-MFCC and

MFCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.24 Effect of the decomposition levels on the performance of fusion of

feature warping with DWT-MFCC and MFCC in the presence of

reverberation and various types of environmental noises. . . . . . 144

5.25 Average EER for fusion of feature warping with DWT-MFCC and

MFCC using different wavelet families in the presence of reverber-

ation and various types of environmental noises. . . . . . . . . . 146

5.26 Comparison of speaker verification performance using different fea-

ture extraction techniques in the presence of environmental noise

and reverberation conditions. . . . . . . . . . . . . . . . . . . . . 147

xxii LIST OF FIGURES


MFCC and MFCC over traditional MFCC in the presence of var-

ious types of environmental noise and reverberation conditions . . 148


MFCC and MFCC over feature-warped MFCC in the presence of

various types of environmental noise and reverberation conditions. 150

5.29 mDCF for feature-warped MFCC and fusion of feature warping

with DWT-MFCC and MFCC features under noisy and reverber-

ant conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.30 Effect of utterance length on the performance of feature-warped

MFCC in the presence of noise and reverberation conditions. . . . 151

5.31 Effect of utterance length on the performance of fusion of feature

warping with MFCC and DWT-MFCC in the presence of noise

and reverberation conditions. . . . . . . . . . . . . . . . . . . . . 152


MFCC and MFCC over feature-warped MFCC when interview

recordings reverberated at 0.15 sec and 40 seconds of the surveil-

lance recordings were corrupted by various types of noise at SNRs

ranging from -10 dB to 10 dB. . . . . . . . . . . . . . . . . . . . . 153


MFCC and MFCC when interview recordings reverberated at 0.15

sec and the duration of the surveillance recordings changed from

10 sec to 40 sec in the presence of various types of noise at SNRs

ranging from -10 dB to 10 dB. . . . . . . . . . . . . . . . . . . . . 154

LIST OF FIGURES xxiii

6.1 Flowchart of the multi-run ICA algorithm. . . . . . . . . . . . . . 161

6.2 Flowchart of speaker verification based on the multi-run ICA or

ICA-EBM algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 165

6.3 Flowchart of fusion of feature warping with DWT-MFCC and

MFCC techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . 167

6.4 Flowchart of the Noise-without-Reverberation speaker verification

baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

6.5 Flowchart of the Reverberation-with-Noise speaker verification

baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

6.6 Comparison of forensic speaker verification based on the multi-

run ICA (highest SIRact) algorithm with other speaker verification

techniques under Noise-without-Reverberation conditions conditions.173

6.7 Comparison of forensic speaker verification based on the multi-

run ICA (highest SIRest) algorithm with other speaker verification

techniques under noisy conditions. . . . . . . . . . . . . . . . . . 176

6.8 Average reduction in EER for the multi-run ICA (highest SIRest)

algorithm over traditional ICA in the presence of various types of

environmental noise at SNRs ranging from -10 dB to 10 dB. . . . 178

6.9 Average EER reduction for the ICA-EBM algorithm compared

with the traditional ICA under different levels and types of en-

vironmental noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

xxiv LIST OF FIGURES

6.10 Effect of duration utterance on the performance of speaker ver-

ification based multi-run ICA (highest SIRest) algorithm in the

presence of various types of environmental noise. . . . . . . . . . 181

6.11 Effect of utterance duration on speaker verification performance

based on the ICA-EBM algorithm under different levels and types

of noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

6.12 Experimental results for PLDA speaker verification when interview

recordings reverberated at 0.15 sec reverberation time and surveil-

lance recordings were mixed with various types of environmental

noise at SNRs ranging from -10 dB to 10 dB. . . . . . . . . . . . . 184

6.13 Average reduction in EER for multi-run ICA (highest SIRest) algo-

rithm over traditional ICA when interview recordings reverberated

at 0.15 sec and the surveillance recordings were mixed with differ-

ent types of environmental noise at SNRs ranging from -10 dB to

10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

6.14 Average EER reduction for ICA-EBM algorithm compared with

the traditional ICA for different types of noise and reverberation. 187

6.15 Average reduction in EER for the ICA-EBM algorithm over the

multi-run ICA (highest SIRest) in the presence of various types of

environmental noise at SNRs ranging from -10 dB to 0 dB and

reverberation conditions. . . . . . . . . . . . . . . . . . . . . . . . 187

6.16 Effect of duration utterance on the performance of speaker veri-

fication based multi-run ICA (highest SIRest) in the presence of

various environmental noise and reverberation conditions. . . . . . 189

LIST OF FIGURES xxv

6.17 Effect of utterance duration on speaker verification performance

when using the ICA-EBM under different types of noise and rever-

beration environments. . . . . . . . . . . . . . . . . . . . . . . . 190

6.18 Effect of the reverberation time on the performance of forensic

speaker verification based on multi-run ICA (highest SIRact) when

interview recordings reverberated at different reverberation times

and surveillance recordings mixed with various types of environ-

mental noises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

6.19 Effect of the reverberation time on the performance of forensic

speaker verification based on multi-run ICA (highest SIRest) when

interview recordings reverberated at different reverberation times

and surveillance recordings were mixed with various types of envi-

ronmental noises. . . . . . . . . . . . . . . . . . . . . . . . . . . 193

6.20 Effect of the reverberation time on speaker verification perfor-

mance when using the ICA-EBM under conditions of noise and

reverberation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

List of Abbreviations

Australian Forensic Voice Comparison AFVCActual signal to interference ratio SIRact

Additive white Gaussian noise AWGN

Baum-Welch BW

Cepstral mean subtraction CMSCepstral mean variance normalization CMVNComputational auditory scene analysis CASACumulative distribution function CDF

Daubechies dbDecision cost function DCFDetection error trade-off DETDeep neural network DNNDeep recurrent neural network DRNNDiscrete cosine transform DCTDiscrete wavelet transform DWT

Efficient fast ICA EFICAEqual error rate EERExpectation maximization EMEstimated signal to interference ratio SIRest

Fast Fourier transform FFTFalse acceptance rate FARFalse rejection rate FRR

xxviii ACRONYMS & ABBREVIATIONS

Feature warping FWFinite impulse response FIR

Gaussian mixture model GMMGaussian probabilistic linear discriminant analysis GPLDA

Heavy-tailed probabilistic linear discriminant analysis HTPLDA

Identity vector i-vectorIndependent component analysis ICAIndependent component analysis by entropy bound minimization ICA-EBMInput SNR SNRi

Ideal ratio mask IRM

Joint factor analysis JFA

Kurtosis Kurt

Linear discriminant analysis LDALinear prediction cepstral coefficients LPCCLong short-term memory LSTM

Maximum a posteriori MAPMaximum likelihood MLMel frequency cepstral coefficients MFCCMel frequency discrete wavelet coefficients MFDWCsMean square error MSEMinimum decision cost function mDCFMicrosoft research MSR

National Institute of Standards and Technology NISTNuisance attribute projection NAPNon-negative Matrix Factorization NMF

Output signal to noise ratio SNRo

Perceptual linear prediction cepstral coefficients PLPCCPrinciple component analysis PCA

ACRONYMS & ABBREVIATIONS xxix

Probability density function PDFProbabilistic linear discriminant analysis PLDA

Relative spectral RASTARelative spectral transform-perceptual linear prediction RASTA-PLP

Short time Fourier transform STFTShort-time spectral amplitude minimum mean square error STSA-MMSESpeaker recognition evaluation SRESupport vector machine SVMSignal to noise ratio SNRSNR enhancement SNRe

Signal to interference ratio SIR

Time-frequency T-F

Universal background model UBM

Voice activity detection VADVoice over Internet protocol VOIP

Within-class covariance normalization WCCN

Certification of Thesis

The work contained in this thesis has not, been previously submitted for a degree

or diploma at, ally other higher educational institution. To the besl of my

knowledge and belief, the thesis contains no material previously published or

·written by another person except where due reference is made.

Signed:

Date: IH/£/to\c'.L---

QUT Verified Signature

Acknowledgments

I would like to thank my supervisory team, Associate Professor Bouchra Senadji,

Dr Mahsa Baktashmotlagh, Dr David Dean, Dr Ganesh Naik, and adjunct Pro-

fessor Vinod Chandran for providing their support and their valuable guidance

throughout my PhD. My respect and gratitude to my principal supervisor Asso-

ciate Professor Bouchra Senadji, who supported me during my PhD journey. I

would especially like to thank my associate supervisor Dr Mahsa Baktashmotlagh

for her consistent help and support during the writing of my PhD thesis. Also

special thanks to my associate external supervisor Dr David Dean, who provided

valuable advice and research direction during my PhD journey. I would also like

to thank my associate external supervisor Professor Vinod Chandran. He has

been very helpful in providing me valuable advice and feedback. His feedback

has been always improved the quality of my publications and thesis.

I would like to thank my associate external supervisor Dr Ganesh Naik who sup-

ported me and provided advice throughout the course of my PhD. I would also

like to thank my colleagues and friends at the Queensland University of Tech-

nology, who shared their friendship and made the laboratory enjoyable. I would

also like to thank Dr Christain Long for providing copyediting and proofreading

services, according to the guidelines laid out in the University-endorsed material

policy guidelines. I also wish to gratefully thank the Ministry of Higher Education

and Scientific Research (MoHESR) in Iraq for supporting my PhD scholarship.

ACKNOWLEDGMENTS xxxiii

Finally, I would like to thank my wife, Tahreer Aljabiri, who supported me during

my PhD journey. It would have been impossible to complete my PhD journey

without her support.

Ahmed Kamil Hasan AL-ALI

2019

Chapter 1

Introduction

1.1 Motivation and overview

Automatic speaker recognition is the process of designing algorithms that recog-

nize the identity of speakers by their voices. Speaker recognition can be divided

into two applications, speaker identification and speaker verification. Speaker

identification is the process of determining the identity of an unknown speaker

from a group of known speakers, while speaker verification is the process of con-

firming or rejecting the claimed identity.

Speaker verification is more widely used than speaker identification due to its

importance in security and forensic applications [59]. For security applications,

speaker verification can be used to protect personal information. People want

companies and banks to ensure that the best possible preventative methods are

used to protect personal details and prevent fraud [59]. Speaker verification can

also be used in national security measures to combat terrorists, by determining

the location of a known terrorist by analyzing the voice from localized audio

2 1.1 Motivation and overview

recordings. Lawyers, judges, and law enforcement agencies have used speaker

verification in forensic applications to prepare legal evidence for court by com-

paring surveillance speech samples from the criminal with a database of interview

speech samples from the suspect [19, 89].

The performance of forensic speaker verification degrades significantly in the

presence of adverse conditions in real-world applications. The adverse condi-

tions come from various sources, such as environmental noise, reverberation, and

channel mismatch between interview and surveillance recordings. Police often

record speech from a suspect in a room where reverberation often presents. In

reverberation environments, the interview speech signal is often combined with

a multiple reflection version of the speech due to the reflection of the interview

speech signals from the surrounding room. The presence of reverberation dis-

torts feature vectors and degrades speaker verification performance because of

the mismatched conditions between enrolment models and verification speech

signals [122]. Surveillance forensic recordings are often recorded over phone lines

with limited channel bandwidth and low-quality [8, 107], and are usually recorded

using hidden microphones in public places. These forensic recordings are often

corrupted by various types of environmental noise [32], and the performance of

speaker verification systems reduces dramatically in the presence of high levels of

noise [73, 94].

This research aims to improve forensic speaker verification performance in the

presence of adverse conditions using a robust feature extraction technique and

multiple channel speech enhancement algorithms.

1.2 Scope of the PhD 3

1.2 Scope of the PhD

The broad scope of this PhD research is to address the challenges that face forensic

speaker verification systems in the presence of adverse conditions. The current

modern forensic speaker verification system suffers from the following problems:

1. In real forensic scenarios, the interview speech signals are often recorded in a

police interview room where reverberation is usually present. The surveil-

lance speech signals from the criminal are usually corrupted by different

types and levels of environmental noise. Therefore, it is necessary to in-

vestigate a robust feature extraction technique to improve forensic speaker

verification performance in the presence of adverse conditions.

2. In real forensic situations, environmental noises are often mixed with

surveillance forensic audio recordings. It is difficult to use these record-

ings as part of the legal evidence in court because their quality is poor.

To reduce the effect of high levels of environmental noise from the noisy

speech signals and improve noisy forensic speaker verification performance,

it is necessary to apply a speech enhancement algorithm as front-end pro-

cessing for speaker verification. Forensic speaker verification performance

under different types of environmental noise could improve by using speech

enhancement algorithm.

3. The performance of forensic speaker verification reduces significantly in the

presence of high levels of environmental noise and reverberation conditions.

Thus, it is necessary to combine the robust feature extraction technique

and speech enhancement algorithm described in 1 and 2 to achieve high

improvement in forensic speaker verification performance in the presence of

high levels of environmental noise and reverberation conditions.

4 1.3 Thesis structure

This PhD research addresses these problems and the outcome of this research can

be used in police investigations to recognize the identity of the criminals by their

voices and prepare legal evidence to the court.

1.3 Thesis structure

The remaining chapters of the thesis are organized as follows:

� Chapter 2: Overview of speaker recognition systems

This chapter provides the background of speaker verification systems.

A summary of existing front-end processing and feature extraction is

provided. Some of the most common speaker modelling methods are

presented in this chapter, including Gaussian mixture model (GMM)

based speaker verification system, universal background model (UBM),

GMM super-vector, factor analysis, and joint factor analysis (JFA).

Channel compensation techniques based on feature extraction approach

are also presented. State-of-the-art identity vector (i-vector) probabilistic

linear discriminant analysis (PLDA) speaker verification is also presented.

Some techniques to improve speaker recognition performance in adverse

conditions are described. The limitations of the existing techniques are

also identified.

� Chapter 3: Noisy and reverberant speech frameworks

This chapter presents an overview of speech and noise corpora that are used

in this thesis. The construction of noisy speech signals based on the single

and multiple channels is designed to evaluate forensic speech enhancement

algorithms. The noisy and reverberant frameworks based on the single

1.3 Thesis structure 5

and multiple microphones are described in this chapter for evaluation the

effectiveness of forensic speaker verification based on the robust of feature

extraction techniques and the ICA algorithms under real-world situations.

� Chapter 4: Forensic speech enhancement algorithms

This chapter presents independent component analysis (ICA) as a multiple

channel speech enhancement algorithm for forensic applications. It begins

with a simple example to introduce the problem of source separation and

the ICA model. Some mathematical and statistical concepts related to ICA

are presented in this chapter. The fast ICA algorithm is also discussed.

Finally, simulation results for single and multiple speech enhancement

algorithms are also given.

� Chapter 5: Robust feature extraction techniques

This chapter investigates the effectiveness of combining features, mel

frequency cepstral coefficients (MFCC) and MFCC extracted from the

discrete wavelet transform (DWT) of the speech signals, with and without

feature warping in the presence of various types of environmental noise

only. Subsequently, a new fusion of feature warping with DWT-MFCC and

MFCC is proposed to improve modern i-vector forensic speaker verification

performance under noisy and reverberant conditions.

� Chapter 6: ICA for speaker verification

This chapter introduces the multi-run ICA, and independent component

analysis by entropy bound minimization (ICA-EBM) as multiple channel

speech enhancement algorithms. New forensic speaker verification systems

based on the multi-run ICA and ICA-EBM algorithm are developed in this

6 1.4 Original contributions

chapter to improve the performance of forensic speaker verification in the

presence of high levels of environmental noise and reverberation conditions.

� Chapter 7: Conclusions and future directions

This chapter concludes the thesis with a summary of the contributions and

makes some suggestions for future directions.

1.4 Original contributions

The research programme contributes to the field of speaker verification for real

forensic applications by addressing challenges faced by forensic speaker verifica-

tion systems in the presence of adverse conditions.

1. In Chapter 3, the construction of noisy and reverberant frameworks are de-

signed from the Australian Forensic Voice Comparison (AFVC) and QUT-

NOISE databases. The goal of designing noisy and reverberant frameworks

is to use these frameworks for simulation forensic speaker verification per-

formance based on the robust feature extraction techniques and the ICA

algorithms under real-world situations in the presence of various types of

environmental noise and reverberation conditions. The noisy and rever-

berant frameworks based on the robust feature extraction and the ICA

algorithms were presented in ” Enhanced forensic speaker verification us-

ing a combination of DWT and MFCC feature warping in the presence of

noise and reverberation conditions” [5] in IEEE Access Journal, ” Hybrid

DWT and MFCC feature warping for noisy forensic speaker verification in

room reverberation” [6] in IEEE International Conference on Signal and

Image Processing Applications, ” Speaker verification with multi-run ICA

1.4 Original contributions 7

based speech enhancement” [3] in 11th International Conference on Signal

Processing and Communication Systems, and ” Enhanced forensic speaker

verification using multi-run ICA in the presence of noise and reverbera-

tion conditions” [7] in IEEE International Conference on Signal and Image

Processing Applications.

2. The effectiveness of combining features, MFCC and MFCC extracted from

the DWT of the speech with and without using feature warping is inves-

tigated (Chapter 5) to improve i-vector forensic speaker verification per-

formance in the presence of various types of environmental noise only. In

Chapter 5, a new fusion of feature warping with DWT-MFCC and MFCC

is proposed to improve the performance of forensic speaker verification sys-

tems in the presence of reverberation and noisy and reverberant conditions.

The proposed fusion of feature warping with DWT-MFCC and MFCC is

also compared with various feature extraction techniques to investigate the

robustness of the proposed fusion feature extraction technique to reverbera-

tion as well as noisy and reverberant conditions. The results demonstrated

that the proposed technique improved forensic speaker verification perfor-

mance compared with different feature extraction techniques in the presence

of various types of environmental noise and reverberant conditions. These

research outcomes were published in IEEE Access [5]. The IEEE Access

Journal has Q1 ranking and the impact factor in 2017 is 3.557 according to

Clarivate Analytics Journal Citation Reports. Some findings of this contri-

bution were also published in 2017 IEEE International Conference on Signal

and Image Processing Applications [6].

3. In Chapter 6, the effectiveness of the multi-run ICA and ICA-EBM is in-

vestigated for the first time as multiple channel speech enhancement algo-

rithms. Forensic speaker verification systems based on the multi-run ICA

and ICA-EBM algorithm are developed to improve forensic speaker verifi-

8 1.5 Publications

cation performance where data are often corrupted by various types of en-

vironmental noise and reverberation conditions. It was found that speaker

verification based on the multi-run ICA or ICA-EBM algorithms signifi-

cantly improved the performance under reverberation conditions and high

levels of environmental noise at SNRs ranging from -10 dB to 0 dB compared

with the two baselines (Noise-without-Reverberation and Reverberation-

with-Noise speaker verification systems). The results also demonstrated

that forensic speaker verification based on multi-run ICA or ICA-EBM al-

gorithm improved speaker verification performance compared with tradi-

tional ICA under noisy and reverberant conditions. Part of this work was

published in 2017 IEEE International Conference on Signal and Image Pro-

cessing Applications [7] and the 11th International Conference on Signal

Processing and Communication Systems [3].

1.5 Publications

Listed below are the peer-reviewed publications and under-review submissions

resulting from this thesis.

Peer-reviewed international journals

1. A. K. H. Al-Ali, D. Dean, B. Senadji, V. Chandran and G. R. Naik

‘Enhanced forensic speaker verification using a combination of DWT and

MFCC feature warping in the presence of noise and reverberation Condi-

tions,’ IEEE Access, vol 5, no 99, pp. 15400 - 15413 , 2017.

1.5 Publications 9

Peer-reviewed international conferences

1. A. K. H. Al-Ali, D. Dean, B. Senadji, and V. Chandran, ‘Comparison

of speech enhancement algorithms for forensic applications,’ Proceedings of

the 16th Australian International Conference on Speech Science and Tech-

nology, December 2016.

2. A. K. H. Al-Ali, B. Senadji and V. Chandran, ‘Hybrid DWT and MFCC

feature warping for noisy forensic speaker verification in room reverbera-

tion,’ IEEE International Conference on Signal and Image Processing Ap-

plications, 2017.

3. A. K. H. Al-Ali, D. Dean, B. Senadji, M. Baktashmotlagh, and V. Chan-

dran,‘Speaker verification with multi-run ICA based speech enhancement,’

11th International Conference on Signal Processing and Communication

Systems, 2017.

4. A. K. H. Al-Ali, B. Senadji and G. R. Naik, ‘Enhanced forensic speaker

verification using multi-run ICA in the presence of environmental noise

and reverberation conditions,’ IEEE International Conference on Signal and

Image Processing Applications, 2017.

Chapter 2

Overview of speaker recognition

systems

2.1 Introduction

Speech signals convey many levels of information to the listener. Speech signals

carry linguistic information such as a message via words at the primary level, while

at other levels speech signals convey information related to the emotion, stress

level, language, gender and the identity of the speaker [18, 114, 128]. Speaker

recognition is the task of identifying a speaker by their voice. Speaker recognition

can be classified into two parts: speaker identification and speaker verification.

Speaker identification is the process of determining which voice in a group of

known voices best matches the speaker’s [111]. Speaker verification is the task of

accepting or rejecting the identity claim of a speaker by analyzing their acoustic

samples [38, 39]. Speaker verification is more common than speaker identification

and it is used widely in real applications such as security, forensics, and telephone

banking [57].

12 2.2 Overview of speaker verification

Speaker verification systems are computationally less complex than speaker iden-

tification systems. They require a comparison between only one or two models,

while speaker identification requires comparison of one model to N speaker mod-

els. Thus, improvements in speaker verification performance can be carried into

speaker identification systems [114].

Speaker verification can be divided into text-dependent and text-independent.

In text-dependent, the speaker verification system has prior knowledge about

the text to be spoken and the user is expected to cooperatively speak this text

[114, 125]. However, in a text-independent scenario, the system has no prior

knowledge about the text to be spoken and the user is not expected to be co-

operative [39, 135]. Text-dependent systems achieve high speaker verification

performance from relatively short utterances, while text-independent systems re-

quire long utterances to train reliable models and achieve good performance. The

performance of text-independent speaker verification systems is also affected by

factors such as environmental noise and reverberation conditions. An example of

text-independent speaker verification would be a forensic application with covert

recordings of speech.

2.2 Overview of speaker verification

Speaker verification systems consist of three stages: development, enrolment and

verification, as shown in Figure 2.1. In the development phase, a large number

of speech signals are used to learn the speaker-independent parameters of the

speech signals. In the enrolment phase, feature extraction is used to extract the

features from the target speaker and the target speaker model is generated by

adapting parameters from the background models. The target speaker model is

a statistical model which characterize the speaker vocal tract. Once a model is

2.2 Overview of speaker verification 13

Front- end

processing

Background

training

Background

model

Front- end

processing

Model

adaptation

Front- end

processing

Development phase

Enrolment phase

Verification phase

Accepted/rejected

Target speaker

Claimedspeaker

Model

adaptation Scoring

Figure 2.1: Text-independent speaker verification system.

created and associated with a speaker, it is used to represent that speaker during

the verification phase. For the verification phase, the claimed speaker is generated

and compared with the target speaker model to determine if the claimant identity

is accepted or rejected by the speaker verification system based on the decision

threshold [50, 83, 112].

Speaker verification performance is often degraded in the presence of chan-

nel/session variability between enrolment and verification speech signals. Various

factors affect channel/session variability:

1. Channel mismatch between enrolment and verification speech signals (e.g.

using different microphones in enrolment and verification speech signals).

2. Environmental noise and reverberation conditions.

3. The differences in speaker voice (e.g. ageing, health, speaking style and

emotional state).

14 2.3 Front-end processing

4. Transmission channel (e.g. landline, mobile phone, microphone and voice

over Internet protocol (VoIP)).

Channel compensation techniques can be used to reduce the effect of channel

mismatch and environmental noise. Channel compensation can be used in dif-

ferent stages of speaker verification such as feature and model domains. Various

channel compensation techniques such as cepstral mean subtraction (CMS) [37],

feature warping [106], cepstral mean variance normalization (CMVN) [139], and

relative spectral (RASTA) processing [49] were used in previous studies to reduce

the effect of channel mismatch during the feature extraction phase. In the model

domain, JFA [63] and i-vector[31] are used to combat enrolment and verification

mismatch. These techniques are briefly described in the following sections.

2.3 Front-end processing

The front-end processing is used to process the speech signals and extract the

feature used for speaker verification systems. The front-end processing consists

of voice activity detection (VAD), feature extraction and channel compensation

techniques, as shown in Figure 2.2 [14, 113].

2.3.1 Voice activity detection

The VAD algorithm is an important step in the front-end processing. The main

goal of the VAD is to detect the speech and non-speech segments from the speech

signal before the feature extraction technique. A robust VAD algorithm can im-

prove the performance of a speaker verification system. Therefore, it is necessary

to review the VAD algorithm to overcome the problems in designing a robust

2.3 Front-end processing 15

Voice activity

detection Feature extraction

Feature domain

channel compensation

Speech

Figure 2.2: Block diagram of front-end processing.

speaker verification system.

Traditionally, VAD algorithms are designed using heuristic models such as energy

and zero-crossing rate [12, 109]. Recently, VAD has used many features, and

the discriminants of the features are based on statistical models. Typically, the

statistical model uses Gaussian distributions to describe the features of the speech

and noise frames, develop the likelihood ratio from the comparison with statistical

parameters fitted in the model, and make speech/noise decisions based on the

hypothesis test [74]. Sohn et al. proposed a robust VAD algorithm based on the

statistical likelihood ratio model consisting of a single observed vector [130]. This

algorithm achieved significant improvements over the VAD specified in the ITU

standard G.729 Annex B [12] at low signal to noise ratio (SNR) [130]. The VAD

algorithm uses the Decision Directed method to estimate a priori SNR in the

speech signal. The statistical likelihood ratio is calculated using the current SNR

frame and the estimated a prior SNR. The likelihood is then compared with the

threshold determined by the distribution model to make the speech/non-speech

decision [74, 130].

2.3.2 Feature extraction

Feature extraction techniques are used to transform the speech signals into acous-

tic feature vectors, carrying the essential characteristics of the speech signal which

16 2.3 Front-end processing

recognizes the identity of the speaker by their voice. The aim of feature extrac-

tion is to reduce the dimension of acoustic feature vectors by removing unwanted

information and emphasizing the speaker-specific information [22].

Unwanted information is any characteristic of the speech signal that is indepen-

dent of the speaker, such as environmental noise and channel transmission. The

speaker-specific information represents the speech characteristics that remain con-

sistent in the speech of a given speaker and it helps to distinguish the speech of

the speaker from that of other speakers [62]. The ideal feature extraction can be

described with the following points:

� Efficient representation of speaker-specific information (i.e. large inter-

speaker variability between speakers and small intra-speaker variability

within a given speaker).

� Easy to compute the features from the speech signals.

� Not affected by attempts to disguise the voice.

� Robust to noise and distortion.

� Occurs naturally and frequently in normal speech.

� Stable over time and robust to intra-speaker variability due to aging and

health.

A number of feature extraction approaches have been proposed in previous stud-

ies of speaker verification systems such as linear prediction cepstral coefficients

(LPCC) [37], perceptual linear prediction cepstral coefficients (PLPC) [48] and

MFCCs [27]. The MFCCs are commonly used as the feature extraction tech-

nique for the modern speaker verification and they provide better performance

than other feature extraction techniques [41, 110].

2.3 Front-end processing 17

Speech

FFT Mel-frequency

wrapping

Frame

blocking Windowing Cepstrum

MFCC

Figure 2.3: Block diagram of MFCC extraction.

The MFCCs are extracted from acoustic signals through cepstral analysis. The

production of human speech consists of the excitation source and the vocal tract.

The goal of the cepstral analysis is to separate the excitation source from the

vocal tract, so that the speaker dependent information can be separated from the

redundant pitch information [123].

The process of extracting MFCC features is illustrated in Figure 2.3. The first

stage of extracting MFCC features involves pre-emphasis of the speech signal

through using a high pass finite impulse response (FIR) filter. The aim of pre-

emphasis is to boost the high-frequency part of the speech signal and multiplied

by the window function (usually Hamming). A Hamming window is used to frame

the speech signals. The duration of the frame is typically between 20 msec to

30 msec [147]. The fast Fourier transform (FFT) can be used to transform the

acoustic signals from the time into the frequency domains. A set of triangular

mel band pass filters is applied. This filtering gives a stronger weighting and

higher resolution to the lower frequencies in the signal [27, 62]. A logarithmic

compression is then applied, followed by a decorrelation with the discrete cosine

transform (DCT). The MFCCs can be represented as [62]

cn =M∑

m=1

[log(Y (m)

]cos

[πn

M(m− 1

2)

], (2.1)

where m is the number of mel filter banks, Y (m) is the output of M - channel

filter bank, and n is the index of the cepstral coefficients. The first 10- 20 cepstral

coefficients are typically extracted from each frame. In order to capture the

18 2.4 Speaker verification based GMM

dynamic characteristics of the speech signals, the first and second time derivatives

of the MFCC are usually appended to each feature. The feature-domain channel

compensation approaches are the final stage in the front-end process, and will be

described briefly in Section 2.6.1.

2.4 Speaker verification based GMM

2.4.1 Overview of Gaussian mixture modeling

Reynold et al. [113, 115, 116] proposed a Gaussian mixture model (GMM) and

used it to model speaker verification systems. A GMM is a weighted sum of M

multivariate Gaussian components and can be described as

P (x|λ) =M∑k=1

Pkbk(x), (2.2)

where x is the feature vector that has D dimension, Pk is the mixture weight of

kth Gaussian component bk(x),

bk(x) =1

(2π)D2 |Σk|

12

exp{−1

2(x− µk)TΣ−1k (x− µk)}, (2.3)

where µk and Σk are the mean vector and the covariance matrix, respectively.

The component weights must satisfy∑M

k=1 Pk = 1. The GMM model can be pa-

rameterized by the mean vectors, covariance matrices and mixture weights for all

component densities. These parameters can be represented by λ = {Pk, µk, Σk},

k = 1, 2, · · · ,M .

There are several methods to estimate the statistical parameters of the GMM

model. The most popular method is the maximum likelihood (ML) or maximum a

posteriori (MAP) estimator [116]. The ML estimator is used to find the parameter

2.4 Speaker verification based GMM 19

model which maximizes the likelihood ratio of GMM given training data. Let the

training vector x = {x1, x2, · · · , xT} and the GMM likelihood can be calculated

as

P (x|λ) =T∏t=1

P (xt, λ). (2.4)

Direct maximization in ML estimator is not possible due to the non-linear func-

tion parameter of λ. Therefore, the expectation maximization (EM) is used as

an iterative method to estimate the maximum likelihood of parameters in statis-

tical models. The main idea of the EM algorithm is to estimate the new model

λnew from the current model λold using the training utterance features x, so that

P (x|λnew) > P (x|λold). The new model becomes an initial model and the itera-

tion is repeated until the convergence threshold is reached. The initial parameter

of the GMM can be defined using the k-means algorithm often used in the vector

quantization estimator [46].

In E step, the posteriori probability of acoustic class k can be calculated as

P (k|xt, λ) =Pkbk(xt)

P (xt|λold). (2.5)

In M step, the statistical parameters are reestimated to guarantee a monotonic

increase in the model likelihood value. These parameters can be represented by

the following equations

Pk =nk

T

T∑t=1

P (k|xt, λold), (2.6)

µk =1

nk

T∑t=1

P (k|xt, λold)xt, (2.7)

Σk =1

nk

T∑t=1

P (k|xt, λold)x2t − µ2k, (2.8)

where nk is the component count from the training utterances and can be repre-

sented by

nk =T∑t=1

P (k|xt), (2.9)

20 2.4 Speaker verification based GMM

where P (k|xt) is the posterior probability of the kth components given the sample

xt.

2.4.2 Universal background model

The EM algorithm cannot be used directly to estimate the statistical parameter

of the speaker model because in typical speaker verification systems the amount

of available data to train an individual speaker model is often limited. Therefore,

the MAP adaptation is used to train the parameter of individual models from

a large background model. This approach estimates the speaker model from

the universal background model (UBM) [115]. The UBM is a high order GMM

trained from a large number of speech samples collected from speaker populations

of interest to represent the speaker-independent distribution of the features. The

EM algorithm described in the previous section can be used to estimate the

parameters of the UBM.

2.4.3 Adaptation of UBM model

The speaker model is derived by adapting the statistical parameters of the UBM

using the MAP estimator. The adaptation consists of a two step estimation

process. The first step is similar to the expectation step of the EM algorithm.

This step estimates the statistical parameters (weight, mean and covariance) of

the speaker population, which are calculated from each mixture in the UBM. In

the second step, the new statistical parameters are combined with old statistical

parameters from the UBM using the adaptation mixing coefficient. It was found

by Reynolds et al. [115] that adapting the covariance of the mixture components

of the UBM does not show improvement in the performance of speaker verification

2.4 Speaker verification based GMM 21

and it is more common in practice to adapt only the mean of the mixture of the

UBM. The new adaptation of the mean (µMAPk ) for Gaussian mixture k is updated

from the prior distribution means (µk), and can be represented as [115]

µMAPk = αkµk + (1− αk)µML

k , (2.10)

where µMLk is the estimation mean using the maximum likelihood estimation

which is described in the previous section and αk is the mean adaptation mixing

coefficient. The mean adaptation mixing coefficient can be calculated by the

following equation

αk =nk

nk + τk, (2.11)

where nk is the component occupancy count for the adaptation data, and τk is

the relevance factor where values typically range between 8 and 32.

2.4.4 Speaker verification based GMM-UBM

Figure 2.4 shows speaker verification based GMM-UBM. In the development

stage, the parameters of the UBM are estimated using a large number of speakers,

and they represent the speaker-independent parameter. The speaker dependent

parameters are estimated using MAP in the enrolment phase. In the verification

phase, to verify if the test speech signals belong or not to the claimed speakers,

the score is computed using the likelihood ratio.

The aim of speaker verification is to determine whether or not the test speech

frame X belongs to the claimed speaker s. The goal of speaker verification is to

test the two hypotheses [115]

H0: X is from the hypothesized speaker s

H1: X is not from the hypothesized speaker s

22 2.5 GMM super-vector

Front- end

processing GMM UBM

Front- end

processing

MAP

adaptation

Speaker

Models

Front- end

processing Likelihood ratio

Development phase

Enrolment phase

Verification phase

Decision

UBM

Training

Data

Enrolment Data

Verification Data

Figure 2.4: Speaker verification based GMM-UBM.

The score can be calculated according to the following equation [115]

S(x) =P (X|H0)

P (X|H1)> θ ⇒ H0, (2.12)

where P (X|Hi) i = 0, 1, is the likelihood of the hypothesis Hi evaluated for the

test speech signals X, and θ is the decision threshold. We conclude that the test

speech is uttered by the speaker s if the value of the score is greater than or equal

to the decision threshold; otherwise, the test speech is not uttered by the speaker.

2.5 GMM super-vector

Super-vector is a robust way to represent the characteristics of the speech signals

using a single vector. Figure 2.5 shows a block diagram of extracting the GMM

super-vector. The GMM super-vector is the combination of GMM mean vectors.

The GMM super-vector can be defined by a column vector which has CF dimen-

2.6 Channel compensation approaches 23

…………. =

µ1

µ2

.

.

.

µ39

µ1

µ2

.

.

.

µ39

µ1

µ2

.

.

.

µ39

µ1

µ2

.

.

µ39

µ1

µ2

.

.

µ39

GMM super-vector

Number of Gaussian components (1, ... , 256)

MAP adapted individual Gaussian mean

9984×1

Figure 2.5: A block diagram of extraction GMM super-vectors.

sion including the mean of each mixture in the speaker GMM where F is the

dimension of feature extraction vector used in the model and C is the total num-

ber of the mixture used to represent the GMM model. The GMM super-vector

can be used as the input feature vector for a support vector machine (SVM) [26]

and JFA [63] speaker verification systems.

2.6 Channel compensation approaches

A speech signal is produced by a complex process which includes laryngeal, res-

piratory, and vocal tract movements. This process leads to changes the voice

of a given speaker along dimensions such as pitch, voice quality, and loudness.

Further, the characteristics of the speech signal are also altered by various fac-

24 2.6 Channel compensation approaches

tors such as ageing, health, content, speaking-style and emotional state. Thus, a

speaker cannot produce an utterance in the exact same way twice. The variation

in the characteristics of speech signals for a given speaker is called intra-speaker

variability [42].

The characteristics of a certain speech signal can be affected by extrinsic vari-

ability. The extrinsic variability can occur in various situations, such as channel

distortion introduced by a device (telephones and microphones) and speech sig-

nals being corrupted by noise and reverberation environments. The combination

of intra-speaker and extrinsic variability is called inter-speaker variability [42].

The success of speaker recognition is based on increasing the inter-speaker vari-

ability between two utterances of a given speaker and decreasing the effect of

the extrinsic variability. Channel compensation techniques can be used in the

feature and model domains to decrease the effect of inter-speaker variability and

improve speaker verification performance. A brief description of channel com-

pensation techniques in the feature and model domains will be described in the

next sections.

2.6.1 Feature domain approaches

Some of channel compensation techniques that are used widely for speaker veri-

fication system include:

1. CMS

Furui [37] proposed CMS to reduce the effect of channel distortion by sub-

tracting the mean from each cepstral coefficient. It was found that cepstral

mean subtraction removes some speaker-specific information with channel

information. Thus, an alternative method called cepstral mean variance


normalization was proposed by Viikki and Laurila [139].

2. RASTA

Hermansky and Morgan [49] proposed RASTA to remove very slowly chang-

ing components (convolutive noise) or very rapidly changing components

(additive noise) by using a bandpass filter bank in the feature extraction.

It was found that RASTA suppresses speaker-specific information in the

low-frequency bands.

3. Feature warping

Pelecanos and Sridharan [106] proposed feature warping technique to de-

crease the effect of channel distortion and noise during the feature extrac-

tion phase. The goal of feature warping is to map the distribution of a

cepstral feature stream over a specified time interval to standard normal

distribution. The warping process is performed via cumulative distribution

function (CDF) as described in [143] and it can be represented as a non-

linear transformation T , which converts the original feature vectors X to a

warped feature X.

X = XT (2.13)

This can be performed by CDF matching which maps a given cepstral

feature so that its CDF matches standard normal distribution. The feature

warping technique assumes that the dimensions of the MFCC features are

treated as a separate stream. The CDF matching is performed over short

time intervals by a sliding window (typically three seconds). Only the

central frame of the window is warped every time. The warping process

executes as follow:

(a) for i = 1, 2, · · · , D where D is the number of the feature extraction

dimensions.

26 2.6 Channel compensation approaches

Speech Cepstral feature

stream

Extract single

cepstral stream

Perform cepstral

warping calculated

from sliding window

of cepstral

Append

corresponding

delta coefficients

Resultant

features

Figure 2.6: A block diagram of the feature warping process.

(b) Ranking features in dimension i in ascending order for a given sliding

window.

(c) warping the cepstral feature value x in dimension i of the central frame

to its warped value X, which satisfies

φ =

∫ X

x=−∞

1√2π

exp(−x2

2)dx (2.14)

where φ is its corresponding CDF values. Suppose the raw cepstral

feature x has rank R and the window size is N , the CDF value can be

estimated as

φ =(R− 1/2)

N(2.15)

(d) The warped value X can be found quickly by lookup in a standard

normal CDF table.

Feature warping achieves robustness to additive noise and channel mismatch

which retains the speaker specific information that is lost by using CMS, CMVN,

and RASTA processing. Thus, feature warping will be used in this thesis. Figure

2.6 shows a block diagram of the feature warping process.

2.6.2 Model-domain techniques (factor analysis)

Factor analysis techniques transform the high dimension of GMM supervectors

into a low dimensional space using a low number of hidden factors. The per-


formance of GMM-UBM based speaker modeling degrades due to the presence

of channel mismatch in the GMM supervector. Keeny et al. first introduced

modeling of speaker and channel variability in the GMM space [70]. Various fac-

tor analysis approaches were proposed such as joint factor analysis and i-vector

which are common factor analysis. This section describes various factor analysis

speaker modeling approaches in detail.

2.6.2.1 Joint factor analysis

The JFA has been used to compensate channel mismatch between training and

test speech signals [63, 66, 67, 68, 69, 142]. The main idea behind the JFA is based

on decomposing the GMM mean supervectors into speaker and channel variability

[63, 69]. The JFA is a combination of mean-only MAP [115] and eigenvoice

modeling [65]. In mean-only MAP, the GMM supervector M is decomposed into

speaker-dependent and independent components as follows,

M = s + Dzs, (2.16)

where s is speaker-independent component, D is a diagonal matrix and zs is

hidden factors with standard normal distribution.

The eigenvoice modeling is an extension of mean-only MAP adaptation, where

speaker variability is assumed to lie in a low dimensional subspace of the GMM

supervector and it can be represented by

M = s + Vys, (2.17)

where s is the speaker-independent UBM mean supervector, V is a low dimension

transformation matrix, and ys is hidden speaker factors with standard normal

distribution.

28 2.7 I-vector based speaker verification system

Kenny introduced the JFA by combining mean-only MAP and eigenvoice model

techniques to simplify model the speaker and channel variability spanned by low

dimension matrices V and U in the GMM supervector space [63]. The GMM

supervector in this approach is represented by

M = s + Vxq + Uys + Dzsq, (2.18)

where s is the speaker and channel independent mean of the GMM supervector,

V and xq are channel subspace and channel factors with standard normal dis-

tribution, respectively. U is the speaker subspace and ys is the speaker hidden

factor with standard normal distribution, and Dzsq is the residual component

which not captured by the speaker subspace.

2.7 I-vector based speaker verification system

Dehak et al. proposed the i-vector [31], which is used widely for text-independent

speaker verification systems. In recent years, the i-vector through length-

normalized Gaussian probabilistic linear discriminant analysis (GPLDA) has be-

come the modern speaker verification system [60]. The system consists of three

stages: i-vector feature extraction, channel compensation techniques and scoring.

These stages are described in detail in the section below. Figure 2.7 shows a

length-normalized GPLDA based speaker verification system.

2.7.1 I-vector feature extraction

The i-vectors represent the speaker and session GMM super-vector using a single

low dimensional total variability space. This single variability was inspired by the

2.7 I-vector based speaker verification system 29

discovery that speaker information in the channel space of JFA could be used in

recognizing speakers more efficiently [30, 31]. A speaker and channel dependent

GMM super-vector in the i-vector approach, s, can be represented as [31]

s = m + Tw, (2.19)

where m is the independent super-vector for the speaker and channel, T is the

low rank total variability matrix which represents the major variability across a

large number of development speech signals, and w is the i-vectors which have

normal distribution with parameters N(0, 1). The i-vector extraction is based on

calculating the zero-order of Baum-Welch (BW), N, and centralized first-order,

F, statistics. For a certain speech utterance, the BW statistics are computed

with respect to the number of the components of UBM (C ) and the dimensions

of the feature extraction (F ). The i-vectors can be described as [31]:

w = (I + TTΣ−1NT)−1TTΣ−1F, (2.20)

where I is the identity matrix (CF × CF) dimension, N is the diagonal matrix

of dimension F × F , F is the CF × 1 supervector obtained by concatenating

the centralized first-order BW statistics, and Σ is the covariance matrix, which

represents residual variability not captured by T.

An efficient method to train a total variability matrix is described in [31, 71].

McLaren and van Leeuwen [91] investigated the effect of using pooled and con-

catenated total variability matrices on i-vector speaker verification systems. For

the pooled approach, the speech signals from the microphone and telephone are

combined. A single total variability matrix was used to train the speech signals,

while two total-variability matrices were used to train the speech utterances from

microphone and telephone in a concatenated total-variability technique. These

total variability matrices are fused together to produce a single total-variability

matrix. Mclaren and van Leeuwen [91] found that the pooled approach represents

30 2.8 Standard channel compensation techniques

UBM

training

data

Feature

Extraction

GMM

training

(m, Σ)

Total

variability T

Feature

Extraction BW Stats

Total

variability

space training

(m, Σ, T) I- vector

extraction

Channel

compensation

Length

norm

PLDA

model

PLDA

parameters

U1, m, p

Development

phase

Feature

Extraction BW Stats I- vector

extraction

Channel

compensation

Length

norm Target i-vectors

Enrolment

phase

Enrolment

data

Feature

Extraction BW Stats I- vector

extraction

Channel

compensation

Length

norm

PLDA

training

x_tar, x-tst

Batch

Likelihood

Verification

data Verification

phase

Decision

Figure 2.7: Length-normalized GPLDA based speaker verification system.

i-vector speaker verification more efficiently compared with the concatenated to-

tal variability approach. Thus, the pooled total variability approach will be used

in this thesis.

2.8 Standard channel compensation techniques

Channel compensation techniques are necessary to decrease the channel variabil-

ity in the i-vector based speaker verification because the i-vectors are based on

2.8 Standard channel compensation techniques 31

one variability space that contains speaker and session information. The chan-

nel compensation techniques are estimated based on within-speaker variability

and between-speaker variability. Within-speaker variability depends on the mi-

crophone used, transmission channel, and environmental distortion such as noise

and reverberation, while between-speaker variability depends on the character-

istics of the speakers such as speaking style, phonetic content, emotional state

and health of the speaker. The goal of the channel compensation technique is

to reduce the effect of within-speaker variability and maximize the effects of

between-speaker variability [59]. Channel compensation techniques that are used

widely for i-vector speaker verification are:

1. Nuisance attribute projection

The nuisance attribute projection (NAP) is used to compensate the ses-

sion variations [131]. This technique finds the orthogonal projection in the

i-vector subspace by removing the unwanted class variation from the com-

ponents of the within-speaker variability. The formulation of this projection

is found by maximization the following equation,

J(v) = vTSwv, (2.21)

where Sw is within-speaker variability matrix and can be calculated as

Sw =1

S

S∑s=1

1

ns

ns∑i=1

(wsi − ws)(w

si − ws)

′, (2.22)

where S is the total number of the speakers, ns is the number of utterances

by each speaker s and the mean i-vector of the speaker, ws, can be calculated

as

ws =1

ns

ns∑i=1

wsi . (2.23)

The NAP projection matrix is found by

P = I−VV′, (2.24)

32 2.8 Standard channel compensation techniques

where I is the identity matrix, V is obtained using an eigen decomposi-

tion of the within-speaker variability covariance matrix. Finally, the NAP

compensated i-vector can be calculated as,

wNAP = PTw. (2.25)

2. Within-class covariance normalization

The within-class covariance normalization (WCCN) was applied to SVM-

based speaker recognition system to attenuate the high dimensions of

within-class variance [47]. Then, Dehak et al. [31] used WCCN as a channel

compensation technique for i-vector speaker verification. The within-class

covariance matrix for i-vector speaker verification can be calculated as

W =1

S

S∑s=1

ns∑i=1

(wsi − ws)(w

si − ws)

T . (2.26)

The inverse of matrix W is used to weight the dimension of the within class

matrix and the WCCN combined with principle component analysis (PCA)

to solve the problem of estimating the inverse of W [47]. The WCCN can

be defined as,

wWCCN = BT1 w, (2.27)

where B1 is obtained through Cholesky decomposition of matrix B1BT1 =

W−1.

3. Linear discriminant analysis

The linear discriminant analysis (LDA) seeks orthogonal axes to reduce

the dimension while retaining as much of the speaker-specific information

as possible [31]. The axes A must satisfy the requirement of maximiz-

ing between-speaker variability and minimizing within-speaker variability

through the eigenvalue decomposition of

Sbv = λSwv, (2.28)

2.9 PLDA speaker verification system 33

where λ is the diagonal matrix of the eigenvalues, and Sb and Sw are

between-speaker and within-speaker variabilities respectively. These vari-

abilities can be calculated by the following equations,

Sb =S∑

s=1

ns(ws − w)(ws − w)T , (2.29)

Sw =S∑

s=1

ns∑i=1

(wsi − ws)(w

si − ws)

T , (2.30)

where the mean i-vector for all speakers, w, is calculated by

w =1

N

S∑s=1

ns∑i=1

wsi , (2.31)

where N is the total number of sessions.

The LDA channel compensation (wLDA) can be calculated by

wLDA = ATw. (2.32)

2.9 PLDA speaker verification system

Prince et al. [108] proposed PLDA for face recognition. Then, the PLDA was

adapted to the task of i-vector speaker verification systems [17, 64, 120]. Kenny

[64] introduced the Gaussian probabilistic linear discriminant analysis (GPLDA)

for i-vector speaker verification. There are two main assumptions for GPLDA.

Firstly, the statistics of speaker and channel components are independent. Sec-

ondly, the distribution of speaker and channel components is a Gaussian distribu-

tion. The main benefit of these assumptions is that the speaker likelihood ratio

can be obtained in closed-form [42]. Kenny [64] also introduced heavy-tailed prob-

abilistic linear discriminant analysis (HTPLDA) by using student’s t-distribution

instead of Gaussian distribution which is used in the GPLDA approach. There

34 2.9 PLDA speaker verification system

are two main advantages of using HTPLDA as a classifier for i-vector speaker ver-

ification systems. Firstly, the HTPLDA provides better representation to outliers

in the i-vector than GPLDA [64]. Secondly, it allows for larger deviation from

the mean (e.g. severe channel distortion and speaker effect). It was found that

HTPLDA approach achieved significant improvements compared with GPLDA

approach [64]. Recently, the length-normalized i-vector technique was proposed

by Garcia-Romero et al. [43], and this technique can be used to transform the

heavy-tailed distribution to Gaussian distribution. A brief description of GPLDA,

HTPLDA and length-normalized GPLDA will be presented in the sections below.

2.9.1 GPLDA

The speaker and channel independent i-vector can be represented in the GPLDA

model as follows,

wr = m + U1x1 + U2yr + εr, (2.33)

where r is the number of utterances for a speaker, m is the global offset, U1

is the eigenvoice matrix, U2 is eigenchannel matrix, x1 and yr are the latent

identity vectors of speaker and channel which have standard normal distribution,

respectively, and εr is the speaker residual term and it assumed to have Gaussian

distribution with zero mean and diagonal covariance matrix Λ−1. The between-

speaker variability in the PLDA model can be represented as m + U1x1 with a

covariance matrix U1UT1 . The within-speaker variability is represented as U2yr+

εr with covariance matrix Λ−1 + U2UT2 .

In this thesis, the precision matrix (Λ) is assumed to be full rank and the eigen-

channel matrix (U2) is removed from Equation 2.33. It was found that the per-

formance of GPLDA speaker verification does not improve significantly when the

eigenchannels are included in the GPLDA system, and removing them provides a

2.9 PLDA speaker verification system 35

useful decrease in the computational complexity [43, 64]. The modified GPLDA

can be represented as

wr = m + U1x1 + εr. (2.34)

The details of the estimation model parameters of GPLDA are given in [64].

2.9.2 HTPLDA

For the HTPLDA model, the speaker factor (x1), channel factor (yr) and residual

(εr) are assumed to have student’s t-distribution instead of the Gaussian distri-

bution assumed in GPLDA. The x1, yr, εr can be represented as

x1 ∼ N(0, u−11 I) where u1 ∼ G(n1

2,n1

2), (2.35)

yr ∼ N(0, u−12r I) where u2r ∼ G(n2

2,n2

2), (2.36)

εr ∼ N(0, u−1r Λ−1) where ur ∼ G(v

2,v

2), (2.37)

where n1, n2 and v are the numbers of degree of freedom and u1, u2r and vr

are scalar-value hidden variables, N(µ,Σ) indicates a Gaussian distribution with

mean µ and covariance matrix Σ, and G(a, b) indicates a Gamma distribution

with parameters a and b [64].

2.9.3 Length-normalized GPLDA

Garcia-Romero et al. [43] used a new technique to convert i-vector behavior from

heavy-tailed to Gaussian. They proposed a length-normalized GPLDA technique,

and found that the length-normalized GPLDA with less computationally com-

plexity performs similarly compared with HTPLDA [43]. Therefore, the length-

normalized GPLDA technique will be used in this thesis. The length-normalized

36 2.10 PLDA scoring

i-vector follows two steps: whitening and length normalization of i-vectors. The

whitened i-vector, wwht, can be calculated as:

wwht = d−12 UTw, (2.38)

where Σ is the covariance matrix which can be estimated from the development

i-vector, U is an orthogonal matrix containing the eigenvector of the covariance

matrix, and d is the diagonal matrix including the corresponding eigenvalues.

The normalization of the i-vector, wnorm can be computed as

wnorm =wwht

‖wwht‖(2.39)

If the behaviour of i-vector is a Gaussian distribution, the distribution of the

length-normalized i-vector should be Chi when the dimension of i-vectors equals

the number of degrees of freedom. Garcia-Romero et al. found that the length-

normalized i-vector does not match the Chi distribution, and this mismatch led to

a conclusion that i-vector has heavy-tailed distribution. The authors [43] found

that the length normalization GPLDA technique can be used to convert the non-

Gaussian i-vector distribution to Gaussian i-vector feature distribution.

2.10 PLDA scoring

GPLDA, HTPLDA and length-normalized GPLDA based i-vector scoring is cal-

culated using a batch likelihood ratio [64]. Given two i-vectors, wtarget and wtest,

the batch likelihood ratio can be computed as:

score = lnP (wtarget,wtest|H1)

P (wtarget|H0)P (wtest|H0), (2.40)

where H1 and H0 are, respectively, the hypotheses that the i-vectors come from

the same speaker or another speaker.

2.11 Performance evaluations 37

2.11 Performance evaluations

The performance of the speaker verification can be measured in terms of errors.

This section discusses in detail the types of error and evaluation metrics commonly

used in speaker verification systems.

2.11.1 Type of error

Speaker verification systems can be represented by two types of errors: false

acceptance and false rejection [59].

False acceptance A false acceptance occurs when the speech segments from an

imposter speaker are falsely accepted as a target speaker by the system. The

false acceptance rate, FAR, can be defined as:

FAR =Total number of false acceptance errors

Total number of imposter speaker attempts. (2.41)

False rejection A false rejection occurs when the target speaker is rejected by

the verification systems. The false rejection rate, FRR, can be defined as:

FRR =Total number of false rejection errors

Total number of enrolled speaker attempts. (2.42)

2.11.2 Performance metrics

The performance metrics of speaker verification systems can be measured us-

ing the equal error rate (EER) and minimum decision cost function (mDCF)

38 2.11 Performance evaluations

0.1 0.2 0.5 1 2 5 10 20 40

False Acceptance Rate (FAR) [%]

0.1

0.2

0.5

1

2

5

10

20

40

False

Rejec

tion R

ate (F

RR) [%

]

Figure 2.8: An example of DET plot.

[114]. These measures represent different performance characteristics of system,

although the accuracy of the measurements is based on the number of trials

evaluated in order to robustly compute the relevant statistics [59]. Speaker ver-

ification performance can also be represented graphically by using the detection

error trade-off (DET) plot [59]. Figure 2.8 shows an example of DET plot.

The EER can be obtained when the false acceptance rate and false rejection rate

have the same value by adjusting the threshold. The performance of the system

improves if the value of ERR is lower because the sum of total error of the false

acceptance and the false rejection at the point of ERR decreases [93].

The decision cost function (DCF) is defined by assigning a cost of each error

and taking into account the prior probability of target and impostor trails. The

decision cost function can be defined as

DCF = CmissPmissPtarget + CfaPfaPimpostor, (2.43)

where Cmiss and Cfa are the cost functions of a missed detection and false alarm,

2.12 Speaker recognition in adverse conditions 39

respectively, the prior probabilities of target and impostor trails are given by

Ptarget and Pimpostor, respectively, and the percentages of the missed target and

falsely accepted impostors’ trails are represented by Pmiss and Pfa, respectively.

The mDCF can be used to evaluate speaker verification by selecting the minimum

value of DCF estimated by changing the threshold value.

mDCF = min[CmissPmissPtarget + CfaPfaPimpostor

], (2.44)

where Pmiss and Pfa are the miss and false alarm rates recorded from the trials,

and the other parameters are adjusted to suit the evaluation of application-specific

requirements.

2.12 Speaker recognition in adverse conditions

The algorithms of speaker recognition under adverse conditions can be divided

into five types according to the techniques used to improve speaker recognition

performance. These techniques are further explored below

2.12.1 Feature extraction based on multiband techniques

Multiband feature extraction techniques are based on decomposing the full fre-

quency of the speech signals into various frequency sub bands by using DWT.

Traditional feature extraction techniques such as MFCC, LPCC, and PLPC can

be used to extract feature vectors from each sub band of the DWT. The feature

recombination combines sub band cepstral features to obtain a single feature vec-

tor that is used to train classifier models. Various multiband feature extraction

techniques have been proposed in previous studies.

Mirghafori and Morgan [95] used a combination of multiband and full band fea-

40 2.12 Speaker recognition in adverse conditions

ture extraction techniques in speech recognition under clean and reverberation

conditions. The multiband feature technique is based on decomposing the speech

signals into various frequency sub bands. Relative spectral transform-perceptual

linear prediction (RASTA-PLP) cepstral uesd to extract the features from the

sub band and full band of the speech signals. Then, the features extracted from

the sub band and full band of the speech signals were combined into a single

feature vector. The results demonstrated that the proposed method improved

word error rate under clean and reverberation conditions compared with using

multiband feature techniques only. The proposed method could also be used

in speaker recognition systems. The performance of the proposed method un-

der reverberation conditions could be improved by using channel compensation

techniques such as feature warping during the feature extraction phase [58].

Tufekci and Gurbuz [134] proposed a robust speaker verification system for noisy

speech signals. This system is based on using parallel model compensation and

mel frequency discrete wavelet coefficients (MFDWCs) as the feature extraction

technique. The performance of MFDWCs was evaluated using National Institute

of Standards and Technology (NIST) and NOISEX-92 databases. The experi-

mental results demonstrated that in the presence of various types of noise at -6

dB and 0 dB the proposed speaker verification achieved an average improvement

equal error rate over MFCC of 26.44% and 23.73%, respectively.

Malik and Afsar [87] presented an effective feature extraction technique for

speaker recognition systems. The feature extraction is based on decomposing

the speech signals into approximation and detail coefficients by using DWT. The

MFCCs were used to extract the features from the approximation coefficients.

Vector quantization techniques were used as a classifier. Experimental results

demonstrated that the proposed feature technique achieved a recognition rate of

96.25% and 86.77% for non telephonic and telephonic speech data from PIEAS


database. Although the performance of the speaker recognition system proposed

in this research improved, it has two main gaps. Firstly, some important fea-

tures extracted from the detail coefficients, such as unvoiced speech signal, are

lost. Secondly, this research ignored the effect of extracting some features from

the full band of the speech signals to improve speaker recognition performance.

Speaker recognition performance could improve by extracting the features from

the detail coefficients and the full band of the speech signals.

Shafik et al. [123] proposed a robust feature extraction technique to enhance

speaker identification performance in the presence of additive white Gaussian

noise (AWGN) and telephone degradation with colored noise. The feature ex-

traction technique is based on combining the features of the MFCC and DWT

of the noisy speech signals into a single feature vector. To reduce the effect of

noise, wavelet denoising technique was also used prior to the feature extraction.

Experimental results show that extracting the features from the enhanced speech

signals improved speaker recognition performance in the presence of high lev-

els of AWGN compared with extracting the features without using the wavelet

denoising technique. However, the wavelet threshold technique did not achieve

improvement in the speaker recognition performance under colored noise. The

wavelet threshold technique fails to suppress the colored noise because the colored

noises are concentrated at certain frequency sub bands when the colored noise

is mixed with clean speech signals. Thresholding all wavelet coefficients in high

frequency sub bands could distort the enhanced speech signals. Thus, using a

suitable speech enhancement algorithm is necessary to improve speaker recogni-

tion performance when the speech signals are mixed with various types and levels

of colored noise.

Maged et al. [86] presented a speaker identification system in the presence of

noise. This system used DWT to decompose the noisy speech signals. The


approximation coefficients and detail coefficients were combined into a single vec-

tor. Then, the MFCCs were used to extract the features from the noisy speech

signals. The vector quantization model was used as a classifier. Experimental re-

sults demonstrated that the system improved speaker identification performance

at high SNR for 13 speakers compared to traditional MFCC features. How-

ever, speaker identification performance based on the wavelet transform does not

improve at low SNR because the system ignored the effect of using important

features extracted from the full band of the noisy speech to improve speaker

identification performance. It was found that combining the features from the

full band and sub band would improve speaker recognition performance [95].

Lei et al. [80] presented a new forensic speaker recognition system based on using

the wavelet cepstral coefficient as the feature extraction technique to train the

i-vector speaker recognition system. Cosine distance scoring was used to compare

the i-vectors of the suspect and verification speech signals. LDA and WCCN were

added to cosine distance scoring to solve the problem related to channel variabil-

ity. Experimental results showed that using wavelet cepstral coefficients under

different levels of noise significantly improved speaker recognition performance

compared with applying features extracted from the MFCC to GMM classifier.

The advantages of combining full band MFCC and multiband feature extraction

techniques for improving speaker recognition performance in the presence of noise

and reverberation conditions are:

1. DWT extracts the important features from the noisy speech signals in the

time and frequency domains. However, the features extracted from the

time domain of the noisy speech signals are lost by assuming a stationary

time frame in traditional cepstral features [9, 123]. Therefore, the DWT

adds some features to the full band features extracted from the MFCC of

the noisy speech signals, thereby assisting in improving speaker recognition


performance in the presence of noise [123].

2. Since the boundary materials used in most rooms are less absorptive at low-

frequency sub bands, natural reverberation affects low-frequency sub bands

more than high-frequency sub bands and leads to distortion of the spectral

information at low-frequency sub bands [95]. DWT can be used to extract

more features from the low-frequency sub bands. These features add some

important features to the full band of the MFCC. Thus, a combination of

full band MFCC and multiband features extracted from the reverberated

signals may achieve better recognition performance than full band MFCC

features extracted in the presence of reverberation conditions.

The disadvantages of using multiband feature extraction techniques include:

1. Speaker recognition performance is affected by the number of decomposition

levels used to extract the features from the speech signals in the presence

of noise and reverberation conditions. Speaker recognition performance

under noisy and reverberant conditions could decrease when the number of

decomposition levels in DWT increases because the number of samples at

low frequency is too small to represent the spectral characteristics of the

speech signals under these conditions [21]. Speaker recognition performance

may also be decreased by using two levels of DWT, because DWT may not

separate the noise that concatenates at high-frequency sub bands from the

speech signals. Thus, it is necessary to choose a suitable number of levels

in DWT to extract the features from the speech signals under noisy and

reverberant conditions.

2. Information about the correlation between sub band features and full band

features is lost when multiband feature extraction techniques are used for

feature extraction. To tackle this problem, it is important to combine full


band MFCC and sub band MFCC features extracted from the DWT of the

speech signals into a single feature vector [21, 90].

2.12.2 Feature warping

Feature warping has also been used as a channel compensation technique during

the feature extraction phase for improving speaker verification performance in

the presence of noise and channel mismatch. Some studies used feature warp-

ing for improving speaker verification performance under noisy and reverberant

conditions.

Pelecanos and Sridharan [106] proposed feature warping technique that is ro-

bust to noise and channel mismatched. The feature warping technique maps the

distribution of the cepstral feature to standard normal distribution over a partic-

ular time interval. The experimental results demonstrated that feature warping

improved speaker verification performance compared with other channel compen-

sation techniques such as CMS, modulation spectrum processing, and short-term

windowed CMS and variance normalization.

Jin et al. [58] investigated two approaches to improve speaker recognition per-

formance in far-field microphone situations. The first approach introduced re-

verberation compensation and feature warping to improve speaker recognition

performance under mismatched conditions. The second approach used high-level

linguistic features. The results showed that both approaches significantly im-

proved speaker recognition performance under mismatched conditions between

enrolment and verification data.

The advantages of using feature warping for improving speaker verification per-

formance under noisy and reverberant conditions are:


1. Feature-warped MFCC is more robust to noise and reverberation conditions

compared with traditional feature extraction techniques [106] [58].

2. Feature-warped MFCC improves speaker verification performance com-

pared with applying MFCC to other channel compensation techniques such

as CMS, CMVN, RASTA, and modulation spectrum processing in the pres-

ence of noise and channel mismatched between training and testing con-

ditions. Feature warping improves the performance of speaker verification

systems compared with other techniques because feature warping technique

reserves the speaker information which is lost by using other channel com-

pensation techniques [106].

Although feature warping and a combination of DWT-MFCC and MFCC were

proposed in previous studies to improve speaker recognition performance under

noisy and reverberant conditions, the robustness of fusion of feature warping with

MFCC and DWT-MFCC features individually or in concatenative combination

of these features has not been investigated yet for improving the modern i-vector

PLDA speaker verification framework in the presence of different types of noise,

reverberation, as well as noisy and reverberant conditions.

The advantages of combining feature warping with DWT-MFCC and MFCC to

extract the features from the speech signals and improve speaker verification per-

formance in the presence of various types of environmental noise and reverberation

conditions are:

1. The feature-warped MFCC extracted from the DWT of the noisy speech

signals adds more features from the approximation and detail coefficients

to the full band feature-warped MFCC of the noisy speech signals. Thus,

the combination of feature warping with DWT-MFCC and MFCC could

improve speaker verification performance in the presence of different types


of environmental noise.

2. Since reverberation affects the low-frequency sub band of the speech signals

more than high-frequency sub bands [95], the DWT-MFCC feature warping

can be used to extract some important information from the low-frequency

sub band of the speech signals. These features add some important features

to the full band feature-warped MFCC. Therefore, the combination of fea-

ture warping with DWT-MFCC and MFCC could improve the performance

of speaker verification systems under reverberation conditions.

2.12.3 Independent component analysis

Independent component analysis is a technique for a linear transformation of the

observed signal into components that are statistically independent. The principle

of estimating independent components is based on maximizing the non-Gaussian

distribution of one independent component. This can be achieved by maximizing

the difference between the distribution of the non-Gaussian component and the

Gaussian distribution of the other components. Various contrast functions were

used to measure this difference and estimate the source signals, such as kurtosis,

negentropy and approximate negentropy [55]. The studies which used indepen-

dent component analysis as a multiple channel speech enhancement algorithm or

in speaker recognition systems under noisy conditions are discussed below.

Li et al. [84] proposed a combination of wavelet threshold technique and fast ICA

to reduce the effect of AWGN from the noisy speech signals. The clean speech

signal was corrupted by AWGN at different values of SNR. Wavelet threshold

technique was used to reduce the noise from the noisy speech signals. Then, a

fast ICA algorithm was used to separate the denoised mixed speech signals. The

results showed that the fast ICA algorithm achieved higher SNR than wavelet


denoising speech signals at different input SNR values.

Li et al. [85] used ICA as a speech enhancement algorithm. The clean speech

signal was mixed with different levels of AWGN and the fast ICA algorithm was

used to separate the clean speech from the noisy speech signals. Experimental

results demonstrated that the fast ICA algorithm significantly improved SNR

compared with the spectral subtraction technique. Multiple channel speech en-

hancement algorithms improved the quality of noisy speech signals compared

with single channel speech enhancement algorithms [118]. Therefore, multiple

channel speech enhancement based on the ICA could be used as front-end pro-

cessing in speaker recognition systems and it might improve speaker recognition

performance under different types and levels of noise.

There is a randomness in the separation of the mixed signals in the ICA algo-

rithm due to the randomness associated with choosing the unmixing matrix. The

randomness of separation of the mixed speech signals in the ICA algorithm could

decrease recognition performance. Naik [100] evaluated the use of ICA on surface

Electromyography and bio signal applications. Naik examined the randomness of

estimating unmixing matrix and proposed multi-run ICA algorithm to improve

the quality of the unmixing matrix. The multi-run ICA is based on computing

the fast ICA algorithm several times to estimate all unmixing matrices. Signal to

interference ratio (SIR) measurement was used to evaluate the quality of the un-

mixing matrices. The best of the unmixing matrices can be chosen by computing

the highest value of the SIR of the unmixing matrix, and this unmixing matrix

was used to estimate all of the sources signals. The multi-run ICA algorithm

was used in the classification of hand gestures. Experimental results showed that

a multi-run ICA algorithm achieved a higher average classification of hand ges-

tures (99%) compared with a traditional fast ICA, which achieved only average

classification (65%).


Denk et al. [32] proposed a technique based on convolutive ICA and spectral

subtraction to improve the performance of speaker recognition in forensic appli-

cations. The mixture of the speech signals can be generated by combining two

or three speakers obtained by using an array of microphones. The mixed sig-

nals were corrupted by coloured noise in each microphone, with different noise

correlation levels. Convolutive ICA was used to separate the individual speech

signals. The essential point of the ICA used in this research was that the number

of microphones was greater than the number of speakers. Therefore, ICA is able

to separate individual speakers from each other and the major part of the noise.

Spectral subtraction techniques were applied to the separate speech signals to re-

move the noise from the noisy separate speech signal. MFCC was used as feature

extraction of the enhanced test speech signal. Finally, a GMM was used to deter-

mine the similarity between enrolment and verification speech signals. The max-

imum likelihood was used to determine the identity of the speaker. Experimental

results showed that this technique is able to detect all speakers simultaneously

and achieves a higher success rate in identifying the speakers compared with the

method used in [127].

Ribas et al. [117] proposed a full multicondition training technique to reduce

the effect of noise and reverberation conditions on an i-vector PLDA speaker

verification system. Full multicondition is based on pooling clean, noisy and re-

verberated speech signals in the development, enrolment and verification phases

of speaker verification systems. Speech enhancement based on the flexible audio

source separation was used to decrease the effect of noise from the enrolment and

verification speech signals. Experimental results showed that using the speech en-

hancement algorithm improved the EER compared to the i-vector PLDA speaker

verification without using speech enhancement. The performance of the i-vector

speaker verification system under noisy and reverberant environments could be

improved by proposing a robust feature extraction to these conditions instead of


using traditional MFCC which is not robust to environmental noise and rever-

beration conditions.

Li and Adali [82] proposed independent component analysis by entropy bound

minimization (ICA-EBM) algorithm. This algorithm is based on computing sev-

eral entropy bounds from four contrast measuring functions (two even and two

odd functions) and using the tightest entropy as the final entropy estimate, rather

than using a single entropy bound in traditional parametric ICA approaches. The

main advantages of using four contrast functions in this algorithm is that the ICA-

EBM can approximate the entropy of a wide distribution such as sub- or super-

Gaussian, unimodal, symmetric, or skewed. The ICA-EBM algorithm was used

to separate the mixed speech signals under clean conditions and the algorithm

improved the separation performance compared with other traditional ICA algo-

rithms due to its superior convergence properties and high flexibility of density

matching.

The advantages of using the ICA algorithm in speaker verification systems are:

1. The ICA algorithm can be used to separate individual speech from the

mixed speech signals and it improves speaker recognition performance [32].

2. The ICA algorithm can be used as a multiple channel speech enhancement

algorithm to separate the speech from the noisy speech signals. The perfor-

mance of speech enhancement based on the ICA algorithm improves com-

pared with single channel speech enhancement algorithms [84, 85]. Thus,

the enhanced speech signals from the ICA algorithm could be used to im-

prove forensic speaker verification performance under noisy conditions.

3. The ICA-EBM algorithm can be used to separate the sources that come

from different distribution and it is used to separate the mixed speech

signals under clean conditions more efficiently compared with traditional


ICA algorithms due to its superior separation performance and convergence

properties than other traditional ICA algorithms [82]. Thus, the ICA-EBM

algorithm could be used as speech enhancement algorithm which might im-

prove speaker verification performance more efficiently than the fast ICA

algorithm.

2.12.4 I-vector PLDA speaker recognition

The i-vector PLDA technique has been widely used for text-independent speaker

recognition systems. Various studies used the state-of-the art i-vector PLDA to

evaluate speaker recognition systems under noisy and reverberant conditions.

Mandasari et al. [89] studied the effect of car and babble noise on the performance

of a traditional GMM UBM (dot scoring), i-vector with cosine distance scoring

and i-vector PLDA speaker verification systems. A Wiener filter was used as front-

end to decrease the effect of noise and improve speaker verification performance.

Experimental results demonstrated that the modern i-vector PLDA improved

speaker verification performance compared with the dot scoring approach. The

i-vector PLDA was also found to offer more robust performance for mixed car

rather than babble noise. The applications of the Wiener filter achieved a small

improvement in speaker verification performance. It is believed that some single

channel speech enhancement algorithms may distort the enhanced speech signals

[117]. However, a multiple channel speech enhancement algorithm improved the

speech quality compared with single channel speech enhancement algorithms.

Thus, using multiple channel speech enhancement algorithms could improve noisy

i-vector speaker verification performance.

Dean et al. [28] designed the QUT-NOISE-SRE protocol for speaker recognition

evaluation (SRE) under noisy conditions. The enrolment and verification speech


signals from the NIST database were mixed with different random sessions of

STREET, CAR, CAFE, REVERB, and HOME noises from the QUT-NOISE

database [29] at SNRs ranging from -10 dB to 15 dB. The protocol of QUT-

NOISE-SRE was used to evaluate the modern PLDA i-vector speaker recognition

system, demonstrating the importance of designing a speaker recognition system

focused on the VAD techniques. The performance of QUT-NOISE-SRE protocol

could be improved by using multiple channel speech enhancement algorithm to

reduce the effect of environmental noise and improve speaker verification perfor-

mance under noisy conditions.

2.12.5 Deep neural network

In recent years, deep neural network (DNN) has also been used widely in speaker

verification systems. Some studies used DNN in speaker verification under noisy

and reverberant conditions.

Zhao et al. [146] proposed a robust speaker recognition system in the presence

of noise and reverberation conditions. The proposed system reduced the effect

of reverberation by training the speaker model in reverberation conditions. The

background noise was removed by using the computational auditory scene analysis

(CASA) technique, which is based on binary time-frequency (T-F) masking to

separate speech from the noise. The binary masking can be performed using DNN.

Experimental results showed that in the presence of a wide range of reverberation

and SNRs, the proposed algorithm improved speaker recognition performance

over related systems.

Du et al. [34] used DNN as a front-end feature to predict clean speech features

and improve noisy speaker verification system. The DNN was used to enhance

speech features before applying the i-vector PLDA speaker verification system.


The DNN was trained using parallel features extracted from the UBM train-

ing speech signals under clean and noisy conditions aligned at the frame level.

The neural network was performed by minimizing the mean square error (MSE).

The proposed system was evaluated by using NIST SRE 2010 and NOISEX-92

databases. Experimental results showed that using DNN-based feature compensa-

tion improved EER compared with DNN-based uncompensated features at SNRs

ranging from 0 dB to 20 dB.

Zolbaek et al. [75] used the modern deep recurrent neural network (DRNN) as

a speech enhancement technique to improve text-dependent speaker verification

performance under noisy conditions. The proposed method is based on using

modern long short-term memory (LSTM) based DRNN as the speech enhance-

ment front-end of the i-vector speaker verification system. The performance

of the proposed method was compared with non-negative matrix factorization

(NMF) and short-time spectral amplitude minimum mean square error (STSA-

MMSE) techniques. Experimental results demonstrated that male speaker and

text-independent DRNN-based speech enhancement without prior knowledge of

the type of noise, achieved high improvements in the performance compared with

NMF and STSA-MMSE at different values of SNR on RSR2015 speech database.

Chang and Wang [20] proposed a front-end speech separation technique for an

i-vector speaker recognition system to deal with the speech signals mixed with

background noise. The DNN was used to estimate the ideal ratio mask (IRM).

The separated speech was used to extract the enhanced features for the GMM/i-

vector and DNN/i-vector in speaker identification and speaker verification sys-

tems. The proposed algorithm was compared with the multi-condition trained

baseline and a traditional GMM-UBM i-vector system. The results showed that

the performance of using speech separation achieved an average improvement of

8% in identification accuracy and 1.2% in EER.

2.13 Limitation of the existing techniques 53

Although speaker recognition performance based on DNN improved compared

with UBM/i-vector framework, the DNN requires a large amount of data (1300

hours) with proper transcription to achieve significant improvement in the speaker

recognition performance [81]. In real forensic applications, it is hard to collect a

large amount of speech signals with proper transcriptions for training DNN. It

was found in [145] that a DNN-based i-vector did not show much improvement

in forensic speaker verification performance in the presence of various types of

environmental noise compared with a UBM based i-vector framework when a

limited amount of data was used to train DNN. Therefore, the DNN framework

will not be used in this thesis to improve forensic speaker verification performance.

2.13 Limitation of the existing techniques

We identify two gaps from the previous studies of speaker recognition in adverse

conditions and propose research directions to address them:

1. The effectiveness of combining feature warping with DWT-MFCC and

MFCC features individually or the concatenative fusion of these features

has not been investigated yet for state-of-the-art i-vector forensic speaker

verification in the presence of environmental noise only, reverberation, and

noisy and reverberant conditions. The research question in this thesis seeks

to answer whether fusion of feature warping with MFCC and DWT-MFCC

can improve modern i-vector forensic speaker verification performance un-

der different levels and types of environmental noise and reverberation con-

ditions.

2. A multi-run ICA or ICA-EBM algorithm has not previously been applied

to separate speech from noisy speech signals in the presence of high levels

54 2.14 Chapter summary

of environmental noise and reverberation conditions, although multi-run

ICA has been applied to biological signals and ICA-EBM algorithm used to

separate mixed speech signals under clean conditions. The multi-run ICA

or ICA-EBM algorithm can be applied to the noisy speech signals to reduce

the effect of environmental noise. The enhanced speech signals could be

used in improving forensic i-vector PLDA speaker verification performance

under noisy conditions.

2.14 Chapter summary

This chapter presented an overview of speaker verification systems, and also de-

scribed the GMM UBM-based speaker verification systems. As a mismatch be-

tween enrolment and verification speech signals has significantly affected speaker

verification performance, several channel compensation techniques in feature and

model domains to compensate channel variation and additive noise were de-

scribed. This chapter also presented i-vector feature extraction techniques, and

standard channel compensation techniques, including NAP, WCCN, and LDA.

Subsequently, GPLDA, HTPLDA and the modern length-normalized GPLDA

techniques were also described in this chapter. Speaker recognition in adverse

conditions identified five techniques to improve speaker recognition performance

in the presence of noise and reverberation conditions (feature extraction based

on the multiband techniques, feature warping, ICA algorithm, I-vector PLDA

speaker recognition system and deep neural network). Each technique has its

strengths and limitations, particularly in terms of the purpose of my research,

which is to improve the performance of speaker recognition systems in the pres-

ence of high levels of noise and reverberation conditions in forensic applications.

This research identifies two gaps and proposes research directions that will be

described in detail in Chapters 5 and 6:

2.14 Chapter summary 55

1. Combining feature warping with MFCC and DWT-MFCC features to im-

prove forensic speaker verification performance in the presence of various

types of environmental noise, reverberation, as well as noisy and reverberant

conditions.

2. Introducing the multi-run ICA or ICA-EBM algorithm as front-end pro-

cessing of speech enhancement to separate the noise from the noisy speech

signals and improve forensic speaker verification performance in the pres-

ence of high levels of environmental noise and reverberant conditions.

Chapter 3

Noisy and reverberant speech

frameworks

3.1 Introduction

This chapter gives an overview of the AFVC and QUT-NOISE databases that

are used in the evaluation of speech enhancement algorithms and noisy speaker

verification systems under real forensic applications. The forensic audio record-

ings available from the AFVC database cannot be used to evaluate speech en-

hancement algorithms under real forensic scenarios because the AFVC database

contains clean speech signal only. Thus, the construction of noisy speech sig-

nals based on the single and multiple microphones is designed in this chapter

to simulate speech enhancement algorithms under real forensic scenarios. The

noisy speech signals are obtained by mixing clean speech signals from the AFVC

database with different levels and types of environmental noise from the QUT-

NOISE database using single and multiple microphones.

58 3.2 AFVC speech database

In most real forensic situations, the interview recordings from a suspect are often

recorded in a police interview office where reverberation is present. The surveil-

lance recordings from the criminal are usually recorded in an outdoor environment

with a single or multiple microphones in the presence of various types of envi-

ronmental noise. In order to simulate forensic speaker verification performance

under real-world situations, the noisy and reverberant frameworks based on the

single and multiple microphones are designed in this chapter.

The interview and surveillance recordings are similar to enrolment and verification

data in traditional speaker recognition systems, where the interview recordings

are treated as the enrolment as the suspects identity has been confirmed, but the

surveillance recordings are treated as the verification as the identity of the speaker

is not known. Throughout the remainder of this thesis, the recordings used for

experiments will be referred to as interview/surveillance to avoid confusion caused

by the more typical enrolment/verification terminology which is not perfectly

applicable in this scenario.

This chapter is divided into several sections. The AFVC speech and QUT-NOISE

databases are presented in Sections 3.2 and 3.3, respectively. Section 3.4 describes

the construction of noisy speech signals. Noisy and reverberant frameworks are

presented in Section 3.5.

3.2 AFVC speech database

The AFVC database [99] contains 552 speakers recorded in three speaking

styles: informal telephone conversation, information exchange over telephone,

and pseudo-police. The telephone channel was used to record informal telephone

conversations and information exchange over the telephone. The microphone

3.3 QUT-NOISE database 59

channel was used to record the pseudo-police style. The sampling frequency of

the clean speech signals was 44.1 kHz with 16 bit/sample resolution [98]. This

database is designed for speaker recognition in forensic applications and it is col-

lected by the Forensic Voice Comparison Laboratory at the University of New

South Wales in Australia. The AFVC database will be used in this thesis be-

cause it contains various speaking style recordings for each speaker, which are

usually found in forensic scenarios [98].

3.3 QUT-NOISE database

The QUT-NOISE database consists of 20 noise sessions [29]. Each session has at

least a 30-minute duration. QUT-NOISE was recorded in five popular noise sce-

narios (CAFE, HOME, CAR, STREET and REVERB). The sampling frequency

of the noise was 48 kHz with 16 bit/sample resolution. A brief description of each

noise scenario is given below:

1. CAFE

The CAFE noise was recorded in an outdoor cafe and indoors in a cafe

food court. The CAFE noise was recorded in the presence of high levels of

babble speech and kitchen noise from cafe environments.

2. HOME

This noise was recorded in two locations for home: kitchen and living room.

The kitchen noise consists of silence interrupted by kitchen noise. The living

room was recorded in the presence of children singing and playing alongside

a television.

3. STREET

60 3.3 QUT-NOISE database

The STREET noise was recorded in two locations: an inner-city and outer-

city. The inner-city recordings consist of pedestrian traffic and bird noise.

The outer-city recordings consist mainly of cycles of traffic noise and traffic

light changes.

4. CAR

The CAR noise was recorded in driving window-down and window-up condi-

tions. These recordings consist of car-interior noise such as bag-movement,

keys, and indicator, as well as the characteristics of the wind for car window-

down.

5. REVERB

The REVERB noise was recorded in two locations: an enclosed indoor

pool and a partially enclosed car park. The indoor pool is characterized by

splashing and running noise, while the car park environment is characterized

by nearby road noise.

For real forensic scenarios, the clean speech signals from existing speech databases

are often mixed with environmental noise at certain noise levels [5]. The limited

duration of the existing noise databases (typically less than five minutes) such as

NOISEX-92 [138], freesound.org [35], and AURORA-2 [51] has lacked the ability

to evaluate speaker verification systems in a wide range of environmental noise

conditions in forensic situations. However, the duration of each session in the

QUT-NOISE database is at least 30 minutes [28]. The clean speech signals can

mix with random sessions of environmental noise from the QUT-NOISE database

to achieve a closer approximation to real forensic situations [5]. Therefore, QUT-

NOISE database will be used to evaluate forensic speaker verification under noisy

conditions in this thesis.

3.4 Construction of noisy speech signals 61

3.4 Construction of noisy speech signals

This section describes the construction of noisy speech signals in single-channel

speech enhancement algorithms (wavelet threshold techniques and spectral sub-

traction) and in the multi-channel speech enhancement algorithm (ICA).

3.4.1 Noisy speech in single channel speech enhancement

In order to reduce the computational complexity of speech enhancement algo-

rithms, we used one sentence from 100 speakers of the AFVC database [99] to

evaluate the performance of single channel speech enhancement algorithms. CAR,

STREET and HOME noises from the QUT-NOISE database [29] were used in

the construction of noisy speech signals because these noises were more likely to

occur in forensic applications. The noise from CAR, STREET and HOME noises

from the QUT-NOISE corpus was down sampled from 48 kHz to 44.1 kHz be-

fore mixing with the clean speech signals from the AFVC database to match the

sampling frequencies of the clean speech. The noisy speech signals were obtained

by sampling the speech signals and the scale of the noise at signal to noise ratios

(SNRs) ranging from -10 dB to 10 dB.

3.4.2 Noisy speech in multi-channel speech enhancement

In real forensic scenarios, the police agencies usually record speech from the crim-

inal using hidden microphones. Such forensic audio recordings are often mixed

with various types of environmental noise. The goal of construction of noisy

speech in multiple channels is to simulate multi-channel speech enhancement un-

der real-world situations. In the construction of noisy speech in multi-channel,

62 3.4 Construction of noisy speech signals

z(n) z(n)

Figure 3.1: Configuration of sources and microphones in instantaneous ICA mix-tures.

one sentence from 100 speakers of the AFVC database was used to evaluate the

performance of speech enhancement based on the ICA algorithm. Each of the

clean speech signals from the AFVC database was corrupted by one session of

environmental noise (CAR, STREET and HOME noises) from the QUT-NOISE

database, resulting in a two-channel noisy speech signal at SNRs ranging from

-10 dB to 10 dB. The sampling frequency of the noise was down sampled from

48 kHz to 44.1 kHz before mixing with the clean speech signals from the AFVC

database to match the sampling frequency of the clean speech signals.

Figure 3.1 shows the configuration of sources (z(n) and e(n)) and microphones

(x1 and x2) in instantaneous ICA mixtures. The noisy speech signals recorded

by the microphones, x, can be modeled as follows:

x = As(n), (3.1)x1x2

=

a11 a12

a21 a22

z(n)

e(n)

, (3.2)

where z(n) is the clean speech signal, e(n) is the environmental noise, A is the

mixing matrix, and a11, a12, a21 and a22 are the parameters of the mixing matrix.

These parameters depend on the distance between the microphone and the source

3.4 Construction of noisy speech signals 63

Speech z (n)

Noise e (n)

X2

X1

d11= d21

d21

d22= 0.5 d12

d12

Figure 3.2: Position of speech and noise signals to the microphones.

signals and the amplitude of the source signals is proportional to the inverse of the

distance from the source to the microphone. Thus, the inverse of each parameter

of the mixing matrix is proportional to the distance between each source and

microphone [132].

The first and second microphones (x1 and x2) have the same distance to the

clean speech source signal (d11 = d21), but the noise (e(n)) has half the distance

to the second microphone (x2) compared to the distance of the noise to the first

microphone (d22 = 0.5 d12), as shown in Figure 3.2. The relationship between dij

and aij can be expressed by the following equation:

dij =1

aij. (3.3)

The mixing matrix, A, used in the construction of noisy speech in multi-channel

speech enhancement is:

A =

1.0 1.0

1.0 2.0

. (3.4)

64 3.5 Noisy and reverberant frameworks

We chose the value of mixing parameter a22 equals to 2 in order to compare the

performance of speech enhancement based on the ICA algorithm under worst

conditions with single channel speech enhancement algorithms.

3.5 Noisy and reverberant frameworks

The forensic audio recordings available from the AFVC database [99] cannot be

used to evaluate the robustness of forensic speaker verification based on robust

feature extraction techniques and the ICA algorithms under noisy and reverber-

ant conditions, because this database contains only clean speech signals. In order

to evaluate the performance of forensic speaker verification systems in the pres-

ence of environmental noise and reverberation conditions, we designed noisy and

reverberant frameworks based on the single and multiple microphones.

In real forensic applications, long speech samples from a suspected speaker are

recorded in an interview scenario, while the surveillance recordings from the crim-

inal may be very short duration. In order to design noisy and reverberant frame-

works for simulation forensic speaker verification under real-world situations, the

interview recordings for these frameworks were obtained from full length utter-

ance of 200 speakers using the pseudo-police style. The surveillance recordings

were obtained from short duration utterances (10 sec, 20 sec, and 40 sec) from

200 speakers using the informal telephone conversation style. The VAD algo-

rithm [130] was used to remove silent regions from the interview and surveillance

recordings. It was necessary to remove silent portions from the clean surveillance

recordings before adding the noise because the silence would artificially increase

the true short-term active speech SNR compared to that of the desired SNR [28].

The VAD was applied to clean speech instead of noisy speech signals because

manual segmentation of speech activity segments or speech labelling may be im-

3.5 Noisy and reverberant frameworks 65

plemented in a forensic scenario when encountered [89]. A brief description of

single and multiple microphone adverse frameworks is provided in the following

sections.

3.5.1 Single microphone adverse framework

In order to evaluate the performance of forensic speaker verification based on the

robust feature extraction techniques in the presence of environmental noise and

reverberation conditions, a single microphone adverse framework is designed. A

single microphone adverse framework consists of three configurations. A brief

description of three configurations is provided in the following sections.

3.5.1.1 Adding noise

The goal of adding noise from the QUT-NOISE database [29] to the clean speech

signals from the AFVC database [99] was to evaluate the robustness of single

microphone forensic speaker verification based on the length normalized GPLDA

under environmental noise conditions only. The interview recordings were kept

under clean conditions, while the surveillance recordings from the criminal were

corrupted by different types of environmental noise. The surveillance recordings

were mixed with environmental noise only because in most real forensic situations,

the surveillance data were often recorded in open areas in the presence of various

types of environmental noise and the effect of reverberation conditions on the

noisy surveillance recordings is neglected [3]. A random session of STREET,

CAR and HOME noises from the QUT-NOISE database was chosen and down-

sampled from 48 kHz to 44.1 kHz to match with the sampling frequencies of the

surveillance speech signals. These noises were used in this thesis because they

were more likely to occur in real forensic situations. The average noise power was


VAD algorithm

Interview recordings (200 speakers

using pseudo-police style)

VAD algorithm

Surveillance recordings (200 speakers

using informal telephone style)

Extract 10 sec duration

Design of single microphone adverse framework based on adding noise

Extract full duration

Add CAR, STREET, and HOME

noises to surveillance recordings

using single microphone

Figure 3.3: Design of a single microphone adverse framework based on addingnoise.

scaled in relation to the reference surveillance speech signal after removing silent

regions according to the desired SNR. The noisy surveillance speech signals were

obtained by sample summing of the surveillance speech signal and the scaled

environmental noise at SNRs, ranging from -10 dB to 10 dB. The design of a

single microphone adverse framework based on adding noise is shown in Figure

3.3.

3.5.1.2 Adding reverberation

The aim of adding reverberation to interview recordings was to investigate the

effect of different reverberation conditions on the forensic speaker verification

performance based on the robust feature extraction techniques. Training room

impulse responses were computed from the fixed room dimension 3 × 4 × 2.5

(m) using the image source algorithm proposed by Lehmann and Johansson [78].


Table 3.1: Reverberation test room parameter.

Configuration Suspect position(xs,ys,zs ) microphone position (xm,ym,zm )1 (2, 1, 1.3) (1.5, 1, 1.3)2 (2, 1, 1.3) (2.4, 1, 1.3)3 (2, 1, 1.3) (2.8, 1, 1.3)4 (2, 1, 1.3) (2.8, 2.5, 1.3)

2.5 m

3 m

Width

Length

4 m

1 m

0.5 m 1.5 m

0.4 m 0.4 m

Suspect

Microphone

2.8 m

1

4

3 2

Figure 3.4: Position of suspect and microphones in a room. All microphones andsuspect are at 1.3 m height and the height of the room is 2.5 m.

Table 3.1 and Figure 3.4 show reverberation room parameters and position of

suspect and microphones in a room. The symbols (xs,ys,zs) and (xm,ym,zm ) in

Table 3.1 represent the position of suspect and microphone in a virtual room,

respectively. In adding reverberation, the position of the microphone changed

horizontally in most configurations in the room, as shown in Table 3.1 and Fig-

ure 3.4 because in most forensic scenarios, the police often put the microphone on

the table in a police office to record the speech from the suspect and the position

of the microphone could be changed horizontally on the table to investigate the

effect of changing suspect/microphone position on the performance of forensic

speaker verification systems under real-world situations. Reverberation is often

characterized by reverberation time (T20 or T60), which describes the amount of

time for the direct sound to decay by 20 dB or 60 dB, respectively [40, 79]. The

reverberation time (T20) was measured on the impulse room response. The rea-


VAD algorithm



VAD algorithm




Design of single microphone adverse framework based on adding

reverberation conditions


Convolve interview

recordings with room

impulse response

Figure 3.5: Design of a single microphone adverse framework based on addingreverberation conditions.

son for using T20 instead of the more popular T60 is to reduce the computational

time when computing the reverberation time in a series of simulated room im-

pulse responses [79]. Each of the interview recordings was convolved with the

impulse room response to generate reverberated speech with the same duration

as the interview recordings. The interview data are often recorded in the pres-

ence of reverberation conditions because the police usually record speech from

the suspect in an interview room where reverberation is present. The surveil-

lance speech signals were kept without reverberation because in most forensic

situations surveillance speech signals are often recorded in open areas [7].

The reverberation from the QUT-NOISE database is not used for the evaluation

of length-normalized GPLDA based speaker verification in forensic applications

for two main reasons. Firstly, the reverberation used in the QUT-NOISE database

was recorded in places which are impractical for most forensic situations, such as

a pool and car park, and is characterized by splashing, running and road noises.


However, in most forensic applications, the police record the interview data in

a police interview room where reverberation often occurs. Secondly, the effects

of some factors such as the position of the microphone relative to a suspect in

the real police interview room and reverberation time cannot be evaluated by

using the reverberation from the QUT-NOISE database. Thus, it is important

to generate reverberated speech signals using a virtual room to achieve a closer

approximation to real forensic situations. The design of a single microphone

adverse framework based on adding reverberation is shown in Figure 3.5.

3.5.1.3 Adding reverberation and noise

The goal of adding reverberation and noise to interview and surveillance record-

ings was to simulate the performance of single microphone length normalized

GPLDA-based forensic speaker verification systems under real-world situations.

Training room impulse responses were computed from the first configuration of

the room, as described in Table 3.1 using the image source algorithm [78]. Each of

the interview recordings was convolved with the impulse room response to gener-

ate reverberated speech with the same duration as the clean interview recordings.

In order to investigate the effect of the utterance duration on noisy speaker ver-

ification, the surveillance speech signals were extracted from random sessions of

10 sec, 20 sec, and 40 sec duration from 200 speakers, using the informal tele-

phone conversation style, after removing silent portions using the VAD algorithm

[130]. The surveillance recordings were corrupted with different segments of CAR,

STREET and HOME noises from the QUT-NOISE database [29] at various SNR

values ranging from -10 dB to 10 dB. The design of a single microphone adverse

framework based on adding reverberation and noise conditions is shown in Figure

3.6.


VAD algorithm



VAD algorithm




Design of single microphone adverse framework based on adding

reverberation and noise conditions




using single microphone

Convolve interview


impulse response

Figure 3.6: Design of a single microphone adverse framework based on addingreverberation and noise conditions.

3.5.2 Multiple microphones adverse framework

In order to evaluate the performance of forensic speaker verification based on the

ICA algorithms in the presence of environmental noise and reverberation con-

ditions, a multiple microphones adverse framework is designed comprising two

configurations. The first was a multiple microphone adverse framework based

on adding noise which combined noise from the QUT-NOISE database [29] with

surveillance recordings from the AFVC database using instantaneous ICA mix-

tures. The second was a multiple microphones adverse framework based on adding

reverberation and noise which combined noise from the QUT-NOISE database

with surveillance recordings from the AFVC database and interview recordings

from the AFVC database convolved with the impulse response of the room to gen-

erate reverberated speech signals. The goal of designing a multiple microphones

adverse framework based on adding reverberation and noise conditions is to sim-


ulate forensic speaker verification based on the ICA algorithm under real-world

situations because in most forensic scenarios, the police often record the speech

from the suspect in a room under reverberation conditions and the police often

use hidden microphones to record the surveillance recordings from the criminal

in public places in the presence of environmental noise. A brief description of two

configurations are provided in the following sections.

3.5.2.1 Adding noise

In most real forensic situations, the criminal may use a mobile phone to commit

criminal offences in public places. The police often use hidden microphones to

record the speech from the criminal in a public place and the forensic audio

recordings are often corrupted by various types of environmental noise. The effect

of reverberation on noisy surveillance speech signals is not investigated because

in most forensic situations noisy speech signals are often recorded in open areas.

Thus, instantaneous ICA will be used in this thesis to record the noisy speech

signals.

The objective of designing the multiple microphones adverse framework based

on adding noise was to evaluate the robustness of forensic speaker verification

based on the ICA algorithms under environmental noise conditions only. The full

duration of interview recordings was obtained from 200 speakers using the pseudo-

police style and the interview recordings were kept under clean conditions. The

surveillance recordings were obtained from short segments (10 sec, 20 sec, and 40

sec) from 200 speakers using the informal telephone conversation style. The noise

from CAR, STREET and HOME noise from the QUT-NOISE database [29] was

down sampled from 48 kHz to 44.1 kHz before mixing with the clean surveillance

speech signal. The noisy speech signal in each microphone was obtained by sample

summing of the surveillance signal and scale environmental noise at SNRs ranging


Speech z (n)

d11= d21

X1

d12

Noise e (n)

X2

d22=1.66 d12

d21

Figure 3.7: Position of speech and noise signals to the microphones.

from -10 dB to 10 dB. The noisy speech signals recorded by the microphones, x,

can be modeled as follows:

x = As(n), (3.5)x1x2

=

1.0 1.0

1.0 0.6

z(n)

e(n)

, (3.6)

where z(n) is the clean speech signal, e(n) is the environmental noise, and A is

the mixing matrix. As the parameters of the mixing matrix are based on the con-

figuration of the sources and the microphones, the amplitude of the source signal

is proportional to the inverse of the distance from the source to the microphone.

Thus, the inverse of each parameter of the mixing matrix is proportional to the

distance between each source and microphone [132].

In the design of the multiple microphone adverse framework based on adding

noise, the first and second microphones (x1 and x2) are the same distance to the

clean speech source signal (d11 = d21), but the distance between the noise and the

second microphone compared to the distance of the noise to the first microphone


VAD algorithm



VAD algorithm




Design of multiple microphone adverse framework based on adding noise




using instantaneous ICA

Mixing process

Figure 3.8: Design of a multiple microphones adverse framework based on addingnoise.

is (d22 = 1.66 d12), as shown in Figure 3.7.

In most real forensic scenarios, the police often record surveillance speech signal

from the suspect using hidden microphones. The distance between the micro-

phones and surveillance speech signal should be less than or equal to the distance

between the microphones and noise signals to obtain noisy speech that has less

effective to environmental noise. These noisy surveillance speech signals can be

used as the input signals to speaker verification systems based on the ICA algo-

rithm in real forensic applications. Therefore, the values of the mixing matrix

are chosen according to Equation 3.6.

Figure 3.8 shows the design of multiple microphones adverse framework based on

adding noise.


3.5.2.2 Adding reverberation and noise

In real forensic applications, the interview data are often recorded in a police

room under controlled conditions where reverberation often occurs, while the

surveillance speech signals from the criminals may be recorded using hidden mi-

crophones in a public place in the presence of environmental noise. Thus, the aim

of designing the multiple microphones adverse framework based on adding rever-

beration and noise was to evaluate the performance of length-normalized GPLDA

forensic speaker verification based on the ICA algorithms in closer approximation

to real forensic situations.

Training room impulse responses were computed from the first configuration of

the room, as described in Table 3.1, using the image source algorithm proposed

by Lehmann and Johansson [78]. Each interview recording was convolved with

the impulse response of the room to generate the reverberated speech with the

same duration as the clean interview speech signal.

In order to investigate the effect of the duration of utterance on noisy speaker ver-

ification, the surveillance recordings were extracted from short segments (10 sec,

20 sec, and 40 sec) duration from 200 speakers using the informal telephone con-

versation style after removing silent portions using the VAD algorithm [130]. The

surveillance recordings were corrupted with different segments of CAR, STREET

and HOME noises from the QUT-NOISE database [29], resulting in a two-channel

noisy speech signal at SNRs ranging from -10 dB to 10 dB, according to Equation

3.6. The design of a multiple microphones adverse framework based on adding

reverberation and noise conditions is shown in Figure 3.9.






VAD algorithm VAD algorithm

Extract full duration Extract 10 sec duration

Design of multiple microphone adverse framework based on adding

reverberation and noise conditions

Add CAR, STREET, and HOME noises

to surveillance recordings using

instantaneous ICA

Convolve interview recordings

with room impulse response

Mixing process

Figure 3.9: Design of a multiple microphones adverse framework based on addingreverberation and noise conditions.

3.6 Chapter summary

This chapter presented the AFVC and QUT-NOISE databases which used to

evaluate speech enhancement algorithms and forensic speaker verification perfor-

mance under noisy and reverberant conditions. As the AFVC database contains

clean speech signals only, the forensic audio recordings available from the AFVC

database cannot be used to evaluate the speech enhancement algorithms. There-

fore, the construction of noisy speech signals based on single and multiple channels

was designed. The construction of noisy speech signals can be used to simulate

speech enhancement algorithms under real forensic scenarios in Chapter 4. In

real forensic applications, the police usually record speech from the suspect in a

room where reverberation is present and surveillance recordings from the crim-

inal could be recorded using single or multiple microphones by police in open

areas in the presence of various types of environmental noise. Thus, the noisy


and reverberation frameworks developed in this chapter will be used to simulate

forensic speaker verification performance based on the robust feature extraction

techniques and the ICA algorithms under real-world situations in Chapters 5 and

6.

Chapter 4

Forensic speech enhancement

algorithms

4.1 Introduction

In real forensic situations, a criminal may use a mobile phone in connection

to a criminal offence in a car, street or public place [89]. Such forensic audio

recordings are often mixed with different types of noise such as car, street and

home noises. It is hard to use these audio recordings directly as part of preparing

legal evidence for court because the quality of these recordings is often poor

[4]. Therefore, speech enhancement approaches play an essential role in such

real forensic situations [129]. Forensic speaker verification performance can be

improved with some form of speech enhancement[36].

Speech enhancement approaches can be classified into single channel and multi-

channel approaches depending on the number of microphones that are used to

record the noisy speech signal. A number of single channel speech enhancement

78 4.1 Introduction

approaches have been proposed in previous studies, such as spectral subtraction

[13, 15] and wavelet threshold techniques [33].

Multi-channel speech enhancement approaches can be used to remove the noise

from the noisy speech signals [104] and achieve better performance compared to

single channel speech enhancement approaches [118]. ICA can be used widely as

a multi-channel speech enhancement [84, 85, 124]. The ICA algorithm is based

on transforming the noisy speech signals into components that are statistically

independent to separate the clean speech from the noisy speech signals. The

estimation of the source signal in the ICA algorithm is based on maximizing the

non-Gaussian distribution of one independent component. The difference between

a Gaussian distribution and the distribution of the independent component is

measured using different contrast functions, such as kurtosis, negentropy, and

approximation of negentropy, which is maximized by the ICA algorithm [55].

The objective of using the ICA algorithm is to separate the clean speech from

the noisy speech signals. The enhanced speech signals from the ICA algorithm

can be used to improve forensic speaker verification performance in the presence

of various types and levels of environmental noise. This chapter focuses on using

ICA for a multi-channel speech enhancement algorithm because the distribution

of environmental noise and speech signals are non-Gaussian distribution and the

ICA algorithm is a robust technique to separate the source signals that come

from non-Gaussian distribution. Figures 4.1 and 4.2 show the histogram of the

clean speech from the AFVC database and STREET noise from the QUT-NOISE

database, respectively.

This chapter presents a brief description of the fundamental concept of ICA. Sim-

ulation results for single-channel and multi-channel forensic speech enhancement

algorithms are also presented in this chapter.

4.2 Independent component analysis 79

-0.5 -0.3 -0.1 0.1 0.3 0.5Bins

0

0.5

1

1.5

2

Histo

gram

#105

Figure 4.1: Histogram of clean speech from the AFVC database.

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1Bins

0

1

2

3

4

5

Hist

ogra

m

#107

Figure 4.2: Histogram of STREET noise from the QUT-NOISE database.

4.2 Independent component analysis

ICA is a statistical approach that is used widely to solve the problem of blind

source separation [55]. In the general model of ICA, the source signals are mixed

80 4.2 Independent component analysis

through a linear basis transformation. Suppose there are N independent source

signals s(t) = {s1(t), s2(t), · · · , sN(t)} and these source signals are observed by

M microphones x(t) = {x1(t), x2(t), · · · , xM(t)}. ICA assumes the number of the

observed signals recorded by the microphones equals the number of the source

signals (M = N) [25]. The fundamental aspect of the mixing process is that each

microphone records a different mixture of the source signals, and the parameters

of the mixing matrix A are unknown. The observed signals, x, can be represented

as:

x = As. (4.1)

The aim of the ICA algorithm is to estimate the original source signals from the

observed signals when both source signals and mixing matrix are unknown [55].

The whole problem resembles that the task a human listener can solve in cocktail

party where using two ears, the brain focuses on a specific sound and suppresses

all other sources in the room [54, 77]. The estimate of the source signals in the

ICA algorithm, s, can be obtained by

s = Wx, (4.2)

where W is the unmixing matrix. The unmixing matrix can be defined as:

W = A−1, (4.3)

where A−1 is the inverse of the mixing matrix.

Figure 4.3 shows the processes of mixing and unmixing in blind source separation.

The source signals s are mixed by mixing matrix A. The estimated source signals

s can be obtained by unmixing matrix W .


Mixing matrix

(A)

s1

s2

.

.

Unmixing matrix

(W)

x1

x2

.

.

ŝ1

ŝ2

.

.

sn Xn ŝn

Blind Observation Reconstruction

Figure 4.3: Mixing and unmixing processes in blind source separation. s are thesource signals, x are the observation signals, s are the estimated source signals,A is the mixing matrix and W is the unmixing matrix.

4.2.1 Statistical independence

The principal concept of ICA is based on statistical independence. To simplify

the concept of independence, the two random variables s1 and s2 are said to be

independent if information on the value of s1 does not provide any information

on the value of s2, and vice versa.

4.2.1.1 Independence

Independence can be defined in terms of the probability density function (PDF)

of the signals. Let the joint PDF of s1 and s2 be denoted by p(s1, s2) and the

marginal PDF of s1 and s2 be represented as p(s1) and p(s2), respectively. The

two random variables of s1 and s2 can be defined as independent if the joint


probability density function is factorized by the following equation:

p(s1, s2) = p(s1)p(s2). (4.4)

Independence can be defined by replacing the PDF by two functions, h1 and h2,

as

E{h1(s1)h2(s2)} = E{h1(s1)}E{h2(s2)}, (4.5)

where E{.} is the expectation operator and Equation 4.5 is used to describe the

relationship between uncorrelatedness and independence [55, 101].

4.2.1.2 Uncorrelateness and independence

Two random variables are defined as uncorrelated if their covariance c(s1, s2) is

zero

c(s1, s2) = E{(s1s2)} − E{s1}E{s2} = 0. (4.6)

Equation 4.6 can be seen to be identical to Equation 4.5 when h1(s1) = s1 and

h2(s2) = s2. Thus, the independent variables are often uncorrelated, although

uncorrelatedness does not imply independence. It is clear from the above dis-

cussion that the independence is stronger than uncorrelatedness and many ICA

algorithms constrain independence for the estimation source procedure. How-

ever, uncorrelatedness is necessary to reduce the number of free parameters and

simplify the computation of the ICA algorithm [54].

4.2.1.3 Non-Gaussianity and independence

The classical central limit theorem states that the distribution of the sum of inde-

pendent signals tends towards a Gaussian distribution under certain conditions.

Therefore, the sum of two independent source signals often has a distribution


closer to Gaussian than any independent source signal. Thus, a Gaussian dis-

tribution can be obtained by a linear combination of many independent source

signals. This also illustrates that the separation of independent signals from the

mixture can be obtained by finding a transformation that yields a non-Gaussiasn

distribution [55].

Non-Gaussianity is an essential principle in the estimation of the source signals

in the ICA algorithm. There are various measurements of non-Gaussianity of a

signal used in the ICA algorithm, such as kurtosis and negentropy [55]. A brief

description of these measurements will be given in the next sections.

1. Kurtosis

The classical measure of non-Gaussian in the ICA algorithm is the kurto-

sis or fourth order cumulant [55]. The kurtosis of the random variable s,

kurt(s), can be defined as:

kurt(s) = E{s4} − 3(E{s2})2. (4.7)

The computation of the kurtosis can be simplified by assuming the random

variable has a zero mean and unit variance (E{s2} = 1). Hence Equation

4.7 can be simplified as:

kurt(s) = E{s4} − 3. (4.8)

Equation 4.8 clarifies that the kurtosis is the normalized version of the

fourth order moment E{s4}. The value of kurtosis is zero for Gaussian

signals and the kurtosis is nonzero for most non-Gaussian signals. Random

variables that have positive kurtosis are called super-Gaussian or platykur-

totic, and those with negative kurtosis are called sub-Gaussian or leptokur-

totic. The non-Gaussianity of the signal can be measured by using the

absolute value of the kurtosis or the square of the kurtosis [55].


Kurtosis is used widely to measure the non-Gaussian distribution in the ICA

algorithm because of its simplicity, both computationally and theoretically.

Computationally, the kurtosis can be estimated simply by computing the

fourth moment of the sample data. Theoretically, the kurtosis has a linear

property. If s1 and s2 are two independent random variables, the kurtosis

can be defined as:

kurt(s1 + s2) = kurt(s1) + kurt(s2), (4.9)

and

kurt(αs1) = α4kurt(s1), (4.10)

where α is a constant.

The main drawback of using kurtosis to measure non-Gaussian distribution

of signals is that it is very sensitive to outliers and the value of the kurtosis

is based on very few observations in tails of the distribution, which means

that its statistical significance is poor [52]. Thus, kurtosis is not a robust

measure of non-Gaussian distribution in ICA and it is necessary to use

better non-Gaussian measurements than kurtosis.

2. Negentropy

Negentropy is also used to measure the non-Gaussian distribution of the

independent component and is based on the information theoretic quantity

of differential entropy [55]. Entropy is a measurement of the randomness of

the signals and can be defined as:

H(s) = −∑i

P (s = ai)logP (s = ai), (4.11)

where ai is the values of s. The entropy can be generalized for continuous

values of a random variable s, called differential entropy, and can be defined

as:

H(s) = −∫P (s)logP (s)ds. (4.12)

4.3 ICA assumption 85

A fundamental result of information theory is that the Gaussian random

variable has the greatest entropy among other random variables [92, 105].

The value of entropy is small for distributions which are concatenated on a

certain value or have a “spiky” PDF.

A modified version of differential entropy is called negentropy J , and it can

be defined as:

J(s) = H(sgauss)−H(s), (4.13)

where sgauss is a Gaussian random variable of the same covariance matrix

of s. According to Equation 4.13, the value of negentropy is often posi-

tive and zero only if the distribution of the random variable is Gaussian

[25]. The advantage of negentropy is that it is an optimal measurement for

non-Gaussianity. However, the computation of negentropy is very difficult.

Thus, the simple approximation of negentropy must be used to estimate

the entropy value [55].

4.3 ICA assumption

The ICA algorithm requires some assumptions on the source signals and mixing

matrix [54]. The assumptions and signal processing properties are described

below

1. The source signals are statistically independent

Statistical independence is the fundamental key assumption to ICA, and

it enables estimation of the source signals s from the mixed signals, x, as

discussed in Section 4.2.1.

2. The independent components have non-Gaussian distribution

86 4.4 ICA ambiguity

The assumption is important in an ICA algorithm because of the close

relationship between Gaussianity and independence. In the ICA algorithm,

it is impossible to separate Gaussian source signals because according to

the central limit theorem the distribution of the sum of two or more source

signals is also in Gaussian distribution. Thus, the sum of two Gaussian

source signals is not recognized from the single Gaussian source in the ICA

algorithm. Therefore, Gaussian sources are forbidden.

3. The mixing matrix is invertible

The mixing process is assumed to be a linear transformation. The linear

assumption is important to simplify the estimation of the source signals. It

is necessary to assume the mixing matrix is a square matrix and the number

of the source signals equals the number of the mixed signals. If the mixing

matrix is not invertible, the unmixing matrix does not exist to separate the

source from the mixed signals.

4.4 ICA ambiguity

There are two main ambiguities in the ICA algorithm: the magnitude and scaling

ambiguity, and the permutation ambiguity [55, 100].

4.4.1 Magnitude and scaling ambiguity

The true energy of independence cannot be determined due to the unknown

values of the mixing matrix A and source signals s. To explain this ambiguity,

4.4 ICA ambiguity 87

the mixing process in Equation 4.1 can be rewritten as:

x =N∑j=1

ajsj, (4.14)

where aj represents the jth column vector of the mixing matrix. Any scalar

multiplication in one of the source signals could be omitted by dividing the same

column of aj by the same scalar. The Equation 4.14 can be rewritten as:

x =N∑j=1

(1

ajaj)(ajsj). (4.15)

This ambiguity is not important for most applications, and the solution for this

ambiguity is to assume that each source signal has unit variance. Furthermore,

the sign of the estimated source signals cannot be determined and the estimated

source signal could be multiplied by -1 to solve this issue without affecting the

ICA model.

4.4.2 Permutation ambiguity

The order of the estimated source signals cannot be determined in the ICA algo-

rithm. Formally, the permutation matrix P and its inverse can be introduced in

Equation 4.1

x = AP−1Ps, (4.16)

where the elements of Ps are the original source signals and AP−1 is the new

unknown mixing matrix to be solved by the ICA algorithm.

In most applications, the ambiguity of the permutation issue in the cocktail party

problem is not a serious problem because the supervisor is able to identify the

different estimated source signals and determine the quality of separation by

listening to the sounds [100].

88 4.5 Pre-processing for ICA

4.5 Pre-processing for ICA

Pre-processing is a very useful technique to reduce the complexity of computation

for ICA. This method is applied before using the fast ICA algorithm and can be

divided into two stages: centering and whitening [55].

4.5.1 Centering

The observed signal (x) is centered by removing the mean m = E{x} from the

signal. Centering can be represented by:

xc = x−m, (4.17)

where xc is the center of the observed signal.

This step simplifies the ICA algorithm by removing the mean from the observed

signals. The unmixing matrix can be estimated using the centered data. The

estimated ICA is not affected by removing the mean because the mean is added

to the centered observation signal after computing the mixing matrix, as shown

in the following equation:

s = A−1(xc +m). (4.18)

4.5.2 Whitening

Whitening is the process of transforming the observed signal (x) linearly into

components which are uncorrelated and the covariance matrix of the whitened

4.5 Pre-processing for ICA 89

signal E{xwxTw} is equal to unity:

E{xwxTw} = I, (4.19)

where xw is the whitened observed signals.

The whitening transformation can be performed by using the eigenvalue decompo-

sition. The covariance matrix of the observed signal, E{xxT}, can be decomposed

by the following equation:

E{xxT} = V DV T , (4.20)

where V is the eigenvector matrix of the covariance matrix and D is the diagonal

matrix of the eigenvalues which can be represented by {λ1, λ2, λ3, ...., λn}. The

observed signal can be whitened by

xw = V D−1/2V Tx, (4.21)

where D−1/2 = diag{λ−1/21 , λ−1/22 , · · · , λ−1/2n }. Whitening approach transforms

the mixing matrix into new mixing matrix, which is orthogonal

xw = V D−1/2V TAs = Aws (4.22)

hence,

E{xwxTw} = AwE{ssT}ATw = AAT

w = I. (4.23)

The whitening transformation is used to reduce the dimension of the mixing

matrix and eliminate the cost of ICA computation by reducing the elements used

in the mixing matrix from n2 to n(n− 1)/2 elements degree of freedom.

90 4.6 Fast ICA

4.6 Fast ICA

Fast ICA is a fixed-point ICA algorithm that uses high order statistics to estimate

the source signals. Fast ICA can estimate the source signals one by one (deflation

approach) or simultaneously (symmetric approach)[53, 55]. Fast ICA is based on

maximizing the approximation of negentropy J(w) to estimate one independent

component source signal.

J(w) = [E{G(wTx)} − E{G(v)}]2, (4.24)

where v is a standard Gaussian random variable, and G is the one contrast

function.

In practical terms, it is necessary to choose a contrast function that does not grow

too fast, is simple to compute and more robust to outliers than kurtosis. The

following choice of contrast functions has proven very useful in the ICA algorithm

[53]:

G1(s) =1

a1log cosh a1s, (4.25)

G2(s) = − 1

a2exp (−a2s

2

2), (4.26)

G3(s) =1

4s4, (4.27)

where 1 ≤ a1 ≤ 2 and a2 ≈ 1 are constant.

Fast ICA for one and several units will be described briefly in the next section.

4.6 Fast ICA 91

4.6.1 Fast ICA for one unit

Fast ICA for one unit is a simple algorithm to estimate a one row vector of

the unmixing matrix [W] by finding the maximum non-Gaussian value of one

independent component vector [53]. There are four steps to estimate one unit for

the ICA algorithm:

1. Select an initial guess for w.

2. Estimate w+ = E{xg(wTx)} − E{g′(wTx)}w,

where w+ is the new row vector of the unmixing matrix, E is the mean, and

g and g′ are the first and second derivatives of the contrast function G(.),

respectively.

3. Normalize the row vector of w+

w∗ =w+

‖w+‖. (4.28)

4. Go back to step 2 if not converged.

The criterion of convergence is that the direction of previous and new values of

w must be in the same direction, i.e. the dot product of these w points is almost

equal to one.

4.6.2 Fast ICA for several units

The one-unit fast ICA algorithm described in the previous section is used to

estimate one independent component only. It is necessary to run the fast ICA

algorithm for n times to estimate all source signals [53]. To prevent different

92 4.7 Simple illustration of ICA

row vectors of the unmixing matrix from converging to the same maxima, the

decorrelation of the output wT1 x,w

T2 x,w

T3 x, · · · , wT

nx must be performed after

every iteration.

The deflation method [53] is a simple method to achieve decorrelation. This

method is based on a Gram-Schmidt-like decorrelation and it estimates indepen-

dent components one by one. When p independent components are estimated,

the fast ICA algorithm runs for wTp+1, and after each iteration step subtract from

wp+1 the “projections” wTp+1wjwj, j = 1, · · · , p of the previously estimated p

vector and then renormalized the wp+1 :

wp+1 = wp+1 −p∑

j=1

wTp+1wjwj, (4.29)

wp+1 =wp+1√wT

p+1wp+1

. (4.30)

The symmetrical approach is also used to estimate various independent source

signals at the same time. Every row vector of the unmixing matrix is subtracted

and normalized according to the following equation:

W = (WW T )−12 W, (4.31)

where W = {w1, w2, · · · , wn} and (WW T )−12 can be obtained by eigenvalue de-

composition of WW T = FDF T as (WW T )−12 = FD

−12 F T .

4.7 Simple illustration of ICA

The concept of ICA can be clarified with a simple example of the separation

of the speech signal from street noise. Statistical independence in ICA is also

illustrated in this section. The results presented below were obtained using the

fast ICA algorithm [55].

4.7 Simple illustration of ICA 93

0 1 2 3 4 5

#105

-0.5

0

0.5

1Clean speech signal "s1"

0 1 2 3 4 5

#105

-0.5

0

0.5 Noise signal "s2"

Figure 4.4: Original speech and street noise.

4.7.1 Separation of speech from street noise

The clean speech signal from the AFVC database was sampled at 44.1 kHz and

16 bit/sample resolution and street noise from the QUT-NOISE database was

generated as shown in Figure 4.4. The street noise was down-sampled from 48

kHz to 44.1 kHz to match the sampling frequencies with the clean speech signals.

The speech signal was mixed with street noise at 0 dB input SNR, according to

the first microphone x1. The mixing process can be represented by the following

equation: x1

x2

=

1 1

1 2

s1s2

, (4.32)

where s1 and s2 are the speech and street noise, respectively.

The resulting signals from this mixing are shown in Figure 4.5. Finally, the

mixed speech signals were separated using the fast ICA algorithm. The contrast


0 1 2 3 4 5

#105

-1

0

1Mixed signal "x1"

0 1 2 3 4 5

#105

-1

0

1Mixed signal "x2"

Figure 4.5: Mixed speech with street noise.

0 1 2 3 4 5

#105

-10

0

10

20Estimated speech signal "s1"

0 1 2 3 4 5

#105

-10

0

10Estimated noise "s2"

Figure 4.6: Estimated speech and street noise using the fast ICA algorithm.

function used in fast ICA has a Gaussian function which can be represented by

G(s) = − exp(−s2/2). (4.33)

The estimation of speech and street noise signals using the ICA algorithm is

shown in Figure 4.6. By comparing Figure 4.6 to Figure 4.4, it is clear that


the clean speech and street noise have been estimated accurately without any

knowledge of the source signals and the mixing matrix.

This example also gives a clear description of the scaling ambiguity discussed in

section 4.4.1. The amplitude of the original speech and estimated speech signals

are also different because of the scaling ambiguity.

4.7.2 Illustration example of statistical independent in

ICA

The previous example provided a simple example of how the ICA is used to

separate the speech from the street noise signals. However, this example did not

give any insight into the mechanism of the ICA algorithm and its close relationship

with statistical independence. In this example, the statistics of the ICA algorithm

are described more clearly. Let two uniform random variables be mixed using the

following mixing process: x1

x2

=

1 −1

1 2

s1s2

. (4.34)

Figures 4.7 and 4.8 show the scatter-plot for the original source signals s1 and s2

and the scatter-plot of the mixture, respectively. It is clear from Figure 4.8 that

the two random variables are statistically dependent. For example, if x2 = 200,

then x1 can be determined. Whitening is a preprocessing step that is generally

performed before the ICA algorithm. The joint probability distribution obtained

from the whitening signals is shown in Figure 4.9, and it is observed that the dis-

tribution of two random variables are uniform and independent. The statistical

independence can be confirmed, as the value of each random variable is not de-

termined by the other random variable. The uniform distribution of two random


0 20 40 60 80 100s1

0

20

40

60

80

100

s2

Figure 4.7: Original sources.

-100 -50 0 50 100x1

0

50

100

150

200

250

300

x2

Figure 4.8: Mixed of source signals.

variables in Figure 4.10 takes values ranging from 0 to 3.5. However, the range

of the original source signal is not known because of the scaling ambiguity of the

ICA algorithm. Comparing the whitened signal in Figure 4.9 and Figure 4.10,

shows that pre-whitening reduces the dimension of the ICA algorithm by finding


-5 -4 -3 -2 -1 0 1 2 3 4 5

x1

-5

-4

-3

-2

-1

0

1

2

3

4

5

x2

Figure 4.9: Joint density of whitened signals obtained from whitening the mixedsources.

-1 0 1 2 3 4Estimated s1

0

0.5

1

1.5

2

2.5

3

3.5

4

Est

imat

ed s

2

Figure 4.10: Estimation of source signals using the ICA algorithm.

a suitable rotation to yield independence and it simplifies the rotation by using

an orthogonal transformation which needs only one parameter.

98 4.8 Methodology

Noisy speech signals

Independent

component analysis Spectral subtraction Wavelet threshold

techniques

Evaluation Performance

Comparison evaluation performance

Figure 4.11: Comparison of speech enhancement algorithms.

4.8 Methodology

Experiments were conducted to evaluate the performance of speech enhancement

based on the ICA algorithm with single channel speech enhancement algorithms

(wavelet threshold technique and spectral subtraction). The comparison of speech

enhancement algorithms consists of the following steps, which are shown in Figure

4.11 and described in the next sections.

4.8 Methodology 99

4.8.1 Noisy speech signals

The construction of noisy speech signals in single and multiple channels speech

enhancement algorithms was described in Section 3.4 in Chapter 3.

4.8.2 Speech enhancement algorithms

This section presents a brief description of the single channel speech enhancement

algorithms (wavelet threshold techniques and spectral subtraction) and multi-

channel speech enhancement algorithm (ICA) which are used in these simulation

results to remove various types of environmental noise from the noisy forensic

speech signals.

4.8.2.1 Discrete wavelet transform

The wavelet transform is a technique to analyze speech signals. It was used to

overcome problems related to frequency and time resolution properties in short

time Fourier transform (STFT) [136]. The wavelet transform uses an adaptive

window that provides low frequency resolution at high-frequency bands and high-

frequency resolution in low-frequency bands. However, the STFT uses a fixed

window size for all frequency sub bands. In that respect, the wavelet transform

is similar to the human auditory system, which exhibits similar time-frequency

resolution properties [136].

The DWT is a type of the wavelet transform that can be defined by

W (j, k) =∑j

∑k

x(k)2−j2 ψ(2−jn− k), (4.35)

where ψ is the time function with fast decay and finite energy called the mother

100 4.8 Methodology

0 5 10 15-1.5

-1

-0.5

0

0.5

1

Figure 4.12: Daubechies 8 wavelet function.

wavelet, j is the number of the level, x(k) is the speech sample, n and k are

integer values. The DWT can be performed using a pyramidal algorithm [88].

Various families of the wavelet transform have been used to decompose signals

such as biorthogonal, coiflets, symmlets and Daubechies [96]. The Daubechies

wavelet is one of the most common wavelets used to analyze the speech signals.

The name of the Daubechies family wavelet can be written as dbN, where N is the

order of the filter. The Daubechies wavelets are a family of orthogonal wavelets

defining a discrete wavelet transform and they are characterized by a maximum

number of vanishing moments p with given support filter length [140]. The van-

ishing moment is the criterion of how the wavelet function decays towards infinity

[23]. Having p vanishing moments means that the wavelet coefficients decay to

zero at the pth order polynomial [140]. The Daubechies 8 is used widely for de-

composition of speech signals because it requires minimum support filter length

for a given number of vanishing moments [133]. Thus, we used the Daubechies 8

to decompose the noisy speech signals in the wavelet threshold techniques. Figure

4.12 shows the Daubechies 8 wavelet function.

Figure 4.13 shows a block schematic of the dyadic wavelet transform. The speech

signal (x) is decomposed into different frequency bands by using a pair of FIR

filters, h(n) and g(n) , which are a low-pass and high-pass filter, respectively. The

4.8 Methodology 101

g (n)

h (n)

2

2

x g (n)

h (n)

CD1

!

2

2

CD2!

CA2

!

CA1

Figure 4.13: Block schematic of the dyadic wavelet transform.

(↓ 2) is a down-sampling operator used to discard half of the speech sequences

after the filter is applied. The approximation coefficients (CA1) can be performed

by convolving the low-pass filter with the speech signal and applying the down-

sampling operator to the output of the filter h(n). The detailed coefficients (CD1)

can be obtained by convolving the high pass filter with the speech signals and

applying down-sampling to the output of the filter g(n). The decomposition of

the speech signals can be repeated by applying the DWT to the approximation

coefficients (CA1).

Figure 4.14 shows the implementation of a two-level inverse discrete wavelet trans-

form based on two filter banks, where h(n), g(n) are low pass and high pass re-

construction filters, respectively, and the symbol (↑ 2) represents the up-sampling

operator of the wavelet coefficients by a factor of 2. The FIR of four filters satisfies

the following relationship

g(n) = (−1)nh(L+ 1− n), (4.36)

h(n) = h(L+ 1− n), (4.37)

g(n) = (−1)(n−1)h(L+ 1− n), (4.38)

where L is the length of the FIR filters and n = 1, 2, · · · , L. The output of the

102 4.8 Methodology

𝑔 (𝑛)

2 ℎ (𝑛)

(n)

2 𝑔 (𝑛)

2

2

ℎ (𝑛)

CA2

CD2

CD1

x

Figure 4.14: Block schematic of the dyadic inverse discrete wavelet transform.

inverse DWT is identical to the input speech signals [45, 88].

4.8.2.1.1 Wavelet threshold technique

The wavelet threshold technique is used to reduce the effect of noise by thresh-

olding the detailed wavelet coefficients. It is based on the assumption that the

energy of the speech signal is mostly concentrated in a small number of wavelet

coefficients [126]. The energy of these coefficients has larger values than other

coefficients (especially noise) that have their energy spread over a large number

of wavelet coefficients. Thresholding the small wavelet coefficients to zero can

eliminate the noise components from noisy speech components [44].

Level dependent wavelet threshold techniques are used widely to suppress noise

from the noisy speech signal. These techniques are based on thresholding the

detail coefficients of the noisy speech signals [44]. Level dependent threshold

(λth) can be defined as:

λth = σj(√

2 log Nj), (4.39)

σj =MAD(Dj)

0.6745, (4.40)

4.8 Methodology 103

where MAD represents the median absolute deviation estimated on the scale j,

Dj is the detailed coefficients for each scale and Nj is the number of samples of

the noisy speech signal for each scale j.

The method of level dependent wavelet threshold can be described by the follow-

ing four steps.

1. Frame the noisy speech signal into several segments by using a Hamming

window. The duration of the frame used in the simulation results for the

speech enhancement algorithm is 25 msec.

2. Decompose the noisy speech signal into four levels by using Daubechies 8

DWT.

3. Threshold the detailed coefficients of the noisy speech signal by using a hard

or a soft level dependent threshold. Hard (Thard) and soft (Tsoft) thresholds

can be expressed as:

Thard(Dj) =

Dj, |Dj| > λth

0, |Dj| 6 λth

, (4.41)

Tsoft(Dj) =

sign(Dj) ∗ (|Dj| − λth), |Dj| > λth

0 , |Dj| 6 λth

. (4.42)

4. Apply the inverse DWT to the thresholded detail wavelet coefficients to

obtain the enhanced speech signal.

4.8.2.2 Spectral Subtraction

Spectral subtraction is based on subtracting the estimated power spectrum of the

noise from the power spectrum of the noisy speech signal, without prior knowledge

104 4.8 Methodology

of the power spectral density of the clean speech and noise signals. A spectral

subtraction can be used to suppress background noise by assuming the noise is

stationary or changing slowly during the non-speech and speech activity periods

[13].

The procedure of spectral subtraction can be summarized by the following steps.

Firstly, the noisy speech signal is framed into several overlapping segments by

using a Hamming window. The duration of the frame used is 25 msec and the

duration of the overlap between two successive windows is 12.5 msec. Secondly,

the noise is estimated by computing the average power spectrum of noise from

several silent frames (noise only). Spectral distance voice activity detector is

used to determine the noise frames. Then, Fourier transform is applied to the

windowed frames of the noisy speech signal [137]. Spectral subtraction can be

computed as:

|S(k)|2 =

|X(k)|2 − δ|D(k)|2, |X(k)|2 − δ|D(k)|2 > β|D(k)|2

β|D(k)|2 , Otherwise

, (4.43)

where X(k), S(k) and D(k) are the magnitude of power spectrum of the segment

of corrupted speech, estimated speech and estimated noise, respectively, δ is the

over subtraction factor that depends on a posteriori segmental SNR, and β is

the spectral factor with values between 0 and 1. For a large value of β, the

spectral floor is high and the remaining noise is audible, while for a small value of

β, the noise is significantly reduced, but the remaning noise becomes annoying.

The value of β used in this experimental results is 0.03. Finally, the enhanced

speech signal can be obtained by applying an inverse Fourier transform to the

phase function of the discrete Fourier transform of the input speech signal and

the estimated spectrum of the speech |S(k)|.

4.8 Methodology 105

4.8.2.3 Independent component analysis

The fast ICA algorithm is used to separate the clean speech from the noisy speech

signals. It is based on separating one non-Gaussian component each time under

the assumption that the sum of two independent source signals has a Gaussian

distribution. The Gaussian contrast function is used in the fast ICA and can be

defined as:

G1(s) = − exp (−s2

2). (4.44)

In order to estimate all the independent components of the source signals (speech

and environmental noise), we ran one unit fast ICA for N times and the deflation

decorrelation algorithm was applied to the estimated source signals after each

iteration to prevent the row vectors of the unmixing matrix W from converging

to the same maxima.

4.8.3 Evaluation of performance

Speech enhancement performance is typically measured using the output SNR

(SNRo) and can be defined as:

SNRo =

∑n

s2(n)∑n

|s(n)− s(n)|2, (4.45)

where s(n) and s(n) are the original speech and the estimated speech signals

respectively.

The SNR enhancement(SNRe) in (dB) can be defined as:

SNRe = SNRo − SNRi, (4.46)

106 4.8 Methodology

where SNRi is the input SNR and it can be computed by the ratio of the sum

squared of the clean speech to that of the noise from the first microphone (x1).

To evaluate the speech enhancement performance, the average improvement in

SNR was used in the simulation results when one sentence from 100 speakers

using the AFVC database was mixed with different types of environmental noise

from the QUT-NOISE database. The average SNR improvement in dB, SNRe,

can be computed as:

SNRe =1

Ns

Ns∑i=1

SNRe(i), (4.47)

where Ns is the number of the speakers and is equal to 100, SNRe(i) is the SNR

enhancement for each speaker.

The standard deviation of the SNR enhancement can be computed as

σ =

√√√√ 1

Ns

Ns∑i=1

(SNRe(i)− SNRe)2. (4.48)

4.8.4 Comparison of speech enhancement algorithms

In this section, the performance of speech enhancement based on the ICA algo-

rithm is compared with the performance of single channel speech enhancement

algorithms (wavelet threshold techniques and spectral subtraction). The goal of

comparing single channel speech enhancement algorithms with speech enhance-

ment based on the ICA algorithm is to choose the most reliable baseline speech

enhancement algorithm for real forensic applications under noisy conditions. The

baseline speech enhancement will be used in Chapter 6 to improve forensic speaker

verification performance in the presence of various types and levels of environ-

mental noise.

Figures 4.15, 4.16, and 4.17 show comparisons of the average and standard devia-

4.8 Methodology 107

Figure 4.15: Comparison of average SNR enhancement when STREET noise isadded to the speech signals from the AFVC database.

Figure 4.16: Comparison of average SNR enhancement when CAR noise is addedto the speech signals from the AFVC database.

tion of SNR improvement for different speech enhancement algorithms when one

sentence from 100 speakers of the AVFC database was corrupted with STREET,

CAR and HOME noises, respectively. Standard deviations from the Monte Carlo

simulation are shown in the bars. The SNRs in the x-axis of these figures were

computed from the first microphone (x1) in the ICA algorithm. There is an ar-

108 4.8 Methodology

-10 -5 0 5 10

SNRi

-8

-6

-4

-2

0

2

4

6

8

10

Avera

ge im

proved

SNR

ICAwavelet hardwavelet softspectral subtraction

Figure 4.17: Comparison of average SNR enhancement when HOME noise isadded to the speech signals from the AFVC database.

bitrariness in the sign upon inversion. The problem with the sign change in the

samples of estimated speech compared with samples of original speech in an ICA

algorithm is solved by multiplying all samples of the estimated speech signal by

-1 if the maximum correlation coefficient between estimated and original speech

has a negative sign. Results of some experiments in comparison of speech en-

hancement based on the ICA with single channel speech enhancement algorithms

were published at the 16th Australian International Conference on Speech Sci-

ence and Technology. The title of the conference paper is “ Comparison of speech

enhancement algorithms for forensic applications” [4]. From these figures, the

following points can be concluded:

1. When the speech signals were mixed with CAR, STREET and HOME

noises at SNRs ranging from -10 to 10 dB, the ICA algorithm significantly

improved average SNR compared with spectral subtraction and wavelet

threshold techniques. The ICA algorithm is based on statistical indepen-

dence to separate the speech from the noisy speech signals. However, single-

channel speech enhancement algorithms (wavelet threshold techniques and

4.9 Chapter Summary 109

spectral subtraction) are based on subtracting the noisy speech by certain

values of threshold in each high-frequency sub bands or using a fixed sub-

traction parameter in spectral subtraction. The enhanced speech signals

may be distorted in a single channel speech enhancement algorithm due to

removing some speech information, such as unvoiced speech by considering

these information like noise.

2. Level dependent wavelet threshold techniques give higher average SNR

enhancement than the spectral subtraction algorithm at low SNRs rang-

ing from -10 dB to 0 dB, because environmental noise (CAR, STREET

and HOME noises) is not uniformly distributed over the entire frequency.

Thresholding the detailed coefficients in each high frequency sub band by

using a level dependent wavelet threshold will lead to improved average

SNR enhancement at low SNRs.

3. Level dependent thresholding and spectral subtraction fail to remove en-

vironmental noise at SNRs ranging from 5 to 10 dB because the power

spectral densities of environmental noise are concentrated at certain fre-

quencies. Using a fixed over subtracting parameter in spectral subtraction

or thresholding all high frequency sub bands of the noisy speech signal at

low levels of noise will lead to a distortion of the enhanced speech signal.

4.9 Chapter Summary

This chapter presented ICA as a multiple channel speech enhancement algorithm.

A brief description of the fundamental and mathematical frameworks of ICA was

presented. As a part of this discussion, the two important preprocessing steps,

centring and whitening, were thoroughly described. The fast ICA algorithm was

also discussed in detail.

110 4.9 Chapter Summary

In real forensic scenarios, forensic audio recordings are often corrupted by various

types of environmental noise. Since the quality of these recordings is poor, it is

difficult to use these recordings directly in forensic speaker verification to recog-

nize the identity of the criminal by their voice. Thus, speech enhancement may

reduce the effect of environmental noise and improve forensic speaker verification

performance.

Simulation results for a single channel (spectral subtraction and wavelet threshold

techniques) and speech enhancement based on the ICA algorithm were also given

in this chapter. Part of comparison results of speech enhancement based on

the ICA with single channel speech enhancement algorithms was published in a

conference paper and titled “ Comparison of speech enhancement algorithms for

forensic applications” [4]. The results demonstrated that speech enhancement

based on the ICA algorithm significantly improved average SNR compared with

single channel speech enhancement algorithms for different levels and types of

environmental noise. The enhanced speech signals from the ICA algorithm could

improve forensic speaker verification performance under real-world situations in

the presence of various types of environmental noise. Thus, the ICA algorithm

will be used as the front-end of speech enhancement in forensic speaker verification

systems in Chapter 6.

Chapter 5

Robust feature extraction

techniques

5.1 Introduction

The performance of speaker verification systems is often degraded in real forensic

applications for two main reasons: noise and reverberation conditions. Forensic

speech samples are often corrupted by various types of environmental noise in

real forensic scenarios [89]. For instance, a criminal may use a mobile phone to

commit a criminal offence, and the surveillance recordings are often corrupted

by various types of environmental noise. The performance of speaker verification

drops dramatically in the presence of high levels of noise [73, 94]. In conditions

of reverberation, the speech signal is often combined with a multiple reflection

version of the original speech due to the surrounding room environment [40].

The reverberated speech can be represented by convolving the impulse response

of the room with the original speech signal. The presence of reverberation dis-

torts feature vectors and decreases speaker verification performance because of

112 5.1 Introduction

the mismatched conditions between the enrolment model and verification speech

signals [121, 122].

A number of feature channel compensation techniques, such as CMS [37], CMVN

[139], and RASTA processing [49] have been used to improve the performance of

speaker verification systems. However, these techniques are less effective for non-

stationary additive distortion and reverberation environments [40, 56]. Pelecanos

et al. [106] introduced a feature warping technique to speaker verification to

compensate for the effect of additive noise and linear channel mismatch in the

feature domain. This technique maps the distribution of the cepstral features

into a standard normal distribution. Feature warping provides a robustness to

noise while retaining the speaker-specific information that is lost when using other

channel compensation techniques such as CMS, CMVN, and RASTA processing

[59].

Multiband feature extraction techniques were used as the feature extraction of

noisy speaker recognition systems [1, 21, 72, 86, 123], achieving better perfor-

mance than traditional MFCC features. Multiband feature techniques are based

on combining MFCC features of the noisy speech signals and MFCC extracted

from the DWT in a single feature vector.

A combination of feature warping with DWT-MFCC and MFCC features of the

speech signal improves forensic speaker verification performance under noisy and

reverberant conditions for three main reasons. Firstly, feature warping is more

robust to additive noise and reverberation conditions compared with traditional

MFCC and other feature compensation techniques [58, 106]. Secondly, rever-

beration affects low frequency more than high-frequency sub bands, since the

boundary materials used in most rooms are less absorptive at low frequency sub

bands. The DWT can be used to extract more features from the low frequency

sub bands [95]. These features add some important features to the full band of the

5.2 Combination of DWT and MFCC FW 113

feature-warped MFCC. Thus, fusion of feature warping with DWT-MFCC and

MFCC features of the reverberated signals may achieve better forensic speaker

verification performance than full band feature-warped MFCC in the presence of

reverberation conditions. Finally, DWT can be a useful tool to decrease the effect

of noise. The feature-warped MFCC extracted from the DWT can be combined

with the feature-warped MFCC extracted from the full band of the noisy speech

signal itself to obtain a large feature vector suitable for speaker recognition in the

presence of noise.

In this chapter, we investigate the effectiveness of combining the features of DWT-

MFCC and MFCC of speech signals with and without feature warping to improve

i-vector speaker verification performance under noisy conditions only. Different

individual and concatenative feature extraction techniques are used to evaluate

the modern i-vector forensic speaker verification performance in the presence of

various types of environmental noise. A new fusion of feature warping with DWT-

MFCC and MFCC is proposed in this chapter for improving forensic speaker

verification performance in the presence of noise, reverberation, as well as noisy

and reverberant conditions.

This chapter is divided into several sections. Section 5.2 presents the combina-

tion of DWT-MFCC and MFCC feature warping. Experimental methodology is

described in Section 5.3. The results and discussion are presented in Section 5.4.

5.2 Combination of DWT and MFCC FW

The technique for extracting the features is based on the multi-resolution property

of the DWT. The MFCC features were computed over Hamming windowed frames

of 30 msec size and 10 msec shift to discard the discontinuities at the edges of the

114 5.2 Combination of DWT and MFCC FW

MFCC+∆+∆∆

Noisy Speech

DWT

Concatenate approximate

and detail coefficients into a

single feature vector Feature warping

MFCC+∆+∆∆

Feature warping

Concatenate feature vectors in a single feature vector

DWT-MFCC

MFCC (FW) MFCC

DWT-MFCC (FW)

1 2

1 A 1 B

2 A

2 B

Figure 5.1: Extraction and fusion of DWT-MFCC and MFCC features with andwithout feature warping (FW).

frame. The MFCC was obtained using a mel-filterbank of 32 channels followed

by a transformation to the cepstral domain. The 13-dimensional MFCC features,

with appended delta (∆) and double delta (∆∆) coefficients, were extracted

from the full band of the noisy speech signals. Feature warping with a 301 frame

window was applied to the features extracted from the MFCC. The DWT was

applied to decompose the noisy speech signals into two frequency sub bands: the

approximation (low-frequency sub band) and the detail (high frequency sub band)

coefficients. The decomposition process can be repeated by applying the DWT

to the approximation coefficients. The approximation and detail coefficients were

combined into a single vector. The feature-warped MFCC was then used to

extract features from the single feature vector of the DWT.

5.2 Combination of DWT and MFCC FW 115

Table 5.1: Summary of feature extraction labels.

Sub-branch label Label feature extraction1 A MFCC1 B MFCC (FW)2 A DWT-MFCC2 B DWT-MFCC (FW)

Fusion 1 A and 2 A Fusion (no FW)Fusion 1 B and 2 B Fusion (both FW)

Table 5.2: Description of the number of features extracted corresponding to eachfeature extraction label.

Label feature extraction Number of featuresMFCC 39

MFCC (FW) 39DWT-MFCC 39

DWT-MFCC (FW) 39Fusion (no FW) 78

Fusion (both FW) 78

In this thesis, the effect of feature warping on DWT-MFCC and MFCC features

investigates, both individually and in a concatenative fusion of these features

in the presence of various types of environmental noise, reverberation and noisy

and reverberant conditions, as shown in Figure 5.1. The features extraction tech-

niques can be used to train the state-of-the-art i-vector PLDA speaker verification

systems and are described in Table 5.1.

Tables 5.1 and 5.2 summarize feature extraction labels and describe the number

of the features extracted corresponding to each feature extraction label. The

symbol (FW) in Tables 5.1 and 5.2 represents the acronym of feature warping.

116 5.3 Experimental methodology

5.3 Experimental methodology

The i-vector based experiments were evaluated using the clean speech signals

from the AFVC database. A UBM with 256 Gaussian components was used in

our experimental results. The UBMs were trained on telephones and microphone

recordings from 348 speakers from the AFVC database. These UBMs were used to

compute the Baum-Welch statistics before training a total-variability subspace

of dimension 400. These total variabilities were used to compute the i-vector

speaker representation. The i-vector dimension was reduced to 200 i-vectors us-

ing LDA. The i-vectors length normalization was used before GPLDA modelling

using centering and whitening of the i-vectors [43]. Length normalized GPLDA-

based forensic speaker verification was used in these experimental results instead

of HTPLDA because it was found that the length normalized GPLDA gives a sim-

ilar performance with less computational complexity than HTPLDA [43]. The

length normalized GPLDA technique is computationally efficient compared with

HTPLDA because the length normalized technique is used to transform the dis-

tribution of the i-vectors from heavy-tailed to Gaussian [61]. The performance of

the i-vector PLDA speaker verification systems was evaluated using the Microsoft

Research (MSR) identity toolbox [119].

5.4 Results and discussion

This section describes the effectiveness of the fusion of features of DWT-MFCC

and MFCC with and without feature warping on the forensic speaker verifica-

tion performance under noisy conditions. A new fusion of feature warping with

DWT-MFCC and MFCC is proposed for improving forensic speaker verification

performance in the presence of reverberation and noisy and reverberant condi-

5.4 Results and discussion 117

tions. The modern length-normalized GPLDA was used as a classifier in all

experimental results. The performance of speaker verification systems was eval-

uated using the EER and the mDCF, calculated using Cmiss = 10, Cfa = 1, and

Ptarget = 0.01.

5.4.1 Noisy conditions

This section describes the performance of the fusion of features of DWT-MFCC

and MFCC with and without feature warping in the presence of STREET, CAR

and HOME noises only. The effect of decomposition level, wavelet families and

utterance duration on the performance of fusion of feature warping with DWT-

MFCC and MFCC-based speaker verification systems will also be described in

the next sections.

5.4.1.1 Effect of decomposition level

This experiment evaluated the effect of decomposition level on the performance

of fusion of feature warping with DWT-MFCC and MFCC features. The full

duration of interview recordings from 200 speakers was kept under clean condi-

tions using pseudo-police style, while 10 sec of the surveillance recordings from

200 speakers using informal telephone conversation style were corrupted with a

random session of STREET, CAR and HOME noises at SNRs ranging from -10

dB to 10 dB. The interview and noisy surveillance recordings were decomposed

into 2, 3, and 4 levels using db8 DWT.

Figure 5.2 shows the effect of the decomposition levels on the performance of

fusion of feature warping with DWT-MFCC and MFCC features in the presence

of various types of environmental noise. It was found that increasing the number

118 5.4 Results and discussion

-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40

EE

R %

STREET

-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40CAR

-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40HOME

Level 2Level 3Level 4

Figure 5.2: Effect of the decomposition levels on the performance of fusion offeature warping with DWT-MFCC and MFCC in the presence of noise.

of levels to more than three over the majority of SNR values in the presence

of various types of environmental noise degraded the speaker verification perfor-

mance. In this case, the number of samples in the lowest frequency sub bands was

so low that the essential characteristics of the speech signals could not be esti-

mated accurately by the classifier [21]. The results of effect of decomposition level

on the performance of fusion of feature warping with DWT-MFCC and MFCC

were published in IEEE Access journal in a paper entitled “ Enhanced forensic

speaker verification using a combination of DWT and MFCC feature warping in

the presence of noise and reverberation conditions” [5].

5.4.1.2 Effect of wavelet family

This experiment evaluated the effect of wavelet family on the performance of fu-

sion of feature warping with DWT-MFCC and MFCC features. The full duration

of interview recordings from the pseudo-police style was kept under clean condi-


-10 -5 0 5 10SNR

0

5

10

15

20

25

30

35

Ave

rage

EE

R %

db2 db4 db8

Figure 5.3: Average EER for fusion of feature warping with DWT-MFCC andMFCC using different wavelet families in the presence of different types of envi-ronmental noise.

tions, while 10 sec of the surveillance speech signals were mixed with STREET,

CAR and HOME noises at SNRs ranging from -10 dB to 10 dB. The interview

and surveillance recordings were split into three levels using different Daubechies

wavelet families (db2, db4, and db8).

The average EER for fusion of feature warping with DWT-MFCC and MFCC

can be calculated by computing the mean of EER for different types of environ-

mental noise at each noise level, as shown in Figure 5.3 . It is clear from this

figure that the performance of fusion of feature warping with DWT-MFCC and

MFCC using db8 achieved slight improvements in average EER compared with

other wavelet families in the presence of various types of environmental noise at

SNRs ranging from -10 dB to 10 dB. Since level 3 achieved better noisy forensic

speaker verification performance over the majority of SNR values, as described in

Section 5.4.1.1, and db8 achieved slight improvements in noisy forensic speaker

verification performance compared with other wavelet families, level 3 and db8


will be used in the feature extraction based on DWT in the presence of noise in

the next sections.

5.4.1.3 Comparison of feature extraction techniques

This experiment evaluated the performance of combining DWT-MFCC and

MFCC features with and without feature warping in the presence of various

levels of environmental noise only. The full length of interview recordings from

200 speakers using pseudo-police style was kept under clean conditions, while 10

sec of the surveillance recordings from 200 speakers using informal telephone con-

versation style were mixed with random sessions of STREET, CAR and HOME

noises from the QUT-NOISE database at SNRs ranging from -10 dB to 10 dB.

Figure 5.4 compares speaker verification performance using different feature ex-

traction techniques in the presence of environmental noise at various SNR values.

The following points were concluded from this figure:

� The performance of forensic speaker verification systems achieves signifi-

cant improvements in EER over the majority SNR values when applying

feature warping to the MFCC features in the presence of various types of

environmental noises (blue solid vs blue dash).

� Feature warping did not improve forensic speaker verification performance

when DWT-MFCC was used as the feature extraction compared with tra-

dional MFCC features. However, speaker verification performance was im-

proved by applying feature warping to MFCC features (red solid vs blue

solid). The major drawback of using DWT-MFCC (FW) as the feature

extraction is that it lost some important correlation information between

sub band features. The lack of correlation information between sub band

features decreased the performance of speaker recognition systems [90].


-10 -5 0 5 10SNR

0

5

10

15

20

25

30

35

40

45

EER

%

STREET

MFCCMFCC (FW)DWT-MFCCDWT-MFCC (FW)Fusion (no FW)Fusion (both FW)

-10 -5 0 5 10SNR

0

5

10

15

20

25

30

35

40

45

EER

%

CAR


-10 -5 0 5 10SNR

0

5

10

15

20

25

30

35

40

45

EER

%

HOME


Figure 5.4: Comparison of speaker verification performance using different featureextraction techniques in the presence of various types of environmental noise.

� Fusion of feature warping with DWT-MFCC and MFCC features achieves

greater improvements in EER than fusion without any feature warping in

the presence of various levels of environmental noises (green solid vs green

dash).

� Fusion of feature warping with DWT-MFCC and MFCC significantly im-

proved EER over traditional MFCC features in the presence of various types

and levels of environmental noises (green solid vs blue dash). The reduction


-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40

45

Ave

rage

Red

uctio

n in

EE

R %

Figure 5.5: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC over feature-warped MFCC in the presence of various typesof environmental noise. Higher average reduction in EER indicates better perfor-mance.

in EER for fusion of feature warping with DWT-MFCC and MFCC at 0

dB SNR is 48.28%, 30.27%, and 41.17%, respectively, over MFCC features

in the presence of random sessions of CAR, STREET and HOME noises.

The reduction in EER between two systems A and B can be defined as:

EERred =EERA − EERB

EERA

, (5.1)

where EERA is the equal error rate for system A and EERB is the equal error

rate for system B. The average reduction in EER between feature-warped MFCC

and fusion of feature warping with DWT-MFCC and MFCC can be computed by

calculating the mean of EERred for various types of environmental noise at each

noise level.

Figure 5.5 shows average reduction in EER for fusion of feature warping with

DWT-MFCC and MFCC over feature-warped MFCC features in the presence of


Figure 5.6: mDCF for feature-warped MFCC and fusion of feature warping withDWT-MFCC and MFCC features in the presence of various types of environmen-tal noise.

various types of environmental noise for each noise level. The results show that

fusion of feature warping with DWT-MFCC and MFCC achieves a reduction in

average EER over feature-warped MFCC features in the presence of various types

of environmental noise at SNRs ranging from -10 dB to 10 dB. At 0 dB SNR, the

average reduction in EER for fusion of feature warping with DWT-MFCC and

MFCC was 21.33% over feature-warped MFCC. The results comparing forensic

speaker verification performance using different feature extraction techniques in

the presence of various types of environmental noise were published in IEEE

Access in the journal paper titled “ Enhanced forensic speaker verification using

a combination of DWT and MFCC feature warping in the presence of noise and

reverberation conditions” [5].

We compare the mDCFs between feature-warped MFCC and fusion of feature

warping with DWT-MFCC and MFCC because these feature extraction tech-

niques achieved significant improvements in the performance of forensic speaker


verification compared with other feature extraction techniques. Figure 5.6 shows

the mDCFs of feature-warped MFCC and fusion of feature warping with DWT-

MFCC and MFCC in the presence of various types of environmental noise. It

is clear from this figure that fusion of feature warping with DWT-MFCC and

MFCC features achieves improvement in mDCF over feature-warped MFCC in

the presence of CAR, STREET and HOME noises at SNRs ranging from -10 dB

to 10 dB.

5.4.1.4 Fusion of feature warping with DWT-MFCC and MFCC

Based on the results obtained from the effectiveness of fusion of features of DWT-

MFCC and MFCC with and without using feature warping in the presence of var-

ious types of environmental noise, the new technique of fusion of feature warping

with DWT-MFCC and MFCC is proposed. This technique is based on decom-

posing the noisy speech signals into approximation and detail coefficients using

DWT. The approximation and detail are fused into a single vector. Feature-

warped MFCC is used to extract the essential characteristics of the noisy speech

signals from the DWT. Then, feature-warped MFCC is used to extract the fea-

tures from the full band of the noisy speech signals. Fusion of feature warping

with DWT-MFCC and MFCC can be obtained by concatenating the feature-

warped MFCC extracted from the DWT and feature-warped MFCC extracted

from the full band of the noisy speech signals into a single feature vector, as

shown in Figure 5.7.

5.4.1.5 Performance comparison to related work

This section compares the performance of fusion of feature warping with DWT-

MFCC and MFCC features with other feature extraction techniques used in


MFCC+∆+∆∆ DWT

Feature warping Concatenate approximate and detail

coefficients into a single feature vector

MFCC+∆+∆∆

Feature warping

Concatenate feature vectors into a single feature vector

Noisy speech

Figure 5.7: Fusion of feature warping with DWT-MFCC and MFCC.

[86, 87, 106]. In order to evaluate the performance of the proposed feature ex-

traction techniques [86, 87, 106], the interview recordings were obtained from full

length from 200 speakers using pseudo-police style and these data were kept under

clean conditions. The surveillance recordings were obtained from 10 sec duration

from 200 speakers using informal telephone conversation style. The surveillance

recordings were corrupted by a random session of CAR, STREET and HOME

noises from the QUT-NOISE database at SNRs ranging from -10 dB to 10 dB.

The noisy speech signals were decomposed into three levels using db8 of DWT

and the decomposed speech signals were used to extract the feature extraction

techniques in [86, 87]. These feature extraction techniques were used to train the


Figure 5.8: Comparison of average EER for the proposed fusion of feature warpingwith DWT-MFCC features with other feature extraction techniques.

modern length-normalized GPLDA framework.

The average EER can be calculated by computing the mean of EER for various

types of environmental noise at each noise level for each feature extraction tech-

nique. Figure 5.8 shows a comparison of average EER for the proposed fusion of

feature warping with DWT-MFCC and MFCC features with other feature extrac-

tion techniques. It shows that the fusion of feature warping with DWT-MFCC

and MFCC features achieves improvements in average EER over other feature

extraction techniques in the presence of various types of environmental noise at

SNRs ranging from -10 dB to 10 dB.

5.4.1.6 Effect of utterance length

In these simulation results, the full duration of the interview recordings was kept

under clean conditions, while the duration of the surveillance recordings was


-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40

EE

R %

STREET

-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40CAR

-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40HOME

Full-10 secFull-20 secFull-40 sec

Figure 5.9: Effect of the utterance length on the performance of feature-warpedMFCC in the presence of environmental noise.

changed from 10 sec to 40 sec. The surveillance recordings were corrupted with

random segments of STREET, CAR and HOME noises at SNRs ranging from

-10 dB to 10 dB. Since the performances of forensic speaker verification based

on fusion of feature warping with DWT-MFCC and MFCC and feature-warped

MFCC under noisy conditions achieved improvements in EER compared with

other feature extraction techniques, as described in Section 5.4.1.3, the effect of

utterance length was evaluated on the performance of fusion of feature warping

with DWT-MFCC and MFCC and feature-warped MFCC in this section.

Figure 5.9 shows the effect of utterance length on the performance of feature-

warped MFCC under environmental noise. It is clear that as the utterance length

increased, the performance of forensic speaker verification based on the feature-

warped MFCC improved in the presence of different levels of STREET, CAR and

HOME noises.

Figure 5.10 shows the effect of utterance length on the performance of the fusion of


-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40

EE

R %

STREET

-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40CAR

-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40HOME


Figure 5.10: Effect of the utterance length on the performance of fusion of featurewarping with DWT-MFCC and MFCC in the presence of various types environ-mental noise.

feature warping with DWT-MFCC and MFCC features in the presence of various

types of environmental noise. It is clear that increasing the utterance duration

improved forensic speaker verification performance in the presence of STREET,

CAR and HOME noises at SNRs ranging from -10 dB to 10 dB.

The reduction in EER between feature-warped MFCC and fusion of feature warp-

ing with DWT-MFCC and MFCC can be calculated using Equation 5.1 when 40

seconds of the surveillance recordings were mixed with different types and lev-

els of environmental noise. The average reduction in EER can be computed by

calculating the mean of EER for different types of environmental noise at each

noise level. Figure 5.11 shows the average reduction in EER for fusion of fea-

ture warping with DWT-MFCC and MFCC over feature-warped MFCC when 40

seconds of the surveillance recordings were mixed with various types of environ-

mental noise at SNRs ranging from -10 dB to 10 dB. The results show that the

performance of fusion of feature warping with DWT-MFCC and MFCC improved


-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40

45

50

55

Ave

rage

Red

uctio

n in

EE

R %

Figure 5.11: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC over feature-warped MFCC when 40 sec of the surveillancerecordings were mixed with various types of environmental noise at SNRs rangingfrom -10 dB to 10 dB.

average EER by 1.02% to 52.72% compared with feature-warped MFCC in the

presence of various types of environmental noise at SNRs ranging from -10 dB to

10 dB.

Figure 5.12 shows the average reduction in EER for fusion of feature warping

with DWT-MFCC and MFCC features when the duration of the surveillance

recordings increased from 10 sec to 40 sec. In 0 dB SNR, the performance of fusion

of feature warping with DWT-MFCC and MFCC features achieved an average

reduction in EER of 17.92% when the duration of the surveillance recordings

increased from 10 sec to 40 sec. Some results of effect of utterance length on

the performance of fusion of feature warping with DWT-MFCC and MFCC in

the presence of various types of environmental noise were published in IEEE

Access journal in a paper entitled “ Enhanced forensic speaker verification using


reverberation conditions” [5].


-10 -5 0 5 10

SNR

0

10

20

30

40

50

60A

vera

ge R

educt

ion in

EE

R %

Figure 5.12: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC when the duration of the surveillance recordings increasedfrom 10 sec to 40 sec.

5.4.2 Reverberation conditions


MFCC and MFCC with different feature extraction techniques under reverbera-

tion conditions only. The effect of decomposition level, wavelet family, utterance

length, reverberation time, and suspect and microphone position on the perfor-

mance of forensic speaker verification will also be investigated.


The effect of the decomposition level on the performance of fusion of feature warp-

ing with DWT-MFCC and MFCC was evaluated using different decomposition

levels. The impulse response of a room was computed by using reverberation time

(T20= 0.15 sec). Full duration of the interview recordings was obtained from 200

speakers using pseudo-police style. The VAD algorithm [130] was used to remove


2 3 4

Level

0

2

4

6

8

10

EER

%

Figure 5.13: Effect of decomposition level on the performance of fusion of featurewarping with DWT-MFCC and MFCC under reverberation conditions only.

silent regions from the interview recordings. The surveillance recordings were ob-

tained from 10 sec duration from 200 speakers using informal telephone conver-

sation style after applying the VAD algorithm. Each of the interview recordings

was convolved with the impulse room response to generate reverberated speech,

while a 10 sec duration of the surveillance recordings was kept under clean con-

ditions. The first configuration of the room is used in this experiment, as shown

in Table 3.1.

In this experiment, db8 of the DWT and different decomposition levels (2, 3,

and 4) were used to investigate the effect of the decomposition levels on the

performance of fusion of feature warping with DWT-MFCC and MFCC under

reverberation conditions only. Figure 5.13 shows the effect of decomposition level


under reverberation conditions only.

It was found, as illustrated in Figure 5.13, that level 2 achieves better improve-


ment in performance than other decomposition levels. Reverberation often affects

low frequencies more than high frequencies, since the materials used in the most

common design of rooms are less absorptive at low frequencies, leading to longer

reverberation times and more distortion of spectral information at those frequen-

cies [95]. Thus, the performance of speaker verification in reverberation envi-

ronments improved by increasing the number of coefficients at a low frequency

using two levels of decomposition. The results of effect of decomposition level


under reverberation conditions only were published in IEEE Access journal in

a paper entitled “ Enhanced forensic speaker verification using a combination

of DWT and MFCC feature warping in the presence of noise and reverberation

conditions” [5].


This section describes the effect of wavelet family on the performance of fusion

of feature warping with DWT-MFCC and MFCC features under reverberation

conditions. Each of the interview recordings was convolved with the impulse

response of the room to generate reverberated speech, while 10 sec duration of the

surveillance recordings remained under clean conditions. The first configuration

of the room is used in this experiment, as shown in Table 3.1.

In this experiment, level 2 of DWT and different Daubechies families (db2, db4,

and db8) were used to investigate the effect of wavelet family on the performance

of fusion of feature warping with DWT-MFCC and MFCC under reverberation

conditions, as shown in Figure 5.14. It was found, as illustrated in Figure 5.14,

that db8 achieved slight improvements in EER compared with other wavelet fam-

ilies. Since level 2 achieved better forensic speaker verification performance under

reverberation conditions only, as described in Section 5.4.2.1, and db8 achieved


db2 db4 db8

Wavelet family

0

1

2

3

4

5

6

7

EE

R %

Figure 5.14: EER for fusion of feature warping with DWT-MFCC and MFCCusing different wavelet families in the presence of reverberation conditions only.

slight improvements in forensic speaker verification performance compared with

other wavelet families, level 2 and db8 will be used in the feature extraction based

on DWT in the presence of reverberation in the next sections.


The performance of fusion of feature warping with DWT-MFCC and MFCC

was evaluated and compared with various feature extraction techniques in the

presence of reverberation. The full duration of interview recordings from 200

speakers using pseudo-police style was reverberated at 0.15 sec reverberation

time, while a 10 sec portion of the surveillance recordings from 200 speakers

using informal telephone conversation style was kept under clean conditions. The

first configuration of the room is used in this experiment, as shown in Table 3.1.

Figure 5.15 shows comparison of fusion of feature warping with DWT-MFCC and


MFCC with different feature extraction techniques in the presence of reverbera-

tion conditions. It is clear from this figure that the performance of forensic speaker

verification under reverberation conditions significantly improved EER when fea-

ture warping was applied to MFCC features. The performance of speaker veri-

fication based on the sub band features (DWT-MFCC and DWT-MFCC (FW))

degraded in the presence of reverberation compared with traditional MFCC be-

cause some important information was lost between sub band features. The

results also demonstrated that a fusion of feature warping with DWT-MFCC and

MFCC features improves speaker verification performance over other feature ex-

traction techniques. The performance of speaker verification based on fusion of

feature warping with DWT-MFCC and MFCC significantly improved EER over

traditional MFCC features. The reduction in EER for fusion of feature warping

with DWT-MFCC and MFCC was 47.00% over traditional MFCC features. Fu-

sion of feature warping with DWT-MFCC and MFCC reduced EER by 20.00%

over feature-warped MFCC. The results of comparing fusion feature warping with

DWT-MFCC and MFCC with different feature extraction techniques under re-

verberation conditions only were published in IEEE Access in the journal paper

entitled “ Enhanced forensic speaker verification using a combination of DWT and

MFCC feature warping in the presence of noise and reverberation conditions” [5].

Table 5.3 shows a comparison of speaker verification performance based on mDCF

using different feature extraction techniques in the presence of reverberation at

0.15 sec. It shows that a fusion of feature warping with DWT-MFCC and MFCC

features improves the performance of mDCF over other feature extraction tech-

niques in the presence of reverberation. The mDCF for fusion of feature warping

with DWT-MFCC and MFCC features reduced by 23.73% compared to feature-

warped MFCC.


0

2

4

6

8

10

12

14

EE

R %


Figure 5.15: Comparison of fusion of feature warping with DWT-MFCC andMFCC with different feature extraction techniques in the presence of reverbera-tion.

Table 5.3: Comparison of speaker verification performance based on mDCF usingdifferent feature extraction techniques in the presence of reverberation at 0.15 sec.

Feature extraction techniques mDCFMFCC 0.0600

MFCC (FW) 0.0434DWT-MFCC 0.0690

DWT-MFCC (FW) 0.070Fusion (no FW) 0.0480

Fusion both (FW) 0.0331

5.4.2.4 Effect of reverberation time

This experiment evaluated the effect of reverberation time on the performance

of fusion of feature warping with DWT-MFCC and MFCC and feature-warped

MFCC because these feature extraction techniques achieved high improvements

in EER over other feature extraction techniques under reverberation conditions,

as described in the previous section. The impulse response of the room was

computed by using the following reverberation times: T20= 0.15 sec, 0.20 sec,


0.1 0.15 0.25 0.30.2

Reverberation Time (sec)

0

3

6

9

12

15

EE

R %

Figure 5.16: Effect of reverberation time on the performance of feature-warpedMFCC.

and 0.25 sec. Each full duration of interview recordings was convolved with

impulse response of the room to generate reverberated interview data at different

reverberation times, while 10 sec surveillance recordings were maintained under

clean conditions. The first configuration of the room is used in this experiment,

as shown in Table 3.1.

Figure 5.16 shows the effect of reverberation time on the performance of feature-

warped MFCC. It is clear from this figure that the performance of forensic speaker

verification based on feature-warped MFCC degraded when the reverberation

time increased from 0.15 sec to 0.25 sec.

Figure 5.17 shows the effect of reverberation time on the performance of fusion

of feature warping with DWT-MFCC and MFCC. It shows that an increase in

reverberation time led to degraded speaker verification performance. When the

reverberation time increased from 0.15 sec to 0.25 sec, there was a degradation

of 34.42% in the performance of fusion of feature warping with DWT-MFCC


0.1 0.15 0.2 0.25 0.3Reverberation Time (sec)

0

2

4

6

8

10

EE

R %

Figure 5.17: Effect of reverberation time on the performance of fusion of featurewarping with DWT-MFCC and MFCC.

and MFCC. The reverberation adds more inter-frame distortion to the cepstral

features when the reverberation time was increased. Therefore, increasing the

reverberation time leads to decreased speaker verification performance [121]. The

results of the effect of reverberation time on the fusion of feature warping with

DWT-MFCC and MFCC under reverberation conditions only were published in

IEEE Access in the journal paper entitled “ Enhanced forensic speaker verification

using a combination of DWT and MFCC feature warping in the presence of noise

and reverberation conditions” [5].

Figure 5.18 shows the reduction in EER for fusion of feature warping with DWT-

MFCC and MFCC over feature-warped MFCC when interview recordings rever-

berated at different reverberation times. The results showed that the performance

of speaker verification based on the fusion of feature warping with DWT-MFCC

and MFCC reduced EER by 20% to 21.27% compared with feature-warped MFCC

when interview recordings reverberated at different reverberation times ranging

from 0.15 sec to 0.25 sec. Reverberation affects low frequency more than high


0.1 0.15 0.2 0.25 0.3

Reverberation Time (sec)

0

5

10

15

20

25

Re

du

ctio

n in

EE

R %

Figure 5.18: Reduction in EER for fusion of feature warping with DWT-MFCCand MFCC over feature-warped MFCC when interview recordings reverberatedat different reverberation times.

frequency sub bands, and it smears spectral information of the speech signal at

low frequency sub bands [95]. The DWT-MFCC (FW) can be used to extract

more features from the low frequency sub bands. These features add some im-

portant features to the full band of the feature-warped MFCC. Thus, fusion of

feature warping with DWT-MFCC and MFCC improves forensic speaker verifi-

cation performance more than feature-warped MFCC in the presence of different

reverberation time.

5.4.2.5 Effect of utterance duration

This section describes the effect of varying utterance duration on the performance

of fusion of feature warping with DWT-MFCC and MFCC and feature-warped

MFCC techniques in the presence of reverberation conditions only. The full

duration of the interview recordings from 200 speakers using pseudo-police style


10 4020

Surveillance Duration (sec)

0

2

4

6

8

10

EE

R %

Figure 5.19: Effect of changing the utterances length of the surveillance recordingson the performance of feature-warped MFCC under reverberation conditions.

reverberated at 0.15 sec using the first configuration of the room, as shown in

Table 3.1, while the duration of the surveillance recordings from 200 speakers

using informal telephone conversation style was changed from 10 sec to 40 sec.

Figure 5.19 shows the effect of changing the utterance length of the surveillance

recordings on the performance of feature-warped MFCC in the presence of rever-

beration conditions only. The results demonstrated that forensic speaker verifica-

tion based on the feature-warped MFCC improved performance when the length

of the surveillance recordings increased from 10 sec to 40 sec.

Figure 5.20 shows the effect of changing the surveillance utterance duration on

the performance of fusion of feature warping with DWT-MFCC and MFCC in the

presence of reverberation conditions only. The results showed that as the surveil-

lance utterance length increased, the performance of fusion of feature warping

with DWT-MFCC and MFCC improved. The EER was reduced by 46.04% when


10 4020

Surveillance Duration (sec)

0

2

4

6

8

EER

%

Figure 5.20: Effect of changing the surveillance utterance duration on the perfor-mance of fusion of feature warping with DWT-MFCC and MFCC under rever-beration conditions.

the duration of the surveillance recordings increased from 10 sec to 40 sec. The

results of effect of utterance length on the fusion of feature warping with DWT-

MFCC and MFCC under reverberation conditions only were published in IEEE

Access in the journal paper entitled “ Enhanced forensic speaker verification us-

ing a combination of DWT and MFCC feature warping in the presence of noise

and reverberation Conditions” [5]. By comparing Figures 5.19 and 5.20, it is

clear that the performance of fusion of feature warping with DWT-MFCC and

MFCC improved EER compared with feature-warped MFCC when the length of

the surveillance recordings increased from 10 sec to 40 sec under reverberation

conditions only.


2.5 m

3 m

Width

Length

4 m

1 m

0.5 m 1.5 m

0.4 m 0.4 m

Suspect

Microphone

2.8 m

1

4

3 2

Figure 5.21: Position of suspect and microphones in a room. All microphonesand suspect are at 1.3 m height and the height of the room is 2.5 m.

5.4.2.6 Effect of suspect and microphone position

In this experiment, the full duration of interview recordings from 200 speakers

using pseudo-police style was reverberated at 0.15 sec, while 10 sec of surveillance

recordings from 200 speakers using informal telephone conversation style was

kept under clean conditions. The position of the suspect was not changed and

four different positions of the microphone were used to investigate the effect of

suspect/microphone position on the performance of feature-warped MFCC and

fusion of feature warping with DWT-MFCC and MFCC. The configuration of

suspect/microphone used in these experimental results is shown in Figure 5.21

and Table 5.4.

Figures 5.22 and 5.23 show the effect of suspect/microphone positions on the

performance of feature-warped MFCC and fusion of feature warping with DWT-

MFCC and MFCC. The results demonstrate that changing the distance between

the suspect and microphone affects the performance of feature-warped MFCC




and fusion of feature warping with DWT-MFCC and MFCC. Configuration 2,

which has the shortest distance between the suspect and microphone, achieved

the greatest improvement in EER compared with other configurations. The per-

formance of feature-warped MFCC and fusion of feature warping with DWT-

MFCC and MFCC decreased when the distance between the suspect and micro-

phone increased. The impulse response of the room consists of early and late

reflections. The characteristics of the early reflections, typically 50 msec after the

arrival of the direct sound, depends strongly on the suspect/microphone positions

[144]. The duration of the early reflections could increase and leads to increased

spectral alteration of the original speech signal when the distance between the

suspect and microphone increases. Thus, the performance of forensic speaker

verification based on the feature-warped MFCC and fusion of feature warping

with DWT-MFCC and MFCC degrades when the distance between the suspect

and microphone increases.

A comparison of the results in Figures 5.22 and 5.23 shows that the performance

of fusion of feature warping with DWT-MFCC and MFCC achieves improvements

in EER compared with feature-warped MFCC when the positions of microphones

change according to the suspect. Some results of effect of suspect/microphone

position on the fusion of feature warping with DWT-MFCC and MFCC under

reverberation conditions only were published in IEEE Access in the journal paper

entitled “ Enhanced forensic speaker verification using a combination of DWT and

MFCC feature warping in the presence of noise and reverberation conditions” [5].


0

3

6

9

12

15

EER

%

Configuration 1Configuration 2Configuration 3Configuration 4

Figure 5.22: Effect of configuration microphone and suspect positions on theperformance of feature-warped MFCC.

Figure 5.23: Effect of configuration microphone and suspect positions on theperformance of fusion of feature warping with DWT-MFCC and MFCC.

5.4.3 Noisy and reverberant conditions

The performance of fusion of feature warping with DWT-MFCC and MFCC was

evaluated and compared with speaker verification based on traditional MFCC and

feature-warped MFCC under noisy and reverberant conditions. As described in

Sections 5.4.1.3 and 5.4.2.3, forensic speaker verification based on DWT-MFCC

and DWT-MFCC (FW) degraded compared with other feature extraction tech-

niques in the presence of various types of environmental noise only and rever-

beration conditions. Therefore, forensic speaker verification based on the DWT-

MFCC and DWT-MFCC (FW) will not be investigated in these experimental

results. The effect of decomposition level, wavelet family and utterance length


-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40

45

EE

R %

STREET

-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40

45CAR

-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40

45HOME

Level 2Level 3Level 4Level 5

Figure 5.24: Effect of the decomposition levels on the performance of fusion offeature warping with DWT-MFCC and MFCC in the presence of reverberationand various types of environmental noises.

will also be discussed in this section.


The effect of the decomposition level on the performance of fusion of feature

warping with DWT-MFCC and MFCC was evaluated using db8 of DWT and

different levels (2, 3, 4, and 5). The full duration of the interview recordings from

200 speakers using pseudo-police style reverberated at 0.15 sec using the first

configuration of the room, as shown in Table 5.4. The surveillance recordings

from 10 sec duration from 200 speakers using informal telephone conversation

style were corrupted with different segments of CAR, STREET and HOME noises

from the QUT-NOISE database [29] at SNRs ranging from -10 dB to 10 dB.

Figure 5.24 shows the effect of the decomposition levels on the performance of


fusion of feature warping with DWT-MFCC and MFCC in the presence of rever-

beration and various types of environmental noises. It is clear from Figure 5.24

that level 4 achieves better performance in EER over the majority of SNR values

and different types of environmental noise. Some results of the effect of decompo-

sition level on the fusion of feature warping with DWT-MFCC and MFCC under

noisy and reverberant conditions were published in IEEE Access in the journal

paper entitled “ Enhanced forensic speaker verification using a combination of

DWT and MFCC feature warping in the presence of noise and reverberation

conditions” [5].


This section describes the effect of wavelet family on the performance of fusion

of feature warping with DWT-MFCC and MFCC under noisy and reverberant

conditions. The full duration of the interview recordings reverberated at 0.15

sec reverberation time. The surveillance recordings were obtained from 10 sec

duration using informal telephone conversation and were mixed with different

sessions of CAR, STREET and HOME at SNRs ranging from -10 dB to 10 dB.

The interview and surveillance recordings were decomposed into four levels using

different Daubechies wavelet families (db2, db4, and db8). The first configuration

of the room is used in this experiment, as shown in Table 5.4.

The average EER for fusion of feature warping with DWT-MFCC and MFCC

can be computed by calculating the mean EER for various types of environmental

noise at each noise level, as shown in Figure 5.25. It is clear that the performance

of fusion of feature warping with DWT-MFCC and MFCC using db8 achieved

slight improvements in average EER compared with other wavelet families when

interview recordings reverberated and the surveillance recordings were mixed with

various types of environmental noise at SNRs ranging from -10 dB to 10 dB. In


-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40

Ave

rage

EE

R %

db2 db4 db8

Figure 5.25: Average EER for fusion of feature warping with DWT-MFCC andMFCC using different wavelet families in the presence of reverberation and varioustypes of environmental noises.

the next sections, the level 4 and db8 will be used in the feature extraction based

on the fusion of feature warping with DWT-MFCC and MFCC under noisy and

reverberant conditions because the performance of fusion of feature warping with

DWT-MFCC and MFCC improved when level 4 and db8 were used compared

with other levels and wavelet families.



MFCC and MFCC with traditional MFCC and feature-warped MFCC in the

presence of reverberation and different types of environmental noise. In these

experimental results, the interview recordings reverberated at 0.15 sec, and 10 sec

of the surveillance recordings was mixed with different sessions of CAR, STREET

and HOME noises at SNRs ranging from -10 dB to 10 dB. The first configuration


-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40

45E

ER

%STREET

-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40

45CARMFCCMFCC (FW)Fusion (both FW)

-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40

45HOME

Figure 5.26: Comparison of speaker verification performance using different fea-ture extraction techniques in the presence of environmental noise and reverbera-tion conditions.

of the room used in this experiment is shown in Table 5.4.

Figure 5.26 shows a comparison of speaker verification performance using dif-

ferent feature extraction techniques in the presence of environmental noise and

reverberant conditions. Overall, the results show that fusion of feature warp-

ing with DWT-MFCC and MFCC achieves improvements in EER over feature-

warped MFCC, when the surveillance recordings were corrupted with random

segments of STREET, CAR and HOME noises at various SNR values and in-

terview recordings reverberated at 0.15 sec. The results also demonstrate that

feature-warped MFCC achieves significant improvements in EER compared with

traditional MFCC. The results of comparing speaker verification performance us-

ing different feature extraction techniques under noisy and reverberant conditions

were published in IEEE Access in the journal paper entitled “ Enhanced forensic


the presence of noise and reverberation conditions” [5].


-10 -5 0 5 10

SNR

0

10

20

30

40

50A

vera

ge re

duct

ion

in E

ER

%

Figure 5.27: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC over traditional MFCC in the presence of various types ofenvironmental noise and reverberation conditions

Figure 5.27 shows the average reduction in EER for fusion of feature warping with

DWT-MFCC and MFCC over traditional MFCC features. It is clear that the per-

formance of fusion of feature warping with DWT-MFCC and MFCC achieved sig-

nificant improvements in average EER over traditional MFCC features when the

surveillance recordings were corrupted with random sessions of CAR, STREET

and HOME noises at SNRs ranging from -10 dB to 10 dB and interview recordings

reverberated at 0.15 sec. The average reduction in EER for this technique ranged

from 17.10% to 51.86% over traditional MFCC features in the presence of various

types of environmental noise at SNRs ranging from -10 dB to 10 dB. The results

of average reduction in EER for fusion of feature warping with DWT-MFCC

and MFCC over traditional MFCC under noisy and reverberant conditions were

published in the IEEE International Conference on Signal and Image Processing

Applications in the paper entitled “ Hybrid DWT and MFCC feature warping

for noisy forensic speaker verification in room reverberation” [6].

The average reduction in EER for fusion of feature warping with DWT-MFCC


and MFCC over feature-warped MFCC features was computed by calculating

the mean of the EER reduction between feature-warped MFCC and fusion of

feature warping with DWT-MFCC and MFCC for various types of environmen-

tal noise at each noise level in the presence of reverberation, as shown in Figure

5.28. The results demonstrate that the performance of fusion of feature warp-

ing with DWT-MFCC and MFCC outperforms feature-warped MFCC in average

reduction of EER at SNRs ranging from -10 dB to 10 dB. At 0 dB SNR, the

average reduction in EER of fusion of feature warping with DWT-MFCC and

MFCC in the presence of various types of environmental noise and reverberation

conditions was 13.28% over feature-warped MFCC. Forensic speaker verification

based on the fusion of feature warping with DWT-MFCC and MFCC achieved

a reduction in average EER compared with feature-warped MFCC under noisy

and reverberant conditions for two main reasons. Firstly, the DWT has a good

time and frequency resolution to represent the speech signals by using adaptive

window and hence the DWT can be used to extract the localization events of the

speech signals [123]. However, the feature-warped MFCC lost some information

in the time domain by assuming that the speech signal is stationary within a

given speech frame. Secondly, reverberation affects low frequencies more than

high-frequency sub bands, since the boundary materials used in most rooms are

less absorptive at low frequency sub bands [95]. The DWT can be used to extract

more features from the low frequency sub bands. These features add some impor-

tant information to the features extracted from the feature-warped MFCC of the

reverberated speech signals. Thus, fusion of feature warping with DWT-MFCC

and MFCC improves forensic speaker verification performance in the presence of

reverberation conditions.

Figure 5.29 shows mDCF of fusion of feature warping with DWT-MFCC and

MFCC features and feature-warped MFCC in the presence of various types of en-

vironmental noise and reverberation conditions. It is clear from this figure that


-10 -5 0 5 10

SNR

0

5

10

15

20

25

30A

vera

ge R

educ

tion

in E

ER

%

Figure 5.28: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC over feature-warped MFCC in the presence of various typesof environmental noise and reverberation conditions.

Figure 5.29: mDCF for feature-warped MFCC and fusion of feature warping withDWT-MFCC and MFCC features under noisy and reverberant conditions.

the fusion of feature warping with DWT-MFCC and MFCC features achieves im-

provements in mDCF over feature-warped MFCC in the presence of various types

of environmental noise at SNRs ranging from -10 dB to 10 dB and reverberation

conditions.


-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40

EE

R %

STREET

-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40CAR

-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35

40HOME


Figure 5.30: Effect of utterance length on the performance of feature-warpedMFCC in the presence of noise and reverberation conditions.


In real forensic applications, long speech samples from a suspected speaker are

recorded in an interview room under reverberation conditions, while the surveil-

lance recording is corrupted by environmental noises and the duration of the

surveillance recordings is uncontrolled. Thus, the full duration of the interview

recordings was reverberated at 0.15 sec reverberation time, while the duration of

the surveillance recordings was changed from 10 sec to 40 sec. The surveillance

recordings were mixed with random session of STREET, CAR and HOME noise

at SNRs ranging from -10 dB to 10 dB. The first configuration of the room is

used in this experiment, as shown in Table 5.4. The effect of utterance length

was evaluated on the performances of the forensic speaker verification based on

the fusion of feature warping with DWT-MFCC and MFCC and feature-warped

MFCC because these feature extraction techniques achieved greater improve-

ments in performance compared with traditional MFCC features, as described in


-10 -5 0 5 10

SNR

0

10

20

30

40

EE

R %

STREET

-10 -5 0 5 10

SNR

0

10

20

30

40CAR

-10 -5 0 5 10

SNR

0

10

20

30

40HOME


Figure 5.31: Effect of utterance length on the performance of fusion of featurewarping with MFCC and DWT-MFCC in the presence of noise and reverberationconditions.

Section 5.4.3.3.

Figure 5.30 shows the effect of utterance length on the performance of feature-

warped MFCC under noisy and reverberant environments. It is clear that the

performance of forensic speaker verification based on feature-warped MFCC im-

proved when the utterance length of the surveillance recordings increased from 10

sec to 40 sec at different levels and types of environmental noise and reverberation

conditions.

Figure 5.31 shows the effect of utterance length on the performance of fusion of

feature warping with DWT-MFCC and MFCC features in the presence of noise

and reverberation environments. It was found, as illustrated in Figure 5.31, that

the performance of speaker verification based on the fusion of feature warping

with DWT-MFCC and MFCC under noisy and reverberant conditions improved

when the duration of the surveillance recordings increased from 10 sec to 40 sec

at various types and levels of environmental noise.


-10 -5 0 5 10

SNR

0

5

10

15

20

25

Ave

rage

Red

uctio

n in

EE

R %

Figure 5.32: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC over feature-warped MFCC when interview recordings rever-berated at 0.15 sec and 40 seconds of the surveillance recordings were corruptedby various types of noise at SNRs ranging from -10 dB to 10 dB.

The reduction in EER between feature-warped MFCC and fusion of feature warp-

ing with DWT-MFCC and MFCC can be calculated according to Equation 5.1

when interview recordings reverberated at 0.15 sec and 40 sec of surveillance

recordings were corrupted by different types of noise. The average reduction in

EER can be calculated by computing the mean of various types of environmental

noise at each noise level. Figure 5.32 shows the average reduction in EER for

fusion of feature warping with DWT-MFCC and MFCC over feature warped-

MFCC when interview recordings reverberated at 0.15 sec and 40 seconds of the

surveillance recordings were corrupted by various types and levels of environ-

mental noise. The results demonstrated that the performance of forensic speaker

verification based on the fusion of feature warping with DWT-MFCC and MFCC

improved average EER in a range from 7.23% to 22.20% over feature-warped

MFCC in the presence of environmental noise at SNRs ranging from -10 dB to

10 dB and reverberation conditions.


-10 -5 0 5 10

SNR

0

5

10

15

20

25

30

35A

vera

ge R

educt

ion in

EE

R %

Figure 5.33: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC when interview recordings reverberated at 0.15 sec and theduration of the surveillance recordings changed from 10 sec to 40 sec in thepresence of various types of noise at SNRs ranging from -10 dB to 10 dB.

Figure 5.33 shows the average reduction in EER for fusion of feature warping with

DWT-MFCC and MFCC when interview recordings reverberated at 0.15 sec re-

verberation time and the duration of the surveillance recordings was changed

from 10 sec to 40 sec in the presence of different types of environmental noise at

SNRs ranging from -10 dB to 10 dB. It is clear that forensic speaker verification

performance improved in average EER of 26.51 % when the duration of surveil-

lance recordings increased from 10 sec to 40 sec in the presence of various types of

environmental noise at 0 dB SNR and reverberated interview recordings. Some

results of the effect of utterance length on the fusion of feature warping with

DWT-MFCC and MFCC under noisy and reverberant conditions were published

in IEEE Access in the journal paper entitled “ Enhanced forensic speaker verifi-

cation using a combination of DWT and MFCC feature warping in the presence

of noise and reverberation conditions” [5].


5.5 Chapter summary

In this chapter, the effectiveness of combining features, MFCC and MFCC ex-

tracted from the DWT of the speech signals, with and without feature warping

was investigated for improving the state-of-the-art length-normalized GPLDA

based speaker verification under noisy conditions only. The performance of

speaker verification was evaluated using different feature extraction techniques:

MFCC, feature-warped MFCC, DWT-MFCC, feature-warped DWT-MFCC, a

fusion of DWT-MFCC and MFCC features, and fusion of feature warping with

DWT-MFCC and MFCC. The performance of forensic speaker verification was

evaluated using AFVC and QUT-NOISE databases. The experimental results

showed that the proposed of fusion of feature warping with DWT-MFCC and

MFCC is superior to other feature extraction techniques in the presence of envi-

ronmental noise under the majority of SNRs. At 0 dB SNR, the fusion of feature

warping with DWT-MFCC and MFCC reduced average EER by 21.33% over

feature-warped MFCC in the presence of various types of environmental noises

only.

Experimental results also demonstrated that the proposed of fusion of feature

warping with DWT-MFCC and MFCC improved forensic speaker verification

performance compared with other feature extraction techniques in the presence

of reverberation only as well as noisy and reverberant conditions. The fusion of

feature warping with DWT-MFCC and MFCC reduced average EER by 20.00%,

and 13.28% over feature-warped MFCC, respectively, in the presence of reverber-

ation only, and noisy and reverberant environments at 0 dB SNR. The results

of the effectiveness of fusion of feature warping with DWT-MFCC and MFCC

for improving forensic speaker verification performance under noisy and rever-

berant conditions were published in IEEE Access Journal and the IEEE Inter-


national Conference on Signal and Image Processing Applications. The journal

and conference papers are titled “ Enhanced forensic speaker verification using


reverberation conditions” [5] and “ Hybrid DWT and MFCC feature warping for

noisy forensic speaker verification in room reverberation” [6], respectively.

Forensic speaker verification based on the fusion of feature warping with DWT-

MFCC and MFCC improved average EER under noisy and reverberant condi-

tions compared with feature-warped MFCC for two main reasons. Firstly, rever-

beration affects low frequency sub bands more than high frequency sub bands,

since the boundary materials used in most rooms are less absorptive at low fre-

quency sub bands. The feature-warped MFCC extracted from the DWT can be

used to extract more features from the low frequency sub band of the reverber-

ated interview recordings. These features add some important information to

feature-warped MFCC. Therefore, fusion of feature warping with DWT-MFCC

and MFCC improves forensic speaker verification performance compared with

feature-warped MFCC under reverberation conditions. Secondly, the feature-

warped MFCC extracted from the DWT adds more features to the features

extracted from the feature-warped MFCC of the full band of the noisy speech

signals, thereby assisting to improve forensic speaker verification performance in

the presence of various types of environmental noise. Thus, the fusion of fea-

ture warping with DWT-MFCC and MFCC approach proposed in this chapter

can be used to extract the features from the interview and enhanced surveillance

recordings, which will be described in Chapter 6.

Chapter 6

ICA for speaker verification

6.1 Introduction

Forensic speaker verification systems face many challenges in real-world appli-

cations. Forensic audio recordings from the criminal are often recorded using

hidden microphones in public places [4]. Such forensic audio recordings are of-

ten corrupted with various types of environmental noises [4, 89]. Reverberation

conditions often occur when interview recordings from the suspect are recorded

in a police interview room. The distortion of speech by environmental noise

and reverberation severely degrades speaker verification performance [117]. In

this chapter, the ICA algorithm will be used to separate the speech from the

noisy speech signals and improve forensic speaker verification performance be-

cause the distribution of the source signals (speech and environmental noise) is

non-Gaussian distribution and the ICA is an effective technique to separate the

source signals that come from non-Gaussian distribution.

Independent component analysis was used as a multiple channel speech enhance-

158 6.1 Introduction

ment algorithm to separate the clean speech from the noisy speech signals and it

can be used to improve forensic speaker verification performance. The fast ICA

is one of the popular ICA methods used in the literature and it is based on the

iterative blind source separation technique [10]. The original source signals in

the fast ICA algorithm are estimated from the mixed data at each instance. The

quality of estimation of the source signals is based on the unmixing matrix [101].

Due to the uncertainty associated with initialization of the fast ICA parameters

and the iterative process, there is a randomness in the separation of the source

signals in the fast ICA algorithm [102]. The performance of the mixed signals

may degrade due to the randomness of the separation of the source signals in the

fast ICA algorithm. Therefore, the multi-run based on the fast ICA algorithm

[103] was proposed to solve the problem of randomness in the separation of the

source signals and it achieved better performance for biosignal applications under

clean conditions.

The multi-run ICA algorithm is based on iterating the fast ICA algorithm [55]

several times, obtaining different mixing matrices each time. The source signals

can be estimated by choosing the suitable of the unmixing matrices. The suitable

unmixing matrix gives clear separation of the source signals. Selecting the suitable

mixing matrix can be performed by computing the highest signal to interference

ratio (SIR) of the mixing matrices [103]. The disadvantage of using the multi-

run ICA based on the fast ICA algorithm is that the multi-run ICA is more

computationally expensive compared with traditional fast ICA algorithm because

the multi-run ICA algorithm is based on iterating the fast ICA algorithm several

times and choosing the unmixing matrix that has the largest SIR.

Various algorithms of ICA have been proposed in previous studies such as Fast

ICA [53], information maximization [11] and efficient fast ICA (EFICA) [76].

These algorithms used a fixed nonlinear contrast function or model which make

6.1 Introduction 159

them computationally attractive for estimating source signals. However, the qual-

ity of separation of the source signals in the ICA algorithm degrades when the

density of the source signals deviates from the assumed underlying model. Re-

cently, ICA by entropy bound minimization (ICA-EBM) algorithm has been used

as an effective technique for source separation [82]. It is based on calculating the

entropy bounds from four contrast measuring functions (two even and two odd

functions). Four contrast functions are used in the ICA-EBM algorithm to pro-

vide satisfactory separation performance for a wide variety of distribution without

prior knowledge about the source signals. Then, the tightest maximum entropy

bound is chosen from calculating the entropy bounds for four contrast functions.

The tightest entropy bound is the one closest to the entropy of the true source and

it can be used as the entropy estimate of the source. The ICA-EBM algorithm

can be used to separate the source signals that come from different distributions

and achieve superior separation performance to other ICA algorithms [82]. Thus,

we use the ICA-EBM as the speech enhancement algorithm for separation of the

noise from the noisy speech signals.

The multi-run ICA based on the fast ICA algorithm or the ICA-EBM algorithm

has not been investigated yet in the literature review to improve forensic speaker

verification performance in the presence of various types of environmental noise

and reverberation conditions. In this chapter, the multi-run ICA based on the

fast ICA algorithm or ICA-EBM algorithm is used to reduce the effect of envi-

ronmental noise. The fusion of feature warping with DWT-MFCC and MFCC

is used to extract the features from the enhanced speech signals. These features

can be used to train the state-of-the-art length-normalized GPLDA based speaker

verification framework.

This chapter is divided into several sections. Multi-run ICA algorithm is pre-

sented in Section 6.2. Section 6.3 describes the ICA-EBM algorithm. The

160 6.2 Multi-run ICA algorithm

methodology of forensic speaker verification based on the multi-run ICA or ICA-

EBM algorithm is described in Section 6.4. Section 6.5 presents the results and

discussion.

6.2 Multi-run ICA algorithm

Fast ICA is an iterative blind source separation approach [103]. The original

source signals can be estimated from the mixing matrix each time. The mix-

ing and unmixing matrices are estimated using the fast ICA algorithm which

described in Sections 4.6.1 and 4.6.2. The quality of estimation of the source sig-

nals is based on the unmixing matrix W . Since the unmixing matrix is estimated

from the random matrix, there is a randomness in the quality of the separation

of the estimated source signals.

The multi-run ICA based on the fast ICA algorithm [103] deals with this prob-

lem of randomness. This algorithm is based on computing the mixing matrices

several times and choosing the unmixing matrix that has the maximum quality

of separation. To estimate the enhanced speech from the noisy speech signals, we

require just one mixing matrix. Thus, the best mixing matrix has to be chosen

from among different mixing matrices to obtain better speaker verification perfor-

mance, as shown in Figure 6.1. There are various methods available to compute

the quality of separation of the unmixing matrix. For audio applications, the

SIR was found to be the most popular method to compute the quality of the

separation [24], and this method is used in this chapter. The SIR computation

process will be described in the next section.

6.2 Multi-run ICA algorithm 161

Figure 6.1: Flowchart of the multi-run ICA algorithm.

6.2.1 SIR computation

The SIR is the ratio of the power of the wanted signal to the power of the

unwanted signal. The actual SIR can be defined as [141]

SIRact =||Starget||2

||einterf ||2, (6.1)

where Starget is the modified version of the clean speech signal with an allowed

distortion and einterf is the interference error.

For real forensic applications, the original source signals (speech and noise) and

the mixing matrix are unknown. In this scenario, the calculation of the SIR

estimation would be the actual SIR for real world data. The one independent

component can be estimated as:

yi = wTi X = (wT

i A)S = giS = gijsj, (6.2)

162 6.3 ICA-EBM algorithm

where yi and sj are the estimated independent component and jth source signals,

respectively, wi is a row vector of the unmixing matrix W , and gi is a normalized

row vector of the global matrix G which can be defined as G = W ∗ A. As yi is

the estimation of the source signals sj, the ideal normalization of the row vector

of the global matrix gi =[0 0 gij 0 0

]is the unit vector uj = [0, 0, · · · , 1, · · · , 0].

Thus, one independent component can be separated successfully if the value of

gi is similar to one unit vector uj.

The quality of separation of each independent component is based on one row

vector of the global matrix G and the quality decreases when the difference be-

tween each row vector of G matrix and each corresponding of unit vector uj

increases [24]. The estimated SIR of each mixing matrix was calculated using the

expression which evaluates the success separation of one independent component

SIRest = −10 log10 (∥∥gi − uj∥∥)22 (6.3)

6.3 ICA-EBM algorithm

By using a small class of nonlinear contrast functions, the ICA-EBM algorithm

[82] performs source separations, which are sub- or super-Gaussian, unimodal or

multimodal, symmetric or skewed. The algorithm uses the entropy bound estima-

tor to approximate the entropies of different types of distribution. The ICA-EBM

algorithm minimizes the mutual information of the estimated source signals to

estimate the unmixing matrix. The algorithm uses a line search procedure, which

forces the unmixing matrix to be orthogonal for better convergence behavior. The

mutual information cost function can be defined as:

I(y1, y2, · · · , yN) =N∑

n=1

H(yn)− log∣∣det(W )

∣∣−H(x), (6.4)

6.3 ICA-EBM algorithm 163

where H(yn) is the differential entropy of the nth separated sources y and entropy

of the observations H(x) is a term independent with respect to the unmixing

matrix W which can be treated as a constant C. Minimization of the mutual

information among the estimated source signals is related to the maximization

of the log-likelihood cost function as long as the model of the PDF matches the

PDF of the true latent source signal [2]. The bias is introduced in the estimation

of the unmixing matrix due to the deviation of the model from the true PDF

of the source signal. This bias can be removed by integrating a flexible density

model for each source signal into the ICA algorithm to minimize the bias of the

unmixing matrix providing separated source signals from a wide range of the PDF

accurately [16].

To achieve this, the cost function and its gradient can be rewritten with respect to

each row vector of the unmixing matrix wn, n = 1, 2, 3, · · · , N . This can be done

by expressing the volume of the N -dimensional parallelepiped, spanned by the

row vectors of W , as the inner product of the nth row vectors and unit Euclidian

length vector hn, that is perpendicular to all row vectors of the unmixing matrix

except wn [82]. Therefore, the mutual information cost function in equation 6.4

can be rewritten as function of only wn as:

Jn(wn) =N∑

n=1

H(yn)− log∣∣∣hTnwn

∣∣∣+ C. (6.5)

The gradient of the equation 6.5 can be computed as:

∂Jn(wn)

∂wn

= −V ′k(n){E[Gk(n)(yn)x]}E[gk(n)(yn)x]− hnhTnwn

, (6.6)

where V ′(.) and gk(n)(.) are the first order derivative of the negentropy V (.) and

the kth contrast functions Gk(n)(.), respectively, and E is the expected operator.

The line search algorithm for the orthogonal ICA-EBM can be obtained by the

164 6.4 Methodology

following equations:

w+n =

E[xgk(n)(yn)]− E[g′k(n)(yn)]wn

E[gk(n)(yn)yn]− E[g′k(n)(yn)], (6.7)

wnewn =

w+n

‖wn+‖, (6.8)

where g′k(n)(.) is the second derivative of the kth contrast functions Gk(n)(.).

The line search algorithm for ICA-EBM given in Equations 6.7 and 6.8 repeats

over different row vectors of the unmixing matrix until convergence. The criteria

of the convergence satisfies the following equation:

1−max(abs[diag(W newW T )]) ≤ ε, (6.9)

with a typical value of ε is 0.0001. After each row vector of the unmixing matrixW

has been updated, the symmetrical decorrelation method is performed to remain

the unmixing matrix orthogonal and it can be obtained by the following equation:

W new = (WW T )−12 W. (6.10)

6.4 Methodology

Experiments were conducted to evaluate the performance of the forensic speaker

verification based on the multi-run ICA or ICA-EBM algorithm in the presence

of various types of environmental noise and reverberant conditions. The system

of speaker verification based on the multi-run ICA or ICA-EBM consists of the

steps shown in Figure 6.2 and described in the following sections.

6.4 Methodology 165

Interview (clean or reverberated data)

Speech enhancement based on

the multi-run ICA or ICA-EBM

algorithm

Noisy surveillance data

Length normalized GPLDA classifier

Fusion of feature warping

with DWT-MFCC and MFCCC


with DWT-MFCC and MFCCC

I-vector extraction I-vector extraction

Decision

Figure 6.2: Flowchart of speaker verification based on the multi-run ICA or ICA-EBM algorithm.

6.4.1 Multi-run ICA or ICA-EBM speech enhancement

In real forensic situations, the police often record forensic surveillance record-

ings from the criminal using hidden microphones. These recordings are often

mixed with different types of environmental noise. The performance of foren-

sic speaker verification degrades significantly in the presence of different types

of noise. Therefore, it is necessary to use multiple channel speech enhancement

techniques to reduce the effect of noise and improve forensic speaker verification

performance. This thesis reports using the multi-run ICA based on the fast ICA

166 6.4 Methodology

algorithm or the ICA-EBM algorithm as a multiple channel speech enhancement

algorithm. The enhanced surveillance speech signals from the multi-run ICA and

ICA-EBM algorithm were applied to the feature extraction and classifier model

because the forensic expert is able to identify the enhanced speech signals from

the estimated noise by listening to the sound.

6.4.2 Fusion of feature warping with DWT-MFCC and

MFCC

The approach to extracting the features from the interview and enhanced

surveillance recordings is based on the multi-resolution property of the DWT.

The MFCCs were computed over Hamming windowed frames of the inter-

view/enhanced surveillance recordings, with 30 ms size and 10 ms shift. The

MFCCs were obtained using a mel-filterbank of 32 channels, followed by a trans-

formation to the cepstral domain, keeping 13 coefficients. The first and second

derivatives of the cepstral coefficients were appended to the MFCCs. Feature

warping with a 301 frame window was applied to the features extracted from the

MFCCs. The frame of the interview/enhanced surveillance speech signal was de-

composed into two frequency sub bands: low-frequency sub band (approximation

coefficients) and high frequency sub band (detail coefficients). The approxima-

tion and detail coefficients were combined into a single vector. The decomposition

process can be repeated by applying the DWT to the low-frequency sub band.

To capture the essential characteristics of the vocal tract, the feature-warped

MFCCs were applied to the concatenated vector from the DWT. Finally, the

fusion of feature warping with DWT-MFCC and MFCC approach can be per-

formed by combining the feature-warped MFCCs extracted from the full band

of the interview/enhanced surveillance recording with similar features extracted

from the DWT in a single feature vector, as shown in Figure 6.3.

6.4 Methodology 167

Interview/enhanced surveillance speech signals

DWT

MFCC+∆+∆∆

Concatenate approximate

and detail coefficients

into a single vector

Feature warping

MFCC+∆+∆∆

Feature warping

Concatenate feature vectors in a single feature vector

Figure 6.3: Flowchart of fusion of feature warping with DWT-MFCC and MFCCtechniques.

6.4.3 Length-normalized GPLDA classifier

The speaker verification system was conducted with the state-of-the-art length-

normalized GPLDA based speaker verification framework. In the development

phase, the 78 features fusion of feature warping with DWT-MFCC and MFCC

and the UBM containing 256 Gaussian components were used in the experiments.

The UBM was trained on telephone and microphone data using 348 speakers

from the AFVC database [99]. The UBM was used to calculate the Baum-Welch

statistics before training the total variability subspace of dimension 400. The

total variability [31] was used to compute the i-vectors. The i-vectors were pro-

jected into LDA space to reduce the dimension of the i-vectors to 200. The

i-vectors length-normalization was used before GPLDA modeling using centering

and whitening of the i-vectors [43]. In the enrolment and verification phases,


the interview and surveillance speaker models were created from the interview

and enhanced surveillance speech signals to represent them in i-vector subspace.

The hidden parameters of the PLDA were estimated using variational posterior

distribution. Scoring in the length-normalized GPLDA was conducted using the

batch likelihood ratio to calculate the similarity score between the interview and

surveillance speaker models [64].

6.5 Results and discussion

This section describes the performance of the speaker verification systems based

on multi-run ICA or ICA-EBM algorithm under noisy and reverberant conditions.

The performance of speaker verification systems was evaluated using the EER and

the mDCF, calculated using Cmiss = 10, Cfa = 1, and Ptarget = 0.01.

6.5.1 Noisy and reverberant speaker verification

baselines

To investigate the effectiveness of speaker verification based on the multi-run ICA

or the ICA-EBM algorithm in the presence of environmental noise only and noisy

and reverberant conditions, we compared the performance of speaker verifica-

tion based on the multi-run ICA or the ICA-EBM algorithm with two baselines

(Noise-without-Reverberation and Reverberation-with-Noise speaker verification

systems). The interview speech signals for these baselines were extracted from

full duration utterances from 200 speakers using the pseudo-police style. The

VAD algorithm [130] was used to remove silent regions from the interview speech

signals. The surveillance recordings were obtained from random sessions of 10 sec

duration from 200 speakers using the informal telephone conversation style after







Add CAR, STREET, and

HOME noises to

surveillance recordings


with DWT-MFCC and MFCC




Length normalized GPLDA

Decision

Figure 6.4: Flowchart of the Noise-without-Reverberation speaker verificationbaseline.

removing silent regions using the VAD algorithm. A brief description of the two

baselines is described below.

6.5.1.1 Noise-without-Reverberation speaker verification baseline

The Noise-without-Reverberation speaker verification baseline is obtained by

keeping interview data under clean conditions and surveillance recordings were

mixed with a random session of CAR, STREET and HOME noises from the

QUT-NOISE database at SNRs ranging from -10 dB to 10 dB. Fusion of feature

warping with DWT-MFCC and MFCC was used to extract the features from the

interview and surveillance recordings. According to our experimental results in

Section 5.4.1.3, the fusion of feature warping with DWT-MFCC and MFCC im-

proved forensic speaker verification performance under noisy conditions compared


with other feature extraction techniques. Level 3 and db8 were used in fusion of

feature warping with MFCC and DWT-MFCC, because level 3 and db8 achieved

better forensic speaker verification performance in the presence of various types

of environmental noise over the majority of SNR values compared with other level

decomposition and wavelet family, as described in Sections 5.4.1.1 and 5.4.1.2.

The interview and surveillance speaker models were created from the speech sig-

nals to represent them in i-vector subspace. Then, the length-normalized GPLDA

and batch likelihood ratio were used to compute the similarity score between those

speaker models, as shown in Figure 6.4.

6.5.1.2 Reverberation-with-Noise speaker verification baseline

The interview recordings were convolved with an impulse response of the first con-

figuration of the room, as described in Table 6.1 using the image source algorithm

[78] at 0.15 sec reverberation time. The surveillance recordings were corrupted

with random segments of STREET, CAR and HOME noises at SNRs ranging

from -10 dB to 10 dB. Fusion of feature warping with DWT-MFCC and MFCC

approach was used to extract the features from the interview and surveillance

recordings because this approach improved forensic speaker verification perfor-

mance in the presence of noise and reverberation conditions compared with other

feature extraction techniques, as described in Section 5.4.3.3. Level 4 and db8

were used in fusion of feature warping with DWT-MFCC and MFCC, because

level 4 and db8 achieved better forensic speaker verification performance in the

presence of various types of environmental noise and reverberant conditions com-

pared with other levels and wavelet families, as described in Sections 5.4.3.1 and

5.4.3.2. The i-vector through length-normalized GPLDA and batch likelihood

ratio were used to calculate the similarity score between the i-vectors of interview

and surveillance recordings, as shown in Figure 6.5.









Add CAR, STREET, and

HOME noises to

surveillance recordings






Length normalized GPLDA

Decision

Convolve interview


impulse response

Figure 6.5: Flowchart of the Reverberation-with-Noise speaker verification base-line.

6.5.2 Noise-without-Reverberation conditions

This section describes speaker verification performance based on the multi-run

ICA or ICA-EBM algorithm under Noise-without-Reverberation conditions. We

compared the actual signal to interference ratio (SIRact) and estimated signal

to interference ratio (SIRest) to investigate the effectiveness of estimated SIR

measurement to evaluate forensic speaker verification performance based on the


multi-run ICA algorithm under Noise-without-Reverberation conditions. The

performance of forensic speaker verification based on the multi-run ICA or the

ICA-EBM algorithm is compared with the Noise-without-Reverberation speaker

verification baseline, and traditional ICA. The effect of the utterance length on

the performance of speaker verification based on the multi-run ICA or the ICA-

EBM algorithm is also investigated in this section. The fusion of feature warping

with DWT-MFCC and MFCC used three levels and db8 to extract the features

from the interview and enhanced surveillance recordings in order to provide a fair

comparison with the baseline of Noise-without-Reverberation speaker verification

system.

6.5.2.1 Speaker verification performance based multi-run ICA (SIRact)

In these simulation results, the full duration of the interview recordings was ob-

tained from 200 speakers using the pseudo-police style. The VAD algorithm was

used to remove silent regions. The interview recordings were kept under clean

conditions. The surveillance recordings were obtained from 10 sec duration from

200 speakers using the informal telephone conversation style after removing silent

regions using the VAD algorithm. The surveillance recordings were mixed with

random sessions of CAR, STREET and HOME noises from the QUT-NOISE at

SNRs ranging from -10 dB to 10 dB, resulting in a two-channel of noisy speech

signals according to the following equation:x1x2

=

1.0 1.0

1.0 0.6

z(n)

e(n)

, (6.11)

where z(n) is the clean speech signal, e(n) is the environmental noise.

The performance of forensic speaker verification based on the multi-run ICA

(highest SIRact) was compared with the Noise-without-Reverberation speaker


-10 -5 0 5 10SNR

0

5

10

15

20

25

30

35

40

EER

%

STREET

Baseline Multi-run ICA (Highest SIR

act)

Traditional ICA Multi-run ICA (Lowest SIR

act)

EERred

=69.24%

-10 -5 0 5 10SNR

0

5

10

15

20

25

30

35

40

EER

%

CAR


act)


act)

EERred

=66.68%

-10 -5 0 5 10SNR

0

5

10

15

20

25

30

35

40

EER

%

HOME


act)


act)

EERred

=70.78%

Figure 6.6: Comparison of forensic speaker verification based on the multi-runICA (highest SIRact) algorithm with other speaker verification techniques underNoise-without-Reverberation conditions conditions.

verification baseline, traditional ICA, and multi-run ICA (lowest SIRact) in the

presence of various types of environmental noises at SNRs ranging from -10 dB

to 10 dB, as shown in Figures 6.6. The SNRs in the x-axis of Figure 6.6 were

computed from the first microphone (x1).

It is clear from Figure 6.6 that the performance of forensic speaker verification

based on the multi-run ICA (highest SIRact) algorithm achieved significant im-


provements in EER over the Noise-without-Reverberation speaker verification

baseline at low SNRs ranging from -10 dB to 0 dB. The reduction in EER for

speaker verification with multi-run ICA (highest SIRact) was 66.68%, 69.24% and

70.78% over the Noise-without-Reverberation speaker verification baseline when

the surveillance recordings were corrupted by CAR, STREET and HOME noises

respectively, at -10 dB. The multi-run ICA (highest SIRact) algorithm significantly

improved EER at -10 dB SNR, because this algorithm is based on choosing the

suitable unmixing matrix that has the maximum SIR. The suitable unmixing

matrix gives a clear separation of noise from the noisy speech signals.

The results also demonstrated that forensic speaker verification performance

based on the multi-run ICA (highest SIRact) improved EER compared with tra-

ditional ICA in the presence of various types of environmental noise at SNRs

ranging from -10 dB to 10 dB. The improvement in EER of speaker verification

based on the multi-run ICA (highest SIRact) decreased when SNR increased and

there was a degradation in the speaker verification performance compared with

the Noise-without-Reverberation speaker verification baseline in the presence of

CAR, STREET and HOME noises at SNRs ranging from 5 dB and 10 dB.

The multi-run ICA (lowest SIRact) degraded the forensic speaker verification per-

formance compared with traditional ICA in the presence of various types of en-

vironmental noise at SNRs ranging from -10 dB to 10 dB. The estimation of the

speech signals from the worst unmixing matrix (lowest SIRact) may occur at any

instance in the traditional ICA algorithm due to the randomness associated with

choosing the unmixing matrix. Thus, the traditional ICA algorithm may fail to

separate the speech from the noisy speech signals in this case, and this leads to

decreased noisy forensic speaker verification performance. The results of compar-

ison of forensic speaker verification based on the multi-run ICA (highest SIRact)

with other techniques in the presence of various types of environmental noise


were published in the 11th International Conference on Signal Processing and

Communication Systems. The conference paper is titled “ Speaker verification

with multi-run ICA based speech enhancement” [3].

6.5.2.2 Speaker verification performance based multi-run ICA (SIRest)

In order to investigate the effectiveness of forensic speaker verification based on

the multi-run ICA (estimated SIR) under Noise-without-Reverberation condi-

tions, we compared the performance of forensic speaker verification based on the

multi-run ICA (highest SIRest) with the Noise-without-Reverberation speaker ver-

ification baseline, traditional ICA, multi-run ICA (highest SIRact), and the ICA-

EBM algorithm. As described in the previous section, forensic speaker verification

performance based on the multi-run ICA (lowest SIRact) degraded compared with

traditional ICA algorithm in the presence of various types of environmental noises

at SNRs ranging from -10 dB to 10 dB. Therefore, forensic speaker verification

based on the multi-run ICA algorithm (lowest SIRest) will not be investigated in

these simulation results.

The full duration of the interview recordings was obtained from 200 speakers using

the pseudo-police style. The VAD algorithm was used to remove silent regions

from the interview recordings. These data were kept under clean conditions. The

surveillance recordings were obtained from 10 sec duration from 200 speakers

using the informal telephone conversation style after removing silent regions using

the VAD algorithm. The surveillance recordings were corrupted with random

segments of CAR, STREET and HOME noises at SNRs ranging from -10 dB to

10 dB, resulting in two-channel noisy speech signals according to Equation 6.11.

Figure 6.7 shows a comparison of forensic speaker verification based on the multi-


-10 -5 0 5 10SNR

0

5

10

15

20

25

30

35

40

EER

%

STREET

Baseline Traditional ICA Multi-run ICA (Highest SIR

act)

Multi-run ICA (Highest SIRest

)

ICA-EBM

EERred

=68.42%

-10 -5 0 5 10SNR

0

5

10

15

20

25

30

35

40

EER

%

CAR


act)


)

ICA-EBM

EERred

=66.76%

-10 -5 0 5 10SNR

0

5

10

15

20

25

30

35

40

EER

%

HOME


act)


)

ICA-EBM

EERred

=71.19%

Figure 6.7: Comparison of forensic speaker verification based on the multi-runICA (highest SIRest) algorithm with other speaker verification techniques undernoisy conditions.

run ICA (highest SIRest) with other techniques under noisy conditions. The SNRs

in the x-axis of Figure 6.7 were calculated from the first microphone (x1). It is

clear from this figure that forensic speaker verification based on the multi-run ICA

(highest SIRest) improved EER over the Noise-without-Reverberation speaker

verification baseline in the presence of different types of environmental noise at

low SNRs ranging from -10 dB to 0 dB. The reduction in EER for forensic speaker


verification based on the multi-run ICA (highest SIRest) algorithm was 66.76%,

68.42% and 71.19% over the Noise-without-Reverberation speaker verification

baseline when the surveillance recordings were mixed with CAR, STREET and

HOME noises, respectively at -10 dB SNR. The results also demonstrated that

in the presence of various types of environmental noise over the majority of SNR

values, forensic verification based on the multi-run ICA (highest SIRest) performs

an approximately similar performance to the multi-run ICA (highest SIRact).

Thus, the estimated SIR can be used instead of the actual SIR to evaluate forensic

speaker verification based on the multi-run ICA algorithm under Noise-without-

Reverberation conditions. The performance of forensic speaker verification based

on the multi-run ICA (highest SIRest) degraded compared with Noise-without-

Reverberation speaker verification baseline at SNRs ranging from 5 dB to 10

dB.

When the surveillance recordings were corrupted with different sessions of CAR,

STREET and HOME noises at SNRs ranging from -10 dB to 10 dB, the multi-

run ICA (highest SIRest) algorithm improved the forensic speaker verification

performance over the traditional ICA algorithm. The reduction in EER between

traditional ICA and multi-run ICA (highest SIRest) can be calculated accord-

ing to Equation 5.1. The average reduction in EER between traditional ICA

and multi-run ICA (highest SIRest) can be computed by calculating the mean in

EER reduction for various types of environmental noise at each noise level, as

shown in Figure 6.8. The results demonstrated that forensic speaker verification

performance based on the multi-run ICA (highest SIRest) improved the average

reduction in EER over traditional ICA under different types of environmental

noise at SNRs ranging from -10 dB to 10 dB. The average reduction in EER for

multi-run ICA (highest SIRest) ranged from 15.80% to 6.83% compared with tra-

ditional ICA when the surveillance recordings were mixed with different sessions

of environmental noise at SNRs ranging from -10 dB to 0 dB.


-10 -5 0 5 10SNR

0

2

4

6

8

10

12

14

16

18

20Av

erage

Red

uctio

n in E

ER %

Figure 6.8: Average reduction in EER for the multi-run ICA (highest SIRest)algorithm over traditional ICA in the presence of various types of environmentalnoise at SNRs ranging from -10 dB to 10 dB.

The result also demonstrated that forensic speaker verification performance based

on the ICA-EBM algorithm achieved more significant improvements in the per-

formance than the Noise-without-Reverberation speaker verification baseline at

low SNRs ranging from -10 dB to 0 dB. The reduction in EER for forensic speaker

verification based on the ICA-EBM algorithm was 67.21%, 67.16%, 68.44% over

the Noise-without-Reverberation speaker verification baseline when surveillance

recordings were corrupted by CAR, STREET and HOME noises, respectively

at -10 dB SNR. The improvement in EER of speaker verification based on the

ICA-EBM decreased when SNR increased. The performance of speaker veri-

fication based on the ICA-EBM degraded compared with the Noise-without-

Reverberation speaker verification baseline under different types of noise at SNRs

ranging from 5 dB to 10 dB.

Figure 6.9 shows average reduction in EER for forensic speaker verification based

on the ICA-EBM algorithm over traditional ICA in the presence of various types

of environmental noise at SNRs ranging from -10 dB to 10 dB. It is clear from

this figure that the average EER reduction for the ICA-EBM algorithm ranges


-10 -5 0 5 10

SNR

0

2

4

6

8

10

12

14

Aver

age R

educ

tion i

n EER

%

Figure 6.9: Average EER reduction for the ICA-EBM algorithm compared withthe traditional ICA under different levels and types of environmental noise.

from 12.68% to 7.42% compared with conventional ICA under different types of

noise at SNR values from -10 dB to 0 dB. From the above results, it is clear that

ICA-EBM is more desirable than traditional ICA due to its superior separation

performance and more reliable convergence behaviour.

Table 6.2 shows a comparison mDCFs for speaker verification based on multi-run

ICA (highest SIRest) algorithm and Noise-without-Reverberation speaker veri-

fication baseline under different types of environmental noise at SNRs ranging

from -10 dB to 10 dB. It is clear from Table 6.2 that speaker verification based

on the multi-run ICA (highest SIRest) significantly improved mDCF at low SNR

(-10 dB to 0 dB) in the presence of STREET, CAR and HOME noise. There

was a degradation in the performance of speaker verification based on the multi-

run ICA (highest SIRest) than the baseline Noise-without-Reverberation speaker

verification over the majority of SNRs ranging from 5 dB to 10 dB.


Table 6.2: Comparison mDCFs for speaker verification based multi-run ICA(highest SIRest) and baseline Noise-without-Reverberation speaker verificationsystem.

Noise Type Baseline (Multi-run ICA algorithm)-10 -5 0 5 10 -10 -5 0 5 10

STREET 0.0996 0.0974 0.0801 0.0590 0.0339 0.0571 0.0586 0.0573 0.0569 0.0551HOME 0.0991 0.0924 0.0803 0.0511 0.0325 0.0582 0.0577 0.0605 0.0581 0.0596CAR 0.0995 0.0884 0.0669 0.0389 0.0240 0.0555 0.0550 0.0495 0.0570 0.0551


In these simulation results, the full duration of the interview recordings was kept

under clean conditions, while the duration of the surveillance recordings was

changed from 10 sec to 40 sec. The surveillance recordings were corrupted with

random segments of STREET, CAR and HOME noises at SNRs ranging from

-10 dB to 10 dB, resulting in a two-channel of noisy speech signals according to

Equation 6.11. Since the performance of forensic speaker verification based on the

multi-run ICA (highest SIRest) and the ICA-EBM algorithms improved EER over

traditional ICA algorithm, as described in Section 6.5.2.2, the effect of utterance

length was evaluated on the performance of forensic speaker verification based on

the multi-run ICA (highest SIRest) and the ICA-EBM algorithms in this section.

Figure 6.10 and 6.11 show the effect of duration utterance on the performance of

the speaker verification based multi-run ICA (highest SIRest) and the ICA-EBM

algorithms in the presence of various types of environmental noise. The SNRs in

the x-axis of Figures 6.10 and 6.11 can be calculated from the first microphone

(x1). It is clear that increasing the surveillance utterance duration improved the

performance of forensic speaker verification based on the multi-run ICA (highest

SIRest) and the ICA-EBM algorithms in the presence of STREET, CAR and

HOME noises at SNRs ranging from -10 dB to 10 dB.


-10 -5 0 5 10SNR

0

5

10

15

EER

%

STREET

Full-10 sec Full-20 sec Full-40 sec

-10 -5 0 5 10SNR

0

5

10

15

EER

%

CAR


-10 -5 0 5 10SNR

0

5

10

15

EER

%

HOME


Figure 6.10: Effect of duration utterance on the performance of speaker verifica-tion based multi-run ICA (highest SIRest) algorithm in the presence of varioustypes of environmental noise.

6.5.3 Reverberation-with-Noise conditions

This section describes the performance of forensic speaker verification based on

the multi-run ICA or the ICA-EBM algorithm under Reverberation-with-Noise

conditions. The performance of forensic speaker verification compares with the

Reverberation-with-Noise speaker verification baseline and traditional ICA algo-


-10 -5 0 5 10SNR

0

5

10

15

EER

%

STREET


-10 -5 0 5 10SNR

0

5

10

15

EER

%

CAR


-10 -5 0 5 10SNR

0

5

10

15

EER

%

HOME


Figure 6.11: Effect of utterance duration on speaker verification performancebased on the ICA-EBM algorithm under different levels and types of noise.

rithm. The effect of reverberation time and utterance duration on the forensic

speaker verification based on the multi-run ICA or the ICA-EBM algorithm is

also described in this section. The EER and mDCF were used in experimental

results to evaluate the performance of forensic speaker verification systems. The

fusion of feature warping with DWT-MFCC and MFCC used four levels and db8

of DWT to extract the features from the interview and enhanced surveillance

speech signals in order to provide a fair comparison with the Reverberation-with-


Noise speaker verification baseline.

6.5.3.1 Speaker verification performance based on ICA algorithms

This section describes the performance of forensic speaker verification based on

the ICA algorithms in Reverberation-with-Noise conditions. The full duration of

the interview recordings was obtained from 200 speakers using the pseudo-police

style. Silent regions from the interview recordings were removed using the VAD

algorithm. The interview recordings were convolved with the impulse response of

the room at 0.15 sec reverberation time to generate the reverberated speech. The

first configuration of the room is used in these experimental results, as shown in

Table 6.1.

The surveillance recordings were obtained from 10 sec duration from 200 speakers

using the informal telephone conversation style after removing silent regions us-

ing the VAD algorithm. The surveillance recordings were mixed with one random

session of the environmental noises (CAR, STREET and HOME noises) from the

QUT-NOISE database [29] at SNRs ranging from -10 dB to 10 dB, resulting in

two-channel noisy speech signal according to Equation 6.11. The noisy surveil-

lance recordings were kept without reverberation because in most real forensic

situations the noisy surveillance recordings are often recorded in open areas [3].

Figure 6.12 shows the experimental results for PLDA speaker verification when

interview recordings reverberated at 0.15 sec reverberation time and the surveil-

lance recordings were mixed with various types of environmental noise at SNRs

ranging from -10 dB to 10 dB. The SNRs in the x-axis of Figure 6.12 were

computed from the first microphone (x1). The results showed that the per-

formance of speaker verification based on the multi-run ICA (highest SIRact,


-10 -5 0 5 10SNR

0

5

10

15

20

25

30

35

40

EER

%

CAR

Baseline Traditional ICAMulti-run ICA (Highest SIR

act)


)

ICA-EBMEER

red=51.84%

-10 -5 0 5 10SNR

0

5

10

15

20

25

30

35

40

EER

%

HOME

Baseline Traditional ICAMulti-run ICA (Highest SIR

act)


)

ICA-EBM

EERred

=66.15%

Figure 6.12: Experimental results for PLDA speaker verification when interviewrecordings reverberated at 0.15 sec reverberation time and surveillance recordingswere mixed with various types of environmental noise at SNRs ranging from -10dB to 10 dB.

and highest SIRest) improved forensic speaker verification performance over the

Reverberation-with-Noise speaker verification baseline in the presence of different

types of environmental noise at low SNRs ranging from -10 dB to 0 dB and rever-

beration conditions. At -10 dB, the performance of speaker verification based on

the multi-run ICA (highest SIRact) significantly reduced EER by 60.88%, 51.84%,

66.15% over the Reverberation-with-Noise speaker verification baseline, respec-


tively, in the presence of STREET, CAR and HOME noises. The improvement

in EER for the speaker verification based on the multi-run ICA decreased over

the Reverberation-with-Noise speaker verification baseline when SNR increased.

Forensic speaker verification performance based on the multi-run ICA speaker

verification degraded compared with the Reverberation-with-Noise speaker veri-

fication baseline at high SNRs ranging from 5 dB to 10 dB for most types of envi-

ronmental noise. The results of comparison of forensic speaker verification based

on the multi-run ICA (highest SIRact) with Reverberation-with-Noise speaker

verification baseline in the presence of various types of environmental noise and

reverberation conditions were published in the IEEE International Conference

on Signal and Image Processing Applications. The conference paper is titled “

Enhanced forensic speaker verification using multi-run ICA in the presence of

environmental noise and reverberation conditions” [7].

The multi-run ICA (highest SIRact and highest SIRest) algorithm improved the

forensic speaker verification performance over traditional ICA when the surveil-

lance recordings were mixed with various sessions of CAR, STREET and HOME

noises at SNRs ranging from -10 dB to 10 dB and interview recordings reverber-

ated at 0.15 sec. The reduction in EER between traditional ICA and multi-run

ICA (highest SIRest) can be calculated using Equation 5.1. The average reduc-

tion in EER can be calculated by computing the mean in EER reduction for

different types of environmental noise at each noise level, as shown in Figure

6.13. The results showed that forensic speaker verification based on the multi-

run ICA (highest SIRest) achieved a 2.20% to 6.39% improvement in average EER

over traditional ICA when the surveillance recordings were corrupted by different

types of environmental noise at SNRs ranging from -10 dB to 0 dB and interview

recordings reverberated at 0.15 sec reverberation time.

The results also demonstrated that the performance of speaker verification based


-10 -5 0 5 10SNR

0

1

2

3

4

5

6

7

8

9

10Av

erage

Red

uctio

n in E

ER %

Figure 6.13: Average reduction in EER for multi-run ICA (highest SIRest) algo-rithm over traditional ICA when interview recordings reverberated at 0.15 secand the surveillance recordings were mixed with different types of environmentalnoise at SNRs ranging from -10 dB to 10 dB.

on the ICA-EBM algorithm improved EER over the Reverberation-with-Noise

speaker verification baseline at low SNR values (-10 dB to 0 dB). At -10 dB SNR,

The reduction in EER for forensic speaker verification based on the ICA-EBM

algorithm was 64.56%, 54.07% and 62.63% over the baseline Reverberation-with

Noise speaker verification, respectively, in the presence of STREET, CAR and

HOME noises. The improvement in the performance decreased when SNR in-

creased. The performance of speaker verification based on the ICA-EBM algo-

rithm degraded compared with the Reverberation-with-Noise speaker verification

baseline at SNRs ranging from 5 dB to 10 dB.

Figure 6.14 shows average EER reduction for the ICA-EBM algorithm compared

with the traditional ICA algorithm for different types of noise and reverberation.

When interview recordings reverberated at 0.15 sec and the surveillance record-

ings were corrupted with different types of noise at SNRs ranging from -10 dB to

0 dB, the performance of speaker verification based on the ICA-EBM achieved av-

erage EER reduction ranging from 7.25% to 8.31% compared with the traditional

ICA.


-10 -5 0 5 10

SNR

0

2

4

6

8

9

Aver

age

redu

ctio

n in

EER

%

Figure 6.14: Average EER reduction for ICA-EBM algorithm compared with thetraditional ICA for different types of noise and reverberation.

-10 -5 0

SNR

0

2

4

6

8

10

Avera

ge R

educ

tion i

n EER

%

Figure 6.15: Average reduction in EER for the ICA-EBM algorithm over themulti-run ICA (highest SIRest) in the presence of various types of environmentalnoise at SNRs ranging from -10 dB to 0 dB and reverberation conditions.

Figure 6.15 shows average reduction in EER for forensic speaker verification based

on the ICA-EBM algorithm over multi-run ICA (highest SIRest). The results

demonstrated that forensic speaker verification based on the ICA-EBM algorithm

reduced average EER ranging from 5.11% to 2.04% when interview recordings re-

verberated at 0.15 sec reverberation time and surveillance recordings were mixed


Table 6.3: comparison mDCFs for speaker verification based on the ICA-EBMalgorithm and Reverberation-with-Noise speaker verification baseline.

Noise Type Baseline ( ICA-EBM algorithm)-10 -5 0 5 10 -10 -5 0 5 10

STREET 0.0994 0.0951 0.0852 0.0685 0.0497 0.0691 0.0677 0.0680 0.0686 0.0682HOME 0.100 0.0953 0.0844 0.0665 0.0527 0.0706 0.0705 0.0704 0.0701 0.0705CAR 0.0989 0.0904 0.0697 0.0545 0.0425 0.0666 0.0672 0.0679 0.0680 0.0689

with various types of environmental noise at SNRs ranging from -10 dB to 0 dB.

Since the performance of forensic speaker verification based on the ICA-EBM

algorithm achieved the highest improvement in the performance compared with

the multi-run ICA (highest SIRest) over the majority of SNRs ranging from -10

dB to 0 dB, as shown in Figure 6.12, we compared the mDCF between speaker

verification based on the ICA-EBM algorithm with the Reverberation-with-Noise

speaker verification baseline, as shown in Table 6.3. It is clear that speaker verifi-

cation performance based on the ICA-EBM algorithm improved mDCF over the

Reverberation-with-Noise speaker verification baseline when interview recordings

reverberated at 0.15 sec reverberation time and the surveillance recordings were

mixed with various types of environmental noise at low SNR values (-10 dB to

0 dB). Forensic speaker verification performance based on the ICA-EBM algo-

rithm degraded compared with the Reverberation-with-Noise speaker verification

baseline at SNRs ranging from 5 dB to 10 dB.


In real forensic scenarios, the full length utterance of the speech signals from a

suspect is often recorded in a police interview room where reverberation is present.

However, the police often record the speech from the criminal using hidden mi-

crophones in the presence of different types of noise and the utterance length of

the speech signals is uncontrolled. Therefore in these experimental results, the


-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

STREET


-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

CAR


-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

HOME


Figure 6.16: Effect of duration utterance on the performance of speaker verifica-tion based multi-run ICA (highest SIRest) in the presence of various environmentalnoise and reverberation conditions.

full length of the interview speech signal was convolved with the impulse response

of the room at 0.15 sec reverberation time to generate the reverberated speech

signals. The first configuration of the room is used in these simulation results,

as shown in Table 6.1. The duration of the surveillance recordings was changed

from 10 sec to 40 sec. The surveillance recordings were mixed with one random

segment of STREET, CAR and HOME noises at SNRs ranging from -10 dB to


-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

STREET


-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

CAR


-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

HOME


Figure 6.17: Effect of utterance duration on speaker verification performancewhen using the ICA-EBM under different types of noise and reverberation envi-ronments.

10 dB, resulting in two-channel noisy speech signals according to Equation 6.11.

We investigate the effect of utterance length on the performance of forensic

speaker verification based on the multi-run ICA (highest SIRest) and the ICA-

EBM algorithms because these algorithms achieved better performance compared

with the traditional ICA algorithm, as described in Section 6.5.3.1. Figure 6.16

and 6.17 show the effect of utterance length on the performance of speaker verifi-


cation based on the multi-run ICA (highest SIRest) and the ICA-EBM algorithms

in the presence of various types of environmental noise and reverberation condi-

tions. The SNRs in the x-axis of Figures 6.16 and 6.17 were computed from the

first microphone (x1). It is evident that the performance of speaker verification

based on the multi-run ICA (highest SIRest) and the ICA-EBM algorithms im-

proved when the interview recordings reverberated at 0.15 sec and the duration

of the surveillance recordings increased from 10 sec to 40 sec in the presence of

various types of environmental noise at SNRs ranging from -10 dB to 10 dB.

6.5.3.3 Effect of reverberation time

This section describes the effect of reverberation time on the performance of foren-

sic speaker verification based on the multi-run ICA (highest SIRact and highest

SIRest) and the ICA-EBM algorithms. To validate the effect of reverberation

time on the performance of the multi-run ICA and the ICA-EBM algorithms, we

computed the impulse response of a room by using different reverberation times

(T20= 0.15 sec, 0.20 sec and 0.25 sec). The first configuration of the room is used

in these experimental results, as shown in Table 6.1. The full duration of the in-

terview recordings was obtained from 200 speakers using the pseudo-police style.

The interview recordings were convolved with the impulse response of the room to

produce reverberated interview recordings at different reverberation times. The

surveillance recordings were obtained from 10 sec duration from 200 speakers us-

ing the informal telephone conversation style. The surveillance recordings were

mixed with one random segment of STREET, CAR and HOME noises at SNRs

ranging from -10 dB to 10 dB, resulting in a two-channel noisy speech signals

according to Equation 6.11.

Figure 6.18 shows the effect of reverberation time on the performance of the

forensic speaker verification based on the multi-run ICA (highest SIRact) when


-10 -5 0 5 10SNR

0

5

10

15

20EE

R %

STREET

T=0.15 sec T=0.20 sec T=0.25 sec

-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

CAR

T=0.15 sec T=0.20 sec T=0.25 sec

-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

HOME

T=0.15 sec T=0.20 sec T=0.25 sec

Figure 6.18: Effect of the reverberation time on the performance of forensicspeaker verification based on multi-run ICA (highest SIRact) when interviewrecordings reverberated at different reverberation times and surveillance record-ings mixed with various types of environmental noises.

interview recordings reverberated at different reverberation times and surveillance

recordings were mixed with various types of environmental noises at SNRs rang-

ing from -10 dB to 10 dB. The SNRs in the x-axis of Figure 6.18 were computed

from the first microphone (x1). It is clear from this figure that increasing the re-

verberation time degrades the performance of noisy forensic speaker verification

based on the multi-run ICA (highest SIRact). The performance of the multi-run


-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

STREET

T=0.15 sec T=0.20 sec T=0.25 sec

-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

CAR

T=0.15 sec T=0.20 sec T=0.25 sec

-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

HOME

T=0.15 sec T=0.20 sec T=0.25 sec

Figure 6.19: Effect of the reverberation time on the performance of forensicspeaker verification based on multi-run ICA (highest SIRest) when interviewrecordings reverberated at different reverberation times and surveillance record-ings were mixed with various types of environmental noises.

ICA algorithm degraded by 12.96%, 16.96% and 16.64% when the reverberation

time increased from 0.15 sec to 0.25 sec and surveillance recordings were mixed

with STREET, CAR and HOME noises at 0 dB SNR. The time of reverberation

parameter represents the length of the room impulse response. High reverbera-

tion time leads to increased distortion in the feature vectors [121]. Accordingly,

increasing the reverberation time results in decreasing the performance of foren-


sic speaker verification systems. The results of the effect of reverberation time

on the performance of forensic speaker verification based on the multi-run ICA

(highest SIRact) were published in the IEEE International Conference on Signal

and Image Processing Applications. The conference paper is titled “ Enhanced

forensic speaker verification using multi-run ICA in the presence of environmental

noise and reverberation conditions” [7].

Figure 6.19 shows the effect of reverberation time on the performance of forensic

speaker verification based on the multi-run ICA (highest SIRest) when the in-

terview recordings reverberated at different reverberation times and surveillance

recordings were mixed with different environmental noise at SNRs ranging from

-10 dB to 10 dB. It is clear from this figure that as the reverberation time in-

creased from 0.15 sec to 0.25 sec, forensic speaker verification performance based

on the multi-run ICA (highest SIRest) decreased.

Figure 6.20 shows the effect of reverberation time on speaker verification per-

formance when using the ICA-EBM algorithm under conditions of noise and

reverberation. The SNRs in the x-axis of Figure 6.20 can be calculated from the

first microphone (x1). It is clear that the noisy forensic speaker verification per-

formance based on the ICA-EBM degrades as the reverberation time increases.

At -10 dB SNR, the ICA-EBM performance degraded by 19.17%, 17.07% and

16.40%, when the time of reverberation varied from 0.15 sec to 0.25 sec and

surveillance recordings were mixed with STREET, CAR and HOME noises, re-

spectively.


-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

STREET

T=0.15 sec T=0.20 sec T=0.25 sec

-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

CAR

T=0.15 sec T=0.20 sec T=0.25 sec

-10 -5 0 5 10SNR

0

5

10

15

20

EER

%

HOME

T=0.15 sec T=0.20 sec T=0.25 sec

Figure 6.20: Effect of the reverberation time on speaker verification performancewhen using the ICA-EBM under conditions of noise and reverberation.

6.6 Chapter summary

In this chapter, forensic speaker verification based on the multi-run ICA and

ICA-EBM approaches were developed for improving forensic speaker verification

performance in the presence of high levels of environmental noise and reverberant

conditions. The performance of speaker verification based on the multi-run ICA


or the ICA-EBM algorithm was compared with the baselines (Noise-without-

Reverberation and Reverberation-with-Noise speaker verification systems), and

traditional ICA algorithm. Forensic speaker verification was evaluated in the

presence of various types of environmental noise (STREET, CAR and HOME)

at SNRs ranging from -10 dB to 10 dB and reverberation conditions.

Experimental results demonstrated that the performance of forensic speaker ver-

ification based on the multi-run ICA or the ICA-EBM algorithm improved EER

over the baseline Noise-without-Reverberation speaker verification system at low

SNRs ranging from -10 dB to 0 dB. The results also showed that forensic speaker

verification based on the multi-run ICA algorithm improved EER compared with

the traditional ICA algorithm because the multi-run ICA algorithm solved the

problem of randomness in the separation of the speech from the noisy speech

signals in the traditional ICA algorithm by choosing the most suitable unmixing

matrix that had the highest SIR. The selection of the suitable unmixing matrix

gives a clear separation of the clean speech from the noisy speech signal and leads

to improved forensic speaker verification performance under different types of en-

vironmental noise. Some experimental results of comparison of forensic speaker

verification based on the multi-run ICA algorithm with other techniques were

published in the 11th International Conference on Signal Processing and Com-

munication Systems. The paper is titled “ Speaker verification with multi-run

ICA based speech enhancement” [3].

The results also demonstrated that forensic speaker verification based on the

ICA-EBM algorithm improved forensic speaker verification performance over tra-

ditional ICA algorithm when surveillance recordings were corrupted with various

types of environmental noise at SNRs ranging from -10 dB to 10 dB. The ICA-

EBM algorithm is more suited to noisy speech separation applications due to

their good convergence behavior. Speech/audio signals are usually either super-


Gaussian or slightly skewed in nature and, hence, perform well with ICA-EBM,

compared with traditional ICA methods, due to its tighter bounds and also su-

perior convergence properties.

The results also demonstrated that forensic speaker verification performance

based on the multi-run ICA or ICA-EBM algorithm degraded compared with the

baselines of Noise-without-Reverberation and Reverberation-with-Noise speaker

verification under various types of environmental noise only at high SNRs ranging

from 5 dB to 10 dB, and noisy and reverberant conditions. Degradation in the

performance of the speaker verification systems was due to the bad separation of

the enhanced speech signals from the noise signals at high SNRs.

The performance of speaker verification based on the multi-run ICA or ICA-

EBM algorithm was also evaluated in the Reverberation-with-Noise conditions.

The algorithms also improved EER over the Reverberation-with-Noise speaker

verification baseline and traditional ICA at low SNRs ranging from -10 dB to 0

dB. Thus, forensic speaker verification based on the multi-run ICA or the ICA-

EBM algorithm can be used for improving forensic speaker verification perfor-

mance, especially when surveillance recordings are corrupted by different types

of environmental noise at low SNRs. Some results of comparison of forensic

speaker verification based on the multi-run ICA algorithm with the baseline of

Reverberation-with-Noise were published in the IEEE International Conference

on Signal and Image Processing Applications. The paper is titled “ Enhanced

forensic speaker verification using multi-run ICA in the presence of environmental

noise and reverberation conditions” [7].

Chapter 7

Conclusions and future directions

7.1 Introduction

This chapter gives a summary of the work presented in this thesis and the con-

clusions drawn from it. The summary describes the three research contributions

identified in Chapter 1: designing noisy and reverberant frameworks for simu-

lation forensic speaker verification performance under real-world situations, pro-

posed new fusion of feature warping with DWT-MFCC and MFCC, and the in-

troduction of the multi-run ICA and ICA-EBM algorithms as a multiple channel

speech enhancement algorithm to improve forensic speaker verification perfor-

mance in the presence of high levels of environmental noise and reverberation

conditions.

7.2 Original Contributions

The main contributions resulted from this work are as follows:

200 7.2 Original Contributions

7.2.1 Designing noisy and reverberant frameworks

As the forensic audio recordings from the AFVC database contain clean speech

signals, forensic speaker verification performance cannot be evaluated in the pres-

ence of various types of environmental noise and reverberant conditions. There-

fore, it is necessary to design noisy and reverberant frameworks from the AFVC

and QUT-NOISE databases for evaluating the robustness of forensic speaker ver-

ification performance under noisy and reverberant conditions. The noisy and

reverberant frameworks based on the single and multiple microphones were de-

signed in Chapter 3 and were used to simulate the performance of forensic speaker

verification based on robust feature extraction techniques and the ICA algorithms

under real-world situations in the presence of different levels and types of environ-

mental noise and reverberation conditions. The noisy and reverberant frameworks

based on the single and multiple microphones were presented in ” Enhanced foren-

sic speaker verification using a combination of DWT and MFCC feature warping

in the presence of noise and reverberation conditions” [5] in IEEE Access Journal,

” Hybrid DWT and MFCC feature warping for noisy forensic speaker verification

in room reverberation” [6] in the IEEE International Conference on Signal and

Image Processing Applications, ” Speaker verification with multi-run ICA based

speech enhancement” [3] in the 11th International Conference on Signal Process-

ing and Communication Systems, and ” Enhanced forensic speaker verification

using multi-run ICA in the presence of noise and reverberation conditions” [7] in

the IEEE International Conference on Signal and Image Processing Applications.

7.2.2 DWT-MFCC feature warping

Chapter 5 investigated the effect of feature warping on DWT-MFCC and MFCC

features, demonstrating that while feature warping has no or detrimental effect

7.2 Original Contributions 201

on DWT-MFCC alone, it increases the complementary features, allowing for bet-

ter performance in fusion with feature-warped MFCC features under noisy and

reverberant conditions. These findings were published as ” Enhanced forensic


the presence of noise and reverberation conditions” [5] in the IEEE Access Jour-

nal and ” Hybrid DWT and MFCC feature warping for noisy forensic speaker

verification in room reverberation” [6] in the IEEE International Conference on

Signal and Image Processing Applications.

The new fusion technique of feature warping with DWT-MFCC and MFCC was

also proposed to improve forensic speaker verification performance in the presence

of various types of environmental noise only. The proposed technique is based

on combining the feature-warped MFCC of the full band of the speech signals

with the same features extracted from the DWT. The performance of forensic

speaker verification based on the fusion of feature warping with DWT-MFCC

and MFCC was compared with different feature extraction techniques: MFCC,

feature-warped MFCC, DWT-MFCC, feature-warped DWT-MFCC, and fusion

of DWT-MFCC and MFCC. The results demonstrated that fusion of feature

warping with DWT-MFCC and MFCC improved forensic speaker verification

performance in the presence of various types of environmental noise under the

majority of SNRs compared with other feature extraction techniques.

The robustness of fusion of feature warping with DWT-MFCC and MFCC was

also investigated and compared with other feature extraction techniques to im-

prove forensic speaker verification performance under reverberation conditions

only. Experimental results showed that fusion of feature warping with DWT-

MFCC and MFCC improved EER compared with other feature extraction tech-

niques. The effect of reverberation time and the position of suspect/microphone

on forensic speaker verification performance based on the fusion of feature warp-

202 7.2 Original Contributions

ing with DWT-MFCC and MFCC was also studied.

Forensic speaker verification performance based on the fusion of feature warping

with DWT-MFCC and MFCC was also investigated and compared with the tradi-

tional MFCC and feature-warped MFCC under noisy and reverberant conditions.

Simulation results demonstrated that fusion of feature warping with DWT-MFCC

and MFCC improved EER and mDCF compared with feature warped MFCC in

the presence of reverberation and different types and levels of environmental

noise. The proposed fusion of feature warping with DWT-MFCC and MFCC can

be used as the feature extraction technique in forensic speaker verification based

on the multi-run ICA and ICA-EBM algorithms which will be described in the

next section.

7.2.3 Multi-run ICA and ICA-EBM algorithms

The performance of forensic speaker verification systems degrades significantly in

the presence of high levels of environmental noise and reverberant conditions. It

is difficult to use noisy forensic recordings as part of legal evidence in real forensic

applications because the quality of these recordings is often poor. Multi-run ICA

and ICA-EBM algorithms were used as multiple channel speech enhancement

algorithms in Chapter 6 to reduce the effect of environmental noise and improve

forensic speaker verification performance.

Although the multi-run ICA algorithm was used in biosignal applications and the

ICA-EBM algorithm was used to separate the mixed speech signals under clean

conditions, the effectiveness of the multi-run ICA and ICA-EBM algorithms was

investigated for the first time as speech enhancement algorithms in this thesis

to improve forensic speaker verification performance under noisy and reverberant

conditions.

7.2 Original Contributions 203

Investigations were also performed to study the effectiveness of the developed

forensic speaker verification based on the multi-run ICA or the ICA-EBM algo-

rithm for improving forensic speaker verification performance under noisy con-

ditions. The developed forensic speaker verification systems were also compared

with the baseline of Noise-without-Reverberation speaker verification system and

traditional ICA algorithm. Simulation results demonstrated that forensic speaker

verification based on the multi-run ICA or ICA-EBM algorithm improved forensic

speaker verification performance compared with the baseline of Noise-without-

Reverberation speaker verification system in the presence of different types of

environmental noise at SNRs ranging from -10 dB to 0 dB. The results also

demonstrated that forensic speaker verification based on the multi-run ICA or

ICA-EBM algorithm improved the performance of speaker verification over tra-

ditional ICA in the presence of environmental noise at SNRs ranging from -10 dB

to 10 dB. The outcomes of forensic speaker verification performance based on the

multi-run ICA in the presence of various types of environmental noise were pub-

lished as ” Speaker verification with multi-run ICA based speech enhancement”

[3] in the 11th International Conference on Signal Processing and Communication

Systems.

The performance of forensic speaker verification based on the multi-run ICA and

ICA-EBM algorithm were also evaluated in the presence of various types of envi-

ronmental noise and reverberation conditions. Simulation results demonstrated

that forensic speaker verification based on the multi-run ICA or ICA-EBM al-

gorithm improved forensic speaker verification performance compared with the

baseline of Reverberation-with-Noise speaker verification system in the presence

of reverberation and different types of environmental noise at SNRs ranging from

-10 dB to 0 dB. The results also demonstrated that forensic speaker verification

based on the multi-run ICA or ICA-EBM algorithm improved the performance of

speaker verification over traditional ICA in the presence of noise at SNRs rang-

204 7.3 Future directions

ing from -10 dB to 10 dB and reverberation conditions. Some results of forensic

speaker verification performance based on the multi-run ICA algorithm in the

presence of environmental noise and reverberation conditions were published as

” Enhanced forensic speaker verification using multi-run ICA in the presence of

noise and reverberation conditions” [7] in the IEEE International Conference on

Signal and Image Processing Applications.

7.3 Future directions

This study has developed techniques to improve forensic speaker verification per-

formance in the presence of high levels of environmental noise and reverberation

conditions. Some potential areas of future research include:

� This research proposed a speaker verification system based on a robust

feature extraction technique to noise and reverberation conditions. Also,

this research investigates the effectiveness of speaker verification system

based on the multi-run ICA or ICA-EBM algorithm for improving forensic

speaker verification performance under noisy and reverberant conditions.

These systems were evaluated using the EER and mDCF. It will be also

useful to evaluate forensic speaker verification performance using forensic-

eval-01 evaluation described in [97].

� Although the DNN based i-vector did not show significant improvement

in forensic speaker verification performance compared with UBM based i-

vector when limited amount of forensic data are available for training DNN,

it will be interesting to investigate the effectiveness of applying the proposed

fusion of feature warping with DWT-MFCC and MFCC to DNN posterior

mapping approach described in [145] for improving forensic speaker verifi-

7.3 Future directions 205

cation performance in the presence of various types of environmental noise

and reverberation conditions.

� Since the proposed speaker verification based on the multi-run ICA or ICA-

EBM algorithm degraded compared with the baselines of Noise-without-

Reverberation and Reverberation-with-Noise speaker verification systems

at SNRs ranging from 5 dB to 10 dB, an SNR estimation could be used

before speech enhancement algorithms to determine whether or not the

multi-run ICA or ICA-EBM algorithm is used to reduce the effect of noise

from the noisy speech signals.

� The performance of forensic speaker verification based on the multi-run

ICA or ICA-EBM algorithm was evaluated when interview data reverber-

ated and surveillance speech signals were corrupted by various types of

environmental noise without reverberation conditions. It will also be inter-

esting to evaluate the performance of forensic speaker verification based on

the convolutive ICA algorithm when the surveillance data are corrupted by

environmental noise in the presence of reverberation conditions.

� The computation complexity for fusion of feature warping with DWT-

MFCC and MFCC features will be calculated and compared with the com-

putation complexity of other fusion feature extraction techniques which

described in Chapter 5.

� The performance of forensic speaker verification systems based on a robust

feature extraction technique and multi-run ICA or ICA-EBM algorithm

will be evaluated using adaptive linear energy detector voice activity detec-

tion and compared the results with that obtained from the statistical voice

activity detection algorithm used in this thesis.

Bibliography

[1] M. I. Abdalla, H. M. Abobakr, and T. S. Gaafar, “DWT and MFCCs

based feature extraction methods for isolated word recognition,” Interna-

tional Journal of Computer Applications, vol. 69, no. 20, pp. 21–26, 2013.

[2] T. Adali, M. Anderson, and G.-S. Fu, “Diversity in independent component

and vector analyses: Identifiability, algorithms, and applications in medical

imaging,” IEEE Signal Processing Magazine, vol. 31, no. 3, pp. 18–33, 2014.

[3] A. K. H. Al-Ali, D. Dean, B. Senadji, M. Baktashmotlagh, and V. Chan-

dran, “Speaker verification with multi-run ICA based speech enhancement,”

in 11th International Conference on Signal Processing and Communication

Systems, pp. 1–7, 2017.

[4] A. K. H. Al-ALi, D. Dean, B. Senadji, and V. Chandran, “Comparison of

speech enhancement algorithms for forensic applications,” Proceedings of the

16th Australian International Conference on Speech Science and Technology,

pp. 169–172, December, 2016.

[5] A. K. H. Al-Ali, D. Dean, B. Senadji, V. Chandran, and G. R. Naik, “En-

hanced forensic speaker verification using a combination of DWT and MFCC

feature warping in the presence of noise and reverberation conditions,” IEEE

Access, vol. 5, no. 99, pp. 15400–15413, 2017.

208 BIBLIOGRAPHY

[6] A. K. H. Al-ALi, B. Senadji, and V. Chandran, “Hybrid DWT and MFCC

feature warping for noisy forensic speaker verification in room reverbera-

tion,” in IEEE International Conference on Signal and Image Processing

Applications, pp. 434–439, 2017.

[7] A. K. H. Al-Ali, B. Senadji, and G. R. Naik, “Enhanced forensic speaker

verification using multi-run ICA in the presence of environmental noise and

reverberation conditions,” in IEEE International Conference on Signal and

Image Processing Applications, pp. 174–179, 2017.

[8] S. S. Alamri, “Text-independent, automatic speaker recognition system eval-

uation with males speaking both Arabic and English,” M.S. thesis, Univer-

sity of Colorado Denver, USA, 2015.

[9] H. Ali, N. Ahmad, X. Zhou, K. Iqbal, and S. M. Ali, “DWT features per-

formance analysis for automatic speech recognition of Urdu,” SpringerPlus,

vol. 3, pp. 1–10, 2014.

[10] T. Awasthy and A. Kumar, “Analysis of Fast-ICA algorithm for separation of

mixed images,” International Journal of Electronics and Computer Science

Engineering, pp. 1252–1256.

[11] A. J. Bell and T. J. Sejnowski, “An information-maximization approach to

blind separation and blind deconvolution,” Neural computation, vol. 7, no. 6,

pp. 1129–1159, 1995.

[12] A. Benyassine, E. Shlomot, H.-Y. Su, D. Massaloux, C. Lamblin, and J.-

P. Petit, “ITU-T recommendation G. 729 Annex B: a silence compression

scheme for use with G. 729 optimized for V. 70 digital simultaneous voice

and data applications,” IEEE Communications Magazine, vol. 35, no. 9,

pp. 64–73, 1997.

BIBLIOGRAPHY 209

[13] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech cor-

rupted by acoustic noise,” in IEEE International Conference on Acoustics,

Speech, and Signal Processing (ICASSP), pp. 208–211, 1979.

[14] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau,

S. Meignier, T. Merlin, J. Ortega-Garcıa, D. Petrovska-Delacretaz, and D. A.

Reynolds, “A tutorial on text-independent speaker verification,” EURASIP

Journal on Applied Signal Processing, vol. 2004, no. 4, pp. 430–451, 2004.

[15] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,”

IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27,

no. 2, pp. 113–120, 1979.

[16] Z. Boukouvalas, R. Mowakeaa, G.-S. Fu, and T. Adali, “Independent com-

ponent analysis by entropy maximization with kernels,” arXiv preprint

arXiv:1610.07104, pp. 1–6, 2016.

[17] L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matejka, and N. Brummer,

“Discriminatively trained probabilistic linear discriminant analysis for

speaker verification,” in IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP), pp. 4832–4835, 2011.

[18] J. P. Campbell, “Speaker recognition: A tutorial,” Proceedings of the IEEE,

vol. 85, no. 9, pp. 1437–1462, 1997.

[19] J. P. Campbell, W. Shen, W. M. Campbell, R. Schwartz, J.-F. Bonastre,

and D. Matrouf, “Forensic speaker recognition,” IEEE Signal Processing

Magazine, vol. 26, no. 2, pp. 95–103, 2009.

[20] J. Chang and D. Wang, “Robust speaker recognition based on DNN/i-vectors

and speech separation,” in IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), pp. 5415–5419, 2017.

210 BIBLIOGRAPHY

[21] W.-C. Chen, C.-T. Hsieh, and E. Lai, “Multiband approach to robust text-

independent speaker identification,” Journal of Computational Linguistics

and Chinese Language Processing, vol. 9, no. 2, pp. 63–76, 2004.

[22] S.-H. Chen and Y.-R. Luo, “Speaker verification using MFCC and support

vector machine,” in Proceedings of the International Multi conference of En-

gineers and Computer Scientists, vol. 1, 2009.

[23] L. Chun-Lin, “A tutorial of the wavelet transform,” Department of Electrical

Engineering, National Taiwan University, Taiwan, pp. 1–71, 2010.

[24] A. Cichocki and S. Amari, Adaptive blind signal and image processing: learn-

ing algorithms and applications. John Wiley & Sons, 2002.

[25] P. Comon, “Independent component analysis, a new concept?,” Signal pro-

cessing, vol. 36, no. 3, pp. 287–314, 1994.

[26] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning,

vol. 20, no. 3, pp. 273–297, 1995.

[27] S. Davis and P. Mermelstein, “Comparison of parametric representations

for monosyllabic word recognition in continuously spoken sentences,” IEEE

Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4,

pp. 357–366, 1980.

[28] D. B. Dean, A. Kanagasundaram, H. Ghaemmaghami, M. H. Rahman, and

S. Sridharan, “The QUT-NOISE-SRE protocol for the evaluation of noisy

speaker recognition,” in Proceedings of the 16th Annual Conference of the

International Speech Communication Association (Interspeech), pp. 3456–

3460, 2015.

[29] D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “The QUT-NOISE-

TIMIT corpus for the evaluation of voice activity detection algorithms,” in

Proceedings of Interspeech, pp. 1–4, 2010.

BIBLIOGRAPHY 211

[30] N. Dehak, R. Dehak, P. Kenny, N. Brummer, P. Ouellet, and P. Dumouchel,

“Support vector machines versus fast scoring in the low-dimensional to-

tal variability space for speaker verification.,” in Proceedings of Interspeech,

pp. 1559–1562, 2009.

[31] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-

end factor analysis for speaker verification,” IEEE Transactions on Audio,

Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.

[32] F. Denk, J. P. C. da Costa, and M. A. Silveira, “Enhanced forensic multiple

speaker recognition in the presence of coloured noise,” in 8th International

Conference on Signal Processing and Communication Systems (ICSPCS),

pp. 1–7, 2014.

[33] D. L. Donoho, “De-noising by soft-thresholding,” IEEE Transactions on

Information Theory, vol. 41, no. 3, pp. 613–627, 1995.

[34] S. Du, X. Xiao, and E. S. Chng, “DNN feature compensation for noise robust

speaker verification,” in IEEE China Summit and International Conference

on Signal and Information Processing (ChinaSIP), pp. 871–875, 2015.

[35] L. Ferrer, H. Bratt, L. Burget, H. Cernocky, O. Glembek, M. Graciarena,

A. Lawson, Y. Lei, P. Matejka, O. Plchot, and N. Scheffer, “Promoting

robustness for speaker modeling in the community: the PRISM evaluation

set,” in Proceedings of NIST 2011 workshop, pp. 1–7, Citeseer, 2011.

[36] A. J. Fisher and S. Sridharan, “Speech enhancement for forensic and telecom-

munication applications,” in Fifth International conference on Speech Sci-

ence and Technology, pp. 40–45, 1994.

[37] S. Furui, “Cepstral analysis technique for automatic speaker verification,”

IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-

29, no. 2, pp. 254–272, 1981.

212 BIBLIOGRAPHY

[38] S. Furui, “An overview of speaker recognition technology,” in Proceedings

ESCA Workshop on Automatic Speaker Recognition, Identification, and Ver-

ification, pp. 1–9, 1994.

[39] S. Furui, “Recent advances in speaker recognition,” Pattern Recognition Let-

ters, vol. 18, no. 9, pp. 859–872, 1997.

[40] S. Ganapathy, J. Pelecanos, and M. K. Omar, “Feature normalization for

speaker verification in room reverberation,” in IEEE International Confer-

ence on Acoustics, Speech and Signal Processing (ICASSP), pp. 4836–4839,

2011.

[41] T. Ganchev, N. Fakotakis, and G. Kokkinakis, “Comparative evaluation of

various MFCC implementations on the speaker verification task,” in Pro-

ceedings of the SPECOM, pp. 191–194, 2005.

[42] D. Garcia-Romero, Robust Speaker Recognition Based on Latent Variable

Models. PhD thesis, University of Maryland at College Park, 2012.

[43] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-vector length nor-

malization in speaker recognition systems,” in Proceedings of Interspeech,

pp. 249–252, 2011.

[44] Y. Ghanbari and M. R. Karami-Mollaei, “A new approach for speech en-

hancement based on the adaptive thresholding of the wavelet packets,”

Speech Communication, vol. 48, no. 8, pp. 927–940, 2006.

[45] R. Haddadi, E. Abdelmounim, M. El Hanine, and A. Belaguid, “Discrete

wavelet transform based algorithm for recognition of QRS complexes,” in

International Conference on Multimedia Computing and Systems, pp. 375–

379, 2014.

BIBLIOGRAPHY 213

[46] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A k-means cluster-

ing algorithm,” Journal of the Royal Statistical Society. Series C (Applied

Statistics), vol. 28, no. 1, pp. 100–108, 1979.

[47] A. O. Hatch, S. S. Kajarekar, and A. Stolcke, “Within-class covariance nor-

malization for SVM-based speaker recognition,” in 9th International Con-

ference on Spoken Language Processing, pp. 1471–1474, 2006.

[48] H. Hermansky, B. Hanson, and H. Wakita, “Perceptually based linear pre-

dictive analysis of speech,” in IEEE International Conference on Acoustics,

Speech, and Signal Processing (ICASSP), pp. 509–512, 1985.

[49] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Trans-

actions on Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, 1994.

[50] A. Higgins, L. Bahler, and J. Porter, “Speaker verification using randomized

phrase prompting,” Digital Signal Processing, vol. 1, no. 2, pp. 89–106, 1991.

[51] H.-G. Hirsch and D. Pearce, “The Aurora experimental framework for the

performance evaluation of speech recognition systems under noisy condi-

tions,” in ISCA Tutorial Research workshop ASR2000, pp. 181–188, 2000.

[52] P. J. Huber, “Projection pursuit,” The annals of Statistics, vol. 13, no. 2,

pp. 435–475, 1985.

[53] A. Hyvarinen, “Fast and robust fixed-point algorithms for independent com-

ponent analysis,” IEEE Transactions on Neural Networks, vol. 10, no. 3,

pp. 626–634, 1999.

[54] A. Hyvarinen, J. Karhunen, and E. Oja, Independent component analysis.

John Wiley & Sons, 2004.

[55] A. Hyvarinen and E. Oja, “Independent component analysis: algorithms and

applications,” Neural Networks, vol. 13, no. 4, pp. 411–430, 2000.

214 BIBLIOGRAPHY

[56] M. A. Islam, W. A. Jassim, N. S. Cheok, and M. S. A. Zilany, “A robust

speaker identification system using the responses from a model of the audi-

tory periphery,” PlOS one, vol. 11, no. 7, pp. 1–21, 2016.

[57] Q. Jin, Robust speaker recognition. PhD thesis, School of Computer Science,

Carnegie Mellon University, Pittsburgh, PA, 2007.

[58] Q. Jin, T. Schultz, and A. Waibel, “Far-field speaker recognition,” IEEE

Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7,

pp. 2023–2032, 2007.

[59] A. Kanagasundaram, Speaker verification using I-vector features. PhD thesis,

Queensland University of Technology, Australia, Brisbane, 2014.

[60] A. Kanagasundaram, D. Dean, S. Sridharan, J. Gonzalez-Dominguez,

J. Gonzalez-Rodriguez, and D. Ramos, “Improving short utterance i-vector

speaker verification using utterance variance modelling and compensation

techniques,” Speech Communication, vol. 59, pp. 69–82, 2014.

[61] A. Kanagasundaram, D. Dean, S. Sridharan, M. McLaren, and R. Vogt,

“I-vector based speaker recognition using advanced channel compensation

techniques,” Computer Speech and Language, vol. 28, no. 1, pp. 121–140,

2014.

[62] F. Kelly, Automatic recognition of ageing speakers. PhD thesis, Trinity Col-

lege Dublin, Ireland, 2014.

[63] P. Kenny, “Joint factor analysis of speaker and session variability: Theory

and algorithms,” CRIM, Montreal,(Report) CRIM-06/08-13, pp. 1–17, 2005.

[64] P. Kenny, “Bayesian speaker verification with heavy-tailed priors,” in

Odyssey Speaker and Language Recogntion Workshop, 2010.

BIBLIOGRAPHY 215

[65] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with

sparse training data,” IEEE Transactions on Speech and Audio Processing,

vol. 13, no. 3, pp. 345–354, 2005.

[66] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Factor analysis

simplified,” in IEEE International Conference on Acoustics, Speech, and

Signal Processing (ICASSP), pp. I– 637– I– 640, 2005.

[67] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Improvements in

factor analysis based speaker verification,” in IEEE International Conference

on Acoustics, Speech and Signal Processing (ICASSP), pp. I– 113– I– 116,

2006.

[68] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis

versus eigenchannels in speaker recognition,” IEEE Transactions on Audio,

Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, 2007.

[69] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speaker and ses-

sion variability in GMM-based speaker verification,” IEEE Transactions on

Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1448–1460, 2007.

[70] P. Kenny and P. Dumouchel, “Disentangling speaker and channel effects

in speaker verification,” in IEEE International Conference on Acoustics,

Speech, and Signal Processing (ICASSP), vol. 1, pp. 37–40, 2004.

[71] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, “A study

of interspeaker variability in speaker verification,” IEEE Transactions on

Audio, Speech, and Language Processing, vol. 16, no. 5, pp. 980–988, 2008.

[72] S. Kim, M. Ji, and H. Kim, “Noise-robust speaker recognition using sub-

band likelihoods and reliable-feature selection,” ETRI Journal, vol. 30, no. 1,

pp. 89–100, 2008.

216 BIBLIOGRAPHY

[73] S. Kim, M. Ji, and H. Kim, “Robust speaker recognition based on filtering

in autocorrelation domain and sub-band feature recombination,” Pattern

Recognition Letters, vol. 31, no. 7, pp. 593–599, 2010.

[74] J. Kola, C. Espy-Wilson, and T. Pruthi, “Voice activity detection,” Merit

Bien, pp. 1–6, 2011.

[75] M. Kolbœk, Z.-H. Tan, and J. Jensen, “Speech enhancement using long

short-term memory based recurrent neural networks for noise robust speaker

verification,” in IEEE Spoken Language Technology Workshop (SLT),

pp. 305–311, 2016.

[76] Z. Koldovsky, J. Malek, P. Tichavsky, Y. Deville, and S. Hosseini, “Blind

separation of piecewise stationary non-Gaussian sources,” Signal Processing,

vol. 89, no. 12, pp. 2570–2584, 2009.

[77] T.-W. Lee, Independent component analysis. Springer, 1998.

[78] E. A. Lehmann and A. M. Johansson, “Prediction of energy decay in room

impulse responses simulated with an image-source model,” Journal of the

Acoustical Society of America, vol. 124, no. 1, pp. 269–277, 2008.

[79] E. A. Lehmann, A. M. Johansson, and S. Nordholm, “Reverberation-time

prediction method for room impulse responses simulated with the image-

source model,” in IEEE Workshop on Applications of Signal Processing to

Audio and Acoustics, pp. 159–162, 2007.

[80] L. Lei and S. Kun, “Speaker recognition using wavelet cepstral coefficient, i-

vector, and cosine distance scoring and its application for forensics,” Journal

of Electrical and Computer Engineering, vol. 2016, pp. 1–11, 2016.

[81] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker

recognition using a phonetically-aware deep neural network,” in IEEE Inter-

BIBLIOGRAPHY 217

national Conference on Acoustics, Speech and Signal Processing (ICASSP),

pp. 1695–1699, 2014.

[82] X.-L. Li and T. Adali, “Independent component analysis by entropy bound

minimization,” IEEE Transactions on Signal Processing, vol. 58, no. 10,

pp. 5151–5164, 2010.

[83] K.-P. Li and J. E. Porter, “Normalizations and selection of speech seg-

ments for speaker recognition scoring,” in IEEE International Conference

on Acoustics, Speech, and Signal Processing (ICASSP), pp. 595–598, 1988.

[84] H. Li, H. Wang, and B. Xiao, “Blind separation of noisy mixed speech signals

based on wavelet transform and independent component analysis,” in 8th

International Conference on Signal Processing, pp. 1–4, 2006.

[85] H.-y. Li, Q.-h. Zhao, G.-l. Ren, and B.-j. Xiao, “Speech enhancement al-

gorithm based on independent component analysis,” in 5th International

Conference on Natural Computation, pp. 598–602, 2009.

[86] H. Maged, A. AbouEl-Farag, and S. Mesbah, “Improving speaker identifica-

tion system using discrete wavelet transform and AWGN,” in 5th IEEE Inter-

national Conference on Software Engineering and Service Science, pp. 1171–

1176, 2014.

[87] S. Malik and F. A. Afsar, “Wavelet transform based automatic speaker recog-

nition,” in 13th IEEE International Multitopic Conference, pp. 1–4, 2009.

[88] S. G. Mallat, “A theory for multiresolution signal decomposition: the wavelet

representation,” IEEE Transactions on Pattern Analysis and Machine In-

telligence, vol. 11, no. 7, pp. 674–693, 1989.

[89] M. I. Mandasari, M. McLaren, and D. A. van Leeuwen, “The effect of noise

on modern automatic speaker recognition systems,” in IEEE International

218 BIBLIOGRAPHY

Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4249–

4252, 2012.

[90] J. McAuley, J. Ming, D. Stewart, and P. Hanna, “Subband correlation and

robust speech recognition,” IEEE Transactions on Speech and Audio Pro-

cessing, vol. 13, no. 5, pp. 956–964, 2005.

[91] M. McLaren and D. van Leeuwen, “Improved speaker recognition when using

i-vectors from multiple speech sources,” in IEEE International Conference

on Acoustics, Speech and Signal Processing (ICASSP), pp. 5460–5463, 2011.

[92] T. M.cover and J. A. Thomas, “Elements of Information Theory”. John

Wiley & Sons Ltd, 1991.

[93] H. Melin, Automatic speaker verification on site and by telephone: meth-

ods, applications and assessment. PhD thesis, KTH Computer science and

Communication, Stockholm, Sweden, 2006.

[94] J. Ming, T. J. Hazen, J. R. Glass, and D. A. Reynolds, “Robust speaker

recognition in noisy conditions,” IEEE Transactions on Audio, Speech, and

Language Processing, vol. 15, no. 5, pp. 1711–1723, 2007.

[95] N. Mirghafori and N. Morgan, “Combining connectionist multi-band and

full-band probability streams for speech recognition of natural numbers,” in

5th International Conference on Spoken Language Processing, pp. 1–4, 1998.

[96] M. Misiti, Y. Misiti, G. Oppenheim, and J.-M. Poggi, “Wavelet toolbox

manual user 's guide,” The MathWorks Inc., 1996.

[97] G. S. Morrison and E. Enzinger, “Multi-laboratory evaluation of forensic

voice comparison systems under conditions reflecting those of a real foren-

sic case (forensic eval 01)–introduction,” Speech Communication, vol. 85,

pp. 119–126, 2016.

BIBLIOGRAPHY 219

[98] G. S. Morrison, P. Rose, and C. Zhang, “Protocol for the collection of

databases of recordings for forensic-voice-comparison research and practice,”

Australian Journal of Forensic Sciences, vol. 44, no. 2, pp. 155–167, 2012.

[99] G. Morrison, C. Zhang, E. Enzinger, F. Ochoa, D. Bleach, M. John-

son, B. Folkes, S. De Souza, N. Cummins, and D. Chow, “Forensic

database of voice recordings of 500+ Australian English speakers,” URL:

http://databases.forensic-voice-comparison.net, 2015.

[100] G. R. Naik, Iterative issues of ICA, quality of separation and number of

sources: a study for biosignal applications. PhD thesis, RMIT University,

Melbourne, Australia, 2008.

[101] G. R. Naik and D. K. Kumar, “An overview of independent component

analysis and its applications,” Informatica, vol. 35, no. 1, pp. 63–81, 2011.

[102] G. R. Naik and D. K. Kumar, “Identification of hand and finger movements

using multi run ICA of surface electromyogram,” Journal of medical systems,

vol. 36, no. 2, pp. 841–851, 2012.

[103] G. R. Naik, D. K. Kumar, and M. Palaniswami, “Multi run ICA and surface

EMG based signal processing system for recognising hand gestures,” in 8th

IEEE International Conference on Computer and Information Technology,

pp. 700–705, 2008.

[104] J. Ortega-Garcıa and J. Gonzalez-Rodrıguez, “Overview of speech enhance-

ment techniques for automatic speaker recognition,” in 4th International

Conference on Spoken Language Processing, pp. 929–932, 1996.

[105] A. Papoulis, ”Probability, random variable and stochastic processes”. Mc

Graw-Hill, 1991.

220 BIBLIOGRAPHY

[106] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker verifica-

tion,” in Proceedings Odyssey Speaker Recognition Workshop, pp. 213–218,

2001.

[107] M. Phythian, Speaker identification for forensic applications. PhD thesis,

Queensland University of Technology, Brisbane, Australia, 1998.

[108] S. J. Prince and J. H. Elder, “Probabilistic linear discriminant analysis

for inferences about identity,” in 11th IEEE International Conference on

Computer Vision, pp. 1–8, 2007.

[109] L. R. Rabiner and M. R. Sambur, “An algorithm for determining the end-

points of isolated utterances,” Bell System Technical Journal, vol. 54, no. 2,

pp. 297–315, 1975.

[110] D. A. Reynolds, “Experimental evaluation of features for robust speaker

identification,” IEEE Transactions on Speech and Audio Processing, vol. 2,

no. 4, pp. 639–643, 1994.

[111] D. A. Reynolds, “Automatic speaker recognition using Gaussian mixture

speaker models,” in The Lincoln Laboratory Journal, vol. 8, pp. 173–192,

1995.

[112] D. A. Reynolds, “Speaker identification and verification using Gaussian

mixture speaker models,” Speech Communication, vol. 17, no. 1, pp. 91–108,

1995.

[113] D. A. Reynolds, “Automatic speaker recognition: Current approaches and

future trends,” Speaker Verification: From Research to Reality, 2001.

[114] D. A. Reynolds, “An overview of automatic speaker recognition technol-

ogy,” in IEEE International Conference on Acoustics, Speech, and Signal

Processing (ICASSP), pp. IV –4072– IV –4075, 2002.

BIBLIOGRAPHY 221

[115] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification

using adapted Gaussian mixture models,” Digital signal processing, vol. 10,

pp. 19–41, 2000.

[116] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker iden-

tification using Gaussian mixture speaker models,” IEEE Transactions on

Speech and Audio Processing, vol. 3, no. 1, pp. 72–83, 1995.

[117] D. Ribas, E. Vincent, and J. R. Calvo, “Full multicondition training for

robust i-vector based speaker recognition,” in Proceedings of Interspeech,

pp. 1–5, 2015.

[118] J. Rosca, R. Balan, and C. Beaugeant, “Multi-channel psychoacoustically

motivated speech enhancement,” in International Conference on Multimedia

and Expo, pp. III– 217– III– 220, 2003.

[119] S. O. Sadjadi, M. Slaney, and L. Heck, “MSR identity toolbox v1. 0: A

MATLAB toolbox for speaker-recognition research,” Speech and Language

Processing Technical Committee Newsletter, vol. 1, no. 4, 2013.

[120] M. Senoussaoui, P. Kenny, N. Brummer, E. De Villiers, and P. Dumouchel,

“Mixture of PLDA models in i-vector space for gender-independent speaker

recognition,” in Proceedings of Interspeech, pp. 25–28, 2011.

[121] N. R. Shabtai, Y. Zigel, and B. Rafaely, “The effect of GMM order and

CMS on speaker recognition with reverberant speech,” in Hands-Free Speech

Communication and Microphone Arrays, pp. 144–147, 2008.

[122] N. R. Shabtai, Y. Zigel, and B. Rafaely, “The effect of room parameters on

speaker verification using reverberant speech,” in 25th IEEE Convention of

Electrical and Electronics Engineers, pp. 231–235, 2008.

[123] A. Shafik, S. M. Elhalafawy, S. Diab, B. M. Sallam, and F. A. El-Samie, “A

wavelet based approach for speaker identification from degraded speech,” In-

222 BIBLIOGRAPHY

ternational Journal of Communication Networks and Information Security,

vol. 1, no. 3, pp. 52–58, 2009.

[124] N. Shanmugapriya and E. Chandra, “Evaluation of sound classification us-

ing modified classifier and speech enhancement using ICA algorithm for hear-

ing aid application,” ICTACT Journal on Communication Technology, vol. 7,

no. 1, pp. 1279–1288, 2016.

[125] M. Sharma and R. Mammone, “Subword-based text-dependent speaker ver-

ification system with user-selectable passwords,” in IEEE International Con-

ference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 93–96,

1996.

[126] H. Sheikhzadeh and H. R. Abutalebi, “An improved wavelet-based speech

enhancement system,” in 7th European Conference on Speech Communica-

tion and Technology, pp. 1855–1858, 2001.

[127] M. A. Silveira, C. P. Schroeder, J. Da Costa, C. G. de Oliveira, J. A. A.

Junior, and S. Junior, “Convolutive ICA-based forensic speaker identification

using mel frequency cepstral coefficients and Gaussian mixture models,” The

International Journal of Forensic Computer Science, vol. 1, pp. 27–34, 2013.

[128] N. Singh, R. Khan, and R. Shree, “Applications of speaker recognition,”

Procedia Engineering, vol. 38, pp. 3122–3126, 2012.

[129] L. Singh and S. Sridharan, “Speech enhancement for forensic applications

using dynamic time warping and wavelet packet analysis,” in 10 IEEE An-

nual Region Conference Speech and Image Technologies for Computing and

Telecommunications, pp. 475–478, 1997.

[130] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity

detection,” IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, 1999.

BIBLIOGRAPHY 223

[131] A. Solomonoff, C. Quillen, and W. M. Campbell, “Channel compensation

for SVM speaker recognition,” in Odyssey Speaker Language Recognition

Workshop, pp. 57–62, 2004.

[132] B. T. Taddese, “Sound source localization and separation,” Mathematics

and Computer Science, Macalester College, 2006.

[133] N. Trivedi, V. Kumar, S. Singh, S. Ahuja, and R. Chadha, “Speech recogni-

tion by wavelet analysis,” International Journal of Computer Applications,

vol. 15, no. 8, pp. 27–32, 2011.

[134] Z. Tufekci and S. Gurbuz, “Noise robust speaker verification using mel-

frequency discrete wavelet coefficients and parallel model compensation,” in

IEEE International Conference on Acoustics, Speech, and Signal Processing

(ICASSP), vol. 1, pp. I– 657–I– 660, 2005.

[135] B. Tydlitat, J. Navratil, J. W. Pelecanos, and G. N. Ramaswamy, “Text-

independent speaker verification in embedded environments,” in IEEE Inter-

national Conference on Acoustics, Speech and Signal Processing (ICASSP),

pp. IV – 293– IV – 296, 2007.

[136] G. Tzanetakis, G. Essl, and P. Cook, “Audio analysis using the discrete

wavelet transform,” in Proceeding Conference in Acoustics and Music Theory

Applications, pp. 1–6, 2001.

[137] N. Upadhyay and A. Karmakar, “Speech enhancement using spectral

subtraction-type algorithms: A comparison and simulation study,” Proce-

dia Computer Science, vol. 54, pp. 574–584, 2015.

[138] A. Varga and H. J. Steeneken, “Assessment for automatic speech recog-

nition: II. NOISEX-92: A database and an experiment to study the effect

of additive noise on speech recognition systems,” Speech Communication,

vol. 12, no. 3, pp. 247–251, 1993.

224 BIBLIOGRAPHY

[139] O. Viikki and K. Laurila, “Cepstral domain segmental feature vector nor-

malization for noise robust speech recognition,” Speech Communication,

vol. 25, pp. 133–147, 1998.

[140] A. E. Villanueva-Luna, A. Jaramillo-Nunez, D. Sanchez-Lucero, C. M.

Ortiz-Lima, J. G. Aguilar-Soto, A. Flores-Gil, and M. May-Alarcon, “De-

noising audio signals using MATLAB wavelets toolbox,” pp. 25–54, INTECH

Open Access Publisher, 2011.

[141] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in

blind audio source separation,” IEEE Transactions on Audio, Speech, and

Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.

[142] R. J. Vogt, B. J. Baker, and S. Sridharan, “Factor analysis subspace esti-

mation for speaker verification with short utterances,” Proceedings of Inter-

speech, pp. 853–856, 2008.

[143] B. Xiang, U. V. Chaudhari, J. Navrtil, G. N. Ramaswamy, and R. A.

Gopinath, “Short-time Gaussianization for robust speaker verification,” in

IEEE International Conference on Acoustics, Speech, and Signal Processing

(ICASSP), pp. I– 681– I– 684, 2002.

[144] T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani,

and W. Kellermann, “Making machines understand us in reverberant rooms:

Robustness against reverberation for automatic speech recognition,” IEEE

Signal Processing Magazine, vol. 29, no. 6, pp. 114–126, 2012.

[145] C. Yu, C. Zhang, F. Kelly, A. Sangwan, and J. H. Hansen, “Text-available

speaker recognition system for forensic applications,” in Annual Confer-

ence of the International Speech Communication Association (Interspeech),

pp. 1844–1847, 2016.

BIBLIOGRAPHY 225

[146] X. Zhao, Y. Wang, and D. Wang, “Robust speaker identification in noisy

and reverberant conditions,” IEEE/ACM Transactions on Audio, Speech

and Language Processing (TASLP), vol. 22, no. 4, pp. 836–845, 2014.

[147] Q. Zhu and A. Alwan, “On the use of variable frame rate analysis in speech

recognition,” in IEEE International Conference on Acoustics, Speech, and

Signal Processing (ICASSP), pp. 1783–1786, 2000.

forensic speaker recognition under adverse conditions kamil hasan_al... · 2019. 6. 16. · speaker...

Documents