forensic speaker recognition under adverse conditions kamil hasan_al... · 2019. 6. 16. · speaker...
TRANSCRIPT
FORENSIC SPEAKER RECOGNITION UNDER
ADVERSE CONDITIONS
Ahmed Kamil Hasan AL-ALI
MSc in Electronic and Communication Engineering
Submitted in fulfilment of the requirements for the degree of
Doctor of Philosophy
School of Electrical Engineering and Computer Science
Science and Engineering Faculty
Queensland University of Technology
2019
Keywords
Forensic Speaker Recognition, Gaussian Mixture Modeling, I-vectors, Probabilis-
tic Linear Discriminant Analysis, Environmental Noise, Reverberation Condi-
tions, DWT, Feature-Warped MFCC, Fusion of Feature Warping with DWT-
MFCC and MFCC, Speech Enhancement Algorithms, Independent Component
Analysis, Multi-Run Independent Component Analysis, Independent Component
Analysis by Entropy Bound Minimization.
Abstract
Speaker recognition is the process of identifying or verifying the identity of a
speaker by analyzing acoustic speech information. Speaker recognition has be-
come more common in recent years and can be used in many applications such
as banking security, national security, building access, and forensic applications.
Judges and law enforcement agencies have used speaker recognition in forensic
applications when investigating a suspect or confirming the judgment of guilt
or innocence. Forensic speaker recognition compares the speech samples from
criminals with speech signals from the suspect to prepare legal evidence for the
court.
Forensic speaker verification systems face numerous challenges in real-world ap-
plications. For example, the police often record speech from a suspect in a police
office, where reverberation is usually present. A criminal may use a mobile phone
in public places to commit criminal offenses. Recorded speech from covert po-
lice recordings is often corrupted by various types of environmental noise. It is
difficult to use these types of recorded speech directly in police investigations
and courts. Forensic speaker recognition performance degrades significantly in
the presence of high levels of environmental noise and reverberation conditions.
Therefore, the objective of this thesis is to improve forensic speaker verification
performance under high levels of noise and reverberation environments by using
robust feature extraction techniques and multiple channel speech enhancement
ii
algorithms.
Three main contributions are presented in this thesis. The first contribution is to
design noisy and reverberant frameworks from the Australian Forensic Voice Com-
parison (AFVC) and QUT-NOISE databases. The aim of designing noisy and
reverberant frameworks is to use these frameworks for simulation forensic speaker
verification performance based on the robust feature extraction techniques and
the independent component analysis (ICA) algorithms under real-world scenarios.
The second contribution is a detailed investigation of the robustness of fusion fea-
tures, mel frequency cepstral coefficients (MFCC) and MFCC extracted from the
discrete wavelet transform (DWT), with and without applying feature warping
to improve forensic speaker verification performance in the presence of environ-
mental noise only. The performance of forensic speaker verification system is
evaluated using different feature extraction techniques: MFCC, feature-warped
MFCC, DWT-MFCC, DWT-MFCC feature warping, fusion of DWT-MFCC and
MFCC, and fusion of feature warping with DWT-MFCC and MFCC. This study
found that the proposed fusion of feature warping with DWT-MFCC and MFCC
improves forensic speaker recognition performance compared with other feature
extraction techniques in the presence of different types of environmental noise
over the majority of signal to noise ratios (SNRs). Experimental results also
demonstrated that the proposed fusion of feature warping with DWT-MFCC
and MFCC improves forensic speaker verification performance under reverber-
ation conditions only and noisy and reverberant environments. The proposed
fusion of feature warping with DWT-MFCC and MFCC can be used as the fea-
ture extraction in forensic speaker verification based on the ICA algorithms.
The third contribution is to develop forensic speaker verification based on the
multi-run ICA or independent component analysis by entropy bound minimiza-
tion (ICA-EBM) algorithm for improving forensic speaker verification perfor-
iii
mance under high levels of environmental noise and reverberation conditions.
The results demonstrated that forensic speaker verification based on the multi-
run ICA or the ICA-EBM algorithm significantly improved speaker verification
performance compared with the baselines of Noise-without-Reverberation and
Reverberation-with-Noise speaker verification systems under reverberation con-
ditions and high levels of environmental noise at SNRs ranging from -10 dB to 0
dB.
Contents
Abstract i
List of Tables xiv
List of Figures xvi
Acronyms & Abbreviations xxvii
Certification of Thesis xxxi
Acknowledgments xxxii
Chapter 1 Introduction 1
1.1 Motivation and overview . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope of the PhD . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
CONTENTS v
1.4 Original contributions . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2 Overview of speaker recognition systems 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Overview of speaker verification . . . . . . . . . . . . . . . . . . . 12
2.3 Front-end processing . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Voice activity detection . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Speaker verification based GMM . . . . . . . . . . . . . . . . . . 18
2.4.1 Overview of Gaussian mixture modeling . . . . . . . . . . 18
2.4.2 Universal background model . . . . . . . . . . . . . . . . . 20
2.4.3 Adaptation of UBM model . . . . . . . . . . . . . . . . . . 20
2.4.4 Speaker verification based GMM-UBM . . . . . . . . . . . 21
2.5 GMM super-vector . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Channel compensation approaches . . . . . . . . . . . . . . . . . . 23
2.6.1 Feature domain approaches . . . . . . . . . . . . . . . . . 24
2.6.2 Model-domain techniques (factor analysis) . . . . . . . . . 26
vi CONTENTS
2.6.2.1 Joint factor analysis . . . . . . . . . . . . . . . . 27
2.7 I-vector based speaker verification system . . . . . . . . . . . . . . 28
2.7.1 I-vector feature extraction . . . . . . . . . . . . . . . . . . 28
2.8 Standard channel compensation techniques . . . . . . . . . . . . . 30
2.9 PLDA speaker verification system . . . . . . . . . . . . . . . . . . 33
2.9.1 GPLDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.9.2 HTPLDA . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.9.3 Length-normalized GPLDA . . . . . . . . . . . . . . . . . 35
2.10 PLDA scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.11 Performance evaluations . . . . . . . . . . . . . . . . . . . . . . . 37
2.11.1 Type of error . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.11.2 Performance metrics . . . . . . . . . . . . . . . . . . . . . 37
2.12 Speaker recognition in adverse conditions . . . . . . . . . . . . . 39
2.12.1 Feature extraction based on multiband techniques . . . . 39
2.12.2 Feature warping . . . . . . . . . . . . . . . . . . . . . . . . 44
2.12.3 Independent component analysis . . . . . . . . . . . . . . 46
2.12.4 I-vector PLDA speaker recognition . . . . . . . . . . . . . 50
CONTENTS vii
2.12.5 Deep neural network . . . . . . . . . . . . . . . . . . . . . 51
2.13 Limitation of the existing techniques . . . . . . . . . . . . . . . . 53
2.14 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Chapter 3 Noisy and reverberant speech frameworks 57
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 AFVC speech database . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 QUT-NOISE database . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Construction of noisy speech signals . . . . . . . . . . . . . . . . . 61
3.4.1 Noisy speech in single channel speech enhancement . . . . 61
3.4.2 Noisy speech in multi-channel speech enhancement . . . . 61
3.5 Noisy and reverberant frameworks . . . . . . . . . . . . . . . . . 64
3.5.1 Single microphone adverse framework . . . . . . . . . . . . 65
3.5.1.1 Adding noise . . . . . . . . . . . . . . . . . . . . 65
3.5.1.2 Adding reverberation . . . . . . . . . . . . . . . . 66
3.5.1.3 Adding reverberation and noise . . . . . . . . . . 69
3.5.2 Multiple microphones adverse framework . . . . . . . . . . 70
3.5.2.1 Adding noise . . . . . . . . . . . . . . . . . . . . 71
viii CONTENTS
3.5.2.2 Adding reverberation and noise . . . . . . . . . . 74
3.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Chapter 4 Forensic speech enhancement algorithms 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Independent component analysis . . . . . . . . . . . . . . . . . . . 79
4.2.1 Statistical independence . . . . . . . . . . . . . . . . . . . 81
4.2.1.1 Independence . . . . . . . . . . . . . . . . . . . . 81
4.2.1.2 Uncorrelateness and independence . . . . . . . . 82
4.2.1.3 Non-Gaussianity and independence . . . . . . . . 82
4.3 ICA assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4 ICA ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4.1 Magnitude and scaling ambiguity . . . . . . . . . . . . . . 86
4.4.2 Permutation ambiguity . . . . . . . . . . . . . . . . . . . . 87
4.5 Pre-processing for ICA . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5.1 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5.2 Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6 Fast ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
CONTENTS ix
4.6.1 Fast ICA for one unit . . . . . . . . . . . . . . . . . . . . . 91
4.6.2 Fast ICA for several units . . . . . . . . . . . . . . . . . . 91
4.7 Simple illustration of ICA . . . . . . . . . . . . . . . . . . . . . . 92
4.7.1 Separation of speech from street noise . . . . . . . . . . . 93
4.7.2 Illustration example of statistical independent in ICA . . . 95
4.8 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.8.1 Noisy speech signals . . . . . . . . . . . . . . . . . . . . . 99
4.8.2 Speech enhancement algorithms . . . . . . . . . . . . . . . 99
4.8.2.1 Discrete wavelet transform . . . . . . . . . . . . . 99
4.8.2.1.1 Wavelet threshold technique . . . . . . . . . . . 102
4.8.2.2 Spectral Subtraction . . . . . . . . . . . . . . . . 103
4.8.2.3 Independent component analysis . . . . . . . . . 105
4.8.3 Evaluation of performance . . . . . . . . . . . . . . . . . . 105
4.8.4 Comparison of speech enhancement algorithms . . . . . . . 106
4.9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 5 Robust feature extraction techniques 111
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
x CONTENTS
5.2 Combination of DWT and MFCC FW . . . . . . . . . . . . . . . 113
5.3 Experimental methodology . . . . . . . . . . . . . . . . . . . . . . 116
5.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 116
5.4.1 Noisy conditions . . . . . . . . . . . . . . . . . . . . . . . 117
5.4.1.1 Effect of decomposition level . . . . . . . . . . . 117
5.4.1.2 Effect of wavelet family . . . . . . . . . . . . . . 118
5.4.1.3 Comparison of feature extraction techniques . . 120
5.4.1.4 Fusion of feature warping with DWT-MFCC and
MFCC . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4.1.5 Performance comparison to related work . . . . 124
5.4.1.6 Effect of utterance length . . . . . . . . . . . . . 126
5.4.2 Reverberation conditions . . . . . . . . . . . . . . . . . . 130
5.4.2.1 Effect of decomposition level . . . . . . . . . . . 130
5.4.2.2 Effect of wavelet family . . . . . . . . . . . . . . 132
5.4.2.3 Comparison of feature extraction techniques . . . 133
5.4.2.4 Effect of reverberation time . . . . . . . . . . . . 135
5.4.2.5 Effect of utterance duration . . . . . . . . . . . . 138
5.4.2.6 Effect of suspect and microphone position . . . 141
CONTENTS xi
5.4.3 Noisy and reverberant conditions . . . . . . . . . . . . . . 143
5.4.3.1 Effect of decomposition level . . . . . . . . . . . . 144
5.4.3.2 Effect of wavelet family . . . . . . . . . . . . . . 145
5.4.3.3 Comparison of feature extraction techniques . . 146
5.4.3.4 Effect of utterance length . . . . . . . . . . . . . 151
5.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Chapter 6 ICA for speaker verification 157
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.2 Multi-run ICA algorithm . . . . . . . . . . . . . . . . . . . . . . . 160
6.2.1 SIR computation . . . . . . . . . . . . . . . . . . . . . . . 161
6.3 ICA-EBM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.4.1 Multi-run ICA or ICA-EBM speech enhancement . . . . 165
6.4.2 Fusion of feature warping with DWT-MFCC and MFCC . 166
6.4.3 Length-normalized GPLDA classifier . . . . . . . . . . . . 167
6.5 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 168
6.5.1 Noisy and reverberant speaker verification
baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
xii CONTENTS
6.5.1.1 Noise-without-Reverberation speaker verification
baseline . . . . . . . . . . . . . . . . . . . . . . . 169
6.5.1.2 Reverberation-with-Noise speaker verification
baseline . . . . . . . . . . . . . . . . . . . . . . . 170
6.5.2 Noise-without-Reverberation conditions . . . . . . . . . . . 171
6.5.2.1 Speaker verification performance based multi-run
ICA (SIRact) . . . . . . . . . . . . . . . . . . . . 172
6.5.2.2 Speaker verification performance based multi-run
ICA (SIRest) . . . . . . . . . . . . . . . . . . . . 175
6.5.2.3 Effect of utterance length . . . . . . . . . . . . . 180
6.5.3 Reverberation-with-Noise conditions . . . . . . . . . . . . 181
6.5.3.1 Speaker verification performance based on ICA
algorithms . . . . . . . . . . . . . . . . . . . . . . 183
6.5.3.2 Effect of utterance length . . . . . . . . . . . . . 188
6.5.3.3 Effect of reverberation time . . . . . . . . . . . . 191
6.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Chapter 7 Conclusions and future directions 199
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
7.2 Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . 199
CONTENTS xiii
7.2.1 Designing noisy and reverberant frameworks . . . . . . . . 200
7.2.2 DWT-MFCC feature warping . . . . . . . . . . . . . . . 200
7.2.3 Multi-run ICA and ICA-EBM algorithms . . . . . . . . . 202
7.3 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Bibliography 207
List of Tables
3.1 Reverberation test room parameter. . . . . . . . . . . . . . . . . 67
5.1 Summary of feature extraction labels. . . . . . . . . . . . . . . . . 115
5.2 Description of the number of features extracted corresponding to
each feature extraction label. . . . . . . . . . . . . . . . . . . . . . 115
5.3 Comparison of speaker verification performance based on mDCF
using different feature extraction techniques in the presence of re-
verberation at 0.15 sec. . . . . . . . . . . . . . . . . . . . . . . . . 135
5.4 Reverberation test room parameter. . . . . . . . . . . . . . . . . 142
6.1 Reverberation test room parameter. . . . . . . . . . . . . . . . . 171
6.2 Comparison mDCFs for speaker verification based multi-run ICA
(highest SIRest) and baseline Noise-without-Reverberation speaker
verification system. . . . . . . . . . . . . . . . . . . . . . . . . . . 180
LIST OF TABLES xv
6.3 comparison mDCFs for speaker verification based on the ICA-EBM
algorithm and Reverberation-with-Noise speaker verification base-
line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
List of Figures
2.1 Text-independent speaker verification system. . . . . . . . . . . . 13
2.2 Block diagram of front-end processing. . . . . . . . . . . . . . . . 15
2.3 Block diagram of MFCC extraction. . . . . . . . . . . . . . . . . . 17
2.4 Speaker verification based GMM-UBM. . . . . . . . . . . . . . . . 22
2.5 A block diagram of extraction GMM super-vectors. . . . . . . . . 23
2.6 A block diagram of the feature warping process. . . . . . . . . . . 26
2.7 Length-normalized GPLDA based speaker verification system. . . 30
2.8 An example of DET plot. . . . . . . . . . . . . . . . . . . . . . . 38
3.1 Configuration of sources and microphones in instantaneous ICA
mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Position of speech and noise signals to the microphones. . . . . . . 63
3.3 Design of a single microphone adverse framework based on adding
noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
LIST OF FIGURES xvii
3.4 Position of suspect and microphones in a room. All microphones
and suspect are at 1.3 m height and the height of the room is 2.5 m. 67
3.5 Design of a single microphone adverse framework based on adding
reverberation conditions. . . . . . . . . . . . . . . . . . . . . . . 68
3.6 Design of a single microphone adverse framework based on adding
reverberation and noise conditions. . . . . . . . . . . . . . . . . . 70
3.7 Position of speech and noise signals to the microphones. . . . . . . 72
3.8 Design of a multiple microphones adverse framework based on
adding noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.9 Design of a multiple microphones adverse framework based on
adding reverberation and noise conditions. . . . . . . . . . . . . . 75
4.1 Histogram of clean speech from the AFVC database. . . . . . . . 79
4.2 Histogram of STREET noise from the QUT-NOISE database. . . 79
4.3 Mixing and unmixing processes in blind source separation. s are
the source signals, x are the observation signals, s are the estimated
source signals, A is the mixing matrix and W is the unmixing matrix. 81
4.4 Original speech and street noise. . . . . . . . . . . . . . . . . . . . 93
4.5 Mixed speech with street noise. . . . . . . . . . . . . . . . . . . . 94
4.6 Estimated speech and street noise using the fast ICA algorithm. . 94
4.7 Original sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
xviii LIST OF FIGURES
4.8 Mixed of source signals. . . . . . . . . . . . . . . . . . . . . . . . 96
4.9 Joint density of whitened signals obtained from whitening the
mixed sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.10 Estimation of source signals using the ICA algorithm. . . . . . . . 97
4.11 Comparison of speech enhancement algorithms. . . . . . . . . . . 98
4.12 Daubechies 8 wavelet function. . . . . . . . . . . . . . . . . . . . . 100
4.13 Block schematic of the dyadic wavelet transform. . . . . . . . . . . 101
4.14 Block schematic of the dyadic inverse discrete wavelet transform. . 102
4.15 Comparison of average SNR enhancement when STREET noise is
added to the speech signals from the AFVC database. . . . . . . 107
4.16 Comparison of average SNR enhancement when CAR noise is
added to the speech signals from the AFVC database. . . . . . . 107
4.17 Comparison of average SNR enhancement when HOME noise is
added to the speech signals from the AFVC database. . . . . . . 108
5.1 Extraction and fusion of DWT-MFCC and MFCC features with
and without feature warping (FW). . . . . . . . . . . . . . . . . . 114
5.2 Effect of the decomposition levels on the performance of fusion of
feature warping with DWT-MFCC and MFCC in the presence of
noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
LIST OF FIGURES xix
5.3 Average EER for fusion of feature warping with DWT-MFCC and
MFCC using different wavelet families in the presence of different
types of environmental noise. . . . . . . . . . . . . . . . . . . . . 119
5.4 Comparison of speaker verification performance using different fea-
ture extraction techniques in the presence of various types of en-
vironmental noise. . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.5 Average reduction in EER for fusion of feature warping with DWT-
MFCC and MFCC over feature-warped MFCC in the presence of
various types of environmental noise. Higher average reduction in
EER indicates better performance. . . . . . . . . . . . . . . . . . 122
5.6 mDCF for feature-warped MFCC and fusion of feature warping
with DWT-MFCC and MFCC features in the presence of various
types of environmental noise. . . . . . . . . . . . . . . . . . . . . 123
5.7 Fusion of feature warping with DWT-MFCC and MFCC. . . . . . 125
5.8 Comparison of average EER for the proposed fusion of feature
warping with DWT-MFCC features with other feature extraction
techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.9 Effect of the utterance length on the performance of feature-warped
MFCC in the presence of environmental noise. . . . . . . . . . . 127
5.10 Effect of the utterance length on the performance of fusion of fea-
ture warping with DWT-MFCC and MFCC in the presence of
various types environmental noise. . . . . . . . . . . . . . . . . . 128
xx LIST OF FIGURES
5.11 Average reduction in EER for fusion of feature warping with DWT-
MFCC and MFCC over feature-warped MFCC when 40 sec of the
surveillance recordings were mixed with various types of environ-
mental noise at SNRs ranging from -10 dB to 10 dB. . . . . . . . 129
5.12 Average reduction in EER for fusion of feature warping with DWT-
MFCC and MFCC when the duration of the surveillance recordings
increased from 10 sec to 40 sec. . . . . . . . . . . . . . . . . . . . 130
5.13 Effect of decomposition level on the performance of fusion of fea-
ture warping with DWT-MFCC and MFCC under reverberation
conditions only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.14 EER for fusion of feature warping with DWT-MFCC and MFCC
using different wavelet families in the presence of reverberation
conditions only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.15 Comparison of fusion of feature warping with DWT-MFCC and
MFCC with different feature extraction techniques in the presence
of reverberation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.16 Effect of reverberation time on the performance of feature-warped
MFCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.17 Effect of reverberation time on the performance of fusion of feature
warping with DWT-MFCC and MFCC. . . . . . . . . . . . . . . . 137
5.18 Reduction in EER for fusion of feature warping with DWT-MFCC
and MFCC over feature-warped MFCC when interview recordings
reverberated at different reverberation times. . . . . . . . . . . . . 138
LIST OF FIGURES xxi
5.19 Effect of changing the utterances length of the surveillance record-
ings on the performance of feature-warped MFCC under reverber-
ation conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.20 Effect of changing the surveillance utterance duration on the per-
formance of fusion of feature warping with DWT-MFCC and
MFCC under reverberation conditions. . . . . . . . . . . . . . . . 140
5.21 Position of suspect and microphones in a room. All microphones
and suspect are at 1.3 m height and the height of the room is 2.5 m.141
5.22 Effect of configuration microphone and suspect positions on the
performance of feature-warped MFCC. . . . . . . . . . . . . . . . 143
5.23 Effect of configuration microphone and suspect positions on the
performance of fusion of feature warping with DWT-MFCC and
MFCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.24 Effect of the decomposition levels on the performance of fusion of
feature warping with DWT-MFCC and MFCC in the presence of
reverberation and various types of environmental noises. . . . . . 144
5.25 Average EER for fusion of feature warping with DWT-MFCC and
MFCC using different wavelet families in the presence of reverber-
ation and various types of environmental noises. . . . . . . . . . 146
5.26 Comparison of speaker verification performance using different fea-
ture extraction techniques in the presence of environmental noise
and reverberation conditions. . . . . . . . . . . . . . . . . . . . . 147
xxii LIST OF FIGURES
5.27 Average reduction in EER for fusion of feature warping with DWT-
MFCC and MFCC over traditional MFCC in the presence of var-
ious types of environmental noise and reverberation conditions . . 148
5.28 Average reduction in EER for fusion of feature warping with DWT-
MFCC and MFCC over feature-warped MFCC in the presence of
various types of environmental noise and reverberation conditions. 150
5.29 mDCF for feature-warped MFCC and fusion of feature warping
with DWT-MFCC and MFCC features under noisy and reverber-
ant conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.30 Effect of utterance length on the performance of feature-warped
MFCC in the presence of noise and reverberation conditions. . . . 151
5.31 Effect of utterance length on the performance of fusion of feature
warping with MFCC and DWT-MFCC in the presence of noise
and reverberation conditions. . . . . . . . . . . . . . . . . . . . . 152
5.32 Average reduction in EER for fusion of feature warping with DWT-
MFCC and MFCC over feature-warped MFCC when interview
recordings reverberated at 0.15 sec and 40 seconds of the surveil-
lance recordings were corrupted by various types of noise at SNRs
ranging from -10 dB to 10 dB. . . . . . . . . . . . . . . . . . . . . 153
5.33 Average reduction in EER for fusion of feature warping with DWT-
MFCC and MFCC when interview recordings reverberated at 0.15
sec and the duration of the surveillance recordings changed from
10 sec to 40 sec in the presence of various types of noise at SNRs
ranging from -10 dB to 10 dB. . . . . . . . . . . . . . . . . . . . . 154
LIST OF FIGURES xxiii
6.1 Flowchart of the multi-run ICA algorithm. . . . . . . . . . . . . . 161
6.2 Flowchart of speaker verification based on the multi-run ICA or
ICA-EBM algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.3 Flowchart of fusion of feature warping with DWT-MFCC and
MFCC techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.4 Flowchart of the Noise-without-Reverberation speaker verification
baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.5 Flowchart of the Reverberation-with-Noise speaker verification
baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.6 Comparison of forensic speaker verification based on the multi-
run ICA (highest SIRact) algorithm with other speaker verification
techniques under Noise-without-Reverberation conditions conditions.173
6.7 Comparison of forensic speaker verification based on the multi-
run ICA (highest SIRest) algorithm with other speaker verification
techniques under noisy conditions. . . . . . . . . . . . . . . . . . 176
6.8 Average reduction in EER for the multi-run ICA (highest SIRest)
algorithm over traditional ICA in the presence of various types of
environmental noise at SNRs ranging from -10 dB to 10 dB. . . . 178
6.9 Average EER reduction for the ICA-EBM algorithm compared
with the traditional ICA under different levels and types of en-
vironmental noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
xxiv LIST OF FIGURES
6.10 Effect of duration utterance on the performance of speaker ver-
ification based multi-run ICA (highest SIRest) algorithm in the
presence of various types of environmental noise. . . . . . . . . . 181
6.11 Effect of utterance duration on speaker verification performance
based on the ICA-EBM algorithm under different levels and types
of noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.12 Experimental results for PLDA speaker verification when interview
recordings reverberated at 0.15 sec reverberation time and surveil-
lance recordings were mixed with various types of environmental
noise at SNRs ranging from -10 dB to 10 dB. . . . . . . . . . . . . 184
6.13 Average reduction in EER for multi-run ICA (highest SIRest) algo-
rithm over traditional ICA when interview recordings reverberated
at 0.15 sec and the surveillance recordings were mixed with differ-
ent types of environmental noise at SNRs ranging from -10 dB to
10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.14 Average EER reduction for ICA-EBM algorithm compared with
the traditional ICA for different types of noise and reverberation. 187
6.15 Average reduction in EER for the ICA-EBM algorithm over the
multi-run ICA (highest SIRest) in the presence of various types of
environmental noise at SNRs ranging from -10 dB to 0 dB and
reverberation conditions. . . . . . . . . . . . . . . . . . . . . . . . 187
6.16 Effect of duration utterance on the performance of speaker veri-
fication based multi-run ICA (highest SIRest) in the presence of
various environmental noise and reverberation conditions. . . . . . 189
LIST OF FIGURES xxv
6.17 Effect of utterance duration on speaker verification performance
when using the ICA-EBM under different types of noise and rever-
beration environments. . . . . . . . . . . . . . . . . . . . . . . . 190
6.18 Effect of the reverberation time on the performance of forensic
speaker verification based on multi-run ICA (highest SIRact) when
interview recordings reverberated at different reverberation times
and surveillance recordings mixed with various types of environ-
mental noises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.19 Effect of the reverberation time on the performance of forensic
speaker verification based on multi-run ICA (highest SIRest) when
interview recordings reverberated at different reverberation times
and surveillance recordings were mixed with various types of envi-
ronmental noises. . . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.20 Effect of the reverberation time on speaker verification perfor-
mance when using the ICA-EBM under conditions of noise and
reverberation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
List of Abbreviations
Australian Forensic Voice Comparison AFVCActual signal to interference ratio SIRact
Additive white Gaussian noise AWGN
Baum-Welch BW
Cepstral mean subtraction CMSCepstral mean variance normalization CMVNComputational auditory scene analysis CASACumulative distribution function CDF
Daubechies dbDecision cost function DCFDetection error trade-off DETDeep neural network DNNDeep recurrent neural network DRNNDiscrete cosine transform DCTDiscrete wavelet transform DWT
Efficient fast ICA EFICAEqual error rate EERExpectation maximization EMEstimated signal to interference ratio SIRest
Fast Fourier transform FFTFalse acceptance rate FARFalse rejection rate FRR
xxviii ACRONYMS & ABBREVIATIONS
Feature warping FWFinite impulse response FIR
Gaussian mixture model GMMGaussian probabilistic linear discriminant analysis GPLDA
Heavy-tailed probabilistic linear discriminant analysis HTPLDA
Identity vector i-vectorIndependent component analysis ICAIndependent component analysis by entropy bound minimization ICA-EBMInput SNR SNRi
Ideal ratio mask IRM
Joint factor analysis JFA
Kurtosis Kurt
Linear discriminant analysis LDALinear prediction cepstral coefficients LPCCLong short-term memory LSTM
Maximum a posteriori MAPMaximum likelihood MLMel frequency cepstral coefficients MFCCMel frequency discrete wavelet coefficients MFDWCsMean square error MSEMinimum decision cost function mDCFMicrosoft research MSR
National Institute of Standards and Technology NISTNuisance attribute projection NAPNon-negative Matrix Factorization NMF
Output signal to noise ratio SNRo
Perceptual linear prediction cepstral coefficients PLPCCPrinciple component analysis PCA
ACRONYMS & ABBREVIATIONS xxix
Probability density function PDFProbabilistic linear discriminant analysis PLDA
Relative spectral RASTARelative spectral transform-perceptual linear prediction RASTA-PLP
Short time Fourier transform STFTShort-time spectral amplitude minimum mean square error STSA-MMSESpeaker recognition evaluation SRESupport vector machine SVMSignal to noise ratio SNRSNR enhancement SNRe
Signal to interference ratio SIR
Time-frequency T-F
Universal background model UBM
Voice activity detection VADVoice over Internet protocol VOIP
Within-class covariance normalization WCCN
Certification of Thesis
The work contained in this thesis has not, been previously submitted for a degree
or diploma at, ally other higher educational institution. To the besl of my
knowledge and belief, the thesis contains no material previously published or
·written by another person except where due reference is made.
Signed:
Date: IH/£/to\c'.L---
QUT Verified Signature
Acknowledgments
I would like to thank my supervisory team, Associate Professor Bouchra Senadji,
Dr Mahsa Baktashmotlagh, Dr David Dean, Dr Ganesh Naik, and adjunct Pro-
fessor Vinod Chandran for providing their support and their valuable guidance
throughout my PhD. My respect and gratitude to my principal supervisor Asso-
ciate Professor Bouchra Senadji, who supported me during my PhD journey. I
would especially like to thank my associate supervisor Dr Mahsa Baktashmotlagh
for her consistent help and support during the writing of my PhD thesis. Also
special thanks to my associate external supervisor Dr David Dean, who provided
valuable advice and research direction during my PhD journey. I would also like
to thank my associate external supervisor Professor Vinod Chandran. He has
been very helpful in providing me valuable advice and feedback. His feedback
has been always improved the quality of my publications and thesis.
I would like to thank my associate external supervisor Dr Ganesh Naik who sup-
ported me and provided advice throughout the course of my PhD. I would also
like to thank my colleagues and friends at the Queensland University of Tech-
nology, who shared their friendship and made the laboratory enjoyable. I would
also like to thank Dr Christain Long for providing copyediting and proofreading
services, according to the guidelines laid out in the University-endorsed material
policy guidelines. I also wish to gratefully thank the Ministry of Higher Education
and Scientific Research (MoHESR) in Iraq for supporting my PhD scholarship.
ACKNOWLEDGMENTS xxxiii
Finally, I would like to thank my wife, Tahreer Aljabiri, who supported me during
my PhD journey. It would have been impossible to complete my PhD journey
without her support.
Ahmed Kamil Hasan AL-ALI
2019
Chapter 1
Introduction
1.1 Motivation and overview
Automatic speaker recognition is the process of designing algorithms that recog-
nize the identity of speakers by their voices. Speaker recognition can be divided
into two applications, speaker identification and speaker verification. Speaker
identification is the process of determining the identity of an unknown speaker
from a group of known speakers, while speaker verification is the process of con-
firming or rejecting the claimed identity.
Speaker verification is more widely used than speaker identification due to its
importance in security and forensic applications [59]. For security applications,
speaker verification can be used to protect personal information. People want
companies and banks to ensure that the best possible preventative methods are
used to protect personal details and prevent fraud [59]. Speaker verification can
also be used in national security measures to combat terrorists, by determining
the location of a known terrorist by analyzing the voice from localized audio
2 1.1 Motivation and overview
recordings. Lawyers, judges, and law enforcement agencies have used speaker
verification in forensic applications to prepare legal evidence for court by com-
paring surveillance speech samples from the criminal with a database of interview
speech samples from the suspect [19, 89].
The performance of forensic speaker verification degrades significantly in the
presence of adverse conditions in real-world applications. The adverse condi-
tions come from various sources, such as environmental noise, reverberation, and
channel mismatch between interview and surveillance recordings. Police often
record speech from a suspect in a room where reverberation often presents. In
reverberation environments, the interview speech signal is often combined with
a multiple reflection version of the speech due to the reflection of the interview
speech signals from the surrounding room. The presence of reverberation dis-
torts feature vectors and degrades speaker verification performance because of
the mismatched conditions between enrolment models and verification speech
signals [122]. Surveillance forensic recordings are often recorded over phone lines
with limited channel bandwidth and low-quality [8, 107], and are usually recorded
using hidden microphones in public places. These forensic recordings are often
corrupted by various types of environmental noise [32], and the performance of
speaker verification systems reduces dramatically in the presence of high levels of
noise [73, 94].
This research aims to improve forensic speaker verification performance in the
presence of adverse conditions using a robust feature extraction technique and
multiple channel speech enhancement algorithms.
1.2 Scope of the PhD 3
1.2 Scope of the PhD
The broad scope of this PhD research is to address the challenges that face forensic
speaker verification systems in the presence of adverse conditions. The current
modern forensic speaker verification system suffers from the following problems:
1. In real forensic scenarios, the interview speech signals are often recorded in a
police interview room where reverberation is usually present. The surveil-
lance speech signals from the criminal are usually corrupted by different
types and levels of environmental noise. Therefore, it is necessary to in-
vestigate a robust feature extraction technique to improve forensic speaker
verification performance in the presence of adverse conditions.
2. In real forensic situations, environmental noises are often mixed with
surveillance forensic audio recordings. It is difficult to use these record-
ings as part of the legal evidence in court because their quality is poor.
To reduce the effect of high levels of environmental noise from the noisy
speech signals and improve noisy forensic speaker verification performance,
it is necessary to apply a speech enhancement algorithm as front-end pro-
cessing for speaker verification. Forensic speaker verification performance
under different types of environmental noise could improve by using speech
enhancement algorithm.
3. The performance of forensic speaker verification reduces significantly in the
presence of high levels of environmental noise and reverberation conditions.
Thus, it is necessary to combine the robust feature extraction technique
and speech enhancement algorithm described in 1 and 2 to achieve high
improvement in forensic speaker verification performance in the presence of
high levels of environmental noise and reverberation conditions.
4 1.3 Thesis structure
This PhD research addresses these problems and the outcome of this research can
be used in police investigations to recognize the identity of the criminals by their
voices and prepare legal evidence to the court.
1.3 Thesis structure
The remaining chapters of the thesis are organized as follows:
� Chapter 2: Overview of speaker recognition systems
This chapter provides the background of speaker verification systems.
A summary of existing front-end processing and feature extraction is
provided. Some of the most common speaker modelling methods are
presented in this chapter, including Gaussian mixture model (GMM)
based speaker verification system, universal background model (UBM),
GMM super-vector, factor analysis, and joint factor analysis (JFA).
Channel compensation techniques based on feature extraction approach
are also presented. State-of-the-art identity vector (i-vector) probabilistic
linear discriminant analysis (PLDA) speaker verification is also presented.
Some techniques to improve speaker recognition performance in adverse
conditions are described. The limitations of the existing techniques are
also identified.
� Chapter 3: Noisy and reverberant speech frameworks
This chapter presents an overview of speech and noise corpora that are used
in this thesis. The construction of noisy speech signals based on the single
and multiple channels is designed to evaluate forensic speech enhancement
algorithms. The noisy and reverberant frameworks based on the single
1.3 Thesis structure 5
and multiple microphones are described in this chapter for evaluation the
effectiveness of forensic speaker verification based on the robust of feature
extraction techniques and the ICA algorithms under real-world situations.
� Chapter 4: Forensic speech enhancement algorithms
This chapter presents independent component analysis (ICA) as a multiple
channel speech enhancement algorithm for forensic applications. It begins
with a simple example to introduce the problem of source separation and
the ICA model. Some mathematical and statistical concepts related to ICA
are presented in this chapter. The fast ICA algorithm is also discussed.
Finally, simulation results for single and multiple speech enhancement
algorithms are also given.
� Chapter 5: Robust feature extraction techniques
This chapter investigates the effectiveness of combining features, mel
frequency cepstral coefficients (MFCC) and MFCC extracted from the
discrete wavelet transform (DWT) of the speech signals, with and without
feature warping in the presence of various types of environmental noise
only. Subsequently, a new fusion of feature warping with DWT-MFCC and
MFCC is proposed to improve modern i-vector forensic speaker verification
performance under noisy and reverberant conditions.
� Chapter 6: ICA for speaker verification
This chapter introduces the multi-run ICA, and independent component
analysis by entropy bound minimization (ICA-EBM) as multiple channel
speech enhancement algorithms. New forensic speaker verification systems
based on the multi-run ICA and ICA-EBM algorithm are developed in this
6 1.4 Original contributions
chapter to improve the performance of forensic speaker verification in the
presence of high levels of environmental noise and reverberation conditions.
� Chapter 7: Conclusions and future directions
This chapter concludes the thesis with a summary of the contributions and
makes some suggestions for future directions.
1.4 Original contributions
The research programme contributes to the field of speaker verification for real
forensic applications by addressing challenges faced by forensic speaker verifica-
tion systems in the presence of adverse conditions.
1. In Chapter 3, the construction of noisy and reverberant frameworks are de-
signed from the Australian Forensic Voice Comparison (AFVC) and QUT-
NOISE databases. The goal of designing noisy and reverberant frameworks
is to use these frameworks for simulation forensic speaker verification per-
formance based on the robust feature extraction techniques and the ICA
algorithms under real-world situations in the presence of various types of
environmental noise and reverberation conditions. The noisy and rever-
berant frameworks based on the robust feature extraction and the ICA
algorithms were presented in ” Enhanced forensic speaker verification us-
ing a combination of DWT and MFCC feature warping in the presence of
noise and reverberation conditions” [5] in IEEE Access Journal, ” Hybrid
DWT and MFCC feature warping for noisy forensic speaker verification in
room reverberation” [6] in IEEE International Conference on Signal and
Image Processing Applications, ” Speaker verification with multi-run ICA
1.4 Original contributions 7
based speech enhancement” [3] in 11th International Conference on Signal
Processing and Communication Systems, and ” Enhanced forensic speaker
verification using multi-run ICA in the presence of noise and reverbera-
tion conditions” [7] in IEEE International Conference on Signal and Image
Processing Applications.
2. The effectiveness of combining features, MFCC and MFCC extracted from
the DWT of the speech with and without using feature warping is inves-
tigated (Chapter 5) to improve i-vector forensic speaker verification per-
formance in the presence of various types of environmental noise only. In
Chapter 5, a new fusion of feature warping with DWT-MFCC and MFCC
is proposed to improve the performance of forensic speaker verification sys-
tems in the presence of reverberation and noisy and reverberant conditions.
The proposed fusion of feature warping with DWT-MFCC and MFCC is
also compared with various feature extraction techniques to investigate the
robustness of the proposed fusion feature extraction technique to reverbera-
tion as well as noisy and reverberant conditions. The results demonstrated
that the proposed technique improved forensic speaker verification perfor-
mance compared with different feature extraction techniques in the presence
of various types of environmental noise and reverberant conditions. These
research outcomes were published in IEEE Access [5]. The IEEE Access
Journal has Q1 ranking and the impact factor in 2017 is 3.557 according to
Clarivate Analytics Journal Citation Reports. Some findings of this contri-
bution were also published in 2017 IEEE International Conference on Signal
and Image Processing Applications [6].
3. In Chapter 6, the effectiveness of the multi-run ICA and ICA-EBM is in-
vestigated for the first time as multiple channel speech enhancement algo-
rithms. Forensic speaker verification systems based on the multi-run ICA
and ICA-EBM algorithm are developed to improve forensic speaker verifi-
8 1.5 Publications
cation performance where data are often corrupted by various types of en-
vironmental noise and reverberation conditions. It was found that speaker
verification based on the multi-run ICA or ICA-EBM algorithms signifi-
cantly improved the performance under reverberation conditions and high
levels of environmental noise at SNRs ranging from -10 dB to 0 dB compared
with the two baselines (Noise-without-Reverberation and Reverberation-
with-Noise speaker verification systems). The results also demonstrated
that forensic speaker verification based on multi-run ICA or ICA-EBM al-
gorithm improved speaker verification performance compared with tradi-
tional ICA under noisy and reverberant conditions. Part of this work was
published in 2017 IEEE International Conference on Signal and Image Pro-
cessing Applications [7] and the 11th International Conference on Signal
Processing and Communication Systems [3].
1.5 Publications
Listed below are the peer-reviewed publications and under-review submissions
resulting from this thesis.
Peer-reviewed international journals
1. A. K. H. Al-Ali, D. Dean, B. Senadji, V. Chandran and G. R. Naik
‘Enhanced forensic speaker verification using a combination of DWT and
MFCC feature warping in the presence of noise and reverberation Condi-
tions,’ IEEE Access, vol 5, no 99, pp. 15400 - 15413 , 2017.
1.5 Publications 9
Peer-reviewed international conferences
1. A. K. H. Al-Ali, D. Dean, B. Senadji, and V. Chandran, ‘Comparison
of speech enhancement algorithms for forensic applications,’ Proceedings of
the 16th Australian International Conference on Speech Science and Tech-
nology, December 2016.
2. A. K. H. Al-Ali, B. Senadji and V. Chandran, ‘Hybrid DWT and MFCC
feature warping for noisy forensic speaker verification in room reverbera-
tion,’ IEEE International Conference on Signal and Image Processing Ap-
plications, 2017.
3. A. K. H. Al-Ali, D. Dean, B. Senadji, M. Baktashmotlagh, and V. Chan-
dran,‘Speaker verification with multi-run ICA based speech enhancement,’
11th International Conference on Signal Processing and Communication
Systems, 2017.
4. A. K. H. Al-Ali, B. Senadji and G. R. Naik, ‘Enhanced forensic speaker
verification using multi-run ICA in the presence of environmental noise
and reverberation conditions,’ IEEE International Conference on Signal and
Image Processing Applications, 2017.
Chapter 2
Overview of speaker recognition
systems
2.1 Introduction
Speech signals convey many levels of information to the listener. Speech signals
carry linguistic information such as a message via words at the primary level, while
at other levels speech signals convey information related to the emotion, stress
level, language, gender and the identity of the speaker [18, 114, 128]. Speaker
recognition is the task of identifying a speaker by their voice. Speaker recognition
can be classified into two parts: speaker identification and speaker verification.
Speaker identification is the process of determining which voice in a group of
known voices best matches the speaker’s [111]. Speaker verification is the task of
accepting or rejecting the identity claim of a speaker by analyzing their acoustic
samples [38, 39]. Speaker verification is more common than speaker identification
and it is used widely in real applications such as security, forensics, and telephone
banking [57].
12 2.2 Overview of speaker verification
Speaker verification systems are computationally less complex than speaker iden-
tification systems. They require a comparison between only one or two models,
while speaker identification requires comparison of one model to N speaker mod-
els. Thus, improvements in speaker verification performance can be carried into
speaker identification systems [114].
Speaker verification can be divided into text-dependent and text-independent.
In text-dependent, the speaker verification system has prior knowledge about
the text to be spoken and the user is expected to cooperatively speak this text
[114, 125]. However, in a text-independent scenario, the system has no prior
knowledge about the text to be spoken and the user is not expected to be co-
operative [39, 135]. Text-dependent systems achieve high speaker verification
performance from relatively short utterances, while text-independent systems re-
quire long utterances to train reliable models and achieve good performance. The
performance of text-independent speaker verification systems is also affected by
factors such as environmental noise and reverberation conditions. An example of
text-independent speaker verification would be a forensic application with covert
recordings of speech.
2.2 Overview of speaker verification
Speaker verification systems consist of three stages: development, enrolment and
verification, as shown in Figure 2.1. In the development phase, a large number
of speech signals are used to learn the speaker-independent parameters of the
speech signals. In the enrolment phase, feature extraction is used to extract the
features from the target speaker and the target speaker model is generated by
adapting parameters from the background models. The target speaker model is
a statistical model which characterize the speaker vocal tract. Once a model is
2.2 Overview of speaker verification 13
Front- end
processing
Background
training
Background
model
Front- end
processing
Model
adaptation
Front- end
processing
Development phase
Enrolment phase
Verification phase
Accepted/rejected
Target speaker
Claimedspeaker
Model
adaptation Scoring
Figure 2.1: Text-independent speaker verification system.
created and associated with a speaker, it is used to represent that speaker during
the verification phase. For the verification phase, the claimed speaker is generated
and compared with the target speaker model to determine if the claimant identity
is accepted or rejected by the speaker verification system based on the decision
threshold [50, 83, 112].
Speaker verification performance is often degraded in the presence of chan-
nel/session variability between enrolment and verification speech signals. Various
factors affect channel/session variability:
1. Channel mismatch between enrolment and verification speech signals (e.g.
using different microphones in enrolment and verification speech signals).
2. Environmental noise and reverberation conditions.
3. The differences in speaker voice (e.g. ageing, health, speaking style and
emotional state).
14 2.3 Front-end processing
4. Transmission channel (e.g. landline, mobile phone, microphone and voice
over Internet protocol (VoIP)).
Channel compensation techniques can be used to reduce the effect of channel
mismatch and environmental noise. Channel compensation can be used in dif-
ferent stages of speaker verification such as feature and model domains. Various
channel compensation techniques such as cepstral mean subtraction (CMS) [37],
feature warping [106], cepstral mean variance normalization (CMVN) [139], and
relative spectral (RASTA) processing [49] were used in previous studies to reduce
the effect of channel mismatch during the feature extraction phase. In the model
domain, JFA [63] and i-vector[31] are used to combat enrolment and verification
mismatch. These techniques are briefly described in the following sections.
2.3 Front-end processing
The front-end processing is used to process the speech signals and extract the
feature used for speaker verification systems. The front-end processing consists
of voice activity detection (VAD), feature extraction and channel compensation
techniques, as shown in Figure 2.2 [14, 113].
2.3.1 Voice activity detection
The VAD algorithm is an important step in the front-end processing. The main
goal of the VAD is to detect the speech and non-speech segments from the speech
signal before the feature extraction technique. A robust VAD algorithm can im-
prove the performance of a speaker verification system. Therefore, it is necessary
to review the VAD algorithm to overcome the problems in designing a robust
2.3 Front-end processing 15
Voice activity
detection Feature extraction
Feature domain
channel compensation
Speech
Figure 2.2: Block diagram of front-end processing.
speaker verification system.
Traditionally, VAD algorithms are designed using heuristic models such as energy
and zero-crossing rate [12, 109]. Recently, VAD has used many features, and
the discriminants of the features are based on statistical models. Typically, the
statistical model uses Gaussian distributions to describe the features of the speech
and noise frames, develop the likelihood ratio from the comparison with statistical
parameters fitted in the model, and make speech/noise decisions based on the
hypothesis test [74]. Sohn et al. proposed a robust VAD algorithm based on the
statistical likelihood ratio model consisting of a single observed vector [130]. This
algorithm achieved significant improvements over the VAD specified in the ITU
standard G.729 Annex B [12] at low signal to noise ratio (SNR) [130]. The VAD
algorithm uses the Decision Directed method to estimate a priori SNR in the
speech signal. The statistical likelihood ratio is calculated using the current SNR
frame and the estimated a prior SNR. The likelihood is then compared with the
threshold determined by the distribution model to make the speech/non-speech
decision [74, 130].
2.3.2 Feature extraction
Feature extraction techniques are used to transform the speech signals into acous-
tic feature vectors, carrying the essential characteristics of the speech signal which
16 2.3 Front-end processing
recognizes the identity of the speaker by their voice. The aim of feature extrac-
tion is to reduce the dimension of acoustic feature vectors by removing unwanted
information and emphasizing the speaker-specific information [22].
Unwanted information is any characteristic of the speech signal that is indepen-
dent of the speaker, such as environmental noise and channel transmission. The
speaker-specific information represents the speech characteristics that remain con-
sistent in the speech of a given speaker and it helps to distinguish the speech of
the speaker from that of other speakers [62]. The ideal feature extraction can be
described with the following points:
� Efficient representation of speaker-specific information (i.e. large inter-
speaker variability between speakers and small intra-speaker variability
within a given speaker).
� Easy to compute the features from the speech signals.
� Not affected by attempts to disguise the voice.
� Robust to noise and distortion.
� Occurs naturally and frequently in normal speech.
� Stable over time and robust to intra-speaker variability due to aging and
health.
A number of feature extraction approaches have been proposed in previous stud-
ies of speaker verification systems such as linear prediction cepstral coefficients
(LPCC) [37], perceptual linear prediction cepstral coefficients (PLPC) [48] and
MFCCs [27]. The MFCCs are commonly used as the feature extraction tech-
nique for the modern speaker verification and they provide better performance
than other feature extraction techniques [41, 110].
2.3 Front-end processing 17
Speech
FFT Mel-frequency
wrapping
Frame
blocking Windowing Cepstrum
MFCC
Figure 2.3: Block diagram of MFCC extraction.
The MFCCs are extracted from acoustic signals through cepstral analysis. The
production of human speech consists of the excitation source and the vocal tract.
The goal of the cepstral analysis is to separate the excitation source from the
vocal tract, so that the speaker dependent information can be separated from the
redundant pitch information [123].
The process of extracting MFCC features is illustrated in Figure 2.3. The first
stage of extracting MFCC features involves pre-emphasis of the speech signal
through using a high pass finite impulse response (FIR) filter. The aim of pre-
emphasis is to boost the high-frequency part of the speech signal and multiplied
by the window function (usually Hamming). A Hamming window is used to frame
the speech signals. The duration of the frame is typically between 20 msec to
30 msec [147]. The fast Fourier transform (FFT) can be used to transform the
acoustic signals from the time into the frequency domains. A set of triangular
mel band pass filters is applied. This filtering gives a stronger weighting and
higher resolution to the lower frequencies in the signal [27, 62]. A logarithmic
compression is then applied, followed by a decorrelation with the discrete cosine
transform (DCT). The MFCCs can be represented as [62]
cn =M∑
m=1
[log(Y (m)
]cos
[πn
M(m− 1
2)
], (2.1)
where m is the number of mel filter banks, Y (m) is the output of M - channel
filter bank, and n is the index of the cepstral coefficients. The first 10- 20 cepstral
coefficients are typically extracted from each frame. In order to capture the
18 2.4 Speaker verification based GMM
dynamic characteristics of the speech signals, the first and second time derivatives
of the MFCC are usually appended to each feature. The feature-domain channel
compensation approaches are the final stage in the front-end process, and will be
described briefly in Section 2.6.1.
2.4 Speaker verification based GMM
2.4.1 Overview of Gaussian mixture modeling
Reynold et al. [113, 115, 116] proposed a Gaussian mixture model (GMM) and
used it to model speaker verification systems. A GMM is a weighted sum of M
multivariate Gaussian components and can be described as
P (x|λ) =M∑k=1
Pkbk(x), (2.2)
where x is the feature vector that has D dimension, Pk is the mixture weight of
kth Gaussian component bk(x),
bk(x) =1
(2π)D2 |Σk|
12
exp{−1
2(x− µk)TΣ−1k (x− µk)}, (2.3)
where µk and Σk are the mean vector and the covariance matrix, respectively.
The component weights must satisfy∑M
k=1 Pk = 1. The GMM model can be pa-
rameterized by the mean vectors, covariance matrices and mixture weights for all
component densities. These parameters can be represented by λ = {Pk, µk, Σk},
k = 1, 2, · · · ,M .
There are several methods to estimate the statistical parameters of the GMM
model. The most popular method is the maximum likelihood (ML) or maximum a
posteriori (MAP) estimator [116]. The ML estimator is used to find the parameter
2.4 Speaker verification based GMM 19
model which maximizes the likelihood ratio of GMM given training data. Let the
training vector x = {x1, x2, · · · , xT} and the GMM likelihood can be calculated
as
P (x|λ) =T∏t=1
P (xt, λ). (2.4)
Direct maximization in ML estimator is not possible due to the non-linear func-
tion parameter of λ. Therefore, the expectation maximization (EM) is used as
an iterative method to estimate the maximum likelihood of parameters in statis-
tical models. The main idea of the EM algorithm is to estimate the new model
λnew from the current model λold using the training utterance features x, so that
P (x|λnew) > P (x|λold). The new model becomes an initial model and the itera-
tion is repeated until the convergence threshold is reached. The initial parameter
of the GMM can be defined using the k-means algorithm often used in the vector
quantization estimator [46].
In E step, the posteriori probability of acoustic class k can be calculated as
P (k|xt, λ) =Pkbk(xt)
P (xt|λold). (2.5)
In M step, the statistical parameters are reestimated to guarantee a monotonic
increase in the model likelihood value. These parameters can be represented by
the following equations
Pk =nk
T
T∑t=1
P (k|xt, λold), (2.6)
µk =1
nk
T∑t=1
P (k|xt, λold)xt, (2.7)
Σk =1
nk
T∑t=1
P (k|xt, λold)x2t − µ2k, (2.8)
where nk is the component count from the training utterances and can be repre-
sented by
nk =T∑t=1
P (k|xt), (2.9)
20 2.4 Speaker verification based GMM
where P (k|xt) is the posterior probability of the kth components given the sample
xt.
2.4.2 Universal background model
The EM algorithm cannot be used directly to estimate the statistical parameter
of the speaker model because in typical speaker verification systems the amount
of available data to train an individual speaker model is often limited. Therefore,
the MAP adaptation is used to train the parameter of individual models from
a large background model. This approach estimates the speaker model from
the universal background model (UBM) [115]. The UBM is a high order GMM
trained from a large number of speech samples collected from speaker populations
of interest to represent the speaker-independent distribution of the features. The
EM algorithm described in the previous section can be used to estimate the
parameters of the UBM.
2.4.3 Adaptation of UBM model
The speaker model is derived by adapting the statistical parameters of the UBM
using the MAP estimator. The adaptation consists of a two step estimation
process. The first step is similar to the expectation step of the EM algorithm.
This step estimates the statistical parameters (weight, mean and covariance) of
the speaker population, which are calculated from each mixture in the UBM. In
the second step, the new statistical parameters are combined with old statistical
parameters from the UBM using the adaptation mixing coefficient. It was found
by Reynolds et al. [115] that adapting the covariance of the mixture components
of the UBM does not show improvement in the performance of speaker verification
2.4 Speaker verification based GMM 21
and it is more common in practice to adapt only the mean of the mixture of the
UBM. The new adaptation of the mean (µMAPk ) for Gaussian mixture k is updated
from the prior distribution means (µk), and can be represented as [115]
µMAPk = αkµk + (1− αk)µML
k , (2.10)
where µMLk is the estimation mean using the maximum likelihood estimation
which is described in the previous section and αk is the mean adaptation mixing
coefficient. The mean adaptation mixing coefficient can be calculated by the
following equation
αk =nk
nk + τk, (2.11)
where nk is the component occupancy count for the adaptation data, and τk is
the relevance factor where values typically range between 8 and 32.
2.4.4 Speaker verification based GMM-UBM
Figure 2.4 shows speaker verification based GMM-UBM. In the development
stage, the parameters of the UBM are estimated using a large number of speakers,
and they represent the speaker-independent parameter. The speaker dependent
parameters are estimated using MAP in the enrolment phase. In the verification
phase, to verify if the test speech signals belong or not to the claimed speakers,
the score is computed using the likelihood ratio.
The aim of speaker verification is to determine whether or not the test speech
frame X belongs to the claimed speaker s. The goal of speaker verification is to
test the two hypotheses [115]
H0: X is from the hypothesized speaker s
H1: X is not from the hypothesized speaker s
22 2.5 GMM super-vector
Front- end
processing GMM UBM
Front- end
processing
MAP
adaptation
Speaker
Models
Front- end
processing Likelihood ratio
Development phase
Enrolment phase
Verification phase
Decision
UBM
Training
Data
Enrolment Data
Verification Data
Figure 2.4: Speaker verification based GMM-UBM.
The score can be calculated according to the following equation [115]
S(x) =P (X|H0)
P (X|H1)> θ ⇒ H0, (2.12)
where P (X|Hi) i = 0, 1, is the likelihood of the hypothesis Hi evaluated for the
test speech signals X, and θ is the decision threshold. We conclude that the test
speech is uttered by the speaker s if the value of the score is greater than or equal
to the decision threshold; otherwise, the test speech is not uttered by the speaker.
2.5 GMM super-vector
Super-vector is a robust way to represent the characteristics of the speech signals
using a single vector. Figure 2.5 shows a block diagram of extracting the GMM
super-vector. The GMM super-vector is the combination of GMM mean vectors.
The GMM super-vector can be defined by a column vector which has CF dimen-
2.6 Channel compensation approaches 23
…………. =
µ1
µ2
.
.
.
µ39
µ1
µ2
.
.
.
µ39
µ1
µ2
.
.
.
µ39
µ1
µ2
.
.
µ39
µ1
µ2
.
.
µ39
GMM super-vector
Number of Gaussian components (1, ... , 256)
MAP adapted individual Gaussian mean
9984×1
Figure 2.5: A block diagram of extraction GMM super-vectors.
sion including the mean of each mixture in the speaker GMM where F is the
dimension of feature extraction vector used in the model and C is the total num-
ber of the mixture used to represent the GMM model. The GMM super-vector
can be used as the input feature vector for a support vector machine (SVM) [26]
and JFA [63] speaker verification systems.
2.6 Channel compensation approaches
A speech signal is produced by a complex process which includes laryngeal, res-
piratory, and vocal tract movements. This process leads to changes the voice
of a given speaker along dimensions such as pitch, voice quality, and loudness.
Further, the characteristics of the speech signal are also altered by various fac-
24 2.6 Channel compensation approaches
tors such as ageing, health, content, speaking-style and emotional state. Thus, a
speaker cannot produce an utterance in the exact same way twice. The variation
in the characteristics of speech signals for a given speaker is called intra-speaker
variability [42].
The characteristics of a certain speech signal can be affected by extrinsic vari-
ability. The extrinsic variability can occur in various situations, such as channel
distortion introduced by a device (telephones and microphones) and speech sig-
nals being corrupted by noise and reverberation environments. The combination
of intra-speaker and extrinsic variability is called inter-speaker variability [42].
The success of speaker recognition is based on increasing the inter-speaker vari-
ability between two utterances of a given speaker and decreasing the effect of
the extrinsic variability. Channel compensation techniques can be used in the
feature and model domains to decrease the effect of inter-speaker variability and
improve speaker verification performance. A brief description of channel com-
pensation techniques in the feature and model domains will be described in the
next sections.
2.6.1 Feature domain approaches
Some of channel compensation techniques that are used widely for speaker veri-
fication system include:
1. CMS
Furui [37] proposed CMS to reduce the effect of channel distortion by sub-
tracting the mean from each cepstral coefficient. It was found that cepstral
mean subtraction removes some speaker-specific information with channel
information. Thus, an alternative method called cepstral mean variance
2.6 Channel compensation approaches 25
normalization was proposed by Viikki and Laurila [139].
2. RASTA
Hermansky and Morgan [49] proposed RASTA to remove very slowly chang-
ing components (convolutive noise) or very rapidly changing components
(additive noise) by using a bandpass filter bank in the feature extraction.
It was found that RASTA suppresses speaker-specific information in the
low-frequency bands.
3. Feature warping
Pelecanos and Sridharan [106] proposed feature warping technique to de-
crease the effect of channel distortion and noise during the feature extrac-
tion phase. The goal of feature warping is to map the distribution of a
cepstral feature stream over a specified time interval to standard normal
distribution. The warping process is performed via cumulative distribution
function (CDF) as described in [143] and it can be represented as a non-
linear transformation T , which converts the original feature vectors X to a
warped feature X.
X = XT (2.13)
This can be performed by CDF matching which maps a given cepstral
feature so that its CDF matches standard normal distribution. The feature
warping technique assumes that the dimensions of the MFCC features are
treated as a separate stream. The CDF matching is performed over short
time intervals by a sliding window (typically three seconds). Only the
central frame of the window is warped every time. The warping process
executes as follow:
(a) for i = 1, 2, · · · , D where D is the number of the feature extraction
dimensions.
26 2.6 Channel compensation approaches
Speech Cepstral feature
stream
Extract single
cepstral stream
Perform cepstral
warping calculated
from sliding window
of cepstral
Append
corresponding
delta coefficients
Resultant
features
Figure 2.6: A block diagram of the feature warping process.
(b) Ranking features in dimension i in ascending order for a given sliding
window.
(c) warping the cepstral feature value x in dimension i of the central frame
to its warped value X, which satisfies
φ =
∫ X
x=−∞
1√2π
exp(−x2
2)dx (2.14)
where φ is its corresponding CDF values. Suppose the raw cepstral
feature x has rank R and the window size is N , the CDF value can be
estimated as
φ =(R− 1/2)
N(2.15)
(d) The warped value X can be found quickly by lookup in a standard
normal CDF table.
Feature warping achieves robustness to additive noise and channel mismatch
which retains the speaker specific information that is lost by using CMS, CMVN,
and RASTA processing. Thus, feature warping will be used in this thesis. Figure
2.6 shows a block diagram of the feature warping process.
2.6.2 Model-domain techniques (factor analysis)
Factor analysis techniques transform the high dimension of GMM supervectors
into a low dimensional space using a low number of hidden factors. The per-
2.6 Channel compensation approaches 27
formance of GMM-UBM based speaker modeling degrades due to the presence
of channel mismatch in the GMM supervector. Keeny et al. first introduced
modeling of speaker and channel variability in the GMM space [70]. Various fac-
tor analysis approaches were proposed such as joint factor analysis and i-vector
which are common factor analysis. This section describes various factor analysis
speaker modeling approaches in detail.
2.6.2.1 Joint factor analysis
The JFA has been used to compensate channel mismatch between training and
test speech signals [63, 66, 67, 68, 69, 142]. The main idea behind the JFA is based
on decomposing the GMM mean supervectors into speaker and channel variability
[63, 69]. The JFA is a combination of mean-only MAP [115] and eigenvoice
modeling [65]. In mean-only MAP, the GMM supervector M is decomposed into
speaker-dependent and independent components as follows,
M = s + Dzs, (2.16)
where s is speaker-independent component, D is a diagonal matrix and zs is
hidden factors with standard normal distribution.
The eigenvoice modeling is an extension of mean-only MAP adaptation, where
speaker variability is assumed to lie in a low dimensional subspace of the GMM
supervector and it can be represented by
M = s + Vys, (2.17)
where s is the speaker-independent UBM mean supervector, V is a low dimension
transformation matrix, and ys is hidden speaker factors with standard normal
distribution.
28 2.7 I-vector based speaker verification system
Kenny introduced the JFA by combining mean-only MAP and eigenvoice model
techniques to simplify model the speaker and channel variability spanned by low
dimension matrices V and U in the GMM supervector space [63]. The GMM
supervector in this approach is represented by
M = s + Vxq + Uys + Dzsq, (2.18)
where s is the speaker and channel independent mean of the GMM supervector,
V and xq are channel subspace and channel factors with standard normal dis-
tribution, respectively. U is the speaker subspace and ys is the speaker hidden
factor with standard normal distribution, and Dzsq is the residual component
which not captured by the speaker subspace.
2.7 I-vector based speaker verification system
Dehak et al. proposed the i-vector [31], which is used widely for text-independent
speaker verification systems. In recent years, the i-vector through length-
normalized Gaussian probabilistic linear discriminant analysis (GPLDA) has be-
come the modern speaker verification system [60]. The system consists of three
stages: i-vector feature extraction, channel compensation techniques and scoring.
These stages are described in detail in the section below. Figure 2.7 shows a
length-normalized GPLDA based speaker verification system.
2.7.1 I-vector feature extraction
The i-vectors represent the speaker and session GMM super-vector using a single
low dimensional total variability space. This single variability was inspired by the
2.7 I-vector based speaker verification system 29
discovery that speaker information in the channel space of JFA could be used in
recognizing speakers more efficiently [30, 31]. A speaker and channel dependent
GMM super-vector in the i-vector approach, s, can be represented as [31]
s = m + Tw, (2.19)
where m is the independent super-vector for the speaker and channel, T is the
low rank total variability matrix which represents the major variability across a
large number of development speech signals, and w is the i-vectors which have
normal distribution with parameters N(0, 1). The i-vector extraction is based on
calculating the zero-order of Baum-Welch (BW), N, and centralized first-order,
F, statistics. For a certain speech utterance, the BW statistics are computed
with respect to the number of the components of UBM (C ) and the dimensions
of the feature extraction (F ). The i-vectors can be described as [31]:
w = (I + TTΣ−1NT)−1TTΣ−1F, (2.20)
where I is the identity matrix (CF × CF) dimension, N is the diagonal matrix
of dimension F × F , F is the CF × 1 supervector obtained by concatenating
the centralized first-order BW statistics, and Σ is the covariance matrix, which
represents residual variability not captured by T.
An efficient method to train a total variability matrix is described in [31, 71].
McLaren and van Leeuwen [91] investigated the effect of using pooled and con-
catenated total variability matrices on i-vector speaker verification systems. For
the pooled approach, the speech signals from the microphone and telephone are
combined. A single total variability matrix was used to train the speech signals,
while two total-variability matrices were used to train the speech utterances from
microphone and telephone in a concatenated total-variability technique. These
total variability matrices are fused together to produce a single total-variability
matrix. Mclaren and van Leeuwen [91] found that the pooled approach represents
30 2.8 Standard channel compensation techniques
UBM
training
data
Feature
Extraction
GMM
training
(m, Σ)
Total
variability T
Feature
Extraction BW Stats
Total
variability
space training
(m, Σ, T) I- vector
extraction
Channel
compensation
Length
norm
PLDA
model
PLDA
parameters
U1, m, p
Development
phase
Feature
Extraction BW Stats I- vector
extraction
Channel
compensation
Length
norm Target i-vectors
Enrolment
phase
Enrolment
data
Feature
Extraction BW Stats I- vector
extraction
Channel
compensation
Length
norm
PLDA
training
x_tar, x-tst
Batch
Likelihood
Verification
data Verification
phase
Decision
Figure 2.7: Length-normalized GPLDA based speaker verification system.
i-vector speaker verification more efficiently compared with the concatenated to-
tal variability approach. Thus, the pooled total variability approach will be used
in this thesis.
2.8 Standard channel compensation techniques
Channel compensation techniques are necessary to decrease the channel variabil-
ity in the i-vector based speaker verification because the i-vectors are based on
2.8 Standard channel compensation techniques 31
one variability space that contains speaker and session information. The chan-
nel compensation techniques are estimated based on within-speaker variability
and between-speaker variability. Within-speaker variability depends on the mi-
crophone used, transmission channel, and environmental distortion such as noise
and reverberation, while between-speaker variability depends on the character-
istics of the speakers such as speaking style, phonetic content, emotional state
and health of the speaker. The goal of the channel compensation technique is
to reduce the effect of within-speaker variability and maximize the effects of
between-speaker variability [59]. Channel compensation techniques that are used
widely for i-vector speaker verification are:
1. Nuisance attribute projection
The nuisance attribute projection (NAP) is used to compensate the ses-
sion variations [131]. This technique finds the orthogonal projection in the
i-vector subspace by removing the unwanted class variation from the com-
ponents of the within-speaker variability. The formulation of this projection
is found by maximization the following equation,
J(v) = vTSwv, (2.21)
where Sw is within-speaker variability matrix and can be calculated as
Sw =1
S
S∑s=1
1
ns
ns∑i=1
(wsi − ws)(w
si − ws)
′, (2.22)
where S is the total number of the speakers, ns is the number of utterances
by each speaker s and the mean i-vector of the speaker, ws, can be calculated
as
ws =1
ns
ns∑i=1
wsi . (2.23)
The NAP projection matrix is found by
P = I−VV′, (2.24)
32 2.8 Standard channel compensation techniques
where I is the identity matrix, V is obtained using an eigen decomposi-
tion of the within-speaker variability covariance matrix. Finally, the NAP
compensated i-vector can be calculated as,
wNAP = PTw. (2.25)
2. Within-class covariance normalization
The within-class covariance normalization (WCCN) was applied to SVM-
based speaker recognition system to attenuate the high dimensions of
within-class variance [47]. Then, Dehak et al. [31] used WCCN as a channel
compensation technique for i-vector speaker verification. The within-class
covariance matrix for i-vector speaker verification can be calculated as
W =1
S
S∑s=1
ns∑i=1
(wsi − ws)(w
si − ws)
T . (2.26)
The inverse of matrix W is used to weight the dimension of the within class
matrix and the WCCN combined with principle component analysis (PCA)
to solve the problem of estimating the inverse of W [47]. The WCCN can
be defined as,
wWCCN = BT1 w, (2.27)
where B1 is obtained through Cholesky decomposition of matrix B1BT1 =
W−1.
3. Linear discriminant analysis
The linear discriminant analysis (LDA) seeks orthogonal axes to reduce
the dimension while retaining as much of the speaker-specific information
as possible [31]. The axes A must satisfy the requirement of maximiz-
ing between-speaker variability and minimizing within-speaker variability
through the eigenvalue decomposition of
Sbv = λSwv, (2.28)
2.9 PLDA speaker verification system 33
where λ is the diagonal matrix of the eigenvalues, and Sb and Sw are
between-speaker and within-speaker variabilities respectively. These vari-
abilities can be calculated by the following equations,
Sb =S∑
s=1
ns(ws − w)(ws − w)T , (2.29)
Sw =S∑
s=1
ns∑i=1
(wsi − ws)(w
si − ws)
T , (2.30)
where the mean i-vector for all speakers, w, is calculated by
w =1
N
S∑s=1
ns∑i=1
wsi , (2.31)
where N is the total number of sessions.
The LDA channel compensation (wLDA) can be calculated by
wLDA = ATw. (2.32)
2.9 PLDA speaker verification system
Prince et al. [108] proposed PLDA for face recognition. Then, the PLDA was
adapted to the task of i-vector speaker verification systems [17, 64, 120]. Kenny
[64] introduced the Gaussian probabilistic linear discriminant analysis (GPLDA)
for i-vector speaker verification. There are two main assumptions for GPLDA.
Firstly, the statistics of speaker and channel components are independent. Sec-
ondly, the distribution of speaker and channel components is a Gaussian distribu-
tion. The main benefit of these assumptions is that the speaker likelihood ratio
can be obtained in closed-form [42]. Kenny [64] also introduced heavy-tailed prob-
abilistic linear discriminant analysis (HTPLDA) by using student’s t-distribution
instead of Gaussian distribution which is used in the GPLDA approach. There
34 2.9 PLDA speaker verification system
are two main advantages of using HTPLDA as a classifier for i-vector speaker ver-
ification systems. Firstly, the HTPLDA provides better representation to outliers
in the i-vector than GPLDA [64]. Secondly, it allows for larger deviation from
the mean (e.g. severe channel distortion and speaker effect). It was found that
HTPLDA approach achieved significant improvements compared with GPLDA
approach [64]. Recently, the length-normalized i-vector technique was proposed
by Garcia-Romero et al. [43], and this technique can be used to transform the
heavy-tailed distribution to Gaussian distribution. A brief description of GPLDA,
HTPLDA and length-normalized GPLDA will be presented in the sections below.
2.9.1 GPLDA
The speaker and channel independent i-vector can be represented in the GPLDA
model as follows,
wr = m + U1x1 + U2yr + εr, (2.33)
where r is the number of utterances for a speaker, m is the global offset, U1
is the eigenvoice matrix, U2 is eigenchannel matrix, x1 and yr are the latent
identity vectors of speaker and channel which have standard normal distribution,
respectively, and εr is the speaker residual term and it assumed to have Gaussian
distribution with zero mean and diagonal covariance matrix Λ−1. The between-
speaker variability in the PLDA model can be represented as m + U1x1 with a
covariance matrix U1UT1 . The within-speaker variability is represented as U2yr+
εr with covariance matrix Λ−1 + U2UT2 .
In this thesis, the precision matrix (Λ) is assumed to be full rank and the eigen-
channel matrix (U2) is removed from Equation 2.33. It was found that the per-
formance of GPLDA speaker verification does not improve significantly when the
eigenchannels are included in the GPLDA system, and removing them provides a
2.9 PLDA speaker verification system 35
useful decrease in the computational complexity [43, 64]. The modified GPLDA
can be represented as
wr = m + U1x1 + εr. (2.34)
The details of the estimation model parameters of GPLDA are given in [64].
2.9.2 HTPLDA
For the HTPLDA model, the speaker factor (x1), channel factor (yr) and residual
(εr) are assumed to have student’s t-distribution instead of the Gaussian distri-
bution assumed in GPLDA. The x1, yr, εr can be represented as
x1 ∼ N(0, u−11 I) where u1 ∼ G(n1
2,n1
2), (2.35)
yr ∼ N(0, u−12r I) where u2r ∼ G(n2
2,n2
2), (2.36)
εr ∼ N(0, u−1r Λ−1) where ur ∼ G(v
2,v
2), (2.37)
where n1, n2 and v are the numbers of degree of freedom and u1, u2r and vr
are scalar-value hidden variables, N(µ,Σ) indicates a Gaussian distribution with
mean µ and covariance matrix Σ, and G(a, b) indicates a Gamma distribution
with parameters a and b [64].
2.9.3 Length-normalized GPLDA
Garcia-Romero et al. [43] used a new technique to convert i-vector behavior from
heavy-tailed to Gaussian. They proposed a length-normalized GPLDA technique,
and found that the length-normalized GPLDA with less computationally com-
plexity performs similarly compared with HTPLDA [43]. Therefore, the length-
normalized GPLDA technique will be used in this thesis. The length-normalized
36 2.10 PLDA scoring
i-vector follows two steps: whitening and length normalization of i-vectors. The
whitened i-vector, wwht, can be calculated as:
wwht = d−12 UTw, (2.38)
where Σ is the covariance matrix which can be estimated from the development
i-vector, U is an orthogonal matrix containing the eigenvector of the covariance
matrix, and d is the diagonal matrix including the corresponding eigenvalues.
The normalization of the i-vector, wnorm can be computed as
wnorm =wwht
‖wwht‖(2.39)
If the behaviour of i-vector is a Gaussian distribution, the distribution of the
length-normalized i-vector should be Chi when the dimension of i-vectors equals
the number of degrees of freedom. Garcia-Romero et al. found that the length-
normalized i-vector does not match the Chi distribution, and this mismatch led to
a conclusion that i-vector has heavy-tailed distribution. The authors [43] found
that the length normalization GPLDA technique can be used to convert the non-
Gaussian i-vector distribution to Gaussian i-vector feature distribution.
2.10 PLDA scoring
GPLDA, HTPLDA and length-normalized GPLDA based i-vector scoring is cal-
culated using a batch likelihood ratio [64]. Given two i-vectors, wtarget and wtest,
the batch likelihood ratio can be computed as:
score = lnP (wtarget,wtest|H1)
P (wtarget|H0)P (wtest|H0), (2.40)
where H1 and H0 are, respectively, the hypotheses that the i-vectors come from
the same speaker or another speaker.
2.11 Performance evaluations 37
2.11 Performance evaluations
The performance of the speaker verification can be measured in terms of errors.
This section discusses in detail the types of error and evaluation metrics commonly
used in speaker verification systems.
2.11.1 Type of error
Speaker verification systems can be represented by two types of errors: false
acceptance and false rejection [59].
False acceptance A false acceptance occurs when the speech segments from an
imposter speaker are falsely accepted as a target speaker by the system. The
false acceptance rate, FAR, can be defined as:
FAR =Total number of false acceptance errors
Total number of imposter speaker attempts. (2.41)
False rejection A false rejection occurs when the target speaker is rejected by
the verification systems. The false rejection rate, FRR, can be defined as:
FRR =Total number of false rejection errors
Total number of enrolled speaker attempts. (2.42)
2.11.2 Performance metrics
The performance metrics of speaker verification systems can be measured us-
ing the equal error rate (EER) and minimum decision cost function (mDCF)
38 2.11 Performance evaluations
0.1 0.2 0.5 1 2 5 10 20 40
False Acceptance Rate (FAR) [%]
0.1
0.2
0.5
1
2
5
10
20
40
False
Rejec
tion R
ate (F
RR) [%
]
Figure 2.8: An example of DET plot.
[114]. These measures represent different performance characteristics of system,
although the accuracy of the measurements is based on the number of trials
evaluated in order to robustly compute the relevant statistics [59]. Speaker ver-
ification performance can also be represented graphically by using the detection
error trade-off (DET) plot [59]. Figure 2.8 shows an example of DET plot.
The EER can be obtained when the false acceptance rate and false rejection rate
have the same value by adjusting the threshold. The performance of the system
improves if the value of ERR is lower because the sum of total error of the false
acceptance and the false rejection at the point of ERR decreases [93].
The decision cost function (DCF) is defined by assigning a cost of each error
and taking into account the prior probability of target and impostor trails. The
decision cost function can be defined as
DCF = CmissPmissPtarget + CfaPfaPimpostor, (2.43)
where Cmiss and Cfa are the cost functions of a missed detection and false alarm,
2.12 Speaker recognition in adverse conditions 39
respectively, the prior probabilities of target and impostor trails are given by
Ptarget and Pimpostor, respectively, and the percentages of the missed target and
falsely accepted impostors’ trails are represented by Pmiss and Pfa, respectively.
The mDCF can be used to evaluate speaker verification by selecting the minimum
value of DCF estimated by changing the threshold value.
mDCF = min[CmissPmissPtarget + CfaPfaPimpostor
], (2.44)
where Pmiss and Pfa are the miss and false alarm rates recorded from the trials,
and the other parameters are adjusted to suit the evaluation of application-specific
requirements.
2.12 Speaker recognition in adverse conditions
The algorithms of speaker recognition under adverse conditions can be divided
into five types according to the techniques used to improve speaker recognition
performance. These techniques are further explored below
2.12.1 Feature extraction based on multiband techniques
Multiband feature extraction techniques are based on decomposing the full fre-
quency of the speech signals into various frequency sub bands by using DWT.
Traditional feature extraction techniques such as MFCC, LPCC, and PLPC can
be used to extract feature vectors from each sub band of the DWT. The feature
recombination combines sub band cepstral features to obtain a single feature vec-
tor that is used to train classifier models. Various multiband feature extraction
techniques have been proposed in previous studies.
Mirghafori and Morgan [95] used a combination of multiband and full band fea-
40 2.12 Speaker recognition in adverse conditions
ture extraction techniques in speech recognition under clean and reverberation
conditions. The multiband feature technique is based on decomposing the speech
signals into various frequency sub bands. Relative spectral transform-perceptual
linear prediction (RASTA-PLP) cepstral uesd to extract the features from the
sub band and full band of the speech signals. Then, the features extracted from
the sub band and full band of the speech signals were combined into a single
feature vector. The results demonstrated that the proposed method improved
word error rate under clean and reverberation conditions compared with using
multiband feature techniques only. The proposed method could also be used
in speaker recognition systems. The performance of the proposed method un-
der reverberation conditions could be improved by using channel compensation
techniques such as feature warping during the feature extraction phase [58].
Tufekci and Gurbuz [134] proposed a robust speaker verification system for noisy
speech signals. This system is based on using parallel model compensation and
mel frequency discrete wavelet coefficients (MFDWCs) as the feature extraction
technique. The performance of MFDWCs was evaluated using National Institute
of Standards and Technology (NIST) and NOISEX-92 databases. The experi-
mental results demonstrated that in the presence of various types of noise at -6
dB and 0 dB the proposed speaker verification achieved an average improvement
equal error rate over MFCC of 26.44% and 23.73%, respectively.
Malik and Afsar [87] presented an effective feature extraction technique for
speaker recognition systems. The feature extraction is based on decomposing
the speech signals into approximation and detail coefficients by using DWT. The
MFCCs were used to extract the features from the approximation coefficients.
Vector quantization techniques were used as a classifier. Experimental results
demonstrated that the proposed feature technique achieved a recognition rate of
96.25% and 86.77% for non telephonic and telephonic speech data from PIEAS
2.12 Speaker recognition in adverse conditions 41
database. Although the performance of the speaker recognition system proposed
in this research improved, it has two main gaps. Firstly, some important fea-
tures extracted from the detail coefficients, such as unvoiced speech signal, are
lost. Secondly, this research ignored the effect of extracting some features from
the full band of the speech signals to improve speaker recognition performance.
Speaker recognition performance could improve by extracting the features from
the detail coefficients and the full band of the speech signals.
Shafik et al. [123] proposed a robust feature extraction technique to enhance
speaker identification performance in the presence of additive white Gaussian
noise (AWGN) and telephone degradation with colored noise. The feature ex-
traction technique is based on combining the features of the MFCC and DWT
of the noisy speech signals into a single feature vector. To reduce the effect of
noise, wavelet denoising technique was also used prior to the feature extraction.
Experimental results show that extracting the features from the enhanced speech
signals improved speaker recognition performance in the presence of high lev-
els of AWGN compared with extracting the features without using the wavelet
denoising technique. However, the wavelet threshold technique did not achieve
improvement in the speaker recognition performance under colored noise. The
wavelet threshold technique fails to suppress the colored noise because the colored
noises are concentrated at certain frequency sub bands when the colored noise
is mixed with clean speech signals. Thresholding all wavelet coefficients in high
frequency sub bands could distort the enhanced speech signals. Thus, using a
suitable speech enhancement algorithm is necessary to improve speaker recogni-
tion performance when the speech signals are mixed with various types and levels
of colored noise.
Maged et al. [86] presented a speaker identification system in the presence of
noise. This system used DWT to decompose the noisy speech signals. The
42 2.12 Speaker recognition in adverse conditions
approximation coefficients and detail coefficients were combined into a single vec-
tor. Then, the MFCCs were used to extract the features from the noisy speech
signals. The vector quantization model was used as a classifier. Experimental re-
sults demonstrated that the system improved speaker identification performance
at high SNR for 13 speakers compared to traditional MFCC features. How-
ever, speaker identification performance based on the wavelet transform does not
improve at low SNR because the system ignored the effect of using important
features extracted from the full band of the noisy speech to improve speaker
identification performance. It was found that combining the features from the
full band and sub band would improve speaker recognition performance [95].
Lei et al. [80] presented a new forensic speaker recognition system based on using
the wavelet cepstral coefficient as the feature extraction technique to train the
i-vector speaker recognition system. Cosine distance scoring was used to compare
the i-vectors of the suspect and verification speech signals. LDA and WCCN were
added to cosine distance scoring to solve the problem related to channel variabil-
ity. Experimental results showed that using wavelet cepstral coefficients under
different levels of noise significantly improved speaker recognition performance
compared with applying features extracted from the MFCC to GMM classifier.
The advantages of combining full band MFCC and multiband feature extraction
techniques for improving speaker recognition performance in the presence of noise
and reverberation conditions are:
1. DWT extracts the important features from the noisy speech signals in the
time and frequency domains. However, the features extracted from the
time domain of the noisy speech signals are lost by assuming a stationary
time frame in traditional cepstral features [9, 123]. Therefore, the DWT
adds some features to the full band features extracted from the MFCC of
the noisy speech signals, thereby assisting in improving speaker recognition
2.12 Speaker recognition in adverse conditions 43
performance in the presence of noise [123].
2. Since the boundary materials used in most rooms are less absorptive at low-
frequency sub bands, natural reverberation affects low-frequency sub bands
more than high-frequency sub bands and leads to distortion of the spectral
information at low-frequency sub bands [95]. DWT can be used to extract
more features from the low-frequency sub bands. These features add some
important features to the full band of the MFCC. Thus, a combination of
full band MFCC and multiband features extracted from the reverberated
signals may achieve better recognition performance than full band MFCC
features extracted in the presence of reverberation conditions.
The disadvantages of using multiband feature extraction techniques include:
1. Speaker recognition performance is affected by the number of decomposition
levels used to extract the features from the speech signals in the presence
of noise and reverberation conditions. Speaker recognition performance
under noisy and reverberant conditions could decrease when the number of
decomposition levels in DWT increases because the number of samples at
low frequency is too small to represent the spectral characteristics of the
speech signals under these conditions [21]. Speaker recognition performance
may also be decreased by using two levels of DWT, because DWT may not
separate the noise that concatenates at high-frequency sub bands from the
speech signals. Thus, it is necessary to choose a suitable number of levels
in DWT to extract the features from the speech signals under noisy and
reverberant conditions.
2. Information about the correlation between sub band features and full band
features is lost when multiband feature extraction techniques are used for
feature extraction. To tackle this problem, it is important to combine full
44 2.12 Speaker recognition in adverse conditions
band MFCC and sub band MFCC features extracted from the DWT of the
speech signals into a single feature vector [21, 90].
2.12.2 Feature warping
Feature warping has also been used as a channel compensation technique during
the feature extraction phase for improving speaker verification performance in
the presence of noise and channel mismatch. Some studies used feature warp-
ing for improving speaker verification performance under noisy and reverberant
conditions.
Pelecanos and Sridharan [106] proposed feature warping technique that is ro-
bust to noise and channel mismatched. The feature warping technique maps the
distribution of the cepstral feature to standard normal distribution over a partic-
ular time interval. The experimental results demonstrated that feature warping
improved speaker verification performance compared with other channel compen-
sation techniques such as CMS, modulation spectrum processing, and short-term
windowed CMS and variance normalization.
Jin et al. [58] investigated two approaches to improve speaker recognition per-
formance in far-field microphone situations. The first approach introduced re-
verberation compensation and feature warping to improve speaker recognition
performance under mismatched conditions. The second approach used high-level
linguistic features. The results showed that both approaches significantly im-
proved speaker recognition performance under mismatched conditions between
enrolment and verification data.
The advantages of using feature warping for improving speaker verification per-
formance under noisy and reverberant conditions are:
2.12 Speaker recognition in adverse conditions 45
1. Feature-warped MFCC is more robust to noise and reverberation conditions
compared with traditional feature extraction techniques [106] [58].
2. Feature-warped MFCC improves speaker verification performance com-
pared with applying MFCC to other channel compensation techniques such
as CMS, CMVN, RASTA, and modulation spectrum processing in the pres-
ence of noise and channel mismatched between training and testing con-
ditions. Feature warping improves the performance of speaker verification
systems compared with other techniques because feature warping technique
reserves the speaker information which is lost by using other channel com-
pensation techniques [106].
Although feature warping and a combination of DWT-MFCC and MFCC were
proposed in previous studies to improve speaker recognition performance under
noisy and reverberant conditions, the robustness of fusion of feature warping with
MFCC and DWT-MFCC features individually or in concatenative combination
of these features has not been investigated yet for improving the modern i-vector
PLDA speaker verification framework in the presence of different types of noise,
reverberation, as well as noisy and reverberant conditions.
The advantages of combining feature warping with DWT-MFCC and MFCC to
extract the features from the speech signals and improve speaker verification per-
formance in the presence of various types of environmental noise and reverberation
conditions are:
1. The feature-warped MFCC extracted from the DWT of the noisy speech
signals adds more features from the approximation and detail coefficients
to the full band feature-warped MFCC of the noisy speech signals. Thus,
the combination of feature warping with DWT-MFCC and MFCC could
improve speaker verification performance in the presence of different types
46 2.12 Speaker recognition in adverse conditions
of environmental noise.
2. Since reverberation affects the low-frequency sub band of the speech signals
more than high-frequency sub bands [95], the DWT-MFCC feature warping
can be used to extract some important information from the low-frequency
sub band of the speech signals. These features add some important features
to the full band feature-warped MFCC. Therefore, the combination of fea-
ture warping with DWT-MFCC and MFCC could improve the performance
of speaker verification systems under reverberation conditions.
2.12.3 Independent component analysis
Independent component analysis is a technique for a linear transformation of the
observed signal into components that are statistically independent. The principle
of estimating independent components is based on maximizing the non-Gaussian
distribution of one independent component. This can be achieved by maximizing
the difference between the distribution of the non-Gaussian component and the
Gaussian distribution of the other components. Various contrast functions were
used to measure this difference and estimate the source signals, such as kurtosis,
negentropy and approximate negentropy [55]. The studies which used indepen-
dent component analysis as a multiple channel speech enhancement algorithm or
in speaker recognition systems under noisy conditions are discussed below.
Li et al. [84] proposed a combination of wavelet threshold technique and fast ICA
to reduce the effect of AWGN from the noisy speech signals. The clean speech
signal was corrupted by AWGN at different values of SNR. Wavelet threshold
technique was used to reduce the noise from the noisy speech signals. Then, a
fast ICA algorithm was used to separate the denoised mixed speech signals. The
results showed that the fast ICA algorithm achieved higher SNR than wavelet
2.12 Speaker recognition in adverse conditions 47
denoising speech signals at different input SNR values.
Li et al. [85] used ICA as a speech enhancement algorithm. The clean speech
signal was mixed with different levels of AWGN and the fast ICA algorithm was
used to separate the clean speech from the noisy speech signals. Experimental
results demonstrated that the fast ICA algorithm significantly improved SNR
compared with the spectral subtraction technique. Multiple channel speech en-
hancement algorithms improved the quality of noisy speech signals compared
with single channel speech enhancement algorithms [118]. Therefore, multiple
channel speech enhancement based on the ICA could be used as front-end pro-
cessing in speaker recognition systems and it might improve speaker recognition
performance under different types and levels of noise.
There is a randomness in the separation of the mixed signals in the ICA algo-
rithm due to the randomness associated with choosing the unmixing matrix. The
randomness of separation of the mixed speech signals in the ICA algorithm could
decrease recognition performance. Naik [100] evaluated the use of ICA on surface
Electromyography and bio signal applications. Naik examined the randomness of
estimating unmixing matrix and proposed multi-run ICA algorithm to improve
the quality of the unmixing matrix. The multi-run ICA is based on computing
the fast ICA algorithm several times to estimate all unmixing matrices. Signal to
interference ratio (SIR) measurement was used to evaluate the quality of the un-
mixing matrices. The best of the unmixing matrices can be chosen by computing
the highest value of the SIR of the unmixing matrix, and this unmixing matrix
was used to estimate all of the sources signals. The multi-run ICA algorithm
was used in the classification of hand gestures. Experimental results showed that
a multi-run ICA algorithm achieved a higher average classification of hand ges-
tures (99%) compared with a traditional fast ICA, which achieved only average
classification (65%).
48 2.12 Speaker recognition in adverse conditions
Denk et al. [32] proposed a technique based on convolutive ICA and spectral
subtraction to improve the performance of speaker recognition in forensic appli-
cations. The mixture of the speech signals can be generated by combining two
or three speakers obtained by using an array of microphones. The mixed sig-
nals were corrupted by coloured noise in each microphone, with different noise
correlation levels. Convolutive ICA was used to separate the individual speech
signals. The essential point of the ICA used in this research was that the number
of microphones was greater than the number of speakers. Therefore, ICA is able
to separate individual speakers from each other and the major part of the noise.
Spectral subtraction techniques were applied to the separate speech signals to re-
move the noise from the noisy separate speech signal. MFCC was used as feature
extraction of the enhanced test speech signal. Finally, a GMM was used to deter-
mine the similarity between enrolment and verification speech signals. The max-
imum likelihood was used to determine the identity of the speaker. Experimental
results showed that this technique is able to detect all speakers simultaneously
and achieves a higher success rate in identifying the speakers compared with the
method used in [127].
Ribas et al. [117] proposed a full multicondition training technique to reduce
the effect of noise and reverberation conditions on an i-vector PLDA speaker
verification system. Full multicondition is based on pooling clean, noisy and re-
verberated speech signals in the development, enrolment and verification phases
of speaker verification systems. Speech enhancement based on the flexible audio
source separation was used to decrease the effect of noise from the enrolment and
verification speech signals. Experimental results showed that using the speech en-
hancement algorithm improved the EER compared to the i-vector PLDA speaker
verification without using speech enhancement. The performance of the i-vector
speaker verification system under noisy and reverberant environments could be
improved by proposing a robust feature extraction to these conditions instead of
2.12 Speaker recognition in adverse conditions 49
using traditional MFCC which is not robust to environmental noise and rever-
beration conditions.
Li and Adali [82] proposed independent component analysis by entropy bound
minimization (ICA-EBM) algorithm. This algorithm is based on computing sev-
eral entropy bounds from four contrast measuring functions (two even and two
odd functions) and using the tightest entropy as the final entropy estimate, rather
than using a single entropy bound in traditional parametric ICA approaches. The
main advantages of using four contrast functions in this algorithm is that the ICA-
EBM can approximate the entropy of a wide distribution such as sub- or super-
Gaussian, unimodal, symmetric, or skewed. The ICA-EBM algorithm was used
to separate the mixed speech signals under clean conditions and the algorithm
improved the separation performance compared with other traditional ICA algo-
rithms due to its superior convergence properties and high flexibility of density
matching.
The advantages of using the ICA algorithm in speaker verification systems are:
1. The ICA algorithm can be used to separate individual speech from the
mixed speech signals and it improves speaker recognition performance [32].
2. The ICA algorithm can be used as a multiple channel speech enhancement
algorithm to separate the speech from the noisy speech signals. The perfor-
mance of speech enhancement based on the ICA algorithm improves com-
pared with single channel speech enhancement algorithms [84, 85]. Thus,
the enhanced speech signals from the ICA algorithm could be used to im-
prove forensic speaker verification performance under noisy conditions.
3. The ICA-EBM algorithm can be used to separate the sources that come
from different distribution and it is used to separate the mixed speech
signals under clean conditions more efficiently compared with traditional
50 2.12 Speaker recognition in adverse conditions
ICA algorithms due to its superior separation performance and convergence
properties than other traditional ICA algorithms [82]. Thus, the ICA-EBM
algorithm could be used as speech enhancement algorithm which might im-
prove speaker verification performance more efficiently than the fast ICA
algorithm.
2.12.4 I-vector PLDA speaker recognition
The i-vector PLDA technique has been widely used for text-independent speaker
recognition systems. Various studies used the state-of-the art i-vector PLDA to
evaluate speaker recognition systems under noisy and reverberant conditions.
Mandasari et al. [89] studied the effect of car and babble noise on the performance
of a traditional GMM UBM (dot scoring), i-vector with cosine distance scoring
and i-vector PLDA speaker verification systems. A Wiener filter was used as front-
end to decrease the effect of noise and improve speaker verification performance.
Experimental results demonstrated that the modern i-vector PLDA improved
speaker verification performance compared with the dot scoring approach. The
i-vector PLDA was also found to offer more robust performance for mixed car
rather than babble noise. The applications of the Wiener filter achieved a small
improvement in speaker verification performance. It is believed that some single
channel speech enhancement algorithms may distort the enhanced speech signals
[117]. However, a multiple channel speech enhancement algorithm improved the
speech quality compared with single channel speech enhancement algorithms.
Thus, using multiple channel speech enhancement algorithms could improve noisy
i-vector speaker verification performance.
Dean et al. [28] designed the QUT-NOISE-SRE protocol for speaker recognition
evaluation (SRE) under noisy conditions. The enrolment and verification speech
2.12 Speaker recognition in adverse conditions 51
signals from the NIST database were mixed with different random sessions of
STREET, CAR, CAFE, REVERB, and HOME noises from the QUT-NOISE
database [29] at SNRs ranging from -10 dB to 15 dB. The protocol of QUT-
NOISE-SRE was used to evaluate the modern PLDA i-vector speaker recognition
system, demonstrating the importance of designing a speaker recognition system
focused on the VAD techniques. The performance of QUT-NOISE-SRE protocol
could be improved by using multiple channel speech enhancement algorithm to
reduce the effect of environmental noise and improve speaker verification perfor-
mance under noisy conditions.
2.12.5 Deep neural network
In recent years, deep neural network (DNN) has also been used widely in speaker
verification systems. Some studies used DNN in speaker verification under noisy
and reverberant conditions.
Zhao et al. [146] proposed a robust speaker recognition system in the presence
of noise and reverberation conditions. The proposed system reduced the effect
of reverberation by training the speaker model in reverberation conditions. The
background noise was removed by using the computational auditory scene analysis
(CASA) technique, which is based on binary time-frequency (T-F) masking to
separate speech from the noise. The binary masking can be performed using DNN.
Experimental results showed that in the presence of a wide range of reverberation
and SNRs, the proposed algorithm improved speaker recognition performance
over related systems.
Du et al. [34] used DNN as a front-end feature to predict clean speech features
and improve noisy speaker verification system. The DNN was used to enhance
speech features before applying the i-vector PLDA speaker verification system.
52 2.12 Speaker recognition in adverse conditions
The DNN was trained using parallel features extracted from the UBM train-
ing speech signals under clean and noisy conditions aligned at the frame level.
The neural network was performed by minimizing the mean square error (MSE).
The proposed system was evaluated by using NIST SRE 2010 and NOISEX-92
databases. Experimental results showed that using DNN-based feature compensa-
tion improved EER compared with DNN-based uncompensated features at SNRs
ranging from 0 dB to 20 dB.
Zolbaek et al. [75] used the modern deep recurrent neural network (DRNN) as
a speech enhancement technique to improve text-dependent speaker verification
performance under noisy conditions. The proposed method is based on using
modern long short-term memory (LSTM) based DRNN as the speech enhance-
ment front-end of the i-vector speaker verification system. The performance
of the proposed method was compared with non-negative matrix factorization
(NMF) and short-time spectral amplitude minimum mean square error (STSA-
MMSE) techniques. Experimental results demonstrated that male speaker and
text-independent DRNN-based speech enhancement without prior knowledge of
the type of noise, achieved high improvements in the performance compared with
NMF and STSA-MMSE at different values of SNR on RSR2015 speech database.
Chang and Wang [20] proposed a front-end speech separation technique for an
i-vector speaker recognition system to deal with the speech signals mixed with
background noise. The DNN was used to estimate the ideal ratio mask (IRM).
The separated speech was used to extract the enhanced features for the GMM/i-
vector and DNN/i-vector in speaker identification and speaker verification sys-
tems. The proposed algorithm was compared with the multi-condition trained
baseline and a traditional GMM-UBM i-vector system. The results showed that
the performance of using speech separation achieved an average improvement of
8% in identification accuracy and 1.2% in EER.
2.13 Limitation of the existing techniques 53
Although speaker recognition performance based on DNN improved compared
with UBM/i-vector framework, the DNN requires a large amount of data (1300
hours) with proper transcription to achieve significant improvement in the speaker
recognition performance [81]. In real forensic applications, it is hard to collect a
large amount of speech signals with proper transcriptions for training DNN. It
was found in [145] that a DNN-based i-vector did not show much improvement
in forensic speaker verification performance in the presence of various types of
environmental noise compared with a UBM based i-vector framework when a
limited amount of data was used to train DNN. Therefore, the DNN framework
will not be used in this thesis to improve forensic speaker verification performance.
2.13 Limitation of the existing techniques
We identify two gaps from the previous studies of speaker recognition in adverse
conditions and propose research directions to address them:
1. The effectiveness of combining feature warping with DWT-MFCC and
MFCC features individually or the concatenative fusion of these features
has not been investigated yet for state-of-the-art i-vector forensic speaker
verification in the presence of environmental noise only, reverberation, and
noisy and reverberant conditions. The research question in this thesis seeks
to answer whether fusion of feature warping with MFCC and DWT-MFCC
can improve modern i-vector forensic speaker verification performance un-
der different levels and types of environmental noise and reverberation con-
ditions.
2. A multi-run ICA or ICA-EBM algorithm has not previously been applied
to separate speech from noisy speech signals in the presence of high levels
54 2.14 Chapter summary
of environmental noise and reverberation conditions, although multi-run
ICA has been applied to biological signals and ICA-EBM algorithm used to
separate mixed speech signals under clean conditions. The multi-run ICA
or ICA-EBM algorithm can be applied to the noisy speech signals to reduce
the effect of environmental noise. The enhanced speech signals could be
used in improving forensic i-vector PLDA speaker verification performance
under noisy conditions.
2.14 Chapter summary
This chapter presented an overview of speaker verification systems, and also de-
scribed the GMM UBM-based speaker verification systems. As a mismatch be-
tween enrolment and verification speech signals has significantly affected speaker
verification performance, several channel compensation techniques in feature and
model domains to compensate channel variation and additive noise were de-
scribed. This chapter also presented i-vector feature extraction techniques, and
standard channel compensation techniques, including NAP, WCCN, and LDA.
Subsequently, GPLDA, HTPLDA and the modern length-normalized GPLDA
techniques were also described in this chapter. Speaker recognition in adverse
conditions identified five techniques to improve speaker recognition performance
in the presence of noise and reverberation conditions (feature extraction based
on the multiband techniques, feature warping, ICA algorithm, I-vector PLDA
speaker recognition system and deep neural network). Each technique has its
strengths and limitations, particularly in terms of the purpose of my research,
which is to improve the performance of speaker recognition systems in the pres-
ence of high levels of noise and reverberation conditions in forensic applications.
This research identifies two gaps and proposes research directions that will be
described in detail in Chapters 5 and 6:
2.14 Chapter summary 55
1. Combining feature warping with MFCC and DWT-MFCC features to im-
prove forensic speaker verification performance in the presence of various
types of environmental noise, reverberation, as well as noisy and reverberant
conditions.
2. Introducing the multi-run ICA or ICA-EBM algorithm as front-end pro-
cessing of speech enhancement to separate the noise from the noisy speech
signals and improve forensic speaker verification performance in the pres-
ence of high levels of environmental noise and reverberant conditions.
Chapter 3
Noisy and reverberant speech
frameworks
3.1 Introduction
This chapter gives an overview of the AFVC and QUT-NOISE databases that
are used in the evaluation of speech enhancement algorithms and noisy speaker
verification systems under real forensic applications. The forensic audio record-
ings available from the AFVC database cannot be used to evaluate speech en-
hancement algorithms under real forensic scenarios because the AFVC database
contains clean speech signal only. Thus, the construction of noisy speech sig-
nals based on the single and multiple microphones is designed in this chapter
to simulate speech enhancement algorithms under real forensic scenarios. The
noisy speech signals are obtained by mixing clean speech signals from the AFVC
database with different levels and types of environmental noise from the QUT-
NOISE database using single and multiple microphones.
58 3.2 AFVC speech database
In most real forensic situations, the interview recordings from a suspect are often
recorded in a police interview office where reverberation is present. The surveil-
lance recordings from the criminal are usually recorded in an outdoor environment
with a single or multiple microphones in the presence of various types of envi-
ronmental noise. In order to simulate forensic speaker verification performance
under real-world situations, the noisy and reverberant frameworks based on the
single and multiple microphones are designed in this chapter.
The interview and surveillance recordings are similar to enrolment and verification
data in traditional speaker recognition systems, where the interview recordings
are treated as the enrolment as the suspects identity has been confirmed, but the
surveillance recordings are treated as the verification as the identity of the speaker
is not known. Throughout the remainder of this thesis, the recordings used for
experiments will be referred to as interview/surveillance to avoid confusion caused
by the more typical enrolment/verification terminology which is not perfectly
applicable in this scenario.
This chapter is divided into several sections. The AFVC speech and QUT-NOISE
databases are presented in Sections 3.2 and 3.3, respectively. Section 3.4 describes
the construction of noisy speech signals. Noisy and reverberant frameworks are
presented in Section 3.5.
3.2 AFVC speech database
The AFVC database [99] contains 552 speakers recorded in three speaking
styles: informal telephone conversation, information exchange over telephone,
and pseudo-police. The telephone channel was used to record informal telephone
conversations and information exchange over the telephone. The microphone
3.3 QUT-NOISE database 59
channel was used to record the pseudo-police style. The sampling frequency of
the clean speech signals was 44.1 kHz with 16 bit/sample resolution [98]. This
database is designed for speaker recognition in forensic applications and it is col-
lected by the Forensic Voice Comparison Laboratory at the University of New
South Wales in Australia. The AFVC database will be used in this thesis be-
cause it contains various speaking style recordings for each speaker, which are
usually found in forensic scenarios [98].
3.3 QUT-NOISE database
The QUT-NOISE database consists of 20 noise sessions [29]. Each session has at
least a 30-minute duration. QUT-NOISE was recorded in five popular noise sce-
narios (CAFE, HOME, CAR, STREET and REVERB). The sampling frequency
of the noise was 48 kHz with 16 bit/sample resolution. A brief description of each
noise scenario is given below:
1. CAFE
The CAFE noise was recorded in an outdoor cafe and indoors in a cafe
food court. The CAFE noise was recorded in the presence of high levels of
babble speech and kitchen noise from cafe environments.
2. HOME
This noise was recorded in two locations for home: kitchen and living room.
The kitchen noise consists of silence interrupted by kitchen noise. The living
room was recorded in the presence of children singing and playing alongside
a television.
3. STREET
60 3.3 QUT-NOISE database
The STREET noise was recorded in two locations: an inner-city and outer-
city. The inner-city recordings consist of pedestrian traffic and bird noise.
The outer-city recordings consist mainly of cycles of traffic noise and traffic
light changes.
4. CAR
The CAR noise was recorded in driving window-down and window-up condi-
tions. These recordings consist of car-interior noise such as bag-movement,
keys, and indicator, as well as the characteristics of the wind for car window-
down.
5. REVERB
The REVERB noise was recorded in two locations: an enclosed indoor
pool and a partially enclosed car park. The indoor pool is characterized by
splashing and running noise, while the car park environment is characterized
by nearby road noise.
For real forensic scenarios, the clean speech signals from existing speech databases
are often mixed with environmental noise at certain noise levels [5]. The limited
duration of the existing noise databases (typically less than five minutes) such as
NOISEX-92 [138], freesound.org [35], and AURORA-2 [51] has lacked the ability
to evaluate speaker verification systems in a wide range of environmental noise
conditions in forensic situations. However, the duration of each session in the
QUT-NOISE database is at least 30 minutes [28]. The clean speech signals can
mix with random sessions of environmental noise from the QUT-NOISE database
to achieve a closer approximation to real forensic situations [5]. Therefore, QUT-
NOISE database will be used to evaluate forensic speaker verification under noisy
conditions in this thesis.
3.4 Construction of noisy speech signals 61
3.4 Construction of noisy speech signals
This section describes the construction of noisy speech signals in single-channel
speech enhancement algorithms (wavelet threshold techniques and spectral sub-
traction) and in the multi-channel speech enhancement algorithm (ICA).
3.4.1 Noisy speech in single channel speech enhancement
In order to reduce the computational complexity of speech enhancement algo-
rithms, we used one sentence from 100 speakers of the AFVC database [99] to
evaluate the performance of single channel speech enhancement algorithms. CAR,
STREET and HOME noises from the QUT-NOISE database [29] were used in
the construction of noisy speech signals because these noises were more likely to
occur in forensic applications. The noise from CAR, STREET and HOME noises
from the QUT-NOISE corpus was down sampled from 48 kHz to 44.1 kHz be-
fore mixing with the clean speech signals from the AFVC database to match the
sampling frequencies of the clean speech. The noisy speech signals were obtained
by sampling the speech signals and the scale of the noise at signal to noise ratios
(SNRs) ranging from -10 dB to 10 dB.
3.4.2 Noisy speech in multi-channel speech enhancement
In real forensic scenarios, the police agencies usually record speech from the crim-
inal using hidden microphones. Such forensic audio recordings are often mixed
with various types of environmental noise. The goal of construction of noisy
speech in multiple channels is to simulate multi-channel speech enhancement un-
der real-world situations. In the construction of noisy speech in multi-channel,
62 3.4 Construction of noisy speech signals
z(n) z(n)
Figure 3.1: Configuration of sources and microphones in instantaneous ICA mix-tures.
one sentence from 100 speakers of the AFVC database was used to evaluate the
performance of speech enhancement based on the ICA algorithm. Each of the
clean speech signals from the AFVC database was corrupted by one session of
environmental noise (CAR, STREET and HOME noises) from the QUT-NOISE
database, resulting in a two-channel noisy speech signal at SNRs ranging from
-10 dB to 10 dB. The sampling frequency of the noise was down sampled from
48 kHz to 44.1 kHz before mixing with the clean speech signals from the AFVC
database to match the sampling frequency of the clean speech signals.
Figure 3.1 shows the configuration of sources (z(n) and e(n)) and microphones
(x1 and x2) in instantaneous ICA mixtures. The noisy speech signals recorded
by the microphones, x, can be modeled as follows:
x = As(n), (3.1)x1x2
=
a11 a12
a21 a22
z(n)
e(n)
, (3.2)
where z(n) is the clean speech signal, e(n) is the environmental noise, A is the
mixing matrix, and a11, a12, a21 and a22 are the parameters of the mixing matrix.
These parameters depend on the distance between the microphone and the source
3.4 Construction of noisy speech signals 63
Speech z (n)
Noise e (n)
X2
X1
d11= d21
d21
d22= 0.5 d12
d12
Figure 3.2: Position of speech and noise signals to the microphones.
signals and the amplitude of the source signals is proportional to the inverse of the
distance from the source to the microphone. Thus, the inverse of each parameter
of the mixing matrix is proportional to the distance between each source and
microphone [132].
The first and second microphones (x1 and x2) have the same distance to the
clean speech source signal (d11 = d21), but the noise (e(n)) has half the distance
to the second microphone (x2) compared to the distance of the noise to the first
microphone (d22 = 0.5 d12), as shown in Figure 3.2. The relationship between dij
and aij can be expressed by the following equation:
dij =1
aij. (3.3)
The mixing matrix, A, used in the construction of noisy speech in multi-channel
speech enhancement is:
A =
1.0 1.0
1.0 2.0
. (3.4)
64 3.5 Noisy and reverberant frameworks
We chose the value of mixing parameter a22 equals to 2 in order to compare the
performance of speech enhancement based on the ICA algorithm under worst
conditions with single channel speech enhancement algorithms.
3.5 Noisy and reverberant frameworks
The forensic audio recordings available from the AFVC database [99] cannot be
used to evaluate the robustness of forensic speaker verification based on robust
feature extraction techniques and the ICA algorithms under noisy and reverber-
ant conditions, because this database contains only clean speech signals. In order
to evaluate the performance of forensic speaker verification systems in the pres-
ence of environmental noise and reverberation conditions, we designed noisy and
reverberant frameworks based on the single and multiple microphones.
In real forensic applications, long speech samples from a suspected speaker are
recorded in an interview scenario, while the surveillance recordings from the crim-
inal may be very short duration. In order to design noisy and reverberant frame-
works for simulation forensic speaker verification under real-world situations, the
interview recordings for these frameworks were obtained from full length utter-
ance of 200 speakers using the pseudo-police style. The surveillance recordings
were obtained from short duration utterances (10 sec, 20 sec, and 40 sec) from
200 speakers using the informal telephone conversation style. The VAD algo-
rithm [130] was used to remove silent regions from the interview and surveillance
recordings. It was necessary to remove silent portions from the clean surveillance
recordings before adding the noise because the silence would artificially increase
the true short-term active speech SNR compared to that of the desired SNR [28].
The VAD was applied to clean speech instead of noisy speech signals because
manual segmentation of speech activity segments or speech labelling may be im-
3.5 Noisy and reverberant frameworks 65
plemented in a forensic scenario when encountered [89]. A brief description of
single and multiple microphone adverse frameworks is provided in the following
sections.
3.5.1 Single microphone adverse framework
In order to evaluate the performance of forensic speaker verification based on the
robust feature extraction techniques in the presence of environmental noise and
reverberation conditions, a single microphone adverse framework is designed. A
single microphone adverse framework consists of three configurations. A brief
description of three configurations is provided in the following sections.
3.5.1.1 Adding noise
The goal of adding noise from the QUT-NOISE database [29] to the clean speech
signals from the AFVC database [99] was to evaluate the robustness of single
microphone forensic speaker verification based on the length normalized GPLDA
under environmental noise conditions only. The interview recordings were kept
under clean conditions, while the surveillance recordings from the criminal were
corrupted by different types of environmental noise. The surveillance recordings
were mixed with environmental noise only because in most real forensic situations,
the surveillance data were often recorded in open areas in the presence of various
types of environmental noise and the effect of reverberation conditions on the
noisy surveillance recordings is neglected [3]. A random session of STREET,
CAR and HOME noises from the QUT-NOISE database was chosen and down-
sampled from 48 kHz to 44.1 kHz to match with the sampling frequencies of the
surveillance speech signals. These noises were used in this thesis because they
were more likely to occur in real forensic situations. The average noise power was
66 3.5 Noisy and reverberant frameworks
VAD algorithm
Interview recordings (200 speakers
using pseudo-police style)
VAD algorithm
Surveillance recordings (200 speakers
using informal telephone style)
Extract 10 sec duration
Design of single microphone adverse framework based on adding noise
Extract full duration
Add CAR, STREET, and HOME
noises to surveillance recordings
using single microphone
Figure 3.3: Design of a single microphone adverse framework based on addingnoise.
scaled in relation to the reference surveillance speech signal after removing silent
regions according to the desired SNR. The noisy surveillance speech signals were
obtained by sample summing of the surveillance speech signal and the scaled
environmental noise at SNRs, ranging from -10 dB to 10 dB. The design of a
single microphone adverse framework based on adding noise is shown in Figure
3.3.
3.5.1.2 Adding reverberation
The aim of adding reverberation to interview recordings was to investigate the
effect of different reverberation conditions on the forensic speaker verification
performance based on the robust feature extraction techniques. Training room
impulse responses were computed from the fixed room dimension 3 × 4 × 2.5
(m) using the image source algorithm proposed by Lehmann and Johansson [78].
3.5 Noisy and reverberant frameworks 67
Table 3.1: Reverberation test room parameter.
Configuration Suspect position(xs,ys,zs ) microphone position (xm,ym,zm )1 (2, 1, 1.3) (1.5, 1, 1.3)2 (2, 1, 1.3) (2.4, 1, 1.3)3 (2, 1, 1.3) (2.8, 1, 1.3)4 (2, 1, 1.3) (2.8, 2.5, 1.3)
2.5 m
3 m
Width
Length
4 m
1 m
0.5 m 1.5 m
0.4 m 0.4 m
Suspect
Microphone
2.8 m
1
4
3 2
Figure 3.4: Position of suspect and microphones in a room. All microphones andsuspect are at 1.3 m height and the height of the room is 2.5 m.
Table 3.1 and Figure 3.4 show reverberation room parameters and position of
suspect and microphones in a room. The symbols (xs,ys,zs) and (xm,ym,zm ) in
Table 3.1 represent the position of suspect and microphone in a virtual room,
respectively. In adding reverberation, the position of the microphone changed
horizontally in most configurations in the room, as shown in Table 3.1 and Fig-
ure 3.4 because in most forensic scenarios, the police often put the microphone on
the table in a police office to record the speech from the suspect and the position
of the microphone could be changed horizontally on the table to investigate the
effect of changing suspect/microphone position on the performance of forensic
speaker verification systems under real-world situations. Reverberation is often
characterized by reverberation time (T20 or T60), which describes the amount of
time for the direct sound to decay by 20 dB or 60 dB, respectively [40, 79]. The
reverberation time (T20) was measured on the impulse room response. The rea-
68 3.5 Noisy and reverberant frameworks
VAD algorithm
Interview recordings (200 speakers
using pseudo-police style)
VAD algorithm
Surveillance recordings (200 speakers
using informal telephone style)
Extract 10 sec duration
Design of single microphone adverse framework based on adding
reverberation conditions
Extract full duration
Convolve interview
recordings with room
impulse response
Figure 3.5: Design of a single microphone adverse framework based on addingreverberation conditions.
son for using T20 instead of the more popular T60 is to reduce the computational
time when computing the reverberation time in a series of simulated room im-
pulse responses [79]. Each of the interview recordings was convolved with the
impulse room response to generate reverberated speech with the same duration
as the interview recordings. The interview data are often recorded in the pres-
ence of reverberation conditions because the police usually record speech from
the suspect in an interview room where reverberation is present. The surveil-
lance speech signals were kept without reverberation because in most forensic
situations surveillance speech signals are often recorded in open areas [7].
The reverberation from the QUT-NOISE database is not used for the evaluation
of length-normalized GPLDA based speaker verification in forensic applications
for two main reasons. Firstly, the reverberation used in the QUT-NOISE database
was recorded in places which are impractical for most forensic situations, such as
a pool and car park, and is characterized by splashing, running and road noises.
3.5 Noisy and reverberant frameworks 69
However, in most forensic applications, the police record the interview data in
a police interview room where reverberation often occurs. Secondly, the effects
of some factors such as the position of the microphone relative to a suspect in
the real police interview room and reverberation time cannot be evaluated by
using the reverberation from the QUT-NOISE database. Thus, it is important
to generate reverberated speech signals using a virtual room to achieve a closer
approximation to real forensic situations. The design of a single microphone
adverse framework based on adding reverberation is shown in Figure 3.5.
3.5.1.3 Adding reverberation and noise
The goal of adding reverberation and noise to interview and surveillance record-
ings was to simulate the performance of single microphone length normalized
GPLDA-based forensic speaker verification systems under real-world situations.
Training room impulse responses were computed from the first configuration of
the room, as described in Table 3.1 using the image source algorithm [78]. Each of
the interview recordings was convolved with the impulse room response to gener-
ate reverberated speech with the same duration as the clean interview recordings.
In order to investigate the effect of the utterance duration on noisy speaker ver-
ification, the surveillance speech signals were extracted from random sessions of
10 sec, 20 sec, and 40 sec duration from 200 speakers, using the informal tele-
phone conversation style, after removing silent portions using the VAD algorithm
[130]. The surveillance recordings were corrupted with different segments of CAR,
STREET and HOME noises from the QUT-NOISE database [29] at various SNR
values ranging from -10 dB to 10 dB. The design of a single microphone adverse
framework based on adding reverberation and noise conditions is shown in Figure
3.6.
70 3.5 Noisy and reverberant frameworks
VAD algorithm
Interview recordings (200 speakers
using pseudo-police style)
VAD algorithm
Surveillance recordings (200 speakers
using informal telephone style)
Extract 10 sec duration
Design of single microphone adverse framework based on adding
reverberation and noise conditions
Extract full duration
Add CAR, STREET, and HOME
noises to surveillance recordings
using single microphone
Convolve interview
recordings with room
impulse response
Figure 3.6: Design of a single microphone adverse framework based on addingreverberation and noise conditions.
3.5.2 Multiple microphones adverse framework
In order to evaluate the performance of forensic speaker verification based on the
ICA algorithms in the presence of environmental noise and reverberation con-
ditions, a multiple microphones adverse framework is designed comprising two
configurations. The first was a multiple microphone adverse framework based
on adding noise which combined noise from the QUT-NOISE database [29] with
surveillance recordings from the AFVC database using instantaneous ICA mix-
tures. The second was a multiple microphones adverse framework based on adding
reverberation and noise which combined noise from the QUT-NOISE database
with surveillance recordings from the AFVC database and interview recordings
from the AFVC database convolved with the impulse response of the room to gen-
erate reverberated speech signals. The goal of designing a multiple microphones
adverse framework based on adding reverberation and noise conditions is to sim-
3.5 Noisy and reverberant frameworks 71
ulate forensic speaker verification based on the ICA algorithm under real-world
situations because in most forensic scenarios, the police often record the speech
from the suspect in a room under reverberation conditions and the police often
use hidden microphones to record the surveillance recordings from the criminal
in public places in the presence of environmental noise. A brief description of two
configurations are provided in the following sections.
3.5.2.1 Adding noise
In most real forensic situations, the criminal may use a mobile phone to commit
criminal offences in public places. The police often use hidden microphones to
record the speech from the criminal in a public place and the forensic audio
recordings are often corrupted by various types of environmental noise. The effect
of reverberation on noisy surveillance speech signals is not investigated because
in most forensic situations noisy speech signals are often recorded in open areas.
Thus, instantaneous ICA will be used in this thesis to record the noisy speech
signals.
The objective of designing the multiple microphones adverse framework based
on adding noise was to evaluate the robustness of forensic speaker verification
based on the ICA algorithms under environmental noise conditions only. The full
duration of interview recordings was obtained from 200 speakers using the pseudo-
police style and the interview recordings were kept under clean conditions. The
surveillance recordings were obtained from short segments (10 sec, 20 sec, and 40
sec) from 200 speakers using the informal telephone conversation style. The noise
from CAR, STREET and HOME noise from the QUT-NOISE database [29] was
down sampled from 48 kHz to 44.1 kHz before mixing with the clean surveillance
speech signal. The noisy speech signal in each microphone was obtained by sample
summing of the surveillance signal and scale environmental noise at SNRs ranging
72 3.5 Noisy and reverberant frameworks
Speech z (n)
d11= d21
X1
d12
Noise e (n)
X2
d22=1.66 d12
d21
Figure 3.7: Position of speech and noise signals to the microphones.
from -10 dB to 10 dB. The noisy speech signals recorded by the microphones, x,
can be modeled as follows:
x = As(n), (3.5)x1x2
=
1.0 1.0
1.0 0.6
z(n)
e(n)
, (3.6)
where z(n) is the clean speech signal, e(n) is the environmental noise, and A is
the mixing matrix. As the parameters of the mixing matrix are based on the con-
figuration of the sources and the microphones, the amplitude of the source signal
is proportional to the inverse of the distance from the source to the microphone.
Thus, the inverse of each parameter of the mixing matrix is proportional to the
distance between each source and microphone [132].
In the design of the multiple microphone adverse framework based on adding
noise, the first and second microphones (x1 and x2) are the same distance to the
clean speech source signal (d11 = d21), but the distance between the noise and the
second microphone compared to the distance of the noise to the first microphone
3.5 Noisy and reverberant frameworks 73
VAD algorithm
Interview recordings (200 speakers
using pseudo-police style)
VAD algorithm
Surveillance recordings (200 speakers
using informal telephone style)
Extract 10 sec duration
Design of multiple microphone adverse framework based on adding noise
Extract full duration
Add CAR, STREET, and HOME
noises to surveillance recordings
using instantaneous ICA
Mixing process
Figure 3.8: Design of a multiple microphones adverse framework based on addingnoise.
is (d22 = 1.66 d12), as shown in Figure 3.7.
In most real forensic scenarios, the police often record surveillance speech signal
from the suspect using hidden microphones. The distance between the micro-
phones and surveillance speech signal should be less than or equal to the distance
between the microphones and noise signals to obtain noisy speech that has less
effective to environmental noise. These noisy surveillance speech signals can be
used as the input signals to speaker verification systems based on the ICA algo-
rithm in real forensic applications. Therefore, the values of the mixing matrix
are chosen according to Equation 3.6.
Figure 3.8 shows the design of multiple microphones adverse framework based on
adding noise.
74 3.5 Noisy and reverberant frameworks
3.5.2.2 Adding reverberation and noise
In real forensic applications, the interview data are often recorded in a police
room under controlled conditions where reverberation often occurs, while the
surveillance speech signals from the criminals may be recorded using hidden mi-
crophones in a public place in the presence of environmental noise. Thus, the aim
of designing the multiple microphones adverse framework based on adding rever-
beration and noise was to evaluate the performance of length-normalized GPLDA
forensic speaker verification based on the ICA algorithms in closer approximation
to real forensic situations.
Training room impulse responses were computed from the first configuration of
the room, as described in Table 3.1, using the image source algorithm proposed
by Lehmann and Johansson [78]. Each interview recording was convolved with
the impulse response of the room to generate the reverberated speech with the
same duration as the clean interview speech signal.
In order to investigate the effect of the duration of utterance on noisy speaker ver-
ification, the surveillance recordings were extracted from short segments (10 sec,
20 sec, and 40 sec) duration from 200 speakers using the informal telephone con-
versation style after removing silent portions using the VAD algorithm [130]. The
surveillance recordings were corrupted with different segments of CAR, STREET
and HOME noises from the QUT-NOISE database [29], resulting in a two-channel
noisy speech signal at SNRs ranging from -10 dB to 10 dB, according to Equation
3.6. The design of a multiple microphones adverse framework based on adding
reverberation and noise conditions is shown in Figure 3.9.
3.6 Chapter summary 75
Interview recordings (200 speakers
using pseudo-police style)
Surveillance recordings (200 speakers
using informal telephone style)
VAD algorithm VAD algorithm
Extract full duration Extract 10 sec duration
Design of multiple microphone adverse framework based on adding
reverberation and noise conditions
Add CAR, STREET, and HOME noises
to surveillance recordings using
instantaneous ICA
Convolve interview recordings
with room impulse response
Mixing process
Figure 3.9: Design of a multiple microphones adverse framework based on addingreverberation and noise conditions.
3.6 Chapter summary
This chapter presented the AFVC and QUT-NOISE databases which used to
evaluate speech enhancement algorithms and forensic speaker verification perfor-
mance under noisy and reverberant conditions. As the AFVC database contains
clean speech signals only, the forensic audio recordings available from the AFVC
database cannot be used to evaluate the speech enhancement algorithms. There-
fore, the construction of noisy speech signals based on single and multiple channels
was designed. The construction of noisy speech signals can be used to simulate
speech enhancement algorithms under real forensic scenarios in Chapter 4. In
real forensic applications, the police usually record speech from the suspect in a
room where reverberation is present and surveillance recordings from the crim-
inal could be recorded using single or multiple microphones by police in open
areas in the presence of various types of environmental noise. Thus, the noisy
76 3.6 Chapter summary
and reverberation frameworks developed in this chapter will be used to simulate
forensic speaker verification performance based on the robust feature extraction
techniques and the ICA algorithms under real-world situations in Chapters 5 and
6.
Chapter 4
Forensic speech enhancement
algorithms
4.1 Introduction
In real forensic situations, a criminal may use a mobile phone in connection
to a criminal offence in a car, street or public place [89]. Such forensic audio
recordings are often mixed with different types of noise such as car, street and
home noises. It is hard to use these audio recordings directly as part of preparing
legal evidence for court because the quality of these recordings is often poor
[4]. Therefore, speech enhancement approaches play an essential role in such
real forensic situations [129]. Forensic speaker verification performance can be
improved with some form of speech enhancement[36].
Speech enhancement approaches can be classified into single channel and multi-
channel approaches depending on the number of microphones that are used to
record the noisy speech signal. A number of single channel speech enhancement
78 4.1 Introduction
approaches have been proposed in previous studies, such as spectral subtraction
[13, 15] and wavelet threshold techniques [33].
Multi-channel speech enhancement approaches can be used to remove the noise
from the noisy speech signals [104] and achieve better performance compared to
single channel speech enhancement approaches [118]. ICA can be used widely as
a multi-channel speech enhancement [84, 85, 124]. The ICA algorithm is based
on transforming the noisy speech signals into components that are statistically
independent to separate the clean speech from the noisy speech signals. The
estimation of the source signal in the ICA algorithm is based on maximizing the
non-Gaussian distribution of one independent component. The difference between
a Gaussian distribution and the distribution of the independent component is
measured using different contrast functions, such as kurtosis, negentropy, and
approximation of negentropy, which is maximized by the ICA algorithm [55].
The objective of using the ICA algorithm is to separate the clean speech from
the noisy speech signals. The enhanced speech signals from the ICA algorithm
can be used to improve forensic speaker verification performance in the presence
of various types and levels of environmental noise. This chapter focuses on using
ICA for a multi-channel speech enhancement algorithm because the distribution
of environmental noise and speech signals are non-Gaussian distribution and the
ICA algorithm is a robust technique to separate the source signals that come
from non-Gaussian distribution. Figures 4.1 and 4.2 show the histogram of the
clean speech from the AFVC database and STREET noise from the QUT-NOISE
database, respectively.
This chapter presents a brief description of the fundamental concept of ICA. Sim-
ulation results for single-channel and multi-channel forensic speech enhancement
algorithms are also presented in this chapter.
4.2 Independent component analysis 79
-0.5 -0.3 -0.1 0.1 0.3 0.5Bins
0
0.5
1
1.5
2
Histo
gram
#105
Figure 4.1: Histogram of clean speech from the AFVC database.
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1Bins
0
1
2
3
4
5
Hist
ogra
m
#107
Figure 4.2: Histogram of STREET noise from the QUT-NOISE database.
4.2 Independent component analysis
ICA is a statistical approach that is used widely to solve the problem of blind
source separation [55]. In the general model of ICA, the source signals are mixed
80 4.2 Independent component analysis
through a linear basis transformation. Suppose there are N independent source
signals s(t) = {s1(t), s2(t), · · · , sN(t)} and these source signals are observed by
M microphones x(t) = {x1(t), x2(t), · · · , xM(t)}. ICA assumes the number of the
observed signals recorded by the microphones equals the number of the source
signals (M = N) [25]. The fundamental aspect of the mixing process is that each
microphone records a different mixture of the source signals, and the parameters
of the mixing matrix A are unknown. The observed signals, x, can be represented
as:
x = As. (4.1)
The aim of the ICA algorithm is to estimate the original source signals from the
observed signals when both source signals and mixing matrix are unknown [55].
The whole problem resembles that the task a human listener can solve in cocktail
party where using two ears, the brain focuses on a specific sound and suppresses
all other sources in the room [54, 77]. The estimate of the source signals in the
ICA algorithm, s, can be obtained by
s = Wx, (4.2)
where W is the unmixing matrix. The unmixing matrix can be defined as:
W = A−1, (4.3)
where A−1 is the inverse of the mixing matrix.
Figure 4.3 shows the processes of mixing and unmixing in blind source separation.
The source signals s are mixed by mixing matrix A. The estimated source signals
s can be obtained by unmixing matrix W .
4.2 Independent component analysis 81
Mixing matrix
(A)
s1
s2
.
.
Unmixing matrix
(W)
x1
x2
.
.
ŝ1
ŝ2
.
.
sn Xn ŝn
Blind Observation Reconstruction
Figure 4.3: Mixing and unmixing processes in blind source separation. s are thesource signals, x are the observation signals, s are the estimated source signals,A is the mixing matrix and W is the unmixing matrix.
4.2.1 Statistical independence
The principal concept of ICA is based on statistical independence. To simplify
the concept of independence, the two random variables s1 and s2 are said to be
independent if information on the value of s1 does not provide any information
on the value of s2, and vice versa.
4.2.1.1 Independence
Independence can be defined in terms of the probability density function (PDF)
of the signals. Let the joint PDF of s1 and s2 be denoted by p(s1, s2) and the
marginal PDF of s1 and s2 be represented as p(s1) and p(s2), respectively. The
two random variables of s1 and s2 can be defined as independent if the joint
82 4.2 Independent component analysis
probability density function is factorized by the following equation:
p(s1, s2) = p(s1)p(s2). (4.4)
Independence can be defined by replacing the PDF by two functions, h1 and h2,
as
E{h1(s1)h2(s2)} = E{h1(s1)}E{h2(s2)}, (4.5)
where E{.} is the expectation operator and Equation 4.5 is used to describe the
relationship between uncorrelatedness and independence [55, 101].
4.2.1.2 Uncorrelateness and independence
Two random variables are defined as uncorrelated if their covariance c(s1, s2) is
zero
c(s1, s2) = E{(s1s2)} − E{s1}E{s2} = 0. (4.6)
Equation 4.6 can be seen to be identical to Equation 4.5 when h1(s1) = s1 and
h2(s2) = s2. Thus, the independent variables are often uncorrelated, although
uncorrelatedness does not imply independence. It is clear from the above dis-
cussion that the independence is stronger than uncorrelatedness and many ICA
algorithms constrain independence for the estimation source procedure. How-
ever, uncorrelatedness is necessary to reduce the number of free parameters and
simplify the computation of the ICA algorithm [54].
4.2.1.3 Non-Gaussianity and independence
The classical central limit theorem states that the distribution of the sum of inde-
pendent signals tends towards a Gaussian distribution under certain conditions.
Therefore, the sum of two independent source signals often has a distribution
4.2 Independent component analysis 83
closer to Gaussian than any independent source signal. Thus, a Gaussian dis-
tribution can be obtained by a linear combination of many independent source
signals. This also illustrates that the separation of independent signals from the
mixture can be obtained by finding a transformation that yields a non-Gaussiasn
distribution [55].
Non-Gaussianity is an essential principle in the estimation of the source signals
in the ICA algorithm. There are various measurements of non-Gaussianity of a
signal used in the ICA algorithm, such as kurtosis and negentropy [55]. A brief
description of these measurements will be given in the next sections.
1. Kurtosis
The classical measure of non-Gaussian in the ICA algorithm is the kurto-
sis or fourth order cumulant [55]. The kurtosis of the random variable s,
kurt(s), can be defined as:
kurt(s) = E{s4} − 3(E{s2})2. (4.7)
The computation of the kurtosis can be simplified by assuming the random
variable has a zero mean and unit variance (E{s2} = 1). Hence Equation
4.7 can be simplified as:
kurt(s) = E{s4} − 3. (4.8)
Equation 4.8 clarifies that the kurtosis is the normalized version of the
fourth order moment E{s4}. The value of kurtosis is zero for Gaussian
signals and the kurtosis is nonzero for most non-Gaussian signals. Random
variables that have positive kurtosis are called super-Gaussian or platykur-
totic, and those with negative kurtosis are called sub-Gaussian or leptokur-
totic. The non-Gaussianity of the signal can be measured by using the
absolute value of the kurtosis or the square of the kurtosis [55].
84 4.2 Independent component analysis
Kurtosis is used widely to measure the non-Gaussian distribution in the ICA
algorithm because of its simplicity, both computationally and theoretically.
Computationally, the kurtosis can be estimated simply by computing the
fourth moment of the sample data. Theoretically, the kurtosis has a linear
property. If s1 and s2 are two independent random variables, the kurtosis
can be defined as:
kurt(s1 + s2) = kurt(s1) + kurt(s2), (4.9)
and
kurt(αs1) = α4kurt(s1), (4.10)
where α is a constant.
The main drawback of using kurtosis to measure non-Gaussian distribution
of signals is that it is very sensitive to outliers and the value of the kurtosis
is based on very few observations in tails of the distribution, which means
that its statistical significance is poor [52]. Thus, kurtosis is not a robust
measure of non-Gaussian distribution in ICA and it is necessary to use
better non-Gaussian measurements than kurtosis.
2. Negentropy
Negentropy is also used to measure the non-Gaussian distribution of the
independent component and is based on the information theoretic quantity
of differential entropy [55]. Entropy is a measurement of the randomness of
the signals and can be defined as:
H(s) = −∑i
P (s = ai)logP (s = ai), (4.11)
where ai is the values of s. The entropy can be generalized for continuous
values of a random variable s, called differential entropy, and can be defined
as:
H(s) = −∫P (s)logP (s)ds. (4.12)
4.3 ICA assumption 85
A fundamental result of information theory is that the Gaussian random
variable has the greatest entropy among other random variables [92, 105].
The value of entropy is small for distributions which are concatenated on a
certain value or have a “spiky” PDF.
A modified version of differential entropy is called negentropy J , and it can
be defined as:
J(s) = H(sgauss)−H(s), (4.13)
where sgauss is a Gaussian random variable of the same covariance matrix
of s. According to Equation 4.13, the value of negentropy is often posi-
tive and zero only if the distribution of the random variable is Gaussian
[25]. The advantage of negentropy is that it is an optimal measurement for
non-Gaussianity. However, the computation of negentropy is very difficult.
Thus, the simple approximation of negentropy must be used to estimate
the entropy value [55].
4.3 ICA assumption
The ICA algorithm requires some assumptions on the source signals and mixing
matrix [54]. The assumptions and signal processing properties are described
below
1. The source signals are statistically independent
Statistical independence is the fundamental key assumption to ICA, and
it enables estimation of the source signals s from the mixed signals, x, as
discussed in Section 4.2.1.
2. The independent components have non-Gaussian distribution
86 4.4 ICA ambiguity
The assumption is important in an ICA algorithm because of the close
relationship between Gaussianity and independence. In the ICA algorithm,
it is impossible to separate Gaussian source signals because according to
the central limit theorem the distribution of the sum of two or more source
signals is also in Gaussian distribution. Thus, the sum of two Gaussian
source signals is not recognized from the single Gaussian source in the ICA
algorithm. Therefore, Gaussian sources are forbidden.
3. The mixing matrix is invertible
The mixing process is assumed to be a linear transformation. The linear
assumption is important to simplify the estimation of the source signals. It
is necessary to assume the mixing matrix is a square matrix and the number
of the source signals equals the number of the mixed signals. If the mixing
matrix is not invertible, the unmixing matrix does not exist to separate the
source from the mixed signals.
4.4 ICA ambiguity
There are two main ambiguities in the ICA algorithm: the magnitude and scaling
ambiguity, and the permutation ambiguity [55, 100].
4.4.1 Magnitude and scaling ambiguity
The true energy of independence cannot be determined due to the unknown
values of the mixing matrix A and source signals s. To explain this ambiguity,
4.4 ICA ambiguity 87
the mixing process in Equation 4.1 can be rewritten as:
x =N∑j=1
ajsj, (4.14)
where aj represents the jth column vector of the mixing matrix. Any scalar
multiplication in one of the source signals could be omitted by dividing the same
column of aj by the same scalar. The Equation 4.14 can be rewritten as:
x =N∑j=1
(1
ajaj)(ajsj). (4.15)
This ambiguity is not important for most applications, and the solution for this
ambiguity is to assume that each source signal has unit variance. Furthermore,
the sign of the estimated source signals cannot be determined and the estimated
source signal could be multiplied by -1 to solve this issue without affecting the
ICA model.
4.4.2 Permutation ambiguity
The order of the estimated source signals cannot be determined in the ICA algo-
rithm. Formally, the permutation matrix P and its inverse can be introduced in
Equation 4.1
x = AP−1Ps, (4.16)
where the elements of Ps are the original source signals and AP−1 is the new
unknown mixing matrix to be solved by the ICA algorithm.
In most applications, the ambiguity of the permutation issue in the cocktail party
problem is not a serious problem because the supervisor is able to identify the
different estimated source signals and determine the quality of separation by
listening to the sounds [100].
88 4.5 Pre-processing for ICA
4.5 Pre-processing for ICA
Pre-processing is a very useful technique to reduce the complexity of computation
for ICA. This method is applied before using the fast ICA algorithm and can be
divided into two stages: centering and whitening [55].
4.5.1 Centering
The observed signal (x) is centered by removing the mean m = E{x} from the
signal. Centering can be represented by:
xc = x−m, (4.17)
where xc is the center of the observed signal.
This step simplifies the ICA algorithm by removing the mean from the observed
signals. The unmixing matrix can be estimated using the centered data. The
estimated ICA is not affected by removing the mean because the mean is added
to the centered observation signal after computing the mixing matrix, as shown
in the following equation:
s = A−1(xc +m). (4.18)
4.5.2 Whitening
Whitening is the process of transforming the observed signal (x) linearly into
components which are uncorrelated and the covariance matrix of the whitened
4.5 Pre-processing for ICA 89
signal E{xwxTw} is equal to unity:
E{xwxTw} = I, (4.19)
where xw is the whitened observed signals.
The whitening transformation can be performed by using the eigenvalue decompo-
sition. The covariance matrix of the observed signal, E{xxT}, can be decomposed
by the following equation:
E{xxT} = V DV T , (4.20)
where V is the eigenvector matrix of the covariance matrix and D is the diagonal
matrix of the eigenvalues which can be represented by {λ1, λ2, λ3, ...., λn}. The
observed signal can be whitened by
xw = V D−1/2V Tx, (4.21)
where D−1/2 = diag{λ−1/21 , λ−1/22 , · · · , λ−1/2n }. Whitening approach transforms
the mixing matrix into new mixing matrix, which is orthogonal
xw = V D−1/2V TAs = Aws (4.22)
hence,
E{xwxTw} = AwE{ssT}ATw = AAT
w = I. (4.23)
The whitening transformation is used to reduce the dimension of the mixing
matrix and eliminate the cost of ICA computation by reducing the elements used
in the mixing matrix from n2 to n(n− 1)/2 elements degree of freedom.
90 4.6 Fast ICA
4.6 Fast ICA
Fast ICA is a fixed-point ICA algorithm that uses high order statistics to estimate
the source signals. Fast ICA can estimate the source signals one by one (deflation
approach) or simultaneously (symmetric approach)[53, 55]. Fast ICA is based on
maximizing the approximation of negentropy J(w) to estimate one independent
component source signal.
J(w) = [E{G(wTx)} − E{G(v)}]2, (4.24)
where v is a standard Gaussian random variable, and G is the one contrast
function.
In practical terms, it is necessary to choose a contrast function that does not grow
too fast, is simple to compute and more robust to outliers than kurtosis. The
following choice of contrast functions has proven very useful in the ICA algorithm
[53]:
G1(s) =1
a1log cosh a1s, (4.25)
G2(s) = − 1
a2exp (−a2s
2
2), (4.26)
G3(s) =1
4s4, (4.27)
where 1 ≤ a1 ≤ 2 and a2 ≈ 1 are constant.
Fast ICA for one and several units will be described briefly in the next section.
4.6 Fast ICA 91
4.6.1 Fast ICA for one unit
Fast ICA for one unit is a simple algorithm to estimate a one row vector of
the unmixing matrix [W] by finding the maximum non-Gaussian value of one
independent component vector [53]. There are four steps to estimate one unit for
the ICA algorithm:
1. Select an initial guess for w.
2. Estimate w+ = E{xg(wTx)} − E{g′(wTx)}w,
where w+ is the new row vector of the unmixing matrix, E is the mean, and
g and g′ are the first and second derivatives of the contrast function G(.),
respectively.
3. Normalize the row vector of w+
w∗ =w+
‖w+‖. (4.28)
4. Go back to step 2 if not converged.
The criterion of convergence is that the direction of previous and new values of
w must be in the same direction, i.e. the dot product of these w points is almost
equal to one.
4.6.2 Fast ICA for several units
The one-unit fast ICA algorithm described in the previous section is used to
estimate one independent component only. It is necessary to run the fast ICA
algorithm for n times to estimate all source signals [53]. To prevent different
92 4.7 Simple illustration of ICA
row vectors of the unmixing matrix from converging to the same maxima, the
decorrelation of the output wT1 x,w
T2 x,w
T3 x, · · · , wT
nx must be performed after
every iteration.
The deflation method [53] is a simple method to achieve decorrelation. This
method is based on a Gram-Schmidt-like decorrelation and it estimates indepen-
dent components one by one. When p independent components are estimated,
the fast ICA algorithm runs for wTp+1, and after each iteration step subtract from
wp+1 the “projections” wTp+1wjwj, j = 1, · · · , p of the previously estimated p
vector and then renormalized the wp+1 :
wp+1 = wp+1 −p∑
j=1
wTp+1wjwj, (4.29)
wp+1 =wp+1√wT
p+1wp+1
. (4.30)
The symmetrical approach is also used to estimate various independent source
signals at the same time. Every row vector of the unmixing matrix is subtracted
and normalized according to the following equation:
W = (WW T )−12 W, (4.31)
where W = {w1, w2, · · · , wn} and (WW T )−12 can be obtained by eigenvalue de-
composition of WW T = FDF T as (WW T )−12 = FD
−12 F T .
4.7 Simple illustration of ICA
The concept of ICA can be clarified with a simple example of the separation
of the speech signal from street noise. Statistical independence in ICA is also
illustrated in this section. The results presented below were obtained using the
fast ICA algorithm [55].
4.7 Simple illustration of ICA 93
0 1 2 3 4 5
#105
-0.5
0
0.5
1Clean speech signal "s1"
0 1 2 3 4 5
#105
-0.5
0
0.5 Noise signal "s2"
Figure 4.4: Original speech and street noise.
4.7.1 Separation of speech from street noise
The clean speech signal from the AFVC database was sampled at 44.1 kHz and
16 bit/sample resolution and street noise from the QUT-NOISE database was
generated as shown in Figure 4.4. The street noise was down-sampled from 48
kHz to 44.1 kHz to match the sampling frequencies with the clean speech signals.
The speech signal was mixed with street noise at 0 dB input SNR, according to
the first microphone x1. The mixing process can be represented by the following
equation: x1
x2
=
1 1
1 2
s1s2
, (4.32)
where s1 and s2 are the speech and street noise, respectively.
The resulting signals from this mixing are shown in Figure 4.5. Finally, the
mixed speech signals were separated using the fast ICA algorithm. The contrast
94 4.7 Simple illustration of ICA
0 1 2 3 4 5
#105
-1
0
1Mixed signal "x1"
0 1 2 3 4 5
#105
-1
0
1Mixed signal "x2"
Figure 4.5: Mixed speech with street noise.
0 1 2 3 4 5
#105
-10
0
10
20Estimated speech signal "s1"
0 1 2 3 4 5
#105
-10
0
10Estimated noise "s2"
Figure 4.6: Estimated speech and street noise using the fast ICA algorithm.
function used in fast ICA has a Gaussian function which can be represented by
G(s) = − exp(−s2/2). (4.33)
The estimation of speech and street noise signals using the ICA algorithm is
shown in Figure 4.6. By comparing Figure 4.6 to Figure 4.4, it is clear that
4.7 Simple illustration of ICA 95
the clean speech and street noise have been estimated accurately without any
knowledge of the source signals and the mixing matrix.
This example also gives a clear description of the scaling ambiguity discussed in
section 4.4.1. The amplitude of the original speech and estimated speech signals
are also different because of the scaling ambiguity.
4.7.2 Illustration example of statistical independent in
ICA
The previous example provided a simple example of how the ICA is used to
separate the speech from the street noise signals. However, this example did not
give any insight into the mechanism of the ICA algorithm and its close relationship
with statistical independence. In this example, the statistics of the ICA algorithm
are described more clearly. Let two uniform random variables be mixed using the
following mixing process: x1
x2
=
1 −1
1 2
s1s2
. (4.34)
Figures 4.7 and 4.8 show the scatter-plot for the original source signals s1 and s2
and the scatter-plot of the mixture, respectively. It is clear from Figure 4.8 that
the two random variables are statistically dependent. For example, if x2 = 200,
then x1 can be determined. Whitening is a preprocessing step that is generally
performed before the ICA algorithm. The joint probability distribution obtained
from the whitening signals is shown in Figure 4.9, and it is observed that the dis-
tribution of two random variables are uniform and independent. The statistical
independence can be confirmed, as the value of each random variable is not de-
termined by the other random variable. The uniform distribution of two random
96 4.7 Simple illustration of ICA
0 20 40 60 80 100s1
0
20
40
60
80
100
s2
Figure 4.7: Original sources.
-100 -50 0 50 100x1
0
50
100
150
200
250
300
x2
Figure 4.8: Mixed of source signals.
variables in Figure 4.10 takes values ranging from 0 to 3.5. However, the range
of the original source signal is not known because of the scaling ambiguity of the
ICA algorithm. Comparing the whitened signal in Figure 4.9 and Figure 4.10,
shows that pre-whitening reduces the dimension of the ICA algorithm by finding
4.7 Simple illustration of ICA 97
-5 -4 -3 -2 -1 0 1 2 3 4 5
x1
-5
-4
-3
-2
-1
0
1
2
3
4
5
x2
Figure 4.9: Joint density of whitened signals obtained from whitening the mixedsources.
-1 0 1 2 3 4Estimated s1
0
0.5
1
1.5
2
2.5
3
3.5
4
Est
imat
ed s
2
Figure 4.10: Estimation of source signals using the ICA algorithm.
a suitable rotation to yield independence and it simplifies the rotation by using
an orthogonal transformation which needs only one parameter.
98 4.8 Methodology
Noisy speech signals
Independent
component analysis Spectral subtraction Wavelet threshold
techniques
Evaluation Performance
Comparison evaluation performance
Figure 4.11: Comparison of speech enhancement algorithms.
4.8 Methodology
Experiments were conducted to evaluate the performance of speech enhancement
based on the ICA algorithm with single channel speech enhancement algorithms
(wavelet threshold technique and spectral subtraction). The comparison of speech
enhancement algorithms consists of the following steps, which are shown in Figure
4.11 and described in the next sections.
4.8 Methodology 99
4.8.1 Noisy speech signals
The construction of noisy speech signals in single and multiple channels speech
enhancement algorithms was described in Section 3.4 in Chapter 3.
4.8.2 Speech enhancement algorithms
This section presents a brief description of the single channel speech enhancement
algorithms (wavelet threshold techniques and spectral subtraction) and multi-
channel speech enhancement algorithm (ICA) which are used in these simulation
results to remove various types of environmental noise from the noisy forensic
speech signals.
4.8.2.1 Discrete wavelet transform
The wavelet transform is a technique to analyze speech signals. It was used to
overcome problems related to frequency and time resolution properties in short
time Fourier transform (STFT) [136]. The wavelet transform uses an adaptive
window that provides low frequency resolution at high-frequency bands and high-
frequency resolution in low-frequency bands. However, the STFT uses a fixed
window size for all frequency sub bands. In that respect, the wavelet transform
is similar to the human auditory system, which exhibits similar time-frequency
resolution properties [136].
The DWT is a type of the wavelet transform that can be defined by
W (j, k) =∑j
∑k
x(k)2−j2 ψ(2−jn− k), (4.35)
where ψ is the time function with fast decay and finite energy called the mother
100 4.8 Methodology
0 5 10 15-1.5
-1
-0.5
0
0.5
1
Figure 4.12: Daubechies 8 wavelet function.
wavelet, j is the number of the level, x(k) is the speech sample, n and k are
integer values. The DWT can be performed using a pyramidal algorithm [88].
Various families of the wavelet transform have been used to decompose signals
such as biorthogonal, coiflets, symmlets and Daubechies [96]. The Daubechies
wavelet is one of the most common wavelets used to analyze the speech signals.
The name of the Daubechies family wavelet can be written as dbN, where N is the
order of the filter. The Daubechies wavelets are a family of orthogonal wavelets
defining a discrete wavelet transform and they are characterized by a maximum
number of vanishing moments p with given support filter length [140]. The van-
ishing moment is the criterion of how the wavelet function decays towards infinity
[23]. Having p vanishing moments means that the wavelet coefficients decay to
zero at the pth order polynomial [140]. The Daubechies 8 is used widely for de-
composition of speech signals because it requires minimum support filter length
for a given number of vanishing moments [133]. Thus, we used the Daubechies 8
to decompose the noisy speech signals in the wavelet threshold techniques. Figure
4.12 shows the Daubechies 8 wavelet function.
Figure 4.13 shows a block schematic of the dyadic wavelet transform. The speech
signal (x) is decomposed into different frequency bands by using a pair of FIR
filters, h(n) and g(n) , which are a low-pass and high-pass filter, respectively. The
4.8 Methodology 101
g (n)
h (n)
2
2
x g (n)
h (n)
CD1
!
2
2
CD2!
CA2
!
CA1
Figure 4.13: Block schematic of the dyadic wavelet transform.
(↓ 2) is a down-sampling operator used to discard half of the speech sequences
after the filter is applied. The approximation coefficients (CA1) can be performed
by convolving the low-pass filter with the speech signal and applying the down-
sampling operator to the output of the filter h(n). The detailed coefficients (CD1)
can be obtained by convolving the high pass filter with the speech signals and
applying down-sampling to the output of the filter g(n). The decomposition of
the speech signals can be repeated by applying the DWT to the approximation
coefficients (CA1).
Figure 4.14 shows the implementation of a two-level inverse discrete wavelet trans-
form based on two filter banks, where h(n), g(n) are low pass and high pass re-
construction filters, respectively, and the symbol (↑ 2) represents the up-sampling
operator of the wavelet coefficients by a factor of 2. The FIR of four filters satisfies
the following relationship
g(n) = (−1)nh(L+ 1− n), (4.36)
h(n) = h(L+ 1− n), (4.37)
g(n) = (−1)(n−1)h(L+ 1− n), (4.38)
where L is the length of the FIR filters and n = 1, 2, · · · , L. The output of the
102 4.8 Methodology
𝑔 (𝑛)
2 ℎ (𝑛)
(n)
2 𝑔 (𝑛)
2
2
ℎ (𝑛)
CA2
CD2
CD1
x
Figure 4.14: Block schematic of the dyadic inverse discrete wavelet transform.
inverse DWT is identical to the input speech signals [45, 88].
4.8.2.1.1 Wavelet threshold technique
The wavelet threshold technique is used to reduce the effect of noise by thresh-
olding the detailed wavelet coefficients. It is based on the assumption that the
energy of the speech signal is mostly concentrated in a small number of wavelet
coefficients [126]. The energy of these coefficients has larger values than other
coefficients (especially noise) that have their energy spread over a large number
of wavelet coefficients. Thresholding the small wavelet coefficients to zero can
eliminate the noise components from noisy speech components [44].
Level dependent wavelet threshold techniques are used widely to suppress noise
from the noisy speech signal. These techniques are based on thresholding the
detail coefficients of the noisy speech signals [44]. Level dependent threshold
(λth) can be defined as:
λth = σj(√
2 log Nj), (4.39)
σj =MAD(Dj)
0.6745, (4.40)
4.8 Methodology 103
where MAD represents the median absolute deviation estimated on the scale j,
Dj is the detailed coefficients for each scale and Nj is the number of samples of
the noisy speech signal for each scale j.
The method of level dependent wavelet threshold can be described by the follow-
ing four steps.
1. Frame the noisy speech signal into several segments by using a Hamming
window. The duration of the frame used in the simulation results for the
speech enhancement algorithm is 25 msec.
2. Decompose the noisy speech signal into four levels by using Daubechies 8
DWT.
3. Threshold the detailed coefficients of the noisy speech signal by using a hard
or a soft level dependent threshold. Hard (Thard) and soft (Tsoft) thresholds
can be expressed as:
Thard(Dj) =
Dj, |Dj| > λth
0, |Dj| 6 λth
, (4.41)
Tsoft(Dj) =
sign(Dj) ∗ (|Dj| − λth), |Dj| > λth
0 , |Dj| 6 λth
. (4.42)
4. Apply the inverse DWT to the thresholded detail wavelet coefficients to
obtain the enhanced speech signal.
4.8.2.2 Spectral Subtraction
Spectral subtraction is based on subtracting the estimated power spectrum of the
noise from the power spectrum of the noisy speech signal, without prior knowledge
104 4.8 Methodology
of the power spectral density of the clean speech and noise signals. A spectral
subtraction can be used to suppress background noise by assuming the noise is
stationary or changing slowly during the non-speech and speech activity periods
[13].
The procedure of spectral subtraction can be summarized by the following steps.
Firstly, the noisy speech signal is framed into several overlapping segments by
using a Hamming window. The duration of the frame used is 25 msec and the
duration of the overlap between two successive windows is 12.5 msec. Secondly,
the noise is estimated by computing the average power spectrum of noise from
several silent frames (noise only). Spectral distance voice activity detector is
used to determine the noise frames. Then, Fourier transform is applied to the
windowed frames of the noisy speech signal [137]. Spectral subtraction can be
computed as:
|S(k)|2 =
|X(k)|2 − δ|D(k)|2, |X(k)|2 − δ|D(k)|2 > β|D(k)|2
β|D(k)|2 , Otherwise
, (4.43)
where X(k), S(k) and D(k) are the magnitude of power spectrum of the segment
of corrupted speech, estimated speech and estimated noise, respectively, δ is the
over subtraction factor that depends on a posteriori segmental SNR, and β is
the spectral factor with values between 0 and 1. For a large value of β, the
spectral floor is high and the remaining noise is audible, while for a small value of
β, the noise is significantly reduced, but the remaning noise becomes annoying.
The value of β used in this experimental results is 0.03. Finally, the enhanced
speech signal can be obtained by applying an inverse Fourier transform to the
phase function of the discrete Fourier transform of the input speech signal and
the estimated spectrum of the speech |S(k)|.
4.8 Methodology 105
4.8.2.3 Independent component analysis
The fast ICA algorithm is used to separate the clean speech from the noisy speech
signals. It is based on separating one non-Gaussian component each time under
the assumption that the sum of two independent source signals has a Gaussian
distribution. The Gaussian contrast function is used in the fast ICA and can be
defined as:
G1(s) = − exp (−s2
2). (4.44)
In order to estimate all the independent components of the source signals (speech
and environmental noise), we ran one unit fast ICA for N times and the deflation
decorrelation algorithm was applied to the estimated source signals after each
iteration to prevent the row vectors of the unmixing matrix W from converging
to the same maxima.
4.8.3 Evaluation of performance
Speech enhancement performance is typically measured using the output SNR
(SNRo) and can be defined as:
SNRo =
∑n
s2(n)∑n
|s(n)− s(n)|2, (4.45)
where s(n) and s(n) are the original speech and the estimated speech signals
respectively.
The SNR enhancement(SNRe) in (dB) can be defined as:
SNRe = SNRo − SNRi, (4.46)
106 4.8 Methodology
where SNRi is the input SNR and it can be computed by the ratio of the sum
squared of the clean speech to that of the noise from the first microphone (x1).
To evaluate the speech enhancement performance, the average improvement in
SNR was used in the simulation results when one sentence from 100 speakers
using the AFVC database was mixed with different types of environmental noise
from the QUT-NOISE database. The average SNR improvement in dB, SNRe,
can be computed as:
SNRe =1
Ns
Ns∑i=1
SNRe(i), (4.47)
where Ns is the number of the speakers and is equal to 100, SNRe(i) is the SNR
enhancement for each speaker.
The standard deviation of the SNR enhancement can be computed as
σ =
√√√√ 1
Ns
Ns∑i=1
(SNRe(i)− SNRe)2. (4.48)
4.8.4 Comparison of speech enhancement algorithms
In this section, the performance of speech enhancement based on the ICA algo-
rithm is compared with the performance of single channel speech enhancement
algorithms (wavelet threshold techniques and spectral subtraction). The goal of
comparing single channel speech enhancement algorithms with speech enhance-
ment based on the ICA algorithm is to choose the most reliable baseline speech
enhancement algorithm for real forensic applications under noisy conditions. The
baseline speech enhancement will be used in Chapter 6 to improve forensic speaker
verification performance in the presence of various types and levels of environ-
mental noise.
Figures 4.15, 4.16, and 4.17 show comparisons of the average and standard devia-
4.8 Methodology 107
Figure 4.15: Comparison of average SNR enhancement when STREET noise isadded to the speech signals from the AFVC database.
Figure 4.16: Comparison of average SNR enhancement when CAR noise is addedto the speech signals from the AFVC database.
tion of SNR improvement for different speech enhancement algorithms when one
sentence from 100 speakers of the AVFC database was corrupted with STREET,
CAR and HOME noises, respectively. Standard deviations from the Monte Carlo
simulation are shown in the bars. The SNRs in the x-axis of these figures were
computed from the first microphone (x1) in the ICA algorithm. There is an ar-
108 4.8 Methodology
-10 -5 0 5 10
SNRi
-8
-6
-4
-2
0
2
4
6
8
10
Avera
ge im
proved
SNR
ICAwavelet hardwavelet softspectral subtraction
Figure 4.17: Comparison of average SNR enhancement when HOME noise isadded to the speech signals from the AFVC database.
bitrariness in the sign upon inversion. The problem with the sign change in the
samples of estimated speech compared with samples of original speech in an ICA
algorithm is solved by multiplying all samples of the estimated speech signal by
-1 if the maximum correlation coefficient between estimated and original speech
has a negative sign. Results of some experiments in comparison of speech en-
hancement based on the ICA with single channel speech enhancement algorithms
were published at the 16th Australian International Conference on Speech Sci-
ence and Technology. The title of the conference paper is “ Comparison of speech
enhancement algorithms for forensic applications” [4]. From these figures, the
following points can be concluded:
1. When the speech signals were mixed with CAR, STREET and HOME
noises at SNRs ranging from -10 to 10 dB, the ICA algorithm significantly
improved average SNR compared with spectral subtraction and wavelet
threshold techniques. The ICA algorithm is based on statistical indepen-
dence to separate the speech from the noisy speech signals. However, single-
channel speech enhancement algorithms (wavelet threshold techniques and
4.9 Chapter Summary 109
spectral subtraction) are based on subtracting the noisy speech by certain
values of threshold in each high-frequency sub bands or using a fixed sub-
traction parameter in spectral subtraction. The enhanced speech signals
may be distorted in a single channel speech enhancement algorithm due to
removing some speech information, such as unvoiced speech by considering
these information like noise.
2. Level dependent wavelet threshold techniques give higher average SNR
enhancement than the spectral subtraction algorithm at low SNRs rang-
ing from -10 dB to 0 dB, because environmental noise (CAR, STREET
and HOME noises) is not uniformly distributed over the entire frequency.
Thresholding the detailed coefficients in each high frequency sub band by
using a level dependent wavelet threshold will lead to improved average
SNR enhancement at low SNRs.
3. Level dependent thresholding and spectral subtraction fail to remove en-
vironmental noise at SNRs ranging from 5 to 10 dB because the power
spectral densities of environmental noise are concentrated at certain fre-
quencies. Using a fixed over subtracting parameter in spectral subtraction
or thresholding all high frequency sub bands of the noisy speech signal at
low levels of noise will lead to a distortion of the enhanced speech signal.
4.9 Chapter Summary
This chapter presented ICA as a multiple channel speech enhancement algorithm.
A brief description of the fundamental and mathematical frameworks of ICA was
presented. As a part of this discussion, the two important preprocessing steps,
centring and whitening, were thoroughly described. The fast ICA algorithm was
also discussed in detail.
110 4.9 Chapter Summary
In real forensic scenarios, forensic audio recordings are often corrupted by various
types of environmental noise. Since the quality of these recordings is poor, it is
difficult to use these recordings directly in forensic speaker verification to recog-
nize the identity of the criminal by their voice. Thus, speech enhancement may
reduce the effect of environmental noise and improve forensic speaker verification
performance.
Simulation results for a single channel (spectral subtraction and wavelet threshold
techniques) and speech enhancement based on the ICA algorithm were also given
in this chapter. Part of comparison results of speech enhancement based on
the ICA with single channel speech enhancement algorithms was published in a
conference paper and titled “ Comparison of speech enhancement algorithms for
forensic applications” [4]. The results demonstrated that speech enhancement
based on the ICA algorithm significantly improved average SNR compared with
single channel speech enhancement algorithms for different levels and types of
environmental noise. The enhanced speech signals from the ICA algorithm could
improve forensic speaker verification performance under real-world situations in
the presence of various types of environmental noise. Thus, the ICA algorithm
will be used as the front-end of speech enhancement in forensic speaker verification
systems in Chapter 6.
Chapter 5
Robust feature extraction
techniques
5.1 Introduction
The performance of speaker verification systems is often degraded in real forensic
applications for two main reasons: noise and reverberation conditions. Forensic
speech samples are often corrupted by various types of environmental noise in
real forensic scenarios [89]. For instance, a criminal may use a mobile phone to
commit a criminal offence, and the surveillance recordings are often corrupted
by various types of environmental noise. The performance of speaker verification
drops dramatically in the presence of high levels of noise [73, 94]. In conditions
of reverberation, the speech signal is often combined with a multiple reflection
version of the original speech due to the surrounding room environment [40].
The reverberated speech can be represented by convolving the impulse response
of the room with the original speech signal. The presence of reverberation dis-
torts feature vectors and decreases speaker verification performance because of
112 5.1 Introduction
the mismatched conditions between the enrolment model and verification speech
signals [121, 122].
A number of feature channel compensation techniques, such as CMS [37], CMVN
[139], and RASTA processing [49] have been used to improve the performance of
speaker verification systems. However, these techniques are less effective for non-
stationary additive distortion and reverberation environments [40, 56]. Pelecanos
et al. [106] introduced a feature warping technique to speaker verification to
compensate for the effect of additive noise and linear channel mismatch in the
feature domain. This technique maps the distribution of the cepstral features
into a standard normal distribution. Feature warping provides a robustness to
noise while retaining the speaker-specific information that is lost when using other
channel compensation techniques such as CMS, CMVN, and RASTA processing
[59].
Multiband feature extraction techniques were used as the feature extraction of
noisy speaker recognition systems [1, 21, 72, 86, 123], achieving better perfor-
mance than traditional MFCC features. Multiband feature techniques are based
on combining MFCC features of the noisy speech signals and MFCC extracted
from the DWT in a single feature vector.
A combination of feature warping with DWT-MFCC and MFCC features of the
speech signal improves forensic speaker verification performance under noisy and
reverberant conditions for three main reasons. Firstly, feature warping is more
robust to additive noise and reverberation conditions compared with traditional
MFCC and other feature compensation techniques [58, 106]. Secondly, rever-
beration affects low frequency more than high-frequency sub bands, since the
boundary materials used in most rooms are less absorptive at low frequency sub
bands. The DWT can be used to extract more features from the low frequency
sub bands [95]. These features add some important features to the full band of the
5.2 Combination of DWT and MFCC FW 113
feature-warped MFCC. Thus, fusion of feature warping with DWT-MFCC and
MFCC features of the reverberated signals may achieve better forensic speaker
verification performance than full band feature-warped MFCC in the presence of
reverberation conditions. Finally, DWT can be a useful tool to decrease the effect
of noise. The feature-warped MFCC extracted from the DWT can be combined
with the feature-warped MFCC extracted from the full band of the noisy speech
signal itself to obtain a large feature vector suitable for speaker recognition in the
presence of noise.
In this chapter, we investigate the effectiveness of combining the features of DWT-
MFCC and MFCC of speech signals with and without feature warping to improve
i-vector speaker verification performance under noisy conditions only. Different
individual and concatenative feature extraction techniques are used to evaluate
the modern i-vector forensic speaker verification performance in the presence of
various types of environmental noise. A new fusion of feature warping with DWT-
MFCC and MFCC is proposed in this chapter for improving forensic speaker
verification performance in the presence of noise, reverberation, as well as noisy
and reverberant conditions.
This chapter is divided into several sections. Section 5.2 presents the combina-
tion of DWT-MFCC and MFCC feature warping. Experimental methodology is
described in Section 5.3. The results and discussion are presented in Section 5.4.
5.2 Combination of DWT and MFCC FW
The technique for extracting the features is based on the multi-resolution property
of the DWT. The MFCC features were computed over Hamming windowed frames
of 30 msec size and 10 msec shift to discard the discontinuities at the edges of the
114 5.2 Combination of DWT and MFCC FW
MFCC+∆+∆∆
Noisy Speech
DWT
Concatenate approximate
and detail coefficients into a
single feature vector Feature warping
MFCC+∆+∆∆
Feature warping
Concatenate feature vectors in a single feature vector
DWT-MFCC
MFCC (FW) MFCC
DWT-MFCC (FW)
1 2
1 A 1 B
2 A
2 B
Figure 5.1: Extraction and fusion of DWT-MFCC and MFCC features with andwithout feature warping (FW).
frame. The MFCC was obtained using a mel-filterbank of 32 channels followed
by a transformation to the cepstral domain. The 13-dimensional MFCC features,
with appended delta (∆) and double delta (∆∆) coefficients, were extracted
from the full band of the noisy speech signals. Feature warping with a 301 frame
window was applied to the features extracted from the MFCC. The DWT was
applied to decompose the noisy speech signals into two frequency sub bands: the
approximation (low-frequency sub band) and the detail (high frequency sub band)
coefficients. The decomposition process can be repeated by applying the DWT
to the approximation coefficients. The approximation and detail coefficients were
combined into a single vector. The feature-warped MFCC was then used to
extract features from the single feature vector of the DWT.
5.2 Combination of DWT and MFCC FW 115
Table 5.1: Summary of feature extraction labels.
Sub-branch label Label feature extraction1 A MFCC1 B MFCC (FW)2 A DWT-MFCC2 B DWT-MFCC (FW)
Fusion 1 A and 2 A Fusion (no FW)Fusion 1 B and 2 B Fusion (both FW)
Table 5.2: Description of the number of features extracted corresponding to eachfeature extraction label.
Label feature extraction Number of featuresMFCC 39
MFCC (FW) 39DWT-MFCC 39
DWT-MFCC (FW) 39Fusion (no FW) 78
Fusion (both FW) 78
In this thesis, the effect of feature warping on DWT-MFCC and MFCC features
investigates, both individually and in a concatenative fusion of these features
in the presence of various types of environmental noise, reverberation and noisy
and reverberant conditions, as shown in Figure 5.1. The features extraction tech-
niques can be used to train the state-of-the-art i-vector PLDA speaker verification
systems and are described in Table 5.1.
Tables 5.1 and 5.2 summarize feature extraction labels and describe the number
of the features extracted corresponding to each feature extraction label. The
symbol (FW) in Tables 5.1 and 5.2 represents the acronym of feature warping.
116 5.3 Experimental methodology
5.3 Experimental methodology
The i-vector based experiments were evaluated using the clean speech signals
from the AFVC database. A UBM with 256 Gaussian components was used in
our experimental results. The UBMs were trained on telephones and microphone
recordings from 348 speakers from the AFVC database. These UBMs were used to
compute the Baum-Welch statistics before training a total-variability subspace
of dimension 400. These total variabilities were used to compute the i-vector
speaker representation. The i-vector dimension was reduced to 200 i-vectors us-
ing LDA. The i-vectors length normalization was used before GPLDA modelling
using centering and whitening of the i-vectors [43]. Length normalized GPLDA-
based forensic speaker verification was used in these experimental results instead
of HTPLDA because it was found that the length normalized GPLDA gives a sim-
ilar performance with less computational complexity than HTPLDA [43]. The
length normalized GPLDA technique is computationally efficient compared with
HTPLDA because the length normalized technique is used to transform the dis-
tribution of the i-vectors from heavy-tailed to Gaussian [61]. The performance of
the i-vector PLDA speaker verification systems was evaluated using the Microsoft
Research (MSR) identity toolbox [119].
5.4 Results and discussion
This section describes the effectiveness of the fusion of features of DWT-MFCC
and MFCC with and without feature warping on the forensic speaker verifica-
tion performance under noisy conditions. A new fusion of feature warping with
DWT-MFCC and MFCC is proposed for improving forensic speaker verification
performance in the presence of reverberation and noisy and reverberant condi-
5.4 Results and discussion 117
tions. The modern length-normalized GPLDA was used as a classifier in all
experimental results. The performance of speaker verification systems was eval-
uated using the EER and the mDCF, calculated using Cmiss = 10, Cfa = 1, and
Ptarget = 0.01.
5.4.1 Noisy conditions
This section describes the performance of the fusion of features of DWT-MFCC
and MFCC with and without feature warping in the presence of STREET, CAR
and HOME noises only. The effect of decomposition level, wavelet families and
utterance duration on the performance of fusion of feature warping with DWT-
MFCC and MFCC-based speaker verification systems will also be described in
the next sections.
5.4.1.1 Effect of decomposition level
This experiment evaluated the effect of decomposition level on the performance
of fusion of feature warping with DWT-MFCC and MFCC features. The full
duration of interview recordings from 200 speakers was kept under clean condi-
tions using pseudo-police style, while 10 sec of the surveillance recordings from
200 speakers using informal telephone conversation style were corrupted with a
random session of STREET, CAR and HOME noises at SNRs ranging from -10
dB to 10 dB. The interview and noisy surveillance recordings were decomposed
into 2, 3, and 4 levels using db8 DWT.
Figure 5.2 shows the effect of the decomposition levels on the performance of
fusion of feature warping with DWT-MFCC and MFCC features in the presence
of various types of environmental noise. It was found that increasing the number
118 5.4 Results and discussion
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40
EE
R %
STREET
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40CAR
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40HOME
Level 2Level 3Level 4
Figure 5.2: Effect of the decomposition levels on the performance of fusion offeature warping with DWT-MFCC and MFCC in the presence of noise.
of levels to more than three over the majority of SNR values in the presence
of various types of environmental noise degraded the speaker verification perfor-
mance. In this case, the number of samples in the lowest frequency sub bands was
so low that the essential characteristics of the speech signals could not be esti-
mated accurately by the classifier [21]. The results of effect of decomposition level
on the performance of fusion of feature warping with DWT-MFCC and MFCC
were published in IEEE Access journal in a paper entitled “ Enhanced forensic
speaker verification using a combination of DWT and MFCC feature warping in
the presence of noise and reverberation conditions” [5].
5.4.1.2 Effect of wavelet family
This experiment evaluated the effect of wavelet family on the performance of fu-
sion of feature warping with DWT-MFCC and MFCC features. The full duration
of interview recordings from the pseudo-police style was kept under clean condi-
5.4 Results and discussion 119
-10 -5 0 5 10SNR
0
5
10
15
20
25
30
35
Ave
rage
EE
R %
db2 db4 db8
Figure 5.3: Average EER for fusion of feature warping with DWT-MFCC andMFCC using different wavelet families in the presence of different types of envi-ronmental noise.
tions, while 10 sec of the surveillance speech signals were mixed with STREET,
CAR and HOME noises at SNRs ranging from -10 dB to 10 dB. The interview
and surveillance recordings were split into three levels using different Daubechies
wavelet families (db2, db4, and db8).
The average EER for fusion of feature warping with DWT-MFCC and MFCC
can be calculated by computing the mean of EER for different types of environ-
mental noise at each noise level, as shown in Figure 5.3 . It is clear from this
figure that the performance of fusion of feature warping with DWT-MFCC and
MFCC using db8 achieved slight improvements in average EER compared with
other wavelet families in the presence of various types of environmental noise at
SNRs ranging from -10 dB to 10 dB. Since level 3 achieved better noisy forensic
speaker verification performance over the majority of SNR values, as described in
Section 5.4.1.1, and db8 achieved slight improvements in noisy forensic speaker
verification performance compared with other wavelet families, level 3 and db8
120 5.4 Results and discussion
will be used in the feature extraction based on DWT in the presence of noise in
the next sections.
5.4.1.3 Comparison of feature extraction techniques
This experiment evaluated the performance of combining DWT-MFCC and
MFCC features with and without feature warping in the presence of various
levels of environmental noise only. The full length of interview recordings from
200 speakers using pseudo-police style was kept under clean conditions, while 10
sec of the surveillance recordings from 200 speakers using informal telephone con-
versation style were mixed with random sessions of STREET, CAR and HOME
noises from the QUT-NOISE database at SNRs ranging from -10 dB to 10 dB.
Figure 5.4 compares speaker verification performance using different feature ex-
traction techniques in the presence of environmental noise at various SNR values.
The following points were concluded from this figure:
� The performance of forensic speaker verification systems achieves signifi-
cant improvements in EER over the majority SNR values when applying
feature warping to the MFCC features in the presence of various types of
environmental noises (blue solid vs blue dash).
� Feature warping did not improve forensic speaker verification performance
when DWT-MFCC was used as the feature extraction compared with tra-
dional MFCC features. However, speaker verification performance was im-
proved by applying feature warping to MFCC features (red solid vs blue
solid). The major drawback of using DWT-MFCC (FW) as the feature
extraction is that it lost some important correlation information between
sub band features. The lack of correlation information between sub band
features decreased the performance of speaker recognition systems [90].
5.4 Results and discussion 121
-10 -5 0 5 10SNR
0
5
10
15
20
25
30
35
40
45
EER
%
STREET
MFCCMFCC (FW)DWT-MFCCDWT-MFCC (FW)Fusion (no FW)Fusion (both FW)
-10 -5 0 5 10SNR
0
5
10
15
20
25
30
35
40
45
EER
%
CAR
MFCCMFCC (FW)DWT-MFCCDWT-MFCC (FW)Fusion (no FW)Fusion (both FW)
-10 -5 0 5 10SNR
0
5
10
15
20
25
30
35
40
45
EER
%
HOME
MFCCMFCC (FW)DWT-MFCCDWT-MFCC (FW)Fusion (no FW)Fusion (both FW)
Figure 5.4: Comparison of speaker verification performance using different featureextraction techniques in the presence of various types of environmental noise.
� Fusion of feature warping with DWT-MFCC and MFCC features achieves
greater improvements in EER than fusion without any feature warping in
the presence of various levels of environmental noises (green solid vs green
dash).
� Fusion of feature warping with DWT-MFCC and MFCC significantly im-
proved EER over traditional MFCC features in the presence of various types
and levels of environmental noises (green solid vs blue dash). The reduction
122 5.4 Results and discussion
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40
45
Ave
rage
Red
uctio
n in
EE
R %
Figure 5.5: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC over feature-warped MFCC in the presence of various typesof environmental noise. Higher average reduction in EER indicates better perfor-mance.
in EER for fusion of feature warping with DWT-MFCC and MFCC at 0
dB SNR is 48.28%, 30.27%, and 41.17%, respectively, over MFCC features
in the presence of random sessions of CAR, STREET and HOME noises.
The reduction in EER between two systems A and B can be defined as:
EERred =EERA − EERB
EERA
, (5.1)
where EERA is the equal error rate for system A and EERB is the equal error
rate for system B. The average reduction in EER between feature-warped MFCC
and fusion of feature warping with DWT-MFCC and MFCC can be computed by
calculating the mean of EERred for various types of environmental noise at each
noise level.
Figure 5.5 shows average reduction in EER for fusion of feature warping with
DWT-MFCC and MFCC over feature-warped MFCC features in the presence of
5.4 Results and discussion 123
Figure 5.6: mDCF for feature-warped MFCC and fusion of feature warping withDWT-MFCC and MFCC features in the presence of various types of environmen-tal noise.
various types of environmental noise for each noise level. The results show that
fusion of feature warping with DWT-MFCC and MFCC achieves a reduction in
average EER over feature-warped MFCC features in the presence of various types
of environmental noise at SNRs ranging from -10 dB to 10 dB. At 0 dB SNR, the
average reduction in EER for fusion of feature warping with DWT-MFCC and
MFCC was 21.33% over feature-warped MFCC. The results comparing forensic
speaker verification performance using different feature extraction techniques in
the presence of various types of environmental noise were published in IEEE
Access in the journal paper titled “ Enhanced forensic speaker verification using
a combination of DWT and MFCC feature warping in the presence of noise and
reverberation conditions” [5].
We compare the mDCFs between feature-warped MFCC and fusion of feature
warping with DWT-MFCC and MFCC because these feature extraction tech-
niques achieved significant improvements in the performance of forensic speaker
124 5.4 Results and discussion
verification compared with other feature extraction techniques. Figure 5.6 shows
the mDCFs of feature-warped MFCC and fusion of feature warping with DWT-
MFCC and MFCC in the presence of various types of environmental noise. It
is clear from this figure that fusion of feature warping with DWT-MFCC and
MFCC features achieves improvement in mDCF over feature-warped MFCC in
the presence of CAR, STREET and HOME noises at SNRs ranging from -10 dB
to 10 dB.
5.4.1.4 Fusion of feature warping with DWT-MFCC and MFCC
Based on the results obtained from the effectiveness of fusion of features of DWT-
MFCC and MFCC with and without using feature warping in the presence of var-
ious types of environmental noise, the new technique of fusion of feature warping
with DWT-MFCC and MFCC is proposed. This technique is based on decom-
posing the noisy speech signals into approximation and detail coefficients using
DWT. The approximation and detail are fused into a single vector. Feature-
warped MFCC is used to extract the essential characteristics of the noisy speech
signals from the DWT. Then, feature-warped MFCC is used to extract the fea-
tures from the full band of the noisy speech signals. Fusion of feature warping
with DWT-MFCC and MFCC can be obtained by concatenating the feature-
warped MFCC extracted from the DWT and feature-warped MFCC extracted
from the full band of the noisy speech signals into a single feature vector, as
shown in Figure 5.7.
5.4.1.5 Performance comparison to related work
This section compares the performance of fusion of feature warping with DWT-
MFCC and MFCC features with other feature extraction techniques used in
5.4 Results and discussion 125
MFCC+∆+∆∆ DWT
Feature warping Concatenate approximate and detail
coefficients into a single feature vector
MFCC+∆+∆∆
Feature warping
Concatenate feature vectors into a single feature vector
Noisy speech
Figure 5.7: Fusion of feature warping with DWT-MFCC and MFCC.
[86, 87, 106]. In order to evaluate the performance of the proposed feature ex-
traction techniques [86, 87, 106], the interview recordings were obtained from full
length from 200 speakers using pseudo-police style and these data were kept under
clean conditions. The surveillance recordings were obtained from 10 sec duration
from 200 speakers using informal telephone conversation style. The surveillance
recordings were corrupted by a random session of CAR, STREET and HOME
noises from the QUT-NOISE database at SNRs ranging from -10 dB to 10 dB.
The noisy speech signals were decomposed into three levels using db8 of DWT
and the decomposed speech signals were used to extract the feature extraction
techniques in [86, 87]. These feature extraction techniques were used to train the
126 5.4 Results and discussion
Figure 5.8: Comparison of average EER for the proposed fusion of feature warpingwith DWT-MFCC features with other feature extraction techniques.
modern length-normalized GPLDA framework.
The average EER can be calculated by computing the mean of EER for various
types of environmental noise at each noise level for each feature extraction tech-
nique. Figure 5.8 shows a comparison of average EER for the proposed fusion of
feature warping with DWT-MFCC and MFCC features with other feature extrac-
tion techniques. It shows that the fusion of feature warping with DWT-MFCC
and MFCC features achieves improvements in average EER over other feature
extraction techniques in the presence of various types of environmental noise at
SNRs ranging from -10 dB to 10 dB.
5.4.1.6 Effect of utterance length
In these simulation results, the full duration of the interview recordings was kept
under clean conditions, while the duration of the surveillance recordings was
5.4 Results and discussion 127
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40
EE
R %
STREET
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40CAR
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40HOME
Full-10 secFull-20 secFull-40 sec
Figure 5.9: Effect of the utterance length on the performance of feature-warpedMFCC in the presence of environmental noise.
changed from 10 sec to 40 sec. The surveillance recordings were corrupted with
random segments of STREET, CAR and HOME noises at SNRs ranging from
-10 dB to 10 dB. Since the performances of forensic speaker verification based
on fusion of feature warping with DWT-MFCC and MFCC and feature-warped
MFCC under noisy conditions achieved improvements in EER compared with
other feature extraction techniques, as described in Section 5.4.1.3, the effect of
utterance length was evaluated on the performance of fusion of feature warping
with DWT-MFCC and MFCC and feature-warped MFCC in this section.
Figure 5.9 shows the effect of utterance length on the performance of feature-
warped MFCC under environmental noise. It is clear that as the utterance length
increased, the performance of forensic speaker verification based on the feature-
warped MFCC improved in the presence of different levels of STREET, CAR and
HOME noises.
Figure 5.10 shows the effect of utterance length on the performance of the fusion of
128 5.4 Results and discussion
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40
EE
R %
STREET
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40CAR
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40HOME
Full-10 secFull-20 secFull-40 sec
Figure 5.10: Effect of the utterance length on the performance of fusion of featurewarping with DWT-MFCC and MFCC in the presence of various types environ-mental noise.
feature warping with DWT-MFCC and MFCC features in the presence of various
types of environmental noise. It is clear that increasing the utterance duration
improved forensic speaker verification performance in the presence of STREET,
CAR and HOME noises at SNRs ranging from -10 dB to 10 dB.
The reduction in EER between feature-warped MFCC and fusion of feature warp-
ing with DWT-MFCC and MFCC can be calculated using Equation 5.1 when 40
seconds of the surveillance recordings were mixed with different types and lev-
els of environmental noise. The average reduction in EER can be computed by
calculating the mean of EER for different types of environmental noise at each
noise level. Figure 5.11 shows the average reduction in EER for fusion of fea-
ture warping with DWT-MFCC and MFCC over feature-warped MFCC when 40
seconds of the surveillance recordings were mixed with various types of environ-
mental noise at SNRs ranging from -10 dB to 10 dB. The results show that the
performance of fusion of feature warping with DWT-MFCC and MFCC improved
5.4 Results and discussion 129
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40
45
50
55
Ave
rage
Red
uctio
n in
EE
R %
Figure 5.11: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC over feature-warped MFCC when 40 sec of the surveillancerecordings were mixed with various types of environmental noise at SNRs rangingfrom -10 dB to 10 dB.
average EER by 1.02% to 52.72% compared with feature-warped MFCC in the
presence of various types of environmental noise at SNRs ranging from -10 dB to
10 dB.
Figure 5.12 shows the average reduction in EER for fusion of feature warping
with DWT-MFCC and MFCC features when the duration of the surveillance
recordings increased from 10 sec to 40 sec. In 0 dB SNR, the performance of fusion
of feature warping with DWT-MFCC and MFCC features achieved an average
reduction in EER of 17.92% when the duration of the surveillance recordings
increased from 10 sec to 40 sec. Some results of effect of utterance length on
the performance of fusion of feature warping with DWT-MFCC and MFCC in
the presence of various types of environmental noise were published in IEEE
Access journal in a paper entitled “ Enhanced forensic speaker verification using
a combination of DWT and MFCC feature warping in the presence of noise and
reverberation conditions” [5].
130 5.4 Results and discussion
-10 -5 0 5 10
SNR
0
10
20
30
40
50
60A
vera
ge R
educt
ion in
EE
R %
Figure 5.12: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC when the duration of the surveillance recordings increasedfrom 10 sec to 40 sec.
5.4.2 Reverberation conditions
This section compares the performance of fusion of feature warping with DWT-
MFCC and MFCC with different feature extraction techniques under reverbera-
tion conditions only. The effect of decomposition level, wavelet family, utterance
length, reverberation time, and suspect and microphone position on the perfor-
mance of forensic speaker verification will also be investigated.
5.4.2.1 Effect of decomposition level
The effect of the decomposition level on the performance of fusion of feature warp-
ing with DWT-MFCC and MFCC was evaluated using different decomposition
levels. The impulse response of a room was computed by using reverberation time
(T20= 0.15 sec). Full duration of the interview recordings was obtained from 200
speakers using pseudo-police style. The VAD algorithm [130] was used to remove
5.4 Results and discussion 131
2 3 4
Level
0
2
4
6
8
10
EER
%
Figure 5.13: Effect of decomposition level on the performance of fusion of featurewarping with DWT-MFCC and MFCC under reverberation conditions only.
silent regions from the interview recordings. The surveillance recordings were ob-
tained from 10 sec duration from 200 speakers using informal telephone conver-
sation style after applying the VAD algorithm. Each of the interview recordings
was convolved with the impulse room response to generate reverberated speech,
while a 10 sec duration of the surveillance recordings was kept under clean con-
ditions. The first configuration of the room is used in this experiment, as shown
in Table 3.1.
In this experiment, db8 of the DWT and different decomposition levels (2, 3,
and 4) were used to investigate the effect of the decomposition levels on the
performance of fusion of feature warping with DWT-MFCC and MFCC under
reverberation conditions only. Figure 5.13 shows the effect of decomposition level
on the performance of fusion of feature warping with DWT-MFCC and MFCC
under reverberation conditions only.
It was found, as illustrated in Figure 5.13, that level 2 achieves better improve-
132 5.4 Results and discussion
ment in performance than other decomposition levels. Reverberation often affects
low frequencies more than high frequencies, since the materials used in the most
common design of rooms are less absorptive at low frequencies, leading to longer
reverberation times and more distortion of spectral information at those frequen-
cies [95]. Thus, the performance of speaker verification in reverberation envi-
ronments improved by increasing the number of coefficients at a low frequency
using two levels of decomposition. The results of effect of decomposition level
on the performance of fusion of feature warping with DWT-MFCC and MFCC
under reverberation conditions only were published in IEEE Access journal in
a paper entitled “ Enhanced forensic speaker verification using a combination
of DWT and MFCC feature warping in the presence of noise and reverberation
conditions” [5].
5.4.2.2 Effect of wavelet family
This section describes the effect of wavelet family on the performance of fusion
of feature warping with DWT-MFCC and MFCC features under reverberation
conditions. Each of the interview recordings was convolved with the impulse
response of the room to generate reverberated speech, while 10 sec duration of the
surveillance recordings remained under clean conditions. The first configuration
of the room is used in this experiment, as shown in Table 3.1.
In this experiment, level 2 of DWT and different Daubechies families (db2, db4,
and db8) were used to investigate the effect of wavelet family on the performance
of fusion of feature warping with DWT-MFCC and MFCC under reverberation
conditions, as shown in Figure 5.14. It was found, as illustrated in Figure 5.14,
that db8 achieved slight improvements in EER compared with other wavelet fam-
ilies. Since level 2 achieved better forensic speaker verification performance under
reverberation conditions only, as described in Section 5.4.2.1, and db8 achieved
5.4 Results and discussion 133
db2 db4 db8
Wavelet family
0
1
2
3
4
5
6
7
EE
R %
Figure 5.14: EER for fusion of feature warping with DWT-MFCC and MFCCusing different wavelet families in the presence of reverberation conditions only.
slight improvements in forensic speaker verification performance compared with
other wavelet families, level 2 and db8 will be used in the feature extraction based
on DWT in the presence of reverberation in the next sections.
5.4.2.3 Comparison of feature extraction techniques
The performance of fusion of feature warping with DWT-MFCC and MFCC
was evaluated and compared with various feature extraction techniques in the
presence of reverberation. The full duration of interview recordings from 200
speakers using pseudo-police style was reverberated at 0.15 sec reverberation
time, while a 10 sec portion of the surveillance recordings from 200 speakers
using informal telephone conversation style was kept under clean conditions. The
first configuration of the room is used in this experiment, as shown in Table 3.1.
Figure 5.15 shows comparison of fusion of feature warping with DWT-MFCC and
134 5.4 Results and discussion
MFCC with different feature extraction techniques in the presence of reverbera-
tion conditions. It is clear from this figure that the performance of forensic speaker
verification under reverberation conditions significantly improved EER when fea-
ture warping was applied to MFCC features. The performance of speaker veri-
fication based on the sub band features (DWT-MFCC and DWT-MFCC (FW))
degraded in the presence of reverberation compared with traditional MFCC be-
cause some important information was lost between sub band features. The
results also demonstrated that a fusion of feature warping with DWT-MFCC and
MFCC features improves speaker verification performance over other feature ex-
traction techniques. The performance of speaker verification based on fusion of
feature warping with DWT-MFCC and MFCC significantly improved EER over
traditional MFCC features. The reduction in EER for fusion of feature warping
with DWT-MFCC and MFCC was 47.00% over traditional MFCC features. Fu-
sion of feature warping with DWT-MFCC and MFCC reduced EER by 20.00%
over feature-warped MFCC. The results of comparing fusion feature warping with
DWT-MFCC and MFCC with different feature extraction techniques under re-
verberation conditions only were published in IEEE Access in the journal paper
entitled “ Enhanced forensic speaker verification using a combination of DWT and
MFCC feature warping in the presence of noise and reverberation conditions” [5].
Table 5.3 shows a comparison of speaker verification performance based on mDCF
using different feature extraction techniques in the presence of reverberation at
0.15 sec. It shows that a fusion of feature warping with DWT-MFCC and MFCC
features improves the performance of mDCF over other feature extraction tech-
niques in the presence of reverberation. The mDCF for fusion of feature warping
with DWT-MFCC and MFCC features reduced by 23.73% compared to feature-
warped MFCC.
5.4 Results and discussion 135
0
2
4
6
8
10
12
14
EE
R %
MFCCMFCC (FW)DWT-MFCCDWT-MFCC (FW)Fusion (no FW)Fusion (both FW)
Figure 5.15: Comparison of fusion of feature warping with DWT-MFCC andMFCC with different feature extraction techniques in the presence of reverbera-tion.
Table 5.3: Comparison of speaker verification performance based on mDCF usingdifferent feature extraction techniques in the presence of reverberation at 0.15 sec.
Feature extraction techniques mDCFMFCC 0.0600
MFCC (FW) 0.0434DWT-MFCC 0.0690
DWT-MFCC (FW) 0.070Fusion (no FW) 0.0480
Fusion both (FW) 0.0331
5.4.2.4 Effect of reverberation time
This experiment evaluated the effect of reverberation time on the performance
of fusion of feature warping with DWT-MFCC and MFCC and feature-warped
MFCC because these feature extraction techniques achieved high improvements
in EER over other feature extraction techniques under reverberation conditions,
as described in the previous section. The impulse response of the room was
computed by using the following reverberation times: T20= 0.15 sec, 0.20 sec,
136 5.4 Results and discussion
0.1 0.15 0.25 0.30.2
Reverberation Time (sec)
0
3
6
9
12
15
EE
R %
Figure 5.16: Effect of reverberation time on the performance of feature-warpedMFCC.
and 0.25 sec. Each full duration of interview recordings was convolved with
impulse response of the room to generate reverberated interview data at different
reverberation times, while 10 sec surveillance recordings were maintained under
clean conditions. The first configuration of the room is used in this experiment,
as shown in Table 3.1.
Figure 5.16 shows the effect of reverberation time on the performance of feature-
warped MFCC. It is clear from this figure that the performance of forensic speaker
verification based on feature-warped MFCC degraded when the reverberation
time increased from 0.15 sec to 0.25 sec.
Figure 5.17 shows the effect of reverberation time on the performance of fusion
of feature warping with DWT-MFCC and MFCC. It shows that an increase in
reverberation time led to degraded speaker verification performance. When the
reverberation time increased from 0.15 sec to 0.25 sec, there was a degradation
of 34.42% in the performance of fusion of feature warping with DWT-MFCC
5.4 Results and discussion 137
0.1 0.15 0.2 0.25 0.3Reverberation Time (sec)
0
2
4
6
8
10
EE
R %
Figure 5.17: Effect of reverberation time on the performance of fusion of featurewarping with DWT-MFCC and MFCC.
and MFCC. The reverberation adds more inter-frame distortion to the cepstral
features when the reverberation time was increased. Therefore, increasing the
reverberation time leads to decreased speaker verification performance [121]. The
results of the effect of reverberation time on the fusion of feature warping with
DWT-MFCC and MFCC under reverberation conditions only were published in
IEEE Access in the journal paper entitled “ Enhanced forensic speaker verification
using a combination of DWT and MFCC feature warping in the presence of noise
and reverberation conditions” [5].
Figure 5.18 shows the reduction in EER for fusion of feature warping with DWT-
MFCC and MFCC over feature-warped MFCC when interview recordings rever-
berated at different reverberation times. The results showed that the performance
of speaker verification based on the fusion of feature warping with DWT-MFCC
and MFCC reduced EER by 20% to 21.27% compared with feature-warped MFCC
when interview recordings reverberated at different reverberation times ranging
from 0.15 sec to 0.25 sec. Reverberation affects low frequency more than high
138 5.4 Results and discussion
0.1 0.15 0.2 0.25 0.3
Reverberation Time (sec)
0
5
10
15
20
25
Re
du
ctio
n in
EE
R %
Figure 5.18: Reduction in EER for fusion of feature warping with DWT-MFCCand MFCC over feature-warped MFCC when interview recordings reverberatedat different reverberation times.
frequency sub bands, and it smears spectral information of the speech signal at
low frequency sub bands [95]. The DWT-MFCC (FW) can be used to extract
more features from the low frequency sub bands. These features add some im-
portant features to the full band of the feature-warped MFCC. Thus, fusion of
feature warping with DWT-MFCC and MFCC improves forensic speaker verifi-
cation performance more than feature-warped MFCC in the presence of different
reverberation time.
5.4.2.5 Effect of utterance duration
This section describes the effect of varying utterance duration on the performance
of fusion of feature warping with DWT-MFCC and MFCC and feature-warped
MFCC techniques in the presence of reverberation conditions only. The full
duration of the interview recordings from 200 speakers using pseudo-police style
5.4 Results and discussion 139
10 4020
Surveillance Duration (sec)
0
2
4
6
8
10
EE
R %
Figure 5.19: Effect of changing the utterances length of the surveillance recordingson the performance of feature-warped MFCC under reverberation conditions.
reverberated at 0.15 sec using the first configuration of the room, as shown in
Table 3.1, while the duration of the surveillance recordings from 200 speakers
using informal telephone conversation style was changed from 10 sec to 40 sec.
Figure 5.19 shows the effect of changing the utterance length of the surveillance
recordings on the performance of feature-warped MFCC in the presence of rever-
beration conditions only. The results demonstrated that forensic speaker verifica-
tion based on the feature-warped MFCC improved performance when the length
of the surveillance recordings increased from 10 sec to 40 sec.
Figure 5.20 shows the effect of changing the surveillance utterance duration on
the performance of fusion of feature warping with DWT-MFCC and MFCC in the
presence of reverberation conditions only. The results showed that as the surveil-
lance utterance length increased, the performance of fusion of feature warping
with DWT-MFCC and MFCC improved. The EER was reduced by 46.04% when
140 5.4 Results and discussion
10 4020
Surveillance Duration (sec)
0
2
4
6
8
EER
%
Figure 5.20: Effect of changing the surveillance utterance duration on the perfor-mance of fusion of feature warping with DWT-MFCC and MFCC under rever-beration conditions.
the duration of the surveillance recordings increased from 10 sec to 40 sec. The
results of effect of utterance length on the fusion of feature warping with DWT-
MFCC and MFCC under reverberation conditions only were published in IEEE
Access in the journal paper entitled “ Enhanced forensic speaker verification us-
ing a combination of DWT and MFCC feature warping in the presence of noise
and reverberation Conditions” [5]. By comparing Figures 5.19 and 5.20, it is
clear that the performance of fusion of feature warping with DWT-MFCC and
MFCC improved EER compared with feature-warped MFCC when the length of
the surveillance recordings increased from 10 sec to 40 sec under reverberation
conditions only.
5.4 Results and discussion 141
2.5 m
3 m
Width
Length
4 m
1 m
0.5 m 1.5 m
0.4 m 0.4 m
Suspect
Microphone
2.8 m
1
4
3 2
Figure 5.21: Position of suspect and microphones in a room. All microphonesand suspect are at 1.3 m height and the height of the room is 2.5 m.
5.4.2.6 Effect of suspect and microphone position
In this experiment, the full duration of interview recordings from 200 speakers
using pseudo-police style was reverberated at 0.15 sec, while 10 sec of surveillance
recordings from 200 speakers using informal telephone conversation style was
kept under clean conditions. The position of the suspect was not changed and
four different positions of the microphone were used to investigate the effect of
suspect/microphone position on the performance of feature-warped MFCC and
fusion of feature warping with DWT-MFCC and MFCC. The configuration of
suspect/microphone used in these experimental results is shown in Figure 5.21
and Table 5.4.
Figures 5.22 and 5.23 show the effect of suspect/microphone positions on the
performance of feature-warped MFCC and fusion of feature warping with DWT-
MFCC and MFCC. The results demonstrate that changing the distance between
the suspect and microphone affects the performance of feature-warped MFCC
142 5.4 Results and discussion
Table 5.4: Reverberation test room parameter.
Configuration Suspect position(xs,ys,zs ) microphone position (xm,ym,zm )1 (2, 1, 1.3) (1.5, 1, 1.3)2 (2, 1, 1.3) (2.4, 1, 1.3)3 (2, 1, 1.3) (2.8, 1, 1.3)4 (2, 1, 1.3) (2.8, 2.5, 1.3)
and fusion of feature warping with DWT-MFCC and MFCC. Configuration 2,
which has the shortest distance between the suspect and microphone, achieved
the greatest improvement in EER compared with other configurations. The per-
formance of feature-warped MFCC and fusion of feature warping with DWT-
MFCC and MFCC decreased when the distance between the suspect and micro-
phone increased. The impulse response of the room consists of early and late
reflections. The characteristics of the early reflections, typically 50 msec after the
arrival of the direct sound, depends strongly on the suspect/microphone positions
[144]. The duration of the early reflections could increase and leads to increased
spectral alteration of the original speech signal when the distance between the
suspect and microphone increases. Thus, the performance of forensic speaker
verification based on the feature-warped MFCC and fusion of feature warping
with DWT-MFCC and MFCC degrades when the distance between the suspect
and microphone increases.
A comparison of the results in Figures 5.22 and 5.23 shows that the performance
of fusion of feature warping with DWT-MFCC and MFCC achieves improvements
in EER compared with feature-warped MFCC when the positions of microphones
change according to the suspect. Some results of effect of suspect/microphone
position on the fusion of feature warping with DWT-MFCC and MFCC under
reverberation conditions only were published in IEEE Access in the journal paper
entitled “ Enhanced forensic speaker verification using a combination of DWT and
MFCC feature warping in the presence of noise and reverberation conditions” [5].
5.4 Results and discussion 143
0
3
6
9
12
15
EER
%
Configuration 1Configuration 2Configuration 3Configuration 4
Figure 5.22: Effect of configuration microphone and suspect positions on theperformance of feature-warped MFCC.
Figure 5.23: Effect of configuration microphone and suspect positions on theperformance of fusion of feature warping with DWT-MFCC and MFCC.
5.4.3 Noisy and reverberant conditions
The performance of fusion of feature warping with DWT-MFCC and MFCC was
evaluated and compared with speaker verification based on traditional MFCC and
feature-warped MFCC under noisy and reverberant conditions. As described in
Sections 5.4.1.3 and 5.4.2.3, forensic speaker verification based on DWT-MFCC
and DWT-MFCC (FW) degraded compared with other feature extraction tech-
niques in the presence of various types of environmental noise only and rever-
beration conditions. Therefore, forensic speaker verification based on the DWT-
MFCC and DWT-MFCC (FW) will not be investigated in these experimental
results. The effect of decomposition level, wavelet family and utterance length
144 5.4 Results and discussion
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40
45
EE
R %
STREET
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40
45CAR
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40
45HOME
Level 2Level 3Level 4Level 5
Figure 5.24: Effect of the decomposition levels on the performance of fusion offeature warping with DWT-MFCC and MFCC in the presence of reverberationand various types of environmental noises.
will also be discussed in this section.
5.4.3.1 Effect of decomposition level
The effect of the decomposition level on the performance of fusion of feature
warping with DWT-MFCC and MFCC was evaluated using db8 of DWT and
different levels (2, 3, 4, and 5). The full duration of the interview recordings from
200 speakers using pseudo-police style reverberated at 0.15 sec using the first
configuration of the room, as shown in Table 5.4. The surveillance recordings
from 10 sec duration from 200 speakers using informal telephone conversation
style were corrupted with different segments of CAR, STREET and HOME noises
from the QUT-NOISE database [29] at SNRs ranging from -10 dB to 10 dB.
Figure 5.24 shows the effect of the decomposition levels on the performance of
5.4 Results and discussion 145
fusion of feature warping with DWT-MFCC and MFCC in the presence of rever-
beration and various types of environmental noises. It is clear from Figure 5.24
that level 4 achieves better performance in EER over the majority of SNR values
and different types of environmental noise. Some results of the effect of decompo-
sition level on the fusion of feature warping with DWT-MFCC and MFCC under
noisy and reverberant conditions were published in IEEE Access in the journal
paper entitled “ Enhanced forensic speaker verification using a combination of
DWT and MFCC feature warping in the presence of noise and reverberation
conditions” [5].
5.4.3.2 Effect of wavelet family
This section describes the effect of wavelet family on the performance of fusion
of feature warping with DWT-MFCC and MFCC under noisy and reverberant
conditions. The full duration of the interview recordings reverberated at 0.15
sec reverberation time. The surveillance recordings were obtained from 10 sec
duration using informal telephone conversation and were mixed with different
sessions of CAR, STREET and HOME at SNRs ranging from -10 dB to 10 dB.
The interview and surveillance recordings were decomposed into four levels using
different Daubechies wavelet families (db2, db4, and db8). The first configuration
of the room is used in this experiment, as shown in Table 5.4.
The average EER for fusion of feature warping with DWT-MFCC and MFCC
can be computed by calculating the mean EER for various types of environmental
noise at each noise level, as shown in Figure 5.25. It is clear that the performance
of fusion of feature warping with DWT-MFCC and MFCC using db8 achieved
slight improvements in average EER compared with other wavelet families when
interview recordings reverberated and the surveillance recordings were mixed with
various types of environmental noise at SNRs ranging from -10 dB to 10 dB. In
146 5.4 Results and discussion
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40
Ave
rage
EE
R %
db2 db4 db8
Figure 5.25: Average EER for fusion of feature warping with DWT-MFCC andMFCC using different wavelet families in the presence of reverberation and varioustypes of environmental noises.
the next sections, the level 4 and db8 will be used in the feature extraction based
on the fusion of feature warping with DWT-MFCC and MFCC under noisy and
reverberant conditions because the performance of fusion of feature warping with
DWT-MFCC and MFCC improved when level 4 and db8 were used compared
with other levels and wavelet families.
5.4.3.3 Comparison of feature extraction techniques
This section compares the performance of fusion of feature warping with DWT-
MFCC and MFCC with traditional MFCC and feature-warped MFCC in the
presence of reverberation and different types of environmental noise. In these
experimental results, the interview recordings reverberated at 0.15 sec, and 10 sec
of the surveillance recordings was mixed with different sessions of CAR, STREET
and HOME noises at SNRs ranging from -10 dB to 10 dB. The first configuration
5.4 Results and discussion 147
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40
45E
ER
%STREET
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40
45CARMFCCMFCC (FW)Fusion (both FW)
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40
45HOME
Figure 5.26: Comparison of speaker verification performance using different fea-ture extraction techniques in the presence of environmental noise and reverbera-tion conditions.
of the room used in this experiment is shown in Table 5.4.
Figure 5.26 shows a comparison of speaker verification performance using dif-
ferent feature extraction techniques in the presence of environmental noise and
reverberant conditions. Overall, the results show that fusion of feature warp-
ing with DWT-MFCC and MFCC achieves improvements in EER over feature-
warped MFCC, when the surveillance recordings were corrupted with random
segments of STREET, CAR and HOME noises at various SNR values and in-
terview recordings reverberated at 0.15 sec. The results also demonstrate that
feature-warped MFCC achieves significant improvements in EER compared with
traditional MFCC. The results of comparing speaker verification performance us-
ing different feature extraction techniques under noisy and reverberant conditions
were published in IEEE Access in the journal paper entitled “ Enhanced forensic
speaker verification using a combination of DWT and MFCC feature warping in
the presence of noise and reverberation conditions” [5].
148 5.4 Results and discussion
-10 -5 0 5 10
SNR
0
10
20
30
40
50A
vera
ge re
duct
ion
in E
ER
%
Figure 5.27: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC over traditional MFCC in the presence of various types ofenvironmental noise and reverberation conditions
Figure 5.27 shows the average reduction in EER for fusion of feature warping with
DWT-MFCC and MFCC over traditional MFCC features. It is clear that the per-
formance of fusion of feature warping with DWT-MFCC and MFCC achieved sig-
nificant improvements in average EER over traditional MFCC features when the
surveillance recordings were corrupted with random sessions of CAR, STREET
and HOME noises at SNRs ranging from -10 dB to 10 dB and interview recordings
reverberated at 0.15 sec. The average reduction in EER for this technique ranged
from 17.10% to 51.86% over traditional MFCC features in the presence of various
types of environmental noise at SNRs ranging from -10 dB to 10 dB. The results
of average reduction in EER for fusion of feature warping with DWT-MFCC
and MFCC over traditional MFCC under noisy and reverberant conditions were
published in the IEEE International Conference on Signal and Image Processing
Applications in the paper entitled “ Hybrid DWT and MFCC feature warping
for noisy forensic speaker verification in room reverberation” [6].
The average reduction in EER for fusion of feature warping with DWT-MFCC
5.4 Results and discussion 149
and MFCC over feature-warped MFCC features was computed by calculating
the mean of the EER reduction between feature-warped MFCC and fusion of
feature warping with DWT-MFCC and MFCC for various types of environmen-
tal noise at each noise level in the presence of reverberation, as shown in Figure
5.28. The results demonstrate that the performance of fusion of feature warp-
ing with DWT-MFCC and MFCC outperforms feature-warped MFCC in average
reduction of EER at SNRs ranging from -10 dB to 10 dB. At 0 dB SNR, the
average reduction in EER of fusion of feature warping with DWT-MFCC and
MFCC in the presence of various types of environmental noise and reverberation
conditions was 13.28% over feature-warped MFCC. Forensic speaker verification
based on the fusion of feature warping with DWT-MFCC and MFCC achieved
a reduction in average EER compared with feature-warped MFCC under noisy
and reverberant conditions for two main reasons. Firstly, the DWT has a good
time and frequency resolution to represent the speech signals by using adaptive
window and hence the DWT can be used to extract the localization events of the
speech signals [123]. However, the feature-warped MFCC lost some information
in the time domain by assuming that the speech signal is stationary within a
given speech frame. Secondly, reverberation affects low frequencies more than
high-frequency sub bands, since the boundary materials used in most rooms are
less absorptive at low frequency sub bands [95]. The DWT can be used to extract
more features from the low frequency sub bands. These features add some impor-
tant information to the features extracted from the feature-warped MFCC of the
reverberated speech signals. Thus, fusion of feature warping with DWT-MFCC
and MFCC improves forensic speaker verification performance in the presence of
reverberation conditions.
Figure 5.29 shows mDCF of fusion of feature warping with DWT-MFCC and
MFCC features and feature-warped MFCC in the presence of various types of en-
vironmental noise and reverberation conditions. It is clear from this figure that
150 5.4 Results and discussion
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30A
vera
ge R
educ
tion
in E
ER
%
Figure 5.28: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC over feature-warped MFCC in the presence of various typesof environmental noise and reverberation conditions.
Figure 5.29: mDCF for feature-warped MFCC and fusion of feature warping withDWT-MFCC and MFCC features under noisy and reverberant conditions.
the fusion of feature warping with DWT-MFCC and MFCC features achieves im-
provements in mDCF over feature-warped MFCC in the presence of various types
of environmental noise at SNRs ranging from -10 dB to 10 dB and reverberation
conditions.
5.4 Results and discussion 151
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40
EE
R %
STREET
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40CAR
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35
40HOME
Full-10 secFull-20 secFull-40 sec
Figure 5.30: Effect of utterance length on the performance of feature-warpedMFCC in the presence of noise and reverberation conditions.
5.4.3.4 Effect of utterance length
In real forensic applications, long speech samples from a suspected speaker are
recorded in an interview room under reverberation conditions, while the surveil-
lance recording is corrupted by environmental noises and the duration of the
surveillance recordings is uncontrolled. Thus, the full duration of the interview
recordings was reverberated at 0.15 sec reverberation time, while the duration of
the surveillance recordings was changed from 10 sec to 40 sec. The surveillance
recordings were mixed with random session of STREET, CAR and HOME noise
at SNRs ranging from -10 dB to 10 dB. The first configuration of the room is
used in this experiment, as shown in Table 5.4. The effect of utterance length
was evaluated on the performances of the forensic speaker verification based on
the fusion of feature warping with DWT-MFCC and MFCC and feature-warped
MFCC because these feature extraction techniques achieved greater improve-
ments in performance compared with traditional MFCC features, as described in
152 5.4 Results and discussion
-10 -5 0 5 10
SNR
0
10
20
30
40
EE
R %
STREET
-10 -5 0 5 10
SNR
0
10
20
30
40CAR
-10 -5 0 5 10
SNR
0
10
20
30
40HOME
Full-10 secFull-20 secFull-40 sec
Figure 5.31: Effect of utterance length on the performance of fusion of featurewarping with MFCC and DWT-MFCC in the presence of noise and reverberationconditions.
Section 5.4.3.3.
Figure 5.30 shows the effect of utterance length on the performance of feature-
warped MFCC under noisy and reverberant environments. It is clear that the
performance of forensic speaker verification based on feature-warped MFCC im-
proved when the utterance length of the surveillance recordings increased from 10
sec to 40 sec at different levels and types of environmental noise and reverberation
conditions.
Figure 5.31 shows the effect of utterance length on the performance of fusion of
feature warping with DWT-MFCC and MFCC features in the presence of noise
and reverberation environments. It was found, as illustrated in Figure 5.31, that
the performance of speaker verification based on the fusion of feature warping
with DWT-MFCC and MFCC under noisy and reverberant conditions improved
when the duration of the surveillance recordings increased from 10 sec to 40 sec
at various types and levels of environmental noise.
5.4 Results and discussion 153
-10 -5 0 5 10
SNR
0
5
10
15
20
25
Ave
rage
Red
uctio
n in
EE
R %
Figure 5.32: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC over feature-warped MFCC when interview recordings rever-berated at 0.15 sec and 40 seconds of the surveillance recordings were corruptedby various types of noise at SNRs ranging from -10 dB to 10 dB.
The reduction in EER between feature-warped MFCC and fusion of feature warp-
ing with DWT-MFCC and MFCC can be calculated according to Equation 5.1
when interview recordings reverberated at 0.15 sec and 40 sec of surveillance
recordings were corrupted by different types of noise. The average reduction in
EER can be calculated by computing the mean of various types of environmental
noise at each noise level. Figure 5.32 shows the average reduction in EER for
fusion of feature warping with DWT-MFCC and MFCC over feature warped-
MFCC when interview recordings reverberated at 0.15 sec and 40 seconds of the
surveillance recordings were corrupted by various types and levels of environ-
mental noise. The results demonstrated that the performance of forensic speaker
verification based on the fusion of feature warping with DWT-MFCC and MFCC
improved average EER in a range from 7.23% to 22.20% over feature-warped
MFCC in the presence of environmental noise at SNRs ranging from -10 dB to
10 dB and reverberation conditions.
154 5.4 Results and discussion
-10 -5 0 5 10
SNR
0
5
10
15
20
25
30
35A
vera
ge R
educt
ion in
EE
R %
Figure 5.33: Average reduction in EER for fusion of feature warping with DWT-MFCC and MFCC when interview recordings reverberated at 0.15 sec and theduration of the surveillance recordings changed from 10 sec to 40 sec in thepresence of various types of noise at SNRs ranging from -10 dB to 10 dB.
Figure 5.33 shows the average reduction in EER for fusion of feature warping with
DWT-MFCC and MFCC when interview recordings reverberated at 0.15 sec re-
verberation time and the duration of the surveillance recordings was changed
from 10 sec to 40 sec in the presence of different types of environmental noise at
SNRs ranging from -10 dB to 10 dB. It is clear that forensic speaker verification
performance improved in average EER of 26.51 % when the duration of surveil-
lance recordings increased from 10 sec to 40 sec in the presence of various types of
environmental noise at 0 dB SNR and reverberated interview recordings. Some
results of the effect of utterance length on the fusion of feature warping with
DWT-MFCC and MFCC under noisy and reverberant conditions were published
in IEEE Access in the journal paper entitled “ Enhanced forensic speaker verifi-
cation using a combination of DWT and MFCC feature warping in the presence
of noise and reverberation conditions” [5].
5.5 Chapter summary 155
5.5 Chapter summary
In this chapter, the effectiveness of combining features, MFCC and MFCC ex-
tracted from the DWT of the speech signals, with and without feature warping
was investigated for improving the state-of-the-art length-normalized GPLDA
based speaker verification under noisy conditions only. The performance of
speaker verification was evaluated using different feature extraction techniques:
MFCC, feature-warped MFCC, DWT-MFCC, feature-warped DWT-MFCC, a
fusion of DWT-MFCC and MFCC features, and fusion of feature warping with
DWT-MFCC and MFCC. The performance of forensic speaker verification was
evaluated using AFVC and QUT-NOISE databases. The experimental results
showed that the proposed of fusion of feature warping with DWT-MFCC and
MFCC is superior to other feature extraction techniques in the presence of envi-
ronmental noise under the majority of SNRs. At 0 dB SNR, the fusion of feature
warping with DWT-MFCC and MFCC reduced average EER by 21.33% over
feature-warped MFCC in the presence of various types of environmental noises
only.
Experimental results also demonstrated that the proposed of fusion of feature
warping with DWT-MFCC and MFCC improved forensic speaker verification
performance compared with other feature extraction techniques in the presence
of reverberation only as well as noisy and reverberant conditions. The fusion of
feature warping with DWT-MFCC and MFCC reduced average EER by 20.00%,
and 13.28% over feature-warped MFCC, respectively, in the presence of reverber-
ation only, and noisy and reverberant environments at 0 dB SNR. The results
of the effectiveness of fusion of feature warping with DWT-MFCC and MFCC
for improving forensic speaker verification performance under noisy and rever-
berant conditions were published in IEEE Access Journal and the IEEE Inter-
156 5.5 Chapter summary
national Conference on Signal and Image Processing Applications. The journal
and conference papers are titled “ Enhanced forensic speaker verification using
a combination of DWT and MFCC feature warping in the presence of noise and
reverberation conditions” [5] and “ Hybrid DWT and MFCC feature warping for
noisy forensic speaker verification in room reverberation” [6], respectively.
Forensic speaker verification based on the fusion of feature warping with DWT-
MFCC and MFCC improved average EER under noisy and reverberant condi-
tions compared with feature-warped MFCC for two main reasons. Firstly, rever-
beration affects low frequency sub bands more than high frequency sub bands,
since the boundary materials used in most rooms are less absorptive at low fre-
quency sub bands. The feature-warped MFCC extracted from the DWT can be
used to extract more features from the low frequency sub band of the reverber-
ated interview recordings. These features add some important information to
feature-warped MFCC. Therefore, fusion of feature warping with DWT-MFCC
and MFCC improves forensic speaker verification performance compared with
feature-warped MFCC under reverberation conditions. Secondly, the feature-
warped MFCC extracted from the DWT adds more features to the features
extracted from the feature-warped MFCC of the full band of the noisy speech
signals, thereby assisting to improve forensic speaker verification performance in
the presence of various types of environmental noise. Thus, the fusion of fea-
ture warping with DWT-MFCC and MFCC approach proposed in this chapter
can be used to extract the features from the interview and enhanced surveillance
recordings, which will be described in Chapter 6.
Chapter 6
ICA for speaker verification
6.1 Introduction
Forensic speaker verification systems face many challenges in real-world appli-
cations. Forensic audio recordings from the criminal are often recorded using
hidden microphones in public places [4]. Such forensic audio recordings are of-
ten corrupted with various types of environmental noises [4, 89]. Reverberation
conditions often occur when interview recordings from the suspect are recorded
in a police interview room. The distortion of speech by environmental noise
and reverberation severely degrades speaker verification performance [117]. In
this chapter, the ICA algorithm will be used to separate the speech from the
noisy speech signals and improve forensic speaker verification performance be-
cause the distribution of the source signals (speech and environmental noise) is
non-Gaussian distribution and the ICA is an effective technique to separate the
source signals that come from non-Gaussian distribution.
Independent component analysis was used as a multiple channel speech enhance-
158 6.1 Introduction
ment algorithm to separate the clean speech from the noisy speech signals and it
can be used to improve forensic speaker verification performance. The fast ICA
is one of the popular ICA methods used in the literature and it is based on the
iterative blind source separation technique [10]. The original source signals in
the fast ICA algorithm are estimated from the mixed data at each instance. The
quality of estimation of the source signals is based on the unmixing matrix [101].
Due to the uncertainty associated with initialization of the fast ICA parameters
and the iterative process, there is a randomness in the separation of the source
signals in the fast ICA algorithm [102]. The performance of the mixed signals
may degrade due to the randomness of the separation of the source signals in the
fast ICA algorithm. Therefore, the multi-run based on the fast ICA algorithm
[103] was proposed to solve the problem of randomness in the separation of the
source signals and it achieved better performance for biosignal applications under
clean conditions.
The multi-run ICA algorithm is based on iterating the fast ICA algorithm [55]
several times, obtaining different mixing matrices each time. The source signals
can be estimated by choosing the suitable of the unmixing matrices. The suitable
unmixing matrix gives clear separation of the source signals. Selecting the suitable
mixing matrix can be performed by computing the highest signal to interference
ratio (SIR) of the mixing matrices [103]. The disadvantage of using the multi-
run ICA based on the fast ICA algorithm is that the multi-run ICA is more
computationally expensive compared with traditional fast ICA algorithm because
the multi-run ICA algorithm is based on iterating the fast ICA algorithm several
times and choosing the unmixing matrix that has the largest SIR.
Various algorithms of ICA have been proposed in previous studies such as Fast
ICA [53], information maximization [11] and efficient fast ICA (EFICA) [76].
These algorithms used a fixed nonlinear contrast function or model which make
6.1 Introduction 159
them computationally attractive for estimating source signals. However, the qual-
ity of separation of the source signals in the ICA algorithm degrades when the
density of the source signals deviates from the assumed underlying model. Re-
cently, ICA by entropy bound minimization (ICA-EBM) algorithm has been used
as an effective technique for source separation [82]. It is based on calculating the
entropy bounds from four contrast measuring functions (two even and two odd
functions). Four contrast functions are used in the ICA-EBM algorithm to pro-
vide satisfactory separation performance for a wide variety of distribution without
prior knowledge about the source signals. Then, the tightest maximum entropy
bound is chosen from calculating the entropy bounds for four contrast functions.
The tightest entropy bound is the one closest to the entropy of the true source and
it can be used as the entropy estimate of the source. The ICA-EBM algorithm
can be used to separate the source signals that come from different distributions
and achieve superior separation performance to other ICA algorithms [82]. Thus,
we use the ICA-EBM as the speech enhancement algorithm for separation of the
noise from the noisy speech signals.
The multi-run ICA based on the fast ICA algorithm or the ICA-EBM algorithm
has not been investigated yet in the literature review to improve forensic speaker
verification performance in the presence of various types of environmental noise
and reverberation conditions. In this chapter, the multi-run ICA based on the
fast ICA algorithm or ICA-EBM algorithm is used to reduce the effect of envi-
ronmental noise. The fusion of feature warping with DWT-MFCC and MFCC
is used to extract the features from the enhanced speech signals. These features
can be used to train the state-of-the-art length-normalized GPLDA based speaker
verification framework.
This chapter is divided into several sections. Multi-run ICA algorithm is pre-
sented in Section 6.2. Section 6.3 describes the ICA-EBM algorithm. The
160 6.2 Multi-run ICA algorithm
methodology of forensic speaker verification based on the multi-run ICA or ICA-
EBM algorithm is described in Section 6.4. Section 6.5 presents the results and
discussion.
6.2 Multi-run ICA algorithm
Fast ICA is an iterative blind source separation approach [103]. The original
source signals can be estimated from the mixing matrix each time. The mix-
ing and unmixing matrices are estimated using the fast ICA algorithm which
described in Sections 4.6.1 and 4.6.2. The quality of estimation of the source sig-
nals is based on the unmixing matrix W . Since the unmixing matrix is estimated
from the random matrix, there is a randomness in the quality of the separation
of the estimated source signals.
The multi-run ICA based on the fast ICA algorithm [103] deals with this prob-
lem of randomness. This algorithm is based on computing the mixing matrices
several times and choosing the unmixing matrix that has the maximum quality
of separation. To estimate the enhanced speech from the noisy speech signals, we
require just one mixing matrix. Thus, the best mixing matrix has to be chosen
from among different mixing matrices to obtain better speaker verification perfor-
mance, as shown in Figure 6.1. There are various methods available to compute
the quality of separation of the unmixing matrix. For audio applications, the
SIR was found to be the most popular method to compute the quality of the
separation [24], and this method is used in this chapter. The SIR computation
process will be described in the next section.
6.2 Multi-run ICA algorithm 161
Figure 6.1: Flowchart of the multi-run ICA algorithm.
6.2.1 SIR computation
The SIR is the ratio of the power of the wanted signal to the power of the
unwanted signal. The actual SIR can be defined as [141]
SIRact =||Starget||2
||einterf ||2, (6.1)
where Starget is the modified version of the clean speech signal with an allowed
distortion and einterf is the interference error.
For real forensic applications, the original source signals (speech and noise) and
the mixing matrix are unknown. In this scenario, the calculation of the SIR
estimation would be the actual SIR for real world data. The one independent
component can be estimated as:
yi = wTi X = (wT
i A)S = giS = gijsj, (6.2)
162 6.3 ICA-EBM algorithm
where yi and sj are the estimated independent component and jth source signals,
respectively, wi is a row vector of the unmixing matrix W , and gi is a normalized
row vector of the global matrix G which can be defined as G = W ∗ A. As yi is
the estimation of the source signals sj, the ideal normalization of the row vector
of the global matrix gi =[0 0 gij 0 0
]is the unit vector uj = [0, 0, · · · , 1, · · · , 0].
Thus, one independent component can be separated successfully if the value of
gi is similar to one unit vector uj.
The quality of separation of each independent component is based on one row
vector of the global matrix G and the quality decreases when the difference be-
tween each row vector of G matrix and each corresponding of unit vector uj
increases [24]. The estimated SIR of each mixing matrix was calculated using the
expression which evaluates the success separation of one independent component
SIRest = −10 log10 (∥∥gi − uj∥∥)22 (6.3)
6.3 ICA-EBM algorithm
By using a small class of nonlinear contrast functions, the ICA-EBM algorithm
[82] performs source separations, which are sub- or super-Gaussian, unimodal or
multimodal, symmetric or skewed. The algorithm uses the entropy bound estima-
tor to approximate the entropies of different types of distribution. The ICA-EBM
algorithm minimizes the mutual information of the estimated source signals to
estimate the unmixing matrix. The algorithm uses a line search procedure, which
forces the unmixing matrix to be orthogonal for better convergence behavior. The
mutual information cost function can be defined as:
I(y1, y2, · · · , yN) =N∑
n=1
H(yn)− log∣∣det(W )
∣∣−H(x), (6.4)
6.3 ICA-EBM algorithm 163
where H(yn) is the differential entropy of the nth separated sources y and entropy
of the observations H(x) is a term independent with respect to the unmixing
matrix W which can be treated as a constant C. Minimization of the mutual
information among the estimated source signals is related to the maximization
of the log-likelihood cost function as long as the model of the PDF matches the
PDF of the true latent source signal [2]. The bias is introduced in the estimation
of the unmixing matrix due to the deviation of the model from the true PDF
of the source signal. This bias can be removed by integrating a flexible density
model for each source signal into the ICA algorithm to minimize the bias of the
unmixing matrix providing separated source signals from a wide range of the PDF
accurately [16].
To achieve this, the cost function and its gradient can be rewritten with respect to
each row vector of the unmixing matrix wn, n = 1, 2, 3, · · · , N . This can be done
by expressing the volume of the N -dimensional parallelepiped, spanned by the
row vectors of W , as the inner product of the nth row vectors and unit Euclidian
length vector hn, that is perpendicular to all row vectors of the unmixing matrix
except wn [82]. Therefore, the mutual information cost function in equation 6.4
can be rewritten as function of only wn as:
Jn(wn) =N∑
n=1
H(yn)− log∣∣∣hTnwn
∣∣∣+ C. (6.5)
The gradient of the equation 6.5 can be computed as:
∂Jn(wn)
∂wn
= −V ′k(n){E[Gk(n)(yn)x]}E[gk(n)(yn)x]− hnhTnwn
, (6.6)
where V ′(.) and gk(n)(.) are the first order derivative of the negentropy V (.) and
the kth contrast functions Gk(n)(.), respectively, and E is the expected operator.
The line search algorithm for the orthogonal ICA-EBM can be obtained by the
164 6.4 Methodology
following equations:
w+n =
E[xgk(n)(yn)]− E[g′k(n)(yn)]wn
E[gk(n)(yn)yn]− E[g′k(n)(yn)], (6.7)
wnewn =
w+n
‖wn+‖, (6.8)
where g′k(n)(.) is the second derivative of the kth contrast functions Gk(n)(.).
The line search algorithm for ICA-EBM given in Equations 6.7 and 6.8 repeats
over different row vectors of the unmixing matrix until convergence. The criteria
of the convergence satisfies the following equation:
1−max(abs[diag(W newW T )]) ≤ ε, (6.9)
with a typical value of ε is 0.0001. After each row vector of the unmixing matrixW
has been updated, the symmetrical decorrelation method is performed to remain
the unmixing matrix orthogonal and it can be obtained by the following equation:
W new = (WW T )−12 W. (6.10)
6.4 Methodology
Experiments were conducted to evaluate the performance of the forensic speaker
verification based on the multi-run ICA or ICA-EBM algorithm in the presence
of various types of environmental noise and reverberant conditions. The system
of speaker verification based on the multi-run ICA or ICA-EBM consists of the
steps shown in Figure 6.2 and described in the following sections.
6.4 Methodology 165
Interview (clean or reverberated data)
Speech enhancement based on
the multi-run ICA or ICA-EBM
algorithm
Noisy surveillance data
Length normalized GPLDA classifier
Fusion of feature warping
with DWT-MFCC and MFCCC
Fusion of feature warping
with DWT-MFCC and MFCCC
I-vector extraction I-vector extraction
Decision
Figure 6.2: Flowchart of speaker verification based on the multi-run ICA or ICA-EBM algorithm.
6.4.1 Multi-run ICA or ICA-EBM speech enhancement
In real forensic situations, the police often record forensic surveillance record-
ings from the criminal using hidden microphones. These recordings are often
mixed with different types of environmental noise. The performance of foren-
sic speaker verification degrades significantly in the presence of different types
of noise. Therefore, it is necessary to use multiple channel speech enhancement
techniques to reduce the effect of noise and improve forensic speaker verification
performance. This thesis reports using the multi-run ICA based on the fast ICA
166 6.4 Methodology
algorithm or the ICA-EBM algorithm as a multiple channel speech enhancement
algorithm. The enhanced surveillance speech signals from the multi-run ICA and
ICA-EBM algorithm were applied to the feature extraction and classifier model
because the forensic expert is able to identify the enhanced speech signals from
the estimated noise by listening to the sound.
6.4.2 Fusion of feature warping with DWT-MFCC and
MFCC
The approach to extracting the features from the interview and enhanced
surveillance recordings is based on the multi-resolution property of the DWT.
The MFCCs were computed over Hamming windowed frames of the inter-
view/enhanced surveillance recordings, with 30 ms size and 10 ms shift. The
MFCCs were obtained using a mel-filterbank of 32 channels, followed by a trans-
formation to the cepstral domain, keeping 13 coefficients. The first and second
derivatives of the cepstral coefficients were appended to the MFCCs. Feature
warping with a 301 frame window was applied to the features extracted from the
MFCCs. The frame of the interview/enhanced surveillance speech signal was de-
composed into two frequency sub bands: low-frequency sub band (approximation
coefficients) and high frequency sub band (detail coefficients). The approxima-
tion and detail coefficients were combined into a single vector. The decomposition
process can be repeated by applying the DWT to the low-frequency sub band.
To capture the essential characteristics of the vocal tract, the feature-warped
MFCCs were applied to the concatenated vector from the DWT. Finally, the
fusion of feature warping with DWT-MFCC and MFCC approach can be per-
formed by combining the feature-warped MFCCs extracted from the full band
of the interview/enhanced surveillance recording with similar features extracted
from the DWT in a single feature vector, as shown in Figure 6.3.
6.4 Methodology 167
Interview/enhanced surveillance speech signals
DWT
MFCC+∆+∆∆
Concatenate approximate
and detail coefficients
into a single vector
Feature warping
MFCC+∆+∆∆
Feature warping
Concatenate feature vectors in a single feature vector
Figure 6.3: Flowchart of fusion of feature warping with DWT-MFCC and MFCCtechniques.
6.4.3 Length-normalized GPLDA classifier
The speaker verification system was conducted with the state-of-the-art length-
normalized GPLDA based speaker verification framework. In the development
phase, the 78 features fusion of feature warping with DWT-MFCC and MFCC
and the UBM containing 256 Gaussian components were used in the experiments.
The UBM was trained on telephone and microphone data using 348 speakers
from the AFVC database [99]. The UBM was used to calculate the Baum-Welch
statistics before training the total variability subspace of dimension 400. The
total variability [31] was used to compute the i-vectors. The i-vectors were pro-
jected into LDA space to reduce the dimension of the i-vectors to 200. The
i-vectors length-normalization was used before GPLDA modeling using centering
and whitening of the i-vectors [43]. In the enrolment and verification phases,
168 6.5 Results and discussion
the interview and surveillance speaker models were created from the interview
and enhanced surveillance speech signals to represent them in i-vector subspace.
The hidden parameters of the PLDA were estimated using variational posterior
distribution. Scoring in the length-normalized GPLDA was conducted using the
batch likelihood ratio to calculate the similarity score between the interview and
surveillance speaker models [64].
6.5 Results and discussion
This section describes the performance of the speaker verification systems based
on multi-run ICA or ICA-EBM algorithm under noisy and reverberant conditions.
The performance of speaker verification systems was evaluated using the EER and
the mDCF, calculated using Cmiss = 10, Cfa = 1, and Ptarget = 0.01.
6.5.1 Noisy and reverberant speaker verification
baselines
To investigate the effectiveness of speaker verification based on the multi-run ICA
or the ICA-EBM algorithm in the presence of environmental noise only and noisy
and reverberant conditions, we compared the performance of speaker verifica-
tion based on the multi-run ICA or the ICA-EBM algorithm with two baselines
(Noise-without-Reverberation and Reverberation-with-Noise speaker verification
systems). The interview speech signals for these baselines were extracted from
full duration utterances from 200 speakers using the pseudo-police style. The
VAD algorithm [130] was used to remove silent regions from the interview speech
signals. The surveillance recordings were obtained from random sessions of 10 sec
duration from 200 speakers using the informal telephone conversation style after
6.5 Results and discussion 169
Interview recordings (200 speakers
using pseudo-police style)
Surveillance recordings (200 speakers
using informal telephone style)
VAD algorithm VAD algorithm
Add CAR, STREET, and
HOME noises to
surveillance recordings
Fusion of feature warping
with DWT-MFCC and MFCC
Fusion of feature warping
with DWT-MFCC and MFCC
I-vector extraction I-vector extraction
Length normalized GPLDA
Decision
Figure 6.4: Flowchart of the Noise-without-Reverberation speaker verificationbaseline.
removing silent regions using the VAD algorithm. A brief description of the two
baselines is described below.
6.5.1.1 Noise-without-Reverberation speaker verification baseline
The Noise-without-Reverberation speaker verification baseline is obtained by
keeping interview data under clean conditions and surveillance recordings were
mixed with a random session of CAR, STREET and HOME noises from the
QUT-NOISE database at SNRs ranging from -10 dB to 10 dB. Fusion of feature
warping with DWT-MFCC and MFCC was used to extract the features from the
interview and surveillance recordings. According to our experimental results in
Section 5.4.1.3, the fusion of feature warping with DWT-MFCC and MFCC im-
proved forensic speaker verification performance under noisy conditions compared
170 6.5 Results and discussion
with other feature extraction techniques. Level 3 and db8 were used in fusion of
feature warping with MFCC and DWT-MFCC, because level 3 and db8 achieved
better forensic speaker verification performance in the presence of various types
of environmental noise over the majority of SNR values compared with other level
decomposition and wavelet family, as described in Sections 5.4.1.1 and 5.4.1.2.
The interview and surveillance speaker models were created from the speech sig-
nals to represent them in i-vector subspace. Then, the length-normalized GPLDA
and batch likelihood ratio were used to compute the similarity score between those
speaker models, as shown in Figure 6.4.
6.5.1.2 Reverberation-with-Noise speaker verification baseline
The interview recordings were convolved with an impulse response of the first con-
figuration of the room, as described in Table 6.1 using the image source algorithm
[78] at 0.15 sec reverberation time. The surveillance recordings were corrupted
with random segments of STREET, CAR and HOME noises at SNRs ranging
from -10 dB to 10 dB. Fusion of feature warping with DWT-MFCC and MFCC
approach was used to extract the features from the interview and surveillance
recordings because this approach improved forensic speaker verification perfor-
mance in the presence of noise and reverberation conditions compared with other
feature extraction techniques, as described in Section 5.4.3.3. Level 4 and db8
were used in fusion of feature warping with DWT-MFCC and MFCC, because
level 4 and db8 achieved better forensic speaker verification performance in the
presence of various types of environmental noise and reverberant conditions com-
pared with other levels and wavelet families, as described in Sections 5.4.3.1 and
5.4.3.2. The i-vector through length-normalized GPLDA and batch likelihood
ratio were used to calculate the similarity score between the i-vectors of interview
and surveillance recordings, as shown in Figure 6.5.
6.5 Results and discussion 171
Table 6.1: Reverberation test room parameter.
Configuration Suspect position(xs,ys,zs ) microphone position (xm,ym,zm )1 (2, 1, 1.3) (1.5, 1, 1.3)2 (2, 1, 1.3) (2.4, 1, 1.3)3 (2, 1, 1.3) (2.8, 1, 1.3)4 (2, 1, 1.3) (2.8, 2.5, 1.3)
Interview recordings (200 speakers
using pseudo-police style)
Surveillance recordings (200 speakers
using informal telephone style)
VAD algorithm VAD algorithm
Add CAR, STREET, and
HOME noises to
surveillance recordings
Fusion of feature warping
with DWT-MFCC and MFCC
Fusion of feature warping
with DWT-MFCC and MFCC
I-vector extraction I-vector extraction
Length normalized GPLDA
Decision
Convolve interview
recordings with room
impulse response
Figure 6.5: Flowchart of the Reverberation-with-Noise speaker verification base-line.
6.5.2 Noise-without-Reverberation conditions
This section describes speaker verification performance based on the multi-run
ICA or ICA-EBM algorithm under Noise-without-Reverberation conditions. We
compared the actual signal to interference ratio (SIRact) and estimated signal
to interference ratio (SIRest) to investigate the effectiveness of estimated SIR
measurement to evaluate forensic speaker verification performance based on the
172 6.5 Results and discussion
multi-run ICA algorithm under Noise-without-Reverberation conditions. The
performance of forensic speaker verification based on the multi-run ICA or the
ICA-EBM algorithm is compared with the Noise-without-Reverberation speaker
verification baseline, and traditional ICA. The effect of the utterance length on
the performance of speaker verification based on the multi-run ICA or the ICA-
EBM algorithm is also investigated in this section. The fusion of feature warping
with DWT-MFCC and MFCC used three levels and db8 to extract the features
from the interview and enhanced surveillance recordings in order to provide a fair
comparison with the baseline of Noise-without-Reverberation speaker verification
system.
6.5.2.1 Speaker verification performance based multi-run ICA (SIRact)
In these simulation results, the full duration of the interview recordings was ob-
tained from 200 speakers using the pseudo-police style. The VAD algorithm was
used to remove silent regions. The interview recordings were kept under clean
conditions. The surveillance recordings were obtained from 10 sec duration from
200 speakers using the informal telephone conversation style after removing silent
regions using the VAD algorithm. The surveillance recordings were mixed with
random sessions of CAR, STREET and HOME noises from the QUT-NOISE at
SNRs ranging from -10 dB to 10 dB, resulting in a two-channel of noisy speech
signals according to the following equation:x1x2
=
1.0 1.0
1.0 0.6
z(n)
e(n)
, (6.11)
where z(n) is the clean speech signal, e(n) is the environmental noise.
The performance of forensic speaker verification based on the multi-run ICA
(highest SIRact) was compared with the Noise-without-Reverberation speaker
6.5 Results and discussion 173
-10 -5 0 5 10SNR
0
5
10
15
20
25
30
35
40
EER
%
STREET
Baseline Multi-run ICA (Highest SIR
act)
Traditional ICA Multi-run ICA (Lowest SIR
act)
EERred
=69.24%
-10 -5 0 5 10SNR
0
5
10
15
20
25
30
35
40
EER
%
CAR
Baseline Multi-run ICA (Highest SIR
act)
Traditional ICA Multi-run ICA (Lowest SIR
act)
EERred
=66.68%
-10 -5 0 5 10SNR
0
5
10
15
20
25
30
35
40
EER
%
HOME
Baseline Multi-run ICA (Highest SIR
act)
Traditional ICA Multi-run ICA (Lowest SIR
act)
EERred
=70.78%
Figure 6.6: Comparison of forensic speaker verification based on the multi-runICA (highest SIRact) algorithm with other speaker verification techniques underNoise-without-Reverberation conditions conditions.
verification baseline, traditional ICA, and multi-run ICA (lowest SIRact) in the
presence of various types of environmental noises at SNRs ranging from -10 dB
to 10 dB, as shown in Figures 6.6. The SNRs in the x-axis of Figure 6.6 were
computed from the first microphone (x1).
It is clear from Figure 6.6 that the performance of forensic speaker verification
based on the multi-run ICA (highest SIRact) algorithm achieved significant im-
174 6.5 Results and discussion
provements in EER over the Noise-without-Reverberation speaker verification
baseline at low SNRs ranging from -10 dB to 0 dB. The reduction in EER for
speaker verification with multi-run ICA (highest SIRact) was 66.68%, 69.24% and
70.78% over the Noise-without-Reverberation speaker verification baseline when
the surveillance recordings were corrupted by CAR, STREET and HOME noises
respectively, at -10 dB. The multi-run ICA (highest SIRact) algorithm significantly
improved EER at -10 dB SNR, because this algorithm is based on choosing the
suitable unmixing matrix that has the maximum SIR. The suitable unmixing
matrix gives a clear separation of noise from the noisy speech signals.
The results also demonstrated that forensic speaker verification performance
based on the multi-run ICA (highest SIRact) improved EER compared with tra-
ditional ICA in the presence of various types of environmental noise at SNRs
ranging from -10 dB to 10 dB. The improvement in EER of speaker verification
based on the multi-run ICA (highest SIRact) decreased when SNR increased and
there was a degradation in the speaker verification performance compared with
the Noise-without-Reverberation speaker verification baseline in the presence of
CAR, STREET and HOME noises at SNRs ranging from 5 dB and 10 dB.
The multi-run ICA (lowest SIRact) degraded the forensic speaker verification per-
formance compared with traditional ICA in the presence of various types of en-
vironmental noise at SNRs ranging from -10 dB to 10 dB. The estimation of the
speech signals from the worst unmixing matrix (lowest SIRact) may occur at any
instance in the traditional ICA algorithm due to the randomness associated with
choosing the unmixing matrix. Thus, the traditional ICA algorithm may fail to
separate the speech from the noisy speech signals in this case, and this leads to
decreased noisy forensic speaker verification performance. The results of compar-
ison of forensic speaker verification based on the multi-run ICA (highest SIRact)
with other techniques in the presence of various types of environmental noise
6.5 Results and discussion 175
were published in the 11th International Conference on Signal Processing and
Communication Systems. The conference paper is titled “ Speaker verification
with multi-run ICA based speech enhancement” [3].
6.5.2.2 Speaker verification performance based multi-run ICA (SIRest)
In order to investigate the effectiveness of forensic speaker verification based on
the multi-run ICA (estimated SIR) under Noise-without-Reverberation condi-
tions, we compared the performance of forensic speaker verification based on the
multi-run ICA (highest SIRest) with the Noise-without-Reverberation speaker ver-
ification baseline, traditional ICA, multi-run ICA (highest SIRact), and the ICA-
EBM algorithm. As described in the previous section, forensic speaker verification
performance based on the multi-run ICA (lowest SIRact) degraded compared with
traditional ICA algorithm in the presence of various types of environmental noises
at SNRs ranging from -10 dB to 10 dB. Therefore, forensic speaker verification
based on the multi-run ICA algorithm (lowest SIRest) will not be investigated in
these simulation results.
The full duration of the interview recordings was obtained from 200 speakers using
the pseudo-police style. The VAD algorithm was used to remove silent regions
from the interview recordings. These data were kept under clean conditions. The
surveillance recordings were obtained from 10 sec duration from 200 speakers
using the informal telephone conversation style after removing silent regions using
the VAD algorithm. The surveillance recordings were corrupted with random
segments of CAR, STREET and HOME noises at SNRs ranging from -10 dB to
10 dB, resulting in two-channel noisy speech signals according to Equation 6.11.
Figure 6.7 shows a comparison of forensic speaker verification based on the multi-
176 6.5 Results and discussion
-10 -5 0 5 10SNR
0
5
10
15
20
25
30
35
40
EER
%
STREET
Baseline Traditional ICA Multi-run ICA (Highest SIR
act)
Multi-run ICA (Highest SIRest
)
ICA-EBM
EERred
=68.42%
-10 -5 0 5 10SNR
0
5
10
15
20
25
30
35
40
EER
%
CAR
Baseline Traditional ICA Multi-run ICA (Highest SIR
act)
Multi-run ICA (Highest SIRest
)
ICA-EBM
EERred
=66.76%
-10 -5 0 5 10SNR
0
5
10
15
20
25
30
35
40
EER
%
HOME
Baseline Traditional ICA Multi-run ICA (Highest SIR
act)
Multi-run ICA (Highest SIRest
)
ICA-EBM
EERred
=71.19%
Figure 6.7: Comparison of forensic speaker verification based on the multi-runICA (highest SIRest) algorithm with other speaker verification techniques undernoisy conditions.
run ICA (highest SIRest) with other techniques under noisy conditions. The SNRs
in the x-axis of Figure 6.7 were calculated from the first microphone (x1). It is
clear from this figure that forensic speaker verification based on the multi-run ICA
(highest SIRest) improved EER over the Noise-without-Reverberation speaker
verification baseline in the presence of different types of environmental noise at
low SNRs ranging from -10 dB to 0 dB. The reduction in EER for forensic speaker
6.5 Results and discussion 177
verification based on the multi-run ICA (highest SIRest) algorithm was 66.76%,
68.42% and 71.19% over the Noise-without-Reverberation speaker verification
baseline when the surveillance recordings were mixed with CAR, STREET and
HOME noises, respectively at -10 dB SNR. The results also demonstrated that
in the presence of various types of environmental noise over the majority of SNR
values, forensic verification based on the multi-run ICA (highest SIRest) performs
an approximately similar performance to the multi-run ICA (highest SIRact).
Thus, the estimated SIR can be used instead of the actual SIR to evaluate forensic
speaker verification based on the multi-run ICA algorithm under Noise-without-
Reverberation conditions. The performance of forensic speaker verification based
on the multi-run ICA (highest SIRest) degraded compared with Noise-without-
Reverberation speaker verification baseline at SNRs ranging from 5 dB to 10
dB.
When the surveillance recordings were corrupted with different sessions of CAR,
STREET and HOME noises at SNRs ranging from -10 dB to 10 dB, the multi-
run ICA (highest SIRest) algorithm improved the forensic speaker verification
performance over the traditional ICA algorithm. The reduction in EER between
traditional ICA and multi-run ICA (highest SIRest) can be calculated accord-
ing to Equation 5.1. The average reduction in EER between traditional ICA
and multi-run ICA (highest SIRest) can be computed by calculating the mean in
EER reduction for various types of environmental noise at each noise level, as
shown in Figure 6.8. The results demonstrated that forensic speaker verification
performance based on the multi-run ICA (highest SIRest) improved the average
reduction in EER over traditional ICA under different types of environmental
noise at SNRs ranging from -10 dB to 10 dB. The average reduction in EER for
multi-run ICA (highest SIRest) ranged from 15.80% to 6.83% compared with tra-
ditional ICA when the surveillance recordings were mixed with different sessions
of environmental noise at SNRs ranging from -10 dB to 0 dB.
178 6.5 Results and discussion
-10 -5 0 5 10SNR
0
2
4
6
8
10
12
14
16
18
20Av
erage
Red
uctio
n in E
ER %
Figure 6.8: Average reduction in EER for the multi-run ICA (highest SIRest)algorithm over traditional ICA in the presence of various types of environmentalnoise at SNRs ranging from -10 dB to 10 dB.
The result also demonstrated that forensic speaker verification performance based
on the ICA-EBM algorithm achieved more significant improvements in the per-
formance than the Noise-without-Reverberation speaker verification baseline at
low SNRs ranging from -10 dB to 0 dB. The reduction in EER for forensic speaker
verification based on the ICA-EBM algorithm was 67.21%, 67.16%, 68.44% over
the Noise-without-Reverberation speaker verification baseline when surveillance
recordings were corrupted by CAR, STREET and HOME noises, respectively
at -10 dB SNR. The improvement in EER of speaker verification based on the
ICA-EBM decreased when SNR increased. The performance of speaker veri-
fication based on the ICA-EBM degraded compared with the Noise-without-
Reverberation speaker verification baseline under different types of noise at SNRs
ranging from 5 dB to 10 dB.
Figure 6.9 shows average reduction in EER for forensic speaker verification based
on the ICA-EBM algorithm over traditional ICA in the presence of various types
of environmental noise at SNRs ranging from -10 dB to 10 dB. It is clear from
this figure that the average EER reduction for the ICA-EBM algorithm ranges
6.5 Results and discussion 179
-10 -5 0 5 10
SNR
0
2
4
6
8
10
12
14
Aver
age R
educ
tion i
n EER
%
Figure 6.9: Average EER reduction for the ICA-EBM algorithm compared withthe traditional ICA under different levels and types of environmental noise.
from 12.68% to 7.42% compared with conventional ICA under different types of
noise at SNR values from -10 dB to 0 dB. From the above results, it is clear that
ICA-EBM is more desirable than traditional ICA due to its superior separation
performance and more reliable convergence behaviour.
Table 6.2 shows a comparison mDCFs for speaker verification based on multi-run
ICA (highest SIRest) algorithm and Noise-without-Reverberation speaker veri-
fication baseline under different types of environmental noise at SNRs ranging
from -10 dB to 10 dB. It is clear from Table 6.2 that speaker verification based
on the multi-run ICA (highest SIRest) significantly improved mDCF at low SNR
(-10 dB to 0 dB) in the presence of STREET, CAR and HOME noise. There
was a degradation in the performance of speaker verification based on the multi-
run ICA (highest SIRest) than the baseline Noise-without-Reverberation speaker
verification over the majority of SNRs ranging from 5 dB to 10 dB.
180 6.5 Results and discussion
Table 6.2: Comparison mDCFs for speaker verification based multi-run ICA(highest SIRest) and baseline Noise-without-Reverberation speaker verificationsystem.
Noise Type Baseline (Multi-run ICA algorithm)-10 -5 0 5 10 -10 -5 0 5 10
STREET 0.0996 0.0974 0.0801 0.0590 0.0339 0.0571 0.0586 0.0573 0.0569 0.0551HOME 0.0991 0.0924 0.0803 0.0511 0.0325 0.0582 0.0577 0.0605 0.0581 0.0596CAR 0.0995 0.0884 0.0669 0.0389 0.0240 0.0555 0.0550 0.0495 0.0570 0.0551
6.5.2.3 Effect of utterance length
In these simulation results, the full duration of the interview recordings was kept
under clean conditions, while the duration of the surveillance recordings was
changed from 10 sec to 40 sec. The surveillance recordings were corrupted with
random segments of STREET, CAR and HOME noises at SNRs ranging from
-10 dB to 10 dB, resulting in a two-channel of noisy speech signals according to
Equation 6.11. Since the performance of forensic speaker verification based on the
multi-run ICA (highest SIRest) and the ICA-EBM algorithms improved EER over
traditional ICA algorithm, as described in Section 6.5.2.2, the effect of utterance
length was evaluated on the performance of forensic speaker verification based on
the multi-run ICA (highest SIRest) and the ICA-EBM algorithms in this section.
Figure 6.10 and 6.11 show the effect of duration utterance on the performance of
the speaker verification based multi-run ICA (highest SIRest) and the ICA-EBM
algorithms in the presence of various types of environmental noise. The SNRs in
the x-axis of Figures 6.10 and 6.11 can be calculated from the first microphone
(x1). It is clear that increasing the surveillance utterance duration improved the
performance of forensic speaker verification based on the multi-run ICA (highest
SIRest) and the ICA-EBM algorithms in the presence of STREET, CAR and
HOME noises at SNRs ranging from -10 dB to 10 dB.
6.5 Results and discussion 181
-10 -5 0 5 10SNR
0
5
10
15
EER
%
STREET
Full-10 sec Full-20 sec Full-40 sec
-10 -5 0 5 10SNR
0
5
10
15
EER
%
CAR
Full-10 sec Full-20 sec Full-40 sec
-10 -5 0 5 10SNR
0
5
10
15
EER
%
HOME
Full-10 sec Full-20 sec Full-40 sec
Figure 6.10: Effect of duration utterance on the performance of speaker verifica-tion based multi-run ICA (highest SIRest) algorithm in the presence of varioustypes of environmental noise.
6.5.3 Reverberation-with-Noise conditions
This section describes the performance of forensic speaker verification based on
the multi-run ICA or the ICA-EBM algorithm under Reverberation-with-Noise
conditions. The performance of forensic speaker verification compares with the
Reverberation-with-Noise speaker verification baseline and traditional ICA algo-
182 6.5 Results and discussion
-10 -5 0 5 10SNR
0
5
10
15
EER
%
STREET
Full-10 sec Full-20 sec Full-40 sec
-10 -5 0 5 10SNR
0
5
10
15
EER
%
CAR
Full-10 sec Full-20 sec Full-40 sec
-10 -5 0 5 10SNR
0
5
10
15
EER
%
HOME
Full-10 sec Full-20 sec Full-40 sec
Figure 6.11: Effect of utterance duration on speaker verification performancebased on the ICA-EBM algorithm under different levels and types of noise.
rithm. The effect of reverberation time and utterance duration on the forensic
speaker verification based on the multi-run ICA or the ICA-EBM algorithm is
also described in this section. The EER and mDCF were used in experimental
results to evaluate the performance of forensic speaker verification systems. The
fusion of feature warping with DWT-MFCC and MFCC used four levels and db8
of DWT to extract the features from the interview and enhanced surveillance
speech signals in order to provide a fair comparison with the Reverberation-with-
6.5 Results and discussion 183
Noise speaker verification baseline.
6.5.3.1 Speaker verification performance based on ICA algorithms
This section describes the performance of forensic speaker verification based on
the ICA algorithms in Reverberation-with-Noise conditions. The full duration of
the interview recordings was obtained from 200 speakers using the pseudo-police
style. Silent regions from the interview recordings were removed using the VAD
algorithm. The interview recordings were convolved with the impulse response of
the room at 0.15 sec reverberation time to generate the reverberated speech. The
first configuration of the room is used in these experimental results, as shown in
Table 6.1.
The surveillance recordings were obtained from 10 sec duration from 200 speakers
using the informal telephone conversation style after removing silent regions us-
ing the VAD algorithm. The surveillance recordings were mixed with one random
session of the environmental noises (CAR, STREET and HOME noises) from the
QUT-NOISE database [29] at SNRs ranging from -10 dB to 10 dB, resulting in
two-channel noisy speech signal according to Equation 6.11. The noisy surveil-
lance recordings were kept without reverberation because in most real forensic
situations the noisy surveillance recordings are often recorded in open areas [3].
Figure 6.12 shows the experimental results for PLDA speaker verification when
interview recordings reverberated at 0.15 sec reverberation time and the surveil-
lance recordings were mixed with various types of environmental noise at SNRs
ranging from -10 dB to 10 dB. The SNRs in the x-axis of Figure 6.12 were
computed from the first microphone (x1). The results showed that the per-
formance of speaker verification based on the multi-run ICA (highest SIRact,
184 6.5 Results and discussion
-10 -5 0 5 10SNR
0
5
10
15
20
25
30
35
40
EER
%
CAR
Baseline Traditional ICAMulti-run ICA (Highest SIR
act)
Multi-run ICA (Highest SIRest
)
ICA-EBMEER
red=51.84%
-10 -5 0 5 10SNR
0
5
10
15
20
25
30
35
40
EER
%
HOME
Baseline Traditional ICAMulti-run ICA (Highest SIR
act)
Multi-run ICA (Highest SIRest
)
ICA-EBM
EERred
=66.15%
Figure 6.12: Experimental results for PLDA speaker verification when interviewrecordings reverberated at 0.15 sec reverberation time and surveillance recordingswere mixed with various types of environmental noise at SNRs ranging from -10dB to 10 dB.
and highest SIRest) improved forensic speaker verification performance over the
Reverberation-with-Noise speaker verification baseline in the presence of different
types of environmental noise at low SNRs ranging from -10 dB to 0 dB and rever-
beration conditions. At -10 dB, the performance of speaker verification based on
the multi-run ICA (highest SIRact) significantly reduced EER by 60.88%, 51.84%,
66.15% over the Reverberation-with-Noise speaker verification baseline, respec-
6.5 Results and discussion 185
tively, in the presence of STREET, CAR and HOME noises. The improvement
in EER for the speaker verification based on the multi-run ICA decreased over
the Reverberation-with-Noise speaker verification baseline when SNR increased.
Forensic speaker verification performance based on the multi-run ICA speaker
verification degraded compared with the Reverberation-with-Noise speaker veri-
fication baseline at high SNRs ranging from 5 dB to 10 dB for most types of envi-
ronmental noise. The results of comparison of forensic speaker verification based
on the multi-run ICA (highest SIRact) with Reverberation-with-Noise speaker
verification baseline in the presence of various types of environmental noise and
reverberation conditions were published in the IEEE International Conference
on Signal and Image Processing Applications. The conference paper is titled “
Enhanced forensic speaker verification using multi-run ICA in the presence of
environmental noise and reverberation conditions” [7].
The multi-run ICA (highest SIRact and highest SIRest) algorithm improved the
forensic speaker verification performance over traditional ICA when the surveil-
lance recordings were mixed with various sessions of CAR, STREET and HOME
noises at SNRs ranging from -10 dB to 10 dB and interview recordings reverber-
ated at 0.15 sec. The reduction in EER between traditional ICA and multi-run
ICA (highest SIRest) can be calculated using Equation 5.1. The average reduc-
tion in EER can be calculated by computing the mean in EER reduction for
different types of environmental noise at each noise level, as shown in Figure
6.13. The results showed that forensic speaker verification based on the multi-
run ICA (highest SIRest) achieved a 2.20% to 6.39% improvement in average EER
over traditional ICA when the surveillance recordings were corrupted by different
types of environmental noise at SNRs ranging from -10 dB to 0 dB and interview
recordings reverberated at 0.15 sec reverberation time.
The results also demonstrated that the performance of speaker verification based
186 6.5 Results and discussion
-10 -5 0 5 10SNR
0
1
2
3
4
5
6
7
8
9
10Av
erage
Red
uctio
n in E
ER %
Figure 6.13: Average reduction in EER for multi-run ICA (highest SIRest) algo-rithm over traditional ICA when interview recordings reverberated at 0.15 secand the surveillance recordings were mixed with different types of environmentalnoise at SNRs ranging from -10 dB to 10 dB.
on the ICA-EBM algorithm improved EER over the Reverberation-with-Noise
speaker verification baseline at low SNR values (-10 dB to 0 dB). At -10 dB SNR,
The reduction in EER for forensic speaker verification based on the ICA-EBM
algorithm was 64.56%, 54.07% and 62.63% over the baseline Reverberation-with
Noise speaker verification, respectively, in the presence of STREET, CAR and
HOME noises. The improvement in the performance decreased when SNR in-
creased. The performance of speaker verification based on the ICA-EBM algo-
rithm degraded compared with the Reverberation-with-Noise speaker verification
baseline at SNRs ranging from 5 dB to 10 dB.
Figure 6.14 shows average EER reduction for the ICA-EBM algorithm compared
with the traditional ICA algorithm for different types of noise and reverberation.
When interview recordings reverberated at 0.15 sec and the surveillance record-
ings were corrupted with different types of noise at SNRs ranging from -10 dB to
0 dB, the performance of speaker verification based on the ICA-EBM achieved av-
erage EER reduction ranging from 7.25% to 8.31% compared with the traditional
ICA.
6.5 Results and discussion 187
-10 -5 0 5 10
SNR
0
2
4
6
8
9
Aver
age
redu
ctio
n in
EER
%
Figure 6.14: Average EER reduction for ICA-EBM algorithm compared with thetraditional ICA for different types of noise and reverberation.
-10 -5 0
SNR
0
2
4
6
8
10
Avera
ge R
educ
tion i
n EER
%
Figure 6.15: Average reduction in EER for the ICA-EBM algorithm over themulti-run ICA (highest SIRest) in the presence of various types of environmentalnoise at SNRs ranging from -10 dB to 0 dB and reverberation conditions.
Figure 6.15 shows average reduction in EER for forensic speaker verification based
on the ICA-EBM algorithm over multi-run ICA (highest SIRest). The results
demonstrated that forensic speaker verification based on the ICA-EBM algorithm
reduced average EER ranging from 5.11% to 2.04% when interview recordings re-
verberated at 0.15 sec reverberation time and surveillance recordings were mixed
188 6.5 Results and discussion
Table 6.3: comparison mDCFs for speaker verification based on the ICA-EBMalgorithm and Reverberation-with-Noise speaker verification baseline.
Noise Type Baseline ( ICA-EBM algorithm)-10 -5 0 5 10 -10 -5 0 5 10
STREET 0.0994 0.0951 0.0852 0.0685 0.0497 0.0691 0.0677 0.0680 0.0686 0.0682HOME 0.100 0.0953 0.0844 0.0665 0.0527 0.0706 0.0705 0.0704 0.0701 0.0705CAR 0.0989 0.0904 0.0697 0.0545 0.0425 0.0666 0.0672 0.0679 0.0680 0.0689
with various types of environmental noise at SNRs ranging from -10 dB to 0 dB.
Since the performance of forensic speaker verification based on the ICA-EBM
algorithm achieved the highest improvement in the performance compared with
the multi-run ICA (highest SIRest) over the majority of SNRs ranging from -10
dB to 0 dB, as shown in Figure 6.12, we compared the mDCF between speaker
verification based on the ICA-EBM algorithm with the Reverberation-with-Noise
speaker verification baseline, as shown in Table 6.3. It is clear that speaker verifi-
cation performance based on the ICA-EBM algorithm improved mDCF over the
Reverberation-with-Noise speaker verification baseline when interview recordings
reverberated at 0.15 sec reverberation time and the surveillance recordings were
mixed with various types of environmental noise at low SNR values (-10 dB to
0 dB). Forensic speaker verification performance based on the ICA-EBM algo-
rithm degraded compared with the Reverberation-with-Noise speaker verification
baseline at SNRs ranging from 5 dB to 10 dB.
6.5.3.2 Effect of utterance length
In real forensic scenarios, the full length utterance of the speech signals from a
suspect is often recorded in a police interview room where reverberation is present.
However, the police often record the speech from the criminal using hidden mi-
crophones in the presence of different types of noise and the utterance length of
the speech signals is uncontrolled. Therefore in these experimental results, the
6.5 Results and discussion 189
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
STREET
Full-10 sec Full-20 sec Full-40 sec
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
CAR
Full-10 sec Full-20 sec Full-40 sec
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
HOME
Full-10 sec Full-20 sec Full-40 sec
Figure 6.16: Effect of duration utterance on the performance of speaker verifica-tion based multi-run ICA (highest SIRest) in the presence of various environmentalnoise and reverberation conditions.
full length of the interview speech signal was convolved with the impulse response
of the room at 0.15 sec reverberation time to generate the reverberated speech
signals. The first configuration of the room is used in these simulation results,
as shown in Table 6.1. The duration of the surveillance recordings was changed
from 10 sec to 40 sec. The surveillance recordings were mixed with one random
segment of STREET, CAR and HOME noises at SNRs ranging from -10 dB to
190 6.5 Results and discussion
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
STREET
Full-10 sec Full-20 sec Full-40 sec
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
CAR
Full-10 sec Full-20 sec Full-40 sec
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
HOME
Full-10 sec Full-20 sec Full-40 sec
Figure 6.17: Effect of utterance duration on speaker verification performancewhen using the ICA-EBM under different types of noise and reverberation envi-ronments.
10 dB, resulting in two-channel noisy speech signals according to Equation 6.11.
We investigate the effect of utterance length on the performance of forensic
speaker verification based on the multi-run ICA (highest SIRest) and the ICA-
EBM algorithms because these algorithms achieved better performance compared
with the traditional ICA algorithm, as described in Section 6.5.3.1. Figure 6.16
and 6.17 show the effect of utterance length on the performance of speaker verifi-
6.5 Results and discussion 191
cation based on the multi-run ICA (highest SIRest) and the ICA-EBM algorithms
in the presence of various types of environmental noise and reverberation condi-
tions. The SNRs in the x-axis of Figures 6.16 and 6.17 were computed from the
first microphone (x1). It is evident that the performance of speaker verification
based on the multi-run ICA (highest SIRest) and the ICA-EBM algorithms im-
proved when the interview recordings reverberated at 0.15 sec and the duration
of the surveillance recordings increased from 10 sec to 40 sec in the presence of
various types of environmental noise at SNRs ranging from -10 dB to 10 dB.
6.5.3.3 Effect of reverberation time
This section describes the effect of reverberation time on the performance of foren-
sic speaker verification based on the multi-run ICA (highest SIRact and highest
SIRest) and the ICA-EBM algorithms. To validate the effect of reverberation
time on the performance of the multi-run ICA and the ICA-EBM algorithms, we
computed the impulse response of a room by using different reverberation times
(T20= 0.15 sec, 0.20 sec and 0.25 sec). The first configuration of the room is used
in these experimental results, as shown in Table 6.1. The full duration of the in-
terview recordings was obtained from 200 speakers using the pseudo-police style.
The interview recordings were convolved with the impulse response of the room to
produce reverberated interview recordings at different reverberation times. The
surveillance recordings were obtained from 10 sec duration from 200 speakers us-
ing the informal telephone conversation style. The surveillance recordings were
mixed with one random segment of STREET, CAR and HOME noises at SNRs
ranging from -10 dB to 10 dB, resulting in a two-channel noisy speech signals
according to Equation 6.11.
Figure 6.18 shows the effect of reverberation time on the performance of the
forensic speaker verification based on the multi-run ICA (highest SIRact) when
192 6.5 Results and discussion
-10 -5 0 5 10SNR
0
5
10
15
20EE
R %
STREET
T=0.15 sec T=0.20 sec T=0.25 sec
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
CAR
T=0.15 sec T=0.20 sec T=0.25 sec
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
HOME
T=0.15 sec T=0.20 sec T=0.25 sec
Figure 6.18: Effect of the reverberation time on the performance of forensicspeaker verification based on multi-run ICA (highest SIRact) when interviewrecordings reverberated at different reverberation times and surveillance record-ings mixed with various types of environmental noises.
interview recordings reverberated at different reverberation times and surveillance
recordings were mixed with various types of environmental noises at SNRs rang-
ing from -10 dB to 10 dB. The SNRs in the x-axis of Figure 6.18 were computed
from the first microphone (x1). It is clear from this figure that increasing the re-
verberation time degrades the performance of noisy forensic speaker verification
based on the multi-run ICA (highest SIRact). The performance of the multi-run
6.5 Results and discussion 193
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
STREET
T=0.15 sec T=0.20 sec T=0.25 sec
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
CAR
T=0.15 sec T=0.20 sec T=0.25 sec
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
HOME
T=0.15 sec T=0.20 sec T=0.25 sec
Figure 6.19: Effect of the reverberation time on the performance of forensicspeaker verification based on multi-run ICA (highest SIRest) when interviewrecordings reverberated at different reverberation times and surveillance record-ings were mixed with various types of environmental noises.
ICA algorithm degraded by 12.96%, 16.96% and 16.64% when the reverberation
time increased from 0.15 sec to 0.25 sec and surveillance recordings were mixed
with STREET, CAR and HOME noises at 0 dB SNR. The time of reverberation
parameter represents the length of the room impulse response. High reverbera-
tion time leads to increased distortion in the feature vectors [121]. Accordingly,
increasing the reverberation time results in decreasing the performance of foren-
194 6.5 Results and discussion
sic speaker verification systems. The results of the effect of reverberation time
on the performance of forensic speaker verification based on the multi-run ICA
(highest SIRact) were published in the IEEE International Conference on Signal
and Image Processing Applications. The conference paper is titled “ Enhanced
forensic speaker verification using multi-run ICA in the presence of environmental
noise and reverberation conditions” [7].
Figure 6.19 shows the effect of reverberation time on the performance of forensic
speaker verification based on the multi-run ICA (highest SIRest) when the in-
terview recordings reverberated at different reverberation times and surveillance
recordings were mixed with different environmental noise at SNRs ranging from
-10 dB to 10 dB. It is clear from this figure that as the reverberation time in-
creased from 0.15 sec to 0.25 sec, forensic speaker verification performance based
on the multi-run ICA (highest SIRest) decreased.
Figure 6.20 shows the effect of reverberation time on speaker verification per-
formance when using the ICA-EBM algorithm under conditions of noise and
reverberation. The SNRs in the x-axis of Figure 6.20 can be calculated from the
first microphone (x1). It is clear that the noisy forensic speaker verification per-
formance based on the ICA-EBM degrades as the reverberation time increases.
At -10 dB SNR, the ICA-EBM performance degraded by 19.17%, 17.07% and
16.40%, when the time of reverberation varied from 0.15 sec to 0.25 sec and
surveillance recordings were mixed with STREET, CAR and HOME noises, re-
spectively.
6.6 Chapter summary 195
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
STREET
T=0.15 sec T=0.20 sec T=0.25 sec
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
CAR
T=0.15 sec T=0.20 sec T=0.25 sec
-10 -5 0 5 10SNR
0
5
10
15
20
EER
%
HOME
T=0.15 sec T=0.20 sec T=0.25 sec
Figure 6.20: Effect of the reverberation time on speaker verification performancewhen using the ICA-EBM under conditions of noise and reverberation.
6.6 Chapter summary
In this chapter, forensic speaker verification based on the multi-run ICA and
ICA-EBM approaches were developed for improving forensic speaker verification
performance in the presence of high levels of environmental noise and reverberant
conditions. The performance of speaker verification based on the multi-run ICA
196 6.6 Chapter summary
or the ICA-EBM algorithm was compared with the baselines (Noise-without-
Reverberation and Reverberation-with-Noise speaker verification systems), and
traditional ICA algorithm. Forensic speaker verification was evaluated in the
presence of various types of environmental noise (STREET, CAR and HOME)
at SNRs ranging from -10 dB to 10 dB and reverberation conditions.
Experimental results demonstrated that the performance of forensic speaker ver-
ification based on the multi-run ICA or the ICA-EBM algorithm improved EER
over the baseline Noise-without-Reverberation speaker verification system at low
SNRs ranging from -10 dB to 0 dB. The results also showed that forensic speaker
verification based on the multi-run ICA algorithm improved EER compared with
the traditional ICA algorithm because the multi-run ICA algorithm solved the
problem of randomness in the separation of the speech from the noisy speech
signals in the traditional ICA algorithm by choosing the most suitable unmixing
matrix that had the highest SIR. The selection of the suitable unmixing matrix
gives a clear separation of the clean speech from the noisy speech signal and leads
to improved forensic speaker verification performance under different types of en-
vironmental noise. Some experimental results of comparison of forensic speaker
verification based on the multi-run ICA algorithm with other techniques were
published in the 11th International Conference on Signal Processing and Com-
munication Systems. The paper is titled “ Speaker verification with multi-run
ICA based speech enhancement” [3].
The results also demonstrated that forensic speaker verification based on the
ICA-EBM algorithm improved forensic speaker verification performance over tra-
ditional ICA algorithm when surveillance recordings were corrupted with various
types of environmental noise at SNRs ranging from -10 dB to 10 dB. The ICA-
EBM algorithm is more suited to noisy speech separation applications due to
their good convergence behavior. Speech/audio signals are usually either super-
6.6 Chapter summary 197
Gaussian or slightly skewed in nature and, hence, perform well with ICA-EBM,
compared with traditional ICA methods, due to its tighter bounds and also su-
perior convergence properties.
The results also demonstrated that forensic speaker verification performance
based on the multi-run ICA or ICA-EBM algorithm degraded compared with the
baselines of Noise-without-Reverberation and Reverberation-with-Noise speaker
verification under various types of environmental noise only at high SNRs ranging
from 5 dB to 10 dB, and noisy and reverberant conditions. Degradation in the
performance of the speaker verification systems was due to the bad separation of
the enhanced speech signals from the noise signals at high SNRs.
The performance of speaker verification based on the multi-run ICA or ICA-
EBM algorithm was also evaluated in the Reverberation-with-Noise conditions.
The algorithms also improved EER over the Reverberation-with-Noise speaker
verification baseline and traditional ICA at low SNRs ranging from -10 dB to 0
dB. Thus, forensic speaker verification based on the multi-run ICA or the ICA-
EBM algorithm can be used for improving forensic speaker verification perfor-
mance, especially when surveillance recordings are corrupted by different types
of environmental noise at low SNRs. Some results of comparison of forensic
speaker verification based on the multi-run ICA algorithm with the baseline of
Reverberation-with-Noise were published in the IEEE International Conference
on Signal and Image Processing Applications. The paper is titled “ Enhanced
forensic speaker verification using multi-run ICA in the presence of environmental
noise and reverberation conditions” [7].
Chapter 7
Conclusions and future directions
7.1 Introduction
This chapter gives a summary of the work presented in this thesis and the con-
clusions drawn from it. The summary describes the three research contributions
identified in Chapter 1: designing noisy and reverberant frameworks for simu-
lation forensic speaker verification performance under real-world situations, pro-
posed new fusion of feature warping with DWT-MFCC and MFCC, and the in-
troduction of the multi-run ICA and ICA-EBM algorithms as a multiple channel
speech enhancement algorithm to improve forensic speaker verification perfor-
mance in the presence of high levels of environmental noise and reverberation
conditions.
7.2 Original Contributions
The main contributions resulted from this work are as follows:
200 7.2 Original Contributions
7.2.1 Designing noisy and reverberant frameworks
As the forensic audio recordings from the AFVC database contain clean speech
signals, forensic speaker verification performance cannot be evaluated in the pres-
ence of various types of environmental noise and reverberant conditions. There-
fore, it is necessary to design noisy and reverberant frameworks from the AFVC
and QUT-NOISE databases for evaluating the robustness of forensic speaker ver-
ification performance under noisy and reverberant conditions. The noisy and
reverberant frameworks based on the single and multiple microphones were de-
signed in Chapter 3 and were used to simulate the performance of forensic speaker
verification based on robust feature extraction techniques and the ICA algorithms
under real-world situations in the presence of different levels and types of environ-
mental noise and reverberation conditions. The noisy and reverberant frameworks
based on the single and multiple microphones were presented in ” Enhanced foren-
sic speaker verification using a combination of DWT and MFCC feature warping
in the presence of noise and reverberation conditions” [5] in IEEE Access Journal,
” Hybrid DWT and MFCC feature warping for noisy forensic speaker verification
in room reverberation” [6] in the IEEE International Conference on Signal and
Image Processing Applications, ” Speaker verification with multi-run ICA based
speech enhancement” [3] in the 11th International Conference on Signal Process-
ing and Communication Systems, and ” Enhanced forensic speaker verification
using multi-run ICA in the presence of noise and reverberation conditions” [7] in
the IEEE International Conference on Signal and Image Processing Applications.
7.2.2 DWT-MFCC feature warping
Chapter 5 investigated the effect of feature warping on DWT-MFCC and MFCC
features, demonstrating that while feature warping has no or detrimental effect
7.2 Original Contributions 201
on DWT-MFCC alone, it increases the complementary features, allowing for bet-
ter performance in fusion with feature-warped MFCC features under noisy and
reverberant conditions. These findings were published as ” Enhanced forensic
speaker verification using a combination of DWT and MFCC feature warping in
the presence of noise and reverberation conditions” [5] in the IEEE Access Jour-
nal and ” Hybrid DWT and MFCC feature warping for noisy forensic speaker
verification in room reverberation” [6] in the IEEE International Conference on
Signal and Image Processing Applications.
The new fusion technique of feature warping with DWT-MFCC and MFCC was
also proposed to improve forensic speaker verification performance in the presence
of various types of environmental noise only. The proposed technique is based
on combining the feature-warped MFCC of the full band of the speech signals
with the same features extracted from the DWT. The performance of forensic
speaker verification based on the fusion of feature warping with DWT-MFCC
and MFCC was compared with different feature extraction techniques: MFCC,
feature-warped MFCC, DWT-MFCC, feature-warped DWT-MFCC, and fusion
of DWT-MFCC and MFCC. The results demonstrated that fusion of feature
warping with DWT-MFCC and MFCC improved forensic speaker verification
performance in the presence of various types of environmental noise under the
majority of SNRs compared with other feature extraction techniques.
The robustness of fusion of feature warping with DWT-MFCC and MFCC was
also investigated and compared with other feature extraction techniques to im-
prove forensic speaker verification performance under reverberation conditions
only. Experimental results showed that fusion of feature warping with DWT-
MFCC and MFCC improved EER compared with other feature extraction tech-
niques. The effect of reverberation time and the position of suspect/microphone
on forensic speaker verification performance based on the fusion of feature warp-
202 7.2 Original Contributions
ing with DWT-MFCC and MFCC was also studied.
Forensic speaker verification performance based on the fusion of feature warping
with DWT-MFCC and MFCC was also investigated and compared with the tradi-
tional MFCC and feature-warped MFCC under noisy and reverberant conditions.
Simulation results demonstrated that fusion of feature warping with DWT-MFCC
and MFCC improved EER and mDCF compared with feature warped MFCC in
the presence of reverberation and different types and levels of environmental
noise. The proposed fusion of feature warping with DWT-MFCC and MFCC can
be used as the feature extraction technique in forensic speaker verification based
on the multi-run ICA and ICA-EBM algorithms which will be described in the
next section.
7.2.3 Multi-run ICA and ICA-EBM algorithms
The performance of forensic speaker verification systems degrades significantly in
the presence of high levels of environmental noise and reverberant conditions. It
is difficult to use noisy forensic recordings as part of legal evidence in real forensic
applications because the quality of these recordings is often poor. Multi-run ICA
and ICA-EBM algorithms were used as multiple channel speech enhancement
algorithms in Chapter 6 to reduce the effect of environmental noise and improve
forensic speaker verification performance.
Although the multi-run ICA algorithm was used in biosignal applications and the
ICA-EBM algorithm was used to separate the mixed speech signals under clean
conditions, the effectiveness of the multi-run ICA and ICA-EBM algorithms was
investigated for the first time as speech enhancement algorithms in this thesis
to improve forensic speaker verification performance under noisy and reverberant
conditions.
7.2 Original Contributions 203
Investigations were also performed to study the effectiveness of the developed
forensic speaker verification based on the multi-run ICA or the ICA-EBM algo-
rithm for improving forensic speaker verification performance under noisy con-
ditions. The developed forensic speaker verification systems were also compared
with the baseline of Noise-without-Reverberation speaker verification system and
traditional ICA algorithm. Simulation results demonstrated that forensic speaker
verification based on the multi-run ICA or ICA-EBM algorithm improved forensic
speaker verification performance compared with the baseline of Noise-without-
Reverberation speaker verification system in the presence of different types of
environmental noise at SNRs ranging from -10 dB to 0 dB. The results also
demonstrated that forensic speaker verification based on the multi-run ICA or
ICA-EBM algorithm improved the performance of speaker verification over tra-
ditional ICA in the presence of environmental noise at SNRs ranging from -10 dB
to 10 dB. The outcomes of forensic speaker verification performance based on the
multi-run ICA in the presence of various types of environmental noise were pub-
lished as ” Speaker verification with multi-run ICA based speech enhancement”
[3] in the 11th International Conference on Signal Processing and Communication
Systems.
The performance of forensic speaker verification based on the multi-run ICA and
ICA-EBM algorithm were also evaluated in the presence of various types of envi-
ronmental noise and reverberation conditions. Simulation results demonstrated
that forensic speaker verification based on the multi-run ICA or ICA-EBM al-
gorithm improved forensic speaker verification performance compared with the
baseline of Reverberation-with-Noise speaker verification system in the presence
of reverberation and different types of environmental noise at SNRs ranging from
-10 dB to 0 dB. The results also demonstrated that forensic speaker verification
based on the multi-run ICA or ICA-EBM algorithm improved the performance of
speaker verification over traditional ICA in the presence of noise at SNRs rang-
204 7.3 Future directions
ing from -10 dB to 10 dB and reverberation conditions. Some results of forensic
speaker verification performance based on the multi-run ICA algorithm in the
presence of environmental noise and reverberation conditions were published as
” Enhanced forensic speaker verification using multi-run ICA in the presence of
noise and reverberation conditions” [7] in the IEEE International Conference on
Signal and Image Processing Applications.
7.3 Future directions
This study has developed techniques to improve forensic speaker verification per-
formance in the presence of high levels of environmental noise and reverberation
conditions. Some potential areas of future research include:
� This research proposed a speaker verification system based on a robust
feature extraction technique to noise and reverberation conditions. Also,
this research investigates the effectiveness of speaker verification system
based on the multi-run ICA or ICA-EBM algorithm for improving forensic
speaker verification performance under noisy and reverberant conditions.
These systems were evaluated using the EER and mDCF. It will be also
useful to evaluate forensic speaker verification performance using forensic-
eval-01 evaluation described in [97].
� Although the DNN based i-vector did not show significant improvement
in forensic speaker verification performance compared with UBM based i-
vector when limited amount of forensic data are available for training DNN,
it will be interesting to investigate the effectiveness of applying the proposed
fusion of feature warping with DWT-MFCC and MFCC to DNN posterior
mapping approach described in [145] for improving forensic speaker verifi-
7.3 Future directions 205
cation performance in the presence of various types of environmental noise
and reverberation conditions.
� Since the proposed speaker verification based on the multi-run ICA or ICA-
EBM algorithm degraded compared with the baselines of Noise-without-
Reverberation and Reverberation-with-Noise speaker verification systems
at SNRs ranging from 5 dB to 10 dB, an SNR estimation could be used
before speech enhancement algorithms to determine whether or not the
multi-run ICA or ICA-EBM algorithm is used to reduce the effect of noise
from the noisy speech signals.
� The performance of forensic speaker verification based on the multi-run
ICA or ICA-EBM algorithm was evaluated when interview data reverber-
ated and surveillance speech signals were corrupted by various types of
environmental noise without reverberation conditions. It will also be inter-
esting to evaluate the performance of forensic speaker verification based on
the convolutive ICA algorithm when the surveillance data are corrupted by
environmental noise in the presence of reverberation conditions.
� The computation complexity for fusion of feature warping with DWT-
MFCC and MFCC features will be calculated and compared with the com-
putation complexity of other fusion feature extraction techniques which
described in Chapter 5.
� The performance of forensic speaker verification systems based on a robust
feature extraction technique and multi-run ICA or ICA-EBM algorithm
will be evaluated using adaptive linear energy detector voice activity detec-
tion and compared the results with that obtained from the statistical voice
activity detection algorithm used in this thesis.
Bibliography
[1] M. I. Abdalla, H. M. Abobakr, and T. S. Gaafar, “DWT and MFCCs
based feature extraction methods for isolated word recognition,” Interna-
tional Journal of Computer Applications, vol. 69, no. 20, pp. 21–26, 2013.
[2] T. Adali, M. Anderson, and G.-S. Fu, “Diversity in independent component
and vector analyses: Identifiability, algorithms, and applications in medical
imaging,” IEEE Signal Processing Magazine, vol. 31, no. 3, pp. 18–33, 2014.
[3] A. K. H. Al-Ali, D. Dean, B. Senadji, M. Baktashmotlagh, and V. Chan-
dran, “Speaker verification with multi-run ICA based speech enhancement,”
in 11th International Conference on Signal Processing and Communication
Systems, pp. 1–7, 2017.
[4] A. K. H. Al-ALi, D. Dean, B. Senadji, and V. Chandran, “Comparison of
speech enhancement algorithms for forensic applications,” Proceedings of the
16th Australian International Conference on Speech Science and Technology,
pp. 169–172, December, 2016.
[5] A. K. H. Al-Ali, D. Dean, B. Senadji, V. Chandran, and G. R. Naik, “En-
hanced forensic speaker verification using a combination of DWT and MFCC
feature warping in the presence of noise and reverberation conditions,” IEEE
Access, vol. 5, no. 99, pp. 15400–15413, 2017.
208 BIBLIOGRAPHY
[6] A. K. H. Al-ALi, B. Senadji, and V. Chandran, “Hybrid DWT and MFCC
feature warping for noisy forensic speaker verification in room reverbera-
tion,” in IEEE International Conference on Signal and Image Processing
Applications, pp. 434–439, 2017.
[7] A. K. H. Al-Ali, B. Senadji, and G. R. Naik, “Enhanced forensic speaker
verification using multi-run ICA in the presence of environmental noise and
reverberation conditions,” in IEEE International Conference on Signal and
Image Processing Applications, pp. 174–179, 2017.
[8] S. S. Alamri, “Text-independent, automatic speaker recognition system eval-
uation with males speaking both Arabic and English,” M.S. thesis, Univer-
sity of Colorado Denver, USA, 2015.
[9] H. Ali, N. Ahmad, X. Zhou, K. Iqbal, and S. M. Ali, “DWT features per-
formance analysis for automatic speech recognition of Urdu,” SpringerPlus,
vol. 3, pp. 1–10, 2014.
[10] T. Awasthy and A. Kumar, “Analysis of Fast-ICA algorithm for separation of
mixed images,” International Journal of Electronics and Computer Science
Engineering, pp. 1252–1256.
[11] A. J. Bell and T. J. Sejnowski, “An information-maximization approach to
blind separation and blind deconvolution,” Neural computation, vol. 7, no. 6,
pp. 1129–1159, 1995.
[12] A. Benyassine, E. Shlomot, H.-Y. Su, D. Massaloux, C. Lamblin, and J.-
P. Petit, “ITU-T recommendation G. 729 Annex B: a silence compression
scheme for use with G. 729 optimized for V. 70 digital simultaneous voice
and data applications,” IEEE Communications Magazine, vol. 35, no. 9,
pp. 64–73, 1997.
BIBLIOGRAPHY 209
[13] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech cor-
rupted by acoustic noise,” in IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), pp. 208–211, 1979.
[14] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau,
S. Meignier, T. Merlin, J. Ortega-Garcıa, D. Petrovska-Delacretaz, and D. A.
Reynolds, “A tutorial on text-independent speaker verification,” EURASIP
Journal on Applied Signal Processing, vol. 2004, no. 4, pp. 430–451, 2004.
[15] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,”
IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27,
no. 2, pp. 113–120, 1979.
[16] Z. Boukouvalas, R. Mowakeaa, G.-S. Fu, and T. Adali, “Independent com-
ponent analysis by entropy maximization with kernels,” arXiv preprint
arXiv:1610.07104, pp. 1–6, 2016.
[17] L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matejka, and N. Brummer,
“Discriminatively trained probabilistic linear discriminant analysis for
speaker verification,” in IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 4832–4835, 2011.
[18] J. P. Campbell, “Speaker recognition: A tutorial,” Proceedings of the IEEE,
vol. 85, no. 9, pp. 1437–1462, 1997.
[19] J. P. Campbell, W. Shen, W. M. Campbell, R. Schwartz, J.-F. Bonastre,
and D. Matrouf, “Forensic speaker recognition,” IEEE Signal Processing
Magazine, vol. 26, no. 2, pp. 95–103, 2009.
[20] J. Chang and D. Wang, “Robust speaker recognition based on DNN/i-vectors
and speech separation,” in IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 5415–5419, 2017.
210 BIBLIOGRAPHY
[21] W.-C. Chen, C.-T. Hsieh, and E. Lai, “Multiband approach to robust text-
independent speaker identification,” Journal of Computational Linguistics
and Chinese Language Processing, vol. 9, no. 2, pp. 63–76, 2004.
[22] S.-H. Chen and Y.-R. Luo, “Speaker verification using MFCC and support
vector machine,” in Proceedings of the International Multi conference of En-
gineers and Computer Scientists, vol. 1, 2009.
[23] L. Chun-Lin, “A tutorial of the wavelet transform,” Department of Electrical
Engineering, National Taiwan University, Taiwan, pp. 1–71, 2010.
[24] A. Cichocki and S. Amari, Adaptive blind signal and image processing: learn-
ing algorithms and applications. John Wiley & Sons, 2002.
[25] P. Comon, “Independent component analysis, a new concept?,” Signal pro-
cessing, vol. 36, no. 3, pp. 287–314, 1994.
[26] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning,
vol. 20, no. 3, pp. 273–297, 1995.
[27] S. Davis and P. Mermelstein, “Comparison of parametric representations
for monosyllabic word recognition in continuously spoken sentences,” IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4,
pp. 357–366, 1980.
[28] D. B. Dean, A. Kanagasundaram, H. Ghaemmaghami, M. H. Rahman, and
S. Sridharan, “The QUT-NOISE-SRE protocol for the evaluation of noisy
speaker recognition,” in Proceedings of the 16th Annual Conference of the
International Speech Communication Association (Interspeech), pp. 3456–
3460, 2015.
[29] D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “The QUT-NOISE-
TIMIT corpus for the evaluation of voice activity detection algorithms,” in
Proceedings of Interspeech, pp. 1–4, 2010.
BIBLIOGRAPHY 211
[30] N. Dehak, R. Dehak, P. Kenny, N. Brummer, P. Ouellet, and P. Dumouchel,
“Support vector machines versus fast scoring in the low-dimensional to-
tal variability space for speaker verification.,” in Proceedings of Interspeech,
pp. 1559–1562, 2009.
[31] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-
end factor analysis for speaker verification,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
[32] F. Denk, J. P. C. da Costa, and M. A. Silveira, “Enhanced forensic multiple
speaker recognition in the presence of coloured noise,” in 8th International
Conference on Signal Processing and Communication Systems (ICSPCS),
pp. 1–7, 2014.
[33] D. L. Donoho, “De-noising by soft-thresholding,” IEEE Transactions on
Information Theory, vol. 41, no. 3, pp. 613–627, 1995.
[34] S. Du, X. Xiao, and E. S. Chng, “DNN feature compensation for noise robust
speaker verification,” in IEEE China Summit and International Conference
on Signal and Information Processing (ChinaSIP), pp. 871–875, 2015.
[35] L. Ferrer, H. Bratt, L. Burget, H. Cernocky, O. Glembek, M. Graciarena,
A. Lawson, Y. Lei, P. Matejka, O. Plchot, and N. Scheffer, “Promoting
robustness for speaker modeling in the community: the PRISM evaluation
set,” in Proceedings of NIST 2011 workshop, pp. 1–7, Citeseer, 2011.
[36] A. J. Fisher and S. Sridharan, “Speech enhancement for forensic and telecom-
munication applications,” in Fifth International conference on Speech Sci-
ence and Technology, pp. 40–45, 1994.
[37] S. Furui, “Cepstral analysis technique for automatic speaker verification,”
IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-
29, no. 2, pp. 254–272, 1981.
212 BIBLIOGRAPHY
[38] S. Furui, “An overview of speaker recognition technology,” in Proceedings
ESCA Workshop on Automatic Speaker Recognition, Identification, and Ver-
ification, pp. 1–9, 1994.
[39] S. Furui, “Recent advances in speaker recognition,” Pattern Recognition Let-
ters, vol. 18, no. 9, pp. 859–872, 1997.
[40] S. Ganapathy, J. Pelecanos, and M. K. Omar, “Feature normalization for
speaker verification in room reverberation,” in IEEE International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP), pp. 4836–4839,
2011.
[41] T. Ganchev, N. Fakotakis, and G. Kokkinakis, “Comparative evaluation of
various MFCC implementations on the speaker verification task,” in Pro-
ceedings of the SPECOM, pp. 191–194, 2005.
[42] D. Garcia-Romero, Robust Speaker Recognition Based on Latent Variable
Models. PhD thesis, University of Maryland at College Park, 2012.
[43] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-vector length nor-
malization in speaker recognition systems,” in Proceedings of Interspeech,
pp. 249–252, 2011.
[44] Y. Ghanbari and M. R. Karami-Mollaei, “A new approach for speech en-
hancement based on the adaptive thresholding of the wavelet packets,”
Speech Communication, vol. 48, no. 8, pp. 927–940, 2006.
[45] R. Haddadi, E. Abdelmounim, M. El Hanine, and A. Belaguid, “Discrete
wavelet transform based algorithm for recognition of QRS complexes,” in
International Conference on Multimedia Computing and Systems, pp. 375–
379, 2014.
BIBLIOGRAPHY 213
[46] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A k-means cluster-
ing algorithm,” Journal of the Royal Statistical Society. Series C (Applied
Statistics), vol. 28, no. 1, pp. 100–108, 1979.
[47] A. O. Hatch, S. S. Kajarekar, and A. Stolcke, “Within-class covariance nor-
malization for SVM-based speaker recognition,” in 9th International Con-
ference on Spoken Language Processing, pp. 1471–1474, 2006.
[48] H. Hermansky, B. Hanson, and H. Wakita, “Perceptually based linear pre-
dictive analysis of speech,” in IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), pp. 509–512, 1985.
[49] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Trans-
actions on Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, 1994.
[50] A. Higgins, L. Bahler, and J. Porter, “Speaker verification using randomized
phrase prompting,” Digital Signal Processing, vol. 1, no. 2, pp. 89–106, 1991.
[51] H.-G. Hirsch and D. Pearce, “The Aurora experimental framework for the
performance evaluation of speech recognition systems under noisy condi-
tions,” in ISCA Tutorial Research workshop ASR2000, pp. 181–188, 2000.
[52] P. J. Huber, “Projection pursuit,” The annals of Statistics, vol. 13, no. 2,
pp. 435–475, 1985.
[53] A. Hyvarinen, “Fast and robust fixed-point algorithms for independent com-
ponent analysis,” IEEE Transactions on Neural Networks, vol. 10, no. 3,
pp. 626–634, 1999.
[54] A. Hyvarinen, J. Karhunen, and E. Oja, Independent component analysis.
John Wiley & Sons, 2004.
[55] A. Hyvarinen and E. Oja, “Independent component analysis: algorithms and
applications,” Neural Networks, vol. 13, no. 4, pp. 411–430, 2000.
214 BIBLIOGRAPHY
[56] M. A. Islam, W. A. Jassim, N. S. Cheok, and M. S. A. Zilany, “A robust
speaker identification system using the responses from a model of the audi-
tory periphery,” PlOS one, vol. 11, no. 7, pp. 1–21, 2016.
[57] Q. Jin, Robust speaker recognition. PhD thesis, School of Computer Science,
Carnegie Mellon University, Pittsburgh, PA, 2007.
[58] Q. Jin, T. Schultz, and A. Waibel, “Far-field speaker recognition,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7,
pp. 2023–2032, 2007.
[59] A. Kanagasundaram, Speaker verification using I-vector features. PhD thesis,
Queensland University of Technology, Australia, Brisbane, 2014.
[60] A. Kanagasundaram, D. Dean, S. Sridharan, J. Gonzalez-Dominguez,
J. Gonzalez-Rodriguez, and D. Ramos, “Improving short utterance i-vector
speaker verification using utterance variance modelling and compensation
techniques,” Speech Communication, vol. 59, pp. 69–82, 2014.
[61] A. Kanagasundaram, D. Dean, S. Sridharan, M. McLaren, and R. Vogt,
“I-vector based speaker recognition using advanced channel compensation
techniques,” Computer Speech and Language, vol. 28, no. 1, pp. 121–140,
2014.
[62] F. Kelly, Automatic recognition of ageing speakers. PhD thesis, Trinity Col-
lege Dublin, Ireland, 2014.
[63] P. Kenny, “Joint factor analysis of speaker and session variability: Theory
and algorithms,” CRIM, Montreal,(Report) CRIM-06/08-13, pp. 1–17, 2005.
[64] P. Kenny, “Bayesian speaker verification with heavy-tailed priors,” in
Odyssey Speaker and Language Recogntion Workshop, 2010.
BIBLIOGRAPHY 215
[65] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with
sparse training data,” IEEE Transactions on Speech and Audio Processing,
vol. 13, no. 3, pp. 345–354, 2005.
[66] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Factor analysis
simplified,” in IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), pp. I– 637– I– 640, 2005.
[67] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Improvements in
factor analysis based speaker verification,” in IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. I– 113– I– 116,
2006.
[68] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis
versus eigenchannels in speaker recognition,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, 2007.
[69] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speaker and ses-
sion variability in GMM-based speaker verification,” IEEE Transactions on
Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1448–1460, 2007.
[70] P. Kenny and P. Dumouchel, “Disentangling speaker and channel effects
in speaker verification,” in IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), vol. 1, pp. 37–40, 2004.
[71] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, “A study
of interspeaker variability in speaker verification,” IEEE Transactions on
Audio, Speech, and Language Processing, vol. 16, no. 5, pp. 980–988, 2008.
[72] S. Kim, M. Ji, and H. Kim, “Noise-robust speaker recognition using sub-
band likelihoods and reliable-feature selection,” ETRI Journal, vol. 30, no. 1,
pp. 89–100, 2008.
216 BIBLIOGRAPHY
[73] S. Kim, M. Ji, and H. Kim, “Robust speaker recognition based on filtering
in autocorrelation domain and sub-band feature recombination,” Pattern
Recognition Letters, vol. 31, no. 7, pp. 593–599, 2010.
[74] J. Kola, C. Espy-Wilson, and T. Pruthi, “Voice activity detection,” Merit
Bien, pp. 1–6, 2011.
[75] M. Kolbœk, Z.-H. Tan, and J. Jensen, “Speech enhancement using long
short-term memory based recurrent neural networks for noise robust speaker
verification,” in IEEE Spoken Language Technology Workshop (SLT),
pp. 305–311, 2016.
[76] Z. Koldovsky, J. Malek, P. Tichavsky, Y. Deville, and S. Hosseini, “Blind
separation of piecewise stationary non-Gaussian sources,” Signal Processing,
vol. 89, no. 12, pp. 2570–2584, 2009.
[77] T.-W. Lee, Independent component analysis. Springer, 1998.
[78] E. A. Lehmann and A. M. Johansson, “Prediction of energy decay in room
impulse responses simulated with an image-source model,” Journal of the
Acoustical Society of America, vol. 124, no. 1, pp. 269–277, 2008.
[79] E. A. Lehmann, A. M. Johansson, and S. Nordholm, “Reverberation-time
prediction method for room impulse responses simulated with the image-
source model,” in IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics, pp. 159–162, 2007.
[80] L. Lei and S. Kun, “Speaker recognition using wavelet cepstral coefficient, i-
vector, and cosine distance scoring and its application for forensics,” Journal
of Electrical and Computer Engineering, vol. 2016, pp. 1–11, 2016.
[81] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker
recognition using a phonetically-aware deep neural network,” in IEEE Inter-
BIBLIOGRAPHY 217
national Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 1695–1699, 2014.
[82] X.-L. Li and T. Adali, “Independent component analysis by entropy bound
minimization,” IEEE Transactions on Signal Processing, vol. 58, no. 10,
pp. 5151–5164, 2010.
[83] K.-P. Li and J. E. Porter, “Normalizations and selection of speech seg-
ments for speaker recognition scoring,” in IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), pp. 595–598, 1988.
[84] H. Li, H. Wang, and B. Xiao, “Blind separation of noisy mixed speech signals
based on wavelet transform and independent component analysis,” in 8th
International Conference on Signal Processing, pp. 1–4, 2006.
[85] H.-y. Li, Q.-h. Zhao, G.-l. Ren, and B.-j. Xiao, “Speech enhancement al-
gorithm based on independent component analysis,” in 5th International
Conference on Natural Computation, pp. 598–602, 2009.
[86] H. Maged, A. AbouEl-Farag, and S. Mesbah, “Improving speaker identifica-
tion system using discrete wavelet transform and AWGN,” in 5th IEEE Inter-
national Conference on Software Engineering and Service Science, pp. 1171–
1176, 2014.
[87] S. Malik and F. A. Afsar, “Wavelet transform based automatic speaker recog-
nition,” in 13th IEEE International Multitopic Conference, pp. 1–4, 2009.
[88] S. G. Mallat, “A theory for multiresolution signal decomposition: the wavelet
representation,” IEEE Transactions on Pattern Analysis and Machine In-
telligence, vol. 11, no. 7, pp. 674–693, 1989.
[89] M. I. Mandasari, M. McLaren, and D. A. van Leeuwen, “The effect of noise
on modern automatic speaker recognition systems,” in IEEE International
218 BIBLIOGRAPHY
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4249–
4252, 2012.
[90] J. McAuley, J. Ming, D. Stewart, and P. Hanna, “Subband correlation and
robust speech recognition,” IEEE Transactions on Speech and Audio Pro-
cessing, vol. 13, no. 5, pp. 956–964, 2005.
[91] M. McLaren and D. van Leeuwen, “Improved speaker recognition when using
i-vectors from multiple speech sources,” in IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 5460–5463, 2011.
[92] T. M.cover and J. A. Thomas, “Elements of Information Theory”. John
Wiley & Sons Ltd, 1991.
[93] H. Melin, Automatic speaker verification on site and by telephone: meth-
ods, applications and assessment. PhD thesis, KTH Computer science and
Communication, Stockholm, Sweden, 2006.
[94] J. Ming, T. J. Hazen, J. R. Glass, and D. A. Reynolds, “Robust speaker
recognition in noisy conditions,” IEEE Transactions on Audio, Speech, and
Language Processing, vol. 15, no. 5, pp. 1711–1723, 2007.
[95] N. Mirghafori and N. Morgan, “Combining connectionist multi-band and
full-band probability streams for speech recognition of natural numbers,” in
5th International Conference on Spoken Language Processing, pp. 1–4, 1998.
[96] M. Misiti, Y. Misiti, G. Oppenheim, and J.-M. Poggi, “Wavelet toolbox
manual user 's guide,” The MathWorks Inc., 1996.
[97] G. S. Morrison and E. Enzinger, “Multi-laboratory evaluation of forensic
voice comparison systems under conditions reflecting those of a real foren-
sic case (forensic eval 01)–introduction,” Speech Communication, vol. 85,
pp. 119–126, 2016.
BIBLIOGRAPHY 219
[98] G. S. Morrison, P. Rose, and C. Zhang, “Protocol for the collection of
databases of recordings for forensic-voice-comparison research and practice,”
Australian Journal of Forensic Sciences, vol. 44, no. 2, pp. 155–167, 2012.
[99] G. Morrison, C. Zhang, E. Enzinger, F. Ochoa, D. Bleach, M. John-
son, B. Folkes, S. De Souza, N. Cummins, and D. Chow, “Forensic
database of voice recordings of 500+ Australian English speakers,” URL:
http://databases.forensic-voice-comparison.net, 2015.
[100] G. R. Naik, Iterative issues of ICA, quality of separation and number of
sources: a study for biosignal applications. PhD thesis, RMIT University,
Melbourne, Australia, 2008.
[101] G. R. Naik and D. K. Kumar, “An overview of independent component
analysis and its applications,” Informatica, vol. 35, no. 1, pp. 63–81, 2011.
[102] G. R. Naik and D. K. Kumar, “Identification of hand and finger movements
using multi run ICA of surface electromyogram,” Journal of medical systems,
vol. 36, no. 2, pp. 841–851, 2012.
[103] G. R. Naik, D. K. Kumar, and M. Palaniswami, “Multi run ICA and surface
EMG based signal processing system for recognising hand gestures,” in 8th
IEEE International Conference on Computer and Information Technology,
pp. 700–705, 2008.
[104] J. Ortega-Garcıa and J. Gonzalez-Rodrıguez, “Overview of speech enhance-
ment techniques for automatic speaker recognition,” in 4th International
Conference on Spoken Language Processing, pp. 929–932, 1996.
[105] A. Papoulis, ”Probability, random variable and stochastic processes”. Mc
Graw-Hill, 1991.
220 BIBLIOGRAPHY
[106] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker verifica-
tion,” in Proceedings Odyssey Speaker Recognition Workshop, pp. 213–218,
2001.
[107] M. Phythian, Speaker identification for forensic applications. PhD thesis,
Queensland University of Technology, Brisbane, Australia, 1998.
[108] S. J. Prince and J. H. Elder, “Probabilistic linear discriminant analysis
for inferences about identity,” in 11th IEEE International Conference on
Computer Vision, pp. 1–8, 2007.
[109] L. R. Rabiner and M. R. Sambur, “An algorithm for determining the end-
points of isolated utterances,” Bell System Technical Journal, vol. 54, no. 2,
pp. 297–315, 1975.
[110] D. A. Reynolds, “Experimental evaluation of features for robust speaker
identification,” IEEE Transactions on Speech and Audio Processing, vol. 2,
no. 4, pp. 639–643, 1994.
[111] D. A. Reynolds, “Automatic speaker recognition using Gaussian mixture
speaker models,” in The Lincoln Laboratory Journal, vol. 8, pp. 173–192,
1995.
[112] D. A. Reynolds, “Speaker identification and verification using Gaussian
mixture speaker models,” Speech Communication, vol. 17, no. 1, pp. 91–108,
1995.
[113] D. A. Reynolds, “Automatic speaker recognition: Current approaches and
future trends,” Speaker Verification: From Research to Reality, 2001.
[114] D. A. Reynolds, “An overview of automatic speaker recognition technol-
ogy,” in IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), pp. IV –4072– IV –4075, 2002.
BIBLIOGRAPHY 221
[115] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification
using adapted Gaussian mixture models,” Digital signal processing, vol. 10,
pp. 19–41, 2000.
[116] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker iden-
tification using Gaussian mixture speaker models,” IEEE Transactions on
Speech and Audio Processing, vol. 3, no. 1, pp. 72–83, 1995.
[117] D. Ribas, E. Vincent, and J. R. Calvo, “Full multicondition training for
robust i-vector based speaker recognition,” in Proceedings of Interspeech,
pp. 1–5, 2015.
[118] J. Rosca, R. Balan, and C. Beaugeant, “Multi-channel psychoacoustically
motivated speech enhancement,” in International Conference on Multimedia
and Expo, pp. III– 217– III– 220, 2003.
[119] S. O. Sadjadi, M. Slaney, and L. Heck, “MSR identity toolbox v1. 0: A
MATLAB toolbox for speaker-recognition research,” Speech and Language
Processing Technical Committee Newsletter, vol. 1, no. 4, 2013.
[120] M. Senoussaoui, P. Kenny, N. Brummer, E. De Villiers, and P. Dumouchel,
“Mixture of PLDA models in i-vector space for gender-independent speaker
recognition,” in Proceedings of Interspeech, pp. 25–28, 2011.
[121] N. R. Shabtai, Y. Zigel, and B. Rafaely, “The effect of GMM order and
CMS on speaker recognition with reverberant speech,” in Hands-Free Speech
Communication and Microphone Arrays, pp. 144–147, 2008.
[122] N. R. Shabtai, Y. Zigel, and B. Rafaely, “The effect of room parameters on
speaker verification using reverberant speech,” in 25th IEEE Convention of
Electrical and Electronics Engineers, pp. 231–235, 2008.
[123] A. Shafik, S. M. Elhalafawy, S. Diab, B. M. Sallam, and F. A. El-Samie, “A
wavelet based approach for speaker identification from degraded speech,” In-
222 BIBLIOGRAPHY
ternational Journal of Communication Networks and Information Security,
vol. 1, no. 3, pp. 52–58, 2009.
[124] N. Shanmugapriya and E. Chandra, “Evaluation of sound classification us-
ing modified classifier and speech enhancement using ICA algorithm for hear-
ing aid application,” ICTACT Journal on Communication Technology, vol. 7,
no. 1, pp. 1279–1288, 2016.
[125] M. Sharma and R. Mammone, “Subword-based text-dependent speaker ver-
ification system with user-selectable passwords,” in IEEE International Con-
ference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 93–96,
1996.
[126] H. Sheikhzadeh and H. R. Abutalebi, “An improved wavelet-based speech
enhancement system,” in 7th European Conference on Speech Communica-
tion and Technology, pp. 1855–1858, 2001.
[127] M. A. Silveira, C. P. Schroeder, J. Da Costa, C. G. de Oliveira, J. A. A.
Junior, and S. Junior, “Convolutive ICA-based forensic speaker identification
using mel frequency cepstral coefficients and Gaussian mixture models,” The
International Journal of Forensic Computer Science, vol. 1, pp. 27–34, 2013.
[128] N. Singh, R. Khan, and R. Shree, “Applications of speaker recognition,”
Procedia Engineering, vol. 38, pp. 3122–3126, 2012.
[129] L. Singh and S. Sridharan, “Speech enhancement for forensic applications
using dynamic time warping and wavelet packet analysis,” in 10 IEEE An-
nual Region Conference Speech and Image Technologies for Computing and
Telecommunications, pp. 475–478, 1997.
[130] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity
detection,” IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, 1999.
BIBLIOGRAPHY 223
[131] A. Solomonoff, C. Quillen, and W. M. Campbell, “Channel compensation
for SVM speaker recognition,” in Odyssey Speaker Language Recognition
Workshop, pp. 57–62, 2004.
[132] B. T. Taddese, “Sound source localization and separation,” Mathematics
and Computer Science, Macalester College, 2006.
[133] N. Trivedi, V. Kumar, S. Singh, S. Ahuja, and R. Chadha, “Speech recogni-
tion by wavelet analysis,” International Journal of Computer Applications,
vol. 15, no. 8, pp. 27–32, 2011.
[134] Z. Tufekci and S. Gurbuz, “Noise robust speaker verification using mel-
frequency discrete wavelet coefficients and parallel model compensation,” in
IEEE International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), vol. 1, pp. I– 657–I– 660, 2005.
[135] B. Tydlitat, J. Navratil, J. W. Pelecanos, and G. N. Ramaswamy, “Text-
independent speaker verification in embedded environments,” in IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. IV – 293– IV – 296, 2007.
[136] G. Tzanetakis, G. Essl, and P. Cook, “Audio analysis using the discrete
wavelet transform,” in Proceeding Conference in Acoustics and Music Theory
Applications, pp. 1–6, 2001.
[137] N. Upadhyay and A. Karmakar, “Speech enhancement using spectral
subtraction-type algorithms: A comparison and simulation study,” Proce-
dia Computer Science, vol. 54, pp. 574–584, 2015.
[138] A. Varga and H. J. Steeneken, “Assessment for automatic speech recog-
nition: II. NOISEX-92: A database and an experiment to study the effect
of additive noise on speech recognition systems,” Speech Communication,
vol. 12, no. 3, pp. 247–251, 1993.
224 BIBLIOGRAPHY
[139] O. Viikki and K. Laurila, “Cepstral domain segmental feature vector nor-
malization for noise robust speech recognition,” Speech Communication,
vol. 25, pp. 133–147, 1998.
[140] A. E. Villanueva-Luna, A. Jaramillo-Nunez, D. Sanchez-Lucero, C. M.
Ortiz-Lima, J. G. Aguilar-Soto, A. Flores-Gil, and M. May-Alarcon, “De-
noising audio signals using MATLAB wavelets toolbox,” pp. 25–54, INTECH
Open Access Publisher, 2011.
[141] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in
blind audio source separation,” IEEE Transactions on Audio, Speech, and
Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
[142] R. J. Vogt, B. J. Baker, and S. Sridharan, “Factor analysis subspace esti-
mation for speaker verification with short utterances,” Proceedings of Inter-
speech, pp. 853–856, 2008.
[143] B. Xiang, U. V. Chaudhari, J. Navrtil, G. N. Ramaswamy, and R. A.
Gopinath, “Short-time Gaussianization for robust speaker verification,” in
IEEE International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), pp. I– 681– I– 684, 2002.
[144] T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani,
and W. Kellermann, “Making machines understand us in reverberant rooms:
Robustness against reverberation for automatic speech recognition,” IEEE
Signal Processing Magazine, vol. 29, no. 6, pp. 114–126, 2012.
[145] C. Yu, C. Zhang, F. Kelly, A. Sangwan, and J. H. Hansen, “Text-available
speaker recognition system for forensic applications,” in Annual Confer-
ence of the International Speech Communication Association (Interspeech),
pp. 1844–1847, 2016.
BIBLIOGRAPHY 225
[146] X. Zhao, Y. Wang, and D. Wang, “Robust speaker identification in noisy
and reverberant conditions,” IEEE/ACM Transactions on Audio, Speech
and Language Processing (TASLP), vol. 22, no. 4, pp. 836–845, 2014.
[147] Q. Zhu and A. Alwan, “On the use of variable frame rate analysis in speech
recognition,” in IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), pp. 1783–1786, 2000.