abstract this article investigates the importance of the vocal source information for speaker...
TRANSCRIPT
Abstract
This article investigates the importance of the vocal source information for speaker recogni-tion. We propose a novel feature extraction scheme to exploit the time-frequency propert-ies of the LP residual signal. The newfeature, named Wavelet Octave Coefficients of Residues (WOCOR), provides additional speaker discriminative power and is demonstrated to improve the overall performance of speaker recognition system with the conventional vocal tract feature, the MFCCs.
Speaker Specific Vocal Source Signal
Acknowledgement
This effort was partially supported by a research grant awarded by the Hong Kong Research Grant Council.The authors wish to acknowledge Dr. Frank Soong for instructive discussions and suggestions during this work.
Time-Frequency Analysis of Vocal Source Signal for Speaker Recognition
Nengheng Zheng, P.C. Ching and Tan LeeDepartment of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR of China
Conclusion
The vocal source time-frequency information provides addition speaker discriminative power and improves the overall performance of speaker recognition system
)()()()( zRzVzUzS
Source-tract separation by LP inverse filtering
Estimating the AR coefficients of V(z) by linear prediction analysis
Inverse filtering s(n) for the output e(n)
e(n) is highly related to u(n) and is speaker dependent.
Speech production
Feature Extraction With Time-Frequency Analysis on the Residual Signal
Voice activity detection and pitch tracking Only voiced segments are interested Energy and zero-crossing detection for VAD Cepstrum analysis for pitch tracking
LP inverse filtering Inverse filter the voiced frame speech
Pitch synchronous wavelet transform Exact pitch tracking by detecting the residual bursts Wavelet transform on every two pitch cycles residual signal with one pitch cycle overlap
p
k
kk za
zAzV
1
1
1)(
1)(
Vocal tract features are widely used in speaker recognition system, i.e., MFCC, LPCC, etc. The vocal-cords vibrating mechanism is speaker dependent We are aiming at capturing the time frequency properties of the glottal source.
∑ )()(1
),( *
abn
nea
baWe
Time-frequency feature generation Firstly, divide the wavelet coefficients into octave groups
}6,2,1|2{ ka k Nb ,,2,1
.6,2,1,,2,1),2( kNbbWW kek
6,2,1WOCOR1 kWk
,2,1
)/(]:1-(∈),2()(
j
NRoundjjbbWjW kek
6,,2,1
,,2,1)(WOCOR
k
jjWk
Glottal pulses Vocal tract Speech signal
u(n) H(z) s(n)
p
kk knsansne
1
)()()(
Experiments
Corpus Read Cantonese HK ID number 40 male speakers 4 enrollment and 6 testing sessions Microphone and telephone speech
Baseline system MFCC_D_A <36 dim.> 128 component GMM
Recognition results Recognition error rate with WOCORα
Performance IDER (%) EER (%)
MFCC_D_AMIC 1.92 2.48
TEL 2.1 2.83
MFCC_D_A +
WOCOR4
MIC 1.50 2.21
TEL 1.6 2.44
Recognition error rate with fused source-tract information
Information fusion in score level
sttt swsws )1(
Secondly, generate the feature vector, named the first order Wavelet Octave Coefficients of Residues (WOCOR1)
Furthermore, to obtain more temporal details, divide each octave into sub-groups and generate high order WOCORα
LP Inverse Filtering
VAD and Pitch Tracking
Pitch Synchronous Wavelet Transform
s(n)
Time-Frequency Feature Generation
WOCOR
e(n) F0
We(a,b)
-0.02
0
0.02
0 20 40 60 80 100
e(n)
b
W1
W2
W3
-0.02
0
0.02
-0.02
0
0.02
-0.02
0
0.02
n
Comments Temporal details of vocal source signal is useful for speaker recognition
In telephone speech, the relative improvement by source information is 24% for identification and 14% for verification; in microphone, 22% and 11% for identification and verification, respectively.
with wt experimentally determined
1 2 3 4 5 6 7 8
20
40
60
Err
or R
ate
(%) MIC, Veri.
MIC, Iden.TEL, Iden.TEL, Veri.
0 50 100 150 200 250-1
0
1
0 50 100 150 200 250-1
0
1
0 1000 2000 3000 4000
-40
-20
0
0 1000 2000 3000 4000
-40
-20
0
0 50 100 150 200 250-1
0
1
0 50 100 150 200 250-1
0
1
0 1000 2000 3000 4000-100
-50
0
0 1000 2000 3000 4000-100
-50
0
Am
plitu
de(d
B)
Am
plitu
de(d
B)
Am
plitu
de(d
B)
Am
plitu
de(d
B)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
s(n)
e(n)