abstract this article investigates the importance of the vocal source information for speaker...

1
Abstract This article investigates the importance of the vocal source information for speaker recogni- tion. We propose a novel feature extraction scheme to exploit the time- frequency propert- ies of the LP residual signal. The new feature, named Wavelet Octave Coefficients of Residues (WOCOR), provides additional speaker discriminative power and is demonstrated to improve the overall performance of speaker recognition system with the conventional vocal tract feature, the MFCCs. Speaker Specific Vocal Source Signal Acknowledgement This effort was partially supported by a research grant awarded by the Hong Kong Research Grant Council. The authors wish to acknowledge Dr. Frank Soong for instructive discussions and suggestions during this Time-Frequency Analysis of Vocal Source Signal for Speaker Recognition Nengheng Zheng, P.C. Ching and Tan Lee Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR of China Conclusion The vocal source time- frequency information provides addition speaker discriminative power and improves the overall performance of speaker recognition system ) ( ) ( ) ( ) ( z R z V z U z S Source-tract separation by LP inverse filtering Estimating the AR coefficients of V(z) by linear prediction analysis Inverse filtering s(n) for the output e(n) e(n) is highly related to u(n) and is speaker Speech production Feature Extraction With Time- Frequency Analysis on the Residual Signal Voice activity detection and pitch tracking Only voiced segments are interested Energy and zero-crossing detection for VAD Cepstrum analysis for pitch tracking LP inverse filtering Inverse filter the voiced frame speech Pitch synchronous wavelet transform Exact pitch tracking by detecting the residual bursts Wavelet transform on every two pitch cycles residual signal with one pitch cycle overlap p k k k z a z A z V 1 1 1 ) ( 1 ) ( Vocal tract features are widely used in speaker recognition system, i.e., MFCC, LPCC, etc. The vocal-cords vibrating mechanism is speaker dependent We are aiming at capturing the time frequency properties of the glottal source. ) ( ) ( 1 ) , ( * a b n n e a b a W e Time-frequency feature generation Firstly, divide the wavelet coefficients into octave groups } 6 , 2 , 1 | 2 { k a k N b , , 2 , 1 . 6 , 2 , 1 , , 2 , 1 ) , 2 ( k N b b W W k e k 6 , 2 , 1 WOCOR 1 k W k , 2 , 1 ) / ( ] : 1 - ( ) , 2 ( ) ( j N Round j j b b W j W k e k 6 , , 2 , 1 , , 2 , 1 ) ( WOCOR k j j W k Glottal pulses Vocal tract Speech signal u(n) H(z) s(n) p k k k n s a n s n e 1 ) ( ) ( ) ( Experiments Corpus Read Cantonese HK ID number 40 male speakers 4 enrollment and 6 testing sessions Microphone and telephone speech Baseline system MFCC_D_A <36 dim.> 128 component GMM Recognition results Recognition error rate with WOCOR Performance IDER (%) EER (%) MFCC_D_A MIC 1.92 2.48 TEL 2.1 2.83 MFCC_D_A + WOCOR 4 MIC 1.50 2.21 TEL 1.6 2.44 Recognition error rate with fused source-tract information Information fusion in score level s t t t s w s w s ) 1 ( Secondly, generate the feature vector, named the first order Wavelet Octave Coefficients of Residues (WOCOR 1 ) Furthermore, to obtain more temporal details, divide each octave into sub-groups and generate high order WOCOR α LP Inverse Filtering VAD and Pitch Tracking Pitch Synchronous Wavelet Transform s(n ) Time-Frequency Feature Generation WOCOR e(n) F 0 W e (a,b) -0.02 0 0.02 0 20 40 60 80 100 e(n) b W 1 W 2 W 3 -0.02 0 0.02 -0.02 0 0.02 -0.02 0 0.02 n Comments Temporal details of vocal source signal is useful for speaker recognition In telephone speech, the relative improvement by source information is 24% for identification and 14% for verification; in microphone, 22% and 11% for identification and verification, respectively. with w t experimentally determined 1 2 3 4 5 6 7 8 20 40 60 ErrorR ate(% ) M IC,Veri. MIC,Iden. TE L,Iden. TEL,Veri. 0 50 100 150 200 250 -1 0 1 0 50 100 150 200 250 -1 0 1 0 1000 2000 3000 4000 -40 -20 0 0 1000 2000 3000 4000 -40 -20 0 0 50 100 150 200 250 -1 0 1 0 50 100 150 200 250 -1 0 1 0 1000 2000 3000 4000 -100 -50 0 0 1000 2000 3000 4000 -100 -50 0 Amplitude(dB) Amplitude(dB) Amplitude(dB) Amplitude(dB) Frequency (H z) Frequency (H z) Frequency (H z) Frequency (H z) s(n) e(n)

Upload: amanda-perry

Post on 17-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abstract This article investigates the importance of the vocal source information for speaker recogni- tion. We propose a novel feature extraction scheme

Abstract

This article investigates the importance of the vocal source information for speaker recogni-tion. We propose a novel feature extraction scheme to exploit the time-frequency propert-ies of the LP residual signal. The newfeature, named Wavelet Octave Coefficients of Residues (WOCOR), provides additional speaker discriminative power and is demonstrated to improve the overall performance of speaker recognition system with the conventional vocal tract feature, the MFCCs.

Speaker Specific Vocal Source Signal

Acknowledgement

This effort was partially supported by a research grant awarded by the Hong Kong Research Grant Council.The authors wish to acknowledge Dr. Frank Soong for instructive discussions and suggestions during this work.

Time-Frequency Analysis of Vocal Source Signal for Speaker Recognition

Nengheng Zheng, P.C. Ching and Tan LeeDepartment of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR of China

Conclusion

The vocal source time-frequency information provides addition speaker discriminative power and improves the overall performance of speaker recognition system

)()()()( zRzVzUzS

Source-tract separation by LP inverse filtering

Estimating the AR coefficients of V(z) by linear prediction analysis

Inverse filtering s(n) for the output e(n)

e(n) is highly related to u(n) and is speaker dependent.

Speech production

Feature Extraction With Time-Frequency Analysis on the Residual Signal

Voice activity detection and pitch tracking Only voiced segments are interested Energy and zero-crossing detection for VAD Cepstrum analysis for pitch tracking

LP inverse filtering Inverse filter the voiced frame speech

Pitch synchronous wavelet transform Exact pitch tracking by detecting the residual bursts Wavelet transform on every two pitch cycles residual signal with one pitch cycle overlap

p

k

kk za

zAzV

1

1

1)(

1)(

Vocal tract features are widely used in speaker recognition system, i.e., MFCC, LPCC, etc. The vocal-cords vibrating mechanism is speaker dependent We are aiming at capturing the time frequency properties of the glottal source.

∑ )()(1

),( *

abn

nea

baWe

Time-frequency feature generation Firstly, divide the wavelet coefficients into octave groups

}6,2,1|2{ ka k Nb ,,2,1

.6,2,1,,2,1),2( kNbbWW kek

6,2,1WOCOR1 kWk

,2,1

)/(]:1-(∈),2()(

j

NRoundjjbbWjW kek

6,,2,1

,,2,1)(WOCOR

k

jjWk

Glottal pulses Vocal tract Speech signal

u(n) H(z) s(n)

p

kk knsansne

1

)()()(

Experiments

Corpus Read Cantonese HK ID number 40 male speakers 4 enrollment and 6 testing sessions Microphone and telephone speech

Baseline system MFCC_D_A <36 dim.> 128 component GMM

Recognition results Recognition error rate with WOCORα

Performance IDER (%) EER (%)

MFCC_D_AMIC 1.92 2.48

TEL 2.1 2.83

MFCC_D_A +

WOCOR4

MIC 1.50 2.21

TEL 1.6 2.44

Recognition error rate with fused source-tract information

Information fusion in score level

sttt swsws )1(

Secondly, generate the feature vector, named the first order Wavelet Octave Coefficients of Residues (WOCOR1)

Furthermore, to obtain more temporal details, divide each octave into sub-groups and generate high order WOCORα

LP Inverse Filtering

VAD and Pitch Tracking

Pitch Synchronous Wavelet Transform

s(n)

Time-Frequency Feature Generation

WOCOR

e(n) F0

We(a,b)

-0.02

0

0.02

0 20 40 60 80 100

e(n)

b

W1

W2

W3

-0.02

0

0.02

-0.02

0

0.02

-0.02

0

0.02

n

Comments Temporal details of vocal source signal is useful for speaker recognition

In telephone speech, the relative improvement by source information is 24% for identification and 14% for verification; in microphone, 22% and 11% for identification and verification, respectively.

with wt experimentally determined

1 2 3 4 5 6 7 8

20

40

60

Err

or R

ate

(%) MIC, Veri.

MIC, Iden.TEL, Iden.TEL, Veri.

0 50 100 150 200 250-1

0

1

0 50 100 150 200 250-1

0

1

0 1000 2000 3000 4000

-40

-20

0

0 1000 2000 3000 4000

-40

-20

0

0 50 100 150 200 250-1

0

1

0 50 100 150 200 250-1

0

1

0 1000 2000 3000 4000-100

-50

0

0 1000 2000 3000 4000-100

-50

0

Am

plitu

de(d

B)

Am

plitu

de(d

B)

Am

plitu

de(d

B)

Am

plitu

de(d

B)

Frequency (Hz)

Frequency (Hz)

Frequency (Hz)

Frequency (Hz)

s(n)

e(n)