pitch prediction for glottal spectrum estimation with applications in speaker recognition nengheng...

Post on 22-Dec-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition

Nengheng ZhengNengheng Zheng

Supervised under Professor P.C. Ching

Nov. 26 , 2004

Outline

• Speech production and glottal pulse excitation in detail

• Linear prediction: short-term and Long-term

• Glottal spectrum estimated with long-term prediction and acoustic features

• For speaker recognition implementation

Speech Production

Impulsetrain

generator

Glottal pulsemodel G(z)

X

Vocal tractmodel V(z)

RadiationmodelR(z)

u(n)

Randomnoise

generatorX

AV

AN

s(n)

Glottal pulses

Vocal tract Speech signal

)()()()( zRzVzGzH

Discrete time model for speech production

A combined transfer function

Acoustic Features of Glottal Pulse

• Time domain– pitch period

– pitch period perturbation (jitter)

– pulse amplitude perturbation (shimmer)

– glottal pulse width

– abruptness of closure of the glottal flow

– aspiration noise • Frequency domain

– fundamental frequency (F0)

– spectral tilt (slope)

– harmonic richness

Glottal Pulse and Voice Quality

• Glottal pulse shape plays an important role on the quality of Natural or synthesized vowels [Rosenberg 1971]– The shape and periodicity of vocal cord excitation are subject to

large variation– Such variations are significant for preserving the speech

naturalness– A typical glottal pulse: asymmetric with shorter falling phase;

spectrum with -12dB/octave decay

• More variation among different speakers than among different utterance of the same speaker [Mathews 1963]

• Such variations have little significance for speech intelligibility but affect the perceived vocal quality [Childers 1991]

Various Glottal Pulses

• Some other vocal typesbreathy falsetto vocal fry

• Temporal and spectral characteristics

Some Comments

• Generally, to study the glottal pulse characteristics, it is necessary to rebuilding the glottal pulse waveform by inverse filtering technique

• Automatically and exactly rebuilding the glottal waveform from real speech is almost impossible, especially, at the transient phase of articulation, or, for high pitched speakers

• Fortunately, it is possible to estimate the glottal spectrum from residual signal with pitch prediction

Linear Prediction

• Speech waveform: correlation between current and past samples and thus predictable

• Short-term correlation:

• Occurs within one pitch period• Formant modulation• Classical linear prediction analysis (short-term prediction)

• Long-term correlation

• occurs across consecutive pitch periods• Vocal cords vibration• Long-term/pitch prediction

p

kk knsans

1

)()(

)()( pnbunu

Linear Prediction

• Short-term predictor <classical linear prediction>

– Remove the short-term correlation and result in a glottal excitation signal

• Long-term predictor <pitch prediction>

– Remove the correlation across consecutive periods

)1(10

)1(11)(

ppp zbzbzbzP

P

k

kk zazA

1

1)(

P

kk knsansnu

1

)()()(

1

1

)()()(k

k kpnubnunv

s(n)

M

i

ii za

1

+

1

1

)(

k

kpk zb +

_ _u(n) v(n)

Short-term predictor Long-term predictor

Linear Prediction: A example

0 100 200 300 400 500 600 700 800-1

0

1

0 100 200 300 400 500 600 700 800-0.5

0

0.5

0 100 200 300 400 500 600 700 800-0.5

0

0.5

0 100 200 300 400 5000

20

40

0 100 200 300 400 5000

20

40

60

80

s(n)

u(n)

v(n)

0 1000 2000 3000 400010

-2

100

102

Frequency (Hz)

inte

ns

ity

(d

B)

0 1000 2000 3000 4000

100

101

Frequency (Hz)

Inte

ns

ity

(d

B)

0 0.2 0.4 0.6 0.8 1-80

-60

-40

-20

Frequency

Pow

er

Spectr

um

Magnitude (

dB

)

0 0.2 0.4 0.6 0.8 1-60

-40

-20

Frequency

Pow

er

Spectr

um

Magnitude (

dB

)

0 1000 2000 3000 400010

-2

100

102

Frequency (Hz)

inte

ns

ity

(d

B)

0 1000 2000 3000 4000

100

101

Frequency (Hz)

Inte

ns

ity

(d

B)

0 0.2 0.4 0.6 0.8 1-80

-60

-40

-20

Frequency

Pow

er

Spectr

um

Magnitude (

dB

)

0 0.2 0.4 0.6 0.8 1-60

-40

-20

Frequency

Pow

er

Spectr

um

Magnitude (

dB

)

Examples of pitch prediction estimatedglottal spectrum

0 50 100 150 200 250 300 350 400 450 5000

20

40

0 50 100 150 200 250 300 350 400 450 5000

2

4

6

8

0 50 100 150 200 250 300 350 400 450 5000.5

1

1.5

2

0 50 100 150 200 250 300 350 400 450 5000

2

4

6

Harmonic Structure of Glottal Spectrum

• Two parameters describing the harmonic structure– Harmonic richness factor and Noise-to-harmonic ratio

• Harmonic richness factor (HRF)

• Noise-to-harmonic ratio (NHR)

1

log10H

H

HRF ni BHi

n

ni

ni

BHi

BNi

n H

N

NHR log10

0 200 400 600 800 1000 1200 1400 1600 1800 20000

5

10

Hi

o N

i

Feature Generation

S-Tprediction

s(n)L-T

predictionon every

pitch period

u(n)G(z) G(f)

Mel-scaleBank pass

filtering

HRFn, NHRn,n=1,2,…,

p, g, bi

• Acoustic features including the following:

– Fundamental frequency F0

– Pitch prediction gain g

– Pitch prediction coefficients b-1, b0, b1

– HRFn and NHRn <n=1:10>

• 10 Mel scale frequency bank

• Feature generation process

)(

)(log10

2

2

nv

nug

Experiments Conditions

• Speech quality: telephone speech

• Subject: 49 male speakers

• Training condition:– 3 training session, about 90s speech totally, over 3~6 weeks

– 128 GMM

• Testing condition:– 12 testing sessions. Over 4~6 months.

Speaker recognition experiments

Feature F0 g [b-1 b0 b1] HRF NHR

Iden. Rate 18% 11% 14% 32% 17%

• Identification results with long-term prediction related features

FeaturesIdentificationerror rate (%)

Fgs: F0_g_HRF_NHR25 52%

LPCC_D_A36 2.84

LPCC_D_A+Fgs 2.26

MFCC_D_A 2.1

MFCC_D_A+Fgs 1.9

• Comparison of glottal source feature with classical features

Summary

• Glottal source excitation is important for perceptional naturalness of voice quality and is helpful for distinguishing a speaker from the others.

• Linear prediction is a powerful tool for speech analysis. The spectral property of the supraglottal vocal tract system can be estimated by short-term prediction; While the long-term prediction estimates the spectrum of the glottal excitation system

• Recognition results show that the glottal source related acoustic features (F0, prediction gain, HRF, NHR, etc.) provide a certain degree of speaker discriminative power.

Other Applications

• Speech coding

• Speech recognition ?

• Speaking emotion recognition !

Thank You!

top related