pitch prediction for glottal spectrum estimation with applications in speaker recognition nengheng...

Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition

Nengheng ZhengNengheng Zheng

Supervised under Professor P.C. Ching

Nov. 26 , 2004

Outline

• Speech production and glottal pulse excitation in detail

• Linear prediction: short-term and Long-term

• Glottal spectrum estimated with long-term prediction and acoustic features

• For speaker recognition implementation

Speech Production

Impulsetrain

generator

Glottal pulsemodel G(z)

X

Vocal tractmodel V(z)

RadiationmodelR(z)

u(n)

Randomnoise

generatorX

AV

AN

s(n)

Glottal pulses

Vocal tract Speech signal

)()()()( zRzVzGzH

Discrete time model for speech production

A combined transfer function

Acoustic Features of Glottal Pulse

• Time domain– pitch period

– pitch period perturbation (jitter)

– pulse amplitude perturbation (shimmer)

– glottal pulse width

– abruptness of closure of the glottal flow

– aspiration noise • Frequency domain

– fundamental frequency (F0)

– spectral tilt (slope)

– harmonic richness

Glottal Pulse and Voice Quality

• Glottal pulse shape plays an important role on the quality of Natural or synthesized vowels [Rosenberg 1971]– The shape and periodicity of vocal cord excitation are subject to

large variation– Such variations are significant for preserving the speech

naturalness– A typical glottal pulse: asymmetric with shorter falling phase;

spectrum with -12dB/octave decay

• More variation among different speakers than among different utterance of the same speaker [Mathews 1963]

• Such variations have little significance for speech intelligibility but affect the perceived vocal quality [Childers 1991]

Various Glottal Pulses

• Some other vocal typesbreathy falsetto vocal fry

• Temporal and spectral characteristics

Some Comments

• Generally, to study the glottal pulse characteristics, it is necessary to rebuilding the glottal pulse waveform by inverse filtering technique

• Automatically and exactly rebuilding the glottal waveform from real speech is almost impossible, especially, at the transient phase of articulation, or, for high pitched speakers

• Fortunately, it is possible to estimate the glottal spectrum from residual signal with pitch prediction

Linear Prediction

• Speech waveform: correlation between current and past samples and thus predictable

• Short-term correlation:

• Occurs within one pitch period• Formant modulation• Classical linear prediction analysis (short-term prediction)

• Long-term correlation

• occurs across consecutive pitch periods• Vocal cords vibration• Long-term/pitch prediction

p

kk knsans

1

)()(

)()( pnbunu

Linear Prediction

• Short-term predictor <classical linear prediction>

– Remove the short-term correlation and result in a glottal excitation signal

• Long-term predictor <pitch prediction>

– Remove the correlation across consecutive periods

)1(10

)1(11)(

ppp zbzbzbzP

P

k

kk zazA

1

1)(

P

kk knsansnu

1

)()()(

1

1

)()()(k

k kpnubnunv

s(n)

M

i

ii za

1

+

1

1

)(

k

kpk zb +

_ _u(n) v(n)

Short-term predictor Long-term predictor

Linear Prediction: A example

0 100 200 300 400 500 600 700 800-1

0

1

0 100 200 300 400 500 600 700 800-0.5

0

0.5

0 100 200 300 400 500 600 700 800-0.5

0

0.5

0 100 200 300 400 5000

20

40

0 100 200 300 400 5000

20

40

60

80

s(n)

u(n)

v(n)

0 1000 2000 3000 400010

-2

100

102

Frequency (Hz)

inte

ns

ity

(d

B)

0 1000 2000 3000 4000

100

101

Frequency (Hz)

Inte

ns

ity

(d

B)

0 0.2 0.4 0.6 0.8 1-80

-60

-40

-20

Frequency

Pow

er

Spectr

um

Magnitude (

dB

)

0 0.2 0.4 0.6 0.8 1-60

-40

-20

Frequency

Pow

er

Spectr

um

Magnitude (

dB

)

0 1000 2000 3000 400010

-2

100

102

Frequency (Hz)

inte

ns

ity

(d

B)

0 1000 2000 3000 4000

100

101

Frequency (Hz)

Inte

ns

ity

(d

B)

0 0.2 0.4 0.6 0.8 1-80

-60

-40

-20

Frequency

Pow

er

Spectr

um

Magnitude (

dB

)

0 0.2 0.4 0.6 0.8 1-60

-40

-20

Frequency

Pow

er

Spectr

um

Magnitude (

dB

)

Examples of pitch prediction estimatedglottal spectrum

0 50 100 150 200 250 300 350 400 450 5000

20

40

0 50 100 150 200 250 300 350 400 450 5000

2

4

6

8

0 50 100 150 200 250 300 350 400 450 5000.5

1

1.5

2

0 50 100 150 200 250 300 350 400 450 5000

2

4

6

Harmonic Structure of Glottal Spectrum

• Two parameters describing the harmonic structure– Harmonic richness factor and Noise-to-harmonic ratio

• Harmonic richness factor (HRF)

• Noise-to-harmonic ratio (NHR)

1

log10H

H

HRF ni BHi

n

ni

ni

BHi

BNi

n H

N

NHR log10

0 200 400 600 800 1000 1200 1400 1600 1800 20000

5

10

Hi

o N

i

Feature Generation

S-Tprediction

s(n)L-T

predictionon every

pitch period

u(n)G(z) G(f)

Mel-scaleBank pass

filtering

HRFn, NHRn,n=1,2,…,

p, g, bi

• Acoustic features including the following:

– Fundamental frequency F0

– Pitch prediction gain g

– Pitch prediction coefficients b-1, b0, b1

– HRFn and NHRn <n=1:10>

• 10 Mel scale frequency bank

• Feature generation process

)(

)(log10

2

2

nv

nug

Experiments Conditions

• Speech quality: telephone speech

• Subject: 49 male speakers

• Training condition:– 3 training session, about 90s speech totally, over 3~6 weeks

– 128 GMM

• Testing condition:– 12 testing sessions. Over 4~6 months.

Speaker recognition experiments

Feature F0 g [b-1 b0 b1] HRF NHR

Iden. Rate 18% 11% 14% 32% 17%

• Identification results with long-term prediction related features

FeaturesIdentificationerror rate (%)

Fgs: F0_g_HRF_NHR25 52%

LPCC_D_A36 2.84

LPCC_D_A+Fgs 2.26

MFCC_D_A 2.1

MFCC_D_A+Fgs 1.9

• Comparison of glottal source feature with classical features

Summary

• Glottal source excitation is important for perceptional naturalness of voice quality and is helpful for distinguishing a speaker from the others.

• Linear prediction is a powerful tool for speech analysis. The spectral property of the supraglottal vocal tract system can be estimated by short-term prediction; While the long-term prediction estimates the spectrum of the glottal excitation system

• Recognition results show that the glottal source related acoustic features (F0, prediction gain, HRF, NHR, etc.) provide a certain degree of speaker discriminative power.

Other Applications

• Speech coding

• Speech recognition ?

• Speaking emotion recognition !

Thank You!

pitch prediction for glottal spectrum estimation with applications in speaker recognition nengheng...

Documents