pitch prediction for glottal spectrum estimation with applications in speaker recognition nengheng...
Post on 22-Dec-2015
213 views
TRANSCRIPT
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition
Nengheng ZhengNengheng Zheng
Supervised under Professor P.C. Ching
Nov. 26 , 2004
Outline
• Speech production and glottal pulse excitation in detail
• Linear prediction: short-term and Long-term
• Glottal spectrum estimated with long-term prediction and acoustic features
• For speaker recognition implementation
Speech Production
Impulsetrain
generator
Glottal pulsemodel G(z)
X
Vocal tractmodel V(z)
RadiationmodelR(z)
u(n)
Randomnoise
generatorX
AV
AN
s(n)
Glottal pulses
Vocal tract Speech signal
)()()()( zRzVzGzH
Discrete time model for speech production
A combined transfer function
Acoustic Features of Glottal Pulse
• Time domain– pitch period
– pitch period perturbation (jitter)
– pulse amplitude perturbation (shimmer)
– glottal pulse width
– abruptness of closure of the glottal flow
– aspiration noise • Frequency domain
– fundamental frequency (F0)
– spectral tilt (slope)
– harmonic richness
Glottal Pulse and Voice Quality
• Glottal pulse shape plays an important role on the quality of Natural or synthesized vowels [Rosenberg 1971]– The shape and periodicity of vocal cord excitation are subject to
large variation– Such variations are significant for preserving the speech
naturalness– A typical glottal pulse: asymmetric with shorter falling phase;
spectrum with -12dB/octave decay
• More variation among different speakers than among different utterance of the same speaker [Mathews 1963]
• Such variations have little significance for speech intelligibility but affect the perceived vocal quality [Childers 1991]
Various Glottal Pulses
• Some other vocal typesbreathy falsetto vocal fry
• Temporal and spectral characteristics
Some Comments
• Generally, to study the glottal pulse characteristics, it is necessary to rebuilding the glottal pulse waveform by inverse filtering technique
• Automatically and exactly rebuilding the glottal waveform from real speech is almost impossible, especially, at the transient phase of articulation, or, for high pitched speakers
• Fortunately, it is possible to estimate the glottal spectrum from residual signal with pitch prediction
Linear Prediction
• Speech waveform: correlation between current and past samples and thus predictable
• Short-term correlation:
• Occurs within one pitch period• Formant modulation• Classical linear prediction analysis (short-term prediction)
• Long-term correlation
• occurs across consecutive pitch periods• Vocal cords vibration• Long-term/pitch prediction
p
kk knsans
1
)()(
)()( pnbunu
Linear Prediction
• Short-term predictor <classical linear prediction>
– Remove the short-term correlation and result in a glottal excitation signal
• Long-term predictor <pitch prediction>
– Remove the correlation across consecutive periods
)1(10
)1(11)(
ppp zbzbzbzP
P
k
kk zazA
1
1)(
P
kk knsansnu
1
)()()(
1
1
)()()(k
k kpnubnunv
s(n)
M
i
ii za
1
+
1
1
)(
k
kpk zb +
_ _u(n) v(n)
Short-term predictor Long-term predictor
Linear Prediction: A example
0 100 200 300 400 500 600 700 800-1
0
1
0 100 200 300 400 500 600 700 800-0.5
0
0.5
0 100 200 300 400 500 600 700 800-0.5
0
0.5
0 100 200 300 400 5000
20
40
0 100 200 300 400 5000
20
40
60
80
s(n)
u(n)
v(n)
0 1000 2000 3000 400010
-2
100
102
Frequency (Hz)
inte
ns
ity
(d
B)
0 1000 2000 3000 4000
100
101
Frequency (Hz)
Inte
ns
ity
(d
B)
0 0.2 0.4 0.6 0.8 1-80
-60
-40
-20
Frequency
Pow
er
Spectr
um
Magnitude (
dB
)
0 0.2 0.4 0.6 0.8 1-60
-40
-20
Frequency
Pow
er
Spectr
um
Magnitude (
dB
)
0 1000 2000 3000 400010
-2
100
102
Frequency (Hz)
inte
ns
ity
(d
B)
0 1000 2000 3000 4000
100
101
Frequency (Hz)
Inte
ns
ity
(d
B)
0 0.2 0.4 0.6 0.8 1-80
-60
-40
-20
Frequency
Pow
er
Spectr
um
Magnitude (
dB
)
0 0.2 0.4 0.6 0.8 1-60
-40
-20
Frequency
Pow
er
Spectr
um
Magnitude (
dB
)
Examples of pitch prediction estimatedglottal spectrum
0 50 100 150 200 250 300 350 400 450 5000
20
40
0 50 100 150 200 250 300 350 400 450 5000
2
4
6
8
0 50 100 150 200 250 300 350 400 450 5000.5
1
1.5
2
0 50 100 150 200 250 300 350 400 450 5000
2
4
6
Harmonic Structure of Glottal Spectrum
• Two parameters describing the harmonic structure– Harmonic richness factor and Noise-to-harmonic ratio
• Harmonic richness factor (HRF)
• Noise-to-harmonic ratio (NHR)
1
log10H
H
HRF ni BHi
n
ni
ni
BHi
BNi
n H
N
NHR log10
0 200 400 600 800 1000 1200 1400 1600 1800 20000
5
10
Hi
o N
i
Feature Generation
S-Tprediction
s(n)L-T
predictionon every
pitch period
u(n)G(z) G(f)
Mel-scaleBank pass
filtering
HRFn, NHRn,n=1,2,…,
p, g, bi
• Acoustic features including the following:
– Fundamental frequency F0
– Pitch prediction gain g
– Pitch prediction coefficients b-1, b0, b1
– HRFn and NHRn <n=1:10>
• 10 Mel scale frequency bank
• Feature generation process
)(
)(log10
2
2
nv
nug
Experiments Conditions
• Speech quality: telephone speech
• Subject: 49 male speakers
• Training condition:– 3 training session, about 90s speech totally, over 3~6 weeks
– 128 GMM
• Testing condition:– 12 testing sessions. Over 4~6 months.
Speaker recognition experiments
Feature F0 g [b-1 b0 b1] HRF NHR
Iden. Rate 18% 11% 14% 32% 17%
• Identification results with long-term prediction related features
FeaturesIdentificationerror rate (%)
Fgs: F0_g_HRF_NHR25 52%
LPCC_D_A36 2.84
LPCC_D_A+Fgs 2.26
MFCC_D_A 2.1
MFCC_D_A+Fgs 1.9
• Comparison of glottal source feature with classical features
Summary
• Glottal source excitation is important for perceptional naturalness of voice quality and is helpful for distinguishing a speaker from the others.
• Linear prediction is a powerful tool for speech analysis. The spectral property of the supraglottal vocal tract system can be estimated by short-term prediction; While the long-term prediction estimates the spectrum of the glottal excitation system
• Recognition results show that the glottal source related acoustic features (F0, prediction gain, HRF, NHR, etc.) provide a certain degree of speaker discriminative power.
Other Applications
• Speech coding
• Speech recognition ?
• Speaking emotion recognition !
Thank You!