01_speech feature extraction
TRANSCRIPT
8
Chapter II
Speech Feature Extraction
II.1. The Speech Signal
Speech is a natural form of communication for human, and it is a kind of sound
wave. To make human-machine communication, we must convert this wave into
electronic-analog one and then to digital signal because machine or computer can
only process, analysis digital signals.
Figure II.2. (a). A analog signal.
(b). The digital signal after ADC process.
Figure II.1 Converting speech to digital signal
FS
Acquisition ADC Speech s(t) s(n)
t
S(t)
S(n)
n
Ts
(a)
(b)
9
Speech is converted to the analog signal by acquisition. In this case, the acquisition
including a microphone and amplifier filter circuit. The analog s(t) is sampled with
frequency Fs=1/Ts for conversion into digital signal s(n) as figure II.1. & II.2
II.2. Feature Extraction From Speech Signal [1]:
The feature vector sequence {y1, y2,…, yT} is obtained from spectral analysis of the
speech samples. Many features can be used. In this project, I choose LPC, cepstral
parameters because they are very prevalent. A block diagram of the steps that are
carried out in the extraction is given is Figure II.2.
Figure II.3. Block diagram of the computation required in feature analysis of the HMM recognizer
Block into
M frames
Window
frame
Correlation
analysis
LPC
analysis
Cepstral
analysis
Cepstral
weighting
Delta
cepstrum
Length, L Separation , ∆L ᴡ(n) Order, P
LPC Order, P
S(n;m) S’(n) Sw(n;m) r(p;m)
a(p;m) γ(q;m)
wγ(q)
∆γw(q;m)
γw(q;m)
Preprocessing
S(n)
Cepstral Order, Q
10
II.2.1. Preprocessing
The preprocessing including preemphasis and voice activation detection can be
seen in Figure II.3
Figure II.4. Preprocessing operation
II.2.1.1. Preemphasis
The preemphasis is commonly a highpass filter to spectrally flatten the speech
signal. The filter is the FIR one:
𝐻 𝑧 = 1 − 0.95𝑧−1 (II.1)
The filtering in the time domain will give output of preemphasis S2(n):
𝑆 ′ 𝑛 = 1− 0.95 ∗ 𝑆(𝑛 − 1) (II.2)
II.1.1.2 Voice Activation Detection (VAD)
The VAD will remove silences at the starting point and endpoint of utterance in a
speech signal because these points decrease the performance of the speech
recognizer. Figure II.4 & II.5 describe VAD operation.
Preemphasis
Voice Activation
Detection
S(n) S1(n) S’(n)
11
Figure II.5. The speech before VAD.
The vietnamese word “một” with silences
Figure II.6. The speech after VAD.
The vietnamese word “một” without silences
12
The speech signal divided into M blocks, L samples for each block. In this project,
I choose L=80 with Fs=8000Hz, that mean 10ms for each block.
The short-term energy Es commonly used for finding speech:
𝐸𝑆 𝑚 = 𝑠12(𝑛)
𝐿∗(𝑚+1)
𝑛=𝐿∗𝑚+1 (II.3)
VAD remove block m if Es(m) < TH. In this project, I choose TH=0.05 .
II.2.2. Block into M Frames
In this section, the signal S’(n) was divided to overlap frames. Each frame is L
samples long, separated by ∆L samples.
Figure II.7 . Frame blocking
II.2.3 Window Frame
After blocking, each frame is weighted by a window function. A commonly used
window is the Hamming window:
𝑤 𝑖 = 0.54 − 0.46𝑐𝑜𝑠 2𝜋𝑖
𝐿−1 (II.4)
S’(n)
S(n,m)
L
∆L
13
II.2.4 Correlation and LPC Analysis
The autocorrelation function is found to lag P, where P is the order of the desired
LPC analysis. The LPC a(i,m) are computed by using the Levinson-Durbin
recursion [1] .
II.2.5. Cepstral analysis and Cepstral Weighting
To get cepstral coefficients γ(q,m), I use the relation:
Qqmkqamk
q
kmqamq
q
k
...,,2,1),(*),(),(),(1
1
(II.5)
The cepstral sequence is weighted by a window function wγ(q) of the form
QqQ
qQqw ...,,2,1
*sin
21)(
(II.6)
II.2.6 Delta cepstrum and The Extraction Output
The Delta cepstrum has Q differenced cepstral coefficients:
Qqmqmqmq www ...,,2,1)1,,()1,(2
1),( (II.7)
Finally, the observation vector for the extraction:
MttQttQtO wwwwt ....1),()....,1(),()...,1( (II.8)