01_speech feature extraction

8

Chapter II

Speech Feature Extraction

II.1. The Speech Signal

Speech is a natural form of communication for human, and it is a kind of sound

wave. To make human-machine communication, we must convert this wave into

electronic-analog one and then to digital signal because machine or computer can

only process, analysis digital signals.

Figure II.2. (a). A analog signal.

(b). The digital signal after ADC process.

Figure II.1 Converting speech to digital signal

FS

Acquisition ADC Speech s(t) s(n)

t

S(t)

S(n)

n

Ts

(a)

(b)

9

Speech is converted to the analog signal by acquisition. In this case, the acquisition

including a microphone and amplifier filter circuit. The analog s(t) is sampled with

frequency Fs=1/Ts for conversion into digital signal s(n) as figure II.1. & II.2

II.2. Feature Extraction From Speech Signal [1]:

The feature vector sequence {y1, y2,…, yT} is obtained from spectral analysis of the

speech samples. Many features can be used. In this project, I choose LPC, cepstral

parameters because they are very prevalent. A block diagram of the steps that are

carried out in the extraction is given is Figure II.2.

Figure II.3. Block diagram of the computation required in feature analysis of the HMM recognizer

Block into

M frames

Window

frame

Correlation

analysis

LPC

analysis

Cepstral

analysis

Cepstral

weighting

Delta

cepstrum

Length, L Separation , ∆L ᴡ(n) Order, P

LPC Order, P

S(n;m) S’(n) Sw(n;m) r(p;m)

a(p;m) γ(q;m)

wγ(q)

∆γw(q;m)

γw(q;m)

Preprocessing

S(n)

Cepstral Order, Q

10

II.2.1. Preprocessing

The preprocessing including preemphasis and voice activation detection can be

seen in Figure II.3

Figure II.4. Preprocessing operation

II.2.1.1. Preemphasis

The preemphasis is commonly a highpass filter to spectrally flatten the speech

signal. The filter is the FIR one:

𝐻 𝑧 = 1 − 0.95𝑧−1 (II.1)

The filtering in the time domain will give output of preemphasis S2(n):

𝑆 ′ 𝑛 = 1− 0.95 ∗ 𝑆(𝑛 − 1) (II.2)

II.1.1.2 Voice Activation Detection (VAD)

The VAD will remove silences at the starting point and endpoint of utterance in a

speech signal because these points decrease the performance of the speech

recognizer. Figure II.4 & II.5 describe VAD operation.

Preemphasis

Voice Activation

Detection

S(n) S1(n) S’(n)

11

Figure II.5. The speech before VAD.

The vietnamese word “một” with silences

Figure II.6. The speech after VAD.

The vietnamese word “một” without silences

12

The speech signal divided into M blocks, L samples for each block. In this project,

I choose L=80 with Fs=8000Hz, that mean 10ms for each block.

The short-term energy Es commonly used for finding speech:

𝐸𝑆 𝑚 = 𝑠12(𝑛)

𝐿∗(𝑚+1)

𝑛=𝐿∗𝑚+1 (II.3)

VAD remove block m if Es(m) < TH. In this project, I choose TH=0.05 .

II.2.2. Block into M Frames

In this section, the signal S’(n) was divided to overlap frames. Each frame is L

samples long, separated by ∆L samples.

Figure II.7 . Frame blocking

II.2.3 Window Frame

After blocking, each frame is weighted by a window function. A commonly used

window is the Hamming window:

𝑤 𝑖 = 0.54 − 0.46𝑐𝑜𝑠 2𝜋𝑖

𝐿−1 (II.4)

S’(n)

S(n,m)

L

∆L

13

II.2.4 Correlation and LPC Analysis

The autocorrelation function is found to lag P, where P is the order of the desired

LPC analysis. The LPC a(i,m) are computed by using the Levinson-Durbin

recursion [1] .

II.2.5. Cepstral analysis and Cepstral Weighting

To get cepstral coefficients γ(q,m), I use the relation:

Qqmkqamk

q

kmqamq

q

k

...,,2,1),(*),(),(),(1

1

(II.5)

The cepstral sequence is weighted by a window function wγ(q) of the form

QqQ

qQqw ...,,2,1

*sin

21)(

(II.6)

II.2.6 Delta cepstrum and The Extraction Output

The Delta cepstrum has Q differenced cepstral coefficients:

Qqmqmqmq www ...,,2,1)1,,()1,(2

1),( (II.7)

Finally, the observation vector for the extraction:

MttQttQtO wwwwt ....1),()....,1(),()...,1( (II.8)

01_speech feature extraction

Documents