cs 552/652 speech recognition with hidden markov models winter 2011

41
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul Hosom Lecture 7 January 31 Features of the Speech Signal

Upload: dylan-franks

Post on 30-Dec-2015

38 views

Category:

Documents


0 download

DESCRIPTION

CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul Hosom Lecture 7 January 31 Features of the Speech Signal. Features: How to Represent the Speech Signal. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

1

CS 552/652Speech Recognition with Hidden Markov Models

Winter 2011

Oregon Health & Science UniversityCenter for Spoken Language Understanding

John-Paul Hosom

Lecture 7January 31

Features of the Speech Signal

Page 2: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

2

Features must (a) provide good representation of phonemes(b) be robust to non-phonetic changes in signal

Features: How to Represent the Speech Signal

Time domain (waveform):

Frequency domain (spectrogram):

“Markov”: male speaker “Markov”: female speaker

Page 3: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

3

Features: Windowing

In many cases, the math assumes that the signal is periodic. We always assume that the data is zero outside the window.

When we apply a rectangular window, there are usually discontinuities in the signal at the ends. So we can window the signal with other shapes, making the signal closer to zero at the ends. This attenuates discontinuities.

Hamming window:

10)1

2cos(46.054.0)(

Nn

N

nnh

1.0

0.0 N-1Typical window size is 16 msec, which equals 256 samples for16-kHz (microphone) signal and 128 samples for 8-kHz (telephone) signal. Window size does not have to equal frame size!

0

Page 4: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

4

Features: Spectrum and Cepstrum

(log power) spectrum:

1. Hamming window2. Fast Fourier Transform (FFT)3. Compute 10 log10(r2+i2)

where r is the real component, i is the imaginary component

timeampl

itude

frequencyener

gy (

dB)

Page 5: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

5

Features: Spectrum and Cepstrum

cepstrum:treat spectrum as signal subject to frequency analysis…

1. Compute log power spectrum2. Compute FFT of log power spectrum

3. Use only the lower 13 values (cepstral coefficients)

Page 6: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

6

Features: Spectrum and Cepstrum

Why Use Cepstral Features?

• number of features is small (13 vs. 64 or 128 for spectrum)

• models spectral envelope (relevant to phoneme identity), not (irrelevant) pitch

• coefficients tend to not be correlated with each other (useful to assume that non-diagonal elements of covariance matrix are zero… see Lecture 5, slide 29)

• (relatively) easy to compute

Cepstral features are very commonly used. Another type of feature that is commonly used is called Linear Predictive Coding (LPC).

Page 7: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

7

Features: Autocorrelation

Autocorrelation:measure of periodicity in signal

m

n kmxmxkR )()()(

)()()()()(1

0

kmwkmxmwmxkR n

kN

mnn

time

ampl

itude

n=start sample of analysis, m=sample within analysis window 0…N-1

Page 8: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

8

Features: Autocorrelation

Autocorrelation: measure of periodicity in signal

KkkmymykRkN

mnnn

0)()()(1

0

and if we set yn(m) = xn(m) w(m), so that y is the windowedsignal of x where the window is zero for m<0 and m>N-1, then:

where K is the maximum autocorrelation index desired.

Note that Rn(k) = Rn(-k), because when we sum over allvalues of m that have a non-zero y value (or just change the limits in the summation to m=k to N-1 and use negative k), then

)()()()()()( kmymymykmykmymy nnnnnn the shift is the same in both cases ; limits of summation change m=k…N-1

)()()()()(1

0

kmwkmxmwmxkR n

kN

mnn

from previous slide

Page 9: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

9

Features: Autocorrelation

Autocorrelation of speech signals: (from Rabiner & Schafer, p. 143)

Page 10: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

10

Features: Autocorrelation

Eliminate “fall-off” by including samples in w2 not in w1.

otherwisemw

kNmmw

otherwisemw

Nmmw

0)(

101)(

0)(

101)(

2

2

1

1

= modified autocorrelation function= cross-correlation function

Note: requires k ·N multiplications; can be slow

KkkmwkmxmwmxkR n

N

mnn

0)()()()()(ˆ2

1

01

KkkmxmxkRN

mnnn

0)()()(ˆ1

0

Page 11: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

11

Features: LPC

Linear Predictive Coding (LPC) provides• low-dimension representation of speech signal at one frame• representation of spectral envelope, not harmonics• “analytically tractable” method• some ability to identify formants

LPC models the speech signal at time point n as an approximate linear combination of previous p samples:

where a1, a2, … ap are constant for each frame of speech.

We can make the approximation exact by including a“difference” or “residual” term:

(1)

(2)

where G is a scalar gain factor, and u(n) is the (normalized)error signal (residual).

)()2()1()( 21 pnsansansans p

p

kk nGuknsans

1

)()()(

Page 12: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

12

Features: LPC

LPC can be used to generate speech from either the error signal (residual) or a sequence of impulses as input:

where ŝ is the generated speech, and e(m) is the error signal or a sequence of impulses. However, we use LPC here as a representation of the signal.

The values a1…ap (where p is typically 10 to 15) describe the signal over the range of one window of data (typically 128 to 256 samples).

While it’s true that 10-15 values are needed to predict (model) only one data point (estimating the value at time m from the previous p points), the same 10-15 values are used to represent all data points in the analysis window. When one frame of speech has more than p values, there is data reduction. For speech, the amount of data reduction is about 10:1. In addition, LPC values model the spectral envelope, not pitch information.

)()2()1()()(ˆ 21 pmsamsamsamems p

Page 13: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

13

then we can find ak by setting En/ak = 0 for k = 1,2,…p, obtaining p equations and p unknowns:

Features: LPC

If the error over a segment of speech is defined as

2

1

2

2

1

2

1

)()(

)(

M

Mm

p

knkn

M

Mmnn

kmsams

meE

pimsimskmsimsaM

Mmnn

p

k

M

Mmnnk

1)()()()(ˆ2

1

2

11

(3)

(4)

(5)

(as shown on next slide…)Error is minimum (not maximum) when derivative is zero, because as any ak changes away from optimum value, error will increase.

Page 14: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

14

)()2(...)1()2()2()(2

)()1(...)1()1(2)1()(2)(0

2122

11112

1

2

1

pmsamsamsamsamsams

pmsamsamsamsamsamsmsaE

p

M

Mmp

n

Features: LPC

pimsimskmsimsaM

Mm

p

k

M

Mmk

1)()()()(2

1

2

11

(5-1)

pikmsimsaimsmsM

Mm

p

kk

10)()(2)()(22

1 1

pikmsimsaimsmsM

Mm

p

kk

M

Mm

10)()(2)()(22

1

2

1 1

0)()1(2...)2()1(2)1()1(2)1()(22

1

21

M

Mmp pmsmsamsmsamsmsamsms

2

10)1()(...)1()3()1()2(

)()1(...)2()1()1()1(2)1()(2

32

21M

Mm p

p

mspmsamsmsamsmsa

pmsamsmsamsmsmsamsms

2

1

2

1

)()(

M

Mm

p

kkn kmsamsE

2

1 111

2 )()()()(2)(M

Mm

p

rk

p

kk

p

kkn rmsakmsakmsamsmsE

2

1

1

122

111

2

)()()()(2

)()2()2()(2

)()1()1()(2)(

M

Mmp

rrpp

p

rr

p

rr

n

rmsapmsapmsams

rmsamsamsams

rmsamsamsamsms

E

(5-2)

(5-3)

(5-4)

(5-5)

(5-6)

(5-7)

(5-8)

(5-9)

repeat (5-4) to (5-6) for a2, a3, … ap

Page 15: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

15

Features: LPC Autocorrelation Method

Then, defining

we can re-write equation (5) as:

2

1

)()(),(M

Mmnnn kmsimski

piikia n

p

knk

1)0,(),(ˆ1

We can solve for ak using several methods. The most commonmethod in speech processing is the “autocorrelation” method:

Force the signal to be zero outside of interval 0 m N-1:

where w(m) is a finite-length window (e.g. Hamming) of length N that is zero when less than 0 and greater than N-1. ŝ is the windowed signal. As a result,

)()()(ˆ mwmsms nn

1

0

2 )(pN

mnn meE

(6)

(7)

(8)

(9)

Page 16: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

16

Features: LPC Autocorrelation Method

How did we get from

to

1

0

2 )(pN

mnn meE

2

1

)(2M

Mmnn meE (equation (3))

(equation (9))

with window from 0 to N-1? Why not

1

0

2 )(N

mnn meE ??

Because value for en(m) may not be zero when m > N-1…for example, when m = N+p-1, then

p

knknn kpNsapNspNe

1

)1(ˆ)1(ˆ)1(

)1(ˆ...)11(ˆ)1(ˆ)1( 1 ppNsapNsapNspNe npnnn 0

ŝn(N-1) is not zero!0

Page 17: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

17

Features: LPC Autocorrelation Methodbecause of setting the signal to zero outside the window, eqn (6):

and this can be expressed as

and this is identical to the autocorrelation function for |ik| becausethe autocorrelation function is symmetric, Rn(x) = Rn(x) :

so the set of equations for ak (eqn (7)) can be combo of (7) and (12):

1

0 0

1)(ˆ)(ˆ),(

pN

mnnn pk

pikmsimski

)(1

0 0

1))((ˆ)(ˆ),(

kiN

mnnn pk

pikimsmski

xN

mnnn

nn

xmsmsxR

kiRki1

0

)(ˆ)(ˆ)(

|)(|),(

p

knnk piiRkiRa

1

1)(|)(|ˆ

(10)

(11)

(12)

(13)

(14)

where

Page 18: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

18

Features: LPC Autocorrelation MethodWhy can equation (10):

be expressed as (11): ???

1

0 0

1)(ˆ)(ˆ),(

pN

mnnn pk

pikmsimski

)(1

0 0

1)(ˆ)(ˆ),(

kiN

mnnn pk

pikimsmski

1

0 0

1)(ˆ)(ˆ),(

pN

mnnn pk

pikmsimski original equation

ipN

mnnn pk

piikmsmski

1

0 0

1)(ˆ)(ˆ),(

add i to sn() offset and subtract i from summation limits. If m < 0, sn(m) is zero so still start sum at 0.

ikN

mnnn pk

piikmsmski

1

0 0

1)(ˆ)(ˆ),( replace p in sum limit by k, because

when m > N+k-1-i, s(m+i-k)=0 and k is always p

Page 19: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

19

Features: LPC Autocorrelation Method

In matrix form, equation (14) looks like this:

)(

)3()2()1(

ˆ

ˆˆˆ

)0()3()2()1(

)3()0()1()2()2()1()0()1()1()2()1()0(

3

2

1

pR

RRR

a

aaa

RpRpRpR

pRRRRpRRRRpRRRR

n

n

n

n

pnnnn

nnnn

nnnn

nnnn

There is a recursive algorithm to solve this: Durbin’s solution

Page 20: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

20

Features: LPC Durbin’s SolutionSolve a Toeplitz (symmetric, diagonal elements equal) matrix for values of :

)(

)1(2)(

)1()1()(

)(

)1(1

1

)1(

)0(

1

ˆ

)1(

11

1)()(

)0(

1)(|)(|

pjj

ii

i

ijii

ij

ij

ii

i

ii

j

iji

p

knnk

a

EkE

ijk

k

piEjiRiRk

RE

piiRkiR

Page 21: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

21

Features: LPC Example

For 2nd-order LPC, with waveform samples {462 16 -294 -374 -178 98 40 -82}

If we apply a Hamming window (because we assume signal is zerooutside of window; if rectangular window, large prediction errorat edges of window), which is

{0.080 0.253 0.642 0.954 0.954 0.642 0.253 0.080}then we get

{36.96 4.05 -188.85 -356.96 -169.89 62.95 10.13 -6.56}and so R(0) = 197442 R(1)=117319 R(2)=-946

0.59420

0.59420)0()1(

0)1(

197442)0(

1)1(

1

)0(1

)0(

k

RR

ERk

RE

Page 22: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

22

Features: LPC Example

0.55317ˆ0.92289ˆ

0.92289)1()0(

)2()1()0()1(

0.55317

0.55317)1()0(

)1()0()2()1()2(

127731)0(

)1()0()1()1(

21

22)1(

12)1(

1)2(

1

2)2(

2

22

2)1()1(

12

22)0(2

1

aa

RR

RRRRk

k

RR

RRRERRk

R

RREkE

Note: if divide all R(·) values by R(0), solution is unchanged,but error E(i) is now “normalized error”.Also: -1 kr 1 for r = 1,2,…,p

Page 23: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

23

Features: LPC Example

We can go back and check our results by using these coefficients to “predict” the windowed waveform:

{36.96 4.05 -188.85 -356.96 -169.89 62.95 10.13 -6.56}and compute the error from time 0 to N+p-1 (Eqn (9))

0 ×0.92542 + 0 × -0.5554 = 0 vs. 36.96, diff = 36.96 036.96 ×0.92542 + 0 × -0.5554 = 34.1 vs. 4.05, diff = -30.05 14.05 ×0.92542 + 36.96 × -0.5554 = -16.7 vs. –188.85, diff = -172.15 2-188.9×0.92542 + 4.05 × -0.5554 = -176.5 vs. –356.96, diff = -180.43 3-357.0×0.92542 + -188.9×-0.5554 = -225.0 vs. –169.89, diff = 55.07 4-169.9×0.92542 + -357.0×-0.5554 = 40.7 vs. 62.95, diff = 22.28 562.95×0.92542 + -169.89×-0.5554 = 152.1 vs. 10.13, diff = -141.95 610.13×0.92542 + 62.95×-0.5554 = -25.5 vs. –6.56, diff = 18.92 7-6.56×0.92542 + 10.13×-0.5554 = -11.6 vs. 0, diff = 11.65 80×0.92542 + -6.56×-0.5554 = 3.63 vs. 0, diff = -3.63 9

A total squared error of 88,645, or error normalized by R(0) of0.449

(If p=0, then predict nothing, and total error equals R(0), so we cannormalize all error values by dividing by R(0).)

time

Page 24: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

24

Features: LPC Example

If we look at a longer speech sample of the vowel /iy/, dopre-emphasis of 0.97 (see following slides), and perform LPC of various orders, we get:

0.00

0.04

0.08

0.12

0.16

0.20

0 1 2 3 4 5 6 7 8 9 10

LPC Order

Nor

mal

ized

Pre

dic

tion

Err

or

(tot

al s

qu

ared

err

or /

R(0

))

which implies that order 4 captures most of the importantinformation in the signal (probably corresponding to 2 formants)

Page 25: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

25

Features: LPC and Linear Regression

• LPC models the speech at time n as a linear combination of the previous p samples. The term “linear” does not imply that the result involves a straight line, e.g. s = ax + b.

• Speech is then modeled as a linear but time-varying system (piecewise linear).

• LPC is a form of linear regression, called multiple linear regression, in which there is more than one parameter. In other words, instead of an equation with one parameter of the form s = a1x + a2x2, an equation of the form s = a1x + a2y + …

• Because the function is linear in its parameters, the solution reduces to a system of linear equations, and other techniques for linear regression (e.g. gradient descent) are not necessary.

Page 26: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

26

Features: LPC Spectrum

because the log power spectrum is:

We can compute spectral envelope magnitude from LPC parameters by evaluating the transfer function S(z) for z=ej:

22

2

2

2

122

11

}Im{}Re{log10

}Im{}Re{log10)(

0)2

sin(}Im{)2

cos(1}Re{

AA

G

AA

Gn

NnN

nkaA

N

nkaA

p

kk

p

kk

Each resonance (complex pole) in spectrum requires twoLPC coefficients; each spectral slope factor (frequency=0 or Nyquist frequency) requires one LPC coefficient.

For 8 kHz speech, 4 formants LPC order of 9 or 10

p

k

kjk

jj

ea

G

eA

GeS

1

1)(

)(

)sin()cos( je j

Page 27: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

27

Features: LPC Representations

Page 28: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

28

Features: LPC Cepstral Features

The LPC values are more correlated than cepstral coefficients.But, for GMM with diagonal covariance matrix, we want values to be uncorrelated.

So, we can convert the LPC coefficients into cepstral values:

1

1

)(1 n

jjnjnn cajn

nac

Page 29: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

29

Features: LPC History

Wikipedia has an interesting article on the history of LPC:

… The first ideas leading to LPC started in 1966 when S. Saito and F. Itakura of NTT described an approach to automatic phoneme discrimination that involved the first maximum likelihood approach to speech coding. In 1967, John Burg outlined the maximum entropy approach. In 1969 Itakura and Saito introduced partial correlation, May Glen Culler proposed real-time speech encoding, and B. S. Atal presented an LPC speech coder at the Annual Meeting of the Acoustical Society of America.

In 1972 Bob Kahn of ARPA, with Jim Forgie (Lincoln Laboratory) and Dave Walden (BBN Technologies), started the first developments in packetized speech, which would eventually lead to Voice over IP. In 1976 the first LPC conference took place over the ARPANET using the Network Voice Protocol.

It is [currently] used as a form of voice compression by phone companies, for example in the GSM standard. It is also used for secure wireless, where voice must be digitized, encrypted and sent over a narrow voice channel.

[from http://en.wikipedia.org/wiki/Linear_predictive_coding]

Page 30: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

30

The source signal for voiced sounds has slope of -6 dB/octave:

We want to model only the resonant energies, not the source.But LPC will model both source and resonances.

If we pre-emphasize the signal for voiced sounds, we flatten it in the spectral domain, and source of speech more closely approximates impulses. LPC can then model only resonances (important information) rather than resonances + source.

Pre-emphasis:

Features: Pre-emphasis

0 1k 2k 3k 4k

97.0)1()()(' kmskmsms nnn

frequency

ener

gy (

dB)

Page 31: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

31

Features: Pre-emphasis

Adaptive pre-emphasis: a better way to flatten the speech signal

1. LPC of order 1= value of spectral slope in dB/octave= R(1)/R(0) = first value of normalized autocorrelation

2. Result = pre-emphasis factor

)1()0(

)1()()(' ms

R

Rmsms nnn

Page 32: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

32

Features: Frequency Scales

The human ear has different responses at different frequencies.

Two scales are common:Mel scale: Bark scale (from Traunmüller 1990):

)700

1(log2595)Mel( 10

ff 53.0

1960

81.26)Bark(

f

ff

frequency

ener

gy (

dB)

frequency

Page 33: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

33

Features: Perceptual Linear Prediction (PLP)

Perceptual Linear Prediction (PLP) is composed of the following steps:

1. Hamming window

2. power spectrum (not dB scale) (frequency analysis) S=(Xr

2+Xi2)

3. Bark scale filter banks (trapezoidal filters) (freq. resolution)

4. equal-loudness weighting (frequency sensitivity)

)1

2cos(46.054.0)(

N

nnh

661.9

644.1

56.1)(

2

22

2

2

ef

ef

ef

ffE

53.01960

81.26)Bark(

f

ff

Page 34: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

34

Features: PLP

PLP is composed of the following steps:

5. cubic compression (relationship between intensity and loudness)

6. LPC analysis (compute autocorrelation from freq. domain)

7. compute cepstral coefficients

8. weight cepstral coefficients

33.0)()( ff

)12()()()(1

pnGuknsansp

kk

1

1

)(1 n

iininn cain

nac

6.0)exp(' kcknc nn

Page 35: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

35

Features: Mel-Frequency Cepstral Coefficients (MFCC)

Mel-Frequency Cepstral Coefficients (MFCC) is composed of the following steps:

1. pre-emphasis

2. Hamming window

3. power spectrum (not dB scale) S=(Xr

2+Xi2)

4. Mel scale filter banks (triangular filters)

)1(97.0)()(' msmsms nnn

)1

2cos(46.054.0)(

N

nnh

)700

1(log2595)Mel( 10

ff

Page 36: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

36

Features: MFCC

MFCC is composed of the following steps:

5. compute log spectrum from filter banks 10 log10(S)

6. convert log energies from filter banks to cepstral coefficients

7. weight cepstral coefficients6.0)exp(' kcknc nn

banksfilter ofnumber uesenergy vallog

))5.0(cos(1

Nm

jN

imc j

N

jji

Page 37: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

37

Features: Delta Values

The PLP and MFCC features, as presented, analyze the speechsignal at one time frame. However, speech changes over time.To capture dynamics of speech, use “delta” features.

Using this formula for delta of nth cepstral coefficient c, at time t:

too noisy!

Use this regression formula (Furui, 1986, IEEE Trans ASSP, 34, pp 52-59):

The “acceleration” or “delta-delta” coefficients may also be used,and computed by applying the same formula to the delta features.

1,,, tntntn ccd

1

2

1,,

,

2

tntn

tn

ccd = window size = 2 frames

(50 msec window)

Page 38: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

38

Features: Delta Values

Derivation of delta formula:

1

2

1,,

,

2

,

,

2

2

,,

,

22

2

)(

)(

)(

)(

tntn

tn

tn

tn

tntn

tn

ccd

cd

n

ccnd

xxn

yxxynm

linear regression formulafor slope of n points (xi,yi)

xi = frame index from – to yi = cn,t+i

remove factors that cancel out

change limits on sum from (– … ) to (1 … )

Page 39: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

39

Removing Noise: CMS

Convolutional noise (from type of channel) is • convolutional in the time domain• multiplicative in the spectral domain• additive in the log-spectral domain

So, we can remove constant convolutional effects by removing constant values from the log spectrum, which is called spectral mean subtraction

Cepstral Mean Subtraction (CMS)removes mean value from cepstral parameters to reduceconvolutional noise, in the cepstral domain

CMS assumes that there is enough of a signal that the mean isnot significantly influenced by the speech component of the signal.

Page 40: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

40

Removing Noise: RASTA

2 types of noise:• additive: noise values added to time-domain signal• convolutional: noise values added to log-domain spectrum

In RASTA, the time trajectory of the log power spectrum (orcepstral coefficients) is filtered with a band-pass filter:

The high-pass portion of the filter alleviates channel characteristics,the low-pass portion smooths small frame-to-frame changes.

If, instead of log compression, a linear-log compression isdone (linear for small spectral values), both additive and convolutional noise can be suppressed.

Page 41: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

41

Features: Summary

Typical features represent the speech signal using a smallanalysis window (e.g. 16 msec) with a medium-size framerate (e.g. 10 msec).

Dynamics of speech, removing channel noise are addressed,but current solutions may not be optimal solutions.

PLP and MFCC features are advantageous because they mimicsome of the human processing of the signal, emphasizingthe perceptually-important aspects.

The use of a small number of cepstral coefficients approximatesthe spectral envelope, removing (unwanted) information about pitch.

Usually one set of generic features is used; features not “targeted”to any specific phonemes.