icassp 20041 a voice activity detector using the chi-square test beena ahmed1 and w. harvey holmes2...

ICASSP 2004 1

A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST

Beena Ahmed1 and W. Harvey Holmes2

1RMIT University, Melbourne, VIC 3001, Australia

2University of New South Wales, Sydney, NSW 2052, Australia

Reporter : Chen, Hung-Bin

ICASSP 2004 2

Outline

• Introduction– This paper proposes

• Testing data– speech plus noise

• The Chi-Square test

• Noise estimation using the Chi-Square test

• Block diagram of the proposed VAD

• Experiments and results

• Summary and Conclusions

ICASSP 2004 3

Introduction - VAD

• significance– Speech recognition systems designed for real world conditions, a

robust discrimination of speech from other sounds is a crucial step.

• advantage– Speech discrimination can also be used for coding or

telecommunication applications.

• proposed system– a feature set inspired by investigations of various stages of the auditory

system

ICASSP 2004 4

This paper proposes

• This paper proposes a voice activity detector (VAD) that makes the speech/noise classification by applying the statistical chi-square test to each frame.

• It also uses a continuous update of the background noise estimate.

• If the chi-square test determines that they are close, the frame is declared to be noise, otherwise speech.

( but not mean that is a real speech data )

ICASSP 2004 5

Testing data

• There are two ways of interpreting noise model:– 1. The ‘additive noise’ point of view;

i.e. the speech signal is corrupted by additive noise– 2. The ‘additive signal’ point of view;

i.e. the residual noise signal has speech added to it.

• Noise model in this paper, the speech signal s(t) is corrupted by uncorrelated additive noise w(t), giving the degraded composite signal y(t):– y(t) = s (t) + w(t) .

ICASSP 2004 6

The Chi-Square test

• In this paper, the chi-square test has been applied to the two problems of noise estimation and voice activity detection.

• The chi-square test seeks to determine if there is a good fit between the frequencies of the noise observed data and the frequencies of the expected or theoretical data.

hypothesis is proposed for each frame:

1,,0 ˆ: pkpk wxH

pkpkpk swxH ,1,,1 ˆ:

:noise-only frame

:noise-pluse-speech frame

ICASSP 2004 7

Noise estimation using the Chi-Square test

• To estimate the noise with improved spectral sensitivity, the input signal is first passed through a bandpass filterbank H1, …, HM, as shown in Figure.

• Each of these sub-banded signals is then divided into time frames of size L, where M<<L.

ICASSP 2004 8


• Let , be the signal in sub-band k and frame p. • The samples in the noise vector , will be given by

• The noise estimates , from the previous frame, p-1, are first grouped into N bins.

• The vector o of observation is obtained in the same way from the current signal , , using the same N bins.

pkx ,

)(),...,1( ,,, Lwwx pkpkpk pkx ,

1, pkx

Ni eeee ,...,,...,1

Ni oooo ,...,,...,1

pkx ,

1i

K(i) (L) Wk,p

ei

ICASSP 2004 9


• The chi-square test is then applied to these bins, where the chi-square statistic is given by

• The hypothesis test can thus be written as

• effectively– Background noise is usually assumed to have a Gaussian distribution.

– Speech is (relatively) non-stationary and is non-Gaussian.

– The chi-square test can be used to effectively identify the noise-only segments of the signal.

N

i i

ii

e

eo

1

22

1

02

H

Hthresthold

ICASSP 2004 10

simple one-pole smoothing filter λ

• if the current frame is determined to be a noise-only frame similar to the current noise estimate, the pdf of the noise is updated using a simple one-pole smoothing filter with smoothing coefficientλ.

• Testing found that a smoothing coefficient of 0.95 gave the best results.

ICASSP 2004 11

Block diagram of the proposed VAD

• The proposed VAD consists of three main components:– 1. Chi-square based noise estimator (as in Section 3),– 2. Noise reduction system, and– 3. Chi-square based decision module.– A block diagram of the complete system is shown in Figure

ICASSP 2004 12

Implementation Details

• An analysis frame size of 256 samples was used, with a step size of 64 samples.

• The optimum band gains were estimated from the spectrally decomposed noisy signal using the Ephraim and Malah gain function with a smoothing parameter of 0.98.

• Noise estimates were obtained from the chi-square noise estimator described in Section 3.

ICASSP 2004 13


• The VAD decision was made using another chisquare detector on the synthesised noise-reduced signal.

• In it a frame size of 15 ms, i.e. 125 samples, was used with an overlap of 25 samples at a sampling frequency of 8192 Hz.

• The signal was divided into 8 equal sub-bands using a digital IIR filterbank of elliptic bandpass filters of order 10, with bandwidths of 487.5 Hz.

• The histogram of the current frame was calculated and divided into 7 classes (or bins)(ei).

• The first three frames were recursively averaged to provide the starting value of the noise vector.

ICASSP 2004 14


• The current input frame was combined with seven previous frames, giving frame sizes of 122 ms for noise estimation.

• The observed and expected distributions in each sub-band were divided into 7 classes (ei) and tested using the chi-square test.

• Only those frames in which the null hypothesis was accepted for all 8 sub-bands were declared to be noise.

• The noise estimate was then updated using a single-sided one-pole recursive filter.

• Testing found that a smoothing coefficient of 0.95 gave the best results.

ICASSP 2004 15

Testing data

• The VAD was applied to 12 sentences with added babble, car, pink, and white noises at SNRs of 0, 5, and 10 dB.

• The chi-square detector manages to pick up short bursts of speech that are only a few hundred samples (20 ms) long.

ICASSP 2004 16

Experimental Results

ICASSP 2004 17

Experimental Results

ICASSP 2004 18

ReferenceVAD proposed by Sohn et al. [5]

• The developed VAD employ the decision directed parameter estimation method for the likelihood ratio test.

ICASSP 2004 19

Summary and Conclusions

• The proposed VAD shows better performances in various environmental conditions, while requires only a few parameters to optimize when compared with the others.

• Applications such as – automatic classification

– segmentation of animal sounds

– an efficient encoding of speech and music

icassp 20041 a voice activity detector using the chi-square test beena ahmed1 and w. harvey holmes2...

Documents