ekapol chuangsuwanich and james glass

Robust Voice Activity Detector for Real World Applications Using Harmonicity and

Modulation frequencyEkapol Chuangsuwanich and James Glass

MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA

2012/07/2 汪逸婷

Introduction Harmonicity Modulation Frequency Experiments and Discussions◦ Speech/Non-speech detection capabilities◦ Effect on ASR performance

Conclusion

Outline

Voice Activity Detection(VAD) is the process of identifying segments of speech in a continuous audio stream.

First stage of a speech processing application. Used to both reduce computation by eliminating

unnecessary transmission and processing of non-speech segments, as well as reduce potential mis-recognition errors in such segments.(binary)

In this paper, we consider the task of giving commands to an autonomous forklift.

Introduction

In high quality recording conditions, energy-based methods perform well.

In noisy conditions, energy-based measures ofter produce a considerable number of false alarms.

Large variety of other features have been investigated for use in noisy environments.

Tuning parameters. Difficulties dealing with non-stationary or

instantaneous types of noises that are frequent in our work.

Introduction

Harmonicity is a basic property of any periodic signal, it is not useful by itself.

Some works shows good results even at very low SNR conditions.

Modulation frequencies which measure the temporal rate of change of energy across different frequency band.

Some studies about purely MF-inspired set of features for discriminating between speech and non-speech which gave good generalization to noise types not included in training.

Introduction

Figure 1: Example of distant speech from the forklift database.

Harmonicity

Autocorrelation ： as a function of the lag

Periodicity measure will yield a high value for pure tones.(use bandpass cepstral liftering)

Harmonicity

Harmonicity

Modulation Frequency

Database consisting of speech commands from 26 subjects with added noise to simulate a variety of SNR values ranging from -5 to 15 dB.

Recorded with an array microphone. Classification was without any additional post-

processing. Transition between speech and non-speech were

excluded. Training 4 min of speech, 22 min of non-speech.

Experiments-Speech/Non-speech detection capabilities

For comparison, include results based on Relative Spectral Entropy(RSE), Long-Term Signal Variability(LTSV), and statistical model-based VADs using MFCCs and MF as features to Gaussian mixtrue models(GMMs).

EER: Equal error rate FAR: False alarm rate


Performance varied specific kinds of noise.


EER and FAR for each noise type are usually obtained at different thresholds.


The SNR values ranged from 5 to 25 dB. Corpus: four microphone channels, 10 hours long, 400

command words. Commands are sparse, forklift is mostly idle waiting for

commands or taking the time to executed commands. ASR was performed using PocketSUMMIT, had a

vocabulary size of 57 words, that out of domain(OOD) command was modeled by a single GMM.

ASR model was trained on over 3600 utterances of commands from 18 talkers.

Experiments-Effect on ASR performance

Examined the influence of different VAD systems on the ASR results in the task of commanding an autonomous forklift in real world environments.

Experiments-Effect on ASR performance

Described the task of VAD on distant speech in low SNR environments for an autonomous robotic forklift.

Designed a two-stage approach for speech /non-speech classification.

Parallel SVM outperformed classification based on the whole MF spectrum. Combination MF and simple harmonicity measure helped reduce false alarm rate by another 9% at low miss rates.

In ASR, VAD outperformed standard VADs and achieved a WER very close to that of hand labeled end point.

Conclusion

ekapol chuangsuwanich and james glass

Documents