a robust, real-time voice activity detection algorithm for embedded mobile devices

INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY 8, 133–146, 2005c© 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

A Robust, Real-Time Voice Activity Detection Algorithm for EmbeddedMobile Devices

BIAN WU∗

Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030,People’s Republic of China

wu [email protected]

XIAOLIN RENMotorola Labs China Research Center, 38F, CITIC Square, 1168 Nanjing Rd. W. Shanghai 200041,

People’s Republic of [email protected]

CHONGQING LIUInstitute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030,

People’s Republic of China

YAXIN ZHANGMotorola Labs China Research Center, 38F, CITIC Square, 1168 Nanjing Rd. W. Shanghai 200041,

People’s Republic of China

Abstract. When an Automatic Speech Recognition (ASR) system is applied in noisy environments, Voice Ac-tivity Detection (VAD) is crucial to the performance of the overall system. The employment of the VAD forASR on embedded mobile systems will minimize physical distractions and make the system convenient to use.Conventional VAD algorithm is of high complexity, which makes it unsuitable for embedded mobile devices;or of low robustness, which holds back its application in mobile noisy environments. In this paper, we proposea robust VAD algorithm specifically designed for ASR on embedded mobile devices. The architecture of theproposed algorithm is based on a two-level decision making strategy, where there is an interaction between alower features-based level and subsequent decision logic based on a finite-state machine. Many discriminatingfeatures are employed in the lower level to improve the robustness of the VAD. The two-level decision strat-egy allows different features to be used in different states and reduces the cost of the algorithm, which makesthe proposed algorithm suitable for embedded mobile devices. The evaluation experiments show the proposedVAD algorithm is robust and contribute to the overall performance gain of the ASR system in various acousticenvironments.

Keywords: voice activity detection, noisy robust, real-time, embedded, automatic speech recognition

∗To whom all correspondence should be addressed.

134 Wu et al.

1. Introduction

Voice Activity Detection (VAD) is a very importantfrontend processing in practical Automatic SpeechRecognition (ASR) systems. A well-designed voice ac-tivity detector can greatly improve the performance ofan ASR system in terms of accuracy and speed. More-over the employment of VAD for ASR on embeddedmobile systems, such as PDA, smart phone, wirelesscar kits, will minimize physical distractions and makethe system convenient to use.

According to Savoji’s evaluation (Savoji, 1989), therequired characteristics of an ideal voice activity de-tector are: reliability, robustness, accuracy, adaptation,simplicity, real-time processing and no prior knowl-edge of the noise. Among them, robustness againstnoise environments has been the most difficult task toaccomplish. In high SNR conditions, the VAD algo-rithms proposed to date show satisfactory performance;while in low SNR environments, all performances de-grade to a certain extent. Robust VAD in noisy environ-ments remains an unsolved problem. An ASR systemon embedded mobile devices may work in various en-vironments with various noise types at various SNRs.So the voice activity detector should be robust in allkinds of conditions. At the same time, the VAD algo-rithm should be of low complexity, which is mandatoryfor real-time embedded systems. Therefore simplicityand robustness against noise are two essential charac-teristics of a practicable voice activity detector for anembedded ASR system.

There have been numerous VAD algorithms pro-posed to date. In these algorithms different discrim-inating features are incorporated to cope with vari-ous environmental noises. Among them, the energy-based features (Benassine, 1997; Junqua et al., 1991;Rabiner, 1993) are most popular because of its sim-plicity and effectiveness. The spectrum-related featuresare also widely used in newly proposed VAD algo-rithms, such as spectral entropy, cepstral features, etc.(Chengalvaryan, 2001; De Wet, 2001; Ganapathiraju,1996; Martens, 2000; Nemer, 2001; Picone, 1993; Ren-evey, 2001; Shieh, 1999; Tanyer, 2000; Tucker, 1992).Some features, such as HOS (Nemer, 2001), show satis-factory noise robustness, but they cannot be employedin an embedded system because of their highly com-putational complexity. Others are of low complexity,but they only show good performance in certain noisyenvironments. Recent experiments show that one fea-ture may effectively deal with certain environments, but

fails to work in others. Thus a voice activity detectorrelying on a single feature will not meet the demandsof the embedded ASR systems. To make the voice de-tection algorithm applicable in various noise environ-ments, researchers proposed to use multiple featurescoupled with some complex algorithms such as CART(Shin, 2000), ANN (Wu, 2000), etc. However the com-bination algorithms add to the complexity of the VADitself, which makes the algorithms not suitable for real-time embedded systems. Moreover it is difficult to con-struct a good combination algorithm suitable for allenvironments.

Since simplicity and robustness are two main chal-lenges for VAD algorithms for embedded systems, weare devoted to designing a deliberate VAD algorithmsuitable for embedded ASR systems. The algorithm isnot only of robustness in various noisy environmentsbut also has a low computational complexity.

In the proposed algorithm a noise model is intro-duced to describe environmental noise signals. Thenoise model is a simple multi-dimension Gaussian dis-tribution model of the average spectral character ofnoises. A score, which is one of the discriminatingfeatures and generates from the noise model, is usedto reflect the difference in spectrum between noise andspeech signals. To improve robustness and accuracyin various noisy environments, short-time energy andspectral distance are also considered. Some conven-tional features, such as zero-crossing rate, are not usedbecause they are not immunized against noise. Thecomputation costs of these discriminating features arelow in themselves, which makes them suitable for em-bedded devices. A two-level decision making strategyis employed to obtain an accurate detection. Firstly, Wecarry out a feature-based decision, which is also namedthe lower level decision. In this stage, the discriminat-ing features are extracted from each frame, and a pre-liminary decision is obtained only according to thesefeatures. Then a four-state automaton is introduced fora final decision, which is named the upper level deci-sion. The two decision levels are not separated but in-teract with each other. Different features are employedin different states in order to lower the computationcost. In other words, the features used for the lowerdecision making level are selected according to the au-tomaton state of the upper level. On the other hand, theautomaton state changes according to the lower leveldecision.

A series of experiments are carried out to evaluatethe performance of the proposed VAD algorithm. The

Robust, Real-Time Voice Activity Detection Algorithm for Embedded Mobile Devices 135

evaluation database includes not only the data collectedin natural noisy mobile environments but also the datacontaining artificially added noise. The results show theproposed VAD algorithm is effective in all the testingacoustic environments.

2. The Proposed Algorithm

Speech signals can be assumed to be stationary in ashort interval, say 10 ms (80 samples for 8 kHz sam-pling rate). In practice the frame window has an overlapso that there is a smooth transition of the feature fromframe to frame. We also assume that there is at least100–250 ms pure noise signals preceding the actualspeech signal, which is easy to implement in practicalsystems since it is only a trade off between the time de-lay of the system and the sufficiency to model noise sig-nals. By using these beginning frames the noise modelcan easily be seeded. All the discriminating featuresand their threshold will then be obtained based on thenoise model. A preliminary decision is then made ac-cording to these features. The preliminary decision issent to a four-state logic to get the final VAD output bydelicate consideration.

2.1. The Discriminating Features for VAD

Due to the importance of VAD in ASR, various algo-rithms have been proposed in recent years. The algo-rithms differ in many aspects such as complexity, thefeatures of the speech signal employed, and even theapplications to which they are targeted. The discrim-inating feature will determine the basic performanceof the voice activity detector to a great extent. In earlyVAD algorithms, short-time energy, zero-crossing rateand LPC coefficients were among the common fea-tures used for speech detection. But they fail to workin low SNR conditions. To improve the robustness ofthe voice activity detector, many new discriminatingfeatures have been developed such as cepstral features,entropy and etc, most of which are spectrum-relatedfeatures, given the observations that speech signals of-ten show different characteristics in spectral domain tonoise signals.

Embedded ASR systems are used in various noisyenvironments at various SNRs. So the VAD for thesystem must present good performance, not only inclean environment, but also in low SNR environments.On the other hand, another major concern of embedded

implementation is the computation complexity. It isprohibited to use computational expensive features ina real-time embedded system. Therefore the featuresused for VAD must meet the following requirements:

(1) Capability to discriminate speech from noise;(2) Suitability for real-time implementation;(3) Robustness to noise;(4) Low computation cost.

2.1.1. The Noise Model. In our algorithm, an incom-ing frame is first inputted into an FFT unit. The noisecharacter is described on the assumption of two facts:the noise is stationary and the value of power densityspectrum at each FFT point is of Gaussian distribution.Thus the noise can be modeled by a multidimensionalGaussian model: N (µ,Σ), in whichµ is the mean vec-tor of the power density of FFT points, and Σ is thecovariance matrix. � is assumed to be a diagonal ma-trix for the sake of simplicity. Then the model can beexpressed as N (µ,σ2). If we have J FFT points,

µ = (µ1 µ2 µ3 . . . µJ )′ (1)

σ2 = (σ 2

1 σ 22 σ 2

3 . . . σ 2J

)′(2)

After initializing the noise model by the first 100–250 ms of signal, a likelihood score is computed foreach incoming frame. So if the spectral character ofthe inputted frame is similar to that of the noise model,the score will be high, and vice versa. The score of eachframe can be shown as:

Score(Oi) = N (Oi;µ,σ2) = 1√2πσ

e(oi−µ)2

2σ2 (3)

where Oi is the density of power spectrum for eachframe. The score will be an important discriminatingfactor in distinguishing the speech frames from thenoise frames since it is in fact a similarity measure.

During the course of detecting speech, if the cur-rent frame is considered as noise, the model will beupdated by the power density spectrum of this frame.This procedure that utilizes an iterative method makesthe model follow up the variation of the environmen-tal noise. The update can also make the noise model amore sufficient statistics to the character of the envi-ronmental noise. The formula of the update procedureis:

µn+1 = µn · n + Nn+1

n + 1(4)

136 Wu et al.

σ2n+1 = (n − 1) · σ2

n + (Nn+1 − µn)2

n− (µn+1 − µn)2 (5)

In Eqs. (4) and (5), µn+1,σ2n+1 and µn,σ

2n are the

mean vector and variance vector after and before theupdate respectively, n is the number of noise framesbefore the update, and Nn+1 is the current noise frameinputted to update the model. In real mobile environ-ments the background noise is a time dependent vari-able. Therefore it is reasonable to fix n when it isgreater than a certain number, which we choose as32, so that the update procedure needs a short pe-riod memory rather than remembering the whole ut-terance. Therefore µn+1 and σ2

n+1 is in fact the maxi-mum likelihood estimator in a slipping window of noiseframes.

In practical applications, sometimes there exist theimpulsive noises such as sounds from opening or clos-ing a door, cough, etc. It is obvious that this kind ofnoise cannot be used to update the model. So in orderto minimize their effects on the noise model update, athreshold can be set to restrict the energy of the framesused to update the model. In other words, if the energyof the noise frame is far greater than the average energyof noise signal, the frame is discarded in updating themodel.

2.1.2. Short-Time Energy. Short-time energy hasbeen used as a conventional feature for detecting voicein speech utterances (Huang, 2001; Rabiner, 1993).The popularity of the energy-based algorithms canbe attributed to the fact that computing energy fromthe speech signal is a simple operation compared toextracting other features. The energy contour of thespeech signal has been found to be a good and obvi-ous indicator of presence of speech in signals with amoderate SNR. Generally speaking, the energy is al-ways effective because it is certain that the speech sig-nal energy plus the noise energy is greater than purenoise energy. So if the actual boundaries that distin-guish speech frames from noise frames are known,the beginning point and the ending point of the in-putted signal can be detected easily and accurately.But the boundary information is usually not availablein real-world applications, especially when the noiseenergy level varies. In the proposed algorithm, the pre-ceding pure noise frames are utilized to initialize theshort-time energy threshold, and revise the thresholdthereafter.

In practice, the short-time energy fails to achievegood performance in low SNR environments. So wecannot rely on the energy alone to make decision. Thenoise model score is then employed according to (3). Infact, this feature is a spectrum related feature. In a noisyenvironment, it is obvious that the pure noise framesare different from speech frames in the spectrum, andthe score of the noise model can reflect the differences.

Figure 1 shows the waveform and contour curvesof short-time energy and noise model score of a digitstring “6654599”. Figure 1(a) shows the waveform cor-rupted by noise (SNR < 10 dB), Fig. 1(b) the short-term energy contour of the waveform, and Fig. 1(c)the trajectory of noise model score. It can be seen inthe figure that the noise model score outperforms theshort-time energy in pattern classification. The scorehas a much greater distance between noise frame andspeech frame. Experiments show that the score can alsoachieve a good discrimination between fricative frameand noise frame, while the short-time energy showspoor performance in this respect, especially in low SNRconditions.

But the short-time energy remains the feature in ouralgorithm, because it not only ensures the accuracy butalso contributes to the performance in some specialenvironments. For example, we reckon on the contri-bution of the short-time energy in babble noise, whosespectral character of noise is similar to that of the fore-ground speaker, while the simple spectrum related fea-tures cannot be used in this situation. So for a frontendused in an embedded system, the short-term energy willbe the simple and the most effective feature in babblenoise condition.

2.1.3. The Spectral Distance. The spectral distanceis computed between the inputted frame and thenoise model. It is certain that the noise model con-tains the average spectral character of noise frames.The full spectral space is divided into several sub-bands. The spectral distances are computed in thesesub-bands. If we divide the whole spectrum spaceinto N bands, for sub-band k, the spectral distanceis

Dk(Oi ) = (Oi − µ)′ · Fk · (Oi − µ) (6)

where Oi is the density of power spectrum foreach frame and Fk is a simple band pass filter ofband k. If band k ranges from FFT point j1 to j2


Figure 1. Contour curves of short-time energy and noise model score.

then

j1 j2

Fk =

0... 0

... 0· · · · · · · · · · · · · · · · · · · · ·... 1 0 0

...0

... 0. . . 0

... 0... 0 0 1

...· · · · · · · · · · · · · · · · · · · · ·0

... 0... 0

(7)

Where there is an identity matrix at the center of thematrix, and all zeros in other parts. Here the filter isset very simply, the spectral space is divided linearlyand there is no overlap between each sub-band, whichmakes the algorithm simple and suitable for embeddedsystems. Each FFT point can be given a weight, andeach sub-band can have an overlap to make the filtermore complex.

The spectral distance will be

D(Oi ) =N∑

k=1

Dk(Oi ) (8)

Please note that D(Oi ) is a static feature which onlytakes into account the difference between the inputtedframes and the noise model. It is also a similarity mea-sure between the inputted frames and the noise model.Therefore there is no intrinsically difference betweenthe static spectral distance and the noise model score.

In speech signal, a frame is often correlated to someextent on the preceding ones. Therefore in a speechrecognition system, the dynamic features are incorpo-rated into the feature vector. These features capturethe correlation between the preceding frames and suc-ceeding frames, especially when the spectral characterchanges dramatically. The dynamic spectral distancesare calculated by formula (9) and (10):

�Dk(Oi ) = 1

n2

n∑

m=0

(Oi+m − Oi−1−m)′ · Fk

·n∑

m=0

(Oi+m − Oi−1−m) (9)

�D(Oi ) =N∑

k=1

�Dk(Oi ) (10)

where �D(Oi ) refers to the total spectral distance be-tween consecutive frames, and n is the number of theadjacent frames. Because of the variation of noise,sometimes the �D(Oi ) may show the same charac-ter as that of the speech frames. On the basis of thefact that the formants exist in the spectrum of voicedspeech frames and that the energy distribution in thespectrum is uneven, there is great difference betweenthe speech frames and noise model in each sub-bandspectrum. Therefore if the great spectral distance isattributed only to one sub-band spectral distance, the

138 Wu et al.

inputted frame is likely a noise frame. The great spec-tral distance in a narrow band is due to certain variationof the environmental noise. On the other hand, if thereare several sub-bands contributing to the great spec-tral distance, the current frame can be considered as aspeech frame to a great extent. Of course this decisionis not sound, and it may lead to false decision. For ex-ample, an impulsive color noise relative to the averagebackground noise may export the same result as thespeech frames do. Then other correlative informationbetween each frame such as duration is employed tomodify our decision.

By using the dynamic spectral distance, the accuracyof the voice activity detector is significantly improved.The experimental results show that the dynamic spec-tral feature can also detect some speech signals whosespectral characters are very similar to those of noise.

As we mentioned above, the discriminating featuresmust meet the four requirements. The capability todiscriminate speech from noise means that the clas-sification boundary between speech and noise can befound according to the discriminating features. Thesuitability for real-time implementation is also essenceto real-time system. Though some features meet thefirst requirement, they can only be computed on thebasis of the information of whole utterance, whichis infeasible for real-time system. When the discrim-inating feature is referred to the above two require-ments, it can be rated as “yes” or “no”. However thelast two requirements cannot be rated as such. The ro-bustness can be improved and the cost decreased tosome extend. The costs of all the selected discriminat-ing features are low enough as far as their computationsare concerned because the computation is only sim-ply multiply and addition. Moreover the noise modelscore and spectral distance all show robustness in noisyenvironments.

Figure 2. The system block diagram.

2.2. The Two-Level Decision Making Strategy

Early algorithms usually obtained the VAD results im-mediately according to the discriminating features. Butthis is not always true since there are inevitablely falsealarms and false rejections for any discriminating fea-ture based method. Therefore a smooth processingshould be used to make a more reliable decision.

In the algorithm, a two-level decision making ap-proach is employed to obtain smooth and accurate VADresults. The first level decision is made according to thediscriminating feature and subsequently the final VADresult is determined based on a finite-state machine.The first-level feature-based decision aimed at obtain-ing the character of the individual frame. In this stagewe do not take into account too much correlation be-tween the adjacent frames, which is useful for detectingthe existence of speech frames. Therefore the decisionlogic is introduced to smooth, and sometimes, revisethe output of the first level decision.

However the two decision-making levels interactwith each other. Figure 2 is the block diagram of theproposed algorithm. The first level decision affects thetransition of the automaton, while the automaton statesdetermine the selection of features and their thresholdsin lower level decision stage. The transition from onestate to another is controlled by the first level decisionoutput and some duration constraints.

The decision logic, which is a four-state machine,includes the four states: (1) stationary state of noise,(2) nonstationary state of noise to speech, (3) stationarystate of speech, and (4) nonstationary state of speechto noise.

Figure 3 shows the four-state automaton. The tran-sition of the automaton is controlled by three condi-tions which are shown in the figure. The first condition(C1) is the output of the lower level decision, and the


Figure 3. The four-state automaton.

other two conditions (C2 and C3) utilize the durationinformation, which will contribute to the performanceimprovement. The values of the Speech Duration andSilence Duration vary according to the Lower LevelDecision.

As we mentioned above, the automaton state de-termines the selection of the discriminating features.Considering the character of each state, we make useof different discriminating features in different state.In states (1) and (3), which are stationary states, en-ergy and noise model score are used to make decision,while in nonstationary states (2) and (4), the spectraldistances (the overall dynamic spectral distance andthe dynamic spectral distance in each sub-band) arethe foundation of the decision. When the automatonstate transfers from one stationary state to another viaa nonstationary state, for example from state (1) to state(3) via state (2), we will trace back to find the accurateboundary between noise and speech. In this way, thedetection accuracy is improved.

The first-stage decision in the two levels decisionstrategy heavily depends on the noise model. All thethresholds used in making the preliminary decision areset according to the model. For example, by using themodel, two short-time energy thresholds are set as:

ETH1 =∑

µ j for all FFT point j (11)

ETH2 =∑

(µ j + σ j ) for all FFT point j (12)

and the score threshold: STH = Score(µ + σ). As de-scribed before, different thresholds are used in differentstates. So the ETH1 and ETH2 are set as the speech en-ergy threshold in state (1) and state (3) respectively.

The threshold of the spectral distance is

DTHk = σ′ · Fk · σ (13)

DTH =∑

DTHi for all sub-band k (14)

In practice, the fault risks of misclassification be-tween noise and speech are quite different. It can bea fatal error if a speech frame is wrongly determinedas a noise frame and it will inevitably cause a deletionerror in recognition. On the other hand, even if we maymistake a noise frame for a speech frame, we still havechances to correct it by the later recognition process.Therefore misclassifying noise as speech is preferredto misclassifying speech as noise. Then in the first-leveldecision, our detector should satisfy the following for-mula (15):

p((S | Oi ∈ N) > p(N | Oi ∈ S) (15)

where p(S | Oi ∈ N) and p(N | Oi ∈ S) are the prob-ability of false alarm and false rejection respectively.Therefore the thresholds of the features are deliberatelyset relatively lower. For example, in state (1) the energythreshold is set as formula (11), which is only the aver-age energy of the noise frames. The spectrum distancethresholds (see formulas 13 and 14) are only the aver-age spectrum difference of the noise frames because infact for each sub-band k:

DTHk = σ′ · Fk · σ= E[(N − µ)′ · Fk · (N − µ)] (16)

In multi-feature based algorithm, it is a difficult prob-lem to combine the multi-features to make decision. Inthe proposed algorithm, the short-time energy is em-ployed in lower level decision in all states. While theshort-time energy is only a necessary condition, it canbe combined with other features by a simple logical“AND”. In other words, in stationary states, the noisemodel score is combined with short-time energy by asimple “AND”, and in nonstationary states, the short-time energy is checked before the dynamic spectraldistance.

140 Wu et al.

2.3. Computational Complexity

The proposed algorithm is designed for embedded mo-bile devices. Therefore the complexity is one majorconcern. The performance cannot be improved at thecost of greatly increased computational complexity. Allthe selected discriminating features are cheap at sys-tem complexity. To further cut the computation cost thelogarithm form of the noise model score is used insteadof the original form in practice:

Score(Oi ) = (Oi − µ)2

σ2+ ln(σ2)

=∑

all j

[(Oi, j − µ j )2

σ 2j

+ ln(σ 2

j

)](17)

where Oi, j is the j th point of the power density spec-trum vector of i th frame, µ j and σ 2

j are the j th point ofthe µ vector and σ2 vector respectively. The procedureof tracing back may add much to the computational costbecause for each frame, the spectrum distance will becomputed between the frames preceding it and thosefollowing it. But this procedure only takes place in thenonstationary states. So the two-level strategy enablesus to use different methods in terms of complexity andaccuracy to discriminate the speech signals from thenoise signals. The four-state automaton uses only logicdecision which does not incorporate complex mathe-matical computation.

The spatial complexity is also very low since the al-gorithm does not employ matrix storage. Furthermore,the computations between vectors have been trans-ferred to scalar computations (see formula 17). Com-pared with the overall memory requirement of 300 Kof the embedded ASR system, the memory used by thealgorithm is less than 30 K.

The algorithm has been implemented with an em-bedded open-vocabulary SI ASR system on a CompaqiPAQ. The computation cost of the VAD (fixed pointimplementation) does not exceed 10% of the overallcost of the ASR system, which make it feasible for areal-time performance in embedded mobile devices.

3. Experiments

We carried out a series of experiments to evaluate the ef-fectiveness of the VAD algorithm. The proposed algo-rithm was compared with a baseline VAD algorithm ona number of databases. The baseline algorithm, which

is also designed for embedded mobile devices, is a real-time adaptive algorithm derived from the algorithms in-troduced in Benassine et al. (1997) and Rabiner (1993).In the baseline algorithm, short-term energy and sub-bands energy work as the features to discriminate thespeech signals from noise signals. The baseline algo-rithm uses short-term energy and duration because theyare suitable for an embedded real-time system. Dura-tion information is then employed to smooth the clas-sification results. The thresholds in the baseline algo-rithms were set according to the preceding pure noisesignals and also modified based on the detection re-sults. The experiment scheme is shown in the Fig. 4,in which all the HMM-based recognizers are identicalin structure and observation feature vectors. The test-ing data and training data (if available) are describedin each experiment.

Experiment I: Detection Evaluation with Real-WorldMobile Data

In this experiment, we intended to show an intuitiveperformance of the proposed VAD algorithm. The re-sult of the VAD algorithm was compared with that ofthe hand labeled. The evaluation data is an English digitdatabase collected in real mobile environments, includ-ing office, park and car. In each environment, the datawas recorded with a stereo data recording from twodifferent microphone setting. One, referred to as highSNR, was close to speaker’s mouth, and the other, re-ferred to as low SNR, was about 60 cm from speaker’smouth. Since the noise in the data is natural noise, theSNR of the data is not stable and varies in a certainrange. The right channel data are of high SNR (above15 db), and the left channel data are of low SNR (be-low 10 db). So in the experiment, we have 6 sets ofdata in total. In each set, there are about 500 to 1000digit strings collected from 28 persons (16 males and12 females), and the lengths of the strings vary from 1to 10 digits.

The VAD algorithm result was compared to a ref-erence segmentation of speech and noise periods. Thereference segmentation was obtained by first perform-ing force alignment segmentation and then revising thesegmentation manually. When the reference segmen-tation and the VAD result were compared, differenceswithin 100 ms before the reference point and 50 ms af-ter the reference point were regarded as correct results.

Table 1 shows the accurate rate of the proposed algo-rithm in different environments. The result shows that


Table 1. VAD accurate rate (%).

Right channel Left channel(above 15 db) (below 10 db)

Energy Proposed Energy ProposedEnvironments based algorithm based algorithm

Park 89.2 93.0 85.2 92.5

Car 71.1 94.8 70.6 96.6

Office 80.1 97.6 78.1 96.2

the proposed algorithm outperforms the baseline al-gorithm remarkably. The VAD accuracy rates are rela-tively stable and satisfactory in the different conditions.It is obviously that the proposed VAD algorithm ismore robust than the conventional energy-based base-line in noisy environments. In the park environment,the noises are often wind noise and babble noise. Inthe car environment, the noises are usually caused by

Figure 4. The experiments scheme.

the engine. The office noises include the fan noise andimpulsive noise from steps, keyboard and door.

Experiment II: Recognition Evaluation withReal-World Mobile Data

The proposed algorithm was further evaluated on asecond English digit database collected from cellularphone in car with 8 kHz sampling rate in four speedconditions. The four conditions quoted as 01 to 04, areidle, 30 mph, and 55 mph and vary. The database con-tains pure digits strings. The string lengths vary fromone to eight digits. Each set of the database includesmore than 5000 strings and more than 20000 digits.The SNRs of the data also vary in a certain range, be-cause the data was collected from the real-world en-vironments. An HMM based speech recognizer with26 dimensional features was employed to model eachof the digits. MFCC features and short-time energywere used in the recognizer. The HMM models were

142 Wu et al.

Table 2. Experiment II results (%).

String error rate String error reduction

Energy Proposed Energy ProposedDatabase(strings, rate) No VAD based algorithm based algorithm

01(5305, 21584) 8.47 8.05 6.71 4.96 20.78

02(5404, 21987) 14.26 14.12 12.15 9.82 14.80

03(5465, 22336) 15.9 15.06 14.06 5.28 11.57

04(5391, 21975) 24.39 25.86 23.37 −6.02 4.18

well trained by vast data collected from cellular phonesin various noisy environments. In all evaluations, theproposed VAD algorithm was performed in real-timemode. Only the speech part was sent to the backend rec-ognizer. The HMM based recognizer result was com-pared in the following approaches: no VAD, energybased VAD and the proposed VAD algorithms. The ex-perimental results are shown in Table 2.

From the results, we can see that the string errorrates are reduced to a certain extent after employ-ing the VAD algorithms. Compared to the other twomethods, the proposed VAD algorithm significantlyreduces the string error rate. The energy-based VADalgorithm fails in very low SNR conditions, and in-creases the string error rate in evaluation dataset 04. Theproposed algorithm, however, is more robust in suchconditions.

We can see from the results that the error reductionof the proposed VAD decreases as the SNR is lower. Itis not surprise that the VAD generates least error reduc-tion in condition 04 because the variable speed includesacceleration and break noises and the background noiseare not stable.

Experiment III: Recognition Evaluation for Data withArtificially Added Noise Using the Same HMM Sets

A third evaluation experiment was carried out in or-der to evaluate the VAD algorithm’s performance moreprecisely in terms of SNR levels. We used an Englishdigit database which was collected in a clean envi-ronment. In our experiments noises were added arti-ficially to the data at several SNR levels (20 dB, 15 dB,10 dB, 5 dB, 0 dB and −5 dB). There are altogethereight kinds of added noises, i.e. subway noise, babblenoise, car noise, exhibition hall noise, restaurant noise,street noise, airport noise and railway station noise.We modeled each digit with a 26-dimensional feature

HMM model. The MFCC features and short-term en-ergy were used. The model training data was also ofvarious SNR in the above mentioned various acousticenvironments.

Figure 5 shows the recognition results in variousacoustic environments. The “inf dB” refers to the cleanspeech. We used the string error rate to score the per-formance of the automatic speech recognizer. The re-sults are given in a comparable way with no-VAD andenergy based VAD as well. The results show that theVAD can improve the overall performance of the rec-ognizer in most SNR conditions. The proposed VADalgorithm works better than the energy-based VAD al-gorithm at any SNR in all acoustic environments. Inmost situations the error reductions by using the pro-posed VAD algorithm exceed 50%, which doubles thatof the energy-based one. Compared with trajectoriesnot only in the no-VAD approach but also in the energy-based VAD approach, the proposed algorithm shows itsrobustness when the SNR falls.

Experiments II and III use the same experiment strat-egy on different databases. The database used in Exper-iment II is collected from the real-noisy environments,and the database used in Experiment III contains allartificially added noise. The proposed algorithm im-proves the performance of the ASR system, and showsmore noise robust than the energy-based VAD.

Experiment IV: Recognition Evaluation for Data withArtificially Added Noise Using Different HMM Set

In this experiment, three different HMM sets weretrained by using three different training datasets. All thethree training sets originated from one database. Theoriginal database contains the data in various acous-tic environments at various SNRs such as inf dB,20 dB, 15 dB, 10 dB, 5 dB, 0 dB and −5 dB. Noiseswere added artificially. Then the three VAD approaches


Figure 5. Recognition tests in various acoustic environments by using same HMM set.

(energy-based algorithm, the proposed algorithm, andhand labeled) were applied to the original databaseand the three different training datasets were obtained.One is from energy-based VAD algorithm, one is from

the proposed VAD algorithm and the other is fromhand label. The three different HMM sets were trainedby using their respective datasets. In the same way,three different testing databases were also obtained.

144 Wu et al.

Figure 6. Recognition tests in various acoustic environments by using same HMM sets.

The evaluation was performed on the respective test-ing datasets by using corresponding HMM set.

Figure 6 shows the recognition results in the eightacoustic environments. From the figure we can see thatthe proposed VAD algorithm generates similar per-

formance to the hand-label condition and better per-formance than the energy-based VAD algorithm. Inmost situations, the trajectory of the proposed algo-rithm is similar or identical to that of hand-labeleddata.


It is obvious that the employment of VAD pro-cessing will improve the overall performance of theHMM-based ASR system. The conventional energy-based VAD algorithm however only shows satisfac-tory performance in high SNR conditions, but failsto show stable performance in various acoustic en-vironments. In the evaluation experiments, the pro-posed VAD algorithm not only produces better per-formance than the conventional one in low SNR con-ditions, but also shows robustness in different acousticenvironments.

4. Conclusions

In this paper, we describe a real-time VAD algorithmdesigned for embedded application in noisy mobile en-vironments. The architecture of the proposed algorithmis based on a two-level decision strategy. The upper-level decision is a four-state automaton, and the lowerlevel is feature-based decision. The two levels are notseparate and they interact with each other for makingthe final decision. The states of automaton in the upperlevel can be divided into two classes: stationary statesand nonstationary states. The features and the thresh-olds in the lower level are determined according to theupper-level state. The lower-level decision will affectthe upper-level state transition. The two-level decisionstrategy allows the use of low cost features in stationarystates and the performing of delicate analysis in non-stationary state. The strategy can greatly reduce thecomputational cost of the algorithm, and also reducethe overall ASR system cost. So the algorithm can beused in embedded systems.

Four sets of experiment were carried out to eval-uate the proposed VAD algorithm. In all the experi-ments, the proposed algorithm was compared with no-VAD case, conventional energy-based VAD case andhand labeled case. From the results we can concludethat the proposed VAD algorithm has relatively con-sistent performance in all environments. When usedin the frontend signal process of a speech recog-nizer, it makes recognition error reduction in all SNRlevers.

It is well known that in the very noisy environmentsthe insertion error is the major problem in a connecteddigit recognition task. We observe that the insertion er-rors are significantly reduced by applying the proposedVAD algorithm. Experiment I shows us the intuitiveperformance of the proposed algorithm. The detectionaccuracy rates in all testing environments (Park, Car

and Office) are above 92%, varying a little with theSNR. Experiments II and III show firstly the implemen-tation of VAD improves the performance of the ASRsystem and also the proposed algorithm outperformsthe baseline energy-based algorithm in terms of robust-ness in noisy environments. Experiments IV shows theperformance of the ASR incorporating the proposedalgorithm is very close to the performance of the ASRusing hand labeled training data. The experiments showthat the frontend VAD processing plays an importantrole in ASR.

From the experiment results we realize that the newalgorithm generates less gain in the non-stable SNRsituation. For example, in Experiment II it generates11.57% string error reduction in the car environmentwith a speed of 55 mph. However the error reductionis down to 4.18 with the varying speed case, which infact has a higher average SNR than the 55 mph case.This indicates that we may need a more robust noisemodel update scheme in the future work.

Acknowledgment

Motorola Labs China Research Center (MCRC) pro-vides the proprietary speech databases and embeddedrecognizer.

References

Benassine, A., Shlomot, E., and Su, H. (1997). ITU-T recommenda-tion G.729, annex B, a silence compression scheme for use withG.729 optimized for V.70 digital simultaneous voice and data ap-plications. In IEEE Commun. Mag., pp. 64–97.

Chengalvaryan, R. (2001). Evaluation of front-end features and noisecompensation methods for robust mandarin speech recognition. InProceeding of Eurospeech.

De Wet, F. (2001). A comparison of LPC and FFT-based acousticfeatures for noise robust ASR. In Proceeding of Eurospeech.

Ganapathiraju, A. (1996). Comparison of energy-based endpoint de-tection for speech signal processing. In Proceedings of the IEEESoutheastcon. Tampa, Florida, USA, pp. 500–503.

Huang, X.D. and Acero, A. (2001). Spoken Language Processing,A Guide to Theory, Algorithm, and System Development. PrenticeHall.

Junqua, J.C., Reaves, B., and Mak, B. (1991). A study of endpointdetection algorithms in adverse conditions: Incidence on a DTWand HMM recognize. In Proceeding of Eurospeech, pp. 1371–1374.

Martens, J.P. (2000). Continuous speech recognition over the tele-phone. Final Report of COST Action 249.

Nemer, E. (2001). Robust voice activity detection using higher-orderstatistics in the LPC residual domain. IEEE Trans. on Speech andAudio Processing, 9(3).

146 Wu et al.

Picone, J. (1993). Signal modeling techniques in speech recognition.Proc. IEEE, 79(4):1215–1247.

Rabiner, L. and Juang, B.H. (1993). Fundamentals of Speech Recog-nition. Englewood Cliffs, NJ: Prentice-Hall.

Renevey, P. (2001). Entropy based voice activity detection in verynoisy conditions. In Proceeding of Eurospeech.

Savoji, M.H. (1989). A robust algorithm for accurate endpointing ofspeech. Speech Communication, 8:45–60.

Shieh, W.C. (1999). The dependence of feature vectors under adversenoise, In Proceeding of Eurospeech.

Shin, W.H. (2000). Speech/non-speech classification using mul-tiple features for robust endpoint detection. In Proceeding ofICASSP.

Tanyer, S.G. (2000). Voice activity detection in nonstationary noise.IEEE Trans. On Speech and Audio Processing, 8(4).

Tucker, R. (1992). Voice activity detection using a periodicity mea-sure. In Proc Inst. Elect. Eng., 139:377–380.

Wu, G.D. and Lin, C.T. (2000). Word boundary detection with mel-scale frequency bank in noisy environment. IEEE Trans. Speechand Audio Processing, 8(5).

a robust, real-time voice activity detection algorithm for embedded mobile devices

Documents