reducing waiting time in automatic captioned relay service

REDUCING WAITING TIME IN AUTOMATIC CAPTIONED RELAY

SERVICE USING SHORT PAUSE IN VOICE ACTIVITY DETECTION

BY

MR. KIETTIPHONG MANOVISUT

.

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF MASTER OF SCIENCE (COMPUTER SCIENCE)

DEPARTMENT OF COMPUTER SCIENCE

FACULTY OF SCIENCE AND TECHNOLOGY

THAMMASAT UNIVERSITY

ACADEMIC YEAR 2017

COPYRIGHT OF THAMMASAT UNIVERSITY

Ref. code: 25605809035149VUW

(1)

Thesis Title REDUCING WAITING TIME IN AUTOMATIC

CAPTIONED RELAY SERVICE USING SHORT

PAUSE IN VOICE ACTIVITY DETECTION

Author Mr. Kiettiphong Manovisut

Degree Master of Science (Computer Science)

Department/Faculty/University Computer Science

Faculty of Science and Technology

Thammasat University

Thesis Advisor

Thesis Co-Advisor

Pokpong Songmuang, Ph.D.

Nattanun Thatphithakkul, Ph.D.

Academic Year 2017

ABSTRACT

The Automatic Captioned Relay Service is crucial for hearing disabilities or

hard-of-hearing to communicate with others in real life. This service uses an Automatic

Speech Recognition (ASR) to transcribe speech to a caption. Reducing the waiting time

in the automatic captioned relay service, the relay service will support more users.

Moreover, the ASR results are more continuous and constant, which directly affect the

user experience. In this paper, we divide the proposed work into four steps. At the first

step, we propose a method for improving a voice activity detection (VAD) based on

dual-threshold method VAD. An idea of this research aims to reduce the waiting time

of ASR result using short pause in speech as an endpoint instead of using only silence.

This step reduces an average waiting time. At the second and the third step, we

propose methods to maintain an accuracy of ASR result. Furthermore, we show a fixed

energy threshold problem in proposed work and traditional VAD that cannot work in

a noisy environment. This problem always related to an ability of short pause and

silence detection directly. Finally, we propose the pause model classifier that is trained

from LSTM-RNN. This last step solves the weakness of short pause and silence

detection. The experimental result shows that the proposed work reduces the average

waiting time of ASR result up to 17.1% compared with traditional VAD.

Ref. code: 25605809035149VUW

(2)

Keywords: voice activity detection, captioned relay service, recurrent neural network,

dual-threshold method

Ref. code: 25605809035149VUW

(3)

ACKNOWLEDGEMENTS

First of all, this research would not have been possible without the support

of my advisor Dr. Pokpong Songmuang and co-advisor Dr. Nattanun Thatphithakkul. I

would like to express my sincere gratitude to them for their invaluable advice and

patient proofreading towards the completion of my research. Their guidance helped

me through pass my hard time when I am doing this research.

Beside my advisor and co-advisor, I sincerely thank Dr. Ananlada

Chotimongkol and Phongphan Phienphanich, Ph.D. student for their valuable and

constructive suggestions during the planning and development of this research works.

They willingness to give their time so generously has been very much appreciated.

Furthermore, I also would like to thank the rest of my thesis committee:

Asst. Prof. Nuttanont Hongwarittorrn, Asst. Prof. Rachada Kongkachandra for their

encouragement, insightful comments, and hard questions.

Finally, I most gratefully acknowledge my family who have been always

beside me and encouraged me for studying the master program. I can go to the goal

and accomplish with all their kindness.

Parts of this research are supported by the grant from National Electronics

and Computer Technology Center (NECTEC), Thailand.

Mr. Kiettiphong Manovisut

Ref. code: 25605809035149VUW

(4)

TABLE OF CONTENTS

Page

ABSTRACT (1)

ACKNOWLEDGEMENTS (3)

TABLE OF CONTENTS (4)

LIST OF TABLES (7)

LIST OF FIGURES (8)

LIST OF ABBREVIATIONS (10)

CHAPTER 1 INTRODUCTION 1

1.1. Objective of the Thesis 3

1.2. Structure of the Thesis 4

CHAPTER 2 LITERATURE REVIEW 5

2.1. The difference of VAD analysis 5

2.2. Voice activity detection (VAD) 6

2.3. The difference of VAD analysis 7

2.3.1 Time-domain analysis 7

2.3.1.1 The dual-threshold method 7

(1) Pre-processing 9

(2) Framing 9

(3) Windowing 9

(4) Feature extraction 10

(5) Decision Algorithm 13

Ref. code: 25605809035149VUW

(5)

2.3.2 Frequency-domain analysis 15

2.3.3 Pattern recognition 16

2.3.3.1 Recurrent Neural Network 16

(1) Long short-term memory networks 18

2.3.3.2 Feature extraction for neural network 19

(1) Mel Frequency Cepstral Coefficient – MFCC 19

CHAPTER 3 SHORT PAUSE BASED VOICE ACTIVITY DETECTION 22

3.1. Dataset 22

3.1.1 Speech and non-speech label 22

3.1.2 Short pause in speech 24

3.2. Research Question 24

3.2.1 How many short pauses can be found in a sentence? 24

3.2.2 Is it possible to use the short pause to reduce the waiting time and

maintains the accuracy? 26

3.3. Short Pause based VAD 27

3.3.1 Short Pause algorithm (Determine silence and short pause) 28

3.4. Short Pause based VAD with endpoint decision 30

3.4.1 Endpoint decision 31

3.5. Short Pause based VAD with padding silence 31

3.5.1 Padding Silence 32

3.6. Short Pause based VAD with LSTM-RNN 33

3.6.1 Pause model from LSTM-RNN 36

CHAPTER 4 EXPERIMENT 39

4.1. Experiments 39

4.1.1 The first step 39

4.1.2 The second step 39

4.1.3 The third step 39

Ref. code: 25605809035149VUW

(6)

4.1.4 The fourth step 40

4.2. Experiment settings 40

4.3. Experiment dataset 40

4.4. Measurement 41

4.4.1 Average waiting time 41

4.4.2 Word error rate (WER) 42

CHAPTER 5 EXPERIMENTAL RESULTS AND DISCUSSIONS 44

CHAPTER 6 CONCLUSION 58

REFERENCES 60

APPENDICES 64

APPENDIX A IMPLEMENTATION OF SHORT PAUSE BASED VAD 65

APPENDIX B IMPLEMENTATION OF SHORT PAUSE BASED VAD WITH LSTM-RNN

69

BIOGRAPHY 71

Ref. code: 25605809035149VUW

(7)

LIST OF TABLES

Tables Page

3.1 The sample labeling of speech sentence. 23

3.2. The number of the short pause and the silence found in 12 hours of the

continuous speech sentences 25

3.3. LSTM-RNN specification 38

4.1. Alignment and types of error for test phrase 43

5.1. The average waiting time and WER of ASR result by short pause based VAD

compared with traditional VAD 45


with endpoint decision compared with the previous step and traditional VAD

47


with padding silence compared with the previous steps and traditional VAD 48


with LSTM-RNN compared with all previous steps and traditional VAD 52

Ref. code: 25605809035149VUW

(8)

LIST OF FIGURES

Figures Page

1.1. The infrastructure of automatic captioned relay service. 1

2.1. The captioned phone. 5

2.2. The flow diagram of dual-threshold method. 8

2.3. The original speech signals 10

2.4. The speech signal after applying a window function 10

2.5. The original speech signals 11

2.6. Energy waveform after short-time energy extraction 11

2.7. Zero-crossing rate of speech signal in noisy environment 12

2.8. An illustration of roughly search and smoothing search on the dual-threshold

method 13

2.9. The original waveform, labeled sum, and its component frequencies 15

2.10. A recurrent neural network and the unfolding in time of the computation

involved in its forward computation. 17

2.11. The RNNs and variety of output 18

2.12. Block diagram of the MFCC algorithm 20

3.1. Short pauses and silence found in sentence. 26

3.2. The flow diagram of traditional VAD 27

3.3. The flow diagram of the first step on short pause based VAD 28

3.4. State transition diagram for determining short pause and silence 29

3.5. The flow diagram of short pause based VAD with endpoint decision. 31

3.6. The flow diagram of short pause based VAD with padding silence 32

3.7. Increasing the energy threshold, the unvoiced portion might not include in

speech segment 33

3.8. The flow diagram of short pause based VAD with LSTM-RNN 35

3.9. The creation of pause model 36

3.10. The result of MFCCs 37

3.11. The LSTM-RNN structure of pause model 37

Ref. code: 25605809035149VUW

(9)

4.1. The description of the waiting time equation 41

5.1. The example of sentence that cannot find silence or short pause by short

pause VAD and traditional VAD 49

5.2. The labeled of silence and short pause in noisy speech sentence 50

5.3. The speech segments in office noise that is detected by short pause based

VAD with LSTM-RNN 51

5.4. The average waiting time reduced compared with the minimum pause setting

in short pause based VAD 53

5.5. The comparison of WER and minimum pause setting in short pause based

VAD 54

5.6 The flow diagram of short pause based VAD in final step. 55

5.7. The frequency and waiting time of Short Pause based VAD with LSTM-RNN

compared with traditional VAD 56

Ref. code: 25605809035149VUW

(10)

LIST OF ABBREVIATIONS

Symbols/Abbreviations Terms

VAD Voice activity detection

ASR Automatic speech recognition

MFCC Mel-frequency cepstral coefficients

RNN Recurrent neural network

LSTM-RNN Long short-term memory recurrent neural network

STE Short-time energy

TE Short-time energy threshold

ZCR Zero-crossing rate

TZ Zero-crossing rate threshold

WER Word error rate

WPM Word per minute

Ref. code: 25605809035149VUW

1

INTRODUCTION

An automatic captioned relay service allows people with hearing

disabilities or hard-of-hearing to use a special mobile application that enables user to

speak and simultaneously read captions of what others say. This service uses an

Automatic Speech Recognition (ASR) to transcribe speech to a caption. Then the

service transmits the caption directly to a mobile application. An infrastructure of

automatic captioned relay service is shown in Figure 1.1

Figure 1.1. The infrastructure of automatic captioned relay service.

To separate a continuous speech in real-time, the automatic captioned

relay service uses Voice Activity Detection (VAD) to separate speech and non-speech

into segment. Then the speech segments are transcribed into a caption by the ASR.

VAD is the first important step for ASR to discriminate a speech signal into speech or

non-speech segments. It is applied in many tasks of a speech-based application to

reduce bandwidth, word error rate (WER) and computation time. These abilities

improve overall performance for ASR. VADs can be divided by feature extraction into

three categories as follows;

(1) Time-domain feature such as log-energy (Wang & Qu, 2014), short-time energy and

zero-crossing rate (Guo & Li, 2010; Pal & Phadikar, 2015).

(2) Frequency-domain feature such as energy spectral entropy (Pang, 2017). The

frequency domain refers to an analysis of signals with respect to frequency rather than

time.

Ref. code: 25605809035149VUW

2

(3) Pattern recognition such as linear classifier, Naive Bayes classifier and Neural

Network (NN). Many recent works (Ryant et al., 2013; Tashev & Mirsamadi, 2016) show

that the neural network clearly improves an accuracy of VADs significantly.

VADs in each category are characterized by their complexity, precision and

computation cost. More complexity of a feature extraction requires more computation

cost. The dual-threshold method is the low complexity in time-domain VAD that is the

most widely used in a real-time speech-based application. This technique is used in

the real-time application to achieve the lowest computation time and computation

cost (Zhang & Junqin, 2015).

High speaking rate is common in a continuous speech. According to the

National Center for Voice and Speech, an average rate of speech for speakers is

approximately 150 words per minute (wpm) (NCVS, 2007). Additionally, High speaking

rate has less of silence. Silence is a non-speech portion that are over 100-millisecond

(Moattar & Homayounpour, 2009). When apply the traditional VAD, which is called

dual-threshold method to the automatic captioned relay service, the traditional VAD

separates the continuous speech to a long speech segment, then ASR takes a long

time to transcribe. Since the traditional VAD that commonly uses silence as an

endpoint of speech, users take long waiting time for ASR results.

From the mentioned above, none of the previous works propose a method

to reduce a waiting time of ASR results. Therefore, to reduce a delay time of ASR result

in the automatic captioned relay service, we propose the new VAD algorithm, which is

called the short pause based VAD based on the traditional dual-threshold method. An

idea of this research aims to reduce the waiting time of ASR result using short pause

in speech as an endpoint. The short pause is a pause of speech that ranges from 20

to 99 milliseconds. To confirm this idea, we investigate a possibility of using short pause

to reduce the waiting time of real-time captioning process. We note that short pauses

are usually found in sentences. Then they are possible to use the short pauses as an

endpoint to reduce the waiting time. Furthermore, traditional dual-threshold method

is the most of easiest implementation, this is an ideal for proof of concept of using the

short pause in VAD.

Ref. code: 25605809035149VUW

3

To understand an experiment of using short pause in VAD and the

research's problems step by step, we divide an experimental into four steps. The first

step, we show that the short pause can be used to reduce the waiting time of ASR

results well. However, using short pause as an endpoint increases a trend of word error

rate (WER) rapidly. Next, an endpoint decision and padding silence module are applied

to reduce WER on the second and third steps, respectively. Experimental results show

that the endpoint decision and padding silence module reduce the WER significantly.

Moreover, we show a fixed energy threshold in the short pause based VAD and

traditional VAD cannot work in an office noise environment. These problems always

relate to an ability of the short pause and silence detection directly.

In 2016, J. Kim (Kim, Kim, Lee, Park, & Hahn, 2016) shows that a long short-

term memory recurrent neural network (LSTM-RNN) shows that a long short-term

memory recurrent neural network (LSTM-RNN) can learn complex features and

patterns in data. Their proposed work achieves state-of-the-art performance. Hence,

we move to the fourth step that borrow the VAD structure from the third step. Then

short-time energy and zero-crossing rate are replaced by a pause model. The pause

model is speech and non-speech classifier trained from LSTM-RNN to achieve the weak

of short pause and silence detection problem. Finally, all of steps are compared the

average waiting time and WER to the traditional VAD used in automatic captioned relay

service.

1.1. Objective of the Thesis

The aim of thesis is to achieve these goals.

- To reduce the waiting time of caption result using short pause as an

endpoint of speech.

- To propose a method for maintaining the efficiency and accuracy of

caption caused by the short pause.

- To propose a method that accurate the short pause detection in a

noisy environment.

Ref. code: 25605809035149VUW

4

1.2. Structure of the Thesis

The rest of thesis is organized as follows: Chapter 2 presents literature

reviews on a difference of VAD algorithm and background on the research. Chapter 3

and 4 shows experiments of the difference of proposed algorithm. In this chapter, we

discuss a problem in the first proposed VAD algorithm and a possible solution which

can be used to conquer the problem. Chapter 5 presents the experimental results of

proposed works and analysis the trade-off between proposed algorithms compared to

the traditional work. Lastly, the conclusion of the proposed works is presented in

Chapter 6.

Ref. code: 25605809035149VUW

5

LITERATURE REVIEW

In this chapter, we discuss a fundamental on the subject, including the

automatic captioned relay service and voice activity detection problem. And finally,

the recurrent neural network and feature extraction for voice activity detection are

briefly presented.

2.1. The difference of VAD analysis

The Automatic Captioned Relay Service is crucial for hearing disabilities or

hard-of-hearing to communicate with others in real life. The first captioned phone was

invented in the 1960s by scientist name is Robert Weitbrecht (Jones, 2017).

Previously, the captioned relay service uses the captioned phone as a tool

to communicate with captioned relay service. The captioned phone has a built-in

screen that displays the caption of conversation during the call in near real-time. When

making the call, the captioned phone automatically connects to the captioned relay

service, hearing disabilities or hard-of-hearing can speak and simultaneously read

caption of what other saying.

Figure 2.1. The captioned phone (Jones, 2017).

Ref. code: 25605809035149VUW

6

Nowadays, instead of using the captioned phone which is only used in

home or at work. The arrival of smartphone is currently some mobile apps available

for the hearing disabilities or hard-of-hearing to use the captioned relay service

anywhere their need. Makes it convenient and accessibility.

2.2. Voice activity detection (VAD)

Voice activity detection (VAD) had started since 1975 (Rabiner & Sambur,

1975), also known as speech activity detection, speech detection or endpoint

detection. It is a speech processing technique that detects the presence or absence

of human speech. The behavior of VAD generally uses to discriminate a speech signal

into speech or non-speech. VAD takes the speech signal from an input source and uses

the feature extraction to retrieve the important feature from audio. Then the feature

value is used to determine a startpoint and endpoint of speech by decision algorithm.

The main uses of VAD is in speech coding, speech recognition (Li, Zheng,

Tsai, & Zhou, 2002) and speech enhancement. In speech coding, VAD is used to

determine a part of silence that can be switched off in speech transmission. This

technique reduces the amount of transmitted data of non-speech packets in Voice

over Internet Protocol (VoIP) applications, saving on computation and on network

bandwidth. In speech recognition, VAD is used to find a part of speech signal that

should be feed to recognition engine. Since the recognition engine is computationally

complex operation. Ignoring non-speech parts improve overall performance of ASR. In

speech enhancement, VAD is used to reduce or remove noise in a speech signal. We

estimate noise characteristics from non-speech parts and remove noise in the speech

part.

From mentioned above, VAD is an important technology for a variety of

speech-based applications. Therefore, various VAD algorithms provide varying features

and compromise between latency, sensitivity, accuracy and computational cost

differently.

VAD is usually language independent. VAD classification can be done in a

variety of techniques. Each technique is suitable for different speech applications in

Ref. code: 25605809035149VUW

7

term of the accuracy, computation time and accuracy of the results. These factors

often affect to the user experience directly. In automatic captioned relay service, the

lowest delay time in VAD is the most appropriate VAD that is used in the real-time

speech-based application. The difference of VAD can be divided into several

categories, which are described in the next section.

2.3. The difference of VAD analysis

The traditional VADs are roughly divided into several categories as follows.

2.3.1 Time-domain analysis

Generally, the creation of VAD analyzes in time-domain is an analysis

of mathematical functions, physical signals or time series with respect to time. In the

time domain analysis, the signal is known for all real numbers for the case of

continuous time or at various separate instants in the case of discrete time.

From literature, many approaches (Guo & Li, 2010; Zhang & Junqin, 2015)

show how time-domain VADs can achieve the computation speed. The time-domain

analysis is the easiest and most with high speed and low computation cost when

compared to another category. It is suitable for real-time application and work well in

clean speech or a little background noise. However, time domain analysis has some

limitations, such as robustness against background noise. The dual-threshold method

is the most one which commonly used in the real-world application. This algorithm

uses short time energy and zero-crossing rate with simple feature thresholds to

determine the speech and non-speech of each signal frame and performs well in clean

speech or little background noise condition (Yali, Dongsheng, Shuo, & Xuefen, 2014).

2.3.1.1 The dual-threshold method

The dual-threshold method algorithm is the most popular

time-domain analysis. It is the easiest implementation which is an ideal for proof of

concept usage short pause in VAD.

Ref. code: 25605809035149VUW

8

At first, the dual-threshold method is proposed based on two features are

short-time energy and zero-crossing rate. In order, to make the process of signals, we

start by making them stationary using framing operation. After that, the window is

added and then STE is calculated immediately.

The first judgment uses short-time energy (STE) as a feature, which is a

summation of the audio signal of each frame. There are very few advantages in

computational time and complexity. It is can be used in clean speech environments.

The secondary judgment is a zero-crossing rate (ZCR). The ZCR is the rate of sign-

changes along a signal which changes from positive to negative or back. This feature is

another basic acoustic feature that can be computed easily. ZCR is commonly used in

both speech recognition and music information retrieval, being a key feature to classify

percussive sounds, it is often used in conjunction with energy (or volume) for speech

detection. In particular, ZCR is used for detecting the start and endpoint of unvoiced

sounds. If ZCR is high, the speech signal is unvoiced, while if the zero-crossing rate is

low, the speech signal is voiced.

Figure 2.2. The flow diagram of dual-threshold method.

From the Figure 2.2 shows the flow diagram of dual-threshold method

algorithm. When received the speech signal from an input device, the speech signal

Short-time energy

Zero-crossing rate

Decision Algorithm

Divide into chunks

Speech signals

(Roughly search)

(Smoothing search)

Begin, End

Speech segments

Pre-processing & Framing

Ref. code: 25605809035149VUW

9

will be processed before using in feature extraction and other processes. The

description of each process will be described below.

(1) Pre-processing

Speech signal is pre-processed to make it suitable for a

features extraction. Pre-emphasis (Pal & Phadikar, 2015) is a very simple signal

processing method which increases the amplitude of high-frequency bands and

decreases the amplitudes of lower bands; the expression is shown as equation 2.1.

𝑦(𝑛) = 𝑎𝑥( + (1 − 𝑎)𝑥(,- (2.1)

Where 𝑎 = 0.95, 𝑥( is 𝑛-th speech frame. 𝑦(𝑛) presents speech frame

after pre-emphasis. We note that the higher frequencies are more important for signal

disambiguation than lower frequencies. Thus, applying the pre-emphasis in energy-

based VAD slightly take the better result. This makes the pre-emphasis become

popular.

(2) Framing

In nature, the speech signal is non-stationary as it normally

changes quite rapidly over time. However, it can be assumed that the speech signal is

stationary in short time range (10ms-30ms). Thus, framing the speech signal into small

frame can fulfill the assumption that the speech signal can be made to be stationary.

(3) Windowing

When we frame the speech signal into small frame

immediately, there often have a case that the end of frame does not smoothly mesh

with the start of next frame. This means it may have a signal which is discontinuous

between the frames. Then the window function is used for “tapering” the edge of the

frame to zero. This is the basic idea of windowing. The expression of hamming window

(Podder, Khan, Zaman, & Haque Khan, 2014) is shown as equation 2.2.

Ref. code: 25605809035149VUW

10

𝑤(𝑛) = 𝛼 − 𝛽𝑐𝑜𝑠 9:;(<,-

= , 0 ≤ 𝑛 ≤ 𝑁 − 1 (2.2)

Where, 𝛼 = 0.54, 𝛽 = 1 − 𝛼 = 0.46. 𝑁 represents the frame length. 𝑤(𝑛) is 𝑛-

th frame after adding window.

Figure 2.3. The original speech signals

Figure 2.4. The speech signal after applying a window function

(4) Feature extraction

Short-time energy

Short-time energy is the most common feature for speech and non-speech

detection. It can represent changes of amplitude of speech signal and amplitude of

speech signal can be used to classify energy of voice and silence. However, the

Ref. code: 25605809035149VUW

11

accuracy of short-time energy decreases rapidly in noisy environment. Figure 2.5 shows

the original speech signal. The energy of original signal after short-time energy

extraction is illustrated in Figure 2.6.

Figure 2.5. The original speech signals

Figure 2.6. Energy waveform after short-time energy extraction

To obtain the short-time energy (𝐸) of each frame, let speech signal be

𝑦(𝑛), 𝑁 represents the frame length and hamming window is adopted by 𝑤(𝑛). 𝑋((𝑛) is 𝑛-th speech frame after adding window, the expression of windowed speech

frame is shown in equation 2.3.

𝑋((𝑛) = 𝑤(𝑛) × 𝑦(𝑛); 0 ≤ 𝑛 ≤ 𝑁 − 1 (2.3)

Ref. code: 25605809035149VUW

12

Finally, short-time energy of the 𝑛-th frame can be defined by equation 2.4.

𝐸( = G [𝑋((𝑛)]:<,-(JK (2.4)

Zero-crossing rate

The zero-crossing rate (ZCR) is another most popular characteristic of a

speech signal. It represents the alternative times of sampling points changes its sign,

the expression of ZCR (Pal & Phadikar, 2015) is illustrated as equation 2.5.

𝑍( =12NO|𝑠𝑖𝑔𝑛[𝑋((𝑛)] − 𝑠𝑖𝑔𝑛[𝑋((𝑛 − 1)]|<,-

(JK

S

Where, 𝑠𝑖𝑔𝑛[𝑥] = T 1 𝑥 ≥ 0−1 𝑥 < 0

Generally, ZCR is used as the secondary parameter to improve the

accuracy of VAD when speech occurs in noisy environment. The ZCR is higher when

unvoiced or silence segment occurs and lower when voice segment occurs. The ZCR

is extracted from original speech signal that is illustrated in Figure 2.7.

Figure 2.7. Zero-crossing rate of speech signal in noisy environment

(2.5)

Ref. code: 25605809035149VUW

13

(5) Decision Algorithm

Let threshold value of short-time energy be 𝑇𝐸, and the

threshold value of zero-crossing rate be 𝑇𝑍.

Figure 2.8. An illustration of roughly search and smoothing search on the dual-

threshold method

The first detection process is called "Roughly search", this process uses a

short-time energy as a feature extraction, which can retrieve the result in a frame by

frame energy value. The decision algorithm determines the beginning and endpoint of

speech using energy value of each frame. If the frame energy value greater than the

threshold 𝑇𝐸 at least P frames. Therefore, we assign the first P frame as Beginning, if

it does not meet condition above, we locate the Beginning point again. The endpoint

of speech can be found like finding the beginning point. But, the frame energy value

must less than the threshold 𝑇𝐸 at least P frames. Then we assign the last P frame

as Endpoint. Shown as A and B, respectively in Figure 2.8.

When we finish the first process above, then make the second level

judgment that is called "Smoothing search". The second level judgment uses zero-

A

B

C

D

Roughly search (short-time energy)

Smoothing search (zero-crossing rate)

Ref. code: 25605809035149VUW

14

crossing rate as a feature extraction. We search backward from the Beginning that found

in the first detection process until found the frame which is less than the threshold

𝑇𝑍 at first. The result of second level judgment is shown as (A) and (C) in Figure 2.8.

Then we assign this frame as a new Beginning. Finally, completing this process again

by search from Endpoint. The result is shown as (B) and (D) in Figure 2.8.

As mentioned above, that is how the dual-threshold method works with

speech signal. Generally, the creation of VAD analyzes in time-domain like the-dual-

threshold method is the easiest and most with high speed and low computation cost.

Hence, these are used to achieve the fast computation cost and maintain the accuracy

in real-time VAD.

In 2010, Qiuyu Guo and Nan Li (Guo & Li, 2010) show that the difference

algorithm in VAD will perform the difference performance vary depending on each side

such as precision, the computational complexity, and robustness to noise varies. To

improve the performance of VAD, they choose a dual-threshold method using a short-

time energy and zero-crossing rate to conquer a noisy environment. The experiment

result shows the performance of their proposed work can be robust in the noisy

environment while maintaining a computational cost.

And later in 2015, Provat Kumar Pal and Santanu Phadikar (Pal & Phadikar,

2015) proposed a method to reduce the computation cost and improve the

performance of dual-threshold method algorithm. Additional techniques of pre-

processing such as Wiener filter is used to improve a speech signal before short-time

energy and zero-crossing rate extraction. Moreover, this research also provides a

method that is called adaptive threshold to enhance the determination process of

VAD for using in a real-world environment. The experiment result shows the

performance of their proposed work can reduce the computation cost up to 27.7%

and improve the accuracy in ASR up to 5.9%. However, there are some limitations in

this literature. The researcher assumes that none of speech presents at the first 100-

millisecond interval. Moreover, the researcher uses the smaller dataset to measure

the performance of VAD. This may produce the inaccurate result.

These are literatures that are closest to the proposed works. All works of

literature use the dual-threshold method to achieve on computation cost and

Ref. code: 25605809035149VUW

15

accuracy. But, there are not reduce the waiting time of result. Therefore, we aim to

reduce the waiting time of ASR result in automatic captioned relay service. Thus, the

traditional VAD is necessary to be improved using the short pause in speech.

2.3.2 Frequency-domain analysis

In electronics, control systems engineering and statistics, the

frequency domain refers to an analysis of mathematical functions or signals. The

analysis respect to frequency rather than time. A time-domain graph shows how the

signal changes over time, while a frequency-domain graph shows how much energy of

the signal lines within each given frequency band over a range of frequencies. A

frequency-domain representation can include the information on phase shift that is

applied to each sinusoid which is able to recombine the frequency components to

recover the original time signals.

A given function or signal can be converted between a time and frequency

domains with a pair of mathematical operators which is called a transform. An example

is the Fourier transform. The Fourier transform converts the time function into a sum

of sine waves of different frequencies. Each of line represents a frequency component.

The spectrum of frequency components is the frequency domain representation of

the signal. The inverse Fourier transforms converts the frequency domain function back

to a time function.

Figure 2.9. The original waveform, labeled sum, and its component frequencies

Ref. code: 25605809035149VUW

16

From literature, many experiments (Jia & Xu, 2002; Misra, 2012; Moattar &

Homayounpour, 2009; Shen, Hung, & Lee, 1998) have proved that time-domain analysis

usually fails under low SNR condition and accuracy cannot be made more accurate.

Nowadays, to achieve more accuracy in VAD algorithm. The types of pattern

recognition are widely used in VAD algorithm. The description of pattern recognition is

described in the next section.

2.3.3 Pattern recognition

In the last few years, the creation of VAD using the pattern

recognition has become very popular because typically pattern recognition such as

Gaussian Mixture Model (GMM) (Misra, 2012), Neural Network (NN) (Kim et al., 2016;

Ryant et al., 2013; Tashev & Mirsamadi, 2016) has been widely used in speech

recognition. Thus, the application of pattern recognition is used to build VAD and the

result is satisfactory in noisy environment and also more accurate in clean speech.

2.3.3.1 Recurrent Neural Network

The neural network is a mathematical model or a computer

model for computing information with a connection-oriented calculation

(Connectionist). The original concept of recurrent neural networks was derived from

the study of the bioelectric network in the brain, which consists of neurons and

synapses. Finally, it is a collaborative network. These techniques commonly use the

feature extraction for improving speech and non-speech classifier, one example is Mel-

frequency cepstral coefficients (MFCC). Besides, many studies (Eyben, Weninger,

Squartini, & Schuller, 2013; Hughes & Mierle, 2013) show that a recurrent neural

network (RNN) improves a noise robustness using the context of previous speech

frames.

In traditional neural networks are usually think that all inputs (and output)

are independent. Unlike recurrent neural network (RNN), the recurrent neural network

has a memory that contains at least one feedback loop, ingesting their own outputs

moment after moment as input. This feature enables the network to do temporal

processing and learn sequences such as perform sequence recognition or temporal

Ref. code: 25605809035149VUW

17

prediction. These techniques often use for video, audio, natural language processing

and images.

Figure 2.10. A recurrent neural network and the unfolding in time of the computation

involved in its forward computation (Olah, 2015).

In the recurrent neural network, it has two sources of input that are the

present and the recent past which combine to determine how they respond to new

data.

That sequential information is preserved in the recurrent network’s hidden

state. The hidden state manages to span many time steps as it cascades forward to

affect the processing of each new example. And we can also write the hidden state

update is as follows:

ℎY = 𝜙(𝑊𝑥Y + 𝑈ℎY,-) (2.6)

From equation 2.6 (Olah, 2015), the hidden state at time step is (ℎY). It is a function of the input at the same time step (𝑥Y) that is modified by a weight matrix

(𝑊) and add to the hidden state of the previous time step (ℎY,-). Then we multiply

by own hidden-state-to-hidden-state matrix (𝑈) also known as a transition matrix. The

weight matrices are filters that determine how much importance to accord to both

the present input and the past hidden state. The error from loss function will return

via back-propagation that is used to adjust the weights until error is lowest as possible.

The sum of the weight input and hidden state is squashed by the function either a

Ref. code: 25605809035149VUW

18

sigmoid function, tanh or rectified linear unit (ReLU). These are a standard tool for

condensing very large or very small values into a logistic space as well as making

gradients workable for backpropagation.

Because this feedback loop occurs at every time step in the series. Each

hidden state contains traces not only of the previous hidden state. But, also of all

those that preceded for as long as memory can persist.

Figure 2.11. The RNNs and variety of output (Olah, 2015)

One of the appeals of RNN is the idea that they might be able to connect

previous information to the present task, such as using a language model trying to

predict the next word based on the previous ones.

(1) Long short-term memory networks

Long short-term memory network (LSTM) is a special kind of

RNN, capable of learning long-term dependencies. LSTM is published by Hochreiter

and Schmidhuber (Hochreiter & Schmidhuber, 1997) and were refined and popularized

by many people in the following works. They work greatly well on a large variety of

problems and are now widely used.

LSTM is explicitly designed to avoid the long-term dependency problem.

Remembering information for long periods of time is practically their default behavior.

All recurrent neural networks have the form of repeating modules. In standard RNNs,

this repeating module will have a very simple structure such as a single tanh layer.

Ref. code: 25605809035149VUW

19

LSTM also has this chain like RNN structure. But, the repeating module has a different

structure. Instead of having a single neural network layer, there are four interacting in

a very special way. LSTM have the ability to remove or add information to the cell

state, carefully regulated by structures called gates.

LSTM has also become very popular in the field of natural language

processing (Palangi et al., 2016; Yao et al., 2014). Unlike previous models based on

HMMs and similar concepts, LSTM can learn to recognize context-sensitive languages.

LSTM improves the machine translation, language modeling and multilingual language

processing. Moreover, LSTM is combined with convolutional neural networks (CNNs)

that also improves automatic image captioning and a plethora of other applications.

2.3.3.2 Feature extraction for neural network

Feature Extraction is finding the parameters that represent the

features of audio signals. The feature extraction is very important. It's used to reduce

the complexity of neural networks, the amount of dataset used to learn and the

computation time of neural networks. Many researches have shown the features

extraction can improve the accuracy, reduce learning time by extract unnecessary

information and improve the overall performance of neural networks (Misra, 2012).

Nowadays, they have various ways to extract these different features.

Generally, the most important features that are commonly used are MFCC, LPC, PLP,

RASTA-PLP (Dave, 2013) which are popular for use with speech recognition systems.

(1) Mel Frequency Cepstral Coefficient – MFCC

Mel Frequency Cepstral Coefficient (MFCC) is a feature that is

widely used in automatic speech and speaker recognition. They were introduced by

Davis and Mermelstein in the 1980's (Mermelstein, 1976) and have been state-of-the-

art ever since. This technique is based on experiments of the human misconception

of words. Prior to the introduction of MFCC, Linear Prediction Coefficient (LPC) and

Linear Prediction Cepstral Coefficient (LPCC) and were the main feature type for ASR,

especially with HMM classifiers.

Ref. code: 25605809035149VUW

20

To extract a feature vector containing all information about the linguistic

message, MFCC mimics some parts of the human speech production and speech

perception. MFCC mimics the logarithmic perception of loudness and pitch of human

auditory system and tries to eliminate speaker-dependent characteristics by excluding

the fundamental frequency and their harmonics. To represent the dynamic nature of

speech the MFCC also includes the change of the feature vector over time as part of

the feature vector.

The standard implementation of computing the Mel-Frequency Cepstral

Coefficients is shown in Figure 2.12.

Figure 2.12. Block diagram of the MFCC algorithm

MFCC is commonly derived as follows:

- Take the Fourier transform of a signal.

- Map the powers of the spectrum obtained above onto the Mel scale,

using triangular overlapping windows.

- Take the logs of the powers at each of the Mel frequencies.

- Take the discrete cosine transform of the list of Mel log powers, as if it

were a signal.

- The MFCCs are the amplitudes of the resulting spectrum.

The Input for the computation of the MFFCs is a speech signal in the time

domain representation with a duration in the order of 10-30 milliseconds.

Fast Fourier transform

Mel scale Filtering

Log Discrete cosine transforms Derivatives

MFCC

Speech Spectrum

Mel frequency Spectrum

Cepstrum Coefficient

Ref. code: 25605809035149VUW

21

In order, to build a short pause based VAD with more accurate in short

pause detection, we need to compare the accuracy, detection time and efficiency of

captions on VAD created by the neural network, and with the time-domain analysis.

And the study research of the past to create the VAD using pattern recognition, which

is detailed below.

In 2012, Ananya Misra (Misra, 2012) has presented research to create a

speech/non-speech segmentation for speech transcription. In this literature review, the

researchers compared the feature extraction and including classifier that can make

speech/non-speech segmentation be the most effective in noisy environments. In the

comparison of the features, the feature extractions are including Low short-time energy

ratio, High zero-crossing rate ratio, Line spectral pair (LSP), Spectral flux, Spectral

centroid, Spectral roll off, Ratio off magnitudes in speech band, Top peaks, Ratio of

magnitudes under top peaks are compared with the most commonly used Mel-

frequency cepstral coefficients (MFCCs). Moreover, the Maximum entropy classifier

(Maxent) is used to compare with Gaussian mixture models (GMM) in comparison of

the classifier.

The researchers used data from YouTube web videos amount 95 hours to

achieve a variety of noise. The experimental result has shown that the extraction

features were compared, the results are performed better than GMM and show that

the great feature extraction will affect the accuracy of speech recognition.

However, according to a study earlier research that although pattern

recognition has been popular in terms of efficiency and robustness to noise. This

method needs to estimate the model parameters of speech and noise signal and

needs to have much training data.

Ref. code: 25605809035149VUW

22

SHORT PAUSE BASED VOICE ACTIVITY DETECTION

In this chapter, the datasets used to investigate possibility of using the

short pause and an experimental are described. Next, creation of short pause based

VAD based on the traditional VAD is presented. Followed by a problem of short pause

and silence detection in the traditional VAD and proposed work. Finally, creation of

pause model based on LSTM-RNN is presented in last section to solve the problem.

3.1. Dataset

In order, to investigate a frequency of short pause in speech sentences. A

process begins with a dataset. We use LOTUS (Kasuriya, Sornlertlamvanich,

Cotsomrong, Kanokphara, & Thatphithakkul, 2003) as a dataset. LOTUS is a large

vocabulary continuous speech recognition (LVCSR) that is provided by the National

Electronics and Computer Technology Center (NECTEC), Thailand. The dataset consists

equal numbers of male and female voices. The voices are recorded with two types of

microphone including high-quality close-talk and medium-quality undirectional. In

addition, the dataset is included two types of environment including a silent

environment and an office noise environment for testing in noisy environment.

3.1.1 Speech and non-speech label

Speech and non-speech segments in LOTUS dataset are labeled by

hand. Silence or "sil" is pause that are over 100 milliseconds. Short pause or "sp" is

pause that ranges from 20 to 99 milliseconds. The dataset consists "sil", "sp" and other

phonetics as a label. We use these labels to investigate frequency of short pause and

measure an accuracy while training a recurrent neural network. The example of

labeling on the dataset is illustrated in Table 2.1.

Ref. code: 25605809035149VUW

23

Table 2.1. The sample labeling of speech sentence.

Start Time

(millisecond)

End Time

(millisecond) Phonetic

0 2990323 sil

2990323 3213509 p

3213509 3681851 e

3681851 4610073 n^

4610073 5620114 w

5620114 6850219 a

6850219 8145215 j^

8145215 9268110 kh

9268110 10532071 @@

10532071 11186622 ng^

11186622 12374408 khw

12374408 13635548 aa

13635548 15142145 m^

15142145 15785411 j

15785411 16916769 u

16916769 17768814 ng^

17768814 18691393 j

18691393 20079494 aa

20079494 20367271 k^

20367271 22043148 sp

22043148 22474814 c

22474814 24421540 a

24421540 27169528 j^

27169528 30174257 sil

Ref. code: 25605809035149VUW

24

The discussions of using the dataset are explained in next topics.

3.1.2 Short pause in speech

In previous research, silence between words (from 100-millisecond)

is commonly used as an endpoint of speech for better accuracy (Moattar &

Homayounpour, 2009; Wu, Kingsbury, Morgan, & Greenberg, 1998). However, using

silence as the endpoint in continuous speech may increases the waiting time

accordingly. In this work, we focus on using short pause between words to reduce the

waiting time of ASR result. A pause ranges from 20 to 99 milliseconds is defined as

short pause.

3.2. Research Question

Here are the main two questions of this work.

- How many short pauses can be found in a sentence?

- Is it possible to uses short pause to reduce the waiting time and

maintains the accuracy?

3.2.1 How many short pauses can be found in a sentence?

By using short pause as an endpoint to reduce a waiting time of ASR

result, it is important to investigate frequency of short pause that occurs during

conversation in continuous speech.

In order, 12 hours of continuous speech sentence in LOTUS dataset are

used in the preliminary investigation. This investigation discriminates number of short

pause and silence that are occurred in speech sentences. However, the characteristic

of speech in LOTUS dataset is relatively slow continuous speech (approximately 123

wpm), the number of silence and short pause are different from high speech rate.

Nevertheless, short pause and silence in LOTUS are labeled by hand that is the reason

why we use LOTUS to investigate preliminary statistics of the short pause in the

sentence. Therefore, we write a program that can determine length of pause and

define a label as follow; (1) the pause ranges from 20 to 99 milliseconds, which is

Ref. code: 25605809035149VUW

25

defined as short pause (2) the pause between words over 100 milliseconds is

automatically defined as silence. The number of short pause and silence in LOTUS is

shown in Table 2.2.

Table 2.2. The number of the short pause and the silence found in 12 hours of the

continuous speech sentences

Types

Minimum

pause

(milliseconds)

Frequency

(seconds) Sd. Total

Endpoints

(Silence +

Short Pause)

Silence 100 1.14 0.918 36,975 36,975

Short Pause

80 0.98 0.8 6125 43,100

60 0.82 0.7 13,957 50,932

40 0.71 0.61 22,107 59,082

20 0.64 0.55 28,944 65,919

Table 2.2 is a results of short pause frequency in the 12 hours of

continuous speech sentence. We found a total short pause is 28,944 points (including

ranges from 20 to 99 milliseconds) and silence is 36,975 points. In other words, the

silence may be used as an endpoint every 1.14 seconds, while 0.64 seconds using the

short pause at 20-millisecond. We note that short pause is usually found in sentence

not less than silence.

Ref. code: 25605809035149VUW

26

Figure 2.13. Short pauses and silence found in sentence.

From Figure 2.13, dashed lines represent short pause and straight lines

represent silence. The figure shows the investigation of short pause frequency, there

are normally some of short pauses in sentence.

3.2.2 Is it possible to use the short pause to reduce the waiting time

and maintains the accuracy?

By considering frequency of short pause that occurs during speech,

the result of investigation show that short pause can be found during speech similar

to silence. Hence, it is possible to reduce the waiting time of traditional VAD using

short pause as an endpoint instead of using only silence. Therefore, we redesign and

improve the traditional VAD algorithm to detect short pause and determine an

endpoint as fast as possible. Then separated speech segments are transmitted to ASR

and transcribe in near real-time.

The proposed work is called short pause based VAD, which is separated

an improvement into four steps. The first step, we improve the traditional VAD by

applying short pause algorithm that uses short pause as an endpoint, instead of using

only silence. This step is improved based on traditional dual-threshold method VAD,

which is low complexity, computation cost and simply to implement. The second step,

we apply the endpoint decision to the short pause based VAD. This step aims reduce

WER by monitoring and minimizing the number of short pauses that are used by VAD.

Ref. code: 25605809035149VUW

27

The third step, we add the padding silence module to short pause based VAD to

minimize WER. The fourth step, we borrow VAD structure from the third step. Then

two feature extractions in traditional VAD are replaced with the pause model. The

pause model is a speech and non-speech classifier that is trained from a long short-

term memory recurrent neural network (LSTM-RNN). In the fourth step, it is more

accurate than the first, second and the third step on the short pause detection

algorithm. The fourth step reduces WER of ASR result in a near real-time captioning

process. The descriptions of each step are described in next topics.

3.3. Short Pause based VAD

Figure 2.14. The flow diagram of traditional VAD (Guo & Li, 2010)

Short-time energy

Zero-crossing rate

Decision Algorithm

Divide into chunks

Speech Signals

(Roughly search)

(Smoothing search)

Begin, End

Speech Segments


Ref. code: 25605809035149VUW

28

Figure 2.15. The flow diagram of the first step on short pause based VAD

Generally, the decision algorithm in traditional VAD is designed to detect

silence that is over 100-millisecond. Then traditional VAD uses the silence as an

endpoint immediately. This is the first step that improves traditional VAD (as shown in

Figure 2.14) to detect short pause between words instead of detecting only silence.

The flow diagram of proposed work is illustrated in Figure 2.15. The decision algorithm

in traditional VAD is replaced with the short pause algorithm (dashed block in Figure

2.15), which has an ability to detect short pause in speech. The detail of short pause

algorithm is explained in the topic below.

3.3.1 Short Pause algorithm (Determine silence and short pause)

From the first question that is described earlier, short pause usually

occurs during speech and we note that the short pause may be used as an endpoint,

which can reduce the waiting time of ASR result. Therefore, to detect the short pause

and silence in speech simultaneously, the short pause algorithm in proposed VAD is

necessary to examine the process consists of two main processes;

- The process is to determine silence from 100-millisecond. This process

gives the probability of the most accurate of result in ASR.

Short-time energy

Zero-crossing rate

Short pause algorithm

Divide into chunks

Speech Signals

(Roughly search)

(Smoothing search)

Begin, End

Speech Segments


Ref. code: 25605809035149VUW

29

- The process of validating pause between words in a sentence. If the

pause is more than 20-millisecond, it is defined as the short pause.

This process uses the short pause to find an appropriate endpoint to

reduce the waiting time.

To locate the short pause and silence, we use a four-state transition

diagram that is illustrated in Figure 2.16.

Figure 2.16. State transition diagram for determining short pause and silence

Ref. code: 25605809035149VUW

30

As shown in Figure 2.16, the four states are included; silence, maybe-

speech, speech, leaving-speech. We assume the silence state is a start state and any

state can be a final state. The transition conditions are aligned with line between

states and actions of the condition are in brackets. From the previous discussion, the

output value from short-time energy feature is 𝐸, 𝑇𝐸 is threshold energy. The output

of start points, silences and a list of short pauses are presented by a detected frame

number. "count" is a number of speech frame that detected, "pause" is a number of

pause frame that found in speech. "sp" is defined as a minimum pause length.

Moreover, "speech" is a minimum speech threshold and "silence" is minimum pause

that can be defined as a silence, these both parameters are set to 100-millisecond.

To illustrate two main processes in the determination of silence and short

pause, we focus on the leaving-speech state that short pause algorithm can detect a

weak signal during speech, which is called a pause. If the pause length is greater than

the "silence" threshold, then the short pause algorithm decides that detected pause

as silence and moves to the silence state. Meanwhile, if the pause length is lower than

the "silence" threshold but greater than the "sp" threshold. We define detected pause

as a short pause, which is used to find an appropriate endpoint. Finally, the state

moves back to the speech state.

3.4. Short Pause based VAD with endpoint decision

Based on preliminary experiments, we separate speech sentence using

short pause as an endpoint by hand. We found that the short pause increases WER

significantly. Moreover, using a small length of short pause increases WER rapidly in

real-time captioning process. Therefore, we propose the endpoint decision, it monitors

and minimizes the use of short pause that minimizing captioning errors.

At the first step, the short pause algorithm examines pause during speech.

Then in this step, we use the detected pause in the endpoint decision which decide

to use the detected pause or do not using the delay time of VAD. The delay time is

an actual processing time that compared with an average chunk duration. The

Ref. code: 25605809035149VUW

31

improvement of short pause based VAD with endpoint decision is presented in Figure

2.17. The descriptions of the endpoint decision rule are described below.

Figure 2.17. The flow diagram of short pause based VAD with endpoint decision.

3.4.1 Endpoint decision

We determine that 𝜃 is an average of chunk duration. ∆𝑑 is an actual

processing time. Then the algorithm chooses between using silence or short pause as

an endpoint by following rules as below.

- If silence can be found in 𝜃, the algorithm uses silence as the endpoint.

- If ∆𝑑 > (𝜃) but VAD cannot find silence portion, then the algorithm

uses the latest short pause as the endpoint.

3.5. Short Pause based VAD with padding silence

Generally, speech recognition systems learn from a large dataset which

commonly contains amount of silence in front and back of speech. Using short pause

as an endpoint reduces the silence in speech segment. An idea of padding silence

comes from Pheraniti’s proposed (Pheraniti, 2008). The researcher shows an


Speech Signals

(Roughly search)

(Smoothing search)

Begin, End

Speech Segments

Short-time energy

Zero-crossing rate

Divide into chunks


Endpoint decision

Ref. code: 25605809035149VUW

32

experimental that allows silence in front and back of speech segment. His

experimental result shows that an appropriate length of silence importantly affects

accuracy in ASR.

In addition, the creation of ASR engine largely relies on language model

that helps to predict a word in sentence. The language model has similar as a state

machine that looking for a final state of speech sentence. The final state is silence that

is given on training process. Using short pause as an endpoint reduces some silence

and unvoiced speech, then ASR engine cannot reach the final state. Thus, by adding

silence at a certain level after separated using short pause may allow ASR to repair

missing silence or recovery unvoiced segment in speech. This technique might reduce

WER.

3.5.1 Padding Silence

Figure 2.18. The flow diagram of short pause based VAD with padding silence


Speech Signals

(Roughly search)

(Smoothing search)

Begin, End

Speech Segments

Short-time energy

Zero-crossing rate

Divide into chunks


Endpoint decision

Padding silence

Used short pause as endpoint

Speech Segments

Ref. code: 25605809035149VUW

33

We add 100-millisecond of generated silence in front or back of short

pause to improve an accuracy. The second step is improved by adding the padding

silence module. We call the third step is the short pause based VAD with padding

silence. Figure 2.18 is the flow diagram of the third step. And the performance is shown

in the experimental results.

3.6. Short Pause based VAD with LSTM-RNN

Considering the flow diagram of traditional VAD and previous steps, we

note that an energy threshold is very important for traditional VAD and proposed work

algorithm. Short pause and silence are cannot detect, if the first threshold energy fail.

From Wang & Qu’s research (Wang & Qu, 2014), the researchers show the fixed

threshold cannot work in a noisy environment. This problem always related to an

ability of short pause detection directly. This problem may lead to failure of the

algorithm.

Figure 2.19. Increasing the energy threshold, the unvoiced portion might not include

in speech segment

Ref. code: 25605809035149VUW

34

However, the higher threshold adjustment is an option that we can solve

the fixed energy threshold problem. Furthermore, increasing threshold energy, we

necessary to consider a problem since the fixed threshold energy is higher.

In order, we need to consider searching of a start point and endpoint of speech

segment. At first, the traditional VAD searches the start point that continuously

increases over 100-millisecond at least (A). Then zero-crossing rate backward scans for

a lowest ZCR frame at first. And we define the lowest ZCR frame as start point. The

result has some risks. Because the lowest ZCR frame that is found at first (B) probably

not a beginning of speech (C). Therefore, an unvoiced speech parts are ignored in the

process of zero-crossing rate. This problem affects the accuracy of ASR result directly,

as illustrated in Figure 2.19.

Many of research (Guo & Li, 2010; Zhang & Junqin, 2015) are shown that

short-time energy works well under clean speech environment. In this proposed work,

the short-time energy cannot capture complex feature like the short pause. Because

the characteristic of short pause is similar to speech or noise that occur in speech.

Using short-time energy will be difficult and inaccurate to search the smallest detail

like the short pause.

We know well that a long short-term memory recurrent neural network

(LSTM-RNN) can understand long-range data sequence better than a vanilla neural

network. The LSTM-RNN has memory cells on the internal structure. The memory cell

can write, read and reset an incoming feature context. We note that a huge of data is

crucial for LSTM-RNN and all of neural network. From recent research, Juntae Kim (Kim

et al., 2016) shows that LSTM-RNN can capture complex feature like vowel sounds

rather whole speech. The vowel sounds are a complex feature like a short pause in

the speech. This research inspires the idea of improving short pause algorithm in this

proposed work. Therefore, we use LSTM-RNN to improve the accuracy of short pause

algorithm.

To apply LSTM-RNN in the short pause based VAD, we borrow the VAD

structure from the third step (short pause based VAD with padding silence). Then we

apply a pause model that is trained from LSTM-RNN. The pause model finely detects

Ref. code: 25605809035149VUW

35

a pause in a noisy environment. Next, the detected pause is defined as short pause or

silence depending on the length of the pause. This step will improve the accuracy of

the short pause algorithm. The flow diagram of short pause based VAD with LSTM-RNN

is illustrated in Figure 3.8.

Figure 2.20. The flow diagram of short pause based VAD with LSTM-RNN

From illustrated in Figure 3.8, the figure shows a new diagram of short

pause based VAD with LSTM-RNN. At all previous step, traditional VAD features are

short-time energy and zero-crossing rate that have weaknesses in short pause and

silence detection. We replace traditional VAD features with the pause model. The

pause model is learned from short pause, silence, and speech. The creation of pause

model that use in the fourth step is described in the next topic.


Speech Signals

Begin, End

Speech Segments

Pause model

Divide into chunks


Endpoint decision

Padding silence

Used short pause as endpoint

Speech Segments

Ref. code: 25605809035149VUW

36

3.6.1 Pause model from LSTM-RNN

We use 12 hours of speech sentence from LOTUS, which is labeled

short pause and silence by hand. The hand labeled and speech audio in LOTUS

dataset are used to create the pause model on LSTM-RNN. The dataset is separated

to 70% for training and then test and evaluation a result with remaining 20% and 10%,

respectively.

The creation of pause model is shown in Figure 2.21. In order, we slice

speech audio into 20-millisecond of a chunk.

Figure 2.21. The creation of pause model

Then the 13-MFCCs feature extraction is used to break apart complex

sound wave into frequency bands (from low to high). This method makes an audio

chunk easier for LSTM-RNN to process. Because trying to recognize speech pattern by

processing these raw audios are the difficulty, LSTM-RNN takes a long time to learn.

The output of MFCCs is illustrated in Figure 2.22.

Mapping

MFCCs feature extraction

Labeled

Weight, Graph

Speech signals

One-hot vector

Recurrent neural network

Framing

Spectrogram

Ref. code: 25605809035149VUW

37

Figure 2.22. The result of MFCCs

Figure 2.22 presents a feature data from 13-MFCC. Each color represents

how much energy of each MFCC coefficient index in speech frame that we feed into

LSTM-RNN. We map each audio chunk with a label at a time during that chuck. The

labels contain three different tags including silence, short pause, and speech.

Figure 2.23. The LSTM-RNN structure of pause model

LSTM-RNN fit the model by approximating the gradient of loss function

with respect to the neuron-weights. The process of approximating is looking at only a

small subset of data, also known as a "mini-batch". Each mini-batch contains 200 of

MFCC features. We feed 20-millisecond of MFCC feature to LSTM-RNN one at a time.

This technique is called a "time-step". There is a concept of time instances

corresponding to a time-series or sequence. Once each mini-batch completed, we

calculate the gradient of loss function immediately, and so on.

… Input Layer

13 LSTM Cells

Output node

Ref. code: 25605809035149VUW

38

The output of LSTM-RNN is binary classification [0, 1] that represents a

probability of pause or speech of each MFCC frame. The output zero (0) is a pause.

Meanwhile, the output one (1) is a speech. The LSTM-RNN structure is presented in

Figure 2.23 and specification of LSTM-RNN is shown in Table 2.3.

Table 2.3. LSTM-RNN specification

Hidden Layer 1

Memory Cell 13

Initial learning rate 0.001

Weight initialization range [-0.1, 0.1]

Decision threshold 0.5

Ref. code: 25605809035149VUW

39

EXPERIMENT

In this proposed work, we propose a new VAD algorithm, which is called

Short Pause based VAD. We divide an experiment into four steps for the understanding

experiment of using short pause in VAD and research problems in step by step.

4.1. Experiments

4.1.1 The first step

This is the first step, we use traditional VAD to improve the

experiment of short pause based VAD. The short pause algorithm is applied to

traditional VAD. Therefore, we find a minimum pause in difference length that suitable

for the algorithm. A length of short pause varies from 20 to 99ms. In ranges 20-39, 40-

59, 60-79 and 80-99 milliseconds, respectively (range of the short pause caused by

framing at 20-millisecond). Finally, we include the traditional VAD setting that use only

silence at 100-millisecond in short pause based VAD. This experiment determines how

the length of short pause affects the waiting time and WER in ASR differently.

4.1.2 The second step

This step is called short pause based VAD with endpoint decision.

We apply endpoint decision module to short pause based VAD (from the first step).

The endpoint decision is used for monitoring and minimizing the use of short pauses

and minimizing WER. This step mainly aims to maintain the accuracy of ASR result.

4.1.3 The third step

This experiment shows an improvement of the short pause based

VAD. We apply the padding silence module to short pause based VAD, it reduces WER

caused by short pause. This step also aims to maintain the accuracy of ASR result in

short pause based VAD.

Ref. code: 25605809035149VUW

40

4.1.4 The fourth step

Lastly, the experiment of short pause based VAD with LSTM-RNN is

the last step. We replace a feature extraction including short-time energy and zero-

crossing rate with a pause model. The pause model is a speech and non-speech

classifier that is trained from LSTM-RNN. We propose the pause model to solve the

weakness of short pause detection in a noisy environment. Finally, we compare all

previous steps with traditional VAD on the average waiting time and WER.

4.2. Experiment settings

To provide a standard setting in an experiment, we set threshold for short-

time energy (𝑇𝐸) and zero-crossing rate (𝑇𝑍) as the similar values as follows 𝑇𝐸 =

50000 and 𝑇𝑍 = 20 respectively. The threshold for short-time energy and zero-crossing

rate easily determine by observation and literature review (Guo & Li, 2010).

The details of experiment dataset are described in next section. Moreover,

all of the experiments are measured on an average waiting time and WER of ASR result.

The descriptions of measurement will describe in the next section.

4.3. Experiment dataset

In experiment, 20 minutes of LOTUS-BN dataset (Chotimongkol, Saykhum,

Chootrakool, Thatphithakkul, & Wutiwiwatchai, 2009) is used to measure an average

waiting time of ASR result. The LOTUS-BN dataset is a Thai television broadcast news

corpus, which is a good resource for investigating on the waiting time of ASR result.

Since it has a higher rate of speech more than natural speech (approximately 196

words per minute). This is an ideal for analyzing the waiting time caused by VAD. An

audio signal is 16 kHz sampling rate and encoded in a 16 bits Microsoft PCM format.

Please note that the sample rate of speech audio is reduced to 8 kHz in this

experiment.

Ref. code: 25605809035149VUW

41

4.4. Measurement

4.4.1 Average waiting time

We measure an average waiting time for the ASR results using

detection time of VAD, upload time and recognition time. The equation of average

waiting time is presented below.

𝑊𝑇 ab =∑ defghef(ie×jkl)mnopeqr

<,- (4.1)

Where the detection time of VAD is 𝐷t . 𝑈t represents upload time of

speech segment. 𝐶t is the duration time of speech segment. Finally, 𝑅𝑇𝐹 is a

processing time factor of automatic speech recognition (also known as real-time

factor). Each variable in the equation can be explained in the illustration below.

Figure 4.1. The description of the waiting time equation

The RTF value in this experiment is 1 to control external factor that affect

the experiment. Due to a large number of speech recognition service providers, each

provider has a different real-time factor.

VAD ASR Automatic

Captioned

Relay

Service

Voice

streaming

Detection

time (𝐶t ) Recognition

time (𝑅𝑇𝐹) Upload

time (𝑈t )

Response

Ref. code: 25605809035149VUW

42

4.4.2 Word error rate (WER)

To measure the effectiveness of each step of proposed algorithm,

the researchers study two factors including average waiting time and WER. For

measurement on accuracy, we apply the Word Error Rate (WER) to measure accuracy

of ASR results. WER is a number of substitution, deletion and insertion errors over a

number of correct words, as the following equation 4.2.

𝑊𝐸𝑅 = xyz{YtYyYt|(fd}~}Yt|(f�({}�Yt|(xyz{YtYyYt|(fd}~}Yt|(fi|��}�Y

(4.2)

WER is developed by Frederick (Frederick Jelinek, 1998). WER is used to

check an accuracy of ASR engine. It works by calculating distance between result from

ASR (hypothesis) and the answer text (reference). In an alignment process, three types

of errors can be distinguished:

- substitutions (SUB)

- deletions (DEL)

- insertions (INS)

In Table 4.1 is an example of alignment produced from example data. The

reference is marked as REF. An output produced by ASR as HYP. The symbols including

SUB, INS, and DEL are the type of error that is used for WER calculation. The example

of comparison yields 38.46% of WER.

Ref. code: 25605809035149VUW

43

Tabl

e 4.

1. A

lignm

ent a

nd ty

pes

of e

rror f

or te

st p

hras

e

Ref. code: 25605809035149VUW

44

EXPERIMENTAL RESULTS AND DISCUSSIONS

In this chapter, we describe the results of proposed work step by step. The

first step, short pause algorithm is applied in traditional VAD. In this step, we

demonstrate the possibility of using short pause to reduce the waiting time of ASR

result. Afterward, endpoint decision and padding silence are added to reduce WER

caused by short pause in the second and third steps. Next, we solve the weakness of

short pause algorithm in a noisy environment by applying the pause model. The pause

model is a speech and non-speech classifier that is trained from LSTM-RNN. Finally,

the result of the fourth step is compared with all previous steps and traditional VAD.

From hypothesis, when we use short pause instead of using silence as an

endpoint of speech, there is an opportunity to dismiss an unvoiced speech such as

vowel sounds or the sounds of letter Ch, F, K, P, S, Sh, T, and Th. Because, these types

of speech are statistically similar to background noise. Therefore, researching of using

short pause in different length (described in section 4.1.1) is necessary for studying

trend-off between average waiting time and WER. The result of the first step is shown

below.

From Table 5.1, the experimental result shows that various lengths of short

pause affect to WER differently (the range of short pause caused by framing at 20-

millisecond). The average of chunk duration and waiting time are related to the

minimum pause. Using short pause as an endpoint in short pause algorithm decrease

the average waiting time of ASR result significantly. The ability to reduce the average

waiting time up to 71.7% by the minimum pause at 20-millisecond. Unfortunately,

short pause increases 35.4% of WER accordingly compared with traditional VAD.

Ref. code: 25605809035149VUW

45

Table 5.1. The average waiting time and WER of ASR result by short pause based

VAD compared with traditional VAD

Algorithms

Minimum

Pause

(ms)

Average

chunk

duration

(s)

Average

waiting

time (s)

Reduced

waiting

time (%)

WER

(%)

Traditional VAD 100 2.14 4.70 - 26.8

Short Pause based VAD

(the 1st step)

20 0.46 1.33 71.7 62.2

40 0.62 1.65 64.9 49.8

60 0.91 2.23 52.6 38.9

80 1.44 3.28 30.2 31.0

100 2.14 4.67 0.6 26.4

However, a nearby acceptable WER is the minimum pause at 80-

millisecond. Using short pause at 80-millisecond reduces the average waiting time by

30.2% and also increase WER by 4.2% compared with traditional VAD. While we note

that the endpoint decision and padding silence module are not included in this step.

As mentioned above, using short pause in VAD is possible to reduce the average waiting

time of ASR result nicely. However, the experimental result from the first step shows

that the short pauses increase WER obviously. Because they have an opportunity to

dismiss unvoiced speech or separate the middle of words such as “ab | solutely”. A

vertical bar (|) shows a false alarm position, which is a low energy value similar to short

pause or silence. Therefore, ASR fails to recognize real-time captioning.

Moreover, the small chunks of speech are another factor that affects the

accuracy of ASR. In general, ASRs use a nearby word to improve the probability of

weakness speech signal in speech sentence. This technique improves the overall

performance of speech recognition. Separating speech sentence using short pause as

an endpoint, VAD divides the speech frequently, and the speech becomes a small

Ref. code: 25605809035149VUW

46

chunk. Then ASR cannot use the nearby word technique to improve the probability of

weakness speech signal. This problem directly affects the accuracy of speech

recognition. Hence, we need to maintain the efficiency and accuracy of ASR result.

We propose a method to minimize the amount of short pause that are

used as an endpoint. Endpoint decision is applied in short pause based VAD algorithm

(described in section 4.1.2). This method considers a delay caused by searching silence

and minimizes the number of short pauses that are used as an endpoint. This method

aims to minimize WER in ASR. The experimental result after added endpoint decision

is presented in Table 5.2.

Table 5.2 presents a performance of the second step by applying an

endpoint decision to short pause based VAD. Using short pause as an endpoint as

needed sights to reduce WER well. From the experimental result shows that the

endpoint decision reduces the waiting time by 22.8% at least (using the minimum

pause from 80-millisecond). Moreover, WER is increased only by 2.9%, it is the lowest

acceptable WER of this experiment compared with traditional VAD. Both of two steps

including the first and the second step are shown that using excessive of short pauses

decrease WER rapidly. Applying endpoint decision to minimize the number of short

pause as need, WER is reduced significantly. However, we need to maintain an accuracy

of ASR result as possible.

Ref. code: 25605809035149VUW

47


VAD with endpoint decision compared with the previous step and traditional VAD

Algorithms

Minimum

Pause

(ms)

Average

chunk

duration

(s)

Average

waiting

time (s)

Reduced

waiting

time (%)

WER

(%)

Traditional VAD 100 2.14 4.70 - 26.7


(the 1st step)

20 0.46 1.33 71.7 62.2

40 0.62 1.65 64.9 49.8

60 0.91 2.23 52.6 38.9

80 1.44 3.28 30.2 31.0

100 2.14 4.67 0.6 26.4


with endpoint decision

(the 2nd step)

20 0.65 1.72 63.5 55.0

40 0.84 2.08 55.9 45.6

60 1.19 2.79 40.7 35.7

80 1.65 3.63 22.8 29.7

100 2.14 4.67 0.7 26.7

Ref. code: 25605809035149VUW

48


VAD with padding silence compared with the previous steps and traditional VAD

Algorithms

Minimum

Pause

(ms)

Average

chunk

duration

(s)

Average

waiting

time (s)

Reduced

waiting

time (%)

WER

(%)

Traditional VAD 100 2.14 4.70 - 26.7


(the 1st step)

20 0.46 1.33 71.7 62.2

40 0.62 1.65 64.9 49.8

60 0.91 2.23 52.6 38.9

80 1.44 3.28 30.2 31.0

100 2.14 4.67 0.6 26.4



(the 2nd step)

20 0.65 1.72 63.5 55.0

40 0.84 2.08 55.9 45.6

60 1.19 2.79 40.7 35.7

80 1.65 3.63 22.8 29.7

100 2.14 4.67 0.7 26.7


with padding silence

(the 3rd step)

20 0.68 1.75 62.8 47.7

40 0.86 2.09 55.6 39.2

60 1.21 2.81 40.2 31.7

80 1.68 3.71 21.2 27.6

100 2.14 4.70 0.2 26.3

Ref. code: 25605809035149VUW

49

Afterward, in the third step that is described in section 4.1.3. we add the

padding silence module to short pause based VAD additionally. This is the third step

that mainly aims to reduce WER of ASR result. The comparisons of WER and average

of waiting time are presented in Table 5.3.

The result of the third step shows that the average of waiting time slightly

increases the average of chunk duration. Because of adding 100-millisecond of the

silence increases the length of speech that also affect the recognition time of ASR. The

short pause based VAD with padding silence reduces the average waiting time by

21.2%, which is reduced from the first and the second step. While WER is increased

only by 0.9% compared with traditional VAD. We note that adding an appropriate

length of silence between front or back of short pause in the third step, ASR can

recover a missing unvoiced speech and improve an accuracy of ASR result.

However, from the previous steps, we found a limitation of traditional VAD

and short pause based VAD that affects a noisy environment. When we test both VADs

in an office noise environment which is contained in a testing set. We found that short

pause based VAD delays the ASR result up to 35 seconds, while traditional VAD is 43

seconds. The problem inherits to a fixed threshold that is used by short-time energy

feature. The short-time energy feature calculates an energy value of each frame and

decides a speech and non-speech using the fixed threshold.

Figure 5.1. The example of sentence that cannot find silence or short pause by short

pause VAD and traditional VAD

Ref. code: 25605809035149VUW

50

Figure 5.2. The labeled of silence and short pause in noisy speech sentence

As illustrated in Figure 5.1, the straight line represents start point and the

dashed line represents endpoint that is detected by short pause based VAD. From

illustrated, the result shows that the proposed work and traditional VAD are cannot

find an endpoint of speech. Since, a noise energy is higher than the fixed threshold

value. Then considering short pause and silence using the fixed threshold are not

precise. This problem affects an ability of short pause detection in short pause based

VAD that directly delays an ASR result to a user.

Moreover, we investigate the fixed threshold problem with a label. The

sentence actually contains 2 segments of silence and 3 segments of short pause that

cannot use as an endpoint due to noise. The illustration is shown in Figure 5.2.

Therefore, we propose the fourth step that is described in section 4.1.4.

Short pause based VAD with LSTM-RNN is created to solve the weakness of short pause

and silence detection that is mentioned above. Thus, the feature extractions in short

pause based VAD are replaced with a pause model which is trained from LSTM-RNN.

The pause model is trained from a noisy dataset to enhance the short pause algorithm

and overall efficiency of short pause based VAD.

Ref. code: 25605809035149VUW

51

Figure 5.3. The speech segments in office noise that is detected by short pause

based VAD with LSTM-RNN

From Figure 5.3 shows the performance of the pause model that detect

the short pause in sentence, where the minimum pause is from 60-millisecond. The

straight lines represent start point and dashed lines represent endpoint, which are

detected by short pause based VAD with LSTM-RNN. Figure 5.3 shows that the short

pause based VAD with LSTM-RNN can detect short pause even in office noise

environment (A). Hence, it is better than using short-time energy with fixed threshold,

which is sensitive in a noisy environment.

The result of the fourth step is shown in Table 5.4. The short pause VAD

with LSTM-RNN reduces an average waiting time up to 17.1%. In addition, the proposed

work in the fourth step reduces WER of ASR result up to 1.5% and 2.29% compared

with traditional VAD and the third step, respectively.

At the first step, using short pause in a few lengths reduce the average

waiting time well. The fourth step can reach the minimum pause at 60-millisecond

compared with all previous steps, which can use only short pause at 80-millisecond.

However, an average waiting time in the fourth step is not reduced as it should be.

Because when using a pause model instead of short-time energy, the pause model has

an ability to detect smallest characteristics in speech frame such as unvoiced speech

or speech in noisy environment simultaneously. Then the number of speech frames

become a longer and the numbers of silence are shorter compared with all previous

steps and traditional VAD. These are the reason why the fourth step achieves the

average waiting time lower and WER greater than the previous steps.

Ref. code: 25605809035149VUW

52


VAD with LSTM-RNN compared with all previous steps and traditional VAD

Algorithms

Minimum

Pause

(ms)

Average

chunk

duration

(s)

Average

waiting

time (s)

Reduced

waiting

time (%)

WER

(%)

Traditional VAD 100 2.14 4.70 - 26.7


(the 1st step)

20 0.46 1.33 71.7 62.2

40 0.62 1.65 64.9 49.8

60 0.91 2.23 52.6 38.9

80 1.44 3.28 30.2 31.0

100 2.14 4.67 0.6 26.4



(the 2nd step)

20 0.65 1.72 63.5 55.0

40 0.84 2.08 55.9 45.6

60 1.19 2.79 40.7 35.7

80 1.65 3.63 22.8 29.7

100 2.14 4.67 0.7 26.7


with padding silence

(the 3rd step)

20 0.68 1.75 62.8 47.7

40 0.86 2.09 55.6 39.2

60 1.21 2.81 40.2 31.7

80 1.68 3.71 21.2 27.6

100 2.14 4.70 0.2 26.3


with LSTM-RNN

(the 4th step)

20 1.04 2.39 49.3 37.7

40 1.32 2.96 37.2 31.8

60 1.77 3.90 17.1 25.3

80 2.47 5.26 - 11.8 22.6

100 3.12 6.55 - 39.3 19.8

Ref. code: 25605809035149VUW

53

Figure 5.4. The average waiting time reduced compared with the minimum pause

setting in short pause based VAD

Ref. code: 25605809035149VUW

54

Figure 5.5. The comparison of WER and minimum pause setting in short pause based

VAD

From Figure 5.4 and Figure 5.5 show an average waiting time and WER in

all of the proposed steps. Dotted lines represent the performance of the first step

(short pause based VAD). Dashed lines represent the second step (short pause based

VAD with endpoint decision). Dot-dashed lines represent the third step (short pause

based VAD with padding silence). Finally, the fourth step (short pause based VAD with

LSTM-RNN) is represented by straight lines.

Ref. code: 25605809035149VUW

55

We note that the first step achieves an average waiting time nicely.

However, WER of ASR result in the first step is decreased rapidly compared with other

steps. Although endpoint decision in the second step increases an average waiting

time, the WER is reduced when compared with the first step. In the third step, padding

silence lightly increases average waiting time. But, reducing huge of WER caused by

short pause significantly. Using minimum pause at 80-millisecond in the fourth step

increases the average waiting time so long. Hence, the average of chunk duration is

increased by unvoiced or speech in noisy, while the silences are decreased. However,

these enable VAD to get more accurate in short pause and silence detection. The

processes of all proposed step are illustrated in Figure 5.6.

5.6 The flow diagram of short pause based VAD in final step.

Ref. code: 25605809035149VUW

56

The Figure 5.4 shows that using the minimum pause at 60-millisecond in

the fourth step (Figure 5.6) slightly reduces an average waiting time lower than other

steps. The average waiting time is reduced from 4.7 to 3.9 seconds at least, which is

only 17.1%. Although, an average waiting time is not reduced better than the first,

second and the third step, which are reduced 30.2%, 22.8% and 21.2% respectively.

But, the fourth step with the minimum pause at 60-millisecond reduces WER up to

25.3%, which is better than traditional VAD and all previous steps. Hence, the fourth

step shows an effective to reduce the waiting time while maintain WER of ASR result,

which is suitable for use in automatic captioned relay service.

Figure 5.7. The frequency and waiting time of Short Pause based VAD with LSTM-RNN

compared with traditional VAD

In addition, the illustration in Figure 5.7 shows the distribution of waiting

time for ASR results. The straight line and the dashed line represent the frequency of

waiting time by short pause based VAD with LSTM-RNN (the fourth step) and traditional

Ref. code: 25605809035149VUW

57

VAD serially. We use the minimum pause at 60-millisecond, which is the best result of

the fourth step.

The graph shows that short pause based VAD with LSTM-RNN delivers an

ASR result every 3.9 seconds and the distribution of caption is less than traditional

VAD. An average waiting time reduced, the ASR results are more continuous and

constant. This will directly affect the user experience.

Ref. code: 25605809035149VUW

58

CONCLUSION

High speaking rate is common in a continuous speech. An average rate of

speech for speakers is approximately 150 wpm (NCVS, 2007). In addition, high speaking

rate has less of silence. Silence is a non-speech portion that are over 100-millisecond. When applying the traditional VAD to the automatic captioned relay service, the

traditional VAD separates a continuous speech to a long speech segment, then ASR

takes a long time to transcribe. The experimental result shows the average waiting

time for ASR result is 4.7 seconds. Since the traditional VAD commonly uses silence as

an endpoint of speech. This delay time directly affect the user experience and lack of

real-time captioning service. In this work, we reduce the waiting time of ASR result using short pause as

an endpoint. We investigate the frequency of short pause in speech sentence. The

result of investigation shows that short pauses are usually found in a sentence every

0.64 seconds, while 1.14 seconds for silence. Base on the experiment, we divide the

experiment of proposed work into four steps. At the first step, we show the

improvement of traditional VAD. The proposed work algorithm is called short pause

based VAD which detect short pause and silence simultaneously. Then the proposed

work uses the detected short pause or silence as an endpoint to reduce the waiting

time. This step, we aim to reduce the waiting time and study trend-off of average

waiting time and WER. The experimental result of the first step shows that using short

pause as an endpoint reduces huge of average waiting time. However, using short

pause in a lower range increases WER rapidly.

The second step, we maintain the accuracy of ASR result from the first

step. We apply endpoint decision module to short pause based VAD. The endpoint

decision is used for monitoring and minimizing the use of short pause to minimize WER.

The experimental result shows that the endpoint decision reduces the average waiting

time at least 22.8%, while WER is increased only by 2.9%, which is the lowest

acceptable WER of this proposed work compared with traditional VAD. The result of

Ref. code: 25605809035149VUW

59

the second step shows that we can only use the minimum pause at 80-millisecond to

reduce the average waiting time.

Afterward, we apply padding silence module to short pause based VAD in

the third step. This step also mainly aims to reduce WER of ASR result. The result of

the third step shows that the average of waiting time slightly increases the average

chunk duration. The short pause based VAD with padding silence reduces the average

waiting time by 21.2%, which reduced from the first and second step, while WER is

increased only by 0.9% compared with traditional VAD. We note that adding an

appropriate length of silence between front or back of short pause in the third step,

ASR can recover a missing unvoiced speech and improve an accuracy and reduce WER

of ASR result.

Then we explain the weakness of short pause and silence detection due

to an office environment. Thus, the feature extractions in short pause based VAD are

replaced with the pause model which is trained from LSTM-RNN. The result of the

fourth step shows that the pause model can solve the weakness of short pause and

silence detection. The fourth step shows the effective to reduce the average waiting

time by 17.1% at least, while WER is reduced to 25.3%, which is the lowest WER.

Short pause based VAD with LSTM-RNN delivers an ASR result in 3.9

seconds on the average compared to 4.7 seconds by the traditional VAD. Reducing the

average waiting time of ASR result, the captions are more continuous and constant,

which directly affect the user experience.

Ref. code: 25605809035149VUW

60

REFERENCES

Books

Frederick Jelinek. (1998). Statistical methods for speech recognition. MIT Press

Cambridge.

Articles

Chotimongkol, A., Saykhum, K., Chootrakool, P., Thatphithakkul, N., & Wutiwiwatchai,

C. (2009). LOTUS-BN: A Thai broadcast news corpus and its research

applications. 2009 Oriental COCOSDA International Conference on Speech

Database and Assessments, ICSDA 2009, 44–50.

https://doi.org/10.1109/ICSDA.2009.5278377

Dave, N. (2013). Feature Extraction Methods LPC , PLP and MFCC In Speech

Recognition. International Journal for Advance Research in Engineering and

Technology, 1(Vi), 1–5.

Eyben, F., Weninger, F., Squartini, S., & Schuller, B. (2013). Real-life voice activity

detection with LSTM Recurrent Neural Networks and an application to

Hollywood movies. In Acoustics, Speech, and Signal Processing, 1988. ICASSP-

88., 1988 International Conference on (Acoust Speech Signal Process) (pp.

483–487). ICASSP; IEEE Signal Processing Society, Institute of Electrical and

Electronics Engineers. https://doi.org/10.1109/ICASSP.2013.6637694

Guo, Q., & Li, N. (2010). A Improved Dual-threshold Speech Endpoint Detection

Algorithm. In Computer and Automation Engineering (ICCAE) (pp. 123–126).

Singapore.

Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural

Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

Ref. code: 25605809035149VUW

61

Hughes, T., & Mierle, K. (2013). Recurrent Neural Networks for Voice Activity

Detection. Acoustics, Speech and Signal Processing (ICASSP), 7378–7382.

https://doi.org/10.1109/ICASSP.2013.6639096

Jia, C., & Xu, B. (2002). An improved entropy-based endpoint detection algorithm.

International Symposium on Chinese Spoken, 1(1), 1–4.

Kasuriya, S., Sornlertlamvanich, V., Cotsomrong, P., Kanokphara, S., & Thatphithakkul,

N. (2003). Thai speech corpus for Thai speech recognition. Proceedings of

Oriental COCOSDA, (January), 54–61.

Kim, J., Kim, J., Lee, S., Park, J., & Hahn, M. (2016). Vowel based Voice Activity

Detection with LSTM Recurrent Neural Network. Proceedings of the 8th

International Conference on Signal Processing Systems - ICSPS 2016, 134–137.

https://doi.org/10.1145/3015166.3015207

Li, Q., Zheng, J., Tsai, A., & Zhou, Q. (2002). Robust endpoint detection and energy

normalization for real-time speech and speaker recognition. IEEE Transactions

on Speech and Audio Processing, 10(3), 146–157.

https://doi.org/10.1109/TSA.2002.1001979

Mermelstein, P. (1976). Distance measures for speech recognition, psychological and

instrumental. Pattern Recognition and Artificial Intelligence.

Misra, A. (2012). Speech / Nonspeech Segmentation in Web Videos. Proceedings of

InterSpeech 2012.

Moattar, M., & Homayounpour, M. (2009). A simple but efficient real-time voice

activity detection algorithm. European Signal Processing Conference

(EUSIPCO), (Eusipco), 2549–2553. https://doi.org/10.1007/978-1-4419-1754-6

Pal, P. K., & Phadikar, S. (2015). Modified energy based method for word endpoints

detection of continuous speech signal in real world environment. In Research

in Computational Intelligence and Communication Networks (ICRCICN) (pp.

381–385). https://doi.org/10.1109/ICRCICN.2015.7434268

Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., … Ward, R. (2016). Deep

Sentence embedding using long short-term memory networks: Analysis and

application to information retrieval. IEEE/ACM Transactions on Audio Speech

Ref. code: 25605809035149VUW

62

and Language Processing, 24(4), 694–707.

https://doi.org/10.1109/TASLP.2016.2520371

Pang, J. (2017). Spectrum energy based voice activity detection. In Computing and

Communication Workshop and Conference (CCWC) (pp. 1–5).

Pheraniti, T. (2008). A Speech/Non-Speech Detection System for Automatic Speech

Recognition. In National Computer Science and Engineering Conference

(NCSEC) (pp. 755–759).

Podder, P., Khan, Zaman, T., & Haque Khan, M. (2014). Comparative Performance

Analysis of Hamming , Hanning and Blackman Window. International Journal

of Computer Applications, 96(18), 1–7.

Rabiner, L. R., & Sambur, M. R. (1975). An Algorithm for Determining the Endpoints of

Isolated Utterances. Bell System Technical Journal, 54(2), 297–315.

https://doi.org/10.1002/j.1538-7305.1975.tb02840.x

Ryant, N., Liberman, M. Y., Yuan, J., Ryant, N., Liberman, M. Y., & Yuan, J. (2013).

Speech Activity Detection on YouTube Using Deep Neural Networks.

Proceedings of Interspeech, 728–731.

Shen, J., Hung, J., & Lee, L. (1998). Robust Entropy-based Endpoint Detection for

Speech Recognition in Noisy Environments. In 5th International conference

ICSLP ’98 (p. 4).

Tashev, I., & Mirsamadi, S. (2016). DNN-based Causal Voice Activity Detector.

Information Theory and Applications Workshop.

Wang, X., & Qu, L. (2014). The Self-adaptive Voice Activity Detection Algorithm based

on time-frequency Parameters. In The Open Automation and Control Systems

Journal (pp. 1661–1668).

Wu, S. L., Kingsbury, B. E. D., Morgan, N., & Greenberg, S. (1998). Incorporating

information from syllable-length time scales into automatic speech

recognition. In International Conference on Acoustics, Speech and Signal

Processing (ICASSP) (Vol. 2, pp. 721–724).

https://doi.org/10.1109/ICASSP.1998.675366

Ref. code: 25605809035149VUW

63

Yali, C., Dongsheng, L., Shuo, J., & Xuefen, N. (2014). A Speech Endpoint Detection

Algorithm Based on Wavelet Transforms. In Control and Decision Conference

(CCDC) (pp. 3010–3012).

Yao, K., Peng, B., Zhang, Y., Yu, D., Zweig, G., & Shi, Y. (2014). Spoken language

understanding using long short-term memory neural networks. In Spoken

Language Technology Workshop (SLT) (pp. 189–194).

https://doi.org/10.1109/SLT.2014.7078572

Zhang, Z., & Junqin, H. (2015). An adaptive voice activity detection algorithm.

International Journal on Smart Sensing and Intelligent Systems, 8(4), 2175–

2194.

Electronic Media

NCVS. (2007). Voice Qualities. Retrieved November 1, 2017, from

http://www.ncvs.org/ncvs/tutorials/voiceprod/tutorial/quality.html

Olah, C. (2015). Understanding LSTM Networks. Retrieved September 23, 2017, from

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Jones, S. (2017). Captioned phones. Retrieved August 29, 2017, from

http://www.healthyhearing.com/help/assistive-listening-devices/captioned-

phones

Ref. code: 25605809035149VUW

64

APPENDICES

Ref. code: 25605809035149VUW

65

APPENDIX A

IMPLEMENTATION OF SHORT PAUSE BASED VAD

We implement Short Pause based VAD using Visual Studio Code editor and

Python 3. Moreover, third-party for python libraries are required including; Numpy,

Tensorflow, python_speech_features, Scipy, Keras and Pycurl.

Short pause based VAD class, which classify the speech/non-speech

segment. The state diagram is presented as Figure 2.16.

class ShortPause_VAD(Base_VAD): def __init__(self, maxsp=5, **kwargs): Base_VAD.__init__(self, **kwargs) self.feature = FeatureExtractor() self.ste_threshold = 50000 self.zcr_threshold = 20 self.maxsp = maxsp self.margin = 0 self.latest_sp = None self.latest_end = 0 # Override def reset_variables(self): self.begin = None self.latest_sp = None self.state = 0 self.count = 0 self.sil = 0 # Override def decision(self, i, X): Base_VAD.decision(self, i, X) en = self.feature.short_term_energy(X) if self.state == 0: if en > self.ste_threshold: self.state = 1 self.count += 1 else: # loop in silence self.reset_variables()

Ref. code: 25605809035149VUW

66

elif self.state == 1: if en > self.ste_threshold: self.count += 1 if self.count >= self.ensure_speech: self.state = 2 position = i - self.count self.begin = self.zcr_beginning_search(position, self.latest_end) self.begin = self.begin - self.margin else: self.reset_variables()

elif self.state == 2: if en > self.ste_threshold: self.count += 1 self.sil = 0 else: self.state = 3 self.sil += 1

elif self.state == 3: if en >= self.ste_threshold: if self.sil >= self.maxsp: self.consider_sp((i, self.sil)) self.state = 2 self.count = self.count + self.sil + 1 self.sil = 0 else: self.sil += 1 if (self.sil >= self.maxsil) and self.count >= self.ensure_speech: position = i position = self.zcr_ending_search(position) position = position + self.margin self.cut(self.begin, position) self.latest_end = position self.reset_variables() return

self.endpoint_decision()

def cut_at_latest(self):

Ref. code: 25605809035149VUW

67

position = self.latest_sp[0] self.padding_silence_end = True self.cut(self.begin, position) new_begin = self.latest_sp[0] - self.latest_sp[1] self.reset_variables() self.latest_sp = None self.begin = new_begin self.state = 2 self.padding_silence_begin = True def consider_sp(self, sp_tuple): self.latest_sp = sp_tuple

Endpoint decision function (proposed in section 3.4.1) for monitoring and

minimizing the use of short pauses and minimize real-time captioning errors.

def endpoint_decision(self): if self.begin and self.state != 3: if self.latest_sp and self.avg_chunk and i - self.begin >= self.avg_chunk: self.cut_at_latest()

Function for generating the 100-millisecond of silence that is used in

padding silence module.

def get_100ms_silence(self): # generate 100ms silence target = int(self.fs * 0.1) shape = (target,) return numpy.zeros(shape)

Function for padding 100-millisecond of silence between front or back of

short pause to improve accuracy (proposed in section 3.5.1).

def padding_silence(self, signal): if self.plus: if self.padding_silence_begin: signal = numpy.array(numpy.hstack((self.get_100ms_silence(), signal)), dtype=numpy.int16) if self.padding _silence_end:

Ref. code: 25605809035149VUW

68

signal = numpy.array(numpy.hstack((signal, self.get_100ms_silence())), dtype=numpy.int16) self.padding _silence_begin = False self.padding _silence_end = False return signal

Ref. code: 25605809035149VUW

69

APPENDIX B

IMPLEMENTATION OF SHORT PAUSE BASED VAD WITH LSTM-RNN

The short pause based VAD with LSTM-RNN class, which using pause model

to classify silence and short pause in each speech frame. The state diagram is

presented as Figure 2.16.

class ShortPauseRNN_VAD(Base_VAD): def __init__(self, maxsp=5, **kwargs): Base_VAD.__init__(self, **kwargs) self.maxsp = maxsp self.latest_sp = None self.plus = True def set_model(self, model): self.model = model def predict(self, X): predict = self.model.predict([X])[0] speech = round(predict[1]) return speech # Override def reset_variables(self): self.begin = None self.latest_sp = None self.state = 0 self.count = 0 self.sil = 0 # Override def decision(self, i, X): Base_VAD.decision(self, i, X) if self.state == 0: if self.predict(X): self.state = 1 self.count += 1 else: self.state = 0 self.count = 0 self.begin = None

Ref. code: 25605809035149VUW

70

elif self.state == 1: if self.predict(X): self.count += 1 if self.count >= self.ensure_speech: self.state = 2 self.begin = i - (self.count + self.margin) else: self.state = 0 self.count = 0 self.begin = None

elif self.state == 2: if self.predict(X): self.count += 1 self.sil = 0 else: self.state = 3 self.sil += 1

elif self.state == 3: if self.predict(X): if self.sil >= self.maxsp: self.consider_sp((i - self.sil, self.sil)) self.state = 2 self.count = self.count + self.sil + 1 self.sil = 0 else: self.sil += 1 if (self.sil >= self.maxsil) and self.count >= self.ensure_speech: position = (i - self.sil) + self.margin self.cut(self.begin, position) self.reset_variables() return self.endpoint_decision()

Ref. code: 25605809035149VUW

71

BIOGRAPHY

Name Mr. Kiettiphong Manovisut

Date of Birth May 31, 1991

Educational Attainment

Academic Year 2012: Computer Science, Faculty

Of Informatics, Mahasarakham University (MSU),

Thailand

Work Position Software Engineer

Spinsoft Co.,Ltd., Thailand

Publications

Manovisut, K., Thatphithakkul, N., & Pokpong, S. (2017). Reducing waiting time in

automatic captioned relay service using short pause in voice activity

detection. 2017 9th International Conference on Knowledge and Smart

Technology (KST) (pp. 216-219). IEEE.

Ref. code: 25605809035149VUW

reducing waiting time in automatic captioned relay service

Documents