introduction: - my.fit.edumy.fit.edu/~vkepuska/thesis/chih-ti shih/chih-ti...  · web viewthe word...

75
Appendix A. INTRODUCTION: The objective of this thesis is to research and develop prosodic features for discriminating proper names used in alerting (e.g., “John, can I have that book?”) from referential context (e.g., “I saw John yesterday”). Prosodic measurements based on pitch and energy are analyzed to introduce new prosodic-based features to the Wake-Up-Word Speech Recognition System (Këpuska V. C., 2006). During the process of finding and analyzing the prosodic features, an innovative data collection method was designed and developed. In a conventional automatic speech recognition system, the users are required to physically activate the recognition system by clicking a button or by manually starting the application. Using the Wake-Up-Word Speech Recognition System, a person can activate a system by using their voice only. The Wake-Up-Word Speech Recognition System will eventually further improve the 1

Upload: hoanghanh

Post on 31-Jan-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

1. INTRODUCTION:

The objective of this thesis is to research and develop prosodic features for

discriminating proper names used in alerting (e.g., “John, can I have that book?”)

from referential context (e.g., “I saw John yesterday”). Prosodic measurements

based on pitch and energy are analyzed to introduce new prosodic-based features

to the Wake-Up-Word Speech Recognition System (Këpuska V. C., 2006). During

the process of finding and analyzing the prosodic features, an innovative data

collection method was designed and developed.

In a conventional automatic speech recognition system, the users are required to

physically activate the recognition system by clicking a button or by manually

starting the application. Using the Wake-Up-Word Speech Recognition System, a

person can activate a system by using their voice only. The Wake-Up-Word

Speech Recognition System will eventually further improve the way people use

speech recognition by enabling speech only interfaces.

In the Wake-Up-Word Speech Recognition System, a word or phrase is used as a

“Wake-Up-Word” (WUW) indicating to the system that the user requires its

attention (e.g., alerting context). Any user can activate the system by uttering a

WUW (e.g., “Operator”), that will enable the application to accept the command

that follows (e.g., “Next slide please”). The non-Wake-Up-Words (non-WUWs)

include the WUWs uttered in referential context, other words, sounds, and noise.

1

Page 2: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

Since the WUW may also occur within a referential context, and therefore

indicating that the user does not need attention from the system, it is important

for the system to be able to discriminate accurately between the two. The

following examples further demonstrate the use of the word “Operator” in those

two contexts:

Example sentence 1: “Operator, please go to the next slide.” (alerting context)Example sentence 2: “We are using the word operator as the WUW.” (referential context)

The above cases indicate different user intentions. In the first example, the word

"operator" is been used as a way to alert the system and get its attention. In the

second example, the same word, “operator”, is used, but in this case it is used in a

“referential context”. Current Wake-Up-Word Speech Recognition system

implements only the pre and post WUW silence as a prosodic feature to

differentiate the alerting and referential contexts.

In this thesis, pitch and energy-based prosodic features are investigated. The

problem of general prosodic analysis is introduced in Section 1.1.

In Chapter 2, the use of pitch as a prosodic feature is described. In general, pitch

represents the intonation of the speech, and the intonation is used to convey

linguistic and paralinguistic information of that speech (Lehiste, 1970) . The

definition and characteristics of pitch will be covered in Section 2.1. In Section 2.2,

a pitch estimation method known as Enhanced Super Resolution Fundamental

Frequency Determinator or eSRFD (Bagshaw, 1994) is introduced. Finally, in

2

Page 3: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

Section 2.3, derivation of multiple pitch-based features from pitch measurements

are used to find the best feature to discriminate the WUW used in alerting

contexts and referential contexts.

In Chapter 3, an additional prosodic feature based on energy measurement is

described. The definition of prominence, an important prosodic feature based on

energy and pitch, and its characteristics will be covered in Section 3.1. In Section

3.2, a description of energy computation is presented. Finally, in Section 3.3, a

derivation of multiple energy features from the energy measurement is presented

and analyzed.

In Chapter 4, an innovative idea of performing speech data collection is presented.

After a number of prosodic analysis experiments conducted using WUWII Corpus

(Tudor, 2007), the validation of the results obtained was deemed necessary using

a different data set. Since, to our knowledge, no specialized speech database is

available, an idea from Dr. R. Wallace was adopted to collect the data from the

movies. We designed a system which extracts speech from the audio channel and,

if necessary, video information from recorded media (e.g., DVD) of movies and/or

a TV series. This system is currently under development by Dr. Këpuska’s VoiceKey

Group.

The problem definition and system introduction will be explained in Section 4.1,

followed by the system design in Section 4.2.

3

Page 4: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

A.1 PROSODIC ANALYSIS

The word prosody refers to the intonation and rhythmic aspect of a language

(Merriam-Webster Dictionary). Its etymology comes from ancient Greek, where it

was used in singing with instrumental music. In later times, the word was used for

the “science of versification” and the “laws of meter” (William J. Hardcastle,

1997), governing the modulation of the human voice in reading poetry aloud. In

modern phonetics the word prosody is most often referred to those properties of

speech that cannot be derived from the segmental sequence of phonemes

underlying human utterances.

Human speech cannot be fully characterized as the expression of phonemes,

syllables, or words. For example, we can notice that the length of segments or

syllables are shortened or lengthened in normal speech, apparently in accordance

with some pattern. We can also hear that pitch moves up and down in some non-

random way, providing speech with recognizable melody. In addition, one can

hear that some syllables or words are made to sound more prominent than

others.

Based on the phonological aspect, prosody can be classified into prosodic

structure, tune, and prominence which can be described as follows:

1. Prosodic structure refers to the noticeable break or disjunctures between

words in sentences which can also be interpreted as the duration of the

4

Page 5: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

silence between words as a person speaks. This factor has been considered

in the current Wake-Up-Word Speech Recognition system where the

minimal silence period before the WUW and after must be present. The

silence period just before the WUW is usually longer than the average

silence period of non-WUW or other parts of the sentence.

2. Tune refers to the intonational melody of an utterance (Jurafsky & Martin)

which can be quantified by pitch measurement also known as fundamental

frequency of the speech. The details on the pitch characteristic, pitch

estimation algorithm, and the usage of pitch features are presented and

explained in Chapter 2.

3. Prominence includes the measurement of the stress and accent in a

speech. Prominence is measured in our experiments using the energy of

the sound signal. The details of energy computation, feature derivation

based on energy, and the experimental results are presented in Chapter 3.

5

Page 6: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

2. PITCH FEATURES

In this chapter, the intonation melody of an utterance, computed using pitch

measurements, is described. The pitch characteristics and a comparison of various

pitch estimation algorithms (Bagshaw, 1994) are covered in chapter 2.1. Based on

the comparison results of multiple fundamental frequency determination

algorithms (FDA) the Enhanced Super Resolution Fundamental Frequency

Determinator (eSRFD) (Bagshaw, 1994) is selected as the algorithm of choice to

perform the pitch estimation. The details of the eSRFD algorithm are covered in

chapter 2.2. Derivation of multiple pitch-based features and their performance

evaluations are covered in chapter 2.3.

2.1 PITCH AND PITCH ESTIMATION METHODS

Intonation is one of the prosodic features that contain the information that may

be the key to discriminate between the referential context and the alerting

context. The intonation of speech is strictly interpreted as “the ensemble of pitch

variations in the course of an utterance”(Hart, 1975). Unlike tonal languages such

as Mandarin Chinese that has lexical forms that are characterized by different

levels or patterns of pitch of a particular phoneme, pitch in the intonational

languages such as English, German, the Romance languages, and Japanese, has

been used syntactically. In addition, intonation patterns in the intonational

languages are grouped with number of words, which are called intonation groups.

6

Page 7: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

Intonation groups of words are usually uttered in one single breath. The pitch

measurement in the intonational languages reveals the emotion of a person

and/or the intention of his/her speech. For example, consider the following

sentence:

Can you pass me the phone?

The pattern of continuously rising pitch in the last three words in the above

sentence indicates a request.

Strictly speaking, pitch is defined as the fundamental frequency or fundamental

repetition of a sound. The typical pitch range for adult males is between 60-200 Hz

and 200-400 Hz for adult females and children. The contraction of vocal fold in

humans produces a relatively high pitch and, conversely, the expanded vocal fold

produces a lower pitch. This explains the reason a person’s voice rises in pitch

when he/she gets nervous or surprised. That human males usually have a lower

voice pitch than females and children can also be explained by the fact that males

usually have longer and larger vocal folds.

After years of development of pitch estimation algorithms, pitch estimation

methods can be classified into the following three categories:

7

Page 8: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

1. Frequency based methods such as CFD (Cepstrum-based FØ determinator)

and HPS (Harmonic product spectrum), use frequency domain

representation of the speech signal to find the fundamental frequency.

2. Time domain based methods such as FBFT (Feature-based FØ tracker)

(Phillips, 1985) uses perceptually motivated features and PP (Parallel

Processing) methods to produce fundamental frequency estimates by

analyzing the waveform in the time domain.

3. Cross-correlation methods, such as IFTA (Integrated FØ tracking algorithm)

and SRFD (Super resolution FØ determinator), uses a waveform similarity

metric based on a normalized cross-correlation coefficient.

The method of eSRFD (Enhanced Super Resolution Fundamental Frequency

Determinator) (Bagshaw, 1994) was chosen to extract the pitch measurement for

the Wake-Up-Word because of its high overall accuracy. According to Bagshaw’s

experiments, the accuracy of the eSRFD algorithm can have a voiced and unvoiced

combined error rate below 17% and a low-gross fundamental frequency error rate

of 2.1% and 4.2% for males and females, respectively. Figure 2-1 and Figure 2-2

show the error rate comparison charts between eSRFD and other FDAs for male

and female voices, respectively.

8

Page 9: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

CFD HPS FBFT PP IFTA SRFD eSRFD0

10

20

30

40

50

60Gross Error Low

Gross Error High

Voiced

Unvoiced

Figure 2-1 FDA Evaluation Chart: Male Speech. Reproduced from (Bagshaw, 1994)

In Figure 2-1 and Figure 2-2, the purple bars indicate the low-gross FØ error which

refers to the halving error where the pitch has been estimated wrongly with a

value about half of the actual pitch. The green bars represent the high-gross FØ

error which refers to the doubling error where the pitch has been estimated

wrongly with a value about twice that of the actual pitch. The voiced error

represented by red bars refers to those unvoiced frames which have been

misidentified as voiced ones by the FDA. Finally, the blue bars show the unvoiced

errors which means the voiced data has been misidentified as unvoiced data.

9

Page 10: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

CFD HPS FBFT PP IFTA SRFD eSRFD0

10

20

30

40

50

60

70Gross Error LowGross Error HighVoicedUnvoiced

Figure 2-2 FDA Evaluation Chart: Female Speech. Reproduced from (Bagshaw, 1994)

Figure 2-1 and Figure 2-2 refer to male and female fundamental frequency

evaluation charts. They show that the eSRFD algorithm achieves the lowest overall

error rate. This result was confirmed in a more recent study (Veprek & Scordilis,

2002). Consequently, eSRFD was chosen to be the FDA used in the present

project.

10

Page 11: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

2.2 ESRFD PITCH ESTIMATION ALGORITHM

The eSRFD (Bagshaw, 1994) is the advanced version of SRFD (Medan, 1991). The

program flow chart of the eSRFD FDA is illustrated in Figure 2-3.

The theory behind the SRFD (Medan, 1991) algorithm is to use a normalized cross-

correlation coefficient to quantify the degree of similarity between two adjacent,

non-overlapping sections of speech. In eSRFD, a frame is divided into three

consecutive sections instead of two as in the original SRFD algorithm.

At the beginning, the sample waveform is passed through a low-pass filter to

remove the signal noise. The sample utterance is then divided into non-

overlapping frames of 6.5 ms length (tinterval = 6.5ms) and each frame contains a set

of samples, SN, wheresN={s ( i)|i∈−Nmax , .. . ,N−Nmax}, which is divided into 3

consecutive segments with each containing an equal number of a varying number

of samples, n. The definition of segmentation is defined by Equation 2-1 and is

further described in Figure 2-4 below.

xn={x ( i)=s ( i−n )|i∈1 ,. ..n}yn={x( i )=s ( i)|i∈1 , .. .n}zn={x ( i)=s ( i+n)|i∈1 , . ..n}

Equation 2-1

11

Page 12: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

Figure 2-3 eSRFD Flow chart

12

Page 13: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

Figure 2-4 Analysis segments of eSRFD FDA

In eSRFD each frame is processed by a silence detector which labels the frame as

unvoiced or silence if the sum of the absolute values of xmin, xmax, ymin, ymax, zmin and

zmax is smaller than a preset value (e.g., 50db signal-to-noise level); conversely, the

frame is voiced if the sum of the absolute values of xmin, xmax, ymin, ymax, zmin and zmax

is equal to or larger than a preset value (e.g., 50db signal-to-noise level). No

fundamental frequency will be searched if the frame is marked as an unvoiced

frame. In those cases where at least one of the segments of xn, yn, or zn is not

defined, which usually happens at the beginning of the speech file and the end of

the speech file, these frames will be labeled as unvoiced and no FDA will be

applied to them.

If the frame of the sample is not labeled as unvoiced, then candidate values for

the fundamental period are searched from values of n within the range Nmin to

Nmax by using the normalized cross-correlation coefficient Px,y(n) as described by

Equation 2-2.

13

Page 14: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

Px , y (n)=∑j=1

[ n/L ]

x( jL )∗y ( jL )

√∑j=1[ n/L

x ( jL)2∗∑j=1

[n /L ]

y ( jL)2

¿¿

¿

Equation 2-2

In Equation 2-2, the decimation factor L is used to lower the computational load of

the algorithm. Smaller L values allow higher resolution but also causes increase in

the computational load of the FDA. Larger L values produce faster computation

with a lower resolution search. The L value is set to 1 since the purpose of this

research is to find as accurately as possible the relationship between pitch

measurements in WUW words. Therefore, computational speed is considered

secondary and thus is not taken into account. However, the variable L will be

considered when this algorithm is integrated into the WUW Speech Recognition

System.

Figure 2-5 Analysis segments for Px,y(n) in the eSRFD

The candidate values of the fundamental period of a frame are found by locating

peaks in the normalized cross-correlation result of Px,y(n). If this value exceeds a

14

Page 15: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

specified threshold, Tsrfd, then the frame is further considered to be a voiced

candidate. This threshold is adaptive and is dependent on the voice classification

of the previous frame and three preset parameters. The definition of Tsrfd is

described in Equation 2-3. If the previous frame is unvoiced or silent, then Tsrfd is

arbitrarily set equal to 0.88. If the previous frame is voiced, then Tsrfd is equal to

the larger value between 0.75 and 0.85 times the value of P x,y of the previous

frame P’x,y. The threshold value is adjusted because the present frame has a higher

possibility to be classified as voiced if the previous frame is also voiced.

T srfd=0 .88 If the previous frame is unvoiced or silent.

T srfd=max [ 0 .75 ,0 .85P ' x , y(n '0 ) ] If the previous frame is unvoiced or silent.

Equation 2-3

In case no candidates for the fundamental period are found in the frame, then the

frame is reclassified as ‘unvoiced’ and no further processing will be applied to the

unvoiced frame. On the other hand, if the frame is classified as voiced, then the

following process will be used to find the optimal candidate as described next.

After getting the first normalized cross-correlation coefficient Px,y, the second

normalized cross-correlation coefficient Py,z, will be calculated for the voiced

frame. The normalized cross-correlation coefficient Py,z is described by Equation 2-

4.

15

Page 16: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

Py , z (n )=∑j=1

[n/ L ]

x ( jL )∗y ( jL )

√∑j=1[ n/ L

x( jL )2∗∑j=1

[ n/L ]

y ( jL )2

¿¿

¿

Equation 2-4

After the second normalized cross-correlation, the score will be given to all

candidates. If the candidate pitch value of a frame has both Px,y and Py,z values

larger than Tsrfd, then a score or value of 2 is given to the candidate. If only P x,y is

above Tsrfd values, then a score of 1 is assigned to the candidate. The higher score

indicates a higher possibility for the candidate to represent the fundamental

period of the frame. After candidate scores are given, if there are one or more

candidates with a score of 2, then all candidates’ scores with 1 in that frame are

removed from the candidate list. If there is only one candidate with a score of 2,

then that candidate is assumed to be the best estimation of the fundamental

period of that particular frame. If there are multiple candidates with a score of 1

but no candidate scores of 2, then an optimal fundamental period is sought from

the remaining candidates.

For the case of multiple candidates with scores of 1 but no candidate scores of 2,

then the candidates are sorted in ascending order of fundamental period. The last

16

Page 17: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

candidate of the list which has the largest fundamental period represents a

fundamental period of nM and nm for the m-th candidate.

Figure 2-6 Analysis segments for q(nm) in the eSRFD

Then the third normalized cross-correlation coefficient, qnm, between two sections

of length nM spaced nm apart, is calculated for each candidate. The section nM in a

frame is illustrated in Figure 2-6, and Equation 2-5 describes the normalized cross-

correlation coefficient, q(nm) used in this case.

q (nm)=∑j=1

[ nM ]

s ( j)∗s ( j+nM+nm )

√∑j=1[nM ]

s( j)2¿∑j=1

[nM ]

y ( j+nM+nm)2

Equation 2-5

After the third normalized cross-correlation coefficient is generated, the q(nm)

value of the first candidate on the list is assumed to be the optimal value. If the

following q(nm), multiplied by 0.77, is larger than the current optimal value, then

the candidate for which q(nm) is considered to be the new optimal value. We

17

Page 18: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

apply the same concept throughout the entire list of candidates, resulting in the

optimal candidate value.

For the case where only one candidate has a score of 1 and there are no candidate

scores of 2, then the possibility for the candidate to be the true fundamental

period of the frame is low. In such a case, if both previous frames and the next

frame are silent, then the current frame is an isolated frame and is reclassified as a

silent frame. If either the previous or the next frame is a voiced frame, then we

assume the candidate of the current frame is the optimal one and it defines the

fundamental period of the current frame.

The above algorithm has a high possibility to misidentify voiced frame as an

unvoiced or silent frame. In order to counteract this imbalance, biasing is applied

when all of the following three conditions are satisfied:

The two previous frames were voiced frames.

The fundamental period of the previous frame is not temporarily on hold.

The fundamental frequency of the previous frame is less than 7/4 times

the fundamental frequency of its next voiced frame and is greater than 5/8

of the next frame.

18

Page 19: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

After obtaining the fundamental frequency, and in order to further minimize the

occurrence of doubling or halving errors, the pitch contour is passed through a

median filter.

The median filter will have a default length of 7, but the size will decrease to 5 or 3

in case there are less than 7 consecutive voiced frames. Figure 2-7 is an example

of doubling points being corrected by the medium filter. In Figure 2-7, the top row

shows the pitch measurement generated by eSRFD FDA and the bottom row

shows the fixed measurement passed through a medium filter. As we can see

from the figure, the two points marked as doubling errors were corrected by the

medium filter.

Figure 2-7 medium filter example

We applied the above pitch estimation method to the WUWII (Wake-Up-Word II)

corpus. The WUWII corpus contains 3410 sample utterances and each utterance

sentence contains at least one of the five different WUWs. The five WUWs are

19

Doubling Error

Page 20: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

‘Wildfire’, ‘Operator’, ‘ThinkEngine’, ‘Onword’ and ‘Voyager’. Figure 2-8 displays a

sample utterance containing the following sentence where the word “Wildfire” is

the WUW of the sentence.

“Hi. You know, I have this cool wildfire service and, you

know, I'm gonna try to invoke it right now. Wildfire”

Figure 2-8 Example, WUWII00073_009.ulaw

In Figure 2-8, the first row shows the waveform of the speech, the second row

shows the pitch estimation from eSRFD FDA, the third shows the pitch estimation

after the median filter, and the last row shows the audio spectrogram of the

20

Page 21: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

speech. The WUW of this sentence is ‘Wildfire’ which is the section delineated

between two red lines.

21

Page 22: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

2.3 PITCH-BASED FEATURES

The pattern of the fundamental frequency contour of utterance waveforms

represents the intonation of the speech. To the best of our knowledge the

problem of discriminating between the uses of words in an alerting context from

words used in a referential context has never been done before. To accomplish

this, a specialized speech data corpus containing WUWs is necessary. In this

project, the corpus named WUWII (Këpuska V. ) was chosen. The WUWII corpus

contains 3410 sample utterances and each utterance sentence contains at least

one of the five different WUWs. The five WUWs are ‘Wildfire’, ‘Operator’,

‘ThinkEngine’, ‘Onword’ and ‘Voyager’.

In our hypothesis, the intonation will rise when the WUW is spoke, thus there

should be an increment on the average pitch and/or maximum pitch on the

WUWs sections compared to the non-WUWs sections in the utterance sentence.

Based on the above hypothesis, the average pitch and maximum pitch of the

WUWs are considered and twelve pitch-based features are derived and listed in

Table 2-1. The features are represented as the relative change between A and B

which is defined in Equation 2-6 as:

Relative Change between A and B = (A-B)/B.

Equation 2-6 Relative Change

22

Page 23: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

Feature Name Feature Definition

APW_AP1SBW The relative change of the average pitch of the WUW to the average pitch of the previous section just before the WUW.

AP1sSW_AP1SBW The relative change of the average pitch of the first section of the WUW to the average pitch of previous section just before the WUW.

APW_APAll The relative change of the average pitch of WUW to the average pitch of the entire speech sample excluding the WUW sections.

AP1sSW_APAll The relative change of the average pitch of the first section of the WUW to the average pitch of the entire speech sample excluding the WUW sections.

APW_APAllBW The relative change of the average pitch of the WUW to the average pitch of entire speech sample before the WUW.

AP1sSW_APAllBW The relative changes of the average pitch of the first section of the WUW to the average pitch of the entire speech sample excluding the WUW sections.

MaxPW_MaxP1SBW The relative change of the maximum pitch in the WUW sections to the maximum pitch in the previous section just before the WUW.

MaxP1sSW_MaxPAllBW The relative change of the maximum pitch in the first section of the WUW to the maximum pitch of the previous section just before the WUW.

MaxPW_MaxPAll The relative change of the maximum pitch of the WUW to the maximum pitch of the entire speech sample excluding the WUW sections.

MaxP1sSW_MaxPAll The relative change of the maximum pitch of the first section of the WUW to the maximum pitch of the entire speech sample excluding the WUW sections.

MaxP1sSW_MaxPAllBW The percentage changes of the maximum pitch in the first section of the WUW to the maximum pitch of the entire speech before the WUW.

MaxPW_MaxPAllBW The percentage changes of the maximum pitch in the WUW sections to the maximum pitch of the entire speech sample before the WUW.

Table 2-1 Pitch Features definition

23

Page 24: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

The pitch-based feature readings have been calculated for combinations of all five

different WUWs and each of the individual of five different WUWs. The detail

performance results are shown in Appendix A. In this section, the results of pitch-

based features are shown and explained using the combination of all five WUWs.

This is presented in Table 2-2 below.

Pitch-Based FeaturesWUW: AllWUWs

Valid Data Pt > 0 % > 0 Pt = 0 % = 0 Pt < 0 % < 0

APW_AP1SBW 1415 726 51 0 0 689 49AP1sSW_AP1SBW 1415 735 52 0 0 680 48

APW_APALL 2282 947 41 0 0 1335 59AP1sSW_APALL 2282 996 44 2 0 1284 56APW_APALLBW 2188 962 44 0 0 1226 56AP1sSW_APALL 2188 1003 46 2 0 1183 54

MaxP_MaxP1SBW 1415 948 67 53 4 414 29MaxP1sSW_MaxP1SBW 1415 719 51 54 4 642 45

MaxPW_MaxPAll 2282 1020 45 109 5 1153 51MaxP1sSW_MaxPAll 2282 716 31 213 9 1353 59

MaxP1sSW_MaxPAllBW 2188 1069 49 111 5 1008 46

MaxPW_MaxPAllBW 2188 1003 35 2 10 1183 55

Table 2-2 Pitch-Based Features Experimental Results of All WUWs

In Table 2-2, the first column indicate the name of the features, the second

column shows the number of valid data and only these samples with valid data are

processed for the particular features. The third and forth columns show the

number and percentage of samples respectively with that feature above zero.

Similarly, the fifth and sixth columns show the number and percentage of samples

with that feature equal to 0. And, finally, the seventh and eighth columns show

the number and percentage of samples with that feature below zero.

24

Page 25: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

From an examination of Table 2-1, we see that the highest percentage of relative

change of all the features is MaxP_MaxP1SBW with only 67%. This means that

only 67% of the sample has positive relative change between the maximum pitch

measurement of the WUW sections and the maximum pitch measurement in the

section just before that WUW. This result can also be interoperated as showing

only 67% of these samples have a maximum pitch in the WUW sections higher

than the maximum pitch in the previous section of the WUW. The results for the

five individual WUWs used in this study are summarized in Table 2-3 below. The

full detail pitch-based feature experimental results can be found in Appendix A.

WUW Best Performance Feature Name

Percentage of Positive Relative Change of the

feature

All WUWs MaxP_MaxP1SBW 67%

Operator MaxP_MaxP1SBW 58%

Onword MaxP_MaxP1SBW 58%

ThinkEngine APW_AP1SBW 65%

Wildfire MaxP_MaxP1SBW 66%

Voyager MaxP_MaxP1SBW 79%

Table 2-3 Summarized Pitch-Based Features Experimental Results

Although there appears to be no prior research which has establish a definite

standard by which performance can be rated, in this project we somewhat

arbitrarily set a minimum of 80% as the criterion for any given features to be

considered reliable. From the summarized results in Table 2-3, the feature with

25

Page 26: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

the best performance is MaxP_MaxP1SBW which has percentages of positive

relative change from 58% to 79% depending on the WUW. This “best

performance” of the pitch based features analysis is below our 80% minimum

standard, which makes it too low to be considered reliable in discriminating

between WUWs and non-WUWs.

In the pitch-based features experiment, no significant discriminating pattern could

be found from the results obtained. These results could be improved if it were

possible to define clear syllabic boundaries. However, syllabic boundaries in the

English language are not clearly defined. In English there is no common agreement

among linguists on syllabic boundaries,

Based on the above experimental results, no pitch-based features can be used for

discriminating WUWs from non-WUWs. Thus, other approaches, such as pitch

measurement patterns, are under consideration and Raymond Sastraputera, a

graduate student working with Dr. Këpuska, will continue research on the new

approaches. Other possible approaches of pitch-based features are covered in

Chapter 5.

26

Page 27: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

3. ENERGY FEATURES

As mentioned in Section A.1, the prosodic feature known as prominence can be

measured using the energy of the utterance. If pitch represents the intonation of

speech then the energy represents the stress of the speech. In this chapter, the

same approach that was applied to pitch in Chapter 2 was used with energy to

generate a similar feature set.

3.1 ENERGY CHARACTERISTIC

In an English sentence, certain syllables are more prominent than others and

these syllables are called accented syllables. Accented syllables are usually either

louder or longer compared to the other syllables in the same word. In the English

language, different positions of the accented syllables on the same word are used

to differentiate the meaning of the word. For example, the word object (noun

[‘ab.dzekt]) compared to the same word object used as a verb ([ab.’dzekt]) (Cutler,

1986) has accented syllables in different locations. The position of the accented

syllables is indicated by “ ’ ” in the phonetic transcription. If this idea of accented

speech is applied to the entire sentence instead of to a single word, then it may

provide additional clues about the use of a word of interest and its meaning within

the sentence. Our hypothesis here is that the prominence of WUWs should be

more significant compare to the prominence of the non-WUWs in the sentence.

27

Page 28: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

Classifying the factors that model speakers’ speech and how they choose to

accentuate a particular syllable within the whole sentence is a very complex

problem. However, the measurement of the accented syllables can be simply

done by using the energy of the speech signal and its pitch change.

28

Page 29: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

3.2 ENERGY EXTRACTION

The energy of a speech signal can be expressed by Parseval’s Theorem as given in

Equation 3-1.

∑n=−∞

∞|x [ n ]|2= 1

2 π ∫− π

π

|X (ω )|2dω

Equation 3-7

In Equation 3-1, the energy of a signal is defined in both the time or frequency

domain. Both |x[n]|2 and |X()|2 represent the energy density which can be

thought as energy per unit of time and energy per unit of frequency respectively.

The energy of a fixed frame size (6.5ms), the same as was used in the earlier pitch

computations, is used here as well. After the energy is calculated for all samples of

each utterance in the WUWII corpus, the energy features could be computed in a

manner similar to the way the pitch-based features were calculated in section 2.3.

This is done in the next section.

29

Page 30: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

3.3 ENERGY-BASED FEATURES

Using the same technique as was done in the previous experiments with pitch-

based features, 12 energy-based features were computed and tested. The energy-

based features are derived the same way as was done previously for the pitch-

based features (Equation 2-6). The features are listed and defined in Table 3-4

below:

Feature Name Feature Definition

AEW_AE1SBW The relative change of the average energy of the WUW to the average energy of the previous section just before the WUW.

AE1sSW_AE1SBW The relative change of the average energy of the first section of the WUW to the average energy of previous section just before the WUW.

AEW_AEAll The relative change of the average energy of the WUW to the average energy of the entire sample speech excluding the WUW sections.

AE1sSW_AEAll The relative change of the average energy of the first section in the WUW to the average energy of the entire utterance excluding the WUW sections.

AEW_AEAllBW The relative change of the average energy of the WUW to the average energy of all speech before the WUW.

AE1sSW_AEAllBW The relative change of the average energy of the first section in the WUW to the average energy of the entire sample speech before the WUW.

MaxEW_MaxE1SBW The relative change of the maximum energy in the WUW sections to the maximum energy in the previous section of the WUW.

30

Page 31: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

MaxE1sSW_MaxEAllBW The relative change of the maximum energy in the first section of WUW to the maximum energy in the entire speech before the WUW.

MaxEW_MaxEAll The relative change of the maximum energy in the WUW to the maximum energy of the entire speech sample excluding the WUW section.

MaxE1sSW_MaxEAll The relative change of the maximum energy in the first section of the WUW to the maximum energy of the entire speech sample excluding the WUW section.

MaxE1sSW_MaxEAllBW The relative change of the maximum energy in the first section of the WUW to the maximum energy of the entire speech before the WUW.

MaxEW_MaxEAllBW The relative change of the maximum energy in the WUW sections to the maximum energy of the entire speech sample before the WUW.

Table 3-4 Energy-Based Features Definition

In this experiment some of the features may not be implementable in real-time

applications since they rely on the measurements after the WUW of interest.

Nevertheless, even those features may lead to interesting conclusions. For real

time speech recognition systems those features that do not rely on the features

after the WUW of interest are the most useful. Table 3-5 below shows the results

of the measurements of energy features based on all five different WUWs of

WUWII corpus, namely the words “Operator”, “ThinkEngine”, “Onword”,

“Wildfire” and “Voyager”. The details of the results result for each of five different

WUWs can be found in Appendix B.

31

Page 32: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

Energy-Based Features

WUW: All WUWsValid Data Pt > 0 % > 0 Pt = 0 % = 0 Pt < 0 % < 0

AEW_AE1SBW 1479 1164 79 0 0 315 21AE1sSW_AE1SBW 1479 1283 84 1 0 240 16AEW_AEAll 2175 1059 49 9 9 1116 51AE1sSW_AEAll 2175 1155 53 2 0 1018 47AEW_AEAllBW 1969 1427 72 0 0 542 28AE1sSW_AEAllBW 1969 1562 79 3 0 404 21MaxEW_MaxE1SBW 1479 1244 84 20 1 215 15MaxE1sSW_MaxEAllBW 1479 1221 83 13 1 245 17MaxEW_MaxEAll 2175 1373 63 13 1 245 17MaxE1sSW_MaxEAll 2175 1336 61 25 1 814 37MaxE1sSW_MaxEAllBW 1969 1209 61 16 1 744 38MaxEW_MaxEAllBW 1969 1562 60 3 1 404 39

Table 3-5 Energy-Based Feature Experimental Results of All WUWs

In Table 3-5, the first column indicate the name of the features, the second

column shows the number of valid data and only the samples with valid data are

processed for the particular features. The third and forth columns show the

number and percentage of samples with that feature above zero. The fifth and

sixth columns show the number and percentage of samples with that feature

equal to zero. The seventh and eighth columns show the number and percentage

of samples with that feature less than zero.

Based on the results shown in Table 3-5, the following three features performed

the best in discriminating WUW from other word tokens:

32

Page 33: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

AE1sSW_AE1SBW: The relative change of the average energy of the first

section in the WUW compared to the average energy of the last section

before the WUW. Using this feature, 84% of the data shows the average

energy of the first section of the WUW is higher than the average energy of

the previous section. This result is illustrated in Figure 3.1 below depicting

the distribution of this feature in blue and its cumulative distribution in

red. The cumulative plot shown in red is a continuous curve that

approaches a value of 100%; the distrivution plot is discrete and is shown

here in blue. Both plots are presented in black in Appendices A and B.

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100(WUW1stAE-LSAE)/LSAE cumulative plot

(WUW1stAE-LSAE)/LSAE

%

Figure 3-9 Distribution and Cumulative plots of energy-based feature AE1sSW_AE1SBW of AllWUWs.

MaxEW_MaxE1SBW: The relative change of the Maximum energy in the

WUW sections compared to the maximum energy from the last section

before the WUW. Using this feature, 84% of the samples show that the

33

Page 34: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

maximum energy in the WUW sections is higher than the maximum energy

of the previous section. The distribution and the cumulative plots of this

feature are shown in Figure 3-10 below.

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100(WUWMAXE-LSMAXE)/LSMAXE cumulative plot

(WUWMAXE-LSMAXE)/LSMAXE

%

Figure 3-10 Distribution and Cumulative plots of energy-based feature MaxEW_MaxE1SBW of AllWUWs.

MaxE1sSW_MaxEAllBW: The relative change of the maximum energy of

the first section of the WUW compared to the maximum energy from the

last section before the WUW. This feature correctly discriminated 83% of

those cases that exhibited a higher maximum energy of the first section of

the WUW than the maximum energy of the previous section. The

cumulative and distribution plots of this feature are shown in Figure 3-11.

34

Page 35: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100(WUW1stMAXE-LSMAXE)/LSMAXE cumulative plot

(WUW1stMAXE-LSMAXE)/LSMAXE

%

Figure 3-11 Distribution and Cumulative plots of energy-based feature MaxE1sSW_MaxEAllBW of AllWUWs.

The above results are based on all the data including all five different WUWs.

Thus, investigating each word independently may be more appropriate. The

detailed performance result of each individual of all five different WUWs is

covered in Appendix B.

Linguistically, one of the more common and useful WUW’s is the word

“Operator”. This word is also used in the current Wake-Up-Word Speech

Recognition System (Këpuska V. C., 2006). Based on the results presented in Table

3-6 below, two features show that over 90% of the WUW cases have an average

or maximum energy higher than the other sections of speech. These two features

are:

35

Page 36: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

AE1sSW_AE1SBW: The relative change of the average energy of the first

section in the WUW compared to the average energy of the last section

before the WUW. Using this feature, 94% of the samples has the first

section of the WUW with higher average energy then previous section.

AE1sSW_AEAllBW: The relative change of the average energy of the first

section in the WUW compared to the average energy of the entire speech

before the WUW sections. Using this feature, 91% of the samples show the

first section of WUW has higher average energy.

Energy-Based Feature

WUW: OperatorValid Data Pt > 0 % > 0 Pt = 0 % = 0 Pt < 0 % < 0

AEW_AE1SBW 275 228 83 0 0 47 17AE1sSW_AE1SBW 275 258 94 0 0 17 6

AEW_AEAll 418 248 59 0 0 170 41AE1sSW_AEAll 418 290 69 1 0 127 30AEW_AEAllBW 394 303 77 0 0 91 23

AE1sSW_AEAllBW 394 359 91 1 0 34 9MaxEW_MaxE1SBW 275 240 87 1 0 34 12

MaxE1sSW_MaxEAllBW 275 243 88 0 0 32 12MaxEW_MaxEAll 418 290 69 4 1 124 30

MaxE1sSW_MaxEAll 418 285 68 6 1 127 30MaxE1sSW_MaxEAllBW 394 272 69 4 1 118 30

MaxEW_MaxEAllBW 394 359 68 1 1 34 30

Table 3-6 Energy-Based Feature Experimental Results of the WUW “Operator”

Based on the preformed experiment, WUW “Wildfire” achieved the best overall

result. Using this word, four features scored higher than 90%. These results are

shown in Table 3-7.

36

Page 37: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

Energy-Based Feature

WUW: WildfireValid Data Pt > 0 % > 0 Pt = 0 % = 0 Pt < 0 % < 0

AEW_AE1SBW 282 253 90 0 0 29 10AE1sSW_AE1SBW 282 261 93 0 0 21 7AEW_AEAll 340 173 51 0 0 167 49AE1sSW_AEAll 340 185 54 0 0 155 46AEW_AEAllBW 298 252 85 0 0 46 15AE1sSW_AEAllBW 298 265 89 0 0 33 11MaxEW_MaxE1SBW 282 258 91 8 3 16 6MaxE1sSW_MaxEAllBW 282 253 90 2 1 27 10MaxEW_MaxEAll 340 230 68 4 1 106 31MaxE1sSW_MaxEAll 340 219 64 4 1 117 34MaxE1sSW_MaxEAllBW 298 195 65 4 1 99 33MaxEW_MaxEAllBW 298 265 62 0 1 33 36

Table 3-7 Energy-Based Feature Experimental Results of WUW “Wildfire”

The four best features were found to be following:

AEW_AE1SBW: The relative change of the average energy of the entire

WUW compared to the average energy of the last section just before the

WUW. It shows 90% of the average energy of the WUW is higher than the

previous section.

AE1sSW_AE1SBW: The relative change of the average energy of the first

section of the WUW compared to the average energy of the last section

before the WUW. Using this feature, 93% of these samples show the first

section of the WUW has higher average energy.

37

Page 38: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

MaxEW_MaxE1SBW: The relative change of the maximum energy of the

WUW sections compared to the maximum energy in the last section before

the WUW. Using this feature, 91% of these samples show the WUW has

higher maximum energy.

MaxE1sSW_MaxEAllBW: The relative change of the maximum energy of

the WUW sections compared to the maximum energy of all sections before

the WUW. Using this feature, 90% of samples show the first section in the

WUW has higher maximum energy.

From an overall view, the experimental result energy-based features are

summarized in Table 3-8 below. The best two features are AE1sSW_AE1SBW with

scores ranging from 71% to 94%, and MaxEW_MaxE1SBW with a score range

between 66% and 87%. For the both of these two features, the lowest scores

occur for the WUW, “ThinkEngine”. The reason of that the word, ThinkEngine has

relative lower energy-based features is because the sound “th” is an unvoiced

fricative sound and has the lowest relative intensity of all English sounds (Fry,

1979). In addition, the nasal sound “eg” also has lower relative intensity of all

English sounds. Despite the lower performance results from the WUW,

“ThinkEngine”, the performance of these two features are between 84% - 94%.

38

Page 39: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

WUW Best Performance Feature Name

Percentage of Positive Relative Change of the

feature

All WUWsAE1sSW_AE1SBW

MaxEW_MaxE1SBW84%

Operator AE1sSW_AE1SBW 94%

Onword MaxEW_MaxE1SBW 87%

ThinkEngine MaxEW_MaxE1SBW 71%

Wildfire AE1sSW_AE1SBW 93%

Voyager MaxEW_MaxE1SBW 83%

Table 3-8 Summarized Energy-Based Features Experimental Results

From the above results, it can be concluded that the WUW is frequently

emphasized or accentuated compared to the rest of the words in the utterance.

Thus the hypothesis that the prominence feature of the WUWs is more significant

than the prominence feature of the non-WUWs is verified. In addition, the energy-

based features of AE1sSW_AE1SBW and MaxEW_MaxE1SBW can be used reliably

for discriminating WUWs and non-WUWs with properly selected WUWs.

39

Page 40: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

4. DATA COLLECTION

In this chapter, we introduce a revolutionary way to collect speech samples. The

preliminary design of this data collection system will also be presented in this

chapter.

4.1 INTRODUCTION TO THE DATA COLLECTION

After developing the WUW discriminating features based on two prosodic

measurements of pitch and energy, described in Chapter 2 and 3, it was realized

that the data used to generate those features may not be the most suitable. The

corpus used in this project is the WUWII corpus which was collected by Dr

Këpuska in 2002. It provides data on a number of WUWs under alerting situations

but doesn’t contain data for the same words used under referential situations. As

a result, it was only possible to perform an analysis based on the changes between

alerting types of WUWs against the overall sentence and not on information in

which the same word appears in a referential situation. Another drawback of the

current WUWII corpus is that it contains speech that is not spontaneous. The

speakers whose voices were used to develop this data set were given each WUW

and asked to make up a sentence using it as a WUW in an alerting situation. Under

these laboratory conditions, some speakers may change the way he/she normally

speaks.

40

Page 41: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

In order to perform a more complete analysis, we will need a corpus which

includes both alerting and referential WUWs in context with natural spoken

utterances. Therefore, it was decided to use a suggestion made by Dr. R. Wallace,

to extract audio samples from a movie or a TV program.

Extracting speech samples from movies and TV programs has the following

advantages compared to the data collection method used in developing WUWII

corpus:

1. The speech examples are more natural. The speech from professional

actors is more natural since they tend to think and speak like a particular

character and act the situation of the character that they are depicting.

2. The data collection process will cost much less since we are not

compensating individuals to record their voices. There are no copyright

problems since the data is not being sold or used for commercial purposes.

3. A large amount of data can be collected in a short period of time once the

process is fully automated.

4. The voice channel data is of CD quality. In this project, speech data was

extracted from recorded videos rather than over conventional telephone

lines as was done in developing the WUWII corpus.

5. No manual labeling is required. We plan to use the transcripts obtained from the video channel (see section 4.2), which provides time stamps for all spoken sentences.

41

Page 42: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

In view of the above listed advantages, it was decided to design an automatic data

collection system to collect specific speech data suitable for prosodic analysis of

proper first names used in referential contexts vs. alerting (or WUW) contexts.

42

Page 43: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

4.2 SYSTEM DESIGN

The data collection project is a part of the prosodic features analysis project which

is illustrated by the program flow chart shown in Figure 4-12. The prosodic

features analysis project can be divided into three sub projects. In Figure 4-12, the

green boxes represent the project of prosodic features extraction which are

described in Chapters 2 and 3 of this thesis.

Figure 4-12 Program Flow Chart

The blue boxes depict the WUW data collection project. Finally, the purple boxes

represent a future project on video analysis.

In the prosodic feature analysis project, we use the prosodic features generated

from acoustic measurement to differentiate the context of the words. In a part of

43

Movie

Clips

Audio Channel

Extraction

Video Channel

Extraction

Extraction of

Transcription from CC, Time Markers,

and Sentence Parsing

Forced Alignmen

t

RelEx:Language Analysis

ToolImage

Sequence Processin

g

Prosodic Feature

Extraction

Processing of

Prosodic Features

Vide o +

Audi o

Vide o +

Audi o

Close Captioning

Analysis, Prosodic & Image

WUW Modeling

Sentence

Transcription

Time Markers

Time Markers

Waveform of

an Uttera

nce

Waveform of

an Uttera

nce

Sentence

Transcription with

Syntactic Labels

WUW or nonWUW Context

WUW & nonWUW Time Markers

Prosodic

Features

Data

Image Sequen

ceFeature

Extraction

Processing of

Image FeaturesImage

Segmentation Segmente

d Image Features

Data

Corpus

Building

Page 44: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

the WUW data collection project the language analysis tools will be used to

automatically classify the words of interest – in this case referential or alerting. At

the moment the capabilities of this tool, RelEx (Novamente LLC) must be

augmented in order to achieve this goal. The outcome of the WUW Speech Data

Collection project will not only build a specialized corpus for the prosodic analysis

project, but also will provide confirmation of the results from the prosodic

analysis. The detailed program flow chart of the WUW Speech Data Collection

System is shown in Figure 4-13 below.

Figure 4-13 WUW Audio Data Collection System Program Flow Diagram

The input of the system will be (1) the video file of a movie or TV program, (2) a

video transcription file; if provided, will be used; otherwise it will be extracted

from the video stream by subtitle extractor, SubRip (Zuggy), and (3) an English

44

Video Samp

le

Subtitle

Extractor

Video Transcri

ption File

(.srt)

English

Name Dictionary

Sentence

Parser

Sentence Transcription

& Time Marker

Audio Extractor

Video

RelEx Langua

ge Analysis Tool

HTKForce

d Alignment

Audio

Audio Parse

r

Name Sentence Time Marker

Name Sentence

Transcription

Name Audio

Sample

Corpus

Building

WUW or nonWUW

Marker

Sentence

Transcription with

Syntactic Labels

English Name List

Page 45: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

dictionary of proper American first names (Campbell). In the case that there is no

video transcription file and the subtitles are encoded into the video stream, the

subtitle extractor Subrip (Zuggy), will extract subtitles and time stamps of the

sentence from the video stream. An example of a transcription file extracted in

this manner is presented in Figure 4-14 below.

Figure 4-14 Example of Video Transcription File

The transcription files provide the following information: date and time when the

files were created, the subtitle index number, the start time and end time of each

subtitle, and the subtitle transcription.

45

# Date | Time |Index| Start Time | End Time |Transcription11-17-08 | 12:14:41 | 38 | 00:03:39,025 | 00:03:42,659 | - It was fun.- Oh

yeah bet it was fun11-17-08 | 12:14:41 | 39 | 00:03:43,822 | 00:03:47,297 | - Oh hey†! This is

Oscar.- Martinez.11-17-08 | 12:14:41 | 40 | 00:03:47,444 | 00:03:50,848 | - See, I didn't

even know, first thing basis.- We're all set.11-17-08 | 12:14:41 | 41 | 00:03:50,905 | 00:03:54,041 | Oh hey, diversity

everybody let's do it.11-17-08 | 12:14:41 | 42 | 00:03:54,670 | 00:03:56,407 | Oscar works in

here.11-17-08 | 12:14:41 | 43 | 00:03:56,465 | 00:03:59,686 | - Jim can you

rapid up please†?- Yeah.11-17-08 | 12:14:41 | 44 | 00:04:00,494 | 00:04:02,325 | It's diversity day

Jim,11-17-08 | 12:14:41 | 45 | 00:04:02,390 | 00:04:04,231 | wish everyday

was diversity day.

Page 46: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

The media audio extractor (AOAMedia.com) will extract an audio channel from

the video file. Then, using an English dictionary of first names and the sentence

transcription with time markers, an application program called sentence parser

was developed by VoiceKey team members (Pattarapong, Ronald, & Xeres,

Sentence Parser Program, 2009) to select sentences that include proper English

language first names. Figure 4-15 shows an example of the output of the

sentence parser program.

Figure 4-15 An example of the output of the sentence parser application program

In the next step, the audio parser (Pattarapong, Ronald, & Xerxes, Audio Parser

Program, 2009) will use the information from the sentence parser to extract the

corresponding audio sections from the audio file produced by media audio

extractor.

After the extraction of a sentence that contains an English first name, the RelEx

(Novamente LLC) is used to analyze the selected sentence. RelEx is an English-

language semantic relationship extractor based on Carnegie-Mellon Link Parser

(Temperlyey, Lafferty, & Sleator, 2000). RelEx is able to provide sentence

information on various parts of speech such as subjects, direct objects, indirect

46

---------------------------------------11-17-08 | 12:14:41 | 23 | 00:02:19,475 | 00:02:23,539 | - Thanks Dwight.---------------------------------------11-17-08 | 12:14:41 | 36 | 00:03:32,794 | 00:03:36,601 | - Hey†! Oscar, how you doing man†?---------------------------------------11-17-08 | 12:14:41 | 38 | 00:03:39,025 | 00:03:42,659 | - Oh yeah bet it was fun11-17-08 | 12:14:41 | 39 | 00:03:43,822 | 00:03:47,297 | - Oh hey†! This is Oscar.

Page 47: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

objects, and various words tagging such as verbs, gender, and nouns. The current

status of the WUW data collection project is developing a rule based process or a

statistical pattern recognition process, which is based on the relationship

information produced by RelEx. Ultimately, the system will be able to accurately

determine if the name in the sentence is used in a WUW or nonWUW context.

A necessary step in the automation process is to obtain precise time markers

indicating the words of interest. To achieve this, one could use the Hidden Markov

Model Toolkit (HTK) (Machine Intelligence Laboratory of the Cambridge University

Engineering Department ), to perform forced alignment of the audio stream. The

HTK was initially developed by Machine Intelligence Laboratory (formerly known

as the Speech, Vision, and Robotics Group) of Cambridge University’s Engineering

Department (CUED). The HTK uses Hidden Markov Model (HMM) which compares

the acoustic features of the incoming audio with the known acoustic features of all

41 English phonemes to predict the most likely combination of phonemes

associated with the incoming audio. It then maps the words from the lexicon

dictionary. In our case, since the transcription of the sentences is known, HTK is

used to map the phonemes of known words to the corresponding time intervals.

The phoneme time labels or equivalently the word boundaries of the spoken

sentence are used to locate in the time domain the WUWs or nonWUWs. Note

that this step can be also performed by Microsoft’s Speech Development Kit (SDK)

which is a speech recognition system that is fully integrated in Microsoft’s Vista

47

Page 48: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

OS. The advantage of Microsoft’s system is that we do not need to train it since

the acoustic models are built-in. However, a development of the application

incorporating Microsoft’s SDK features is necessary. Alternatively, HTK does not

require any significant integration coding, however it does require acoustically

accurate models. Automation of the described data collection process will be

made possible by integrating the outputs from RelEx with the forced alignment.

With time segmented sentence labels of the audio stream indicating the WUW or

non-WUW context, a new corpus can be generated just like WUWII corpus. This

data will then be used to perform prosodic analysis and develop new features or

refine existing prosodic features. It is expected that further study with the new

data will not only validate the current prosodic analysis results, but also will

provide information useful for developing new prosodic features. The ultimate

goal of this speech data collection project is to build a suitable specialized corpus

for the research on finding reliable features to discriminate between WUWs and

non-WUWs in both alerting and referential contexts.

48

Page 49: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

5. Conclusions

This thesis investigated two prosodic features and designed an innovative speech

data collection system. The pitch-based features experimental results including all

5 different WUWs are shown in Table 5-9, and the energy-based features are

show in Table 5-11. In this study we arbitrarily decided that the relative change of

any features should be 80% or higher before we would consider it a reliable

discriminator between WUWs and non-WUWs. In addition, it was found that no

single feature works best on all five WUWs used in this study. Each different WUW

may have may require a different feature to achieve the best performance. It can

be concluded that the same performance feature will not discriminate all WUWs

equally well between their use in alerting contexts and referential contexts.

Pitch FeaturesWUW: All WUWs

Valid Data Pt > 0 % > 0 Pt = 0 % = 0 Pt < 0 % < 0

APW_AP1SBW 1415 726 51 0 0 689 49AP1sSW_AP1SBW 1415 735 52 0 0 680 48

APW_APALL 2282 947 41 0 0 1335 59AP1sSW_APALL 2282 996 44 2 0 1284 56APW_APALLBW 2188 962 44 0 0 1226 56

AP1sSW_APALLBW 2188 1003 46 2 0 1183 54MaxPW_MaxP1SBW 1415 948 67 53 4 414 29

MaxP1sSW_MaxP1SBW 1415 719 51 54 4 642 45MaxPW_MaxPAll 2282 1020 45 109 5 1153 51

MaxP1sSW_MaxPAll 2282 716 31 213 9 1353 59MaxP1sSW_MaxPAllBW 2188 1069 49 111 5 1008 46

MaxPW_MaxPAllBW 2188 1003 35 2 10 1183 55

Table 5-9 Pitch Features Result of All WUWs

49

Page 50: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

Table 5-9 shows the results of the pitch-based features performance when five

different WUWs are included. As we can see from the table, the best performance

would be only 67% which is not high enough to allow reliable discrimination

between WUWs and non-WUWs.

The Table 5-11 below shows the energy-based features performance results of all

five WUWs used in the present study.

Energy FeaturesWUW: All WUWs

Valid Data Pt > 0 % > 0 Pt = 0 % = 0 Pt < 0 % < 0

AEW_AE1SBW 1479 1164 79 0 0 315 21AE1sSW_AE1SBW 1479 1283 84 1 0 240 16AEW_AEAll 2175 1059 49 9 9 1116 51AE1sSW_AEAll 2175 1155 53 2 0 1018 47AEW_AEAllBW 1969 1427 72 0 0 542 28AE1sSW_AEAllBW 1969 1562 79 3 0 404 21MaxEW_MaxE1SBW 1479 1244 84 20 1 215 15MaxE1sSW_MaxEAllBW 1479 1221 83 13 1 245 17MaxEW_MaxEAll 2175 1373 63 13 1 245 17MaxE1sSW_MaxEAll 2175 1336 61 25 1 814 37MaxE1sSW_MaxEAllBW 1969 1209 61 16 1 744 38MaxEW_MaxEAllBW 1969 1562 60 3 1 404 39

Table 5-10 Energy Features Result of All WUWs

One can see from the Table 5-11, there are several energy-based features with

positive relative changes above 80%. In addition, some individual WUWs achieve

50

Page 51: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

multiple energy-based features having positive relative change of 90% or more

which is covered in section 3.3 and detailed in Appendix B. These results provide

firm evidence that there are significant increases for the energy measurement

when WUWs are spoken. These results confirm that the prominence of WUWs is

more significant than the prominence of non-WUWs. Therefore, we can conclude

that energy-based features can be used to discriminate between WUWs and non-

WUWs. A future improvement would be to quantify the level of change comparing

WUWs to non-WUWs.

6. Future Work

Two potential solutions ahad beenare are being considered addressing the

insufficient accuracy reported in this work for pich based features and will be

implemented by the VoiceKey team of Dr. Këpuska. They are outlined as follows:

[1.] Build a specialized corpus which contains the same words in both WUWs

and non-WUWs. The speech sentences in the current corpus, WUWII,

only contain WUWs but no non-WUWs. A new speech data collection

system is presented in Chapter 4, which will allow us to buildcreation

of a data base from the collected data that includes both WUWs and

non-WUWs.

51

Page 52: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

1.[2.] Use different approaches in defining pitch-based features. For example,

when using average and maximum pitch measurements of the WUW,

how the pitch pattern changes should also be considered.

The Table 5-11 below shows the energy-based features performance results of all

five WUWs used in the present study.

Energy FeaturesWUW: All WUWs

Valid Data Pt > 0 % > 0 Pt = 0 % = 0 Pt < 0 % < 0

AEW_AE1SBW 1479 1164 79 0 0 315 21AE1sSW_AE1SBW 1479 1283 84 1 0 240 16AEW_AEAll 2175 1059 49 9 9 1116 51AE1sSW_AEAll 2175 1155 53 2 0 1018 47AEW_AEAllBW 1969 1427 72 0 0 542 28AE1sSW_AEAllBW 1969 1562 79 3 0 404 21MaxEW_MaxE1SBW 1479 1244 84 20 1 215 15MaxE1sSW_MaxEAllBW 1479 1221 83 13 1 245 17MaxEW_MaxEAll 2175 1373 63 13 1 245 17MaxE1sSW_MaxEAll 2175 1336 61 25 1 814 37MaxE1sSW_MaxEAllBW 1969 1209 61 16 1 744 38MaxEW_MaxEAllBW 1969 1562 60 3 1 404 39

Table 5-11 Energy Features Result of All WUWs

One can see from the Table 5-11, there are several energy-based features with

positive relative changes above 80%. In addition, some individual WUWs achieve

multiple energy-based features having positive relative change of 90% or more

52

Page 53: INTRODUCTION: - my.fit.edumy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih/Chih-Ti...  · Web viewThe word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster

which is covered in section 3.3 and detailed in Appendix B. These results provide

firm evidence that there are significant increases for the energy measurement

when WUWs are spoken. These results confirm that the prominence of WUWs is

more significant than the prominence of non-WUWs. Therefore, we can conclude

that energy-based features can be used to discriminate between WUWs and non-

WUWs. A future improvement would be to quantify the level of change comparing

WUWs to non-WUWs.

Finally the new data collection system which collects both WUWs and non-WUWs

has been designed and partially implemented. Work on this data collection system

will be continued by VoiceKey group at Florida Institute of Technology. The

ultimate goal of this speech data collection project is to build a suitable specialized

corpus of data samples in order to find suitable prosodic features to reliably

discriminate between WUWs and non-WUWs.

53