speech analyser in an icai system for tesol

120
Speech Analyser in an ICAI System for TESOL by Huayang Xie A thesis submitted to the Victoria University of Wellington in fulfilment of the requirements for the degree of Master of Science in Computer Science. Victoria University of Wellington 2004

Upload: others

Post on 04-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Speech Analyser in an ICAISystem for TESOL

by

Huayang Xie

A thesissubmitted to the Victoria University of Wellington

in fulfilment of therequirements for the degree of

Master of Sciencein Computer Science.

Victoria University of Wellington2004

AbstractThere is an increasing demand for computer software which can provideuseful personalised feedback to English as a Second Language (ESL)speakers on prosodic aspects of their speech, to supplement the shortageof ESL teachers and reduce the cost of learning.

This thesis concentrates on constructing such an Intelligent ComputerAided Instruction (ICAI) prototype system, particularly focusing on onecomponent — the Speech Analyser. The speech analyser recognises auser’s speech, identifies the rhythmic stress pattern in the speech, discov-ers stress and rhythm errors in the speech, and provides reports for theother component generating personalised feedback to the user on ways ofeffectively improving the prosodic aspects of the speech.

We build an Hidden Markov Model (HMM) based speech recogniserto recognise a user’s speech. A set of parameters for constructing therecogniser is investigated by an exhaustive experiment implemented ina client/server computing network. The exploration suggests that thechoice of parameters is very important. We build stress detectors to de-tect the rhythmic stress pattern in the user’s speech by using both SupportVector Machine (SVM) and Decision Tree (DT) techniques. The detectorusing SVM outperforms the one using DT. It suggests that SVM is moresuitable for a relatively large data set with all numeric data than DT. Webuild an error identifier to automatically identify stress and rhythm errorsin the user’s speech. A two-layer phoneme alignment algorithm usingthe Needleman/Wunsch technique is developed to facilitate the prosodicerror identification problem. Our study also suggests that the foot com-parison method is better than Vowel Onset Point comparison method forautomatically identifying the main rhythm errors in the user’s speech.

Acknowledgments

I would like to give special thanks to my two great supervisors, Peter An-dreae and Mengjie Zhang, for their time, patience, enlightening instruc-tions, eye for detail, and everything they did for me, especially the finan-cial support arrangement. During the time of doing my research, not onlydid they gave their time to discuss the progress of my thesis and were al-ways available to solve problems, but also encouraged me and helped meto convert part of my work into two published papers ([82, 83]). Withouttheir generous assistance, my research may have been a painful experi-ence. I will never forget the great experience working with them.

Thanks also to other members of this research group, including DavidCrabbe and Paul Warren of the School of Linguistics and Applied Lan-guage Study, Irina Elgort of the University Teaching Development Center,Neil Leslie of School of Mathematical and Computing Sciences, and MikeDoig of VicLink.

Most importantly, I thank my parents for their constant encourage-ment. My wife Yi, although very busy with her own work, always gavemy research a higher priority and provided substantive support. Espe-cially during the last phase toward my thesis completion, my son Tianyiand my coming baby Tianying have given me great emotional support.

I also thank all the people not mentioned here who have contributed tomy research for their help, including other AI group and Technical groupmembers in School of Mathematical and Computing Sciences, and peoplefrom the virtual communities I have never met.

i

ii

Thanks also to Simon Doherty for helpful advice and comments.I was mainly financially supported by the New Economy Research

Fund, New Zealand. I was also financially supported by the School ofMathematical and Computing Sciences Graduate Scholarship and Com-puter Science Studentship, Victoria University of Wellington,New Zealand.

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 ICAI Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Span Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Issues Addressed . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.1 Constructing the Speech Recogniser . . . . . . . . . . 51.4.2 Building the Stress Detector . . . . . . . . . . . . . . . 61.4.3 Creating the Error Identifier . . . . . . . . . . . . . . . 71.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Background 112.1 Automatic Speech Recognition (ASR) . . . . . . . . . . . . . 11

2.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.2 Recognition Categories . . . . . . . . . . . . . . . . . 122.1.3 Digital Signal Processing . . . . . . . . . . . . . . . . 132.1.4 Hidden Markov Models . . . . . . . . . . . . . . . . . 16

2.2 Suprasegmental Properties . . . . . . . . . . . . . . . . . . . . 172.2.1 Stress . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.2 Rhythm . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.3 Perceptual Studies of Features for Stress . . . . . . . . 192.2.4 Vowel Quality . . . . . . . . . . . . . . . . . . . . . . . 20

iii

CONTENTS iv

2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 242.3.3 Learning Paradigms . . . . . . . . . . . . . . . . . . . 24

2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.4.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . 262.4.2 Stress Detection . . . . . . . . . . . . . . . . . . . . . . 282.4.3 Prosodic Error Identification . . . . . . . . . . . . . . 31

2.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Speech Recognition Stage 323.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.1 The Training Element . . . . . . . . . . . . . . . . . . 333.1.2 The Performance Element . . . . . . . . . . . . . . . . 35

3.2 HMM Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.1 Architecture of the HMMs . . . . . . . . . . . . . . . . 373.2.2 Stochastic Models: Mixture-of-Gaussians . . . . . . . 38

3.3 Speech Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.1 Window Size and Frame Period . . . . . . . . . . . . 403.3.2 MFCC Feature Extraction and Selection . . . . . . . . 40

3.4 Parameter Identification . . . . . . . . . . . . . . . . . . . . . 413.4.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 423.4.2 Design of Case Combinations . . . . . . . . . . . . . . 423.4.3 Training and Testing Process . . . . . . . . . . . . . . 433.4.4 Experiment Configuration . . . . . . . . . . . . . . . . 433.4.5 Performance Evaluation . . . . . . . . . . . . . . . . . 45

3.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 463.5.1 Best Parameter Combinations . . . . . . . . . . . . . . 463.5.2 Recognition Accuracy . . . . . . . . . . . . . . . . . . 50

3.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . 51

CONTENTS v

4 Stress Detection Stage 524.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.1.1 The Training Element . . . . . . . . . . . . . . . . . . 534.1.2 The Performance Element . . . . . . . . . . . . . . . . 55

4.2 Feature Extraction and Normalisation . . . . . . . . . . . . . 554.2.1 Duration Features . . . . . . . . . . . . . . . . . . . . 564.2.2 Amplitude Features . . . . . . . . . . . . . . . . . . . 584.2.3 Pitch Features . . . . . . . . . . . . . . . . . . . . . . . 594.2.4 Vowel Quality Features . . . . . . . . . . . . . . . . . 60

4.3 Classifier Learning and Feature Evaluation . . . . . . . . . . 634.3.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 644.3.2 Performance Evaluation . . . . . . . . . . . . . . . . . 644.3.3 Experiment Design . . . . . . . . . . . . . . . . . . . . 65

4.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 654.4.1 Experiment 1: Prosodic Features . . . . . . . . . . . . 654.4.2 Experiment 2: Vowel Quality Features . . . . . . . . . 664.4.3 Experiment 3: All Features . . . . . . . . . . . . . . . 68

4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Error Identification Stage 705.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.2 Vowel Sequence Alignment . . . . . . . . . . . . . . . . . . . 71

5.2.1 Fundamentals of the Alignment Algorithm . . . . . . 735.2.2 Additional Issue . . . . . . . . . . . . . . . . . . . . . 765.2.3 Two-layer Alignment Algorithm . . . . . . . . . . . . 78

5.3 Stress Error Identification . . . . . . . . . . . . . . . . . . . . 785.4 Rhythm Error Identification . . . . . . . . . . . . . . . . . . . 81

5.4.1 VOP Comparison . . . . . . . . . . . . . . . . . . . . . 815.4.2 Foot Comparison . . . . . . . . . . . . . . . . . . . . . 82

5.5 Visualisation Tools . . . . . . . . . . . . . . . . . . . . . . . . 845.5.1 Stress Error Visualisation . . . . . . . . . . . . . . . . 84

CONTENTS vi

5.5.2 Rhythm Error Visualisation . . . . . . . . . . . . . . . 845.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . 87

6 Conclusions 896.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.1.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . 906.1.2 Stress Detection . . . . . . . . . . . . . . . . . . . . . . 916.1.3 Error Identification . . . . . . . . . . . . . . . . . . . . 92

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.2.1 Improving the Forced Alignment System . . . . . . . 936.2.2 Constructing A New Phoneme HMM Architecture . 956.2.3 Adding A New Breathing HMM . . . . . . . . . . . . 966.2.4 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Bibliography 99

A IPA Symbols and NZSED Labels 109

List of Tables

3.1 Nine feature combinations and the number of features ineach set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2 Best and worst parameter choices by boundary timing ac-curacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3 Best and worst parameter choices by boundary timing ac-curacy of vowels only. . . . . . . . . . . . . . . . . . . . . . . 48

3.4 Summary of the mixture-of-Gaussians for the top 50 cases. . 493.5 Recognition errors for 10-12.5-MFCC-O-D-A-4. . . . . . . . . 50

4.1 Results for prosodic features. . . . . . . . . . . . . . . . . . . 664.2 Results for vowel quality features. . . . . . . . . . . . . . . . 674.3 Results for prosodic and vowel quality features. . . . . . . . 68

vii

List of Figures

1.1 Outline of the ICAI system. . . . . . . . . . . . . . . . . . . . 21.2 Elements and procedures with and in Span. . . . . . . . . . . 31.3 Overview of Span. . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Mel-scale filter bank (adapted from [85]). . . . . . . . . . . . 152.2 A 3D model of the vowel space (adapted from [42]). . . . . . 202.3 A simple 2D vowel space (adapted from [43]). . . . . . . . . 212.4 Vowels in F1/F2 space (from [78]). . . . . . . . . . . . . . . . 22

3.1 Overview of Span: recognising speech . . . . . . . . . . . . . 333.2 The training element. . . . . . . . . . . . . . . . . . . . . . . . 343.3 The performance element. . . . . . . . . . . . . . . . . . . . . 363.4 A chained connection mode HMM (adapted from [85]). . . . 373.5 HMMs for silences and short pauses (adapted from [85]). . . 383.6 Speech encoding process (adapted from [85]). . . . . . . . . . 393.7 Experiment configuration. . . . . . . . . . . . . . . . . . . . . 44

4.1 Overview of Span: detecting stress. . . . . . . . . . . . . . . . 534.2 The training element. . . . . . . . . . . . . . . . . . . . . . . . 544.3 The performance element. . . . . . . . . . . . . . . . . . . . . 564.4 Vowel quality features processing. . . . . . . . . . . . . . . . 62

5.1 Overview of Span: identifying stress and rhythm errors. . . 715.2 Overview of the error identifier. . . . . . . . . . . . . . . . . . 725.3 Alignment algorithm: initialisation. . . . . . . . . . . . . . . 74

viii

LIST OF FIGURES ix

5.4 Alignment algorithm: scoring. . . . . . . . . . . . . . . . . . . 755.5 Alignment algorithm: traceback. . . . . . . . . . . . . . . . . 765.6 Two-layer vowel sequence alignment. . . . . . . . . . . . . . 795.7 Rhythm pattern comparison – VOP. . . . . . . . . . . . . . . 825.8 Stress pattern visualisation. . . . . . . . . . . . . . . . . . . . 855.9 Timing feedback – target vowel to vowel. . . . . . . . . . . . 865.10 Timing feedback – target stressed to stressed. . . . . . . . . . 875.11 Timing feedback – exactly matched stress pattern. . . . . . . 88

6.1 Auto-labelling errors of the forced alignment system. . . . . 956.2 A new three-state HMM. . . . . . . . . . . . . . . . . . . . . . 966.3 Auto-labelling errors with a breath and an inserted phoneme. 97

Chapter 1

Introduction

This thesis is about constructing an Intelligent Computer Aided Instruc-tion (ICAI) system for Teaching English to Speakers of Other Languages(TESOL).

The ICAI system first analyses an English as a Second Language (ESL)learner’s speech in order to identify errors in the speech. It then providesuseful personalised feedback to the learner based on the errors.

This thesis mainly studies how to recognise an ESL learner’s speech,how to detect the stress pattern in the speech, and how to automaticallyidentify the stress and rhythm errors in the speech.

1.1 Motivation

Today, English is the most widely used language in the world. Many peo-ple from non-native English speaking countries learn English in order toeffectively communicate with the world. Speaking is one important as-pect of communication skills. Good speech involves many issues, such asvocabulary, grammar, and pronunciation. Among these issues, pronunci-ation, especially prosodic aspects in spontaneous speech, always remainsthe most serious problem.

Learning spoken English well requires lots of practice. However it is

1

CHAPTER 1. INTRODUCTION 2

really hard for ESL speakers to compare their speech with native speak-ers’ speech and identify prosodic problems by themselves given the largedifferences in sound features during practice. A great deal of personalisedfeedback to identify and correct prosodic errors in ESL speakers’ speechmust be provided in order to make the learning procedures more effectiveand productive. Providing personalised feedback from human teachers isvery expensive and the shortage of ESL teachers is increasing. As moreand more ESL students come to New Zealand, there is an increasing de-mand for computer software that can provide useful personalised feed-back to ESL speakers on prosodic aspects of their speech to supplementthe shortage of ESL teachers and reduce the cost of learning.

1.2 ICAI Outline

An outline of the ICAI system, which consists of two sub-systems — aPedagogic Component (Peco) and a Speech Analyser (Span), is shown inFigure 1.1.

Speech Analyser ( Span )

PedagogicComponent ( Peco )

Text Sound

Identified Errors

Feedback

Figure 1.1: Outline of the ICAI system.

First of all, Peco displays a sentence on the screen for a user to readaloud. The recorded user’s sound is then fed into Span. Figure 1.2 brieflyoutlines the elements and procedures involved with and in Span. Moredetails will be covered in later chapters. Inside Span, a speech recogni-tion process is applied to the sound and annotates/labels the sound at a

CHAPTER 1. INTRODUCTION 3

... come along on the barge ...Text

Sound

Sound(Phoneme Labelled) k V m @ l O N @ n D @ b a: dZ

Display

Read aloud

Speech Recognition

Vowel Segment Extraction

Stress Detection

Error Identification

Vowels

Stress Pattern

Rhythm Pattern

Identified Errors

V @ O @ @ a:

V Stressed@ UnstressedO Stressed @ Stressed@ Unstresseda: Unstressed

V 1566000 1629000O 1710000 1791000 @ 1854000 2043000

You stressed /@/ in the word "on" that should be unstressed.You unstressed /a:/ in the word "barge" that should be stressed.The interval between /V/ and /O/ is shorter.

Figure 1.2: Elements and procedures with and in Span.

phoneme level. A sequence of vowel segments is then extracted from thephoneme labelled sound and is fed into a stress detection process to gen-erate a stress pattern and a rhythm pattern. An error identification processis applied to the stress pattern and the rhythm pattern in order to ascertainstress and rhythm errors. Finally the identified errors are fed back to Peco,which provides personalised feedback to the user.

CHAPTER 1. INTRODUCTION 4

1.3 Span Overview

This section provides a detailed overview of Span as illustrated in Fig-ure 1.3.

Sound(Phoneme Labelled)

Speech Recogniser

Stress Detector

Error Identifier

Stress Pattern

Pre-trainedclassifier/detector

Phoneme HMMs

SoundText Target Patterns

Identified Stress or Rhythm Errors

Peco

Figure 1.3: Overview of Span.

There are three inputs to Span, text, sound and target patterns. Thetext is a sequence of words of a sentence that a user reads aloud. Thesound is uttered by the user and is encoded into the standard Pulse CodeModulation (PCM) format [12]. These two inputs are used immediatelyin Span. The target patterns include a stress pattern and a rhythm patternin a native English speaker’s speech. The stress pattern consists of a se-quence of vowels and their stress statuses. The rhythm pattern consistsof a sequence of stressed vowels and their timing information, including

CHAPTER 1. INTRODUCTION 5

the start and end time stamps of each vowel (see Figure 1.2). This input isused later in Span.

There is one output — a list of identified stress or rhythm errors — fromSpan. The identified errors are descriptions of misplaced stressed vow-els or unmatched time intervals in the speech. This output is conveyedto Peco, which uses the reported information to provide useful individu-alised feedback to the user.

Inside Span there are three key components. The first component is aspeech recogniser, which takes the text and the sound, performs phonemelevel forced alignment by using a set of pre-trained phoneme HMMs [85],and produces a phoneme labelled sound. The second component is astress detector, which analyses the phoneme labelled sound, focusing onvowel segments to identify which vowel phonemes would be perceivedas stressed by using a pre-trained classifier and produces a stress patternof the speech. The third component is an error identifier, which examinesthe generated stress pattern to identify stress errors by comparing it with atarget stress pattern. Then it further identifies rhythm errors if no seriouserrors are found in the user’s stress pattern.

1.4 Issues Addressed

1.4.1 Constructing the Speech Recogniser

In order to analyse an ESL learner’s speech, we first need to make oursystem be able to recognise the speech. We use a toolkit named HiddenMarkov Model Toolkit (HTK) [85] to build our speech recogniser. HTK isa statistical-based speech recognition system. Speech is encoded using aframe-based digital signal processing method [85] and is modeled statisti-cally using HMMs.

There are many parameters used to configure HMMs and the speechencoding process and it is important to set them to the optimal values.

CHAPTER 1. INTRODUCTION 6

HTK provides default settings for a standard speech recogniser that takesthe sound waveform as the input and produces the word transcription asthe output. However, this is not the kind of speech recogniser we needbecause we already know the exact sentence a speaker is trying to readaloud. The task of our speech recogniser is to identify the phonemic ele-ments in a given ESL learner’s utterance so that later we can determine thestress pattern of the utterance, and identify prosodic errors in the learner’sspeech. These later tasks — stress detection and error identification —require accurate start and end time information for each phoneme seg-ment in an ESL learner’s speech. Since the default settings for configur-ing phoneme HMMs and speech encoding provided by HTK may not beappropriate for our task, we must determine optimal values of the param-eters in order to appropriately construct our speech recogniser so it canproduce phoneme level labelled sound with as accurate as possible startand end time stamps.

1.4.2 Building the Stress Detector

In order to identify the prosodic errors in an ESL learner’s speech, we mustidentify the stress pattern of the speech first. There are two types of stressin English — lexical stress and rhythmic stress [45]. We must determinewhich kind of stress pattern we should detect and analyse.

There are several prosodic features relevant to stress, including dura-tion, amplitude, and pitch (also known as fundamental frequency). Thesefeatures are easily extracted but the issue is how we should normalisethem in order to reduce effects of variations due to differences such asthose between speakers, recording situations, and utterance contexts.Vowel quality is known to be another important feature for identifyingthe stress status of a vowel [18]. The measurement of vowel quality is nottrivial since there is no standard way to calculate it. So another issue ishow we can measure vowel quality properly. Furthermore, among these

CHAPTER 1. INTRODUCTION 7

features, we need to identify those that are most useful for classifying thestress status of a vowel as stressed or unstressed.

It is sensible to use machine learning technology to construct a classifierfor detecting stress status of a vowel since the stress detection task can beconsidered as a binary classification problem. However, there are manymachine learning algorithms, we need to find one that does a good job forthis task.

1.4.3 Creating the Error Identifier

Once we have completed the speech recognition and the stress detectionprocedures, we will have a stress pattern and a derived rhythm pattern ofthe ESL learner’s speech. We need to compare the generated stress andrhythm patterns with the given target patterns. However, this compar-ison is not trivial. To compare either the stress status or rhythm prop-erty of vowels in the ESL learner’s speech with the target speech, we needto ensure the vowel pairs are properly aligned. Due to many variationsand errors in an ESL learner’s speech, such as adding extra phonemes,mispronouncing phonemes, or replacing phonemes by others, the vowelphoneme sequence in the ESL learner’s speech is unlikely to be the sameas the target. In such a situation, the issue is how we can make the sys-tem overcome ambiguities and difficulties to properly align the vowel se-quences.

Once we have aligned the vowel sequences, the stress and rhythm dif-ferences can be explored. However, we need to determine what kind ofdifferences represent errors that need to be fixed by users. In order toproduce the most useful personalised feedback to users, it is importantto ignore consequential or minor errors and report only critical errors toPeco.

CHAPTER 1. INTRODUCTION 8

1.4.4 Summary

The main issues covered by this thesis can be summarised as follows:

• how to configure phoneme HMMs for constructing our speechrecogniser;

• how to configure the speech encoding process for constructing ourspeech recogniser;

• which type of stress we should work on;

• how to normalise prosodic features for detecting stress;

• how to calculate vowel quality features for detecting stress;

• which features are the most useful for detecting stress;

• which machine learning algorithm we should use to build up thestress detector;

• how to overcome ambiguities and difficulties to properly alignvowel sequences;

• how to determine the critical stress and rhythm errors in ESL learn-ers’ speech.

1.5 Contributions

This thesis has the following major contributions:

1. This thesis shows how to choose a set of parameters for construct-ing a forced alignment speech recogniser. An exhaustive experimentthat covers 3,555 combinations of parameters provides empirical ev-idence on choosing optimal parameter values. A client/server com-puting system was developed to make this experiment feasible.

Part of this work was published in [83].

CHAPTER 1. INTRODUCTION 9

2. This thesis presents a novel method for calculating and measuringvowel quality features. Our vowel quality features are shown toachieve comparable performance to other prosodic features in au-tomatic rhythmic stress detection.

Part of this work was published in [82].

3. This thesis shows how to automatically identify especially prosodicstress and rhythm errors in ESL learners’ speech. To our knowledge,very little research has been done on automatic stress and rhythmerror detection. Our study itself provided a considerable useful ex-ample for further research in this field.

1.6 Thesis Outline

The rest of this thesis is structured as follows:Chapter 2 starts by reviewing Automatic Speech Recognition (ASR),

including techniques in the digital signal processing area, then describessuprasegmental properties of speech, including explanations of stress andrhythm. Discussions of features believed to be relevant to the perceptionof stress and vowel quality are also presented. It also surveys work closelyrelated to this thesis.

Chapter 3 describes how to recognise a user’s speech, which involvesdesigning phoneme HMMs and configuring speech encoding processes,and presents an exhaustive search experiment and its results.

Chapter 4 describes how to classify vowel segments as stressed or un-stressed. It also introduces a novel method for measuring vowel qualityfeatures for classifying stress. Two machine learning algorithms for con-structing the stress classifier are also discussed.

Chapter 5 describes the automatic stress and rhythm error identifica-tion approach. It also presents visualisation tools for helping users under-stand their prosodic problems.

CHAPTER 1. INTRODUCTION 10

Chapter 6 summarises the achievements, and presents possible futureresearch work.

Note that International Phonetic Alphabet (IPA) symbols are usedthroughout the text of this thesis to represent phonemes, and the NewZealand Spoken English Database (NZSED) 1 phonemic labels are used inthe figures. A list of IPA symbols, NZSED labels, and example words isgiven in Appendix A.

1The NZSED is a database created by the School of Linguistics and Applied LanguageStudies at Victoria University of Wellington. It contains a representative sample of spokenNew Zealand English from 1990s . For more information, see http://www.vuw.ac.nz/lals/nzsed/.

Chapter 2

Background

This chapter reviews aspects of automatic speech recognition, supraseg-mental properties of speech, and machine learning technology that arerelevant to the goal of this thesis.

2.1 Automatic Speech Recognition (ASR)

“Automatic speech recognition is the process by which a computer mapsan acoustic speech signal to text.” [34]

2.1.1 History

ASR has been investigated for many years. In the 1870s, Alexander Gra-ham Bell worked on a machine that was expected to be able to transcribespoken words into written text, but failed. In 1952, at Bell Laboratories,Davis, Biddulph and Balashek developed an ASR system that could recog-nise the digits 0 to 9 [21]. The range of accuracy was from 50% to 100%. In1959, at MIT Lincoln Laboratories, Forgie and Forgie’s ASR system couldrecognise 10 vowels in a small class of mono-syllabic words [27]. The ac-curacy was 93%. Even more impressive, the system was speaker and sexindependent. In the early 1970s, the HMM approach was developed by

11

CHAPTER 2. BACKGROUND 12

Lenny Baum of Princeton University and shared with contractors of theAdvanced Research and Projects Agency of the United States Departmentof Defense (ARPA) [19]. In the 1970’s, researchers around the world madeseveral significant contributions to speech recognition area [37, 63, 67, 74].After that, researchers gradually moved towards end-user products, sig-nalled by the foundation of companies such as Dragon Systems in 1982and SpeechWorks in 1984 [19].

Today, ASR has left academic labs and entered into the homes of end-users. There is now commercially available software for personal com-puters such as Dragon Systems’ Dragon Naturally Speaking, Lernout andHauspies Voice Xpress, and IBM’s ViaVoice R© [19]. These systems canperform continuous dictation with large vocabularies, and can processcontext-sensitive spoken commands. These systems also learn on the fly– if you correct the system when it mis-recognises a word, it will “learn”.In this way, the training never actually ends, and the more you use andcorrect the system, the better it becomes at recognising your voice. Eventhough ASR has made such significant progress, it still requires furtherresearch and development to recognise continuous natural speech frommultiple speakers [65].

2.1.2 Recognition Categories

Speech recognition techniques can be divided into many categories us-ing different criteria. This section briefly discusses recognition categoriesbased on input and output differences.

In terms of output, speech recognition systems can be divided intoword level recognition and sub-word level recognition, including syllablelevel recognition and phoneme level recognition [85].

In terms of input and the focus of the speech recogniser, recognition canbe considered to have two kinds — full recognition and forced alignment.

With “full recognition”, the recogniser only has the digitised sound

CHAPTER 2. BACKGROUND 13

signal as input, and the output is the transcription of words, syllables, orphonemes. Full recognition is usually used when a system does not knowbut wants to find out what is being said in a given set of words, syllables,or phonemes. But the timing information of each recognised unit is notreally a concern [64]. If large vocabularies are used, then one way of ob-taining good performance is to use language models or artificial grammarsto restrict the combination of words and reduce the search space [85].

With “forced alignment”, the recogniser has at least two inputs. Oneis the digitised sound signal; the other is the text of the sentence that thespeaker uttered [64, 68, 85]. The text can be a sequence of words, syl-lables, or phonemes. Forced alignment is usually used to segment thesound signal and annotate it with the elements in the text. In contrast tofull recognition, the exact sequence of spoken words is known; the goalof the recogniser is to identify the start and end time points of each seg-ment unit [64, 68]. So researchers can use the timing information for otherpurposes [25, 58, 71].

If a system focuses on getting high accuracy on timing informationof individual segments, then using forced alignment is generally a goodchoice provided the content of sentences is given.

2.1.3 Digital Signal Processing

According to [48], although in theory the raw digitised sound waveformmay be used directly in speech recognition, the unlimited variability ofsound signal makes this infeasible. Thus researchers tend to use a limitedset of high level extracted features to represent the raw digitised soundsignal. Feature vectors are typically extracted every 10–20 ms frame pe-riod using a window size of 20–30 ms [48]. So a raw digitised soundsignal can be characterised by a sequence of feature vectors. The mostfrequently used feature extraction techniques are: linear predictive coding(LPC) [49], Mel-frequency cepstral coefficients (MFCC) [22], and percep-

CHAPTER 2. BACKGROUND 14

tual linear prediction (PLP) [32].

LPC

LPC starts with the assumption that the speech signal is produced by abuzzer at the end of a tube. The glottis produces the buzz, which is char-acterised by its intensity and frequency. The vocal tract forms the tube,which is characterised by its resonances, called formants. LPC preformsa three step speech signal analysis, which includes estimating formants,removing their effects from the speech signal, and estimating the inten-sity and frequency of the remaining buzz. The process of removing theformants is called inverse filtering, and the remaining signal is called theresidue [33].

LPC is one of the most powerful techniques used to represent a speechsignal. Usually it is implemented as an all-pole model to capture the vo-cal tract properties. It is the commonly used method for encoding cleanspeech at a low bit rate. However its performance degrades if a speech sig-nal contains distortions [76]. Another drawback is that amplitude seemsto be a very important feature to differentiate nasalised consonants andvoiced vowels, but it is not included in the standard LPC formant analysisprocedure [69].

MFCC

MFCC is based on the known variation of the human ear’s critical band-widths with frequency: filters spaced linearly at low frequencies and loga-rithmically at high frequencies capture the phonetically important charac-teristics of speech. This is expressed in the Mel-frequency scale, which is alinear frequency spacing below 1000 Hz and a logarithmic spacing above1000 Hz.

Figure 2.1 illustrates the Mel-scale filter bank.MFCC is one of the standard representations used by HTK [85], a

CHAPTER 2. BACKGROUND 15

m1 mP

freq

1

m j... ...Energy inEach Band

MELSPEC

Figure 2.1: Mel-scale filter bank (adapted from [85]).

toolkit involved in our system. A conventional method for extracting themel-frequency cepstral features includes the following steps:

• Window the data with a Hamming window

• Take the fast Fourier transform

• Find the magnitudes of the output of the fast Fourier transform

• Convert the data into filter bank outputs

• Calculate the decimal logarithm

• Find the discrete cosine transform

Note that the purpose of performing the cosine transform at the laststep is to decorrelate the set of log energies to a set of uncorrelated cep-stral coefficients [35]. Thus only about half of the cepstral coefficients —typically 12, from 1st to 12th [85] — will be left for use in a further speechrecognition process. The 0th cepstral coefficient describes the shape of thelog spectrum independent of its overall level so that it can be used to esti-mate the energy over a short period. The 1st cepstral coefficient representsthe balance between the upper and lower halves of the spectrum. Thehigher order coefficients are related to increasingly finer features in thespectrum [15].

Compared with the large amount of Mel-Scale bands, the smaller sizecepstral coefficients make it easier to compute reasonably accurate proba-bility estimates in a subsequent statistical recognition process.

CHAPTER 2. BACKGROUND 16

PLP

The PLP technique may be thought as the combination of LPC and MFCC.Its features smooth an auditory spectrum by fitting an auto-regressivemodel. The auto-regressive model can be an all-pole model, which is themodel mostly used in LPC, or a linear-predication model. The features arecomputed based on the Bark-scale [86], which is similar to the Mel-scalein modelling the human ear.

According to [24], “PLP has the nice property of modeling spectralpeaks more carefully than the less-reliable valleys in between.” It gener-ally outperforms LPC [73] but not MFCC [24].

2.1.4 Hidden Markov Models

There are several kinds of approaches used in ASR, including thetemplate-based approach, the knowledge-based approach, the statistical-based approach, and the connectionist-based approach [65, 70]. We willuse one example of the statistical-based approach — Hidden Markov Mod-els — in this research. This section briefly introduces the HMM technique.

From the statistical point of view, the speech recognition problem is theproblem of computing the probability that a vocabulary word is spokenwithin an utterance. This probability is not directly computable but canbe indirectly computed once the likelihood of each HMM generating thesound of that word has been worked out [85].

The success of HMM based speech recognition is built on an assump-tion that the sequence of observed speech feature vectors correspondingto each word is generated by a Markov model. A Markov model can beviewed as a finite state machine. When a state is entered, a speech featurevector is generated from a probability density. When all speech featurevectors are generated, the likelihood of the model generating the featurevectors can be calculated. Hidden Markov Model is named because we justknow the sequence of speech feature vectors but do not know what exactly

CHAPTER 2. BACKGROUND 17

the underlying sequence of states are [85].In order to obtain good speech recognition performance, one of the

keys is to get a well trained model for each recognised unit, such as word,syllable, or phoneme. Providing a sufficient amount of training data isessential for obtaining well estimated models. However, we think that theHMM design itself and an accurate speech signal encoding process aremore important. We hypothesise that:

Hypothesis 2.1.1 Without careful consideration of parameters involved in theHMM design and the speech encoding process, the speech recognition perfor-mance would be low.

2.2 Suprasegmental Properties

Suprasegmental properties of speech are properties that cannot be deriveddirectly from the underlying sequence of phonemes. Researchers dividesuprasegmental properties into several categories and the most commonlyused terms for them are stress, rhythm, tone and intonation [4, 11, 17, 18,30, 43, 50].

Suprasegmental properties are very important for the comprehensibil-ity of speech. ESL speakers with excellent vocabulary, grammar, and indi-vidual phoneme pronunciation may still be very hard to understand if thestress, rhythm, tone, or intonation patterns of their speech are wrong. Inthis thesis, we only focus on two suprasegmental properties — stress andrhythm.

2.2.1 Stress

Stress is a form of prominence in spoken language. Usually, stress is seenas a property of a syllable or of the vowel nucleus of that syllable. Thereare two types of stress in English [45]. Lexical stress refers to the relativeprominences of syllables in individual words. Rhythmic stress refers to the

CHAPTER 2. BACKGROUND 18

relative prominences of syllables in longer stretches of speech than iso-lated words. When words are used in utterances, their lexical stress maybe altered to reflect the rhythmic (as well as semantic) structure of the ut-terance.

Lexical stress is usually used in the discrete speech environment andcan be thought of as a syllable’s potential to receive prominence. In En-glish, there are many noun-verb phonetically similar word pairs with dif-ferent lexical stress patterns. For instance, the word “project”, when it isused as a noun, the lexical stress is placed on the first syllable, but when itis used as a verb, the lexical stress is placed on the second syllable. Incor-rect lexical stress placements can often cause grammatical problems, suchas a listener expecting a noun but hearing a verb.

Rhythmic stress is usually used in the continuous speech environmentand can be thought of as the actual degree of prominence observed whena syllable is uttered as part of a sentence.

Lexical stress and rhythmic stress often coincide, but due to the occur-rence of stress shifting, they may differ in continuous speech. For exam-ple, in the sentence “The total number of people is thirteen”, the lexicaland rhythmic stressed syllables of the word “thirteen” are the same —the second syllable. But in the sentence “There are thirteen people in thisroom”, the rhythmic stressed syllable of the word “thirteen” is often thefirst syllable, which is not the same as the lexical stress.

2.2.2 Rhythm

Speech rhythm is defined as the sequence of the durations of consecu-tive vowels, consonants, and pauses in speech [14]. As introduced in[1, 60], speech rhythm comes in two types: stress-timed rhythm, wherethe stresses recur at approximately equal time intervals, and syllable-timedrhythm, where the syllables recur at regular intervals. A widely held theoryis that speech rhythm in English is stress-timed [1, 6, 17, 18, 20, 43, 56, 61]:

CHAPTER 2. BACKGROUND 19

that is, the intervals between pairs of stressed syllables in an English sen-tence are approximately equal. Although generally accepted, the literature[10, 18, 20, 43, 66] notes that this theory has not yet been experimentallyverified.

2.2.3 Perceptual Studies of Features for Stress

The perception of a syllable as stressed or unstressed depends on its rela-tive duration, its amplitude or energy, its pitch, and its quality, especiallythe vowel quality [18]. Many researchers have tried to determine which fea-ture has the most significant effect in signaling stress in speech. However,the current results in the literature are inconclusive.

Fry [29] investigated the effects on stress perception by changing syl-lable pitch, energy and duration. The data set consisted of noun–verb bi–syllable words from synthetic utterances. Fry concluded that for lexicalstress perception the most important cue was pitch.

Lieberman [46] examined the relationship between the perceived syl-lable stress and acoustic correlates, including peak envelope amplitude,syllable duration, and pitch. The data set consisted of noun–verb bi–syllabic words extracted from natural speech. Lieberman indicated thatenvelope amplitudes and pitch were the most relevant unidimensionalcues of stressed syllables and amplitude was more important than pitch.

Adams and Munro [2] explored the use of pitch, amplitude, and du-ration of syllable stress perception in longer utterances. Pitch contourof the syllables, amplitude envelope of the syllables, and duration of thewhole syllable were extracted from longer utterances. They claimed thata greater degree of pitch change, a greater fall of amplitude from a fairlyconstant peak level, and longer duration were linked to syllable stress per-ception. They suggested that syllable duration is the most frequently usedcue and syllable amplitude is the least used.

CHAPTER 2. BACKGROUND 20

2.2.4 Vowel Quality

As just mentioned at the beginning of the previous sub-section, a furthercorrelate of stress is the quality of the vowel in a syllable. According tothe traditional theory, vowel quality is determined by the configuration ofthe tongue, jaw, and lips [7, 42, 44, 59], so that there are three main pa-rameters describing vowel quality. These three parameters form a three-dimensional vowel space: the position of the highest point of the tonguein the close-open dimension, and in the front-back dimension; and the de-gree of lip rounding in the rounded-spread dimension. Therefore, vowelquality can be represented by a X-Y-Z form, where X refers to the hori-zontal axis of the tongue position and includes front, central and back, Yrefers to the vertical axis of the tongue position and includes high/close,low/open, and mid (it can also be divided into half-close and half-open),and Z refers to the degrees of lip rounding and includes rounded and un-rounded. A three-dimensional vowel space is shown in Figure 2.2. How-ever, these three parameters cannot be measured from the sound. Othercorrelates that can be calculated from the sound are needed to describevowel quality.

back

open

rounded

spread

front

close

i

e

u

o

3

a

c

a

Figure 2.2: A 3D model of the vowel space (adapted from [42]).

CHAPTER 2. BACKGROUND 21

Ladefoged [43] showed that the height of the tongue correlates to thefrequency of the first formant (F1) and the backness of the tongue cor-relates to the difference between the frequencies of F1 and F2 (the sec-ond formant), but there does not exist an auditory property correlating tothe rounding of the lip. Ladefoged [43] also mentioned that height andbackness were the mostly used features to distinguish one vowel from an-other in nearly every language, including English. Therefore, as heightand backness correlate to formants and formants can be easily calculateddirectly from sound signal, the three-dimensional vowel space can be re-duced to a two-dimensional vowel space. A simple two-dimensionalvowel space is shown in Figure 2.3.

high

low

front back

HZ

(F1)

(F2-F1)

Figure 2.3: A simple 2D vowel space (adapted from [43]).

However, we realised that the first two formants are not sufficient toidentify quality of different vowels due to flexibility in the formation ofa vowel and variation of speakers. Figure 2.4 illustrates the vowels (ex-cluding diphthongs) in F1 and F2 space for the female New Zealand (NZ)native speakers (left) and the male NZ native speakers. The centroid ofeach vowel class is mark by the lexical item and the perimeter of each el-lipse indicates the boundary of the area in which at least 95% of the datafor a particular class fall. From Figure 2.4, we can see that overlaps among

CHAPTER 2. BACKGROUND 22

these vowels casue the problem of identifying vowel qualities amongstmultiple speakers’ speech by using F1 and F2.

Figure 2.4: Vowels in F1/F2 space (from [78]).

Cruttenden [18] further reduced the dimensions describing vowelquality to one dimension. He introduced the notions of full or reducedvowel to describe vowel quality for detecting stressed syllables. How-ever, there is no standard method for determining whether a pronouncedvowel is full or reduced. The term “full vowel” and “reduced vowel” ab-stractly describe the quality of a vowel. In English phonology, “reduced”means among other things, central in articulatory vowel space, short andunstressed [18, 43]. In contrast to “reduced”, “full” vowel means moreperipheral in vowel space. If the quality of a vowel is considered to be “re-duced”, then the vowel is always unstressed. However, even if a vowel isconsidered to be “full”, the vowel may occur in both stressed or unstressedsyllables [43].

We are interested in using the one-dimensional vowel space to detectvowel stress status. Based on the above discussion, we hypothesise andwill show that:

CHAPTER 2. BACKGROUND 23

Hypothesis 2.2.1 For detecting stress status, knowing a vowel is reduced ismore reliable than knowing it is full.

2.3 Machine Learning

Machine learning is one of the hottest research areas. It has been widelyadopted in real-world applications, including speech recognition, hand-written character recognition, image classification and bioinformatics. Thissection gives a brief overview of the machine learning technology relatedto this thesis.

2.3.1 Definitions

Researchers give different definitions of machine learning. However theprinciple is roughly the same: a computer program processes a given set ofexamples and tries to either describe the known data in some meaningfulways or develop an appropriate response to unseen cases. We list three ofrepresentative definitions or descriptions as follows:

Mitchell [52] gives the following definition of machine learning:

“A computer program is said to learn from experienceE with respect to some class of tasks T and performancemeasure P , if its performance at tasks in T , as measuredby P , improve with experience E.”

Witten and Frank [81] state that:“... things learn when they change their behavior in away that makes them perform better in the future ...”

Michalski et al. [51] state that:

“Learning denotes changes in the system that are adap-tive in the sense that they enable the system to do thesame task or tasks drawn from the same populationmore efficiently and more effectively the next time.”

CHAPTER 2. BACKGROUND 24

2.3.2 Terminology

A data set is a collection of knowledge or examples. A single example ina data set is called an instance. There are one or more attributes or featuresrepresenting the aspect(s) of an instance. Each attribute or feature can haveeither a categorical or numerical value.

In order to train a computer program and evaluate its performance, thedata set is usually split into two subsets: training data set and test data set.We sometimes split the data set into three subsets. The third data set isusually called validation data set. The purpose of using validation data setis to monitor the training progress and prevent the training from overfit-ting. When no extra data are available to provide a separated validationdata set, the n-fold cross validation method is sometimes used to overcomethe overfitting problem [52]. The m available examples are randomly par-titioned into n disjoint subsets, each of size m/n. Training and validationprocesses are then run n times. In each run, a different one of these n sub-sets is used as the validation data set and all the other subsets are mergedand used as the training data set. The averaged validation result is usedto evaluate the training performance.

2.3.3 Learning Paradigms

According to [38], based on the knowledge provided, there are three mainlearning paradigms, supervised learning, unsupervised learning, and hybridlearning. Supervised learning is sometimes referred to as learning with ateacher. The knowledge provided to a learning system includes a correctanswer for each input instance. The learning process is continued untilthe learning system produces answers as close as possible to the givencorrect answers. Unsupervised learning is sometimes referred to as learn-ing without a teacher. Instances are grouped into appropriate categoriesby analysis. A typical problem dealt with by unsupervised learning isclustering [26, 31]. “Hybrid learning combines both supervised and unsu-

CHAPTER 2. BACKGROUND 25

pervised learning. Part of the solutions (network weights, architecture, orcomputer programs) are determined through supervised learning, whilethe others are obtained through unsupervised learning.” [38]

There are many machine learning models in common use. Here webriefly describe some of them:

• Decision Trees. A decision tree consists of leafs and nodes. A leafrecords an answer (often called a class) and a node specifies some testconditions to be carried out on a single feature value of an instance,with one branch and sub-tree for each possible result of the test. Fora given instance, a decision is made by starting from the root of atree and moving through the tree determined by the outcome of acondition test at each node until a leaf is encountered. [62]

• Neural Networks. A neural network is usually constructed fromnodes, links, weights, biases, and transfer function. Nodes are of-ten called neurons. There are varieties of ways to connect neurons.Usually multiple layers are needed to manage the neurons. Throughnetwork training the internal weights and biases of the neurons areautomatically updated in order to produce the target output. [54]

• Support Vector Machines. In a support vector machine, input vec-tors are mapped into a very high-dimension feature space througha non-linear mapping. Then a linear classification decision surfaceis constructed in the high-dimension feature space. This linear deci-sion surface can take a non-linear form when it is mapped back intothe original feature space. Special properties of the decision surfaceensures good generalisation ability of this learning machine. [16]

• Genetic Programming. Genetic programming has become more andmore important in the machine learning research area since the1990s. The inputs to a genetic programming learning system area terminal set and a function set. The outputs of the system are

CHAPTER 2. BACKGROUND 26

evolved computer programs. A fitness function is used to evaluateindividual programs during the learning in order to select winners,then several evolutionary operations are applied to these winners toform the next generation, and so on in order to produce the best in-dividual program for a given problem. [41]

2.4 Related Work

2.4.1 Speech Recognition

There is vast literature on ASR. We briefly review some of the articles,which are closely related to HMM-based systems and contain informationon parameters used in HMM design and speech encoding.

Bocchieri et al. [8] constructed a speech recogniser by using triphoneHMMs. The speech signal was encoded by using a 20 ms window andshifted by a 10 ms interval. The encoded speech frame feature vector con-sisted of 12 MFCCs and the frame energy measured in dB with their firstand second derivatives, making 39 components in total. However therewere no clear description of how many states the HMM had and howthese states linked. Gaussian mixtures [85] were used to define the con-tinuous state observation densities but the number of Gaussian mixtureswas not stated. The recogniser had 3.6% word error rate.

Rapp [64] constructed a forced alignment system for German. The sys-tem used 25.6 ms window size with 10 ms interval to calculate soundfeature vectors, which consisted of 12 MFCC features and overall energy,and their first and second derivatives. A single mixture Gaussian outputprobability density function was used in the phoneme HMMs. The authorexplored three different phoneme HMM topologies and reported that byusing 3-state left-to-right HMM topology, the percentage of auto-labelledphoneme boundaries within a 20 ms threshold from the manual labellingwas 84.4%.

CHAPTER 2. BACKGROUND 27

Wightman and Talkin [79] developed an HMM-based forced alignmentsystem with acoustic model training and Viterbi search [85] implementedusing the HTK. In their system, the frame period was 10 ms and eachHMM state contained five Gaussian mixtures. The TIMIT database1 wasused and approximately 80% accuracy within the 20 ms threshold wasreached.

Pellom and Hansen [57] developed a gender-dependent forced align-ment system by using 5-state left-to-right HMMs. For each HMM state, 16Gaussian mixtures were used. The speech waveform was parameterisedevery 5 ms by a vector consisting of 12 MFCCs, 12 delta MFCCs and nor-malised log-frame energy. However, no clear statement about the windowsize was found. The TIMIT database was used and the phoneme segmen-tation accuracy was 85.9% within the 20 ms threshold.

Pellom [58] developed a 5-state left-to-right HMM-based forced align-ment system with an output distribution of 16 Gaussian mixture densi-ties per state. By using 25 ms window size with 5 ms frame interval,12 MFCC features and the normalised log frame energy plus their firstderivatives were calculated. The phoneme segmentation accuracy within20 ms threshold was 86.2%.

Irino et al. [36] presented a new speech recognition/generation sys-tem. The system consisted of STRAIGHT [40], warped-frequency discretecosine transform, and an HMM engine. Each HMM has 12 states with 2Gaussian distributions per state. MFCC features were computed over 40ms window with 2 ms interval. The orders of the MFCC features adoptedin the experiments were 12, 20 and 30. No first and second derivativeswere used since the authors intended to compare the potential of MFCCbaseline features with others. Their experiments reported that the speaker-dependent word recognition rates were between about 95 and 97%.

Lindgren et al. [47] introduced a novel method for speech recognition.

1One of the Linguistic Data Consortium (LDC) top ten corpora. See http://www.ldc.upenn.edu/Catalog/topten.jsp

CHAPTER 2. BACKGROUND 28

In order to evaluate the new method, the authors constructed a HMM-based speech recogniser by using a set of one-state HMMs with 16 Gaus-sian mixtures and used it in a performance comparison experiment. In theHMM-based baseline speech recogniser, the features used were 12 MFCC,log energy and their deltas and delta-deltas. However, no clear statementsabout the window size and frame period were found. The accuracy forisolated phoneme recognition was 54.86%.

The review above indicates that there are no standard settings for thoseparameters in constructing either a speech recogniser, a forced alignmentsystem, or other similar systems. It seems that these parameters need to beadjusted for different purposes using different speech data sets in differentsituations. In order to obtain optimal performance, careful analysis andfurther investigation in these parameters are essential.

2.4.2 Stress Detection

There have been a number of reports on stress detection. Most reportsfocused on lexical stress detection based on isolated words, but a few haveaddressed rhythmic stress detection in complete utterances.

Lieberman [46] used duration, energy and pitch to identify lexical stressin bisyllabic noun-verb stress pairs (e.g. PREsent vs preSENT). These fea-tures were extracted from the hand-labelled syllables. The database con-sisted of isolated words pronounced individually by 16 native speakersand was used for both training and testing. A decision tree approach wasused to build the stress detector and 99.2% accuracy was achieved.

Aull and Zue [5] used duration, pitch, energy, and changes in the spec-tral envelope (a measure of vowel quality) to identify lexical stress in poly-syllabic words. The features were extracted from the sonorant portion ofautomatically labelled syllables. The pitch parameter used was the max-imum value in the syllable. Energy was normalised using the logarithmof the average energy value. The database consisted of isolated words ex-

CHAPTER 2. BACKGROUND 29

tracted from continuous speech pronounced by 11 speakers. A template-based algorithm was used to build the stress detector and 87% accuracywas achieved.

Ferij et al. [28] used pitch, energy and the spectral envelope to identifylexical stress in bi-syllabic words pairs. The first and second derivativesof pitch and the first derivative of energy were also used. The spectralenvelope was represented by four LPC features. These nine features wereextracted from the hand-labelled syllables at 10 ms intervals. The databaseconsisted of isolated words extracted from continuous speech pronouncedby three male speakers. HMMs were used to build the stress classifier andan overall accuracy of 94% was achieved. Vowel quality, especially thedistinction between reduced and full unstressed syllables was suggestedas a direction for future work.

Ying et al. [84] used energy and duration to identify stress in bi-syllabicword pairs. Energy and duration features were extracted from automati-cally labelled syllables and were normalised using several methods. Thedatabase consisted of isolated words extracted from continuous speechpronounced by five speakers. A Bayesian classifier assuming multivari-ate Gaussian distributions was adopted and the highest performance was97.7% accuracy.

A few studies have investigated stress detection in longer utterances.Waibel [75] used amplitude, duration, pitch, and spectral change to iden-tify rhythmically stressed syllables. The features were extracted from auto-matically labelled syllables. Peak-to-peak amplitude was normalised overthe sonorant portion of the syllable. Duration was calculated as the in-terval between the onsets of the nuclei of adjacent syllables. The pitchparameter used was the maximum value of each syllable nucleus. Spec-tral change was normalised over the sonorant portion of the syllable. Thedatabase consisted of 50 sentences read by 10 speakers. A Bayesian classi-fier assuming multivariate Gaussian distributions was adopted and 85.6%accuracy was reached.

CHAPTER 2. BACKGROUND 30

Jenkin and Scordilis [39] used duration, energy, amplitude, and pitchto classify vowels into three levels of stress — primary, secondary, andunstressed. The features were extracted from hand-labelled vowel seg-ments. Peak-to-peak amplitude, energy and pitch were normalised overthe vowel segments of the syllable. In addition syllable duration, vowelduration and the maximum pitch in the vowel were used without nor-malisation. The database consisted of 288 utterances (8 sentences spokenby 12 female and 24 male speakers) from dialect l of the TIMIT speechdatabase. Neural networks, Markov chains, and rule-based approacheswere adopted. The best overall performances ranged from 81% to 84% byusing Neural networks. Rule-based systems performed more poorly, withscores from 67% to 75%.

Van Kuijk and Boves [72] used duration, energy, and spectral tilt toidentify rhythmically stressed vowels in Dutch — a language with simi-lar stress patterns to those of English. The features were extracted frommanually checked automatically labelled vowel segments. Duration wasnormalised using the average phoneme duration in the utterance, to re-duce speaking rate effects. Also a complex duration normalisation methodintroduced in [80] was adopted. Energy was normalised using severalprocedures, such as the comparison of the energy of a vowel to its leftneighbour and its right neighbour, to the average energy of all vowels toits left and to the average energy of all vowels in the utterance. Spectraltilt was calculated using spectral energy in various frequency sub-bands.The database consisted of 5000 training utterances and 5000 test utterancesfrom the Dutch POLYPHONE corpus [23]. A simple Bayesian classifierwas adopted, on the argument that the features can be jointly modelledby a N-dimensional normal distribution. The best overall performanceachieved was 68%.

The summary above shows that stress classification has higher accu-racy for a limited task, such as identifying the stressed syllable in isolatedbi- or polysyllabic words, but performance levels are noticeably lower in

CHAPTER 2. BACKGROUND 31

the few studies using longer utterances. Vowel quality was addressed infew studies. A variety of machine learning techniques were used in thestudies but it does not seem to indicate that a particular classification pro-cedure is any more successful than any other.

2.4.3 Prosodic Error Identification

We have not yet found any literature relevant to the automatic prosodicerror identification issue.

2.5 Chapter Summary

In this chapter, we have briefly reviewed ASR and related technologies,including signal processing and HMMs. We have discussed stress andrhythm properties of speech and explained vowel quality in terms of stressdetection. We have also presented closely related work to our study. Fromthe next chapter, we will present our work in each of the three stages inSpan.

Chapter 3

Speech Recognition Stage

This chapter discusses the speech recognition stage as highlighted in Fig-ure 3.1. Section 3.1 gives a more detailed overview of the speech recog-nition stage, including the procedure for training phoneme HMMs. Sec-tions 3.2 and 3.3 discuss issues in phoneme level HMM design and speechencoding. Section 3.4 describes an experiment for exploring sets of param-eters for building an effective speech recogniser. Section 3.5 presents theexperimental results. Section 3.6 summarises this chapter. Note that partof this chapter is taken from [83].

3.1 Overview

The speech recognition stage is the first stage of Span. It is implementedusing the procedures in HTK. The performance element of this stage isan HMM-based speech recogniser. We trained phoneme HMMs from aspeech data set, hand labelled at the phoneme level. As introduced insection 1.4.1 (page 5), our speech recogniser is not a standard full speechrecogniser as it knows the sentence that a user is trying to say. Rather, it isa forced alignment system. That is, it represents the sentence at phonemelevel and aligns the sentence with the speech signal to identify the startand end times of all the segmental units. Therefore, the input is the text of

32

CHAPTER 3. SPEECH RECOGNITION STAGE 33

Sound(Phoneme Labelled)

Speech Recogniser

Stress Detector

Error Identifier

Stress Pattern

Phoneme HMMs

SoundText Target Patterns

Identified Stress or Rhythm ErrorsIdentified Stress or Rhythm ErrorsIdentified Stress or Rhythm Errors

Peco

Pre-trainedclassifier/detectorclassifier/detector

Figure 3.1: Overview of Span: recognising speech

a sentence and the sound from a user, and the output is the user’s soundlabelled at phoneme level.

This stage is very important because the accuracy of phoneme bound-aries is critical to the later stages.

3.1.1 The Training Element

The training procedure of the phoneme level HMMs is illustrated in Fig-ure 3.2.

The input to the training element consists of two sets of related data.One is a sound data set and the other is a phoneme label data set. Thesedata sets are collected from NZSED.

CHAPTER 3. SPEECH RECOGNITION STAGE 34

Training Sound Files

tt024 091 ih

135 182 ih

011 073 s

024 091 ih

092 134 t

135 182 ih

092 134 t

000 046 s

046 091 ih

092 134 t

135 182 ih

000 046 D

046 091 ih

192 134 t

135 182 ih092 134 t

000 023 b

024 091 I

135 182 w

TrainingPhoneme Label Files

Phoneme HMM PrototypeSpeech Encoding

Viterbi Training

Baum-Welch Training

Parameter Vectors

InitialPhoneme HMMs

s

D

b

I

...

...

TrainedPhoneme HMMs

s

D

b

I

...

...

...

...

...

Figure 3.2: The training element.

The sound data set contains 1119 utterances of 200 distinct English sen-tences produced by six female native speakers. The sounds were recordedat a 16 kHz sampling rate, which allows accurate analysis of all frequen-cies up to 8 kHz (the Nyqvist frequency for this sampling rate). The rangeto 8 kHz includes all perceptually relevant information in human speech.

CHAPTER 3. SPEECH RECOGNITION STAGE 35

These sound files are then encoded into parameter vectors.The phoneme label data set contains phoneme files corresponding to

the sounds, specifying the name and the start and end times of eachphoneme. The labelling was at a coarse level, with 44 different Englishphonemes and two labels for silences and short pauses. The first half ofthe 1119 utterances were hand-labelled at the phoneme level by trainedlinguists; the second half were automatically labelled by a computer pro-gram (based on the labelling of the first half) then checked by a trainedlinguist.

There are two stages in the training. The first stage is called Viterbitraining. It takes phoneme HMM prototypes, which are just descriptionsof the structure of phoneme HMMs, and constructs a set of initial phonemeHMMs. The second stage is called Baum-Welch training. By applying sev-eral iterations of Baum-Welch training, the initialised HMMs are graduallyrefined to produce phoneme level trained HMMs. More information aboutthe Viterbi and Baum-Welch training procedures can be found in [85].

3.1.2 The Performance Element

The phoneme level forced alignment performance procedure is illustratedin Figure 3.3.

A user reads a sentence prompted on a computer screen. When theuser finishes speaking, the recorded digitised sound and the text of thesentence are fed into the speech recogniser.

The speech recogniser converts the words in the text to a phonemenetwork. The phoneme network could be just a simple phoneme sequenceif the word pronunciation dictionary includes only one pronunciation foreach word. The speech recogniser also windows the sound into a sequenceof short frames and encodes each sound frame into a parameter vector. Itthen uses a Viterbi search algorithm to calculate the acoustic likelihoodfor each encoded sound frame using the pre-trained phoneme HMMs and

CHAPTER 3. SPEECH RECOGNITION STAGE 36

the phoneme network, and maps the sound signal with the best chosenphoneme sequence from the phoneme network. The sound labelled at thephoneme level is then output by the speech recogniser.

Sound

Speech Encoding

Parameter Vectors

Trained Phoneme HMMs

Word Pronunciation

Dictionary

... come along on the barge ...

Text

...-k-V-m-@-l-O-N n-D b-a:-dZ-...@O

@i:

PhonemeNetwork

Word Parsing

Viterbi Search

k V m @

Sound(Phoneme Labelled)

s

D

b

I

...

...

...

Figure 3.3: The performance element.

3.2 HMM Design

There are three key parameters required to specify the phoneme HMMprototype in the training element: the number of states needed in eachphoneme level HMM, the connections between the states, and the sizeof the mixture-of-Gaussian models in each state of each HMM. The firsttwo issues belong to the phoneme HMM architecture design. The last onebelongs to the phoneme HMM stochastic model design.

CHAPTER 3. SPEECH RECOGNITION STAGE 37

3.2.1 Architecture of the HMMs

A phoneme-level HMM is a description of segments of input speech sig-nals that correspond to a particular phoneme. The HMM consists of anetwork of states where each state describes subsegments of input speechsignals that correspond to a different section of a phoneme (for example,the initial component of the phoneme).

With fewer states, an HMM will have a less precise model because eachstate must describe a larger section of a phoneme. It will also be less ac-curate at identifying the start and end times of the phoneme. With morestates, the HMM may have greater accuracy, but the computational costwill be higher when recognising the speech input. It will also require moretraining data in order to determine all the values in the state descriptions.

To balance the need for accuracy against the computation time and sizeof training data, we chose to follow standard practice using a three statemodel for each phoneme HMM. The first state describes the transitionfrom the previous phoneme into the current phoneme, the second statedescribes the middle section of the phoneme, and the third state describesthe transition out of the current phoneme to the next phoneme.

The states can be connected in many ways. We chose the commonlyused mode of a chained connection [85], where each state is connectedto itself (since each state may cover a number of samples from the inputsignal) and to the next state. This mode does not allow connections thatskip over a state or connect back to a previous state. An example of thisconnection mode is shown in Figure 3.4.

1 2 3

Figure 3.4: A chained connection mode HMM (adapted from [85]).

In addition to the phonemes, there are also a number of silences andshort pauses in each speech signal. Because silences and pauses do not

CHAPTER 3. SPEECH RECOGNITION STAGE 38

have the same regular structure as phonemes, we allowed more flexi-ble structures for the silence and short pause HMMs: we used a modi-fied three-state HMM with backward and forward skipping connectionsto model the silences and a tied one-state connection to model the shortpauses, as shown in Figure 3.5.

sharedstate

sil

sp

1 2 3

1

Figure 3.5: HMMs for silences and short pauses (adapted from [85]).

3.2.2 Stochastic Models: Mixture-of-Gaussians

An HMM state describes segments of speech signals using Gaussian mod-els of each of the features used to encode the signal. If all speakers alwayspronounced a given phoneme in very similar ways, then there would belittle variability in the signal and simple Gaussian models (mean and vari-ance) would be sufficient. However, there is considerable variability in theusual case, and a mixture-of-Gaussians model may provide a more accu-rate description. The design issue is to determine an appropriate numberof Gaussians in the mixture-of-Gaussian models. In general, the greaterthe size of the mixture-of-Gaussian model, the more training data is re-quired to learn the parameters of the model. Particularly for those rarelyoccurring or unevenly distributed phonemes, providing more speech datais essential.

For our data set, we explored a range of possible sizes of the modelsfrom 1 to 16.

CHAPTER 3. SPEECH RECOGNITION STAGE 39

3.3 Speech Encoding

The HMM-based speech recogniser requires digitised sound signals to beencoded into a sequence of feature vectors, where each feature vector en-codes the essential features of a short “frame” or “window” of the inputsignal. There are three key parameters required to configure the encod-ing process: the size of each window, the interval (“frame period”) be-tween two adjacent frames, and the set of features to be extracted fromeach frame. An encoding process is shown in Figure 3.6.

framen

framen+1

.... etc

Window Size

Frame Period

Raw Speech Signals

Encoded Speech

Features

Figure 3.6: Speech encoding process (adapted from [85]).

CHAPTER 3. SPEECH RECOGNITION STAGE 40

3.3.1 Window Size and Frame Period

The window size and frame period are important parameters for thespeech recogniser.

If the window size is too small, the window will not contain enough ofthe signal to be able to measure the desired features; if the window size istoo large, the feature values will be averaged over too much of the inputsignal, and will lose precision. We explored a range of window sizes, fromthe lower limit 10 ms to the upper limit 30 ms, with the lower limit beingchosen to be large enough to include at least two complete cycles of thefundamental frequency of the speech signal, and the upper limit chosen toensure that a window seldom spanned more than a single phoneme.

If the frame period is too long, there will be insufficient feature vectorsfor each phoneme, which will prevent the HMMs from recognising thespeech. If the frame period is longer than the window size, then someparts of the speech signal will not be encoded at all. If the frame period istoo short, then there will be too many feature vectors, which will increasethe computational cost. The absolute lower limit on the frame period isgoverned by the sample rate of the raw speech signal. We explored a rangeof frame periods, from 4 ms to 12 ms, subject to the constraint that theframe period was not larger than the window size.

3.3.2 MFCC Feature Extraction and Selection

There are many possible classes of features that could be used to encode aspeech signal. We have followed common practice in using Mel-FrequencyCepstrum Coefficients (MFCCs) on a Hamming window. MFCCs use amathematical transformation called the cepstrum which computes the in-verse Fourier transform of the log-spectrum of the speech signal [85].Within this class of features, there is still considerable choice about whichMFCC features to use. In addition to the 12 basic MFCC transformation co-efficients, there are also the energy (E), the 0th cepstral coefficient (O), and

CHAPTER 3. SPEECH RECOGNITION STAGE 41

the first and second order derivatives (D and A) of those coefficients. Notall combinations of these features are sensible. For example, it makes littlesense to use the second order derivatives without the first order deriva-tives, and the HTK toolkit [85] will not use both the energy and the 0thcepstral coefficient simultaneously. We have identified nine combinationsto explore in our experiments, as shown in Table 3.1.

Table 3.1: Nine feature combinations and the number of features in eachset.

No Combination No. of Features

1 MFCC 122 MFCC-D 243 MFCC-D-A 364 MFCC-E 135 MFCC-E-D 266 MFCC-E-D-A 397 MFCC-O 138 MFCC-O-D 269 MFCC-O-D-A 39

3.4 Parameter Identification

We decided that the architecture of phoneme HMM — the number ofstates and their connections — follows the commonly used one. We alsodecided to use MFCC features to be the main sound signal encoding rep-resentation. This section describes an exhaustive search experiment thatwe used to investigate the other parameters.

For our experiments, we trained a collection of phoneme level HMMmodels on a training set of annotated speech samples with each of combi-

CHAPTER 3. SPEECH RECOGNITION STAGE 42

nation of parameters. We then evaluated the quality of the HMM modelsby using them to recognise and label speech samples in a separate test set.The following sections describe the data sets used in the experiment, theparameter combinations we explored, the training and testing process, theexperiment configuration, and the performance evaluation.

3.4.1 Data Set

The experiments used the same speech data set described in section 3.1.1on page 33. We split the speech data set into a training set with 544 ut-terances and their labels and a test set of the remaining 575 utterances.The split preserved an even distribution of speakers in both sets, but wasotherwise random.

3.4.2 Design of Case Combinations

The goal of the experiment was to explore the performance of differentchoices of feature sets, window sizes, frame periods, and sizes of themixture-of-Gaussian models. As described in the previous section(page 41), we have nine combinations of feature sets to explore.

Since the average fundamental frequency of female speech is around220 Hz, the period of the voiced sounds is approximately 4.5 ms. We there-fore chose a set of 9 frame periods from 4 ms to 12 ms in steps of 1 ms. Thesmallest frame period was chosen to be only marginally smaller than theaverage period of the fundamental frequency. The largest frame periodwas chosen to be above the default frame period (10 ms) suggested by theHTK toolkit.

We chose a set of 9 possible window sizes from 10 ms to 30 ms in stepsof 2.5 ms. With the constraint that the window size should be at least aslong as the frame period, there are 79 (= 9×9−2) possible combinations ofwindow size and frame period and therefore 711 (= 79× 9) combinationsof the speech encoding parameters.

CHAPTER 3. SPEECH RECOGNITION STAGE 43

We also explored 5 different sizes for the mixture-of-Gaussian models:1, 2, 4, 8, and 16. There were therefore a total of 3,555 (= 711× 5) differentpossible combinations of parameters to evaluate.

We refer to the different sets of parameters by hyphenated codes suchas “9-10-MFCC-E-D-A-4” where the first number is the frame period, thesecond number is the window size, the middle letters specify which of theMFCC features were used, and the final number is the size of the mixture-of-Gaussian models.

3.4.3 Training and Testing Process

For each combination of parameters, a set of phoneme level HMMs wastrained on the utterances (and their labels) in the training set. During thetraining process, each utterance was encoded and the relevant featureswere extracted based on the choice of features, window size, and frame pe-riod. Each HMM state was modelled initially by a mixture-of-Gaussians ofsize 1 and trained using four cycles of the Baum-Welch training algorithm[85]. The maximum size of the mixture-of-Gaussians was then doubledand two cycles of the Baum-Welch re-estimation were applied. This wasrepeated until the maximum size of 16 was reached.

After obtaining the phoneme level HMMs, the testing process was con-ducted by applying these HMMs to the utterances in the test set usingforced alignment and the Viterbi algorithm [85]. The testing process gener-ated a set of auto-labelled phonemes (phoneme name, start, and end time)for each utterance. These auto-labelled phonemes were then be comparedto the hand-labelled phonemes to measure the accuracy of the HMMs.

3.4.4 Experiment Configuration

Since training and testing a set of HMM models is a computationally in-tensive task, it would not have been feasible to run this exhaustive exper-iment on a single computer. Instead, we constructed a distributed server-

CHAPTER 3. SPEECH RECOGNITION STAGE 44

client computing system to run the experiment. The system consisted ofone Sun Fire 280R Server and 22 1.8 GHz Pentium 4 workstations with128MB RAM running NetBSD, as shown in Figure 3.7. Even with these 22computers, the experiment took more than 48 hours to complete.

Client 22Client 1

Server

Testing

Results

Trained

HMMs

Training

Testing

Testing

Case List

Training

Case List

Training Testing... ... ...

Figure 3.7: Experiment configuration.

There are two central synchronised lists on the server, one containingall training cases (one case for each combination of speech encoding pa-rameters) and the other for testing cases. To reduce the network traffic, theutterances in the training and test data sets were pre-distributed to eachclient (workstation). The training time varies with different training cases.Therefore, instead of pre-assigning a fixed number of training cases to eachclient, we created connections between the server and the 22 clients so thatthe clients can train or test any case. Clients repeatedly request a trainingcase (a combination of parameters) from the server, perform the trainingprocess, then send the trained HMMs back to the server, which are readyto be applied to the testing cases. Once the list of training cases is empty,each client starts requesting testing cases from the server, performing thetesting and sending the resulting set of auto-labelled utterances back tothe server. The effect of this distributed process is that no clients are idleuntil the very end of the experiment.

CHAPTER 3. SPEECH RECOGNITION STAGE 45

3.4.5 Performance Evaluation

To measure the recognition performance of a case, the system comparesthe auto-labelled phonemes in each utterance of the test set against thehand-labelled phonemes.

In the context of our speech analyser system, the most important re-quirement on the recognition system is the accuracy of the time bound-aries of the auto-labelled phonemes. The simplest way of measuring thisaccuracy would be to measure the average error, where the error is thetime difference between the boundary of the auto-labelled phoneme andthe hand-labelled phoneme. However, we suspect that large errors willbe much more significant for the rest of the speech analyser than small er-rors. Also, the nature of continuous speech is such that determining hand-labelled phoneme boundaries necessarily involves a subjective judgement.This means that small errors should not affect the accuracy measure. Wetherefore set a threshold and define the boundary accuracy to be the per-centage of phonemes for which the difference between the auto-labelledboundary and the hand-labelled boundary is less than the threshold.

Obviously, the actual accuracy measure will vary with different thresh-olds — with a sufficiently high threshold, all the cases would have a 100%accuracy. However, the purpose of the accuracy measure is to comparethe performance of different cases, so only the relative value of the accu-racy measure is important, and the threshold value is not too critical. As iscommon among other studies [58, 64, 79], we chose 20 ms for the thresholdof the performance evaluation. We also looked at the effect on the resultsof changing this threshold in either direction.

We were also interested in the sources of boundary time errors. Wehypothesised that:

Hypothesis 3.4.1 Some of the boundary time errors might be due to the recog-niser misclassifying phonemes during the recognition process.

We therefore performed a more standard recognition accuracy evalua-

CHAPTER 3. SPEECH RECOGNITION STAGE 46

tion on the best performing HMM, measuring the fraction of phonemes inthe test set that were misclassified by the HMM.

3.5 Results and Discussion

This section presents the relative recognition performance of the cases andgives some further analysis of the results.

3.5.1 Best Parameter Combinations

Using the threshold measure described in the previous section, we calcu-lated the relative phoneme boundary accuracy of the HMM speech recog-niser with 3,555 different combinations of parameters. Since the vowelsplay a more important role in the later stages of the speech analyser, wealso calculated the relative accuracy results for vowels only (measuringjust the end boundary of the vowel). The best 50 results and the worstresult are given in Tables 3.2 and 3.3.

The best performance (87.07% accuracy for 10-12.5-MFCC-O-D-A-4) isconsiderably better than the worst performance (69.06% accuracy for 4-30-MFCC-8) in the same training and testing environment. So the choiceof parameters, for the HMM design and the speech encoding process, isvery important. Therefore, Hypothesis 2.1.1 (page 17) is supported by ourresults.

There are also several other observations that can be made from theresults.

• There is a clear advantage in using the derivative (D) and accelera-tion (A) features. For all phonemes, all of the top 7.06% (251/3555)of the cases have the acceleration features, and all of the top 43.76%(1556/3555) of the cases have the derivative features. For vowelsalone, all of the top 10.44% (371/3555) of the cases have the accelera-

CHAPTER 3. SPEECH RECOGNITION STAGE 47

Table 3.2: Best and worst parameter choices by boundary timing accuracy.

rank Case Boundary rank Case BoundaryAccuracy Accuracy

1 10-12.5-MFCC-O-D-A-4 87.07% 27 10-17.5-MFCC-O-D-A-16 86.41%2 10-15-MFCC-O-D-A-8 87.00% 28 10-17.5-MFCC-E-D-A-4 86.40%3 10-17.5-MFCC-O-D-A-4 86.99% 29 10-22.5-MFCC-O-D-A-2 86.38%4 10-12.5-MFCC-O-D-A-8 86.99% 30 10-10-MFCC-E-D-A-16 86.38%5 10-15-MFCC-O-D-A-4 86.97% 31 10-22.5-MFCC-O-D-A-4 86.37%6 10-17.5-MFCC-O-D-A-8 86.87% 32 10-15-MFCC-E-D-A-16 86.37%7 10-20-MFCC-O-D-A-4 86.84% 33 11-17.5-MFCC-O-D-A-4 86.37%8 10-12.5-MFCC-O-D-A-2 86.83% 34 10-17.5-MFCC-O-D-A-1 86.34%9 10-15-MFCC-O-D-A-2 86.79% 35 10-12.5-MFCC-E-D-A-8 86.31%

10 10-12.5-MFCC-O-D-A-1 86.70% 36 9-15-MFCC-O-D-A-4 86.30%11 10-20-MFCC-O-D-A-8 86.69% 37 10-20-MFCC-E-D-A-8 86.30%12 10-12.5-MFCC-O-D-A-16 86.63% 38 10-20-MFCC-O-D-A-1 86.29%13 10-20-MFCC-O-D-A-2 86.61% 39 9-17.5-MFCC-O-D-A-4 86.29%14 10-17.5-MFCC-O-D-A-2 86.60% 40 9-17.5-MFCC-O-D-A-8 86.26%15 10-15-MFCC-E-D-A-8 86.56% 41 10-22.5-MFCC-E-D-A-8 86.26%16 11-15-MFCC-O-D-A-4 86.51% 42 10-12.5-MFCC-E-D-A-16 86.23%17 10-10-MFCC-O-D-A-8 86.50% 43 10-22.5-MFCC-E-D-A-4 86.23%18 10-15-MFCC-E-D-A-4 86.49% 44 11-15-MFCC-O-D-A-8 86.22%19 10-10-MFCC-O-D-A-4 86.49% 45 11-20-MFCC-O-D-A-4 86.20%20 10-15-MFCC-O-D-A-1 86.47% 46 10-10-MFCC-O-D-A-2 86.19%21 10-15-MFCC-O-D-A-16 86.44% 47 11-17.5-MFCC-O-D-A-8 86.17%22 10-20-MFCC-E-D-A-4 86.44% 48 10-10-MFCC-E-D-A-8 86.17%23 9-15-MFCC-O-D-A-8 86.44% 49 10-17.5-MFCC-E-D-A-16 86.17%24 10-17.5-MFCC-E-D-A-8 86.43% 50 10-25-MFCC-O-D-A-4 86.14%25 10-22.5-MFCC-O-D-A-8 86.43% · · · · · · · · ·26 10-12.5-MFCC-E-D-A-4 86.42% 3,555 4-30-MFCC-8 69.06%

tion features, and all of the top 26.86% (955/3,555) of the cases havethe derivative features.

• A frame period around 10 ms and a window size around 15 ms ap-pear to give the best performance over all phonemes, but a largerframe period around 11 ms to 12 ms and a larger window size around17.5 ms is better for performance on vowels alone.

• The 0th Cepstral (O) feature is clearly better than Energy (E) feature.There are only 15 of the top 50 cases for all phonemes and only one

CHAPTER 3. SPEECH RECOGNITION STAGE 48

Table 3.3: Best and worst parameter choices by boundary timing accuracyof vowels only.

rank Case Boundary rank Case BoundaryAccuracy Accuracy

1 11-15-MFCC-O-D-A-4 86.98% 27 10-12.5-MFCC-O-D-A-2 86.40%2 12-15-MFCC-O-D-A-4 86.94% 28 10-15-MFCC-O-D-A-2 86.40%3 12-17.5-MFCC-O-D-A-4 86.86% 29 11-20-MFCC-O-D-A-8 86.37%4 11-17.5-MFCC-O-D-A-8 86.81% 30 11-17.5-MFCC-O-D-A-2 86.37%5 12-12.5-MFCC-O-D-A-4 86.81% 31 11-15-MFCC-O-D-A-2 86.33%6 10-17.5-MFCC-O-D-A-4 86.73% 32 12-20-MFCC-O-D-A-16 86.32%7 12-20-MFCC-O-D-A-4 86.72% 33 12-15-MFCC-O-D-A-1 86.32%8 11-15-MFCC-O-D-A-8 86.71% 34 12-12.5-MFCC-O-D-A-2 86.32%9 11-17.5-MFCC-O-D-A-4 86.66% 35 11-12.5-MFCC-O-D-A-8 86.31%

10 11-12.5-MFCC-O-D-A-4 86.65% 36 12-20-MFCC-O-D-A-8 86.31%11 12-20-MFCC-O-D-A-2 86.64% 37 12-25-MFCC-O-D-A-2 86.29%12 12-12.5-MFCC-O-D-A-8 86.64% 38 12-15-MFCC-O-D-A-16 86.29%13 12-22.5-MFCC-O-D-A-4 86.61% 39 12-25-MFCC-O-D-A-4 86.28%14 12-17.5-MFCC-O-D-A-8 86.59% 40 12-17.5-MFCC-O-D-A-1 86.26%15 12-17.5-MFCC-O-D-A-2 86.55% 41 11-12.5-MFCC-O-D-A-1 86.25%16 12-15-MFCC-O-D-A-2 86.55% 42 10-12.5-MFCC-O-D-A-1 86.24%17 10-17.5-MFCC-O-D-A-8 86.51% 43 12-17.5-MFCC-E-D-A-4 86.24%18 12-17.5-MFCC-O-D-A-16 86.47% 44 12-20-MFCC-O-D-A-1 86.23%19 10-15-MFCC-O-D-A-4 86.46% 45 10-12.5-MFCC-O-D-A-8 86.22%20 12-15-MFCC-O-D-A-8 86.45% 46 12-12.5-MFCC-O-D-A-1 86.21%21 10-12.5-MFCC-O-D-A-4 86.44% 47 11-15-MFCC-O-D-A-16 86.20%22 12-22.5-MFCC-O-D-A-2 86.42% 48 11-17.5-MFCC-O-D-A-16 86.20%23 11-20-MFCC-O-D-A-4 86.42% 49 11-22.5-MFCC-O-D-A-4 86.15%24 11-12.5-MFCC-O-D-A-2 86.40% 50 10-20-MFCC-O-D-A-4 86.13%25 11-20-MFCC-O-D-A-2 86.40% · · · · · · · · ·26 10-15-MFCC-O-D-A-8 86.40% 3,555 4-30-MFCC-16 68.10%

of the top 50 cases for vowels using Energy.

• As shown in Table 3.4, there seems to be a preference towards a sizeof 4 for the mixture-of-Gaussians for vowels. Although using the sizeof 8 is much better for all phonemes than for vowels, the advantageof using the size of 4 still exists.

• The case (10-12.5-MFCC-O-D-A-4), which resulted in the best perfor-mance for all phonemes, is ranked 21st for vowels only. The case (11-

CHAPTER 3. SPEECH RECOGNITION STAGE 49

Table 3.4: Summary of the mixture-of-Gaussians for the top 50 cases.

Size For All Phonemes For Vowels Only

1 8% (4 out of 50) 12% (6 out of 50)2 12% (6 out of 50) 24% (12 out of 50)

4 34% (17 out of 50) 32% (16 out of 50)

8 32% (16 out of 50) 22% (11 out of 50)16 14% (7 out of 50) 10% (5 out of 50)

15-MFCC-O-D-A-4), which achieved the highest accuracy for vowelsonly, is ranked 16th for all phonemes. However, the difference in rel-ative accuracy over the top 50 cases is less than 1%. So the exactchoice of E vs O, frame period, window size and number of Gaus-sians, within the ranges above, does not appear to be very critical.

• Changing the threshold to 8 ms or 32 ms makes no difference tothe strong advantage of the Derivative and Acceleration features.However, the preferred frame period and window size are slightlysmaller for the 8 ms threshold, and slightly larger for the 32 msthreshold. Also for the 8 ms threshold, there is a preference for theEnergy feature rather than the 0th cepstral feature for vowels.

The outcome of this experiment provides a clear recommendation forthe parameters we should use for the speech analyser system: using 0thCepstral, Derivative and Acceleration features, along with a frame periodof 11 ms, a window size of 15 ms, and a mixture of 4 Gaussians shouldminimise the boundary timing errors on the vowels. If in later work theboundary timing differences of less than 20 ms are significant, we wouldthen need to use the Energy feature, and a smaller frame period and win-dow size.

CHAPTER 3. SPEECH RECOGNITION STAGE 50

3.5.2 Recognition Accuracy

The results above focused on the accuracy of the boundaries of the auto-labelled phonemes. The second evaluation attempted to identify somepossible sources of the boundary errors by counting the kinds of phonemerecognition errors made by the recogniser using the best performingHMMs. There are three kinds of phoneme recognition errors:

• substitution, where the auto-labelled phoneme is different from thehand-labelled phoneme.

• insertion, where the auto-labelling includes an additional phonemethat is not present in the hand-labelling.

• deletion, where the auto-labelling does not contain a phoneme that ispresent in the hand-labelling.

Table 3.5 shows recognition errors of each category for the highestranked HMM (10-12.5-MFCC-O-D-A-4) applied to the test set. Since in-sertion and deletion errors will almost certainly result in boundary timeerrors of phonemes adjacent to the error, in addition to the phoneme in-serted or deleted, the nearly 6.45% of insertion or deletion phonemes isa non-trivial cause of the approximately 13% boundary timing error ratein Table 3.2. The substitution errors may or may not result in boundarytiming errors. Therefore, Hypothesis 3.4.1 (page 45) is supported.

Table 3.5: Recognition errors for 10-12.5-MFCC-O-D-A-4.

Kind of Error Fraction of phonemes

Insertion errors 5.33% (1372 out of 25723)Deletion errors 1.12% (288 out of 25723)Substitution errors 4.32% (1111 out of 25723)

The recognition system uses forced alignment recognition in whichthe system knows the target sentence and uses a dictionary of alternative

CHAPTER 3. SPEECH RECOGNITION STAGE 51

word pronunciations to determine the expected phonemes. Insertion anddeletion errors will generally occur when the actual pronunciation by thespeaker does not match any of the pronunciations in the dictionary: thespeaker drops a phoneme or includes an extra phoneme, and the systemis forced to align the dictionary pronunciation with the actual pronunci-ation. These errors are due primarily to inadequacies in the dictionary,rather than to the HMM models. Substitution errors will result from pro-nunciations by the speaker that are not in the dictionary, but also mayresult from poor HMM phoneme models if the dictionary gives two alter-native pronunciations, and the HMM for the wrong phoneme matches thespeech signal better than the HMM for the correct phoneme.

3.6 Chapter Summary

In this chapter, we have described an HMM based speech recogniser us-ing the forced alignment technique. The central requirement on the recog-niser is that it can accurately identify the boundaries of the phonemes,especially vowels, in a speech signal.

We have reported on an experiment that exhaustively explored a spaceof parameter values for the recogniser. This included the parameters ofthe encoding of the speech signal and the size of the statistical modelsin the states of the phoneme HMMs. The results of the experiment pro-vide clear recommendations for the choice of frame period, window size,MFCC features, and the statistical model in order to minimise the sig-nificant phoneme boundary errors. Hypotheses 2.1.1 (page 17) and 3.4.1(page 45) have been supported.

The results of the experiment also expose the limitations of the dictio-nary used for forced alignment.

Chapter 4

Stress Detection Stage

This chapter discusses the stress detection stage as highlighted in Fig-ure 4.1. Section 4.1 gives a more detailed overview of the stress detectionstage, including the procedure for training a stress detector. Section 4.2discusses the issues, including the calculation and normalisation methodsof features, especially vowel quality features. Section 4.3 explores the per-formances of two different classification techniques — Decision Trees andSupport Vector Machines. Section 4.4 presents the experimental results.Section 4.5 summarises this chapter. Note that part of this chapter is takenfrom [82].

4.1 Overview

The stress detection stage is the second stage of Span. We decided to detectrhythmic stress in users’ speech instead of lexical stress because rhythmicstress is more important than lexical stress in good speech [77]. This stageperforms rhythmic stress detection using a binary classifier. The rhyth-mic stress detection task is to classify vowel segments in the user’s speechlabelled at the phoneme level as stressed or unstressed. The classifier istrained from a speech data set, hand labelled at the phoneme level withthe stress status marked.

52

CHAPTER 4. STRESS DETECTION STAGE 53

Sound(Phoneme Labelled)

Speech Recogniser

Stress Detector

Error Identifier

Stress Pattern

Phoneme HMMs

SoundText Target Patterns

Identified Stress or Rhythm ErrorsIdentified Stress or Rhythm ErrorsIdentified Stress or Rhythm Errors

Peco

Pre-trainedclassifier/detector

Figure 4.1: Overview of Span: detecting stress.

Note that the classification accuracy is critical to the final stage of Span,which identifies the stress or rhythm errors in the user’s speech. If vowelsegments are classified incorrectly at the second stage, then the gener-ated user’s stress pattern will be inaccurate. Consequently, the stress orrhythm errors reported by the final stage will not correctly indicate theuser’s prosodic problems.

4.1.1 The Training Element

The classifier training procedure is illustrated in Figure 4.2.There are three related data sets as inputs to the training elements. Two

of them are the same data sets used in the training element of the speech

CHAPTER 4. STRESS DETECTION STAGE 54

Training Sound Files

tt024 091 ih

135 182 ih

011 073 s

024 091 ih

092 134 t

135 182 ih

092 134 t000 046 s046 091 ih092 134 t135 182 ih

000 046 D

046 091 ih

192 134 t

135 182 ih092 134 t

000 023 b

024 091 I

135 182 w

TrainingPhoneme Label Files

Feature Extraction & Normalisation

Feature Vectors

Classifier Learning

decision trees orsupport vectors etc

Classifier

TrainingStress Label Files

-1 0 0 0 1 0 01 0 0 1 0 0 -1

0 0 -1 0 0 0 1

O ...

E ...

U ...

I ...

A ...

@ ...

I ...

i: ...

u: ...

Figure 4.2: The training element.

recogniser. The third data set is a stress label data set.The stress label data set contains stress label files corresponding to the

phoneme files. Each sound file is now associated with a phoneme label fileand a stress label file. We use -1, 1, and 0 to represent the stress status foreach phoneme, where -1 means unstressed, 1 means stressed, and 0 meansthe phoneme is not a vowel and a stress mark is not applicable.

As shown in Figure 4.2, a feature extraction and normalisation processtakes the training sound files and the phoneme label files, and produces

CHAPTER 4. STRESS DETECTION STAGE 55

sets of feature vectors for each vowel segment in each sound. A classifierlearning procedure is then applied to construct a stress classifier, whichcan be represented by either decision trees or support vectors (or otherrepresentations used by other machine learning techniques), from the setsof feature vectors and the corresponding stress labels.

4.1.2 The Performance Element

The classification performance element is illustrated in Figure 4.3.The phoneme labelled sound output from the speech recognition stage

becomes the input of the performance element. The same feature extrac-tion and normalisation process as used in the training element producesfeature vectors representing vowel segments in the user’s speech. By us-ing the trained classifier, each feature vector is analysed and the corre-sponding vowel segment is classified as stressed or unstressed. Finally astress pattern for the user’s speech is constructed.

4.2 Feature Extraction and Normalisation

Although stress is generally seen as a property of the syllables in an utter-ance rather than of just the vowels, we hypothesise that:

Hypothesis 4.2.1 The features we used in our study are largely carried by thevowel as the nucleus of the syllable. The features extracted from vowels can sig-nificantly contribute to rhythmic stress detection.

Each vowel is analysed in several different ways to extract a set of fea-tures that can be passed to the stress classifier. Since duration, amplitude,pitch and vowel quality are the parameters that have been shown to cuethe perception of stress differences in English [2, 18, 29, 46], the featureswe need to extract are related to these parameters.

CHAPTER 4. STRESS DETECTION STAGE 56

FeatureVectors

Classifying

k V m @

Sound(Phoneme Labelled)

Classifier

decision trees orsupport vectors etc

Feature Extraction & Normalisation

V ...

@ ...

... ...

... ...

Stress Pattern

V 1@ -1

......

Figure 4.3: The performance element.

There are many alternative measurements of prosodic features that canbe extracted, and also many ways of normalising these features in order toreduce variation due to differences between speakers, recording situationsor utterance contexts. However, there is no standard way to extract andnormalise vowel quality features.

4.2.1 Duration Features

Vowel durations can be directly measured from the output of the forcedalignment recogniser since the recogniser identifies the start and endpoints of the vowels. The measurements are not completely reliable sinceit is hard for the recogniser to precisely determine the transition point be-

CHAPTER 4. STRESS DETECTION STAGE 57

tween two phonemes that flow smoothly into each other. Furthermore,some short vowels may be inaccurately reported if they are shorter thanthe minimum number of frames specified for a phoneme in the system.

The absolute value of the duration of a vowel segment is influencedby many factors other than stress, such as the intrinsic durational prop-erties of the vowel, the speech rate of the speaker, and local fluctuationsin speech rate within the utterance. Therefore the absolute duration ofthe vowel segment is not a useful feature. We require a normalised dura-tion that measures the length of the vowel segment relative to what would“normally” be spoken by an “average” speaker. To reduce the impact ofthese contextual properties, we applied three different levels of normali-sation to the raw duration values.

The first level normalisation reduces the effect of speech rate variationbetween speakers. To normalise, we need to compare the length of an ut-terance to the “expected” length of that utterance. To compute the latter,we first use the training speech data set to calculate the average dura-tion of each of the 20 vowel phonemes of NZ English. We then computethe expected utterance length by summing the average durations of thevowel types in the utterance, and the actual utterance length by summingthe actual durations of the vowel segments in the utterance. We can thennormalise the durations of each vowel segment by multiplying by the ex-pected utterance length divided by the actual utterance length.

The second level normalisation removes effects of variation in the du-rations of the different vowel phonemes. Each phoneme has an intrinsicduration — long vowels and diphthongs normally have longer durationsthan short vowels. There are several possible ways to normalise for in-trinsic vowel duration. One method is to normalise the vowel segmentduration by the average duration for that vowel phoneme, as measured inthe training data set. Another method is to cluster the 20 vowel phonemesinto three categories (short vowel, long vowel and diphthong) and nor-malise vowel segment durations by the average duration of all vowels in

CHAPTER 4. STRESS DETECTION STAGE 58

the relevant category. We consider both methods.The third level normalisation removes the effect of the variation in

speech rate at different parts of a single utterance. To remove this in-fluence, the result of the second level normalisation is normalised by aweighted average duration of the immediately surrounding vowel seg-ments.

Based on the three levels of normalisation, we computed five durationfeatures for each vowel segment:

• Utterance normalised duration: the absolute duration normalised bythe length of the utterance;

• Phoneme normalised duration: the duration normalised by the lengthof the utterance and the average duration of the vowel type;

• Category normalised duration: the duration normalised by the lengthof the utterance and the average duration of the vowel category;

• Phoneme neighbourhood normalised duration: the vowel typenormalised duration further normalised by the durations of neigh-bouring vowels;

• Category neighbourhood normalised duration: the vowel category nor-malised duration further normalised by the durations of neighbour-ing vowels.

4.2.2 Amplitude Features

The amplitude of a vowel segment can be measured from the speech sig-nal, but since amplitude changes during the vowel, there are a number ofpossible measurements that could be made — maximum amplitude, initialamplitude, change in amplitude, etc. A measure commonly understoodto be a close correlate to the perception of amplitude differences betweenvowels is the root mean square (RMS) of the amplitude values across the

CHAPTER 4. STRESS DETECTION STAGE 59

entire vowel. This is the measure chosen as the basis of our amplitude fea-tures. As with the duration features, amplitude is influenced by a varietyof factors other than stress, including speaker differences and differencesin recording conditions as well as changes in amplitude across the utter-ance. We therefore need to normalise measured amplitude to reduce vari-ability introduced by these effects. We apply two levels of normalisationto obtain two amplitude features.

Our first level normalisation of amplitude takes into account globalinfluences such as speaker differences and the recording situation, by nor-malising the RMS amplitude of each vowel segment against the overallRMS amplitude of the entire utterance.

Our second level normalisation considers local effects at different partsof the utterance and normalises the vowel amplitude against a weightedaverage amplitude of the immediately surrounding vowel segments.

4.2.3 Pitch Features

Like amplitude, pitch can vary over the course of the vowel segment and isinfluenced by a variety of different factors, including the basic pitch of thespeaker’s voice. To reduce the effects of speaker differences, we normalisethe pitch measurement of a vowel segment by the average pitch of theentire utterance. A pitch calculation algorithm introduced in [9] was usedin our system.

The change in pitch over the vowel segment is at least as importantas the pitch level of the vowel, but it is not clear exactly which prop-erties of pitch are most significant for determining stress. Therefore, weextracted 10 different pitch features, including not only the average nor-malised mean pitch value of a vowel segment, but other features intendedto capture changes in pitch. The 10 pitch features of a vowel segment aresummarised as follows:

• Normalised mean pitch: the mean pitch value of the vowel normalised

CHAPTER 4. STRESS DETECTION STAGE 60

by the mean pitch of the entire utterance.

• Normalised pitch value at the start point: the pitch value at the startpoint of the vowel divided by the mean pitch of the utterance.

• Normalised pitch value at the end point: the pitch value at the end pointof the vowel divided by the mean pitch of the utterance.

• Normalised maximum pitch value: the maximum pitch value of thevowel divided by the mean pitch of the utterance.

• Normalised minimum pitch value: the minimum pitch value of thevowel divided by the mean pitch of the utterance.

• Relative pitch difference: the difference between the normalised maxi-mum and minimum pitch values. A negative value indicates afalling pitch and a positive value indicates a rising pitch.

• Absolute difference: the magnitude of the Relative difference, which isalways positive.

• Pitch trend: the sign of the Relative difference — +1 if the pitch “rises”over the vowel segment, -1 if it “falls”, and 0 if it is “flat”.

• Boundary Problem: a boolean attribute — true if the pitch value ateither the start point or the end point of the vowel segment cannotbe detected.

• Length Problem: a boolean attribute — true if the vowel segment istoo short to compute meaningful minimum, maximum, or differencevalues.

4.2.4 Vowel Quality Features

As mentioned earlier, vowel quality features are more difficult to extractsince there is no standard approach that can be used. We developed anovel method that uses the vowel HMM models, which were obtained inthe first stage of Span, to re-recognise the vowel segments of the utteranceand extract vowel quality features.

CHAPTER 4. STRESS DETECTION STAGE 61

Centralised vowels (see page 22) are associated with unstressed sylla-bles, particularly the

�����vowel which is the primary reduced vowel. In

NZ English,�����

is also pronounced very centrally and often acts as a re-duced vowel. Among 20 English vowel types, excepting

�����and

�����, there

are 18 full vowels. Full vowels tend to be more peripheral, but can beassociated with both stressed and unstressed syllables.

To determine whether a vowel segment represents a reduced vowel,we need to recognise the intended vowel phoneme, and also to determinewhether it is pronounced more centrally than the norm for that vowel.Since our speech recogniser uses forced alignment, it only identifies thesegments of the utterance that match each expected vowel best and doesnot identify how the speaker pronounced the vowel. For the prosodicfeatures above, this is all that is needed, and using a full recogniser on theentire sentence would reduce the accuracy of the recognition. However,for measuring vowel quality, we need to know what vowel the speakeractually said, and how they pronounced it.

To determine the actual vowel quality of the vowels, we apply a veryconstrained form of full recognition to each of the vowel segments previ-ously identified by forced alignment, and use the probability scores of theindividual HMM phoneme models to compute several features that indi-cate whether the vowel is reduced or not. The algorithm is illustrated inFigure 4.4 and outlined below.Step 1 Extract vowel segments from the utterance using forced alignment.

Label each segment with the expected vowel phoneme label, basedon the target sentence and the pronunciation dictionary.

Step 2 Encode each vowel segment into a sequence of acoustic parametervectors, using a 15 ms Hamming window with a step size (frame pe-riod) of 11 ms. These parameters consist of 12 MFCC features and the0’th cepstral coefficient with their first and second order derivatives.The values of these parameters were suggested from the experimentdescribed in section 3.5.1 (page 46).

CHAPTER 4. STRESS DETECTION STAGE 62

ExtractThe

VowelSegment

... ... ...1 2 3

/u:/

1 2 3

/au//e/

1 2 3 1 2 3

/I/

AcousticLikelihood 1

AcousticLikelihood 2

AcousticLikelihood 3

AcousticLikelihood 20

... ... ...

Recognise

Encoding

Parameter Vectors

/e/

/e/

...

Figure 4.4: Vowel quality features processing.

Step 3 Feed the parameter vector sequence into the 20 pre-trained HMMvowel recognisers to obtain 20 normalised acoustic likelihood scores.Each score is the geometric mean of the acoustic likelihoods of allframes in the segment, as computed by the HMM recogniser. Thescores are likelihoods that reflect how well the segment matches thevowel type of the HMM.

Step 4 Find the score of the expected vowel type Se, the maximum score ofany full vowel phoneme Sf and the maximum score of any reducedvowel phoneme Sr from the above 20 scores.

Step 5 Compare the scores of the best matching full vowel and the bestmatching reduced vowel to the score of the expected vowel in order

CHAPTER 4. STRESS DETECTION STAGE 63

to determine whether the expected vowel is pronounced as a fullvowel or a reduced vowel. We compute four features, two of whichmeasure the difference between the likelihoods, and two measure theratio of the likelihoods. In each case, we take logarithms to reducethe spread of values.

Rd =

− log(Sr − Se) if Se < Sr

0 if Se = Sr

log(Se − Sr) if Se > Sr

(4.1)

Fd =

− log(Sf − Se) if Se < Sf

0 if Se = Sf

log(Se − Sf ) if Se > Sf

(4.2)

Rr = log(Se/Sr) = log Se − log Sr (4.3)

Fr = log(Se/Sf ) = log Se − log Sf (4.4)

Both difference and ratio measures have advantages and disadvan-tages. We explore which of the two approaches is better for the de-tection of rhythmic stress.

Step 6 Compute a boolean vowel quality feature, T , to deal with caseswhere the vowel segment is so short that F or R cannot be calcu-lated. If the vowel segment is less than 33 ms, which is the minimumsegment duration requirement of our HMM system, then the valueof this attribute will be 1. Otherwise, -1. If this value is 1, we set Fand R to 0.

4.3 Classifier Learning and Feature Evaluation

The goals of our experiments described below are to investigate whether itis feasible to build an effective automated stress detector for English utter-ances and to evaluate the different sets of features that we have extracted.

CHAPTER 4. STRESS DETECTION STAGE 64

Our approach is to use two examples of standard machine learning tools— a decision tree constructor (C4.5) [62] and a support vector machine(LIBSVM) [13] – to construct stress detectors using these features, and tomeasure the performance of the resulting stress detectors. One reason forconsidering a decision tree constructor is that they can generate explicitrules that might help us identify which features were most significant forthe stress detector. One reason for choosing a support vector machine con-structor is that the support vector machines technology is relatively new.Support vector machines were originally designed for solving binary clas-sification problems [16], which is just what the stress detector will perform.For margin classifiers (boosted DTs), which are also now considered to bethe state of the art, we would explore it in the future.

4.3.1 Data Set

The experiments used a subset of the speech data that was described inthe training element on page 53. The subset contains 60 utterances of tendistinct English sentences produced by six adult female NZ speakers, aspart of the data set used in the first stage of Span (see page 33). The ut-terances were hand labelled at the phoneme level, and each vowel waslabelled as stressed or unstressed. There are 703 vowels in the utterances.340 are stressed and 363 unstressed. The prosodic and vowel quality fea-tures were extracted for each of these vowels.

4.3.2 Performance Evaluation

The task of our stress-detector is to classify vowels as stressed orunstressed. Neither stress category is weighted over the other, and so weuse classification accuracy to measure the performance of each classifier.

Since the data set is relatively small, we applied the 10-fold cross val-idation method for training and testing the stress detectors. In addition,we repeat this training and testing process ten times. Our results below

CHAPTER 4. STRESS DETECTION STAGE 65

are the average results over the ten repetitions.

4.3.3 Experiment Design

As discussed earlier, we computed several sets of features and selectedtwo learning algorithms for the construction of stress detectors. We de-signed three experiments to investigate a sequence of research questions.

To explore which subset of prosodic features is most useful for learn-ing stress detectors for our data, the first experiment uses the two learn-ing algorithms in conjunction with all seven different combinations of theprosodic features (D, A, P , D+A, D+P , A+P , and D+A+P , where D, A,and P are the sets of duration, amplitude, and pitch features, respectively).

To assess the contribution of vowel quality features to stress detection,the second experiment uses the two learning algorithms in conjunctionwith seven different combinations of the vowel quality features (Fd + T ,Rd+T , Rd + Fd + T , Fr + T , Rr+ T , Rr + Fr + T , and Rd + Fd + Rr + Fr + T ).

The third experiment investigates whether combining the prosodic fea-tures and the vowel quality features improves performance.

In these experiments, we also investigate whether scaling the featurevalues to the range [-1 . . . 1] improves performance.

For LIBSVM, we used a radial basis function kernel and a C parameterof 1.0. More information about kernels and parameters used in LIBSVMcan be found in [13].

4.4 Results and Discussion

4.4.1 Experiment 1: Prosodic Features

The results for the prosodic features in the first experiment are shown inTable 4.1. Overall, the best results obtained by the LIBSVM are almostalways better than those obtained by C4.5 for all feature combinations.

CHAPTER 4. STRESS DETECTION STAGE 66

Table 4.1: Results for prosodic features.

Features C4.5 LIBSVMUnscaled Scaled Unscaled Scaled

D 80.66 80.22 81.00 82.55A 68.18 68.26 70.18 69.08P 55.12 56.00 57.82 58.45

D + A 81.34 81.06 83.88 84.72D + P 80.84 80.10 79.27 81.55A + P 66.96 66.36 70.00 70.28

D + A + P 80.40 80.58 79.72 83.23

Scaled data led to better performance than unscaled data for LIBSVM inmost cases, but this is not true for C4.5. For both LIBSVM and C4.5 , thecombination of duration and amplitude features (D + A) produced thebest results, which are 84.72% and 81.34% respectively. Adding the pitchfeatures to this subset did not improve performance in any case. Theseresults suggest that the set of features (D+A) is the best combination forour data set. While both decision trees and support vector machines aresupposed to be able to deal with the redundant features, neither of themperformed well at ignoring the less useful features in this experiment.

4.4.2 Experiment 2: Vowel Quality Features

The second experiment investigated the performance of the vowel qualityfeatures. The results are shown in Table 4.2. The vowel quality featuresalone achieved results that were very comparable to the performance ofthe prosodic features. The best result was 82.50%, which was achieved byLIBSVM using vowel quality features (Rr +T ). This result was only 2.22%lower than the best result achieved by LIBSVM using prosodic features(D+A). Furthermore, the performance achieved by C4.5 using (Rr + T )(82.15%) is better than using (D + A) (81.34%). In addition, the following

CHAPTER 4. STRESS DETECTION STAGE 67

points can be noted.

Table 4.2: Results for vowel quality features.

Features C4.5 LIBSVMUnscaled Scaled Unscaled Scaled

Fd + T 65.50 66.17 66.57 68.27Rd + T 80.74 80.87 81.36 81.51

Fd +Rd + T 79.88 79.73 79.12 81.51Fr + T 67.80 68.38 62.56 63.44Rr + T 82.14 82.15 82.50 78.37

Fr +Rr + T 80.64 80.48 81.29 78.37Rd + Fd + Rr + Fr + T 79.68 78.90 79.05 81.51

• The reduced vowel quality features Rd andRr are more reliable thanfull vowel quality features Fd and Fr, which supportsHypothesis 2.2.1 (page 22).

• For LIBSVM, scaling is recommended when using likelihood differ-ences for vowel quality features, ; if likelihood ratios are used, scal-ing is not needed.

• For LIBSVM, with a smaller number of scaled features — five fea-tures in total, the ability of handling redundant features is demon-strated in this experiment. The performance retains 81.51% when-ever scaled features Rd + T are included in the feature set.

• For C4.5, using the likelihood ratios is better than using the likeli-hood differences.

• For C4.5, in most cases, scaling produced slightly better results thannon-scaling, regardless of whether differences or ratios were used,but the difference in performance between scaling and non-scalingwas always very small.

CHAPTER 4. STRESS DETECTION STAGE 68

4.4.3 Experiment 3: All Features

Table 4.3: Results for prosodic and vowel quality features.

Features C4.5 LIBSVMUnscaled Scaled Unscaled Scaled

C + Vd 80.26 81.38 81.04 82.23C + Vr 80.30 80.42 81.14 82.40

C + Vd + Vr 79.97 80.06 81.28 82.01

The third experiment was performed using the combination of all theprosodic features (C) and the vowel quality features using either the dif-ference (Vd = Fd + Rd + T ) or the ratio measure of vowel quality (Vr =

Fr + Rr + T ). As can be seen from Table 4.3, combining all features fromthe two sets did not improve the best performance on our data set overusing either prosodic or vowel quality features alone. However, the resultdid demonstrate that the LIBSVM achieved better performance than theC4.5 on the data set, suggesting that SVM is more suitable for a relativelylarge data set with all numeric data.

For all the three experiments, C4.5 produced rules that were far morecomplex and much harder to interpret than expected. Given that most ofthe features are numeric, this was not too surprising.

4.5 Chapter Summary

In this chapter, we have explored rhythmic stress detection methods us-ing SVM and DT techniques. We have also studied a range of prosodicfeatures and vowel quality features, including feature extraction, normal-isation and scaling. We have developed a novel method for measuringvowel quality features.

The results of the experiments suggest that we should use SVM to build

CHAPTER 4. STRESS DETECTION STAGE 69

the automatic rhythmic stress detector and duration and amplitude fea-tures to detect rhythmic stress in our data set.

The results of the experiments also indicated that SVM is more suitablefor a relatively large data set with all numeric data compared to DT (C4.5).

Hypotheses 4.2.1 (page 55) and 2.2.1 (page 22) have been supported.

Chapter 5

Error Identification Stage

This chapter discusses the stress and rhythm error identification stage ashighlighted in Figure 5.1. Section 5.1 gives a brief overview of the stressand rhythm error identification stage. Section 5.2 discusses the vowel se-quence alignment problem. Section 5.3 discusses error identification inthe user’s stress pattern. Section 5.4 explores error identification in theuser’s rhythm pattern. Section 5.5 presents visualisation tools for users.Section 5.6 summarises this chapter.

5.1 Overview

The stress and rhythm error identification stage is the final stage of Span.As shown in Figure 5.2, it first aligns the vowel sequences in the targetstress pattern and the user’s stress pattern. It then performs stress er-ror identification by comparing the stress status of a vowel in the user’sspeech with the corresponding status in the target. If no error is found, itthen derives the user’s rhythm pattern and performs a rhythm error iden-tification by comparing the user’s rhythm pattern with the target rhythmpattern. The reported errors from the error identifier are then fed intoPeco, which uses the given information to provide useful personalisedfeedback to the user. Therefore, the reported errors must include the most

70

CHAPTER 5. ERROR IDENTIFICATION STAGE 71

Sound(Phoneme Labelled)

Speech Recogniser

Stress Detector

Error Identifier

Stress Pattern

Phoneme HMMs

SoundText Target Patterns

Identified Stress or Rhythm ErrorsIdentified Stress or Rhythm ErrorsIdentified Stress or Rhythm Errors

Peco

Pre-trainedclassifier/detectorclassifier/detector

Figure 5.1: Overview of Span: identifying stress and rhythm errors.

critical problems in the user’s speech but must not include redundant, con-sequential, or minor errors, which may prevent Peco from providing use-ful feedback.

5.2 Vowel Sequence Alignment

As described in the previous chapter, a stress pattern in an utterance con-sists of a sequence of vowels and their stress status. To identify whetherthe user stressed or unstressed vowels correctly, we need to check whetherthe status of each vowel in a user’s speech matches the status of the corre-sponding vowel in the target speech. In order to do this, we first need to

CHAPTER 5. ERROR IDENTIFICATION STAGE 72

V StressedO Stressed @ Unstressed I Unstresseda: Stressed

Target Stress Pattern

V Stressed@ UnstressedO Stressed ----@ Unstresseda: Stressed

V Stressed ----O Stressed @ UnstressedI Unstresseda: Stressed

Stress Error Identification

Rhythm Error Identification

User's Stress Pattern

V Stressed@ UnstressedO Stressed@ Unstresseda: Stressed

V 1616000 1684000O 1770000 1857000 a: 1924000 2120000

Target Rhythm Pattern

V 1566000 1629000O 1710000 1791000 a: 1854000 2043000

User's Rhythm Pattern

haserror ?

Yes No

Vowel Sequence Alignment

Identified Stress or Rhythm Errors

Yes No haserror ?

Figure 5.2: Overview of the error identifier.

properly align the two sequences of vowels.As our speech recogniser uses the forced alignment technique, the best

matched phoneme sequence aligned to the user’s speech is from the wordpronunciation dictionary regardless of what the user actually utters.Therefore if the dictionary does not include any alternatives (either accept-able pronunciations or mistakes), then the speech recogniser will producethe same phoneme sequence as the target one despite the errors the usermay make. In this case the vowel sequence alignment will be trivial but

CHAPTER 5. ERROR IDENTIFICATION STAGE 73

the stress pattern in the user’s speech may not be correctly identified sincepronunciation errors such as insertion, deletion and substitution, may notbe noticed and explored by the speech recogniser. If the dictionary doesinclude alternatives, the speech recogniser may produce a phoneme se-quence different from the target one. In this case, the vowel sequencealignment will be non-trivial and it could be very difficult to achieve theideal alignment result.

5.2.1 Fundamentals of the Alignment Algorithm

We adapted a dynamic programming algorithm called the Needleman/W-unsch algorithm [55] to perform the vowel sequence alignment. The algo-rithm has three steps: initialisation, scoring, and traceback.

Assume that: (1) there are two sequences A and B, which have M andN elements, respectively; (2) the sequence A is used as the target.

Initialisation Create a matrix X with M + 1 columns and N + 1 rows sothere are (M + 1) × (N + 1) cells in total. Each cell has a value and threebackpointers. For each cell in the first row and first column of the matrix,values are initially filled with 0. All backpointers in all cells initially pointto NULL.

For example, we have two sequences A = {V,@, O,@, a:,@} and B =

{V,O,@, I, a:}. The size of A is six and the size of B is five. We then createa matrix X with 6× 7 cells as shown in Figure 5.3. Note that backpointerspointing to NULL are not shown in this figure (nor are they shown inFigures 5.4 and 5.5).

Scoring Fill the value in a cell (i, j) with Xi,j, the maximum global align-ment score ending at the cell (i, j), where 1 ≤ i ≤ N and i ≤ j ≤M .

Xi,j = Max[Xi−1,j−1 + Si,j, Xi,j−1, Xi−1,j] (5.1)

CHAPTER 5. ERROR IDENTIFICATION STAGE 74

V @ O @ a: @

a:

I

@

O

V

0 0 0 0 0 0 0

0

0

0

0

0

Sequence A

Seq

uen

ce B

Matrix X

Figure 5.3: Alignment algorithm: initialisation.

Si,j is defined as the local alignment score. Si,j is set to 1 if the vowel atposition i of sequence B is the same as the vowel at position j of sequenceA, otherwise it is 0.

If Xi,j = Xi−1,j−1 + Si,j, then set a backpointer to the cell (i − 1, j − 1)

in the cell (i, j). If Xi,j = Xi,j−1, then set a backpointer to the cell (i, j − 1)

in the cell (i, j). If Xi,j = Xi−1,j, then set a backpointer to the cell (i− 1, j)

in the cell (i, j). It is possible for each cell to have more than one non-nullbackpointer by the end of the scoring. That means there exist more thanone possible path leading to the cell (i, j).

For example, Figure 5.4 illustrates the above matrixX after completingthe scoring.

Traceback Following the backpointers in each cell, starting from the cellat the bottom right corner of the matrix X , move back to find a path andgenerate alignment results.

From a cell (i, j), if there exists a backpointer pointing to a cell (i, j−1),

CHAPTER 5. ERROR IDENTIFICATION STAGE 75

V @ O @ a: @

a:

I

@

O

V

0 0 0 0 0 0 0

0 1 1 1 1 1 1

0 1 1 2 2 2 2

0 1 2 2 3 3 3

0 1 2 2 3 3 3

0 1 2 2 3 4 4

Sequence A

Seq

uen

ce B

Matrix X

Figure 5.4: Alignment algorithm: scoring.

it means that the jth element in the sequence A is an extra element or thesequence B has a deletion error. A “–” is inserted into the sequence B torepresent the missing element. If there exists a backpointer pointing to acell (i−1, j), it means that the ith element in the sequenceB is an insertionerror. A “–” is then inserted into the sequence A to expand its size. Ifthere exists a backpointer pointing to a cell (i− 1, j − 1), it means that theelements in sequence A and B are matched or are substitutions.

If from a cell (i, j), there are more than one backpointers pointing backto its predecessors that include the cell (i− 1, j − 1), then the backpointerspointing to the cell (i− 1, j − 1) will be chosen to form the traceback path.Otherwise one of the backpointers is arbitrarily chosen.

For the example used above, the traceback path and the alignment re-sult are shown in Figure 5.5.

CHAPTER 5. ERROR IDENTIFICATION STAGE 76

V @ O @ a: @

a:

I

@

O

V

0 0 0 0 0 0 0

0 1 1 1

0 1 2 2

0

0 1 3

0 2 3 4 4

1 1

222

1

11 2

2 3 31 2

2

3

3

3

3

3

2

2

2 3

V @ O @ -- a: @

V -- O @ I a: --

Sequence A:

Sequence B:

Sequence A

Seq

uen

ce B

Matrix X

Figure 5.5: Alignment algorithm: traceback.

5.2.2 Additional Issue

Since one of the tasks in this research is to automatically identify the stresspattern in the user’s speech, the speech recogniser should identify mostof the pronunciation differences in users’ speech, especially the insertionand deletion errors, because these two types of errors directly affect thestress pattern and the rhythm pattern by changing the number of vowels.In order to identify most of the errors in an ESL student’s speech, partic-ularly for Mandarin speakers, we conducted a small experiment with thefollowing changes made in the speech recogniser:

• Amend the word pronunciation dictionary by adding extra schwasat the end of some words or in some consonant clusters;

CHAPTER 5. ERROR IDENTIFICATION STAGE 77

• Change the phoneme network construction by setting everyphoneme as optional instead of standard compulsory mode.

After applying these changes, the results of the small experiment showthat the recognised phoneme sequence reflects the actual pronunciationmore precisely, which means more inserted or deleted phonemes can beidentified in the user’s incorrect speech. However, it makes the vowelsequence alignments much more difficult since the discovered insertions,deletions, and substitutions of phonemes cause more ambiguities.

For example, for a sentence containing a word “vitamins”, with the tar-get vowel sequence of “...

����...”, and the user’s vowel sequence of “...

�...”,

obviously there are errors in the user’s speech. But by just examining thetwo vowel sequences, we cannot determine which is the true cause of er-rors from the following four possibilities:

• The�����

in the user’s speech is a substitution of the first����

in thetarget speech and the rest vowels are missing;

• The�����

in the user’s speech matches with the middle�����

but the two�����

vowels are missing;

• The�����

in the user’s speech is a substitution of the second�����

in thetarget speech and the previous two vowels are missing;

• The�����

in the user’s speech is an insertion problem at the end of theprevious word in the sentence, which commonly happens to a Man-darin speaker, and the target word “vitamins” is not pronounced atall.

One approach for reducing ambiguity and difficulty of the alignmentis to align the two phoneme sequences of the user’s speech and the tar-get speech instead of just the two vowel sequences because the additionalconsonant phonemes can help to reduce the ambiguity and the difficulty.

CHAPTER 5. ERROR IDENTIFICATION STAGE 78

However, our preliminary experimental results show that our standard al-gorithm can still find more than one possible alignment result. It cannothandle this issue properly.

5.2.3 Two-layer Alignment Algorithm

To address this additional problem, we perform the sequence alignment attwo layers — the word level as well as the phoneme level. In order to doso, we first need to know which word each chunk of phonemes belongsto. There are at least two ways to obtain the word – phoneme mappinginformation. One way is to modify the speech recognition process so thatthe output contains both phoneme and word annotations. The other wayis to perform an additional word level forced alignment and then map thewords to phonemes by checking their time stamps. We chose to use thesecond method because the changes to the output of the first speech recog-nition stage would cause many significant changes in other later stages.

After obtaining the word – phoneme mapping, as shown in Figure 5.6,the recognised word sequence of the user’s speech is aligned with the tar-get word sequence. This is the first layer sequence alignment. We thenperform the second layer alignment that aligns the phoneme sequences ofeach aligned word pair. After that, we filter out the non-vowel phonemesand obtain the aligned vowel sequences. This two-layer sequence align-ment approach greatly reduces the ambiguity of the phoneme alignmentprocess.

5.3 Stress Error Identification

After properly obtaining the two aligned vowel sequences, we are ableto identify the stress errors in the user’s speech by comparing the stressstatuses of each aligned vowel pair.

When comparing the stress status of a vowel in the user’s speech with

CHAPTER 5. ERROR IDENTIFICATION STAGE 79

is an... ...

... I

z

A

n

I

n

A

p

@

z

I

t

...

inapposite... ...

... I n

@

p

O

z

I

t

...

Alignment

Word Sequence

Phoneme Sequence

First Layer

...

...

...

...

is

I

z

I

User Target

an

A

n

--

-- I

n

A

p

@

z

I

t

n

@

p

O

z

I

t

...

...

...

...

AlignmentAlignment Alignment Alignment AlignmentSecond Layer

User Target User TargetUser Target User Target User Target

Word Sequence

Phoneme Sequence

... ... ... ...I

z

I

--

A

n

--

--I

n

A

p

@

z

I

t

--

n

@

p

O

z

I

t

Filtering

I

A

I

A

@

I

I

--

--

@

O

I

User Target

Aligned Vowel Sequences

is inapposite

inappositeinappositeis

Figure 5.6: Two-layer vowel sequence alignment.

the corresponding vowel in the target speech, we can obtain the followingkinds of situations:

1. Stressed – Stressed and Unstressed – Unstressed

These are the ideal results. The stress status of a vowel in the user’sspeech matches the stress status of the aligned vowel in the targetspeech.

2. Stressed – Unstressed and Unstressed – Stressed

CHAPTER 5. ERROR IDENTIFICATION STAGE 80

These are the worst cases. The stress statuses are just opposite.

3. Stressed – Not Applicable and Not Applicable – Stressed

When a user inserts an extra stressed vowel that is not present inthe target speech, “Stressed – Not Applicable” occurs. When a usermisses a vowel that is stressed in the target speech, “Not Applicable– Stressed” occurs.

4. Unstressed – Not Applicable and Not Applicable – Unstressed

When a user inserts an extra unstressed vowel, which is not presentin the target speech, “Unstressed – Not Applicable” occurs. When auser misses a vowel, which is unstressed in the target speech, “NotApplicable – Unstressed” occurs.

Ideally, we should report any unmatched results as stress errors. How-ever, as our next step is to identify rhythm errors in a user’s speech weshould mainly consider stress errors that could cause incorrect speechrhythm.

As discussed in sub-section 2.2.2 (page 18), speech in English is widelyrecognised as having stress-timing rhythm. Stressed vowels construct themain frame of speech rhythm. It means that while reading a sentence in agiven context, any missing or additional stressed vowel would destroy theexpected rhythm. Contrarily, unstressed vowels play less important rolesin speech rhythm. Native speakers sometimes skip unstressed vowels butincrease the length of neighbouring consonants or short pauses to retainthe rhythm. Therefore, for the above four kinds of situations, we mustreport the second kind of situation because cases in that are absolute stresserrors. We also need to treat the third kind of situation seriously since astressed vowel should never be unpronounced and an additional stressedvowel is not allowed in order to produce good speech rhythm. We tendto ignore the fourth kind of situation since it has less negative impact tospeech rhythm.

CHAPTER 5. ERROR IDENTIFICATION STAGE 81

5.4 Rhythm Error Identification

If no stress error is found, the error identifier then derives a rhythm patternby extracting the stressed vowels from the stress pattern. Since a rhythmpattern consists of a sequence of stressed vowels and intervals betweeneach pair of them, the system must retrieve additional timing informationfrom the output of the speech recogniser to construct the rhythm pattern.

In order to identify rhythm errors in a user’s speech, we developeda rhythm error identification approach. The approach is to compare theuser’s rhythm pattern with a target rhythm pattern to identify the differ-ences. The differences are then further analysed in order to eliminate in-significant, minor, or redundant errors. By using this approach, we needto carefully choose a native speaker who can produce the right speechrhythm pattern so that the comparison results based on the target rhythmpattern will not lead to suggestions that might negatively affect the user.Because the user’s speech may have a different length to the target speech,directly comparing the rhythm pattern of the two speeches may not al-ways be practical. Therefore, we normalised the length of the user’s speechto the length of the target speech.

We explored two kinds of comparison approaches for identifying thedifferences in the user’s speech, VOP comparison and foot comparison.

5.4.1 VOP Comparison

When speaking rhythmically, English speakers adjust the overall timingso that vowel onsets of the stressed vowels occur near certain privilegedtemporal locations. Thus Vowel Onset Points (VOP) are believed to bethe most important elements forming the English speech rhythm pattern[3, 53].

The VOP comparison approach only compares VOPs of twocorresponding stressed vowels in the user’s speech and the target speech.The VOP differences can indicate whether the user starts a stressed vowel

CHAPTER 5. ERROR IDENTIFICATION STAGE 82

earlier or later than the target. The advantage of using this method is thatthe comparison is fairly simple but the disadvantage is that any conse-quential “error” caused by previous errors will also be reported. For ex-ample, Figure 5.7 illustrates the rhythm pattern comparison between thetarget speech and the user’s speech for a sentence “Amongst her friendsshe was considered beautiful.”. The stars represent the stressed vowels ineach utterance and the positions of those stars indicate the VOPs of thosestressed vowels. From this figure, we can see that none of the VOPs in theuser’s speech match the corresponding VOPs in the target speech, espe-cially from the third to the fifth although it is clear that the fourth and thefifth VOP differences are caused by the late occurrence of the third VOP.

Target:

Yours:

Feedback

* * * * *

* * * * *

Figure 5.7: Rhythm pattern comparison – VOP.

By using this approach, all these five VOP differences are reported toPeco. Obviously it is really not ideal. Peco may be disturbed by those con-sequential errors and may give unnecessary, incorrect, or useless feedbackto the user.

5.4.2 Foot Comparison

The foot is calculated by measuring the distance between VOPs of a pair ofstressed vowels. The foot comparison approach compares a foot betweentwo stressed vowels in the user’s speech with the corresponding foot inthe target speech. The foot differences can indicate whether the user holds

CHAPTER 5. ERROR IDENTIFICATION STAGE 83

a foot too long or too short. The advantage of this method is that, unlikethe VOP method, foot comparison can ignore the consequential errors andonly report the master errors. Thus Peco can focus on the master errorsand provide the user right and handy feedback.

Foot difference can be measured in two forms. One is the absolutefoot difference, which is directly calculated by subtracting the foot in theuser’s speech from the corresponding foot in the target speech. The otheris the relative foot difference, which is the ratio of the absolute foot differ-ence over the foot in the target speech. We use both absolute and relativefoot differences to identify rhythm errors because in some circumstancesthe relative foot difference may reduce the limitation of the absolute footdifference in identifying the causes of a wrong rhythm. For example, anabsolute foot difference in a user’s speech may not be significant. But thefoot may be over 200% longer or shorter than the corresponding target footif the target foot itself is small.

The foot differences can be either positive or negative. By “positive”,it means the foot in a user’s speech is shorter. By “negative”, it means thefoot in a user’s speech is longer.

Clearly, in acceptable utterances of a sentence spoken by differentspeakers or by the same speaker multiple times, none of correspondingfeet will be exactly the same in different utterances. So we need to have anabsolute difference threshold and a relative difference threshold to mea-sure whether the foot differences are acceptable or should be considered aserrors. In our prototype system, we adopt an interactive method: thresh-olds are not set to fixed numbers but are adjustable by users. Not only doesthis interactive method enable researchers to find appropriate thresholdsduring the use of the prototype software, but also it provides users an op-portunity to know how closely their rhythm matches the target.

In order to let users concentrate on their main rhythm errors, our pro-totype system only reports the largest absolute foot difference and thelargest relative foot difference to Peco. For example, for the case illus-

CHAPTER 5. ERROR IDENTIFICATION STAGE 84

trated in Figure 5.7, with the threshold of the absolute foot difference setto 5 units and the threshold of the relative foot difference set to 20%, oursystem reports that in the user’s speech the first foot is absolutly too short,the second foot is absolutly too long, and the second foot is also relativelytoo long.

5.5 Visualisation Tools

In our prototype system, we provided users with some visualisation tools.These visualisation tools can be thought as the preliminary design of Peco.Currently they are designed for experimenting by ESL researchers and forhelping users get some information of their prosodic problems.

5.5.1 Stress Error Visualisation

The stress error visualisation tool shows users the result of their stress pat-tern comparison.

For example, in Figure 5.8, the target stress pattern is shown in theupper panel with the title “Target” and the user’s stress pattern is shownin the lower panel with the title “Yours”. From the figure, we can see thatthe stress statuses of vowels

����and

�����in the user’s speech are opposite

to the ones in the target speech. In addition, there is one vowel�����

missingin the user’s speech.

5.5.2 Rhythm Error Visualisation

The rhythm error visualisation tool shows users the results of their rhythmpattern analysis. Note that the feedback in the figures in this section arebased on the absolute foot difference with an arbitrary threshold value.

As discussed in section 5.4, the rhythm error identification procedureis constrained by the results of the stress error identification. If either the

CHAPTER 5. ERROR IDENTIFICATION STAGE 85

Feedback

Target

Yours

Stress Level

Stress Level

Time

Time

0

0

100

100

0

0

@V @: E

i: @ @I

@u:

I @

@V @: E

@ @ @ I@ u:

@

1

1

Missing in your speech

Substituted vowelUnmatched stress status

Figure 5.8: Stress pattern visualisation.

user makes any serious stress errors or our stress detector reports wrongly,which means some of the feet in the user’s speech do not match with thetarget feet, then the foot comparison approach used for identifying therhythm errors will not work appropriately. This implies that our originallydesigned rhythm error identification procedure may not be able to giveusers constructive feedback on rhythm as we would hope. Therefore, weprovide two additional timing measurement options in order to help usersidentify their timing problems in their speech. Thus we have three optionsin total to provide users with three different kinds of feedback on timingand rhythm errors.

The first option is Target Vowel to Vowel, where the intervals between ad-jacent vowels in the target speech are compared with the intervals betweenthe corresponding vowels in the user’s speech, regardless of the stress sta-tus in the target speech. This option tends to check whether the overallspeech rate in the user’s speech matches the one in the target speech.

For example, Figure 5.9 shows the rhythm errors in the user’s speechby using the target vowel to vowel option for an ESL user’s utterance ofthe sentence “Amongst her friends she was considered beautiful”, . Thespeech rate in words “Amongst her” is acceptable but in words “friendsshe was” is faster. The shortest interval is between the words “friends”

CHAPTER 5. ERROR IDENTIFICATION STAGE 86

and “she”. The overall speech rate in words “was considered beautiful” isslower than the target and the longest phase occurs between the last

�����

of the word “considered” and the first vowel��� ���

of the word “beautiful”.According to this, the user should slow down somewhat at the beginningand possibly needs to add a short pause after the word “friends”. Theuser also needs to speed up a bit during the interval towards the end ofthe sentence.

Not ApplicableNormalLongerShorter

friends she was Amongst her considered beautiful.

<<Too Short

<<Too Long

Feedback

Figure 5.9: Timing feedback – target vowel to vowel.

The second option is Target Stressed to Stressed, where the intervals be-tween stressed vowels in the target speech are compared with the intervalsbetween the corresponding vowels in the user’s speech. Note that it doesnot count the unstressed vowels in the target speech and ignores the stressstatus of the user’s speech. This option can roughly discover whether theuser has potentially presented a kind of “rhythm” despite there being se-rious stress errors in the user’s speech.

For the same example used above, Figure 5.10 shows the rhythm errorin the user’s speech by using the target stressed to stressed option. It sug-gests that the user does have a kind of timing control while starting andfinishing the reading but needs to adjust the timing by slowing down themiddle part of the sentence.

Note that a vowel in the target speech will be skipped if there is not a

CHAPTER 5. ERROR IDENTIFICATION STAGE 87

FeedbackNot Applicable

NormalLongerShorter

friends she was

<<Too Short

considered beautiful.Amongst her

Figure 5.10: Timing feedback – target stressed to stressed.

corresponding vowel in the user’s speech when the above two options areused.

The third option is Exactly Matched Stress Pattern, where the intervalsbetween the stressed vowels in the user’s speech are compared with theintervals between the corresponding stressed vowels in the target speech.This applies if and only if the stressed vowels in the user’s speech exactlymatch the stressed vowels in the target speech.

For the same example used above, by using the exactly matched stresspattern option, the system will display a warning and only give “not ap-plicable” feedback as shown in Figure 5.11 due to the existing stress errorsin the user’s speech.

5.6 Chapter Summary

In this chapter, we have described the final component of Span — the erroridentifier, including the vowel sequence alignment algorithm, the stresserror identification, and the rhythm error identification.

A two-layer vowel sequence alignment algorithm based on Needle-man/Wunsch technique has been introduced to handle the increased dif-ficulty and ambiguity while dealing with ESL speakers’ speech.

CHAPTER 5. ERROR IDENTIFICATION STAGE 88

Amongst her friends she was considered beautiful.

Not ApplicableNormalLongerShorter

Feedback

Figure 5.11: Timing feedback – exactly matched stress pattern.

Several kinds of mismatch between the stress status of vowels froma user’s speech and the target speech were studied. Two rhythm erroridentification methods were explored.

We also presented the preliminary work on Peco, which provides userswith visual feedback for their prosodic problems.

Chapter 6

Conclusions

The objective of this chapter is to summarise the research findings andpresent future work. Following an overview of this research, findings ineach of the three main components of Span are explored. The thesis thenconcludes with a discussion of future research work.

6.1 Conclusions

The work in this thesis involved technologies in multiple areas, includingcomputer science, linguistics, statistics, and physics. The purpose of thisthesis was to use these technologies to develop a prototype sub-system —Span — of an ICAI system for TESOL. The goal was successfully achievedby constructing a speech recogniser, building a stress detector, and creat-ing an error identifier.

The performance of the speech recogniser and the stress detection com-ponents of Span was reasonable (about 87% accuracy on the speech recog-niser, and 85% accuracy on the stress classifier). This performance is com-parable with the reported performance of other systems in related areas.The performance of the error identifier component could not be evaluatedsince there exists an unsolved lingustic research problem — foot differencethresholds.

89

CHAPTER 6. CONCLUSIONS 90

It is not clear whether the performance of each of the components ofSpan is adequate for the effective use of the ICAI system because theoverall performance of Span cannot be evaluated until the foot differencethresholds have been determined and the other sub-system — Peco — hasbeen completed, which are beyond the scope of this thesis.

6.1.1 Speech Recognition

We studied an HMM–based forced alignment speech recogniser. The spee-ch recogniser is a key component of Span. The central requirement on thespeech recogniser is that it can accurately identify the boundaries of thephonemes in a speech signal.

We explored a range of parameters for constructing the speech recog-niser, especially in the phoneme HMM stochastic design and the speechencoding process. Our findings supported Hypothesis 2.1.1 (page 17).We found that a stochastic model with 4 Gaussian mixtures per HMMstate contributed to the highest vowel phoneme boundary accuracy onour speech data set. We found that 12 MFCCs plus 0th Cepstral, and theirderivative and acceleration features computed for a 15 ms wide speechsignal window with 11 ms interval provided sufficient information for thespeech recogniser to minimise the boundary timing errors on the vowelson our speech data set. The maximum accuracy within the 20 ms bound-ary timing difference threshold was 87.07%, which is higher than otherrelated speech recognition systems mentioned in section 2.4.1 (page 26).However, we must point out that a precise performance comparison is notpossible due to the variations in the speech data used and details of theimplementations.

We also looked at the standard three-state, left-to-right phoneme HMMarchitecture. We found that this architecture is a possible source of someof the errors of the recognised phoneme boundaries. The left-to-right, noskipping state connection model requires that every recognised phoneme

CHAPTER 6. CONCLUSIONS 91

must be at least three frames wide. If the actual phoneme is missing, orthe duration is less than the width of three frames, then sound signalsbelonging to its neighbours are “stolen” to make up the phoneme in orderto meet the requirement. This “stealing” problem generates bad phonemeboundaries.

6.1.2 Stress Detection

On advice from the linguistic researchers [77], we decided the rhythmicstress is more important in good speech production. We developed anapproach to rhythmic stress detection in NZ English.

Vowel segments were identified from speech data and a range ofprosodic features and vowel quality features were extracted from thevowel segments instead of from commonly used syllables. We exploredseveral normalisation methods for normalising the prosodic features. Wedeveloped an innovative method for calculating vowel quality features.

We normalised and/or scaled different combinations of these featuresand then fed them into the C4.5 and LIBSVM algorithms to learn the stressdetectors. We found that a combination of duration and amplitude fea-tures achieved the best performance (84.72%) and that the vowel qualityfeatures also achieved good results (82.50%). It is interesting to note thatthe prosodic features and the vowel quality features are comparable atdetecting stress, but that their combination did not appear to enhance per-formance.

The experimental results supported Hypothesis 4.2.1 on page 55: fea-tures we used in our study are largely carried by the vowel as the nucleusof the syllable. The features extracted from vowels can significantly con-tribute to the rhythmic stress detection.

The results using vowel quality features supported Hypothesis 2.2.1(page 22): for detecting stress status, knowing a vowel is reduced is morereliable than knowing it is full.

CHAPTER 6. CONCLUSIONS 92

We found that on our data set, support vector machines achieved betterresults than decision trees, so we decided to use support vector machinesto construct the stress detector.

We found that with the whole set of features, either scaled or unscaled,neither support vector machines nor decision trees could perform wellat handling less useful features. However, with the smaller set of scaledvowel quality features, the support vector machines were able to deal withredundant features but decision trees still could not.

While the maximum accuracy is not good enough yet to be very use-ful for a commercial system, these results are quite comparable to (evenslightly better than) other stress detection systems in this area [39, 72], re-flecting the fact that automatic rhythmic stress detection from continuousspeech remains a difficult problem in the current state of the art of speechrecognition.

6.1.3 Error Identification

We explored the Needleman/Wunsch algorithm to align vowel sequencesin the target speech and the user’s speech. We found that due to toomany variations and errors in ESL speakers’ speech, the vowel sequencealignment process encountered many ambiguities and difficulties and pro-duced more than one possible alignment result even though the programused additional consonant information. We designed and implementeda strategy using a two-layer alignment approach that greatly reduced theambiguity in the vowel sequence alignment process.

We explored kinds of comparison results between the stress status ofpairs of vowels from the target speech and the user’s speech. We deter-mined which kinds of mismatch are the most serious errors. Our systemis able to distinguish errors and identify the most serious errors based onour exploration.

We also explored two rhythm error identification methods. We found

CHAPTER 6. CONCLUSIONS 93

that the VOP comparison approach was easily implemented but reportedconsequential errors to Peco in addition to the key errors. We found thatthe foot comparison approach was much better at reporting only the keyerrors. However, due to the variations of human speech, setting up thethresholds that are used to determine whether the foot differences shouldbe considered as errors or as acceptable variations is still under investiga-tion in the linguistic research area.

We found that, in practice, only giving feedback on rhythm errors wasnot very useful, because the prerequisite is that all the stressed vowels ina user’s speech and in the target speech must be correctly matched. It wastoo hard to achieve this for two reasons: first, it is not easy for ESL speak-ers to produce matched stress status on all corresponding vowels withoutenough practice; and second, our speech recogniser and the stress detectordid not provide 100% accuracy, which means even though a user’s stresspattern actually matches the target, it may still be reported as having stresserrors. Therefore, in addition to the rhythm error identification based ona strictly matched stress pattern, we also provided two other timing mea-surements — target vowel to vowel and target stressed to stressed — inour visualisation tools in order to help users to get more sense of theirstress and rhythm problems. The usefulness of the two additional timingmeasurements needs to be further investigated by linguistic researchers.

6.2 Future Work

Possible improvements to the performance of Span are discussed in thissection.

6.2.1 Improving the Forced Alignment System

The word pronunciation dictionary we use currently contains some alter-native pronunciations of some words. These alternative pronunciations

CHAPTER 6. CONCLUSIONS 94

reflect variations in pronunciation that are acceptable to native speakers.Clearly, we need to continue augmenting the dictionary with other alter-native pronunciations that are acceptable to native speakers. Althoughwe augmented a small number of words in the dictionary for Mandarinspeakers in a small experiment (see page 76), the dictionary does not in-clude the common pronunciation mistakes of native speakers nor most ofthe mispronunciations of non-native speakers who are currently learningEnglish. These mispronunciations cause auto-labelling errors in our sys-tem.

Figure 6.1 shows an example from a non-native speaker mispronounc-ing the word pretend. The dictionary pronunciation is

��������� ���������, but the

speaker mispronounced it as����� ���������

(she missed�������

and mispronouncedthe final phoneme

�����as

�����). The top viewport is the spectrogram of

the sound of the word pretend mispronounced by the speaker; the mid-dle viewport shows the hand-labelling of the sound waveform, and thebottom viewport shows the auto-labelling of the same sound waveform.Clearly, the boundaries of the auto-labelled phonemes have been badlyaffected by missing and mispronounced phonemes.

Missing certain phonemes (such as�����

) and adding phonemes suchas

�����at word boundaries and within consonant sequences are common

mistakes among ESL students, and our system needs to be able to dealwith them more effectively. However, adding all the possible alterna-tive pronunciations directly to the dictionary would make the dictionaryvery large, and would also greatly increase the complexity of the HMMsbuilt by the forced alignment speech recogniser. This increased complex-ity would have unacceptable consequences for the computation speed ofthe recogniser.

We intend to build a model of the kinds of deletions, insertions, andsubstitutions that are common in the speech of Mandarin ESL speakers,and use this model to dynamically construct a better phoneme networkthat allows the recogniser to deal with insertion and deletion errors more

CHAPTER 6. CONCLUSIONS 95

Time (s)0.574 0.9630

5000

Fre

quency (

Hz)

p t E n @ t

Time (s)0.574 0.963

p r I t E n d

Time (s) 0.9630.574

Figure 6.1: Auto-labelling errors of the forced alignment system.

gracefully.

6.2.2 Constructing A New Phoneme HMM Architecture

As pointed out in section 6.1.1, the standard three-state left-to-rightphoneme HMM is also a possible source of phoneme boundary errors.We believe that the “stealing” problem can be addressed by a more robustHMM architecture in which some or all of the states can be skipped. Thisenhanced HMM is shown in Figure 6.2. With this enhanced HMM, we ex-pect that the forced alignment based speech recogniser could handle miss-ing phonemes by matching the phoneme against a zero length segment ofthe speech signal.

For long phonemes, particularly diphthongs, there may be more vari-ations than can be captured well by just three states. We will considerHMMs with more than three states for such vowels.

CHAPTER 6. CONCLUSIONS 96

1 2 3

Figure 6.2: A new three-state HMM.

6.2.3 Adding A New Breathing HMM

In addition, the current HMM design deals with the short pause /sp/ andthe silence /sil/ but does not model breathing properly since an audiblebreath has energy, and we do not have an HMM for breathing. Short peri-ods of breathing currently confuse the recogniser, and cause it to label thephoneme boundaries badly.

Figure 6.3 shows an example of auto-labelling errors resulting from ashort breath and an inserted phoneme. The first panel presents the spec-trogram of a speech signal consisting of the word but preceded by a shortbreath and followed by an inserted

�����, produced by a female ESL student.

The second panel shows the sound waveform and the hand-labelling. Thethird panel presents the auto-labelling generated by the forced alignmentsystem. As can be seen from the figure, the forced alignment systemplaced most of the boundaries quite inappropriately.

Audible breathing has less of an impact on the native speaker data asthese speakers were largely able to read the sentences fluently on a sin-gle intake of breath. Non-native speakers usually have considerably morehesitation and therefore periods of breathing will be much more common.We will add an HMM model for breathing to our system to improve thesystem performance.

6.2.4 Others

There are a number of limitations in this work that need to be improvedin the future.

CHAPTER 6. CONCLUSIONS 97

Time (s)2.543 3.0020

5000

Fre

quency (

Hz)

Breath b V t @

Time (s)2.543 3.002

sp b V t

Time (s) 3.0022.543

Figure 6.3: Auto-labelling errors with a breath and an inserted phoneme.

• We were surprised that the pitch features were not particularly use-ful for the automatic rhythmic stress detection. This is contrary to thediscussions with linguistic researchers [77]. It might be possible thatour pitch features were not sufficiently good, and/or that the pitchnormalisation was not quite adequate for our data. In the future, wewill examine better pitch detection and calculation algorithms, andinvestigate better normalisation methods.

• We suspect that the current duration and amplitude feature normal-isation methods were not sufficient. We will explore other and betternormalisation methods for rhythmic stress detection.

• Although we looked at several combinations of prosodic feature setsand vowel quality feature sets, we did not perform evaluations ofeach individual feature. Future work needs to do more exhaustivestudy of each of these features to identify which individual featurewould contribute to better performance in rhythmic stress detection.

CHAPTER 6. CONCLUSIONS 98

If it is too expensive to do an exhaustive search, we will explore theuse of some appropriate feature selection algorithms.

• We used C4.5 and LIBSVM to construct the automatic rhythmic stressdetector. We need to explore other techniques, including genetic pro-gramming and neural networks, as well as other DT constructorsand other varieties of SVM.

• Once Peco has been completed, we will be able to evaluate the erroridentifier and further develop the stress and rhythm error identifica-tion methods to improve the whole system performance.

Bibliography

[1] ABERCROMBIE, D. Elements of general phonetics. Aldine Pub. Co.,Chicago, 1967.

[2] ADAMS, C., AND MUNRO, R. R. In search of the acoustic correlates ofstress: Fundamental frequency, amplitude, and duration in the con-nected utterance of some native and non-native speakers of English.Phonetica 35 (1978), 125–156.

[3] ALLEN, G. The location of rhythmic stress beats in English: An ex-perimental study. Part II. Language and Speech 15 (1972), 179–195.

[4] ALTENBERG, B. Predicting text segmentation into tone-units. InThe London-Lund corpus of spoken English: description and research,J. Svartvik, Ed. Lund University Press, 1990, pp. 275–286.

[5] AULL, A. M., AND ZUE, V. W. Lexical stress determination andits application to speech recognition. In Proceedings of the IEEE In-ternational Conference on Acoustics, Speech and Signal Processing (1985),pp. 1549–1552.

[6] BEEFERMAN, S. The rhythm of lexical stress in prose. In Proceedingsof the Thirty-Fourth Annual Meeting of the Association for ComputationalLinguistics (San Francisco, 1996), A. Joshi and M. Palmer, Eds., Mor-gan Kaufmann Publishers, pp. 302–309.

99

BIBLIOGRAPHY 100

[7] BERNTHAL, J. E., AND BANKSON, N. W. Articulation and phonologicaldisorders. Prentice Hall, New Jersey, 1988.

[8] BOCCHIERI, E., RICCARDI, G., AND ANANTHARAMAN, J. The 1994AT&T ATIS chronus recognizer. Freely available on the Web at thepage http://www.research.att.com/˜dsp3/slt95.ps, 1994.

[9] BOERSMA, P. Accurate short-term analysis of the fundamental fre-quency and the harmonics-to-noise ratio of a sampled sound. In Pro-ceedings of the Institute of Phonetic Sciences (Amsterdam, 1993), vol. 17,pp. 97–110.

[10] BOLINGER, D. Pitch accent and sentence rhythm. In Forms of En-glish: Accent, morpheme, order, I. Abe and T. Kanekiyo, Eds. HarvardUniversity Press, Cambridge, MA., 1965, pp. 139–180.

[11] BOLINGER, D. L. Two kinds of vowels, two kinds of rhythm. IndianaUniversity Linguistics Club, Bloomington, 1981.

[12] CCITT. Recommendation G.711: Pulse code modulation (PCM) ofvoice frequencies, 1988.

[13] CHANG, C.-C., AND LIN, C.-J. LIBSVM: a library for support vectormachines. Freely available on the Web at the page http://www.

csie.ntu.edu.tw/˜cjlin/papers/libsvm.pdf, 2003.

[14] CHISTOVICH, L., VENTZOV, A., AND M., G. The speech perception byhuman. Nauka, Leningrad, 1976.

[15] COLE, R. A., MARIANI, J., USZKOREIT, H., ZAENEN, A., AND ZUE,V., Eds. Survey of the State of the Art in Human Language Technology.Cambridge University Press, 1996.

[16] CORTES, C., AND VAPNIK, V. Support-vector network. Machine learn-ing 20 (1995), 273–297.

BIBLIOGRAPHY 101

[17] COUPER-KUHLEN, E. An introduction to English prosody. EdwardArnold, London, 1986.

[18] CRUTTENDEN, A. Intonation, second ed. Cambridge University Press,Cambridge, 1997.

[19] DALE, R. Tech858: Building spoken language dialog systemsusing VoiceXML. Freely available on the Web at the pagehttp://www.comp.mq.edu.au/units/tech858/Lectures/

tech858-2002-03-12.pdf, 2002.

[20] DAUER, R. Stress-timing and syllable-timing reanalyzed. Journal ofPhonetics 11 (1983), 51–62.

[21] DAVIS, K. H., BIDDULPH, R., AND BALASHEK, S. Automatic recog-nition of spoken digits. Journal of the Acoustical Society of America 24(6)(1952), 637–642.

[22] DAVIS, S. B., AND MERMELSTEIN, P. Comparison of parametric rep-resentations for monosyllabic word recognition in continuously spo-ken sentences. In IEEE transactions on Acoustics, Speech and Signal Pro-cessing (1980), vol. 28, pp. 357–366.

[23] DEN OS, E. A., BOOGAART, T. I., BOVES, L., AND KLABBERS, E. TheDutch Polyphone Corpus. In Proceedings of Eurospeech ’95 (Madrid,1995), pp. 825–828.

[24] ELLIS, D. ICSI Speech FAQ. Freely available on the Web at the pagehttp://www.icsi.berkeley.edu/speech/faq, 2000.

[25] ESKENAZI, M. Detection of foreign speakers’ pronunciation errors forsecond language training preliminary results. In International Confer-ence on Spoken Language Processing ’96 (Philadelphia, PA, 1996), vol. 3,pp. 1465–1468.

BIBLIOGRAPHY 102

[26] FISHER, D. H., PAZZANI, M., AND LANGLEY, P., Eds. Concept For-mation: Knowledge and Experience in Unsupervised Learning. MorganKaufmann, 1991, ch. 1.

[27] FORGIE, J. W., AND FORGIE, C. D. Results Obtained From a VowelRecognition Computer Program. Journal of the Acoustical Society ofAmerica 31(11) (1959), 1480–1489.

[28] FREIJ, G., FALLSIDE, F., HOEQUIST, C., AND NOLAN, F. Lexicalstress estimation and phonological knowledge. Computer Speech andLanguage 4, 1 (1990), 1–15.

[29] FRY., D. B. Experiments in the perception of stress. Language andSpeech 1 (1958), 126–152.

[30] HARDCASTLE, W. J., AND LAVER, J., Eds. The handbook of phoneticsciences, first ed. Blackwell Publishers Ltd, Oxford, 1997.

[31] HENERY, R. Classification. In Machine Learning, Neural and Statisti-cal Classification, D. Michie, D. Spiegelhalter, and C. Taylor, Eds. EllisHorwood, 1994, ch. 2.

[32] HERMANSKY, H. Perceptual linear predictive (PLP) analysis forspeech. In Journal of the Acoustical Society of America (1990), vol. 87(4),pp. 1738–1752.

[33] HOWITT, W. Linear Predictive Coding (LPC). Freely available onthe Web at the page http://www.otolith.com/otolith/olt/lpc.html, 1995.

[34] HUNT, A. Comp Speech FAQ. Freely available on the Webat the page http://www.speech.cs.cmu.edu/comp.speech/

Section6/Q6.1.html, 1997.

[35] HUNT, M. J. Signal Representation. In Survey of the State of the Artin Human Language Technology, R. A. Cole, J. Mariani, H. Uszkoreit,

BIBLIOGRAPHY 103

A. Zaenen, and V. Zue, Eds. Cambridge University Press, 1996, ch. 1,p. 13.

[36] IRINO, T., MINAMI, Y., NAKATANI, T., TSUZAKI, M., AND TAGAWA,H. Evaluation of a speech recognition/generation method based onHMM and STRAIGHT. In Proceedings of the International Conference onSpoken Language Processing (Denver, 2002), vol. 4, pp. 2545–2548.

[37] ITAKURA, F. Minimum prediction residual applied to speech recog-nition. In IEEE transactions on Acoustics, Speech and Signal Processing(1975), vol. 23(1), pp. 67–72.

[38] JAIN, A. K., MAO, J., AND MOHIUDDIN, K. M. Artificial neuralnetworks: A tutorial. IEEE Computer 29, 3 (1996), 31–44.

[39] JENKIN, K. L., AND SCORDILIS, M. S. Development and comparisonof three syllable stress classifiers. In Proceedings of the InternationalConference on Spoken Language Processing (Philadelphia, USA, 1996),pp. 733–736.

[40] KAWAHARA, H., MASUDA-KATSUSE, I., AND DE CHEVEIGNE, A.Restructuring speech representations using a pitch adaptive time-frequency smoothing and an instantaneous-frequency-based F0 ex-traction: Possible role of a repetitive structure in sounds. Speech Com-munication 27 (1999), 187–207.

[41] KOZA, J. R. Genetic Programming — On the Programming of Computersby Means of Natural Selection. MIT Press, Cambridge, 1992.

[42] LADEFOGED, P. Three Areas of experimental phonetics. Oxford Univer-sity Press, London, 1967.

[43] LADEFOGED, P. A Course in Phonetics, third ed. Harcourt Brace Jo-vanovich, New York, 1993.

BIBLIOGRAPHY 104

[44] LADEFOGED, P., AND MADDIESON, I. Vowels of the world’s lan-guages. Journal of Phonetics 18 (1990), 93–122.

[45] LIBERMAN, M., AND PRINCE, A. On stress and linguistic rhythm.Linguistic Inquiry 8 (1977), 249–336.

[46] LIEBERMAN, P. Some acoustic correlates of word stress in AmericanEnglish. Journal of the Acoustical Society of America 32 (1960), 451–454.

[47] LINDGREN, A. C., JOHNSON, M. T., AND POVINELLI, R. J. Speechrecognition using reconstructed phase space features. In Proceedingsof the IEEE International Conference on Acoustics, Speech and Signal Pro-cessing (2003).

[48] MAKHOUL, J. DSP Techniques. In Survey of the State of the Art in Hu-man Language Technology, R. A. Cole, J. Mariani, H. Uszkoreit, A. Zae-nen, and V. Zue, Eds. Cambridge University Press, 1996, ch. 11, p. 402.

[49] MARKEL, J., AND GRAY, A. Linear Prediction of Speech. Springer-Verlag, Berlin, 1976.

[50] MATEESCU, D. English phonetics and phonological theory. Freelyavailable on the Web at the page http://www.unibuc.ro/

eBooks/filologie/mateescu, 2003.

[51] MICHALSKI, R. S., CARBONELL, J. G., AND MITCHELL, T. M. Ma-chine Learning, An Artificial Intelligence Approach. Tioga PublishingCompany, California, 1983.

[52] MITCHELL, T. M. Machine learning. McGraw Hill, 1997.

[53] MORTON, J., MARCUS, S., AND FRANKISH, C. Perceptual centers(P-Centers). Journal of Pragmatics 83 (1976), 405–408.

[54] MULLER, B., REINHARDT, J., AND STRICKLAND, M. T. Neural Net-works: An Introduction, 2nd ed. Springer-Verlag, Berlin Heidelberg,Germany, 1995.

BIBLIOGRAPHY 105

[55] NEEDLEMAN, S. B., AND WUNSCH, C. B. A general method appli-cable to the search for similarities in the amino acid sequences of twoproteins. Journal of Molecular Biology 48 (1970), 443–453.

[56] O’DELL, M. L., AND NIEMINEN, T. Coupled oscillator model ofspeech rhythm. In Proceedings of The XIVth International Congress ofPhonetic Sciences (1999), vol. 2, pp. 1075–1078.

[57] PELLOM, B., AND HANSEN, J. Automatic segmentation of speechrecorded in unknown noisy channel characteristics. Speech Communi-cation 25 (August 1998), 97–116.

[58] PELLOM, B. L. Enhancement, Segmentation, and Synthesis of Speech withApplication to Robust Speaker Recognition. PhD thesis, Duke University,Durham, North Carolina, 1998.

[59] PENNINGTON, M. C. Phonology in English language teaching: An inter-national approach. Longman, London, 1996.

[60] PIKE, K. L. Phonetics: A critical analysis of phonetic theory and a technicfor the practical description of sounds. University of Michigan Press, AnnArbor, MI., 1943.

[61] PIKE, K. L. The intonation of American English. University of MichiganPress, Ann Arbor, MI., 1945.

[62] QUINLAN, J. C4.5: Programs for machine learning. Morgan Kaufmann,1993.

[63] RABINER, L. R., LEVINSON, S. E., ROSENBERG, A. E., AND WILPON,J. G. Speaker independent recognition of isolated words using clus-tering techniques. In IEEE transactions on Acoustics, Speech and SignalProcessing (1979), vol. 27, pp. 336–349.

[64] RAPP, S. Automatic phonemic transcription and linguistic annota-tion from known text with hidden Markov models / an aligner for

BIBLIOGRAPHY 106

German. In Proceedings of ELSNET Goes East and IMACS Workshop(Moscow, Russia, 1995), pp. 152–163.

[65] RIIS, S. K. Hidden Markov Models and Neural Networks for Speech Recog-nition. PhD thesis, Technical University of Denmark, 1998.

[66] ROACH, P. On the distinction between “stress-timed” and “syllable-timed” languages. In Linguistic controversies, D. Crystal, Ed. EdwardArnold, London, 1982.

[67] SAKOE, H., AND CHIBA, S. Dynamic programming algorithm opti-mization for spoken word recognition. In IEEE transactions on Acous-tics, Speech and Signal Processing (1978), vol. 26(1), pp. 43–49.

[68] SJLANDER, K. An HMM-based system for automatic segmentationand alignment of speech. In Procceedings of Fonetik (2003), pp. 93–96.

[69] STUTTLE, M., AND GALES, M. A mixture of Gaussians front end forspeech recognition. In Proceedings Eurospeech (2001).

[70] TEBELSKIS, J. Speech Recognition using Neural Networks. PhD thesis,Carnegie Mellon University, 1995.

[71] TSAI, M.-Y., CHOU, F.-C., AND LEE, L.-S. Improved pronuncia-tion modeling by properly integrating better approaches for baseformgeneration, ranking and pruning. In International Speech Communica-tion Association Tutorial and Research Workshop on Pronunciation Mod-eling and Lexicon Adaptation (Estes Park, Colorado, USA, September2002).

[72] VAN KUIJK, D., AND BOVES, L. Acoustic characteristics of lexicalstress in continuous speech. In Proceedings of the IEEE InternationalConference on Acoustics, Speech and Signal Processing (Munich, Ger-many, 1999), vol. 3, pp. 1655–1658.

BIBLIOGRAPHY 107

[73] VAN VUUREN, S. Comparison of text-independent speaker recogni-tion methods on telephone speech with acoustic mismatch. In Inter-national Conference on Spoken Language Processing (1996), vol. 3, p. 1788.

[74] VELICHKO, V. M., AND ZAGORUYKO, N. G. Automatic recognitionof 200 words. International Journal of Man-Machine Studies 2 (1970),223.

[75] WAIBEL, A. Recognition of lexical stress in a continuous speech sys-tem - a pattern recognition approach. In Proceedings of the IEEE In-ternational Conference on Acoustics, Speech and Signal Processing (Tokyo,Japan, 1986), pp. 2287–2290.

[76] WANG, X., AND POLS, L. C. A preliminary study about robustspeech recognition for a robotics application. In Proceedings of the In-stitute of Phonetic Sciences (Amsterdam, 1997), pp. 11–20.

[77] WARREN, P., CRABBE, D., AND ELGORT, I., 2003. personal commu-nication.

[78] WATSON, C. I., HARRINGTON, J., AND EVANS, Z. An Acoustic Com-parison between New Zealand and Australian English Vowels. Aus-tralian Journal of Linguistics 18(2) (1998), 185–207.

[79] WIGHTMAN, C., AND TALKIN, D. The aligner: Text-to-speech align-ment using Markov models. In Progress in Speech Synthesis, J. P. vanSanten, R. W. Sproat, J. P. Olive, and J. Hirshberg, Eds. Springer-Verlag, New York, 1997, pp. 313–320.

[80] WIGHTMAN, C. W. Automatic detection of prosodic constituents for pars-ing. PhD thesis, Boston University, 1992.

[81] WITTEN, I. H., AND FRANK, E. Data Mining: Practical Machine Learn-ing Tools and Techniques with Java Implementations. Morgan Kaufmann,San Diego, CA, 2000.

BIBLIOGRAPHY 108

[82] XIE, H., ANDREAE, P., ZHANG, M., AND WARREN, P. Detectingstress in spoken English using decision trees and support vector ma-chines. Australian Computer Science Communications (Data Mining, CR-PIT 32) 26 (January 2004), 145–150.

[83] XIE, H., ANDREAE, P., ZHANG, M., AND WARREN, P. Learning mod-els for English speech recognition. Australian Computer Science Com-munications (Computer Science, CRPIT 26) 26 (January 2004), 323–330.

[84] YING, G. S., JAMIESON, L. H., CHEN, R., MICHELL, C. D., AND

LIU, H. Lexical stress detection on stress-minimal word pairs. InProceedings of the International Conference on Spoken Language Processing(1996), pp. 1612–1615.

[85] YOUNG, S., EVERMANN, G., KERSHAW, D., MOORE, G., ODELL,J., OLLASON, D., VALTCHEV, V., AND WOODLAND, P. TheHTK book (for HTK version 3.2). http://htk.eng.cam.ac.uk/prot-docs/htk_book.shtml, 2002.

[86] ZWICKER, E., FLOTTORP, G., AND STEVENS, S. Critical bandwidthin loudness summation. Journal of the Acoustical Society of America 29(1957), 548–557.

Appendix A

IPA Symbols and NZSED Labels

Consonants VowelsIPA NZSED Sample IPA NZSED Sample

Symbol Label Words Symbol Label Words �!" p pen, copy, happen

$#% I kit, bid, hymn �&"

b back, bubble, job �'(

E dress, bed () t tea, tight, button

$*+ A trap, bad �,"

d day, ladder, odd �-�

O lot, odd, wash �.� k key, cock, school

(/0 V strut, bud, love �1�

g get, giggle, ghost �2"

U foot, good, put (34 tS church, match, nature

�5768 i: fleece, sea, machine �9:

dZ judge, age, soldier �';#%

ei face day, steak (<= f fat, coffee, rough, physics

$>?#% ai price, high, try (@0

v view, heavy, move �-(#%

oi choice, boy $A� T thing, author, path

�2�68 u: goose, two, blue �BC

D this, other, smooth (D;2"

@u goat, show, no �E s soon, cease, sister

$>$2" au mouth, now �F�

z zero, zone, roses, buzz �5 D�

i@ here, serious G� S ship, sure, station

�'HD� e: fair, various $I(

Z pleasure, vision �J�68

a: start, father �K" h hot, whole, behind

�-(68 o: thought, law, north, war �LM

m more, hammer, sum �2�D$

u@ cure, jury �N" n nice, know, funny, sun

$O$68 @: nurse, stir �P�

N ring, long (D�

@ about, coma, common �Q7 l light, valley, feel �R r right, arrange �S j yet, use (T�

w wet, one, when, queen

109