representing time in automated speech recognition...representing time in automated speech...

161
Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of Doctor of Philosophy of The Australian National University i

Upload: others

Post on 12-Jun-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Representing Time in Automated Speech Recognition

David Richard Llewellyn Davies

March 2002

A thesis submitted for the degree of Doctor of Philosophyof The Australian National University

i

Page 2: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Unless otherwise indicated the work presented in this thesis is entirely my own.

D. Davies

_____________

Acknowledgements

I would like to thank all those friends and colleagues who have helped in so many ways to make the completion of this thesis possible. To the one who made the greatest sacrifice, my daughter Annwn, I can only offer my love and wish her well in her own endeavours. To my supervisor Bruce Millar my thanks for providing the initial opportunity for me to develop my interests in speech, for developing the resources that made attempting this work possible, for his insights into the complex interdisciplinary nature of speech research and for his consistent guidance and encouragement during the development of this thesis and in helping to develop coherence from chaos in its presentation. I am indebted to the Computer Sciences Laboratory for access to its excellent facilities and in particular to Arthur McGuffin, Dennis Andriolo, Joe Elso for their technical support, Michelle Moravec for her friendship and patient assistance with administrative matters and to Markus Hegland for his valuable discussions on Fourier Transforms. Finally, I would like to gratefully acknowledge the financial assistance of an Australian Postgraduate Award.

ii

Page 3: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Contents

1 Introduction 1

2 Time in Speech and ASR 42.1 Introduction 42.1.1 The Potential of ASR 42.1.2 Motivations 52.1.3 Temporal perspectives 62.2 Time in speech production and perception 82.2.1 General considerations 82.2.2 Source Synchronous Timing 92.2.3 Phoneme Duration 102.2.4 Dynamical Cues in the Frequency Domain 172.3 Time and Hidden Markov Models 182.4 Time and Artificial Neural Networks 212.5 Coding temporal information 242.6 Serial/parallel processing 272.7 Summary of time in speech and ASR 28

3 Objectives and Rationales 303.1 Introduction 303.2 Spectral Analysis 333.3 Source Synchronous Framing 343.4 The parameter similarity length 353.5 Acoustic-Phonetic Association Tables 363.5.1 The choice 363.5.2 Table input bandwidth 373.5.3 Acoustic-phonetic association rankings 383.6 Experimental Objectives 393.6.1 Smoothing and outlier constraint 403.6.2 Similarity lengths and differentials 403.6.3 Similarity lengths and duration 413.6.4 Voice Onset Transitions 42

4 Systems and Methods 434.1 Introduction 434.2 Development Platform 434.3 SLab Structure and Operational Overview 444.4 The Speech Data 464.5 Source Synchronous Framing 474.6 The Fourier Transforms 504.7 The Formant Picker Algorithm 55

iii

Page 4: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

4.8 Base Acoustic Parameters 614.9 Acoustic Parameter Scaling 654.10 Temporal Parameter Modifiers 674.10.1 Time Differentials 674.10.2 The Parameter Similarity Length 684.10.3 Smoothing and outlier constraint 704.10.4 Other Modifiers 704.11 The Association Table 714.12 Parameter Histograms and Temporal Slices 734.13 Discrimination Potentials 76

5 Results and Analysis 785.1 Introduction 785.2 Impact of smoothing and outlier constraint 795.3 Similarity lengths and differentials 815.3.1 The temporal range of PSLs and differentials 815.3.2 Comparison of PSLs and Differentials 855.4 Similarity lengths and duration 875.4.1 Duration-PSL comparison for all phonemes 875.4.2 Duration in phonetic contexts 915.5 CV contexts and F2 transitions 97

6 Conclusions and Projections 1026.1 Summary of results 1026.1.1 Temporal resolution and smoothing 1026.1.2 PSLs and time differentials 1036.1.3 PSLs and duration 1046.1.4 PSLs and duration in context 1056.1.5 PSLs and F2 transitions 1056.2 Methodological objectives 1056.3 Future exploration of the acoustic space 1076.4 Conclusions 108

iv

Page 5: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Appendices

A1 SLab system 110A2 Data specifications 114A2.1 Phonetic labels 114A2.2 Speaker information 116A3 Algorithms 117A4 Frequency domain smoothing 120A5 Formant histograms 121A6 Tx and Tx.Sim time streams 130A7 Parameter smoothing and constraint 133A8 Diferential-PSL comparisons 139

Bibliography 146

v

Page 6: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

The leading idea which is present in all our researches, and which accompanies every fresh observation, the sound which to the ear of the student of Nature seems continually echoed from every part of her works, is - Time! Time! Time!

George P. Scrope, 1827

Abstract

This thesis explores the treatment of temporal information in Automated Speech Recognition. It reviews the study of time in speech perception and concludes that while some temporal information in the speech signal is of crucial value in the speech decoding process not all temporal information is relevant to decoding. We then review the representation of temporal information in the main automated recognition techniques: Hidden Markov Models and Artificial Neural Networks. We find that both techniques have difficulty representing the type of temporal information that is phonetically or phonologically significant in the speech signal.

In an attempt to improve this situation we explore the problem of representation of temporal information in the acoustic vectors commonly used to encode the speech acoustic signal in the front-ends of speech recognition systems. We attempt, where possible, to let the signal provide the temporal structure rather than imposing a fixed, clock-based timing framework. We develop a novel acoustic temporal parameter (the Parameter Similarity Length), a measure of temporal stability, that is tested against the time derivatives of acoustic parameters conventionally used in acoustic vectors.

While the Similarity Length is directly applicable to conventional recognition techniques our analysis of these techniques points to the development of an approach to recognition that might prove more flexible and efficient in its ability to deal with temporal issues. We outline the requirements of such a system and in conformity with these requirements, develop an approach to evaluating the Similarity Lengths. We find that the Similarity Lengths generally provide a stronger association between the acoustic vector space and the phonetic space than do conventional parameter derivatives.

vi

Page 7: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Glossary of Abbreviations

A-P Acoustic-PhoneticANN Artificial Neural NetworksASR Automated Speech RecognitionAV Acoustic VectorCV Consonant-VowelDFT Discrete Fourier TransformDSP Digital Signal ProcessingDTW Dynamic Time WarpingEBP Error Back PropagationFB Filter BankF0 Glottal frequency or pitchF1, F2, F3 The First three formantsFHE Fundamental Harmonic ExtractionFFT Fast Fourier TransformFIR Finite Impulse ResponseGUI Graphical User InterfaceHMM Hidden Markov ModelsIGC Instant of Glottal ClosureIIR Infinite Impulse Responsek-NN k Nearest NeighbourLA Log AreaLAR Log Area RatioLVQ Linear Vector QuantisationMARS Multivariate Adaptive Regression SplinesNLP Natural Language ProcessingPSL Parameter Similarity LengthRC Reflection CoefficientsSS Source SynchronousSSF Source Synchronous FramingSV Stop-VowelSVD Singular Value DecompositionTD Temporal DecompositionTDNN Time Delay Neural NetworkTRLI Time Reversed Leaky IntegratorVCV Vowel-Consonant-VowelVOT Voice Onset Time

vii

Page 8: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Chapter 1

Introduction

The process of Automated Speech Recognition (ASR) can be characterised, at its simplest, as consisting of two stages. In the first stage the speech signal is partitioned into frames, typically 10 to 20ms long. Spectral and energy parameters are generated for each frame and combined in an acoustic vector (AV) that represents the frame. In the second stage a time stream of acoustic vectors is presented to a matching algorithm that attempts to generate a symbol stream at the phonemic to lexical level. Acoustic parameters may include: spectral parameters such as formant frequencies, energies or bandwidths; Cepstral coefficients; spectral band energy or energy ratio parameters; and non-spectral parameters such as frame energy and glottal period.

The primary focus of this thesis is the first stage - specifically the representation of temporal contextual information in the acoustic vector stream. Spectral information from an individual frame may tell us little about the phonetic context of the frame without some additional information on the acoustic context derived from neighbouring frames. Conventionally, the matching stage develops temporal context by looking at more than one AV or by combining temporal information into the AV in the form of parameter time derivatives. These derivatives provide a very limited and generalised representation of context and take up as much as two thirds of the information bandwidth of the AV.

Here we look at replacing this representation with temporal encodings that can provide a broader and more flexible information framework: ones that can be tailored to specific Acoustic-Phonetic (A-P) contexts. The motivation comes from the hypothesis that the second stage matching process can be simplified if whatever relevant acoustic contextual information that can be extracted from the signal is pre-determined and coded into the AV stream, leaving the matching stage to focus on other domains of contextual information which can include phonetic context generated from neighbouring AVs or the phonological, semantic and syntactic information generated from language models.

Freed of the task of acoustic processing the matching algorithm can be made ‘smarter’, using the phonetic context to request relevant acoustic contextual information from the acoustic processing stage in an iterative refinement of its phonetic or lexical hypotheses.

From the literature reviews of Chapter 2 we see that segmental duration is both important perceptually and a source of problems in conventional ASR systems. The task set for the experimental work proposed in Chapter 3 and described in Chapter 4 is to evaluate a proxy for segmental duration based on acoustic continuity.

1

Page 9: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Acoustic parameter time differentials, in their simplest form just the differences between acoustic parameter values in temporally adjacent acoustic vectors, can be seen as local measures of discontinuity measuring the degree of parameter change over some given time interval. The temporal parameter chosen for evaluation in this thesis, the Parameter Similarity Length (PSL), is defined as the temporal range over which a parameter’s value remains within some given value range. While on the surface this might appear to be just the reciprocal of the time derivative it is shown to have qualitatively different properties that can both surpass and complement information provided by differentials. Related approaches are reviewed in Chapter 2.

The PSLs are evaluated in five sets of experiments. Three providing general evaluations of the PSL and two based in acoustic-phonetic contexts that are known from human speech perception experiments to contain phonetically relevant temporal information. Specifically, we look at the role of segmental duration in the vowel length decision, the compensatory lengthening of the vowel in pre-stop contexts, the role of voice-onset time in the stop voicing and place decisions and, finally, the role of short-term, source synchronous formant transitions in the stop place decision.

In the present work the second (matching) stage is implemented using a simple, fast Bayesian approach that generates A-P association strengths for individual acoustic vectors but does not attempt a full recognition task. The high speed or low computational complexity of this approach allows rapid re-estimation for system optimisation and data exploration. The generality of the association strengths produced allows the evaluation of temporal codings in a form that is applicable to conventional Hidden Markov Model (HMM) or Artificial Neural Network (ANN) matching algorithms.

The analysis in Chapter 2 of temporal processing in HMM and ANN systems leads to the conclusion that both these approaches have fundamental problems in dealing with temporal issues. In some respects the work reported here assumes a more flexible ASR system that can invoke specialised HMM, ANN or other matching algorithms to suit the A-P context.

Chapter 4 outlines the experimental approaches taken. Strong emphasis is placed on allowing the signal to define time. To this end we use source synchronous timing markers to partition the signal into frames and, as our central goal, explore approaches to generating segment level continuity measures from the signal to act as acoustic proxies for segment duration. We also look at the value of combinations of parameter derivatives and PSLs in representing regions of sustained slope or curvature.

The results presented in Chapter 5 and summarised in Chapter 6 generally support 2

Page 10: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

the approach taken. Specifically, they validate the use of PSLs as a method of encoding temporal information either alone or in combination with parameter time derivatives.

3

Page 11: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Chapter 2

Review of Time in Speech and ASR

2.1 Introduction and motivations

This chapter provides the background for the work described in this thesis. Following a general introduction to ASR and a rationale for its continued development, the literature on speech perception and ASR techniques is reviewed with an emphasis on temporal issues. It is often difficult to choose which of many possible papers best represents work done in an area. Occasionally recent articles have been chosen because they provide a detailed or interesting historical perspective. Some early articles have been included to highlight turning points or major developments.

2.1.1 The Potential of ASR

The decoding of spoken language is one of the grand challenges of computing. Not only do computers provide an opportunity for expanding our theoretical knowledge of speech communication but the construction of effective Automated Speech Recognition systems opens the way for speech to be used as a means of interacting with computer systems, a development that has the potential to transform the way we use computers.

Some applications of ASR, such as dictation of text or extensions of existing computer interfaces for situations where hands-free operation is required, are widely accepted. Not so widely recognised is the role that ASR will play as a gateway or enabling technology for Natural Language Processing (NLP) systems. Speech driven NLP systems will open the way to a major transformation of the human-computer interface, break down barriers between the spoken and written word and greatly enhance the utility of natural language translation systems.

While NLP research still struggles to come to grips with the vagaries of idiolect and the fundamental irregularity of unconstrained human language, it does provide a broad and strong foundation for processing constrained speech. Most people have had experience interacting within constrained language contexts, with young children or with adults learning a new language. We usually adapt readily to these constraints and can assist the other party with simple explanations of new linguistic concepts or structures. Such is the likely future for human machine interaction.

Our understanding of natural language dialogue is at a level where sophisticated natural language agent technologies are possible. These can form the basis of new

4

Page 12: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

models for the human computer interface. Simple NLP has been used as an extension of the Graphical User Interface (GUI) to access menu driven command structures. Dialogue systems can potentially replace the GUI in many contexts.

The proliferation of communication technologies such as the mobile telephone and local area wireless communications including lightweight, wireless screens can eventually replace the need for sitting at a desk cluttered with a keyboard, boxes of hardware and tangles of cables - at least for most computer usage. Along with the cables we can start to replace much of the logic now hard-coded in computer operating systems and applications with automatically verified subsets of natural language. By combining procedural and declarative language constructs, as is the case in most natural languages, we can build natural language rule-based systems that encode knowledge in languages that will still be in use in centuries to come. We can extend the concept of a database to a knowledgebase embedded in free-form text or ‘found data’.

With this perspective in mind we can readily foresee transformations in the way we interact with computers and other people. Perhaps less obvious is the enhanced ability to communicate with ourselves - over time. Being able to record verbal comments at any time or place and to have these notes available in a readily accessible transcription opens the way to radical changes in the way we organise our lives. It is not unrealistic to assume that many people will eventually keep records of most of what they do or say. This has profound implications for almost every facet of human interaction.

2.1.2 Motivations

If the preceding analysis and speculation is realistic then it provides a strong motivation for developing high performance ASR not just to extend the the way we interact with computers but to expand their usefulness. However, for our purposes here we need to clarify the motivation of two specific goals: the need to broaden the basic algorithmic structure of ASR systems beyond the HMM/ANN models generally used in contemporary systems and the need to address the issue of extraction and representation of phonetically significant temporal information in the speech acoustic signal.

Motivations for these goals are as follows:

1. The issue of new directions for ASR has been the subject of a recent public debate among some of the world’s leading ASR researchers1. Bourlard et.al. (1996) in initiating the debate urged researchers to look at alternatives to incremental improvements in current state-of-the-art techniques but they stay

1 Bourlard et al.; Atal; De Mori; Flanagan; Fouri; Hatton; Hunt; Jelinek; Lippmann; Mariani et al.; Seneff, all 1996.

5

Page 13: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

close to the conventional paradigms in their examples. Contributors such as Flanagan (1996), Seneff (1996) suggested more adventurous exploration of new techniques. The tenor of this debate indicated considerable support within the speech research community for developing new approaches to ASR.

2. The development of a rationale for looking at temporal issues in ASR is the focus of the remainder of this chapter. We will see that there is much general evidence in the literature to support the idea that temporal coding/decoding issues are critical for the advancement of our theoretical understanding of speech and the development of high performance ASR.

The fundamental nature of time in ASR is evidenced by the fact that the techniques of Dynamic Time Warping (DTW) algorithms and HMMs are both explicitly temporal, though in different ways. We can plausibly summarise DTW as having a primary focus on duration with event sequencing as a secondary function. Conversely HMMs can be seen as having temporal sequencing as their primary function while the modelling of duration (as we shall see in Section 2.3) is a major weakness. While basic ANNs are not fundamentally temporal, connectionist methods are very flexible. Section 2.4 looks at efforts made to build temporal processing capability into ANNs.

Various approaches have been taken to encode temporal information in the AV stream. Most commonly, parameter time differentials have been used. These and other perspectives such as Temporal Decomposition and our novel approach, the PSL, are discussed in Section 2.5.

2.1.3 Temporal perspectives

Before moving on to more detailed analysis it may be helpful to establish a basic framework for considering temporal issues in speech. The following list of temporal perspectives is not intended to suggest priorities or relative importance. It is also not intended to be an exhaustive analysis since the issues relevant to this work are discussed in more detail through the remainder of this thesis.

• Time scales: We can view speech on several different scales. At the shortest scale we have the sampling period of the signal digitisation which for the speech data used here is 50µs. Arguably the fundamental time scale of voiced speech is the glottal epoch at around 5 to 10ms. This corresponds approximately to the common frame sampling intervals used in ASR systems of 5 to 20ms. Next we have a range of time scales associated with articulator movement ranging from around 20 to 50ms up to periods of several hundred milliseconds. Beyond this level we move into suprasegmental or prosodic scales where we consider factors such as sustained articulator settings or

6

Page 14: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

coarticulation, speech rhythm and speaking rate, taking us from hundreds of milliseconds up to typical utterance lengths of 1 to 10 seconds. Beyond this we can define another scale of global evaluation in which we consider averages, histograms or parametric models of acoustic parameter values taken over many utterances.

• Duration: Segmental duration is always likely to be a major consideration in the design of speech decoding methods though it may not always be phonemically or phonologically significant. Acoustic measures (or proxies) for phoneme duration are a central theme of the present work.

• Resolution: Temporal resolution is significant at the sampling level since the sampling rate determines an upper limit for frequency domain information. Less trivially, at a segmental level resolution involves factors such as the framing rate and acoustic parameter smoothing that are also investigated here. In this thesis we argue for the glottal epoch as the minimum unit of temporal resolution.

• Synchronisation: The present work uses source synchronous framing, breaking away from absolute time and operating in what can be seen as a natural sampling framework for voiced speech signals. Other synchronisation issues occur. Most notably, the speech decoding process itself can be viewed as attempting to align or synchronise the symbol stream of an an utterance hypothesis with the parallel streams of acoustic parameters generated from the signal. Importantly, the human brain is known to process information in a highly parallel manner so issues of synchronisation are likely to be significant in the recombination of complex memory activations.

• Continuity: Just as source synchronous analysis provides a new time scale that has a monotonic but nonlinear relationship with conventional time, we can also view variations in the information content of the speech signal as defining other time scales. Spectral and temporal acoustic cues of value in decoding speech appear to be highly uneven in their temporal distribution. With any form of data stream displaying low information continuity an analysis strategy should consider the option of abandoning unidirectional analysis and instead, locate information peaks and explore forward and backward from these.

Continuity is also a factor in the analysis of symbol streams such as time streams of phoneme hypotheses. In symbol streams we have categorical change along the stream rather than the quantitative change of parameter streams. Of significance to ASR is the need to develop techniques for effectively dealing with mixtures of symbolic and parametric streams. As in the work described in this thesis, ASR typically starts with a stream of

7

Page 15: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

acoustic parameters and matches them to a symbol stream.

Thirdly, continuity is an issue when ASR systems are confronted with missing or corrupted data.

• Sequences or sets: Disregarding coarticulatory effects, phonemes and syllables are ordered. With a small lexicon the order is not always critical to word-level recognition and even with a large lexicon an unordered set of phoneme candidates can provide a rapid means of reducing the lexical search space. Likewise, at a sub-segmental or acoustic frame level order can be significant but, importantly in some contexts, detailed sequencing information is not always relevant to the decoding task. It may be valid to treat some sequences of frames as unordered sets, in which case effort expended in modelling them as an ordered sequence may be counter-productive.

• Temporal logics: Complex temporal concepts can be encountered in speech decoding. For example we might wish to specify an event as occurring at “the mid-point of the first glottal period following the plosive burst of a word-initial stop”. Temporal logics have been developed in recent years that allow such specifications to be handled in a rigourous manner2. They are not considered further in this thesis.

2.2 Time in speech production and perception

2.2.1 General considerations

The following review is an attempt to outline what can be deduced from studies of speech perception about how listeners deal with time and how that might apply to automated speech recognition. It is not unreasonable to say that time is the principal variable in speech. Time and its reciprocal, frequency, provide most of the information required for a lexical decoding of speech. Intensity is phonetically distinctive but primarily in its short term dynamics and, as we shall see later, plays a relatively minor role in the perception of segmental stress.

There are a number of assumptions or hypotheses that underlie the analysis of speech that can be drawn from our general knowledge of human perceptual development and that of speech in particular. Individual speakers and listeners each have a unique introduction to speech and a unique physiology. It is tempting to assume, and often implicitly assumed, that speech consists of some ideal utterance that is imperfectly instantiated in a particular speech act.

2 eg. Allen, 1984; Long, 1989; Galton, 1990; Allen, 1991; Loganantharaj, 1991; Freksa, 1992; Lee et. al., 1993; Vila, 1994

8

Page 16: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

This is a normative view where variation is seen as deviance from the norm. Alternatively, we can view individual speaker characteristics and the variety that they display as having a primary importance in their own right with a validity that is independent of the norm. If everybody spoke with the same voice we would often have to rely on visual cues for speaker identity and the process of following one speaker among many in “cocktail party” contexts would be made more difficult, if not impossible. Statistical ASR techniques generally penalise deviance. While it may not be possible to totally avoid a normative approach it could be advantageous to address this issue as a primary design criterion for ASR systems.

Developmental factors are likely to play a significant role, such as the acoustics of the womb which suppresses high frequencies and may lead to an emphasis on intonation and rhythm in the foundations of speech perception. Evidence exists that listeners can interpret speech with greatly reduced frequency domain information (eg. Shannon et al., 1995). Everyone has a unique introduction to language so we should not be surprised to find large variations in the decoding techniques of individual listeners.

Speculation about human auditory and lexical processing does not impact directly on the experimental work described here but does form part of the motivation for for a flexible recognition strategy that provides a broad setting for the experiments. This debate is typified by the lack of agreed conceptual frameworks apparent in the interchange generated by Marslen-Wilson and Tyler (1980, 1983) and Norris (1982).

A more pragmatic approach has been discussed and tested that focusses on the role of context sensitive allophones (eg. Wickelgren, 1969; Marcus, 1979). The issue of temporal granularity, or the level of basic speech unit (phone, diphone, triphone, syllable or word etc.) that is most appropriate for ASR has been discussed at great length in the literature. This debate, and the associated body of experimental results, is not discussed in detail here. In this thesis we are looking towards future systems that can make decisions concerning appropriate temporal scale on the basis of the specific context. The experiments reported in this thesis are restricted to either single phonemes or look at behaviour in specific diphone contexts such as consonant-vowel (CV) boundaries.

2.2.2 Source Synchronous Framing

It is usual in ASR to sample digitised acoustic signals with fixed length frames typically 10-25ms wide shifting 5-10ms or more between frames. This approach can be shown, at least for voiced speech, to be sub-optimal in that voiced speech has a natural framing rate generated by the glottal source impulse stream.

It is clear from simple theoretical considerations that improved frequency domain

9

Page 17: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

transforms can be constructed by synchronising the signal framing to the glottal source (SS framing) so that no more than one cycle of the glottal impedance variation is included in each frame. The source discontinuities at the instant of glottal closure (and to a lesser extent, opening) break the stationarity assumptions of frequency domain analysis and introduce the well known source harmonics that confound formant analysis.

While the advantages of SS framing have long been recognised (eg. Hess, 1983) there appears to have been no consensus developed in the speech recognition community on how to achieve it reliably. Davies and Millar (1996) described and evaluated a computationally efficient and reliable technique for generating a stream of source synchronous timing markers.

There are two potential practical obstacles to the use of SS framing. Firstly, glottal epochs are not of uniform length so in conventional HMMs or ANNs resampling may be required for some systems. Secondly, SS techniques are prone to disruption in noise corrupted signals. This can be circumvented, to a point, by basing the SS analysis on low-pass filtered frequency band energies.

2.2.3 Phoneme duration

Since the phoneme and its temporal span, the segment, are abstractions we cannot expect segmental durations to have direct acoustic correlates. The syllable is, arguably, a less abstract entity. Bagshaw (1994) defines five syllable based durations: nucleus; nucleus and coda; onset and nucleus; onset, nucleus and coda; nucleus-to-nucleus. As noted by the author, none of these has a clear acoustic representation either. Ultimately we rely on the partly subjective judgement of phoneticians to determine segmental boundaries. Automated segmenters use acoustic cues in combinations that have been ‘verified’ against the phoneticians judgements.

None-the-less the segment is widely recognised as a valuable concept and a broad consensus exists for the validity of the concept of segmental duration although strong qualifications emerge when it comes to coarticulation or the temporally complex phonemes such as the stops (eg. Ainsworth and Millar, 1976).

The phonetics literature shows that duration plays a complex and important role in English speech. In an early review of the subject Klatt (1976) lists seven phonologically significant durational cues:

10

Page 18: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

a. Differences in inherent phonological duration for vowelsb. Duration as a cue for voicing in fricativesc. Lengthening at the ends of syntactic unitsd. Stresse. Vowel duration as a cue to voicing feature of a post vocalic consonantf. Emphasisg. Shortening in consonant clusters

All of these factors are mixed together in combinations that require sophisticated disentanglement in the auditory response of listeners and ASR systems. Several later studies (eg. Pols et al., 1996a, 1996b; Riedi, 1997) have investigated these factors and, following Klatt, have built empirical models for predicting vowel duration based on the statistical analysis of large speech data sets.

The following is a summary of the main influences on phoneme duration.

Intrinsic Duration

As much as 50% of the durational variation of stressed vowels in continuous speech can be attributed to the the intrinsic duration of individual vowels (Klatt, 1976). This variation decreases with the general phonological reduction and shortening associated with low stress or fast speaking rate (van Santen and Olive, 1990). In English the phonologically distinct short and long vowels can differ in duration by as much as a factor of two.

This distinction is weakened but largely maintained across stress categories (Crystal and House, 1988a). While the distinction is not directly attributable to articulatory constraints at normal speaking rates it tends to reflect the intuitively expected limits of fast speech where phonological reduction is associated with reduced articulation and duration. This effect can be seen in the data of Umeda (1975, figs. 24 and 25) where in duration histograms for unstressed vowels those instances where vowel identity is retained (not reduced to schwa) constitute the long duration tails of the distributions.

Duration and rate of speech

Variation of segmental durations is not a simple function of phrasal durations. Lehiste (1970) varied speaking rate by a factor of two and recorded a factor of 1.5 in durational change for stressed syllables and observed that unstressed syllables absorbed most of the change. Goldman-Eisler (Lehiste, 1970 p52) looked at syllable rate variations of between 4.4 and 5.9 syllables per second finding that corresponding variations in pausal durations were five times this range.

Crystal and House (1982) discussed the impact of the differing natural speaking 11

Page 19: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

rates of 14 speakers on segmental duration. Their SLOW group on average took 33% longer on an utterance than their FAST group. The majority (54%) of the increase in duration was associated with an increase in the number of pauses, 27% from the increase in durations of existing pauses and 19% from increases in segment durations. The latter showed equal variation for major phonetic categories. They contrast their results with other researchers reporting differential variation between categories based on durational data for individual speakers asked to perform at rates varying from their natural rate. Millar and Wagner (1983) reported positive correlations between segmental and sentence duration for diphthongs, nasal-vowel transitions, vowel-fricative transitions, vowel-stop transitions and vowels.

Nooteboom and Slis (1969), using nonsense words (variants of ‘mamamam’) tested the hypothesis that rate variations do not vary the relative durations of vowel and consonant within the syllable. Comparing fast and normal rates the only exception they found was the case of a very short unstressed second syllable. With slow speech they found that long vowels lengthened relatively more than short vowels. Relative lengthening of vowels and consonants varied between speakers. Kuwabara (1996) performed similar measurements on Japanese CV syllables, also finding relatively fixed C/V duration ratios for normal and fast speech with vowels lengthening more than consonants for slow speech. Adaptation to rate shows some association with place of articulation. Least influenced by rate was the stop /t/.

The above studies have looked at the relationships between segmental duration and rate change. Perceptual boundaries can also be adaptive. Port (1979) and Miller et al. (1986) show adaptation in the Voice Onset Time (VOT) perceptual boundary for the stop consonant voicing decision. The magnitude of the shift (10% for a 30% rate change (Port, 1979) is consistent with the overall change in consonant duration with rate. More recent results and a detailed analysis by Ultman (1998) suggest that perception of the VOT voicing boundary may be confounded by definitions of local rate being sensitive to phrasal rate but difficult to define in terms of syllabic rate because of the complex interactions between consonant duration and the duration of adjacent vowels.

Perceptual adjustment to rate changes is, according to Francis and Nusbaum (1996), an active process requiring cognitive effort that can be disrupted by a secondary cognitive task manipulated by the experimenters. They contrast their conclusions with the common assumption (their claim) that rate adjustment is a passive task based on filtering.

Verhasselt and Martens (1996) and Ohno et al. (1996) have described techniques for estimating the local rate of speech. This is clearly an important issue since most of the work on rate discussed so far uses a utterance level estimate of rate whereas it is generally accepted that rate can vary within an utterance. In some languages local rate is varied between metric feet with a tendency to produce isochronous word

12

Page 20: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

boundaries regardless of the number of phonemes in the word. The estimation of local rate is likely to be central to any attempt to model duration since durational contrasts are sustained over rate changes. An example of this is discussed by Lehiste (1970, p40) reviewing work by Tarnoczy in Hungarian vowel durations.

The data of Verhasselt and Martens (1996, eg. Fig. 4) and others suggest that the shortening of mean durations with increasing rate is accompanied by a reduction in the asymmetry of the durational histograms. Specifically, the shortening is promoted by reduction of the long durational tails of the distributions more than changes in durations shorter than the mode.

In summary we can, perhaps, view timing in normal speech as being constrained by the articulatory demands of very fast speech but at normal rates there is freedom to provide phonological cues that may duplicate or substitute for other cues such as the glottal frequency (F0) in the stress decision or vowel quality in vowel distinctions. In addition, we should probably consider normal speech as highly stylised with prosodic embellishments that can be progressively reduced as rate increases without having a major impact in intelligibility.

Duration and stress

While intuitively we might think of the stress or prominence of a syllable as a function of intensity, studies have shown that both F0 and duration are more significant acoustic correlates.

Turk and Sawusch (1996), considering duration and intensity, found that listeners have difficulty with intensity judgements, particularly if duration is varying. They concluded that duration or, perceptually, length was a significant cue to prominence and that listeners did not use intensity in the judgement. Stressed vowels can be, on average, 50% longer than unstressed (Lehiste, 1970). Physical constraints on rapid changes in sub-glottal pressure may be a significant factor in leading speakers to substitute durational cues that demand less effort. Lyberg (1979) linked this limit on F0 change to final lengthening. The minimum period for glottal change under articulatory control (as opposed to physiological effects such as glottal jitter) has been estimated as around 50ms, or 100ms for complementary paired events (eg. Lyberg, 1979; Sundberg, 1979). This association may provide a natural basis for lengthening substituting for F0 inflection.

The primary cue to prominence is an inflection in F0 above its longer term trajectory. Lehiste (1970) cites F0 as the primary stress cue with the perception of change in F0 being enough to signal stress rather than depending on absolute magnitude of the change. She claims an effect for F1 and F2 inflections but that this effect is small, possibly less than intensity. She quotes Bolinger as rejecting intensity as a cue. A

13

Page 21: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

more detailed view of the F0-stress relation is given by Hermes (1996) who concluded that the moment of change of pitch is the cue and its effect is dependent on the timing of the change relative to either the vowel onset or the power centre of the vowel. This interpretation may undermine the conjecture of a physical link between F0 change and duration.

In summary we have F0 as the primary stress cue with duration acting as a secondary cue with both possibly linked through physical limits to the rate of change of F0.

Pre-pausal lengthening

It has been argued by Marek (1978) that the pause be elevated to a status comparable to that of the segment in phonology since its position is predictable and governed by rules much as segments are. Others have modelled the duration of pauses as a function of this position with considerable success.

Perception of pauses appears highly dependent on the position within the clausal structure of a sentence. Butcher (1978) quotes Boomer as showing a threshold of 200ms for the perception of pauses with intraclausal position, and 500 to 1000ms for interclausal pauses. Butcher’s own work with German showed a logarithmic relationship between pausal duration and a measure of local syntactic complexity as defined by Miller and Chomsky.

Grosjean (1978) showed that the pause hierarchy of sentences, based on pause length, was closely related to, but distinct from, the parse tree. He ascribes the difference, at least in part, to the inclination of speakers to break surface syntactic units into constituents of approximately equal length. He claims to predict 72% of the variance in pausal data compared with 56% achieved using the complexity index alone.

The segmental impact of the pause has been studied extensively with pre-pausal lengthening being a major contributor to variations in segment duration. Cooper and Danley (1981) have looked at this effect for fricatives and vowels finding that duration is dependent on the segment and its distance from the pause. They conclude that variation does not appear to be accounted for by constraints on production or perception. At least part of the explanation may lie in the tendency toward isochrony observed by Grosjean.

The pause may be secondary to intonation and duration as a cue to sentence structure. Henderson (1978) finds a lack of support for the hypothesis that pauses signal structural boundaries. Rather he suggests, in accordance with the results he quotes of Buller, that the salience of a pause is a consequence of its position rather than a cue to position. He suggests that fall in intonation substitutes for the pause

14

Page 22: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

and summarises evidence for this in the speech of mothers to their children showing a decrease in the occurrence of boundary pauses as the age of the children increased.

These results indicate that a prosodic pausal model may be both necessary and available for inclusion in a broad and detailed model of segmental duration for application to ASR systems.

Positional effects on duration

Several factors relating to the position of a segment within a word, and the phonemic structure of the word, can effect the segment’s duration in addition to the utterance level effects of rate, stress and pausal location discussed above. The most commonly cited is the voicing decision of post vocalic stops where vowels preceding voiced stops are as much as 50% longer than those preceding unvoiced stops.

Luce and Charles-Luce (1985) concluded that the voicing decision for stops in VCV context was most consistently related to initial vowel duration (p=0.02) whereas closure duration was unreliable in 83% of cases and exceeded 0.05 significance only for velar stops. The C/V duration ratio also fared poorly as a voicing cue being more strongly dependent on other contextual factors than was vowel duration.

Duration and Place

Differences in the mean durations of consonants have been discussed by many researchers (eg. Carlson and Granstrom, 1986). Place of articulation plays a significant role in this variation with frontal consonants generally being shorter than those produced by the back articulators. Bell-Berti and Harris (1981) discuss this effect in detail and postulate an intrinsic articulatory period that extends the actual acoustically realised consonant duration by preliminary and final durations that account for the whole articulatory act.

Crystal and House (1988b) show a lengthening trend for the release duration of a labial-alveolar-velar sequence for complete stops that is large (velar approximately 220% of labial) and consistent between slow and fast speakers that differed in rate by 15% on average. For incomplete stops they show a much smaller effect having minimum duration for alveolars and with labials longer (but possibly not significantly longer) than velars. They downplay these effects because of the low statistical significance. They found duration to have a dominating contrastive influence in at least one situation. Very short stops (less than 20ms or so) are perceived as /t/ by listeners.

Port (1979) wrote more generally of the apical flaps in rapid speech ranging in duration from 5 to 40ms. The bilabial /b/, on the other hand was usually longer than 30ms. Neagu and Bailly (1997) shed light on the relative importance of

15

Page 23: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

information content in the noise portion of French voiced stops and that in the subsequent voiced signal to the end of the syllable. They show large stop place variations and strong dependence on the following vowel.

Modelling durational variance

Here we look at work done to build parametric models of the variance of phoneme duration based on models of the major influencing factors. This process should not be confused with the work done to model the duration densities (smoothed envelope of durational histograms) in HMM design discussed in Section 2.3 below.

Early modelling efforts such as those of Umeda (1975) and Klatt (1976) used linear regression coefficients for one or two factors at a time. Klatt provided sets of four rules each for both vowels and consonants. In similar work Carlson and Granstrom (1986) compared Swedish and American English durations using similar regression equations to those used by Klatt. They confirm and extend Klatt’s findings and provide five rules for identifying and classifying the duration of vowels. The cross-language comparison shows higher duration means and variance for Swedish stops.

The work of van Santen and Olive (1990) advances the models by looking at covariance between factors of durational variance. They point to the importance of looking at the relationships between factors in coming to a theoretical understanding of duration. They found that log transformed durations produced better fitting models than untransformed durations. Their model residuals were 8.3% of duration compared with 9.3% for Klatt’s model.

Pols et al. (1996a) used a hierarchical analysis of variance approach with eleven factors. Riedi (1997) also use a multivariate approach (Multivariate Adaptive Regression Splines or MARS). Up to nineteen parameters were used to specify the segmental context and model up to 80% of the durational variance. The parameters represent segmental type and position and the degree of accent of the current segment. Segmental type and characteristics (F0 for vowels; manner, voice and place for consonants) are provided to the model for a sequence of five segments centred on the current segment along with position in the syllable and the number of segments in the syllable, foot and sentence.

The incorporation of such models in ASR systems has the potential to provide a significant boost to reliability and adaptability. Lexical constants such as position can be pre-recorded. Dynamic parameters such as accent and local speaking rate can be provided by a parallel prosodic model that can include a speaker model. An appropriately designed recognition system can use the durational model to reduce the lexical search space and ignore the residual durational invariance. For such a model to be possible we need something like the PSL tested in the present work as a signal derived duration proxy.

16

Page 24: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

2.2.4 Dynamical cues in the frequency domain

In this section we look at transient formant behaviour and its phonological significance. The movement of F1 has been studied in relation to place decisions for stops (eg. Fischer et al., 1990) but here we look at F2 transitions in stop-vowel contexts which have been described by Sussman et al. (1991) as a ‘litmus test’ for the problem of explaining perceptual invariance in the face of acoustic variance.

Using a speech synthesiser Liberman (1952) showed that a strong upward trend in the formant transition at voicing onset following a stop was perceived by listeners as /b/ and strong downward trends as /g/. Intermediate values were perceived as /d/.

To describe this phenomenon Sussman et. al. use a ‘virtual locus’ perspective, which is defined as the frequency intercept of a projection in the time-frequency plane from F2 of the vowel nucleus back through the F2 at voice onset. They ascribe the original locus definition (just the F2 value of the first glottal pulse at vowel onset) to Lindblom (1963).

In this work we focus on what is perhaps a third ‘transient’ model based on the short term slope of F2 at the onset of voicing. From a transient perspective, over the first glottal cycles of the onset of voicing the articulators and the (possibly complex) vocal resonances are still moving rapidly under the influence of the stop release. The place of release being evident in the short term formant dynamics.

Perceptual experiments (Blumstein and Stevens, 1979; Blumstein, 1980) have narrowed the temporal range of interest showing that human listeners can detect place of articulation of consonants with as little as 20ms of the signal following a release burst, though recognition improved as more signal was given. This tentatively suggests information concentrated near or beyond voicing onset of short stops rather than burst frequency information.

Blumstein et al. (1982) investigated this process further, looking at the relative significance of onset dynamics of the second and third formants and the overall spectral shape of the onset signal. They found that while overall spectral shape was used as a cue to place it did not override the information gained from formant dynamics.

17

Page 25: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

2.3 Time and Hidden Markov Models

Early attempts at ASR based on nonlinear time warping (DTW) and syntactic pattern recognition suffered from an unconstrained explosion in computational complexity with increased match distance and relied on ad-hoc distance measures. A significant development coming out of work in the 1960s on probability nets was the introduction of Markov models attributed to Baker (1975) although in application to phoneme strings Neuburg (1971) had previously described a Markov approach based on an underlying (hidden) state sequence.

Viewing strings as Markov chains was a valuable move away from DTW systems in that the computational complexity was reduced by as much as an order of magnitude. This was largely achieved by the use of empirical likelihoods for state transition and emission probabilities rather than the less informative distance measures of DTW techniques.

A number of substantial papers on HMMs appeared in the early eighties from Rabiner, Juang, Levinson, Sondhi and Wilpon3. In their 1983 study Levinson, Rabiner and Sondhi compare HMM models with their own previously developed DTW methods. They found an order of magnitude reduction in time for the HMMs with slightly poorer performance than the DTW. They claimed this work as the first such comparison.

A fundamental problem with the Markov chain approach is that it models duration with a geometric probability distribution which is generally accepted as invalid. As we shall see below, some of the HMM’s complexity advantage was soon lost in the attempt to regain a valid representation of duration. In this section we look at attempts to remedy this deficiency but most lose the theoretical integrity of the basic HMM. They require alternate methods for parameter estimation and can be more computationally intensive than DWT.

Duration Density Distributions in HMMs

According to Levinson (1986), Ferguson (1980 unpublished; 1986) extended the standard HMM to allow for a discrete set of state dependent duration probabilities. Levinson has shown that the Ferguson model can be transformed into an equivalent conventional HMM by replication of states. Duration of a state was modelled through transition to one of several sub-states, each of which has a different (fixed) duration with no self-loops. Levinson extended Ferguson's discrete duration model to continuously variable duration using gamma distributions. He gave a parameter re-estimation method.3 Rabiner et al.,1985a; Rabiner et al., 1985b; Rabiner, 1986; Rabiner et al., 1993a; Rabiner et al., 1993b; Rabiner et al., 1995; Rabiner et al., 1986; Levinson et al., 1983; Levinson,1986.

18

Page 26: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Russell and Moore (1985) used a single parameter Poisson distribution to model state duration density. They used a modified version of Liporace's (1982, unsighted) forward-backward re-estimation method. They applied their model to time critical recognition problems such as the ‘leek-league’ distinction.

Reporting on an extensive experimental study Crystal and House (1988a), in addition to providing a considerable amount of empirical duration data, made some observations on the subject of duration. They discussed the process of fitting geometric distributions and related the process to the design of a digital filter with a given impulse response, the difference being that HMMs place the additional constraints on coefficients (probabilities) that they be non-negative and sum to 1.

Picone (1989), concluded from a detailed analysis of HMM errors that they were generally the result of corresponding errors in time alignment. He adapted context dependent hierarchical clustering techniques from DTW to computing the optimal number of states for HMM models and claimed that the seed models so generated provided good performance in a digit recognition task.

A number of tests have been conducted to determine the best parametric distribution for duration probabilities. Gotch et.al. (1994) have compared four distributions finding that the Poisson distribution (12% error), Gaussian (12.3%), and multinomial (13.4%) outperform a conventional HMM with a geometric distribution (17.4%). Their finding of a drop in performance with increased number of model parameters suggests that this family of distributions might not be optimal.

An exception to the preceding evidence on optimal distributions is seen in the data published by Burshtein (1996) who showed, also contrary to the rest of his own data, that in the third (presumably final) state of the word 'oh' the duration distribution followed a geometric distribution very well. This suggests that we should be wary of basing ASR design on broad generalisations.

Nonstationary States

Simple HMMs with phone level states assume stationarity in the emission probabilities while speech signals are well known to be highly non-stationary at a sub-segmental level. The path followed by an individual utterance through the region of acoustic vector (AV) space associated with a phoneme can vary greatly within and between utterances. While some short term (~20-50ms) dynamical characteristics, particularly formant transitions associated with stops, are significant in the identification of phonemes much of the variation at this scale is not and can act as noise if it is not ignored.

An early extension to the original HMM approach was to use Gaussian mixtures for 19

Page 27: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

the emission probabilities. If adequate training data is available this allows a flattening of the distributions and a degree of accommodation of inter-utterance variation that is not phonemically significant but there is a training cost and we are still faced with the problem that temporal regions of low information content form weak links in the Markov chain.

A simple solution within the standard HMM model involves splitting states into regions of greater temporal stability. Alternatively, Deng et al. (1992a; and Deng 1992b; 1994; 1997) have introduced parametric models for nonstationary states. They optimised an interpolation between the acoustic vectors representing a vowel nucleus and a consonant target. They were able to use a single set of parameters for all vowels with little loss of performance over modelling each vowel separately. They pointed to major savings in required training data thus achieved.

In comparable work Saerens (1993) described a model for the time evolution of acoustic vectors within a state using continuous-time linear differential equations. He did not provide any experimental results to support his model but pointed to similar work in LPC analysis. Significantly, for the present work, he suggested that the need for derivatives of the acoustic vector was a weakness in the method.

Park et. al. (1996) discuss a Modified Continuous Density HMM with both state duration probability distributions and duration dependent observation probabilities. They include results comparing acoustic vectors (cepstral coefficients) with and without time derivatives (delta-cepstral values). This work provides a comparison between modelling duration explicitly within the HMM and implicitly by incorporation of timing information in the acoustic vectors which has a direct bearing on the present work which aims to facilitate the latter approach.

Modelling Prosody

A two level approach to modelling prosody in HMMs was described by Saudeau and Andre-Obrecht (1993) in which they overlay a conventional HMM with a prosodic layer that modelled duration at the word level. They found that the performance of a standard HMM (94.6% recognition rate) was improved to 95.2% with their model using Gaussian distributions and to 95.35% with inverse Gaussian distributions.

In modelling duration, Abe and Nakajima (1993) have evaluated a Bounded Transition HMM that restricts the state changes to regions around estimated phone boundaries emitted by statistical boundary detectors. This is a significantly different approach in that it permits the use of word global information specific to phone boundary detection at recognition time. The authors compare a conventional HMM (32% error), one using Gaussian distributions (47.2%) and Gamma distributions (46.5%) with their model at 14.2% error. The authors do not attempt a detailed

20

Page 28: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

explanation of these comparisons. The use of a phone boundary detector could provide a means for HMMs to fully model durational variance.

Summary of Duration modelling in HMMs

Efforts to incorporate models of duration in HMMs fall into two classes. The main focus in the literature has been on optimising representations of durational probability distributions with no reference to contextual dependencies.

Modelling distributions within the HMM suffers from two obvious limitations. Firstly, the important phonological information imbedded in durations is generally ignored. Secondly, attempts to incorporate such models involve such a radical move from the basic Markov approach that it raises serious doubts as to the suitability of Markov models for ASR, or at least undermines their role as the central technique.

The second class of durational models, discussed in detail in Section 2.2.3, attempts to model durational variance using parametric models of contextual factors such as rate and stress. Combining prosodic models in a HMM framework can be seen as a valuable move towards full durational variance modelling but the attempts published to date have been incomplete and the results have been mixed.

2.4 Time and Artificial Neural Networks

Introduction

Artificial Neural Networks (ANNs) have been applied to ASR in many different ways. Most commonly, they have been applied to the problem of mapping acoustic vectors to phonemes, with and without timing information, mapping acoustic vectors to words, prediction including prediction of the next acoustic vector, and augmenting HMMs. Among those that explicitly deal with time, the most common application for ANNs is the detection of critical local timing information such as recognising plosives, or at the word level, time warping to compensate for varying duration.

Basic ANNs are static input devices. The system has neither specific memory of immediate prior inputs nor knowledge of the order of presentation of prior inputs. More generally, the dynamic of the hidden and output layers of the net can follow a simple feed-forward to a single fixed output in one forward sweep or iterate toward a stable solution over several sweeps to allow self, lateral or feedback connections to take effect. In the nonlinear extreme they can follow a chaotic dynamic spending time in multiple attractors, settling down into the strongest attractor as the system energy decays in the manner of a Boltzman machine (eg. Freeman, 1979).

Dynamic input nets maintain a memory of immediate prior inputs. Each successive

21

Page 29: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

input moves the system state from the state reached by the previous input in a direction determined by the current input. This produces a form of soft hashing where the final state is a product of all the intermediate states but, in contrast to conventional hashing, small deviations in the intermediate states do not randomise the final state but instead, a trajectory is learned for each word as a valley in an energy surface (eg. Tank and Hopfield, 1987).

The digital signal processing (DSP) classification of IIR and FIR (Infinite and Finite Impulse Response) systems is useful here in that it gives another perspective on the time dynamics of nets. FIR type nets reach a decision in a finite number of iterations of the net for each input. Such systems can be unfolded in time and trained by conventional Error Back-Propagation (EBP) methods. IIR type systems appear to require individual training techniques. This perspective is discussed further below.

Static Input ANNs

Short segments of speech at the phoneme to short word level can be presented as static input to an ANN as a single input vector of fixed width, zero padded for short segments. A net with this multi-frame input architecture can also be seen as a temporal window scanning a longer utterance (eg. Sawai, 1991). A static input net can, thus, deal with sequential input if the network input is wider than the longest acoustic feature duration to be detected. The difficulty with this approach is that the cost of training and operation is high and the increase is more than linear with input width. It is practical at the phone level but not usually at the whole word level.

Time Delay ANNs

Rather than extending the net's input to incorporate time, delays can be introduced in the hidden layer. In the approach used by Tank and Hopfield (1987) an analogue network had a small number of (<10) delay lines that fed the output layer. The aim of the technique was to focus information in time onto the output layer at the estimated completion time of the word. Dispersion of delay was made greatest for the longer delay lines to allow for an accumulated drift from the mean rate of speech over the word.

In Unnikrishnan et.al. (1992) the authors apply this to ASR and claim a potential 97% accuracy on a multi-speaker task approach (cf 97% for HMM). The Time Delay Neural Network (TDNN) is capable of performance comparable to the HMM, Linear Vector Quantisation and Nearest Neighbour methods (eg. Lang et.al., 1990).

22

Page 30: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Recurrent ANNs

If the hidden or output layers of a net have self-loops, lateral connections or feed back to previous layers then the simple feed-forward architecture is broken. At each consecutive input step the activation state of the net will represent a trajectory developed over prior inputs in the current recognition sequence. A distinction is made here between training data presented in the distant past that is reflected in connection weights and a trajectory of network activations accumulated since a network activation reset event at the start of a word or particular recognition attempt.

Apollone et al. (1993) achieved good performance using an ANN with isolated Italian digits. The approach suffered from the disadvantage of requiring a separately trained net for each word in the lexicon. They give no indication of how their approach is likely to perform with larger lexicons. Increasing the lexicon not only requires additional word nets but increased ability to select the appropriate net for the current input.

Sato (1990a, 1990b) and others have looked at the problem of learning in RNNs for continuous time systems. Discrete time RNN systems with constrained duration can be unfolded to approximate a TDNN architecture allowing EBP training. Each unfolded copy of the net (or sub-net) can be constrained to have the same set of weights as the others. This approach is akin to the IIR to FIR approximation in digital filter design discussed again later in this section.

Generally speaking the training of RNNs is slow and can require 100,000 or more training epochs. Techniques have been presented for reducing this problem such as the replication approach mentioned above or the use of Radial Basis Functions as described by Bourlard and Wellekens (1989).

Attractor ANNs

Generalised ANNs with time delays in connections have been shown to develop complex multi-attractor dynamics (Freeman, 1979; Luzyanina, 1995). Questions of stability arise that are not readily answered. Bauer and Geisel (1990) presented a simplified approach to training feedback networks that they call the open loop method. During training, rather than feeding the actual output back to the input, they use the desired output and then use the actual output in normal operation. They demonstrate the stability of training in an XOR example and on a sequence of two bit input vectors. They demonstrate a capacity for time warping.

23

Page 31: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

ANNs and Digital Filters

A potentially valuable approach to dealing with time in ANNs is available through the application of the extensive body of knowledge accumulated in the field of DSP on the theory and application of digital filters.

Gong et al. (1991) have used an IIR filter to encapsulate time information into the input stream of a neural network. Bourlard and Wellekens (1989) used the analogy of IIR and FIR filters to classify ANNs. Time Delay Neural Networks with limited delay can be compared with FIR filters and fully recurrent nets with IIR filters. Unfortunately the theory of nonlinear filters is not well developed and most ANNs are nonlinear devices. The question of nonlinear time warping does not appear to have been addressed in the DSP literature.

Summary of Duration modelling in ANNs

Artificial Neural Networks have demonstrated an ability to model phone duration though, as discussed elsewhere in this thesis, this may not always be necessary or even desirable. They have demonstrated an ability to both learn and use critical timing information such as in the detection and discrimination of plosives.

There is scope for improving the presentation of timing and duration information at the input of the network. Little work appears to have been done with ANNs in optimising the representation of information in acoustic vectors as addressed in this thesis. This is likely to be due, at least in part, to the heavy training demands of ANNs, particularly those with more complex time dynamics.

2.5 Coding temporal information

In this thesis the focus is acoustic duration measures, derived from the signal, that may act as proxies for segmental durations. For a given phoneme in context we set out to develop representations of acoustic durations that provide a basis for discrimination in situations such as phonological length and consonantal voicing decisions.

Acoustic duration or stability can be defined in absolute or relative terms. Absolute values can be defined as distances between landmarks that are in turn defined by the dynamics of one or more acoustic parameters. Relative duration measures can be generated by looking at the behaviour of acoustic parameters relative to reference values derived from clusters of one or more acoustic vectors. Multi-parameter landmarks can be generated either by using more than one acoustic parameter in estimating the local continuity of the signal as is done in the Temporal Decomposition (TD) approaches discussed here or, alternatively, rule-based

24

Page 32: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

selections can be made from a set of landmarks each generated from individual acoustic parameters (eg. Salomon and Espy-Wilson, 1999).

The Parameter Similarity Length that forms the main focus of the experimental work reported in this thesis is an acoustic duration that can be defined as a relative duration generated from a single acoustic parameter. Duration boundaries are defined for the PSL as points at which the acoustic parameter value deviates from a reference value (the current vector) by some pre-determined fraction of the parameter variance.

Having established acoustic measures for duration or other acoustic timing information, they can be coded into each AV to define the temporal context of the AV. The motivation is to separate the temporal processing and pattern matching from the lexical pattern matching stage of ASR and so, hopefully, lead to a reduction in complexity for both processes and an overall improvement in system performance.

Established approaches to dimensional reduction and abstracted signal representations that might provide the basis for proxy duration measures were investigated. Those looked at were techniques based on matrix rotations of short AV streams or temporal smoothing (eg. Atal, 1983; Hermansky, 1994; Krishnan and Rao (1996) and subsequently Malayath et. al. 1997; Shen and Hwang, 1999; Kajarekar et.al., 2001 and many others). The Temporal Decomposition approach described by Atal was considered in some detail because it had a specific focus of determining a single best eigenvector to represent a speech segment in a particular A-P context and subsequent discussion had a segmental duration focus (eg. Marcus and van Lieshout, 1984).

Temporal Decomposition attempts to provide a coded reconstruction of the signal either from basis functions generated by applying Singular Value Decomposition (SVD) techniques to a windowed segment of the AV stream as in Atal’s original approach. The label has also been applied to loosely related techniques based on a clustering of temporally local AVs based in a distance measure defined on the AV space.

In introducing Temporal Decomposition Atal (1983) addressed the problem of low bit rate coding rather than signal representation for ASR. The basic steps (as described in Marcus and van Lieshout, 1984) are:

• Take AVs of Log Area (LA) parameters at 10ms intervals over a 250 to 300ms speech segment. This is expected to pick up 4 to 5 phonemes.

• Perform a SVD on the AVs and select the eigenvectors that contribute more than 5% of the variance. This is usually 4 or 5 eigenvectors.

• Form linear combinations of the eigenvectors that minimise spread about the centre of the window to get compact interpolant functions with minimum overlap.

25

Page 33: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

• Take the most compact combination as the most salient in the window.• Shift the window incrementally across the incoming signal and the location of the

function in the signal is taken as that point where the centre of gravity of the function passes the middle of the window.

• Having a first estimate of the functions use one new window for each function extending to the location of the neighbouring function.

Atal's TD method was taken up by Marcus and van Lieshout (1984) who discussed the association between the target functions and the phonetics of the utterance. They reported Zuk (IPO, 1984) as showing Atal's method to be highly sensitive to changes in window duration. The instability is shown to be, at least partly, a consequence of the windowing algorithm. They commented that this implies a weak link with the phonetics. They discussed further work in improving the window location algorithm but any such improvement is not likely to strengthen the phonetic link but just make some arbitrary representation more consistent. This is fine for speech coding applications originally intended by Atal but of limited value to ASR.

Ahlbom et.al. (1987) modified the approach taken by Atal. They dropped the SVD step and used AVs as basis functions by clustering similar vectors. They achieved a reconstruction error comparable to Atal's SVD based method although their functions have considerable overlap even after they have been modified to make them more compact by rather ad-hoc methods. They claimed a significant improvement in naturalness for the reconstructed speech over diphone synthesis. They also claimed that TD reveals the acoustic-phonetic structure of speech but make no comment on the overall stability of this relationship.

Dijk-Kappers and Marcus (1987) modified Atal's method by optimising the choice of window position and width to suit the particular basis function that was being determined. They used SVD and recalculated the SVD for each change in window width. They did some testing of phonetic relevance. Looking at 9 phones, they showed that all had more than one target function except for bursts which were usually missing altogether - ie generated no target function. They pointed out that it was not surprising that they miss the bursts since they did not use any amplitude information. They concluded that the method can not be considered as a simple phoneme segmenter. It may still be of value as one tool in a toolbox of analytical techniques supplying incomplete but valuable information for a more complex ASR system.

Dijk-Kappers (1988) presented an attempt to determine the most appropriate acoustic input parameters for TD analysis. Atal originally used a set of LA parameters. Dijk-Kappers tests six parameter sets: Log Area, Log Area Ratio (LAR), Reflection Coefficients, Area, and filterbank outputs. He found that the number and position of target functions varied greatly between the different parameter sets. Tests were performed on a set of 48 CVC combinations from one male speaker.

26

Page 34: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Phonetic labelling of utterances was done by hand and used only for the analysis of target-phoneme associations and not in the determination of the target functions. As was noted in Dijk-Kappers and Marcus (1987) the bursts were missed.

In the reconstruction tests the filterbank parameters more often gave larger error signals though they gave better target-phoneme associations. LA and LAR parameters seem to give the best reconstruction. Here acoustic-phonetic mapping and signal reconstruction seem to be in opposition. Dijk-Kappers suggested from F1-F2 plots that LA based target functions may still be of some value in recognition.

Since in the present work we are interested in clustering AVs it is worth noting that from Figure 1 in Dijk-Kappers et al. (1987) it is apparent that the AVs selected from the centre of vowels were more tightly clustered than the TD values which had more outliers.

Montacie et al. (1989) used a TD based technique on surnames spelt aloud in the French alphabet. They achieved 76% recognition of phonemes defined as the phoneme being present in the list of alternates for each AV.

The TD technique is clearly an effective method for speech coding with a single speaker, if somewhat expensive computationally. However stability of the eigenvectors and speaker adaptability have not been strongly demonstrated. The distinct objectives of speech coding and speech recognition may be, to some degree, mutually exclusive in this context.

From Dijk-Kappers (1988) and Ahlbom and Chollet (1987) we can see that ad-hoc clusters of local AVs can perform as well as those generated at great computational expense using SVD. This provides support for the PSL approach used in this thesis that is effectively a frame clustering technique.

2.6 Serial/parallel processing

The serial/parallel processing debate initiated by Marslen-Wilson and Tyler (1980; 1983) and Norris (1982) opened up a discussion on temporal issues in human speech processing and suggested a model that may prove to be of value in ASR. The main point of interest here is not whether information is actually processed simultaneously but rather whether it is processed in a temporally linear manner or viewed as ‘chunks’ within which temporal order is only one criterion for guiding processing algorithms. Specifically, it is possible to first process information that is most reliable and then, where appropriate, making use of the information already gained, move on to other information sources in the chunk that are not as reliable.

The approach can be extended through to the lexical matching level. We can take all the information detected for a word, assess which components (phones, syllables, etc.) are most reliably known and build an initial cohort of possible words based on

27

Page 35: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

the most reliable information at hand. This initial cohort can then be progressively reduced by introducing the less reliable information in a manner that maximises the chance of the correct word being in the initial cohort and remaining there to the end. Information is thus used in order of reliability rather than its sequential order. Specialised techniques can then be used to test individual phoneme or word hypotheses.

An underlying theme of this thesis is the establishment and evaluation of temporal frames of reference that are data driven rather than imposed by constraints of the analytical system. In Section 2.5 we looked at possibilities for generating segment scale timing markers from the signal in a bottom-up approach based on stability of acoustic parameters. Another possible approach is to base a tentative segmentation of the signal on information generated in the lexical matching algorithm in a top-down process.

2.7 Summary of time in speech and ASR

The analysis presented in this chapter is not intended to be exhaustive, rather it has covered issues considered to be central to the theme of time in ASR and relevant to the present work. We have seen that there are specific temporal issues that have arisen from perceptual studies that have direct relevance to the recognition problem. We can reasonably expect to discover more through the use of large and varied speech databases and the application of automated techniques capable of flexible representations of complex temporal cues.

It is also clear that not all temporal information is directly relevant for recognition. An example is the diphthong where the start and end states are the primary phonological elements with the details of the transitional trajectory being of little or no significance. A cursory inspection of appropriate speech data shows that the diphthongal transition is highly varied even within the utterances of one speaker and different instances of the same utterance.

Segmental duration, likewise, is not always critical for recognition. Most phonemes must be expressed for some minimal time due to physical constraints of both production and perception. Otherwise, moving outside the normal durational ranges is, with a few exceptions, quite acceptable to our ears though it might produce an awkwardness and lack of rhythm that is capable of causing confusion.

With this in mind it is important to note that all of the ASR techniques reviewed (with the possible exception of Montacie et al., 1989) expend effort on detailed modelling of duration and diphthongal transitions and pay a penalty if they they don’t, or if they are not successful. The approach characterised as ‘parallel’ in the last section is capable of avoiding this problem. The signal, or its derived parameters, can be searched for just that temporal information needed for the

28

Page 36: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

current activity, ignoring information that is not relevant.

We have seen that for normal speech it is possible to create models of segmental duration that can play a valuable part in the recognition process. However, none of the recognition techniques reviewed appears to be capable of readily incorporating such models without major structural change (and associated loss of theoretical support) and significantly increased computational effort.

29

Page 37: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Chapter 3

Objectives and Rationales

3.1 Introduction

In this chapter we define objectives for the experiments performed and the rationale for the techniques used. Specific experimental details are left for Chapter 4.

In Chapter 2 we looked at the potential and motivation for developing new approaches to ASR. To begin this chapter we take a brief look at what directions this might take. This analysis provides a framework within which we can design new ways of coding and using temporal information in ASR.

System framework

The system design criteria that were chosen as a computational framework for the experiments reported in this thesis were based on the goal of dealing rapidly and flexibly with the full complexity of the ASR task from signal processing through to syntactic and semantic analysis. Complexity stems from factors such as variations in acoustic conditions, phonetic context, speaker, mood and prosody and from choices in recognition strategies and signal processing requirements to suit these differing contexts. As complexity increases, reliability and transparency become increasingly important issues that can form a barrier to further development. The key elements of the design were the use of high-level, soft-coded rule-based control logic using raw (unmodelled) statistical tables.

A rule encoding using a high level rule language and/or templates can provide greater transparency than rules written in a low level programming language or embedded in the design and training strategies of statistical machines such as HMMs and ANNs. Complex systems can be more reliably and flexibly constructed if the application logic is separated from the ‘hard-wired’ implementation software as it generally is in rule-based systems. The use of raw data tables simplifies the internal data representation providing both speed and transparency that is only reduced in the approach used in this thesis by the need to compress the range of acoustic parameters to achieve high data densities (or low bandwidth) in the acoustic vector. The combination of separate control logic and simple data representation is expected to facilitate the automated generation of recognition rules from the data.

We have chosen to focus on the encoding of the acoustic vector in this work but this can not be done in complete isolation from the matching process. The two should work together. The AV stream generated from a series of acoustic frames forms an intermediate data representation between the raw signal and the internal data

30

Page 38: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

representation of the matching stage. Its primary function can be seen as storing the output of a filter that has winnowed the input signal and retained just that information that is required by the matcher. To further reduce the algorithmic and computational burdens of the matcher the AVs can be processed to present the information in the most efficient form for the particular task at hand.

In Section 3.5.2 we will look in some detail at the issue of AV bandwidth. We will see than the conventional 12 Cepstral coefficients with their first and second time derivatives form an extremely sparsely populated input space. The temporal information takes up two thirds of the bandwidth but we saw in Section 2.2 that the requirements for temporal information may be relatively small (but quite specific). Many successful attempts have been made to confront the problem of AV bandwidth using techniques such as Vector Quantisation and Principal Component Analysis but recognisers still require computationally expensive algorithms to model sparse data as parametric probability distributions or connection weights.

For this work we have attempted to avoid the use of computationally expensive mathematical models (specifically matrix or iterative methods) rather employing computational logic and statistical tables (Section 3.5.1). For the matching stage, or in our case the acoustic-phonetic evaluation stage, we used tables of dense population distributions that do not require modelling (or smoothing) though efficient techniques such as Parzen estimators (eg. Parzen, 1962; Jarre and Pieraccini, 1987; Babich and Camps, 1996) are available if smoothing is required. Raw incidence statistics are computationally cheap to accumulate and, with the flexibility of a full ASR system in mind, have the valuable property of allowing simple addition and subtraction of data without remodelling.

Rule-based or knowledge-based systems have been applied to ASR, primarily encoding human acoustic-phonetic expertise, with some success (eg. Lamel, 1993). Here we consider a system that additionally uses rules that it generates automatically from explorations of the data. Rules that may be too numerous and complex in the specification of their application for human compilation.

In this thesis the PSL is evaluated in a manner that assumes its varied application to differing contexts as part of an acoustic probe that can be used to test the speech signal. There is no attempt to define an ‘ideal’ PSL that can be applied to all acoustic-phonetic contexts.

The twin issues of AV content (what specific items of information are included for a particular A-P context) and form (how this information is represented) are addressed both explicitly and implicitly through the objectives and methods of the experiments performed.

31

Page 39: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Experimental objectives

It was shown in the previous chapter that temporal issues play an important and varied role in the decoding of speech. Studies of the human perception of speech have shown that phoneme duration plays a role in phonological contrasts such as the length decision for vowels and the voicing and place decision for stops (Section 2.2.3). At a shorter time scale the behaviour of F2 at the onset of voicing following a consonant can give a cue to the place of articulation of the consonant (Section 2.2.4).

The investigation of the relationship between the acoustic information (or the acoustic vector) generated from a frame, its temporal context and the strength of the resulting acoustic-phonetic associations forms the basic experimental setting for this thesis. The acoustic-phonetic information value of acoustic parameters depends critically on the acoustic-phonetic context. Both segmentally local information such as duration and utterance global information, such as rate, play an important role in defining this context. In our investigation of the PSL we confine our scope to local, segment scale information.

At a local level the spectral analysis of one isolated frame of a speech signal may tell little about the phonetic context until we look at surrounding frames and provide some information on the acoustic context. Conversely, not all segmental information is phonemically or phonologically useful. The duration of a phoneme may only be significant in so far as it falls within a known range. Using the actual duration may amount to an over-specification that places an unnecessary burden on the recogniser. This information may still be of value to ASR but can be kept separate from the local acoustic-phonetic analysis and, as with the broader temporal context, beyond the scope of this thesis.

If an appropriate distance measure is defined it can be used to estimate the temporal continuity or stability of spectral parameters in the temporal neighbourhood of each frame. If one or more temporal parameters are found that help to define segmental contextual information they can be used to supplement spectral information in the process of evaluating acoustic-phonetic associations for the frame and its neighbours.

Representation of acoustic temporal information is conventionally done with acoustic parameter time differentials. The acoustic parameters can be formant parameters (frequency, energy or bandwidth), cepstral coefficients, frequency band energies (or ratios), frame energy or glottal period. In this work we introduce and evaluate an alternative representation, the PSL, which is defined in Section 3.4 as a measure of temporal stability that can be applied to any acoustic parameter.

The central experimental objective of this thesis is to evaluate the ability of PSLs to encode phonemically significant temporal context and compare them with

32

Page 40: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

conventional acoustic parameter derivatives and the phonemic alignment derived duration.

This evaluation is performed through five series of experiments which have as their specific aims:• to investigate the impact of temporal smoothing of acoustic parameters and

the requirements, if any, for maximum possible temporal resolution;• to compare and contrast PSLs with parameter time differentials in a manner

that can lead to the appropriate choice of temporal representation for any specific acoustic-phonetic context;

• to test the ability of the PSLs to act as proxies for segmental durations by comparing them with the best measure available of segmental duration - the duration derived from the phonemic alignment data associated with the speech signals;

• to look at the value of the PSL in modelling duration in four specific phonologically contrastive contexts: 1. the duration of long and short vowels in the vowel length decision; 2. the duration of pre-stop vowels in the stop voicing decision;3. the voice-onset time of stops in the stop voicing decision;4. the voice-onset time of stops in the stop place decision.

• to investigate the behaviour of the second formant in the vicinity of the segmental boundaries between stops and vowels in an effort to detect behaviour that is thought to provide information in the stop place decision.

Secondary objectives were the development and evaluation of: • techniques for rapid generation of acoustic parameters and acoustic-phonetic

associations• source synchronous framing of the speech signal; • likelihood ranking based evaluation of acoustic-phonetic associations.• the use of low dimensional formant and band energy analysis in the

representation of temporal context;• methods of parameter value compression or AV bandwidth reduction;

Secondary objectives have a methodological focus. They reflect the requirements of speed and flexibility and AV bandwidth reduction described in Chapter 2 as desirable in a speech recognition system.

3.2 Spectral Analysis

The use of formants and band energies in this work rather than cepstral coefficients was motivated largely by a need to minimise the bandwidth of the acoustic vector stream that serves as input to the tables used in evaluating acoustic-phonetic association strengths described in Section 3.5. Formant analysis has long been a tool for acoustic-phonetic research but has generally been abandoned in ASR in favour of

33

Page 41: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

cepstrally derived parameters.

Rabiner and Juang (1993, p20) dismissed formant analysis as being “... more of theoretical than of practical interest”. Their reasons were that formants are too difficult to estimate for weakly voiced phonemes and difficult to define for unvoiced regions. The authors may be correct in the context of a full ASR system where most published systems use sets of cepstral coefficients. However, encoding temporal information is just a subtask of the full ASR problem and for the purposes of the present work we have evaluated a wide range of single acoustic parameters for their potential temporal information content.

Each PSL is based on a single acoustic parameter for several reasons. Using more than one parameter requires the development of multi-dimensional distance measures which generally prove difficult and no obvious candidates were apparent. Intuitively we can expect multi-dimensional approaches to be limited in temporal extent by the most rapidly varying parameter whereas with a single parameter we had more flexibility in choosing a parameter according to performance in generating strong acoustic-phonetic associations in a particular acoustic-phonetic context without the confounding effects of other parameters. General parsimony and bandwidth considerations also pointed to the one dimensional problem as an appropriate starting point at least.

From our knowledge of vowel quality distinctions using formant frequencies (particularly F1 and F2 and their well known relationship with the vowel triangle/trapezoid) we can be reasonably assured of getting phonetically useful information from individual formant frequencies if the frequency domain transform and formant picker are implemented with care. Using source synchronous framing greatly simplifies the task of formant analysis. Initial testing of the Fourier transforms on synchronous frames showed that they provide relatively clean and reliable spectral data to the input of the formant picker algorithm even for weakly voiced speech.

A wide range of acoustic parameters were tested including sets of spectral moments that can be seen as closely related to Cepstral coefficients. A full list of the acoustic parameters selected for testing and results of the tests is given in Chapter 4.

3.3 Source Synchronous Framing

The glottal epoch is, arguably, the fundamental timing framework for voiced speech. The use of Source Synchronous Framing (SSF) has many ramifications, both theoretical and practical. From a theoretical point of view it is sampling the signal at an optimal level, maximising the temporal resolution of the AV streams generated from sequences of frames. While there are detectable variations in the short term signal spectrum within a glottal cycle they are largely associated with glottal

34

Page 42: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

movement which is not directly related to the slower movements of the other articulators or the voicing changes of the glottis.

SSF also minimises the F0 related spectral transform artefacts that are generated when the framing is not synchronised to the glottal pulse stream. To contain F0 information the frame must contain information that is repeated at the glottal cycle rate. If less than one full glottal cycle of the signal is presented to the DFT and the phase of the synchronous framing is set so that the frame starts near the instant of glottal closure then the characteristic burst-decay pattern of a voiced speech ensures that the frame contains little F0 information other than the (possibly related) length of the sampling window. This reduces the need for, and impact of, the Cosine or other temporal windowing function commonly applied to the speech signal prior to a Fourier Transform.

From a practical point of view SSF introduces two main problems - deriving a reliable time stream of synchronous markers and relating this new time frame back to absolute time where necessary. Dealing with irregular framing intervals is a relatively minor problem when we are looking at phoneme durations. The average glottal period is of the same order as the minimum detectable difference in duration (eg. 10-40ms, Lehiste, 1970) so when looking at vowel duration the framing length, fixed or variable, is assumed insignificant if it is constrained to be of the same order as the glottal period. We could resample acoustic parameter streams to regain a regular time stream but it was not considered to be necessary and, as argued later, probably not desirable because resampling implies a degree of signal smoothing which we treat as a separate temporal issue.

To generate the marker stream we conducted an extensive evaluation of the Fundamental Harmonic Extraction (FHE) technique discussed in detail in Section 4.6 and in Davies and Millar (1996). The FHE technique, in addition to providing glottal epoch markers, gives a very direct measure of voicing strength which can be used to mark unvoiced or weakly voiced intervals.

3.4 The Parameter Similarity length

The approach taken to encoding temporal information could be described as minimalist in the sense of attempting to capture simple descriptions of the speech signal that contain just the desired information and no more. This approach has parallels with the Speech Sketch perspective described by Green (1990). A sketch constructed from a set of standardised elements that contains just enough information to get the picture across. Rather than starting with a large set of temporal parameters such as the 24 delta and delta-delta (first and second time derivative) cepstral parameters and attempting to reduce the information bandwidth we take the approach of starting with nothing and incrementally building AV structures by adding information that is found to be useful in context.

35

Page 43: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Time differentials measure the change in parameter value over a fixed time interval. Conversely, the PSL is defined as the length of the time interval over which the acoustic parameter remains within some predetermined range constrained by two thresholds. A full definition is given in Section 4.10.2.

The rationale for looking at the PSL is that most of the phonemically significant temporal information is durational in nature and differentials are not a direct or reliable representation of acoustic duration. Long durations are associated with small differentials which are highly susceptible to parameter noise. The PSL, on the other hand, can be totally immune to a small noise signal over most of its range and is only moderately susceptible near its extremes.

3.5 Acoustic-Phonetic Association Tables

3.5.1 The choice

The strength of the association between an acoustic vector and its associated phoneme can be assessed in many ways. The most obvious are the two principal techniques used in ASR systems: HMMs and ANNs. In Chapter 2 we looked at both techniques from the perspective of a complete ASR system. Here they are evaluated from a narrower perspective: fast assessment of acoustic-phonetic associations.

The Markov model approach to the analysis of time stream data is inherently temporal in nature. It makes assumptions about temporal continuity and builds its own temporal models as a primary function. Because of this it is not an appropriate approach for evaluating upstream temporal analysis because of the difficulties involved in interpretation of the results as there is no simple way of differentiating between the temporal model being tested and the temporal model of the HMM.

A simple ANN looking at one AV at a time and generating some measure of acoustic-phonetic association strength at its output would suit present purposes and retain a reasonable degree of generality. However this approach has serious practical limitations. Retraining an ANN for new data is highly computationally intensive and we wish to repeatedly re-evaluate the AV stream as we optimise contextual representations. Also, the connection weights do not readily provide insight into the underlying processes of matching. After careful consideration of the many ANN design options it was realised that a comparable result could be achieved using simple statistical look-up tables acting as content addressable memory if the input bandwidth of the AV stream could be constrained.

Statistical tables counting associations between AVs and phonemes can be accumulated with optimum speed since all that is required is one addition. The

36

Page 44: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

results can be trivially and rapidly converted to A-P association likelihoods with one division operation. These are, arguably, the most general and transparent measure of association between the signal and its phonetic alignment data and as such are applicable to any ASR system.

3.5.2 Table input bandwidth

In general, any ASR system that encodes the speech signal as an acoustic vector stream is likely benefit from a reduction in AV information bandwidth if no useful information is lost in the process since an optimal bandwidth can reduce training data requirements. The tabular approach used in this thesis also has a limit to the maximum bandwidth dictated by limits to table size.

The typical 38 parameter AVs used with HMMs (12 cepstral coefficients plus the first and second order time derivatives of these and the frame energy) can be reduced by a factor of two without significantly reducing performance (eg. Bocchieri and Wilpon, 1993, also Paliwal, 1992). With a 19 parameter AV, and assuming a minimum of 4 bits per parameter we have at least 76 bits per AV. To put this in perspective, at 200 frames per second we have 15,200 bits per second whereas the speech has a minimum information content in the order of 50 bits/s (~26 phones at ~8 per second). Alternatively stated, it would take around 1013 years of speech to populate the AV space with an average of one AV instance per AV value.

HMM techniques overcome the problem of sparse training data, to some degree, by fitting Gaussian distributions to the data. In the tabular method used in this thesis the raw, unsmoothed data is used so the requirements for data density are more acute.

If we consider the table size constraint for a table that has a column for each phoneme and a row for each AV value the number of phonemes is given so we need to constrain the number of possible AV values to keep the table size within the limitations of available computer memory. The exponential growth in available memory capacity has dramatically expanded the scope of tabular methods. If we restrict table size to 360 Mb and assume that we only need one table in memory at any one time, the maximum size of the acoustic vector is 22 bits. Using 12 cepstral coefficients would restrict each coefficient to less than 2 bits or less than one bit if we included the full delta cepstral parameters.

Ultimately, however, the main restriction on AV bandwidth was the amount of data available. An initial estimate (see Section 4.9) was that data availability would limit AV bandwidth to around 10 to 12 bits and so reduce the number of parameters per AV to 2 or 3 if plausible granularity was to be retained in the quantisation of each AV component. The impact of parameter granularity on acoustic-phonetic

37

Page 45: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

association strengths is tested in Section 4.9 where we look at bandwidth reduction through compression of acoustic parameter value ranges.

One possible approach to reducing the number of parameters is to use formant frequencies, energies and bandwidths. For four formants we have 12 parameters or 24 if we include the full set of time derivatives which is no improvement on the bandwidth of the cepstral coefficients. However key formant parameters such as F1 and F2 can be expected to be more independent than cepstral coefficients which require a minimum number of coefficients (typically 12) to represent the frequency domain information to an adequate degree of detail. Using formants we expected to be able to use individual parameters as a basis for representing temporal context and so greatly reduce the bandwidth of the AV stream and the flexibility of temporal representations.

Holmes et. al. (1997) tested a mixture of formant and cepstral coefficients with some success, but only when the confidence levels for formant choices were included. Eight cepstral coefficients outperformed five cepstral coefficients plus three formant frequencies. This may have been due to the lower cepstral coefficients duplicating the information in the formant frequencies. In the preparations for the experiments reported in this thesis the information content of acoustic parameters both alone and in low dimensional combinations was extensively tested (see Chapter 4).

3.5.3 Acoustic-phonetic association rankings

The likelihood function l(q|X) for phoneme q given acoustic vector X is central to Bayesian analysis in “representing the information about q coming from the data” (Box and Tiao, 1973, p11).

Likelihoods provide a valid expression of acoustic-phonetic association strengths but to be more directly applicable to the ASR problem we look at relative likelihoods not in the usual form of likelihood ratios but by generating a phoneme likelihood ranking for each AV value. For each cell of the table we count the number of cells in the row (AV value) that have an association likelihood equal to or higher than the labelled phoneme. The strength of this approach comes from summarising the key information of a vector (table row) in a scalar.

The ranking can be directly related to the computational complexity of a form of word recognition task. The problem of matching a sequence of N phonemes, each of rank R to a member of the system lexicon can be posed in many forms. Here we will assume a method involving the generation of a hash key from each hypothesised phoneme sequence with dictionary lookup involving a single table access for each key generated. We can give a simple estimate of the computational complexity of the task as: C = aPNRN where aN is the cost of generating each key, P is the number of

keys matched per second and RN is the number of key combinations. Assuming a 38

Page 46: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

practical computational limit of 108 instructions per second and a = P = 10, Figure 3.2 shows that rankings up to 5 can be usable for a 7 phoneme word. This analysis is, of course, only indicative since in general the phoneme rankings will vary widely across a word and the complexity estimate is nonlinear.

1e+8

1e+7

1e+6

1e+5

1e+4

1e+3

1e+2

Complexity

Rank (R)1 2 3 4 5 6 7 8 9 10 11 12

Complexity C = aPNR^N

1

2

3

4

5

6

7

8

Figure 3.2: Computational complexity calculations for phoneme ranking R and a word or syllable length of N phonemes (N=1..8).

3.6 Experimental objectives

Five series of experiments were performed to explore temporal issues and demonstrate the performance of the PSLs in different contexts. These were:

1. Impact of smoothing and outlier constraint2. Comparison of PSLs and differentials3. Duration-PSL comparison for all phonemes4. Duration and the vowel length decision and stop voicing and place decisions5. CV boundaries and F2 transitions

The first experiment addresses the issue of temporal resolution for a selection of acoustic parameters. Experiments 2 and 3 look at the general behaviour of the PSLs in comparisons with parameter time differentials and an alignment derived duration. The remaining experiments look at the performance of PSLs in three specific phonological or acoustic-phonetic contexts.

39

Page 47: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

3.6.1 Smoothing and outlier constraint

Smoothing and outlier constraint applied to acoustic parameters are treated in this thesis as a primary temporal issue and subjected to detailed investigation. They can be viewed as similar processes and are treated together.

Smoothing consists of running the acoustic parameter data through a low pass filter or, equivalently, using an expanded temporal neighbourhood in the evaluation of parameters such as time differentials. Outlier constraint consists of detecting parameter values that differ greatly from their neighbours and replacing them with a local mean. Both are temporal issues in that they imply models of temporal continuity and are likely to significantly effect the information content of the acoustic parameters, their derivatives and the PSL.

It is common practice in formant analysis to use continuity constraints to aid the process of picking formants from the raw spectral data that often presents ambiguous options. Implicit in this practice is the assumption that irregularities in spectral peaks are in some sense errors - either generated in the spectral transform, the product of exogenous noise or deriving from speaker factors such as poor articulation or weak vocalisation.

It is possible, however, that in some instances we are dealing with signal behaviour that contains significant useful information. Since we do not have a comprehensive understanding of how information is transmitted in speech signals it would appear to be unwise to throw away potentially useful information without first evaluating it.

A series of experiments were conducted to determine the A-P significance of smoothing and constraint across a variety of A-P contexts. An underlying hypothesis was that short range or high resolution analysis would be advantageous in abbreviated and temporally complex contexts such as plosives whereas the analysis of continuants would benefit from a broader temporal analysis.

3.6.2 Similarity lengths and differentials

Parameter differentials are commonly represented in ASR by ‘delta parameters’, simple differences between temporally adjacent parameter values. Alternatively, the mean differential is taken over an extended temporal range. Ultimately, in theory at least, it would be possible to vary the temporal range of a differential to suit the context, using a short range estimate for rapidly varying signals and a longer range for slowly varying signals.

The PSL also has an internal variable that controls its temporal range. This is the parameter range R defined in Section 4.10.2 as the constraint on parameter values

40

Page 48: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

over the temporal range of the Similarity Length - our measure of similarity. The temporal range is, in general, a highly nonlinear but monotonically increasing function of R. Small values of R allow sensitive assessment of slowly varying parameters with sensitivity decreasing for larger values.

We can see from this that not only do differentials and PSLs measure different temporal properties of the signal, they each have a variety of forms and parameter settings that complicate any comparisons. Rather than look for a single definitive test to determine which is the better approach, the strengths and weaknesses of each were evaluated for an individual parameter in a particular acoustic-phonetic context. A flexible ASR system could thus optimise the information content of AVs by choosing appropriate temporal information for a particular context.

3.6.3 Similarity lengths and duration

Phoneme duration is an abstract concept. Though it does not have a universally applicable definition it is central to most phoneme level analysis of speech. Its validity rests largely on the ability of skilled phoneticians to judge where the influence on the speech signal of the present phoneme wanes and the signal becomes dominated by the characteristics of the next phoneme. These assignments can then be used to evaluate machine derived alignments. It is well known that coarticulatory effects commonly confound this picture.

A single acoustic parameter or even combinations of them can not be expected to give a reliable absolute measure of phoneme duration in all contexts since this implies parameter stability through the phoneme with consistently detectable changes at the phoneme boundaries which is not generally the case. We proceed with the attempt none-the-less in the hope that information related to duration can be extracted directly from the signal in a form that can aid a recogniser in some instances. The task is aided by the fact that we are more interested in relative durational contrasts than absolute values. It suffices that a measure maintain a quasi-linear, or at least monotonic, relationship with duration.

We saw in Section 2.2.3 that duration was potentially contrastive but is subject to strong prosodic and speaker dependent influences which can dominate its variation. While the development of models for duration is clearly a major temporal issue for ASR we do not attempt any prosodic analysis here. It is not necessary for present purposes since we evaluate the ability of the PSL to act as a proxy for the actual phoneme duration.

Examples of contrastive duration are the long-short vowel distinction and voicing and place decisions for stops. For the vowel length decision we evaluate the proxy duration of the PSL in a discriminant test. The PSL is also compared it with the duration derived from the alignment data of the ANDOSL speech database.

41

Page 49: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

With the stop decisions three decision boundaries are investigated. We look at the PSL as a proxy for voice-onset time in both the voicing and place decisions. We also look at the associated compensatory lengthening in the preceding vowel. The alignment duration through stops is likely to be less reliable than for continuants since the structure of stops is complex and, particularly in normal speech, is less readily determined. The rules used to position the manual alignment boundaries for the ANDOSL data are given in Kroot and Taylor (1995). The “arbitrary” 60ms duration assignments for initial and final stops and the release phase label (/H/) do not appear in the data used in these experiments which look only at intervocalic or pre-vocalic stops.

In both the hand labelled (possibly machine labelled and hand corrected) and fully machine labelled speech used here the ‘duration’ of intervocalic stops included both the occlusion phase and the release phase - the full intervocalic interval. These durations are used as a comparison for the PSL for the stops but none of the conclusions drawn in Chapter 6 are based on comparisons with stop alignment data.

Across vocalic segments we usually have a high degree of acoustic stability or at least smoothness of parameter values so the PSL was expected to perform better as a duration proxy for vowels. All the acoustic parameters used in this work are highly variable through stops so the development of an effective duration measure using temporal parameters such as the PSLs was likely to be difficult.

In a study that aligns quite closely with the aims of the present work Salomon and Espy-Wilson (1999) generate signal defined landmarks energy onsets and offsets and use the length of the intervening interval as a proxy for voice-onset-time in pre-vocalic stops.

3.6.4 Voice Onset Transitions

As we saw in Section 2.2.4, it has been demonstrated in perceptual studies that F2 movement during the onset of voicing provides information on the place of articulation of stops. The objective chosen for the final experiment was not to build a complete decision system for stop place but rather to focus on what might constitute one element of such a system. We want to see if it is possible to detect and categorise short time scale F2 transitions (or transients) using the PSL. More specifically we want to see if the PSL of the F2 differential (SLab parameter F2.Dif.Sim) was capable of tracking a steady rise or fall of F2 in the vowel onset.

The broader alternative perspectives on this perceptual cue discussed in Section 2.2.4, the value of F2 at onset or the ‘locus’ of a projection from the vowel nucleus, were considered to be primarily frequency domain rather than temporal issues. They are not pursued in this thesis.

42

Page 50: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Chapter 4

Software System and Methods

4.1 Introduction

This chapter looks at the software system that was developed, the experimental techniques used and the tests performed to validate them. A Java software package, referred to in this thesis as SLab, was developed as part of this project to allow a flexible and interactive approach to be taken in the tests conducted on data from the ANDOSL speech database.

The general approach taken in the experimental work reported in this thesis has been to view examples of the data at each stage of processing and to use automated accumulation of statistics, usually in the form of histograms, to verify that all the data conforms, in some sense at least, to the visually checked data. Many hundreds of spectral and time frequency plots were examined, primarily to verify algorithms and also to develop an intuitive grasp of the behaviour of the speech signal. To pursue this approach required an interactive and flexible software system for processing the data, displaying it and producing intermediate statistical tables in addition to generating the final results.

4.2 Development Platform

The requirements for the software system were:

• Automatic selection of multiple files for sequential batch processing or manual selection of single files for visual inspection and testing.

• Selection of the signal component of each file rejecting lead-in and trailing non-speech regions.

• Pre-processing speech data for low frequency noise reduction and pre-emphasis of high frequency components of the signal if required.

• Presentation of time domain speech signals as an interactive graphical display.• Generation of a source synchronous marker stream aligned to the instant of glottal

closure where there is detectable glottal activity.• Temporal partitioning of the voiced signal into source synchronous frames and the

unvoiced signal into fixed length frames where the length of an unvoiced segment exceeds a given time interval.

• Generation of a frequency domain representation of each frame using a form of DFT suited to the varying length frames.

• Presentation of frequency domain information as a graphical display.

43

Page 51: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

• Generation of secondary frequency domain representations or parameters such as band energies and formant frequencies, energies and bandwidths.

• Generation, accumulation and processing of utterance length streams of frame based data including the temporal parameters to be tested.

• Accumulation of associations between frame based parameter streams and the phonemic alignment label in the form of association tables.

• Generation of parameter histograms over full phoneme durations or a sequence of sub-segmental and trans-segmental time slices.

• Generation of statistical summaries of the association tables either as formatted text files or a graphical displays.

• Aggregation of association statistics over the range of each phonemic label.• The ability to automatically vary analysis system parameters over a specified

range for optimisation purposes.

Of the many speech analysis software systems available none were considered capable of providing all the functionality required for these experiments without the development of additional code. Several of the requirements such as file manipulation, frequency domain transforms, graphical display of the signal and frequency domain and SS analysis, were available with the Xwaves speech system that was employed in early trials but lacked flexibility and addition of the remaining functions would have been restricted to unix C routines running on the licensed Xwaves server.

It was considered highly desirable that the software be developed on a PC, allowing the use of state-of-the-art development environments not then available on unix machines. Initial code was written as C routines that could eventually have been combined with Xwaves. Later consideration of system requirements, development speed, display requirements and package reliability led to the choice of Java as the final development environment. Java has several advantages over C or C++ such as tight compilation, cross platform operation (unix, Mac, Windows) and extensive object libraries for data structures (jgl) and user interface development (awt, Swing). The development of the JavaSpeech environment has considerably enhanced Java as a platform for speech analysis.

The SLab system was developed for this project using the MetroWerks Java development environment on Macintosh microcomputers and ported at source level to a four processor SUN 450 unix system for some of the production batch runs.

4.3 SLab Structure and Operational Overview

A detailed list of SLab objects (or Classes) is given in Appendix 1. The overall system structure of SLab is shown as a simplified data flow diagram in Figure 4.1 Starting from the bottom left, speech files are selected from a list generated from a list file or from a user GUI selection. The Utterance object provides pre-processing

44

Page 52: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

and partitioning of the signal into source synchronous frames (SLab partitions). Each Utterance is passed to a StreamSet which converts it into a set of parameter streams - one stream for each acoustic parameter used with one value per frame. The StreamSets for all the utterances from a single speaker are added to a StreamSetCollection which manages the generation of parameter histograms and parameter scaling transforms. The AssociationMatrixSet takes a StreamSetCollection object and produces an AssociationMatrix for each acoustic vector structure specified in a preferences file.

GUI

HistogramSet

AssociationMatrix

PhoneSet

Alignment

PhoneDiphonesTriphones

FrequencyDomain

Phone Counters

SLabFilter

File Lists

ParameterHistogram

StreamSet

Partition

Utterance

SLabWav

ParameterStream

StreamSetCollection AssociationMatrixSet

AVSet

AV Structure File Alignment File

Phone File

List FileWAV File

Figure 4.1: SLab data flow and main Java classes

In addition to the ‘wav’ data files and a general system preferences file that specifies the type of run (eg. manual or batch) and specific data processing options, SLab reads information from four other files: file lists of wav files to process unless a directory is specified, in which case the whole directory is used; an AV structures file that lists each of the acoustic vector structures to be tested in a batch run - eg the specification ‘F1&F1.Sim’ describes a two dimensional AV containing the F1 frequency combined with its similarity length; an alignment file associated with each wav file contains the phoneme labelling data; a phoneme data file contains a list of all the phonemes used in the labelling and their phonemic class descriptors.

Each SLab run uses data for one or more speakers with association tables accumulated for individual speakers and one for aggregated data for all speakers. A spreadsheet was used to post-process output data, to compare and combine data from different speakers and different runs and to generate graphs.

45

Page 53: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

4.4 The Speech Data

Initial testing of the Source Synchronous framing was performed with data from the ACRB database (Millar et al. 1989) since it has associated Electroglottograph data. All later work was done with data from the Australian National Database of Spoken Language (ANDOSL) which consists of high quality recordings performed in an anechoic chamber. Further information on the ANDOSL collection can be found at the ANDOSL web site (Kroot et.al., 1995).

Although the focus of this work is on individual speaker characteristics, six speakers were used to provide a basis for comparison. Speakers were taken from the ‘Cultivated’ speaker class which was considered to most closely reflect international English pronunciation. Of the speakers in this class four (1 male, 3 female) had manually generated or corrected phonemic alignments and were included for this reason. Two male speakers were chosen at random (from the Cultivated speakers whose data was automatically aligned) to giving an equal number of males and females. Speaker details are tabulated in Appendix 2. Three of the speakers reported as having abnormal characteristics (gravelly, hesitant and slurred) were retained in the set to add to speaker diversity (See Table A2.3, Appendix 2). All of the phonemically balanced sentences in the ANDOSL 200 sentence set were used for each speaker providing around 12 minutes per speaker or a total of over 74 minutes of speech and over 620,000 frames.

Much of the original system testing and early optimisation were carried out using male speaker S106. Later optimisation of the formant picking algorithm also used speaker S065 and, to a lesser degree, the remaining four speakers that were included to provide a degree of diversity. For parameter histograms and association matrices, aggregation of data from all speakers was performed to test the generality of individual speaker results. Aggregated data for the six speakers is referred to in this thesis as ‘Spk6’. Of the individual speakers S106, S083, S081 were male and S065, S058, S052 were female. The fully automatically aligned speakers were S083 and S081. Further information on the speakers is given in Appendix 2.2.

SLab Utterance methods provide a range of signal pre-processing options such as: a rumble filter to remove very low frequency and DC components from the signal; a pre-emphasis filter to lift high frequency components; a trimming algorithm that removes the lead-in and trailing non-speech signal; generation of utterance level time domain statistics such as mean frame energy, number of frames and mean glottal period.

Pre-emphasis was provided by the same filter as used in the fundamental harmonic extraction stage described in Section 4.5 - a time reversed leaky integrator. The optimum value for the memory constant a was determined from aggregated acoustic-

46

Page 54: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

phonetic association performance of formant based acoustic parameters and temporal parameters based on them. For pre-emphasis the setting was a=0.95 with each iteration including the time reversal pass. This gives a degree of pre-emphasis double that of the single pass (1-0.95z-1) filter commonly used for pre-emphasis in speech signal processing. Increasing a from 0.95 to 0.97 in a time reversed double pass, moving toward a more normal degree of pre-emphasis, degrades mean association strengths for all phonemes and AVs by 0.32 ranking points but does not significantly alter the qualitative results reported here.

4.5 Source Synchronous Framing

The techniques of source-synchronous framing have been extensively described by Hess (1974, 1983, 1984, 1987) and many others. The method used in this thesis for generation of a stream of source synchronous timing markers was referred to by Hess (1983) as Fundamental Harmonic Extraction (FHE). The details of this technique and the experiments conducted to test its effectiveness are reported in (Davies and Millar, 1996). The technique consists of passing the speech signal through a low-pass or band-pass filter, extracting the low frequency signal component at the glottal frequency (F0). A peak detection algorithm was then applied to the near-sinusoidal filter output to give estimates for the Instant of Glottal Closure (IGC).

Dologlou and Caratannis (1989) have investigated this approach using a second order FIR low pass filter. The low order of their filter required a large number of iterations (>400) which makes their approach computationally intensive and led to a focus on the determination of an effective halting condition for the iterations in the subsequent debate (Dologlou and Caratannis, 1991). Reducing the number of iterations to a more practical level requires a higher order FIR filter, preferably with a recursive IIR form.

High order filters are still relatively computationally intensive and would not readily be made adaptive if it became necessary to track changes in F0 for optimal performance. A rectangular windowed moving average of any order can be implemented with the computational complexity of a second order filter but suffers from strong side-lobes in its frequency response. A symmetric, exponentially windowed filter of any order can also be implemented with second order complexity and with side-lobes that reduce in significance with increased filter order.

An approximation to a symmetric exponential filter with maximal order was implemented in recursive IIR form using a leaky integrator filter. Time reversal was used to produce time symmetry and, thus, zero phase characteristics. The difference equation for simple leaky integrator for a signal x(n) and filtered signal y(n) at time n is

y n( ) = ay n -1( ) + bx n( ) ; b = 1 - a. E4.447

Page 55: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

The Time Reversed Leaky Integrator (TRLI) used here has two functional parameters, the memory constant (a) and the number of iterations (I). Its frequency response is

H(w ) = b2I(1+ a2 - 2aCosw)- I E4.5

The technique was tested extensively for accuracy and possible failure modes using Electroglottograph (EG) data associated with the ACRB (Millar et al., 1989) speech database. Approximately 20 seconds of voiced speech was used from two speakers. The IGC was estimated from the EG signal by detecting the largest negative peaks of its second derivative. Visually, the 'knee' of the EG signal (Figure 4.2) conventionally associated with the IGC was better approximated by a minor peak of the third derivative but the second derivative proved more robust to varying signal conditions. The procedure used to optimise the FHE algorithm was to generate a histogram of the time differences between the EG signal and filtered speech and to maximise the height/area ratio of the histograms (Figure 4.3) by varying the parameters a and I.

Figure 4.2: Three approaches to detecting the ‘knee’ in the Electroglottograph signal using its time derivatives.

The signal derived markers were found to align closely with the EG derived IGC 48

Page 56: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

markers for all levels of voiced speech. The fundamental harmonic (FH) signal was found to persist, at a low level, through many speech segments that would normally be classed as unvoiced in a judgement based on the dominant frequency domain energy. Where the FH signal dropped below a fixed threshold, timing markers were placed at regular intervals of 500 samples - equivalent to a minimum F0 of 40Hz.

Figure 4.3 shows a histogram of the the phase difference between the FH and EG signals for approximately 37s of voiced speech from speaker 6 of the ACRB data set. It has a standard deviation of 7.7% of the source period. Over the 17 speakers the standard deviation ranged from 4.0% to 19.2% with a mean of 7.4%. The broadest histogram came from the female speaker with the highest mean F0 (226Hz). Since the broader histograms clearly deviated from a normal distribution we also estimated an inter-quartile range spanning 50% of the points which varied from 1.6% to 5.2% of the source period with a mean of 3.4%.

This result can be compared with the variance results reported by Dologlou and Caratannis (1989) of 2.4%. Using their FIR filter with 400 iterations on our data gave a variance of 5.0%. The difference can be attributed to the fact that they used a variable, and possibly higher, number of iterations but also to differences in speaker characteristics.

Figure 4.3: Histogram of phase differences between the EGG signal and the FHE markers. Abscissa is time (ms). The ordinate is frame counts.

In addition to the phase variance about the histogram peak for each speaker, there was an inter-speaker variation in histogram peak positions, or absolute phase, of over 1 ms. This absolute phase difference showed little variation with the degree of filtering (I).

The source of this inter-speaker variation is unclear. The phase lags for male and female speakers differed, on average, by 0.45 ms or 2 standard deviations. Assuming an average vocal tract length of 17.5 cm (Fant, 1960) and a mouth to microphone distance of 27cm, the total time delay from source to microphone is 1.4 ms. The assumption of inter-speaker variations of up to 10 cm (including variations of speaker size and head position) can only account for approximately 0.3 ms of the 1ms inter-speaker variation.

49

Page 57: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

4000

3000

2000

1000

0

Counts

ms2.5 5.0 7.5 10.0 12.5 15.0

R6328: Tx Histograms

S106

S083

S081

S065

S058

S052

Figure 4.4: Histograms of glottal period for six speakers.

The adequacy of the alignment accuracy for the current purpose of positioning a source synchronous framing marker was confirmed in further tests of the sensitivity of the frequency domain transform to variations in frame starting position (Section 4.7). Starting and ending frames on the nearest zero crossing of the signal prior to the IGC timing markers gave best results in acoustic-phonetic association tests.

Figure 4.4 provides a summary of the output of the framing algorithm in the form of histograms of the glottal period (Tx). Framing errors can show up in such histograms as multimodal structures. Only a minor irregularity for speakers S052 and S058 (a distortion at 5 ms possibly associated with the low nasal resonance N1) is apparent with the current configuration.

Because of the large differences in mean glottal Tx between males and females the cut-off frequency of the fundamental harmonic filter was raised for the female speakers. Based on the results of optimisation runs, the memory parameter of the filter (a in Equation E4.4) was set to 0.88 for females and 0.93 for males. The number of iterations was 10 in both cases.

4.6 The Fourier Transforms

Performing a frequency domain transform on a segment of speech signal is, under the best of circumstances, an ugly compromise. In general, the oral articulators and the glottis are moving so the signal is highly nonstationary. The definition of frequency assumes a stable signal. To reduce the risk of Fourier transform artefacts resulting from signal instability influencing results, two quite distinct Fourier transform algorithms were used in initial testing - the FFT and the Goertzel (eg. Oppenheim and Schafer, 1989) DFT algorithms. For the final results the Goertzel algorithm was used.

The typical burst and decay pattern of a glottal period of voiced speech suggested

50

Page 58: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

two possible alternative transforms: wavelet transforms and the Z transform. Both of these options were considered and preliminary tests performed in the Z domain.

A chirp-Z algorithm was implemented by applying an exponential multiplier to the signal frame to shift the signal radially across the unit circle in the Z-domain where it was sampled using a Fourier transform. Using a formant bandwidth as a measure of transform performance it was found that the signal cycled radially in the Z domain over each glottal cycle as expected. Although apparent formant bandwidth could be reduced slightly by this radial shifting of the signal there was no significant shift in the formant frequency so the Fourier transform alone was used in all subsequent work. This conclusion forms the basis of our claim that the glottal epoch defines the shortest practical time unit for generating time streams of acoustic parameters in voiced speech.

The Goertzel algorithm was the main algorithm used, largely because the FFT in its standard form suffered from two practical limitations - constraint of both the number of input signal samples and output frequency samples to a power of 2. The radix 2 input range constraint of the standard FFT dictated at least 256 samples per frame to cover long glottal periods with zero padding to fill the unused input range. This reduces the FFT’s efficiency for short glottal periods relative to the Goertzel algorithm which can have a variable, and usually lower, number of input samples. The FFT sacrifices flexibility for speed. In our case, with low sample numbers, the speed difference between the two algorithms was less than 30% and the impact on overall processing time was negligible. More complex FFT algorithms allowing freedom from the radix 2 constraint would be likely to lose this small speed advantage.

The Goertzel algorithm has the advantage that the frequency domain samples can be independently specified so that the granularity of the frequency domain information can be adjusted to requirements. The Goertzel transform can be viewed as a set of narrow bandpass filters, the centre frequencies of which can be set to any desired values. A Mel scale option was included and used in initial testing of frequency band energy ratios. In the experiments described in this thesis a linear frequency scale was used because a subsequent nonlinear transform was performed on all acoustic parameters (see Section 4.9).

Testing of the transforms was performed using SLab in interactive mode and also automated in batch mode where an aggregated measure of AV-Phoneme associations was optimised. While viewing the time domain representation of the signal a frame was selected either by selecting a source synchronous partition or some larger or smaller time span of the signal. A frequency domain transform for the selected signal interval was then displayed.

51

Page 59: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Figure 4.5: SLab representation of time domain signal for frame 164 of sentence S001 for speaker S106. The sampling rate was 20 KHz. Tall vertical bars are 5 ms apart. Shorter vertical red bars are frame boundaries with red frame number.

A sample time domain plot is shown in Figure 4.5 and corresponding frequency domain plots in Figures 4.6 and 4.7. The shorter red vertical bars mark the source synchronous frame boundaries. The five short green bars about frame marker 164 show the start points for the examples in Figure 4.6. The frame is towards the end of the phoneme /ei/ in the word ‘range’.

Frame 164 was chosen as a relatively extreme case of the timing marker falling among signal samples that are well away from zero. Since this situation could potentially produce problems for the frequency domain transform it was used for the tests described below. In the tests described in this section no windowing of the frame was used. Elsewhere, a cosine window was applied to the frame in the conventional manner to smoothly reduce the signal to zero at the frame boundaries. It had little impact on the Goertzel transform.

Figure 4.6 illustrates the sensitivity of the frequency domain transform to variations in the starting position of the frame relative to the IGC marker and its nearest zero crossings. A 90 sample (4.5ms) frame was shifted successively from 40 samples prior to the marker to 40 samples past it in steps of 20 samples. These positions are marked in green in Figure 4.5. The centre example (trace 3 of Figure 4.6) is the standard IGC aligned frame.

The short red vertical bars of Figure 4.6 are markers for the main peaks. The longer green extensions to these are the formant selections numbered 1 to 3 (see Section 4.7). In all these examples the low frequency nasal energy can be seen below F1. This appears because the frame occurs just prior to the boundary of /ei/ and /n/ in the word ‘range’ and is thus partly nasalised.

52

Page 60: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Figure 4.6: SLab spectral plots for frame 164 of sentence S106S001. Vertical axis is signal energy in arbitrary units that differ for each frame. Horizontal axis is frequency (0 to 5 KHz). Frames differ in their starting times marked by green bars in Figure 4.5. Formants are labelled 1 to 3.

In plot 1 of Figure 4.6 the signal starts from a zero crossing (left green line in Figure 4.5). Plots 4 and 5 start in the burst following glottal closure. These examples show that there is little sensitivity to frame positioning where only one glottal impulse is sampled. Batch tests showed marginally improved acoustic-phonetic association strengths when each frame was started and ended at the last signal zero crossing prior to the source markers. For the results reported in Chapter 5 frame boundaries were marked at zero crossings.

At least two factors influence the impact of frame length and the positioning of the frame within the glottal period. Firstly, the initial region of the glottal period can be expected to be more stable because the glottis is closed, though possibly still moving axially. Secondly, high frequency components of the signal appear to decay faster than low frequencies so sampling early in the glottal period was expected to give best results across all frequencies of interest.

Tests were conducted to determine the optimal frame length. Visual inspection of spectral plots such as Figure 4.7 for a wide range of voiced speech suggested that 80 to 100 samples were needed to gain good transform behaviour at the lowest

53

Page 61: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

frequencies of interest. These observations were confirmed in batch tests in which the number of samples was varied from 40 to the full glottal period.

Figure 4.7: Testing minimum frame length. Frames were all started on the SS marker. The frame length was varied in 20 sample increments from 20 samples in test 1 to 100 samples in test 5. Test 6 uses the full glottal period of 172 samples.

In plots 5 and 6 of Figure 4.7 it can be seen that 100 samples and the full 172 samples give nearly identical results which suggests that variable frame length is not significantly effecting spectral resolution below a glottal frequency of 200 Hz with the 20,000 samples/s sampling rate used here. For shorter frames, plot 3 (60 samples) has correct formant positions but, at the low frequency end, the nasal murmer (N1) is merged with F1. In plot 4 with an 80 sample, or 4 ms time span, we expect a low frequency roll-off at around 250 Hz and we find that the 200 Hz resonance is just emerging from the low frequency skirt of F1.

The low frequency performance possible with synchronised frames is better than can be achieved using fixed length, unsynchronised frames crossing glottal epochs. Using

54

Page 62: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

multiple epochs generates F0 harmonics that require low frequency filtering to reduce them. The enhanced low frequency performance achieved with synchronised frames provides a clear picture of the lowest nasal resonance which, in the data used in this work, is the strongest and most reliable of the nasal resonances and a useful cue to the presence of higher nasal resonances.

A comparison of Goertzel and FFT spectral peak histograms is given in Figure 4.8. The main difference is the occasional low frequency peak in the FFT spectra which causes some problems for the formant picker.

The comparison with LPC techniques is less simple. Although the quality of the results would appear, from LPC examples in the literature, to be similar it should be noted that the Goertzel algorithm is significantly more computationally efficient. This is particularly true with the availability of fast floating point processors since the Goertzel filter uses trigonometric functions to model sinusoidal signal components which gives it a computational advantage over autocorrelation techniques normally used in LPC analysis.

Overall, the combination of source synchronous framing and the Goertzel DFT was found, at least for clean signals, to be robust (in that no significant failure modes were found), flexible (particularly with regard to degree of voicing) and highly computationally efficient. It produced high quality frequency domain transforms for the formant picking stage.

4.7 The Formant Picker Algorithm

The task of the formant picker is to select the main cavity resonances from noise and frequency transform artefacts and to separate the oral and nasal resonances where both coexist (eg Schafer and Rabiner, 1970).

For the results reported here only the first three formants were detected and used. These can be seen (marked 1, 2, 3) in Figures 4.6 and 4.7. The same formant picker algorithm and settings were used for all speakers. The low frequency nasal resonance (N1) was monitored and its size included in the acoustic parameter set. The second (assumed) nasal resonance (N2) around 1KHz was detected where possible and rejected as a candidate for F2.

The data input to the formant picker, the raw peak data after the elimination of all but the five strongest peaks, is illustrated in Figure 4.8a. Histograms such as this were used to check the performance of the formant picker. They provide insight into the overall resonance structure and how it varies from speaker to speaker (See Appendix 5 for examples from other speakers).

55

Page 63: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

4000

3000

2000

1000

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6348-9: S106 Peak Histogram

PkT

Pk0

Pk1

Pk2

Pk3

Pk4

a

4000

3000

2000

1000

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6362-3 FFT: S106 Peak Histogram for /NA/

PkT

Pk0

Pk1

Pk2

Pk3

Pk4

b

Figure 4.8: Histograms from speaker S106 for the 5 largest spectral peaks prior to formant selection. (a) Goertzel, (b) FFT. Ordinate is frame counts. PkT is total counts. Low frequency peaks are truncated for display purposes.

Initial attempts at formant picking for a single speaker were based on the conventional approach (eg. Samouelian, 1997) of constraining formant choices to the normal frequency ranges for that speaker. With this approach difficulties arise mainly in the regions of range overlap. If the overlap is reduced by constraining the formant ranges, valid formant outliers can be missed and multi-speaker performance deteriorates due to the reduced ability to cope with variations in individual speaker formant ranges.

The second approach focussed on picking the largest peaks and reflected an attempt to restrict the use of frequency limits to secondary decisions such as the presence of nasals or merged formants. The final algorithm combined both approaches. It represents best efforts for speaker generality rather than optimisation for an individual speaker.

The values for the frequency limits were determined through an optimisation process in which each limit, in turn, was varied and aggregated acoustic-phonetic

56

Page 64: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

associations of the formants were maximised. For the initial speaker dependent algorithms several iterations of optimisation were performed though no attempt was made to perform a comprehensive system wide multivariate optimisation. For the speaker independent or peak energy based algorithm only one univariate iteration of optimisation was used.

The formant picker algorithm is given in Appendix 3. It can be summarised as:a. Generate frequency domain representation with 250 points spread linearly over a

5 kHz rangeb. Build a list of frequencies and peak sizes for energy peaks and troughs c. Remove peaks above the frequency MaxF3 from list d. Remove smallest peaks until only 5 peaks remain in liste. If no peaks below maxF1 set N1=F1=0. (constrain false outliers in F1 histograms

during unvoiced speech)f. Set N1 as lowest peak (used only if F1=N1)g. Merge F1 and N1 if N1 < maxN1 and N1Width > maxN1Width. Remove

smallest remaining peak when F1 = N1.h. Allocate F1.i. F1 and F2 merged if ((F1/F2 size ratio<MaxF1F2SizeRatio) and

(F1>minOpenF1) and (F1Width>maxF1Width) or (next peak>maxF2)). Remove smallest remaining peak when F2=F1.

j. Allocate F2, F3 as the remaining peaks in ascending frequency.

The minOpenF1 test of stage (i) primarily eliminates confusion with F1=N1 when F1 is wide so the value is set low and is not critical for any speaker.

The five strongest peaks, as fed to the formant picker, are illustrated in the time frequency plot Figure 4.9a. The first three formant choices are shown in 4.9b. Many hundreds of formant stream plots such as these were inspected visually in an attempt to detect and characterise possible failure modes of the formant picker. Where the formant structure looked unclear or irregular its validity was checked by viewing a spectral plot (eg. Figures 4.6 and 4.7) for the frame and and its neighbours. In some cases a comparison was made between the Goertzel transform and the FFT. No major qualitative differences were observed between the two transform algorithms though differences in peak height and width occasionally influenced marginal formant picker decisions.

Most questionable formant choices appeared to stem from real features of the signal rather than resulting from picker errors or artefacts of the frequency domain transform. Errors appeared to arise largely from abnormal formant energies and occasional failure to detect formant (or formant-nasal) overlap. In some cases it was not possible to determine a reliable formant structure visually.

57

Page 65: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

900MinF2

10MaxF1F2SizeRatio380MaxN1

MinOpenF1

1400MaxF1Width

500MaxN1Width

5000MaxF3

2600MaxF2

450

Frequency (Hz)Parameter

Table 4.2: Formant Picker Critical Frequencies. Widths are defined as the difference between adjacent minima.

Counting possible misallocation of formants over the vocalic segments of the first two utterances for each speaker showed wide speaker variations with F1 errors varying from zero per utterance to 5% of frames. For F2 error rates generally varied between 10% and 15%. Note that most F1 errors lead to an F2 error and so on.

We need to question the validity of such visual and intuitive judgements of error. We have found some evidence that formant anomalies or outliers may be providing useful phoneme discriminating information. The outlier constraint algorithm, which corrected some of these suspect frames purely on the basis of continuity with adjacent frames, was found to significantly reduce aggregate acoustic-phonetic association strengths as did smoothing in some cases (reported in Section 5.2). Modification of the formant picker aimed at reducing apparent F2 errors was also found to reduce association strengths in some instances but no obvious pattern was apparent so results for different algorithms are not compared in this thesis.

An outstanding feature of the formant traces obtained from source synchronous analysis is the frequent lack of continuity. In many cases the transitions between nasals or stops and neighbouring vowels takes place over a single glottal epoch.

58

Page 66: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

5 4 3 2 1 0

K H z

D@

pr

ai

sr

ei

ndZ

Iz

sm

o:

l@D

@n

En

i:@

v@

sE

ks

pE

kt

@d

#

R63

49

: U

tter

ance

S1

06

S00

1 b

y SS

Fra

me

Pk0

Pk1

Pk2

Pk3

Pk4

5 4 3 2 1 0

K H z

D@

pr

ai

sr

ei

ndZ

Iz

sm

o:

l@D

@n

En

i:@

v@

sE

ks

pE

kt

@d

#

R63

48

: U

tter

ance

S1

06

S00

1 b

y SS

Fra

me

F1 F2 F3

Figure 4.9: Temporal parameter streams for (a) frequency domain peaks and (b) formants for sentence S001 from male speaker S106. Note that the abscissa is glottal epochs not absolute time.

It is a common assumption behind much discussion of vocal acoustics, that formants follow a path that directly and smoothly follows articulatory dynamics though many

59

Page 67: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

(recently Deng, 1998) have pointed to the complexity of vocal tract resonances (and anti-resonances) in conditions of oral constriction. The picture of smooth articulator driven, in a sense steady state, vocal tract resonances is generally valid for the duration of most vowels and diphthongs lacking nasalization. Conditions arise, however, where mode switching between nasal and oral resonances dominates or other rapid fluctuations in formant energy occur. This can happen during the oral closure of consonants, or even in poorly articulated vowels over which a high degree of oral constriction is retained. In extreme cases where the articulatory setting is held near the oral-nasal boundary, energy can alternate between resonance modes for several glottal epochs. An example of this phenomenon can be seen in Figure 4.9a in the first /E/ of ‘expected’ which is weakly voiced and appears to be partially nasalised. The peak that is switching between 1400Hz and 2500Hz is unstable.

Apparently anomalous formant choices seen as the multimodal structures in the formant frequency histograms of Figures 4.10, 4.17 and 4.18 could, in a simple analysis, be assumed to be introducing significant errors into the acoustic-phonetic associations discussed in Section 4.11. A detailed view of the competitive operation of the association tables, however, suggests a mechanism for seeing these secondary modes as providing additional acoustic information. If these secondary modes fall largely outside the normal frequency range for a formant they can still provide a unique descriptor for the acoustic context in the manner of a hash table key. If the anomalous value is the result of extremes in formant energies, formant coincidences or the presence of nasal peaks then the displaced formant energy may be signalling this situation and providing additional information to the association tables which is lost when outlier constraint is applied.

5000

4000

3000

2000

1000

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6348-9: S106 Peak Histogram

FT

F1

F2

F3

Figure 4.10: Histograms of formant peaks for speaker S106. FT is total counts. The F1 peak is truncated for display purposes.

For these reasons no attempt was made within the formant picker itself to adjust formant data to correct outliers or smooth the formant tracks using continuity constraints from neighbouring partitions. The outlier constraint in step (e) of the

60

Page 68: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

formant picker algorithm is based on the absence of a viable candidate for F1 within one frame rather than continuity constraints involving neighbouring frames. In this case F1 was reset to zero rather than the local F1 mean used in the outlier constraint algorithm. Smoothing and outlier constraint were investigated as temporal modifiers to the base formant data and are discussed in Section 4.10.3.

Investigation of the structure of the spectral peak histograms assisted the design of the formant picker and is relevant to our analysis of formant transitions at CV boundaries to be discussed further in Section 4.12.

In Figure 4.10 we can see accumulated frequencies of the formants chosen from the raw peaks of Figure 4.9a. Some signs of possible peak misallocation for F2 and F3 can be seen in the broad F3 peak below 1700Hz. F3 (mainly around 2.4KHz) also has outlier peaks in the range of F4. Residual instances of N1 can be seen as a small peak in F1 at around 200Hz.

F0 is not likely to be having a significant impact on results given the absence of its harmonics in the raw peak histograms of Figure 4.10 which shows some small peaks at irregular intervals and not at harmonic multiples of F0. Histograms of F1/F0 and F2/F0, which in some early tests proved very sensitive to F0 harmonics (eg. Davies and Millar, 1996, Figure 4), failed to show any trace of them with current configurations. Even so, some ‘leakage’ of F0 information between frames is possible where significant acoustic energy remains at the end of the glottal epoch and has an impact on the energy and phase in the following frame.

To determine the sensitivity of the formant picker to ‘noise’ in the frequency domain, tests were conducted in which the spectral data for each frame was smoothed using the TRLI filter (as used for the FHE stage) with the parameter a of equation E4.4 set to values ranging from 0.6 to 0.85. Typical resultant formant data are shown in Appendix 4. As a result of A-P association tests, spectral data for all results reported in Chapter 5 were produced with spectral smoothing (a = 0.6).

4.8 Base Acoustic Parameters

The base parameters are those generated from an individual signal frame. In the terminology used in this thesis a temporal parameter is generated by applying a SLab temporal modifier, such as ‘Sim’ in the case of the PSL, to the base parameter. Table 4.1 lists the base acoustic parameters that were tested. Frame Energy (Etot) is the total energy of a source synchronous frame. All other energy parameters were normalised with respect to it. Formant energy, or spectral peak size was taken as the energy integral between adjacent energy minima. Formant bandwidths were evaluated as a peak’s size divided by its height.

The Excitation energy (Ex) was taken as the local amplitude of the fundamental 61

Page 69: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

harmonic generated for SS framing. Nasal energy (En1) was the peak size of the lower nasal resonance at around 200Hz. Parameters 13 to 15 were tested as alternatives to En1. Parameters 16 to 19 were taken as typical band energy examples from the literature. They were intended to capture general formant information. The formant energy ratios of parameters 20 to 22 are measures of the ratio of energy in the lower half of the formant frequency range to the total energy in this range. They measure a combination of formant energy and frequency.

P a r a m e t e r Code1 Frame Energy Etot 2 Excitation Energy Ex 3 Glottal Period T x4 1st Formant Frequency F1 5 2nd Formant Frequency F2 6 3rd Formant Frequency F3 7 F0 to F1 Frequency F0toF1 8 F1 to F2 Frequency F1toF2 9 F1 Energy Ef1

10 F2 Energy Ef2 11 F3 Energy Ef312 Nasal Energy En1 13 0 to 300Hz Energy E0to3c14 0 to 400Hz Energy E0to4c15 100-300Hz Energy E1to4c16 0 to 2kHz Energy E2k17 600 to 2800Hz Energy E6to28c18 2 to 3kHz Energy E2to3k19 2 to 5kHz Energy E2to5k20 F1 Energy Ratio ERf1 21 F2 Energy Ratio ERf222 F3 Energy Ratio ERf323 Bandwidth F1 BWf124 Bandwidth F2 BWf225 Bandwidth F3 BWf326 General Band Energy Eb27 General Band Energy Ratio E r28 Energy Moments Em1-Em8

Table 4.1: Full list of acoustic parameters tested.

An approximation to the first eight Cepstral coefficients (excluding the zeroth) was made using frequency moments. The frequency band (5KHz in 250 discrete values) of a frame’s spectrum (S) was divided into 2N bands, where N is the Cepstral coefficient index, then taking the ratio of alternate band integrals over M frequency values we have the frequency moments:

62

Page 70: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

EmN =

S(2bm + a)a= 0

a< m

Âb= 0

b< n/ 2

Â

S(i)i = 0

i< M

 where n = 2N ,m = M / n E4.1

and the related band energies and ratios:

Eb(b, m) = S(bn + a)a= 0

a< m

 and E4.2

Er(b, m) =

S((b +1)n + a)a= 0

a< m

Â

S(bn + a)a= 0

a< m

 with b and m as in E4.1 E4.3

Tests of the energy moment parameters were performed with both linear and Mel frequency scales.

Selection of parameters

Not all the parameters tested were used in the reported results. Evaluation of acoustic parameters for selection of the subset used in the experiments was done through analysis of mean acoustic-phonetic association rankings such as those illustrated in Figure 4.11. The association tables used to generate these diagrams are discussed in detail in Section 4.11. At this stage it is only necessary to note that low rankings represent strongest acoustic-phonetic associations. The stops are shown from three perspectives: all stops (S); disaggregated according to voicing (US, VS) and according to place of articulation (FS, MS, BS or front, middle and back). Vowels are disaggregated into length classes. Class definitions are given in Appendix 2.

All acoustic parameters in Table 4.1 were trialed along with their delta modifiers and PSLs. The parameters Em, Eb and Er are actually parameter sets and are not included in the one dimensional AV results reported in this thesis. Although individual energy moments such as Em4 can show 1D results comparable to those of the formants interpretation of the results was considered to be beyond the scope of this thesis.

Of the remaining parameters, ten with the highest mean rankings in four contexts: best vowel associations; best stop associations; best Delta parameter associations; best PSL associations are shown in Figures 4.11a-d. From their prominence in rankings such as those of Figure 4.11 and considerations of mutual independence of acoustic parameters discussed in Section 3.2 it was decided to restrict most experimental runs to one or more of the base parameters in the subset [Tx, F1, F2,

63

Page 71: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

F3, Etot].

Perceptual conditioning of acoustic parameters such as log energy and frequency scales is not included in the results reported here since the parameters were subjected to a highly nonlinear quantisation process described in the next section.

18

16

14

12

10

8

6

Rank

Phoneme ClassS US VS FS MS BS N V LV SV

(a) R6494 S106: Best 10 AVs for Vowels

F0toF1

F1

E1to4c

E6to28c

F2

F1toF2

ERf1

En1

E2to5k

E2to3k

18

16

14

12

10

8

6

Rank

Phoneme ClassS US VS FS MS BS N V LV SV

(b) R6494 S106: Best 10 AVs for stops

Etot

Ex

E6to28c

E2to5k

F1

F0toF1

E1to4c

F2

E2to3k

En1

Figure 4.11a,b: Ten best base parameters for (a) vowels and (b) stops listed in order of increasing mean rank. Abscissa classes4 include broad vowel and stop aggregates along with their subclasses and the nasals.

4 Many of the graphs in this thesis are presented with lines connecting data points even though they have a categorical abscissa. The lines are not intended to imply continuity or order in abscissa values, rather they are included to enhance visual connection between values for individual AVs.

64

Page 72: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

20

18

16

14

12

10

8

6

Rank

Phoneme classS US VS FS MS BS N V LV SV

(c) R6498 S106: PSL rankings for Stops, Nasals and Vowel classes

Tx.Del

F2.Del

F1toF2.Del

F0toF1.Del

F3.Del

ERf1.Del

Ex.Del

Etot.Del

E2to3k.Del

E2to5k.Del

18

16

14

12

10

8

6

Rank

Phoneme classS US VS FS MS BS N V LV SV

(d) R6497 S106: PSL rankings for Stops, Nasals and Vowel classes

Tx.Sim

F1.Sim

F1toF2.Sim

F2.Sim

F0toF1.Sim

F3.Sim

ERf1.Sim

En1.Sim

E2to3k.Sim

Ex.Sim

Figure 4.11c,d: Ten best parameters for (c) Del and (d) Sim. Listed in order of increasing mean rank.

4.9 Acoustic Parameter Scaling

If data in the tables used for evaluating A-P associations is sparse some form of smoothing of tables would be required. To avoid this the input bandwidth of the tables was minimised which reduces table size but also aggregates data to the point where smoothing is no longer necessary. Error analysis of the subsequent acoustic-phonetic associations was simplified by the provision of approximately equal populations for all values of the input acoustic vector.

All raw acoustic parameters were initially mapped to an integer value from 0 to 1023. The parameter values were scaled down to a smaller range (eventually 0 to 15) and, using parameter histograms, transformed in a nonlinear manner to give approximately equal populations for each parameter value (or bin). A sequence of

65

Page 73: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

optimisation trials showed that 4 bits was adequate granularity while still maintaining significant populations for each bin (as discussed in Section 4.11).

Figure 4.12 shows the impact on acoustic vector association rankings of varying the bandwidth of the AV from 1 bit to 8. At 8 bits the mean rank is still dropping but the difference between 4 and 8 bits was not considered significant for these experiments and 4 bits allows for testing of two-dimensional AVs while still maintaining data density and small table size.

201816141210

Rank

AV bits1 2 3 4 5 6 7 8

R6413 S106: Varying AV bandwidth

Tx

F1

F2

F3

Et

M

Figure 4.12: Phoneme mean association rankings over varying AV bandwidth for speaker S106. M is the AV mean.

Several algorithms were tested for the equi-population transform. The simplest consisted of dividing the total population by 16 and then, in a single linear scan, the parameter histogram was divided into hexadecile units. This approach performed poorly where parameter distributions were peaky, which is often the case.

A more effective two stage algorithm was tested. The first stage consisted of successively merging the two adjacent bins of the 1024 bin input histogram which have the smallest combined population, continuing the process until only 16 bins remain. In the second stage, input domain boundaries developed in the first stage are adjusted to balance populations of output bins on a ‘smallest impact first’ basis. This approach still did not deal perfectly with very peaky distributions since large peaks, where one bin approaches or exceeds 1/16th of the total counts, will not divide deterministically but it proved to be an effective and reliable transform. A summary of the Java code of these algorithms is given in Appendix 3.

An assessment of the effectiveness of the compression algorithm can be made from the variation in bin populations for 24 acoustic parameters shown in Figure 4.13. The mean value of all parameter value (bin) counts is 6520. The second stage of the algorithm reduces the standard deviation of the counts from 1710 to 1171 or 18% of the mean. Outlier values in Figure 4.13 are for Total Energy (at parameter value 0) and Bandwidth of F1 (at parameter values 6 and 7).

66

Page 74: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

20000

18000

16000

14000

12000

10000

8000

6000

4000

2000

0

Counts

Acoustic Parameter Value0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 4.13: Example of bin population range for the first 24 acoustic parameters of table 4.1.

4.10 Temporal Parameter Modifiers

Several temporal transforms were applied to the base acoustic parameters separately and in combination. In the terminology used in this thesis a modifier is a function applied singly or in succession to a base (non temporal) acoustic parameter. The SLab AV P.Dif.Sim, for example, applies the Dif (time differential) and Sim (PSL) modifiers to the base parameter P. In functional notation this would be Sim(Dif(P)).

4.10.1 Time Differentials

Differentiation was performed both as a simple difference between adjacent values and as a symmetrical differential of variable temporal range or order.

Del

Dif2.Dif2.Sim4Dif2.Sim4Sim4Dif4.Dif4Dif4

Dif2.Dif2.Sim2Dif2.Sim2Sim2Dif2.Dif2Dif2

Table 4.3: Temporal parameter modifiers used in comparisons. DifN is a differential estimated over a temporal range of N+1 frames. SimN is a PSL with parameter range 10N% of parameter standard deviation.

We use symmetrical, even-ordered estimates, with the exception of the first order difference or ‘delta’ parameter referred to here as the ‘Del’ modifier. The full set of acoustic parameter modifiers used for the comparisons between PSLs and parameter differentials are listed in Table 4.3.

67

Page 75: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

The range N differential of P at time T is given by:

DifN =dPTN

dt=

2N

(PT + i - PT - i)i =1

N / 2

 /(tT + i - tT - i) E4.6

4.10.2 The Parameter Similarity Length

The PSL was evaluated in the following manner: • The standard deviation of parameter values was calculated• The similarity range (R) was set as a fraction of the standard deviation. Tests

were performed for ranges varying over 5% to 90% of the SD.• For a point m in the time domain parameter stream with parameter value Xm the

stream was searched forward and backward to find the frames, m+f and m-b, where X was outside the range Xm ±R.

• The PSL was given the value Tm+f - Tm-b+1 where Tm is the start time (in msec) of the mth frame. Alternatively, a frame based time scale was tested using f+b+1 as the PSL value.

Preliminary testing summarised in Table 4.4 showed that using absolute time gave best results over all phonemes with the frame (or glottal epoch) based time scale giving improved A-P association strengths for stops in some cases. The AV F2.Dif&F2.Dif.Sim, a 2D acoustic vector that gives the best overall association strengths for stops gave a mean ranking advantage for the frame based timing (aggregated over six speakers and six stops) of 1.7 ranking points. Unless explicitly stated, PSL values used in this work were based on absolute time.

R6445-6: Absolute and Frame based PSL timing (F2.Del&F2.Del.Sim)

13.510.1Vowels

9.611.3Stops

Frames basedAbsolute timeClass

Table 4.4: Comparison of class mean rankings for PSLs generated in absolute and frame based time scales for AV F2.Del&F2.Del.Sim.

Figure 4.14 illustrates a short time domain segment of the temporal parameter F1.Sim from the sentence S106S001. The 20% (Sim2) setting gives best short term detail and can be seen to perform a plausible phonetic segmentation of the stream in most cases in this example. The 60% setting better represents the full duration of vowels and diphthongs but misses short term detail. A setting of 40% provides a compromise. Further examples of PSL time streams are given in Appendix 6.

68

Page 76: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Figure 4.14: An example of the Similarity Length parameter F1.Sim for three values for the similarity range settings: 20%, 40% and 60% of the standard deviation of F1 values. Numbers on the abscissa are in 10 ms units.

The peaks at the temporal extremes of high valued PSL regions are an artefact of the algorithm used. At an F1 peak the algorithm sees only lower values at the temporal extremes so the similarity range in parameter value is effectively half that available on the skirt of a peak where both higher and lower values are found. Alternative, optimising algorithms were investigated and rejected on the grounds of computational complexity, lack of transparency and instability with parameter streams that are not smooth. The algorithm used has the possible advantage of representing, in the current segment, information about parameter continuity through to adjacent segments.

In general the PSL provides an approximate representation of duration in some phonemic contexts, falls short of duration where parameter values are very irregular, and exceeds duration in cases where there is continuity of the acoustic parameter across phoneme boundaries. This continuity, or more precisely its variations, can also be viewed as a function of the degree of local coarticulation which, like duration, has a complex relationship with speaking rate. Through the range of normal to rapid speech we can expect the concomitant reduction of vowels, for instance, to lead to reduced variation in the values of spectral parameters across phoneme boundaries counteracting the rate induced drop in duration. The PSL as a duration measure is thus a complex function of rate, phonetic context, acoustic parameter and speaker characteristics.

69

Page 77: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

4.10.3 Smoothing and Outlier Constraint Parameter smoothing was performed with a triangular windowed moving average. Several window shapes were tested. The filter chosen had the multipliers: at-1 =

0.25, at = 0.5, at+1 = 0.25.

Outlier constraint was performed as follows: If X1, X2 and X3 are a sequence of temporally adjacent parameter values and C1 (4) and C2 (32) are constants, then the condition for outlier selection for X2 is

Qq = ((X3-X2) > C1(X3-X1)) and ((X3-X2) > (X1+X2)/C2). E4.7

Smoothing (the SLab ‘Smo’ modifier), outlier correction (‘Cor’) differentiation (‘Del’ or ‘Dif’) and PSLs (‘Sim’) were tested in combinations such as P.Smo.Sim, P.Smo.Dif, P.Dif.Sim and P.Smo.Dif.Sim. Each temporal modifier was applied in left to right order to one of the base acoustic parameters P.

4.10.4 Other Modifiers

Two further temporal modifiers were tested but not used. A contour shape parameter (the ‘Sha’ modifier) consisted of a 2 bit coding of the four possible contour shapes (rising, falling, peak, trough) within the PSL window. This was intended for use in conjunction with the PSL to provide a more detailed parametric model of the local temporal context.

A ‘switch’ parameter (Swi) selected the largest of Sim or Dif.Sim for a frame and combined it with a single bit flag to indicate which value was chosen for a particular AV. This modifier was trialed as an attempt to combine the two temporal parameters and was based on the hypothesis that differentials and PSLs would both provide most useful information in their high values - differentials where the acoustic parameter is changing rapidly and PSLs where the parameter is relatively steady.

Both the Sha and Swi modifiers were put aside after initial trials because the aggregated results were marginal. They could both have a significant contribution to make in specific contexts.

70

Page 78: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

4.11 The Association Tables

The association table was constructed with a column for each phoneme and a row for each AV value. The table accumulated counts of the instances in which a particular AV value was labelled as a particular phoneme. These counts were converted to a probability by dividing each cell by the column total. Each cell contained the probability that a phoneme will be represented by a particular AV value regardless of the frequency of the phoneme in the data set. These probabilities were then ranked for each AV value and this ranking was used as the primary measure of association strength.

An separate association table was built for each AV structure being tested combining data from all 200 sentences from one speaker. A combined table was also built accumulating data from all speakers (labelled Spk6) to provide an indication of the generality of the individual speaker data rather than as an attempt at speaker independent recognition.

Figure 4.15 shows representative examples of vertical slices through three association matrices for /A/ and /a:/ for speaker S106. Figure 4.16 shows speaker independent F2 vowel associations. Some vowels have multimodal distributions. These arise partly from nasalization or F1-F2 mergers not detected by the formant picker and from anomalous spectral energy fluctuations associated with weak or transitional voicing. As discussed in Sections 3.6.1, 4.7 and 4.12, outliers or secondary modes of the distributions are not eliminated since they may carry useful phonemic information reflecting the abnormal spectral structure of a frame.

Ranks of up to 22 (half the phoneme total or number of columns) were included in the results of Chapter 5, though all conclusions are based on much lower rankings. The statistical significance of the rankings was gauged by the significance of comparative rankings - ie the significance of the change in ranking across all phonemes when a system variable is changed. The AV bandwidth compression (Section 4.9) ensured a mean population of at least 100 frames for each cell used in the association tables for each speaker. Runtime checks were made to ensure populations were maintained for at least 22 cells per row to ensure that rankings up to 22 were based on sufficient data.

Phoneme class associations, in which each column represented a phonemic class, were also tested but had much weaker association strengths than the individual phonemes and are not reported in detail here. Where class associations are given here as summaries of general phoneme attributes they are means of the associations for the constituent phonemes. The phonemic class codes used are given in Appendix 2. Note that stops were classified by either place of articulation or by voicing.

71

Page 79: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

0.40

0.30

0.20

0.10

0.00

P

AV Value0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R6348 S106: F1 Association Probabilities

A

a :

0.40

0.30

0.20

0.10

0.00

P

AV Value0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R6348 S106: F2 Association Probabilities

A

a :

0.20

0.10

0.00

P

AV Value0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R6348 S106: F3 Association Probabilities

A

a :

Figure 4.15: Example association probabilities for AVs F1, F2 and F3 from a single speaker (S106) for all 16 bins of their 4 bit range.

72

Page 80: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Figure 4.16: F2 association probabilities for vowels for all speakers (Spk6). Values below 5% have been removed to enhance presentation clarity.

4.12 Parameter Histograms and Temporal Slices

Parameter histograms such as those in Figures 4.4, 4.8 and 4.10 were used extensively to provide various globally aggregated views of the data. These were also generated at segmental levels either for phoneme classes or individual phonemes forming a hierarchy of phonetic aggregation. The lowest, or least aggregated level, in this hierarchy was data for a particular phoneme in a specified diphone or triphone context.

Because most triphone and even some diphone instances occurred in low numbers in the 200 sentence data set, A-P association strengths based on aggregation of the data at this level were not generally used. In the case of our investigations of F2 transitions at CV boundaries, however, diphone contexts were used and special data representations generated displaying both aggregated and instance data.

We also disaggregate the data on a temporal basis building time slices at a sub-segmental level. These slices were constructed either by accumulating frame counts over 20 or 40ms real-time intervals or aggregated on a frame by frame basis on a frame based time scale. Slices were positioned relative to phoneme boundaries for selected phoneme and class pairs such as stop-vowel boundaries. The number of slices before and after the boundary could be specified.

Time slices were generated on a phoneme or phoneme class basis. For the stops two class specifications were used based on voicing and place of articulation respectively. Vowels were classed by length only. Consonants were grouped into the

73

Page 81: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

broad classes of fricatives, affricates, semivowels and nasals.

The observations in Section 4.8 on formant dynamics, that the use of source synchronous framing shows up a high degree of discontinuity between frames, are central to our investigations of F2 transitions at CV boundaries. It should be noted that much of the discussion in the literature on formant analysis, in particular F2 transitions following stops, has been based on data generated asynchronously and with overlapping signal ranges for adjacent frames which leads to interpolation and smoothing of formant frequencies.

Common terminology such as the ‘path’, ‘track’ or ‘trajectory’ of a formant frequency imply notions of continuity or the ballistics of objects with physical mass. Such intuitive models fail when we look at the full complexity of the combined oral-nasal resonance system. We have looked closely at these resonance systems in an attempt to characterise transitions between consonants and vowels where varying degrees of oral closure produce complex, and possibly chaotic, mixtures of resonances.

To some degree, each glottal epoch can be seen as a discrete acoustic event with the reverberation of each glottal pulse largely fading by the time the next starts. Residual signal energy at the time of glottal closure starting a new epoch will have some influence on the phase of the new pulse and could have a major impact on the dominance of resonant modes where the balance is fine.

Figure 4.17 illustrates typical behaviour of F2 at a VC boundary. Six frame based sequences are shown for /i:/-/n/ boundaries for speaker S106. Note that the time scale is frames, not absolute time. It should also be borne in mind that the formant picker makes its initial peak selection based on peak size and that peak assignments can change with small relative changes in peak size.

3000

2500

2000

1500

1000

500

Hz

Frame number relative to boundary-10- 9- 8 - 7 - 6- 5 - 4 - 3 - 2- 1 0 1 2 3 4 5 6 7 8 9 10

R6277: S106 F2 sequence for /i:n/ boundary

M

1

2

3

4

5

6

Figure 4.17: Six F2 transition instances for the /i:/-/n/ boundary.

Four temporal regions can be distinguished in this diagram. The first region, starting

74

Page 82: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

at frame -10 and ending at frame -7 for sequence 2 and frames -3 to 0 for the remaining sequences, is the end of the vowel segment. Sequence 3 switches modes twice to the ~2400Hz nasal resonance (N3) in an example of anticipatory coarticulation. The remaining sequences follow an articulatory path first with an upward trend then after frame -4 show a brief downward inflection.

The second region, starting at frame -6 for sequence 2 and spanning frames -4 to 1 for the remaining sequences, is a transitional stage where each sequence drops by at least 700Hz in one frame down to the ~1Khz (N2) nasal resonance. Between frames -6 and 3 we can see a third region in which sequences are dominated by N2 with occasional energy bursts in N3 and higher resonances. Following frame 3 the fourth region dominates with the sequences starting to show the articulation of the following phonemes.

This analysis clearly illustrates the modal nature of the transitions. We are going through a transition from a variable oral resonance to a mixture of fixed nasal resonances dominated by N1 (~200Hz). The vowel-nasal transitions can be seen, to some extent, as extreme examples of the behaviour that characterises VC or CV transitions. With nasals, oral closure is not always complete in rapid speech and the transition dynamics is not simple or consistent. With stops, the plosive burst and the voicing distinction add further complexity.

A male speaker was chosen for the example in Figure 4.17 because the nasal resonances of the males used have a very narrow range as can be seen in Figure 4.18. Peak positions of the nasal resonances of the females are spread, possibly by interactions with the higher F0 frequencies of the females. This interaction appears to be taking place largely in the vocal tract since varying the framing position and length, and switching between Goertzel and FFT transforms, while having a greater impact on the female nasal resonances than in other contexts, did not produce the sharpness seen for the males. The same can be said for the widely varying 800-1500Hz resonance of male speaker S083.

The frequency quantisation seen around 2.3 KHz for speaker S065 was assumed to be a transform artefact consisting of a small fixed frequency ripple on the broad, low energy, high frequency formants. Since the separation of the peaks increases with increased frequency we are looking at something more complex than a simple harmonic structure. It may arise from interaction between the glottal period and/or N1 with the signal sampling rate. Note that the N1 histogram peak is lower and broader for this speaker in contrast with the extremely narrow range of N1 for speaker S052. The effect has a minor impact on measured peak frequencies so, following an initial analysis, it was ignored. Further examples of formant histograms are given in Appendix 5.

75

Page 83: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

1800

1600

1400

1200

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0

R6348-9: Formant Histogram for /NA/

S106

S083

S081

a

1800

1600

1400

1200

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0

R6348-9: Formant Histogram for /NA/

S065

S058

S052

b

Figure 4.18: Combined formant histograms for nasals. (a) males, (b) females. 4.13 Discrimination potentials

Histograms of duration and PSLs were also used quantitatively in the evaluation of duration based contrasts of vowel length, stop voicing and stop place. The PSL histograms are very noisy since they are accumulated from multiples of varying frame lengths. The irregular nature of the distributions makes any statistical analysis difficult. There is, however, a direct comparison we can make that largely circumvents these difficulties by not making any assumptions about the nature of the distributions - the discrimination potential (DP) of the distributions which we define

76

Page 84: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

as the degree of separation, or lack of overlap, of a given pair of raw (pre-scaled) parameter histograms.

Taking the vowel length decision and two vowel histograms (long and short) as an example, a discrimination probability distribution was estimated by setting a decision boundary at some duration b in each histogram and calculating the probability of a correct result if we assumed that every value below b was a short vowel and that higher values were long vowels. The discrimination potential is defined as the maximum value of this distribution.

DP = Maxb(50( 1NL

HLt = 0

b

 t( ) +1NS

HSt = b+1

M

 t( ))) E4.8

H is the histogram distribution, N is the total number of instances, indices L and S represent long and short length respectively and M is the maximum AV value.

77

Page 85: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Chapter 5

Results and Analysis

5.1 Introduction

In this chapter results are presented for the five sets of temporal experiments described in Section 3.6:

1. Impact of smoothing and outlier constraint2. Comparison of PSLs and differentials3. Duration-PSL comparison for all phonemes4. Duration in phonetic contexts5. CV boundaries and F2 transitions

For experiments 1 to 4 results are presented as ranking differences generated from the association tables. In the fourth series we also look at discrimination potentials generated from the raw histograms of the acoustic parameter values. Experiment 5 does not provide an explicit numerical result, rather the trends in F2-time plots are discussed.

The results presented in this chapter are a representative selection of the broader result set generated in the course of exploring the five experimental objectives listed above. A complete presentation would be prohibitively lengthy and would add little to the central goal of providing a general evaluation of the PSLs. An exception is experiment 5 where individual speaker examples are drawn from the better examples of smooth F2 transitions. In this case aggregates for all speakers are also included to demonstrate the generality of the conclusions drawn.

When comparing rankings we start with the most highly aggregated data then disaggregate to look at individual speakers, phonemic classes and, in some instances, individual phonemes. Disaggregations are relegated to the appendices to simplify presentation. In some cases, where rank differences are small, we estimate student-t as a means of hypothesis testing. Significance levels are referred to by the common terminology of ‘marginal’, ‘significant’ and ‘highly significant’ for p < 0.05, 0.01, 0.005 respectively.

We will see that the PSLs generally out-perform parameter time derivatives in providing stronger acoustic-phonetic associations (lower rankings). They can, in some instances, act as an effective proxy for phoneme duration and even when they fail to do so they can still provide valuable discriminatory information.

78

Page 86: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

5.2 Impact of smoothing and outlier constraint

Outlier constraint

Outlier constraint can be seen as an extreme form of smoothing that allows us to correct parameter values in a time stream that are not just considered to be noisy values but are assumed to be completely invalid. By changing them to an appropriate interpolated value we are attempting to avoid major disturbance of the final results and invalid disturbance of adjacent values if subsequent smoothing is performed. There is no general means of ensuring that all values changed were initially invalid but in this work we use the association strengths to indicate whether the average impact is beneficial or not.

2423222120191817161514131211109876

Rank

Vo SV LV Di St US VS Na Gl F r M

R6156: Mean association ranks of temporal modifiers for speaker Spk6

Tx

F1

F2

F3

Etot

Tx.Cor

F1.Cor

F2.Cor

F3.Cor

Etot.Cor

Tx.Smo

F1.Smo

F2.Smo

F3.Smo

Etot.Smo

Figure 5.1: All speaker association ranks summarised by phone class for base parameters with Cor and Smo temporal modifiers. Classes are: Vowel (short, long), Diphthong, Stop (voiced, unvoiced), Nasal, Gl (glides and liquids), Fricatives, M (all phone mean).

Figure 5.1 illustrates results for the constraint (Cor) modifier applied to five base acoustic vectors (Tx, F1, F2, F3, Etot). Overall the impact of constraint can be seen to be small with the main exception being Tx in the context of unvoiced stops where rankings are also very weak (>20).

Rank differences for the main classes of interest are presented in Table 5.1a where significance estimates are also indicated. Values are for rank differences (RX.Mod -

79

Page 87: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

RX) where R is rank, X is the AV and Mod is the temporal modifiers ‘Cor’. The aggregate rankings for F3 and all phonemes (‘All Ph’) show significance not seen for the classes listed. This results mainly from the impact of fricatives but the increased phoneme numbers also add strength to the t scores.

0.07-0.080.18-0.220.01Dips

0.01-0.64-0.22- 0 . 3 80 .85Stops

0.00-0.22-0.01- 0 . 4 40.07Vowels

-0.10- 0 . 3 4-0.11- 0 . 3 70 .22All Ph

EtF3F2F1Tx(a) R6156 Spk6Cor Rank increase

-0.06-0.370.38-0.03-0.07Dips

-0.201 .411.000.500.91Stops

-0.16-0.340.21-0.33-0.31Vowels

-0.07-0.110.130.150.24All Ph

EtF3F2F1Tx(b) R6156 Spk6Smo Rank increase

Table 5.1: Rank differences RX.Mod - RX for five AVs (X=Tx,...,Et) temporal modifiers (Cor, Smo) and classes of interest. Bold figures are for P<0.01, bold and underlined are P<0.005, just underlined are P<0.05. The histogram change figures are the percentage net change in raw parameter histogram bin counts caused by the modifier.

120011001000900800700600500400300200100

0@@:Aa :Ee: I i :Oo:Uu:V p b t d k gtSdZT Dm n Nw j l r v z Z f s S h

R6488: Outlier constraint instances

Tx

F1

F2

F3

Et

Figure 5.2: Instances of outlier constraint for speaker S106 and five acoustic vectors (Tx, F1, F2, F3, Etot).

Constraint has an impact on association rankings that is typically less than half a rank point. It reduces ranks (improves associations) for all three formants but not for

80

Page 88: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Tx and Etot. It has an overall beneficial and highly significant impact on the contribution of F1 of 0.37 ranking points. This difference is consistent across all phoneme classes and significant for all but diphthongs. Over all phonemes Tx.Cor significantly degrades rankings by 0.22 points. Correcting Tx for stops produces a highly significant detrimental rank increase of 0.85 points.

Instance counts for outlier constraint are shown in Figure 5.2. Disaggregate counts for /p/ are given in Appendix 7 (Tables A7.2 and A7.3). High counts for stops and fricatives are to be expected given the parameter irregularity of unvoiced or weakly voiced segments. The nasals in Figure 5.2, particularly /n/, have counts comparable to the vowel average for Tx and Etot but the formants show a higher incidence reflecting the switch between oral and nasal resonances near nasal boundaries.

Smoothing

Looking at the smoothing data in Figure 5.1 and Table 5.1b we see that there is no significant ranking improvement from applying smoothing. For F3, smoothing degrades stop rankings by 1.4 points (highly significant). Smoothing F2 also degrades stop rankings but with marginal significance.

It should be noted that we are looking at combined speaker data and that some smoothing results are highly speaker dependent with individual speakers showing quite differing behaviour for particular AV-phoneme combinations. Phoneme class data for individual speakers is given in Appendix 7 (Figures A7.1a-f) and examples for individual phonemes for F1 and F2 are seen in Figure A7.2. An extreme example of beneficial smoothing is /D/ where F1.Smo improves the ranking over F1 by more than three rank points.

Those instances where either smoothing or outlier constraint increase the phonetically significant information content of the AV (reduce its rank) can be viewed as potential secondary processing options or fine tuning that can be used in ASR for specific speaker-AV-phoneme contexts.

5.3 Similarity Lengths and Differentials

The primary test for the PSLs was to see how their acoustic-phonetic association strengths compared with those of parameter time differentials but first the issue of selecting appropriate values for the temporal range had to be addressed for both temporal modifiers.

5.3.1 The temporal range of Difs and PSLs

Differentials were calculated for varying temporal ranges as in equation E4.6. Ranges of 1, 2, 4, 6 and 8 glottal epochs were tested initially then the delta modifier and

81

Page 89: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

sample ranges 2 and 4 (modifiers Del, Dif2, Dif4) were selected for more detailed analysis.

222018161412

Rank

Dif Order2 4 6 8

Tx

222018161412

Dif Order2 4 6 8

F1

LV

SV

D

US

VS

a b

Figure 5.3: Impact of Differential Order (temporal range) on A-P association rankings across phoneme classes (vowels diphthongs and stops). Figures (a) and (b) are Tx and F1 respectively aggregated over six speakers (Spk6).

22212019181716151413121110

Rank

V SV LV D S US VS N G F M

R6128: Spk6 Differential modifiers for [Tx+F1+F2+F3]

Del

Dif2

Dif4

Sim2

Sim4

Figure 5.4: Dif-Sim comparison aggregated over six speakers for four AVs and broad phoneme classes. M is the all-class mean.

In Figure 5.3 we see the impact of varying differential order on association strengths over the phoneme classes indicated. Values are given for short and long vowels (SV, LV), voiced and unvoiced stops (VS, US) and diphthong (D) phoneme classes. As we would expect intuitively, stops favour a short range differential while vowels and diphthongs are best represented by a long range estimate so selection of a single order to cover all cases is not possible.

82

Page 90: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

24 22 20 18 16 14 12 10 8 6

R a n k

#@

@:A

a:E

e:I

i:O

o:U

u:V

pb

td

kg

TD

tSdZ

mn

Nw

jl

rv

zZ

fs

Sh

aiei

oi@

ui@au

MV

SD

R612

8 Sp

k6:

Indi

vidu

al P

hone

me

Rank

ings

for

Tem

pora

l Aco

ustic

Vec

tors

Tx.D

el

Tx.D

if2

Tx.D

if4

a

24 22 20 18 16 14 12 10 8 6

R a n k

#@

@:A

a:E

e:I

i:O

o:U

u:V

pb

td

kg

TD

tSdZ

mn

Nw

jl

rv

zZ

fs

Sh

aiei

oi@

ui@au

MV

SD

R612

8 Sp

k6:

Indi

vidu

al P

hone

me

Rank

ings

for

Tem

pora

l Aco

ustic

Vec

tors

Tx.D

el

Tx.S

im2

Tx.S

im4

b

Figure 5.5: Del, Dif and Sim comparisons for Tx using combined speaker data. An apparent anomaly surfaces, however, when we include the shortest differential estimate, the Del modifier. We can see in Figure 5.4 that Del modifiers outperform Dif2 and Dif4 for all but nasals and semivowels (G). A possible reason for this is

83

Page 91: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

that the delta parameters are the most direct measure of base parameter smoothness. The Difs are averages over a temporal range so are inherently smoothed. In accordance with Figure 5.3, Dif4 outperforms Dif2 for all classes but stops and fricatives. The general trends shown here are not universal over all speakers, phonemes and acoustic vectors but a detailed investigation of the exceptions was considered to be beyond the scope of this work.

Figure 5.5a shows individual phoneme disaggregations for Tx. Relative rankings are consistent within phoneme classes, particularly the unvoiced stops and fricatives. Note that the rankings of unvoiced stops and fricatives for Tx based temporal parameters are well below the means (over four AVs) in Figure 5.4.

A selection of disaggregations of Figure 5.4 are shown in Appendix 8 for individual speakers, acoustic vectors and phonemes. Individual speakers show no major qualitative differences and relatively small quantitative differences. Notably, female speakers show much larger differences between Dif2 and Dif4 for stops. This suggests that valuable short duration information is available that is missing for the male speakers. In the phoneme breakdown (eg. Appendix 8 Figure A8.3) Dif4 can provide an advantage over the Tx.Del and Tx.Dif2 for /a:/, /e:/ and some diphthongs. Likewise F1.Dif4 with most vowels and F2.Dif4 with /a:/, /i:/, /U/, /w/, /r/ and nasals add support for the conclusion that Dif4 (and possibly longer ranges) provide advantage with vowels (particularly long vowels) and semivowels. For the results described in this thesis a range of 2 can be assumed for Dif if not otherwise specified.

The PSL has one free parameter, the range (Rsim) over which the parameter value is allowed to vary. It is expressed as a fraction of the standard deviation of parameter values. Obviously the PSL is a monotonically increasing function of the range and we need to set the range to suit the task at hand. Values of Rsim from 5% to 50% were tested. The parameter modifiers Sim2 and Sim4 (20% and 40% ranges) were selected for detailed analysis. Figure 5.4 shows Sim2 outperforming Sim4 for stops, the reverse for diphthongs, vowels and nasals. Table 5.3 summarises some of the rank differences for the Sim4 to Sim2 comparison for three phoneme classes and the phoneme total. Negative values show an advantage for Sim4. For stops, Tx and F2 perform better with the short range of Sim2 - likewise for F3.Sim2 for all classes but only significantly for the diphthongs. F1 on the other hand combines best with Sim4 for all classes. Overall we have strong and significant differences between AVs and phoneme classes with patterns remaining qualitatively consistent across speakers for all AVs and within phoneme classes for some AVs.

84

Page 92: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

0 .880.31-0.94- 2 . 3 6Diphthongs

0.501.69-0.991 .92Stops

0.01-0.46- 2 . 0 1-0.46Vowels

0 .620.20- 1 . 2 3-0.23All Phonemes

F3F2F1Tx

R6128

Sim4-Sim2

Table 5.3: Mean rank differences RSim4-RSim2 over all speakers with positive values indicating lower (stronger) ranking for Sim2. Those in bold are significant. Bold and underlined are highly significant. Those just underlined are marginal.

5.3.2 Comparison of Similarity Lengths and Differentials

The best rankings over all classes are provided by PSLs with mean rankings across all phoneme classes nearly one ranking point better than the Del modifiers. Table 5.4 confirms this with the greatest advantage from PSLs being with the stops. For Tx and F1 the ranking differences are large at 3.1 and 6.1 points respectively and significance is very high with t scores of 13 and 20 which give uncertainties well below the P=0.005 level.

2 .58-0.030.373 .69Diphthongs

-1.65-0.81- 6 . 0 9- 3 . 1 3Stops0.630.090.43-1.47Vowels

-0.56-0.89-1.08- 1 . 2 4All Ph

F3F2F1Tx

R6128

Sim2-Del

Table 5.4: Rank differences RSim2-RDel over all speakers with negative values indicating lower (better) ranking for Sim2. Statistical significance is indicated as in Table 5.3,

Table 5.4 and Figures 5.4 and 5.6 provide a high level comparison between the PSLs and differentials. Differentials perform best for some long vowels and diphthongs. Figure 5.5b shows the phoneme level perspective for the base parameter Tx chosen for its generally good performance with the temporal modifiers. Strong associations are apparent with Del for /@:/ and /a:/ and weaker associations for /I/, /i:/, /O/, /U/ and /V/ indicating that the aggregated vowel data are hiding major differences between individual vowels.

Irregular but strong differences between front and back vowels can be seen through the individual phoneme data (eg. Figures A8.3 in Appendix 8). As with the outlier correction of the previous section, speaker variations for the AV aggregates tend to be minor quantitative changes rather than qualitative ones. Individual AVs, on the other hand show large qualitative differences. Individual phoneme results are generally consistent across stop voicing classes but vary across vowels for some

85

Page 93: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

acoustic parameters.

To complete the comparisons of this section we take a look at some more complex AVs. As with Figure 5.5, Tx is used as the best temporal exemplar. All the other base acoustic parameters were tested in similar fashion but the results did not add significantly to the picture presented by Tx so, for reasons of simplicity, they are not reported here. The parameter Tx.Del.Sim applies the Sim modifier to the time differential of Tx and measures the temporal extent of the rate of change in glottal period or its acceleration in the case of Tx.Del.Del.Sim.

2018161412108642

# @ @: A a : E e: I i : O o: U u: V p b t d k g tS dZ T D m n N M

R6493 S106: Mean ranks for five temporal acoustic vectors

Tx.Del

Tx.Sim

Tx.Del.Sim

Tx.Del.Del.Sim

Tx.Del&Tx.Del.Sim

Figure 5.6: Rankings for temporal parameters combining Del and Sim modifiers for a subset of phonemes. M is the phoneme set mean.

The AV Tx.Del&Tx.Del.Sim is a 2D vector. These do not generally conform to the row population constraints ensured by the one dimensional parameter range compression algorithm of Section 4.9. A close inspection of the raw table data for this example, however, shows that at least 94% of AV instances fall in table rows that can sustain a valid ranking in the ranges shown in Figure 5.5 but the results should still be treated with caution and are excluded from the conclusions drawn here and in Chapter 6.

The overall picture from Figure 5.6 is that ranks decrease with changing AV in the order that the AVs are listed in the legend. Some patterns are apparent. Unvoiced consonants consistently rank lower than their voiced counterparts. Long back vowels

86

Page 94: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

show lower rankings than their short counterparts. The two dimensional AV Tx.Del&Tx.Del.Sim shows similar patterns to the 1D cases but gives a mean rank 3.4 points below the best of the one dimensional AVs. With the qualifications addressed in the previous paragraph, it provides an indication of the ranking improvements expected for higher dimensional AVs.

In summary we can conclude that the PSLs outperform parameter differentials in most combinations of parameter temporal range, phoneme or phonemic class and AV structure. Combinations of differentials and the PSLs generally outperform simpler modifiers.

5.4 Similarity lengths and duration

Three sets of experiments were conducted to evaluate the PSL as a measure of duration. The first was a direct comparison of PSLs and the alignment duration for all phonemes and a subset of the base acoustic parameters. In the second set of experiments PSLs were evaluated in the vowel length decision. The third set looked at durational aspects of the stop voicing and place decisions - specifically, the duration of vowels preceding voiced and unvoiced stops and the duration of the stops themselves attempting to use PSLs as measures of voice-onset time.

5.4.1 Duration-PSL comparison for all phonemes

Here we compare similarity lengths with phoneme durations as defined by the phonemic alignment data. This information is not, of course, available to a speech recogniser at recognition time but we take advantage of its availability here (as in the training phase of a recogniser) since it is central to any consideration of timing in speech and provides us with a valuable reference for evaluation of the PSL. It is referred to here just as ‘duration’ or the SLab parameter ‘Dur’.

Figure 5.7 shows the spread of durations and Sims between speakers. These means are taken over all phonemes for the full 200 sentences per speaker. For durations we see inter-speaker duration variation of around 20% so we can plausibly attribute around 20% of the relative variation of the Sims to inherent speaker variations. For Tx.Sim this could account for around half the variation but the other Sims have much larger variation introduced by factors such as strong intrasegmental variation of parameters producing shortening, or variations in the degree of coarticulation at segment boundaries leading to a lengthening.

We note that Tx.Sim is generally longer than the duration which agrees with the observation that Tx (or F0) remains relatively stable through boundaries between contiguous orally voiced phonemes. Frame energy (Et), on the other hand, while often sustained over adjacent vocalic segments, can vary over 50% of its full range within

87

Page 95: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

the bounds of a segment. Not surprisingly then, Et.Sim global means are just below the duration means with the exception of speaker S065.

140120100

80604020

msec

Dur Tx.Sim Et.Sim F1.Sim F2.Sim F3.Sim

R6131: Durations and Sims by Speaker

S106

S083

S081

S065

S058

S052

Spk6

Figure 5.7: All phoneme means for duration and Sim parameters.

The formant Sim means fall well short of duration means. This also should be expected because formants often vary significantly within segments and are also prone to outliers which can truncate PSLs. This is particularly so for F3 which, as we can see from the time-frequency plots such as Figure 4.11, occasionally takes the stronger F4 peak and is disrupted by most F2 allocation errors.

For F1 and F2 the most notable feature of the PSL means is the difference between the males and females. The male speakers are strongly clustered around 80% of duration while the female speakers are spread over a wide range of lower values. This suggests a less consistent formant structure for the female speakers.

18016014012010080604020

% Duration

Spk6 S106 S083 S081 S065 S058 S052

R6134: Mean Sid ratio for vowels, stop-like phonemes and nasals

Etot.Sid

Tx.Sid

F1.Sid

F2.Sid

F3.Sid

M

Figure 5.8: Sim/Dur ratio or Sid parameter aggregated over vowels, stops, affricates and nasals. M is the AV mean.

88

Page 96: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

150

100

50

0

Counts

~ % duration0 20 40 60 80 100 120 140 160 180 200 220 240

R6151 Spk6: Sid parameter histogram for class LV

T x

Et

300250200150100

500

Counts

~ % duration0 20 40 60 80 100 120 140 160 180 200 220 240

R6151 Spk6: Sid parameter histogram for class LV

F1

F2

Figure 5.9: Combined speaker histograms of Sid parameters for long vowels. The point at which Sim equals duration is 102.3 so the abscissa approximates a percentage.

Comparing global means gives a very blurred picture of what is going on at the frame level. For a more detailed and accurate picture we generate a Sim/Dur ratio parameter for each frame, the SLab Sid parameter, and look at its behaviour. Figure 5.8 shows Sid means by speaker and AV. With the exception of Etot and F3 the male Sids (S106, S083, S081) cluster around 100% of duration which means that the PSLs are matching the duration. For the female speakers the picture is less consistent with mean Etot.Sid remaining at or above duration but formant PSLs, while consistent across speakers, dropping to 30-60% of duration.

The data of Figure 5.8 are, again, means and the full distributions of Sid values for long vowels shown in Figure 5.9 show a broad distribution. For Tx and F1 there is a sharp peak for low Sid values representing the Sims that are truncated to one or two frames. The broader peak at higher values shows that the majority of Sims range from approximately 60% to 180% of duration. F2, F3 and Et show even weaker links with duration than are suggested by the mean values.

The final comparison of the Dur and Sim parameters comes from the acoustic-phonetic association strengths shown in Figure 5.10. Figure 5.10a gives rankings aggregated over all phonemes. 5.10b and 5.10c show rankings for the vowels and stops which we look at in greater depth in the following sections.

89

Page 97: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

222018161412

Rank

Acoustic VectorDur Tx.Sim Etot.Sim F1.Sim F2.Sim F3.Sim

R6133: Rank means of Dur and Sims for all phonemes

Spk6

S106

S083

S081

S065

S058

S052

a

222018161412

Rank

Acoustic VectorDur Tx.Sim Etot.Sim F1.Sim F2.Sim F3.Sim

R6133: Rank means for Sims for vowels

Spk6

S106

S083

S081

S065

S058

S052

b

2018161412108

Rank

Acoustic VectorDur Tx.Sim Etot.Sim F1.Sim F2.Sim F3.Sim

R6133: Rank means for Sims for stops

Spk6

S106

S083

S081

S065

S058

S052

c

Figure 5.10: Association rankings for duration and PSL parameters aggregated over (a) all phonemes, (b) vowels and (c) stops.

The Dur and Tx.Sim AVs give similar rankings for vowels and the full phoneme set. Duration has more inter-speaker variation. The combined speaker (Spk6) rankings are significantly (P<0.01) weaker than the individual speaker rankings. For Tx.Sim the speaker values are more tightly clustered and the Spk6 ranking is only marginally (P>0.05) worse than the individual values and actually betters the ranking for female speaker S058. In a speaker by speaker comparison the Dur and Tx.Sim rankings are not significantly different.

90

Page 98: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

For vowels, formant PSLs show a marginally significant (P<0.025) difference between male and female speakers (S106, S083, S081 male). For males, F1.Sim has rankings comparable to the Dur and Tx.Sim rankings but female speakers have rankings three points weaker.

The stops in Figure 5.10c show a different picture. F1.Sim outperforms both duration and Tx.Sim while the formant results for the female speakers generally have stronger association rankings than the males. An example of strong qualitative differences between speakers can be seen with the rankings of speakers S081 and S052 in Figure 5.10c. Speaker S081 shows low rankings for Dur and Tx.Sim with weak rankings for the formants while S058 displays the opposite behaviour.

An assumption behind much of the discussion of Section 2.2.3 was that phonemically useful durational information is coded primarily in segmental duration. From this viewpoint the Sid results in Figure 5.9 did not provide strong grounds for expecting the PSLs to show useful phonemic information content. In Figure 5.10, however, the glottal period PSL clearly carries information comparable in its utility to that of the alignment duration as does F1.Sim for stops and the male vowels. This can be interpreted as showing that acoustic parameter stability carries phonemically significant durational information that is distinct from, but comparable in its utility to, that of the segmental duration. This issue is discussed further in Section 6.1.3.

The comparison of Figures 5.8 and 5.9 illustrates the difficulties involved in dealing with population means of broad and irregular distributions. This problem is addressed in the next section where we look more closely at specific phonetic contexts and use a measure of the phonemic discrimination potential of the parameter histograms defined in Section 4.13.

5.4.2 Duration in phonetic contexts

Duration and the vowel length decision

In the previous section we were looking at a general comparison of aggregates of phoneme classes. We now look in closer detail at the value of PSLs in the length decision for vowels. We use the measure of phonemic discrimination, the discrimination potential (DP) of the parameter histograms that was defined in Equation 4.8. To produce Figure 5.11 the position of the decision boundary (time b in Equation 4.8) was varied from 0 to 240 ms.

Table 5.5 provides a summary showing the percentage of cases in which the correct decision would be made with the best setting for the decision boundary. All AVs but F2.Sim provide some information for the length decision. The best of the PSLs (Et.Sim) comes close to matching duration in its maximum discrimination potential (63.8% to 67.7% respectively). Note that these results are for combined speaker data

91

Page 99: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

with no attempt made to adjust for speaker differences.

70

65

60

55

50

45

DP %

msec0 20 40 60 80 100 120 140 160 180 200 220 240

R6152 Spk6: Discrimination Ratio for LV-SV

Dur

Tx

Et

F1

F2

F3

Figure 5.11: Discrimination probability distributions for long and short vowels estimated from duration (Dur) and five PSLs (Tx.Sim etc. as in legend).

R6153

59.452.262.263.862.667.7Vowellength

F3.SimF2.SimF1.SimEt.SimTx.SimDur

Stop voicing 51.754.956.554.669.864.3

Table 5.5: Discriminatory Potential (peak probability) for duration and PSLs. Values are the percentage probability of making a correct decision at the optimum decision boundary (eg. around 54ms for Dur and Et) in Figures 5.11 and 5.12.

Duration and the stop voicing decision

The stop voicing decision is expected, from the analysis of Chapter 2, to draw on at least three durational features: the degree or continuity of voicing over the stop, the voice-onset time, and possible compensatory lengthening of the preceding phoneme.

The discrimination potentials for stops are shown in Figure 5.12. It should be noted that the alignment duration can be somewhat arbitrary for non-intervocalic stops so the context is constrained to stops preceding vowels where we look at voicing onset and stops following vowels when we look at the duration of the preceding vowel. As pre-empted in Section 3.6.3, a PSL can not be expected to behave in a reliable manner in stops, particularly unvoiced stops, but when this assumption was put to the test an unexpectedly strong result was found for the PSL generated from the glottal period (Tx.Sim). As with the vowel length decision, all PSLs included in Figure 5.12, other than that for F3, provide some discriminatory information.

92

Page 100: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

70

65

60

55

50

DP

%

msec0 20 40 60 80 100 120 140 160 180 200 220 240

R6153 Spk6: Discrimination probabililty for VS-US

Dur

Tx

Et

F1

F2

F3

Figure 5.12: Speaker independent discrimination probability of duration and PSLs during closure for stop voicing decisions for all stops in SV phonetic contexts. The decision direction (sense) is reversed for the Tx and formant PSLs (see text).

120100806040200

msec

Dur Et Tx F1 F2 F3

R6153 Spk6: Duration and Sims means

US

VS

Figure 5.13: All-speaker mean durations as estimated by the Dur and Sim parameters through unvoiced and voiced stops.

The Tx and formant based decisions have their sense reversed relative to Dur and Et. For the stops, only the energy PSL (Et.Sim) discriminates in the same sense as duration (long for unvoiced). For Tx and the formants stop discrimination appears to be based on continuity of glottal excitation rather than from the PSL acting as a duration measure. This is best seen in the parameter means shown in Figure 5.13 where the mean of Et.Sim closely matches the Dur mean but the other PSLs, primarily Tx.Sim, are longest for the voiced stops. Voicing continuity and duration are opposing influences on the PSL with continuity dominating for all but the energy PSL.

The glottal period provides a peak DP of nearly 69.8% (see Table 5.5) which is larger than the alignment duration based decision (64.3%). This result is consistent across all speakers suggesting that voicing continuity, when measured source synchronously,

93

Page 101: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

is the primary durational cue in this context.

All the PSL based DPs have peaks around 20 to 40ms compared with the duration at around 100ms. For the formants, the DP peak degrades as the position of the mode shifts toward shorter time intervals, showing an intuitively plausible link between formant stability (decreasing with increased formant number) and the discriminative value of the PSL.

To further explore the stop voicing decision the length of vowels preceding stops was examined for individual speakers and combined speaker data. Long and short vowels were analysed separately because of the known, large quantitative differences in their durations and to allow for possible qualitative differences in their behaviour. Figure 5.14 shows the probability distributions for Dur, Tx.Sim and Etot.Sim. The formant based PSLs are not included because they showed weak DPs with F1.Sim giving values of 54.0% and 53.1% for short and long vowels respectively. Tx.Sim and Etot.Sim, on the other hand, showed larger DPs that generally peaked higher than the corresponding duration based DPs.

65

60

55

50

45

DP %

ms0 20 40 60 80 100 120 140 160 180 200

R6453 Spk6: Discrimination probabililty for US-VS

Tx.SimLV

Tx.SimSV

Et.SimLV

Et.SimSV

DurLV

DurSV

Figure 5.14: Speaker independent (Spk6) discrimination probabilities for the US-VS decision based on the duration of the preceding vowel as given by the alignment duration and PSLs for Et and Tx. Short and long vowel contexts are treated separately.

The speaker independent discrimination potentials of Figure 5.14 are only slightly weaker than the individual speaker results summarised in Figure 5.15 and actually exceed the speaker dependent means for Et.SimLV and Tx.SimSV but not significantly. The maximum DP in the test set was 65.8% for Tx.Sim in short vowels of speaker S083. For the Dur AV the peak DP was 63% for long vowels of speaker S106. For both Dur and Et.Sim the long vowels generally show the highest DPs. The reverse is the case for Tx.Sim which also has the highest DP means. Individual speakers show quite different qualitative patterns in some cases indicating that they may be placing emphasis on different cues for stop voicing.

94

Page 102: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

In Figure 5.15 the discrimination potentials for PSLs are generally greater than for the corresponding duration. This is significantly so for short vowels as summarised in Table 5.6.

70

65

60

55

50

DP %

S106 S083 S081 S065 S058 S052 Spk6

R6453: Discrimination probabililty for US-VS

DurLV

DurSV

Et.SimLV

Et.SimSV

Tx.SimLV

Tx.SimSV

Figure 5.15: Discrimination potentials for the stop voicing decision based on the length of the preceding vowel for duration, energy and glottal period.

R6453: Significance of difference between DPs for PSLs and duration

P < 0.005P < 0.1P < 0.01P > 0.1

Tx-Dur (SV)Tx-Dur (LV)Et-Dur (SV)Et-Dur (LV)

Table 5.6: Significance of the differences between PSL and durational DPs of vowels preceding stops in the stop voicing decision.

Duration and the stop place decision

Looking at the discrimination probabilities of the PSLs in the stop place decisions (Figure 5.16 and Table 5.7) we see that the best peak value (DP) is for the frame energy PSL in the bilabial-alveolar decision at 64% compared with 66.4% for the alignment duration histograms. The glottal period and formant PSLs produce weaker DPs that have their sense reversed relative to the others for the velar-alveolar decision (DP<50). An outstanding weakness of the PSL DPs is their failure to come close to the largest duration DP in the velar-alveolar decision.

The probability distribution for the duration based bilabial-velar (FS-BS) decision illustrated in Figure 5.16a is weak relative to the other duration based decisions and to the corresponding distribution for Etot.Sim in Figure 5.16b. It also reverses its sense at around 60ms. This reversal is associated with a broader distribution for the velar stops relative to the bilabials.

95

Page 103: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

70

65

60

55

50

45

%

ms10 20 30 40 50 60 70 80 90 100

R6700 Spk6: Discrimination Probability for Dur

FS-BS

BS-MS

FS-MS

a

70

65

60

55

50

45

%

ms10 20 30 40 50 60 70 80 90 100

R6700 Spk6: Discrimination Probability for Etot.Sim

FS-BS

BS-MS

FS-MS

b

70

65

60

55

50

45

%

ms10 20 30 40 50 60 70 80 90 100

R6700 Spk6: Discrimination Probability for Tx.Sim

FS-BS

BS-MS

FS-MS

c

Figure 5.16: All speaker DPs for the Stop Place decision based on (a) duration, (b) Etot.Sim and (c) Tx.Sim histograms of those stops followed by vowels or diphthongs. Individual plots are for pairs of the stop place classes FS, MS, BS (bilabial, alveolar and velar).

96

Page 104: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

54.549.751.2F2.Sim

51.547.252.7F1.Sim

54.346.258.0Tx.Sim

64.057.258.4Et.Sim

R6700: Spk6DP %

65.766.453.2DurationFS-MSBS-MSFS-BS

Table 5.7: Discrimination potentials of the combined speaker data for the three stop place decisions and five AVs. Values are the percentage discrimination potential.

Summary

We have seen that in the pair-wise contrasts of the vowel length decisions the PSL results fall just below those of the alignment duration with discrimination potential of the frame energy SPL 3.9% below that of duration.

For the stop voicing decision the frame energy and formant PSLs performed poorly in comparison with the alignment duration but the glottal period (as Tx.Sim) was seen to pick up voicing information that exceeded the discrimination potential of the duration by 5.5% at a level of just under 70%. Tx.Sim for preceding short vowels showed stop voicing discrimination 9.1% higher than the corresponding alignment durations at its peak of 63%.

In the stop place decision the best of the PSLs, Et.Sim, surpasses the very weak duration based discrimination in the bilabial-velar decision by 5.2% and is just 1.7% below the duration DP for the bilabial-alveolar decision.

The question of the utility of the DP in ASR has been largely circumvented in this thesis by basing the analysis of PSL discrimination potentials on comparisons with the alignment duration. The assumption behind this approach is that matching the durational performance is the best we can expect of the PSLs, however the reliability of the peaks in the probability distributions must also be considered. Visual examination of the smoothness of the distributions in Figure 5.16 shows that the PSL distributions are of comparable smoothness, and form this perspective can be expected have a reliability comparable to that of the duration distributions.

5.5 CV transitions and F2

Here we look in detail at the dynamic behaviour of F2 at stop-vowel boundaries seen in Chapter 2 to provide a cue for consonantal place of articulation. The aim is to characterise the movement of F2 at the CV boundary and to determine what, if any, discriminatory information is available to be picked up by the PSL and used in the

97

Page 105: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

A-P association rankings.

Figure 5.17: Example time-frequency plots of the first three formants over Stop-Vowel boundaries (vertical grey lines). Fragments from all 6 speakers of sentence S002: “They asked if I wanted to come along on the barge trip” (examples in bold). Blue = F1, green = F2, brown = F3.

Alignment boundaries are taken as given despite the inherent unreliability of 98

Page 106: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

alignment within stops whether manual or automatically determined because no obvious evidence for misalignment was apparent. In any event, the results discussed here do not rely critically on the boundary position. Rather, we were looking for evidence of regular transient behaviour in the region of voicing onset regardless of its absolute timing.

2200200018001600140012001000

800

Hz

Frame number relative to boundary-10- 9 - 8 - 7 - 6 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 6 7 8 9 10

R6270: S083, F2 Time Slices around S-LV boundary

b

d

g

p

t

k

a

2200

2000

1800

1600

1400

1200

1000

Hz

Frame number relative to boundary-10- 9 - 8 - 7 - 6 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 6 7 8 9 10

R6270: Spk6, F2 Time Slices around S-LV boundary

b

d

g

p

t

k

b

Figure 5.18: F2 means across boundaries between stops and long vowels. /@:/, /a:/, /e:/, /i:/, /o:/ and /u:/ where present. (a) S083, (b) Spk6.

The starting point for this analysis was an extensive visual examination of time-frequency plots of formant data. As can be seen in the typical examples shown in Figure 5.17, there is very little sign of a consistent transitional dynamic in F2. During unvoiced stops F2 is highly variable with a tendency to stabilise at the vowel target frequency over the final few frames preceding the boundary. Voiced stops show lower F2 variation but no consistent pattern of behaviour around the boundary.

To aid this visual examination, F2 data in the region of the stop to long vowel segmental boundaries of interest were extracted automatically for all 200 sentences and six speakers. All time scales are glottal epoch based. The vowels for Figures 5.18-5.20 are restricted to the long vowel class to minimise the impact of vowel reduction that appeared to be more prevalent in short vowels.

99

Page 107: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Male speakers S106 and S083 have been chosen for the primary examples in these CV transient examples because the narrow frequency range of their nasal resonances clarifies the analysis. Specifically, S106 was used because he has been the example in other results reported in this thesis and S083 as a possible ‘best case’ because he most closely approaches the smooth transitions described in Section 2.2.4.

30002500200015001000500

Hz

Frame number relative to boundary-10- 9 - 8 - 7 - 6 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 6 7 8 9 10

R6255: S106 F2 sequence for /pe:/ boundary

M

1

2

3

4

5

30002500200015001000500

Hz

Frame number relative to boundary-10- 9 - 8 - 7 - 6 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 6 7 8 9 10

R6255: S106 F2 sequence for /ga:/ boundary

M

1

2

Figure 5.19: Individual F2 sequences about the (a) /p/-/e:/ and (b) /g/-/a:/ boundaries for speaker S106.

Aggregated results are shown in Figure 5.18 over all instances of each stop for speaker S083 (a) and all speakers combined (b). Bilabial stops have an upward trend at the boundary. Alveolar stops have a flat or moderate downward trend and velar stops have a steeper downward trend. In the combined speaker data (5.18b), the difference between the alveolar and velar stops disappears for /k/ over frames -4 to -1 but they are still almost 500 Hz above the bilabial stops at frame -2.

While these aggregates show interesting trends, disaggregation leads to a mixed picture with individual F2 sequences following widely differing paths. Figure 5.19 gives examples of individual F2 sequences. Both contexts show a sharp consistent rise in frequency of around 1 kHz just prior to the boundary. In the /pe:/ case we see a stronger rise with all but instance 2 starting form around 800Hz contrasting with a common start point of around 1100Hz for the /ga:/ examples. Conformity to

100

Page 108: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

the locus model is restricted to differences in relative magnitude of these trends with the velar stops starting at a higher frequency as expected.

3000

2000

1000

0

Hz

Frame number relative to boundary-10- 9 - 8 - 7 - 6 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 6 7 8 9 10

R6255: S106 F2 sequence for /ba:/ boundary

M

1

2

3

4

Figure 5.20: Inconsistent examples of SV transitions. The zero value for frame 10 represents the end of the segment.

The examples of Figure 5.19 were chosen from among the better examples of F2 trends. To balance the picture, Figure 5.20 shows less consistent examples. The trend prior to the boundary is irregular or non-existent but at frame -2 three instances are close to the 1200-1300Hz seen for /b/ in the aggregates of Figure 5.19.

Combined speaker data such as Figure 5.18b show that the trends do exist, to some degree, for all the speakers. It is reasonable to expect that the regular, short-term, pre-boundary trends seen in Figure 5.19 can be modelled by the temporal parameter F2.Del.Sim which looks at the duration of the slope of F2. It is clear, however, that the regular F2 transitions are not a universal feature of stop-vowel boundaries so other acoustic features such as the durational information of Section 5.4.2 or frequency domain information in the plosive burst are needed for determination of the place of articulation of stops.

101

Page 109: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Chapter 6

Conclusions and Projections

6.1 Summary of results

We set out to explore the value and utilisation of temporal information in ASR. Specifically, we chose to evaluate a novel data-driven temporal representation, the Parameter Similarity Length, as a means of encoding temporal information in the AV stream. As listed in Section 3.1 the primary experimental objectives in this evaluation were:

• to investigate the impact of temporal smoothing of acoustic parameters;• to compare and contrast PSLs with parameter time differentials; • to test the ability of the PSLs to act as proxies for segmental durations;• to look at the value of the PSL in modelling duration in specific contexts; • to investigate the behaviour of F2 in the vicinity of stops.

We also highlighted a set of procedurally oriented objectives that motivated the choices of experimental techniques. These were: • the use of low dimensional representations of acoustic data;• methods of parameter value compression or AV bandwidth reduction;• likelihood ranking based evaluation of acoustic-phonetic associations;• source synchronous framing of the speech signal.

In this final chapter the results are reviewed against the original objectives and possible directions for future work are outlined.

6.1.1 Temporal resolution and smoothing

The issue of temporal resolution has surfaced in several different forms in the development of this thesis. Initially we were concerned with establishing the highest possible resolution in the AV stream to ensure that valuable short term temporal information was not lost. Temporal resolution, or range, also emerged as an issue in the definition of both the parameter differentials and the PSLs.

The decision to use source synchronous analysis was based partly on the ability to get optimum temporal resolution for the acoustic parameters in the frame based AV stream. This was a secondary consideration to the achievement of reliable spectral data for spectral analysis but using source synchronous analysis has allowed this study to start with what we claim to be a maximum feasible resolution based on the shortest vocal event, the glottal epoch, and through a process of incrementally

102

Page 110: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

extending the temporal range of parameter estimates, address the issue directly and precisely.

An hypothesis discussed in Section 3.6.1, that maximum temporal resolution would be advantageous in the acoustic-phonetic analysis of stops with lower resolution giving advantage with continuants has been generally borne out. A notable exception was the performance of the Del, or simple difference, modifier over long vowels and diphthongs. In Section 5.3.1 it was suggested that the Del modifier may be acting as a measure of parameter temporal smoothness. It is possible that in some contexts smoothness or regularity through contiguous glottal pulses (or perhaps even a consistent irregularity or jitter) is being picked up by the association tables as a phonemically distinctive acoustic feature.

Otherwise, for the smoothed differentials and PSLs, the results related to smoothing follow an intuitively reasonable pattern with stops, short vowels (particularly shwa) and some other abbreviated consonantal contexts benefiting from high resolution temporal analysis. Conversely, parameter differentials evaluated over long duration vowels, diphthongs and sustained consonants can gain acoustic-phonetic association strength from an extended temporal range. The optimum resolution is a complex function of speaking rate, speaker and phonetic context but strategies could be learned by an appropriate ASR system allowing it to optimise acoustic analysis ‘on-the-fly’.

6.1.2 PSLs and time differentials

The comparison with differentials is a crucial test for the PSLs because of the widespread use of differentials in ASR. As stated in Section 3.1 we have made no attempt to create a test regime that could provide a definitive comparison of these two temporal measures. It is not possible to say that one or the other was best in in all acoustic-phonetic contexts. We have shown that both PSLs and differentials can express phonetically significant changes in behaviour as their temporal range is varied which means that rather than just two parameters we have two sets of parameters to compare.

However, with some significant exceptions, primarily Del with long vowels and diphthongs as mentioned in the previous section, the PSLs proved to carry more useful information than parameter time differentials. The advantage of the PSLs is particularly apparent with stops which have the most complex temporal cues and, in our tests, showed the strongest dependence on temporal information in their detection.

These two forms of representing temporal context should be seen as complementary and can be used in combination where they can, for instance, represent durations of parameter gradients in two dimensional AV constructs of the form P.Dif&P.Dif.Sim

103

Page 111: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

(Figure 5.6). Such combinations can form the basis for specialised acoustic probes (mooted in Section 3.1 and discussed further in Section 6.2) that can be used for characterising acoustic dynamics in specific acoustic-phonetic contexts. Combinations such as P.Dif.Dif.Sim were also tested and showed potential value in specific contexts - again, primarily stops.

6.1.3 PSLs and duration

The literature analysis of Section 2.2.3 indicated duration as a principal form of temporal information in speech. The PSL, as a measure of parameter stability was primarily intended as a data driven duration measure. As a proxy for segmental duration the performance of the PSL varies greatly with the base acoustic parameter used. Mean values of the glottal period similarity length Tx.Sim (Figure 5.7) gave the closest representation of the alignment duration but, as with the frame energy and formant PSLs, it was found to be weakly associated in the PSL/duration ratio measurements of Figure 5.9. Dispite this, association rankings of Tx.Sim were generally comparable to those generated for the alignment duration (Figure 5.10). The F1 similarity length also gives rankings comparable with duration and the glottal period PSL.

The lack of strong correlation betweem the PSLs and alignment duration indicates that they are representing different acoustic information. In this thesis the expression ‘durational information’ has been used to cover any temporal information of a durational nature either based on acoustic parameter values or their time derivatives. This is contrasted with dynamical information represented by acoustic parameter time derivatives. In the association tests, dynamical durational parameters have generally outperformed the actual values of the parameter derivatives they are based on and have matched or improved on the results for the alignment duration.

Given this strength it is significant that the PSLs show strong variations in relative information content in different phonemic or speaker contexts seen, for example, in the strong qualitative differences between the glottal period and F1 PSLs for voiced and unvoiced stops in Appendix 6 (Figures 6.2a,b). The results suggest that a single measure of duration such as that represented by the alignment duration is an inadequate representation of the phonemically or phonologically significant durational information in the speech signal. In this light our original aim of finding a data driven proxy for the segmental duration was too restrictive but the PSL has been demonstrated to be capable of representing a more complex mix of durational cues.

Since the PSLs are representing some useful form of durational information it may be possible to use the PSL in generating and applying models of contextual durational influences (eg. rate, stress) discussed in Section 2.2.3 which have the potential to greatly improve the way ASR systems deal with durational variation. Accurate

104

Page 112: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

prosodic models are likely to provide additional benefits to ASR where, for example, stress has phonological, syntactic or semantic significance.

6.1.4 PSLs and duration in context

Useful discrimination was demonstrated for the PSLs in the vowel length decision and the stop place decision with the frame energy PSL (Et.Sim, Figures 5.11 and 5.14) displaying slightly better performance than that of the glottal period. Tx.Sim was shown to be capable of significantly greater discrimination power than the alignment duration and other PSLs in the stop voicing decision though this was apparently through the use of voicing continuity information rather than as a measure of voice-onset time.

The duration and PSL distributions used for the reported discrimination potentials constrain phonetic contexts but come from a mixture of prosodic contexts. The application of a prosodic model as suggested in the previous section, if successful, will reduce the variance of PSL values in a particular phonetic context and enhance the Discrimination Potentials of Section 5.4.2.

6.1.5 PSLs and F2 transitions

The F2 transition tests confirmed the conclusions of the initial visual inspections - that the short-term transient model of formant transitions in the post consonantal voicing onset is, like the locus models, not an invariant of the signal but rather an occasional trend. Temporal parameters such as F2.Dif.Sim are able to represent the slope and temporal range of the short term behaviour that was observed, however similar formant movements are common in other segmental contexts in normal speech so further information is needed in the acoustic vector to specify the context more precisely. Such information might take a form such as: ‘this is in the first glottal cycle following an unvoiced region’.

The results indicate that consonantal voicing and place decisions are based on a number of partial cues. This suggests that a strategy of generating a ‘best consensus’ decision from a multiplicity of cues may provide advantages for ASR over techniques that attempt to choose a single best cue or model.

6.2 Methodological objectives

Source synchronous framing

Source Synchronous framing based on Fundamental Harmonic Extraction proved to be computationally inexpensive while remaining reliable even under weak voicing conditions. The computational complexity or speed of component operations is,

105

Page 113: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

generally, an important factor in real-time tasks such as ASR. Being an open-ended problem there is always scope to make use of time saved in one operation to expand the activities of another.

The obvious next step would be to test the FHE technique under a variety of additive or multiplicative noise types. This could be done by band-pass filtering the full signal energy at F0 or, if the noise is significantly coloured, using band energies least effected by the noise. Initial testing using the full energy (on clean signals) has shown results comparable to those of the straight signal based fundamental harmonic.

The glottal period has shown itself in the results presented in this thesis to to provide a valuable base parameter for representing temporal information. Specifically the value of Tx.Sim has been repeatedly demonstrated. The apparent strength of Tx.Sim as a temporal parameter and its potential as a source of valuable prosodic information suggests an advantage for the use of source synchronous timing in providing an accurate and reliable measure of Tx, or instantaneous F0, for ASR.

Low dimensional representation of spectral data

The complexity estimates of Section 3.5.2 suggested that rankings of 5 may be of value with a 7 phoneme word or ranks up to 12 for 5 phonemes. Figure 5.5b shows individual phoneme rankings for single parameters (1 dimensional AVs) in this range (see also Figure A5.1 in Appendix 5). Figure 5.6 (Section 5.4) illustrates the impact of adding a second parameter to form a 2D AV. By the criteria of Section 3.5.2 the 1D and 2D vectors are providing potentially useful information that increases in value with increased AV dimension.

AV bandwidth reduction

To keep within the constraints of available data, extension of AV dimension requires further reduction in the bandwidth of individual AV components. The parameter value compression of Section 4.9 was successful for single parameters. It has not been extended to deal with a multi-dimensional AV so, with the exception of the ‘hand checked’ 2D example of Figure 5.6, the results reported in this thesis are based on single parameter AVs.

Alternative quantisation approaches are possible that can produce major reductions in single parameter bandwidth and thus allow reliable multi-dimensional operation. One option is to use decision boundaries as bin boundaries for parameter quantisation. For simple binary decisions, such as the vowel length decision, as few as 1 or 2 bits per parameter would be required. These bits can be coded to divide the acoustic parameter domain into two or more regions based on boundary positions pre-determined for individual acoustic-phonetic and speaker contexts in the manner

106

Page 114: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

demonstrated for the binary decisions in the discrimination tests of Section 5.4.2. This would allow 5 to 10 parameters per AV within the data constraints of this thesis (200 utterances per speaker).

A capability for high dimensional AVs would enable an evaluation of cepstral coefficients or sets of band energy ratios combined with their associated PSLs or with the glottal period PSL.

Acoustic-phonetic associations

The ultimate test for the tabular methods used in this thesis for evaluating acoustic-phonetic associations is application to a full recogniser. The degree of consistency across speakers and the low rankings achieved in many acoustic-phonetic contexts support the expectation that comparable techniques can play an important role in ASR.

6.3 Future exploration of the acoustic space

The discussion of AV bandwidth reduction on the previous section leads to the question of how best to represent high dimensional information in ASR systems. The choice of parameters used to form the acoustic vector is hindered by a combinatorial explosion. Even a one dimensional AV provides a wide range of options when taken in phonetic context. We have at least 25 possible base parameters. Combinations of temporal modifiers include Dif, Sim, Dif.Sim, Dif.Dif.Sim. Including Del, Dif4, Sim2 and Sim4 as options gives 16 variants or a total of 17 when we include the base parameters for a total of 425 parameters. Also, design decisions such as the use of absolute time or frame based time in the construction of PSLs were shown (eg. Section 4.10.2) to provide useful alternatives or additions to those investigated in this thesis.

If in addition to the acoustic contexts derived from the local signal we envision a system capable of feeding phonemic hypotheses generated in the matching stage back to the acoustic stage we have 43 phonemic contexts, 1849 diphone contexts or 602 (43*14) phone-class combinations. This leads to between 18,000 and 785,825 combinations. This can, of course, be reduced by the introduction of prior knowledge such as phonotactic constraints but even without such assistance we can still perform an exhaustive search if tests are kept simple.

Moving to higher dimensions in a systematic manner requires a strategy that leads to the exploration of the most likely candidates first. The approach taken ‘manually’ in the experimental work reported in this thesis was intended to be an exploration in this direction. Highly aggregated data was first assessed for signs of strong acoustic-phonetic association strengths. These were then explored on partially disaggregated

107

Page 115: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

data and the process repeated as appropriate looking for low level origins of strong associations. This process can be automated.

Additional approaches include combining small AVs that appear to carry differing and complementary information as indicated by divergent or complementary patterns of ranking across phonemes (or phoneme classes). The introduction of human expertise gained, for example, in perception experiments or visual inspection of time-frequency plots could be used to define more complex acoustic features than are likely to be discovered automatically in a combinatorial search - for example defining the occlusion, burst and voicing onset regions of stops and the expected temporal relationships between them.

The parameter trajectory shape parameter introduced in Section 4.10.4 is not reported in detail in this thesis because of the complexity of its behaviour and marginal advantage in aggregated results. It showed potential in specific acoustic-phonetic contexts that could be further explored with an automated search.

The final output of the acoustic-phonetic analysis system used in this thesis is a frame based time stream of phoneme rankings. An obvious extension to the present work would be to develop a word recognition system. A first step towards a full recogniser could take the form of a phoneme level temporal analysis based on clustering of temporally local frames in the rank stream with the aim of improving on the phoneme hypotheses of individual frames.

Clustering could be performed using a window with a width appropriate to the expected (eg. long term mean) duration of the hypothesised phoneme or longer to include coarticulatory effects. An appropriate measure for the strength of the combined rankings could be obtained by generating a ranking from the sum of the association table rows corresponding to the AVs within the window. A non-rectangular window can be implemented through weighing the individual AV contributions.

6.4 Conclusions

A low dimensional data driven temporal framework has been established and Similarity Lengths, in particular, the PSL of the glottal period derived from the fundamental harmonic, have been shown to be an effective means of encoding temporal information. We have demonstrated that information is encoded that assists in the vowel length decision and the stop voicing and place decisions. The potential exists for the Similarity Length to further assist in the stop place decision using formant onset dynamics if the acoustic-phonetic context can be supported by other acoustic-phonetic contextual information.

We started our analysis of temporal information in the AV stream by noting that in 108

Page 116: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

some conventional recognition systems up to two thirds of the information bandwidth of the AVs was allocated to high dimensional parameter derivatives. We have explored an approach that can reduce the bandwidth allocated to temporal information by an order of magnitude or more. Although we have not made a direct comparison between such high dimensional representations and the representations evaluated here we have shown that low dimensional representation can provide valuable temporal information that is comparable to that provided by the alignment derived duration. If, as our analysis in Chapter 2 implies, duration is a principal source of phonologically significant temporal information then we can conclude that, at least in this respect, the results reported here point the way to significant improvements in ASR performance.

There is no obvious impediment to these results being applied to ASR systems based on either Hidden Markov Models or Artificial Neural Networks, significantly enhancing the ability of both systems to deal with temporal issues. The data driven rule-based approach used as a conceptual and methodological framework for this thesis could provide a more flexible approach to ASR system design that could incorporate both HMM and ANN algorithms as specialist techniques.

Demonstrating an ability to model, from the data, key phonologically significant temporal information is a major step toward a proof of principle for the use of data driven rule-based systems in ASR.

109

Page 117: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Appendix 1: The SLab Speech Laboratory

Figure A1.1: Example of the SLab User Interface desktop showing popup command menus for the Store and View Icons. The primary object, represented on the SLab desktop as a Store icon, is the Java NodeNode that can generate or store other icons/nodes such as a ViewNode. A ViewNode generates and controls a display window or view. Examples of signal and frequency domain views are seen in Chapter 4 (eg. Figures 4.5 and 4.6). Note that this is a reconstruction and not all functions shown are completed. Some menu options should be greyed out as inoperative.

110

Page 118: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Figure A1.2: Java Class hierarchy: Tree structure represents class inheritance. (Referenced in Section 4.3)

111

Page 119: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Figure A1.2: SLab Class hierarchy (cont.)

112

Page 120: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Figure A1.2: SLab Class hierarchy (cont.)

113

Page 121: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Appendix 2: Speech data information

2.1: Phonetic labels

Information on the ANDOSL speach data used. Further information can be found in the ANDOSL repository at www.andosl.anu.edu.au/andosl. (Referenced in Sections 4.4, 4.8 and 4.11)

Diphthongs (DI)

Long Vowels (LV)

Short Vowels (SV)

"tour"/u@/

"here"/i@/

"how"/au/

"hide"/ai/

"hoy"/oi/

"hoed"/@u/

"hay"/ei/

"hard"/a:/

"hawed"/o:/

"heard"/@:/

"there"/e:/

"who'd"/u:/

"heed"/i:/

"had"/A/

"bud"/V/

"hod"/O/

"the" (not "thee")/@/

"head"/E/

"hood"/U/

"hid"/I/

Table A2.1a: Labels used for Vowels and Diphthongs.

114

Page 122: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Appendix 2.1: Phonetic labels (cont.)

'you'/j/Palatal

"wow"/w/Bilabial

"row"/r/Rhotic

Glides and Liquids (GL)

Nasals (NA)

Affricates (AF)

Fricatives (FR)

Stops (S)

"lull '/l/Lateral

"sing"/N/Velar closure

"now"/n/Alveolar closure

"mow"/m/Bilabial closure

"judge"/dZ/Voiced Alveolar

"chore'/tS/Voiceless Alveolar

'ham'/h/Glottal voiceless fricative

"azure"/Z/Palatal voiced fricative

"sure"/S/Palatal voiceless fricative

"zoo"/z/Alveolar voiced fricative

"sue"/s/Alveolar voiceless fricative

"than"/D/Inter-dental voiced fricative

"thin"/T/Inter-dental voiceless fricative

"van"/v/Labio-dental voiced fricative

"fan"/f/Labio-dental voiceless fricative

"gore"/g/Velar voiced stop (VS)

"core"/k/Velar voiceless stop (US)

"door"/d/Alveolar voiced stop (VS)

"tore"/t/Alveolar voiceless stop (US)

"bore"/b/Bilabial voiced stop (VS)

"poor"/p/Bilabial voiceless stop (US)

Table A2.1b: Labels used for Consonants.

115

Page 123: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Appendix 2.2: Speaker information

SpeakersData categories

--

fmcfmcfmcmmcmycmmcSPK_TYPE: Speaker type code

mmmmymAGE_CAT: Age category

363445382040AGE: (yrs)

SydneySydneySydneyInverellSydneyGoulburnPOB: Place of birth

normalnormalquietlightnormalnormalSPQ: Speech quality

cult.cult.cult.cult.cult.cult.SOC: Speech category

englishenglishenglishenglishenglishenglishNAL: Native language

goodgoodgoodgoodHER: Hearing

fluentfluentfluentfluentfluentfluentFLU: Fluency

highhighhighhighhighhighINT: Intelligibility

n0n0n0n0n0n0ACC: Accent

460300350450660460LFR: Lung flow rate (L/s)

11.311.510.310.312.212.5ECD: Eye chin distance (cm)

5956.854.858.159.162.1COH: Circumference of head (cm)

6064-548086WGT: Weight (Kg)

168163160170187180HGT: Height (cm)

195719591948195519731953YOB: Year of birth

fffmmmSEX:

S052S058S065S081S083S106SPKID:

Table A2.2: Speaker characteristics as provided in the ANDOSL ancillary data records.

• Voice had a growling, "gravelly" quality throughout, even at the start of the session.

• Unusually flat, expressionless intonation pattern, almost monotone at times.

• Voice slightly slurred at times.

• Subject seemed nervous of the recording environment, possibly slightly claustrophobic and apprehensive of being able to see space underfoot through wire floor. Subject commented on her frequent mid-sentence hesitations, which she said she recognised as a normal part of her everyday speech.

• Slightly hoarse during last 20 sentences.

AnnotationsSpeaker

S058

S065

S081

Table A2.3: Annotations made by attending phonetician at the recording sessions.

116

Page 124: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Appendix 3: Algorithms

Java code for the formant picker and AV bandwidth compression algorithms. (Referenced in Sections 4.7 and 4.9)

117

Page 125: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Figure A3.1: Formant Picker algorithm (Section 4.7).

------------------------------------------------------

118

Page 126: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Figure A3.2: Compression algorithms for AV bandwidth compression (Section 4.9).

119

Page 127: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Appendix 4: Frequency domain smoothing

A comparison of the time-frequency formant plots generated from spectral data that has been smoothed prior to the formant picking. The smoothing parameter used in Figure 4.1b (a = 0.6) was used for the formant data reported in Chapter 5. (Referenced in Section 4.7)

5 4 3 2 1 0

K H z

D@p

rai

sr

eind

ZI

zs

mo:

l@D

@n

En

i:@

v@

sE

ks

pE

kt

@d

#

R633

2: U

tter

ance

S10

6S00

1 by

SS

Fram

e

F1 F2 F3

a

5 4 3 2 1 0

K H z

D@p

rai

sr

eind

ZI

zs

mo:

l@D

@n

En

i:@

v@

sE

ks

pE

kt

@d

#

R633

7: U

tter

ance

S10

6S00

1 by

SS

Fram

e

F1 F2 F3

b

Figure A4.1: Spectral smoothing examples for different smoothing parameters (memory ‘a’ in Equation E4.4) (a) a=0.85, (b) a = 0.6

120

Page 128: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Appendix 5: Formant histograms

Formant histogram plots representing the accumulated peak positions for the first three formants. Data is presented on a speaker by class basis. (Referenced in Sections 4.7 and 4.12)

1800

1600

1400

1200

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0

R6348-9: Formant Histogram for /LV/

S106

S083

S081

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6348-9: Formant Histogram for /LV/

S065

S058

S052

Figures A5.1a-b: Long vowels

121

Page 129: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

1800

1600

1400

1200

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0

R6348-9: Formant Histogram for /SV/

S106

S083

S081

2000

18001600

140012001000

800600

400200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6348-9: Formant Histogram for /SV/

S065

S058

S052

Figures A5.1c-d: Short vowels

122

Page 130: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

1800

1600

1400

1200

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0

R6348-9: Formant Histogram for /DI/

S106

S083

S081

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6348-9: Formant Histogram for /DI/

S065

S058

S052

Figures A5.1e-f: Diphthongs

123

Page 131: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

1800

1600

1400

1200

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0

R6348-9: Formant Histogram for /NA/

S106

S083

S081

2000

18001600

140012001000

800600

400200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6348-9: Formant Histogram for /NA/

S065

S058

S052

Figures A5.1g-h: Nasals

124

Page 132: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6348-9: Formant Histogram for /FR/

S106

S083

S081

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6348-9: Formant Histogram for /FR/

S065

S058

S052

Figures A5.1i-j: Fricatives

125

Page 133: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6348-9: Formant Histogram for /AF/

S106

S083

S081

200

180160

140120100

8060

40200

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6348-9: Formant Histogram for /AF/

S065

S058

S052

Figures A5.1k-l: Affricates

126

Page 134: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0

R6348-9: Formant Histogram for /US/

S106

S083

S081

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6348-9: Formant Histogram for /US/

S065

S058

S052

Figures A5.1m-n: Unvoiced stops

127

Page 135: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0

R6348-9: Formant Histogram for /VS/

S106

S083

S081

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6348-9: Formant Histogram for /VS/

S065

S058

S052

Figures A5.1m-n: Voiced stops

128

Page 136: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

1800

1600

1400

1200

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0

R6348-9: Formant Histogram for /GL/

S106

S083

S081

1000

800

600

400

200

0

Frames

KHz0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

R6348-9: Formant Histogram for /GL/

S065

S058

S052

Figures A5.1o-p: Glides and Liquids

129

Page 137: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Appendix 6: Tx and Tx.Sim time streams

Duration-time plots for the glottal period (Tx) and two Similarity Lengths Sim2 (range = 20% Std.Dev.) and Sim4 (range = 40% SD). (Referenced in Section 4.10.2)

400

350

300

250

200

150

100 50 0

T i m e

D@

pr

ais

rei

ndZ

Iz

sm

o:l@

D@

nE

ni:

@

R669

5: U

tter

ance

S10

6S00

1Str

eam

Plo

t fo

r: T

x

Tx

Tx.S

im2

Tx.S

im4

400

350

300

250

200

150

100 50 0

T i m e

D@p

rai

sr

eind

Z@zs

mo:

l@D

@n

En

i:@v

Vs

Iks

R669

5: U

tter

ance

S08

3S00

1Str

eam

Plo

t fo

r: T

x

Tx

Tx.S

im2

Tx.S

im4

Figure A6.1a-b: Utterance S001 from Speakers S106 and S083. Time is measured in signal samples (20 per ms) for Tx and ms for the PSLs.

130

Page 138: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

400

350

300

250

200

150

100 50 0

T i m e

D@p

rai

sr

ein

dZIz

sm

o:l@

DA

n#E

ni:

@vV

sE

ks

pE

kt

R669

5: U

tter

ance

S08

1S00

1Str

eam

Plo

t fo

r: T

x

Tx

Tx.S

im2

Tx.S

im4

400

350

300

250

200

150

100 50 0

T i m e

pr

ais

rei

ndZ

Iz

sm

o:l

@

R669

5: U

tter

ance

S06

5S00

1Str

eam

Plo

t fo

r: T

x

Tx

Tx.S

im2

Tx.S

im4

Figure A6.1c-d: Utterance S001 from male speaker S081 and female S065. Time is measured in signal samples (20 per ms) for Tx and ms for the PSLs.

131

Page 139: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

400

350

300

250

200

150

100 50 0

T i m e

D@

pr

ais

rei

ndZ

@z

sm

o:

R669

5: U

tter

ance

S05

8S00

1Str

eam

Plo

t fo

r: T

x

Tx

Tx.S

im2

Tx.S

im4

400

350

300

250

200

150

100 50 0

T i m e

D@

pr

ais

rei

ndZ

Iz

sm

o:l

@D

@n

R669

5: U

tter

ance

S05

2S00

1Str

eam

Plo

t fo

r: T

x

Tx

Tx.S

im2

Tx.S

im4

Figure A6.1e-f: Utterance S001 from female speakers S058 and S052. Time is measured in signal samples (20 per ms) for Tx and ms for the PSLs.

132

Page 140: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Appendix 7: Parameter smoothing and constraint

Disaggregations of Figure 5.1 (Section 5.2) which compares the mean acoustic-phonetic association ranking of base parameters with their smoothed and outlier corrected versions. Two disaggregations by speaker and phoneme are given in Figures A7.1 and A7.2 respectively. Tables A7.1 and A7.2 show individual correction patterns for F1 and F2 through the stop /p/. (Referenced in Section 5.2)

2423222120191817161514131211109876

V SV LV D S US VS N G F M

R6156: Mean association ranks of temporal modifiers for speaker S106

Tx

F1

F2

F3

Etot

Tx.Cor

F1.Cor

F2.Cor

F3.Cor

Etot.Cor

Tx.Smo

F1.Smo

F2.Smo

F3.Smo

Etot.Smo

Figure A7.1a: S106 association ranks by phone class

2423222120191817161514131211109876

V SV LV D S US VS N G F M

R6156: Mean association ranks of temporal modifiers for speaker S083

Tx

F1

F2

F3

Etot

Tx.Cor

F1.Cor

F2.Cor

F3.Cor

Etot.Cor

Tx.Smo

F1.Smo

F2.Smo

F3.Smo

Etot.Smo

Figure A7.1b: S083 association ranks by phone class

133

Page 141: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

2423222120191817161514131211109876

V SV LV D S US VS N G F M

R6156: Mean association ranks of temporal modifiers for speaker S081

Tx

F1

F2

F3

Etot

Tx.Cor

F1.Cor

F2.Cor

F3.Cor

Etot.Cor

Tx.Smo

F1.Smo

F2.Smo

F3.Smo

Etot.Smo

Figure A7.1c: S081 association ranks by phone class

2423222120191817161514131211109876

V SV LV D S US VS N G F M

R6156: Mean association ranks of temporal modifiers for speaker S065

Tx

F1

F2

F3

Etot

Tx.Cor

F1.Cor

F2.Cor

F3.Cor

Etot.Cor

Tx.Smo

F1.Smo

F2.Smo

F3.Smo

Etot.Smo

Figure A7.1d: S058 association ranks by phone class

134

Page 142: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

2423222120191817161514131211109876

V SV LV D S US VS N G F M

R6156: Mean association ranks of temporal modifiers for speaker S058

Tx

F1

F2

F3

Etot

Tx.Cor

F1.Cor

F2.Cor

F3.Cor

Etot.Cor

Tx.Smo

F1.Smo

F2.Smo

F3.Smo

Etot.Smo

Figure A7.1e: S058 association ranks by phone class

2423222120191817161514131211109876

V SV LV D S US VS N G F M

R6156: Mean association ranks of temporal modifiers for speaker S052

Tx

F1

F2

F3

Etot

Tx.Cor

F1.Cor

F2.Cor

F3.Cor

Etot.Cor

Tx.Smo

F1.Smo

F2.Smo

F3.Smo

Etot.Smo

Figure A7.1f: S052 association ranks by phone class

135

Page 143: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

22 20 18 16 14 12 10 8 6 4

R a n k

#@

@:A

a:E

e:I

i:O

o:U

u:V

pb

td

kg

TD

tSdZ

mn

Nw

jl

rv

zZ

fs

Sh

aiei

oi@

ui@a

uM

VS

D

R615

6 Sp

k6:

Indi

vidu

al P

hone

me

Rank

ings

for

Tem

pora

l Aco

ustic

Vec

tors

F1 F1.C

or

F1.S

mo

22 20 18 16 14 12 10 8 6 4

R a n k

#@

@:A

a:E

e:I

i:O

o:U

u:V

pb

td

kg

TD

tSdZ

mn

Nw

jl

rv

zZ

fs

Sh

aiei

oi@

ui@a

uM

VS

D

R615

6 Sp

k6:

Indi

vidu

al P

hone

me

Rank

ings

for

Tem

pora

l Aco

ustic

Vec

tors

F2 F2.C

or

F2.S

mo

Figure A7.2: Impact of Cor and Smo modifiers for all phonemes. The final four columns are mean values: M is the all phoneme mean; V, S and D are the vowel, stop and diphthong means respectively.

136

Page 144: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Table A7.1: Constraint counts for F1 and /p/

1111

2111

11211

12211

111111

1111111

1112121

111132

11112

12112211

1225321111

1334241

1142121

3342321

2212141

2121412121

111434332

13274315

11224161382

112171221121

132321

12322111

1141111111

21473453111

1227731111

11121211

3468221

333221

111

15234245321

1 91 81 7

2 9

2 8

2 7

2 6

2 5

2 22 12 01 61 51 41 31 21 11 09876543210

3 0

2 4

2 3

2 2

2 1

2 0

1 9

1 8

1 7

1 6

1 5

1 4

1 3

1 2

1 1

1 0

9

8

7

6

5

4

3

2

1

0

Table A7.1: Constraint counts for F1 and /p/. Vertical axis is original value. Horizontal axis is constrained value.

137

Page 145: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Table A7.2: Constraint counts for F2 and /p/

1 91 81 71 61 51 41 31 21 11 09876543210

2 72 6

2 52 42 3

2 22 12 0

1 91 81 7

1 61 5

1 41 31 2

1 11 09

876

54

321

0

11

211

111

1111

2221

1111111111

12

1312121111311341

11122111111

14222132

11264233114647432

1356953

128825985211256126101126222642

1122478451341131081243121

111657105821131433221

111

Table A7.2: Constraint counts for F2 and /p/. Vertical axis is original value. Horizontal axis is constrained value.

138

Page 146: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Appendix 8: Diferential-PSL comparisons

Disaggregations of Figure 5.4 (Section 5.3.1) which compares the mean acoustic-phonetic association ranking of parameter time differentials (Del, Dif2, Dif4) with PSLs of two ranges (20% and 40%). Three disaggregations by speaker, AV and phoneme are given in Figures A8.1, A8.2 and A8.3 respectively. (Referenced in Sections 5.3.1 and 5.3.2)

22212019181716151413121110

Rank

V SV LV D S US VS N G F M

R6128: S106 Differential modifiers for [Tx+F1+F2+F3]

Del

Dif2

Dif4

Sim2

Sim4

22212019181716151413121110

Rank

V SV LV D S US VS N G F M

R6128: S083 Differential modifiers for [Tx+F1+F2+F3]

Del

Dif2

Dif4

Sim2

Sim4

22212019181716151413121110

Rank

V SV LV D S US VS N G F M

R6128: S081 Differential modifiers for [Tx+F1+F2+F3]

Del

Dif2

Dif4

Sim2

Sim4

Figure A8.1a-c: Speaker disaggregations of Figure 5.4.

139

Page 147: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

22212019181716151413121110

Rank

V SV LV D S US VS N G F M

R6128: S065 Differential modifiers for [Tx+F1+F2+F3]

Del

Dif2

Dif4

Sim2

Sim4

22212019181716151413121110

Rank

V SV LV D S US VS N G F M

R6128: S058 Differential modifiers for [Tx+F1+F2+F3]

Del

Dif2

Dif4

Sim2

Sim4

22212019181716151413121110

Rank

V SV LV D S US VS N G F M

R6128: S052 Differential modifiers for [Tx+F1+F2+F3]

Del

Dif2

Dif4

Sim2

Sim4

Figure A8.1d-f: Speaker disaggregations of Figure 5.4.

140

Page 148: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

24222018161412108

Rank

V SV LV D S US VS N G F M

R6128: Spk6 Temporal modifier rankings for [Tx]

Tx.Del

Tx.Dif2

Tx.Dif4

Tx.Sim2

Tx.Sim4

24222018161412108

Rank

V SV LV D S US VS N G F M

R6128: Spk6 Temporal modifier rankings for [F1]

F1.Del

F1.Dif2

F1.Dif4

F1.Sim2

F1.Sim4

24222018161412108

Rank

V SV LV D S US VS N G F M

R6128: Spk6 Temporal modifier rankings for [F2]

F2.Del

F2.Dif2

F2.Dif4

F2.Sim2

F2.Sim4

24222018161412108

Rank

V SV LV D S US VS N G F M

R6128: Spk6 Temporal modifier rankings for [F3]

F3.Del

F3.Dif2

F3.Dif4

F3.Sim2

F3.Sim4

Figure A8.2a-d: Acoustic vector disaggregation of Figure 5.4.

141

Page 149: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

24 22 20 18 16 14 12 10 8 6

R a n k

#@

@:A

a:E

e:I

i:O

o:U

u:V

pb

td

kg

TD

tSdZ

mn

Nw

jl

rv

zZ

fs

Sh

aiei

oi@

ui@a

uM

VS

D

R612

8 Sp

k6:

Indi

vidu

al P

hone

me

Rank

ings

for

Tem

pora

l Aco

ustic

Vec

tors

Tx.D

el

Tx.D

if2

Tx.D

if4

24 22 20 18 16 14 12 10 8 6

R a n k

#@

@:A

a:E

e:I

i:O

o:U

u:V

pb

td

kg

TD

tSdZ

mn

Nw

jl

rv

zZ

fs

Sh

aiei

oi@

ui@a

uM

VS

D

R612

8 Sp

k6:

Indi

vidu

al P

hone

me

Rank

ings

for

Tem

pora

l Aco

ustic

Vec

tors

F1.D

el

F1.D

if2

F1.D

if4

Figure A8.3a,b: Acoustic vector and phoneme disaggregation of Figure 5.4.

142

Page 150: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

24 22 20 18 16 14 12 10 8 6

R a n k

#@

@:A

a:E

e:I

i:O

o:U

u:V

pb

td

kg

TD

tSdZ

mn

Nw

jl

rv

zZ

fs

Sh

aiei

oi@

ui@a

uM

VS

D

R612

8 Sp

k6:

Indi

vidu

al P

hone

me

Rank

ings

for

Tem

pora

l Aco

ustic

Vec

tors

F2.D

el

F2.D

if2

F2.D

if4

24 22 20 18 16 14 12 10 8 6

R a n k

#@

@:A

a:E

e:I

i:O

o:U

u:V

pb

td

kg

TD

tSdZ

mn

Nw

jl

rv

zZ

fs

Sh

aiei

oi@

ui@a

uM

VS

D

R612

8 Sp

k6:

Indi

vidu

al P

hone

me

Rank

ings

for

Tem

pora

l Aco

ustic

Vec

tors

F3.D

el

F3.D

if2

F3.D

if4

Figure A8.3c,d: Acoustic vector and phoneme disaggregation of Figure 5.4.

143

Page 151: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

24 22 20 18 16 14 12 10 8 6

R a n k

#@

@:A

a:E

e:I

i:O

o:U

u:V

pb

td

kg

TD

tSdZ

mn

Nw

jl

rv

zZ

fs

Sh

aiei

oi@

ui@

auM

VS

D

R612

8 Sp

k6:

Indi

vidu

al P

hone

me

Rank

ings

for

Tem

pora

l Aco

ustic

Vec

tors

Tx.S

im2

Tx.S

im4

24 22 20 18 16 14 12 10 8 6

R a n k

#@

@:A

a:E

e:I

i:O

o:U

u:V

pb

td

kg

TD

tSdZ

mn

Nw

jl

rv

zZ

fs

Sh

aiei

oi@

ui@

auM

VS

D

R612

8 Sp

k6:

Indi

vidu

al P

hone

me

Rank

ings

for

Tem

pora

l Aco

ustic

Vec

tors

F1.S

im2

F1.S

im4

Figure A8.3e,f: Acoustic vector and phoneme disaggregation of Figure 5.4.

144

Page 152: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

24 22 20 18 16 14 12 10 8 6

R a n k

#@

@:A

a:E

e:

Ii:

Oo

:Uu

:Vp

bt

dk

gT

DtS

dZm

nN

wj

lr

vz

Zf

sS

hai

eioi

@ui

@au

MV

SD

R61

28

Spk

6:

Indi

vidu

al P

hone

me

Rank

ings

for

Tem

pora

l Aco

usti

c V

ecto

rs

F2.S

im2

F2.S

im4

24 22 20 18 16 14 12 10 8 6

R a n k

#@

@:A

a:E

e:I

i:O

o:U

u:V

pb

td

kg

TD

tSdZ

mn

Nw

jl

rv

zZ

fs

Sh

aiei

oi@

ui@au

MV

SD

R612

8 Sp

k6:

Indi

vidu

al P

hone

me

Rank

ings

for

Tem

pora

l Aco

ustic

Vec

tors

F3.S

im2

F3.S

im4

Figure A8.3g,h: Acoustic vector and phoneme disaggregation of Figure 5.4.

145

Page 153: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Bibliography

Abe, Y., Kunio, N. (1993), A Bounded Transition Hidden Markov Model for Continuous Speech Recognition, EUROSPEECH-93, 595-598.

Ahlbom, G., Bimbot, F., Chollet, G. (1987), Modelling Spectral Speech Transitions Using Temporal Decomposition Techniques, ICASSP-87, 13-16.

Ainsworth, W.A., Millar, J.B. (1976), Allophonic variations of stop consonants in a speech synthesis-by-rule program, International Journal Man-Machine Studies, 8, 159-168.

Allen, J.F. (1984), Towards a general theory of action and time, Artificial Intelligence, 23, 123-54.

Allen, J.F. (1991), Time and time again: the many ways to represent time, International Journal of Intelligent Systems, 6, 341-55.

Apolloni, B., Crivelli, D., Amato, M. (1993), Neural Time Warping, EUROSPEECH-93, 139-142.

Atal, B.S. (1996), Comments on “Towards increasing speech recognition error rates” by H. Bourlard, H. Hermansky and N. Morgan, Speech Communication 18, 233.

Atal, B.S. (1983), Efficient Coding of LPC parameters by Temporal Decomposition, ICASSP-83, 81-84.

Babich, G.A., Camps, O.I. (1996), Weighted Parzen Windows for pattern classification, IEEE Transactions on Pattern Analysis and Machine Intelligence, V18, No.5, 567-570.

Bagshaw, P.C. (1994), Automatic prosodic analysis for computer aided pronunciation teaching, PhD thesis, University of Edinburgh.

Baker, J. (1975), The DRAGON System - An Overview, IEEE Transactions on Acoustics, Speech and Signal Processing, 23(1), 24-29.

Bauer, H.U., Geisel, T. (1990), Nonlinear Dynamics of Feedback Multilayer Perceptrons, Phys. Rev. A, 42(4), 2401-2409.

Bell-Berti, F., Harris, K.S. (1981), A temporal model of speech production, Phonetica, 38, 9-20.

Blumstein, S.E. (1980) Perceptual invariance and onset spectra for stop consonants in different vowel environments, JASA 67(2), 648-662.

Blumstein, S.E., Isaacs E., Mertus J. (1982), The role of the gross spectral shape as a perceptual cue to place of articulation in initial stop consonants, JASA 72(1), 43-50.

Blumstein, S.E., Stevens, K.N. (1979), Acoustic invariance in speech production: Evidence from the spectral characteristics of stop consonants, JASA 66, 1001-1017.

146

Page 154: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Bocchieri, E.L., Wilpon J.G. (1993), Discriminative feature selection for speech recognition, Computer Speech and Language, 7, 229-246.

Bourlard, H., Hermansky H., Morgan N. (1996), Towards increasing speech recognition error rates, Speech Communication 18, 205-231.

Bourlard, H., Wellekens, C.J. (1989), Speech Dynamics and Recurrent Neural Networks, ICASSP-89, 33-36.

Box, G.E.P., Tiao, G.C. (1973), Bayesian inference in statistical analysis, Addison-Wesley.

Burshtein, D. (1996), Robust Parametric modelling of Duration in Hidden Markov Models, IEEE Transactions on Speech and Audio Processing, 4(3), 240-242.

Butcher, A. (1978), Pause and syntactic structure, in Dechert and Raupach (1978).

Carlson, R., Granstrom, B. (1986), A search for durational rules in a real-speech data base, Phonetica 43, 140-154.

Cooper, W.E., Danley, M. (1981), Segmental and temporal aspects of utterance-final lengthening, Phonetica, 38, 106-115.

Crystal, T.H., House, A.S. (1982), Segmental durations in connected speech signals: Preliminary results, JASA 72(3), 705-716.

Crystal, T.H., House, A.S. (1988a), The duration of American-English vowels: an overview, Journal of Phonetics, 16, 263-284.

Crystal, T.H., House, A.S. (1988b), Segmental durations in connected-speech signals: current results, JASA 83(4), 1553-1573.

Davies, D., Millar, J.B. (1996), A system for source synchronous analysis of speech signals, SST’96, 527-532

De Mori R. (1996), Comments on “Towards increasing speech recognition error rates” by H. Bourlard, H. Hermansky and N. Morgan, Speech Communication 18, 234-5.

Dechert and Raupach (1978), Temporal variables in speech, Mouton, The Hague.

Deng, L., Patrick, K., Lennig, M., Mermelstein, P. (1992a), modelling acoustic transitions in speech by state interpolation Hidden Markov Models, IEEE Transactions on Signal Processing, 40(2), 265-271

Deng, L. (1992b), A generalised Hidden Markov Model with state-conditioned trend functions of time for the speech signal, Signal Processing, 27, 65-78

Deng, L. (1994), Speech recognition using Hidden Markov Models with polynomial regression functions as nonstationary states, IEEE Transactions

147

Page 155: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

on Speech and Audio Processing, 2(4), 507-520

Deng, L. (1997), Speaker-independent phonetic classification using Hidden Markov Models with mixtures of trend functions, IEEE Transactions on Speech and Audio Processing, 5(4), 319-324

Deng, L. (1998), A dynamic feature based approach to the interface between phonology and phonetics for speech modelling and recognition, Speech Communications, 24, 299-323.

Dijk-Kappers, A.M.L. van, Marcus, S.M. (1987), Temporal decomposition of speech, IPO-AR, 22, 41-50.

Dijk-Kappers, A.M.L. van (1988), Comparison of parameter sets for temporal decomposition of speech, IPO-AR, 23, 24-33.

Dologlou, I., Caratannis, G. (1989) Pitch detection based on zero phase filtering, Speech Communication, 8, 309-318.

Dologlou, I., Caratannis, G. (1991) A reply to "Some remarks on the halting criterion for iterative low-pass filtering in a recent proposed pitch detection algorithm" by G. Hult, Speech Communication 10, 227-228.

Fant, G. (1960) Acoustic theory of speech production, Mouton and Co., The Hague.

Ferguson, J.D., (ed.) (1986), Variable duration models for speech, in Proc. Symp. Applic. Hidden Markov Models Text Speech, Princeton NJ, IDA-CRD, pp 143-179 (not sighted).

Fischer, R.M., Ohde, R.N. (1990), Spectral and durational properties of front vowels as cues to final stop-consonant voicing, JASA 88, 1250-1259.

Flanagan, J. (1996), Comments on “Towards increasing speech recognition error rates” by H. Bourlard, H. Hermansky and N. Morgan, Speech Communication 18, 236-7.

Fouri, S. (1996), Comments on “Towards increasing speech recognition error rates” by H. Bourlard, H. Hermansky and N. Morgan, Speech Communication 18, 238.

Francis, A.L., Nusbaum, H.C. (1996), Paying attention to speaking rate, ICSLP’96, 1537-1540

Freeman, W.J. (1979), Nonlinear dynamics of paleocortex manifested in the olfactory EEG, Biological Cybernetics, 35, 21-37.

Freksa, C. (1992), Temporal reasoning based on semi-intervals, Artificial Intelligence, 54, 199-227.

Galton, A. (1990), A critical examination of Allen’s theory of action and time, Artificial Intelligence, 42, 159-88.

Gong, Y., Ying Cheng, Haton J.-P. (1991), Neural Network Coupled with IIR 148

Page 156: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

sequential adapter for phoneme recognition in continuous speech, ICASSP, 153-156.

Gotch, Y., Hochberg, M.M., Silverman, H.F. (1994) Using map estimated parameters to improve HMM speech recognition performance, ICASSP-94, I-229 - I-232.

Green, P.D., Brown, G.J., Cooke, M.P., Crawford, M.D., Simons, A.J.H. (1990), Bridging the gap between signals and symbols in speech recognition, Advances in Speech, Hearing and Language Processing, 1, 149-192.

Grosjean, F. (1978), Linguistic structures and performance structures: studies in pause distribution, in Dechert and Raupach (1978).

Hatton J.-P. (1996), Comments on “Towards increasing speech recognition error rates” by H. Bourlard, H. Hermansky and N. Morgan, Speech Communication 18, 239.

Henderson, A.I. (1978), Juncture pause and intonation fall and the perceptual segmentation of speech, in Dechert and Raupach (1978).

Hermansky, H., Morgan, N. (1994), RASTA processing of speech, IEEE Transactions on Speech and Audio Processing, V2, No.4, 578-589.

Hermes, D.J. (1996), Timing of pitch movements and accentuation of syllables, ICSLP’96, 1197-1200.

Hess, W.J. (1974), A pitch-synchronous, digital feature extraction system for phonemic recognition of speech, IEEE Symp. Speech Recognition, 112-121.

Hess, W.J. (1983) Pitch determination of Speech Signals, Springer Verlag, Berlin.

Hess, W.J. (1984), Effective implementation of short-term analysis pitch determination algorithms, Proc. 10th ICPS, 263-269.

Hess, W.J., Indefrey, H. (1987), Accurate time-domain pitch determination of speech signals by means of a laryngograph, Speech Communication, 6, 55-68.

Hochberg, M.M., Silverman, H.F. (1993) Constraining model duration variance in HMM-Based connected speech recognition, EUROSPEECH '93, 323-326.

Holmes, J.N., Holmes, W.J., Garner, P.N. (1997), Using formant frequencies in speech recognition, EUROSPEECH ‘97, V4, p2083-2086.

Hunt, M.J. (1996), Comments on “Towards increasing speech recognition error rates” by H. Bourlard, H. Hermansky and N. Morgan, Speech Communication 18, 240-1.

Hussain, A. (1997), A new neural network structure for temporal signal processing, ICASSP-97, V4, 3341-3344.

Jarre, A., Pieraccini, R. (1987), Some experiments in speaker adaptation, 149

Page 157: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

ICASSP-87, 1273-1276.

Jelinek, F. (1996), Five speculations (and a divertimento) on the themes of H. Bourlard, H. Hermansky and N. Morgan, Speech Communication 18, 242-6.

Kajarakar, S.S., Yegnanarayana, B., Hermansky, H. (2001), A study of two dimensional linear discriminants for ASR, ICASSP01, SPEECH-p11, 7.

Krishnan, S., Rao, P.V.S (1996), A comparative study of explicit frequency and conventional signal representations for speech recognition, Digital Signal Processing, 6, 249-264.

Klatt, D.H. (1976), Linguistic uses of segmental duration in English: acoustic and perceptual evidence, JASA, 59(5), 1208-1221.

Kroon, P., Atal, B.S. (1991) On the use of pitch predictors with high temporal resolution, IEEE Transactions on Signal Processing 39, 733-735.

Kroot, C., Taylor, B. (1995) ANDOSL Annotation, www.andosl.anu.edu.au /andosl.

Kuwabara, H. (1996), Acoustic properties of phonemes in continuous speech for different speaking rate, ICSLP‘96, 2435-2438.

Lamel, L.F. (1993), A knowledge-based system for stop consonant identification based on speech spectrogram reading, Computer Speech and Language, V2, 169-191.

Lang, K.J., Waibel, A.H., Hinton, G.E. (1990), A time-delay neural network architecture for isolated word recognition, Neural Networks, 3, 23-43.

Lee, H., Tannock, J., Williams, J.S. (1993), Logic based reasoning about actions and plans in artificial intelligence, The Knowledge Engineering Review, 8(2), 91-120.

Lehiste, I. (1970), Suprasegmentals, MIT Press, Cambridge, Massechusetts, and London.

Levinson, S.E. (1986), Continuously variable duration Hidden Markov Models for automatic speech recognition, Computer Speech and Language, 1, 29-45.

Levinson, S.E., Rabiner, L.R., Sondhi, M.M. (1983), An introduction to the application of the theory of probabilistic functions of a Markov Process to automatic speech recognition, Bell System Technical Journal, 62(4), 1035-1074.

Liberman, A.M. (1952), The role of selected stimulus-variables in the perception of the unvoiced stop consonants, The American Journal of Psychology, LXV(4), 497-516.

Lindbolm, B. (1963), On vowel reduction, Report No.29, The Royal Institute of Technology, Speech Transmission Laboratory, Stockholm Sweden.

Lippmann, R.P. (1996), Recognition by humans and machines: miles to go before we sleep, Speech Communication 18, 247-8.

150

Page 158: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Loganantharaj, R. (1991), Representation and compilation of knowledge in point-based temporal system, International Journal of Intelligent Systems, 6, 549-67.

Long, D. (1989), A review of temporal logics, Knowledge Engineering, 4, 141-62.

Luce, P.A., Charles-Luce, J. (1985), Contextual effects on vowel duration, closure duration, and the consonant/vowel ratio in speech production, JASA, 78(6), 1949-1957.

Luzyanina, T.B. (1995), Synchronisation in an oscillator neural network model with time-delayed coupling, Network: Computation in Neural Systems, 6, 43-59.

Lyberg, B. (1979), Final lengthening – partly a consequence of restrictions in the speed of fundamental frequency change?, Journal of Phonetics, 7, 187-196.

Malayath, N., Hermansky, H., Kain, A. (1997), Towards decompositing the sources of variability in speech, EUROSPEECH-97, V1, 497-500.

Marcus, S.M. (1979), Context-sensitive coding in speech perception - a simulation, IPO Annual Progress Report 14.

Marcus, S.M., Lieshout, R.A.J.M. van (1984), Temporal decomposition of speech, IPO Annual Progress Report, 19, 25-31.

Marek, B. (1978), Phonological status of the pause, in Dechert and Raupach (1978).

Mariani, J., Gauvian, J.L., Lamel L. (1996), Comments on “Towards increasing speech recognition error rates” by H. Bourlard, H. Hermansky and N. Morgan, Speech Communication 18, 249-52.

Marslen-Wilson, W., Tyler, L.K. (1980), The temporal structure of spoken language understanding, Cognition, 8, 1-71.

Marslen-Wilson, W., Tyler, L.K. (1983), Reply to Cowart, Cognition, 15, 227-235.

Millar, J.B., Wagner, M. (1983) The Automatic analysis of acoustic variance in speech, Language and Speech, 6(2), 145-158.

Millar, J.B., O'Kane, M., Bryant, P. (1989) Design, collection and description of a database of spoken Australian English, Australian Journal of Linguistics 9, 165-189.

Miller, J.L., Green, K.P., Reeves, A. (1986), Speaking rates and segments: a look at the relation between speech production and speech perception for the voicing contrast, Phonetica, 43, 106-115.

Montacie, C., Choukri, K., Chollet, G. (1989), Speech recognition using temporal decomposition and multi-layer feed-forward automata, ICASSP,

151

Page 159: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

409-412.

Neagu, A., Bailly, G. (1997), Relative contributions of noise burst and vocalic transitions to the perceptual identification of stop consonants, EUROSPEECH-97, 2175-2178.Neuburg, E.P. (1971), Markov models for phonetic text, National Security Agency report, Abstract in JASA 50(1), p116 (original not sighted).

Nooteboom, S.G., Slis, I.N. (1969), A note on rate of speech, IPO Annual Progress Report, 58-61.

Norris, D., Autonomous processes in comprehension (1982), Cognition, 11, 97-101

Ohno, S., Masamichi, F., Fujisaka, H. (1996), Quantitative analysis of the local speech rate and its application to speech synthesis, ICSLP’96, 2254-2257.

Oppenheim, A.V., Schafer, R.W. (1989) Discrete-time signal processing, Prentice Hall, New Jersey.

Paliwal, K.K. (1992), Dimensionality reduction of the enhanced feature set for the HMM-based speech recognizer, Digital Signal Processing, 2, 157-173.

Park, Y.K., Un, C.K., Kwon, O.W. (1996), Modelling acoustic transitions in speech by modified Hidden Markov Models with state duration and state duration-dependent observation probabilities, IEEE Transactions on Speech and Audio Processing 4(5), 389-392.

Parzen, E. (1962), On estimation of a probability density function and mode, Ann. Math. Statistics, V33, 1065-1076.

Picone, J. (1989) On modelling duration in context in speech recognition, ICASSP89, 421-424.

Pols, L.C.W., Wang, X., ten Bosch, L.F.M. (1996a), Modelling of phone duration (using the TIMIT database), and its potential benefit for ASR, Speech Communication, 19, 161-176.

Pols, L.C.W., Wang, X.. (1996b), Extracting durational knowledge and applying it for improved ASR, SPECOM’96, 16-21

Port, R.F. (1979), The influence of tempo on stop closure duration as a cue for voicing and place, Journal of Phonetics, 7, 45-56.

Porter, R.J., Cullen, J.K., Collins, M.J., Jackson, D.F. (1991), Discrimination of formant transition onset frequency: psychoacoustic cues at short, moderate and long durations, JASA 90, 1298-1308.

Rabiner, L.R., Juang, B.-H. (1993), Fundamentals of Speech Recognition, Prentice Hall, New Joursey.

Rabiner, L.R. (1986) A model-based connected-digit recognition system using either hidden Markov models or templates, Computer Speech and Language, 1, 167-197

152

Page 160: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Rabiner, L.R., Juang, B.-H., Levinson, S.E., Sondhi, M.M. (1993b) Recognition of isolated digits using hidden Markov models with continuous mixture densities, AT&T Technical Journal 64(6) 1211-1235

Rabiner, L.R., Juang, B.-H., Levinson, S.E., Sondhi, M.M. (1995) Some properties of continuous hidden Markov model representations, AT&T Technical Journal 64(6) 1251-.

Rabiner, L.R., Wilpon, J.G., Juang, B.H. (1986) A model-based connected digit recognition system using either hidden Markov models or templates, Computer Speech and Language 1, 167-197

Rabiner, L.R.,Levinson, S.E. (1985a) A speaker-independent, syntax-directed, connected word recognition system based on Hidden Markov Models and level building, IEEE Transactions on Acoustics, Speech and Signal Processing, 33(3), 561-565.

Rabiner, L.R., B.-H., Levinson, S.E., Sondhi, M.M. (1993a) On the application of vector quantisation and Hidden Markov Models to speaker-independent, isolated word recognition, Bell System Technical Journal, 62(4) 1075-1104

Rabiner, L.R., Juang, B.-H., Levinson, S.E., Sondhi, M.M. (1985b), Some Properties of continuous Hidden Markov Model representations, AT&T Technical Journal 64(6), 1251-

Riedi, M. (1997), modelling segmental duration with multivariate adaptive regression splines, Eurospeech97, 2627-2630.

Russel, M.J., R.K. Moore (1985), Explicit modelling of state occupancy in Hidden Markov Models for automatic speech production, ICASSP 85, 5-8.

Saerens, M. (1993), Hidden Markov Models assuming a continuous-time dynamic emission of acoustic vectors, EUROSPEECH-93, 587-590.

Salomon, A., Espy-Wilson, C. (1999), Automatic detection of manner events based on temporal parameters, EUROSPEECH-99, V6, 2797-2800.

Samouelian, A. (1997), Frame-level phoneme classification using inductive inference, Computer Speech and Language, 11, 161-186.

Santen, J.P.H. van, Olive J.P. (1990), The analysis of contextual effects on segmental duration, Computer Speech and Language, 4, 359-390.

Sato, M. (1990a), A real time learning algorithm for recurrent analog neural networks, Biological Cybernetics, 62, 237-241.

Sato, M. (1990b), A learning algorithm to teach spatiaotemporal patterns to recurrent neural networks, Biological Cybernetics, 62, 259-263.

Saudeau, N., Andre-Obrecht, R. (1993), Sound duration modelling and time-variable speaking rate in a speech recognition system, EUROSPEECH-93, 307-310.

153

Page 161: Representing Time in Automated Speech Recognition...Representing Time in Automated Speech Recognition David Richard Llewellyn Davies March 2002 A thesis submitted for the degree of

Sawai, H. (1991), TDNN-LR Continuous speech recognition system using adaptive incremental TDNN training, ICASSP, 53-56.

Schafer, R.W., Rabiner, L.R. (1970), System for automatic formant analysis of voiced speech, JASA, 47(2), 634-648.

Seneff S. (1996), Comments on “Towards increasing speech recognition error rates” by H. Bourlard, H. Hermansky and N. Morgan, Speech Communication 18, 253-5.

Shannon, R.V., Zeng, F.G., Kamath, V., Wygonski, J., Ekelid, M. (1995), Speech recognition with primarily temporal cues, Science, 270, 303-4.

Sharma, S., Hermansky, H. (1999), Speech recognition from temporal patterns, Proceedings ICPHS99, p1661.

Shen, J., Hwang, W.L. (1999), New temporal features for robust speech recognition with emphasis on microphone variations, Computer Speech and Language, 13, 65-78.

Sundberg, J. (1979), Maximum speed of pitch changes in singers and untrained subjects, Journal of Phonetics, 7, 71-79.

Sussman, H.M., McCaffery, H.A., Matthews, S.A. (1991), An investigation of locus equations as a source of relational invariance for stop place categorization, JASA 90, 1309-1325.

Tank, D.W., Hopfield, J.J. (1987), Neural computation by concentrating information in time, Proceedings of the NAS USA, 84, 1896-1900.

Turk, A.E., Sawusch, J.R. (1996), The processing of duration and intensity cues to prominence, JASA 99(6), 3782-3791.

Ultman, J.A. (1998) Effects of local speaking rate context on the perception of voice-onset time in initial stop consonants, JASA, 103 (3), 1640-1653.

Umeda, N. (1975), Vowel duration in American English, JASA, 58(2), 434-445.

Unnikrishnan, K.P., Hopfield, J.J., Tank, D.W. (1992), Speaker-independent digit recognition using a neural network with time-delayed connections, Neural Computations, 4, 108-119.

Verhasselt, J.P., Martens, J.-P. (1996), A fast and reliable rate of speech detector, ICSLP’96, 2258-2261.

Vila, L. (1994), A survey on temporal reasoning in artificial intelligence, AICOM 7(1), 4-28.

Wickelgren, W.A. (1969), Context-sensitive coding, associative memory, and serial order in (speech) behavior, Psychology Review, 76, 1-15.

154