c 2008 harsh vardhan sharma - illinois speech and ... speech by harsh vardhan sharma b.tech., indian...

UNIVERSAL ACCESS:EXPERIMENTS IN AUTOMATIC RECOGNITION OF

DYSARTHRIC SPEECH

BY

HARSH VARDHAN SHARMA

B.Tech., Indian Institute of Technology Roorkee, 2006

THESIS

Submitted in partial fulfillment of the requirementsfor the degree of Master of Science in Electrical and Computer Engineering

in the Graduate College of theUniversity of Illinois at Urbana-Champaign, 2008

Urbana, Illinois

Adviser:

Associate Professor Mark Hasegawa–Johnson

ABSTRACT

This thesis describes the results of first experiments in small and medium vo-

cabulary dysarthric speech recognition, using the database being recorded by

the Statistical Speech Technology Group under the Universal Access initiative.

Speaker-dependent, word- and phone-level speech recognizers utilizing the hidden

Markov model architecture were developed and tested; the models were trained

exclusively on dysarthric speech produced by individuals diagnosed with cerebral

palsy. The experiments indicate that (a) different system configurations (being

word vs. phone based, number of states per HMM, number of Gaussian compo-

nents per state-specific observation probability density, etc.) give useful perfor-

mance (in terms of recognition accuracy) for different speakers and different task

vocabularies, and (b) for subjects with very low intelligibility, speech recognition

outperforms human listeners on recognizing dysarthric speech.

ii

To Mummy and Papa. For everything.

iii

ACKNOWLEDGMENTS

This work has been supported by NSF grant 05-34106 and NIH grant R21-

DC008090-A. I thank the subjects for their participation and Bowon Lee for

building and maintaining a microphone array. I also thank Mary Sesto and Kate

Vanderheiden at University of Wisconsin–Madison for their assistance in recruit-

ing subjects in the Madison area and for allowing use of the facilities at the Trace

Research and Development Center.

As a novice to research in the field of automatic speech recogntion, it was a

tremendous honor and privilege for me to be associated with Dr. Mark Hasegawa-

Johnson. As my adviser he helped me formulate and tackle the problem addressed

in this thesis. At every juncture he was readily available with critiques of the ideas

that I threw at him. I cannot thank him enough for the belief he showed in me

and for educating me in the process of research.

My parents have been and will always be the strongest source of support

and inspiration to me. I wish to acknowledge and thank them for never letting

anything affect my education.

iv

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . 1

CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . 32.1 Prevalence of Dysarthria . . . . . . . . . . . . . . . . . . . . . . . 32.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Present State of Art: ASR for Dysarthria . . . . . . . . . . . . . . 6

CHAPTER 3 THE UNIVERSAL ACCESS DATABASE . . . . . . . . . 73.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.1 Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.2 Recording materials and procedures . . . . . . . . . . . . . 9

3.3 Description of the Database . . . . . . . . . . . . . . . . . . . . . 113.3.1 Speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.2 Sample data . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER 4 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . 144.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2.1 Data used . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2.2 ASR tasks and task vocabularies . . . . . . . . . . . . . . 154.2.3 ASR performance measure . . . . . . . . . . . . . . . . . . 154.2.4 Architectures for HMM-based ASR . . . . . . . . . . . . . 164.2.5 Acoustic feature extraction . . . . . . . . . . . . . . . . . . 17

CHAPTER 5 RESULTS AND DISCUSSION . . . . . . . . . . . . . . . . 195.1 ASR task scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.2 Analysis of results . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.2.1 Whole-word ASR . . . . . . . . . . . . . . . . . . . . . . . 215.2.2 Monophone ASR . . . . . . . . . . . . . . . . . . . . . . . 235.2.3 Triphone ASR . . . . . . . . . . . . . . . . . . . . . . . . . 23

v

5.2.4 Monophone ASR vs. triphone ASR . . . . . . . . . . . . . 265.2.5 Whole-word ASR vs. monophone ASR vs. triphone ASR . 26

CHAPTER 6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . 28

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

vi

LIST OF TABLES

3.1 Summary of Speaker Information. . . . . . . . . . . . . . . . . . . 12

4.1 Summary of speaker information for the data used in the ex-periments (in decreasing order of human listener intelligibilityrating). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 ASR recognition tasks and corresponding vocabulary sizes. . . . . . 16

5.1 %WC scores for various ASR recognition tasks: Whole-wordarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.2 %WC scores for various ASR recognition tasks: Monophone ar-chitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.3 %WC scores for various ASR recognition tasks: Triphone ar-chitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

vii

LIST OF FIGURES

3.1 Recording Equipment: MOTU; the microphone array on top ofa laptop computer; and the video camera. . . . . . . . . . . . . . . 10

3.2 Waveforms and Spectrograms of the word ‘Zero’ produced by sixdifferent speakers: Speaker Code - Intelligibility (%). . . . . . . . . 13

5.1 ASR performance (%WC scores) for whole-word architecture. . . . 225.2 ASR performance (%WC scores) for monophone architecture. . . 245.3 ASR performance (%WC scores) for triphone architecture. . . . . 255.4 Comparison of ASR performance (%WC scores) for monophone

and triphone architectures. . . . . . . . . . . . . . . . . . . . . . . 275.5 Comparison of ASR performance (%WC scores) for whole-word,

monophone and triphone architectures. . . . . . . . . . . . . . . . 27

viii

CHAPTER 1

INTRODUCTION

Automatic speech recognition (ASR) software with significantly high word recog-

nition accuracy (> 90%) is now widely available to the general public; accuracy

of the newest generation of large vocabulary speech recognizers, after adaptation

to a user without speech pathology, typically exceeds 95% (Dragon claims a 99%

accuracy for Dragon Naturally Speaking version 9 [1]). To cite one example of

an ASR success story, the winner of the 2007 National Book Award for fiction,

The Echo Maker, was dictated using ASR software [2]. Therefore, ASR has been

reasonably successful at providing a useful human-computer interface especially

for people who find it difficult to type with a keyboard (e.g., patients with carpal

tunnel syndrome).

However, many individuals with gross motor impairment, including some peo-

ple with cerebral palsy and closed head injuries, have not enjoyed the benefit of

these advances in speech technology, mainly because their general motor impair-

ment includes a component of dysarthria: reduced speech intelligibility caused by

neuromotor impairment. Such people find their participation in society limited by

their inability to use a personal computer, and it is these aforementioned motor

impairments that often preclude normal use of a keyboard. For this reason, case

studies have shown that some dysarthric users may find it easier, instead of a

keyboard, to use a small-vocabulary ASR system, with code words representing

letters and formatting commands, and with acoustic speech recognition carefully

adapted to the speech of the individual user (e.g., see [3], [4], [5], [6]).

1

Faculty and graduate students at the University of Illinois from such diverse

departments as Electrical and Computer Engineering, Speech and Hearing Sci-

ence, and Disability Resources and Education Services are currently engaged in

acquiring and experimenting with a database of dysarthric speech, with an aim to

develop dysarthric ASR systems and a corresponding human-computer interface

prototype (see [7], [8]) for use by students with dysarthria at the University of

Illinois. This being a work in progress, the work described in this thesis investi-

gated the performance in terms of the word error rate (WER) on a part of this

database, of word-based and phone-based audio speech recognition models em-

ploying the hidden Markov model (HMM) architecture, for both small-vocabulary

and medium-vocabulary speech recognizers, designed to be used for unrestricted

text entry on a personal computer.

The remainder of this thesis is organized as follows: Chapter 2 discusses the

motivation behind developing and testing ASR systems for dysarthric users, and

related work by other researchers. Chapter 3 describes the Universal Access

database which was used in the experiments carried out for this study. Chap-

ter 4 discusses the setup of the experiments, presents their results (recognition

accuracies for various configurations), and analyses the performance of the speech

recognizers developed. Chapter 5 concludes by presenting the major findings of

this study.

2

CHAPTER 2

BACKGROUND

2.1 Prevalence of Dysarthria

Speech and language disorders are caused by various types of congenital or trau-

matic disorders of the brain, nerves and muscles [9]. Dysarthria is a collective

term used for referring to a group of motor speech disorders resulting from dis-

turbed muscular control of the speech mechanism due to damage of peripheral

or central nervous system. People suffering from one of the dysarthrias exhibit

oral communication problems on account of weakness, incoordination or paralysis

of speech musculature. The physiologic characteristics of dysarthria include ab-

normal/disturbed strength, speed, range, steadiness, tone and accuracy of muscle

movements. The communication characteristics include disturbed pitch, loud-

ness, voice quality, resonance, respiratory support for speech, and articulation.

Dysarthria may be congenital or acquired, and it may be chronic, improving, pro-

gressive, or exacerbating-remitting. The following lists summarize some of the

problems in spoken communication resulting from dysarthria. For more complete

details regarding etiology, assessment and treatment of dysarthria, the reader is

referred to [10] and [11]:

1. Respiratory problems

• forced inspirations that interrupt speech

• forced expirations that interrupt speech

3

• audible or breathy inspiration

• grunt at end of inspiration

2. Phonatory disorders

• abrupt and/or extreme variations in pitch resulting in pitch breaks

• lack of variations in pitch resulting in monopitch and lack of inflection

• diplophonia, simultaneous production of both a lower and a higher

pitch

• lack of variation in loudness, resulting in monoloudness

• sudden and/or excessive variation in loudness, resulting in too soft or

too loud speech

• continuously/intermittently breathy voice

3. Articulation disorders

• imprecise production of consonants

• prolongation of phonemes

• repetition of phonemes

• irregular breakdowns in articulation

• weak pressure consonants

4. Resonance disorders

• hypernasality

• hyponasality

• nasal emission

4

The specific type of speech impairment is often an indication of the neuro-

motor deficit causing it; therefore, speech language pathologists have developed

a system of dysarthria categories reflecting both genesis and symptoms of the

disorder [12], [13], [14]. The most common category of dysarthria among chil-

dren and young adults is spastic dysarthria [15]. Symptoms of spastic dysarthria

vary from talker to talker, but typical symptoms include strained phonation, im-

precise placement of the articulators, incomplete consonant closure resulting in

sonorant implementation of many stops and fricatives, and reduced voice onset

time distinctions between voiced and unvoiced stops.

2.2 Motivation

Dysarthria then, is generally characterized by imprecise articulation of phonemes

and monotonic or excessive variation of loudness and pitch [13], [14]. Although

dysarthria can differ notably from normal speech due to imprecise articulation,

the articulation errors are generally neither random (unlike, for example, in the

case of apraxia) nor unpredictable. In fact, previous studies show that most

articulation errors in dysarthria can be described in terms of a small number of

substitution error types [16], [17], [18]. Kent et al. [16], for example, suggest that

most articulation errors in dysarthric speech are primarily errors in the production

of one distinctive feature.

When articulation errors occur in a consistent manner and, as a result, they are

predictable, there exists the advantage of using ASR, even for speech that is highly

unintelligible for human listeners. Supporting evidence for better performance of

ASR than human listeners is reported in Carlson and Bernstein [19].

5

2.3 Present State of Art: ASR for Dysarthria

Several studies have repeatedly demonstrated that adults with dysarthria are

capable of using ASR, and that in some cases, human-computer interaction using

speech recognition is faster and less tiring than interaction using a keyboard ([3],

[4], [5], [6], [20], [21]). The technology used in these studies is commercial off-

the-shelf speech recognition technology. Further, these studies have focused on

small-vocabulary applications, with vocabulary sizes ranging from 10 to 70 words.

To our knowledge, there is not currently any commercial or open-source prod-

uct available that would enable people in this user community to enter unrestricted

text into a personal computer via automatic speech recognition.

6

CHAPTER 3

THE UNIVERSAL ACCESS DATABASE

This chapter describes the Universal Access (UA) database (a part of which was

used in the experiments described in this thesis) that is currently being recorded

and analyzed (in terms of acoustic and phonetic characteristics) by the Statistical

Speech Technology Group [22] at the Beckman Institute [23], in collaboration

with faculty from the Departments of Speech & Hearing Science, and Disability

Resources & Education Services [7]. From this point on, the people mentioned in

these references are collectively referred to as the “UA authors” and the database

is referred to as the “UA database”.

3.1 Introduction

The necessity of the great amount of training data is a major concern in ASR

development especially for dysarthric speakers since speaking can be a tiring task.

The only dysarthric speech database currently available is the Whitaker Database

[24]. This database consists of 30 repetitions of 46 isolated words (10 digits, 26

alphabet letters, and 10 ‘control’ words) and 35 words from the Grandfather

passage produced by each of six individuals with cerebral palsy. This database

includes one normal speaker’s production of 15 repetitions of the same materials.

The Whitaker Database as well as the materials used in most prior research on

ASR for dysarthria consists of a small range of words produced by a small number

of subjects.

7

The UA database was constructed with the aim of developing large-vocabulary

dysarthric ASR systems which would allow users to enter unlimited text into a

computer. The database consists of recordings of a variety of word categories: dig-

its, computer commands, radio alphabet letters, common words selected from the

Brown corpus of written English, and uncommon words selected from children’s

novels digitized by Project Gutenberg (see Section 3.2.2 for more details). The UA

authors focus particularly on dysarthria associated with cerebral palsy. In addi-

tion to audio data, they have recorded video data, because evidence suggests that

using visual information aids speech perception ( [25], [26], [27], [28], [29], [30]).

The UA authors expect the database to be a resource that can be used to improve

ASR systems for people with neuromotor disabilities.

The UA database is also designed to benefit research in the areas of speech

and hearing science and linguistics. As [31] points out, clinicians are becoming

more aware of the need to base their treatment decisions on objective data. Pho-

netic and phonological analysis of dysarthric speech will reveal the characteristics

of articulation errors, and can help clinicians maximize the efficiency of clinical

treatments. For example, if clinicians know what types of errors occur predomi-

nantly for their patients, they can develop specific strategies to modify the errors

and improve the level of speech intelligibility. The UA authors are in the pro-

cess of analyzing our database to determine the speaker-specific and the general

speaker-independent characteristics of articulation errors in dysarthria, and we

encourage other researchers to do the same. Research in this line would yield

better understanding of the neuromotor limitations on articulatory movement in

dysarthria. Eventually, it will enhance our knowledge of neuromotor mechanisms

that underlie human speech production.

8

3.2 Method

3.2.1 Equipment

An eight–microphone array previously designed by our lab [32] was used for record-

ing. Eight microphones were arranged in an array, with 1.5 inches of spacing

between adjacent microphones. Each microphone was 6 mm in diameter. The

total dimension of the array was roughly 11 inches X 1.5 inches. Microphone

preamplifiers were attached to the array. Preamplified audio channels were sent

to a multi-channel Firewire audio interface (MOTU 828mkII) through cables. The

MOTU recorded eight audio channels at a sampling rate of 48 kHz. One channel

of the MOTU was reserved for recording DTMF tones, which were used to seg-

ment words and save each word as a separate wav file. The array was mounted

at the top of the laptop computer screen, as shown in Figure 3.1.

In addition to audio recording, one video camera (Canon ZR500) was used to

capture the visual features of speech. Audio data recorded at the eighth micro-

phone and DTMF tones were transferred to the video camera. A laptop computer

was used to display speech materials and to run the AudioDesk (audio recording

software of MOTU). To improve the quality of video recording, we used studio

lighting (Lowel Tota-light T1-10) at 750W.

3.2.2 Recording materials and procedures

Most subjects have been recruited through personal contacts established with

clients of the Rehabilitation Education Center at the University of Illinois at

Urbana-Champaign. Three subjects were recruited in the vicinity of Madison,

Wisconsin through a collaboration with the Trace Research and Development

Center, University of Wisconsin-Madison. Only subjects who reported a diagnosis

9

Figure 3.1: Recording Equipment: MOTU; the microphone array on top of alaptop computer; and the video camera.

of cerebral palsy were recruited.

Recordings (both audio and video) took place while subjects were seated com-

fortably in front of a laptop computer. Subjects were asked to read an isolated

word displayed on a PowerPoint slide on the computer. An experimenter sat be-

side the subject and advanced the PowerPoint slides after the subject spoke each

word. Subjects read three blocks of words. There was always a break between

blocks, and subjects were allowed to have a break anytime they wanted one. Each

block contained 255 words: 155 words repeated across blocks, and 100 uncom-

mon words that differed across three blocks. The 155 words included 10 digits

(‘zero’ to ‘nine’), 26 radio alphabet letters (e.g., ‘Alpha’, ‘Bravo’, ‘Charlie’), 19

computer commands (e.g., ‘backspace’, ‘delete’, ‘enter’) and 100 common words

(the most common words in the Brown corpus of written English such as ‘it’, ‘is’,

‘you’). The uncommon words (e.g., ‘naturalization’, ‘moonshine’, ‘exploit’) were

selected from children’s novels digitized by Project Gutenberg, using a greedy

algorithm that maximized token counts of infrequent biphones. In other words,

each speaker produced a total of 765 isolated words, including 455 distinct words:

three repetitions of 155 words, which allow researchers to train and test whole-

word speech recognizers, and 300 distinct uncommon words chosen to maximize

phone-sequence diversity. After speech was recorded, the 7 channels of speech for

10

each block were saved as 7 separate wav files. DTMF tones recorded in Channel

1 were then used to automatically segment each wav file into single word files,

using MATLAB code.

3.3 Description of the Database

3.3.1 Speaker

Table 3.1 summarizes the characteristics of 18 subjects that have been recorded

so far. The letter M and F in speaker code specifies a participants’ gender. Kim

et al. [33] describe in detail, the acquisition of and intelligibility assessment on

this database. Two hundred distinct words were selected from the recording of

the second block: 10 digits, 25 radio alphabet letters, 19 computer commands

and, 73 words randomly selected from each of the CW and UW categories. Five

naive listeners were recruited for each speaker and were instructed to provide

orthographic transcriptions of each word that they thought the speaker said. The

percentage of correct responses was then averaged across five listeners to obtain

each speakers intelligibility.

3.3.2 Sample data

Figure 3.2 gives examples of waveforms and spectrograms for the word ‘zero’

produced by each of six speakers. Their intelligibility rates are displayed next to

the Speaker code. Visual inspection of waveforms and spectrograms suggests that

different speakers (or different subject groups in terms of intelligibility categories)

may exhibit different articulation patterns. For the fricative /z/, the fricative

noise is clearly present for Speakers M05, M06 and F02, and less obvious for

Speaker M07. For Speakers F03 and M04, /z/ is deleted. In addition, lowering of

11

Table 3.1: Summary of Speaker Information.

Speaker Age Speech Intelligbility (%) Dysarthria Diagnosis

F02 30 low (29%) SpasticF03 51 very low (6%) SpasticF04 18 currently being obtained AthetoidF05 22 currently being obtained SpasticM01 >18 very low (10%) SpasticM04 >18 very low (2%) SpasticM05 21 mid (58%) SpasticM06 18 low (39%) SpasticM07 58 low (28%) SpasticM08 28 currently being obtained SpasticM09 18 high (86%) SpasticM10 21 currently being obtained MixedM11 48 currently being obtained AthetoidM12 19 currently being obtained MixedM13 44 currently being obtained Spastic

the third formant that characterizes /r/ is manifested differently depending upon

the speaker’s intelligibility: considerable lowering of the third formant in M05 and

M06, slight lowering in F02 and M07, and almost none for speakers in the ‘very

low’ category (F03, M04). Finally, the formant patterns of the vowels /i/ and /o/

also suggest correlation between articulation patterns and intelligibility. The /i/

and /o/ vowels are well distinguished for the speaker in the ‘mid’ or some speakers

in the ‘low’ category, as evidenced by a large spacing between the first and second

formants for /i/ vs. a much smaller spacing for /o/. As intelligibility decreases,

this distinction becomes weaker (e.g. M04’s vowel formants rarely change across

these two vowels). More detailed phonetic/phonological analysis is in progress, at

the time of writing this thesis.

12

(a) M05-mid(58%) (b) M06-low(39%)

(c) F02-low(29%) (d) M07-low(28%)

(e) F03-very low(6%) (f) M04-very low(2%)

Figure 3.2: Waveforms and Spectrograms of the word ‘Zero’ produced by six dif-ferent speakers: Speaker Code - Intelligibility (%).

13

CHAPTER 4

EXPERIMENTS

4.1 Hypotheses

The aim of the study described in this thesis is to develop and test audio-only

speaker-dependent HMM-based speech recognizers for subjects recorded under

the Universal Access initiative, utilizing architectures that model different levels

of an utterance (words and phones). These architectures would be tested on small

(< 100 words) and medium (between 100 and 1000 words) vocabulary tasks in

which the utterances are a selected subset of the speech material recorded for the

UA database.

The communication characteristics of dysarthria being very speaker-specific,

it is hypothesized that different recognition architectures will perform well (the

measure of performance will be discussed below) for different subjects. Secondly,

as mentioned in Chapter 1, adults with dysarthria are capable of utilizing an

ASR-based human-computer interface for text entry. Since these studies were

mainly based on commercial off-the-shelf ASR technology, it would be interesting

to see whether or not ASR systems trained on dysarthric speech possess the same

usability. The third question of interest is whether ASR can “listen” better than

humans (especially for vocabularies that are not small), i.e., whether or not speech

recognition accuracy is greater than the human-listener intelligibility ratings (see

Table 3.1, page 12) for the recorded subjects.

14

4.2 Setup

4.2.1 Data used

Table 4.1 lists those speakers from Table 3.1 whose speech materials from the

UA database were used in the ASR experiments. These were the speakers whose

data was available for ASR development and testing when these experiments were

conducted.

Table 4.1: Summary of speaker information for the data used in the experiments(in decreasing order of human listener intelligibility rating).

Speaker Age Speech Intelligibility (%) Dysarthria Diagnosis

M09 18 high (86%) SpasticM05 21 mid (58%) SpasticM06 18 low (39%) SpasticF02 30 low (29%) SpasticM07 58 low (28%) SpasticF03 51 very low (6%) SpasticM04 >18 very low (2%) Spastic

4.2.2 ASR tasks and task vocabularies

Ten recognition tasks were set up using the recorded data, as summarized in

Table 4.2. The reader may refer to Section 3.2.2 for vocabulary details of the

words recorded.

4.2.3 ASR performance measure

The measure used for assessing the performance of the developed recognizers is

the fraction of task-vocabulary words correctly recognized (in percent), defined in

Equation (4.1). This can be compared directly to the human listener intelligibility

15

Table 4.2: ASR recognition tasks and corresponding vocabulary sizes.Task Descriptor Vocabulary Vocabulary Size

T01 digits (D) 10T02 computer commands (C) 19T03 radio alphabet (L) 26T04 D+L+C 55T05 common words (CW) 100T06 uncommon words (UW) 100T07 L+C+CW 145T08 D+L+C+CW 155T09 L+C+CW+UW 245T10 D+L+C+CW+UW 255

ratings mentioned in Table 4.1, since the latter were also obtained as the fraction

of task-vocabulary words correctly heard by the human listeners [33].

%WC =# words correctly recognized

# words in test set(4.1)

4.2.4 Architectures for HMM-based ASR

For the theory behind HMM-based speech recognizers, the reader is referred to

the article by Rabiner [34] which presents an excellent introduction to the HMM

formalism and its application to developing speech recognition systems. In this

study, each utterance (which consisted of a single word) of a particular subject was

modeled as “silence–word–silence”. HMM-based speech recognizers employing the

following three architectures were developed and tested:

• whole-word : Each utterance was modeled as an HMM. For a particular

task, the number of HMMs was the number of distinct words in the task

vocabulary, plus one (silence was also modeled as an HMM).

• monophone: Each utterance was modeled as a sequence of phones (these

phone sequences were taken from the ISLE web dictionary [22]). The total

16

number of distinct phones was 40. Therefore, there were a total of 41 HMMs

with a silence HMM.

• triphone: Each utterance was modeled as a sequence of triphones (“phone-

phone-phone”). This was done to model the effects of context on the acoustic

properties of a phone. The number of HMMs in this case equalled one (for

silence, which is not modeled as a triphone) plus the number of distinct

word-internal triphones in the task vocabulary, plus the number of biphones

(“phone-phone”) that resulted from blocking context at word boundaries.

This number is much less than 403 (the number of all possible triphones

from 40 phones) — such a large number of phone combinations is not only

unobserved in day-to-day language use but is also impractical to model (on

account of both computational complexity and finite amount of training

data).

Blocks 1 and 3 were used for development and block 2 for testing for each of the

seven speaker-dependent recognizers. Referring to Table 4.2, word-level recogniz-

ers were built for tasks T01-T05, T07 and T08. The other tasks had uncommon

words as part of their vocabularies, and since each block’s uncommon words were

unique to it, they could not be modeled at the word level. For tasks T06-T10,

phone-level (both monophone and triphone) recognizers were built. Hence, tasks

T07 and T08 are the tasks for which all three architectures were tested.

The ASR systems were built using Cambridge University’s HTK modeling

toolkit [35], following the procedure outlined in the toolkit manual [36].

4.2.5 Acoustic feature extraction

The features extracted from the speech waveform were comprised of 12 perceptual

linear prediction coefficients [37] for 25-ms Hamming-windowed segments obtained

17

every 10 ms, plus the energy of the windowed segment. Velocity and acceleration

components were also calculated for this 13-dimensional feature, which finally

resulted in a 39-dimensional acoustic feature vector.

18

CHAPTER 5

RESULTS AND DISCUSSION

5.1 ASR task scores

For each architecture-specific task, the number of Gaussian components in the

state-specific observation probability densities was increased (in an iterative man-

ner) in powers of 2, starting from 1 and stopping when either (a) the number of

components had risen to 32, or (b) the %WC score had decreased on two consec-

utive iterations. The number of states per HMM was fixed at 3 for monophone

and triphone architectures, but was varied from 3 through 11 for the whole-word

recognizers.

Tables 5.1, 5.2 and 5.3 list the speaker-dependent %WC scores for whole-word,

monophone and triphone architecures respectively on the architecture-specific

tasks. The speakers are listed in decreasing order of intelligibility rating. The

scores reported are for a particular HMM configuration (in terms of number of

states per HMM and number of Gaussian probability density components) be-

cause the stopping stage of the above-mentioned iterative process is likely to be

different for different task and speaker combinations. Hence, the consideration in

selecting the HMM configuration for reporting scores was that it should be the

same across all speakers (and if possible across all tasks, especially for whole-word

systems where the number of states per HMM was also varied), and that other

configurations should have performed as well or worse than it.

19

Table 5.1: %WC scores for various ASR recognition tasks: Whole-word architec-ture.

ASR TaskSpeaker T01 T02 T03 T04 T05 T06 T07

M09 84.29 100 97.25 89.87 63.29 69.26 65.9M05 90 78.95 77.47 63.12 56.14 54.68 52.9M06 92.86 81.95 77.47 72.21 52.86 53 51.24F02 94.29 83.46 69.78 72.99 64.43 58.72 57.24M07 100 86.47 85.71 80.78 56.14 58.92 58.89F03 74.29 63.91 41.76 40.26 49.43 36.35 33.82M04 46 23.16 19.23 14.18 6.4 7.03 6.84

Table 5.2: %WC scores for various ASR recognition tasks: Monophone architec-ture.

ASR TaskSpeaker T06 T07 T08 T09 T10

M09 31.14 47.49 46.82 46.53 50.36M05 29.43 48.57 50.05 32.89 39.1M06 14.57 37.54 36.31 26.82 30.48F02 17.43 42.76 42.76 26.82 31.2M07 15.57 44.83 43.78 28.98 34.01F03 2.14 25.22 22.4 6.94 8.8M04 1.2 6.07 5.94 2.37 2.59

Table 5.3: %WC scores for various ASR recognition tasks: Triphone architecture.ASR Task

Speaker T06 T07 T08 T09 T10

M09 25.29 63.65 63.32 46.47 52.04M05 13.57 54.48 53.55 30.44 35.52M06 5.43 58.82 52.53 29.8 34.01F02 3.43 54.48 56.68 27.81 35.06M07 7.86 57.14 60.65 32.48 43.87F03 1 40 38.89 9.97 12.61M04 1.8 4.83 5.81 1.96 2.82

For the whole-word systems, results are for HMMs with 2 Gaussian compo-

nents per probability density and 6 states per HMM (except for task T01, for

which results are presented for 5-state HMMs). For the monophone and triphone

systems, results are for HMMs with 16 and 2 Gaussian components per proba-

20

bility density, respectively. The significant difference in the number of Gaussian

components for monophone and triphone systems is not unexpected: for the same

number of Gaussian components for monophone and triphone systems, the latter

will require much more speech data for reliable training of the HMMs since they

will have a much larger number of parameters to learn from the data. In the

case of whole-word recognizers, having too few or too many states per HMM will

deteriorate performance because (1) for the former, the number of parameters to

model speech variability will be quite small, and (2) for the latter, there will be

more parameters than can be reliably trained from the given quantum of speech

data.

5.2 Analysis of results

In this section, the reported scores are analysed to determine, first, overall recog-

nition accuracy of each architecture; second, group differences in the scores due

to severity of dysarthria (a measure of which is the human listener intelligibility

rating for the subjects studied); and third, dependence of the scores on the size of

the task vocabulary.

5.2.1 Whole-word ASR

Figure 5.1 is a graphical representation of the whole-word ASR scores reported

in Table 5.1. For all speakers, recognition accuracy deteriorates with increase

in vocabulary size. However, for speakers with low and very low intelligibility

(all except M09 and M05), recognition accuracy is higher than their respective

intelligibility ratings (the magnitude of difference is larger for small vocabularies

than for medium sized ones). For M09 and M05, the recognition accuracy is higher

than their intelligibility ratings for small sized vocabularies (tasks T01-T04) but

21

M09 M05 M06 F02 M07 F03 M040

10

20

30

40

50

60

70

80

90

100

speaker (from left: decreasing order of intelligibility)

# W

ord

s C

orr

ec

t (%

)

T01T02T03T04T05T07T08

(a) %WC versus speaker

T01 T02 T03 T04 T05 T07 T080

10

20

30

40

50

60

70

80

90

100

HMM Tasks (from left: increasing size of vocabulary)

# W

ord

s C

orr

ec

t (%

)

M09M05M06F02M07F03M04

(b) %WC versus task

Figure 5.1: ASR performance (%WC scores) for whole-word architecture.

22

not for medium sized ones (tasks T05, T07, T08).

5.2.2 Monophone ASR

Figure 5.2 is a graphical representation of the monophone ASR scores reported

in Table 5.2. For all speakers, the accuracy on task T06 (uncommon words only)

was always worse than their intelligibility rating, indicating that an HMM system

trained solely on such a vocabulary is not good enough for use in a human-

computer interface for spoken text entry. Secondly, for all speakers, the recogni-

tion scores for tasks T07 and T08 were respectively higher than those on T09 and

T10 (which are T07 and T08 with the uncommon words added). This again shows

that when the task vocabulary contained uncommon words, the ASR performance

went down. However, comparing T09 and T10 scores, it appears that incorporat-

ing digits into the vocabulary improves the %WC score by 4-7% absolute, for all

speakers except F03 and M04 (very low intelligibility). For these two speakers,

there is only a slight improvement. Finally, for low and very low intelligibility

speakers other than M06 and F02, the monophone system recognizes their speech

as well as the human listeners on task T09.

5.2.3 Triphone ASR

Figure 5.3 is a graphical representation of the whole-word ASR scores reported

in Table 5.3. For all subjects, the variation in scores is similar to the one in

the monophone case: performance on T06 worse than intelligibility rating; and

scores for T07 and T08 respectively higher than those on T09 and T10, indicating

performance deterioration upon adding uncommon words.

As with the monophone recognizers, incorporating digits into the vocabulary

improves the %WC score (comparing T09 and T10 scores) by 5-12% absolute,

23

M09 M05 M06 F02 M07 F03 M040

10

20

30

40

50

60


# W

ord

s C

orr

ec

t (%

)

T06T07T08T09T10


T06 T07 T08 T09 T100

10

20

30

40

50

60


# W

ord

s C

orr

ec

t (%

)

M09M05M06F02M07F03M04

(b) %WC versus task

Figure 5.2: ASR performance (%WC scores) for monophone architecture.

24

M09 M05 M06 F02 M07 F03 M040

10

20

30

40

50

60

70


# W

ord

s C

orr

ec

t (%

)

T06T07T08T09T10


T06 T07 T08 T09 T100

10

20

30

40

50

60

70


# W

ord

s C

orr

ec

t (%

)

M09M05M06F02M07F03M04

(b) %WC versus task

Figure 5.3: ASR performance (%WC scores) for triphone architecture.

25

for all subjects except F03 and M04 (for these two, there is again, only a slight

improvement). On task T10, the triphone architecture gives a higher score than

the intelligibility rating for all subjects with low and very low intelligibility, except

M06. In fact, for these subjects, triphone ASR has also been able to achieve a

performance at par with or better than the human listener rating on task T09.

5.2.4 Monophone ASR vs. triphone ASR

Figure 5.4 compares the scores (presented in Tables 5.2 and 5.3) on tasks T06-

T10 for the monophone (mP) and triphone (tP) recognizers. For vocabularies not

containing the uncommon words (T07, T08), triphone systems outperform mono-

phone systems by 3-17% absolute, for all subjects except M04 (2% intelligibility).

For task T06 (uncommon words only), the monophone systems have better %WC

scores (6-16% absolute) except for subjects F03 and M04. For these very low

intelligibility subjects, the performance of the two architectures is comparable.

Finally, for tasks T09 and T10, monophone and triphone systems have similar

performance in terms of the %WC score.

5.2.5 Whole-word ASR vs. monophone ASR vs. triphone ASR

Figure 5.5 compares the scores (presented in Tables 5.1, 5.2 and 5.3) on tasks T07

and T08 for whole-word (wW), monophone (mP) and triphone (tP) recognizers.

For both tasks and all subjects except M04, the monophone systems have the worst

performance among the three architectures. For M04, the whole-word systems

give the best performance on both tasks; for all other subjects, either whole-word

(M09, M05, F02) or triphone (M06, F03) system performs best on both tasks

(except for M07: whole-word score higher for task T07 and vice-versa for task

T08).

26

M09 M05 M06 F02 M07 F03 M040

10

20

30

40

50

60

70


# W

ord

s C

orr

ec

t (%

)

T06−mPT06−tPT07−mPT07−tPT08−mPT08−tPT09−mPT09−tPT10−mPT10−tP

Figure 5.4: Comparison of ASR performance (%WC scores) for monophone andtriphone architectures.

T07−wW T07−mP T07−tP T08−wW T08−mP T08−tP0

10

20

30

40

50

60

70


# W

ord

s C

orr

ec

t (%

)

M09M05M06F02M07F03M04

Figure 5.5: Comparison of ASR performance (%WC scores) for whole-word,monophone and triphone architectures.

27

CHAPTER 6

CONCLUSIONS

We see that ASR systems trained specifically on a small amount of dysarthric

speech (two training tokens per utterance) have demonstrated recognition accura-

cies comparable to human listeners. Secondly, for some medium-sized vocabular-

ies, the whole-word systems performed as well as triphone systems, indicating that

simpler architectures are as capable as more complex ones for dysarthric speech

recognition. These comparable performances permit the designer to choose from

the two architectures: the triphone system is more flexible and scalable; on the

other hand, the whole-word system is faster to train.

However, the most interesting outcome of these experiments is that for subjects

with very low intelligiblity, ASR has significantly outperformed human listeners as

far as recognizing their speech on small vocabularies is concerned. For vocabularies

of medium size it has been as successful as a human listener in understanding

dysarthric speech. The listeners in intelligibility experiments had two important

disadvantages: (a) they did not know the speaker, and (b) they did not know the

vocabulary. Many of these patient talkers are able to communicate easily with

human listeners who (1) know them, and (2) know from situational context, what

words the subject might be trying to say. These results show that ASR with the

knowledge of talker and vocabulary outperforms human listeners without such

knowledge, and that in many cases, the resulting %WC score approaches ranges

that may be useful for human-computer interaction.

28

REFERENCES

[1] D. Durham, “Key steps to high speech recognition accuracy,” Sep. 2007.[Online]. Available: http://www.emicrophones.com.

[2] A. Michod, “The believer – Interview with Richard Powers,” Feb. 2007.[Online]. Available: http://www.believermag.com/issues/200702/?read=interview powers.

[3] H. P. Chang, “Speech input for dysarthric users,” Journal of the AcousticalSociety of America, vol. 94, no. 3, p. 1782, Sep. 1993.

[4] H. P. Chang, “Speech recognition for dysarthric computer users,” presentedat the IVth International Clinical Phonetics and Linguistics Association Sym-posium, New Orleans, LA, 1994.

[5] P. C. Doyle et al., “Dysarthric speech: A comparison of computerized speechrecognition and listener intelligibility,” Journal of Rehabilitation Researchand Development, vol. 34, pp. 309–316, 1997.

[6] K. Hux, J. Rankin-Erickson, N. Manasse, and E. Lauritzen, “Accuracy ofthree speech recognition systems: Case study of dysarthric speech,” AAC:Augmentative and Alternative Communication, vol. 16, no. 3, pp. 186–196,2000.

[7] M. Hasegawa-Johnson, J. Gunderson, A. Perlman, and T. S. Huang,“Audiovisual phonologic-feature-based recognition of dysarthric speech,”2006. [Online]. Available: http://www.isle.uiuc.edu/research/phonology ofdysarthria.html.

[8] M. Hasegawa-Johnson, “Universal access automatic speech recognitionproject,” 2006. [Online]. Available: http://cita.disability.uiuc.edu/research/asr/overview.php.

[9] D. Caplan, Language: Structure, Processing and Disorders. Cambridge,MA: MIT Press, 1992.

[10] M. N. Hegde, Hegde’s PocketGuide to Assessment in Speech-Language Pathol-ogy, 3rd ed. Florence, KY: CENGAGE Delmar Learning, 2007.

29

[11] M. N. Hegde, Hegde’s PocketGuide to Treatment in Speech-Language Pathol-ogy, 3rd ed. Florence, KY: CENGAGE Delmar Learning, 2007.

[12] F. Darley and A. Aronson, “Differential diagnostic patterns of dysarthria,”Journal of Speech and Hearing Research, vol. 12, pp. 246–269, 1969.

[13] F. Darley, A. Aronson, and J. Brown, Motor Speech Disorders. Philadelphia,PA: W. B. Saunders Inc., 1975.

[14] J. Duffy, Motor Speech Disorders. Boston, MA: Elsevier Mosby, 2005.

[15] R. J. Love, Childhood Motor Speech Disability. Boston, MA: Allyn andBacon, 1992.

[16] R. D. Kent, G. Weismer, J. F. Kent, J. K. Vorperian, and J. R. Duffy,“Acoustic studies of dysarthric speech: Methods, progress and potential,”Journal of Communication Disorders, vol. 32, pp. 141–186, 1999.

[17] L. J. Platt, G. Andrews, M. Young, and P. T. Quinn, “Dysarthria of adultcerebral palsy: I. Intelligibility and articulatory impairment,” Journal ofSpeech and Hearing Research, vol. 23, no. 1, pp. 28–40, 1980.

[18] L. J. Platt, G. Andrews, and P. M. Howie, “Dysarthria of adult cerebralpalsy: II. Phonemic analysis of articulation errors,” Journal of Speech andHearing Research, vol. 23, no. 1, pp. 41–55, 1980.

[19] G. S. Carlson and J. Bernstein, “A voice–input communication aid,” NationalInstitute of Neurological and Communicative Disorders and Stroke, MenloPark, CA, Final Report of contract N01-NS-5-2394, April 1988.

[20] H. P. Chang, “Speech input for dysarthric computer users,” Ph.D. disserta-tion, Massachusetts Institute of Technology, Cambridge, MA, 1995.

[21] A.-L. Kotler and N. Thomas-Stonell, “Effects of speech training on the accu-racy of speech recognition for an individual with speech impairment,” AAC:Augmentative and Alternative Communication, vol. 13, no. 2, pp. 71–80,1997.

[22] “Statistical Speech Technology Group.” [Online]. Available: http://www.isle.uiuc.edu/sst.html.

[23] “Beckman Institute, University of Illinois at Urbana-Champaign.” [Online].Available: http://www.beckman.uiuc.edu.

[24] J. R. Deller, M. S. Liu, L. J. Ferrier, and P. Robichaud, “The Whitakerdatabase of dysarthric (cerebral palsy) speech,” Journal of the AcousticalSociety of America, vol. 93, no. 6, pp. 3516–3518, Sep. 1993.

30

[25] A. Adjoudani and C. Benoit, “On the integration of auditory and visualparameters in an HMM-based ASR,” in Speechreading by Humans and Ma-chines: Models, Systems and Applications, D. G. Stork and M. E. Hennecke,Eds. New York, NY: Springer, 1996, pp. 461–471.

[26] M. T. Chan, Y. Zhang, and T. S. Huang, “Real-time lip tracking and audio-video continuous speech recognition,” presented at the IEEE Workshop onMultimedia Signal Processing, Los Angeles, CA, Dec. 1998.

[27] M. E. Hennecke, D. G. Stork, and K. V. Prasad, “Visionary speech: Look-ing ahead to practical speechreading systems,” in Speechreading by Humansand Machines: Models, Systems and Applications, D. G. Stork and M. E.Hennecke, Eds. New York, NY: Springer, 1996, pp. 331–350.

[28] Y. Zhang, “Information fusion for robust audio-visual speech recognition,”Ph.D. dissertation, University of Illinois at Urbana-Champaign, 2000.

[29] S. Chu and T. S. Huang, “Bimodal speech recognition using coupled hiddenMarkov models,” presented at Interspeech, Beijing, China, Oct. 2000.

[30] S. Chu and T. S. Huang, “Multi-modal sensory fusion with applicationto audio-visual speech recognition,” presented at EUROSPEECH, Aalborg,Denmark, Sep. 2001.

[31] M. McHenry, “Review of evidence supporting dysarthria treatment inadults,” presented at the American Speech-Language-Hearing AssociationConvention, Boston, MA, Nov. 2007.

[32] B. Lee et al., “AVICAR: Audio-visual speech corpus in a car environment,”presented at Interspeech, Jeju Island, Korea, Oct. 2004.

[33] H. Kim et al., “Dysarthric speech database for universal access research,” tobe presented at Interspeech, Brisbane, Australia, Sep. 2008.

[34] L. R. Rabiner, “A tutorial on hidden Markov models and selected applicationsin speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286,Feb. 1989.

[35] Cambridge University Engineering Department, “HTK speech recognitiontoolkit,” Dec. 2006. [Online]. Available: http://htk.eng.cam.ac.uk/.

[36] S. Young et al., “The HTK book,” Dec. 2006. [Online]. Available:http://htk.eng.cam.ac.uk/prot-docs/htk book.shtml.

[37] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Jour-nal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.

31

c 2008 harsh vardhan sharma - illinois speech and ... speech by harsh vardhan sharma b.tech., indian...

Documents