cs 508 427 fowler, carol a., ed. -report no 93 · author fowler, carol a., ed. title spee:11...

214
DOCUMENT RESUME ED 366 020 CS 508 427 AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. INSTITUTION Haskins Labs., New Haven, Conn. -REPORT NO SR-113 PUB DATE 93 NOTE 214p.; For the previous report, see ED 359 575. PUB TYPE Collected Works General (020) Reports Research/Technical (143) EDRS PRICE MF01/PC09 Plus Postage. DESCRIPTORS Adults; *Articulation (Speech); *Beginning Reading; Chinese; Communication Research; Elementary Secondary Education; French; Higher Education; Infants; Language Acquisition; Language Research; Music; Reading Writing Relationship; *Speech Communication; Spelling; Thai IDENTIFIERS Speech Research ABSTRACT One of a series of quarterly reports, this publication contains 14 articles which report the status and progress of studies on the nature of speech, instruments for its investigation, and practical applications. Articles in the publication are: "Some Assumptions about Speech and How They Changed" (Alvin M. Liberman); "On the Intonation of Sinusoidal Sentences: Contour and Pitch Height" (Robert E. Remez and Philip E. Rubin); "The Acquisition of Prosody: Evidence from French- and English-Learning Infants" (Andrea G. Levitt); "Dynamics and Articulatory Phonology" (Catherine P. Browman and Louis Goldstein); "Some Organizational Characteristics of Speech Movement Control" (Vincent L. Gracco); "The Quasi-Steady Approximation in Speech Production" (Richard S. McGowan); "Implementing a Genetic Algorithm to Recover Task-Dynamic Parameters of an Articulatory Speech Synthesizer" (Richard S. McGowan); "An MRI-Based Study of Pharyngeal Volume Contrasts in Akan" (Mark K. Tiede); "Thai" (M. R. Kalaya Tingsabadh and Arthur S. Abramson); "On the Relations between Learning to Spell and Learning to Read" (Donald Shankweiler and Eric Lundquist); "Word Superiority in Chinese" (Ignatious G. Mattingly and Yi Xu); "Prelexical and Postlexical Strategies in Reading: Evidence from a Deep and a Shallow Orthography" (Ram Frost); "Relational Invariance of Expressive Microstructure across Global Tempo Changes in Music Performance: An Exploratory Study" (Bruno H. Repp); and "A Review of 'Psycholinguistic Implications for Linguistic Relativity: A Case Study of Chinese' by Rumjahn Hoosain" (Yi Xu). (RS) *********************************************************************** * Reproductions supplied by EARS are the best that can be made * * from the original document. * ***********************************************************************

Upload: others

Post on 12-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

DOCUMENT RESUME

ED 366 020 CS 508 427

AUTHOR Fowler, Carol A., Ed.TITLE Spee:11 Research Status Report, January-March 1993.INSTITUTION Haskins Labs., New Haven, Conn.

-REPORT NO SR-113PUB DATE 93NOTE 214p.; For the previous report, see ED 359 575.PUB TYPE Collected Works General (020) Reports

Research/Technical (143)

EDRS PRICE MF01/PC09 Plus Postage.DESCRIPTORS Adults; *Articulation (Speech); *Beginning Reading;

Chinese; Communication Research; Elementary SecondaryEducation; French; Higher Education; Infants;Language Acquisition; Language Research; Music;Reading Writing Relationship; *Speech Communication;Spelling; Thai

IDENTIFIERS Speech Research

ABSTRACTOne of a series of quarterly reports, this

publication contains 14 articles which report the status and progressof studies on the nature of speech, instruments for itsinvestigation, and practical applications. Articles in thepublication are: "Some Assumptions about Speech and How They Changed"(Alvin M. Liberman); "On the Intonation of Sinusoidal Sentences:Contour and Pitch Height" (Robert E. Remez and Philip E. Rubin); "TheAcquisition of Prosody: Evidence from French- and English-LearningInfants" (Andrea G. Levitt); "Dynamics and Articulatory Phonology"(Catherine P. Browman and Louis Goldstein); "Some OrganizationalCharacteristics of Speech Movement Control" (Vincent L. Gracco); "TheQuasi-Steady Approximation in Speech Production" (Richard S.McGowan); "Implementing a Genetic Algorithm to Recover Task-DynamicParameters of an Articulatory Speech Synthesizer" (Richard S.McGowan); "An MRI-Based Study of Pharyngeal Volume Contrasts in Akan"(Mark K. Tiede); "Thai" (M. R. Kalaya Tingsabadh and Arthur S.Abramson); "On the Relations between Learning to Spell and Learningto Read" (Donald Shankweiler and Eric Lundquist); "Word Superiorityin Chinese" (Ignatious G. Mattingly and Yi Xu); "Prelexical andPostlexical Strategies in Reading: Evidence from a Deep and a ShallowOrthography" (Ram Frost); "Relational Invariance of ExpressiveMicrostructure across Global Tempo Changes in Music Performance: AnExploratory Study" (Bruno H. Repp); and "A Review of'Psycholinguistic Implications for Linguistic Relativity: A CaseStudy of Chinese' by Rumjahn Hoosain" (Yi Xu). (RS)

************************************************************************ Reproductions supplied by EARS are the best that can be made *

* from the original document. *

***********************************************************************

Page 2: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

HaskinsLaboratories

Status Report on

Speech Research

U.S. DEPARTMENT OF EDUCATIONOffice of Educatronat Research and improvement

EDUCATIONAL RESOURCES INFORMATIONCENTER (ERIC)

Anus document has been reproduced asrecer,.ed from the person or organrtationoriginating

0 Mrnor changes have been made 10 imforOvereproduchon duality

Points of nrew of oprniona StafficS in MS clocu .ment bo not neCestahly represent officialOE RI posrtron or coach

SR-113JANUARY-MARCil 1993

BEST COPY AVAILABLE2

Page 3: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

HaskinsLaboratories

Status Report on

Speech Research

SR-113JANUARY-MARCH 1993 NEW HAVEN, CONNECTICUT

3

Page 4: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Distribution StatementEditor

Carol A. Fowler

ProductionYvonne Manning

Fawn Zefang Wang

This publication reports the status and progress of studies on the nature of speech,instrumentation for its investigation, and practical applications.

Distribution of this document is unlimited. The information in this document is availableto the general public. Haskins Laboratories distributes it primarily for library use. Copiesare available from the National Technical Information Service or the ERIC DocumentReproduction Service. See the Appendix for order numbers of previo- is Status Reports.

Correspondence concerning this report should be addressed to the Editor at the addressbelow:

Phone: (203) 865-6163

Haskins Laboratories270 Crown StreetNew Haven, Connecticut

06511-6695FAX: (203) 865-8963 Bitnet: HASKINSOYALEHASK

Internet: H ASKINS%[email protected].

This Report was reproduced on recycled paper

SR-113 January-March 1993

Page 5: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Acknowledgment

The research reported here was made possible in part by support from the following sources:

National Institute of Child Health and Human Development

Grant HD-01994Grant HD-21888

National Institute of Health

Biomedical Research Support Grant RR-05596

National Science Foundation

Grant DBS-9112198

National Institute on Deafness and Other Communication Disorders

Grant DC 00121Grant DC 00183Grant DC 00403Grant DC 00016Grant DC 00594

Grant DC 00865Grant DC 01147Grant DC 00044Grant DC 00825Grant DC 01247

SR-113 January-March 1993 iii

5

Page 6: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Investigators Technical Staff

Arthur Abramson*Peter J Alfonso*Eric Bateson*Fredericka Be ll-Berti*Catherine T. Best*Susan Brady*Catherine P. BrowinanClaudia Care IleFranklin S. Cooper*Stephen Crain*Lois G. Dreyer*Alice FaberLaurie B. Feldman*Janet Fodor*Anne Fowler*Carol A. Fowler*Louis Goldstein*Carol GraccoVincent GraccoKatherine S. Harris4John HogdenLeonard Katz*Rena Arens Krakow*Andrea G. Levitt*Alvin M. Liberman*Diane Li llo-Martin*Leigh Lisker*Anders LofqvistIgnatius G. Mattingly*Nancy S. McGarr*Richard S. McGowanPatrick W. NyeKiyoshi OshirnaTKenneth Pugh*Lawrence J. Raphael*Bruno H. ReppHy la Rubin*Philip E. RubinElliot SaltzrnanDonald Shankweiler*leffrey ShawMichael Studdert-Kennedy*Michael T. Turvey*.Douglas Whalen

*Part-timetVisiting from University of Tokyo, Japan

Michael D'AngeloVincent GulisanoDonald HalleyYvonne ManningWilliam P. ScullyFawn Zefang WangEdward R. Wiley

Administrative Staff

Philip ChagnonAlice DadourianBetty J. DeLiseLisa FresaJoan Martinez

Students*

Melanie CampbellSandra ChiangMargaret Hall DunnTerri ErwinJoseph KalinowskiLaura KoenigBetty KolliaSimon LevySalvatore MirandaMaria ModyWeijia NiMira PeterChristine RomanoJoaquin RomeroDorothy RossArlyne RussoMichelle SanderSonya SheffertCaroline SmithBrenda StoneMark TiedeQi WangYi XuElizabeth Zsiga

SR-113 January-March 1993

Page 7: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Contents

Some Assumptions about Speech and How They ChangedAlvin M. Liberman 1

On the Intonation of Sinusoidal Sentences:Contour and Pitch Height

Robert E. Remez and Philip E. Rubin 33

The Acquisition of Prosody: Evidence from French- andEnglish-Learning Infants

Andrea G. Levitt 41

Dynamics and Articulatory PhonologyCatherine P. Browman and Louis Goldstein 51

Some Organizational Characteristics of SpeechMovement Control

Vincent L. Gracco 63

The Quasi-steady Approximation in Speech ProductionRichard S. McGowan 91

Implementing a Genetic Algorithm to Recover Task-dynamicParameters of an Articulatory Speech Synthesizer

Richard S. McGowan 95

An MRI-based Study of Pharyngeal Volume Contrasts in AkanMark K. Tiede 107

ThaiM. R. Kalaya Tingsabadh and Arthur S. Abramson 131

On the Relations between Learning to Spell andLearning to Read

Donald Shankweiler and Eric Lundquist 135

Word Superiority in ChineseIgnatius G. Mattingly and Yi Xu 145

Pre lexical and Post lexical Strategies in Reading:Evidence from a Deep and a Shallow Orthography

Ram Frost 153

Relational Invariance of Expressive Microstructure acrossGlobal Tempo Changes in Music Performance:An Exploratory Study

Bruno H. Repp 171

A Review of Psycho linguistic Implications for Linguistic Relativity:A Case Study of Chinese by Rumjahn Hoosain

Yi Xu 197

Appendix 209

SR-113 January-March 7993 v

7

Page 8: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

HaskinsLaboratories

Status Report on

Speech Research

8

Page 9: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins I.boratories Status Report on Speech Research1993, SR-113, 1-32

Some Assumptions about Speech and How They Changed*

Alvin M. Liberman

My aim is to provide a brief account of the re-search on speech at Haskins Laboratories as seenfrom my point of view. In pursuit of that aim, Iwill scant most experiments and their outcomes,putting the emphasis, rather, on the changing as-sumptions that, as I understood them, guided theresearch or were guided by it. I will be particu-larly concerned to describe the development ofthose assumptions, hewing as closely as I can tothe order in which they were made, and then ei-ther abandoned or extended.

My account is necessarily inaccurate, not justbecause I must rely in part on memory about mystate of mind almost 50 years ago when, in theearliest stages of our work, we did not alwaysmake our underlying assumptions explicit, but,even more, for reasons that put a proper accountbeyond the reach of any recall, however true. Thechief difficulty is in the relation between thetheoretical assumptions and the research theywere supposed to rationalize. Thus, it happenedonly once, as I see it now, that the assumptionschanged promptly in response to the results of aparticular experiment; in all other cases, theylagged behind, as reinterpretations of data thathad accumulated over a considerable perkod oftime. Moreover, theory was influenced, not only byour empirical findings, but equally, if even morebelatedly, by general considerations of plausibilitythat arose when, stimulated by colleagues and bythe gradual broadening of my own outlook, Ibegan to give proper consideration to the special

The preparation of this paper was aided by the NationalInstitute of Child Health and Human Development underGrant HD 01994. Since the paper is a very personal account ofthe research that has, since 1965, been supported by thatagency, I am moved here to offer special ..nanks to the agencyfor its long-continued help, and to Dr. James Kavanagh, amember of its staff, for the kind and wise counsel that he has,for so many years, given me.

requirements of phonological communication andto all that is entailed by the fact that speech is aspecies-typical product of biological evolution. Theconsequence for development of theory was that,with the one exception just noted, I never changedmy mind abruptly, but rather segued from eachearlier position to each later one, leaving behindno clear sign to mark the time, the extent of thechange, or the reason for making it. Faced nowwith having to describe a theory that was inalmost constant flux, I can only offer what I must,in retrospect, make to seem a progress throughdistinct stages.

My account is also necessarily presumptuous,because I will sometimes say 'we' when I probablyshould have said 'I', and vice versa. In so doing, Iam not trying to avoid blame for bad ideas thatwere entirely mine, nor to claim credit for goodideas that I had no part in hatching. It is, rather,that everything I have done or thought in researchon speech has been profoundly influenced by mycolleagues at the Laboratories. In most cases,however, I can't say with certainty just how or bywhom. A consequence is that, of all the words Iuse in this chronicle, the most ambiguous by farare 'I' and 'we'.

'I' began to be confused with 'we' on a day inJune, 1944, when I was offered a job at theLaboratories to work with Frank Cooper on thedevelopment of a reading machine for the blind, adevice that would convert printed letters intointelligible sounds. Of course, we appreciated fromthe outset that the ideal machine would renderthe print as speech that the blind user hadalready mastered. Though that is done quiteroutinely now, it was, in 1944, far beyond thescience and technology that was available to us.There were no optical character readers to identifythe letters, and no rules for synthesis that wouldhave enabled us to produce speech from theiroutputs. But, reassured by our assumptions aboutthe relation of speech to language, of which more

9

Page 10: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

2 Liberman

later, we did not think it critical that the machineshould speak. Rather, we supposed that it hadonly to produce distinctive sounds of some kindthat the blind would then learn to associate withthe consonants and vowels of the language, muchas we supposed they had done at an earlier stageof their lives with the sounds of speech. Thusconceived, our enterprise lay in the domains ofauditory perception and discrimination learning,two subjects I had presumably mastered as agraduate student at Yale. Indeed, my dissertationhad been based on experiments aboutdiscrimination learning to acoustic stimuli, and Iwas thoroughly familiar with the neobehaviorist,stimulus-response theory that was thought by myprofessors and me to account for the results. Inthe course of my eduLation in psychology, I hadlearned nothing about speech, but I didn't thinkthat mattered, because the theory I had absorbedwas supposed, like most other theories inpsychology, to apply to everything. I was,therefore, enthusiastic about the job, confidentthat I knew exactly how to make the readingmachine serve its intended purpose and so put thetheory to practical use. In the event, the theory,and, indeed, virtually everything else I hadlearned, proved to be, at best, irrelevant and, atworst, misleading. I think it unlikely that I wouldever have discovered that had it not been for twofortunate, as they proved to be, circumstances.One was the collaboration, from the outset withFrank Cooper, whose gentle but nonethelessinsistent prodding helped me to accept that thetheory might be wrong, and to see what might bemore nearly right. The other was that speech layconstantly before me, providing an existence proofthat language could be conveyed efficiently bysound, and thus setting the high standard bywhich I was bound to evaluate the performance ofthe acoustic substitutes for speech that our earlyassumptions led us to contrive. But for FrankC000per, on the one hand, and speech on theother, I might still be massaging those substitutesand modifying the conditions of learning, satisfiedto achieve trivial improvements in systems thatwere, by comparison with speech, hopelesslyinadequate. As it was, experience in trying to findan alternative set of sounds brought Frank and,ultimately, me to the conclusion that speech isuniquely effective as an acoustic vehicle forlanguage. It remained only to find out why.

But I get ahead of the story. To set out from theproper beginning, I should say why we initiallybelieved there was nothing special about speech orits underlying processes, and where that belief led

us, not only in the several stages of the readingmachine work, but also in the development of, andearly work with, a research synthesizer we calledthe Pattern Playback.

The assumptions about speech that have beenmade by us and others differ in many details. As Isee it now, nowever, there is one question thatstands above those details, dividing theoriesneatly into two categories: does speech recruitmotor and perceptual processes of a general sort,processes that cut horizontally across a widevariety of behaviors; or does it belong to a specialphonetic mode, a distinct kind of action andperception comprising a vertical arrangement ofstructures and mechanisms specifically andexclusively adapted for linguistic communication?I will use this issue as the basis for organizing myaccount of the assumptions we made, calling them'horizontal' or 'vertical' according to the side of thedivide on which they fall.

THE HORIZONTAL VIEW OF SPEECHAND THE DESIGN OF READING

MACHINESAs it pertains to the perceptual side of the

speech process, the horizontal view, which is howwe saw speech at the outset, rests on threeassumptions: (1) The constituents of speech aresounds. (2) Perception of these sounds is managedby processes of a general auditory sort, processesthat evoke percepts no different in kind from thoseproduced in response to other sounds. (3) Thepercepts evoked by the sounds of speech, beinginherently auditory, must be invested withphonetic significance, and that can be done onlyby means of a cognitive translation. Accordingly,the horizontal view assumes a second stage,beyond perception, where the purely auditorypercepts are given phonetic names, measuredagainst phonetic prototypes, or associated with'distinctive features'. Seen this way, perceivingspeech is no different in principle from readingscript; listener and reader alike perceive one thingand learn to call it something else.

These assumptions wereand, perhaps, stillareso much a part of the received view that wecould hardly imagine an alternative; it simply didnot occur to us to question them, or even to makethem explicit. At all events, it was surely theseconventional assumptions that gave us confidencein our assumption that nonspeech sounds could bemade to work as well as speech, for what theyclearly implied was that the processes by whichblind users would learn to 'read' our sounds would

Page 11: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and How They Changed 3

differ in no important way from those by whichthey had presumably learned to perceive speech.Our task, then, was simply to contrive sounds ofsufficient distinctiveness, and then design theprocedures by which the blind would learn toassociate them with language.

Auditory discriminability is all it takesAs for distinctiveness, I was, at the outset,

firmly in the grip of the notion that it was just somuch psychophysical discriminability, almost as ifit were to be had simply by separating the stimulione from another by a sufficient number of just-noticeable-differences. This notion fit all too nicelywith our ability, even then, to engineer a print-to-sound machine that would meet thediscriminability requirement, for such a machinehad only to control selected parameters of itsacoustic output according to what was seen by aphotocell positioned behind a narrow scanningslit. Provided, then, that we chose wisely amongthe many possible ways in which the sound couldbe made to vary according to what the photocellsaw, the auditory results would be discriminable,hence learnable.

In all these machines the sound would becontrolled directly by the amount or verticalposition of the print under the scanning slit. Wetherefore called them 'direct translators' todistinguish them from an imaginable kind wecalled 'recognition machines', having in mind adevice that might one day be available to identifyeach letter, as present-day optical characterreaders do, and produce in response a presetsound of any conceivable type. Since we were in noposition to build a recognition machine, we set oursights initially on a direct translator.

We were aware that a reading machine of thedirect-translation kind had been constructed andtested in England just after World War I. Thismachine, called the Optophone, scanned the printwith five beams of light, each modulated at adifferent audio frequency. When a beamencountered print, a tone was made to sound at afrequency (hence, pitch) equal to the frequency atwhich that beam was modulated. Thus, as thescanning slit and its beams were moved across theprint, the user would hear a continuouslychanging musical chord, the composition of whichdepended, instant by instant, on the modulationfrequencies of the beams that struck print. Theuser's task, of course, was to learn to interpret thechanging chords as letters and words. I recall notever knowing what tests, if any, had been carriedout to determine just how useful the machine was,

only that a blind woman in Scotland had been ableto make her way through simple newspaper text.We did know that the machine was not in useanywhere in 1944, so we could only assume that ithad been found wanting. At all events, the lesson Itook from the Optophone was not that an acousticalphabet is no subtitute for speech, but that theparticular alphabet produced by the Optophonewas not a good one. Moreover, it seemed likely,given the early date at which the machine hadbeen made, that the signal was less than ideallyclear. I, perhaps more than Frank, was sure wecould do better. (Indeed, I think that Frank, evenat this early stage, had reservations about anymachine that read letter by letter, but he sharedmy belief that we could design a letter-readingmachine good enough to be of some use.)

Doing better, as we conceived it then, simplymeant producing more learnable sounds.Unfortunately, there was nothing in the literatureon auditory perception to tell us how to do that, sowe set out to explore various possibilities. For thatpurpose, Frank built a device with which we couldsimulate the performance of almost any directtranslator, with the one limitation that thesubjects of our experiments were not free tochoose the texts or to control the rate of scan.(This was because the print had to be scanned bypassing a film negative of the text across a fixedslit.) Otherwise, the simulator was quite adequatefor producing the outputs we would want to screenbefore settling on those that deserved furthertesting.

For much of the initial screening, Frank and Imade the evaluations ourselves after determiningroughly how well we could discriminate thesounds for a selected set of letters and words. Onthis basis, we quickly came to the conclusion thatthere was, perhaps, some promise in a system thatused the print to modulate the frequency of theoutput signal. (Certainly, this was very muchbetter than modulating the amplitude, which hadbeen tried first.) Having made this early decision,we experimented with different orientations andwidths of the slit, the shape of the sound wave,and the range of modulation, selecting, finally, avertical orientation of a slit that was somewhatnarrower than, say, the vertical bar of a lowercase and a sine wave with a modulation rangeof 100 to 4000 Hz.

Further informal tests with this frequency-modulation (FM) system showed that it failed toprovide a basis for discriminating certain letters(for example, n and u), which led us to concludethat it would likely prove in more thorough tests

1 1

Page 12: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

4 Liberman

to be not good enough. This conclusion did notdiscourage me, however, for I reckoned that thedifficulty was only that the signal variation, beingone-dimensional, was not sufficiently complex.Make the signals more complexly multi-dimensional, I reasoned, so that somewhere in thecomplexity the listener would find what he neededas the basis for associating that signal with theletter it represented. I recall that here, too, Prankwas skeptical, but, open minded as always, heagreed to try more elaborate arrangements. Inone, the Dual FM, the slit was divided into anupper and lower half (each with its own photocell),so that two tones, one in the upper part of the4000-Hz range, the other in the lower, would beindependently modulated. In another (Super FM),the slit was divided into thirds (each with its ownphotocell), and the difference between the outputof the middle cell and the sum of the upper andlower cells controlled the (relatively slow)frequency modulations of a base tone, while thesum of all three cells controlled an audiofrequency at which the base tone was itselffrequency modulated. The effect of this lattermodulation was to create side bands on the basefrequency at intervals equal to the modulationfrequency, and thus cause frequent, sudden, andusually gross changes in timbre and pitch as aconsequence of the amount and position of theprint under the slit. At all events, the signals ofthe Super FM seemed to me to vary inwonderfully complex ways, and so to provide a fairtest of my assumption about the necessarycondition for distinctiveness and learnability. Wealso simulated the Optophone, together with avariant on it in which, in a crude attempt to createconsonant-like effects, we had the risers anddescenders of the print produce various hisses andclicks. We even tried an arrangement in which theprint controlled the frequency positions of twoformant-like bands, giving an impression ofconstantly changing vowel color.

We tested all of these systems, together withthree or four others of similar type, by finding howreadily subjects could learn to associate thesounds with eight of the most common four-letterwords. To determine how well these systems didin relation to speech, we applied the same test toan artificial language that we spoke simply bytransposing phonological segments. Thus, vowelswere converted into vowels, stops into stops,fricatives into fricatives, etc., so the syllablestructure remained unchanged, hence easy for ourhuman reading machine to pronounce. We called

this new language Vuhzi', in honor of one of thetransposed words.

The results of the learning test were simpleenough: of all the nonspeech signals tested, theoriginal, simple FM and the original Optophonewere the most nearly learnable; all others trailedfar behind. Worst of all, and by a considerablemargin, were the extremely complex signals of theSuper FM, the system for which I had entertainedsuch high hopes. As for Wuhzi, the transposedspeech, it was in a class of its own. It wasmastered (for the purposes of the test) in about 15trials, in contrast to the nonspeech sounds, onwhich subjects were, after 24 trials, performing ata level of only 60% with the simple FM andOptophone signals, and at 50% or lower for theothers, and all seemed at, or close to, theirasymptotes. More troubling was the fact thatlearnability of the nonspeech signals went downappreciably as the rate of scan was increased,even at modest levels of rate. What should havebeen most troubling, however, were theindications that learning would be rate specific.Though we did not pursue the point, it seemedclear in my own experience with the FM systemthat what I had learned to 'read' at a rate of, say,10 words per minute, did not transfer to a rate of50. My impression, as I remember it now, was notthat things sounded much the same, only speededup, but rather that the percept had changedentirely, with the result that I could no longerrecognize whatever I had learned to respond to atthe slower speed. To the extent that thisobservation was correct, users would have had tolearn a different nonspeech "language' for everysignificantly different rate of reading, and thatalone would have made the system impractical. Italso became c1car that at rates above 50 or sowords per minute, listeners could identify thewords only on the basis of some overallimpressionI am reluctant to call it a patterninwhich the individual components were not readilyretrievable. The consequence of this, though wedid not take proper account of it at the time, wasthat perception of the nonspeech signals wouldhave been, at best, logophonic, as it were, notphonologic, so users would have had to learn toidentify as many sounds are there were words tobe read, and would not have be-n able to fall backon their phonologic resources in order to cope withnew words.

At this point it had become evident that, asFrank had been saying for some time, perceptionof the acoustic alphabet produced by any letter-

Page 13: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and How Theu Changed 5

reading machine would be severely limited by thetemporal resolving power of the earthat is, byits poor abililty to segregate and properly orderacoustic events that are presented briefly and inrapid succession. Taking account of the number ofdiscrete frequency peaks in each alphabetic soundand the average number of alphabetic charactersper word, we estimated the maximum rate to bearound 50 words per minute, which I now believeto have been very much on the high side of whatmight have been achieved. Still, 50 words perminute would be quite useful for many purposes,and we had a considerable investment in theletter-reading direct-translator kind of system, sowe felt obliged to see how far a subject could go,given control of the scanning rate, a great varietyof printed material, and much practice. Franktherefore built a working model of a direct-translating FM reading machine, and we setseveral subjects to work trying to master it. After90 hours of practice, our best subject was able tomanage a fifth-grade reader (without correction)at an average speed of about four words perminute. Moreover, the subject seemed to havereached her peak performance during the first 45hours. Comparing performance at the end of thatperiod with performance after 90 hours, we foundno increase in rate and no decrease in errors.Tests of the same system conducted by.John Flynnat the Naval Medical Research Laboratory yieldedresults that gave no more grounds for optimism.

I must not omit from this account a reference toour experience with a sighted subject named (dareI say, appropriately?) Eve, who, having firstproduced a properly shaped learning curve,attained a reading speed of over 100 words perminute, and convincingly demonstrated her skillbefore the distinguished scientists who weresponsoring our research. In that demonstration,as in all of her work with the machine, she woreblacked-out welder's goggles so she would notsuffer the discomfort of having to keep her eyesshut tight. We all remember the day when one ofour number, acting on what seemed anungentlemanly impulse, put an opaque screenbetween her and the print she was 'reading', atwhich point she bounded from her chair and fledthe lab, confessing later to the young man whohad been monitoring her practice sessions that, b:'leaning her cheek against her fist, she had, fromthe very beginning, been raising the bottom of thegoggles and so seeing the print. In designing thetraining and testing procedures, I had controlledfor every possibility except that one. Obviously, Iwas the wrong kind of psychologist.

Having concluded at last that, given the proper-ties of the auditory system, letter-by-letter read-ing machines were destined to be unsatisfactory,we experimented briefly with a system in whichthe -,ound was controlled by information that hadbeen integrated across several letters. This wasintended to reduce the rate at which discreteacoustic events were delivered to the ear, and socircumvent the limitation inposed by its temporalresolving power. I cannot now imagine what wethought about the consequences of having to as-sociate holistically different sounds with linguisti-cally arbitrary conjunctions of letters. But what-ever it was that we did or did not have in mind,this integrating type of reading machine failed ev-ery tz ,si.. and was quickly abandoned.

One other scheme we tested deserves mention ifonly because it now seems incredible that weshould ever have considered it. I suppose we feltobliged to look forward to the time when opticalcharacter readers would be available, and so totest some presumably appropriate output of the'recognition' machine that would then be ,-)ossibie.Prerecorded 'phonemes' seemed an obvious choice.After all, the acoustic manifestations of phonemeswere known to be distinctive, and people had, onthe horizontal view, already learned to connectthem to language. As for difficulty with rate, wemust have supposed that these sounds carriedwithin them the seeds of their own integrationinto larger wholes. At all events, we recordedvarious 'phonemes' on film`duh' for /d/, `luh' for/1/, etc.and then carefully spliced the pieces offilm together into words and sentences. The resultwas wholly unintelligible, even at moderate rates,and gave every indication of forever remaining so.

Perhaps auditory patterning is the answerFrom all this experience with nonspeech we

drew two conclusions, one right, one wrong. Theright conclusion was that an acoustic alphabet isno way to convey language. Of course, it washumbling for me to realize that I might havereached that conclusion without all the work onthe reading machines, simply by measuring theknown temporal-resolving characteristics of theear against the rate at which the discrete soundswould have to be perceived. As I have alreadysuggested, Frank must have thought this through,though he might not have been sufficiently pes-simistic about the limits, and that would accountfor his early opinion that an acoustic alphabetmight be at least marginally useful. I, on the otherhand, took the point only after seeing our early re-sults and having my nose rubbed in them.

13

Page 14: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

6 Liberman

The wrong conclusionwrong because it wasstill uncompromisingly horizontal in outlookwasthat what we needed was not discriminability butproper auditory patterning. Speech succeeds, wedecided, because its sounds conform to certainGestalt-like principles that yield perceptually dis-tinctive patterns, and cause the phonetic elementsto be integrated into the larger wholes of syllablesand words. Apparently, it did not occur to me thatphonological communication would be impossibleif the discrete and commutable elements it re-quires were to be absorbed into indissolubleGestalten. Nor did I bother to wonder how speak-ers might have managed their articulators so as toforce all of the many covarying but not-indepen-dently-controllable aspects of the acoustic signalto collaborate in such a way as to produce justthose sounds that would form distinctive wholes.

In any case, nobody knew what the principles ofauditory pattern perception were. Research inaudition, unlike the corresponding effort in vision,had been quite narrowly psychophysical, havingbeen carried out with experimentsd stimuli thatwere not sufficiently complex to reveal suchpattern perception as there might be. So if, for thepurposes of the reading machine, we were going tomake its sounds conform to the principles thatunderlie good auditory patterns, we had first tofind the principles. We therefore decided to turnaway from the reading machine work until weshould have succeeded in that search. We werethe more motivated to undertake it, because theoutcome would not only advance the constructionof a reading machine, but also count, moregenerally, as a valuable contribution to the scienceof auditory perception.

THE PATTERN PLAYBACK: MAKINGTHE RIGHT MOVE FOR THE WRONG

REASONIt is exactly true, and important for understand-

ing how horizontal our attitude remained, to saythat our aim at this stage was to study the percep-tion of patterned sounds in general, not speech inparticular. As for the reading machine, we aid notsuppose we would use our results to make itspeak, only that we would have it produce soundsthat would be as good as speech because theywould conform to the same principles of auditoryperception. These would be sounds that, except forthe accidents of language development, might aswell have been speech. Obviously, we would lookto speech for examples of demonstrably well-pat-terned sounds. Indeed, we would be especially at-tentive to speech, but only because we did not

know where else to search for clues. At this stage,however, our interest lay in auditory perception.

Since we barely knew where to begin, weexpected to rely, especially at the outset, on trialand error. Accordingly, we had to have some easyway to produce and modify a large number andwide variety of acoustic patterns. The soundspectrograph and the spectrograms it producedwere no longer a wartime secret, and, like manyothers, we were impressed with the extent towhich spectrograms made sense to the eye.Beyond a doubt, they provided an excellent basisfor examining speech; it seemed to us a small stepto suppose that they would serve equally well forthe purpose of experimenting with it. Theexperimenter would have immediately availablethe patterned essence of the sound, and thuseasily grasp the parameters that deserved hisattention. These would be changed to suit hisaims, and he would then have only to observe theeffects of the experimental changes on the soundas heard. But that required a complement to thespectrographthat is, a device that would convertspectrograms, as well as the changes made onthem, into sound.

It was in response to these aims and needs thatthe Pattern Playback was designed and, in thecase of its most important components, actuallybuilt by Frank Cooper. My role was simply to offeroccasional words of encouragement, and, as thedevice took shape, to appreciate it as a triumph ofdesign, a successful realization of Frank'sdetermination to provide experimental andconceptual convenience to the researcher. Theneed for that convenience is, I think, hard toappreciate fully, except as one has lived throughthe early stages of our speech research, when littlewas known about acoustic phonetics, and progressdepended critically, therefore, on the ability to testguesses at the rate of dozens per day.

Seven years elapsed between the start of ourenterprise and the publication of the first paperfor which we could claim significance. I wouldtherefore here record that the support of theHaskins Laboratories and the CarnegieCorporation of New York was a notable. and, toFrank and me, essential exercise of faith andpatience, Few universities or granting agencieswould then have been, or would now be, similarlysupportive of two young investigators who weretrying for such a long time to develop an unprovedmethod to investigate a questionwhat is specialabout speech that it works so well?that was nototherwise being asked. I managed to survive inacademe by changing universities frequently, thus

1 4

Page 15: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and How They Changed

confusing the promotion-tenure committees, andalso by loudly trumpeting two rat-learning studiesthat I published as extensions of my thesisresearch.

There were many reasons it took so long to getfrom aim to goal. One was the failure, after twoyears of work, of the first model of the device thatwas to serve as our primary research tool. It wasthe second and very different model (hereafter thePattern Playback or simply the Playback) thatwas, for our purposes, a success. In the form inwhich it was used for so many years, this devicecaptured the information on the spectrogram byusing it to control the fate of frequency-modulatedlight beams. Arrayed horizontally so as to corre-spond to the frequency scale of the spectrogram,there were 50 such modulated beams, comprisingthe first 50 harmonics of a 120 Hz fundamental.Those beams that were selected by the pattern ofthe spectrogram were led to a phototube, the re-sulting variations of which were amplified andtransduced to sound. In effect, the device was, assomeone once said, an optical player piano.

When the Pattern Playback was conceived, wethought we might not be able to work from hand-painted copies of spectrograms, but only from realones. Therefore, we needed spectrograms on filmso that, given photographic negatives, the photo-tube would see the beams that passed through thefilm. (When operating from hand-painted pat-terns, the phototube would receive the beams thatwere reflected from the paint.) For convenience inmanipulating the patterns, we also wanted thefrequency and time scales to be of generous di-mensions. Moreover, we thought at this stage thatwe would need spectrograms with a large dynamicrange. None of these needs was fully met by thespectrograph that had been built originally at theBell Telephone Laboratories, and it was not avail-able for sale in any case, so Frank set about to de-sign and build our own. Unfortunately for theprogress of our research, that took time. As for thePlayback itself, it was, as our insurance inspectorremarked when first he saw it, 'homemade'. And,indeed it was. Only the raw materials were store-bought. Everything else was designed and fash-ioned at the Laboratories, including, especially,the tone wheel, the huge circular film with 50 con-centric rings of variable density used to producethe light beams that played on the spectrograms.

Once the special-purpose spectrograph and thePlayback were, at last, in working order, we hadto determine that speech could survive the rigorsof the transformations wrought first by the onemachine and then by the other. I well remember

the relief I feltFrank probably had more faithand therefore experienced correspondingly lessrelief--when, operating from a negative of aspectrogram, the Playback produced, on its firsttry, a highly intelligible sentence.

But that was only to pass the first test. Forusing these 'real' spectrograms as a basis forexperimenting with speech would have beenawkward in the extreme, since it would have beenvery hard to make changes on the film. Wetherefore began immediately to develop the abilityto use hand-painted spectrograms to control thesound. For that purpose, we had first to find apaint that would wet the acetate tape (i.e., notbead up), be easily erased (without leaving scars),and dry quickly. (The last requirement becamenearly irrelevant when, quite early in the work,we acquired a hair-dryer.) Unable to find acommercially available paint that met ourspecifications, we became paint manufacturers,making our way by trial and error to a satisfactoryproduct. That much we had to do. But there wastime spent in other preliminaries that proved tobe quite unnecessary. Thus, overestimating thedegree of fidelilty to the original spectrogram wewould need, we assumed we would have to controlthe relative intensities of the several parts of thepattern, and so devoted considerable effort topreparing and calibrating paints of variousreflectances. We also supposed that we wouldhave to produce gradations of a kind that couldbest be done with an airbrush, so we fiddled for atime with that.

But then, having decided that the price offidelity was too high, we simply began, with anartist's brush and our most highly reflecting paint,to make copies of spectrographic representationsof sentences, using for this purpose a set of twentythat had been developed at Harvard for work onspeech intelligibility. Taking pains with our firstcopies to preserve as much detail as possible, wesucceeded in producing patterns that yielded whatseemed to us a level of intelligibility high enoughto make the method useful for research. We werethen further encouraged about the prospects ofthe method when, after much trial and error, wediscovered that we could achieve even greaterintelligibilityabout 85% for the twentysentenceswith still simpler and more highlyschematized patterns.

Having got this far, we ran a few preliminaryexperiments no different in principle from those,very common at the time, in which intelligibilitywas measured as a function of the filtering thatthe speech signal had been subjected to. But

15

Page 16: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

8 Liberman

instead of filtering, we presented the formants oneat a time and in all combinations. The grosslystatistical result seemed of no great interest, eventhen, so we did not pursue it. There was, however,one unintended and unhappy result of our interestin the relative contributions of the severalformants, In the course of giving a talk about thePlayback at a meeting of the Acoustical Society ofAmerica, Frank played several copy-syntheticsentences, first with each formant separately, andthen with the formants in various combinations. Itwas plain from the reaction of the audience thateveryone was greatly surprised each time Frankread out the correct sentence. Apparently, peoplehad formed wrong hypotheses as they tried tomake sense of the speech produced by theindividual formants, and subsequently hadtrouble correcting those hypotheses whenpresented with the full pattern. So the sentencesgot nowhere near the 85% intelligibility theydeserved, and Frank got little recognition on thatoccasion for a promising research method.

AN EXCURSION INTO NONSPEECHAND THE INFINITELY HORIZONTAL

HYPOTHESISWe were, at this stage, strongly attracted to the

notion that spectrograms of speech were highlyreadable because, as tranformations of well-patterned acoustic signals, they managed still toconform to basic principles of pattern perception.Implicit in this notion was the assumption thatprinciples of pattern perception were so general asto hold across modalities, provided proper accountwas taken of the way coordinates were to betransformed in moving from the one modality tothe other. As applied to the relation betweenvision and audition, this assumption could betested by the extent to which patterns that lookedalike could be so transformed as to make themsound alike. To apply this test to thespectrographic transform, we varied the size andorientation of certain easily identifiable geometricforms, such as circles, squares, triangles, and thelike, converted them to sound on the Playback,and then asked whether listeners wouldcategorize the sounds as they had the visualpatterns. Under certain tightly constrainedcircumstances, the answer was yes, they would;otherwise, it was no, they would not. At all events,we were so bold as to publish the idea, togetherwith a description of the Playback in theProceedings of the National Academy of Sciences.Fortunately for us, that journal is not widely readin our field, so our brief affair with the mother of

all horizontal views has remained till now a well-kept secret.

We also experimented briefly with a phe-nomenon that would, today, be considered an in-stance of 'streaming', though I thought of it thenas the auditory equivalent of the phi phenomenonthat had for so long occupied a prominent place inthe study of visual patterns. Using painted linesso thin that each one activated only a single har-monic, we observed that when they alternated be-tween a lower and a higher frequency, the listenerheard the alternation only if the resulting sinu-soids were sufficiently close in frequency and suf-ficiently long in duration; otherwise, the impres-sion was of two streams of tones that bore no cleartemporal relation to each other. In pursuing thiseffect, we looked for that threshold of frequencyseparation and duration at which the subjectcould not distinguish a physically alternating pat-tern from one in which the two streams of tonescame on and went off in perfect synchrony. Wedid, in fact, succeed in finding such thresholds andin observing that they were sensitive to frequencyseparation and duration, as our hypothesis pre-dicted, but, after exhibiting a certain threshold fora while, the typical subject would quite suddenlybegin to hear the difference and to settle down, ifonly temporarily again, at a new threshold.Discouraged by this apparent lability, we aban-doned the project and began to put our whole ef-fort on speech.

EARLY (AND STILL HORIZONTALLYORIENTED) RESEARCH ON SPEECHIt was at about this point that Pierre Delattre

came to the lab to visit for a day and, fortunatelyfor us, stayed for ten years. Initially, his interestwas in discovering the acoustic basis for nasalityin French vowels, but once he found what thePlayback was capable of, his ambition broadenedto include any and all elements of phonetic struc-ture. So, while continuing to work on nasality,Pierre applied himself (and us) to producing two-formant steady-state approximations to the cardi-nal vowels of Daniel Jones. (Pierre supplied thecriterial ear.) The result was published in LeMaitre Phonetique, written in French and in anarrow phonetic transcription. Thus, in our sec-ond paper, we continued the habit that had beenestablished in the first of so publishing as to guar-antee that few among our colleagues would readwhat we wrote.

Meanwhile, we had begun to put our attentiononce again on the copy-synthetic sentences, butnow, instead of inquiring into the contribution of

Page 17: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and How They Changed 9

each formant to overall intelligibility, we sought,more narrowly, to find the acoustic basesthecuesfor the perception of individual phones. Irecall working first on the word 'kill' as itappeared in simplified copy we had made from aspectrogram of the utterance, 'Never kill a snakewith your bare hands. Looking for the phone [ILwe held the pattern stationary at closely spacedtemporal intervals, thus creating at each restingpoint a steady-state signal controlled by the staticpositions of the formants at that instant. Theresult, to our ears, was a succession of vowel-likesounds, but nothing we would have called an [1].Yet, running the tape through at normal speed,and thus capturing the movement of the formant,produced a reasonable approximation to thatphone. I don't remember what we made of this (tome) first indication of the importance of thedynamic aspects of the signal, but we did notimmediately pursue it, turning instead to the [k]at the beginning of the same word.

Context-conditioned variability and thehorizontal version of the motor theory

The advantage of [k] from our point of view asexperimenters was that it appeared, in the copy-synthetic word to be carried by a clearlyidentifiable and nicely delimited cue: a brief burstof sound that stood apart from the formants. Sincewe knew that [k] was a voiceless stop, in the sameclass as [0 and [t.), we reckoned that all threestops might depend on such a burst, according toits position on the frequency scale. So, after a littletrial and error, we carried out what must becounted our first proper experiment. Bursts (about360 Hz in height, 15 msec in duration at theirwidest, and of a shape that caused Pierre to callthem 'blimps') were centered at each of 12frequencies covering the full range of ourspectrograms. Each burst was placed in front ofseven of Pierre's steady-state cardinal vowels, andthe resulting stimuli were randomized forpresentation to listeners who would be asked toidentify the consonant in each case as [pl, [t.], or[k].

Of all the synthetic patterns ever used, theseburst-plus-steady-state-vowel 'syllables' were un-doubtedly the farthest from readily recognizablespeech. Indeed, they so grossly offended Pierre'sphonetic sensibilities that he agreed only reluc-tantly to join Frank and me as a subject in the ex-periment. After discharging my own duty as a sub-ject, I thought, as did Frank, that Pierre's reserva-tions were well taken, and that our responseswould reveal little about stop consonants or, in-

deed, anything else. We were therefore surprisedand pleased when, on tabulating the results, wesaw a reasonably clear and systematic pattern.and then equally surprised and pleased when agroup of undergraduates, known for technicalpurposes as 'phonetically naive subjects', did muchas we had done, if just a little more variably. Forus and for them, the modal [k] response was forbursts at or slightly above the second formant ofthe vowel, wherever that was; [t] was assignedmost often L bursts centering at frequenciesabove the highest of the [10 bursts; and a [p] re-sponse was given to most of the bursts notidentified as [t] or [k].

This experiment was, for me, an epiphany. It isthe one, referred to earlier, that changed mythinking within hours or days after its resultswere in. What caused the change was the findingthat the effect of an acoustic cue depended to sucha very large extent on the context in which itappeared, and, more to the point, that perceptionaccorded better with articulatory gesture thanwith sound.

The effect of context was especially apparent inthe fact that the burst most frequently perceivedas [k] was the one lying at or slightly above thesecond formant of the following vowel, eventhough that formant sampled a range that ex-tended from 3000 Hz (for [i]) at the one end, to 700Hz (for [u]), at the other. Moreover, this evidencethat the same phonetic percept was cued by stim-uli that were acoustically very different was butone side of the coin, for the results also showedthat, given the right context, different perceptscould be evoked by stimuli that were acousticallyidentical. This was the case with the burst at 1440Hz, which was perceived predominantly as [p] be-fore [il, as [k] before (al, and then again as [p],though weakly, before [u].

As an empirical matter, then, this first, verycrude experiment demonstrated a kind and degreeof context-conditioned variability in the acousticcues that subsequent research has shown to bepervasive in speech. As for its relevance to ourearlier work on reading machines, it helped torationalize one of the conclusions we had beenbrought to, which was that speech cannot be anacoustic alphabet; for what the context effectsshowed was that the commutable acoustic unit isnot a phone, but rather something more like asyllable.

From a theoretical point of view, the results re-vealed the need to find, somewhere in the percep-tual process, an invariant to correspond to the in-variant phonetic unit, and they strongly suggested

17

Page 18: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

10 Liberman

that the invariant is in the articulation of thephone. Thus, in the case of the [kj burst we noteuthat it was the articulatory movementraisingand lowering the tongue body at the velumthatremained reasonably constant, regardless of thevowel, and, further, that coarticulation of the con-stant consonant with the variable vowels ac-counted for the extreme variability in the acousticsignal. As for the other side of the cointhe verydifferent perception of the burst before differentvowelswhich resulted, it should be noted, from afortuitous combination of circumstances probablyunique to this experiment, we supposed that, inorder to produce something like a burst at 1440Hz in front of [i] or [u], one had to close and openat the lips, while in front of [a], closing and open-ing at the velum was required.

T-tking all this into account, we adopted anotionthe Early Motor Theorythat I nowbelieve to have been partly right and partlywrong. It was right, I think, in assuming that theobject of perception in phonetic communication isto be found in the processes of articulation. It waswrong, or at least short of being thoroughly right,because it implied a continuing adherence to thehorizontal view. As earlier indicated, that viewassumes a two-stage process: first an auditoryrepresentation no different in kind from any otherin that modality, followed, then, by linkage tosomething phonetic, presumably as a result oflong experience and associative learning. Thephonetic thing the auditory representationbecomes associated with is, in the conventionalview, a name, prototype, or distinctive feature. Aswe described the Early Motor Theory in our firstpapers, we made no such explicit separation intotwo stages, but only because our concern wasrather to emphasize that perception was moreclosely associated with articulatory processes thanwith sound, and then to infer that this wasbecause the listener was responding to the sensoryfeedback from the movements of the articulators.However, we offered no hint that phoneticperception takes place in a distinct modality, thusomitting to make explicit the assumption that is,as will be seen, the heart of the vertical view.Indeed, we implied, to the contrary, that theeffects we were concerned with were, potentiallyat least, perfectly general, and so wouldpresumably occur for any perceptual response to aphysical stimulus, given long association betweenthe stimulus and some particular muscularmovement, together with its sensory feedback.What was special about speech was only that itprovided the par excellence example of the

opportunity for precisely that association. to beformed. In any case, my own view, as I rememberit now, did comprehend two more or less distinctstages: an initial auditory representation that, asa result of associative learning, ultimately gaveway to the sensory consequences of thearticulatory gesture that had always beencoincident with it. I had, I must now suppose, notspent much time wondering exactly what 'gaveway' might mean. Had I been challenged on thispoint, I think I should have said that in the earlystages of learning there surely was a properauditory percept, no different in kind from theresporise to any other acoustic stimulus andequally available to consciousness, but that later,when the bond with articulation was secure, thisrepresentation would simply have ceased to bepart of the perceptual process. But however Imight have responded, the Early Motor Theorywas different from the standard two-stage view ina way that had an important effect the way wethought about speech, and also on the direction ofour research. It mattered greatly that we took theobject of perception, and the ultiniate constituentof phonetic structure, to be an articulatorygesture, not a sound (or its auditory result), forthis began a line of thinking that would in time beseen to eliminate the need for the horizontalists'second stage, and so permit us to exorcise thelinguistic ghoststhe phonetic names or othercognitive entities--that haunted it. As for thedirection of our research, the Early Motor Theorycaused us to turn our attention to speechproduction, and so to initiate the inquiry into thatprocess that has occupied an ever larger and moreimportant place in our enterprise.

Some mildly interesting assumptions thatunderlay the methods we used

Given the extreme unnaturalness of the stimuliin many of our experiments, and the difficultysubjects reported in hearing them as speech, wehad to assume that there would be no importantinteraction between the degree of approximationto natural speech quality and the way listenersperceive the phonetic information. I thereiore notehere that this assumption has proved to be cor-rect: no matter how unconvincing our experimen-tal stimuli as examples of speech, those listenerswho were nevertheless able to hear them asspeech provided results that have held up re-markably well when the experiments were carriedout subsequently with stimuli that were closerapproximations to the real thing. This was thefirst indication we had of the theoretically impor-

18

Page 19: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and How They Changed 11

tant fact that accurate information about the rele-vant articulator movements, as conveyed, howeverunnaturally, by the acoustic signal, is sufficientfor veridical phonetic perception, provided onlythat the listener is not too put off by the absenceof an overall speech-like quality.

We also had to assume that the validity of ourresults would be little affected by a practice,followed throughout our search for the cues, ofinvestigating, in any one experiment, only a singlephonetic feature in a single phonetic class, andinstructing the subjects to limit their responses tothe phones in that class. Obviously, this procedureleft open the possibility that cues sufficient todistinguish phones along, say, the dimension ofplace in some particular condition of voicing ormanner would not work when voicing or mannerwas changed, or when the set of stimuli andcorresponding response choices was enlarged. Infact, the results obtained in the limited contexts ofour experiments were not overturned in laterresearch when, for whatever reason, those limitswere relaxed. I take this to be testimony to theindependence in perception of the standardfeature dimensions.

Finally, given early indications that there wereseveral cues for each phonetic element, and giventhat, to make our experiments manageable, atleast in the early stages, we typically investigatedone at a time, we had to assume the absence ofstrong interactions among the cues. In fact, as welater discovered, there are such interactionsspecifically, the trading relations that bespeak aperceptual equivalence among the various cues forthe same phone and that are, therefore of sometheoretical interest, as I will say laterbut theseoccur only within fairly narrow limits, and theyonly change the setting of a cue that is optimal forthe phone; they do not otherwise affect it.Therefore, working on only one cue at a time didnot cause us to be seriously misled.

More context-conditioned variability and thedynamic aspects of the speech signal

The most cursory examination of a spectrogramof running speech reveals nothing so clearly as thealmost continuous movement of the formants;even the band-limited noises that characterize thefricatives seem more often than not to bedynamically shaped. Indeed, Potter, Kopp, andGreen, in their book, Visible Speech, hadremarked these movements, but they consideredthe effect to proceed from (constant) consonant to(variable) vowel, at least in the case of stopconsonant-vowel syllables; and since their interest

was primarily in how these transitional influencesmight help people to 'read' spectrograms, theysimply called attention to the direction of themovement, up or down, and did not speculateabout the role of these movements in speechperception. Martin Joos, on the other hand, wroteexplicitly about the consonant-vowel transitions,as well as their context- conditioned variability,and showed, by cutting out properly selected partsof magnetic-tape recordings, that these transitionsconveyed information about the consonants. Healso made the important observation that therewas, therefore, no direct correspondence betweenthe segmentation of the acoustic signal and thesegmentation of the phonetic structure. But Jooscould not vary the transition for experimentalpurposes, so his conclusions could not be furtherrefined. We therefore thought it a reasonable nextstep to make those variations, and so learn moreabout the role of the transitions in perception,choosing, first, to study place of production amongstop and nasal consonants.

Our research to that point had prepared us todeal only with two-formant patterns, and since in-spection of spectrograms indicated that transi-tions of the first formant did not vary with place,we chose to experiment with transitions of thesecond. To that end, we varied the starting point,hence the direction and extent, of these transi-tions by starting them at each of a number of fre-quencies above and below the steady state of thefollowing vowel. In one condition, the first formanthad a fixed transition that rose to the steady statefrom the very lowest frequency (120 Hz); the re-sulting patterns were intended to represent voicedstops, and it was our judgment that they did thatreasonably well. In a second condition, the firstformant had a zero transitionthat is, it wasstraight. We hoped that these patterns wouldsound voiceless, but, in fact, they did not. Theywere used nevertheless. The stimuli in each ofthese conditions were presented for identificationas fb], [d], [g] to one group of listeners and as [p],[t], [k] to another. The results showed clearly thatthe transitions do provide important informationabout the place dimension of the voiced and voice-less stops, and, also, that this information is inde-pendent of voicing. Indeed, it mattered littlewhether the listeners were identifying a particularset of synthetic stops as voiced or voiceless; thesame transitions were associated with the sameplace, and there was, at most, only slightly lessvariability in the condition with the rising firstformant, where the stimuli sounded voiced to usand the subjects were asked to judge them so. In a

13

Page 20: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

12 Liberman

third condition, we strove, fairly successfully wethought, for nasal consonants by using a straightfirst formant, together with what we considered atthe time to be an appropriate (fixed) nasal reso-nance. Here, too, the second-formant transitionsprovided information about place, and that infor-mation was in no way affected by the change inmanner, even though we had reversed the pat-terns so as to make the nasals syllable final, andso to take account of the fact that the velar nasalnever appears initially in the syllable in English.As for context effects, they were large and system-atic. Thus, the best transition for [d] (or [t] or [n])fell from a point considerably above the steadystate of the vowel with (u], but with [i] it lose froma point below the vowel's steady state, and similareffects were evident with the transitions for otherphones. We thought it supportive of a motor the-ory that the highly variable transitions for thesame consonant were produced by a reasonablyconstant articulatory gesture as it was mergedwith the gesture appropriate for the followingvowel. Equally supportive, in our view, was thefact that mirror-image transitions in syllable-ini-tial and syllable-final positions neverthelessyielded the same consonantal percept, for surelythese would sound very different in any well-be-haved auditory system. From an articulatory pointof view, however, these transitions are seen as theacoustic consequences of the opening and closingphases of the same gesture. As with the bursts,then, perception cued by the transitions accordedbetter with an articulatory process than withsound.

Despite the demonstration, by us and others, ofthe extreme context sensitivity of the acousticcues, some researchers have been concerned formany years to show that there are, nevertheless,invariant acoustic cues, implying, then, that nospecial theoretical exertions are necessary inorder to account for invariant phonetic percepts.My own view of this matter has always been that,whatever the outcome of the seemingly never-ending search for acoustic invariants, thetheoretical issue will remain largely untouched;for there is surely no question that the highlycontext-sensitive transitions do supply importantinformation for phonetic perceptionthey can,indeed, be shown to be quite sufficient in manycircumstancesand that incontrovertible factmust be accounted for.

A brief flirtation with binary decisionsIn our first attempt to interpret the significance

of the burst and transition results, we took

seriously, if only for a short time, the possibilitythat the two kinds of cues collaborated in such away that two binary decisions resolved allperceptual ambiguity. Our data had shown thatthe bursts were identified as [t], if they were highin frequency and as Ip1 or Da if low; the second-formant transitions evoked [t] or Da if they werefalling, and [p], if rising. So a low burst and arising transition would be an unambiguous (1)1; ahigh burst and falling transition would be [t]; anda low burst coupled with a falling transition wouldbe [lc]. We were, of course, influenced to thisconclusion not just by our data but also, if onlyindirectly, by the then prevailing fashion forbinary arrangements. At all events, we made noattempt to link our notion about binary decisionswith the Early Motor Theory, perhaps becausethat would have been hard to do.

Acoustic loci: Rationalizing the transitionsand their role in perception

It required only a little further reflection,combined with an examination of the results ofour experiments on the second-form anttransitions, to see that perception was sensitive tosomething more than whether the transition wasrising or falling. Since those transitions reflect thecavity changes that occur as the articulators movefrom the consonant position to the vowel, and,since the place of production for each consonant ismore or less fixed, we saw that we should expectto find a correspondingly fixed positionor 'locus',as we chose to call itfor its second formant.More careful examination of the results of ourexperiments suggested that, for each position onthe dimension of place, the transition might,indeed, have originated at some particukrfrequency, and then made its way to the steadystate of the vowel, wherever that was. To refinethis notion, we carried out a two-step experiment.In the first, we put steady-state second formantsat each of a number of frequency levels, fnpm 3600Hz to 720 Hz, and paired each with one of anumber of first formants in the range 720 Hz to2400 Hz. The first formants had rising transitionsdesigned to evoke the effect of a voiced stopconsonant. Careful listening revealed that [d] washeard when the second formant was at 1800 Hz,[b] at 720 Hz, and [g] at 3000 Hz, so we settled onthese frequencies as the loci for the places ofproduction of the three stops.

The second step was to prepare two-fornantpatterns in which the second formant started ateach of these loci, and then rose or fell to thesteady state of the vowel, wherever that was. With

'1 0

Page 21: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and How They Changed 13

these patterns, the consonant appropriate to thelocus was not evoked clearly. For [d], indeed,starting the transitions at the locus produced, forsome steady states, [1)] and [g]. To get goodconsonants, we had in all cases to 'erase' the firsthalf of the transition so as to create a silentinterval between the locus and the actual start ofthe transition. We noted, further, that in the caseof [g], this maneuver worked only with second-formant steady states from 3000 Hz to about 1200Hz, which is approximately where the vowel shiftsfrom spread to rounded; below 1200 Hz, noapproximation to [g] could be heard.

The concept of the locus, together with the ex-perimental results that defined it, made simplesense of the transitions. It also tempted me totemper the emphasis on context-conditioned vari-ability by assuming that the perceptual machin-eryby which I might have meant the auditorymachinery'extrapolated' backward from thestart of the transition to the locus, and so arrivedat a 'virtual' acoustic invariant. Fortunately, Iyielded to this temptation only briefly, and thereis, I think, no written record of my lapse. In anycase, we began early to take the opposite tack,using the locus data to strengthen our conclusionsabout the role of context by emphasising the un-toward consequences of actually starting thetransitions at the locus, and by pointing to thesudden shift in the [g] locus when the vowelcrosses the boundary from spread to rounded.

Stop vs. semivowel, or once more into theauditory breach

It was apparent on the basis of articulatoryconsiderations, and also by inspection ofspectrograms, that an acoustic correlate of thestop-semivowel distinction was the duration orrate of the appropriate transitions. To find outhow this variable actually affected perception, wecarried out the obvious experiment. Weparticularly wanted to know whether it was rateor duration, and also where on the rate orduration continuum the phonetic boundary was.By varying the positions of the vowel formants,and hence the extent of the transition, we wereable to separate the two variables and find thatduration seemed to be doing all the work. We alsofound that the boundary was at about 50 msec.

I recall thinking that 50 msec might be criticalfor some kind of auditory integration. As I havealready said, it seemed reasonable to me at thetime to suppose that phonetic distinctions hadaccommodated themselves to the properties of theauditory system as revealed at the level of

psychophysical relations, since the (implicit)motor activity that was ultimately perceived wasitself initially evoked by some kind of first-stageauditory representation. I was therefore naturallyattracted to the possibility that transitionexcursions of less than 50 msec duration were,perhaps, so integrated by the auditory system asto produce a unitary impression like that of a stopconsonant, while transitions with excursionslonger than that would evoke the impression ofgradual change that characterizes the semivowels.I well remember the excitement I felt when,having decided that a 50-msec duration mightwell be the auditory key, I appreciated how easy itwould be to find out if, indeed it was. Theexperimental test required only that I draw on thePlayback a series of rising and falling isolatedtransitions in which the duration was varied overa wide range. What I expected and hoped to findwas that a duration of 50 msec would provide aboundary between perception of a unitary stop-like burst of sound on the one side, and asemivowel kind of glide on the other. So far as Icould tell, however, there appeared to be no suchboundary at 50 msec and thus no evidence of anauditory basis for the results of our experiment onthe distinction between stop and semivowel. I wasdisapppointed, but not enough to abandon myhorizontal attitude about the role of auditoryrepresentations in the ontogenetic development ofphonetic perception.

Categorical perception: the right predictionfrom the wrong theories

According to just those aspects of the EarlyMotor Theory that I now believe to be mistaken,the auditory percept originally evoked by thespeech signal was supposed to give way to thesensory consequences of the articulatory gesture,and it was just those consequences that were ul-timately perceived. In arriving at this theory, Ihad, of course, been much influenced by the be-haviorist stimulus-response tradition in which Ihad been reared. It was virtually inevitable, then,that I should take the next step and consider theconsequences of two processes'acquired distinc-tiveness' and 'acquired similarity'that were partof the same tradition. The point was simpleenough: if two stimuli become connected, throughlong association, to very different responses, thenthe feedback from those responses, having becomethe end-states of the perceptual process, will causethe stimuli to be more discriminable than theyhad originally been; conversely, if these stimulibecome connected to the same response, then, for

21

Page 22: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

14 Liberman

the same reason, they will be less discriminable.In fact, there was not then, and is not now, any ev-idence that such an effect occurs. But that did nottrouble me, for I supposed that investigators hadnot thought to look in the right place, and I couldnot imagine a better place than speech perception.Neither was I troubled by what seems to me nowthe patently implausible assumption, basic to theconcepts of acquired distinctiveness and acquiredsimilarity, that the normal auditory representa-tion of a suprathreshold acoustic stimulus couldbe wholly supplanted, or even significantly af-fected, by the perceptual consequences of somemotor response just because the acoustic stimulusand the motor response had become strongly as-sociated. It was for me compelling that listenershad for many years been making different articu-latory responses to stimuli that happened to lie oneither side of a phonetic boundary, so, by theterms of the Early Motor Theory and the theory ofacquired distinctiveness, that difference shouldhave become more discriminable. On the otherhand, those listeners had been making the samearticulatory response to equally different stimulithat happened to lie within the phonetic class, so,by the same theories, those stimuli should havebeen rendered less discriminable.

To test the theories, I thought it necessary onlyto get appropriate measures of acoustic-cuediscriminability. As I know now, one can easily getthe effect I sought simply by listening to voicedstops, for example, as the second-formanttransition is changed in relatively small and equalsteps, for what one hears is, first, severalconsonants that are almost identical (brs, then arather sudden shift to [di, followed by severalalmost identical [dl's, and then, again, a suddenshift, this time to [g). Though we had the means tomake this simple and quite convincing test, we didnot think to try it. Instead, I put together twowrong theories and produced what my professorshad taught me to strive for as one of the highestforms of scientific achievement: a real prediction.

The test of the prediction was initially under-taken by one of our graduate students, BelverGriffith, who, with my enthusiastic approval,elected to do the critical experiment on steady-state vowels. We know now that the effect wewere looking for does not occur to any significantextent with such vowels, so it was fortunate thatGriffith, by nature very fussy about the materialsof his experiments, was unsuccessful in producingvowels he was willing to use. The happy conse-quence was that he, together with the rest of us,

decided to move ahead, in parallel, with stop con-sonants.

It was not our purpose to obtain discriminationthresholds, but only to measure discriminabililtyof a constant physical difference at various pointson the continuum of second-formant transitions.To that end, we synthesized a series of 14syllables in which the starting point of the secondformant was varied in steps of 120 Hz, from apoint 840 Hz below the steady state of thefollowing vowel to a point 720 Hz- above it. Wethen paired stimuli that were one, two, and threesteps apart on the continuum, and for each suchpair measured discriminability by the ABXmethod (A and B were members of the pair, X wasone or the other, and the subject's task was tomatch X with A or B). The result was that therewere peaks in the discrimination functions atpositions on the continuum that corresponded tothe phonetic boundaries as earlier determined bythe way the subjects had identified the stimuli as[b], kl), or [g] when they were presented singlyand in random order. This is to say that, otherthings equal, discimination was better betweenstimuli to which the subjects habitually gavedifferent articulatory responses than it wasbetween stimuli to which the responses were thesame. Thus, the prediction was apparentlyconfirmed by this instance of what we chose to call'categorical perception'.

In the published paper, we included a method,worked out by Katherine Harris, for computingfrom the absolute identification functions whatthe discrimination function would have been if,indeed, perception had been perfectly categori-calthat is, if listeners had been able to perceivedifferences only if they had assigned the stimuli todifferent phonetic categories. Applying this calcu-lation to our results, we found that, in this exper-iment at least, perception was rather strongly cat-egorical, but not perfectly so.

To this point in our research, psychologists,including even those interested in language, hadpaid us little attention. Requests for reprints, andsuch other tokens of interest as we had received,had come mostly from communication engineersand phoneticians, and there were few referencesto our work in the already considerable literatureof psycholinguistics, a field that had beenestablished, seemingly by fiat, by a committee ofpsychologists and linguists who had met for asummer work session at Cornell. The result oftheir deliberations was a briefly famousmonograph in which they defined the new

Page 23: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and How They Changed 15

discipline, constructed its theoretical framework,and posed the questions that remained to beanswered. That done, they officially launched thefield at a symposium held during the nextnational convention of the American PsycholegicalAssociation. At the end of the symposium, I askeda question that provoked one of the foundingfathers to inform me, icily, that speech hadnothing to do with psycholinguistics. He did notsay why, but then, as one of the inventors of thediscipline, he was entitled to speak ex cathedra. Itis, however, easy to appreciate that in looking atspeech horizontally, as he and the other membersof the committee surely did, one sees nothing thatis linguistically interesting, only a set ofunexceptional noises and equally unexceptionalauditory percepts that just happen to convey themore invitingly abstract structures where anyonewho would think deeply about language oughtproperly to put his attention.

The categorical perception paper seemed, how-ever, to touch a psycholinguistic nerve. Perhapsthis was because, if taken seriously, it showedthat the phonetic units were categorical, not onlyin their communicative role, but also as immedi-ately perceived; and this nice fit of perceptualform to linguistic function must have seemed atodds with the conventional horizontal assumptionthat the auditory percepts assume linguistic sig-nificance only after a cognitive translation, not be-fore. Our results could be taken to imply that nosuch translation was necessary. and that theremight, therefore, be something psycholinguisti-cally interesting and important about the precog-nitivethat is, purely perceptualprocesses bywhich listeners apprehend phonetic structures.

Most psychologists seemed unwilling to acceptthat implication, though not all for the samereason. Some argued that categorical perception,as we had found it, was of no consequence becauseit was merely an artifact of our method. Thiscriticism boiled down to an assertion, perfectlyconsistent with the standard horizontal view, thatthe memory load imposed by the ABX proceduremade it impossible for the subject to compare theinitial auditory representations, forcing him torely, instead, on the categorical phonetic names heassigned to the rapidly fading auditory echoes asthey were removed from the everchanging sensoryworld and elevated, for safer keeping, into short-term memory. It is, of course, true that reducingthe time interval between the stimuli to becompared does raise the troughs that appear inthe within-category parts of the discriminationfuncntion, and thus reduces the approximation to

categorical perception. But the Early MotorTheory did not require that the articulatoryresponses within a phonetic category be identical,so it did not predict that perception had to beperfectly categorical. (The degree to which thearticulatory responses within a category aresimilar presumably varies according to thecategory and the speaker; it is, therefore, a matterfor empirical determination.) Neither did thetheory in any way preclude the possibility thatperceptual responses would be easier todiscriminate when fresh than when stale. In anycase, it surely was relevant that the peaks onefinds with various measures of discriminationmerely confirm the quantal shifts a listenerperceives as the stimuli are moved along thephysical continuum so rapidly as to make thememory load negligible.

The other criticism, which seemed almost oppo-site to the one just considered, was that categori-cal perception is not relevant to psycholinguisticsbecause it is so common, and, more particularly,because those boundaries that the discriminationpeaks mark are simply properties of the generalauditory system, hence not to be taken as supportfor the view that speech perception is interesting,except, perhaps, within the domain of auditorypsychophysics. As for the criticism that categoricalperception is common, it seemed to have beenbased on the misapprehension that we hadclaimed categorical perception to be unique tospeech, but in fact we had not, having merely ob-served (correctly) that, given stimuli that lie onsome definable physical continuum, observerscommonly discriminate many more than they canidentify absolutely. Our claim about phonetic per-ception was only that there is a significant, if nev-ertheless incomplete, tendency for that commonlyobserved disparity to be reduced. On the otherhand, the claim by our critics that the boundariesare generally auditory, not specifically phonetic,was important and deserved to be taken seriously.It has led to many experiments on perception ofnonspeech control stimuli and on perception ofspeech by nonhuman animals, leaving us and theother interested parties with an issue that is stillvexed. In fact, I think the weight of evidence,taken together with arguments of plausibility,overwhelmingly favors the conclusion that theboundaries are specific to the phonetic system, butI reserve the justification for that conclusion tothe last section of the paper.

There was yet another seemingly widespreadmisapprehension about categorical perception,which was that it had served us as the primary

E-4

4 a

Page 24: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

16 Liberman

basis for the Early Motor Theory. In fact, we (or,at least, I) have long believed that ;.he facts aboutthis phenomenon are consistent with the theory,but they do not by any means provide its most im-portant support; indeed, they were not availableuntil at least five years after we had been per-suaded to the theory, as I earlier indicated, by thevery first results of our search for the acousticcues.

Finally, in the matter of categorical perception, Iwill guess that consonant perception is likely,when properly tested, to prove more nearlycategorical than experiments have so far shown itto be. The problem with those experiments is thatthey have used acoustic synthesis, so it has beenprohibitively difficult in any single experiment tomake proper variations in more than one of themany aspects of the signal that are perceptuallyrelevant. But when only one cue is varied, as inthe experiments so far done, then, as it is changedfrom the form appropriate for one phone to theform appropriate for the next, it leaves all theother relevant information behind, as it were,creating a situation in which the listener isdiscriminating, not just the phonetic structure,but also increasingly unnatural departures fromit. I suspect that, with proper articulatorysynthesis, when the acoustic signal will change inall relevant aspectsat least for the cases thatare produced by articulations that can be said tovary continuouslythe discrimination functionswill come much closer to being perfectlycategorical.

The concept of 'cue' as a theoretically relevantentity

At the very least, 'cue' is a term of convenience,useful for the purpose of referring to any piece ofsignal that has been found by experiment to havean effect on perception. We have used the word inthat sense, and continue to do so. But there was atime when cue had, at least in my mind, a moreexalted status. I supposed that there was, for eachphone, some specifiable number of particulatecues that combined according to a scientificallyimportant principle to evoke the correct response.It was this understanding of cues that wasimplicit in the 'binary' account of their effects thatI referred to earlier. The same understanding wasmore explicit in a dissertation on cue combinationthat I had urged on Howard Hoffman. Finally, andperhaps most egregiously, it became the center-piece of our interpretation of an experiment on theeffects of third-formant transitions on perceptionof place among the stops. Having found there that,

in enhancing the perception of one stop, anyparticular transition does not do so equally at theexpense of the other two, we concluded that a cuenot only tells a listener what a speech sound is,but also which of the remaining possibilities it isnot. It was almost as if we were supposing thatthe third-formant transition had been designed,by nature or by the speaker, just to resolve anambiguity that the more important second-formant transition had overlooked. At all events,we went on to speculate that the response alterna-tives exist in a multidimensional phonetic space,and, though we were not perfectly explicit aboutthis in the published paper, that a cue has a mag-nitude and a direction, just like a vector, with theresult that the final position of the percept in thephonetic space is determined by the sum of thevectors. Such a conception is, of course, at oddswith all the data now available that indicate howexquisitely sensitive the listener is to all theacoustic consequences of phonetically significantgestures, for what those data mean is that anydefinition of an acoustic cue is always to some ex-tent arbitrary. Surely, it makes little sense towonder about the rules by which arbitrarilydefined cues combine to produce a perceptualresult.

The voicing distinction; an exercise in notseeing that which is most visible

We discovered very early how to produce stopsthat were convincingly voiced, but we had beenfrustrated for five years or more while seeking thekey to synthesis of their voiceless counterparts. Inour quest, we had examined spectrograms, soughtadvice from colleagues in other laboratories, and,by trial and error on the Playback, tried everytrick we could think of. We varied the strength ofthe burst relative to the vocalic section of thesyllable, drew every conceivable kind of first-formant transition, and substituted variousintensities of noise for the harmonics throughvarying lengths of the initial parts of the form anttransitions.

In fact, there was no noise source in the PatternPlayback, only harmonics of a 120 Hzfundamental, but we had been able in research onthe fricatives to make do by placing small dots ofpaint where noise was supposed to be. In isolation,patches of such dots ounded like a twittering ofbirds, but in syllabic context they producedfricatives so acceptable that, when we used themin experiments, we got results virtually identicalto those obtained later when, with a newsynthesizer called Voback, we were able to deploy

24

Page 25: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and How Thew Changed 17

real noise. Before Voback was available, however,we had, in our attempts at voiceless stops, to relyon the trick that had worked for the fricatives.When it did not help, we concluded that, unlikethe fricatives, the voiceless stops needed truenoise and, accordingly, that our inability tosynthesize them was to be attributed to the noise-producing limitations of the Playback.

That we were wrong to blame the Playbackbecame apparent one day as a consequence of adiscovery by Andre Malecot, one of Pierre'sgraduate students. While working to synthesizethe syllable-final releases of stops, he omitted thefirst formant of the short-duration syllable thatconstituted the release. I believe that he did thisinadvertently. But whether by inadvertence or bydesign, he produced a dramatic effect: we allheard a stop that was quite clearly voiceless.Encouraged by this finding, we adapted it to stopsin syllable initial position, and carried out severalrelated experiments. In each, we varied onepotential cue for the voicing contrast for all threestops, paired with each of the vowels [i], (ael, and[u]. Our principal finding was that, with all elseequal, simply delaying the onset of the firstformant relative to the second and third wassufficient to cause naive listeners to shift theirresponses smartly between voiced and voiceless.Then, recognizing that in so delaying the onset ofthe first formant we were, at the same time,starting it at a higher frequency, we reconfirmedthe observation we had made in our earlierexperiments, which was that starting the firstformant at a very low frequency was important increating the impresssion of a voiced stop, but thatstarting it higher, at the steady state of thatformant, did not, by itself, make much of acontribution to voicelessness. On the other hand,delaying the onset of the first formant without atthe same time raising its starting frequency didprove to be a very potent cue. Indeed, it appearedfrom the responses of our listeners to be about aspotent as the original combination of delay andraised starting point. (For the purpose of thisexperiment, we varied the delay alone bycontriving a synthetic approxiwation to the vowel[ol in which the first formant was placed as low onthe pattern as it could go; it was, then, just thisstraight formant that was delayed.) Next, we tookadvantage of the newly available synthesizer,Voback, which, as I earlier said, had a propernoise source, to experiment with the effect of noisein place of harmonics during the transitions. Whatwe found was that substituting noise in all threeformants was, by itself, ineffective, but that

substituting it for harmonics in the second andthird formants for the duration of the delay infirst-formant onset did soniewhat strengthen theimpression of voicelessness. In connection withthis last conclusion, we noted that when, in anattempt to produce an initial [hi, which is, ofcourse, the essence of aspiration, we replaced theharmonics of all the formants with noise for thefirst 50 or 100 msec, we did not get [h], but ratherthe impression of a vowel that changed fromwhispered to fully voiced; to get [h], we had toomit the first formant. To explain all this, weadvanced a suggestion, made to us by GunnarFant, that the vocal cords were open during theperiod of aspiration, and that it was thiscircumstance that reduced the intensity of thefirst formant, thus effectively delaying its onset.We emphasized, then, that all the acousticallydiverse cues were consequences of the samearticulatory event, and therefore led, inaccordance with the Motor Theory, to a perceptthat was perfectly coherent.

This early work on the voicing distinction wassubsequently refined and considerably extendedby Arthur Abramson and Leigh Lisker. Inparticular, it was they who established how theacoustic boundaries for the voicing distinctionvary with different languages, and thuF providedthe basis for the great volume of later research byother investigators who exploited these differencesin pursuit of their interests in the ontogenesis ofspeech. And it was Abramson and Lisker whoaccurately characterized the relevant variable asvoice-onset-time (VOT), defined as th: duration ofthe interval between the consonant opening (inthe oral part of the tract) and the onset of voicingat the larynx. Unfortunately, some of theresearchers who later used the voicing distinctionfor their own purposes ignored the fact that theVOT variable is articulatory, not acoustic, andtherefore failed to take into account in theirtheoretical interpretations that its acousticmanifestations are complexly heterogeneous.

As for our initial discovery of the acoustic cuesfor the voicing distinction, I note an I; ony in thelong search that preceded it, for once one knowswhere to look Li the spectrogram, the delay in thefirst formant onset can be measured more easily,and with greater precision, than almost any of theother consonant cues we had found. Consider, forexample, how important to perception is thefrequency at which a formant transition starts,and then how hard it is to specify that frequencyprecisely from an inspection of its appearance on aspectrogram. Yet our tireless examination of

Page 26: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

18 Liberman

spectrograms had, in the case of the voicingdistinction, availed us nothing; we simply had notseen what we now know to be so plainly there.

Synthesis by rule and a reading machine thatspeaks

In this very personal chronicle of our early re-search, I have chosen to write of just those exper-iments that best illustrate certain underlying as-sumptions I now find interesting. I have saidnothing about the many other experiments thatwere carried out during roughly the same periodof time. I would now partly repair that omissionby recognizing the existence of those others, andby emphasizing that they p:ovided a collection ofdata sufficient as a basis for synthesizing speechfrom a phonetic transcription, without the need tocopy from a spectrogram. Unfortunately, all therelevant data had been brought together only inPierre's head. Relying only on the experience hehad gained from participation in our published re-search, and also from the countless unpublishedexperiments he had carried out in his unflaggingeffort to refine and extend, Pierre could 'paint'speech to order, as it were. But the knowledgethat Pierre had in his head was, by its nature, notso public as science demands.

We therefore recommended to FrancesIngemann. when she came to spend some time atthe Laboratories, that she write a set of rules forsynthesis, making everything so explicit thatsomeone totally innocent of knowledge aboutacoustic phonetics could, simply by following therules, draw a spectrogram that would produce anydesired phonetic structure. Accepting thischallenge, she decided to rely entirely on thepapers we had published in journals or in labreports; she did not use the synthesizer to test andimprove as she went along, nor did she attempt toformalize what Pierre and other members of thestaff might know but had never written down. Shenevertheless succeeded very well, J think, inproducing what must count as the first rules forsynthesis. I don't recall that we ever formallyassessed the intelligibility of the speech producedby these rules, but I know that we found it rea-sonably intelligible. At all events, an outline of therule system was published in 1959 under the title,'Minimal Rules for Synthesis'. The word `minimal'was appropriate because the rules were written atthe level of features, the presumed `atoms' ofphonetic structure, not at the level of the morenumerous segments or `molecules'. I should note,too, that we took explicit notice in the paper of ourbelief that the rules for synthesis had better be

written in articulatory terms, for then all therelevant acoustic information would be provided tothe listener. There was, however, no alternative tothe acoustically based rules we offered, becausethere was not enough known about articulation,but also because there was, in any case, nosatisfactory articulatory synthesizer. Articulatorysynthesis would therefore have to wait.

Meanwhile, we had, by 1966, a computingfacility, and had constructed a computer-controlled, terminal-analog formant synthesizer.It was then that Ignatius Mattingly joined thestaff and undertook to program the computer toproduce speech by rule. For this purpose, he drewon the work he had done previously with Holmesand Shearme in England, and also on all that hadbeen learned about the cues and rules forsynthesis at the Laboratories. By 1968 the job wasdone. Accepting an input of a phonetic string, thesystem would speak. The intelligibility of thespeech was tested on several occasions and inseveral different ways. Thus, it was testedinformally by having blind veterans listen, forexample, to rather long passages from Dickensand Steinbeck. It was evident that theyunderstood the speech, even at rates of 225 wordsper minute, but we had no measure of exactly howhard they found it to do so, and they did complain,not without reason we thought, of what theycalled the machine's 'accent'. In more formal tests,the rule-generated synthetic speech came off quitewell by comparison with 'real' speech, but, notunexpectedly, there was evidence of a priceexacted by the extra cognitive effort that wasrequired to overcome its evident imperfections, aprice that had to be paid, presumably, by theprocesses of comprehension.

At that point, we had in hand a principalcomponent of a reading machine that wouldconvert text to speech, and thus avoid all theproblems we had encountered in our earlier workwith nonspeech substitutes. What was needed, inaddition, was an optical character reader toconvert the letters into machine-readable form,and also, of course, some way of translatingspelled English into a phonetic transcriptionappropriate to the synthesizer. Given our history,it was inevitable that we should have beenimpatient to acquire these other components andsee (or, more properly, hear) what a fullyautomatic system could do. So, Frank, PatrickNye, and others cobbled together just such asystem, using an optical character reader webought with money given us for the purpose by theSeeing Eye Foundation, a phonetic dictionary

Page 27: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and Haw They Changed 19

made available by the Speech CommunicationsResearch Laboratory of Santa Barbara, and, ofcourse, our own computer-controlled synthesizer.Tests revealed that the speech produced in thisfully automatic way was almost as good as that forwhich the phonetic transcription had been hand-edited. But we were concerned about the evidencewe had earlier collected concerning the probableconsequences for ease of comprehension that aroseout of the shortcomings of the speech. Wetherefore put together a plan to evaluate themachine with blind college students who woulduse it to read their assignments; having found itsweaknesses, we would then try to correct them. Iassumed that various federal agencies wouldcompete to see which one could persuade us toaccept their support for this undertaking. Wecould, after all, show that a reading machine forthe blind was not pie-in-the-sky, but a do-ablething that stood in need of just the kinds ofimprovement that further research would surelybring. Yet, though we tried very hard with severalagencies, and for several years, we failed utterlyto get support, and were forced finally to abandonour plans. Still, we had the satisfaction of havingproved to ourselves that a reading machine for theblind was ,Dse to being a reality. The basicresearch was largely complete; what remainedwas just the need for proper development.

ON BECOMING VERTICAL,OR HOW I RIGHTED MYSELF

To this point, my concern has been to describethe various forms of the horizontal view that mycolleagues and I held during our work on non-speech reading machines and in the early stagesof the research on speech to which it led. Now Imean to offer a more detailed account of the im-portant differences between that view and thevertical view I now hold, In so doing, I drawfreely, and without specific attribution, on anumber of theoretically oriented papers that werewritten in close collaboration with various of mycolleagues. Among the most relevant of these areseveral reviews by Ignatius Mattingly and me inwhich we hammered out the vertical view as I(we) see it now.

All these theoretically oriented papers deal, atleast implicitly, with questions about speech towhich the horizontal and vertical views givedifferent, sometimes diametrically opposed,answers. Such questions serve well, therefore, todefine the two positions, and to explain how Icame to abandon the one for the other; for those

reasons, I will organize this section of the paperaround them.

The issue that unites the questions pertains tothe place of speech in the biological scheme ofthings. That I should have come to regard that is-sue as central is odd, given the habits of mind Ihad brought to the research, for, as I earlier im-plied, my education in psychology had been un-remittingly abiological. I had, to be sure, studied alittle physiology, narrowly conceived, and it can-not have escaped my notice that in the physiologi-cal domain things were not of a piece, having beenformed, rather, into distinct systems for corre-spondingly distinct functions. At the level of be-havior, however, I saw only an overarching same-ness, a reflection of my attachment to principles sogeneral as to apply equally to a process as naturalas learning to speak and as arbitrary as memoriz-ing a list of nonsense syllables.

I think I was moved first, and most generally, toa different approach by scientists who work, noton spe, ch, but on other forms of species-typicalbehmor. Thus, i was largely under the influenceof people like Peter Mar ler, Nobuo Suga, MarkKonishi, and Fernando Nottebohm that I came tosee myself as an ethologist, very much like them,and to appreciate that I would be well advised tobegin to think like one. They helped me tounderstand that speech is to the human being asecholocation is to the bat or song is to the birdtosee, that is, that all these behaviors depend onbiologically coherent faculties that werespecifically adapted in evolution to functionappropriately in connection with events that are ofparticular ecological significance to the animal. Tothe horizontalist that I once was, this was heresy;but to the verticalist I was in process of becoming,it was the beginning of wisdom.

Meanwhile, back at the Laboratories there werebiological stirrings on the part of MichaelStuddert-Kennedy, who is nevertheless not acommitted verticalist, and Ignatius Mattingly,who is. Michael has been a constructive critic inregard to virtually every biologically relevantnotion I have dared to entertain. As for thebiological slant of the vertical view (including theRevised Motor Theory), that is as much Ignatius'scontribution as mine. Indeed, the view itself is theresult of a joint effort, though, of course, he bearsno responsibility for what I say about it here.

Among the influences of a somewhat differentsort, there was the growing realization that myearly horizontal view did not sit comfortably evenwith the results of the early research it was

Z17

Page 28: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

20 Liberman

designed to explain. That will have been seen inwhat I have already said about my attempts toaccount for those results, and, especially, aboutthe patch on the horizontal view that I have herecalled the Early Motor Theory. 1-1:-avi ly loaded asit was with untested and wholly implausibleassumptionsfor example, that auditory perceptscould, as a result of learning, be replaced bysensations of movementit had begun to fall of itsown weight.

Contributing further to the collapse of the EarlyMotor Theory was the research, pioneered byPeter Eimas and his associates, in which it wasfound that prelinguistic infants had a far greatercapacity for phonetic perception than a literalreading of the theory would allow.

My faith was further weakened by the work ofKatherine Harris and Peter MacNeilage, who, asthe first of the Laboratories' staff to work onspeech production, were busily finding a greatdeal of context-conditioned variability in theperipheral articulatory movements (as reflected inelectromyographic measures), and thus disprovingone of the assumptions of the Early Motor Theory,which was that the articulatory invariant was inthe final-common-path commands to the muscles.

At the same time, Michael Turvey was pointingthe way to an appropriate revision by showinghow, given context-conditioned variability at thelevel of movement, it is nevertheless possible, in-deed necessary, to find invariance in the more re-mote motor entities that Michael called'coordinative structures'. In any case, Michael andCarol Fowler were strongly encouraging me topersevere in the aspect of the Early Motor Theorythat took gestures to be the objects of speech per-ception, while simultaneously heaping scorn onthe idea that perception was a second-order trans-lation of a sensory representation, as the horizon-tal version of the theory required. I began, there-fore, to take more seriously the possibility thatthere is no mediating auditory percept, only theimmediately perceived gestures as provided by asystemthe phonetic modulethat is specializedfor the ecologically important function of repre-senting them.

Not that Michael and Carol or, indeed, any ofthe other 'ecological' psychologists in theLaboratories, are verticalists. They most certainlyare not, because they do not accept (yet) that thereis a distinct phonetic mode, preferring, rather, totake speech perception as simply one instance ofthe way all perception is tuned to perceive thedistal objects; in the case of speech, these justhappen to be the articulatory gestures of the

speaker. Thus, I have been in ihe happy positionof taking advantage of the best of what myecological friends have had to offer, while freelyrejecting the rest, and, as an important bonus,being stimulated by our continuing disagreementsto correct weaknesses and repair omissions in myown view.

It was also relevant to the development of mythinking that Isabelle Liberman, DonaldShankweiler, and Ignatius Mattinglyfollowedlater by such younger colleagues as BenitaBlachman, Susan Brady, Anne Fowler, HylaRubin, and Virginia Mannhad begun to see inour research how to account for the fact thatspeech is so much more natural (hence easier)than reading and writing, and thus to be explicitabout what is required of the would-bereader/writer that mastery of speech will not havetaught him. As I will say later, their insights andthe results of their empirical work illluminatedaspects of the vertical view that I would otherwisenot have seen.

I was affected, too, by the results of experimentson duplex perception, trading relations, andintegration of cues, experiments that went beyondthose, referred to earlier, that merely isolated thecues and looked for discontinuities in thediscrimination functions. These later experiments,done (variously) in close collaboration withVirginia Mann, Bruno Repp, Douglas Whalen,Hollis Fitch, Brad Rakerd, Joanne Miller. MichaelDorman, and Lawrence Raphael (few of whom areadmitted verticalists) provided data that spokemore clearly than the earlier findings to some ofthe shortcomings of the horizontal position, andtherefore inclined me ever more strongly to thevertical alternative.

Finally, I should acknowledge the profoundeffect of Fodor's provocative monograph, 'TheModularity of Mind', which, in the early stages ofmy conversion, enlightened and stimulated me byits arguments in favor of the most general aspectsof the vertical view.

That I should finally have asked the followingquestions, and answered them as I do, reflects theinfluences I have just described, and fairlyrepresents the theoretical position to which theymoved me.

In the development of phonologicalcommunication, what evolved?

Defined as the production and perception ofconsonants and vowels, speech, as well as thephonological communication it underlies, isplainly a species-typical product of biological

Page 29: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and How They Changed 21

evolution. All neurologically normal human beingscommunicate phonologically; no other creaturesdo. The biologically important function ofphonologic communication derives from the way itexploits the combinatorial principle to generatevocabularies that are large and open, in contrastto the vocabularies of nonhuman, nonphonologicsystems, which are small and closed. Thus,phonological processes are unique to language andto the human beings who command them. Itfollows that anyone who would understand howspeech works must answer the question: whatevolved? Not when, or why, or how, or by whatprogression from earlier-appearing stages. Thefirst question is simply: what?

The answer given by the horizontal view isclear: at the level of action and perception, noth-ing evolved; language simply appropriated for itsown purposes the plain-vanilla ways of acting andperceiving that had developed independently ofany linguistic function. Thus, those horizontalistswho put their attention rather narrowly on theperceptual side of the process argue that the cate-gories of phonetic perception simply reflect theway speech articulation has accommodated itselfto the production of sounds that conform to theproperties of the auditory system, a claim that Iwill evaluate in some detail later. A recent andbroader, but still horizontal, take on the same is-sue distributes the emphasis more evenly betweenproduction and perception, arguing that phoneticgestures were selected by language on the basis ofconstraints that were generally motor, as well asgenerally auditory, However, the important pointin this, as in the narrower view, is that the con-straints are independent of a phonetic function,hence in no way specific to speech. Put forth as anexplicit challenge to the vertical assumption, thebroader view has it that there is no reason to as-sume a special mode for the production and per-ception of speech, if, with the proper horizontalorientation, one can see that the units of speechare optimized with respect to motor and percep-tual constraints that are biologically general.

But the question is not whether languagesomehow developed out of the biology that wasalready there; surely, it could hardly have doneotherwise. The question, to put it yet again, asks,rather, what did that development produce as thebasis for a unique mode of communication? Whenthe horizontalists say that the development of thismode was accomplished merely by a process ofselection from among the possibilities offered bygeneral faculties that are independent oflanguage, they are giving an account that applies

as well to the development of, say, a cursivewriting system. Was not the selection of thecursive gestures similarly determined by motorand perceptual constraints that are independentof language? Yet, what that selection producedwere not the biologically primary units of speech,but only a set of optical artifacts that had then tobe connected to speech in a wholly arbitrary way.Of course, this is merely to say the obvious aboutthe relation between speech and a writing system,which is that the evolution of the one wasbiological, the other, not. That is surely a criticaldifference, but one that the horizontal view musthave difficulty comprehending.

If pressed further to answer the question aboutthe product of evolution, the horizontalists wouldpresumably have to say that, while nothingevolved at the level of perception and action, theremust have been relevant developments at a highercognitive level. Thus, it would have been evolutionthat produced the phonetic entities of a cognitivetype to which the nonphonetic acts and percepts ofspeech must, on the horizontal view, beassociated. Being neither acts nor percepts, thesecognitive entitiesor ideas, as they might bewould presumably be acceptable within thehorizontal framework as genetically determinedadaptations for language, hence special in a waythat speech is not allowed to be. In itself, thisseems an unparsimonious, not to say biologicallyimplausible, assumption. And it can be seen to bethe more unparsimonious and implausible oncethe horizontalist tries to explain how thephonetically neutral acts and percepts gotconnected to the specialized cognitive entities inthe first place. In the case of a script, to wring onemore point out of that tired example, the obviouslynonphonetic motor and visual representations ofthe writer and reader were connected to languageby agreement among the interested parties. Canwe seriously propose a similar account for speech?

If the horizontalists should reject the notionthat phonetic ideas were the evolutionary basesfor speech, there remains to them the mostthoroughly horizontal view of all, which is thatwhat evolved was a large brain. In that case, theymight suppose either that phonologicalcommunication was an inevitable by-product ofthe cognitive power that such a brain provides,which seems unlikely, or that phonologicalcommunication was an invention, created bylarge-brained people who were smart enough tohave appreciated the immense advantages forcommunication of the combinatorial principle,which seems absurd.

6 9

Page 30: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

22 Liberman

The vertical view is different on all counts. Whatevolved, on this view, was the phonetic module, adistinct system that uses its own kind of signalprocessing and its own primitives to form aspecifically phonetic way of acting and perceiving.It is, then, this module that makes possible thephonological mode of communication.

The primitives of the module are gestures of thearticulatory organs. These are the ultimate con-stituents of language, the units that must be ex-changed between speaker and listener if linguisticcommunication is to occur. Standing apart as aclass from the nonphonetic activities of the sameorgansfor example, chewing, swallowing, mov-ing food around in the mouth, and licking thelipsthese gestures serve a phonetic function andno other. Hence they are marked by their very na-ture as exclusively phonetic in character; there isno need to make them so by connecting them tolinguistic entities at the cognitive level. As part ofthe larger specialization for language, they are,moreover, uniquely appropriate to other linguisticprocesses. Thus, the syntactic component isadapted to operate on the specifically phoneticrepresentations of the gestures, not on represen-tations of an auditory kind. Indeed, it is preciselythis harmony among the several components ofthe language specialization that makes the epithet'vertical' particularly apposite for the view I amhere promoting.

Of course, the gestures constitute only thephonetic structures that the perceptual processextracts from the speech signal. Such aspects ofthe percept as, for example, those that contributeto the perceived qualilty of the speaker's voice arenot part of the phonetic system. Indeed, these arepresumably auditory in the ordinary sense, exceptas they may figure in speaker identification, forwhich there may be a separate specialization.

It is not only the gestures themselves that arespecifically phonetic, but also, presumably, theircontrol and coordination. Surely, there is inspeech production, as in all kinds of action, theneed to cope with the many-to-one relationsbetween means and ends, and also to reducedegrees of freedom to manageable proportions. Inthese respects, then, the management of speechand nonspeech movements should be subject tothe same principles. But there is, in addition,something that seems specific to speech: thegrossly overlapped and smoothly mergedmovements at the periphery are controlled by, andmust preserve information about, relatively longstrings of the invariant, categorical units thatspeech cares about but other motor systems do

not. And, certainly, it is relevant to the claimabout a specialized mode of production thatspeech, in the very narrowest sense, is species-specific: given every incentive and opportunity tolearn, chimpanzees are nevertheless unable tomanage the production of simple CVC syllables.(The fact that the dimensions of their vocal tractspresumably do not allow a full repertory of vowelsshould not, in itself, preclude the articulation ofsyllables with whatever vowels their anatomypermits.)

As for the evolution of the phonetic gestures, Ishould think an important selection factor was notso much the ease with which they could bearticulated, or the auditory salience of theresulting sound, but rather how well they lentthemselves to being coarticulated. For it iscoarticulation that, as I will have occasion to saylater, makes phonological communication possible.

But it is also this very coarticulation that, as wesaw earlier, frustrates the attempt to find thephonetic invariant in the acoustic signal or in theperipheral movements of the articulators. Still,such motor invariants must exist, not just for theaspect of the Motor Theory that explains howphonetic segments are perceived, but for just anytheory that presumes to explain how they are pro-duced; after all, speech does transmit strings ofinvariant phonological structures, so these invari-ants must be represented in some way and atsome place in the production process. But how arethey to be characterized, and where are they to befound? Having accepted the evidence that they arenot in the peripheral movements, as the EarlyMotor Theory assumed, Mattingly and I proposedin the Revised Motor Theory that attention bepaid instead to the configurations of the vocaltract as they change over time and are comparedwith other configurations produced by the samegesture in different contexts. As for the invariantcauses of these configurations, they are presum-ably to be found in the more remote motor enti-tiessomething like Turvey's coordinative struc-turesthat control the various articulator move-ments so as to accomplish the appropriate vocal-tract configurations. It is, I now think, structuresof this kind that represent the phonetic primi-tives, providing the physiological basis for thephonetic intentions of the speaker and the pho-netic percepts of the listener. Unfortunately forthe Motor Theory, we do not yet know the exactcharacteristics of these motor invariants, nor canwe adequately describe the processes by whichthey control the movements of the articulators.My colleagues, including especially Cathe

30

Page 31: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and How They Changed

Browman, Louis Goldstein, Elliot Saltzman, andPhilip Rubin, are currently in search of those in-variants and processes, and I am confident thatthey will, in time, succeed in finding them.Meanwhile, I will, for all the reasons set forth inthis paper, remain confident that motor invariantsdo exist, and that they are the ultimate con-stituents of speech, as produced and as perceived.

According to the Revised Motor Theory, then,there is a phonetic module, part of the largerspecialization for language, that is biologicallyadapted for two complementary processes: onecontrols the overlapping and merging of thegestures that constitute the phonetic primitives;the other processes the resulting acoustic signal soas to recover, in perception, those same primitives.On this view, one sees a distinctly linguistic wayof doing things down among the nuts and bolts ofaction and perception, for it is there, not in theremote recesses of the cognitive machinery, thatthe specifically linguistic constituents make theirfirst appearance. Thus, the Revised Motor Theoryis very different from its early ancestor; the tworemain as one only in supposing that the object ofperception is the invariant gesture, not thecontext-sensitive sound.

How is the requirement for parity met?In all communication, whether linguistic or not,

sender and receiver must be bound by a commonunderstanding about what counts: what counts forthe sender must count for the receiver, elsecommunication does not occur. In the case ofspeech, speaker and listener must perceive, orotherwise know, that, out of all possible signals,only a particular few have linguistic significance.Moreover, the processes of production andperception must somehow be linked; theirrepresentations must, at some point, be the same.Though basic, this requirement tends to passunnoticed by those who look at speechhorizontally, and, especially, by those whosepreoccupation with perception leaves productionout of account. However, vertical Motor Theoristslike Ignatius Mattingly and me are bound to thinkthe requirement important, so we have given it aname`parity'and asked how, in the case ofspeech communication, it was established andhow maintained.

Horizontalists must, I think, find the questionvery hard. For if, as their view would have it, theacts of the speaker are generally motor and thepercepts of the listener generally auditory, thenact and percept have in common only that neitherhas anything to do with language. The horizontal-

ist is therefore required to assume that these rep-resentations are linked to language and to eachother only insofar as speaker and listener havesomehow selected them for linguistic use from theindefinitely large set of similarly nonphonetic al-ternatives, and then connected them at a cognitivelevel to the same phonetic name or other linguisticentity. Altogether, a roundabout way for a naturalmode of communication to work.

For the verticalists, on the other hand, thequestion is easy. On their view, it was specificallyphonetic gestures that evolved, together with thespecialized processes for producing and perceivingthem, and it is just these gestures that provide thecommon currency with which speaker and listenerconduct their linguistic business. Parity is thusguaranteed, having been built by evolution intothe very bones of the system; there is no need toarrive at agreements about which signals arerelevant and how they are to be connected to unitsof the language.

How is speech related to other natural modesof communication?

I noted earlier that human beings communicatephonologically but other creatures do not, and,further, that this difference is important, becauseit determines whether the inventory of 'words' isopen or closed. Now, in the interest of parsimony,I ask whether either view of speech allows thatthere is, nevertheless, something common to twomodes of communication that are equally natural.

On the horizontal view, the two modes must beseen as different in every important respect.Nonhuman animals, the horizontalists would pre-sumably agree, communicate as they do because oftheir underlying specializations for producing andperceiving the appropriate.signals. I doubt thatanyone would seriously claim that these require tobe translated before they can take on communica-tive significance. For the human, however, thehorizontal position, as we have seen, is that thespecialization, if any, is not at the level of the sig-nal, but only at some cognitive remove. I find ithard to imagine what might have been gained forhuman beings by this evolutionary leap to an ex-clusively cognitive representation of the commu-nicative elements, except, perhaps, the smug sat-isfaction they might take in believing that theycommunicate phonologically, and the nonhumananimals do not, because they have an intellectualpower the other creatures lack, and that even inthe most basic aspects of communication they cancount themselves broad generalists, while the oth-ers must be seen as narrow specialists.

Page 32: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

24 Liberman

On the vertical view, human and nonhumancommunication alike depend on a specialization atthe level of the signal. Of course, these specializa-tions differ one from another, as do the vehiclesacoustic, optical, chemical, or electricalthat theyuse. And, surely, the phonetic specialization dif-fers from all the others in a way that is, as weknow, critical to the openness or generativity oflanguage: Still, the vertical view permits us to seethat phonetic communication is not completely offthe biological scale, since it is, like the other natu-ral forms of communication, a specialization allthe way down to its roots.

What are the (special) requirements ofphonological communication, and howare they met?

If phonology is to use the combinatorialprinciple, and so serve its critically importantfunction of building a large and open vocabularyout of a small number of elements, then it mustmeet at least two requirements. The more obviousis that the phonological segments be commutable,which is to say discrete, invariant, andcategorical. The other requirement, which is onlyslightly less obvious, concerns rate. For if allutterances are to be formed by stringing togetheran exiguous set of commutable elements, then,inevitably, the strings must run to great lengths.There is, therefore, a high premium on rapidcommunication of the elements, not only in theinterest of getting the job done in good time, butalso in order to make things reasonably easy, oreven feasible, for those other processes that havegot to organize the phonetic segments into wordsand sentences.

Consider how these requirements would be metif, as the horizontal view would have it, the ele-ments were sounds and the auditory percepts theyevoke. If it were these that had to be commutable,then surely it would have been possible to makethem so, but only at the expense of rate. Forsounds and the corresponding auditory percepts tobe discrete, invariant, and categorical would re-quire that the segmentation be apparent at thesurface of the signal and in the most peripheralaspects of the articulation. How else, on the hori-zontal view, could commutability be achieved, ex-cept as each discrete sound and associated audi-tory percept were produced by a correspondinglydiscrete articulatory maneuver? Of course, thesounds and the percepts might be joined, as arethe segments of cursive writing, and that mightspeed things up a bit, but, exactly as in cursivewriting, the segmentation would nevertheless

have to be patent. The consequence would be that,to say a monosyllabic word like 'bag', the speakerwould have to articulate the segments discretely,and that would produce, not the monosyllable`bag', but the trisyllable [bo] [x] [ge]. To articulatethe syllable that way is not to speak, but to spell,and spelling would be an impossibly slow and te-dious way to communicate language.

One might imagine that if production had beenthe only problem in the matter of rate, naturemight have solved it by abandoning the vocaltract, providing her human creatures, irsstead,with acoustic devices specifically adapted toproducing rapid-fire sequences of sound. Thatwould have taken care of the production problem,while, at the same time, defeating the ear. Theproblem is that, at normal rates, speech producesfrom eight to ten segments per second, and, forshort stretches, at least double that number. Butif each of those were a unit sound then rates thathigh would strain the temporal resolving power ofthe ear, and, of particular importance to phoneticcommunication, also exceed its abililty to perceivethe order in which the segments had been laiddown. Indeed, the problem would be exactly theone we encountered when, in the early work onreading machines, we presented- acousticalphabets at high rates.

According to the vertical view, nature solved therate problem by avoiding the acoustic-auditory(horizontal) strategy that would have caused it.What evolved as the phonetic constituents werethe special gestures I spoke of earlier. These servewell as the elements of language, because, ifproperly chosen and properly controlled, they canbe coarticulated, so strings of them can beproduced at high rates. In any case, all speakers ofall languages do, in fact, coarticulate, and it isonly by this means that they are able tocommunicate phonologically as rapidly as they do.

Coarticulation had happy consequences for per-ception, too. For coarticulation folds informationabout several successive segments into the samestretch of sound, thereby achieving a paralleltransmission of information that considerably re-laxes the constraint imposed by the temporal re-solving properties of the ear. But this gain came atthe price of a relation between acoustic signal andphonetic message that is complex in a specificallyphonetic way. One such complication is the con-text-conditioned variability in the acoustic signalthat I identified as the primary motivation for theEarly Motor Theory, presenting it then as if itwere an obstacle that the processes postulated bythe theory had to overcome. Now, on the Revised

32

Page 33: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and Haw They Changed 25

Motor Theory, we can see that same variability asa blessing, a rich source of information about pho-netic structure, and, especially, about order.Consider, again, the difficulty the auditory systemhas in perceiving accurately the ordering of dis-crete and brief sounds that are presented sequen-tially. Coarticulation effectively gets around thatdifficulty by permitting the listener to apprehendorder in quite another way. For, given coarticula-tion, the production of any single segment affectsthe acoustic realization of neighboring segments,thereby providing, in the context-conditionedvariation that results, accurate information aboutwhich gesture came first, which second, and so on.Hence, the order of the segments is conveyedlargely by the shape of the acoustic signal, not bythe way pieces of sound are sequenced in it. Forexample, in otherwise comparable consonant-vowel and vowel-consonant syllables, the listeneris not likely to mistake the order, however briefthe syllables, because the transitions for prevo-calic and postvocalic consonants are mirror im-ages. But these will have the proper perceptualconsequence only if the phonetic system is special-ized to represent the strongly contrasting acousticsignals, not as similarly contrasting auditory per-cepts, but as the opening and closing phases of thesame phonetic gesture. Accordingly, order is givenfor free by processes that are specialized to dealwith the acoustic consequences of coarticulatedphonetic gestures. Thus, we see that a criticalfunction of the phonetic module is not so much totake advantage of the properties of the generalmotor and auditory systemsa matter that wasbriefly examined earlieras it is to find a wayaround their limitations.

Could the assignment of the stimulusinformation to phonetic categoriesplausibly be auditory?

Many of the empirically based arguments aboutthe two theories of speech, including, especially,most of those that have been advanced against thevertical position, come from experiments, muchlike those described in the first section of this es-say, that were designed very simply to identify theinformation that leads to perception of phoneticsegments. The results of these experiments haveproved to be reliable, so there is quite generalagreement about the nature of the relevant infor-mation. Disagreement arises only, but nonethelessfundamentally, about the nature of the event thatthe information is informing about. Is it thesound, as a proper auditory (and horizontal) view

would have it, or the articulatory gesture, which isthe choice of the vertically oriented Motor Theory.

The multiplicity of acoustic-phonetic boundariesand rues. Research of the kind j-ist referred to hassucceeded in isolating many acoustic variablesimportant to perception of the various phoneticsegments, and in finding for each the location ofthe boundary that separates the one segment fromsome alternativefor example, [ba) from [da). Thehorizontalists take satisfaction in furtherexperiments on some of these boundaries in whichit has been found that they are exhibited bynonhuman animals, or by human observers whenpresented with nonspeech analogs of speech, forthese findings are, of course, consistent with theassumption that the boundaries are auditory innature. In response, the verticalists point toexperiments in which it has been found that theboundaries differ between human and nonhumansubjects, and, in humans, between speech andnonspeech analogs, arguing, in their turn, thatthese findings support the view that theboundaries are specifically phonetic. Indeed, forsome parties to the debate it has been in theinterpretation of these boundaries that thedifference between the two views has come intosharpest focus. The issue therefore deserves to befurther ventilated.

It is now widely accepted that the location of theacoustic-phonetic boundary on every relevant cuedimension varies greatly as a function of phoneticcontext, position in the syllable, and vocal-tractdimensions. It is now also known, and accepted,that some vary with differences in language type,linguistic stress, and rate of articulation. For atleast one of theserate of articulationthevariation is possibly continuous. From all this itfollows that the number of acoustic-phoneticboundaries is indefinitely large, far too large,surely, to make reasonable the assumption thatthey are properties of the auditory system. Howwould these uncountably many boundaries havebeen selected for as that system evolved? Surely,not just against the possibility that languagewould come along and find them usefill. Indeed, asauditory properties, they would presumably bedysfunctional, since they are perceptualdiscontinuities of a sort, and would, therefore,cause continuously varying acoustic events to beperceived discontinuously, thereby frustratingveridical perception.

The matter is the worse confounded for theauditory theory when proper account is taken ofthe fact that, for every phonetic segment, there

Page 34: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

26 Liberman

are multiple cues, and that phonetic perceptionuses all of them. For if, in accounting for theperception of certain consonantal segments, weattribute an auditory basis to all the context-variable boundaries on, say, the second-formanttransitionsalready a dubious assumption, aswe've seenthen what do we do about the third-formant transitions and the bursts (or fricativenoises)? These various information-bearingaspects of the signal are not independentlycontrollable in speech production, so one mustwonder about the probability that a gesture somanaged as to have just the 'right' acousticconsequences for the second-formant transitionwould happen, also, to have just the rightconsequences in all cases for the other,acoustically very different cues. On its face, thatprobability would seem to be vanishingly small.

Nor does it help the horizontal position tosuggest, as some have, that the acoustic-phoneticboundaries exhibited by nonhuman animalsserved merely as the auditory starting pointstheprotoboundaries, as it wereto which all theothers were somehow added. For this is to supposethat, out of the many conditions known to affectthe acoustic manifestation of each phoneticsegment, some one is canonical. But is it plausibleto suppose that there really are canonical formsfor vocal-tract size, rate of articulation, conditionof stress. language type, and all the otherconditions that affect the speech signal? And whatof the further implications? What, for example, isthe status of the countless other boundaries thathad then to be added in order to accommodate thenoncanonical forms? Did they infect the generalauditory system, or were they set apart in adistinct phonetic mode? If the former, then whydoes everything not sound very much like speech?If the latter, then are we to suppose that thelistener shifts back and forth between auditoryand phonetic modes depending on whether or notit is the canonical form that is to be perceived?

None of this is to say that natural boundaries ordiscontinuities do not exist in the auditorysystemI believe there is evidence that they dobut rather to argue that they are irrelevant tophonetic perception.

All of the tbregoing considerations are simplygrist for the Motor Theory mill. For on thattheory, the phonetic module uses the speechstimulus as information about the gestures, whichare the true and immediate objects of phoneticperception, and so finds the acoustic-phoneticboundaries where the articulatory apparatushappened, for its own very good reasons, to put

them. The auditory system is then free to respondto all other acoustic stimuli in a way that does notinappropriately conform their perception to aphonetic mold.

Integrating cues that are acoustically heteroge-neous, widely distributed in time, and shared bydisparate segments. Having already noted thatthere are typically many cues for a phonetic dis-tinction, I take note of the well-known fact thatthese many cues are, more often than not, acousti-cally heterogeneous. Yet, in the several cases sofar investigated, they can, within limits, betraded, one for the other, without any change inthe immediate percept. That is, with other cuesneutralized, the phonetic distinction can be pro-duced by any one of the acoustically heteroge-neous cues, with perceptual results that are notdiscriminably different. Since these perceptualequivalences presumably exist among all the cuesfor each such contrast, the number of equivalencesmust be very great, indeed. But how are thesemany equivalences to be explained? From anacoustic or auditory-processing standpoint, whatdo such acoustically diverse, but perceptuallyequivalent, cues have in common? Or, in the ab-sence of that commonality, how plausible is it tosuppose that they might nevertheless haveevolved in the auditory system in connection withits nonspeech functions? In that regard, one asksthe same questions I raised about the claim con-cerning the auditory basis of the boundary posi-tions. What general auditory function would haveselected for these equivalences? Would they not, inalmost every case, be dysfunctional, since theywould make very different acoustic events soundthe same? And, finally, what is the probabilitythat speakers could so govern their articulatorygestures as to produce for each particular phoneticsegment exactly the right combination of percep-tually equivalent cues?

The Revised Motor Theory has no difficultydealing with the foregoing facts about stimulusequivalence. It notes simply that the acousticallyheterogeneous cues have in common that they areproducts of the same phonetically significantgesture. Since it is the gesture that is perceived,the perceptual equivalence necessarily follows.

Also relevant to the argument is the fact thatphonetic perception integrates into a coherentphonetic segment a numerous variety of cues thatare, because of coarticulation, widely dispersedthrough the signal and used simultaneously toprovide information for other segments in thestring, including not only their position, as earliernoted, but also their phonetic identity. The sim-

3 4

Page 35: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and How They Changed

plest examples of such dispersal were among thevery earliest findings of our speech research, asdescribed in the first half of this essay. Since then,the examples have been multiplied to includecases in which the spread of the cues for a singlesegment is found to be much broader than origi-nally supposed, extending, in some utterances,from one end of a complex syllable to the other;yet, even in these cases, the phonetic system inte-grates the information appropriately for each ofthe constituent segments. I have great difficultyimagining what function such integration wouldserve in a system that is adapted to the perceptionof nonspeech events. Indeed, I should suppose thatit would sometimes distort the representation ofevents that were discrete and closely sequenced.

Again, the Revised Motor Theory has a ready, ifby now expected, explanation: the widelydispersed cues are brought together, as it were,into a single and perceptually coherent segmentbecause they are, again, the common products ofthe relevant articulatory gesture.

Integrating acoustic and optical information. Itis by now well known that, as Harry McGurkdemonstrated some years ago, observers formphonetic percepts under conditions in which someof the information is acoustic and some optical,provided the optical information is aboutarticulatory gestures. Thus, when observers arepresented with acoustic [ba], but see a face saying[de], they will, under many conditions of intensityand clarity of the signal, perceive [da], havingtaken the consonant from what they saw and thevowel from what they heard. Though theperceptual effect is normally quite compelling, theresult is typically experienced as slightlyimperfect by comparison with the normal case inwhich acoustic and optical stimuli are inagreement. But the observers can't tell what thenature of the imperfection is. That is, they can'tsay that it is to be attributed to the fact that theyheard one thing but saw another. Left standing,therefore, is the conclusion that the McGurk effectprovides strong evidence for the equivalence inphonetic percepi ion of two very different kinds ofphysical information, acoustic and optical.

For those who believe that speech perception isauditory, the explanation of the McGurk effectmust be that the unitary percept is the result of alearned association between hearing a phoneticstructure and seeing it produced. As an explana-tion of the phenomenon, however, such an accountseems manifestly improbable, since it requires usto believe, contrary to all experience, that a con-vincing auditory percept can be elicited by an opti-

cal stimulus, or that an au&tory percept and a vi-sual percept become indistinguishable as a conse-quence of frequent association. Indeed, we are re-quired to believe, even more implausibly, that theseemingly auditory percept elicited by the opticalstimulus is so strong as to prevail over the normal(and different) auditory response to asuprathreshold acoustic stimulus that is presentedconcurrently. If there were such drastic perceptualconsequences of association in the general case,then the world would sometimes be misrepre-sented to observers as they gained experience withpercepts in different modalities that happened of-ten to be contiguous in time. Fortunately for ourrelation to the world, there is no reason to supposethat such modality shifts, and the consequent dis-tortions of reality, ever occur. As for the implica-tions of the horizontal account for the McGurk ef-fect specifically, we should expect that the phe-nomenon would be obtained between the sounds ofspeech and print, given the vast experience thatliterate adults have had in associating the onewith the other. Yet the effect does not occur withprint. It also weighs against the same accountthat prelinguistic infants have been shown to besensitive to the correspondence between speechsounds and seen articulatory movements, whichis, of course, the basis of the McGurk effect.

On the vertical view, the McGurk phenomenonis exactly what one would expect, since theacoustic and optical stimuli are providinginformation about the same phonetic gesture, andit is, as I have said so relentlessly, precisely thegesture that is perceived.

Just how 'special' is speech perception?The claim that speech percep-tion is special has

been criticized most broadly, perhaps, on theground that it is manifestly unparsimonious andlacking in generality. Unparsimonious, because a"special" mechanism is necessarily an additionalmechanism; and lacking in generality, becausethat which is special is, by definition, not general.

As for parsimony, I have already suggested thatthe shoe is on the other foot. For the price ofdenying a distinctly phonetic mode at the level ofperception is having to make the still lessparsimonious assumption that such a mode beginsat a higher cognitive level, or wherever it is thatthe auditory percepts of the horizontal view areconverted to the postperceptual phonetic shapesthey must assume if they are to serve as thevehicles of linguistic communication.

But generality is another matter. Here, the hor-izontal view might appear to have the advantage,

5

Page 36: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

28 Liberman

since it sees the perception of speech as a whollyunexceptional example of the workings of an audi-tory modality that deals with speech just as itdoes with all the other sounds to which the ear issensitive. In so doing, however, this view sacrificeswhat is, I think, a more important kind of general-ity, since it makes speech perception a mere ad-junct to language, having a connection to it no lessarbitrary than that which characterizes the rela-tion of language to the visually perceived shapesof an alphabet. The vertical view, on the otherhand, shows the connection to language to be trulyorganic, permitting us to see speech perception asspecial in much the same way that other compo-nents of language perception are special. I havealready pointed out in this connection that theoutput of the specialized speech module is a repre-sentation that is, by its nature, specifically appro-priate for further processing by the syntactic com-ponent. Now I would add that the processes ofphonetic and syntactic perception have in commonthat the distinctly linguistic representations theyproduce are not given directly by the superficialproperties of the signal. Consider, in this connec-tion, how a perceiving system might go about de-ciding whether or not an acoustic signal containsphonetic information. Though there are, to besure, certain general acoustic characteristics ofnatural speech, experience with synthetic speechhas shown that none of them necessarily signalsthe presence of phonetic structure. Having alreadynoted that this was one of the theoretically inter-esting conclusions of the earliest work with thethe highly schematized drawings used on thePattern Playback, I add now that more convincingevidence of the same kind has come from later re-search that carried the schematization of the syn-thetic patterns to an extreme by reducing them tothree sine waves that merely follow the centerfrequencies of the first three form ants. These barebones have nevertheless proved sufficient to evokephonetic percepts, even though they have nocommon fundamental, no common fate, nor, in-deed, any other kind of acoustic commonality thatmight provide auditory coherence and mark thesinusoids acoustically as speech. What the sinu-soids do offer the listenerindeed, all they offeris information about the trajectories of the for-mants, which is to say movements of the articula-tors. If those movements can be seen by the pho-netic module as appropriate to linguistically sig-nificant gestures, then the module, being properlyengaged, integrates them into a coherent phoneticstructure; otherwise, not. There are. then, nopurely acoustic properties, no acoustic stigmata,

on the basis of which the presence of phoneticstructure can, under all circumstances, be reliablyapprehended. But is it not so with syntax, too? If aperceiving system is to determine whether or nota string of words is a sentence, it surely cannotrely on some list of surface properties; rather itmust determine if the string can be parsedthatis, if a grammatical derivation can be found. Thus,the specializations for phonetic and syntactic per-ception have in common that their products aredeeply linguistic, and are arrived at by proceduresthat are similarly synthetic.

As for specializations that are adapted forfunctions other than communication, Mattinglyand I have claimed for speech perception that, as Iearlier hinted, it bears significant resemblances toa number of biologically coherent adaptations.Being specialized for different functions, each ofthese is necessarily different from every other one,but they nevertheless have certain properties incommon. Thus, they all qualify as modules, inFodor's sense of that term, and therefore sharesomething like the properties he assigns to suchdevices. I choose not to review those here, butrather to identify, though only briefly, severalother common properties that such modules,including the phonetic, seem to have.

To see what some of those properties might be,Mattingly and I have found it useful to distinguishbroadly between two classes of modules. One com-prises, in the auditory modality, the specializa-tions for pitch, loudness, timbre, and the like.These underlie the common auditory dimensionsthat, in their various combinations, form the in-definitely numerous percepts by which peopleidentify a correspondingly numerous variety ofacoustic events, including, of course, many thatare produced by artifacts of one sort or another,and that are, therefore, less than perfectly natu-ral. We have thought it fitting to call this class'open'. It is appropriate to the all-purpose charac-ter of these modules that its representations becommensurate with the relevant dimensions ofthe physical stimulus. Thus, pitch maps onto fre-quency, loudness onto amplitude, and timbre ontospectral shape; hence, we have called these repre-sentations 'homomorphic'. It is also appropriate totheir all-purpose function that these homomorphicrepresentations not be permanently changed as aresult of long experience with some acoustic event.Otherwise, acquired skill in using the sound of au-tomobile engines for diagnostic purposes, for ex-ample, would render the relevant modules mal-adapted for every one of the many other events forwhich they must be used.

36

Page 37: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and How They Changed 29

Members of the other classthe one that in-cludes speechare more narrowly specialized forparticular acoustic events or stimulus relation-ships that are, as particular events or relation-ships, of particular biological importance to theanimal. We have, therefore, called this class'closed'. It includes specializations like sound lo-calization, stereopsis, and echolocation (in the bat)that I mentioned earlier. Unlike the representa-tions of the open class, those produced by theclosed modules are incommensurate with the di-mensions of the stimulus; we have therefore calledthem 'heteromorphic'. Thus, the sound-localizingmodule represents interaural differences of timeand intensity, not homomorphically as time orloudness, but heteromorphically as location; themodule for stereopsis represents binocular dispar-ities, not homomorphically as double images, butheteromorphically as depth. The echo-locatingmodule of the bat presumably represents echotime, not homomorphically as an echoing (bat) cry,but heteromorphically as distance. In a similarway, the phonetic module respresents the contin-uously changing formants, not homomorphicallyas smoothly varying timbres, but heteromorphi-cally as a sequence of discrete and categoricalphonetic segments.

Unlike the open modules, those of the closedclass depend on very particular kinds of environ-mental stimulation, not only for their develop-ment, but for their proper calibration. Moreover,they remain plasticthat is, open to calibrationfor some considerable time. Consider, in this con-nection, how the sound-localizing module must becontinuously recalibrated for its response to inter-aural differences as the distance between the earsincreases with the growth of the child's head. Thesimilarly plastic phonetic module is calibratedover a period of years by the particular phoneticenvironment to which it is exposed. Significantly,the calibration of these modules in no way affectsany of the specializations of the open class, eventhough their representations figure importantly inthe final percept, as in the paraphonetic aspects ofspeech, for example, This is to say that the closedmodules must learn by experience, as the phoneticmodule most surely does, but the learning is of anentirely precognitive sort, requiring onlyneurological normality and exposure (at the righttime) to the particular kinds of stimuli in whichthe module is exclusively interested.

As implied above, the two classes have their owncharacteristically different ways of representingthe same dimension of the stimulus. Why, then,does the listener not get both representations

heteromorphic and homomorphicat the sametime? Given binocularly disparate stimuli, whydoes the viewer not see double images in additionto depth? Or, given two syllables [dal and [gal thatare distinguished only by the direction of thethird-formant transition, why does the listener nothear, in addition to the discrete consonant andvowel, the continuously changing timbre that themost nearly equivalent nonspeech pattern wouldproduce, and to which the two transitions wouldpresumably make their distinctively different, butequally nonphonetic, contributions.

Mattingly and I have proposed that the competi-tion between the modules is normally resolved infavor of the members of the closed class by virtueof their ability to preempt the information that isecologically appropriate to computing the hetero-mophic percept, and thus, in effect, to remove thatinformation from the flow. As for what is ecologi-cally appropriate, the closed modules have anelasticity that permits them to take a rather broadview. Thus, the module for stereopsis representsdepth for binocular disparities considerablygreater than would ever be produced by even themost widely separated eyes. The phonetic modulewill, in its turn, tolerate rather large departuresfrom what is ecologically plausible. Imagine, forexample, two synthetic syllables, [da] and [gal,distinguished only by the direction of a third-for-mant transition that. as I indicated earlier,sounds in isolation like a nonspeech chirp. If thesyllables are now divided into two partsone, thecritical transition cue; the other, the remainder ofthe pattern (the 'base') that, by itself, is ambigu-ously [da] or [galthen, the phonetic system willintegrate them into a coherent [da] or [gal syllableeven when the two parts have, by various means,been made to come from perceptibly differentsources. (Mattingly has a different interpretationof this particular phenomenon, so I must take fullresponsibility for the one I offer here,) In what is,perhaps, the most dramatic demonstration of thiskind of integration, the isolated transition cue ispresented at a location opposite one ear, the am-biguous base at a location opposite the other.Under these ecologically implausible circum-stances, the listener nevertheless perceives a per-fectly coherent [dal or [gal, and, more to the point,confidently localizes it to the side where only theambiguous base was presented. (This happens,indeed, even when both the base and the criticaltransition cue are made of frequency-modulatedsinusoids, which is a most severe test, since thedifferently located sinusoids would, as I pointedout earlier, seem to lack any kind of acoustic co-

3 7

Page 38: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

30 Liberman

herence that might cause them to be integrated onsome auditory basis.) But there are limits to thiselasticity, and seemingly similar effects occur inphonetic perception and in stereopsis when thoselimits are exceeded. Thus, in the speech case, asthe stimulus is made to provide progressivelymore evidence for separate sourcesfor example,by increasing the relative intensity of the isolatedtransition on the one sidelisteners begin to hearboth the integrated syllable and the nonspeechchirp. In other words, the heteromorphic and ho-momorphic percepts are simultaneously repre-sented. In the speech case, this has been called'duplex perception'. Appropriate (and necessary)tests for the claim that the duplex percepts arerepresentations of truly different types, and notsimply a cognitive reinterpretation of a singletype, are in the demonstration that listeners can-not accurately identify the chirps as [da] and [gal,and that they hear only the integrated syllableand the chirp, not those and also the ambiguousbase. It is also relevant to the claim that the dis-crimination functions obtained for the percepts onthe two sides of the percept are radically different,for that shows that the listeners cannot hear theone in the other, even when, under the conditionsof the discrimination procedure, they try.(Unfortunately, these tests have not been appliedto the cases of nonspeech auditory perception thatsome have claimed to be duplex.)

In the case of stereopsis, the elasticity of themodule is strained by progressively increasing thebinocular disparity. Beyond a certain point, theviewer begins to perceive, not only heteromorpicdepth, but also homomorphic double images. Thisseems quite analogous to duplex perception,though it has not been called that.

In both speech and stereopsis, providing furtherevidence of ecological implausibility causes theheteromorphic percept (phonetic structure ordepth) to weaken as its homomorphic counterpart(nonphonetic chirp or double images) strengthens,until, finally, the closed module fails utterly, andonly the homomorphic percept of the openmodules is represented. Thus :ie information inthe stimulus can seemingly be variously dividedbetween the two kinds of representation, and,since either gains at the expense of the other, it isas if there were a kind of conservation ofinformation.

Putting the matter most generally, then, Ishould say that speech is special, to be sure, butneither more nor less so than many otherbiologically coherent adaptations, including, ofcourse, language itself.

How do speaking and listening differ fromwriting and reading?

Among the most obvious, and obviously conse-quential, facts about language is the immense dif-ference in biological status, hence naturalness, be-tween speech, on the one hand, and writ-ing/reading, on the other. The phonetic units ofspeech are the vehicles of every language onearth, and are commanded by every neurologicallynormal human being. On the other hand, many,perhaps most, languages do not even have a writ-ten form, and, among those that do, some compe-tent speakers find it all but impossible to master.Having been thus reminded once again thatspeech is the biologically primary form of thephonological behavior that typifies our species, wereadily appreciate that alphabetic writing is notreally the primary behavior itself, but only a fairdescription of it. Since what is being described isspecies typical, alphabetic writing is a piece ofethological science, in which case a writer/readeris fairly regarded as an accomplished ethologist. Itweighs heavily against the horizontal view, there-fore, that, as I have already said and will sayagain below, it cannot comprehend the differencebetween speaking/listening and writing/reading,for in that respect it is like a theory of bird songthat does not distinguish the behavior of the birdfrom that of the ethologist who describes it.

To see the problem created by the horizontalview, we need first to appreciate, once again, thatwriting-reading did not evolve as part of the lan-guage faculty, so the relevant acts and perceptscannot be specifically linguistic. The importantconsequence, of course, is that they require to bemade so, and, as I have said so many times, thatcan be done only by some kind of cognitive trans-lation. Now I would emphasize that it is primarilyin respect of this requirement that writing andreading differ biologically from speech. Indeed, itis precisely in the need to meet this requirementthat writing-reading are intellectual achievementsin a way that speech is not. But the horizontalview of speech does not permit us to see that es-sential difference. Rather, it misleads us into thebelief that the primary processes of the two modesof linguistic communication are equally general,hence equally nonphonetic. That being so, wemust suppose that the relevant representationsare equally in need of a cognitive connection to thelanguage, and so have the same status from a bio-logical point of view. We are, of course, permittedto see the obvious and superficial differences, but,for each one of those, the horizontal view would

Page 39: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Assumptions about Speech and How Thew Changed 31

seem, paradoxically, to give the advantage to writ-ing-reading, leading us to expect that writing-reading, not speaking-listening, would be the eas-ier and more natural. For, surely, the printedcharacters offer a much better signal-to-noise ra-tio than the phonetically relevant parts of thespeech sound; the fingers and the hand are vastlymore versatile than the tongue; the eye providesfar greater possibilities for the transmission of in-formation than the ear; and, for all the vagaries ofsome spelling systems, the alphabetic charactersbear a more nearly transparent relation to thephonological units of the language than the con-text-variable and elaborately overlapped cues ofthe acoustic signal.

The vertical view, on the other hand, isappropriately revealing. Given the phoneticmodule, speakers do not have to know how to spella word in order to produce it. Indeed, they do noteven have to know that it has a spelling. Speakershave only to access the word, however that isdone; the module then spells it for them,automatically selecting and coordinating theappropriate gestures. Listeners are in similarcase. To perceive the word, they need not puzzleout the complex and peculiarly phonetic relationbetween signal and the phonological message itconveys; they need only listen, again leaving allthe hard work to the phonetic module. Beingmodular, these processes of production andperception are not available to conscious analysis,so the speakers and listeners cannot be aware ofhow they do what they do. Though therepresentations themselves are available toconsciousnessindeed, if they were not, use of analphabetic script would be impossiblethey arealready phonological in nature, hence appropriatefor further linguistic processing, so the readerneed not even notice them, as he would have to if,as in the case of alphabetic characters, somearbitrary connection to language had to be formed.Hence, the processes of speech, whether inproduction or perception, are not calculated to putthe speaker's attention on the phonological unitsthat those processes are specialized to manage.

On the basis of considerations very like those,Isabelle Liberman, Donald Shankweiler, andIgnatius Mattingly saw, more than twenty yearsago, that, while awareness of phonologicalstructure is obviously necessary for anyone whowould make proper use of an alphabetic script,such awareness would not normally be aconsequence of having learned to speak. Isabelleand Donald proceeded, then, to test thishypothesis with prelilerate children, finding that

such children do, indeed, not know how to break aword into its constituent phonemes. That findinghas now been replicated many times. Moreover,researchers at the Laboratories and elsewherehave found that the degree to which would-bereaders are phonologically aware may be the bestsingle predictor of their success in learning toread, and that training in phonological awarenesshas generally happy consequences for the progressin reading of those children who receive it. Thereis also reason to believe that, other things equal,an important cause of reading disability may be aweakness in the phonetic module. That weaknesswould make the phonological representations lessclear than they would otherwise be, hence thatmuch harder to bring to awareness. Indeed, thereis at least a little evidence that reading-disabledchilden do have, in addition to their problems withphonological awareness, some of the othercharacteristics that a weak phonetic module mightbe expected to produce. Thus, by comparison withnormals, they seem to be poorer in several kindsof phonologically related performances: short-term-memory for phonological structures, but notfor items of a nonlinguistic kind; perception ofspeech, but not of nonspeech sounds, in noise;naming of objectsthat is, retrieving theappropriate phonological structureseven whenthey know what the objects are and what they do;and, finally, production of tongue-twisters.

The vertical view was not developed to explainthe writing-reading process or the ills that sofrequently attend it, but rather for all the reasonsgiven in earlier sections of this paper. That itnevertheless offers a plausible account, while thehorizontal view does not, is surely to be countedstrongly in its favor.

Are there acoustic substitutes for speech?When Frank Cooper and I set out to build a

reading machine for the blind, we accepted thatthe answer to that question was not just 'yes', but'yes, of course'. As I see it now, the reason for ourblithe confidence was that, being unable toimagine an alternative, we could only think inwhat I have here described as horizontal terms.We therefore thought it obvious that speechsounds evoked normal auditory percepts, and thatthese were then named in honor of the variousconsonants and vowels so they could be used forlinguistic purposes.

On the basis of our early experience with thenonspeech sounds of our reading machines, welearned the hard way that things were differentfrom what we had thought. But it was not until

39

Page 40: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

32 Liberman

we were well into the research on speech that webegan to see just how different and why. Now,drawing on all that research, we would say thatthe answer to the question about acousticsubstitutes is, 'no', or, in the more emphaticmodern form, 'no way'. The sounds produced by areading machine for the blind will serve as well asspeech only if they are of a kind that might havebeen made by the organs of articulation as theyfigure in gestures of a specifically phonetic sort. If

the sounds meet that requirement, then they willengage the specialization for speech, and soqualify as proper vehicles for linguistic structures;otherwise, they will encounter all the difficultiesthat fatally afflicted the nonspeech sounds weworked with so many years ago.

FOOTNOTE"This essay will appear as the introduction to a collection of

papers to be published by MIT Press.

Page 41: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins Laboratories Status Report on Speech Research1993, SR-113, 33-40

On the Intonation of Sinusoidal Sentences:Contour and Pitch Height*

Robert E. Remezt and Philip E. Rubin

A sinusoidal replica of a sentence evokes a clear impression of intonation despite theabsence of the primary acoustic correlate of intonation, the fundamental frequency. Ourprevious studies employed a test of differential similarity to determine that the toneanalog of the first formant is a probable acoustic correlate of sinusoidal sentenceintonation. Though the typical acoustic and perceptual effects of the fundamentalfrequency and the first formant differ greatly, cur finding was anticipated by reports thatharmonics of the fundamental within the dominance region provide the basis forimpressions of pitch more generally. The frequency extent of the dominance region roughlymatches the range of variability typical of the first formant. Here, we report two additionaltests with sinusoidal replicas to identify the relevant physical attributes of the firstformant analog that figure in the perception of intonation. These experiments determined(1) that listeners represent sinusoidal intonation as a pattern of relative pitch changescorrelated with the frequency of the tonal replica of the first formant, and (2) thatsinusoidal sentence intonation is probably a close match to the pitch height of the firstformant tone. These findings show that some aspects of auditory pitch perception apply tothe perc^ ption of intonation; and, that impressions of pitch of a multicomponentnonharmonic signal can be derived from the component within the dominance region.

When the pattern of formant frequency varia-tion of a natural utterance is imparted to severaltime-varying sinusoids, there are two obvious per-ceptual consequences. First, the phonetic contentof the original natural utterance is preserved, andlisteners understand the sinusoidal voice to besaying the same sentence as the natural one onwhich the sinusoidal sentence is modeled (Remez,Rubin, Pisoni, & Carrell, 1981). Second, the qual-ity of a sinusoidal voice is unnatural both in itstimbre and its intonation (Remez et al., 1981;Remez, Rubin, Nygaard, & Howell, 1987; Remez &Rubin, in press). Our studies have attributed theintelligibility of sinusoidal sentences to the avail-ability of time-varying phonetic information whichis independent of the specific acoustic elements

The authors gladly acknowledge the advice and editorialcare of Stefanie Berns and Jennifer Pardo; and the assiduousscholarship and criticism of Eva Blumenthal. This researchwas supported by grants from NIDCD (00308) to R. E. Remez,and from NICHHD (01994) to Haskins Laboratories.

33

composing the signal (Remez & Rubin, 1983,1990). Correspondingly, we have attributed theanomalous voice qualities to the absence ofharmonic structure and broadband resonancesfrom the short-term spectrum, and to the peculiarintonation of sinewave sentences (Remez et al.,1981; Remez & Rubin, 1984; in press). Unlikenatural and synthetic speech, sinewave replicas ofutterances do not present the listener with afundamental frequency common to its tonalcomponents. In the absence of this typicalcorrelate of intonation, our studies revealed thatthe tonal analog of the first formant is a probablesource of the impression of the odd sentencemelody accompanying the phonetic perception ofreplicated utterances.

The studies of intonation that are reported hereaddress two iswes about the method of our earlierresearch (Remez & Rubin, 1984), producing theevidence that now permits a clear interpretation.These new tests show that impressions of sinu-soidal intonation are approximate to the patternof frequency change of the tone analog of the first

41

Page 42: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

34 Rernez and Rubin

formant, and that the intonation of the sinusoidalsentence appears to approximate the specific pitchheight of this crucial tone component.

Sinusoidal intonationIn our initial studies of the basis for the

impression of intonation in sinewave sentences(Remez & Rubin, 1984), we used a matching task,on each trial of which the subject listened to asinusoidal sentence followed by two single-tonepatterns, and chose from the pair of single tonesthe better match to the intonation of the sinewavesentence. By examining the likeness judgmentsacross a set of single-tone candidates, we wereable to identify a best match, good matches, andpoor matches to the intonation of a particularsentence. We generally found that the toneimitating the first formant was preferred to allother alternatives.

First, listeners consistently chose the tone repli-cating the first formant from a set of tonal candi-dates that also included the second and the thirdformant analogs of the sentence replica. The set ofalternatives also contained a tone reproducing thegreatest common divisor of the three concurrenttones of the sentence pattern; and a tone whichpresented a linguistically and acoustically plausi-ble fundamental frequency contour for the sen-tence that we used. The expression of any clearpreference among these alternatives suggestedthat subjects understood the instructions andwere capable of the task that we set for them.Second, we found that the preference for the firstformant analog did not depend on the fact that ithad the greatest energy among the constituents ofthe sentence, for the first formant tone was thechoice for matching the intonation regardless ofthe ordinal amplitude relations that we imposedameng the tones reproducing the formant centers.Here, the alternatives in the matching set againwere the tonal components of the sentence replica.Third, when listeners were asked to identify theintonation of a sentence pattern that includedformant analogs with an additional tonal compo-nent below the frequency of Tone 1 that followedthe natural fundamental frequency values of theutterance from which the tonal replica was de-rived, they did not select this Fo-analog as thematch to intonation, but instead chose the firstformant analog again. Here, the alternatives inthe matching set were the tonal components of thesentence replica and the tone reproducing Fovariation. Last, listeners expressed no consistentexperience of intonationperformance on the in-tonation matching task did not differ from

chancewhen requested to identify the intonationof a sinusoidal sentence replica that lacked thecomponent analogous to the first formant.

Overall, these results suggested that theintonation of a sinusoidal replica of a sentence iscorrelated with attributes of the analog of the firstformant. The selection of the first formant tone(Tone 1) as the match to intonation persisted evenwhen that tone had neither the greatest poweramong the sinusoidal components of the sentencenor the lowest frequency. In that case, it is notsurprising that the intonation of sinewavesentences seems odd. The pattern of first formantfrequency variation is better correlated with theopening and closing of the jaw in cycles of syllableproduction than the patterns of rise and fall ofintonation, which are correlated with thepolysyllabic breath group. But, as evidence for amore specific claimthat sinusoidal sentenceintonation is based on the frequency variation ofthe first-formant analogour prior findings areequivocal in two crucial ways. First, if listenerschose Tone 1 because it reprised the pattern ofsentence pitch, this choice may also have occurredif Tone 1 was simply the candidate falling closestin pitch range to the apparent intonation of thesentence. In fact, the report of Grunke &Pisoni (1982) suggests that this alternativehypothesis is plausible. They noted that theapparent similarity of speechlike syllable-lengthtone patterns was affected by average tonefrequency as well as by details within the pattern;none of the tests of Remez & Rubin (1984)evaluated the possibility that similar effects occurin sentence length patterns. A second point toconsider is that the single-tone alternatives usedby Remez & Rubin (1984) in the matching testconfounded pitch range and pitch pattern. It isnecessary to observe matching preferences forTone 1 in a study controlling pitch rangedifferences among the test alternatives toconclude that listeners hear sinusoidal intonationas the correlate of Tone 1.

The two studies reported here also used amatching task to test the hypothesis that theacoustic correlate of intonation in sinewavesentences is the tone that follows the frequencyand amplitude variation of the first formant. Inboth experiments we employed a set of alternativesingle-tone frequency patterns for subjects to usein matching their impressions of intonation of afour-tone sentence replica. In the first experiment,we composed the set of candidates to distinguishthe perceptual effect of frequency contour fromaverage frequency.

Page 43: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

On the Intonation of Sinusoidal Sentences: Contour and Pitch Height 35

EXPERIMENT 1

While our prior studies had pointed to the toneanalog of the first formant as the closest match tothe intonation of a sinusoidal sentence, additionaltests were needed to conclude that the pattern ofthis tone, and not merely its average frequency,was the basis for the likeness judgmrni:s. The testperformed in Experiment 1 offers alternativesthat have identical average frequency butdifferent frequency patterns derived from tonalcomponents of the sinusoidal sentence. If subjectsfail here to exhibit a clear preference in matchingintonation and single-tone pitch patterns, then wewould conclude that the average frequency is wellrepresented perceptually in the perception ofintonation of a sinusoidal sentence, and thespecific pattern of frequency changes less well.

MethodSubjects. Fifteen audiologically normal adults

were recruited from the population of Barnardand Columbia Colleges. All were native speakersof English, and none had been tested in otherexperiments employing sinusoidal signals. Thesubjects were paid for participating.

Acoustic Test Materials. The acoustic materialsused in this test consisted of four sinusoidal pat-terns: one four-tone sentence pattern, and threesingle-tone patterns, all of them generated by asinewave synthesizer (Rubin, 1980). This synthe-sizer produces sinusoidal patterns defined by pa-rameters of frequency and amplitude for eachtone, updated at the rate of 10 ms per parameterframe. The initial synthesis parameters were ob-tained by analyzing a natural utterance, the sen-

3

2

1

tence "My t.v. has a twelve inch screen," spoken byone of the authors. This utterance was recorded onaudiotape in a sound-attenuating chamber andconverted to digital records by a VAX-based pulse-code modulation system using a 4.5 kHz low-passfilter on input and a sampling rate of 10 kHz. At10-ms intervals, center-frequency and amplitudevalues were determined for each of the three low-est oral formants and the intermittent fricativeformant by the analysis technique of linear pre-diction (Markel & Gray, 1976). In turn, these val-ues were uSed as sinewave synthesis parametersafter correcting the errors typical of linear predic-tion estimates. A full description of sinusoidalreplication of natural speech is provided by Remezet al. (1987).

The resulting sentence pattern comprised fourtime-varying sinusoids. Tone 1 corresponded tothe first formant, Tone 2 to the second, Tone 3 tothe third, and Tone 4 was used to replace thefricative formant. A spectrographic representationof this pattern is shown in Figure 1. The threesingle-tone patterns that were used to composethe pairs of alternatives in the matching trialswere Tone 1 and frequency-transposed versions ofTone 2 and Tone 3, each a component of the sen-tence pattern that the subject heard at the begin-ning of each trial. Tone 1 was produced with thefrequency values it exhibited within the sentencepattern. These three single-tone alternatives wereproduced with equal average power, and each hadroughly the same average frequency of 320 Hz.This was accomplished by dividing the frequencyparameters of Tones 2 or 3 by constant divisorsthroughout the pattern and then synthesizing thetransposed time-varying sinusoids.

300 ms

liii 1110011011.1

Iihniti Oa 1"l

1vioosol

litimi ounlaulloo "N1..4

oingillypoon%

it Wit

III110111101111111111

Time

figure 1. Spectrographic representation of the tone replica of the sentence, "My t.v. has a twelve inch screen."Amplitude variation is represented in the height of each hatch mark placed at the tone frequency.

Page 44: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

36 Rentez and Rubin

The synthesized test materials were convertedfrom digital records to analog signals, recorded onaudiotape, and were presented to listeners byplayback of the audiotape. Average listeninglevels were set at 72 dB SPL. Test materials weredelivered binaurally in an acoustically shieldedroom over Telephonics TDH-39 headsets.

Procedure. During an initial instruction portionof the test session, listeners were told that theexperiment was examining the identifiability ofvocal pitch, the tune-like quality, of syntheticsentences. To illustrate the independence ofphonetic structure and sentence melody for thetest subjects, the experimenter sang the phrases"My Country 'Tis of Thee" and "I Could HaveDanced All Night" with the associated melodiesand with the melodies interposed. When subjectsacknowledged their ability to determine themelody of a sentence regardless of its words, theywere instructed to attend on each test trial to thepitch changes of the sinusoidal sentence, toidentify the pattern, and then to select thealternative of the two following patterns thatmore closely resembled the intonation of thesentence. The subjects recorded their choices inspecially prepared response booklets.

The format of each trial was identical, consistingof three sinusoidal patterns. First was the sinu-soidal sentence "My t.v. has a twelve inch screen,"presented once. Then, one of the three single-tonepatterns was presented. Last, a second single-tonepattern was presented. There were six differentcomparisons among the three different single-tonealternatives. Counterbalanced for order, each sub-ject judged each different comparison twentytimes. Each sinusoidal pattern was approximately3.1 s in duration, the intei val between itemswithin a trial was 1 s, and the interval betweentrials was 3 s.

Results and DiscussionThe results are not difficult to interpret.

Figure 2 shows the proportion of trials on whicheach alternative was chosen relative to thenumber of trials on which it was presented.Subjects preferred Tone 1 to the other twointonation candidates, and the mean differenceswere large. An analysis of variance was performedon the differences in preference between candidatetones in the three comparisons, and was highlysignificant [F (2,40)=71.62, p<.0001].

Because the single-tone alternatives differed infrequency contour but not in average frequency,the data reveal a strong preference for the first

formant analog as the match to intonation, sug-gesting that subjects used pitch patterns ratherthat average frequency in this test. This findingadds credibility to our claim that the pattern offrequency variation of the first formant tone issupplying the acoustic basis for pitch impressions.Nevertheless, it remained to be shown that theprecise range of frequency variation of this tone isresponsible for the listener's impression of intona-tion in a sinusoidal sentence replica.

1.0

0.8

0.6

0.4

0.2

0.0

Contour Variation

%-4

Tone 1vs

Tone 2f

Tone 1vs

Tone 3t

Comparison

Tone 2tvs

Tone 3t

Figure 2, Relative preference for single-tone matches tosinusoidal sentence intonation; group performance inExperiment 1. Subjects preferred Tone 1 to other tones ofequal average frequency.

EXPERIMENT 2One way to determine the perceptual effect of

the precise frequencies of the first formant analog(as opposed to the contour of its frequencyvariation) is to see how subjects fare in thematching task if all the alternative pitchcandidates have the same contour, differing onlyin average frequency. If subjects persist inselecting Tone 1 as the best match when othercandidates exhibit the same pattern of variation(at different pitch heights) we may conclude thatTone 1 is a fairly direct cause of pitch inipressions.

In this test, we used the same sentence replicathat we created for Experiment 1, but the tone al-ternatives for the matching task were all derived

4 4

Page 45: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

On the Intonation of Sinusoidal Sentences: Contour and Pitch Height 3.

from the pattern of the first formant analog,Tone 1. They were made by transposing the syn-thesis parameters for Tone 1 up or down infrequency, resulting in a set of Tone 1 variants,only one of which had the identical frequencyvalues exhibited by Tone 1 in the sentencepattern. If subjects express precision in theirpreferences, we can conclude at least that therelative pitch pattern is anchored to a specificimpression of pitch height, however else theimpression of intonation is moderated byconverging influences.

MethodSubjects. Twenty volunteer listeners with

normal hearing in both ears were drawn from thestudent population of Barnard and ColumbiaColleges. None had previously participated instudies of sinusoidal synthesis. The subjectsreceived pay for participating.

Acoustic Test Materials. The same sinusoidalsentence pattern, "My t.v. has a twelve inchscreen," from Experiment 1 was employed here.The single-tone alternatives consisted of Tone 1and four variants: the frequency pattern of Tone 1transposed downward by 20% and 40%, andtransposed upward by 20% and 40%.

Procedure. Subjects heard the same briefinstructions that were used in the first experimentto spotlight the independence of sentence melodyand lexical attributes. The test itself comprised100 trials in which each began with a singlepresentation of the sinusoidal sentence, "My t.v.has a twelve inch screen." Next, two of the fivesingle tone alternatives were presented, and thesubject chose the closer match to the pitch patternof the sinusoidal sentence. There were tendifferent comparisons of the five alternatives,repeated five times in each order.

Results and DiscussionThe outcome of this test was clear, again. An

analysis of variance performed on the preferencedifferences across the ten contests was highlysignificant [F (4,60)=13.79, p<.0001]. In essence,subjects consistently selected Tone 1 as most likethe sentence pitch. These results are shown inFigure 3a.

The intonation matches on trials in which Tone 1was not a candidate are shown in Figure 3b. Fourof these comparisons contained a pair in whichone candidate was closer to Tone 1 in frequencythan the other was, and two of these comparisonshad tones equally similar to Tone 1 in physicalfrequency. In the two equal-similarity cases,subjects chose the higher tone as the better match.

Pitch Height Variation

Tone 1vs

.6 Tone 1

Tcne 1vs

.8 Tone 1

Tone 1 Tone 1vs vs

1.2 Tone 1 1.4 Tone 1

Comparison

Figure 3(a). Relative preference for single-tone matches tosinusoidal sentence intonation; group performance inExperiment 2. Conditions in which Tone 1 was comparedwith four transposed versions are shown. Subjectspreferred Tone 1 to other tones of the same frequencycontour. .

1 0

06

0.0

Pitch Height Variation

.61'1 .671vs

1 2 T1

.611

1 4 11

.8 T1

1 2 11

8 TI

1 4 TI

1.211vs

1 4 71

Comparison

Figure 3(b). Relative preference for single-tone matches tosinusoidal sentence intonation; group performance inExperiment 2. Conditions in which transpositions ofTone 1 were compared with each other are shown.

Page 46: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

38 Remez and Rubin

This is exactly what we would expect on theprecedent of classic studies of pitch scaling (cf.Ward, 1970), which roughly confirm the relation-ship of the 2:1 frequency ratio and the interval ofthe octave. In the present case, this means thatthe tone with its frequency increased by 40% rela-tive to Tone 1 should be as different, subjectively,from Tone 1 as the tone with its frequency de-creased 20%. This hypothesis is encouraged by theoutcome of the condition comparing the case of20% downward vs. 40% upward transpositions, inwhich there was no clear preference, suggesting asubjective equality of these two in degree of simi-larity to sentence pitch.

To sumrarize the ten comparisons, we deriveda likeness index for each of the five single tone-alternatives, by summing the total numberof contests over the whole test in which each wasselected. Each tone occurred in forty contests,which yields a limiting score of 40 were thattone selected on every opportunity. The groupaverages are portrayed in Figure 4, along withthe best quadratic fit to the five points. AlthoughTone 1 is clearly the most frequently chosenalternative across the likeness test, the effectof frequency transposition is not symmetrical.

40

Pitch Height Comparisons

30 -

20 -

10

.6 Tone 1 .8 Tone 1 Tone 1 1.2 Tone 1 1.4 Tone 1

Single-tone alternatives

Figure 4. Representation of the integrated performanceon the ten contests of Experiment 2 is shown in this plotof overall similarity to the apparent intonation of thesinewave sentence. The likeness index for the fivesingle-tone alternatives in Experiment 2 is graphed witha best quadratic fit to the group averages. (The curve isplotted to emphasize the asymmetry of the effect oftransposition; no specific theoretical significance is givento the terms of the best-fitting function.)

This is exactly what we would expect based on therough match of frequency to pitch in this range, inother words, that a 40% increase in frequency is asubjectively equal change to a 20% decrease infrequency. Overall, this pattern of results suggeststhat the pitch height of the intonation of a sinu-soidal sentence must be quite close to the pitchimpression created by Tone 1 presented alone.

GENERAL DISCUSSIONThe findings of Experiments 1 and 2 add

precision to our account of the odd intonation thataccompanies sinusoidal replicas of sentences.Listeners in our tests consistently selected thetonal analog of the first formant as the match oftheir impressions of intonation, as if thatcomponent of the tone complex plays a dual role inthe perception of sinewave sentences: First, it actsas if it were a vocal resonance supplyingsegmental information about the opening andclosing of the vocal tract; second, it acts as theperiodic stimulation driving the perception ofpitch. If the tone analog of the first formant isresponsible, the intonation pattern of a sinusoidalsentence ought to sound weird, given that thistonal component (1) lies several octaves above thenatural frequency of phonation, and (2) varies in amanner utterly unlike the suprasegmentalpattern of Fo variation.

If the perceiver catches on to the trick oftreating sinusoids as formant analogs, then it isreasonable to expect the analog of the firstformant to contribute to impressions of consonantsand vowels. But, why should this constituent ofthe signal also create an impression of intonation?These two studies of tone contour and frequencybolster the case which we have made relying onthe dominance region hypothesis (Remez & Rubin,1984). Research on the causes of pitch impressionswith nonspeech sets of harmonic components hadshown that the auditory system is roughly keyedto detect pitch preferentially from excitation in therange of 400-1000 Hz (Plomp, 1967; Ritsma,1967). In essence, these psychoacoustic studies offundamentals around 100-200 Hz revealed apredominant influence on pitch impressions of thecommon periodicity of the third through fifthharmonics, and a lesser influence of higher orlower harmonics. An important proof of the effectsof the dominance region using speech sounds wasreported by Greenberg (1980). This study assessedhuman auditory evoked potentials in response tosynthetic speech spectra, and showed that therepresentation of the fundamental frequency inthe recordings was strongest when the frequency

4 6

Page 47: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

On the Intonation of Sinusoidal Sentences: Contour and Pitch Height 39

of the first formant fell within the dominanceregion. Of course, in ordinary speech, the firstformant ranges widely, from roughly 270 Hz to850 Hz in the speech of adults, to as high as 1 kHzin children. By this notion, in ordinary listeningthe band extending from 400-1000 Hz is analyzedto supply both periodicity information and thecenter-frequency of the first formant thattraverses it.

The periodicity of a speech signal and thefrequency of its first formant typically differgreatly in natural speech, both in pattern as wellas in frequency range. In sinewave replicas, whichlack harmonic structure, often the sole componentfalling in the dominance region is the tonal analogof the first formant. In that case, we claimed,despite the absence of harmonics, the firstformant analog evidently presents an effective ifinadvertent stimulus for pitch perception, andacts as well as an lit plicit resonant frequency.Here, the periodicity within the dominance regionconverges on the center frequency of the firstalbeit functionalformant, and the apparentintonation is therefore unlike familiar speech inits pattern of variation and its displaced range.

The present tests furnish two missing pieceswhich clarify this situation. It is the frequencycontour of Tone 1 which subjects use to matchtheir impressions of sentence melody, rather thanaverage frequency as such. Our test determinedthis by forcing subjects to differentiate theparticular pitch contour of Tone 1 from its pitchheight and pitch range. Nonetheless, subjectsreported that the precise pitch height of Tone 1matched the impression of intonation of thesentence. This was revealed in the secondexperiment which required subjects todifferentiate frequency-transposed versions ofTone 1. Although subjects consistently chose thetrue Tone 1 as the best match to sentenceintonation, they also confirmed, implicitly, thatthe familiar relation of frequency to apparentpitch obtains in this instance.

In implicating basic auditory analyzingmechanisms to account for the intonation ofsinusoidal sentences, our claim suggests thatdefinite impressions of pitch should typify tonalanalogs of spoken syllablesa familiar kind ofnonspeech controlwhether or not phoneticsegmental attributes are apparent. To take arecent instance, listeners who heard three-toneanalogs of isolated steady-state vowels chose arough approximation of the first-formant analog inmatching their auditory impressions of thesesignals (Kuhl, Williams & Meltzoff, 1991); that

result departed from the prediction that thevowels of English differed characteristically intheir spectral centers of gravity, and that suchintegrated spectral properties ruled the perceptionof speech sounds. Because the listeners evidentlydid not perceive the tone complexes as vowels,Kuhl et al. concluded that speech and nonspeechsounds are accommodated by divergent perceptualanalyses. From the perspective of our findings,though, it seems clear that pitch impressions ofthree-tone steady-state vowels would be derivedfrom the frequency of the first formant analogs inany case, whether or not the spectrum of the tonecomplex is integrated in establishing a vowelimpression. A subject who is instructed to matchthe vowel quality of a three-tone complex or tomatch its predominant pitch may attend to verydifferent psychoacoustic attributes in each task.Although we admit that the paths to phoneticperception and auditory form perception divergeeaily on (Remez, Rubin, Berns, Pardo, & Lang,in press), the present studies show how intrkatethe interpretation of evidence can be.

While these experiments on sinusoidal sentencepitch have produced a novel instancecorroborating the dominance region hypothesis,we have yet to observe a perceptual effectattributable to linguistic or paralinguisticattributes of the sentences. Some accounts ofintonation report sentence-level effects based ondepartures from an underlying downdrift pattern,or from phrase-final fall (Vaissière, 1983). Theviolation of expectations based on typicalproperties of sentences may contribute to theapparent oddness of sinusoidal sentence melody,and to difficulties in intelligibility. But, despitethe fact that tone analogs of sentences lack thefamiliar acoustic properties associated with asteady .;indamental, we have not observed anyeffect of these gross departures in the registrationof intonation. The pitch patterns of thesesentences follow the prediction given by thedominance region hypothesis. Subsequent studiesmay identify precise points of departure betweenthe impressions of intonation and the periodicityof the tone supporting the percept.

REFERENCESGreenberg, S. (1980). Tempor.il neural coding of pitch and vowel

quality. Working Papers in Phonetics, 52, 1-183. Los Angeles:U. C. L. A.

Grunke, M. E., & Pisoni, D. B. (1982). Perceptual learning ofmirror-image acoustic patterns. Perception & Psychophysics, 31,210-218.

Kuhl, P. K., Williams, K. A., & Meltzoff, A. N. (1991). Cross-modalspeech perception in adults and infants using nonspeech

Page 48: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

40 Remez and Rubin

auditory stimuli. Journal of Experimental Psychology: HumanPerception and Performance, 17, 829-840.

Markel, J. D., & Gray, A. H., Jr. (1976). Linear prediction of speech.New York: Springer-Verlag.

Plomp, R. (1967). Pitch of complex tones. Journal of the AcousticalSociety of America, 41, 1526-1533.

Remez, R. E., & Rubin, P. E. (1983). The stream of speech.Scandinavian Journal of Psychology, 24, 63-66.

Rernez, R. E., & Rubin, P. E. (1984). On the perception ofintonation from sinusoidal sentences. Perception &Psychophysics, 35, 429-440.

Remez, R. E., & Rubin, P. E. (1990). On the perception of speechfrom time-varying attributes: Contributions of amplitudevariation. Perception & Psychophysics, 48, 313-325.

Remez, R. E., & Rubin, P. E. (in press). Acoustic shards,perceptual glue. In J. Charles-Luce, P. A. Luce and J. R.Sawusch (Eds.), Theories in Spoken Language: Perception,Production, and Development. Norwood, New Jersey: AblexPress.

Remez, R. E., Rubin, P. E., Berns, S. M., Pardo, J. S., & Lang, J. M.(in press). On the perceptual organization of speech.Psychological Review.

Remez, R E., Rubin, P. E., Nygaard, L C., & Howell, W. A. (1987).Perceptual normalization of vowels produced by sinusoidalvoices. Journal of Experimental Psychology: Human Perception andPerformance, 13, 40-61.

Remez, R. E., Rubin, P. E., Pisoni, D. B., & Carrell T. D. (1981).Speech perception without traditional speech cues. Science, 212,947-950.

Ritsma, R. J. (1967). Frequencies dominant in the perception of thepitch of complex sounds. Journal of the Acoustical Society ofAmerica, 42, 191-198.

Rubin, P. E. (1980). Sinewave synthesis. Internal memorandum,Haskins Laboratories, New Haven, Connecticut.

Vaissiere, J. (1983). Language-independent prosodic features. InA. Cutler and D. R. Ladd (Eds.), Prosody: Models andmeasurements (pp. 53-66). New York: Springer, .

Ward, W. D. (1970). Musical perception. In J. A. Tobias, (Ed.),Foundations of nwdern auditory theory, Vol. 1 (pp. 407-447). NewYork: Academic Press.

FOOTNOTES*Journal of the Acoustical Society of America, 94, 1983-1988 (1993).t Department of Psychology, Barnard College, New York.

4 8

Page 49: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins Laboratories Status Report on Speech Research1993, SR-113, 41-50

The Acquisition of Prosody: Evidence from French- andEnglish-learning Infants*

Andrea G. L-wittt

The reduplicative babbling of five French- and five English-learning infants, recorded

when the infants were between the ages of 7;3 months and 11;1 months on average, wasexamined for evidence of language-specific prosodic patterns. Certain fundamentalfrequency and syllable-timing patterns in the infants' utterances clearly reflected the

influence of the ambient language. The evidence for language-specific influence on syllableamplitudes was less clear. The results are discussed in terms of a possible order ofacquisition for the prosodic features of fundamental frequency, timing, and amplitude.

1. INTRODUCTIONProsody is generally described in terms of three

main suprasegmental features that vary inlanguage-specific ways: the fundamentalfrequency contours, which give a language itscharacteristic melody; the duration or timingmeasures, which give a language its characteristicrhythm; and the amplitude patterns, which give alanguage its characteristic patterns of loud versussoft syllables. When does the prosody of infants'utterances begin to show language-specific effects?

To answer this question it is important first tounderstand the linguistic environment of thechild, which is characterized by a specialsociolinguistic register called child-directed speech(CDS). CDS has marked grammatical as well asprosodic characteristics, for which a number ofpossible uses have been suggested. It is alsoimportant to understand what is known aboutinfants' sensitivity to the three prosodic featuresof speech. Since English and French provide verydifferent prosodic models for young infants, theyare thus excellent choices for investigating theissue of language-specific prosodic influences oninfants' utterances. Analyzing the reduplicativebabbling of two groups of infants, one learning

This work was supported by NIH grant DC00403 toCatherine Best and Haskins Laboratories. We thank thefamilies of our French and American infants for theirparticipation in this research.

41

French and the other English, Doug Whalen, QiWang and I have found evidence for the earlyacquisition of certain language-specific prosodicfeatures. These results can be discussed in termsof a possible order of acquisition for language-specific prosodic features and in terms of evidencefor possible regression in children's apparentsensitivity to prosodic information.

2. CHILD-DIRECTED SPEECH (CDS)

In the last twenty-five years or so, researchershave documented the existence of child-di: actedspeech (CDS), also know as "motherese," a specialstyle of speech or linguistit register used withyoung first language learners (e.g., Ferguson etal., 1986). Most researchers consider CDS to beuniversal (e.g., Fernald et al., 1989; but cf.Bernstein Ratner, & Pye, 1984; Heath, 1983).Compared to speech between adults, or adult-directed speech (ADS), CDS shows both specialgrammatical and prosodic features. From agrammatical perspective, CDS consists of shorter,simpler, and more concrete sentences, uses morerepetitions, questions, and imperatives and moreemphatic stress. From a prosodic perspective,CDS includes high pitch, slow rate, exaggeratedpitch contours, long pauses, increased final-syllable lengthening, and whispering.

Some researchers have attributed adults'production of the higher pitch and more variablefundamental frequency of CDS to the preferenceof very young children for higher pitched sounds

4 9

Page 50: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

42 Levitt

(Sachs, 1977), whereas others have focused onthese prosodic characteristics as contributing tothe expression of affection (Brown, 1977) or forattracting the child's attention (Garnica, 1977).

More recently, however, some investigators haveargued for a more linguistically significant role forthe prosodic characteristics of CDS. Thus,researchers have variously suggested that theprosodic patterns of CDS may help infants inlearning how to identify their native language(Mehler et al., 1988); to identify importantlinguistic information, such as names forunfamiliar objects (Fernald & Mazzie, 1983); andeven to parse the syntactic structures of theirnative language (Hirsh-Pasek et al., 1987). Someof our own current work suggests that theprosodic features of CDS may also serve toenhance speaker-specific properties of the speechsignal.

As it turns out, not all of the linguistic featuresattributed to CDS are present at once. Indeed,certain features are quite unlikely to co-occur.Other sociolinguistic registers remain relativelystable over time, but CDS does not. In fact, it ischaracterized by notable systematic changes thatappear linked to the developmental stage of thechild spoken to (Bernstein Ratner, 1984, 1986;Malsheen, 1980; Stern, Spieker, Barnett, &MacKain, 1983). As do the other features of CDS,the prosodic aspects also appear to change overtime. For example, pitch height and the use ofwhispering are reduced as children grow older(Garnica, 1977). There may even be changes in thetypes of fundamental frequency contours that achild hears over time. A recent study (Papoutek &Hwang, 1991) has shown that Mandarin CDSprosody, as produced for presyllabic infants, mayeven distort the fundamental lexical tones, whichare each marked by specific fundamentalfrequency contours in the adult language. ButChinese children do go on to learn the appropriatetones, and indeed our preliminary analyses ofMandarin CDS, produced to an infant between 9and 11 months of age, suggest that for the olderinfant there is considerably less distortion. Even ifvery early CDS has more universal thanlanguage-specific prosodic patterns (Papou.tek,PapoiLtek, & Symmes, 1991), the CDS addressedto older infants, as well as all other forms ofspeech which young infants are likely to hear,provide ample exposure to language-specificprosodic patterns as well. What is known aboutyoung infants' sensitivity to the prosodic patternsof language?

3. INFANT RESPONSE TO PROSODYBull and his colleagues (Bull, Eilers, & 011er,

1984, 1985; Eilers, Bull, 011er, & Lewis, 1984)have shown that infants in the second half year oflife can detect changes in each of the threeprosodic parameters under discussion.Researchers have found that infants' response tofundamental frequency variation or intonation isparticularly strong. Indeed, infants' strongresponse to CDS (Fernald, 1985) can beinterpreted as a preference on their part for itsspecial fundamental frequency contours (Fernald& Kuhl, 1987). In terms of early pitch production,Kessen, Levine, and Wendrick (1979) found thatinfants between 3 and 6 months of age were ableto match with their voices the pitches of certainnotes, and there have also been reports of younginfants being able to match the fundamentalfrequency contours of spoken utterances(Lieberman, 1986). Once children have begun tospeak, they can make communicative use of pitch,e.g., contrast a request from a label, even at theone-word stage (Galligan, 1987; Marcos, 1987).

Other research has demonstrated that infantsshow an early perceptual sensitivity to somespecific rhythmic properties of language. Forexample, it has been shown that very younginfants can discriminate two bisyllabic utteranceswhen they differ in syllable stress (Jusczyk &Thompson, 1978; Spring & Dale, 1977). Infantscould perform this task both when the syllablestress was cued by all three typical prosodicmarkers as well as when the stress was cued byduration alone (Spring & Dale, 1977).Furthermore, Fowler, Smith, and Tassinary(1985) found evidence that the basis for infants'perception of speech timing is stress beats, just asit is for adults. Relatively little attention has beenplaced on infants' sensitivity to amplitude orloudness differences, independently of its role instress, except for the work of Bull and hiscolleagues (1984), mentioned above.

4. EARLY LANGUAGE-SPECIF1CPROSODIC INFLUENCES ON

PRODUCTIONAlthough not all attempts to find support for

early language-specific effects on infantutterances have been successful (see Locke, 1983for a review), Boysson-Bardies and her colleagueswere able to find such evidence in their cross-linguistic investigations of infant utterances. Forexample, using acoustic analysis, they found that

5 0

Page 51: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

The Acquisition of Prosody: Evidence front French- and English-learning Infants 43

the vowel formants of 10-month-old infants variedin ways consistent with the formant-frequencypatterns in the adult languages (Boysson-Bardies,Halle, Sagart, & Durand, 1989). Some of our ownresearch (Levitt & Utman, 1991), along withresults from another study by Boysson-Bardiesand her colleagues (Boysson-Bardies, Sagart, &Durand, 1984), suggested that young infants fromdifferent linguistic communities might also showearly language-specific prosodic differences, whichwe decided to explore by comparing the utterancesof French-learning and English-learning infants.

4.1 Prosodic differences in English and FrenchFrench and English differ in a number of ways

on each of the three prosodic features. In terms offundamental frequency contours, there are severaldifferences, including contour shape (DeIattre,1965) and incidence of rising contours (Delattre,1961). It is the difference in the incidence of risingcontours that we investigated. Delattre (1961),who analyzed the speech of Simone de Beauvoirand Margaret Mead, found that the Frenchspeaker had many more rising continuationcontours (93%) than her American counterpart(11%).

In our investigation of timing differencesbetween French and English we focused on thesyllable level, where we found at least three cleardifferences: the salience of final syllablelengthening, the timing of nonfinal syllables, andthe interval between stressed syllables or, in otherterms, the typical length of the prosodic word. InFigure 1, we can see the first two rhythmicproperties illustrated for French and English. Thegraphs (taken from Levitt [19911) show syllabletiming measures based on reiterant productionsby adult native speakers of French and English ofwords of two to five syllables in the two languages.The native speakers replaced the individualsyllables of each word with the syllable/ma/, whilepreserving natural intonation and rhythm. Toprovide these data, ten native speakers of Englishand ten native speakers of French were asked toproduce reiterant versions of a series of 30 wordsin their own language. The top graph representstiming measures for French words of two to fivesyllables, the middle graph represents Englishwords of two and three syllables with all possiblestress patterns, and the bottom graph representsEnglish words of four and five syllables with aselection of stress patterns.

The first property, final-syllable lengthening, isa more salient feature of French than of English.Although both French and English exhibit final-syllable lengthening (breath-group final lengthen-

ing in French), final-syllable lengthening is moresalient in French because French nonfinal sylla-bles are not typically lengthened due to wordstress, as are nonfinal syllables in English.

0

0

300

200

100

11 ill 101 11111Two- to five- syllable words in French

$W WS SWW WSW WWTwo- and three- syllable words In English

300

200

0 111 !III III !Ill !!!,1

100

lig V

111 C

Four and five- syllable words in English

Figure 1. Syllable timing measures for words of two- tofive-syllables in French (top panel) and English (bottomtwo panels).

The second properly, nonfinal syllable timing, isalso clearly different for the two languages.French has been classified as syllable-timed, witha rhythmic structure known as "isosyllabicity,"which is characterized by syllables generally equalin length. However, this description ignores theobvious, important final-syllable lengthening wesee in French. On the other hand, aside from theeffects of emphatic stress and inherent segmental

Page 52: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

44 Levitt

length differences, nonfinal syllables in Frenchgenerally are equal in length. In Figure 1 in thetop panel, the nonfinal syllables in French wordsof three to five syllables are quite equal in length,whereas English nonfinal syllables (in the bottomtwo panels) are not because of variable wordstress.

Finally, the third rhythmic property that weinvestigated, the length of a prosodic word, heredefined as the number of syllables from onestressed syllable to the next, may be expected todiffer in English and French, again because ofdifferences in the stress patterns in the twolanguages. Information about the typical length ofthe prosodic word in French and English comesfrom studies by Fletcher (1991) and by Crystaland House (1990). Fletcher analyzed theconversational speech of six native speakers ofFrench. Reanalyzing a portion of her data, wefound that 56% of the speakers' polysyllabic"prosodic words," which included all unaccentedsyllables preceding an accented final syllable,were 4 or more syllables in length, on average. Onthe other hand, when we examined similar datafrom Crystal and House, who had analyzed theread speech of six English subjects, we found thatprosodic words of 4 or more syllables accounted foronly six percent of the total, on average. Thus,there is some evidence that interstreJs intervalsor prosodic words tend to be longer in French.

How do the amplitude patterns of the twolanguages differ? Figure 2 shows the waveforms ofthe French word "population" with its reiterantversion, spoken by a male native speaker ofFrench on top, and the waveforms of the Englishword "population" with its reiterant version,spoken by a male native speaker of English on thebottom. The patterns in Figure 2 are veryrepresentative. Basically, French words tend tostart high in amplitude and generally decline, sothat final syllables, which are systematicallylonger than nonfinal syllables, tend to be lowest inamplitude or loudness. The French reiterantversion of "population," on the right, which avoidsloudness variations due to inherent amplitudedifferences in different segments, as on the left,looks rather like a Christmas tree on its side. Onthe other hand, as can be seen from thewaveforms for the English words, nonfinalstressed syllables in English tend to have greateramplitude than surrounding syllables (as well asgreater duration), although there is also atendency for the last syllable in an English wordto have lower amplitude if it is not stressed.

What sorts of evidence for language-specificprosodic structure might we find in the vocalproductions of young infants themselves?

POST 0 11 MOSULSUM IN

.0

104111

43

4..22

0.4. VON

0 11 .1 .1, .8 " 7, 04 ." 74 70 IP P 10 42 V

Figure 2. Waveforms (showing characteristic amplitudepatterns) of French population and its reiterant version(top panel) and English population and its reiterantversion (bottom panel).

4.2 Reduplicative babbling studiesIn order to investigate whether prosodic differ-

ences in fundamental frequency contour, rhythm,or amplitude emerged in the vocal productions ofFrench and American infants between the ages of5 and 12 months, the babbling of five English-learning infants (three male and two female) andfive French-learning infants (four male and onefemale) was recorded weekly by their parents athome. The French-learning infants were recordedin Paris and the English-learning infants wererecorded in cities in the northeastern UnitedStates. Recordings began when the infants werebetween 4 and 6 months old and continued untilthey were between 9 and 17 months old. Eachtape was phonetically transcribed, and all infantspeechlike vocalizations were digitized for com-puter analysis. The vocalizations were dividedinto utterances, or breath groups, which were de-fined as a sequence of syllables that were sepa-rated from adjacent utterances by at least 700 msof silence and which contained no internal silentperiods longer than 450 ms in length. From thetranscribed and digitized utterances, we selectedall the reduplicative babbles, that is, those whichcontained the same consonant-like element as well

52

Page 53: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

The Acquisition of Prosody: Evidence from French- and English-learning Infants 45

as the same vowel-like element, repeated in an ut-terance of two or more syllables, according to ourtranscriptions.

Using these criteria, we obtained 208reduplicative utterances, approximately half (102)from the English-learning children and half (106)from the French-learning infants. (See Table 1,taken from [Levitt & Wang, 1991]).

Table 1. Description of the source of the 208 stimuli.

Ages (in months)at which

recordingswere made

Ages (in months)at which

reduplicativebabbles were

detected

Number ofreduplicati ve

babbles

FrenchInfants

MB 5-11 7-11 24

EC 6-12 6-12 42

MS 5-16 7-12 23

IZ 4-9 5- 7 9

NB 4-14 8-13 8

AmericanInfants

MA 5-16 8-12 24

MM 5-17 7-12 35

CR 5-17 9-10 7

AB 5-17 8-11 18

VB 4-15 7-12 18

4.2.1. Fundamental Frequency Contours. For ouranalysis of the fundamental frequency contours ofthe reduplicative babbles of the French andAmerican infants, we decided to obtain contourjudgments for the reduplicative babbles and toanalyze them acoustically as well (Whalen, Levitt,& Wang, 1991). First, we asked a group ofexperienced listeners to judge whether each infantbabble had a falling, a rising, a fall/rise, a rise/fall,or a flat contour. In order to make the perceptualjudgments feasible, we limited our data set tothose reduplicative babbles that were two or threesyllables in length. We found both acoustic andperceptual evidence for language-specific effects inthe FO contours of the reduplicative babbles of theFrench- and English-learning infants.

Although about 65% of the perceptual judg-ments made for both the French and the Americanreduplicative babbles were either rise or fall, thesetwo categories were about equally divided in thejudgments of the French babbles, whereas about75% were labelled fall for the American subjects.Thus, in agreement with the higher incidence of

rising intonations in adult French speech(Delattre, 1961), the reduplicative babbles of ourFrench infants showed a significantly higher inci-dence of rising FO contours by comparison to thoseof our American infants.

The results of our acoustic analysis of thereduplicative babbles also supported ourperceptual finding. All of the reduplicative babbleswere categorized according to the contour opinionof the majority of the listeners and thenacoustically analyzed. The contour patterns werethen averaged for each language. The meanpatterns for each of the contour types revealed anappropriate fundamental frequency curve, andstatistical tests of the fundamental frequencyvalues also support the finding that F tenchinfants produced more rising contours.

4.2.2. Timing Measures. What about timingmeasure differences in the infants' reduplicativebabbling? We investigated this aspect of theFrench-learning and English-learning infants'production in another recent study (Levitt &Wang, 1991). Recall that final syllablelengthening is more salient in French, which alsohas more regularly timed nonfmal syllables, andlonger prosodic words. Using the entire corpus of208 utterances, we measured each syllable. Aconservative criterion for measuring syllablelength was adopted: the duration as measuredincluded only the visibly voiced portion of eachsyllable. In order to test for final-syllablelengthening, we compared the length of the finalsyllables with the penultimate syllables in eachutterance. To test for regularity in the timing ofnonfinal syllables, we calculated the meanstandard deviation of the nonfinal syllables ineach utterance of three or more syllables, and totest for length of prosodic word, we looked at thenumber of syllabld per utterance per child.

In Figure 3, the three graphs represent theresults of our investigation of the timingproperties. The top graph shows that the Frenchinfants had a significantly greater proportion oflong final s: gables (54%) than did the English-learning American infants (29%). In terms of theregularity of nonfinal syllables as shown in themiddle graph, French infants produced moreregular nonfinal syllables overall, although thatdifference was not significant.

However, when we analyzed the nonfinalsyllable timing measurements in terms of an earlyand a late stage of reduplicative babblingproduction for each of the infants, we found asignificant interaction, in that the nonfinalsyllables of the French infants tended to become

Page 54: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

46 Leuitt

more regular whereas the nonfinal syllables of theEnglish-learning American infants tended tobecome more irregular. Finally, we also found asignificant difference in the length of prosodicwords, with the French infants producingconsiderably more longer utterances (of 4 syllablesor more) than the Americans, as shown in thebottom graph of Figure 3.

60

r,- 40

03

O 30J2

430. 10

121 11/2rcor4 Lang Rola

French En. 11414

Fmnchhints English Weft

Ftench

Eadi SapLab Singe

Portont S hotPtroont Long

Figure 3. Comparison of French and English infants'syllable timing patterns for final-syllable lengthening(top panel) regularity of nonfinal syllables (middlepanel), and number of syllables per utterance (bottompanel).

4.2.3. Amplitude Measures. What then about thelast of the prosodic factors, amplitude or loudness?In order to answer this question, we first analyzedadult amplitude patterns in the two languagesfrom the reiterant speech study mentioned earlier(Levitt, 1991). We chose five speakers of eachlanguage, 3 men and 2 women, at random. Wemeasured the peak amplitude in each of thereiterant syllables produced by the adults and alsoof each of the reduplicated syllables produced by

our French and American infants. Durationmeasures for each of the syllables had alreadybeen obtained.

Our results are pictured in the two graphs inFigure 4. As indicated in the top graph, we foundthat, as mentioned earlier, French adults tend toproduce long final syllables with lowest amplitude(81%) significantly more often than Americanadults (45%) in their utterances overall [t(8)=3.2,p=.0061, one-tailed], although Americans did showa similar tendency for long finals with lowamplitudes, especially for words without a final-syllable stress. The infants showed a similarpattern of results, with French infants linkinglong final syllables with lowest amplitude moreoften (33%) than English-learning Americaninfants (21%), but this difference in the infantpopulations was not significant [t(8)=1.3, ns].

111 Long final/Low amp

French Adults French Infants English Adults English Infants

100 fil Long nonfinal/Low amp

60CMI 0t 25 60=.... in

+0 COa) 20

O.

40

French Adults French Infants English AdultaEnglish Infants

Frgure 4. Comparison of duration-linked amplitudepatterns for French- and English-speaking adults andFrench- and English-learning infants. The top panelshows the typical French pattern and the bottom panelshows the typical English pattern.

At 'played in the bottom graph of Figure 4,when we looked at nonfinal syllables in utterancesof three or more syllables, we found that Americanadults tended to link long nonfinal syllables withhighest amplitude or loudness (80%), significantlymore often than the French adults (45%)[t(8)=4. 1,p=.0016, one-tailed]. Similarly, American infantstended to produced their highest amplitudes onthe longest nonfinal syllables (57%), whereas theFrench infants did so less often (41%). This latter

5 4

Page 55: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

The Acquisition of Prosody: Evidence from French- and English-Iearning Infants 47

difference between the two groups of infantsapproached significance ft(8)=1.6,p=.0711].

5. EVIDENCE FOR PROSODICINFLUENCES IN CHILDREN'S LATER

PRODUCTIONS

By the age of two, children have already largelymastered a number of the syllabic timingproperties of their language. Thus, Allen (1983)has shown that French children exhibit final-syllable lengthening in polysyllabic words by twoyears of age. Although the patterns of final-lengthening produced by the children were morevariable than those produced by French adults,the children's median ratios of final to nonfinalsyllables were very comparable to those of Frenchadults, roughly 1.6:1. Similarly, Smith (1978) hasshown that English-speaking children betweentwo and three years of age have mastered final-syllable lengthening as well, with a final tononfinal ratio of close to 1.4:1 for both the adultsand the children. Some research with two-year-oldchildren learning tone languages (Li & Thompson,1977) suggests that children can reproduce tonalpatterns more accurately than speech segments,although certain tone contours appear easier toacquire than others.

6. POSSIBLE ORDER OF ACQUISITIONOF PROSODIC FEATURES

Our results, taken together with some of theother research concerning infants' early percep-tion (and occasional production) and young chil-dren's production of certain fundamental fre-quency and rhythmic properties, lend support tothe notion that infants begin to imitate some ofthe prosodic properties of their native languagesbefore they fully master its segmental properties.Specifically, it would appear that the more globalproperties of fundamental frequency and syllabictiming are acquired before amplitude patterns.Within each prosodic domain, there also appearsto be some evidence for a learning sequence. Liand Thompson (1977), as noted above, found thatchildren learning Mandarin acquired some tonecontours, which are based on fundamental fre-quency, earlier than others. Similarly, our resultson the acquisition of syllable timing suggest anearly beginning for children's development of con-trol of final-syllable lengthening and of utterancelength, whereas acquiring the regular timing ofnonfinal syllables in French appears more diffi-cult. Children's vocal productions are notably

more variable than those of adults and graduallymove towards more adult-like stability as theygain increasing motor control (e.g., Kent, 1976).Producing regularly-timed syllables would thus heconsiderably more difficult for children than foradults.

However, before we relegate the child's controlof the amplitude patterns of his/her language tothe status of prosody's stepchild, we have to keepin mind that relatively little exploration has beendone of the infant's sensitivity to languageamplitude patterns and that the present resultsdealt with two languages, English and French, forwhich differences in the amplitude patterns ofsyllables may he less important in perception thanare the other prosodic variables. Until more directtests are undertaken of infants' sensitivities to allof the prosodic features and comparisons are madebetween languages such as English or French, onthe one hand, and Ik, a language spoken ineastern Sudan, which contrasts voiceless andpresumably low amplitude versus voiced,presumably high amplitude vowels, on the otherhand (Maddieson, 1984), our conclusion thatamplitude pattern control is acquired later thanother prosodic features must be provisional.

7. EVIDENCE FOR REGRESSION INPROSODIC LEARNING

We would also speculate, based on some of ourown findings as well as on suggestions from theliterature, that infants show a special sensitivityto prosody beginning perhaps at birth and lastinguntil about 9 or 10 months of age, when there maybe some regression in the child's sensitivity toprosody. It would come, of course, at a time whenWerker and her colleagues (e.g., Werker, 1989;Werker & Lalonde, 1988; Werker & Tees, 1984)and Best and her colleagues (Best, in press; Best,McRoberts, & Sithole, 1988) have shown thatthere is a shift in infants' phonetic perception ofsome nonnative segmental contrasts as their focusturns to learning the words of their nativeangu age.

Our evidence comes from both perception andproduction studies of prosody. Recently CatherineBest, Gerald McRoberts, and I (1991) investigatedthe ability of infants who were either 2-4, 6-8, or10-12 months old to discriminate a prosodiccontrast (questions versus statements) in English(their native language) and in Spanish, whenthere was segmental variation across the tokensrepresenting statement and question types. Thefinding of interest is the comparison between the

55

Page 56: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

48 Levitt

6-8 month olds, who discriminated the prosodiccontrast in both languages, and the 10-12 montholds, who failed to discriminate the questions fromthe statements in both Spanish and English.Another study, by D'Odorico and Franco (1991),which looked at infants' production of specific,prosodically-defined vocalization types in differentcommunicative contexts, also suggested someevidence of decline toward the end of the firstyear. Apparently, the infants stop using theseidiosyncratic, context-determined vocalizations ataround 9 months of age. Finally, we also found atendency, though not significant, for some infantsin our production study to produce less consistentfinal syllable lengthening as they began toproduce their first words (Levitt & Wang, 1991).

Although the evidence is very preliminary,the period beginning at 10 months and extendinguntil some time before the second birthday,may be marked by some "regression" in younginfants' perception and production of prosodicinformation.

8. CONCLUSIONIt is important to remember that children learn

much more from prosody than its language-spe-cific characteristics. Prosody has a number of par-alinguistic functions so that it teaches, for exam-ple, about turn taking (e.g., Schaffer, 1983) andabout the expression of emotion as well (e.g.,Scherer, Ladd, & Silverman, 1984). In addition tolanguage-specific and paralinguistic function,prosody also serves some strictly grammatical lin-guistic functions, such as distinguishing betweenquestions and statements or between words in atone language. Mapping out the complete path bywhich children acquire the paralinguistic, lan-guage-specific, and grammatically significantprosodic patterns of their native languages, begin-ning from what appears to be quite an early start,has just begun.

REFERENCESAllen, G. (1983). Some suprasegmental contours in French two-

year-old children's speech. Phonetica, 40, 269-292.Bernstein Ratner, N. (1984). Patterns of vowel modification in

mother-child speech. Journal of Child Language, 11, 557-578.Bernstein Ratner, N. (1986). Durational cues which mark clause

boundaries in mother-child speech. Journal of Phonetics, 14, 303-309.

Bernstein Ratner, N., & Pye, C. (1984). Higher pitch in BT is notuniversal: Acoustic evidence from Quiche Mayan. Journal ofChild Language, 2, 515-522.

Best, C., Levitt, A., & McRoberts, G. (1991). Examination oflanguage-specific influences in infants' discrimination ofprosodic categones. Actes du Xllerne Congres International desSciences Phonetiques (pp. 162-165). Aix-en-Provence, France:Université de Provence Service des Publications.

Best, C., McRoberts, G., & Sithole, N. (1988). The phonologicalbasis of perceptual loss for non-native contrasts: Maintenanceof discrimination among Zulu clicks by English-speakingadults and infants. journal of Experimental Psychology: HumanPerception and Performance, 14, 345-360.

Boysson-Bardies, B. de, Sagart, L., & Durand, C. (1984).Discernible differences in the babbling of infants according totarget language. Journal of Child Language, 22, 1-15.

Boysson-Bardies, B. de, Halle, P. Sagan, L., & Durand, C. (1989).A crosslinguistic investigation of vowel formants in babbling.Journal of Child Language, 16, 1-17.

Brown, R. (1977). Introduction. ln C. Snow & C. Ferguson (Eds.).Talking to children: Language input and acquisition. Cambridge:Cambridge University Press.

Bull, D., Eilers, E., & 011er, D. (1984). Infants' discrimination ofintensity variation in multisyllabic contexts. Journal of theAcoustical Society of America, 76, 1-13

Bull, D., Eilers, E., & 011er, D. (1985). Infants' discrimination offinal syllable fundamental frequency in multisyllabic stimuli.journal of the Acoustical Society of America, 77, 289-295.

Crystal, T., & House, A. (1990). Articulation rate and the durationof syllables and stress groups in connected speech. Journal of theAcoustical Society of America, 88, 101-112.

Delattre, P. (1961). La lecon d'intonation de Simone de Beauvoir,etude d'intonation declarative comparee. The French Review, 35,59-67.

Delattre, P. (1965). Comparing the phonetic features of English,French, German and Spanish: An interim report. Philadelphia:Chilton.

D'Odorico, L., & Franco, F. (1991). Selective production ofvocalization types in different communicative contexts. Journalof Child Language, 18, 475499.

Eilers, E., Bull, D., 011er, D., & Lewis, D. (1984). Thediscrimination of vowel duration by infants. Journal of theAcoustical Society of America, 75, 213-218.

Ferguson, C. (1964). Baby talk in six languages. AmericanAnthropologist, 66,103-114.

Ferguson, C. (1978) Talking to children: A search for universals. InJ. Greenberg, C. Ferguson, & E. Moravcsik (Eds.), Universals ofhuman language (pp. 203-2240). Stanford: Stanford UniversityPress.

Fernald, A. (1985). Four-month-old infants prefer to listen tomotherese. Infant behavior and development, 8, 181 - 195.

Fernald, A., & Kuhi, P. (1987). Acoustic determinants of infantpreference for motherese speech. Infant Behavior andDevelopment, 8, 279-293.

Fernald, A., & Mazzie, C. (1983, April). Pitch marking of new andold information in mothers' speech. Paper presented at themeeting of the Society for Research in Child Development,Detroit.

Fernald, A., Taeschner, T., Dunn, J. Papoulek, M., Boysson-Bardies, B. de, & Fukui, I. (1989). A cross-language study ofprosodic modifications in mothers' and fathers' speech topreverbal infants. Journal of Child Language, 26, 477-501.

Fletcher, J. (1991). Rhythm and final lengthening in French. Journalof Phonetics, 19, 193-212.

Fowler, C., Smith, M., & Tassinary, L. (1985). Perception ofsyllable timing by prebabbling infants. Journal of the AcousticalSociety of America, 79, 814-825.

Galligan, R. (1987). Intonation with single words: Purposive andgrammatical use. Journal of Child Language, 14, 1-21.

Garnica, 0. (1977). On some prosodic and paralinguistic featuresof speech to young children. In C. Snow & C. Ferguson (Eds.),Talking to children: Language input and acquisition. Cambridge:Cambridge University Press.

Haynes, L., & Cooper, R. (1986). A note on Ferguson's proposedbaby-talk universals. The Fergusonian impact: Papers in Honor

fl,:r 1/4770 LHII

5 6

Page 57: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

The Acquisition of Prosody: Evidence_from French- and English-learning Infants 49

of Charles A. Ferguson on the Occasion of his 65th Birthday,Berlin: Mouton de Gruyter.

Heath, S. B. (1983). Ways with words. Cambridge: CambridgeUniversity Press.

Hirsh-Pasek, K. Kemler-Nelson, D. G Jusczyk, P. W., Wright, K.,Druss, B., & Kennedy, L. (1987). Clauses are perceptual unitsfor young infants. Cognition, 26, 269-286.

Jusczyk, P., & Thompson, E. (1978). Perception of a phoneticcontrast in multisyllabic utterances by 2-month-old infants.Perception and Psychophysics, 23, 105-109.

Kent, R. (1976). Anatomical and neuromuscular maturation of thespeech mechanism: Evidence from acoustic studies. Journal ofSpeech and Hearing Research. :9, 421-445.

Kessen, W., Levine, J., & Wendrick, K. (1979). The imitation ofpitch in infants. Infant Behavior and Development, 2, 93-100.

Levitt, A. (1991). Reiterant speech as a test of nonnative speakers'mastery of the timing of French. Journal of the Acoustical Societyof America, 90, 3008-3018.

Levitt, A., & Utman, J. (1991). From babbling towards the soundsystems of English and French: A longitudinal two-case study.Journal of Child Language, 19, 19-49.

Levitt, A., & Wang, Q. (1991). Evidence for language-specificrhythmic influences in the reduplicative babbling of French-and English-learning infants. Language and Speech, 34, 235-249.

Li, C., & Thompson, S. (1977). 'The acquisition of tone inMandarin-speaking children. Journal of Child Language, 4, 185-199.

Locke, J. L. (1983). Phonological acquisition and change. New York:Academic Press.

Maddieson, 1. (1984). Patterns of sounds. New York: CambridgeUniversity Press.

Malsheen, B. (1980). Two hypotheses for phonetic clarification inthe speech of mothers to children. In G. Yeni-Komshian, J. F.Kavanagh, & C.A. Ferguson (Eds.), Child phonology: Vol. 1.Perception. New York: Academic Press.

Marcos, H. (1987). Communicative functions of pitch range andpitch direction in infants. Journal of Child Language, 14, 255-268.

Mehler, J., Jusczyk. P., Larnbertz, G., Halsted, N., Bertoncini, J., &Amiel-Tison, C. (1988). A precursor of language acquisition inyoung infants. Cognition, 29, 143-178.

Papousek, M., & Hwang, S.-F. (1991). Tone and intonation inMandarin babytalk to presyllabic infants: Comparison with

registers of adult conversation and foreign languageinstruction. Applied Psycholinguistics, 12, 481-504.

Papoutek, M., Papoutek, H., & Symmes, D., (1991). The meaningof melodies in motherese in tone and stress languages. InfantBehavior and Development, 14, 415.1440.

Sachs, J. (1977). The adaptive significance of linguistic input toprelinguistic infants. In C. Snow & C. Ferguson, (Eds.), Talkingto children: Language input and acquisition. Cambridge:Cambridge University Press.

Schaffer, D. (1983). The role of intonation as a cue to turn taking inconversation. Journal of Phonetics, 11, 243-257.

Scherer, K., Ladd, D., & Silverman, K. (1984). Vocal cues tospeaker affect: Testing two models. Journal of the AcousticalSociety of America, 76,1346-1356.

Smith, B. (1978). Temporal aspects of English speech production:A developmental perspective. Journal of Phonetics, 6, 37-67.

Spring, D., & Dale, P. (1977). The discrimination of linguisticstress in early infancy. Journal of Speech and Hearing Research, 20,

224-231.Stern, D. N., Spieker, S., Barnett, R. K., & MacKain, K. (1983). The

prosody of maternal speech: Infant age and context-relatedchanges. Journal of Child Language, 10, 1-15.

Werker, J. F. (1989). Becoming a native listener. American Scientist,77, 54-59.

Werker, J. F., & Lalonde, C. E. (1988). Cross-language speechperception: Initial capabilities and developmental change.Developmental Psychology, 24, 672-683.

Werker, J., & Tees, R. C. (1984). Cross-language speechperception: Evidence for perceptual reorganization during thefirst year of life. Infant Behavior and Development, 7, 49-63.

Whalen, D. H., Levitt, A., & Wang, Q. (1991). Intonationaldifferences between the reduplicative babbling of French- andEnglish-learning infants. Journal of Child Language, 18, 501-516.

FOOTNOTES*Appears in Proceedings of the NATO Advanced Research

Workshop on Changes in Speech and Face Processing in Infancy:A Glimpse at Developmental Mechanisms of Cognition. Carry-le-Rouet, France, June 29-July 3, 1992. Dordrecht, TheNetherlands: Kluwer Academic Publishers.

t Also Wellesley College, Wellesley.

Page 58: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins Laboratories Status Report on Speech Research1993, SR-173, 51-62

Dynamics and Articulatory Phonology*

Catherine P. Browman and Louis Goldsteint

1. INTRODUCTIONTraditionally, the study of human speech and its

patterning has been approached in two differentways. One way has been to consider it as mechan-ical or bio-mechanical activity (e.g., of articulatorsor air molecules or cochlear hair cells) thatchanges continuously in time. The other way hasbeen to consider it as a linguistic (or cognitive)structure consisting of a sequence of elementschosen from a closed inventory. Development ofthe tools required to describe speech in one or theother of these approaches has proceeded largely inparallel, with one hardly informing the other at all(some notable exceptions are discussed below). Asa result, speech has been seen as having twostructures, one considered physical, and the othercognitive, where the relation between the twostructures is generally not an intrinsic part of ei-ther description. From this perspective, a com-plete picture requires 'translating' between the in-trinsically incommensurate domains (as argued byFowler, Rubin, Remez, & Turvey, 1980).

The research we have been pursuing (Browman& Goldstein, 1986; 1989; 1990a,b; 1992)('articulatory phonology') begins with the verydifferent assumption that these apparentlydifferent domains are, in fact, the low and highdimensional descriptions of a single (complex)system. Crucial to this approach is identificationof phonological units with dynamically specifiedunits of articulatory action, called gestures.Thus, an utterance is described as an act thatcan be deccmposed into a small numberof primitive units (a low dimensional description),in a particular spatio-temporal configuration.

This work was upported by NSF grant DBS-9112198 andNIH grants HD-01994 and DC-00121 to Haskins Laboratories.Thanks to Alice Faber and Jeff Shaw for comments on anearlier version.

51

5 8

The same description also provides an intrinsicspecification of the high dimensional properties ofthe act (its various mechanical and bio-mechanicalconsequences).

In this chapter, we will briefly examine the na-ture of the low and high dimensional descriptionsof speech, and contrast the dynamical perspectivethat unifies these to other approaches in whichthey are separated as properties of mind and body.We will then review some of the basic assump-tions and results of developing a specific model in-corporating dynamical units, and illustrate how itprovides both low and high dimensional descrip-tions.

2. DIMENSIONALITY OFDESCRIPTION

Human speech events can be seen as quitecomplex, in the sense that an individual utterancefollows a continuous trajectory through a spacedefined by a large number of potential degrees offreedom, or dimensions. This is true whether thedimensions are neural, articulatory, acoustic,aerodynamic, auditory, or other ways of describingthe event. The fundamental insight of phonology,however, is that the pronunciation of the words ina given language may differ from (that is, contrastwith) each other in only a restricted number ofways: the number of degrees of freedom actuallyemployed in this contrastive behavior is far fewerthan the number that is mechanically available.This insight has taken the form of the hypothesisthat words can be decomposed into a smallnumber of primitive units (usually far fewer thanone hundred in a given language) which cancombine in different ways to form the largenumber of words required in human lexicons.Thus, as argued by Kelso, Saltzman, and Tuller(1986), human speech is characterized not only bya high number of potential (microscopic) degreesof freedom, but also by a low dimensional(macroscopic) form. This macroscopic form is

Page 59: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

52 Browman and Goldstein

usually called the 'phonological' form. As will besuggested below, this collapse of degrees offreedom can possibly be understood as an instanceof the kind of self-organization found in othercomplex systems in nature (Haken, 1977;Kauffina.nn, 1991; Kugler & Turvey, 1987; Madore& Freedman, 1987; Schoner & Kelso, 1988).

Historically, however, the gross differencesbetween the macroscopic and microscopic scales ofdescription have led researchers to ignore one orthe other description, ur to assert its irrelevance,and hence to generally separate the cognitive andthe physical. Anderson (1974) describes how thedevelopment of tools in the nineteenth and earlytwentieth centuries led to the quantification ofmore and more details of the speech signal, but"with such increasingly precise description,however, came the realization that much of it wasirrelevant to the central tasks of linguisticscience" (p.4). Indeed, the development of manyearly phonological theories (e.g., those ofSaussure, Trubetzkoy, Sapir, Bloomfield)proceeded largely without any substantiveinvestigation of the measurable properties of thespeech event at all (although Anderson notesBloomfield's insistence that the smallestphonological units must ultimately be defined interms of some measurable properties of the speechsignal). In general, what was seen as importantabout phonological units was their function, theirability to distinguish utterances.

A particularly telling insight into this view ofthe lack of relation between the phonological andphysical descriptions can be seen in Hockett's(1955) familiar Easter egg analogy. The structureserving to distinguish utterances (for Hockett, asequence of letter-sized phonological units calledphonemes) was viewed as a row of colored, butunboiled, easter eggs on a moving belt. Thephysical structure (for Hockett, the acousticsignal) was imagined to be the result of runningthe belt through a wringer, effectively smashingthe eggs and intermixing them. It is quite strikingthat, in this analogy, the cognitive structure of thespeech event cannot be seen in the gooey messitself. For Hockett, the only way the hearer canrespond to the event is to infer (on the basis ofobscured evidence, and knowledge of possible eggsequences) what sequence of eggs might have beenresponsible for the mess. It is clear that in thisview, the relation between cognitive and physicaldescriptions is neither systematic nor particularlyinteresting. The descriptions share color as animportant attribute, but beyond that there is littlerelation.

A major approach that did take seriously thegoal of unifying the cognitive and physical aspectsof speech description was that in the SoundPattern of English (Chomsky & Halle, 1968), in-cluding the associated work on the development ofthe theory of distinctive features (Jakobson, Fant,& Halle, 1951) and the quantal relations thatunderlie them (Stevens, 1972, 1989). In this ap-proach, an utterance is assigned two representa-tions: a 'phonological' one, whose goal is to de-scribe how the utterance functions with respect tocontrast and patterns of alternation, and a'phonetic' one, whose goal is to account for thegrammatically determined physical properties ofthe utterance. Crucially, however, the relationbetween the representations is quite constrained:both descriptions employ exactly the same set ofdimensions (the features). The phonological repre-sentation is coarser in that features may take ononly binary values, while the phonetic representa-tion is more fine-grained, with the features havingscalar values. However, a principled relation be-tween the binary values and the scales is also pro-vided: Stevens' quantal theory attempts to showhow the potential continuum of scalar feature val-ues can be intrinsically partitioned into categori-cal regions, when the mapping from articulatorydimensions to auditory properties is considered.Further, the existence of such quantal relations isused to explain why languages employ these par-ticular features in the first place.

Problems raised with this approach to speechdescription soon led to its abandonment, however.One problem is that its phonetic representationswere shown to be inadequate to capture certainsystematic physical differences betweenutterances in different languages (Keating, 1985;Ladefoged, 1980; Port, 1981). The scales used inthe phonetic representations are themselves ofreduced dimensionality, when compared to acomplete physical description of utterances.Chomsky and Halle hypothesized that suchfurther details could be supplied by universalrules. However, the above authors (also Browman& Goldstein, 1986) argued that this would notworkthe same phonetic representation (in theChomsky and Halle sense) can have differentphysical properties in different languages. Thus,more of the physical detail ( and particularlydetails having to do with timing) would have to bespecified as part of the description of a particularlanguage. Ladefoged's (1980) argument cut evendeeper. He argued that there is a system of scalesthat is useful for characterizing the measurablearticulatory and acoustic properties of utterances,

Page 60: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Dynamics and Articulatory Phonology 53

but that these scales are very different from thefeatures proposed by Chomsky and Halle.

One response to these failings has been tohypothesize that descriptions of speech shouldinclude, in addition to phonological rules of theusual sort, rules that take (cognitive) phonologicalrepresentations as input and convert them tophysical parameterizations of various sorts. Theserules have been described as rules of 'phoneticimplementation' (e.g., Keating, 1985; Keating,1990; Klatt, 1976; Liberman & Pierrehumbert,1984; Pierrehumbert, 1990; Port, 1981). Note thatin this view, the description of speech is dividedinto two separate domains, involving distincttypes of representations: the phonological orcognitive structure and the phonetic or physicalstructure. This explicit partitioning of the speechside of linguistic structure into separate phoneticand phonological components which employdistinct data types that are related to one anotheronly through rules of phonetic implementation (or'interpretation') has stimulated a good deal ofresearch (e.g., Cohn, 1990; Coleman, 1992;Fourakis & Port, 1986; Keating, 1988; Liberman& Pierrehumbert, 1984). However, there is amajor price to be paid for drawing such a strictseparation: it becomes very easy to view phoneticand phonological (physical and cognitive)structures as essentially independent of oneanother, with no interaction or mutual constraint.As Clements (1992) describes the problem: "Theresult is that the relation between thephonological and phonetic components is quiteunconstrained. Since there is little resemblancebetween them, it does not matter very much forthe purposes of phonetic interpretation what theform of the phonological input is; virtually anyphonological description can serve its purposesequally well. (p. 192)" Yet, there is a cot- strainedrelation between the cognitive and physicalstructures of speech, which is what drove thedevelopment of feature theory in the first place.

In our view, the relation between the physicaland cognitive, i.e. phonetic and phonological, as-pects of speech is inherently constrained by theirbeing simply two levels of descriptionthe micro-scopic and macroscopicof the same system.Moreover, we have argued that the relation be-tween microscopic and macroscopic properties ofspeech is one of mutual or reciprocal constraint(Browman & Goldstein, 1990b). As we elaboratedthere, the existence of such reciprocity is sup-ported by two different lines of research. One linehas attempted to show how the macroscopic prop-erties of contrast and combination of phonological

units arise from, or are constrained by, the micro-scopic, i.e., the detailed properties of speech ar-ticulation and the relations among speech articu-lation, aerodynamics, acoustics, and audition (e.g.,Lindblom, MacNeillage, & Studdert-Kennedy,1983; Ohala, 1983; Stevens, 1972; 1989). A secondline has shown that there are constraints runningin the opposite direction, such that the(microscopic) detailed articulatory or acousticproperties of particular phonological units are de-termined, in part, by the macroscopic system ofcontrast and combination found in a particularlanguage (e.g., Keating, 1990; Ladefoged, 1982;Manuel & Krakow, 1984; Wood, 1982). The appar-ent existence of this bi-directionality is of consid-erable interest, because recent studies of thegeneric properties of complex physical systemshave demonstrated that reciprocal constrairt be-tween macroscopic and microscopic scales s ahallmark of systems displaying 'self-organization'(Kugler & Turvey, 1987; see also discussions byLangton in Lewin, 1992 [pp. 12-14; 188-191], andwork on the emergent properties of "co-evolving"complex systems: Hogeweg, 1989; Kauffman,1989; Kauffman & Johnsen, 1991; Packard, 1989).

Such self-organizing systems (hypothesized asunderlying such diverse phenomena as the con-struction of insect nests and evolutionary andecological dynamics) display the property that the'local' interactions among a large number of mi-croscopic system components can lead to emergentpatterns of 'global' organization and order. Theemergent global organization also places con-straints on the components and their localinteractions. Thus, self-organization provides aprincipled linkage between descriptions ofdifferent dimensionality of the same system: thehigh-dimensional description (with many degreesof freedom) of the local interactions and the low-dimensional description (with few degrees offreedom) of the emergent global patterns. Fromthis point of view, then, speech can be viewed as asingle complex system (with low-dimensionalmacroscopic and high-dimensional microscopicproperties) rather than as two distinctcomponents.

A different recent attempt to articulate the na-ture of the constraints holding between the cogni-tive and physical structures can be found inPierrehumbert (1990), in which the relation be-tween the structures is argued to be a 'semantic'one, parallel to the relation that obtains betweenconcepts and their real world denotations. In thisview, macroscopic structure is constrained by themicroscopic properties of speech and by the prin-

GO

Page 61: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

54 Brourrnan and Goldstein

ciples guiding human cognitive category-forma-tion. However, the view fails to account for theapparent bi-directionality of the constraints. Thatis, there is no possibility of constraining the mi-croscopic properties of speech by its macroscopicproperties in this view. (For a discussion of possi-ble limitations to a dynamic approach to phonol-ogy, see Pierrehumbert & Pierrehumbert, 1990).

The 'articulatory phonology' that we have beendeveloping (e.g., Browman & Goldstein, 1986,1989, 1992) attempts to understand phonology(the cognitive) as the low-dimensional macroscopicdescription of a physical system. In this work,rather than rejecting Chomsky and Halle'sconstrained relation between the physical andcognitive, as the phonetic implementationapproaches have done, we have, if anything,increased the hypothesized tightness of thatrelation by using the concept of differentdimensionality. We have surmised that theproblem with the program proposed by Chomskyand Halle was instead in their choice of theelementary units of the system. In particular, wehave argued that it is wrong to assume that theelementary units are (1) static, (2) neutralbetween articulation and acoustics, and (3)arranged in non-overlapping chunks. Assumptions(1) and (3) have been argued against by Fowler etal. (1980), and (3) has also been rejected by mostof the work in 'nonlinear' phonology over the past15 years. Assumption (2) has been, at leastpartially, rejected in the 'active articulator'version of 'feature geometry'Halle (1982), Sagey(1986), McCarthy (1988).

3. GESTURESArticulatory phonology takes seriously the view

that the units of speech production are actions,and therefore that (1) they are dynamic, not static.Further, since articulatory phonology considersphonological functions such as contrast to be low-dimensional, macroscopic descriptions of suchactions, the basic units are (2) not neutralbetween articulation and acoustics, but rather arearticulatory in nature. Thus, in articulatoryphonology, the basic phonological unit is thearticulatory gesture, which is defined as adynamical system specified with a characteristicset of parameter values (see Saltzman, in press).Finally, because the tasks are distributed acrossthe various articulator sets of the vocal tract (thelips, tongue, glottis, velum, etc.), an utterance ismodeled as an ensemble, or constellation, of asmall number of (3) potentially overlappinggestural units.

As will be elaborated below, contrast amongutterances can be defined in terms of thesegestural constellations. Thus, these structures cancapture the low-dimensional properties ofutterances. In addition, because each gesture isdefined as a dynamical system, no rules ofimplementation are required to characterize thehigh-dimensional properties of the utterance. Atime-varying pattern of articulator motion (and itsresulting acoustic consequences) is lawfullyentailed by the dynamical systems themselvesthey are self-implementing. Moreover, these time-varying patterns automatically display theproperty of context dependence (which isubiquitous in the high dimensional description ofspeech) even though the gestures are defined in acontext-independent fashion. The nature of thearticulatory dimensions along which theindividual dynamical units are defined allows thiscontext dependence to emerge lawfully.

The articulatory phonology approach has beenincorporated into a computational system beingdeveloped at Haskins Laboratories (Browman &Goldstein, 1990a,c; Browman, Goldstein, Kelso,Rubin, & Saltzman, 1984; Saltzman, 1986;Saltzman, & Munhall, 1989). In this system,illustrated in Figure 1, utterances are organizedensembles (or constellations) of units ofarticulatory action called gestures. Each gesture ismodeled as a dynamical system that characterizesthe formation (and release) of a local constrictionwithin the vocal tract (the gesture's functionalgoal or etask'). For example, the word "ban" beginswith a gesture whose task is lip closure.

Intendedutterance

LINGUISTIC

GESTURAL

MODEL

TASK

DYNAMIC

MODEL

outputspeech

VOCAL

TRACT

MODEL

C TRARTICULATORYAJECTORIES

Figure 1. Computational system for generating speechusing dynamically-defined articulatory gestures.

The formation of this constriction entails a changein the distance between upper and lower lips (or

61

Page 62: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Dipiamics and Articulatory Phonology 55

Lip Aperture) over time. This change is modeledusing a second order system (a 'point attractor,'Abraham & Shaw, 1982), specified with particularvalues for the equilibrium position and stiffnessparameters. (Damping is, for the most part,assumed to be critical, so that the systemapproaches its equilibrium position and doesn'tovershoot it). During the activation interval forthis gesture, the equilibrium position for LipAperture is set to the goal value for lip closure; thestiffness setting, combined with the damping,determines the amount of time it will take for thesystem to get close to the goal of lip closure.

The set of task or tract variables currently im-plemented in the computational model are listedat the top left of Figure 2, and the sagittal vocaltract shape below illustrates their geometric defi-nitions. This set of tract variables is hypothesizedto be sufficient for characterizing most of the ges-tures of English (exceptions involve the details ofcharacteristic shaping of constrictions, seeBrowman & Goldstein, 1989). For oral gestures,

tract variable

two paired tract variable regimes are specified,one controlling the constriction degree of a par-ticular structure, the other its constriction loca-tion (a tract variable regime consists of a set ofvalues for the dynamic parameters of stiffness,equilibrium position, and damping ratio). Thus,the specification for an oral gesture includes anequilibrium position, or goal, for each of two tractvariables, as well as a stiffness (which is currentlyyoked across the two tract variables). Each func-tional goal for a gesture is achieved by the coordi-nated action of a set of articulators, that is, a co-ordinative structure (Fowler et ai., 1980; Kelso,Saltzman, & Tuller, 1986; Saltzman, 1986;Turvey, 1977); the sets of articulators used foreach of the tract variables are shown on the topright of Figure 2, with the articulators indicatedon the outline of the vocal tract model below. Notethat the same articulators are shared by both ofthe paired oral tract variables, so that altogetherthere are five distinct articulator sets, or coordi-native structure types, in the system.

articulators involved

LP lip protrusionLA lip aperture

TTCL tongue tip constrict locationTTCD tongue tip constrict degree

TBCL tongue body constrict locationTBCD tongue body constrict degree

VEL velic aperture

GLO glottal aperture

VEL

*-TBCL?-+TTCL

TBCD

Figure 2. Tract variables and their associated articulators.

upper & lower lips, jawupper & lower lips, jaw

tongue tip, tongue body, jawtongue tip, tongue body, jaw

tongue body, jawtongue body, jaw

velum

glottis

velumtongue t p

tonguebody

center

glottis

upper lip

lower lip

law

Page 63: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

56 Brownian and Goldstein

In the computational system the articulators arethose of a vocal tract model (Rubin, Baer, &Mermelstein, 1981) that can generate speechwaveforms from a specification of the positions ofindividual articulators. When a dynamical system(or pair of them) corresponding to a particular ges-ture is imposed on the vocal tract, the task-dy-namic model (Saltzman, in press; Saltzman, 1986;Saltzman & Kelso, 1987; Saltzman & Munhall,1989) calculates the time-varying trajectories ofthe individual articulators comprising thatcoordinative structure, based on the informationabout values of the dynamic parameters, etc,contained in its input. These articulatortrajectories are input to the vocal tract modelwhich then calculates the resulting global vocaltract shape, area function, transfer function, andspeech waveform (see Figure 1).

Defining gestures dynamically can provide aprincipled link between macroscopic andmicroscopic properties of speech. To illustratesome of the ways in which this is true, considerthe example of lip closure. The values of thedynamic parameters associated with a lip closuregesture are macroscopic properties that define itas a phonological unit and allow it to contrastwith other gestures such as the narrowing gesturefor [w]. These values are definitional, and remaininvariant as long as the gesture is active. At thesame time, however, the gesture intrinsicallyspecifies the (microscopic) patterns of continuouschange that the lips can exhibit over time. Thesechanges emerge as the lawful consequences of thedynamical system, its parameters, and the initialconditions. Thus, dynamically defined gesturesprovide a lawful link between macroscopic andmicroscopic properties.

While tract variable goals are specified numeri-cally, and in principle could take on any realvalue, the actual values used to specify the ges-tures of Engliah.in the model cluster in narrowranges that correspond to contrastive categories:for example, in the case of constriction degree,different ranges are found for gestures thatcorrespond to what are usually referred to asstops, fricatives and approximants. Thus,paradigmatic comparison (or a densitydistribution) of the numerical specifications of allEnglish gestures would reveal a macroscopicstructure of contrastive categories. The existenceof such narrow ranges is predicted by approachessuch as the quantal theory (e.g., Stevens, 1989)and the theory of adaptive dispersion (e.g.,Lindblom, MacNeilage, & Studdert-Kennedy,1983), although the dimensions investigated in

those approaches are not identical to the tractvariable dimensions. These approaches can beseen as accounting for how microscopic continuaare partitioned into a small number of macro-scopic categories.

The physical properties of a given phonologicalunit vary considerably depending on its context(e.g., Kent & Minifie, 1977; ohman, 1966;Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967). Much of this context dependenceemerges lawfully from the use of task dynamics.An example of this kind of context dependence inlip closure gestures can be seen in the fact thatthe three independent articulators that cancontribute to closing the lips (upper lip, lower lip,and jaw) do so to different extents as a function ofthe vowel environment in which the lip closure isproduced (Macchi, 1988; Sussman, MacNeilage, &Hanson, 1973). The value of lip aperture achieved,however, remains relatively invariant no matterwhat the vowel context. In the task-dynamicmodel, the articulator variation resultsautomatically from the fact that the lip closuregesture is modeled as a coordinative structure thatlinks the movements of the three articulators inachieving the lip closure task. The gesture isspecified invariantly in terms of the tract variableof lip aperture, but the closing action isdistributed across component articulators in acontext-dependent way. For example, in an utter-ance like [ibi], the lip closure is produced concur-rently with the tongue gesture for a high frontvowel. This vowel gesture will tend to raise thejaw, and thus, less activity of the upper and lowerlips will be required to effect the lip closure goalthan in an utterance like [aba]. These microscopicvariations emerge lawfully from the task dynamicspecification of the gestures, combined with thefact of overlap (Kelso, Saltzman, & Tuller, 1986;Saltzman & Munhall, 1989).

4. GESTURAL STRUCTURESDuring the act of talking, more than one gesture

is activated, sometimes sequentially and some-times in an overlapping fashion. Recurrent pat-terns of gestures are considered to be organizedinto gestural constellations. In the computationalmodel (see Figure 1), the linguistic gestural modeldetermines the relevant constellations for any ar-bitrary input utterance, including the phasing ofthe gestures. That is, a constellation of gestures isa set of gestures that are coordinated with one an-other by means of phasing, where for this purpose(and this purpose only), the dynamical regime foreach gesture is treated as if it were a cycle of an

Page 64: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Dynamics and Articulatmy Phonology

undamped system with the same stiffness as theactual regime. In this way, any characteristicpoint in the motion of the system can be identifiedwith a phase of this virtual cycle. For example, themovement onset of a gesture is at phase 0 degrees,while the achievement of the constriction goal (thepoint at which the critically damped system getssufficiently close to the equilibrium position) oc-curs at phase 240 degrees. Pairs of gestures arecoordinated by specifying the phases of the twogestures that are synchronous. For example, twogestures could be phased so that their movementonsets are synchronous (0 degrees phased to 0 de-grees), or so that the movement onset of one isphased to the goal achievement of another (0 de-grees phased to 240 degrees), etc. Generalizationsthat characterize some phase relations in the ges-tural constellations of English words are proposedin Browman and Goldstein (1990c). As is the casefor the values of the dynamic parameters, valuesof the synchronized phases also appear to clusterin narrow ranges, with onset of movement (0 de-grees) and achievement of goal (240 degrees) beingthe most common (Browman & Goldstein, 1990a).

An example of a gestural constellation (for theword "pawn" as pronounced with the back un-rounded vowel characteristic of much of the U.S.)is shown in Figure 3a, which gives an idea of thekind of information contained in the gestural dic-tionary. Each row, or tier, shows the gestures thatcontrol the distinct articulator sets: velum, tonguetip, tongue body, lips, and glottis. The gestures arerepresented here by descriptors, each of whichstands for a numerical equilibrium position valueassigned to a tract variable. In the case or the oralgestures, there are two descriptors, one for each ofthe paired tract variables. For example, for thetongue tip gesture labelled Iclo alv), Idol standsfor -3.5 mm (negative value indicates compressionof the surfaces), and lalv) stands for 56 degrees(where 90 degrees is vertical and would corre-spond to a midpalatal constriction). The associa-tion lines connect gestures that are phased withrespect to one another. For example, the tonguetip Icic alv) gesture and the velum (wide) gesture(for nasalization) are phased such that the pointindicating 0 degreesonset of movementof thetongue tip closure gesture is synchronized withthe point indicating 240 degreesachievement ofgoalof the velic gesture.

Each gesture is assumed to be active for a fixedproportion of its virtual cycle (the proportion isdifferent for consonant and vowel gestures). Thelinguistic gestural model uses this proportion,along with the stiffness of each gesture and the

phase relations among the gestures, to calculate agestural score that specifies the temporal activa-tion intervals for each gesture in an utterance.One form of this gestural score for "pawn" isshown in Figure 3b, with the horizontal extent ofeach box indicating its activation interval, and thelines between boxes indicating which gesture isphased with respect to which other gesture(s), asbefore. Note that there is substantial overlapamong the gestures. This kind of overlap can re-sult in certain types of context dependence in thearticulatory trajectories of the invariantly speci-fied gestures. In addition, overlap can cause thekinds of acoustic variation that have been tradi-tionally described as allophonic variation. For ex-ample in this case, note the substantial overlapbetween the velic lowering gesture (velum {wide})and the gesture for the vowel (tongue body(narrow pharp. This will result in an interval oftime during which the velo-pharyngeal port isopen and the vocal tract is in position for thevowelthat is, a nasalized vowel. Traditionally,the fact of nasalization has been represented by arule that changes an oral vowel into a nasalizedone before a (final) nasal consonant. But viewed interms of gestural constellations, this nasalizationis just the lawful consequence of how the individ-ual gestures are coordinated. The vowel gestureitself hasn't changed in any way: it has the samespecification in this word and in the word "pawed"(which is not nasalized).

The parameter value specifications andactivation intervals from the gestural score areinput to the task dynamic model (Figure 1), whichcalculates the time-varying response of the tractvariables and component articulators to theimposition of the dynamical regimes definedby the gestural score. Some of the time-varyingresponses are shown in Figure 3c, along withthe same boxes indicating the activation intervalsfor the gestures. Note that the movement curveschange over time even when a tract variable isnot under the active control of some gesture.Such motion can be seen, for example, in the LIPSpanel, after the end of the box for the lip closuregesture. This motion results from one orboth of two sources. (1) When an articulator is notpart of any active gesture, the articulator returnsto a neutral position. In the example, the upper lipand the lower lip articulators both are returningto a neutral position after the end of the lipclosure gesture. (2) One of the articulators linkedto the inactive tract variable may also be linkedto some active tract variable, and thus causepassive changes in the inactive tract variable.

Page 65: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

58 Brotuman and Goldstein

(3a)

(3b)

VELUM

TONGUE TIP

TONGUE BODY

LIPS

GLOTTS

clolab

wide

'pan

wide

narrowphar

cloalv

VELUM

TONGUE TIP

TONGUE BODY

LIPS

GLOTTIS

clolab

'pan

wide

wide

narrowphar

cloalv

Page 66: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Dynamics and ArticulatarY Phonolov 59

(3c)

VELUM

TONGUE TIP

TONGUE BODY

LIPS

GLOTTIS

'pan

wide

narrowpfir

wide

cloalv

100 200 300 400

closed

closed

closed

closed

closed

TIME (ms)

Figure 3. Various displays from computational model for "pawn." (a) Gestural descriptors and association lines (b)Gestural descriptors and association lines plus activation boxes (c) Gestural descriptors and activation boxes plusgenerated movements of (from top to bottom): Velic Aperture, vertical position of the Tongue Tip (with respect to thefixed palate/teeth), vertical position of the Tongue Body (with respect to the fixed palate/teeth), Lip Aperture, Glottal

Aperture.

In the example, the jaw is part of the coordinativestructure for the tongue body vowel gesture, aswell as part of the coordinative structure for thelip closure gesture. Therefore, even after the lipclosure gesture becomes inactive, the jaw isaffected by the vowel gesture, and its lowering forthe vowel causes the lower lip to also passivelylower.

The gestural constellations not only characterizethe microscopic properties of the utterances, asdiscussed above, but systematic differences amongthe constellations also define the macroscopicproperty of phonological contrast in a language.Given the nature of gestural constellations, thepossible ways in which they may differ from oneanother is, in fact, quite constrained. In otherpapers (e.g., Browrnan & Goldstein, 1986; 1989;1992) we have begun to show that gesturalstructures are suitable for characterizingphonological functions such as contrast, and whatthe relation is between the view of phonologicalstructure implicit in gestural constellations, andthat found in other contemporary views cfphonology (see also Clements, 1992 for adiscussion of these relations). Here we will simply

give some examples of how the notion of contrastis defined in a system based on gestures, using theschematic gestural scores in Figure 4.

One way in which constellations may differ is inthe presence vs. absence of a gesture. This kind ofdifference is illustrated by two pairs of subfiguresin Figure 4: (4a) vs. (4b) and (4b) vs. (4d). (4a)"pan" differs from (4b) "ban" in having a glottisIwidel gesture (for voicelessness), while (4b) "ban"differs from (4d) "Ann" in having a labial closuregesture (for the initial consonant). Constellationsmay also differ in the particular tractvariable/articulator set controlled by a gesturewithin the constellation, as illustrated by (4a)"pan" vs. (4c) "tan," which differ in terms ofwhether it is the lips or tongue tip that performthe initial closure. A further way in whichconstellations may differ is illustrated bycomparing (4e) "sad" to (40 "shad." in which thevalue of the constriction location tract variable forthe initial tongue tip constriction is the onlydifference between the two utterances. Finally,two constellations may contain the same gesturesand differ simply in how they are coordinated, ascan be seen in (4g) "dab" vs. (4h) "bad."

Page 67: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

60 Brawman and Goldstein

VELUM

TONGUE TIP

TONGUE BODY

LIPS

GLOTTIS

VELUM

TONGUE TIP

TONGUE BODY

LIPS

GLOTTIS

VELUM

TONGUE TIP

TONGUE BODY

LIPS

GLOTTIS

VELUM

TONGUE TIP

TONGUE BODY

LIPS

GLOTTIS

(a) pmn (b) bxn

dolab

wide

widephar

wide

do

(C) twn

doalv

wide

wide

widephar

doalv

(e) swd

critalv

wide

widephar

doalv

(g) dwb

doalv

widephar

dolab

wide

widephar

doely

(d) Een

widephar

wide

doalv

critalvpal

wide

widephar

doalv

(h) bwd

dolab

widephar

do

Figure 4. Schematic gestural scores exemplifying contrast. (a) "pan" (b) "ban" (c) "tan" (d) "Ann" (e) "sad" (0 "shad"(g) "dab" (h) "bad."

6 7

Page 68: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Dynamics and Articulatory Phonology 61

This chapter described an approach to the de-scription of speech in which both the cognitive andphysical aspects of speech are captvred by viewingspeech as a set of actions, or dynamic tasks, thatcan be described using different dimensionalities:low dimensional or macroscopic for the cognitive,and high dimensional or microscopic for thephysical. A computational model that instantiatesthis approach to speech was briefly outlined. Itwas argued that this approach to speech, which isbased on dynamical description, has severaladvantages. First, it captures both thephonological (cognitive) and physical regularitiesthat minimally must be captured in anydescription of speech. Second, it does so in a waythat unifies the two descriptions as descriptionswith different dimensionality of a single complexsystem. The latter attribute means that thisapproach provides a principled view of thereciprocal constraints that the physical andphonological aspects of speech exhibit.

REFERENCESAbraham, R. H., & Shaw, C. D. (1982). Dynamics-The geometry of

behavior. Santa Cruz, CA: Aerial Press.Anderson, S. R., (1974). The organization of phonology. New York,

NY: Academic Press.Browman, C. P., & Goldstein, L. (1986). Towards an articulatory

phonology. Phonology Yearbook, 3, 219-252.Brownian, C. P., & Goldstein, L. (1989). Articulatory gestures as

phonological units. Phonology, 6. 201-251.Browman, C. P., & Goldstein, L. (1990a). Gestural specification

using dynamically-defined articulatory structures. Journal ofPhonetics, 18, 299-320.

Browman, C. P., & Goldstein, L. (1990b). Representation andreality: Physical systems and phonological structure. Journal ofPhonetics, 18, 411-424.

Browrnan, C. P., & Goldstein, L. (1990c). In J. Kingston & M. E.Beckman (Eds.), Tiers in articulatory phonology, with someimphcations for casual speech (pp. 341-376).

Browman, C. P., & Goldstein, L. (1992). Articulatory phonology:An overview. Phonetica, 49, 155-180.

Browman, C. P., Goldstein, L., Kelso, J. A. S., Rubin, P., &Saltzman, E. (1984). Articulatory synthesis from underlyingdynamics. Journal of the Acoustical Society of America, 75, S22-S23(A).

Chomsky, N., & Halle, M. 1968. The sound pattern of English.New York: Harper Row. Clements, G. N. (1992). Phonologicalprimes: Features or gestures? Phonetwa, 49, 181-193.

Clements, G. N. (1992). Phonological primes: Features orgestures? Phonetica, 49. 181-193

Cohn, A. C. (1990). Phonetic and phonological rules ofnasa:ization. UCLA WPP, 76.

Coleman, I (1992). The phonetic interpretation of headedphonological structures containing overlapping constituents.Phonology, 9. 1-44.

Fouralos, M., & Port, R. (1986). Stop epenthesis in English. Journalof Phonetics. 14, 197-2_21.

Fowler, C. A., Rubin, P., Remez, R. E., & Turvey, M. T. (1980).Implications for speech production of a general theory of

action. In B. Butterworth (Ed.), Language production (pp. 373-420). New York, NY: Academic Press,

Haken, H. (1977). Synergetics: An introduction. HeidelbergSpringer-Verlag.

Halle, M. (1982). On distinctive features and their articulatoryimplementation. Natural Language and Linguistic Theory, 1. 91-105.

Hockett, C. (1955). A manual of phonology. Chicago: University ofChicago.

Hogeweg, P. (1989). MIRROR beyond MIRROR. Puddles of LIFE.In C. Langton (Ed.), Artificial life (pp. 297-316). New York:Addison-Wesley.

Jakobson, R., Fant, C. G. M., & Halle, M. (1951). Preliminaries tospeech analysis: The distinctive features and thetr correlates.Cambridge, MA: MIT.

Kauffmann, S. (1989). Principles of adaptation in complexsystems. In D. Stein (Ed.), Sciences of complexity (pp. 619-711).New York: Addison-Wesley.

Kauffman:-.. S. (1991). Antichaos and adaptation. ScientificAmerican, 265, 78-84.

Kauffmann, S., & Johnsen, S. (1991). Co-evolution to the edge ofchaos: coupled fitness landscapes, poised states, and co-evolu-tionary avalanches. In C. Langton, C. Taylor, J. D. Farmer, & R.Rasmussen (Eds.), Artificial life 11 (pp. 325- 369). New York:Addison-Wesley.

Keating, P. A. (1985). CV phonology, experimental phonetics, andcoarticulation. UCLA WPP, 62, 1-13.

Keating, P. A. (1988). Underspecification in phonetics. Phonology,5, 275-292.

Keating, P. A. (1990). Phonetic representations in a generativegrammar. Journal of Phonetics, 18, 321-334.

Kelso, J. A. S., Saltzman, E. L., & Tuller, B. (1986). The dynamicalperspective on speech production: Data and theory. Journal ofPlwnetics, 14, 29-59.

Kent, R. D., & Minifie, F. D. (1977). Coarticulation in recent speechproduction models. Journal of Phonetics, 5, 115-133.

Klatt, D. H. (1976). Linguistic uses of segmental duration inEnglish: Acoustic and perceptual evidence. Journal of theAcoustical Society of America, 59, 1208-1221.

Kugler, P. N., & Turvey, M. T. (1987). Information, natural law, andthe self- assembly of rhythmic movement. Hillsdale, NJ: LawrenceErlbaum Associates.

Ladefoged, P. (1980). What are linguistic sounds made of?Language, 56, 485- 502. Ladefoged, P. (1982). A course inphonetics (2nd ed.). New York, NY: Harcourt Brace Jovanovich.

Lewin, R. (1992). Complexity. New York, NY: Macmillan.Ladefoged, P. (1982). A course in phonetics (2nd ed.). New York:

Harcourt Brace Jovanovich.Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-

Kennedy, M. (1967). Perception of the speech code.Psychological Review, 74, 431- 461.

Liberman, M., & Pierrehumbert, J. (1984). Intonational invarianceunder changes in pitch range and length. In M. Aronoff, R. T.Oehrle, F. Kelley, & B. Wilker Stephens (Eds.), Limguage soundstructure (pp. 157-233). Cambridge, MA: MIT Press.

Lindblom, B., MacNeilage, P., & Studdert-Kennedy, M. (1983)Self-organizing processes and the explanation of phonologicaluniversals. In B. Butterworth, B. Comrie, & 0. Dahl (Eds.),Explanations of linguistic universals (pp. 181-203). MAli.-mi: TheHague.

Macchi, M. (1988). labial articulation patterns associated withsegmental features and syllable structure in English. Phonetica45, 109-121.

Madore, B. F., & Freedman, W L. (1987). Self-organizingstructures. Arnerwan Scientist. 75, 252-259.

Page 69: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

62 Brownian and Goldstein

Manuel, S. Y., & Krakow, R. A. (1984). Universal and languageparticular aspects of vowel-to-vowel coarticulation. Haskinslathoratories Status Report on Speech Research, SR77/78, 69-78.

McCarthy, J. J. (1988). Feature geometry and dependency: Areview. Phonetica, 45, 84-108.

Ohala, J. (1983). The origin of sound patterns in vocal tractconstraints. In P. F. MacNeilage (Ed.), The production of speech(pp. 189-216). New York, NY: Springer-Verlag.

Ohman, S. E. G. (1966). Coarticulation in VCV utterances:Spectrographic measurements. Journal of the Acoustical Society ofAmerica, 39, 151- 168.

Packard, N. (1989). Intrinsic adaptation in a simple model forevolution. In C. Langton (Ed.), Artificial hfe (pp. 141-155). NewYork: Addison-Wesley.

Pierrehumbert, J. (1990). Phonological and phoneticrepresentation. Journal of Phonetics 18, 375-394.

Pierrehumbert, J. B., & Pierrehumbert, R. T. (1990). On attributinggrammars to dynamical systems. Journal of Phonetics, 18.

Port, R. F. (1981). Linguistic timing factors in combination. Journalof the Acoustical Society of America, 69, 262-274.

Rubin, P. E., Baer, T., & Mermelstein, P. (1981). An articulatorysynthesizer for perceptual research. Journal of the AcousticalSociety of America, 70, 321-328.

Sagey, E. C. (1986). The representation of features and relations in non-linear phonology, doctoral dissertation, MIT.

Saltzman, E. (in press). Dynamics in coordinate systems in skilledsensorimotor activity. In T. van Gelder & B. Port (Eds.), Mind asmotion. Cambridge, MA: MIT Press.

Saltzman, E. (1986). Task dynamic coordination of the speecharticulators: A preliminary model. In H. Heuer & C. Fromm

(Eds.), Generation and modulation of action patterns (pp. 129-144).Berlin/Heidelberg: Springer-Verlag

Saltzman, E., & Kelso, J. A. S. (1987). Skilled actions: A taskdynamic approach. Psychological RerieW, 94, 84-106.

Saltzman, E. L., & Munhall, K. G. (1989). A dynamical approach togestural patterning in speech production. Ecological Psychology1, 333-382.

Schoner, G., & Kelso, J. A. S. (1988). Dynamic pattern generationin behavioral and neural systems. Science, 239, 1513-1520.

Stevens, K. N. (1972). The quantal nature of speech: Evidencefrom articulatory-acoustic data. In E. E. David & P. B. Denes(Eds.), Human communication: A unified view (pp. 51-66). NewYork, NY: McGraw-Hill.

Stevens, K. N. (1989). On the quantal nature of speech. Journal ofPhonetics, 17, 3-45.

Sussman, H. M., MacNeilage, P. F., & Hanson, R. J. (1973). Labialand mandibular dynamics during the production of bilabialconsonants: Preliminary observations. Journal of Speech andHearing Research, 16, 397-420.

Turvey, M. T. (1977). Preliminaries to a theory of action withreference to vision. Ln R. Shaw & J. Bransford (Eds.), Perceiving,acting and knowing: Toward an ecological psychology. Hillsdale, NJ:Lawrence Erlbaum Associates.

Wood, S. (1982). X-ray and model studies of vowel articulation.Working Papers, Lund University, 23.

FOOTNOTEST. van Gelder & B. Port (Eds.), Mind as motion. Cambridge,

MA: MIT Press (in press).*Also Department of Linguistics, Yale University.

Page 70: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins Laboratories Status Report on Speech Research1993, SR-113, 63-90

Some Organizational Characteristics of SpeechMovement Control*

Vincent L. Gracco

The neuromotor organization for a class of speech sounds (bilabials) was examined toevaluate the control principles underlying speech as a sensorimotor process. Oral openingand closing actions for the consonants /p/, /b/, and Ira/ (CI) in /s Vi CI V2 C2/ context,where V1 was either /ae/ or /V, V2 was lae/, and C2 was /p/, were analyzed from foursubjects. The timing of oral opening and closing action was found to be a significantvariable differentiating bilabial consonants. Additionally, opening and closing actions werefound to covary along a number of dimensions implicating the movement cycle as theminimal unit of speech motor programming. The sequential adjustments of the lips andjaw varied systematically with phonetic context reflecting the different functional roles ofthese articulators in the production of consonants and vowels. The implication of thesefindings for speech production is discussed.

INTRODUCTIONAs a motor process, speaking is an intricate or-

chestration of multiple effectors (articulators) co-ordinated in time and space to produce sound se-quences. From a functional perspective, thespeech production mechanism is a special purposedevice modulating the aerodynamic and resonanceproperties of the vocal tract by rapidly creatingconstrictions, occlusions, or overall shape changes.These events provide the foundation for the cate-gorically distinct sounds of the language. Thefunctional task, driven by cognitive considera-tions, involves a number of sensorimotor processesthat generate segmental units, specify the coordi-nation among contributing articulators, scale ar-ticulator actions to phonetic and pragmatic con-texts, and subsequently sequence these actionsinto meaningful units for communication.

This research was supported by grants DC-00121, and DC-00594. from the National Institute of Deafness and OtherCommunication Disorders The author thanks Carol Fowler,Anders Lofqvist, Ignatius Mattingly and Rudolph Sock forcomments on earlier versions of this manuscript. The authoralso thanks John Folkins for a thorough and constructivelycritical review of this manuscript.

63

Assuming that the individual phonetic segmentsof a language are stored in some manner in thenervous system, a number of important issues canbe identified. What are the production units, howare they differentiated for sounds, and how arethey modified according to context? The presentinvestigation is an attempt to answer some ofthese questions and identify some principles ofspeech motor organization used to scale, coordi-nate, and sequence multiple articulatory actions.

Movement characteristicsStop consonantsOne focus of the present investigation is to

examine, in detail, the kinematic adjustments oflip and jaw motion associated with a class ofEnglish speech sounds, the bilabial stopconsonants /p/, /b/, /m/, in different vowel contexts.These consonants are the same in their place ofarticulation (the lips), employing the same set ofarticulators to create an obstruction at the oralend of the vocal tract. One property distinguishingthese sounds is the presence or absence of voicing;/p/ is a voiceless consonant because laryngealvibration is briefly (100-200 insec) arrested duringthe oral closing. As the lips release the oralclosure the vocal folds move back together andvoicing for the next vowel is initiated (Lisker &

7 ()

Page 71: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

64 Gracco

Abramson, 1964). There are two other classes ofstop consonants in English which createtemporary obstructions in different regions of thevocal tract, and use different articulators; alveolarand velar stop consonants. In addition, each of thealveolar and ve' ar voiced consonants can beproduced with the velum lowered, creating thenasal resonances for the In/ and /rj/ sounds,respectively. Thus, English contains threecharacteristic sets of stop consonants that differ inthe primary articulators used to create theobstruction and the location of the obstruction inthe vocal tract.

It is not clear from previous investigations howarticulatory movements within a , class aremodified by concomitant articulatory actions. Thatis, are the lip and jaw movements for all bilabialconsonants similar with the acoustic distinctionsbetween sounds in the class attributed solely tothe laryngeal and velar actions? Examination ofelectromyographic (EMG), kinematic, and acousticdata reveal conflicting findings. Electromyo-graphic activity from lip muscles for bilabialproduction has generally failed to revealsignificant differences in measured variables suchas peak EMG amplitude and EMG burst durationassociated with the oral closing action (Fromkin,1966; Harris, Lysaught, & Schvey, 1965; Lubker& Parris, 1970; Tatham & Morton, 1968). The oneexception is the study by Sussman, MacNeilage, &Hanson (1973) in which the activity from one up-per lip depressor muscle (depressor anguli oris)from one subject was found to be greater for /p/than for /b/ and /m/. Kinematic studies have re-vealed a few movement differences for /p/ and /b/,most consistently a tendency for the oral closingmovement velocity to be higher for /p/ than /b/ or/rn/ (Chen, 1970; Summers, 1987; Sussman et al.,1973). Two of the above mentioned investigations(Sussman et al., 1973; Summers, 1987) have alsoreported kinematic and EMG differences inmovements surrounding the oral closing action.Jaw opening for the vowel /a/ or /ae/ before /p/ wasfound to be higher than before /b/ (Summers,1987) and the lower lip opening velocity and asso-ciated EMG activity (following the oral occlusion)was greater for /m/ than /p/ or /b/ (Sussman et al.,1973).

Acoustic studies have provided the most reliablefindings associated with voicing differences.Closure durations for voiceless consonants aregenerally longer than closure durations for theirvoiced cognates and the preceding vowel is shorterin the voiceless context (Denes, 1955; House &Fairbanks, 1953; Luce & Charles-Luce, 1985).

Integrating results from different empiricalobservations allows for some potential speculationor. the voiced/voiceless differences. EMG resultssuggest that the form of the oral closing motorcommands is not different in magnitude orduration due to the presence or absence of voicing.In contrast, the movement and acoustic studiessuggest that differences may be focused onmovement timing as evidenced by changes inmoven-mt velocity and vowel duration. While theavailable data suggest that one difference betweensounds within an equivalence class such asbilabials may be reflected in aspects of theirtiming, no detailed analysis has been conducted.

Another possibility is that articulatory motionmay be governed by control principles that operateon combinations of kinematic variables reflectingimportant system parameters. For example, con-sonant related movement differences may bespecified by articulator stiffiaess, a construct withsome potential as a motor control variable (Cooke,1980; Ostry & Munhall, 1985; Kelso et al., 1985).If an effector such as the lower lip is modeled as alinear second order system, changing the staticstiffness of the system results in predictablechanges in the velocity/displacement relationship;the velocity/displacement ratio varies directlywith stiffness. Moreover if resting stiffness is avariable being controlled by the nervous systemthen changes in an effector's kinematics can bebrought about by a single system parameter. Ithas been suggested that mass normalized stiff-ness, estimated from the ratio of a movement'speak velocity and displacement, may be an impor-tant control parameter for skilled actions includ-ing speech (Kelso et al., 1985; Kelso, 1986). Whilea number of studies have demonstrated consistentand systematic velocity/displacement relations fora range of speech movements (Kuehn & Moll,1976; Munhall, Ostry & Parush, 19851; Ostry,Keller & Parush, 1983), to date it has not beendemonstrated unequivocally that stiffness is animportant control parameter, except as one ofmany possible alternatives (see Nelson, 1983 forexample). One possibility is that /p/ with higherclosing velocity might be differentiated from /b/and /m/ by its stiffness specification suggestingthat speech sounds may be coded neurophysiologi-cally by such a construct.

Articulator interactions and sequencingIt is becoming increasingly clear that individual

articulators and their respective actions are notindependent but functionally related by task. Theempirical support for this view comes from two

Page 72: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Organizational Characteristics of Speech Movement Control 65

main sources. When the motions of the lips andjaw for bilabial closure are examined in detail, itappears that the individual articulators are notadjusted independently either in magnitude orrelative timing (Gracco, 1988; Gracco & Abbs,1986; Hughes & Abbs, 1976). Spatiotemporal ad-justments to suprasegmental manipulations alsosuggest that stress and rate mechanisms act on allactive regions of the vocal tract (Fowler, Gracco,V.-Bateson, & Romero, submitted). Changes in ar-ticulator movement patterns following mechanicalperturbation are also consistent with the sugges-tion that the actions of individual articulators arenot adjusted independently. Rather, disruptions toarticulator movement during speech result in ad-justments in the perturbed articulator as well asthose (unperturbed) articulators that are activelyinvolved in the production of the specific sound(Folkins & Abbs, 1975; Kelso et al., 1984; Kollia,Gracco, & Harris, 1992; Munhall, Lofqvist, &Kelso, in press; Shaiman, 1989; Abbs & Gracco,1984; Gracco & Abbs, 1985; 1988). While manyprevious studies can be criticized on statisticalgrounds (see Folkins & Brown, 1987) it is not un-reasonable to suggest that interactions among ar-ticulators are not random but systematic and in-tegral to the overall speech production process.Together these observations suggest that speechmovements are organized according to higherlevel principles that involve articulatory aggre-gates (similar to the principle of synergy definedby Bernstein, 1967).

Speaking is also a continuous process involvingarticulatory motion generating acoustic and aero-dynamic events. Lip and jaw motion can be classi-fied as either opening or closing depending on thedirection of the motion relative to an occlu-sion/constriction or characteristic vocal tractshape. Oral opening is most often associated withvowel production while oral closing is most oftenassociated with consonant production. Differentrelative timing patterns for oral articulatory mo-tions depending on direction have been inter-preted as manifestations of separate and distinctsynergistic actions fundamental to the speechproduction process (Gracco, 1988). Moreover, theopening phase of an articulatory action showsgreater variation in spatiotemporal adjustmentdue to stress (Kelso, V.-Bateson, Saltzman, & Kay,1985; Ostry, Keller, & Parush, 1983) than the cor-responding closing phase. Similarly, oral openingand closing movement duration show differentialeffects to changes in speaking rate (Adams,Weismer, & Kent, 1993). These differential pat-terns associated with movement direction are also

influenced by phonetic context. The context inwhich a sound is produced is known to affect theacoustic and kinematic properties associated withits production (Daniloff & Moll, 1968; Sussman,MacNeilage, & Hanson, 1973; Parush, Ostry, &Munhall, 1983; Perkell, 1969). It seems that artic-ulatory adjustments for context may have differ-ential effects on the different movement phases.In order to understand the organizational princi-ples and control processes that govern speech pro-duction, evaluation of speech movement differ-ences must include the articulatory adjustmentswithin as well as across articulators and move-ment phases.

Examining kinematic changes across movementphases has additional implications for under-standing speech as a serial process. As pointed outby Lashley (1951), serial actions, such as thosefound in speech, locomotion, typing, and the play-ing of musical instruments, cannot be explained interms of successions of external stimuli or reflexchaining. Rather, the apparent rhythmicity foundin all but the simplest motor activities suggeststhat some sort of temporal patterning or temporalintegration may form the foundation for motor aswell as perceptual activities (Lashley, 1951).Lashley further suggested that skilled action in-volves the advanced planning of entire sequencesof action. More recently, Sternberg and colleagueshave developed a model of speech timing that in-corporates the concept of an advanced plan of ac-tion, an utterance program, which is used to con-trol the execution of the elements in sequence(Sternberg, Knoll, Monsell, & Wright, 1988). Theelements of the utterance program are actionunits defined abstractly on the basis of requiring asingle selection process but may contain multipledistinguishable actions (Sternberg et al., 1988).That is, speech production is considered a hierar-chical process with advanced planning of what isto be said (utterance program) followed by the ex-ecution of the program using smaller sequencedelements (action units) on the order of stressgroups (Fowler, 1983; Sternberg et al., 1988).

While this model provides a framework for thespeech motor process, a number of unresolvedissues remain. Previous work has focused onevaluating the latency and duration of words orsyllables rapidly produced by subjects.Examination of articulatory movement has notbeen undertaken and thus identification of thearticulatory correlates of the action unit islacking. If an action unit or unit of speechproduction is to have any theoretical significance,identification of the physiological (articulatory)

Page 73: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

66 Gracco

instantiation of the construct is crucial. Further,in order to fully understand the speech productionprocess, the mechanism by which elemental unitsare sequenced and modified is also of centralimport. In the present investigation, serial speechmovements within and across movementsequences were examined for evidence ofarticulatory cohesiveness that may reflect theproduction unit as well as the manner in whichphonetic context may modulate the underlyingtemporal patterning.

Articulatory coordinationA final focus of the present investigation is the

coordination patterns of the lips and jaw as theycooperate to occlude and open the oral end of thevocal tract for the different bilabial consonants. Anumber of previous investigations suggest thattask-related articulatory movements, such as thelip and jaw motion for bilabial closing, are inter-dependent in their timing (Gracco, 1988; Gracco &Abbs, 1986). Results presented by Lofqvist andYoshioka (1981;1984) similarly suggest a numberof consistent timing patterns for different laryn-geal-oral interactions following stress, rate, andconsonantal manipulations. Observations of tim-ing coherence among articulatory actions havebeen extended to include the lip, jaw, and larynxassociated with the initiation of voicing and de-voicing in a variety of phonetic contexts (Gracco,1990; Gracco & Lofqvist, 1989; Gracco & Lofqvist,in preparation). Such observations reflect a poten-tial principle of speech movement organization.For speech movement coordination, as for motorcoordination in general (see Bernstein, 1967), thetiming or patterning of the potential degrees offreedom for task-related actions, are constrainedthereby reducing the overall motor control com-plexity (Gracco, 1988).

While the available evidence suggests that con-straining the degrees of freedom may be a generalprinciple underlying the coordination of task-re-lated speech articulators, it has recently beensuggested that lip and jaw coordination is not in-variant (De Nil & Abbs, 1991). Rather, variationsin the temporal sequencing of lip and jaw closingmovements for /b/ at fast and slow speaking r -tes,have been interpreted to reflect fundamentallydifferent patterns requiring different coordinativepatterns. With the exception of the De Nil andAbbs (1991) investigation, previous studies exam-ining the relative timing of the lips and jaw dur-ing oral closing have focused on voiceless /p/(Caruso, Abbs, & Gracco, 1988; Gracco & Abbs,1986; 1988; Gracco, 1988; McClean, Kroll, &Loftus, 1990). It is possible that lip and jaw

actions for voiced and voiceless sounds differ intheir relative timing, reflecting differentcoordinative relations. Further, one previousstudy has examined the relative timing of the lipsand jaw for oral opening, and reportedfundamentally different timing patterns amongthe articulators for oral opening than for oralclosing (Gracco, 1988). As such, extended ex-amination of interarticulatory timing acrossmovement phases may be informative regardingthe extent and generality of any coordinative prin-ciples governing speech motor actions.

MethodsSubjects and movement task. Four females

between the ages of 21 and 28 years were subjectsin the present study. None had a history ofneurological disorder and all were native speakersof American English. Subjects were asked torepeat one of six utterances following the onset ofan experimenter-controlled tone. The utteranceswere of the form /s VI. C V2/ where V1 was either/ae/ or /V, C was /p/, /b/, or Im/ and V2 was lae./.The specific utterances were:

1) sapappie2) seepapple3) sabapple4) st:ebapple5) samapple6) seemapple

Each word was repeated, in isolation, a total of40 times, 10 or 20 consecutive times for each wordon the list in the presented order, followed bysubsequent repetition(s) of the entire list. A totalof 240 tokens were obtained for each subject withthe exception of subject four whose total was 238(one sapapple and one seemapple missing). Thestimulus to respond (auditory tone) was presentedat approximately three second intervals. Subjectswere instructed to produce each word with equalstress at a comfortable speaking rate following thetone and to use an effort level appropriate forspeaking to someone 10-15 feet away.

Instrumentation and data acquisition. Singledimensional midsagittal movements of the upperlip (UL), lower lip (LL), and jaw (J) weretransduced using a head-mounted strain gagesystem previously described (Barlow, Cole, &Abbs, 1983). Briefly, movements of the upper lip,lower lip, and jaw were transduced using ultra-light weight cantilever beams instrumented withstrain gages attached to a lightweight head-mounted frame. Transducers were attachedmidsagittally at the vermilion border of the upperand lower lips and on a region of the chin which

Page 74: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Organizational Characteristics of Speech Movement Control

yielded minimal skin movement artifact (seeBarlow et al., 1983; Folkins, 1981; Kuehn, Reich,& Jordan, 1980; Muller & Abbs, 1979).Transducers were visually aligned to capture themajor movement axis for oral opening/closing foreach subject with the resultant orientation rotatedapproximately 100 to 110 degrees posterior to thevertical. Movement signals were digitized at asampling rate of 500 Hz with 12 bit resolutionexcept for subject four in which movement signalswere sampled at 400 Hz. The acoustic signal wassampled at 500 Hz to indicate the onset and offsetof the stimuli only.

Data analysis. Prior to data analysis, allmovement signals were digitally low pass filtered(Butterworth implementation, two-pole zero phaselag, 20 Hz cutoff). As transduced, the lower lipsignal reflects the combined movement of thelower lip and jaw. In order to obtain lower lipmovement only, the jaw signal was softwaresubtracted from the lower lip signal, yielding netlower lip movement. First derivatives wereobtained using a three-point numericaldifferentiation reutine (central difference) andstored as part of the data file. From theseinstantaneous velocity signals, movement onsetsand offsets were determined for the individualclosing movements. Computer software was usedto identify zero crossings in the velocity traces ofthe respective upper lip, lower lip, and jaw signalsand movement onset and offsets were marked. Inaddition, the movements of the upper lip, lowerlip, and jaw were combined to yield a measure of(le overall lip aperture (LA).

For the present experiment, subjects had beeninstructed to start each utterance from a positionof rest with the lips in contact, and the upper andlower teeth lightly touching. Using this relativelyconsistent starting point it was possible toevaluate the relative position of the differentarticulators for the /s/ and the following vowel.

Each acquired token consisted of eight signalsillustrated in Figure 1. Two complete movementcycles or sequences were identified from the lipaperture signal (LA) and indicated as sequence 1and 2 (S_1, S_2) in the figure. Each sequence con-sists of two phases, an opening and closing phase.Sequences were defined as the time interval be-tween the onset of the opening movement to theoffset of the closing movement in the LA signal us-ing a zero velocity criteria. Onsets of the openingand closing movements for the lips and jaw werealso determined from zero crossings in the firstderivatives of the respective articulator motions.Maximum displacement and duration were de-

fined as the distance and time, respectively, be-tween the onset and offset of each movement;peak instantaneous velocity was also obtained foreach movement. As with previous studies (Gracco,1988; Gracco & Abbs, 1986) a measure of upperlip, lower lip, and jaw coordination during oralclosing was obtained from the timing of each ar-ticulator's peak velocity relative to the same pre-ceding articulatory event (jaw opening for thevowel). From the identified events of interest, thefollowing measures were made (see Figure 1):

Ulm

LLv

Jv

LAv

S A PA PP LE

250 msoc

FIgure 1 A representative example of upper lip (UL),lower lip (LL), jaw (J), and lip aperture (LA)displacement (above) and velocity (below) signals forone of the stimulus words, "sapapple." All trials wereinitiated from a rest position with the lips together andteeth lightly touching. Each production included twoopening/closing sequences identified from the LA signalas S_1 and S_2. Within each sequence or cycle, openingand closing phases were also identified (Op_1, Cl_l,Op_2, Cl_2). Measurements are indicated by thenumbers in the figure (see text for details).

1) extent of oral opening and associated jawposition for /s/, relative to the subjects' restposition,

2 extent of oral opening for the first vowel(Op_1) and the associated jaw opening movementcharacteristics,

74

Page 75: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

68 Gracco

3) lip and jaw closing movement characteristicsfor the first closing,

4) oral opening (Op_2) and closing (Cl_2)movement characteristics for the second cycle(S_2) for the /aep/ in "apple,"

5) time of upper lip, lower lip, and jaw peakclosing velocity relative to the peak jaw openingvelocity for the first vowel opening,

6) time of upper lip, lower lip, and jaw peakopening velocity for the second opening relative tothe peak upper lip closing velocity for the first oralclosing.

Statistical analysis. The data were analyzedusing two factor repeated measures ANOVA withvowels and consonants as factors. When therewere significant interactions, post hoc analyseswere c.onducted using Scheffe's F-test with alphalevel set to .01 for all comparisons.

ResultsContext-dependent changes in the opening and

closing movements will be examined for the firstmovement sequence. Next, movement parametersof the second sequence will be examined forcarryover effects from the phonetic context of thefirst sequence. Finally, the coordination of lip andjaw motion will be examined for closing andopening effects and phonetic context. Individualarticulator adjustments will be concurrentlyexamined to evaluate their functional role in thespeech production process.

Movement characteristicsSequence One

Cycle duration. One of the most robust findingsin the present study was a change in the durationof the first movement cycle. For the group, thecycle duration obtained from the lip aperturesignal was, on average, shorter for the /i C/ than/ae C/ context (191.8 msec Vs 227.3 msec). Inaddition, the cycle duration was shorter when theconsonant was /p/ than /b/ or /m/ (203.3 vs. 210.2vs. 215.1 msec, respectively). Shown in Figure 2are three examples of the lower lip and jawmovements for "sapapple," "sabapple" and"samapple" from a single subject. All movementsin the figure were aligned to the same articulatoryevent; jaw opening peak velocity for the first vowel/ae/ (dotted line). As illustrated by the arrows inFigure 2, the interval between the J opening peakvelocity for /ae/ and the time of maximum lip andjaw closing velocity and displacement is shortestwhen the consonant is /p/. Because the jawopening peak velocities are aligned in theseexamples, it can also be seen that the adjustmentwithin the cycle was localized to the intervalspanning the terminal phase of the opening actionand the time of the peak velocity for the closingaction. This interval for all phonetic contextsand subjects for the first movement cycle ispresented in Figure 3. Since there were noarticulator specific differences, only the LL will beconsidered.

Displacement Velocity

Figure 2. Three representative lower lip (LL) and jaw movements (displacement and velocity) from S2 for "sapapple,""sabapple," and "samapple." All signals were aligned to the jaw opening peak velocity (dotted line). Opening istoward the bottom; closing is toward the top). Arrows in jaw opening displacement panel (lower left) indicate theshortening of the jaw opening movement duration for the different consonants. Similarly, the LL closing movementfor /p/ achieves both peak velocity (arrows, upper right panel) and peak displacement (upper left panel) earlier for /p/than /b/ and /m/. Vertical calibration is 6 mm (displacement) and 150 mm/sec (velocity).

I i)

Page 76: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Organizational Characteristics of Speech Movement Control 69

S 1

90

45

0 0

O.0

S 2

/p/ Ibl kJ/

S 3

S 4ladlil

/p/

Figure 3. Average interval (in milliseconds) between the jaw opening peak velocity for the first vowel (Op_1) and thetime of the LL peak velocity for the first oral closing (Cl_1) for the different phonetic contexts for all subjects. Singleasterisk (*) reflects a significant /p/ - /b/ or /p/ - /m/ difference; a double asterisk (**) indicates that both comparisonswere reliable (p < .01).

Table I. Mean difference (xd) and associated F-value (Scheffe's F-test) for within-vowel (/aep-aeb/, /aep-aem/, /ip-ib/,/ip-itn./) consonant comparisons and across-vowel (/aep-ip/, /aeb-ib/. /aem-im/) consonant comparisons for the Op_1CLI interval and lower lip closing movement duration for all subjects. Degrees of freedom for S1-3 (5, 234), S4(5,232); asterisk indicates a significant difference at p < .01.

Op_l C1-1 interval

/aep-aeb/ /acp-aem/ /ip-ibl /ip-im/ /aep-ip/ /aeb-ib/ lacm-im/

SI (xd) -6.75 3.25 -5.4 -5.75 35.55 36.9 26.55

F 1.15 .27 .73 .83 31.8* 34.26* 17.73*

S2 -43.5 -53.55 -4.4 -12.1 13.5 52.6 54.95

75.99* 115.16* .78 5.88* 7.32* 111.11* 121.26*

S3 -22.95 -17.5 -10.05 -9.75 14.6 27.5 22.35

18.9* 10.99* 3.62* 3.41* 7.65* 27.14* 17.93*

S4 -22.1 -31.29 -39.69 -38.99 44.71 27.12 37.01

12.88* 25.81* 42.06* 40.09* 52.71* 19.65* 36.11*

Movement duration

/acp-aeb/ /aep-acm/ Iip-ibl /ip-itn/ /acp-ip/ /acb-ib/ /acm-im/

SI (xd) -5.05 -3.35 .3 -5.1 -2.8 2.55 -4.55

F 1.93 .85 .01 1.97 .59 .49 1.57

S2 -19.2 -11.8 -.15 -13.2 -18.25 .8 -19.65

11.74* 4.44* .(X) 5.55* 10.61* .02 12.3*

S3 -20.2 -22.45 -3.65 -6.35 -13.1 3.45 3

9.98* 12.32* .33 .99 4.2* .29 .22

S4 -4.66 -5.16 -.94 -15.38 -3.6 .12 -13.81

.51 .61 .02 5.59* .31 .0003 4.51*

Page 77: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

70 Gracco

Within vowel and consonant comparisonsrevealed a significantly shorter interval when theconsonant was /p/ than /b/ and /rn/ in both vowelcontexts, and a longer interval when the vowelwas /ae/ compared to hi/. These effects werereliable for all subjects except for S1 whoseconsonant effect did not reach significance.Inspection of the data from S1 indicated that forthe first block of ten repetitions the same trend asseen in the other subjects was present. However,for the subsequent blocks, the overall speech ratewas increased resulting in a reduction in the /p/ -/b/ - /rn/ interval differences. In contrast to theinterval results, the LL closing movementdurations were not consistently affected by theconsonant or vowel identity. Table 1 presents theaverage Op_l Cl_l interval and movementduration differences for the LL for each subjectand the associated F statistic (see Table 1).

Oral openingPosition/Displacement/VelocitySince the subjects began each experimental trial

with their lips together and teeth lightly touching,the opening position for the first sound (/s/) in themovement sequence could be examined foranticipatory adjustments associated with thedifferent phonetic context (upcoming vowel and/orconsonant). The vertical oral opening for /s/obtained from the lip aperture signal, averaged6.78 mm (S1) to 9.38 mm (S2) across subjects andconditions and was not affected by subsequentphonetic context (p > .2). In contrast, themaximum oral opening for the vowel was highlydependent on vowel identity. Figure 4 presentsthe average oral opening for the group and theindividual lip and jaw contributions to the openingfor the different vowel-consonant contexts. Theoral opening was significantly larger for /ae/compared to lil for all subjects (S1[F(1,238) =230.81, p = .0001]; S2[F(1,238) = 2188.6, p =.00011; S3[F(1,238) = 796.51, p = .0001];S4[F(1,236) = 343.96, p = .0001]). Oral opening for/ae/ averaged 15.2, 17.6, 17.9, and 13.3 mm, whileoral opening for lil averaged 12.2, 9.8, 11.5, and10.0 mm for S1-4 respectively. As can be seenfrom the consistent UL plus LL contribution to theoral opening in Figure 4, the jaw is the articulatorsolely responsible for the oral opening changes.Jaw position was consistently higher for /i/ than/ae/ for all subjects (S1[F(1,238) = 535.41, p =.0001]; S2[F(1,238) = 2067.8, p = .0001];S3[F(1,238) = 610.89, p =.0001]; S4[F(1,236) =505.12, p =.0001]). Collapsed across consonantsjaw opening position averaged 3.4, 7.0, 6.1, and3.5 mm higher for /if than /ae/ (S1-4 respectively).

The voicing character of the consonant had noconsistent effect on the jaw opening position. Nosignificant vowel or consonant effects were notedfor the UL+LL.

20Opening-V1

15

10

5

0

10

8

6

4

2

0

10

8

6

4

2

0

LA

UL+LL

Jaw

/p/ /b/ /m/

Figure 4 Group lip aperture (LA) opening, in millimeters(mm) relative to the rest position, for the fitst vowel (V1)and consonant contexts. Below the LA data are thecontributions of the UL plus LL (UL+LL) and jaw to theoral opening Vertical bars indicate one standarddeviation.

Page 78: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Orsanizational Characteristics of Speech Movement Control

In the present phonetic context, it was often notpossible to unambiguously identify UL or LLopening movement associated with the vowel frommovements associated with /s/. Thus, the openingvelocity for the first movement sequence focusedexclusively on the jaw. Because of the large vowelrelated jaw opening displacement effect, jawopening characteristics were examined forconsonant-related effects only. Figure 5 presentsthe jaw opening velocity results for the foursubjects. Post hoc testing revealed significantconsonant effects in the /ae/ context only with jawopening velocity higher before /p/ than /m/ for allsubjects (Sl[F(2,117) = 8.11, p < .01]; S2 F = 30.97,p < .01]; S3 F = 29.88 , p < .01]; S4[F(2,116) = 9.0,p < .01]); /p/ - /b/ differences were only significantfor S2 (F = 9.0, p < .01)

Oral closingDisplacementAs a consequence of changes in oral opening

displacement for the different vowels, oral closing

0

- 00

-180

0

1111 he/N Iii

S 1

-90

0 -135Tu

-100S 2

/p/ /b/ Int/

71

movements were modified in their extent. Asshown in Figure 6 for the group, movement extentfor each articulator is greater for the more openvowel /ae/. Presented in Table 2 are the individualsubject comparisons for the upper lip, lower lip,and jaw closing displacements. There was a highlysignificant vowel effect for the J for all subjectswith larger closing movement displacementsin the /ae/ context compared to /i/. Results for theUL were in general agreement with those of the Jbut to a lesser degree; S2, S3, and S4demonstrated significantly larger closingdisplacements in the /ae/ context compared to /V.The LL results were less consistent with fewercomparisons reaching significance. No significantUL or LL differences related to vowel identitywere noted for Sl. Consonant related movementadjustments were much less consistent for allarticulators. The J displacement for /p/ was largerthan /m/ for three subjects; Sl, S3, S4 in the /ae/context.

I''.1',';4'

S 3

IIS 4

/b/ /nil

Ftgure 5. Average jaw opening velocity, in millimeters per second (mmlsec), for the different phonetic contexts for eachof the four subjects. Vertical bars indicate one standard error.

78

Page 79: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

72 Gracco

8

6

4

2

0

Closing UL

LL

Jaw

/p/ ml /m/

Figure 6. Average upper lip (UL), lower lip (LL), and jaw(J) closing displacement (in millimeters) for the phoneticcontexts for the group. Vertical bars indicate onestandard deviation.

Oral closingVelocity/StiffnessLower lip closing velocity has previously I:en

shown to be higher for /p/ than /b/ (Sussman et al,1973; Summers, 1987). A similar result for the LLwas found in the present investigation althoughthe differences, with few exceptions, were small.

Shown in Figure 7 is a summary of the LL closingmovement velocities for all subjects for thedifferent phonetic contexts. The closing velocityfor /p/ was higher for /b/ for three subjects (S2, S3,S4) and higher than /rn/ for two subjects (S3, S4)in the /ae/ context. For S1 the LL closing velocitywas significantly reduced for /p/ compared to /m inthe /a& context (p < .01). Table 3 is a summary ofthe post hoc comparisons for each articulator andsubject. As shown, the UL results are generallysimilar to the LL with S2, S3, and S4 showingsmall increases in closing velocity for certaincomparisons in the /se/ context. Only two subjects(S3, S4) showed higher J closing velocities for /p/compared to /b/ or /in/ in the /ae/ context. Jawclosing velocity for all subjects was significantlyhigher in the /ae/ context. A similar tendency wasnoted for the UL and LL although the results werenot as robust as those for the J.

It appears that the magnitude of the LL closingvelocity provides a fairly robust metric differenti-ating voiceless from voiced consonants. However,it was of interest to determine if a combination ofkinematic variables, especially those for the ULand J, might provide a more reliable measure.One construct that has been used to describespeech movements and suggested as an importantcontrol variable for speech is mass-normalizedstiffness, expressed as the ratio of peak velocity topeak displacement (Kelso et al., 1985; Ostry &Munhall, 1985). Changes in derived stiffness havebeen shown to co-occur with changes in movementduration, speaking rate and emphasis. The corre-lation of velocity to displacement was generallyhigh for all subjects and articulators with the cor-relation magnitudes ordering J > LL > UL.Velocity/displacement correlations were generallyhigh across subjects for the LL and J ranging fromr = .49 to r = .79 for the LL and from r = .91 to r =.96 for the jaw; the upper lip was less robust withcorrelations ranging from r = .16 to r = .86.Presented in Figure 8 are the UL, LL, and J stiff-ness values for the different consonants for thegroup. The LL values are generally higher for /p/,a finding consistent with the peak velocity mea-sures presented in Figure 7. For the UL and J, theratios of peak velocity to peak displacement sug-gest that in the /ae/ context, /p/ is less stiff than/rn/, a result opposite to that for the LL. Vowel re-lated differences were also inconsistent acrosssubjects with the exception that J stiffness wasalways higher for hi/ than lae./. The data from theindividual subjects reflected the group trend andare presented in Table 4.

79

Page 80: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Organizational Characteristics of Speech Movement Control 73

Table 2. Mean difference (xd) and associated F-value (Scheffe's F-test) for within-vowel (/aep-aeb/. /aep-aem/,/ip-irn/) consonant comparisons and across-vowel (/aep-ip/, /aeb-ib/. /aem-im/) consonant comparisons for the upperlip, lower lip, and jaw closing displacement (mm) for all subjects. Degrees of freedom for S1-3 (5, 234) S4 (5.232);asterisk indicates a significant difference at p < .01.

Upper Lip

laep-aeb/ /aep-aem/ /ip-ib/ lip-im/ /aep-ip/ /aeb-ib/ laern-iml

S 1 (xd) .75 .56 .2 .26 .4 -.14 .1

F 8.32* 4.61* .61 .99 2.4 .31 .16

S2 .58 .18 .32 .4 .77 .51 .99

6.86* .69 2.06 3.26* 12.2* 5.33* 19.95*

S3 .-55 -.07 -.01 -.07 .51 1.04 .5

7.26* .11 .003 .14 6.24* 26.41* 6.09*

S4 .58 -.03 .34 .-06 1.06 .82 1.04

1.95 .01 .67 .02 6.59* 4.0* 6.3*

Lower Lip

/aep-aeb/ /aep-aem/ /ip-ib/ /ip-im/ /aep-ip/ /aeb-ib/ /aem-im/

SI (xd) -.24 -.36 -.03 -.34 .26 .47 .27

F .66 1.42 .01 1.33 .77 2.49 .84

S2 .06 -.28 -1.17 -1.45 2.0 .77 .82

.02 .59 10.05* 15.49* 29.17* 4.33* 4.98*

S3 .33 .97 .08 -.21 1.76 1.51 .58

.7 6.0* .04 .28 19.81* 14.59* 2.17

S4 1.1 .48 .08 -.75 1.04 .03 -.18

7.09* 1.34 .04 3.26* 6.4* .01 .19

Jaw

/aep-aeb/ /aep-aem/ /ip-ib/ lip-iml /aep-ip/ /aeb-ib/ /aem-im/

SI (xd) .48 .87 .46 .2 2.47 2.44 1.79

F 1.06 3.49* .96 .18 27.81* 27.31* 14.68*

S2 -.46 .46 .99 .28 3.93 5.38 3.75

1.25 1.26 5.82* .47 91.51* 171.59* 83.32*

S3 -.21 1.97 .09 .18 4.58 4.88 2.79

.38 31* .06 .26 167.02* 190.02* 61.92*

S4 .77 1.84 .47 .57 2.25 1.95 .98

11.93* 68.34* 4.61* 6.61* 102.3* 78.32* 19.53*

160

sn 120

80

0 40

0

160

A 120

80

8 40

S1 S3

S2

. .."--I ,

1 N" " :;.",I: : IV .1 .,..i 1

117, : I1.6-.... -...,.. ...,:. I

3

/p/ -Figure 7. Mean lower lip peak closing velocity (in millimeters per second) for the different phonetic contexts for thefour subjects. Vertical bars indicate one standard error; asterisk indicate significantly higher closing velocity (p < .01)

for /p/ compared to /b/ and /m/.

S4

/b/ /m/ /p/ /m/

Page 81: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

74 Graaf)

Table 3. Mean difference (xd) and associated F-value (Scheffe's F-test) for within-vowel (/aep-aeb/, /aep-aem/, /ip-ib/./ip-im/) consonant comparisons and across-vowel (/aep-ip/, /aeb-ibl, lczem-im/) consonant comparisons for the upperlip, lower lip, and jaw closing velocity (mm/sec) for all subjects. Degrees of freedom for S1-3 (5, 234), 54 (5,232);asterisk indicates a significant difference at p < .01.

Li

/aep-aeb/ /aep-aem/ /ip-ib/ /ip-im/ /aep-ip/ Iaeb-ibl /aem-irn/

S I (xd) 4.52 -1.16 3.11 3.43 -2.18 -3.59 2.42

1.36 .09 .64 .78 .32 .86 .39

S2 2.26 8.6 4.34 -5.64 10.46 8.38 24.7

.4 5.87* 1.49 2.53 8.68* 5.57* 48.43*

S3 5.96 5.92 -.27 -2.24 .24 6.47 8.4

4.26* 4.2* .01 .6 .01 5.02* 8.45*

S4 13.76 2.86 8.47 10.66 9.49 -4.27 17.29

4.64* .2 1.78 2.79 2.21 .45 7.34*

Lower Li

/aep-aeb/ /aep-aem/ /ip-ib/ /ip-irn/ /aep-ip/ /aeb-ib/ /aem-im/

S I (xd) -4-9 -10.14 -.92 -4.34 2.99 6.97 8.79

F .79 3.38* .03 .62 .29 1.6 2.54

S2 12.45 8.62 -13.88 -11.08 47.79 21.46 28.09

343* 1.64 4.26* 2.71 50.49 10.18* 17.45*

S3 22.22 35.05 17.33 15.06 20.26 2.5.37 .27

6.26* 15.57* 3.81* 2.87 5.2* 2.99 .0001

S4 23.87 13.6 15.46 4.93 14.5 1.59 5.84

14.8* 3 41* 4.47* .45 3.88* .05 .63

Jaw

/aep-aeb/ /aep-aem/ /ip-ib/ /ip-im/ /aep-ip/ /aeb-itil /aem-im/

S 1 (xd) 2.53 4.3 6.32 4.52 34.16 37.95 34.37

F .11 .32 .69 .35 19.99* 24.68* 20.24*

S2 -1.33 10.76 14.1 1.82 68.36 83.81 59.44

.03 2.26 3.87* .06 91.11* 136.87* 68.85*

S3 -8.13 32.21 2.11 4.13 80.43 90.67 52.35

1.43 22.53* .1 .37 140.5* 178.56* 59.53*

S4 14.62 27.33 6.23 8.93 35.67 27.27 17.26

18.67* 65.2* 3.43* 6.96* 111.03* 65.75* 26.02*

Table 4. Derived stiffness values, defined as the ratio of peak velocity to peak displacement for the first oral closing

movement for the upper lip (UL), lower lip (LL), and jaw (J) for the four subjects. Results are presented according tovowel and consonant context for the first movement sequence. A single asterisk indicates a significant /p/ /b/ ordifference at p < .01; double asterisks indicate that /p/ was significantly different than both /b/ and /m/. Allcomparisons were within vowel.

S I

law/ Lath/ laeml IibL /irn./

UL 13.4 (.24)" 14.8(.25) 15.5(.25) 15.1(.16) 15.1(.20) 15.2(.18)

LL 18.2(.28) 18.3(.22) 18.7(.19) 18.5(.20) 18.6(.22) 18.1(.22)17.7(.20)* 19.0(.28) 20.2(.26) 20.8(.28) 22.2(.34) 20.7(.36)

S2UL 12.7(.20)** 15.1(.32) 15.0(.17) 12.6(.27)* 14.7(.21) 12.4(.18)

LL 21.0(.27)* 19.4(.34) 19.0(.24) 19.8(.22)" 18.3(.21) 17.1(.25)18.9(.19) 17.7(.15) 18.4(.17) 21.8(.42)" 29.4(.79) 24.2( 46)

S3UL 17.0(.32) 16.6(.63) 18.7(.37) 20.4(.31) 20.3(.36) 19.0(.45)LL 20.1(.21)" 17.9(.28) 17.6(.42) 23.2(.49)" 20.3(.42) 19.5(.42)

19.6(.14) 20.3(.20) 21.3(.24) 28.0(.85) 27.8(.82) 27.1(.63)

S4UL 15.6(.22) 14.6(.24) 14.9(.21) 16.9( .24)* 16.3(.27) 14.7(.25)

LL 20 0(.28) 18.7(.48) 19.2(.35) 21.5(.34)* 18.4(.34) 17.5(.30)16.8(.20) 16.2(.27) 18.3(.26) 17.8(.23)* 19,9(.43) 19.2(.40)

Page 82: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Organizational Characteristics of Speech Movement Control 75

30

25

20

15

10

5

UL

LL

Jaw

/p/ /b/ ImlFigure 8. Average mass-normalized stiffness for theupper lip (UL), lower lip (LL) and jaw (J) oral closing forthe different vowel-consonant combinations for thegroup. Mass normalized stiffness was derived bvdividing the peak velocity, in mm/sec, by the peakclosing displacement, in mm, resulting in units offrequency (1/sec). Vertical bars indicate one standarddeviation for the group.

Opening/closing interactionsInspection of the oral opening and closing

movement relations collapsed across consonantsrevealed some interactions that were notphonetically related but apparently reflect moregeneral characteristics of speech movementorganization. As indicated above, the positions ofthe lips and jaw for the first vowel were not

reliably related to phonetic context. However,vowel-related positions of the differentarticulators did vary and those variationsinfluenced the extent of articulator displacementfor the closing movement. Shown in Figure 9 areresults from the LL (left side) and J (right side)opening position-closing displacement relations forthe individual subjects. As shown, the relativearticulatory positions for the vowel producedsystematic variations in the magnitude of the oralclosing movement. The LL data (left panel) reflectall consonant and vowel combinations; due to the-large movement difference for the J, only the datafor the lee/ context are included (right panel).Including lu in the lee/ data created a range effectthat artificially inflated the correlation.

As shown in Figure 9, correlation coefficients forthe LL range from r = .37 to r = .66. The J closingdisplacement was even more strongly related tothe preceding J opening position with correlationcoefficients ranging from r = .50 to r = .73; the UL(not shown) was less consistent (correlationcoefficients ranging from r = .05 to r = .55). Asindicated, the lower the position of the lips andjaw from the subjects defined rest position prior toclosure, the greater the resulting displacement.

In addition to the appar nt dependence ofarticulator displacement on articulator position,other characteristics of oral opening and oralclosing were found to covary. For example, the jawopening velocity for /ae/ and the subsequentclosing velocity for the consonant demonstrated acovariation with correlations ranging from r = .47to r = .67 for the four subjects (Figure 10). Itshould also be noted that there was a trend forjaw opening and closing velocity for /p/ and /b/ tobe faster than for /m/. It appears that there aresystematic spatiotemporal variations in oralopening and closing and such interactions are notnecessarily phoneme specific.

Articulator interactions. In previous studies ithas been suggested that the movement of individ-ual articulators are subordinate to the combinedmovement of the contributing parts (Hughes &Abbs, 1976; Gracco & Abbs, 1986; Saltzman,1986). That is, there appears to be a higher levelmotor plan in which the action of individual ar-ticulators is partially dependent on the action ofthe other articulators contributing to a multiar-ticulator goal. In the present study, separatestepwise regressions were done on the individualUL, LL, and J displacements for oral closing, us-ing the positions of the three articulators for thepreceding vowel as independent variables, to ex-amine for evidence of articulatory interactions.

52

Page 83: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

76 Gracco

8

4

0

6

2

10

6

2

8

6

4

LL S1

"IF

, r = .37

:JAW S1

.1144::%"As

f7frotall .110 :ill0 10

0

, r = .68

-12 -8 -4

.2 eiIP

NW. ;6. Ss. .4p,

-16 -12

S2

r = .73 1

-8

O. V..

544 S.

S3

r = .62

-20

6

4

2

2 0-7 -3 1 -12

-14 -8

S4

044.: liejexp

ipo: kaaellr = .50

Opening Position (mm)-8 -4

Figure 9 . Scatter plots of the lower lip (LL; left side) and jaw (Jaw; right side) closing displacements as a function ofthe relative positions of the respective articulators for the preceding vowel . Product-moment correlation coefficients(r) are presented at the bottom right hand corner of each plot. Only the jaw opening positions/jaw closingdisplacements are presented for he/ (see text for explanation).

83

Page 84: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Organizational Characteristics of Speech Movement Control 77

180

140

100

60

20

180

140

100

60

Jaw Si

4.0.7 -lob 4,,

20-180 -140

Opening-100

Velocity

S2

r = .58

-60 -20(min/sec)

-240

140

105

70

35

0:10.4114Ss

AY/t 911

-200

r = .67

-160 -120 -80

S 4

e

4 . .liff... :

--1-180 -140 -100 -60 -20

Opening Velocity (non/sec)

Figure 10. jaw opening peak velocity for lae/ and the corresponding jaw closing peak velocity for the three consonants.Product-moment correlation coefficients (r) are presented at the bottom right hand corner of each plot. As shown, asthe jaw opens faster there is a tendency for the jaw to close faster as well.

In all cases, the position of a particular articulatorfor the preceding vowel was found to have thestrongest influence on the movement displace-ment for this articulator. However, in all casessignificant increases in the amount of explainedvariance were noted when the position of at leastone other articulator was included in the regres-sion model. A summary of the regression results ispresented in Table 5.

Second sequenceCycle duration. In the present investigation, the

stimuli required two complete movement se-quences; two opening and two closing. Followingthe first sequence in which the vowel and conso-nant varied, the remaining sequence for "apple"was similar in phonetic context; an opening for thevowel /a& and a closing for /p/. Thus the datacould be examined for carryover effects of the firstsyllabic context on the second movement se-quence. Examination of the duration of the secondmovement cycle presented in Figure 11 revealedsignificant differences as a function of the preced-ing consonant. For subjects Sl, S3 and S4, the du-ration of the second movement sequence, obtainedfrom the lip aperture signal, was longer when the

preceding oral closing consonant was /p/ thanwhen it was /b/ or /m/ (S1 (F = 52.06, p .00011;S3 (F. 42.23, p = .00011; S4 [F = 49.13, p = .0001]).There were no consistent vowel related effects forthese subjects (p > .05). For S2 the duration of thesecond movement sequence was only longer for /p/when the preceding vowel was /if ([F(5,234) = 6.92,p < .01]); a significant vowel effect was noted inthe /p/ context (F = 14.06, p < .01).

To determine whether the longer cycle durationwas localized to a single phase of the movementsequence and a single articulator, or distributedacross the opening and closing phases and con-tributing articulators, the movement characteris-tics of the respective phases for all three articula-tors were examined. Shown in Figure 12 are thegroup averaged upper lip, lower lip, and jaw open-ing and closing movement durations for the sec-ond movement sequence. As can be seen, the open-ing movements account for the changes in cycleduration associated with the preceding phoneticcontext noted in Figure 11. For th oup theopening movement duration was long. llowing/p/ than /11/ or /m/ for the UL ([F(2,95b, = 47.07, P= .00011) the LL ([F(2,956) = 95.9, p = .00011) andthe J ((F(2,956) = 31.53, p = .0001)). There were no

Page 85: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

78 Gracca

significant closing movement duration changes forany articulator or context (p > .05). In addition,vowel related differences were noted that were ob-scured by examining the sequence duration fromthe lip aperture signal. For the UL and LL theopening movement was shorter for /i/ cempared to/ae/ ([F(1,957) = 212.02, 125.01, p = .00011 for the

UL and LL respectively). For the J just the oppo-site was found; jaw opening duration was longerfor /i/ compared to /ae/ ([F(1,957) = 160.45, p =.0001]). Opposing vowel-related effects for the lipsand jaw apparently offset one another in the com-bined lip aperture signal resulting in no netchange to the second movement cycle.

Table 5. Stepwise regression of the upper lip (ul), lower lip (II), and jaw (j) oral closing displacement (disp) for thefirst opening/closing sequence. The displacement of each individual articulator was regressed on the relative positionof each of the three articulators for the preceding vowel.

Upper Lip adj. R-squared

S 1 UL disp = 3.45 + .56(uI) +.I5(j) +.27(11) .66 .42

S2 UL disp = 3.06 + .38(u1) + .08(j) .37 .12

S3 UL disp = 1.45 + .64(uI) + .13(j) + .09(11) .67 .43

S4 UL disp = 2.02 + .84(u1) .21(j) + .2(11)

Lower Lip

.67 .44

adj. R-squared

SI LL disp = 2.63 + .37(11) + .26(j) + .26(u1) .67 .44

S2 LL disp = 2.66 + .71(11) + .18(j) .72 .50

S3 LL disp = 3.99 + .32(11) + .I4(j) .36 .11

S4 LI disp = .1 + .6(11) + .28(j) + .46(u1)

Jaw

.78 .60

adj. R-squared

SI J disp = .85 + .58(j) + .66(u1) .74 .53

S2 J disp = 3.1 + .61(j) + .32(u1) + .22(11) .76 .57

S3 J disp = .07 + .37(j) - .49(u1) .64 .40

S4 J disp = .01 + .48(j) .50 .24

a)

S 2200

210

c 1400.-172 707

/p/

S 3

S 4

.L

Inc/

Figure 11. Duration (in milliseconds) of the second movement cycle (S-2) for the oral opening for /ae/ and subsequentclosing for /p/ following the different vowel-consonant combinations in the first cycle (S-1) for each of the foursubjects. Vertical bars indicate one standard error; asterisks indicate significant /p/ - /b/ or /m/ difference (p < .01).

ro0

Page 86: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Organizational Characteristics of Speech Movement Control 79

150

2 100

50

LL

Figure 12. Group means of the duration of opening (Op_2) and closing (Cl_2) movements for the second sequence forthe upper lip (UL), lower lip (LL) and jaw. The opening movement durations for the UL and LL were reliably longerfor /p/ compared to /b/ and /m/. Vertical bars indicate one standard error.

Velocity/DisplacementOther characteristics of the opening phase of the

second sequence revealed articulator-specific con-textual differences in movement extent andmovement velocity consistent with these results.In the opening phase, the most significant effectswere vowel related for the J and consonant relatedfor the LL. As summarized in Figure 13 for thefour subjects, the J opening movement for the /ae/went farther (S1-S3[F(1,238) = 179.72, 65.1,709.07 respectively, p = .0001); S4[F(1,237) =298.2, p = .00011) and faster when following A/(S1-S3[F(1,238) = 78.06, 12.01, 419.24 respec-tively, p = .00011; S4[F(1,237) = 306.88, p =.0001]). The effect for the LL was less robust(displacement; S3 F = 290.46, p = .0001; S4 F =52.05, p = .0001; velocity S2 F = 6.93, p = .009; S3F = 21.7, p = .0001) and when significant went inthe opposite direction to that observed for the J.The greater J opening displacement, duration andvelocity for /ae/ following hi/ appears to be due tothe higher jaw position prior to the second openingmovement following /i/. The jaw position for /i/ ishigher than when the first vowel is /ae/ and as a

consequence, the jaw opening displacement startsfrom a significantly higher position. The higher Jposition results in larger opening movement dis-placements, longer opening movements andhigher opening velocities for the subsequent vowelopening.

In contrast to the vowel related opening differ-ences, consonant related differences were notedbut only for the LL. The magnitude of the openingvelocity following closure was found to orderaccording to consonant identity with /p/ < /b/ < Imlfor all subjects. As can be seen in Figure 14, thisresult was reliable for all subjects for the LL (S1-S4 F = 52.95, 237.76, 54.44, 160.85 respectively, p= .0001). For the J post hoc testing revealed onlytwo significant differences; /b/ - /m/ for S1 (F =6.01, p < .01) and /p/ - /in/ for S2 (F = 31.52, p < .01).

Lip-jaw CoordinationOral closing. Starting from a relatively steady

state for the /s/ in the six stimulus words, theupper lip, lower lip, and jaw attained some openposture for the vowel and subsequently closed theoral opening cooperatively. Using the jaw openingpeak velocity for the vowel as a reference point

Page 87: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

80 Gracco

(see Figure 2), the consistency of the relativetiming of the three articulators was examined.Consistent with previous studies the timing of theUL, LL, and J covaries during oral closing(Gracco, 1988; Gracco & Abbs, 1986). The UL-LL

0JAW

0

-150

S1 S2 S3Subject

relative timing for the four subjects is presentedin Figure 15 with the corresponding correlationsshown at the bottom. Similar results wereobtained for the LL-J with all within-vowelcorrelations ranging from r = .86 to r = .99.

INll

S4 S1 S2 S3 S4Subject

Frgure 13. Peak opening velocity and displacement for jaw (jaw) and lower lip (LL) for the second movement sequencefor the four subjects. Vertical bars indicate one standard error; asterisks indicate significant vowel differences (p <.01). As shown, the jaw opening is faster and farther when the preceding vowel is Ii/ compared to /ae./. The same trendis not seen in the lower lip.

0

a)

>.100

-150

JAWtole

1

/p/ /b/ Irn/

LL

/p/ /b/ /m/

Frgu,r 14 Jaw and lower lip opening velocity for the second movement sequence as a function of preceding consonantidentity for the four subjects. For all subjects, the LL opening velocity was highest for /m/ compared to /p/ or /b/; thesame trend was not found for the Jaw. Vertical bars indicate one standard error.

S7

Page 88: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Organizational Characteristics of Speech Movement Control Sl

Oral opening. In contrast to the UL, LL, and Jtiming relations for the oral closing, the timing ofthe three articulators for oral opening was lessconsistent. Figure 16 presents the UL-LL relativetiming for oral opening for the four subjects. Forthese data the UL-LL opening velocity timing isreferenced to the time of the preceding UL peakclosing velocity. While it appears that the two lipsare certainly related in their timing, they do notdisplay the same consistency seen for the oral clos-

04, 2 0 0

155_

6

2 0

200

155

110

65

20

S1

UL

AP'

ter= .97 LL

S2

r = .99

20 65 110 155 200Time of Peak Velocity (msec)

ing (Figure 15). Similar results were obtained forthe LL-J with correlations ranging from r = .57 tor = .80.

Context effects. A final issue related to the lip-jaw coordination for oral closing focused onwhether the coordinative relations among the ar-ticulators are fundamentally the same or differentfor different contexts. To examine the articula-tory relations in detail, the intervals between theUL-LL and LL-J peak velocities were examined.

S3

r = .94

S45"--1

r = .98

20 65 110 155 200Time of Peak Velocity (msec)

Figure 15. Scatter plots of the relative timing of the upper lip WU and lower lip (LI..) peak closing velocity (Cli) for

the different vowel-consonant contexts for the four subjects. Time of peak velocity was referenced to the time of thejaw opening peak velocity for the first vowel (0p3). Product-moment correlation coefficients (r) are presented at the

bottom right hand corner of each plot.

c 8

Page 89: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

82 Gracco

The interpeak interval is a measure of theabsolute time difference between articulator pairsand the sign of the difference (positive or negative)indicates whether one articulator leads or lagsanother. Figure 17 presents the averages of theUL-LL (right side) and LL-J (left side) interpeakintervals for oral closing for the four subjects. Themost significant effects were vowel-related. Forthe UL-LL, the tendency was for the interpeak in-terval to decrease in the /i/ context with four of thetwelve possible within-vowel comparisons (3 com-parisons X 4 subjects) reaching significance (S2 F= 6.88 for /b/; S3 F = 4.25 for /p/; S4 F = 5.32 and F= 9.91 for /p/ and /b/ respectively, p < .01). The LL-J interval also displayed vowel related effects forall subjects, however, the trend was opposite ofthat seen for the UL-LL. Of the twelve possibledifferences, five were found to be reliable (S2 F =

2 00

155 -

1 1 0

6 5

2 0

2 00

1 55

1 1 0

6 5

S1

UL

2 0

r = .68 LL

S2

r = .89

4111r

2 0 6 5 1 1 0 155Time of Peak Velocity (msec)

3.82 and 3.82 for /b/ and /m/ respectively; S3 F =6.12 for /m./; S4 F = 20.66 and 4.52 for /p/ and /m/respectively, p < .01). For the /i/ context, the LL-Jinterval tended to increase indicating that theJ timing was not being adjusted to the vowelcontext. Consonant related effects for the UL-LLwere essentially absent with only two /p/comparisons reaching significance (S3 F = 8.51and S4 F = 5.5, p < .01). For the LL-J, there wasa trend for the interval for /p/ closure to be longerthan /b/ and/or /m/ with reliable differences foundfor three subjects (S2 F= 18.99 for /p/ - /b/ in the/ae/ context and F = 45.27 in the /i/ context; S3F = 7.48 for /p/ - /ml; S4 F = 14.36 for /p/ - /b/ in the/ae/ context and F = 48.29 for /p/ /b/ in the /i/context; p < .01). Again it appears that the Jtiming is less related to the phonetic context thanare the lips.

S3

r = .68

r = .45

S4

200 20 65 110 155 200Time of Peak Velocity (msec)

Figure 16. Scatter plots of the relative timing of the upper lip (UL) and lower lip (LL) peak velocity for the second oralopening (Op_2) for the four subjects. Time of peak velocity was referenced to the time of the upper lip peak velocityfor the oral closing (CU). Product-moment correlation coefficients (r) are presented at the bottom right hand corner ofeach plot.

59

Page 90: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Organizational Characteristics of Speech Movement Control 83

UL-LL

18

14

10cD

6CD

2

/p/ /b/ /m/

16

12

8

4

0

-4

-8

LL-J

S4

/p/ /b/ /nal

Figure 17 . Upper lip-lower lip (UL-LL; left side) and lower lip-jaw (LL-J; right side) closing velocity interpeakintervals for the four subjects The interpeak interval was defined as the time, in milliseconds, between the occurrenceof the peak velocities for the different articulator pairs. A single asterisk centered over either /p/ bar indicates a singlesignificant within-vowel consonant comparison; a double asterisk indicates that both comparisons (/p/ - /b/ and /p/ -/m/) reached significance (p < .01). An asterisk centered over any consonant pair indicates a significant within-consonant vowel comparison (p < .01).

Page 91: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

84 Grace°

It can also be seen that the sequence of velocityevents varied across subjects and articulatorpairs. Positive values for the two interpeak inter-vals indicate that the average sequence of articu-lator velocity peaks was UL-LL-J. For S2 and 54,negative values were obtained for /b/ and /m/ forthe LL-J and UL-LL respectively indicating thatthe sequence of events reversed from UL-LL-J toeither UL-J-LL or LL-UL-J. In general, the UL-LLtiming for the different phonetic contexts suggestthat the coordinative relations among these artic-ulators remains similar for different consonants.When the movement characteristics are substan-tially modified, as occurred for the different vow-els, there is a tendency for the relative timing ofthe articulators to change with differences notedfor the lips.

DISCUSSIONThe present study investigated characteristics of

the lip and jaw movements associated with a classof consonants (bilabials). A number ofobservations were made that reflect on certainspeech movement principles and the manner inwhich such principles are modulated for phoneticcontext. The most consistent finding of the presentinvestigation is that the timing or phasing ofcontiguous movement phases is an importantmechanism for modifying speech movements forcertain phonetic differences (see also Browman &Goldstein, 1989; 1990; Edwards, Beckman, &Fletcher, 1991; Saltzman & Munhall, 1989).Speech movements appear to be organized andhence controlled at a level in which multiplearticulatory actions and contiguous movementphases are the functional units of control andcoordination. These conclusions will be discussedbelow.

Movement adjustments. The two vowels used inthe present study resulted in articulatory changesin position, extent and speed due to greatermovement requirements for the jaw (Lindblom,1967; Lindblom, Lubker, & Gay, 1979). As such,the size of the oral opening, determined predomi-nantly by jaw position, is a significant factor indetermining movement adjustments for differentvowel contexts. In contrast, the upper and lowerlips were not affected by the identity of the twovowels. Similar findings have been reported byMacchi (1988) in which lip position for /p/ was notaffected by the vowel environment while the jawwas due to tongue height. It should be noted thattransducing lip motion in only one dimension(inferior-superior) limited any observations of an-terior-posterior lip spreading, often assumed to

accompany hI. However, while there may havebeen some differences in the anterior-posteriordimension, it seems clear that lip motion is notthe major factor in differentiating the vowels /ae/and IV.

The vowel-related jaw opening movementsresulted in systematic and distributed differencesin oral closing movement characteristics. As jawopening was modified for the high or low vowel,the extent and speed of lip and jaw oral closingcharacteristics were similarly adjusted; vowel-related oral closing movement adjustments weredistributed to all the contributing articulators.These vowel-related adjustments suggest that oralopening and oral closing are not independentevents. Rather, oral closing actions are dependenton characteristics of the preceding oral openingaction. Other evidence for systematic movementrelations between oral opening and oral closingcan be found in results reported by Folkins andCanty (1986), Folkins and Linville (1983), Hughesand Abbs (1976) and Kozhevnikov and Chistovich(1965).

Some of the opening/closing interactions ob-served were apparently consonant-related such ashigher opening and closing velocities for the voice-less consonant /p/. Jaw opening velocity for /ad/was faster when the subsequent closing movementwas the voiceless /p/ compared to /b/ or /m/(Summers, 1987). In addition, LL closing velocitywas generally higher for /p/ than /b/ or /m/ inde-pendent of the preceding vowel, and, LL openingvelocity following closure, was always lowest for/p/ and highest for /m/. Previous investigationshave demonstrated faster lower lip or jaw closingvelocity for voiceless compared to voiced conso-nants (Chen, 1970; Fujimura & Miller, 1979;Sussman et al., 1973; Summers, 1987), and slowerlip opening velocity for the voiceless consonant(Sussman et al., 1973). It has been suggested thatthe higher LL closing velocity is to accommodatethe higher intraoral pressures for /p/ compared to/b/ (Chen, 1970; Sussman et al., 1973). That is,higher intraoral air pressure may require greaterlip force, manifest as higher lip closing velocity, tomaintain lip contact during the closure interval.While it is generally observed that oral pressurefor a voiceless sound is higher than for its voicedcounterpart which is higher than its voiced nasalcounterpart (Arkebauer, Hixon, & Hardy, 1967;Black, 1950), empirical evidence suggests thathigher lip closing velocity does not directly trans-late into greater lip contact force. Using a minia-ture pressure sensor, Lubker and Paris (1970)were unable to find reliable differences in lip con-

Page 92: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Organizational Characteristics of Speech Movement Control 85

tact pressure for /p/ and /b/. An additional consid-eration is the higher lower lip opening velocity for/m/. If oral pressure and lip velocity covary, thenthe lowest oral pressure sounds, like /m/, shouldconsistently have the lower closing and openingvelocity. However, in the same vowel context,lower lip opening velocity following /m/ closurewas higher than /p/. It seems premature at best toconclude that lip closing velocity is modulated ac-cording to oral pressure characteristics especiallywhen a direct measure of lip contact does not sup-port such an interpretation.

Similarly, mass-normalized stiffness, defmed asthe ratio of the peak closing velocity and peakdisplacement, was not found to consistentlydifferentiate the different vowel and consonantcombinations. While the LL stiffness estimates forthe closing movements were found to beconsistently higher for /p/ than /b/ or /m/, theupper lip and jaw estimates were not. For theupper lip, estimates of mass-normalized stiffnessoften varied reciprocally with the LL values whilethe jaw values showed no consistent patterns. It ishard to reconcile a differential modulation ofstiffness for three articulators contributing to thesame movement or phonetic goal. While stiffnessmay be a convenient means of describing themotion or rate of an articulator, it is not clear thatit qualifies as a controlled variable. Rather than asingle physical control variable it appears thatspeech movements are more likely organizedaccording to principles that take into account thefunction of the action related to the overall goal ofcommunication (Gracco, 1990; 1991).

A potential explanation for voice/voicelessdifferences can be presented from examining therelative timing of the opening and closing actions.The relative timing or phasing of articulatorymotion was found to be a consistent variabledifferentiating the voiced and voicelessconsonants. For the voiceless /p/, the closing actionwas initiated earlier than for /b/ or /rn/ while fororal opening, /p/ was initiated later than either /13/or /m/. These two timing results are consistentwith previous acoustic results related todurational differences for voiced and voicelessconsonants. In many languages, two relateddurational phenomena have been reported forvoiced and voiceless sounds. Vowel length isshorter and duration of the consonant closure,defined acoustically, is longer for voiceless thanfor voiced consonants. In the present phoneticcontext, the interval between jaw opening velocityfor /ae/ and oral closing velocity is essentiallyidentical to the duration of the vowel (McClean,

Kroll, & Loftus, 1990). As such, the earlier onsetof the closing action for the /p/ is a kinematicmanifestation of the differential vowel lengtheffect (Denes, 1955; House & Fairbanks, 1953).Similarly, the longer opening movement durationfollowing /p/ closure is consistent with longeracoustic closure duration reported for voicelesssounds. It seems, from the acoustic durationaleffects reported previously and the kinematiceffects reported here and elsewhere, that voicedand voiceless consonants differ primarily in theirtiming with other kinematic variations, such asvelocity differences, secondary. The movementvelocity changes may be considered a naturalconsequence of the timing requirements for thedifferent consonants.

However, an explanation of articulatory timingas a means to differentiate sounds within a classleaves open the question as to why articulatorytiming differs. Recently it has been suggested that/p/ - /b/ differences in vowel duration reflect aperceptually salient feature of speech productionwhich enhances the acoustic differences betweenvoiced and voiceless consonant sounds (Kluender,Diehl, 8:- Wright, 1988). That is, voiced andvoiceless consonants differ in their timing becauseof the perceptual enhancement the durationaldifferences provide to the listener. A recentinvestigation by Fowler (1992), however, suggeststhat there is no auditory-perceptual enhancementto be obtained from the shorter vowel durationsand longer consonant closure durations for /p/versus /b/. A simpler, and less speculativeexplanation is that timing differences among thebilabial consonants reflect modifications toaccommodate concomitant articulatory actions. Inthe case of /p/, the longer closure durationcompared to /b/, is to accommodate the laryngealadjustment associated with devoicing. As such,from a functional perspective /p/ is an inherentlylonger sound than /b/. The shortening of the vowelreflects an intrusion of the phonetic gesture for /p/,including the laryngeal devoicing, on the vowel.The closing must be faster to integrate the lip andthe larynx actions while the opening is slower toprovide the appropriate voice onset time. It issuggested that all voiced and voiceless sound pairs(cognates), are fundamentally differentiated bytheir relative timing or phasing which is a directconsequence of the supralaryngeal-laryngealinteraction and consistent with aerodynamicrequirements (Harris et al., 1965; Lisker &Abramson, 1964; Lubker & Parris, 1970).

Speech movement coordination. Another robustfinding of the present investigation was the

Page 93: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

86 Gracco

consistency of the upper lip-lower lip timingrelations associated with the different bilabialconsonants. For all subjects, the temporalrelations of the velocity peaks for the upper andlower lips were essentially constant acrossphonetic contexts, indicating a constraint on theircoordination. While the UL-LL timing wasrelatively consistent, the sequence of events (UL-LL-J) did not remain constant as had beenobserved in more limited phonetic contexts(Caruso et al., 1988; Gracco, 1988; Gracco & Abbs,1986; McClean et al., 1990). However, based on anumber of investigations that have reporteddifferent sequences, it is not clear that the actualpattern or sequence of articulatory events is ofimportance (De Nil & Abbs, 1991). Rather, it is thegeneral trend for covariation that is indicative ofan underlying invariant control parameter(Gracco, 1990; 1991). That is, while the sequenceof events may change with rate, stress or phoneticcontext, such variation does not indicate a lack ofinvariance as suggested recently by De Nil andAbbs (1991). As pointed out by von Neumann(1958) the nervous system is not a digital device inwhich extremely precise manipulations arepossible but an analog device in which precision issubordinate to reliability. Establishing thepresence of invariance on fixed stereotypicpatterns, similar to those observed in amechanical system, is inconsistent with principlesof biological systems. Rather, constraints oncoordination are most likely specifiedstochastically with specific limits on variationcontingent on the overall goal of the action (seealso Lindblom, 1987 for arguments for theadaptive nature of variability). As suggested bythe quantal nature of speech (Stevens, 972;1989), there are vocal tract actions that mayrequire relatively more precise control than othersbecause of the potential acoustic consequences.However, it is doubtful that the timing of bilabialclosure is a context in which absolute precision isrequired (see also Weisrner, 1991).

More importantly, in the present study as inprevious investigations of speech movementtiming, the consistent relative timing relationsamong the UL-LL-J indicate that the timing ofthese articulators is not controlled independently.Rather, timing adjustments covary acrossfunctionally related actions and contributingarticulators are effectively constrained tominimize the degrees of freedom to control(Gracco, 1988). While performance variables suchas phonetic context, speaking rate and theproduction of emphasis will change the surface

kinematics, the underlying principles governingtheir coordination remains unchanged. Moreover,the articulator interactions such as the lipadjustments to changes in jaw position and theopening and closing velocity covariation areadditional manifestations of the interdependencyof speech movement coordination.

Opening/Closing differencesThe present results also suggest that opening

and closing movements are fundamentally differ-ent actions operating under different constraints(Gracco, 1988). Oral opening was found to be thelocus of the most significant consonant and vowelrelated articulatory adjustments. Oral closing ad-justments, on the other hand, varied primarily asa function of oral opening. Only the LL peak clos-ing velocity for the different consonants demon-strated any consistent change across subjects. Incontrast to the tightly coupled lip and jaw timingduring closing, oral opening coordination wasfound to be more variable. The different spa-tiotemporal patterns observed suggest that open-ing and closing phases of sequential speech motoractions are differentially modifiable for context.Oral closing is generally faster, involving rela-tively abrupt constrictions or occlusions in variousportions of the vocal tract. Oral opening is gener-ally slower, involving resonance producing eventsassociated with vowel sounds. Hence these twoclasses of movements reflect fundamentally differ-ent speech actions with distinct functional(aerodynamic and acoustic) consequences (see alsoFowler, 1983; Perkell, 1969).

Why this might be so may, in part, reflect thedifferent biomechanical influences on the closingand opening actions. Oral opening involves tem-porarily moving the lips and jaw from some rest orneutral position to a position that requiresstretching the associated tissues (skin, muscle,ligament). For closing, the lip and jaw motion isassisted by the elastic recoil from the openingstretch. A consequence of the different biomechan-ical influences on opening and closing would bethat opening movements could be controlled di-rectly by agonistic muscle actions. That is,changes in lip and jaw opening may result fromdirect modification of jaw and lip opening muscleactivity. In contrast, to counteract the elastic re-coil of the lip and jaw tissue, oral closing adjust-ments would require some combination of reduc-tion in agonistic activation and/or agonist-antago-nist co-contracti,-.: It is suggested that the open-ing action would be an easier task to control thanthe closing action and that the biomechanical in-

'3 3

Page 94: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Organizational Characteristics of Speech Movement Control $7

fluences would differentially affect the two ac-tions. The closing action would be more rapid dueto the contribution of tissue elasticity and lessvariable. Such considerations are possible factorsinfluencing the patterns observed in the presentinvestigation reflecting biomechanical optimiza-tions and may account for certain aspects of theontogeny of the phonological system.

Articulator differencesConsonants andvowels

In addition to the different functions of theopening and closing actions, it can also be sug-gested that the lips and jaw contribute to soundproduction in different ways. Examination of thelip and jaw movement characteristics suggest thatwhile all three articulators contribute to the gen-eral opening and closing, their roles are not iden-tical. The lower lip closing and opening velocityvaried as a function of the different consonantsounds. The upper lip, in contrast did not. In addi-tion, the lower lip and jaw closing displacementswere significantly and systematically related tothe opening position for the preceding vowel; theupper lip closing movement displacement wasonly weakly related. It is plausible that the upperlip contributes in a more stereotypic manner toconsonant actions while the lower lip and jaw pro-vide more of the details related to phonetic ma-nipulations. Each lip contributes in an interde-pendent but not redundant manner to theachievement of oral closing.

From the lip and jaw differences noted in thepresent investigation it can be suggested thatconsonant and vowel adjustments are articulatorspecific. The jaw, while assisting in the oralclosing, was more significantly involved in theproduction of the vowels than was either of thelips. This was noted not only for the first syllablein which the vowel varied between a phoneticallyhigh /i/ and low /ae/, but in the second syllable inwhich jaw motion for the same syllable varieddependent on the identity of the preceding vowel.The same pattern was not observed for the LLmotion. In contrast, consonant-dependent openingmovement differences were noted for the LL butwere not found for the J. These results suggestthat boundaries between phonetic segmentsand/or phonemic classes such as consonants andvowels, may be identified by differentialarticulatory actions that overlap in time. Thespeech mechanism can be thought of as a specialpurpose device in which individual componentscontribute in unique and complementary ways tothe overall aerodynamic/acoustic events.

Speech motor programming/Serial timingBased on the opening/closing interactions out-

lined above, it is not unreasonable to suggest thatspeech movements are organized minimally acrossmovement phases with movement extent andspeed specified for units larger than individualopening and closing actions. The size of the orga-nizational unit is at least on the order of a syllableand perhaps larger encompassing something onthe order of a stress group (Fowler, 1983;Sternberg et al., 1988). It can further be suggestedthat modulation of the relative timing of move-ment phases is an organizational principle used todifferentiate many sounds of the language. Assuch, serial timing and the associated mechanismis an important concept in speech production butone that has not received much empirical atten-tion. It has been suggested previously that theserial order for speech is a consequence of anunderlying rhythm generating mechanism(Kozhevnikov & Chistovich, 1965; Lashley, 1951;Kelso & Tuller, 1984; Gracco, 1988; 1990). ForLashley (1951), the simplest of timing mecha-nisms are those that control rhythmic activity andhe drew parallels between the rhythmic move-ments of respiration, fish swimming, musicalrhythms, and speech (see also Stetson, 1951). Forspeech, however, the mechanism can not be sim-ple or stereotypic since, as shown in this investi-gation, the different sounds in the language havedifferent temporal requirements. Moreover, such arhythmic mechanism most likely reflects a net-work property (Martin, 1972) as opposed to aproperty of a localized group of neurons, sincedysprosody results from damage to many differentregions of the nervous system (e.g.., Kent &Rosenbeck, 1982). More likely, any centrally gen-erated rhythmic mechanism for speech and anyother flexible behavior must be modifiable by in-ternal and external inputs (cf. Getting, 1989;Glass & Mackey, 1988; Harris-Warrick, 1988;Rossignol, Lund, & Drew, 1988).

Recent investigations on the discrete effects ofmechanical perturbation on sequential speechmovements are beginning to identify aspects ofthe underlying serial timing mechanism .

Mechanical perturbations to the lower lip duringsequential movement result in an increase or de-crease in the movement cycle frequency dependingon the phase of the movement the load is applied.(Gracco & Abbs, 1988; 1989; Saltzman, Kay,Rubin, & K.-Shaw, 1991). One interpretation ofthese results is that a rhythmic mechanism is thefoundation for the serial timing of speech move-ments and that this mechanism is modifiable not

Page 95: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

88 Gracco

stereotypic (Gracco, 1990; 1991). The present re-sults demonstrating changes in opening and clos-ing phasing with phonetic context reflect anotheraspect of the serial timing for speech. The conso-nants acted to differentially modify the ongoingspeech rhythm. As such, consonants can be con-sidered to act as perturbations on an underlyingvowel-dependent rhythm. Consistent with thisspeculation is evidence from two apparently con-tradictory results obtained by Ohala (1975) andKelso et al. (1985). Searching for the underlyingpreferred speech movement frequency Ohala(1975) was unable to find an isochronous fre-quency associated with jaw movements duringoral reading. In contrast, Kelso et al. (1985) pro-vided evidence suggesting that jaw movementsduring reiterant speaking were produced withvery little variation around a characteristic fre-quency. The difference in these two investigationsis in the phonetic content used by the different in-vestigators. The reiterant speech contained thesyllable "ba" produced with different stress levelswhereas Ohala used unconstrained oral readingpassages. Sounds of the language may have an in-herent frequency (intrinsic timing; Fowler, 1980)which interacts with a central rhythm resulting ina modal frequency with significant dispersion.Similarly, the degree to which a characteristicrhythm or frequency is modulated during speak-ing is no doubt a result of an interaction of a num-ber of functional considerations such as the articu-latory adjustments for the specific phoneme, andother suprasegmental considerations such asstress and rate (Gracco, 1990; 1991). While specu-lative, further investigations of the serial timingcharacteristics or rhythm generation for speech(Gracco & Abbs, 1989; Saltzman et al., 1991) com-bined with computational models to evaluate thepotential oscillatory network or serial dynamics(Laboissiere et al., 1991; Saltzman & Munhall,1989) associated with the temporal patterning forspeech are important areas of future research.Speech production: A functional neuromotorperspective

The observations above suggest that the neuro-motor representation for speech is more compli-cated than can be captured by a single controlvariable or sensorimotor mechanism. Rather,speech production is organized according to prin-ciples that incorporate the sequential nature ofthe process, the functional character of the pro-duction units, and the articulatory details thatshape the acoustic output. It appears that speechmotor control is organized at a functional levelaccording to sound-producing vocal tract actions

(Gracco, 1990; 1991; Kent, Carney, & Severeid,1974). These functional groupings are stored inmemory and map categorically onto the sounds ofthe language. The apparent constraint on coordi-nating functionally-related articulators observedin this and other investigations (Gracco, 1988;1990; Gracco & Laqvist, 1989; in preparation)suggests that the temporal patterning among ar-ticulators is a component of the neuromotor repre-sentation. Observable speech movements (or vocaltract area functions) reflect a combination ofstored representations with flexible sensorimotorprocesses interacting to form complex vocal tractpatterns from relative simple operations (Gracco& Abbs, 1987; Gracco, 1990; 1991). Stored neuro-motor representations and sensorimotor interac-tions simplify the overall motor control process byminimizing much of the computational load.Important areas of future research are to deter-mine the precise role and contribution of biome-chanical properties to the observable patterns, toevaluate the relative strength of articulator inter-actions, and to identify how articulator move-ments are modified by linguistic and cognitiveconsiderations. The major contributions to under-standing speech motor control and underlyingnervous system organization will come from a bet-ter understanding of the neural, physical, andcognitive factors that govern this uniquely humanbehavior.

REFERENCESAbbs, J. H., & Gracco, V. L. (1984). Control of complex motor

gestures: Orofacial muscle responses to load perturbations ofthe lip dunng speech. Journal of Neurophysiology, 51(4), 705-723.

Adams, S. G., Weismer, G., & Kent, R. D. (1993). Speaking rateand speech movement velocity profiles. Journal of Speech and

Hearing Research, 36, 41-54.Arkebauer, H. J., Hixon, T. J., &, Hardy, J. C. (1967). Peak intraoral

air pressures during speech. Journal ot Speech and HearingResearch. 10, 196-208.

Barlow, S. M., Cole, K. J., &, Abbs, J. H. (1983). A new head-mounted lip-jaw movement transduction system for the studyof motor speech disorders. Journal of Speech and HearingResearch, 26, 283-288.

Bernstein, N. (1967). The co-ordination and regulation of movements.

New York: Pergamon Press.Black, J. W. (1950). The pressure component in the production of

consonants. Journal of Speech and Hearing Disorders, 15, 207-210.Browman, C. P., & Goldstein, L. M. (1989). Tiers in articulatory

phonology, with some implications for casual speech. In J.Kingston & M. Beckman (Eds.), Papers in laboratory phonology 1.Between the grammar and physics of speech (pp. 341-376).Cambridge, England: Cambridge University Press.

Browman. C. P., & Goldstein, L. M. (1990). Articulatory gesturesas phonological units. Phonology, 6:2.

Caruso, A. J., Abbs, J. H., & Gracco, V. L. (1988). Kinematicanalysis of multiple movement coordination during speech instutterers Pram. 177, 439-455.

Page 96: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Some Organizational Characteristics of Speech Movement Control 89

Chen, M. (1970). Vowel length variation as a function of thevoicing of the consonant environment. Phonetica, 22, 129-159.

Cooke, J. D. (1980). The organization of simple, skilledmovements. In G. E. Stelmach & J. Requin (Eds.), Tutorials inmotor behavior (pp. 199-212). Amsterdam: North-Holland.

Daniloff, R., & Moll, K. (1968). Coarticulation of lip rounding.Journal of Speech and Hearing Research, 11, 707-721.

Denes, P. (1955). Effect of duration on the perception of voicing.Journal of the Acoustical Society of America, 27, 761-764.

DeNil, L. & Abbs, J. (1991). Influence of speaking rate on theupper lip, lower lip, and jaw peak velocity sequencing duringbilabial closing movements. Journal of the Acoustical Society ofAmerica, 89, 845-849.

Edwards, J., Beckman, M. E., Sr Fletcher, J. (1991). The articulatorykinematics of final lengthening. Journal of the Acoustical Societyof America, 89, 369-382.

Folkins, J. W. (1981). Muscle activity for jaw closing duringspeech. Journal of Speech and Hearing Research, 24, 601-615.

Folkins, J. W, & Abbs, J. H. (1975). Lip and jaw motor controlduring speech: responses to resistive loading of the jaw. Journalof Speech and Hearing Research, 18, 207-220.

Folkins, J. W., & Linville, R. N. (1983). The effects of varyinglower-lip displacement on upper-lip movements: Implicationsfor the coordination of speech movements. Journal of Speech andHearing Research, 26, 209-217.

Folkins, J. W., & Canty, J. (1986). Movements of the upper andlower lips during speech: Interactions between lip with thejaw fixed at different position. Journal of Speech and HearingResearch, 29, 348-356.

Folkins, J. W., Sr Brown, C. K. (1987). Upper lip, lower lip, and jawinteractions during speech: Comments on evidence fromrepetition-to-repetition variability. Journal of the AcousticalSociety of America, 82, 1919-1924.

Fowler, C. A. (1980). Coarticulation and theories of intrinsictiming. Journal of Phonetics, 8, 113-133.

Fowler, C. A. (1983). Converging sources of evidence on spokenand perceived rhythms of speech: Cyclic production of vowelsin monosyllabic stress feet. Journal of Experimental Psychology:General,. 112, 386-412.

Fowler, C. A. (1992). Vowel duration and closure duration invoiced and unvoiced stops: there are no contrast effects here.Journal of Phonetics. 20. 143-165.

Fowler, C. A., Gracco, V. L., V.-Bateson, E. A., & Romero, J.(submitted) Global correlates of stress accent in spokensentences. Journal of the Acoustical Society of America.

Fromkm, V. A. (1966). Neuro-muscular specification of linguisticunits. Language and Speech, 9, 170-199.

Fujimura, 0., & Miller, J. (1979). Mandible height and syllable-final tenseness. Phonetica, 36, 2630272.

Getting, P. A (1989). Emerging principles governing theoperation of neural networks. Annual Review of Neuroscience, 12,185-204.

Glass, L., & Mackey, M. C. (1988). From clocks to chaos. Princeton:Princeton University Press.

Gracco, V. L. (1987). A multilevel control model for speech motoractivity. 1 H. Peters & W. Hulstijn (Eds.), Speech motor dynamicsin stuttering (pp. 57-76). Wien: Springer-Verlag,

Gracco, V. L. (1988). Timing factors in the coordination of speechmovements. Journal of Neuroscience. 8, 4628-4634.

Gracco, V. L. (1990) Characteristics of speech as a motor controlsystem. In G Hammond (Ed.), Cerebral control of speech and limbmovenients (pp. 3-28). North Holland: Elsevier, .

Gracco. V. L (1991). Sensorimotor mechanisms in speech motorcontrol In I I Peters, W Hultsijn, & C W. Starkweather (Eds 1.

Speech motor control and stuttering (pp. 53-78).North HollandElsevier.

Gracco, V. L., & Abbs, J. H. (1985). Dynamic control of the penoralsystem during speech: kinematic analyses of autogenic andnonautogenic sensorimotor processes. Journal ofNeurophysiology, 54, 418-432.

Gracco, V. L., & Abbs, J. H. (1986). Variant and invariantcharacteristics of speech movements. Experimental BrainResearch, 65, 156-166.

Gracco, V. L., & Abbs, J. H. (1987). Programming and executionprocesses of speech motor control. Potential neural correlates.In E. Keller, & M. Gopnick, Eds.), Motor and sensory processes oflanguage (pp. 163-202). Hillsdale, NJ: Lawrence Erlbaurn.

Gracco, V. L., & Abbs, J. H. (1988). Central patterning of speechmovements. Experimental Brain Research, 71, 515-526.

Gracco, V. L., & Abbs, J. H. (1989). Sensorirnotor characteristics ofspeech motor sequences. Experimental Brain Research, 75, 586-598.

Gracco, V. L.., & Lofqvist (1989). Speech movement coordination:Oral-laryngeal interactions. Journal of the Acoustical Society ofAmerica, 86, S114.

Harris, K. S., Lysaught, G. F., &, Schvey, M. M. (1965). Someaspects of the production of oral and nasal labial stops.Language and Speech, 8, 135-147.

Harris-Warrick, R. M. (1988) Chemical modulation of centralpattern generators. I, A. Cohen, S. Rossignol, & S. Grillner(Eds.), Neural control of rhythmic movements in vertebrates (pp.285-332). John Wiley & Sons: New York.

House, A. S., & Fairbanks, G. (1953). The influence of consonantenvironment upon the secondary acoustic characteristics ofvowels. Journal of the Acoustical Society of America, 25, 105-113.

Hughes, 0.. & Abbs, J. (1976). Labial-mandibular coordination inthe production of speech: implications for the operation ofmotor equivalence. Phonetica, 33, 199-221.

Kelso, J. A. S. (1986). Pattern formation in speech and limbmovements involving many degrees of freedom. In H. Heuer &C. Fromm (Eds.), Generation and modulation of action patterns(pp. 105-128). Berlin: Springer-Verlag.

Kelso, J. A. S., & Tuller, B. (1984). Converging evidence in supportof common dynamical principles for speech and movementcoordination. American Journal of Physiology, 75, R928-R935.

Kelso, J. A. S., Tuller, B., V.-Bateson, E., & Fowler, C. (1984).Functionally specific articulatory cooperation following jawperturbations during speech: Evidence for coordinativestructures. journal of Experimental Psychology Human Perceptionand Peiforrnance, 10, 812-832.

Kelso, J. A. S., V.-Bateson, E., Saltzman, E. L., & Kay, B (1985). Aqualitative dynamic analysis of reiterant speech production:Phase portraits, kinematics, and dynamic modeling. Journal ofthe Acoustical Society of America, 77, 266-280.

Kent, R. D. (1983). The segmental organization of speech. In PMacNeilage (Ed.), The production of speech. New York: Springer.Verlag.

Kent, R. D., Carney, P. J., & Severeid, L. R. (1974) Velarmovement and timing: evaluation of a model for binarycontrol. Journal of Speech and Hearing Research, 17, 470-488

Kent, R. D., & Rosenbek, J. C. (1982). Prosodic disturbances andneurologic lesion. Brain laanguage, 15, 259-291.

Kluender, K. R., Diehl, R. L., & Wright, B. A. (1988) Vowel-lengthdifferences before voiced and voiceless consonants Anauditory explanation. Journal of Phonetics, 16, 153-169.

Kollia, Gracco cSr Harris (1992). Functional organization of velarmovements following jaw perturbation. Journal of the AcousticalSociety of America, 91(2), 2474

S 6

Page 97: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Gracco

Kozhevnikov, V., & Chistovich, L. (1965). Speech: Articulation andperception. Joint Pubi.cations Research Service, 30,453; U.S.Department of Commerce.

Kuehn, D. P. & Moll, K. (1976). A cineradiographic study of VCand CV articulatory velocities. Journal of Phonetics, 4, 303-320.

Kuehn, D. P., Reich, A. R., & Jo-clan, J. E. (1980). A cineradio-graphic study of chin marker positioning: Implications for thestrain gauge transduction of jaw movement. Journal of theAcoustical Society of America, 67, 1825-1827.

Laboissiere, R., Schwartz, J.-L., & Bailly, G. (1991). Motor controlfor speech skills: A connectionist approach. In D. S. Touretzky& G. E. Hinton (Eds.), Proceedings of the 1990 ConnectionistModels Summer School (pp. 319-327). San Mateo, CA: MorganKaufmann.

Lashley, K. S. (1951). The problem of serial order in behavior. In L.A.. Jeffress (Ed.), Cerebral mechanisms in behavior: The Hixonsymposium. New York: Wiley.

Lofqvist, A., & Yoshioka, H. (1981). Interarticulator programmingin obstruent production. Phonetica, 38, 21-34.

Lofqvist, A., & Yoshioka, H. (1984). Intrasegmental timing:Laryngeal-oral coordination in voiceless consonant production.Speech Communication, 3, 279-289.

Lindblom, B. E. F. (1987). Adaptive variability and absoluteconstancy in speech signals: two themes in the quest forphonetic invariance. Proceedings of the Eleventh InternationalCongress of Phonetic Sciences, 3, 9-18.

Lindblom, B. E. F., Lubker, J., & Gay, T. (1979). Formantfrequencies of some fixed-mandible vowels and a model ofspeech motor programming by predictive simulation. Journal ofPhonetics, 7, 147-161.

Lisker, L., & Abramson, A. S. (1964). A cross-language study ofvoicing in initial stops: acoustical measurements. Word, 20, 384.

Lubker, J. F., & Parris, P. J. (1970). Simultaneous measurements ofintraoral pressure, force of labial contact, and labialelectromyographic activity during production of the stopconsonant cognates /p/ and /b /. Journal of the AcousticalSociety of America, 47. 625-633.

Luce, P. A., & Charles-Luce, J. (1985). Contextual effects on vowelduration, closure duration and the consonant/vowel ratio inspeech production. Journal of the Acoustical Society of America,78, 1949-1957.

Macchi, M, (1988). Labial articulation patterns associated withsegmental features and syllable structure in English. Phonetica,45, 109-121.

Martin, J. G. (1972). Rhythmic (hierarchical) versus serial structurein speech and other behavior. Psychological Review, 79, 487-509.

McClean, M. D., Kroll, R. M., & Loftus, N. S. (1990). Kinematicanalysis of lip closure in stutterers' fluent speech. Journal ofSpeech and Hearing Research, 33, 755-760.

Moore, C.A., Smith, A., & Ringel, R. L.. (1988). Task-specificorganization of activity in human jaw muscles. Journal of Speechand Hearing Research, 31, 670-680.

Muller, E. M., & Abbs, J. H. (1979). Strain gage transduction of lipand jaw motion in the midsagittal plane: Refinement of aprototype system. Journal of the Acoustical Society of America, 65,481-486.

Munhall, K., Ostry, D. J., & Parush, A. (1985). Characteristics ofvelocity profiles of speech movements. Journal of ExperimentalPsychology: Human Perception and Performance, 11, 457-474.

Munhall, K., LOfqvist, A., & Kelso, J. A. S. (in press). Lip-larynxcoordination in speech: effects of mechanical perturbations tothe lower lip. Journal of the Acoustical Society of America.

Nelson, W. L (1983). Physical principles for economies of skilledmovements. Biological Cybernetics, 46, 135-147.

Ohala, J. J. (1975). The temporal regulation of speech. In, G. Fant& M. A. A. Tatham (Eds.), Auditory analysis and perception ofspeech (pp. 431-454). London: Academic.

Ostry, D. J., Keller, E., & Parush, A. (1983), Similarities in thecontrol of the speech articulators and the limbs: kinematics oftongue dorsum movement in speech. Journal of ExperimentalPsychology: Human Perception and Performance, 9, 622-636.

Ostry, D. J., & Munhall, K. G. (1985). Control of rate and durationof speech movements. le Jrnal of the Acoustical Society of America,77, 640-648.

Parush, A., Ostiy, D. 0., ist Munhall, K. G. (1983). A kinematicstudy of lingual coarticulation in VCV sequences. Journal of theAcoustical Society of Americ.a, 74, 1115-1125.

Perkell, J. S. (1969). Physiology of speech production. ResearchMonograph (53). Cambridge, MA: MIT Press.

Rossignol, S., Lund, J. P., & Drew, T. (1988). The role of sensoryinputs in regulating patterns of rhythmical movements inhigher vertebrates: A comparison between locomotion,Respiration, and mastication. In A. Cohen, S. Rossignol, & S.Grillner (Eds.), Neural control of rhythmic movements invertebrates (pp. 201-284). John Wiley & Sons: New York.

Saltzman, E. L. (1986). Task dynamic coordination of the speecharticulators: a preliminary model. In H. Heuer & C. Fromm,(Eds.), Generation and modulation of action patterns (pp. 129-144).Berlin: Springer-Verlag.

Saltzman, E. L., & Munhall, K. G. (1989). A dynamical approach togestural patterning in speech production. Ecological Psychology,1(4), 333-382.

Saltzman, E. L., Kay, B., Rubin, P., & Kinsella-Shaw, J. (1991).Dynamics of intergestural timing. PERILUS, 16, 47-55.

Sternberg, S. Knoll, R. L., Monsell, S., & Wright, C. E. (1988).Motor programs and hierarchical organization in the control ofrapid speech. Phonetica, 45, 175-197.

Stetson, R. H. (1951). Motor phonetics: A study of speechmovements in action. North Holland Publishing Co.

Stevens, K. N. (1972). On the quantal nature of speech: Evidencefrom articulatory-acoustic data. In E. E. David & P. B. Denes(Eds.), Human communication: A unified view (pp. 51-66). NewYork: McGraw-Hill.

Stevens, K. N. (1989). On the quantal nature of speech. Journal ofPhonetics, 17, 3-45.

Summers, W. V. (1987). Effects of stress and final-consonantvoicing on vowel production: Articulatory and acousticanalyses. Journal of the Acoustical Society of America, 82, 847-863.

Sussman, H. M., MacNeilage, P. F., & Hanson, R. J. (1973). Labialand mandibular dynamics during the production of bilabialconsonants: Preliminary observations. Journal of Speech andHearing Research, 16, 397-420.

Tatham, M. A. A., & Morton, K. (1968). Some electromyographydata towards a model of speech production. University ofEssex, Language Centre, Occasional Papers 1.

von Nuemann, J. (1958). The computer and the brain. New Haven:Yale University Press.

Weismer, G. Assessment of articulatory timing. In, J. Cooper(Ed.), Assessment of Speech and Voice Production: Researchand Clinical Applications, (pp. 84-95). NIDCD Monograph.

FOOTNOTE'Journal of Speech and Hearing Research, in press.

.97BEST COPY AVAILABLE

Page 98: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins Laboratories Status Report on Speech Research1993, SR-113, 91-94

The Quasi-steady Approximation in Speech Production*

Richard S. McGowan

Because boundary-layer separation is important in determining force between flowing airand solid bodies and separation can be sensitive to unsteadiness, the quasi-steadyapproximation needs to be examined for the flow-induced oscillations of speech (e.g.,phonation and trills). A review of the literature shows that vibratory phenomena, such asphonation and tongue-tip trills, may need to be modeled without the quasi-steadyapproximation.

The quasi-steady approximation is commonlymade in modeling air flow and vibration in speechproduction. Experimentally, the quasi-steady ap-proximation means that time-varying situations,such as vocal fold oscillation or tongue-tip trills,can be studied with a series of static configura-tions simulating the air flow in a vocal tract. Formathematical modeling purposes, the quasi-steady approximation means that accelerationterms involving partial derivatives with respect totime in the equations of motion of air can be ne-glected. This allows the modeling of air flow as asequence of static flow configurations. Althoughan inductive term representing the effect ofacceleration of air in the constriction is oftenincluded in mathematical modeling, otherunsteady effects are ignored. When the quasi-steady approximation is made, vorticity andturbulence distributions, important in force andenergy balance considerations, are assumed to beunaffected by unsteady air acceleration. Inparticular, the quasi-steady approximation isapplied to boundary-layer separation, which is animportant determinant of vorticity distribution,

This work was supported by NIH grants DC-00121 and DC-00865 to Haskins Laboratories. Thanks to Anders Lbfqvist andPhilip Rubin for comments.

91

and, hence, of the energy exchange between theair and solid in flow-induced oscillations, such asphonation and tongue-tip trills. However, a reviewof recent literature (e.g., Bertram & Pedley, 1983;Cancel li & Pedley, 1985; Pedley & Stephanoff,1985; Sobey, 1983; Sobey, 1985) shows that thequasi-steady approximation can only be used withgreat care in the fluid mechanics regimesinvolving boundary-layer separation. Becauseboundary-layer separation occurs in the vocaltract these recent works bear consideration forunderstanding speech production. Some of therecent literature that brings the quasi-steadyapproximation into question will be reviewed here,after some of its relevance to speech productionmodeling has been discussed.

The fact that characteristic Strouhal numbersare often small in speech has been used to justifythe quasi-steady approximation. For periodicmotion the Strouhal number is the product of acharacteristic frequency and characteristic lengthscale divided by a characteristic velocity, and it isthe coefficient of the time derivative terms innondimensional versions of the mass and mo-mentum conservation equations. If the Strouhalnumber is small compared to one, it is presumedthat these terms can be neglected, that is, it ispossible to make the quasi-steady approximation.(This assumes that other terms are of order one,

88

Page 99: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

92 McGowan

which is generally the case.) The quasi-steadyapproximation is often used in aerodynamicconsiderations for the modeling of phonation, stoprelease, and fricatives. For phonation, withmaximum air speeds in excess of 2000 cm/sec,frequencies on the order of 100 Hz, and lengthscales at most on the order of 1 cm, producing aStrouhal number less than .05, this may appear tobe a valid approximation to use for simplifying amodel (Catford, 1977, p. 98). Before reviewing thereasons that this may be faulty, it is necessary toconsider boundary-layer separation and itsimportance for vibratory phenomena in the vocaltract.

When air flows over a solid surface, a layer of airwith a high coAlcentration of vorticity is formednext to the solid surface, and this layer is aboundary layer. Boundary-layer separation occursat places on the surface of a solid where thevorticity of the boundary layer abruptly leaves theregion close to the solid boundary and issubsequently convected by the flow. The placeswhere separation occurs are called separationpoints. (A separation point in two-dimensionalmodeling represents a line of separation in thethird dimension.) Separation occurs where there isa sufficient adverse pressure gradient, as occurswhen flow is decelerated (Lighthill, 1963). Anadverse pressure gradient occurs when there isflow along solids from regions of low pressure toregions of high pressure, and if the spatial rate ofchange of pressure is sufficiently large, theboundary layer will separate. For example, theflow from the constriction for an /s/ separatesbecause of the abrupt, adverse pressure changecaused by the sudden area expansion after theconstriction. During phonation, the flow separatesfrom the folds, thus providing the time-varyingflow resistances that are essential for theproduction of modal voice (Ishizaka & Flanagan,1972).

The locations of separation points are importantfor the study of flow induced oscillations in thevocal tract because they help to determine theforces between the air and the tissue. Becausevorticity is transported from the boundary layerinto the main portion of' the flow field at aseparativr point, there can be a change inpressure head in traversing the region near aseparation point. A net force on a blunt object in

the direction of flow can result from such apressure head difference. There is likely to be aseparation point near the boundary between thewindward and the leeward sides of such an objectbecause of the adverse pressure gradient. (Thereis an adverse pressure gradient because the flowslows down on the leeward side.) The windwardside has a higher pressure head than the leewardside, so there is a higher static pressure on thewindward side than on the leeward side. Theseconsiderations are important for the energyexchange between moving objects and the air,because energy is the integral of force againstdistance moved. Thus, the change in boundary-layer separation behavior with the removal of thequasi-steady approximation may have importantconsequences in modeling flow inducedoscillations of speech, such as phonation andtongue-tip trills.

In the last decade the quasi-steady approx-imation for internal flows (e.g., flows inside thevocal tract), even at very small Strouhal number,has been questioned. The Reynolds number, equalto the product of a characteristic velocity andlength scale divided by the kinematic viscosity,has been shown to be an important parameter forquestions of unsteadiness. Pedley (1983), in areview article on physiological fluid flows, quotesresults on channel flows driven sinusoidally froma side wall with amplitudes of between .28 and .57of the channel width. For Reynolds numbers of'between 300 and 700 based on cross-sectionallyaveraged, steady fluid particle velocity andchannel width, quasi-steady behavior disappearsfor a Strouhal number of about .008. A vortexwave is observed to travel downstream from theoscillating portion of the wall (Pedley &Stephanoff, 1985; Sobey, 1985), which is notpredicted by quasi-steady theory. There is alsoother behavior that marks the inadequacy of thequasi-steady approximation in unsteady flow.Bertram and Pedley (1983) have shownexperimentally that impulsively started flow overan indentation of the channel wall can createseparated flow on the lee side of the indentation,with the separation point moving upstream as thesteady state is approached. Sobey (1983) has usednumerical simulations of unsteady flow inchannels with wavy walls to show the movingseparation point on the lee slopes of the wavy

89

Page 100: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

The Quasi-steady Avproximation in Speech Production 93

walls. For a Reynolds number based on peakvelocity and minimum channel half-width of only75 and a Strouhal number of .01, there arequalitative differences in the separation behaviorfrom that expected from the quasi-steadyapprcximation. For one thing the flow, onceseparated during acceleration, does not reattachto the walls after deceleration to zero flow.Requiring that separation vorticity disappearswhen the flow reverses in oscillatory flow, Sobeyderived a very restrictive relation betweenStrouhal number and Reynolds number for thequasi-steady approximation to be valid. TheStrouhal number must be less thaz .2 of thesquare inverse of the Reynolds number. For aReynolds number of 100 the Strouhal numberwould need to be less than .00002 to meet Sobey'scriterion for quasi-steady behavior. Thus, a smallStrouhal number is not sufficient to ensure quasi-steady behavior.

Sobey (1983) furthermore gives an argument asto why unsteady separation phenomena aredifferent from steady separation phenomena, evenat very small Strouhal numbers. In unsteady flow,the fluid particle velocity at a point in space canbe considered both a function of time and afunction of a time-variable Reynolds number. Toobtain the total time derivative of the fluidparticle velocity, one needs to include a term thatis the derivative of the fluid particle velocity withrespect to Reynolds number times the timederivative of the Reynolds number. Because flowconditions are singular near a separation point,the derivative of flow velocity with respect to theReynolds number in a region of a separation pointcan be quite large. Thus, the total time derivativedoes not scale exclusively with Strouhal numbernear a separation point, but also depends on theReynolds number.

It is easily seen, .using Sobey's criterion, thatflow-induced oscillations in speech may not bequasi-steady if flow separation is concerned.Based on maximum velocity of 2000 cm/sec andcharacteristic dimension of 1 cm, the Reynoldsnumber during phonation is about 13000. TheStrouhal number for phonation is .05 based on a100Hz oscillation frequency. Based on anoscillation frequency of 30 Hz, the tongue-tip trillStrouhal number is 1/3 of that for phonation, andthe Reynolds number is in the same range as that

for phonation (McGowan, 1992). Thus, both thesevibratory phenomena should be considered to betruly unsteady and the quasi-steady assumptionseriously questioned.

One possible consequence of unsteadiness maybe a mechanism for energy exchange from air tosolid during vibration. For instance, for essentiallyone-degree-of-freedom solid motion during falsettovoice there could be a hysteresis in the separationpoint position between opening and closing phasesof the motion. If the separation point tends to befurther forward during the opening phase thanduring the closing phase, the pressure on theupstream portion of the folds could be greaterduring the opening than during the closing phase.This mechanism could supplement othersincluding a glottal air induction that depends onvocal fold position (Wegel, 1930) and an inductiveloading of the supraglottal vocal tract (Flanagan& Landgraf, 1968; Ishizaka & Flanagan, 1972) inaccounting for energy exchange.

In this note, it has been shown that the quasi-steady approximation may not be a validapproximation in speech, particularly when flowseparation is involved. Unsteady effects may haveto be included to account for some aspects ofphonation and tongue-tip trills. For instance,mechanisms proposed for the energy exchangefrom air to the vocal folds when each vocal foldhas essentially one degree of freedom may besupplemented with a moving separation point.

REFERENCESBertram, C. D., & Pedley, T. J. (1983). Steady and unsteady

separation in an approximately two-dimensional indentedchannel. Journal of Fluid Mechanics,130, 315-345.

Cancelli, C., & Pedley, T. J. (1985). A separated-flow model forcollapsible tube oscillations. Journal of Fluid Mechanics, 157, 375-404.

Ca tford, J. C. (1977). Fundamental problems in phonetics.Bloomington: Indiana University Press.

Flanag.n, J. L., & Landgraf, L. L. (1968). Self-oscillating source forvocal-tract synthesizers. IEEE Trans. Audio and Electroacoustics,AU-I6. 57-64.

Lighthill, M. J. (1963). Introduction. Boundary layer theory. InL. Rosenhead (Ed.), Laminar boundary layers. Oxford: TheClarendon Press.

McGowan, R. S. (1992). Tongue-tip trills and vocal-tract wallcompliance. Journal of the Acoustical Society of America, 91, 2903-2910.

Ishizaka, K., & Flanagan, J. L. (1972). Synthesis of voiced soundsform a two-mass model of the vocal cords. Bell System TechnicalJournal, 51, 1233-1269.

Pedley, T. J. (1983). Wave phenomena in physiological flows.Journal of Applied Mathematics, 32, 267-287.

1 t;

Page 101: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

94 McGowan

Ped ley, T. J., & Stephanoff, K. D. (1985). Flow along a channelwith a time dependent indentation in one wall: The generationof vorticity waves. Journal of Fluid Mechanics, 160, 337-367.

Sobey, I. J. (1983). The occurrence of separation in oscillatory flow.Journal of Fluid Mechanics, 134, 247-257.

Sobey I. J. (1985). Observation of waves during oscillatory channelflow. Journal of Fluid Mechanics, 151, 395-426.

Wegel, it L (1930). Theory of vibration of the larynx. Journal of theAcoustical Society of America, 1, No. 3, Part 2, 1-21.

FOOTNOTE*Journal of the Acoustical Society of America, 94, 3011-3013 (1993).

101

Page 102: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins Laboratories Status Report on Speech Research1993, SR-113, 93-106

Implementing a Genetic Algorithm to Recover Task-dynamic Parameters of an Articulatory Speech Synthesizer

Richard S. McGowan

A genetic algorithm was ,lsed to recover the dynamics of a model vocal tract from thespeech pressure wave t....at it produced. The algorithm was generally successful inperforming the optimization necessary to do this inverse problem: a problem withsignificance in the psychology, biology and technology of speech. A natural extension ofthis work is to study speech learning using a classifier system that employs a geneticalgorithm.

INTRODUCTIONThis paper reports work on the recovery of

dynamic parameters used to drive the articulatorsin a model vocal tract of an articulatorysynthesizer from the spectral properties of themodel's speech output. A genetic algorithm wasused to scudy this inverse problem. Descriptions ofthe articulatory synthesizer and its dynamics willbe provided as needed in this introduction.

Articulatory synthesizerThe Haskins Laboratories articulatory synthe-

sizer, ASY, (Rubin, Baer, & Mermelstein, 1981)(Figure 1) uses a model vocal tract developed byMermelstein (1973). The shape of the vocal tract iscontrolled by the positions of various articulators,examples of which are the jaw, tongue body,tongue tip and the lips. The coordinates of the jaware specified by the angle of a fixed-length vector,JA, with origin at the condyle and end at the pointmarked jaw in Figure 1 (Mermelstein, 1973). Theposition of the tongue body center is specified hy avector relative to the jaw vector, with its origin atthe cordyle and end at the tongue-body centerarticulacr. This vector has a length, CL, and anangle, CA, relative to the jaw. The tongue tip is

This work was supported by N1H grant DC-01247 toHaskins Laboratories. Thanks goes to Carol Fowler, PhilipRubin and Elliot Saltzman for making very helpful commentson this work.

95

specified by a vector with origin on the outline ofthe tongue body, whose angle, TA, is relative tothe jaw and tongue body, and whose length is TLThe lips are specified in Cartesian coordinates,with the vertical dimension specifying thedimension of lip closure/opening, and thehorizontal specifying protrusion. The upper lip'svertical position, ULV, is in relation to the fixedskull, and the lower lip vertical position, LLV, isin relation to the jaw. The lips are yoked in thehorizontal dimension, so this coordinate isspecified for both lips as lip horizontal, LH.

The vocal tract formed by the placement of thearticulators is assumed to be a variable-area tubesupporting acoustic wave propagation. It can bemodeled as a linear filter with a rational transferfunction, with the poles corresponding toresonances, or formants. The first three formantfrequencies provided the data of the speechacoustics used to map back into the dynamics ofthe vocal tract articulators in the experimentsreported here.

Task dynamics and the one-to-many problemWhat does it mean to map into a dynamics of

the articulators? Why not use an optimizationprocedure to find the positions of the ASYarticulators at any time given the first threeform ant frequencies? Previous methods for per-forming the mapping from acoustics to vocal-tractshape have relied on mapping from static framesof acoustic data to static shapes of the vocal tract(e.g., Atal, Chang, Mathews, & Tukey, 1978).

102

Page 103: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

96 McGowan

CL

Velum

Figure 1. ASY model vocal tract.

I\

.55 at

Tongue bodycenter

Hyoid

One problem encountered in these static maps isthat of mapping one acoustic signal onto manyvocal tract shapes when there is limited acousticdata. In practical terms this indicates that anoptimum solution will be nonrobust, so that anyperturbation in the data may lead to wildlydifferent results. The constraint that vocal tractshape be continuous in its movement helps toalleviate the one-to-many problem when animpoverished set of acoustic data, such as the firstthree formant frequencies, is used to map onto avocal tract shape (e.g., Shirai & Kobayashi, 1986).Alternatively, it has been proposed to map entiretrajectories of acoustic data, corresponding toconsonant-vowel or vowel-consonant intervals,onto vocal-tract articulator trajectories describedas parameterized functions of time (McGowan,1991). This procedure should be more efficientthan mapping several frames of acoustic data ontoseveral vocal tract shapes, optimizing the relationeach time, while enforcing continuity.

In experiments on recovering the articulatorymovement in the vocal tract, it became apparentthat the optimization should be done on a map-ping from the acoustic data to task dynamics(Saltzman & Kelso, 1987; Saltzman & Munhall,1989) instead of the articulator trajectories. Thereasons for using task dynamics will be given af-ter its description. Task-dynamics applied to thevocal tract models the coupled motion of the vocaltract articulators in performing speech tasks. For

LLV

instance, instead of recovering the individualcontributions of the jaw and lips in a bilabialclosure, the dynamics of the lip aperture reductionwould be recovered without part:cular regard tothe individual contributions from the jaw and liparticulators. Figure 2 illustrates the tractvariables and lists the articulators associated witheach. The focus is the tract variables tongue bodyconstriction location and degree, TBCL andTBCD, tongue tip constriction location and degreeTTCL and TTCD, lip aperture, LA, and lipprotrusion, LP. TBCL is the location along thevocal tract of the minimum cross- ,ectional areaformed with the tongue body, and TBCD isa measure of the size of that area. Similarinterpretations apply to TTCL and TTCD. LA is ameasure of the lip opening area, and LP isa measure of the extent to which the vocal tractis lengthened by the lips. The task dynamics ofthe tract variables is described by a set ofuncoupled, second-order, mass-spring systems.Not only are the parameters of naturalfrequencies and dampings specified, but targetpositions, and activation intervals (the timeduring which the task dynamics is active) fortarget positions other than the default are alsospecified. Given this set of parameters, the currentimplementation of task dynamics of the vocal tractdescribes the formation and releasing ofconstrictions by the coordinated activity of thecomponent articulators.

1 3

Page 104: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Implementing_a Genetic Algorithm to Recover Task-dynamic Parameters of an Articulatory Speech Synthesizer 97

TRACT

ASY ARTICULATOR COORDINATES(0j ; = 1. 2, ....n; n=8)

....m; m=6)

LH

(01 )JA

(02 )ULV(c; )

LLV(04 )

CL(05)

CA(06 )

asarerama=macauturart$

TL(0 7)

TA(08 )

, ..rr,==mattei,..,norraccasumson=manc-thanomaxamet.=mcsam

LP (Z ) i

,....,.........z...

LA (Z2),

TBCL (Z3)

TBCD (Z 4)

TTCL (Z5) 0TTCD (Z6) e

(2

Figure 2. Tract variables.

4TBC1,4

TBCD

GLO

TTCDLTH

origin of head-centeredcoordinate system

A reason for optimizing for the transformationfrom acoustics to tract variables rather than forthe transformation from acoustics to articulatorsis that the tract variables specify the acousticallysalient features of vocal tract shape, constrictionlocation and degree, more directly than do the ar-ticulators. Boe, Perrier, and Bailly (1992) havenoted thrt acoustic output, in terms of formantfrequencies, seems to be most sensitive to placeand degree of constriction and emphasized thatthis fact should be used in acoustic-to-articulatorymapping. While it may be true that there is

LA 4 LP

strictly only one articulatory specification for agiven vocal tract shape, there may be many anddisparate sets of articulatory coordinates that arenear by, in the sense of producing similar acousticoutput. For instance, an alveolar constriction (thetongue tip close to the ridge behind the upperteeth) can be specified with varying amounts ofjaw, tongue body and tongue tip displacement,because these articulators may compensate foreach other to attain the prescribed constrictiondegree and location. While there is not completecompensation throughout the vocal tract, the most

104

Page 105: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

98 McGowan

acoustically salient parts of the vocal tract shape,constriction degree and location, can be preservedusing compensation. Thus, by mapping to task dy-na.mics, ambiguity caused by articulatory compen-sation may be removed.

As indicated in Figure 2, tract variables recruitvarious articulators as their dynamics are instan-tiated in the vocal tract. The articulators are re-cruited with various weightings to achieve thetarget as specified by task dynamics at the tractvariable level. Therefore, in the implementation oftask dynamics a mapping between the tract vari-ables and articulators, in this case ASY articula-tors, is specified. While each tract variable ap-pears to act as an independent dynamical system,some may control the same articulator coordinatesimultaneously, as the way tongue body constric-tion degree, TBCD, and lip aperture, LA, both usethe jaw angle, JA. Thus, the set of task dynamicsequations transformed into the articulatory spacebecomes a coupled set of equations. Once the task-dynamic parameters are specified, the articulatorposition trajectories are specified by this set ofcoupled differential equations. A fourth-orderRunge-Kutta routine is used to obtain the articu-lators' movements and the resulting acoustic out-put is obtained from ASY.

In the experiments here, and in previous exper-iments (McGowan, in press), the task dynamics oftest utterances resembling /abw/ and /adw_/ werespecified in modified gestural scores generatedfrom the linguistic gestural model of Browmanand Goldstein (1990). A gestural score is a filecontaining the task-dynamic specifications fortract variable activation times, stiffnesses, andother information necessary to run the task-dy-namic simulation. For each utterance the firstthree formant frequencies were extracted at 10 msintervals, thus creating the formant frequency tra-jectories for the acoustic data. The cost function,or inverse fitness function, was the sum over timeof squares of the differences of each of the firstthree lowest formant frequencies produced by aproposed dynamics and the those of the data.

The genetic algorithmSince there was no analytic expression between

the task-dynamic parameters and the cost func-tion (a numerical extrapolation of nonlinear dif-ferential equations was necessary to evaluate thecost function), it was difficult to find the necessaryderivatives between the ,:ost function and the taskdynamic parameters. Attempting to calculatederivatives numerically would have been verytime consuming, because the function evaluations

were so complicated. A genetic algorithm waschosen as a nonderivative-based algorithm foroptimization of the task-dynamic parameters ingestural scores. A genetic algorithm was not theonly possibility for a non-derivative basedalgorithm, but it had several advantages. Theseadvantages will be explained after the algorithmis described. Genetic optimization procedures arestochastic procedures and there has been someprecedence for stochastic procedures in researchon the inverse problem. Schroeter, Meyer, andParthasarthy (1990) have used random access ofcodebooks in their approach to the inverseproblem in speech.

The specific genetic algorithm employed for thisstudy was a modified version of an algorithm de-scribed by Goldberg (1989a). In a genetic algo-rithm, the individuals of a population are assignedrandomly chosen parameter sets that are codedinto binary strings called "chromosomes", andeach is assigned a fitness. The fitness used in thisstudy was the inverse of the sum, taken over 10ms intervals, of the squares of the difference be-tween the formant frequencies of a proposed solu-tion and those of the speech data, for the firstthree formant frequencies. Based on fitness, indi-viduals are chosen to breed with others to form anew population of chromosomes. When two indi-viduals mate their chromosomes split at a ran-domly chosen location with each of two progenyobtaining one part of their chromosome from eachparent. The children's fitnesses are evaluatedbased on their parameter sets as coded in theirchromosomes. A small probability of mutation isallowed.

It should be noted that the use of geneticalgorithms requires that the parameters foroptimization be coded into finite length strings(chromosomes), which limits the range of any pa-rameter and essentially discretizes the parameterspace. The degree of discretization of each param-eter can be varied to tune the optimization. Thiswas controlled by varying the range of allowed pa-rameters and the number of bits given to code aspecific parameter. Because the ranges of startingand ending activation times were both limited todiscrete steps and finite rarge, the potentially in-finite set of parameters was made finite.

The coding that was used here was a simple bi-nary code for the real parameters, such as startingactivation time. The coded parameters were con-catenated to form a complete chromosome. A bet-ter way of forming the chromosomes might be tosplit the binary representations of the parametersso that the most significant bits of all the parame-

1 0 5

Page 106: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Implementing a Genetic Algorithm to Recover Task-dynamic Parameters of an Articulatory Speech Synthesizer 99

ters are grouped together, then the next signifi-cant bits are grouped together, and so on. Thismay be better because of it is more likely thatshorter lengths of chromosomes stay togetherthrough the mating process and the fitness of anyindividual depends on how the parametersinteract.

There were some particular advantages in em-ploying a genetic algorithm for the optimizationprocedure in this work. The basic geneticalgorithm was relatively easy to implement, andthis was an important consideration because ofthe exploratory nature of this research: not muchtime would have been spent if things had notworked out. Also, it was easy to watch theoptimization procedure evolve, and, thus, to tunethe parameters of the genetic algorithm. A veryimportant consideration was that genqtic algo-rithms provide a natural way to L._und the rangesof the parameters that are to be optimized,because the parameters are coded in finite lengthstrings. This meant that the task-dynamic modelwould not be driven beyond certain bounds andnumerical overflow during optimization could beavoided. Another important factor was a geneticalgorithm's ability to incorporate an on-off gene.For example, the optimal solution may or may notinvolve the activation of the tongue tip (TTCL andTTCD). It was easy to include a bit in eachindividual's chromosome that determines whetherthe tongue-tip was to be activated or not. If thetongue tip was to be activated, then the part of thechromosome containing tongue-tip task param-eters could be used, and if not, then that part ofthe chromosome would be ignored. Finally, thegenetic algorithm can be used for purposes otherthan straight-forward optimization. Speechlearning is an obvious extension of the workpresented here, so the fact that genetic algorithmscan be used in classifier systems is an importantconsideration.

The power of the genetic algorithm comes fromwhat John Holland, the originator of genetic algo-rithms, called implicit parallelism. Although thepopulation of chromosomes is finite, say numberN, the number of patterns of chromosomes, calledschemata, being processed is on the order of N3(Goldberg, 1989, pp. 40-41). This implicit paral-lelism, as well as the probability of mutation,makes the algorithm much less likely to becomestuck in local maxima. Also, the implicit parallelprocessing property makes this algorithm moreefficient than exhaustive search of the parameterspace. Parallel processing in the usual sense of us-ing many processors is also possible. Given that

children chromosomes have been produced as theresult of mating a generation of individuals, thechildren's fitnesses can be evaluated on physicallydistinct processors. This capability was not usedin this work.

SignificanceRecovery of articulation by optimization is

reminiscent of the analysis by synthesis model ofspeech perception (Stevens, 1960; Stevens &Halle, 1967; Stevens & House, 1970). In analysisby synthesis, as originally conceived, the contentof a speech signal (the data) is recovered whenhypothesized phonetic segments and features areused to provide instructions to an articulatorymechanism which, in turn, produces speech to becompared with the data. When the comparison isgood enough, the hypothesized phonetic segmentand features are the perceived phonetic segmentsand features. Rival theories of speech perceptionalso take recovery of the vocal tract articulatorymovement or the intended gesture over time ascentral to perceiving the spoken message (e.g.,Liberman & Mattingly, 1985). Whatever thepsychological importance of the articulatorymovement or the motor system for speechperception in adults, children develop their speechby listening and by trying to be understood, andthus this learning process may be considered ananalysis by synthesis over an extended time scale.Also, because task-dynamics and articulatorymovement are slow relative to the variations inthe acoustic pressure wave, there are potentialtechnological advantages in recovering vocal-tractmovement from the speech signal to achieve bit-rate reduction for transmission (Schroeter &Sondhi, 1992; Sondhi, 1990). In the workpresented here, selected task-dynamic parameterswere recovered, without regard to phoneticcategories: a model of speech perception is notoffered. However, there are future directions thatthis work might take for speech recognition andfor modeling speech development.

Extensions to previous workIn previous work using the methods described

above, the focus has been on recovering articula-tory movement rather than task dynamics(McGowan, in press). Based on limited testing, themethod seems to be successful at doing this. Whilethis method may help solve the one-to-manyproblem in the transformation from acoustics toarticulation, it was not known whether theacoustics maps onto the task-dynamics of the tractvariables uniquely. In the previous work, the task-

106

Page 107: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

100 McGowan

dynamic parameters were either assumed to be onor they were assumed to be otr before the opti-mization was run. The only task parameters al-lowed to be activated were the ones known to havesynthesized the speech data. It may have beenpossible to obtain nearly identical articulatorymovement from very different task dynamics.That is, there may be a one-to-many problem inmapping from articulation to task dynamics, orfrom acoustics to task dynamics. Thus, for thepresent study, on-off genes were added so that se-lected tract variables could either be activated orinactivated. This allowed the optimization itself toselect which tract variables to use.

PROCEDUREConstraining relations between the parameters

were used to keep the number of unknownparameters to a minimum. It was assumed thatall movements were critically damped, and thatthe activation intervals were equal to the periodbased on the natural frequency parameter. Also,the movement periods, and hence the activationintervals were assumed to be at least 100 ms longto avoid movements that were too stiff for thetask-dynamic simulation. With these constraints,the unknowns for a given tract-variable activationwere the beginning and ending activation timesand the target position. There could have beenmore than one activation of any tract variable,and these could have overlapped in time. Also, theactions of the certain tract variables were groupedaccording to common constriction goals, so thatthey would have identical activation intervals.The tongue body tract variables TBCL and TBCDwere in one such group, the tongue-tip variable::rra, and TTCD were in another group, and thelip variables LP and LA were in a group. Thesegroups are known as constriction or task spaces.

Gestural scores were designed to produce articu-latory movement for utterances that resembled/abw./ and /adR/, within the constraints mentionedabove. The trajectories of the tract variables re-sulting from the scores for /abm/ and Iadml are il-lustrated in Figures 3 and 4. The first utteranceinvolves moving from neutral position into a bil-abial closure. Subsequently, as the lips open, thetongue lowers for the final vowel. The second ut-terance involves the tongue making an alveolarclosure, and then the release of that closure with alowering of the tongue body for the final vowel.The heights of the hatched boxes for activationdenote the target positions. In recovering the task-dynamic parameters, some parts of the gesturalscore were taken as known and fixed. In the case

of /able/, for the initial movement to lip closure, in-volving the tract variables LA and LP, the activa-tion interval and targets were taken as known.However, the subsequent activation interval forLA and LP was taken as unknown, as was thetarget position for LA. Only the target position forLP was assumed known in its second activation.Tongue body movement was taken as unknown, sothat the activation interval and the target posi-tions for TBCL and TBCD were varied for the op-timum fit, although it was assumed that TBCLand TBCD were activated for at least 100 msduring the utterance. The fact that TTCL andTTCD were not activated for /abm/ was assumed tobe unknown. The test was to see whether the ge-netic algorithm would leave these tract variablesdeactivated in its optimal solution. This can becompared with kdx/ where there was activationboth of the tongue-tip tract variables, TTCL andTTCD, and the tongue body tract variables, TBCLand TBCD. The lip movement for the final vowel,which occurs after the tongue-tip closure, wastaken as completely known. The parameters forthe tongue tip were taken as unknown and the al-gorithm was allowed to turn them on or off. Theparameters of the tongue-body tract variables,TBCL and TBCD, in the transition to final vowelwere taken as unknown, but they were presumedactivated for at least 100 ms some time during theutterance as for /abaY.

Given the synthetic acoustic data, a genetic al-gorithm (Goldberg, 1989a) was used to recovertask-dynamic parameters when random noisewith a flat distribution between -10 and +10 Hzwas added to each formant frequency data at eachtime frame. The noise was used to test robustnessof the procedure. For each test an initial popula-tion of chromosomes of 60 individuals was gener-ated using a random number generator. Their fit-nesses were evaluated by decoding the chromo-somes of each individual into the task dynamicparameters of a gestural score, which enabled thetask-dynamic simulation to drive the articulatorysynthesizer, ASY, which, in turn produced for-mant trajectories. The coding was done so thattask-dynamic parameters were placed contigu-ously coded into binary strings with all parame-ters in each task space grouped together. Theseparameters included the beginning of activation,the activation interval, and the targets of the tractvariables within a given task space. Table 1indicates the range of the target values for eachtract variable, the number of bits used in thecoding of that parameter into the chromosomes,and the resulting resolution of that target.

107

Page 108: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Implementing a Genetic Algorithm to Recover Task-dynamic Parameters of an Articulatory Speech Synthesizer 107

back. low

OFF

front. high

OPen

closeback. low

front. high

°Pen

close

0

Pr A acirvation interval

Figure 3. Original gestural score for Iabz/.

L.

A

TIME (MS) 400

back. low

#4 front.

....* often

dose

back

rfront.

°Pertr ,,........-__close

I Alga'

//- open

dose

out/in

0TIME (MS)

activation interval

Figure 4. Original gestural score forIadvd.

high

low

108

Page 109: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

102 McGowan

Table 1. Target value specifications.

Tract Maximum/Minimum Number of BitsVariable Target Value in Chromosome Resolution

LA 1.80/-.3 cm 6 .033 cm

TBCL 3.16/.51 rad 6 .042 radTBCD{Iabai 11.80/-.30 cm 6 1.033 cm

11.63/-.13 cm 1.028 cmTTCL 1.16/.40 rad 6 .012 rad

TTCD 2.15/-.65 cm 6 .044 cm

Beginning and ending times for the activationintervals were resolved within the data frame rateof 10 ms. Recall that the fitness of an individualwas the inverse of the sum of squares of thedifferences of each of the lowest three formant fre-quencies produced by the individual and that ofthe data in 10 ms steps. Thirty pairs of individu-als were chosen with probability proportional toeach of their fitnesses. There was a .6 chance ofeach of these pairs to mate, where mating con-sisted of randomly selecting a cut point for thepair, and then exchanging substrings on each sideof the cut point to form two children strings. Therewas also a .001 chance of mutation at a single po-sition in any of the children. If mating did occur,the fitness of the children strings would then beevaluated. In a variation of the standard geneticalgorithm, the best individual of a given genera-tion was always retained into the succeedinggeneration.

The choice of pairs and possible mating wasallowed to go for 60 generations. At the end of 60generations the individual with the greatestfitness had its string decoded into task-dynamicparameters of a gestural score, which were savedfor later comparison. This procedure was repeated8 times for a new initial random population of 60individuals. The saved gestural scores were usedto drive ASY, and the gestural score andarticulator trajectories of the best individual ofeach run were compared to those that generatedthe original acoustic data.

RESULTSThe fittest individual of the 8 runs for each ut-

terance is termed the optimal solution, and theindividuals having the highest fitness in each ofthe other runs are termed suboptimal solutions.Figures 5 and 6 show the optimal recovered ges-tural scores for the utterances /abw../ and /adx/. re-spectively. A comparison between Figures 3 and 5shows that the gestural score for /alxe/ was well

recovered. There were some discrepancies, becausethe activation interval for the tongue body (TBCLand TBCD) was delayed, while the activationinterval for (LA and LP) was slightly delayed, aswell as shortened. The recovered paths of the tractvariables matched closely those of the original. Itis significant here that the tongue tip variables(TTCL and TTCD) were not activated in theoptimal solution even though they showedmovement because of their dependence on thetongue body. Thus, the procedure was able tocorrectly decide whether the tongue tip was underactive control or not in this instance. The optimalrecovery of /adx/ seems to be very good (Figures 4and 6), except that the activation interval for thetongue body (TBCL and TBCD) started late andwas too short. In this utterance, not only did thegenetic algorithm have the opportunity to shut offthe tongue tip (TTCL and TTCD), but there was asequence of gestures involving the tongue tip andtongue body for which dynamic parameters wereunknown. In this instance, the procedure foundthat the tongue tip was activate&

Quantitative measures of the goodness of fit areindicated by the mean square error in the Formantfrequency fits. In the optimal fit for /abm/ the meansquare error was 1099, which corresponds to anaverage error of 19 Hz per formant per frame ofdata. For /acW the mean square error for theoptimal fit was 4444, which corresponds to anaverage error of 38 Hz per formant per frame ofdata. Using the recovered gestural scores it waspossible to generate the articulatory trajectoriesthemselves, such as that for jaw angle, JA (seeFigure 1). Tables 2 and 3 show that both the RMSerror and maximum error in the optimal recoveryof important articulatory trajectories are small inabsolute terms.

Perhaps as important as evaluating the perfor-mance of the best, or optimal individual, is toevaluate the performances of other, suboptimalindividuals. For /ab&, the second fittest subopti-mal individual was a gestural score that activatedthe tongue tip (Figure 7). As can be seen, the ac-tivation interval of the tongue tip was just overthe minimum 100 ms, and the targets were closeto neutral position, which was the direction ofmovement for TTCL and TTCD during thatinterval anyway. Therefore, this activation hadlittle effect on the tongue tip. The mean squareerror of this fit was 2004, which corresponds to anaverage error of 26 Hz per formant per time frameof data The resulting error in the articulation isshown in Table 2, where the absolute errors werealmost as small as those for the optimal solution.

Page 110: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Implementing a Genetic Algorithm to Recover Task-dynamic Parameters of an Articulatoni Speech Synthesizer 103

A

LA

beck. low

front. hkgh

ci

cl

A0

Figure 5. Optimal gestural score for Iabw./.

A

TIME (MS)

actria0co nterval

ose

dc. low

I. high

460

back low

front. NO

01

0

A 1 r

0

c, c

c

c

.:1 /Figure 6. Optimal gestural score for /acke/.

ectwatoon interval

TIME (ME)

seck. low

cot. NOpen

lose

Page 111: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

104 McGowan

Table 2. Comparison of original and recovered trajectories for utterance /atm/.

Articulator RMS difference RMS difference suboptimal Maximum absolute difference Maximum absolute difference

coordinate optimal solution solution (example) optimal solution suboptimal solution (example)

CL 0.0083 cm 0.0067 cm 0.079 cm 0.18 cm

CA 0.00023 rad 0.0019 rad 0.0050 rad 0.019 rad

JA 0.00043 rad 0.0032 rad 0.0070 rad 0.029 rad

Table 3. Comparison of original and recovered trajectories for utterance /adz/

Articulatorcoordinate

RMS differenceoptimal solution

RMS difference suboptimalsolution (example)

Maximum absolute differenceoptimal solution

Maximum absolute differencesuboptimal solution (example)

CL 0.023 cm 0.10 cm 0.26 cm 1.26 cm

CA 0.0037 rad 0.0040 rad 0.035 rad 0.051 rad

TL 0.012 cm 0.098 cm 0.23 cm 1.12 cm

TA 0.0034 rad 0.0035 rad 0.067 rad 0.42 rad

JA 0.0033 rad 0.0049 rad 0.043 rad 0.059 rad

For the utterance /adx/, a relatively commonfeature of the suboptimal solutions was rese-quencing the tongue tip and tongue bodyactivation intervals. Because these tract variablesare coupled through the anatomy of the modelarticulators, these tract variables couldcompensate for one another, with the tongue bodyactivated to make the initial alveolar constrictionfor the /d/ and the tongue tip activated to makethe vowel 4/. There were 4 suboptimal solutionsthat resequenced the tongue tip and tongue bodyactivations. The fittest of these individuals (fourthfittest overall) is shown in Figure 8. This fit had amean square error in the formant fit equal to21584, corresponding to an average error of 85 Hzper formant per time frame of data. It is apparentfrom Figure 8 that the tongue body was activatedto make the initial tongue-tip closure, and thatthe tongue tip was activated to make the bodymove down and back for the vowel 4/. Table 3shows that the articulation with this suboptimalsolution is not nearly as close to the originalutterance as the optimal solution: there are largemaximum errors in the vector lengths CL and TL.

CONCLUSIONThese results indicate that the genetic

algorithm is useful for recovering task dynamics.However, the suboptimal solutions show that thealgorithm can be driven into locally optimalsolutions with incorrect tract variable activations.In the case of /abm/ a suboptimal solution producedarticulatory paths close to the original utterancesdespite activating the tongue tip. This indicatedthat there is a one-to-many situation in themapping from articulation or acoustics into thetask dynamics of the tongue tip. In the case of

/adw-/, the example of a suboptimal solution did notproduce an utterance all that close to the original,but the 60 individuals after 60 generations in thisparticular run were nearly identical, suggestingthat the algorithm had found a locally optimalsolution far from the optimal solution. In general,it is important to run the genetic algorithmseveral times to ensure that the proposedsuboptimal solution is not trapped in a locallyoptimal region far from the optimal region.

There are some improvements in the algorithmthat can be tried so that local optima might beavoided. One is to code the parameters in thechromosomes so that the most arithmeticallysignificant bits of all the parameters are close by,and the next significant bits are all close by, andso on. In this way the likelihood of breaking apartchromosomes with roughly good combinations oftarget positions and activation intervals might beavoided. Another improvement would be to runseveral populations in parallel, and then take thebest from each to run again (Goldberg, 1989b).

From the experience gained with optimization, aclassifier system based on a genetic algorithmcould be used to successfully learn dynamicparameters. Extending the research brings a hostof related questions. What happens if there is amismatch between the synthesizer used for theclassifier and the vocal tract that produced thedata, as would be the case if the speech data wasnatural speech? How would the fitness function beaffected? For a machine to repeat or learn humanspeech with an articulatory synthesizer, it doesnot matter whether the correct dynamics is recov-ered, as long as the articulatory synthesizer hasenough degrees of freedom to produce a very goodoptimal match with the human acoustic data.

Iii

Page 112: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Implementing a Genetic Algorithm to Recover Task-dynamic Parameters of an Articulatory Speech Synthesizer 105

A

/A

open

dose

dose

open

doseout

in

4000

V A activation interval

TIME (MS)

Figure 7. Suboptimal gestural score for /absei.

L - ; LLt411.1,l1-44k4.L44-1

I.A

"414444444.

back, low

front high

open

close

back, low

front, high

open

doseopen

dose

out

in

0

Figure 8. Suboptimal gestural score for lachPJ.

TIME (MS) 400

112

Page 113: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

106 McGowan

A fixed fitness function evaluated in the acousticparameter space, analogous to the simple oneused here will suffice. If the goal is to modelhuman speech development and the resultingspeech perception and production, then it isimportant to ask whether children learn speech bymatching acoustic output. Do children recover themovement of adults' articulators or theirdynamics? It is plausible that children neitherrecover adult articulation, nor match acousticoutput. What does this mean for the simulation ofhuman learning? What is the fitness function touse? Children, somehow, recover the adults'meanings and understand adults according totheir own needs and desires. They adapt to adult'sspeaking with the added plus of having a speechapparatus that they can use to be understood (e.g.to get adults to do what they want). A child couldconstruct a dynamics based on his own vocalapparatus and associate others speech soundswith their own dynamics, even when it isimpossible to match another's speech soundsexactly, because he can be understood using thesemovements. Thus, it is not necessary that thechild recover adults vocal movements by ananalysis by synthesis with a comparator. Heknows which of his own movements wouldproduce like meaning. The fitness function wouldnot be based on acoustic match with adults'utterances exactly, but one that involves bothmeaning and associations with the child's vocalapparatus. The fitness function, in the case of de-velopment, would be a "moving target" dependingon the state of the adult listeners and speakers, aswell as the child's state.

REFERENCESAtal, B. S ., Chang, J. J., Mathews, M. V., & Tukey, J. W. (1978).

Inversion of articulatory-to-acoustic transformation in thevocal-tract by a computer sorting technique. Journal of theAcoustical Society of America, 63, 1535-55

L-J., Perrier, P., & Bailly, G. (1992). The geometric vocal tractvariables controlled for vowel production: proposals forconstraining acoustic-to-articulatory inversion. Journal ofPhonetics, 20, 27-38.

Browman, C. P., & Goldstein, L. (1990). Gestural specificationusing dynamically-defined articulator structures. Journal ofPhonetics, 18, 299-320.

Goldberg, D. E. (1989a). Genetic algorithms in search, optimization,and machine learning. Reading MA: Addison-Wesley PublishingCompany, Inc.

Goldberg, D. E. (1989b). Sizing populations for serial and parallelgenetic algorithms. In J. D. Schaffer (Ed.), Proceedings of theThird International Conference on Genetic Algorithms (pp. 70-9).San Mateo, California: Morgan Kaufman Publishers, Inc.

Liberman, A. M., & Mattingly, I. G. (1985). The motor theory ofspeech perception revisited. Cognition, 21, 1-36.

McGowan, R. S. (1991). Recovering tube kinematics using time-varying acoustic information. In Proceedings of the 12thInternational Congress of Phonetic Sciences, August 19-24, Aix-en-Provence (Volume 4, pp. 486-89). Aix-en-Provence: Universitede Provence, Service des Publications.

McGowan, R. S. (in press). Recovering articulatory movementfrom fonnant frequency trajectories using task dynamics and agenetic algorithm: Preliminary model tests. Speech

Communication.Mermelstein P. (1973). Articulatory model for the study of speech

production. Journal of the Acoustical Society of America, 53, 1070-82.

Rubin, P., Baer, T., & Mermelstein, P. (1981). An articulatorysynthesizer for perceptual research. Journal of the AcousticalSociety of America, 70, 321-28.

Saltzman, E. L., & Kelso, J. A. S. (1987). Skilled actions: A task-dynamic approach. Psychological Review, 94, 84-106.

Saltzrnan, E. L., & Munhall, K. G. (1989). A dynamic approach togestural patterning in speech production. Ecological Psychology,14, 333-82.

Schroeter, J., Meyer, P., & Parthasarthy, S. (1990). Evaluation ofimproved articulatory codebooks and codebook access distancemeasures. IEEE Proceedings, ICASSP 90, Albuquerque, NewMexico.

Schroeter, J., & Sondhi, M. M. (1992). Speech coding based onphysiological models of speech production. In S. Furui & M.M. Sondhi (Eds.), Advances in speech signal processing (pp. 231-268). New York: Marcel Dekker, Inc.

Shirai, K., & Kobayashi, T. (1986). Estimating articulatory motionfrom speech wave. Speech Communication, 5, 159-70.

Sondhi, M. M. (1990). Models of speech production for speechanalysis and synthesis. Journal of the Acoustical Society ofAmerica, 87, S14 (A)

Stevens, K. N. (1960). Toward a model for speech recognition.journal of the Acoustical Society of America, 32, 47-55.

Stevens, K. N., & Halle, M. (1967). Remarks on analysis bysynthesis and distinctive features. In W. Wathen-Dunn (Ed.),Models for the perception of speech and visual form. MIT Press:Cambridge, Massachusetts.

Stevens, K. N., & House, A. S. (1970). Speech perception. In J.Tobias (Ed.), Foundations of modern auditory theory, Volume II(pp. 3-63).

113

Page 114: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins Laboratories Status Report on Speech Research1993, SR-113, 107-130

An IVIRI-based Study of Pharyngeal VolumeContrasts in Akan

Mark K. Tiede

Characteristic differences in pharyngeal volume between Akan +/-Advanced Tongue Root(ATR) vowel pairs have been investigated using Magnetic Resonance Imaging (MRI)techniques, and compared to the similar Tense/Lax vowel distinction in English. Twosubjects were scanned .!Siring steady-state phonation of three pairs of contrasting vowels.Analysis of the resulting images shows that it is the overall difference in pharyngealvolume that is relevant to the Akan vowel contrast, not just the tongue root advancementand laryngeal lowering previously reported from xray studies. The data also show that theATR contrast is articulatorily distinct from the English Tense/Lax contrast.

INTRODUCTIONSeveral West African languages have a

phonological process of Vowel Harmony that splitsthe distribution of their vowels into two congruentand mutually exclusive sets. The Akan language,a member of the Niger-Congo Kwa group (Stewart1971) spoken in Ghana, is typical of this pattgrn.The vowels of Akan may be grouped as follows.1

(1) Set 1

a

Set 2

3

In general vowels in an Akan word harmonize;that is all vowels are drawn exclusively from oneset or the other. (The low mid vowel /a/ may occurin a word with vowels from either set.2 ) Affixeshave two forms for compatibility with stemshaving type I or type 2 vowels. For example (fromDolphyne 1988:15):

I am grateful to the many people who assisted me in thisproject, particularly Cathe Browman, Alice Faber, LouisGoldstein, Carol Gracco, Robin Greene, Kathy Harris, Pat Nye,Alfred Opoku, Elliot Saltzman, Doug Whalen, and all thosewho agreed to be magnetized for Science. Goofs and garblesremain of course my responsibility. This work was supportedby NIH grant DC-00121.

107

(2) mI + di --> midi[1st Sg] [eat]

0 di --> odi[3rd Sg] [eat]

mI + di --> midi[1st Sg] [be called]

0 + di --> 31;11

[3rd Sg] [be called]

Drawing on a cineradiograp: ic study of the re-lated language Igbo by Ladefoged (1964), Stewart(1967) proposed an analysis of Akan in whichcorresponding vowels from the two harmonygroups are distinguished articulatorily bydifferences in tongue root position.Cineradiographic studies of Akan undertaken byLindau (1975, 1979) confirmed the primacy oftongue root position in the harmony mechanism,but also showed correlated variation in larynxheight: advanced tongue root was combined withlowered larynx, and retracted root with raisedlarynx. In reporting this work Lindau suggestedthat the relevant contrast was not just the relativepositions of these articulators but rather theoverall difference in pharyngeal volume producedby their cooperative positioning, and proposed thefeature "Expanded" to desceibe it. A contrastbased on differences in pharyngeal volume

"I eat"

"he eats"

"I am called"

"he is called"

114

Page 115: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

108 Tiede

suggests that vowels from the two groups mayalso differ in the left-to-right or lateral dimensionof the pharynx, assuming that this is subject tovoluntary control, but studies based oncineradiography are inherently unable to explorethis possibility, since the technique collapses alllateral information into a flat (sagittal) image.The purpose of the study discussed here was toinvestigate the predicted difference in pharyngealvolume using Magnetic Resonance Imaging, whichis not subject to the same limitation.

The magnetic resonance technique has onlyrecently become a viable imaging alternative.Developed pripiarily for medical diagnosticpurposes, MRITexploits the behavior of hydrogennuclei in a magnetic field to construct an imagecorrelated with the concentration of hydrogen inthe scanned tissue.3 Because different tissue typeshave differing hydrogen densities and bondings,MR images provide soft tissue definition over arange inaccessible to xray techniques. Twoadditional advantages make MRI an especiallyattractive imaging modality: there are currentlyno known health risks for the subject associatedwith the technique, and because the imagingplane may be reoriented without moving thesubject, three dimensional data collection ispossible.

Despite these advantages the MR technique hassubstantial drawbacks in its potential for phoneticresearch, chief of which is the tradeoff betweenimaging time and resulting image quality.Although image acquisition rates continue to dropas the technology evolves, they are currently stilltoo slow by an order of magnitude for capturingdynamic speech. In addition, three dimensionalscanning requires multiple passes through thesame volume, further increasing acquisition time.This limits the current usefulness of MR1 inphonetics to studies involving static vocal tractshapes and sustainable patterns of phonation.

The current study was able to proceed underthese constraints. Although vowels sustained formany times their normal speaking duration rep-resent an admittedly artificial source of data, allof the vowels examined (except English lax vow-els) can occur in open syllables, and are thereforeartificial only in duration, and not in syllablestructure. Furthermore, sung vowels of constantpitch and quality and extended duration occur inthe musical traditions of Ghanaian and Americancultures.4 There is also precedent for use of theMRI technique applied to measurement of staticvowel shapes in the work of Baer, Gore, Gracco, &Nye (in press), who successfully used it to obtain

vocal tract area functions of four point vowels, andin similar work by Lakshminarayanan, Lee, andMcCutcheon (1991).

The English Tense/Lax distinction eludesprecise articulatory description, but is similar inmany ways to the Akan contrast: tense vowels aregenerally articulated with an advanced tongueroot, and sometimes a lowered larynx. Englishvowels comparable to those used in Akan may begrouped as in (1) above:

(3) Tense Lax

3

Previous attempts to identify the English dis-tinction with the same mechanism used in theAkan ATR contrast include a proposal by Halleand Stevens (1969), and a cineradiographic studyby Perkell (1971). However a more extensivecineradiographic study conducted by Ladefoged etca. (1972) showed that tongue root advancement isjust one of several complementary strategies usedto implement the Tense/Lax contrast, and is notused consistently by all speakers. Similarlyelectromyographic data reported by Raphael andBell-Berti (1975) showed differences betweensubjects in patterns of muscle tension used todistinguish between tense and lax vowels. Factoranalysi. of xray-derived tongue shapes byHarshman, Ladefoged, and Goldstein (1977)showed that tongue position for English tense/laxpairs can be predicted very completely byreference to just two parameters along which eachof the tongue root and tongue dorsum positionscovary, whereas a similar analysis of Akan byJackson (1988) found three parameters necessaryfor tongue shape specification. While it is probablythe case that different mechanisms are involved ineach language, they are similar enough to makedirect comparison feasible and interesting, and sothey have been treated in parallel in the currentstudy.

MethodIn this experiment the contrast in cross-

sectional area was examined at adjacent levelsthrough the pharynx for corresponding expandedand constricted vowels (i.e. +/-ATR; Tense/Lax).5The experiment involved two subjects, both male,in their early thirties. subject AO is a nativespeaker of the Asante dialect of Akan; subject MTis a native speaker of Midwestern AmericanEnglish. The vowels selected for comparison were/i : ,/, /e : c/, and /u : u/, chosen because of theirreasonable similarity across the two languages.

115

Page 116: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

An MRI-based Study of Pharyngeal Volume Contrasts in Akan 109

Prior to the experiment a target stimulus tapewas prepared for each subject by twice recordingeach of the target vowels in a characteristic word,extracting the vowel portion using digital editingtechniques, then recording it as a continuousutterance by concatenation. The following targetwords were used:

(4) Vowel Akan English

[i] pi "many" hid "heed"

[1] fi "to vomit" hid "hid"

[e] jqe "empty" held "hayed"

[c] oitie "he looked at" had "head"

[u] bu "to break" hud "who'd"

[u] bu "to be drunk" hud "hood"

These words were also recorded for acousticanalysis:

[o] ako "parrot" hod "hoed"

[o:a] kap "red" had "hawed"

[a:x] daa "everyday" hxd "had"

The experiment was performed on a GeneralElectric Signa machine installed at the Yale NewHaven Hospital. The Signa system consists of atoroidal superconducting electromagnetdeveloping a 1.5 Tesla flux density, placed in ascanning room designed to minimized interferencefrom external electromagnetic noise. The magnetis controlled from an operator's console outside thescanning room, and an attached computer is usedfor image reconstruction, collation, and storage.

After divesting themselves of all ferrous mate-rial subjects were fitted with an earphone, mi-crophone, and neck RF transceiver imaging coil,and positioned on their backs inside the bore ofthe magnet. The earphone was the terminus of alength of plastic tubing connected to a smallspeaker placed as far away from the magnet aspossible (because the strength of the magneticfield tended to overwhelm the speaker coil). Thespeaker was driven by an amplifier outside thescanning room, and was used to communicatewith the subject and to play the prerecordedtarget stimuli during scanning. The microphonewas used to record subject phonation immediatelyprior to and following scanning; while phonationduring scanning was also recorded it wasunusable for analysis because of the intensity ofthe noise produced by the machine.

Prior to each run the experimenter verified thetarget vowel by reminding the subject of thecharacteristic word containing it. Playback of thetarget vowel was then initiated through theearphone, and scanning begun shortly after thesubject began phonating. Subjects were instructedto produce the target vowel with steady pitch anduniform quality, to take shallow breaths whileretaining vocal tract configuration, and to refrainfrom head movement and swallowing as best theycould.

Two scans were made for each vowel. The firsttook approximately two and a half minutes tocomplete and produced eight adjacent sagittalimages (bisecting face) at 3mm intervals (seeTable 1 for imaging paranieters used). The secondscan took approximately three minutes; itproduced 28 adjacent images at 5mm intervals inthe axial orientation perpendicular to the pharynx(see Table 2).

Table 1. Sagittal imaging parameters.

Subject AO (Akan)

256 x 256 pixels x 2 passes (NEX)GRE/30 Multi (Grass Echo)28cm field of view3mm thicknessOmm interscan skip8 images

Vowel TR TE Time

32 13 2:2537 11 2:3639 13 2:4439 12 2:4437 11 2:3637 11 2:36

Subject MT :English)

256 x 256 pixels x 2 passes (NEX)GRE/30 Multi (Grass Echo)28cm field of view3mm thicknessOmm interscan skip8 images

Vowel TR TE Time

36 10 2:3236 10 2:3237 11 2:3237 11 2:3236 10 2:3237 11 2:32

116

Page 117: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

no Tiede

Table 2. Axial imaging parameters

Subject AO (Akan)

256 x 128 pixels x 1 pass (NEX)

SPGR/45 Volume (Spoiled grass)

28cm field of view

5mm thickness

Omm interscan skip

28 images

TR 45

TE 5

Time 3:05

Subject MT (English)

256 x 128 pixels x 1 pass (NEX)

SPGR/45 Volume (Spoiled grass)

30cm field of view (i. I. u, U)

28cm field of view (e. E)

5mm thickness

Omm interscan skip

28 images

TR 45

TE 5

Time 3:05

AnalysisThe images obtained were converted from 16 bit

Signa format to an 8 bit (255 gray level) formatcompatible with display and analysis software.The axial images were normalized for a standardimage density so that a given pixel magnituderepresented the same value across all series; thiswas done so that air-tissue boundary measure-ments could be made consistently. Image analysiswas performed on a Macintosh II computer usingthe NIH IMAGE program.6

The sagittal images were used to replicate thecineradiographic results. Measurements of rela-tive height for the tongue dorsum, jaw, and larynxwere obtained for each vowel, using the top of thesecond cervical vertebra as the reference, from theimage judged to be closest to sagittal midline forthat scan set. Dorsum height was measured fromthe point on the tongue that maximized the verti-cal distance from the reference (see Figure 1); jawheight was measured from a point at the base ofthe root of the lower incisors;, and larynx heightwas measured from its vertical distance to the ref-erence. Relative tongue root advancement wasmeasured as the length of a line drawn from thebottom of the vallecular sinuses to the rear pha-ryngeal wall. Figure 1 illustrates how measure-ments were obtained, and shows an overlay of thevocal tract outlines obtained for the comparisonbetween Akan /V and Id.

Figure 1. Sagittal measurements, showing Subject AO (Akan) for /if overlain by /i/.

DorsumHeight

Reference

LarynxHeight

117

Page 118: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

An MRI-based Study of Pharyngeal Volume Contrasts in Akan 112

The axial images were used to obtain data forthe left-to-right lateral dimension inaccessible tothe cineradiographic studies. For purposes ofcomparison 11 sequential images were chosenfrom each axial scan set. The reference used forcomparison was the lowest image in each series inwhich the epiglottis was visible as a detachedentity; this point corresponds to the bottom of thevallecular sinuses used in the sagittal images toidentify the tongue root position. This image wasused as the pivot for choosing five sequentialimages below that level in the pharynx, and fivemore above, for the total of eleven. Th....,

approximate locations of axial scans measured areshown overlain on a sagittal outline in Figure 2.The reason for using the epiglottis to coordinateinter-vowel comparison was to minimize anyeffects due to laryngeal height differences, so thatcomparison would be made at correspondinganatomical points for both vowel conditions.

Three measurements were obtained from eachaxial image: the distance from the anterior to the

posterior boundaries of pharyngeal airspacecorresponding to tongue root advancement, whichwill be referred to as "depth"; the novel left-to-right lateral distance across pharyngeal airspace,which will be referred to as "width"; and ameasure of the cross-sectional area of pharyngealairspace at that level; these are illustrated inFigure 3. The distance measurements were madealong a straight line at the point of widestdistension for each dimension. The areameasurement was computed by determining aperimeter of standard image intensitycorresponding to the pharyngeal air-tissueboundary, counting the number of enclosed pixels,and converting to area measure. Except whendetermining the reference or pivot level, theepiglottis was ignored in all measurements; inparticular at levels above the reference (those inwhich a detached epiglottis was visible), theanterior wall was taken to be the base of thetongue, and area measurements included anyvisible epiglottal tissue.

Reference

.'.....---------------)Figure 2. Appmximate location of Axial scans.

+25

0

25

118

Page 119: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

112 Tiede

LEFTSIDE

Figure 3. Axial measurements.

ANTERIOR

PHARYNGEALWALL

POSTERIOR

Axial image measurement is complicated by thetrifurcation of the airspace below the epiglottisinto three tubes by the aryepiglottic folds: thecentral laryngeal passageway and two deadendsidechannels, the piriform sinuses. The approachtaken for these images was to obtain an overallarea measurement by adding the cross-sectionmeasured for the central larynx tube to thoseobtained for each of the piriform sinuses (i.e.aryepiglottic tissue was not included). The widthmeasurement was made from the left side of theleft sinus to the right side of the right sinus, butthe depth measurement was made across thepoint of widest distension of the central larynxtube only.

The acoustic recordings of pre- and post-scanphonation were used to verify that subject vowelquality remained on target during scanning. Theresulting utterances were digitized at a samplingfrequency of 10 kHz. Formant values for each to-ken as well as the original vowel target utteranceswere obtained by a linear-predictive-coding auto-correlation analysis, using a 20ms Hammingwindow shifted 10ms to fit a 14th-order polyno-mial, with averaging of successive frames forwhich the first three formants were available.

ResultsImages. Figure 4 is representative of the

resulting images. The left side of the figure showsa midline sagittal view and mid-pharynx viewfrom the expanded (+ATR) variant of the Akanvowel /i/, and on the right the correspondingconstricted views (h/). The orientation of the axialimages corresponds to a slice through the neck,perpendicular to the pharynx, presented so that

RIGHTSIDE

the top of the image shows the front of the face orneck. Notice the difference in pharyngeal airspace:the sagittal view on the left shows a wider openingin the lower pharynx than does the correspondingview for the constricted variant on the right.Similarly the axial view of the expanded vowel onthe left shows a larger cross-sectional area thanthe corresponding constricted vowel on the right.

In subjective terms the images obtained variedfrom poor to good quality compared to others ofthis type,7 representing in general a reasonablecompromise between image noise and requiredscanning time. Measurements were based onair/tissue interfaces, which tended to image welleven if resolution of tissue-internal anatomicaldetails was nonoptimal. Resolution wasapproximately one millimeter per image pixel inboth orientations.8

To determine whether normalization of mea-surements for different head sizes was appropri-ate, subject tract lengths were measured using themidline sagittal image obtained for the vowels /i/and /1/ in each language. The measurement wasmade from the level of the vocal folds, around thecurve of the tongue, to the midpoint of a line bi-secting the narrowest lip approximation, pre-sented here in centimeters:

(5) Vocal Tract Lengths (cm)

Subject AO Subject MT(Akan) (English)

17.4 16.915.7 15.3

Vowel

These values were considered to be sufficientlyclose as t make normalization unnecessary.

119

Page 120: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

An MRI-based Study of Pharyngeal Volume Contrasts in Akan 113

Subject AO (Akan) vowel /i/

Mid-Sagittal

Mid-Pharynx

Figure 4. Representative images.

Errors affecting measurements may have beenintroduced in two different ways: from distortionof the image due to subject movement duringscanning, or from subject head tilt. Both

Subject AO (Akan) vowel Il

Mid-Sagittal

Mid-Pharynx

possibilities affected sagittal scanning more thanaxial. Sagittal scans were assembled individuallyone after another, whereas the axial scans wereobtained using a method that computed all images

F.' ill WIIP utnw "Ivo it Pk:10 rbrA I E,rktit.-16

Page 121: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

114 Tiede

concurrently. Subject movement in a sagittal scanset resulted in blurring of the single image beingacquired at that moment, maldng it unusable formeasurement, while movement in an axial setcaused only a loss of defmition across all images inthe set, leaving them all still viable formeasurement.

The effect of any head tilt or rotation was tocause the scanning plane to intersect with thesubject at an oblique angle, rather than producinga true bisection of the head. This was not aproblem for axial images, since any tilting wouldhave been too slight to cause measurabledistortion in that orientation. For the sagittalimages however, even slight tilting was sufficientto make determination of the head midlinedifficult; for example, one image might showmidline at the level of the larynx, yet be off centerat the level of the palate, thus affecting tonguedorsum height measurements. Therefore allsagittal tongue measurements were confirmed onimages adjacent to the one chosen as midline.

Sagittal Orientation. The sagittal measure-ments obtained are given in Table 3, and illus-trated graphically in Figure 5. Figure 6 shows the(expanded - constricted) difference between re-sults obtained for corresponding vowel pairs.

Table 3. Sagittal measurements.11Subject AO(Akan) 20

(MM) 30

E40

E50

35

25

20

E'"10

5

10

20

E30

40

50

10

VowelRoot Dorsum

Advancement HeightLaryngeal

DepthJaw

Depth

i 22.97 19.69 68.91 48.13I 17.50 19.69 57.97 45.94e 21.88 25.16 54.69 41.56e 19.69 24.06 51.41 47.03u 32.81 21.88 74.38 43.75u 17.50 18.59 67.81 41.56

60

70

80

35

Subject MT 3)(English)

Com)Root Doisum Laryngeal Jaw as

Vowel Advancement Height Depth Depth E ,

10

5

0

i 22.97 30.63 55.78 30.63I 21.88 28.44 47.03 32.81c 21.88 27.34 51.41 36.09e 19.69 25.16 49.22 32.81u 30.63 30.63 55.78 33.91u 18.59 26.25 54.69 33.91

Moon= Height

Akan +Akan -English +

/, English -

Jaw Hags*

Lalyngeal Height

Tongue Root Advancemert

Figure 5. Sagittal results.

1 21

Page 122: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

An MRLbased Study of Pharyngeal Volume Contrasts in Akan 225

Dorsnm Height expanded - constricted

IIII AkanEnglish

.'ii

Larynx Height expanded - constricted

16

14 -

12 -

10

8

6

4

2

0

Root Advancement expanded constricted

111 AkanEnglish

Figure 6. Expanded - constricted sagittal results.

In each case the Akan results confirm the cinera-diographic studies mentioned above: tongue rootadvancement was larger for the expanded (+ATR)variant of each vowel pair, larynx height waslower for each expanded variant as well, andtongue doreum height showed either no difference(iA) or slightly higher values for expandedalternates (c:e, u:u). The results for English were

very similar: for each tense variant rootadvancement was larger, larynx height was lower,and tongue dorsum higher. However, the Englishmeasurements show smaller differences inmagnitude for both root advancement andlaryngeal lowering than Akan, and greaterdifferences in tongue dorsum height. Resultsobtained for jaw height showed small andinconsistent differences in both languages.

Axial Orientation. The axial measurements ob-tained are given in Tables 4a and 4b. Figure 7illustrates the difference in cross-sectional areabetween Akan vowel pairs. In each graph in thisand subsequent figures the height of each barshows the difference in area between expandedand constricted variants at corresponding levels ofthe pharynx, increasing in height at 5mmintervals from five measurement levels below thebase of the epiglottis to five levels above. Observethat the pharyngeal airspace is larger at allmeasured levels for the expanded (+ATR) variantsof vowels Ill and /u/, and for all levels of expanded/e/ above the epiglottal pivot. By combining eachsequence of measured cross-sections the followingapproximations to pharyngeal volume wereobtained (representing the section delimited by +/-25mm from the base of the epiglottis):9

(6) Derived Pharyngeal Volumes for SubjectAO (Akan) (cm3)

I E I.1I

+ATR 30.58 18.38 42.04

-ATR 18.90 17.27 25.09

The larger volumes observed for the "Expanded"variants of each vowel show that Lindau's term isan apt name for the feature being contrasted. It isinteresting that the ratio of expanded to con-stricted volume is nearly the same for vowels /i/and /u/ (1.62 vs. 1.68), and the absence of adifference of similar magnitude for /e/ suggeststhat the data obtained for that vowel should beviewed with caution (although they agree with thesagittal results, obtained during a separatescanning sequence, which also showed thesmallest root advancement difference for /e/). Thesmaller differences observed for /e:e./ may be dueto the fact that the inherently lower tongueconstriction location characteristic of these vowelsleaves less room in the lower pharynx for effectingthe contrast.

122

1

Page 123: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

116 Tiede

Table 4a. Axial measurements for Subject AO (Akan).

Area (nun2)

Level (mm)

-25 309.84 133.98 69.38 107.67 508.42 245.24

-20 230.88 137.57 137.57 205.76 534.74 290.70

-15 296.68 230.88 174.66 293.09 807.50 363.67

-10 419.90 373.24 294.29 327.78 820.65 504.83

-5 492.87 313.43 328.98 311.04 788.35 520.39

0 656.76 422.29 476.12 297.88 1233.37 878.08

5 750.07 451.00 488.09 374.44 1165.19 764.43

10 787.16 455.79 462.96 380.42 721.36 480.91

15 722.56 437.84 429.47 418.70 646.00 381.62

20 714.18 421.09 404.35 374.44 640.01 349.32

25 734.52 403.15 410.33 362.48 543.12 238.06

Width (mm)

Level (mm)

-25 35.00 28.44 28.44 28.44 33.91 29.53

-20 32.81 29.53 29.53 29.53 35.00 30.63

-15 32.81 32.81 31.72 30.63 39.38 31.72

-10 35.00 33.91 33.91 29.53 38.28 31.72

-5 36.09 31.72 35.00 30.63 37.19 36.09

0 32.81 29.53 33.91 25.16 37.19 38.28

5 30.63 24.06 29.53 24.06 35.00 32.81

10 30.63 26.25 22.97 25.16 35.00 30.63

15 28.44 28.44 28.44 24.06 35.00 27.34

20 31.72 25.16 24.06 22.97 35.00 27.34

25 30.63 22.97 25.16 17.50 33.91 20.78

Depth (mm)

Level (mm)

-25 19.69 18.59 20.78 12.03 37.19 26.25

-20 25.16 15.31 20.78 14.22 31.72 21.88

-15 27.34 18.59 19.69 15.31 31.72 24.06

-10 25.16 17.50 24.06 15.31 29.53 22.97

-5 17.50 14.22 14.22 14.22 30.63 20.78

0 33.91 24.06 27.34 17.50 43.75 30.63

5 33.91 18.59 27.34 20.78 42.66 28.44

10 31.72 18.59 18.59 22.97 32.81 21.88

15 29.53 24.06 20.78 21.88 27.34 18.59

20 32.81 22.97 24.06 24.06 27.34 15.31

25 31.72 24.06 24.06 25.16 25.16 12.03

123

Page 124: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

An MI-based Study of Pharyngeal Volume Contrasts in Akan 117

Table 4b. Axial measurements for Subject MT (English).

Area (mm2)

Level (mm)

-25 438.08 508.12 470.14 486.89 753.94 618.48

-20 517.73 519.10 411.52 514.40 699.01 617.29

-15 527.34 616.61 391.19 614.89 806.12 608.91

-10 661.93 649.57 576.61 590.97 862.43 752.47

-5 606.99 585.02 523.97 644.80 896.76 718.97

0 914.6! 659.18 608.91 532.35 1031.34 848.17

5 951.69 630.34 653.17 368.46 958.56 802.71

10 942.08 611.11 575.42 295.48 944.82 724.95

15 948.94 597.38 517.99 324.19 855.56 628.05

20 940.70 649.57 526.37 337.35 811.61 541.92

25 961.30 708.62 525.17 394.78 752.56 485.69

Width (mm)

Level (mm)

-25 32.81 33.98 33.91 33.91 33.98 32.81

-20 37.50 36.33 36.09 36.09 36.33 33.91

-15 39.84 39.84 38.28 38.28 39.84 36.09

-10 41.02 39.84 39.38 41.56 41.02 39.38

-5 41.02 41.02 39.38 40.47 41.02 40.47

0 38.67 38.67 40.47 38.28 39.84 39.38

5 37.50 38.67 39.38 37.19 38.67 39.38

10 42.19 36.33 36.09 36.09 39.84 39.38

15 43.36 41.02 35.00 33.91 41.02 31.72

20 44.53 38.67 36.09 35.00 41.02 36.09

25 45.70 42.19 33.91 33.91 39.84 38.28

Depth (mm)

Level

-25 26.95 29.30 29.53 26.25 30.47 30.63

-20 21.09 22.27 21.88 22.97 28.13 28.44

-15 19.92 19.92 21.88 24.06 29.30 26.25

-10 21.09 19.92 17.50 19.69 24.61 29.53

-5 17.58 17.58 17.50 17.50 24.61 24.06

0 31.64 23.44 24.06 17.50 31.64 26.25

5 30.47 22.27 22.97 15.31 30.47 27.34

10 30.47 21.09 20.78 14.22 29.30 19.69

15 30.47 18.75 20.78 12.03 2.5.78 18.59

20 31.64 23.44 18.59 12.03 24.61 16.41

25 29.30 23.44 20.78 17.50 21.09 15.31

124

Page 125: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

118 Tiede

500

400

300

2005E 100

-100

-200

500

400

300

200

E 100

-100

-200

500

400

300

200

100

-100

-200

Ill- h

Epiglottis

125 -20 -15 -10 %5 '0 'S '10 '15 '20 '25

mm (increasing height through pharynx)

let - In!

Epiglottis

25 -20 -15 -10 -5 0 5 10 15 20

mm (increasing height through phalynx)5

/u/ - /u/Epiglottis

iliilliiii1-25 '-20 '-15 110 -5 '0 5 '10 15 '20 '25

mm (increasing height through pharynx)

Figure 7. Delta cross-sectional area for Subject AO(Akan).

The observed difference in pharyngeal volume isconsistent with the tongue root advancement ob-served in the sagittal images and cineradiographicstudies, but leaves open the question of whetherthe difference is due entirely to root position. orwhether pharyngeal lateral width contributes aswell. The relative contributions of width anddepth to the area values obtained for Akan are il-lustrated in Figure 8.

Recall that width is a measure of side-to-side orlateral distance, and depth is a measure of ante-rior/posterior distance corresponding to root ad-

vancement. Notice that while somewhat noisierthan the area data, the overall trend for bothwidth and depth is for consistently larger valuesfor the expanded variant of each vowel.10 Noticealso that the observed differences in lateral widthare almost as large as those observed for ante-rior/posterior depth, showing that for this speakerat least control of the lateral dimension is an im-portant part of the mechanism used to produce thecontrast, thereby confirming Lindau's positionthat speakers of Akan seek to maximize the dif-ference in overall pharyngeal volume in effectingthe contrast.

Figures 9, 10, and 11 provide a comparison ofthe English area, width, and depth results withthose of Akan. As in the previous figures, theheight of each bar shows the difference betweenexpanded and constricted variants atcorresponding levels through the pharynx.Combining area cross-sections as for Akan thefollowing approximations to pharyngeal volumewere obtained:

(7) Derived Pharyngeal Volumes for SubjectMT (English) (cm3)

I E U

Tense 37.25 26.28 43.10

Lax 30.13 23.55 34.31

As in Akan, the expanded (tense) variants showconsistently larger overall pharyngeal volumethan their lax counterparts. Again, it isinteresting to note that the ratio of expanded toconstricted volume is nearly the same for vowels/i/ and /u/ (1.24 us. 1.26), although the magnitudeof the difference is considerably less than thatobserved for Akan.Acoustics. Figure 12 shows a combined vowelspace chart using the averaged stimulus targetvowel formants from each language. The primaryacoustic effect of the ATR contrast on Akan vowelpairs appears to be a raised Fl for (-ATMvariants, with F2 relatively unaffected. Laxvowels in English on the other hand show both araised Fl and lowered F2 compared to their tensecounterparts, for a net centralizing effect. Valuesmeasured for F3 were lower for constrictedvariants in both languages. Two analyses ofvariance were performed to quantify thesignificance of these observed effects: one groupedrecorded tokens into source groupirxr: of target(the original stimulus words), sagittal run, andaxial run; the second analysis grouped them intotarget, pre-scan, and post-scan source groupings.

Page 126: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

An MRI-based Study of Pharyngeal Volume Contrasts in Akan 119

161412108642

-2-4-6

16141210

6

WID TH /it - /1/

Epiglottis

-25 - 0-1 -5 0 0 5mm (increast ng heiffit through phatyrud

Epiglottis

161412108642

-2-4-6

1614 Epiglottis1210

DEP TH

Epiglottis

5 :20 :15 :10 T-5 '0 '5 10 '15 20 25mm (increasing height through pharynx)

a6

2 20 0

-2 -2-4 -4-6

25 0 -15 - 0 -5 0 5 0 5 5rnn1 (tn creast ng heignt througn phatyn)d

/u/ - /u/

Epiglottis

25 215 -'10 -15 '0 '5 110 15 0 25mm EincreaF4 ng height through pharymJ

Flpre 8. Akan pharyngeal width and depth.

5 0 5 -10 -5 0 5 0 5 0 5mm (increastng height through pharynx) .-

/u/1614

Epiglottis

121086

E 42

-2-46

5 220 215 .!10 15 '0 '5 '10 '15 20 25111M (increasing height through pharynx] --a.

126

Page 127: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

120 Tiede

AKAN500 lil

400 Epiglottis 500

300400

3200 00

g 100 200

E 1000

0-100-200

-300

500

400

300200

g 1000

-100-200

-300

500

400

300

200

E 100

-100

-200

-300

25 - 0 -15 -10 -5 0 5 10 5 0rnm (increasing height through phatynx)

lel - IdEpiglottis

5 - 0 -15 -1 -5 I 5 1

mm (increasing (legit through pharynx)

IuI - /u/Epiglottis

2520 15 10 4-5 40 5 10 )5 '25MITI Dna-easing height through pharynx)

Figure 9. Comparison of cross-sectional area.

-100-200

-300

500

400

300200

g 1000

-100

-200

-300

500

400

300200

g 1000

-100

-200

-300

Z 7

ENGLI SH hi -

EpiglottisN.

-25 -20 -15 -10 -5 0 5 10 15 20 25mm (increasing height thrcugh pharynx)

lel - lel

Epiglottis

111.

5 -20 -15 -10 -5 0 5 0 5min (Increasing height through pharynx) ..

lul -hIEpiglottis

5 0 -15 -10 -5 0 5 0 5 5mm (increasing height through pharynx)

Page 128: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

An MRI-based Study of Pharyngeal Volume Contrasts in Akan 121

161412108642

AKAN [il-hiEpiglottis

0-2-4-6

5 0 -15 - 0 -5 0 5 0 5 0 5mm (increasing hdght through pharynx)

1614121086

E 42

-2-4-6

1614121086

E 42

-2-4-6

lel - lel

mm (ina-eastng height through pharynx)

Epiglottis

5 -2D -15 -10 -5 0 5 0 5 0 5mm (increasing hdght through pharynx) ...

Figure 10. Comparison of pharyngeal width.

ENGLISH16 hl - Il14 Epiglottis1210

6420

-2

-6-25 - -15 -10 .6 0 5 10 15

mm (increasing height throtgh pharynx)

/e/ - /e/1614 Epiglottis1210

86

2

-24

25 - 5 -10 -5 0 5 10 15mm (increasing he(ght through pharynx)

1W-lW1614 Epiglottis1210

8c 6t 4

2

-2-4-6

-25 -20 -15 - 0 -5 0 5 0 5 0 5mm (Ina-easing hdght through pharynx)

128

Page 129: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

/22 Tiede

1614121086420

-2-46

1614121086

AKANli/ - hl

Epiglottis

5 220 -15 210 1-5 '0 '5 '10 15 -20 25mm (increasing height through pharynx)

lel - lel

Epiglottis

1614121086

ENGLISH

Epiglottis

20-2-4-6

mm (increasing height through pharynx)

ful-lu/

Epiglottis

2 20 0-2 -2-4 -4

1614121086420-2-4-6

5 0 -15 - 0 -5 0 5 0 5 0 5mm (Increasing heigrit through pharynx) as

/u/ - /u/

Epiglottis

-25 -20 -15 -10 -5 0 5 10 15 20 25

1614121086420

-2-4-6

mm (Increastng height through pharynx)

lel - kl

Epiglottis

5-20-15-10 -5 0 5 0 5 0 5mm Oncreastng height througn pharyn4 mm (Increasing height through pharynx)

Figure 11. Comparison of pharyngeal depth.

Page 130: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

An MRI-based Study of Pharyngeal Volume Contrasts in Akan 123

F2-F1

2500 1000 200

i

iI

1 201

30

40

50

60

70

80

e

I

e

0

U

0

E1

uU

e

3

m

a

a

Figure 12. Combined vowel space.

The results of both analyses confirmed thesignificance of the ATR and Tense/Lax effects onFl (p < .0001) and F3 (p < .05), showed strongsignificance for the effect of the Tense/Laxcontrast on F2 (p < .0001), and showed nosignificance for ATR effects on F2.

Figures 13 and 14 show vowel space plots forsubjects AO (Akan) and MT (English) respectivelyincorporating both stimulus and run time formantvalues. Missing values for certain configurationsresulted from subjects initiating phonation afterscanning had commenced, or respiring as it ended(recall that machine noise precluded analysis ofphonation recorded during scanning). Ideally theformant values measured for a given vowel shouldbe independent of its source category; in practicethis expectation was unrealistic, given that thetarget vowels were separately recorded underacoustically ideal conditions, while pre- and post-scan tokens were recorded with subjects proneand encased by several tons of intimidating hightechnology.n Effects due to token source onmeasured formant values were assessed bycomputing Wilks' A (a measure of multivariate

0

0

0

0

0

Akani English

F1

correlation) from the separate formant analyses.The values obtained (listed in Table 5) show thattoken source category was in fact a significantsource of variance for both subjects, both as amain effect, and in interactions with vowel andATR factors, but the results were mixeddepending on how the runtime tokens weregrouped. When analyzed by type (i.e. target, pre-scan, post-scan), the Akan subject showed amarginally significant (p < .05) interaction ofsource with ATR (the feature of primary interest),but no significance for this interaction whenanalyzed by run (i.e. target, sagittal, axial). Thissuggests that while subject AO .drifted slightlyfrom the initial target over time, the way in whichhe did so was consistent across runs, at least withrespect to ATR effects. The English subject MTshowed the reverse pattern: the interaction ofsource with the Tense/Lax parameter wasmarginally significant (p < .05) for the analysis oftokens grouped by run, but not significantgrouped by type, indicating that while this subjectproduced slightly different vowels across runs, hedid not drift significantly during scanning.

1 3 (1

Page 131: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

124 Tiede

F2 - Fl

2500 1000 200

i

i1 ,

cm,

U 300C

1U

ai 0

0C

400t,

e U u50C

3

60C

70(

80(

aa

90(

Figure 13. Subject AO (Akan) vowel space.

F2 - Fl

2000 1000 400

1

oo

50

SI

150

i i iva UU

e

C 0

I 0II

450

500

550

c

I

1

0 ...

e

r t CN

A

II

50

700

Figure 14. Subject MT (English) vowel space.

i targeti run

Fl

i targeti run

Fl

131

Page 132: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

An MRI-based Study of Pharyngeal Vtme Contrasts in Akan 125

Table 5. Wilks' Lambda values from MANOVA of formant data. Tokens grouped as Target:Sagittal:Axial

Akan

Effect Value F-Value Num DF Den DF P-Value

Vowel 0.001 94.142 6.000 18.000 0.0001

ATR 0.112 23.867 3.000 9.000 0.0001

Source 0.223 3.348 6.000 18.000 0.0215

Vowel * ATR 0.063 8.951 6.000 18.000 0.0001

Vowel * Source 0.153 2.847 9.000 22.054 0.0206

AM * Source 0.414 1.665 6.000 18.000 0.1871

Vowel * ATR * Source 0.415 1.656 6.000 18.000 0.1893

English

Effect Value F-Value Num DF Den DF P-Value

Vowel 0.001 136.776 6.000 28.000 0.0001

ATR 0.055 80.075 3.000 14.000 0.0001

Source 0.410 2.619 6.000 28.000 0.0383

Vowel * ATR 0.067 13.420 6.000 28.000 0.0001

Vowel * Source 0.149 3.281 12.000 37.332 0.0025ATR * Source 0.369 3.020 6.000 28.000 0.0210Vowel * ATR * Source 0.456 1.074 12.000 37.332 0.4078

Tokens grouped as Target:Pre:Post scanAkan

Effect Value F-Value Num DF Den DF P-Value

Vowel 0.001 140.397 6.000 20.000 0.0001

ATR 0.054 58.871 3.000 10.000 0.0001

Source 0.086 8.057 6.000 20.000 0.0002Vowel * ATR 0.080 8.483 6.000 20.000 0.0001Vowel * Source 0.172 2.889 9.000 24.488 0.0174ATR * Source 0.268 3.106 6.000 20.000 0.0257Vowel * ATR * Source 0.684 1.542 3.000 10.000 0.2639English

Effect Value F-Value Num DF Den DF P-ValueVowel 0.000 205.738 6.000 28.000 0.0001

ATR 0.060 72.659 3.000 14.000 0.0001

Source 0.167 6.748 6.000 28.000 0.0002Vowel * ATR 0.016 32.756 6.000 28.000 0.0001Vowel * Source 0.058 6.016 12.000 37.332 0.0001

ATR * Source 0.520 1.805 6.000 28.000 0.1343Vowel * ATR * Source 0.304 1.765 I 2.000 37.332 0.0907

Recall that the purpose of collecting acousticdata was to verify subjects were in factarticulating the desired vowel configurationthroughout the scanning sequence. The aboveresults indicate that due possibly to nervousnessor fatigue subjects were not in fact entirelyconsistent in vowel production. However, if the F-values derived using Wilks' A are regarded asreflecting the relative importance of the individualfactors, observe that vowel type and ATR andtheir interaction show larger values than anyterm involving token source category, for bothlanguages and both analyses; in other words whilethe source factor is significant, it is less importantrelative to the other parameters determining

vowel configuration, and does not indicate subjectphonation drifted so much as to seriously skew theimaged tract configurations.

DiscussionThe fact that the English sagittal measurements

show smaller differences in magnitude for bothtongue root advancement and laryngeal loweringthan Akan, and greater differences in tonguedorsum height, suggests a relatively moresignificant role for tongue height in maintainingthe English contrast. It should be noted howeverthat root advancement and dorsum height may beintrinsically linked. An electromyographic studyby Baer, Alfonso, and Honda (1988) claimed that

1 3 2

Page 133: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

126 Tiede

the same contraction of the posterior genioglossus(GGP) muscle effecting tongue root advancementalso forces the tongue dorsum upwards, whilecontraction of the separately controlled anteriorgenioglossus (GGA) pulls the dorsum forward anddown. The results obtained here suggest that GGAactivity is greater in Akan than in English: activecontrol of the GGA in Akan works against thedorsum-raising effect of the GGP, resulting insmaller net dorsum height differences than thoseobserved for English. This predicts that thecorrelation of dorsum height with rootadvancement should be stronger in English thanin Akan, which is confirmed by a regressionanalysis outlined in the table below:

(8) Dorsum Height regressed against RootAdvancement (all vowels)

d.f.

Subject AO(Akan)

Subject MT(English)

4Pearson's r 0.226 0.782r2 0.051 0.611t-ratio 0.465 2.51

n.s. <0.05

Assuming the pattern of muscle activity men-tioned above, one apparent difference thereforebetween the ATR and Tense/Lax mechanisms liesin how each language treats the inherent dorsum-raising effect of root advancement: Akan seeks toneutralize its effect, whereas English (at least inthis subject's dialect) appears to exploit it.

Another difference is apparent from thepatterning of axial data at measured levels belowthe epiglottis. With one exception (area measuredat the three lowest levels of lel), the area, width,and depth measurements obtained for Akan showconsistently larger values for expanded (+ATR)variants at all measured levels, above and belowthe epiglottis. But while the English data alsoshow consistently larger values above theepiglottal pivot, at levels below that pointdifferences between tense and lax variants areinconsistent in sign and considerably smaller inmagnitude. The change occurs abruptly, and isevident to some degree in all three parametersmeasured.

An attempt was made to quantify the signifi-cance of this divergence by performing an analysisof variance on each measurement parameter,treating the individual levels as repeated mea-sures nested within an upper or lower groupingfactor.12 Under this design, inter-language differ-ences in behavior above and below the epiglottisbetween expanded and constricted configurations

are encompassed in the Language x Group x ATRinteraction. Results of the analysis for the widthparameter showed no significance for this interac-tion (F=1.12), but moderate significance for depth(F=9.48, p<.05), and strong significance for area(F=15.84, p<.005), reflecting the difference insubepiglottal behavior between the two subjects.

Further evidence of divergent behavior isprovided by a regression analysis of width againstdepth for corresponding measurement levels,summarized in Table 6, and illustrated in Figure15. The results show a significant (p < .01)positive correlation between Akan width anddepth across all measurement levels, but nocorrelation for the same analysis on English.When the English values are analyzed separatelyby upper/lower grouping however, the resultsshow a significant (p < .01) positive correlationbetween width and depth for measurement levelsabove the epiglottal pivot, and a significant (p <.01) negative correlation at levels below it; in otherwords below the level of the epiglottis in Englishan increase in anterior/posterior depth isaccompanied by a decrease in lateral width.Separate analysis of the Akan values by groupshows significant (p < .01) positive correlationsbetween width and depth for measurement levelsabove and below the epiglottal pivot.

Table 6. Regression of Axial Width and Depth.

All Vowels

d.f.Pearson's rr2t-ratio

Subject AO(Akan)

580.4430.1963.76

<0.01

Akan Vowels (Subject AO)below epiglottis

Subject MT(English)

580.0900.008-0.687n.s.

above epiglottis

d.f. 28 28Pearson's r 0.545 0.744r-') 0.297 0.553t-ratio 3.44 5.89

p <0.01 <0.01

English Vowels (Subject MT)below epiglottis above epiglottis

d.f. 28 28Pearson's r -0.671 0.680

0.450 0.462t-ratio -4.79 4.91

<0.01 <0.01

133

Page 134: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

An MRI-based Study of Pharyngeal Volume Contrasts in Akan 117

Akan: all vowelsy .751x + .571. r2 ..196

English: below epiglottisy -1.024 + 62.433. r2 .45

45

40-

35-

30

.5 25-

25 30 35wiith (mxr)

40 45

English: all vowelsy .167x + 17.009, r2 .008

-§*20

15-

1020 25 30 35 40 45

wi:kh (nrn)

Figure 15. Regressior of axial width x depth.

40

35

30

25

-8 20

The fact that this divergence in behaviorappears to occur exactly at the level of the base ofthe epiglottis suggests an anatomical explanation.It can be seen from the sagittal images that thebase of the epiglottis is at approximately the sameheight as the hyoid, which serves as an anchor forthe medial pharyngeal constrictor. The primaryfunction of this muscle is to control pharyngealaperture during swallowing in the region of thepharynx between the level of the hyoid and theupper larynx. Since this is where the Akan andEnglish patterns diverge, the source of the cross-language difference in subepiglottal behavior (forthese two subjects at least) may be the activeinvolvement of this muscle in effecting the ATRcontrast, and its lack of involvement in theTense/Lax distinction: active control of pharyngealaperture would account for the consistently larger(expanded - constricted) difference valuesobserved for Akan, and its presumed absencewould explain the small and inconsistentdifferences observed for English. The tradingrelation observed between English width and

15

1030 32 34 36 38 40 42

tsickh (mon)

English: above epiglottis1.475x - 33.627, r2 . .462

44 46

depth also supports the hypothesized non-involvement of the medial constrictor in theTense/Lax mechanism: when GGP action on thetongue pulls it up and forward in Tenseconfigurations, the relaxed constrictor does notoppose it, and so the lateral pharyngeal wallscollapse slightly below the epiglottis as aconsequence of the advanced tongue root. In Akanhowever active control of the constrictor results inthe pharyngeal walls being tensed in +ATR con-figurations, preventing similar deformation of thelateral sides. Hardcastle (1976) refers to twomodes of pharyngeal muscle contraction: an"isotonic" mode inducing sphincteral narrowing ofpharyngeal aperture, and an "isometric" modeserving to tense pharyngeal walls withoutreducing aperture. Under this account of thedifferences in subepiglottal behavior betweenEnglish and Akan, English does not activelycontrol the pharyngeal constrictor, whereas Akanuses isotonic contraction of the pharyngealconstrictor to minimize pharyngeal aperture in[-APR) configurations, and isometric contraction

134

Page 135: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

128 Tiede

in (+ATM configurations to prevent deformationof pharyngeal lateral walls. Pharyngeal isometrictension may be significant in a different context:Hardcastle (1973) has suggested that it may beimportant in the production of tensed initial stopsin Korean.

Assuming that the patterns observed for thesetwo subjects are in fact representative of Akanand English, it is evident that the ATR andTense/Lax distinctions are only superficiallysimilar. In producing +ATR vowels Akan speakersenlarge the pharyngeal cavity by advancing theroot of the tongue, lowering the larynx, andmaintaining tension in the pharyngeal walls. In-ATR configurations the root is retracted,pharyngeal aperture is constricted, and the larynxis raised, minimizing pharyngeal volume. Becausespeakers appear to make adjustments to maintainrelatively constant dorsum height across the twoconfigurations, the ATR distinction is essentially acontrast in pharyngeal volume. Tense vowels inEnglish are also articulated with an advancedtongue root, enlarging the pharyngeal cavity as aconsequence; but below the level of the epiglottisthis enlargement is counteracted by thedeformation of the lateral walls from lack ofconstrictor tension. In lax configurationspharyngeal volume is smaller, but it is unclearwhether this is due to actual constriction by themedial constrictor, or simply the relaxed positionof the tongue root. The dorsurn-raising effect ofroot advancement is not adjusted for in English,and instead constitutes an integral part of thecontrast.

Generalizations of this sort are somewhat pre-sumptuous given the limited scope of this study,and those made for English in particular shouldbe viewed in the context of the Ladefoged et a1.(1972) and Raphael & Bell-Berti (1975) findingsmentioned above, showing that different Englishspeakers produce the Tense/Lax contrast differ-ently. Separate studies by Lindau in 1975 and1979 involving four subjects each found consistentarticulatory implementation of the ATR mecha-nism, so the generalizations made here for Akanare perhaps on firmer ground, especially as thesagittal results of this study replicated her find-ings. In any case, while different speakers ofEnglish may approximate more or less closely theAkan ATR articulatory mechanism, the point isthat the Tense/Lax contrast is not identical to it,and must be represented by a different feature.Furthermore, because the Akan distinction ap-pears to involve two muscles that the Englishcontrast does not exploit (the medial pharyngeal

constrictor and possibly the anterior genioglos-sus), from an articulatory standpoint it appears tobe more complex with respect to the active controland coordination needed to produce it.

ConclusionsAlthough exploratory in scope, this study did

succeed in achieving certain objectives. ItdemonstTated that despite its inherent drawbacksMagnetic Resonance Imaging can be a usefultechnique for obtaining information about staticvocal tract configurations, one that willundoubtedly increase in importance as thetechnology is further refined. Measurements fromthe sagittal images collected here successfullyreplicated previous results from cineradiographicstudies showing the significance of tongue rootadvancement and larynx height in effecting theAkan ATR contrast, and those obtained from axialimages extended these results for the first timeinto the dimension of lateral width. The axialmeasurements confirm Lindau's position that it isthe overall difference in pharyngeal volume that isrelevant to the Akan vowel contrast, not justrelative larynx or tongue root positions. Finallythe parallel analysis of the English Tense/Laxcontrast gave results showing that despitesuperficial similarities with the Akan ATRdistinction, the two contrasts are not the same.

REFERENCESBaer, T., Alfonso, P., & K. Honda (1988) Electromyography of the

tongue muscles during vowels in /@pVp/ environment.Annual Bulletin of the Research Institute for Logopedics andPhoniatrics (University of Tokyo), 22, 7-19.

Baer, T., Gore, J. C., Gracco, L. C., & Nye, P. W. (1991). Analysis ofvocal tract shape and dimensions using magnetic resonanceimaging: Vowels. Journal of the Acoustical Society of America, 90(2), 799-828.

Bradley, W. G., Newton, T. H., & Crooks, L. E. (1983). Physicalprinciples of nuclear magnetic resonance. In T. H. Newton & D.G. Potts (Eds.), Modern neuroradiology: Advanced imagingtechniques (pp. 15-62). San Francisco: Clavadel Press.

Clements, G. N. (1980). Vowel harmony in a nonlinear generativephonology: An autosegmental model. Indiana UniversityLinguistics Club.

Dolphyne, F. A. (1988) The Akan (Twi-Fante) Language. Accra:Ghana Universities Press.

Halle, M., & Stevens, K. N. (1969). On the feature AdvancedTongue Root. Quarterly Progress Report, 94, Research Laboratoryof Electronics, Massachusetts Institute of Technology, 209-215.

Hardcastle, W. J. (1973). Some observations on the tense-laxdistinction in initial stops in Korean. Journal of Phonetics, 1, 263-272.

Hardcastle, W. J. (1976). Physiology of speech production. New York:Academic Press.

Harshman, R., Ladefoged, P., & Goldstein, L. (1977). Factoranalysis of tongue shapes. Journal of the Acoustical Society ofAmerica, 62, 693-707.

135

Page 136: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

An MRI-based Study of Pharyngeal Volume Contrasts in Akan 119

Jackson, M. T. (1988). Phonetic theory and cross-linguisticvanation in vowel articulation. Working Papers in Phonetics, 71,Los Angeles: University of California.

Ladefoged, P. (1964). A phonetic study of West African languages.Cambridge: University Press.

Ladefoged, P., DeClerk, J., Lindau, M., St Papcun, G. (1972). Anauditory-motor theory of speech production. Working Papers inPhonetics, 22, 48-75. Los Angeles: University of California.

Lakshminarayanan, A. V., Lee, S., St McCutcheon, M. J. (1991).MR Imaging of the vocal tract during vowel production. Journalof Magnetic Resonance Imaging, 1, 71-76.

Lindau, NI. (1975). [Features) For Vowels. Working Papers inPhonetics, 30. Los Angeles: University of California.

Lindau, M. (1979). The feature expanded. Journal of Phonetics, 7,163-176.

Perkell, J. (1971). Physiology of speech production: a preliminarystudy of two suggested revisions of the features specifyingvowels. Quarterly Progress Report, 102, Research Laboratory ofElectronics, Massachusetts Institute of Technology, 123-139.

Raphael, L. J., St Bell-Berti, F. (1975). Tongue musculature and thefeature of tension in English vowels. Phonetica, 32, 61-73.

Stewart, J. M. (1967). Tongue root position in Akan vowelharmony. Phonetica, 16,185-204.

Stewart, J. M. (1971). Niger-Congo, Kwa. In T. Sebeok (Eds.),Current trends in linguistics (pp. 179-2121. The Hague: Mouton.

FOOTNOTEShligh vowels from both harmony groups (and low /a/) alsohave nasalized counterparts which were not investigated inthis study.2In Clements' (1980) analysis of Akan vowel harmony /a/ istreated as an opaque vowel that induces the [-ATRI feature onany subsequent vowels. Some dialects of Akan have anadditional (1-FATRI) front vowel /3/ that harmonizes with I a/;however the exact distribution of this vowel is unclear, and it isignored here.'The magnetic resonance technique relies on strong magneticfields to align the magnetic moments of hydrogen nuclei, andpulsed radio-frequency energy to set them into resonance. Thesignal intensity in a MR image depends on the density of thehydrogen nuclei in the scanned tissue and its resonance decaycharacteristics, determined by the atomic environment of theprotons within each molecule. See Bradley et al. (1983) forfurther information.

4Personal communication from my Akan informant.

5To facilitate cross-language comparison in the followingdiscussion I freely apply the terms "expanded" and"constricted" to both Akan and English, but do not mean toimply by this that the same feature mechanism is involved inboth languages; nor am I referring to Lindau's term forthe Akan feature. The terms are simply meant as shorthandphysical descriptions of the respective contrasts,so that "expanded" for example should be understood as[+ATR] in the context of Akan, and [-ETensej in the context ofEnglish.

6Image version 1.29q by W. Rasband,7Image quality was inferior to that obtained by Baer et al. (1991);

however those researchers used custom-made cephalostats andlonger scanning times in obtaining their best results.

8The scanning process resolved a given volume into a flat 256 x256 pixel image. Sagittal scanning used a 280mm width x280mm depth x 3mm thickness giving a resolution of1.09rnm/pix. Half the axial scans were done using a 280mm x280mm x 5mm volume resulting in the same 1.09mm /pixresolution, and half were done using a 300mm x 300mm x 5mmvolume giving a resolution of 1.17mm/pix. See Tables 1 and 2for full scanning specifications used.

9Each volume element represents (scan area scan thickness),where scanned image thickness was a constant 5mm (seepreceding footnote).

1611e uppermost axial levels measured for Akan /e:e/ show(slightly) larger depth values for the contracted variant,reversing the pattern found everywhere else. One possibleexplanation is that at these levels scanning has moved out ofthe region manipulated for the ATR contrast, and into therelatively invariant region of the tongue constrictioncharacterizing the vowel.

I I Both subjects r?ported feelings of nervousness when first placedinside the magnet; as the diameter of the central bore isapproximately two feet only, claustrophobia is definitely apotential problem in studies of this type.

12For purposes of symmetry the uppermost level was droppedfrom the analysis, so that the five subepiglottal levelsconstituted the 'lower' group, and the epiglottal pivot and itsfour succeeding levels the 'upper.' ANOVA design (levels asrepeated measures nested in group): level (measurement level,1-5); group (below/above epiglottis); vowel (I /E/U); ATR(expanded/constricted); language (Akan/English). ThelevelvowelATRlanguage interaction was used as the errorterm (16df).

136

Page 137: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins Laboratories Status Report on Speech Research1993, SR-113, 131-134

Thai*

M. R. Kalaya Tingsabadht and Arthur S. Abramsont

Stand r'.-d Thai is spoken by educated speakersin every part of Thailand, used in news broadcastson radio and television, taught in school, anddescribed in grammar books and dictionaries. Ithas evolved historically from Central Thai, theregional dialect of Bangkok and the surroundingprovinces.

The transcription of a Thai translation of TheNorth Wind and the Sun is based on recordingsmade by three cultivated speakers of the lan-guage, who were asked to read the passage in a re-laxed way. In fact, we find them all to have used afairly formal colloquial style, apparently equiva-

Consonants

lent to Eugenie J. A. Henderson's "combinativestyle" (Henderson, 1949). In a more deliberatereading of the text many words in the passagewould be transcribed differently. The main fea-tures subject to such stylistic variation are vowelquantity, tone, and glottal stop. Thus, for exam-ple, /ted 'but' is likely under weak stress to be /t1with a short vowel; the modal auxiliary /tcii3/'about to' becomes Acii, with change of tone fromlow to mid and loss of final glottal stop.The prosodic and syntactic factors that seemto be at work here remain to be thoroughlyexplored.

Bilabial Lab-dent.Post-

Dental Alveolar alveolar Palatal Velar Glottal

Plosive p ph b t th d k kh ?

Nasal m n a

Fricative f s h

Affricate tc tch

Trill rApproxi-m ant j w

LateralApprox. I

p pi: 'elder aunt'ph phi: 'cloth'b bi: 'insane'm mi: 'to come'f fij 'pimple'

tth

dnsrI

tctch

timthimdimni:sajrakliktci:mwhim

to pound''to do''black'`ricefield''clear'`to love'to steal'to sneeze''dish'

k013

wj7

h

kiaj 'fish bone'khi:13 'side'qi: 'tusk'win 'day'ji:m 'watchman'?tian 'fat'hij 'earthen jar'

131

137

Page 138: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

132 Tingsabadh and Abramson

VowelsThere are nine vowels. Length is distinctive for

all the vowels. (In some phonological treatments,NV is analyzed as /VV7.) Although small spectraldifferences between short and long counterpartsare psychoacoustically detectable and have someeffect on vowel identification (Abramson & Ren,1990), we find the differences too subtle to place

with confidence in the vowel quadrilateral.The vowel /a/ in unstressed position, including theendings of the diphthongs hia tua usil, is likely to besomewhat raised in quality. The final segmentsof the other two sets of phonetic diphthongs:(1) [iu, eu, em, cm, ally am, iau] and (2) [ai, ai, ai, ai,ui, -1-1, uai, mai] are analyzed as Av / and /yrespectively.

ia ua

i krit 'dagger' i: kill `to cut' ia riin to study'

e ?en 'ligament' e: ?e:n to recline' um rtuin 'house'

c phe? 'goat' C: phe. : to be defeated' ua ruin 'to be

a fin to dream' a: f i:n `to slice' provocative'

a kl5q 'box' a: kla:g 'drum'

o khOn 'thick (soup)' o: khO:n to fell (a tree)'

u silt 'last, rearmost' lc sill 'to inhale'

1- win 'sil ver' T: dim 'to walk'

tu Ichtfin to go up' ur khltiln 'wave'

I. 3 8

Page 139: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Thai 133

TonesThere are five tones in Standard Thai: high /1,

mid I" / , low % /, rising and falling r/.

khi:kha:

khi:

'to get stuck'`galangar

khi: 'to engage in trade'

kha: 'leg'

StressPrimary stress falls on the final syllable of a

word. The last primary stress before the end of amajor prosodic group commonly takes extra stress.

ConventionsThe feature of aspiration is manifested in the

expected fashion for the simple prevocalic oralstops /ph, th, kh/. The fairly long lag between therelease of the stop and the onset of voicing is filledwith turbulence, i.e., noise-excitation of therelatively unimpeded supraglottal vocal tract. Inthe special case of the "aspirated" affricate /tch/,however, the noise during the voicing lag excites anarrow postalveolar constriction, thus giving riseto local turbulence. It is necessarily the case, then,that the constriction of the aspirated affricatelasts longer than that of the inaspirate(Abramson, 1989). Not surprisingly, it followsfrom these considerations that the aspiration of

initial stops as the first element in clusters occursduring the articulation of the secon 1 element,which must be a member of the set /1, r,

Only /p, t, k, 2, m, n, sj, w, Y occur in syllable-final position. Final /p, t, k, ?/ have no audiblerelease. The final oral plosives are said to beaccompanied by simultaneous glottal closure(Henderson, 1964, Harris, 1972). Final m isomitted in unstressed positions. Initial It/ and /f/are velarized before close front vowels.

The consonant /r/ is realized most frequently asEr] but also as [r]. Perceptual experiments(Abramson, 1962: 6-9) have shown that thedistinction between /r/ and /1/ is not very robust;nevertheless, the normative attitude amongspeakers of Standard Thai is that they areseparate phonemes, as given in Thai script. Thisdistinction is rather well maintained by somecultivated speakers, especially in formal speech;however, many show much vacillation, with atendency to favor the lateral phone [1] in theposition of a single initial consonant. As thesecond element of initial consonant clusters, both/1/ and /r/ tend to be deleted altogether.

In plurisyllabic words, the low tone and the hightone on syllables containing the short vowel /a/followed by the glottal stop in deliberate speech,become the mid tone when unstressed, with loss ofthe glottal stop.

Transcription of recorded passage

khi'ni? thl lOm'nuia k ,phrithlt I kimlig 'thiig kin 'wi: I 'khrij tci 'mi:k kwi 'kin I k5 'ml:

'phit 'nuinj 'dir:n 'phin 'mi: I Sij 'stiia,kirenilw II jörUndia I phri?ifit 'twig ,thkliig kin 'Ithrij

lhim hij phü ni 6:t 'Slia,kinbilV sim'ret 16:n I tci I pen 'phi: thi

'ml: 'kwi: II 0 le:w I liim'ntiia ki kri'phtir 'phit 'ji:0 'sit 'rEg II te jbj phit 'rem 'mi:k 'phiag 'dij

k5 'chin) kritchip kip lila 'khuln phiaj nin le? 'nij thi 'sit I ,Iiim'ntia 1(3

'115m 'khwi:m phijijim II 'tci:k 'nin I phrilith'it tctki 'sse:ij r5:n rE:0 ?M.( 'mi: II 'nik,d5r.n'thig 1(3 tt

?5:k 'than .sitt

on II

likurnia tainj icim tnj ji:m 'rip 'wi: I phri?ifit mi: philig 'kwi:

The passage in Thai script' e...

?I E.F cia mix iirteriiit-sniiajiiwril ii1,46-i1l.11fIrl 11KU 0,111.11-11,fi 11111.1W11 IkIlil U1-11111J1 LA_Lbn turic u1 1 si Invi lizIl10111C1EIVCIE111421-14-TI...L Ai fiffuninti-thiw-11.611111)4fitilIDFIL ei.n1111$11r110D11 ii115 Dna u BIIID111,1,11C4 t1111,41141ryinal Lyit,Lin 1 n Inli iinflilir. AD v`ica Biz tpIlki3 -AMEN yri1tliant1111/1179.1iil au-um 111114f1 a 4i141fib -riumul.21$1 nit:INA-1J 6% maul/ 14 8,11111 ll fitthlilffil II ULM ilankFinn I, A 111J Vip En 1.2 D1/1 111045al F1C1 8 D.1 filAll Fi46J- .15D ULlisman 3)1 II-nal utronnofiL ibriwilITIDDFIITIA (11litinfiumilD'$4 ilgID4 BD all 111145t1D11-71C1 El 1:1 Vas! Winn l'IM U 1

[3:S1 CY AVAILABLE 1 39

Page 140: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

134 Tingsabadh and Abramson

REFERENCESAbramson, A. S. (1962). The vowels and tones of Standard Thai:

Acoustical measurements and experiments. Bloomington: IndianaUniversity Research Center in Anthropology, Fo%lore, and

Linguistics, Publication 20.Abramson, A. S. (1989). Laryngeal control in the plosives of

Standard Thai. Pasaa, 19, 85-93.Abramson, A. S., & Ren, N. (1990). Distinctive vowel length:

Duration vs. spectrum in Thai. Journal of Phonetics, 18, 79-92.

Harris, J. G. (1972). Phonetic notes on some Siamese consonants.In J. G. Harris & R. B. Noss, (Eds.), Tai phonetics and phonology

(pp. 8-22). Bangkok: Central Institute of English Language.Henderson, E. J. A. (1949). Prosodies in Siamese: A study in

synthesis. Asia Major New Series 1, 189-215.Henderson, E. J. A. (1964). Marginalia to Siamese phonetic

studies. In D. Abercrombie, D. B. Fry, P. A. D. MacCarthy,

N. C. Scott, & J. L. M. Trim (Eds.), In honour of Daniel Jones:Papers contributed on the occasion of his eightieth birthday 12September 1961 (pp. 41-424). London: Longmans.

FOOTNOTES'Journal of the International Phonetic Association, in press.*Department of Linguistics, Faculty of Arts, Chulalongkorn

University, BangkokAlso Department of Linguistics, The University of Connecticut,StorrsWe thank Dr. Chalida Rojanawathanavuthi, Dr. KingkarnThepkanjana, and Miss Surangkana Kaewnamdee, whosereadings of the passage underlie our transcription. Part of thework of the second author was supported by Grant HD01994from the U.S. National Institutes of Health to HaskinsLaboratories.

Page 141: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins Laboratories Status Report on Speech Research1993, SR-113, 135-144

On the Relations between Learning to Spell andLearning to Read*

Donald Shankweilert and Eric Lundquist*

The study of spelling is oddly neglected byresearchers in the cognitive sciences who devotethemselves to reading. Experimentation andtheories concerning printed word recognitioncontinue to proliferate. Spelling, by contrast, hasreceived short shrift, at least until fairly recently.It is apparent that in our preoccupation withreading, we have tended to downgrade spelling,passing it by as though it were a low-level skilllearned chiefly by rote. However, a look beneaththe surface at children's spellings quicklyconvinces one that the common assumption isfalse. The ability to spell is an achievement no lessdeserving of well- directed study than the abilityto read. Yet spelling and reading are not quiteopposite sides of a coin. Though each is party to acommon code, the two skills are not identical. Inview of this, it is important to discover howdevelopment of the ability to spell words is phasedwith development of skill in reading them, and todiscover how each activity may influence theother. Thus, this chapter is concerned with therelationship between reading and writing.

It is appropriate to begin by asking what infor-mation an alphabetic orthography provides for awriter and reader, and to briefly review the possi-ble reasons why beginners often find it difficult tounderstand the principle of alphabetic writing andto grasp how spellings represent linguistic struc-ture. In this connection, would an orthographybest suited for learning to spell differ from an or-thography best suited for learning to read?

We are indebted to Professor Peter Bryant of the Universityof Oxford for making it possible for one of us (DS) to study agroup of children whose data are discussed in this chapter, andfor help with the analysis and much valuable discussion.Thanks are due to Stephen Crain, Anne Fowler, Ram Frost,Leonard Katz, Alvin Liberman, and Hyla Rubin for theircomments on earlier drafts. This work was supported in partby Grant HD-01994 to Haskins Laboratories from the NationalInstitute of Child Health and Human Development.

The second section discusses how spelling andreading are interleaved in a child newlyintroduced to the orthography of English. Here,one central question is precedence: Does theability to read words precede the ability to spellthem, or, alternatively, might some children beready to apply the alphabetic principle in writingbefore they can do so in reading? A relatedquestion is strategy. Do children sometimesapproach the two tasks in very different ways?Finally, the last section discusses how analysis ofchildren's spellings may illuminate aspects oforthographic learning that are not readilyaccessible in the study of reading.

135

HOW WRITERS AND READERS AREEQUIPPED TO COPE WITH THE

INFORMATION PROVIDED BY ANALPHABETIC SYSTEM

Writing differs from natural and conventionalsigns in that it represents linguistic units, notmeanings directly (DeFrancis, 1989; Mattingly,1992). The question of how the orthography mapsthe language is centrally relevant to the course ofacquisition of reading and spelling. All forms ofwriting permit the reader to recover the individualwords of a linguistic message. Given thatrepresentation of words is the essence of writing,it is important to appreciate that words arephonological structures. To apprehend a word,whether in speech or in print, is thus toapprehend (among other things) its phonology.But in the manner of doing this, A. M. Liberman(1989; 1992) notes that there is a fundamentaldifference between speech on the one hand andreading and writing on the other. For a speaker orlistener who knows a language, the languageapparatus produces and retrieves phonologicalstructures by means of processes that functionautomatically below the conscious level. Thus,

Page 142: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

136 Shankweiler and Lundquist

Liberman notes that to utter a word one does notneed to know how the word is spelled, or even thatit can be spelled. The speech apparatus that formspart of the species-specific biological specializationfor language "spells" the word for the speaker(that is, it identifies and orders the segments). Incontrast, writing a word, or reading one, brings tothe fore the need for some explicit understandingof the word's internal structure. Since in analphabetic system, it is primarily phonemes thatare mapped, those who succeed in mastering thesystem would therefore need to grasp thephonemic principle and be able to analyze wordsas sequences of phonemes.

The need that alphabetic orthographies presentfor conscious apprehension of phonemic structureposes special difficulties for a beginner (eeGleitman & Rozin, 1977; I. Y. Liberman, 1973;Liberman, Shankweiler, Fischer, & Carter, 1974).The nub of the problem is this: phonemes are anabstraction from speech, they are not speechsounds as such. Hence, the nature of the relationbetween alphabetic writing and speech isnecessarily indirect and, as we now know, oftenproves difficult for a child or a beginner of any ageto apprehend. In order to understand why this isso it will pay us to dwell for a moment on the waysin which it is misleading to suppose that analphabetic orthography represents speech sounds(see Liiierman, Rubin, Duques, & Carlisle, 1985;Liberman et al., 1974).

First, the letters do not stand for segments thatare acoustically isolable in the speech signal. So,for example, one does not find consonants andvowels neatly segmented in a spectrogram incorrespondence with the way they are representedin print. Instead phonemes are co-articulated,thus overlappingly produced, in syllable- sizedbundles. Accordingly, apprehension of theseparate existences of phonemes and their serialorder requires that one adopt an analytic stancethat differs from the stance we ordinarily adopt inspeech communications, in which the attention isdirected to the content of an utterance, not to itsphonological form. In view of this, it is notsurprising to discover that preschool children havedifficulty in segmenting spoken words by phoneme(see Liberman et al., 1989; Marais, 1991 forreviews).

Without some awareness of phonemicsegmentation, it would be impossible for abeginning reader or writer to make sense of thematch between the structure of the printed wordand the structure of the spoken word. So, forexample, writers and readers can take advantage

of the fact that the printed word CLAP has foursegments only if they are aware that the spokenword "clap" has four (phonemic) segments.Accordingly, in order to master an alphabeticsystem it is not enough to know the phoneticvalues of the letters. That knowledge, necessarythough it is, is not sufficient. In order to fullygrasp the alphabetic principle, it is necessary, inaddition, to have the ability to decompose spokenwords phonemically. Indeed, experience showsthat there are many children who know letter-phoneme correspondences yet have poor worddecoding skills (Liberman, 1971).

Considerable evidence now exists that children'sskill in segmenting words phonemically and theirprogress in reading are, in fact, causally linked(e.g., Adams, 1990; Ball & Blachman, 1988; 1991;Bradley & Bryant, 1983; Byrne & Fielding-Barnsley, 1991; Goswami & Bryant, 1990;Lundberg, Frost, & Petersen, 1988). One wouldalso expect to find that the same kind ofrelationship prevails between phonemesegmentation abilities and spelling. And, indeed,the data are consistent with that expectation.Studies by Zifcak (1984) and Liberman et al.(1985) have shown substantial correlationsbetween performance on tests of phonemesegmentation of spoken words and the degree towhich all the phonemes are represented inchildren's spellings. The findings of Rohl andTunmer (1988) confirm this association. Theycompared matched groups of older poor spellerswith younger normal ones and found that the poorspellers did significantly less well on a test ofphoneme segmentation. (See also Bruck &Treiman, 1990; Juel, Griffith, & Gough, 1986, andPerin, 1983).

The complex relation between phonemicsegments and the physical topography of speech isone sense in which alphabetic writing representsspeech sounds only remotely. This, we havesupposed, constitutes an obstacle for thebeginning reader/writer to the extent that itmakes the alphabetic principle difficult to graspand difficult to apply. Two further sources of theabstractness of the orthography should also bementioned, which may be especially relevant tothe later stages of learning to read and to spell.

First, alphabetic orthographies are selective inregard to those aspects of phonological structurethat receive explicit representation in thespellings of words (Klima, 1972; Liberman et al.,1985). No natural writing system incorporates thekind of phonetic detail that is captured in thespecial-purpose phonetic writing that linguists

Page 143: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

On the Relations betzveen Learning to Spell and Learning to Read 137

use. Much context- conditioned phonetic variationis ignored in conventional alphabetic writing,1 inaddition to the variation associated with dialectand idiolect. Hence, conventional writing does notaim to capture the phonetic surface of speech, butaims instead to create a more generally usefulabstraction. It is enlightening to note, in thisconnection, that young children's "inventedspellings"2 often differ from the standard systemin treating English writing as though it were morenearly phonetic than it is (Read, 1971; 1986).

A second source of abstractness stems from thefact that the spelling of English is more nearlymorphophonemic than phonemic. Englishorthography gives greater weight to themorphological structure of words than is the casewith some other alphabetic orthographies, forexample, Italian (see Cossu, Shankweiler,Liberman, Katz & To la, 1988) and Serbo-Croatian(see OFpienovid , Lukatela, Feldman, & Turvey,1983). El nmples of morphological penetration inthe writi 2g of English words are easy to find. Aubiquitous phenomenon is the consistent use of sto mark the plural morpheme, even in thosewords, like DOGS, in which the suffix ispronounced not [s], but [z]. The morphemic aspectof English writing appears also in spellings thatdistinguish words that are homophones, forexample, CITE, SITE; RIGHT, WRITE.

The knowledge that spellings of some Englishwords may sacrifice phonological transparency tocapture morphological relationships brings intoperspective certain seeming irregularities, asseveral writers have noted (Chomsky & Halle,1968; Klima, 1972; Venezky, 1970). Homophonespellings are instances in which the two modes ofrepresentation, the phonemic and the morphemic,are partially in conflict (DeFrancis, 1989). In thesespellings the principle of alphabetic writing iscompromised to a degree, but it is not abandoned,since most letters are typically shared betweenwords that have a common pronunciation. Alexical distinction in homophone pairs isordinarily indicated by the change of only a letteror two. Thus, homophone spellings in Englishpresent an irregularity from a narrowlyphonological standpoint, while nonethelesskeeping the irregularity within circumscribedlimits.

Such examples are telling. They led DeFrancis(1989) to make a novel and stimulatingsuggestion: that the needs of readars and writersmay actually conflict to some degree. Theconvention of distinct spellings for homophoneswould benefit readers by removing lexical

ambiguity in cases in which context does notimmediately resolve the matter. Writers, on theother hand, would perhaps be better served by asystem that minimizes inconsistencies in mappingthe surface phonology. For writers, the presence ofhomophones which are distinguished by theirspellings increases the arbitrariness of theorthography, and hence the burden on memory.Because it has to serve for both purposes, thestandard system can be regarded as acompromise, in some instances favoring readersand in other instances favoring writers.

Scrutiny of the words that users of English finddifficult to spell confirms that morphologicallycomplex words are among those most oftenmisspelled (Carlisle, 1987; Fischer, Shankweiler,& Liberman, 1985). Carlisle (1988) notes that inderived words the attachment of a suffix to thebase may involve a simple addition resulting in nochange in either pronunciation or spelling of thebase (ENJOY, ENJOYMENT). Alternatively, theaddition may result in a pronunciation change inthe base (HEAL, HEALTH), a spelling change butnot a pronunciation change (GLORY, GLORIOUS)or a case in which both spelling and pronunciationchange (DEEP, DEPTH). Difficulties in spellingmorphologically complex words appear to stem inpart from their phonological complexity andirregular spellings. But they may also stem fromfailure to recognize and accurately partitionderivationally related words. Carlisle (1988)tested school children aged 8 to 13 formorphological awareness. They were asked torespond orally with the appropriate derived form,given the base followed by a cueing sentencedesigned to prompt a derivative word (e.g.,"Magic. The show was performed by aIt was found that awareness of derivativerelationships was very limited in the youngestchildren, especially in cases in which the baseundergoes phonological change in the derivedform (as in the above example). Moreover, theability to produce derived forms has provendeficient in children and adults who are poorspellers (Carlisle, 1987; Rubin, 1988). All in all,the evidence supports the expectation that bothphonologic and morphologic aspects of linguisticawareness are relevant to success in spelling andreading.

So far we have discussed the common basis ofreading and writing, pointing first to the greatdivide that separates speech processes on the onehand from orthographic processes on the other.Then we proceeded to identify the factors thatmake learning an alphabetic system difficult. The

143

Page 144: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

738 Shankweiler and Lundquist

idea was also introduced that reading and spellingmay tax orthographic knowledge in somewhatdifferent ways. It is to these differences that weturn next.

CAN CHILDREN APPLY THEALPHABETIC PRINCIPLE IN SPELLINGBEFORE THEY ARE ABLE TO APPLY IT

IN READING?The possibility that the needs of readers and

writers may differ with respect to the kind oforthographic mapping that is easiest to learnraises the broader issue of the relation betweenlearning to write and learning to read. Does oneprecede the other? Do children adopt differentstrategies for the one than for the other? Toanswer these questions we will want to examinewhat is known about how spelling articulates withreading in new learners.

As to the first question, one may wonderwhether precedence is really an issue. Just as inprimary language development, where it is oftennoted that children's perceptual skills run aheadof their skills in production, so in writtenlanguage, too, it would seem commonsensical tosuppose that a new learner's ability to read wordswould exceed the ability to spell them. Most usersof English orthography have probably had theexperience of being unsure how to spell somewords that they recognize reliably in reading.Contributing to the difficulty is the fact that thereis usually more than one way for a word to bespelled that would equivalently represent itsphonological structure. (Consider, for example,"clene" and "cleen" as equivalent transcriptions ofthe word clean). The reader's task is to recognizethe correspondence between a letter string thatstands for a word (i.e., its morphophonologicalstructure) and the corresponding word in thelexicon. It is not required that the reader knowexactly how to spell a word in order to read itonly that the printed form (together with thecontext) should provide sufficient cues to promptrecognition of the represented word and not someother word. In contrast, the writer must generatethe one (and ordinarily only one) spelling thatcorresponds to the conventional standard. So it isnatural to assume that spelling words requiresgreater orthographic knowledge than readingthem. We therefore might expect that a beginnerwould have the ability to read many words beforenecessarily being able to spell them correctly.

Nonetheless, questions about precedence in thedevelopment of reading and writing have arisen

repeatedly. Some writers have suggested that,contrary to the view that reading is easier,children may indeed be ready to write words, insome fashion, before they are able to use thealphabetic principle productively in reading.Montessori (1964) expressed this view, and it hasmore recently been articulated by severalprominent researchers. In part, these claims arebased on experiences with preschool children whowere already writing using their own inventedspellings. Carol Chomsky (1971; 1979) stressedthat many young writers do this at a time whenthey cannot read, and, indeed, may show littleinterest in reading what they have written.Others who have proposed a lack of coordinationbetween spelling and reading in children'sacquisition of literacy are Bradley and Bryant(1979), Frith (1980), and Goswami and Bryant(1990).

In order to discuss the question of precedencewe must first consider how we are going to defmespelling and reading. By spelling, do we meanspelling a word according to conventional spelling?To adopt that criterion would ignore thephenomenon of children's invented spelling. Thatwould seem unwise since it is well-establishedthat some children are able to write more or lessphonologically before they know standardspellings (Read, 1971; 1986). It would beappropriate for some purposes to credit a child forspelling a word if the spelling the child producesapproximates the word closely enough that it canbe read as the intended word.

The criterion of reading is in one sense lessproblematical, but in another sense it is more so.For someone to be said to have read a word, thatword, and not some other word (or nonword) musthave been produced in response to the printedform. It is also relevant to ask how the responsewas arrived at. Words written in an alphabeticsystem can be approached in a phonologicallyanalytic fashion or, alternatively, they can belearned and remembered holistically (i.e., asthough they were logographs). As Gough andHillinger (1980) stress, the difficulty with thelogographic strategy is that it is self-limitingbecause it does not enable a reader to read newwords. Moreover, as the vocabulary grows and thenumber of visually similar words increases, thememory burden becomes severe and thelogographic strategy becomes progressively moreinaccurate. Should we therefore consider someonea reader if she can identify high frequency words,but cannot read low frequency words or nonwords?There is some consensus that we should not (e.g.,

144

Page 145: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

On the Relations between Learning to Spell and Learning to Read 139

Adams, 1990; Gleitman & Rozin, 1977; Gough &Hillinger, 1980; Liberman & Shankweiler, 1979).The possibility of reading new words, notpreviously encountered in print, is a specialadvantage conferred by an alphabetic system. It isreasonable to suppose that someone who hasmastered the system will possess that ability.

However, in the view of some students ofreading, most children when they begin to read,and perhaps for a considerable time afterward,read logographically, and only later learn toexploit the alphabetic principle (Bradley &Bryant, 1979; Byrne, 1992; Gough & Hillinger,1980). Given the absence of agreement as to whatis to be taken as sufficient evidence of readingability, the question of whether spelling or readingcomes first is less the issue than whether childreninitially employ discrepant strategies for readingand writing.

The strategy question is brought into focus byGoswami and Bryant (1990). As noted above, theysuppose that the child's initial strategy in reading(the default strategy) is to approach alphabeticallywritten words as though they were logographs.They contend that children tend to do this evenwhen they have had instruction designed topromote phonemic awareness. Readinganalytically might require more advanced wordanalysis skills than are available to mostbeginning readers. Writing, on the other hand,forces the child to think in terms of segments. Theprocess of alphabetic writing is by its naturesegmental and sequential: The writer forms oneletter at a time and must order the lettersaccording to some plan. Thus, Goswami andBryant suppose that children's initial approachesto writing would tend to be phonologicallyanalytic. Goswami and Bryant (1990) find itparadoxical that children's newly foundphonological awareness, which most often isintroduced in the context of instruction in reading,has an immediate effect on their spelling, but noton their reading. "So at first there is a discrepancyand a separation between children's reading andspelling. It is still not clear why children are sowilling to break up words into phonemes whenthey write, and yet are so reluctant to think interms of phonemes when they read (p. 148)."

Bryant and his colleagues (see especiallyBradley and Bryant, 1979) deserve much credit forgrasping the need for a coordinated approach tothe study of reading and spelling. They recognizedthat this undertaking would require testingchildren on reading and spelling the same words.It is well known that performance on reading and

spelling tests are highly correlated, at least inolder children and adults (Perfetti, 1985;Shankweiler & Liberman, 1972). Bradley andBryant stressed that the correlation betweenreading and spelling scores depends on the wordschosen. They proposed that the words thatchildren at the beginning stages find difficult toread are not always the words that are difficult tospell, and vice versa. Words that tended to be readcorrectly but misspelled were words whosespellings presented some irregularity, like EGG orLIGHT, whereas words spelled and not readtended to be regular words, like MAT and BUN(Bradley & Bryant, 1979).

The finding that the spell-only words and theread-only words did not overlap very much in thebeginning would lend support to the hypothesisthat children at this stage use different strategiesfor spelling and reading. The greater difficulty inspelling irregular words is what one would expectif the children were attempting to spell accordingto regular letter-to-phoneme correspondences.They would tend to regularize the irregular wordsand thus get them wrong. Moreover, the failure toread regular words suggests that the childrenwere using some nonanalytic strategy for reading,responding perhaps to visual similarity. Thatwould make them prone to miss easy wordswhenever their appearance is confusable withother words that look similar. If they were readinganalytically they would read these wordscorrectly. Thus, Bryant and his colleagues citefindings that seem to underscore the differencesbetween early reading and spelling.

Should we, then, accept Goswami and Bryant'sparadox and suppose that reading and writing arecognitively disjunct at the early stages, even inchildren who have received training inphonological awareness? We think not. First, asthe succeeding section shows, some data (to whichwe turn next) point to concurrent development ofreading and spelling skills. Secondly, it is tooearly to assess fully the impact on children'sreading and spelling of the several experimentalapproaches to instruction in phonologicalawareness (e.g., Ball & Blachman, 1988; 1991;Blachman, 1991; Byrne & Fielding-Barnsley,1991; in press). Therefore, we believe that thequestion must remain open.

A new research study, which coordinated theinvestigation of spelling and reading in six yearolds (the subjects were selected only for age), doesnot find evidence that incompatible strategies areemployed by beginners (Shankweiler, 1992).Unlike the Bradley and Bryant study, the test

145

Page 146: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

/40 Shankweiler and Lundquist

words in this experiment included no words withirregular spellings. The test words did containphonological complexities, however. Eachcontained a consonant cluster at the beginning orthe end.

There was a wide range in level of achievementwithin this group of six year olds. Nine of the 26children were unable to read and spell more thanone word correctly. The remaining 17 were able toread a mean of 70 percent of the words correctlybut were able to correctly spell only 39 percent.These findings show that the spelling difficultiesof beginners are not confined to irregular words.3Regularly spelled words can cause difficulty ifthey are phonologically complex, as when theycontain consonant clusters. With the exception ofone child, all read more words correctly than theywere able to spell. Finally, analytic skill inreading, as indexed by ability to read nonwords,was almost perfectly correlated (r = .93) withspelling performance (on a variety of real words).4These data do not sit well with the conclusion thatearly reading and spelling are cognitivelydissociated. On the contrary, the findings lendsupport to the idea that skill in reading andspelling tend to develop concurrently over a widerange of individual differences in attainment.

It is notable that spelling accuracy consistentlylagged somewhat behind reading. Only 6 percentof the words were spelled correctly and readincorrectly, whereas 37 percent were read and notspelled. Thus the children showed what might beexpected to be true generally: that spelling thewords would prove to be more difficult thanreading them, if by reading we mean correctidentification of individual words, and by spellingwe mean spelling these words according tostandard conventions.

INTERPRETING ERROR PATTERNS INSPELLING AND READING

So far we have been comparing spelling andreading at a coarse level of analysis. To addressmore rigorously the question of whether newlearners us;:t similar or dissimilar strategies forspelling and :eading we would wish to make adetailed compar:son between the error pattern inspelling word., and reading them. But, as ithappens, thi.1 turns out to be a difficult thing todo.

Problems of comparabilityMost of the published information on the

correlations between reading and spelling scoresis based simply on right/wrong scoring. This

approach has the disadvantage of throwing awaymuch of the potential information in the incorrectresponses. It fails to distinguish reading errorsthat are near misses from errors that are wildguesses, and it does not distinguish misspellingsthat capture much of a word's phonologicalstructure from those that capture little of it. If wegive partial credit for wrong responses, we mustcreate a scheme to evaluate the many possibleways of misspelling a word and assign relativeweights to each.

As an illustration of how we might proceed, weturn again to the research study last described(Shankweiler, 1992). In this study, reading wasassessed by the Decoding Skills Test (DST,Richardson, & Di Benedetto, 1986). The testconsists of 60 real words, chosen to giverepresentation to the major spelling patterns ofEnglish, and, importantly, it also includes anequal number of matching nonwords, the latterformed by changing one to three letters in each ofthe corresponding words. For the purposes athand, phonotactically legal nonwords constitutethe best measure of reading for assessing theskills of the beginning readerbecause only thesecan provide a true measure of decoding skill.Because they are truly unfamiliar entities,nonwords test whether a reader's knowledge ofthe orthography is productive. As noted earlier,only that kind of knowledge enables someone toread new words not previously encountered inprint (see Shankweiler, Crain, Brady, &Macaruso, 1992). Responses to the Decoding SkillsTest were recorded on audiotape and transcribedin IPA phonemic symbols for later comparisonwith the spelling measures.

To gain a fine-grained measure of spelling forcomparison with the reading error measures, thechildren's written spellings were scored phonemeby phoneme, using the following categories:

Correct spellingPhonologically acceptable substitute

(e.g., k for ck)Phonologically unacceptable substitute

(e.g., c for ch)Phoneme not represented

When we try to compare the error pattern inreading and spelling, we encounter a furtherdifficulty: Reading is a covert process that isassessed only by its effects. One cannot directlyinfer what goes on in the head when someoneattempts to read a word. When we ask the child toread aloud unconnected words in list form, we

146

Page 147: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

On the Relations between Learning to Spell and Learning to Read 141

encounter an obstacle: children are oftenunwilling to make their guesses public. Of course,a beginning reader who is stuck on a particularword may be entertaining a specific hypothesisabout the word's identity, but in the absence of anovert response, we cannot discover the hypothesisand use it as a basis for inferring the source of thedifficulty.

Writing, on the other hand, leaves a visiblerecord of the writer", hypothesis about how tospell a word. The findings of the study we havebeen discussing bear this out. Many of thechildren declined the experimenter's invitation toguess at the words they were having difficulty inreading. Yet the same children produced aspelling for nearly every word they were asked towrite. The upshot is that we have nearly acomplete set of responses to the spelling test, butmany gaps in the record occur on thecorresponding items on the reading test. Thisyields an unsatisfactory data base for comparingthe error pattern in spelling and reading. Thus,the kind of word-by-word comparison we wouldlike to make may be unattainable.

Nonetheless, there is much to be gained by alinguistic analysis of children's spellings. Indeed,it is chiefly through their writing, and not throughtheir reading, that children reveal theirhypotheses about the infrastructure of words.

Children's conceptions of the infrastructure ofwords as revealed in their spellings

When encouraged to invent spellings for words,young children invent a system that is morecompatible with their linguistic intuitions thanthe standard system. Whether the resultcorresponds to standard form is simply not aquestion that would occur to the child at thisstage. In Carol Chomsky's words, creative spellers"appear to be more interested in the activity thanthe product (1979, p. 46)." There is evidence thatchildren's invented spellings tend to be closer tothe phonetic surface than the spellings of thestandard system (Read, 1986). The standardsystem of English, as we noted, maps lexical itemsat a level that is highly abstract, both because theconventional system is morphophonemic, andbecause it tends not to transcribe phonetic detailthat is predictable from general phonologicalrules.

In the comparative study of reading and writingin six year olds which we have discussed(Shankweiler, 1992), even the least-advancedbeginners, who wrote only a single letter torepresent an entire word, usually chose a

consonant that could represent the first phonemein the word. A child who does this is apparentlyaware that letters represent phonological entitieseven though she is not yet able to analyze theinternal structure of the syllable. Altogether, firstconsonants were represented in 95% of cases.There was a strong tendency to omit the secondsegment of a consonant cluster: that is, the L inCL, the T in ST, the M in SM, the R in CR, and soforth. These were omitted in 56% of occurrences,yet when these consonants occurred alone ininitial position, they were rarely omitted. Bruckand Treiman (1990) report the same trends, bothin normal children and dyslexics. Tht. tendency toomit the second segment from an initial clusterfits with Treiman's idea (1992) that children mayinitially use letters to represent syllable onsetsand rimes rather than phonemes.5

The ability to represent the second segment ofinitial consonant clusters was a very goodpredictor of overall spelling achievement. It wasalso a good predictor of the accuracy of wordreading. Regression analysis showed that thispart score accounted for 45 percent of the variancein either spelling or reading when a different setof words is tested, after age, vocabulary (Dunn,Dunn, & Whetton, 1982) and a measure ofphonemic segmentation skill (Kirtley, 1989) hadalready been entered. Representation of theinterior segment in final clusters does almost aswell when entered in the regression. The results offine scoring give further support to the view thatreading and spelling skill are closely linked evenin beginners.

Why are consonant clusters a special source ofdifficulty? Two possibilities might be considered,each related to the phonetic complexity of clusters.First, it is well known that clusters causepronunciation difficulties for young children.Perhaps the spelling error signals a generaltendency to simplify these consonant clusters - afailure to perceive and produce them as twophonemes. But there was no indication that thiswas the case. All the children could pronounce thecluster words without difficulty.

An alternative possibility is that the childrenhad difficulty in conceptually breaking clustersapart and representing them as two phonemes. Inthat case, the difficulty in spelling could be seenas a problem in phonological awareness. So, also,could the problems in reading the cluster words.Reading analytically would require the reader todecompose the word into its constituent segments,and the presence of clusters would increase thedifficulty of making this analysis.

147

Page 148: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

142 Shanktveiler and Lundquist

Research conducted during the past two decadeshas shown that phonological awareness is not allof a piece. Full phoneme awareness is a late stagein a process of maturation and learning that takesyears to complete (Bradley & Bryant, 1983;Liberman et al., 1974; Morais, Cary, Alegria, &

Bertelson, 1979; Treiman & Zukowski, 1991).Although the order of acquisition is not completelysettled, there is evidence that before they cansegment by phoneme children are able to segmentspoken words using larger sublexical unitson-sets and rimes, and syllables, particularly stressedsyllables that rhyme (Brady, Gipstein, & Fowler,1992; Liberman et al., 1974; Treiman, 1992).

The role of literacy instruction in fostering thedevelopment of phonological awareness has beenmuch discussed in the research literature (Seechapters in Brady & Shankweiler, 1991, and inGough, Ehri, & Treiman, 1992). In thisconnection, Treiman (1991) urges that an analysisof spelling is the best route by which to studythose aspects of phonological awareness thatdepend on experience with reading and writing.We would tend to agree. This is not to say,however, that writing, but not reading would feedthis development in young children. It is to beexpected that a child's interest and curiosity aboutthe one activity would encourage and nourish aninterest in the other.6

To sum up, because reading and writing are sec-ondary language functions derived from spokenlanguage, they display a very different course ofacquisition than speech itself: unlike speech, mas-tery of alphabetic writing requires facility in de-composing words into phonemes and morphemes.Since both reading and writing depend upon graspof the alphabetic principle, it could be expectedthat both would develop concurrently, thoughspelling, being the more difficult, would progressmore slowly. Several researchers, however, haveraised challenging questions about the order ofprecedence, suggesting that spelling, due to theinherently segmental nature of writing words al-phabetically, emerges earlier than the ability todecode in reading. At present, the evidence ismixed. It is significant that recent research com-paring children's reading and spelling errors indi-cates that in both spelling and reading, regularlyspelled words present difficulties to beginnerswhen the words contain phonologically-complexconsonant clusters. Thus, beginners' difficulties inreading and spelling do not necessarily involvedifferent kinds of words, as had been suggestedearlier. This undercuts the claim of incompatiblestrategies.

Whether a child initially adopts a logographic oran analytic strategy for reading may depend inlarge part on the kind of pre-reading instructionthe child was provided with. There is evidencethat both phonological awareness and knowledgeof letter-phoneme correspondences are importantto promote grasp of the alphabetic principle, andare thus important to skill in spelling anddecoding (Ball & Blachman, 1988; 1991; Bradley& Bryant, 1983; Byrne & Fielding-Barnsley, 1991;Gough, Juel & Griffith, 1992). Neither is sufficientalone. The phasing of these two necessarycomponents of instruction may turn out to becritical in determining the child's initial approachto the orthography.

REFERENCESAdams, M. J. (1990). Beginning to read: Thinking and learning

about print. Cambridge, MA: MIT Press.Ball, E. W., & Blachman, B. A. (1988). Phoneme segmentation

training: Effect on reading readiness. Annals of Dyslexia, 38,208-225.

Ball, E. W., & Blachman, B. A. (1991). Does phoneme awarenesstraining in kindergarten make a difference in early wordrecognition and developmental spelling? Reading ResearchQuarterly, 26, 49-66.

Blachman, B. A. (1991). Phonological awareness: Implications forprereading and early reading instruction. In S. A. Brady & D. P.Shankweiler (Eds.), Phonological processes in literacy: Atribute to Isabelle Y. Liberman. Hillsdale, NJ: LawrenceErlbaum Associates.

Bradley, L., & Bryant, P. E. (1979). The independence of readingand spelling in backward and normal readers. DevelopmentalMedicine and Child Neurology, 21, 504-514.

Bradley, L., & Bryant, P. E. (1983). Categorizing sounds andlearning to reada causal connection. Nature, 30, 419-421.

Brady, S. A., Gipstein, M., & Fowler, A. E. (1992). Thedevelopment of phonological awareness in preschoolers.Presentation at AERA annual meeting, Apri11992.

Brady, S. A., & Shankweiler, D. P. (1991). Phonological processes inliteracy: A tribute to Isabelle Y. Liberman. Hillsdale, NJ: LawrenceErlhaum Associates.

Bruck, M., & Treiman, R. (1990). Phonological awareness andspelling in normal children and dyslexics: The case of initialconsonant clusters. Journal of Experimental Child Psychology, 50,156-178.

Byrne, B. (1992). Studies in the Acquisition Procedure forReading: Rationale, hypotheses, and data. In P. B. Gough, L. C.Ehri, & R. Treiman (Eds.), Reading acquisition (pp. 1-34).Hillsdale, NJ: Lawrence Erlbaum Associates.

Byrne, B., & Fielding-Barnsley, R. (1990). Acquiring the alphabeticprinciple: A case for teaching recognition of phoneme identity.Journal of Educational Psychology, 82, 805-812.

Byrne, B., & Fielding-Barnsley, R. (1991). Evaluation of a programto teach phonemic awareness to young children. Journal ofEducational Psychology, 83, 451-455.

Carlisle, J. F. (1987). The use of morphological knowledge inspelling derived forms by Learning Disabled and NormalStudents. Annals of Dyslexia, 37, 90-108

Carlisle, J. F. (1988). Knowledge of derivational morphology andspelling ability in fourth, sixth, and eighth graders. AppliedPsycholinguistics, 9, 247-266.

148

Page 149: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

On the Relations between Learning to Spell and Learning to Read 143

Chomsky, C. (1971). Write first, read later. Childhood Education, 47,296-299.

Chomsky, C. (1979). Approaching reading through inventedspelling. In L. B. Resnick & P. A. Weaver (Eds.), Theory andpractice of early reading, vol. 2 (pp. 43-65). Hillsdale, NJ:Lawrence Erlbaum Associates.

Chomsky, N., & Halle, M. (1968). The sound pattern of English. NewYork: Harper and Row.

Cossu, G., Shankweiler, D., Liberman, I. Y., Tola, G., & Katz, L.(1988). Phoneme and syllable awareness in Italian children.Applied Psycholinguistics, 9, 1-16.

DeFrancis, J. (1989). Visible speech: The diverse oneness of writingsystems. Honolulu: University of Hawaii Press.

Dunn, L. M., Dunn, L. M., & Whetton, C. (1982). British picturevocabulary scale. Berks., England: NFER-Nelson.

Ehri, L. C. (1989). The development of spelling knowledge and itsrole in reading acquisition and reading disability. Journal ofLearning Disabilities, 22, 356-365.

Ehri, L. C., & Wilce, L. S. (1987). Does learning to spell helpbeginners learn to read words? Reading Research Quarterly, 22,47-64.

Fischer, F. W., Shankweiler, D., & Liberman, I. Y. (1985). Spellingproficiency and sensitivity to word structure. Journal of Memoryand Language, 24, 423-441.

Frith, U. (1980). Unexpected spelling problems. In U. Frith (Ed.),Cognitive processes in spelling. London: Academic Press.

Gleitman, L. R., & Rozin, P. (1977). The structure and acquisitionof reading I: Relations between orthographies and the structureof the language. In A. S. Reber & D. L. Scarborough (Eds.),Toward a psychology of reading. Hillsdale, NJ: Lawrence ErlbaurnAssociates.

Goswami, U., & Bryant, P. (1990). Phonological skills and learning toread. Hillsdale, NJ: Lawrence Erlbaurn Associates.

Gough, P. B., Ehri, L. C., & Treiman, R. (1992). Reading acquisition.Hillsdale, NJ: Lawrence Erlbaum Associates.

Gough, P. B., & Hillinger, M. L. (1980). Learning to read: Anunnatural act. Bulletin of the Orton Dyslexia Society, 30, 179-196.

Gough, P. B., Juel, C., & Griffith, P. L. (1992). Reading, Spelling,and the orthographic cipher. In P. B. Gough, L. C. Ehri, & R.Treiman (Eds.), Reading acquisition (pp. 35-48). Hillsdale, NJ:Lawrence Erlbaum Associates.

Juel, C., Griffith, P. L., & Gough, P. B. (1986). Acquisition ofliteracy: A longitudinal study of children in first and secondgrade. Journal of Educational Psychology, 78, 243-255.

Kirtley, C. L. NI. (1989). Onset and rime in children's phonologicaldevelopment. Unpublished doctor dissertation. University ofOxford.

Klima, E. S. (1972). How alro;abets might rellect language. In J. F.Kavanagh & I. G. Mattingly (Eds.), Langu,. cc. by ear and by eye.Cambridge, MA: MIT Press.

Liberman, A. M. (1989). Reading is hard lust Decause listening iseasy. In C. von Euler, I. Lundberg & G. Lennerstrand (Eds.),Wenner-Gren Symposium Series 54, Brain and Reading. London:Macmillan.

Liberman, A. M. (1992). The relation of speech to reading andwriting. In L. Katz & R. Frost (Eds.), Orthography, phonology,morphology, and meaning (pp. 167-178). Amsterdam: ElsevierScience Publishers.

Liberman, I. Y. (1971). Basic research in speech and lateralizationof language: Some implications for reading disability. Bulletin ofthe Orton Society, 21, 71-87.

Liberman, I. Y. (1973). Segmentation of the spoken word. Bulletinof the Orton Society, 23, 65-77.

Liberman, I. Y., Rubin, H., Duques, S., & Carlisle, J. (1985).Linguistic abilities and spelling proficiency in kindergartnersand adult poor spellers. In D. B. Gray and J. F. Kavanagh (Eds.),

Biobehavioral measures of dyslexia (pp. 163-176). Parkton, MD:York Press

Liberman, I. Y., & Shankweiler, D. (1979). Speech, the alphabet,and teaching to read. In L. B. Resnick & P. A. Weaver (Eds.).Theory awl practice of early reading, vol. 2 (pp. 109-134). Hillsdale,NJ: Lawrence Erlbaum Associates.

Liberman, I. Y., Shankweiler, D., Fischer, F. W., & Carter, B.(1974). Explicit syllable and phoneme segmentation in theyoung child. Journal of Experimental Child Psychology, 18, 201-212.

Liberman, I. Y., Shankweiler, D., & Liberman, A. M. (1989). Thealphabetic principle and learning to read. In D. Shankweiler &I. Y. Liberman (Eds.), Phonology and reading disability: Solving thereading puzzle. IARLD Monograph Series, Ann Arbor, MI:University of Michigan Press.

Lundberg, I., Frost, J., & Petersen, 0.-P. (1988). Effects of anextensive program for stimulating phonological awareness inpreschool children. Reading Research Quarterly, 23, 263-284.

Mattingly, I. G. (1992). Linguistic awareness and orthographicform. In L. Katz & R. Frost (Eds.), Orthi graphy, phonology,morphology, and meaning (pp. 11-26). Amsterdam: ElsevierScience Publishers.

Montessori, M. (1964). The Montessori method. New York: SchockenBooks.

Morais, J. (19911. Constraints on the development of phonemicawareness. In S. A. Brady & D. P. Shankweiler (Eds.),Phonological processes in literacy: A tribute to Isabelle Y. Liberman.Hillsdale, NJ: Lawrence Erlbaum Associates.

Morais, J., Cary, L., Alegria, J., & Bertelson, P. (1979). Doesawareness of speech as a sequence of phonemes arisespontaneously? Cognition, 7, 323-331.

Ognjenovid, V., Lukatela, G., Feldman, L. B., & Turvey, M. T.(1973). Misreadings by beginning readers of Serbo-Croatian.Quarterly Journal of Experimental Psychology, 35A, 97-109.

Penn, D. (1983). Phonemic segmentation and spelling. BritishJournal of Psychology, 74, 129-144.

Perfetti, C. A. (1985). Reading ability. New York: Oxford UniversityPress.

Read, C. (1971). Pre-school children's knowledge of Englishphonology. Harvard Educational Review, 41, 1-34.

Read, C. (1986). Children's creative spelling. London: Routledge andKegan Paul.

Richardson, E., & DiBenedetto, B. (1986). Decoding skills test.Parkton, MD: York Press.

Rohl, M., & Tunmer, W. E. (1988). Phonemic segmentation skilland spelling acquisition. Applied Psyclwlinguistics, 9, 335-350.

Rubin, H. (1988). Morphological knowledge and early writingability. Language and Speech, 31, 337-355.

Shankweiler, D. (1992). Surmounting the Consonant Cluster inBeginning Reading and Writing. Presentation at AERA annualmeeting, Apri11992.

Shankweiler, D., Crain, S., Brady, S., & Macaruso, P. (1992).Identifying the causes of reading disability. In P. B. Cough, L.C Ehri, & R. Treiman (Eds.), Reading acquisition (pp. 275-305).Hillsdale, NJ: Lawrence Erlbaum Associates.

Shankweiler, D., & Liberman, I. Y. (1972). Misreading: A searchfor causes. In J. F. Kavanagh & I. G. Mattingly (Eds.), Languageby car and by eye: The relationships between speech and reading (pp.293-n7). Cambridge, MA: MIT Press.

Treirnan, R. (1985). Onsets and rimes as units of spoken syllables:Evidence from children. Journal of Experimental Child Psychology,39, 161-181.

Treiman, R., & Zukowski, A. (1991). Levels of phonologicalawareness. In S. A. Brady & D. P. Shankweiler (Eds.),Phonological processes in literacy: A Tribute to Isabelle Y. Liberman.Hillsdale, NT: Lawrence Erlbaurn Associates.

149BEST

Page 150: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

144 Shanktveiler and Lundquist

Treiman, R. (1991). Children's spelling errors on syllable-initialconsonant clusters. Journal of Educational Psychology, 83, 346-360.

Treiman, R. (1992). The role of intrasyllabic units in learning toread and spell. In P. B. Gough, L. C. Ehri, & R Treirnan (Eds.),Reading acquisition (pp. 65-106). Hillsdale, NJ: LawrenceErlbaurn Associates.

Treiman, R. (1993). Beginning to spell: A study of first grade children.New York: Oxford University Press.

Venezky R. L. (1970). The structure of English orthography. TheHague: Mouton.

Zifcak, M. (1981). Phonological awareness and readingacquisition. Contemporary Educational Psychology, 6, 117-126.

FOOTNOTES

L. Katz & R. Frost (Eds.), OrthographY, phonology, morphology, andmeaning (pp. 179-192). Amsterdam: Elsevier Science Publishers(1992).

t Also University of Connecticut. Storrs.liniversity of Connecticut, Storrs.

'For example, it has often been noted that aspirate and inaspirate/p/, /t/ and /k/ are not distinguished in English spelling. Inthe word COCOA, for example, both the initial and medialconsonant are spelled alike although phonetically andacoustically they are different.

20ften children who have had little or no formal instructionattempt to write words using the letters that they know, together

with their their conceptions of the phonetic values of the lettersand the segmental composition of the words they wish to write.This phenomenon has t...!en studied extensively by Read (1986).The question of whether invented spellings can regularly beelicited from children with varied educational and familybackgrounds was addressed by Zifcak (1981). In a study of 23inner-city six year olds from blue-collar families, it was foundthat nearly all the children were willing to make up spellings forwords though most had little knowledge of the standardorthography.

3These results are in full agreement in this respect with those ofTreiman (1993), who carried out a comprehensive study ofspelling in six year olds. The findings of both studies support thecaveat that one should not be too quick to attribute children'sspelling errors to the irregularities of English orthography.

4Spelling was correlated with reading real words, .91 and .81,respectively, based on two independent measures of reading.

5The onset consists of the string of consonants preceding thevowel nucleus. When the onset consists of a single consonant, asin the example of CAR, Treiman (1985) showed that childrenmay treat it as a segment distinct from the remainder of thesyllable, which corresponds to the rime. At the same time, theyare unable to decompose the rime into separable components.An invented spelling, like CR for CAR or BL for BELL isconsistent with such partial knowledge of the internal structureof the syllable.

6Adams (1990), Ehri (1989; Ehri & Wilce, 1987) and Treiman (inpress) reach a similar conclusion.

I o 0

Page 151: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins Laboratories Status Report on Speech Research1993, SR-113, 145-152

Word Superiority in Chinese*

Ignatius G. Mattinglyt and Yi Xut

In a line of research that began with Cattell(1886), it has been demonstrated that words playa special role in the recognition of text. A letterstring that forms a word is recognized faster andmoze accurately than a nonword string; a letter isrecognized faster and more accurately if it ispresented as part of a word than if it is presentedalone or as part of nonword (e.g., Reicher, 1969;for a review, see Henderson, 1982). Thisconstellation of findings, often referred to as 'wordsuperiority," suggests that the reader does notsimply process the text letter by letter, and thatwords are crucial. The letters in a word areprocessed so automatically that the reader isunaware of recognizing them, and when he isrequired to report a letter presented in a word, hefinds it most efficient to infer the identity of theletter from that of the word.

Most of these experiments, however, have beencarried out for writing systems like that ofEnglish, in which the "frame" units (W. S.-Y.Wang, 1981) explicitly correspond to linguisticwords. However, some writing systems lack thisproperty, notably that of Chinese. The frames ofthe Chinese system, the characters, correspond tomonomorphemic syllables, rather than to words,even though it is true that because there are manymonosyllabic words in Chinese, there are manycharacters that stand for words as well as for syl-lables. But most Chinese words are polymor-phemic and are therefore written with strings oftwo or more characters, and these strings are notspecially demarcated in text. Moreover, there aremany bound morphemes in Chinese, and a eparac-ter for such a morpheme occurs only as ,:iementof a character string.

It is of some importance to establish -7hetherword superiority is observable in writing systemsin which the frames are not words. If word

This work was supported by NICHD Grant HD-01994 toHaskins Laboratories.

145

superiority is not found in these systems, wewould have to view the word superiority found forword-frame systems as merely orthographic, to beattributed, perhaps, to the reader's longexperience in dealing with this particular kind offrame. In the case of Chinese, we would thenexpect to find evidence for the superiority ofmorphemic syllables. On the other hand, if word-superiority is found even when the frames of thewriting system do not correspond to words, wewould have to say that the superiority of the wordmust depend essentially on its linguistic ratherthan its orthographic status.

At least two other investigators haveinvestigated word superiority in Chinese. C.-M.Cheng (1981) compared the accuracy with which abriefly presented target character could beidentified in real-word and in nonword two-character strings. (A nonword consisted of twovalid characters that did not constitute a realword.) A string containing the target characterwas presented either preceding or following a"distractor" string, and the subject had to reportwhether the target character was in the first orthe second string. Performance was better forwords than for nonwords, and better for high-frequency words than for low-frequency words.

Cheng also carried out a second experiment,parallel to the first, in which the targets were rad-icals presented as components of real characters,or of pseudocharacters, or of noncharacters.Pseudocharacters were created by interchangingradicals in two real characters, keeping their posi-tion within the character constant; noncharacterswere created by interchanging radicals and locat-ing them in orthographically impossible positions.The character containing the target was presentedeither preceding or following a distractor charac-ter, and subjects had to report whether the targetwas in the first or the second character.Performance was better for real characters thanfor pseudocharacters, and better for pseudochar-acters than for noncharacters. This result is con-

151

Page 152: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

146 Mattingly and Xu

sistent with a word-superiority effect formonomorphemic words; however, as Hoosain(1991) suggests, it is not conclusive. Because thereal characters stood for morphemic syllables aswell as for words, the result is equally consistentwith morphemic-syllable superiority. It would bedesirable to repeat the experiment, comparingcharacters standing for bound morphemes withcharacters for free morphemes.

I.-M. Liu (1988) asked subjects to pronounce thecharacter occurring in first, second, or third posi-tion in word strings, in pseudoword strings or inisolation, and measured reaction time. (The posi-tion of a character in isolation apparently refers tothe position of the same character when presentedin a word or pseudoword.) Characters in realwords were pronounced faster than characters inpseudowords in some though not in all positions.However, characters in isolation were pronouncedas fast as, and in some positions faster than, char-acters in words. Thus, if an advantage for the real-word context over isolation is considered essentialfor word superiority, Chinese may not properly besaid to exhibit this phenomenon and Cheng's firstexperiment may, as Liu argues, merely demon-strate "compound-word superiority."

But perhaps it is not reasonable to expect thereal-word context to be superior to isolation if thetarget characters may themselves be words. Wewould not be surprised to find the advantage ofwords over single letters for English breakingdown if I or a, which happen to be words as wellas letters, were the targets. Analysis of Liu'sresults for target characters standing for boundmorphemes might clarify the issue.

The experiment reported here explores furtherthe phenomenon of word-superiority in Chinese.We asked whether a character is recognized faster

when part of a two-character word than when partof a two-character pseudoword. Our experimentthus complements Cheng's (1981) first experi-ment, in which accuracy of identification was thedependent variable. The paradigm we used wasadapted from Meyer, Schvaneveldt, and Ruddy(1974) and has already been used for Chinese byC.-M. Cheng and S.-I. Shih (1988). In each trial inour experiment, a Chinese subject saw a sequenceof two characters on a monitor screen. This se-quence might consist of two genuine Chinesecharacters, which might form either a real word ora pseudoword. In either case the subject was to re-spond, "Yes." Alternatively, the sequence mightconsist of a genuine character preceded or fol-lowed by a pseudocharacter, in which case thesubject was to respond, "No." Reaction time wasmeasured for all responses.

MethodsDesign To make possible the various critical

comparisons in which we were interested, a rathercomplex design was used; see Table I. There werethree main types of two-character sequences: Realbimorphemic Chinese words; pseudowords, each,like Cheng's (1981) nonwords, consisting of tworeal characters that did not form a real word; andpseudocharacter sequences, each consisting of areal character preceded or followed by apseudocharacter. The real characters were drawnfrom the inventory used in the People's Republicof China, and thus included some "simplified"characters not used elsewhere. Thepseudocharacters, like Cheng's, each consisted of agenuine, appropriately located semantic radicaland a genuine, appropriately located phoneticradical that do not actually occur together in theChinese character inventory.

Table 1. Experimental design.

Se uence Tv e: Real Words (128) Pseudowords (128)Pseudocharacterse uences (256)

Word F: High F (64) Low F (64)

Char F: HH HL LH LL HH HL LH LL HH HL LH LL HN LN NH NL

16 16 16 16 16 16 16 16 32 32 32 32 64 64 64 64

A 4 4 4 4 4 4 4 4 8 8 8 8 16 16 16 16

Subj. B 4 4 4 4 4 4 4 4 8 8 8 8 16 16 16 16

Group: C 4 4 4 4 4 4 4 4 8 8 8 8 16 16 16 16

D 4 4 4 4 4 4 4 4 8 8 8 8 16 16 16 16

152

Page 153: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Ward Superiority in Chinese 147

There were 128 real words, half of which were ofrelatively high frequency and half, of relativelylow frequency. Each of these two word-frequencysets was divided into four secondary sets with thefour possible character-frequency patterns: high-high, high-low, low-high, low-low. We relied on H.Wang et al. (1986) for information about wordfrequencies and character frequencies. Finally,each of these eight secondary sets was dividedarbitrarily into four tertiary four-word sets A, B,C, D. There were in all 32 such tertiary sets. Theaverage nurber of strokes per character was keptapproximately equal across these tertiary sets.

There were 128 pseudowords. For each of thetertiary sets, four pseudowords were formed byswapping initial characters within the set,discarding any resulting real words. Thus,pseudowords were perfectly matched with realwords with respect to character frequency.

There were 256 pseudocharacter sequences,each consisting of a real character and apseudocharacter. Four pseudocharacter sequenceswere derived by swapping the semantic radicals ofthe word-initial characters within each of theoriginal tertiary sets, discarding any resultingreal characters. Four more pseudocharactersequences were formed by repeating thisoperation with word-final characters.

From these materials, four different butequivalent tests were compiled. Each test included32 real words, 32 pseudowords, and 64pseudocharacter sequences. In a particular test,the real words, the pseudowords, and thepseudocharacter sequences were composed ofunrelated tertiary sets. For example, one testconsisted of set A real words, set B pseudowords,pseudocharacter sequences with word-initialpseudocharacters based on set C, andpseudocharacter sequences with word-finalpseudocharacters based on set B. None of the 256characters occurred more than once within a test,and each test was balanced with respect to wordfrequency, character-frequency pattern within asequence, ordinal position of pseudocharacters,and number of strokes per character. Across thefour tests, each of the original 256 real charactersoccurred equally often in a real word and in apseudoword, and equally often in a real charactersequence and in a pseudocharacter sequence.

Subjects There were 53 subjects. All werespeakers of Mandarin and graduate students orspouses of graduate students at the University ofConnecticut. All had been born and educated inthe People's Republic of China, and were thusfamiliar with the simplified characters. Subjects

were paid for their participation in theexperiment.

Procedure. Subjects were divided arbitrarily intofour equal groups, and each group received adifferent test. Each subject was tested separately.A subject was told that on each trial in the test, hewould see a sequence of two characters on theMacintosh computer monitor. If he was sure thatboth were genuine Chinese characters, he was topress the key designated as "Yes." Otherwise, hewas to press the "No" key. The next trial begantwo seconds after the subject's response, or, if hefailed to respond, two seconds after the sequencehad appeared. Before the experiment began, thesubject was given a 24-trial practice session withfeedback.

Responses and reaction times wereautomatically recorded by the computer, using aprogram written by Leonard Katz and slightlymodified by us. Reaction time for a trial wasmeasured from the instant the two charactersbegan to be written on the monitor screen. Thismeasurement were subject to an error of ±8.33msec. because the write-time could not be knownwith any greater accuracy.

ResultsThe data for 13 subjects who responded with

less than 90% accuracy were excluded fromfurther analysis. For the remaining 40 subjects,the accuracy rates were 98.5% for real words,92.2% for pseudowords, and 92.3% forpseudocharacter sequences.

Reaction-time data for the pseudocharactersequences, for which the correct response was"No," are shown in Figure 1. Reaction times wereshorter for pseudocharacter-initial than forpseudocharacter-final sequences. They were alsoshorter when the real character in the sequencewas of high frequency than when it was of lowfrequency. However, real-character frequency hada much smaller effect for pseudocharacter-initialsequences than for pseudocharacter-final ones.

An analysis of variance was carried out on thepseudocharacter sequence data for which the fac-tors were: Subject group (A/B/C/D), pseudocharac-ter position (initial/final), and real-character fre-quency (high/low). There was no effect of subjectgroup: F(3,36) = .729. The effect of pseudocharac-ter position was highly significant: F(1,36) =39.58, p. .0001. The effect of real-character fre-quency was mildly significant: F(1,36) = 6.67, p. <.05. There was a highly significant interaction be-tween pseudocharacter position and real-characterfrequency: F(1,36) = 14.40, p. = .0005.

153

Page 154: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

148 Mattingly and Xu

1000 -

900 -

Petudocharactrposition

800 - tr-- Initial

IF Final

700 -

600High Low

Real-character frequency

Figure 1. Reaction times for pseudocharacter sequences.

1000 -

900 -

800

700

600

1000 1

900 -

800 -

700 -

600

High LOw

(a) Initial-character frequency

Sequence type

High LOw

(b) Final-character frequency

PseuclowordLF RealHF Real

Figure 2. Reaction times for real word and pseudoword sequences as a function of initial- and final-characterfrequency.

154

Page 155: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Word Superiority in Chinese 149

The data for the real words and the pseu-dowords, for which the correct response was "Yes,"are plotted in Figures 2a and 2b. Figure 2a showsthe effect of varying initial character frequency;Figure 2b, the effect of varying final-characterfrequency. From both figures, it is apparent thatreaction times are shorter for real words than forpseudowords, and shorter for high-frequencywords than for low-frequency words. For both realwords and pseudowords, reaction times areshorter for initial high-frequency characters thanfor initial low-frequency characters (Figure 2a)and similarly, shorter for final high-frequencycharacters than for final low-frequency characters(Figure 2b). However, reaction times for pseu-dowords are more affected by character frequencythan are reaction times for real words.

An analysis of variance was carried out on thereal and pseudoword data. The factors were:Subject group, sequence type (realword/pseudoword), initial-character frequency(high/low), and final-character frequency(high/low). There was no effect of subject group:F(3,36) = .17. The effect of sequence type washighly significant: F(1,36) = 438.61, p. < .0001.The effects of both character-frequency factorswere highly significant: Initial, F(1,36) = 41.47, p.

.0001; final, F(1,36) = 66.22, p. .0001. Therewere significant interactions between sequencetype and each of the character-frequency factors:Initial: F(1,36) = 7.29, p. < .05; final: F(1,36) =19.31, p. < .0001.

1000 -

900 -

800 -

700 -

600

An analysis of variance was carried out on thereal word data alone to determine the effects ofword frequency. The factors were: Subject group,word frequency (high/low), initial-characterfrequency, and final-character frequency. Therewas no effect of subject group: F(3,36) = .32. Theeffect of word frequency was highly significant:F(1,36) = 19.79, p. .0001. The effects of bothcharacter-frequency factors were highlysignificant: Initial, F(1,36) = 15.03, p. < .0005;final, F(1,36) = 18.31, p. .0001. There was nointeraction between word frequency and either ofthe character-frequency factors: Initial, F (1,36) =.24; final, F(1,36) = .65. There was a significantinteraction among subject group, word frequency,initial character-frequency, final-character fre-quency: le(3,36) = 6.56, p. < .005. We believe thisinteraction to be artifactual, reflecting anunfortunate choice of items in one of the tertiarysubsets.

The pseudoword function in Figure 2b is steeperthan the pseudoword function in Figure 2a,suggesting that character frequency has more ofan effect in final position than in initial position.To explore further the effect of character-frequency order, reaction time data for high-lowand low-high character-frequency patterns areplotted in Figure 3. For pseudowords, reactiontimes for the low-high pattern are shorter thanfor the high-low pattern. For real words, there isno comparable effect of character-frequencypattern.

Sequence type

High-Low Low-High

Character-frequency order

PseudowordLF Real

HF Real

Figure 3. Reaction times to real and pseudoword sequences as a function of character-frequency order.

135

Page 156: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

150 Mattin711, and Xu

An analysis of variance was carried out on thereal and pseudoword data for the high-low andlow-high patterns alone. The factors were: Subjectgroup, sequence type, and character-frequencypattern(high-low/low-high). There was no effect ofsubject group: F(3,36) = .23. The effect of sequencetype was significant: F(1,36) = 209.35 P. < .0001.The effect of character-frequency pattern fell justshort of significance: F(1,36) = 3.86, p. = .0571.There was a significant interaction betweensequence type and character-frequency pattern:F(1,36) = 4.39, p. < .05.

An analysis of variance was also carried out forthe high-low and low-high patterns in the realword data alone. The factors were; Subject group,word frequency, and character-frequency pattern.There was no effect of subject group: F(3,36) = .24.The effect of word frequency was significant:F(1,36) = 21.45, p. .0001. There was no effect ofcharacter-frequency, pattern: F(1,36) = .15. Therewas no interaction between word-frequency andcharacter-frequency pattern: F(1,36) = .06. Therewas a significant interaction between wordfrequency and subject group: F(3,36) = 6.06, p. <.005. We believe this interaction has the samesource as the artifactual interaction mentionedabove.

DiscussionThese results complement those of Cheng (1981)

and provide further evidence of a word superiorityeffect for Chinese. The key finding is that a char-acter is evaluated more rapidly when part of areal-word sequence than when part of a pseu-doword sequence (Figure 2a,b). The advantage forreal words is consistent across differences in char-acter-frequency pattern. The magnitude of the ad-vantage, around 200 msec., is very large.

The results also reveal something about thebasis of word superiority. Let us -ompare the waysubjects deal with sequences that are real wordsand sequences that are not: the pseuth;charactersequences and the pseudowords. Common to allthree sequence types is the effect of characterfrequency (Figures 1, 2a,b). We can conclude fromthis simply that word superiority is not magical:Word recognition in Chinese, whatever its otherproperties, is mediated by character identification.The recognition of a word is evidently facilitatedby previous encounters with its characters inother words. On the other hand, the fact that wefound a word-frequency effect (Figure 2a,b),independent of the character-frequency effect, fora task ostensibly requiring only characterrecognition, suggests that word recognition cannot

he simply a matter of recognizing one character ata time, then deciding that a character string is aword.

This proposal is supported by the fact thatevidence of serial processing is found forpseudocharacter sequences and pseudowords, butnot for words. In the case of the pseudocharactersequences, it was found that sequences beginningwith a pseudocharacter were rejected faster thanthose beginning with a real character, and thatreal-character frequency affected the latter butnot the former (Figure 1). The obviousinterpretation is that if the first character wasgenuine, the subject evaluated it, the amount eftime required depending on the frequency of thecharacter. Then he had to evaluate thepseudocharacter, which required more time,before he could reject the sequence. On the otherhand, if the first character was a pseudocharacter,he could reject the sequence as soon as he couldevaluate this character; there was no need toconsider the second character at all. What is ofinterest here is simply that the subject isprocessing the characters serially.

As for the pseudowords, there was an effect ofcharacter-frequency order (Figure 3): The effect ofcharacter frequency was greater for the initialcharacter than for the final character. Thus, alow-frequency character in initial position inhib-ited the response more than a high-frequencycharacter in final position facilitated it; con-versely, a high-frequency character in initial posi-tion facilitated the response more than a low-fre-quency character in final position inhibited it. Weare not able to offer a conclusive explanation forthis phenomenon without further experimenta-tion, but it is plausible that when there was a low-frequency character in final position, the subjectwas apt to delay his evaluation, whereas he wasless apt to do so when there was a low-frequencycharacter in initial position because he had tohurry on to evaluate the final character. What isclear, however, is that the phenomenon is an ordereffect. It suggests that the characters in pseu-dowords, like those in pseudocharacter sequences,are being processed serially.

This is not the case for real words. There is noeffect of character-frequency order (Figure 3).Given the order effects observed for the other twosequence types, this finding has two implications.It suggests first that the characters in real wordsare processed in parallel. Assume that the"logogens" (Morton, 1969) for Chinese are stringsof characters corresponding to words. When astring matching a particular logogen appears in

Page 157: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Word Superiority in Chinese 151

text, all of its constituent characters will beactivated at the same or nearly the same time,just like the letters of an English word (Sperling,1969). This means that in the case of a real word,a subject in our experiment would have had onlyone decision to make instead of two. Havingrecognized the word, he would have knownimmediately that both characters must begenuine. Much of the advantage of real words overpseudowords may be due to this fact.

The second implication is that if logogens fortwo or more character strings of unequal lengthare activated, the logogen for the longest string ispreferred. For the present experiment, this meansthat a subject was not free to choose whether totreat a sequence as two separate monomorphemiccharacters or as a single bimorphemic word.Rather, bimorphemic logogens automatically tookprecedence if activated. Without this "LongestString Principle," the subject would have been freeto waste time by processing the real wordsserially, just as if they were pseudowords. Thisprinciple also explains why readers of Chinese canread rapidly despite the absence of explicit wordboundary markers in the text.

ConclusionWord superiority has been demonstrated for

Chinese, and it has been argued that it depends ingreat part on the reader's ability to process thecharacters of a word in parallel. Of course, it maybe that Liu (1988) is right to insist that experi-ments like this one and the first experiment inCheng (1981) demonstrate merely "compound-word superiority"; the superiority of monomor-phemic words remains in question. But even com-pound-word superiority is of great theoretical im-portance. Because there are no word boundaries inChinese writing, our results are evidence thatword superiority generally depends not on ortho-graphic experience, but on linguistic experience.

REFERENCESCattell, J. M. (1886). The time taken up by cerebral operations.

Mind, 11, 220-242.Cheng, C.-M. (1981). Perception of Chinese characters. Acta

Psychologica Taiwanica, 23, 137-153.Cheng, C.-M., & Shih, S-.I. (1988). The nature of lexical access in

Chinese: Evidence from experiments on visual andphonological priming in lexical judgment. In I.-M. Liu, H.-C.Chen. & M. J. Chen (Eds.),Cognitive aspects of the ChineseLanguage, Volume I (pp. 1-13). Hong Kong: Asian ResearchService.

Henderson, L. (1982). Orthography and word recognition in reading.London: Academic Press.

Hoosain, R. (1991). Psycholinguistic implications for linguisticrelativity. A case study of Chinese. Hillsdale, NJ: LawrenceErlbaum.

Liu, I.-M. (1988). Context effects on word /character naming:Alphabetic versus logographic languages. In I.-M. Liu, H.-C.Chen, & M. J. Chen (Eds.), Cognitive aspects of the ChineseIiinguage, Volume 1 (pp. 81-92). Hong Kong: Asian ResearchService.

Meyer, D. E., Schvaneveldt, R. W., & Ruddy, M. (1974). Functionsof graphemic and phonemic codes in visual word recognition.Memory & Cognition, 2, 309-321.

Morton, J. (1969). Interaction of information in word recognition.Psychological Review, 76, 165-178.

Reicher, G. M. (1969). Perceptual recognition as a function of themeaningfulness of the stimulus material. Journal of ExperimentalPsychology, 81, 275-280.

Sperling, G. (1969). Successive approximations to a model forshort-term memory. In R N. Haber (Ed.), Information processingapproaches to visual perception (pp. 32-37). New York: Holt,Rinehart & Winston.

Wang, H., Chang, B., Li, Y., Lin, L., Lin, J., Sun, Y., Wang, Z., Yu,Y., Zhang, J., & Li, D. (1986). Frequency dictionary of contemporaryChinese. Beijing: Beijing Language Institute Press.

W. S.-Y. Wang (1981) Language structure and optimalorthography. In 0. L. Tzeng & H. Singer (Eds.), Perception ofprint: Reading research in experimental psychology (pp. 223-236).Hillsdale, NJ: Lawrence Erlbaum.

FOOTNOTES'Presented at the Sixth international Symposium on CognitiveAspects of the Chinese Language, Taipei, Taiwan, September14, 1993

-Also Department of Linguistics. University of Connecticut,Storrs.

157

Page 158: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins Laboratories Status Report on Speech Research1993, SR-113, 153-170

Prelexical and Postlexical Strategies in Reading:Evidence from a Deep and a Shallow Orthography*

Ram Frost t

The validity of the orthographic depth hypothesis (ODH) was examined in Hebrew byemploying pointed (shallow) and unpointed (deep) print. Experiments 1 and 2 revealedlarger frequency effects; and larger semantic priming effects in naming, with unpointedprint than with pointed print. In Experiments 3 and 4 subjects were presented withHebrew consonantal strings that were followed by vowel marks appearing at stimulusonset asynchronies ranging from 0 ms (simultaneous presentation) to 300 ms from theonset of consonant presentation. Subjects were inclined to wait for the vowel marks toappear even though the words could be named unequivocally using lexical phonology.These results suggest that prelexical phonology is the default strategy for readers inshallow orthographies, providing strong support for the ODH.

Most early studies of visual word recognitionwere carried out in the English language. Thisstate of affairs was partly due to an underlyingbelief that reading processes (as well as othercognitive processes) are universal, and thereforestudies in English are sufficient to provide a com-plete account of the processes involved in recogniz-ing printed words. In the last decade, however,studies in orthographies other than English havebecome more and more prevalent. These studieshave in common the view that reading processescannot be explained without considering the read-er's linguistic environment in general, and thecharacteristics of his writing system in particular.

Various writing system have evolved over timein different cultures. These writing systems,whether logographic, syllabic, or alphabetic,typically reflect the language's unique phonologyand morphology, representing them in an effective

This work was supported in part by a grant awarded to theauthor by the Basic Research Foundation administered by theIsrael Academy of Science and Humanities, and in part by theNational Institute of Child Health and Human DevelopmentGrant HD-01994 to Haskins Laboratories. I am indebted toLen Katz for many of the ideas that were put forward in thisarticle, to Orna Moshel and Itchak Mendelbaum for their helpin conducting the experiments, and to Ken Pugh, Bruno Repp,Charles Perfetti, Albrecht Inhoft and Sandy Pollatsek for theircomments on earlier drafts of this paper.

way (Mattingly, 1992; Scheerer, 1986). The matchbetween writing system and language insures adegree of efficiency for the reading and the writingprocess (for a discussion, see Katz & Frost, 1992).Theories of visual recognition differ in theiraccount of how the different characteristics ofwriting systems affect reading performance. Onesuch characteristic that has been widelyinvestigated is the way the orthographyrepresents the language's surface phonology.

The transparency of the relation betweenspelling and phonology varies widely between or-thographies. This variance can often be attributedto morphological factors. In some languages (e.g.,in English), morphological variations are capturedby phonologic variations. The orthography, how-ever, was designed to preserve primarilymorphologic information. Consequently, in manycases, similar spellings denote the samemorpheme but different phonologic forms: Thesame letter can represent different phonemeswhen it is in different contexts, and the samephoneme can be represented by different letters.The words "steal" and "stealth," for example, aresimilarly spelled because they are morphologicallyrelated. Since in this case, however, a morphologicderivation resulted in a phonologic variation, thecluster "ea" represents both the sounds [i] and [El.Thus, alphabetic orthographies can be classified

153

158

Page 159: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

154 Frost

according to the transparency of their letter tophonology correspondence. This factor is usuallyreferred to as "orthographic depth" (Katz &Feldman, 1981; Klima, 1972; Liberman,Liberman, Mattingly, & Shankweiler, 1980;Lukatela, Popadid, Ognjenovié, & Turvey, 1980).An orthography that represents its phonologyunequivocally following grapheme-phonemesimple correspondences is considered shallow,while in a deep orthography the relation oforthography to phonology is more opaque.

The effect of orthographic depth on readingstrategies has been the focus of recent and currentcontroversies (e.g., Besner & Smith, 1992; Frost,Katz, & Bentin, 1987). In general the argumentrevolves around the question of whetherdifferences in orthographic depth lead todifferences in processing printed words. What iscalled the orthographic depth hypothesis (ODH)suggests that it does. The ODH suggests thatshallow orthographies can easily support a wordrecognition process that involves the printedword's phonology. This is because the phonologicstructure of the printed word can be easilyrecovered from the print by applying a simpleprocess of grapheme-to-phoneme conversion(GPC). For example, in the Serbo-Croatian writingsystem the letter-to-phoneme correspondence isconsistent and the spoken language itself is notphonologically complex. The correspondencebetween spelling and pronunciation is so simpleand direct that a reader of this orthography canexpect that phonological recoding will alwaysresult in an accurate representation of the wordintended by the writer. A considerable amount ofevidence now supports the claim that phonologicalrecoding is extensively used by readers of Serbo-Croatian (see, Carello, Turvey, & Lukatela, 1992,for a review). In contrast to shallow orthographies,deep orthographies like English or Hebrewencourage readers to process printed words byreferring to their morphology via the printedword's visual-orthographic structure. In deeporthographies lexical access is based mainly onthe word's orthographic structure, and the word'sphonology is retrieved from the mental lexicon.This is because the relation between the printedword and its phonology are more opaque andprelexical phonologic information cannot be easilygenerated (e.g., Frost et al. 1987; Frost & Bentin,1992b; Katz & Feldman, 1983).

The ODH's specific predictions mainly refer tothe way a printed word's phonology is generatedin the reading process. Because readers of shalloworthographies have simple, consistent, and

relatively complete connections betweengraphemes and sub-word pronunciation, they canrecover most of a word's phonological structureprelexically, by assembling it directly from theprinted letters. In contrast, the opaque relation ofsubwords segments and phonemes in deeporthographies prevents readers from usingprelexical conversion rules. For these readers, themore efficient process of generating the word'sphonologic structure is to rely on a fast visualaccess of the lexicon and to retrieve the word'sphonology from it. Thus, phonology in this case islexically addressed, not prelexically assembled.Thus, according to the ODH, the differencebetween deep and shallow orthographies is theamount of lexical involvement in pronunciation.This does not necessarily entail specificpredictions concerning lexical decisions and lexicalaccess, or how meaning is accessed from print.

The specific predictions of the ODH must bediscussed with reference to the tools ofinvestigation employed by the experimenter. Theissue to be clarified here is what serves as a validdemonstration that phonology is mainly prelexical(assembled) or postlexical (addressed). In general,measures of latencies and error rates for lexicaldecisions or naming are monitored. The idea isthat lexical involvement in pronunciation leavescharacteristic traces. The first question to beexamined is, therefore, whether the lexical statusof a word affects naming latencies. Lexical searchresults in frequency effects and in lexicalityeffects: Frequent words are named faster thannonfrequent words, and words are named fasterthan nonwords (but see Balota & Chumbley, 1984,for a discussion of this point). Thus, one trace ofpostlexical phonology is that naming latencies andlexical decision latencies are similarly effected bythe lexical status of the printed stimulus (e.g.,Katz & Feldman, 1983). If phonology is assembledfrom print prelexically, smaller effects related tothe word's lexical status should be expected, andconsequently lexical status should affect namingand lexical decisions differently. A second methodof investigation involves the monitoring ofsemantic priming effects in naming (see Lupker,1984; Neely, 1991, for a review). If pronunciationinvolves postlexical phonology, strong semanticpriming effects will be revealed in naming. Incontrast, if pronunciation is carried out usingmainly prelexical phonology, naming of targetwords would be less facilitated by semanticallyrelated primes.

Keeping the above tools of investigation in mind,the exact predictions of the ODH can be now fully

159

Page 160: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Pre lexical and Post lexical Strategies in Reading: Evidence from a Deep and a Shallow Orthography 155

described. Two versions of the hypothesis exist inthe current literature. What can be called thestrong ODH claims that in shallow orthographies,the complete phonological representations can bederived exclusively from only the prelexicaltranslation of subword spelling units (letter or let-ter clusters) into phonological units (phonemes,phoneme clusters, and syllables). According to thisview, readers of shallow orthographies perform aphonological analysis of the word based only on aknowledge of these correspondences. Rapidnaming, then, is a result of this analytic processonly, and does not involve any lexical information(see Katz & Frost, 1992, for a discussion). Incontrast to the strong ODH, a weaker version ofthe ODH can be proposed. According to the weakversion of the ODH, the phonology needed for thepronunciation of printed words comes both fromprelexical letter-phonology correspondences andfrom stored lexical phonology. The latter is theresult of either an orthographic addressing of thelexicon (i.e., from a whole-word or whole-morpheme spelling pattern to its storedphonology), or from a partial phonologic represen-tation that was assembled from the print and wasunequivocal enough to allow lexical access. Thedegree to which the prelexical process is active is afunction of the depth of the orthography; prelexi-cal analytic processes are more functional inshallow orthographies. Whether or not prelexicalprocesses actually dominate orthographic process-ing for any particular orthography is a question ofthe demands the two processes make on the read-er's processing resources (Katz & Frost, 1992).

It is easy to show that the strong form of theODH is untenable. It is patently insufficient toaccount for pronunciation even in shalloworthographies like Spanish, Italian, or Serbo-Croatian. This is so because these orthographiesdo not represent syllable stress and, even thoughstress is often predictable, this is not always thecase. For example, in Serbo-Croatian, stress fortwo-syllable words always occurs on the firstsyllable, but not always for words of more thantwo syllables. These words can be pronouncedcorrectly only by reference to lexically storedinformation. The issue of stress assignment iseven more problematic in Italian, where stresspatterns are much less predictable. In Italian,many pairs of words differ only in stress whichprovides the intended semantic meaning (Colombo& Tabossi, 1992; Laudanna & Caramazza, 1992).

Several studies have argued for the obligatoryinvolvement of prelexical phonology in Serbo-Croatian (e.g., Turvey, Feldman, & Lukatela,

1984) or in English (Perfetti, Bell, & Delaney,1988; Perfetti, Zhang, & Berent, 1992; Van Orden,1987). Note than the weaker version of the ODH isnot inconsistent with these claims as long as it isnot claimed that naming is achieved exclusively byprelexical analysis. All alphabetic orthographiesmay make some use of prelexically derivedphonology for word recognition. According to theweak form of the ODH, shallow orthographiesshould make more use of it than deep orthogra-phies, because prelexical phonology is more read-ily available in these orthographies. Substantialprelexical phonology may be generated inevitablywhen reading Serbo-Croatian, for example. Thisphonological representation may be sufficient forlexical access, but not necessarily for pronuncia-tion. The complete analysis of the word's phono-logic and phonetic structure may involve lexicalinformation as well (Carello et al., 1992).

Evidence concerning the validity of the ODHcomes from within- and cross-language studies.But note that single-language experiments areadequate only for testing the strongest form of theODH. The strong ODH requires lexical access andpronunciation to be accomplished entirely on thebasis of phonological information derived fromcorrespondences between subword spelling andphonology. That is, the reader's phonologicalanalysis of the printed word, based on his or herknowledge of subword letter-sound relationships,is the only kind of information that is allowed.Thus, merely showing lexical involvement in pro-nunciation in a shallow orthography would pro-vide a valid falsification of the strong ODH(Seidenberg & Vidanovid, 1985; Sebastian-Galles,1991). However, the demonstration of lexical ef-fects in shallow orthographies cannot in itself beconsidered evidence against the weak ODH, and adifferent methodology is necessary to test it.Because it refers to the relative use of prelexicalphonologic information in word recognition in dif-ferent orthographies, experimental evidence bear-ing on the weak ODH should come mainly fromcross-language studies. More important, becausethe weak ODH claims that readers in shallow or-thographies do not use the assembled routine ex-clusively, the evidence against it cannot comefrom single-language studies by merely showingthe use of the addressed routine in shallow or-thographies (e.g., Sebastian-Galles, 1991).

The experimental evidence supporting the weakODH is abundant. Katz and Feldman (1983)compared semantic priming effects in naming andlexical decision in English and Serbo-Croatian,and demonstrated that while semantic facilitation

160

Page 161: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

156Frost

was obtained in English for both lexical decisionand naming, in Serbo-Croatian semantic primingfacilitated oniy lexical decision. Similarly, acomparison of semantic priming effects in namingin English and Italian showed greater effects inthe deeper English than in the shallower Italianorthography (Tabossi & Laghi, 1992). A study byFrost et al. (1987) involved a simultaneouscomparison of three languages, Hebrew, English,and Serbo-Croatian, and confirmedthe hypothesisthat the use of prelexical phonology in namingvaries as a function of orthographic depth. Frostet al. showed that the lexical status of thestimulus (its being a high- or a low-frequencyword or a nonword) affected naming latencies inHebrew more than in English, and in Englishmore than in Serbo-Croatian. In a secondexperiment, Frost et al. showed a relatively strongeffect of semantic facilitation in Hebrew (21 ms), asmaller but significant effect in English (16 ms),and no facilitation (0 rns) in Serbo-Croatian.

Frost and Katz (1989) examined the effects ofvisual and auditory degradation on the ability ofsubjects to match printed to spoken words inEnglish and Serbo-Croatian. They showed thatboth visual and auditory degradation had a muchstronger effect in English than in Serbo-Croatian,regardless of word frequency. These results wereexplained by an extension of an interactive modelwhich rationalized the relationship between theorthographic and phonologic systems in terms oflateral connections between the systems at all oftheir levels. The structure of these lateral connec-tions was determined by the relationship betweenspelling and phonology in the language: simpleisomorphic connections between graphemes andphonemes in the shallower Serbo-Croatian, butmore complex, many-to-one, connections in thedeeper English. Frost and Katz argued that thesimple isomorphic connections between the ortho-graphic and the phonologic systems in the shal-lower orthography enabled subjects to restore boththe degraded phonemes from the print and thedegraded graphemes from the phonemic informa-tion, with ease. In contrast, in the deeper orthog-raphy, because the degraded information in onesystem was usually consistent with several alter-natives in the other system, the buildup of suffi-cient information for a unique solution to thematching judgment was delayed, so the matchingbetween print and degraded speech, or betweenspeech and degraded print, was slowed.

The psychological reality of orthographic depthis not unanimously accepted. Although it is gener-ally agreed that the relation between spelling and

phonology in different orthographies might affectreading processes (especially reading acquisition)to a certain extent, there is disagreement as to therelative importance of this factor. What I will callhere "the alternative view" argues that the pri-mary factor determining whether or not theword's phonology is assembled pre- or post-lexi-cally is not orthographic depth but word fre-quency. The alternative view suggests that in anyorthography, frequent words are very familiar asvisual patterns. Therefore, these words can easilybe recognized through a fast visually-based lexicalaccess which occurs before a phonologic represen-tation has time to be generated prelexically fromthe print. For these words, phonologic informationis eventually obtained, but only postlexically, frommemory storage. According to this view, the rela-tion of spelling to phonology should not affectrecognition of frequent words. Since the ortho-graphic structure is not converted into a phono-logic structure by use of grapheme-to-phonemeconversion rules, the depth of the orthographydoes not play a role in the processing of thesewords. Orthographic depth exerts some influence,but only on the processing of low-frequency wordsand nonwords. Since such verbal stimuli are lessfamiliar, their visual lexical access is slower, andtheir phonology has enough time to be generatedprelexically (Baluch & Besner, 1991; Tabossi &Laghi, 1992; Seidenberg, 1985).

A few studies involving cross-language researchsupport the alternative view. Seidenberg (1985)demonstrated that in both English and Chinese,naming frequent printed words was not effectedby phonologic regularity. This outcome was inter-preted to mean that, in logographic as in alpha-betic orthographies, the phonology of frequentwords was derived postlexically, after the wordhad been recognized on a visual basis. Moreover,in another study, Seidenberg and Vidanovid (1985)found similar semantic priming effects in namingfrequent words in English and Serbo-Croatian,suggesting again that the addressed routine playsa major role even in the shallow Serbo-Croatian.

Several studies in Japanese were interpreted asproviding support for the alternative view.Although these studies were carried out withinone language, they could in principle furnish evi-dence in favor or against the weak ODH becausethey examined reading performance in two differ-ent writing systems that are commonly used inJapanese- the deep logographic Kanji and theshallower syllabic Hiragana and Katakana.However, the relevance of these studies to the de-bate concerning the weak ODH is questionable, as

Page 162: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Pre lexical and Post lexical Strategies in Reading: Evidenceirom a Deep and a Shallow Orthographu 157

I will point out below. Besner and Hildebrant(1987) showed that words that were normallywritten in Katakana were named faster thanwords written in Katakana that were transcribedfrom Kanji. They argued that these results sug-gest that readers of the shallow Katakana did notname the transcribed words using the assembledroutine, as otherwise no difference should haveemerged in naming the two types of stimuli. Inanother experiment, Besner, Patterson, Lee, andHildebrant (1992) showed that naming words thatare normally seen in Katakana were namedslower if they were written in Hiragana and viceversa. This outcome suggests that orthographicfamiliarity plays a role even in reading the shal-lower syllabic Japanese orthographies. Taken to-gether, the results from Japanese provide strongevidence against the strong ODH, but they do notcontradict the weak ODH. Although these studiescompare reading performance in two writing sys-tems, they merely show that readers of the shal-lower Japanese syllabary do not use the assem-bled routine exclusively, but use the addressedroutine as well. These conclusions, however, areactually the basic tenets of the weak ODH.

More damaging to the weak ODH is a recentstudy by Baluch and Besner (1991), who employeda within-language between orthography design inPersian, similar to the one in Japanese. In thisstudy the authors took advantage of the fact thatsome words in Persian are phonologically trans-parent whereas other words are phonologicallyopaque. This is because three of the six vowels ofwritten Persian are represented in print as dia-critics and three are represented as letters.Because (as in Hebrew) fluent readers do not usethe pointed script, words that contain vowels rep-resented by letters are phonologically transparent,whereas words that contain vowels represented bydiacritics are phonologically opaque. Baluch andBesner demonstrated similar semantic primingeffects in naming phonologically transparent andphonologically opaque words of Persian, providedthat nonwords were omitted from the stimuluslist. These results were interpreted to suggest thatthe addressed routine was used in naming bothtypes of words.

Thus, when the weak ODH and the traditionalalternative view are contrasted, it appears thatthey differ in one major claim concerning thepreferred route of the cognitive system forobtaining the printed word's phonology. Thealternative view suggests that in general visualaccess of the lexicon is "direct" and requires fewercognitive resources. Moreover, the extraction of

phonological information from the mental lexiconis more or iess effortless relative to the extractionof phonological information prelexically. Hence,the default of the cognitive system in reading isthe use of addressed phonology. The weak ODHdenies that the extraction of phonologicalinformation from the mental lexicon is effortless.On the contrary, based on findings showingextensive phonologic recoding in shalloworthographies, its working hypothesis is that if thereader can successfully employ prelexicalphonological information, then it will be used first;the easier it is, the more often it will be used.Thus, the "default" of the cognitive system in wordrecognition is the use of prelexical rather thanaddressed phonology. If the reader's orthographyis a shallow one with uncomplicated, direct, andconsistent correspondences between letters andphonology, then the reader will be able to use suchinformation with minimal resources for lexicalaccess. The logic for the ODH lies in a simpleargument. The so called "fast," "effortless," visualroute is available for readers in all orthographies,including the shalk ones. If in spite of that itcan be demonstrated that readers of shalloworthographies do prefer prelexical phonologicalmediation over visual access, it is because it ismore efficient and faster for these readers.

The present study addresses this controversy bylooking at a deeper orthography, namely Hebrew.In Hebrew, letters represent mostly consonants,while vowels can optionally be superimposed onthe consonants as diacritical marks. The diacriti-cal marks, however, are omitted from most read-ing material, and are found only in poetry, chil-dren's literature, and religious texts. In addition,like other Semitic languages, Hebrew is based onword families derived from triconsonantal roots.Therefore, many words (whether morphologicallyrelated or not) share a similar or identical letterconfiguration. If the vowel marks are absent, asingle printed consonantal string usually repre-sents several different spoken words. Thus, in itsunpointed form, the Hebrew orthography does notconvey to the reader the full phonemic structure ofthe printed word, and the reader is often facedwith phonological ambiguity. In contrast to theunpointed orthography, the pointed orthographyis a very shallow writing system. The vowel marksconvey the missing phonemic information, makingthe printed word phonemically unequivocal.Although the diacritical marks carry mainly vowelinformation, they also differentiate in some in-stances between fricative and stop variants of con-sonants. Thus the presentation of vowels consid-

162

Page 163: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

158 Frost

erably reduces several aspects of phonemic ambi-guity (see Frost & Bentin, 1992b, for a discussion).

Several studies have shown that lexicaldecisions to Hebrew phonologically ambiguouswords are given prior to any phonologicaldisambiguation suggesting that lexical access inunpointed Hebrew is accomplished more often viaorthographic representations than viaphonological recodings of the orthography (Bentin& Frost, 1987; Bentin, Bargai, & Katz, 1984; Frost& Bentin, 1992a; Frost, 1992; and see Frost &Bentin, 1992b, for a review). The purpose of thepresent study was to show that even the reader ofHebrew who is exposed almost exclusively tounpointed print and generally uses the lexicaladdressed routine, prefers to use the prelexicalassembled routine when possible. Note that theprobability of extending the cross-languagefindings (e.g., Frost et al., 1987) to a within-Hebrew experimental design suffers from the factthat readers in general are used to regularlyemploying strategies of reading that arise fromthe characteristics of their orthography.Obviously, if it can be shown that even in spite ofthis factor Hebrew readers are willing to altertheir reading strategy and adopt a prelexicalroutine, it will certainly provide very significantevidence in favor of the ODH.

EXPERIMENT 1In Experiment 1 lexical decision and naming

performance with pointed and unpointed printwere compared. The aim of the experiment was toassess the effects of lexical factors on naming per-formance relative to lexical decision in the deepand shallow forms of the Hebrew orthography.This experimental design is very similar to thatemployed by Frost et al. (1987) in the first exper-iment of their multilingual study. Frost et al.found that in unpointed Hebrew the lexical statusof the stimuli affected lexical decision latencies inthe same way that it affected naming latencies. Incontrast, in English and in Serbo-Croatian thepronunciation task was much less affected by thelexical status of the stimuli. The shallower the or-thography, the more naming performance devi-ated from lexical decision performance. This out-come confirmed that, in unpointed Hebrew,phonology was generated mainly using the ad-dressed routine, whereas in the shallower Englishorthography the assembled routine had a moresignificant role. Finally, in the shallowest orthog-raphy, Serbo-Croatian, the assembled routinedominated the addressed routine, resulting in anonsignificant frequency effect. The purpose of

Experiment 1 was to investigate whether a similargradual deviation of naming from lexical decisionperformance would be observed in pointed relativeto unpointed Hebrew. If the weak ODH is correct,then the reader of Hebrew should use the assem-bled routine to a greater extent when namingpointed print than when naming unpointed print.

MethodSubjects. One hundred and sixty undergraduate

students from the Hebrew University, all nativespeakers of Hebrew, participated in theexperiment for course credit or for payment.

Stimuli and Design. The stimuli were 40 high-frequency words, 40 low-frequency words, and 80nonwords. All stimuli were three to five letterslong, and contained two syllables with four to sixphonemes. The average number of letters andphonemes were similar for the three types ofstimuli. All words could be pronounced as only onemeaningful word. Nonwords were created byaltering randomly one or two letters of high- orlow-frequency real words that were not employedin the experiment. The nonwords were allpronounceable and did not violate the phonotacticrules of Hebrew. In the absence of a reliablefrequency count in Hebrew, the subjectivefrequency of each word was estimated using thefollowing procedure: A list of 250 words waspresented to 50 undergraduate students, whorated the frequency of each word on a 7-pointscale from very infrequent (1) to very frequent (7).The rated frequencies were averaged across all 50judges, and all words in the present study wereselected from this pool. The average frequency ofthe high-frequency words was 4.0, whereas theaverage frequency of the low-frequency words was2.0. Examples of pointed and unpointed Hebrewwords are presented in Figure 1.

Hebrew print

Phonologic structure

Semantic meaning

Unpointed Pointed

-rum nr.1.?12P

/ psanter /

Piano

Figure 1. Example of the unpointed and the pointedforms of a Hebrew printed word.

There were four experimental conditions: thestimuli could be presented pointed or unpointed,for naming or for lexical decision. Forty differentsubjects were tested in each experimental

6 3

Page 164: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Pre lexical and Post lexical Strategies in Reading: Evidence from a Deep and a Shallow Orthography 159

condition. This blocked design was identical tothat of the Frost et al. (1987) study.

Procedure and apparatus. The stimuli were pre-sented on a Macintosh 3E computer screen, and abold Hebrew font, size 24, was used. Subjects weretested individually in a dimly lighted room. Theysat 70 cm from the screen so that the stimuli sub-tended a horizontal visual angle of 4 degrees onthe average. The subjects communicated lexicaldecisions by pressing a "yes" or a "no" key. Thedominant hand was always used for the "yes" re-sponse. In the naming task, response latencieswere monitored by a Mura-DX 118 microphoneconnected to a voice key. Each experiment startedwith 16 practice trials, which were followed by the160 experimental trials presented in one block.The trials were presented at a 2.5 sec intertrialinterval.

ResultsMeans and standard deviations of RTs for

correct responses were calculated for each subjectin each of the four experimental conditions.Within each subject/condition combination, RTsthat were outside a range of 2 SDs from therespective mean were excluded, and the mean wasrecalculated. Outliers accounted for less than 5%of all responses. This procedure was repeated inall the experiments of the present study.

Mean RTs and error rates for high-frequencywords, low-frequency words and nonwords in thedifferent experimental conditions are presented inTable 1.1 Point presentation had very little effectin the lexical decision task. In contrast, in thenaming task the effect of frequency and lexicalstatus of the stimulus were more pronounced inthe unpointed presentation than in the pointedpresentation. Moreover, naming was more similarto lexical decision performance in unpointed thanin pointed print.

The statistical significance of these differenceswas assessed by an analysis of variance (ANOVA)

across subjects (F1) and across stimuli (F2), withthe main factors of stimulus type (high-frequencywords, low-frequency words, nonwords), pointpresentation (pointed, unpointed), and task(naming, lexical decision). The main effect ofstimulus type was significant (F1(2,312) = 267.0,MSe =1550, p<0.001, F2(2,157) = 119.0, MSe =4705, p<0.001). The main effects of point presen-tation and task were not significant in the subjectanalysis (F1(1,156) = 1.7, MSe =21,506, p<0.2;F1(1,156) = 2.1, M S e = 21,506, p<0.2, respec-tively), but were significant in the stimulus analy-sis (F2(1,157) = 63.7, M S e = 3258, p<0.001;F2(1,157) = 23.1, MSe = 3258, p<0.001, respec-tively). Point presentation interacted with task(F1(1,156) = 4.3, MSe = 21,506, p<0.04; F2(1,157)= 97.9, MSe = 1877, p<0.001). Stimulus type in-teracted both with point presentation and withtask (F1(2,312) = 3.0, M S e = 1550 p<0.05;F2(2,157) = 11.9, MSe = 832, p<0.001; F1(2,312) =30.0, MSe = 1550, p<0.001; F2(2,157)=17.4, MSe =3258, p<0.001, respectively). The three-way in-teraction did not reach significance. This wasprobably due to a similar slowing of nonword la-tencies relatively to the low-frequency words inthe two prints. The difference between namingpointed and unpointed presentation was mostconspicuous while examining the effect of wordfrequency. A Tukey-A post-hoc analysis (p<0.05)revealed that while there was a significantfrequency effect in naming unpointed print (28ms), there was no significant frequency effect innaming pointed print (9 ms). Finally, thecorrelation of RTs in the naming and the lexicaldecision task was calculated for both types of printpresentation. RTs in the two tasks were highlycorrelated in the unpointed presentation (r=0.56),but much less so in the pointed presentation(r=0.28). This suggests that the lexical status ofthe stimuli affected lexical decision and namingsimilarly in unpointed print but not in pointedprint.

Table I. RTs and percent errors in the lexical decision and naming tasks the for high-frequency words, low-frequencywords, and nonwords in unpointed and pointed print.

UNPOINTED PRINT

High-freq Low-freq. Nonwords

POINTED PRINT

High-freq. Low-freq. Nonwords

LEXICALDECISION 529

4r7t

NAMING 5690%

617 6595%

597 6646% 2%

545 627 6645% 5% 5%

541 550 6040% 2% 9%

164

Page 165: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

160 Frost

DiscussionThe results of Experiment 1 replicate to a cer-

tain extent the findings of Frost et al. (1987), witha within-language, between-orthography design.The pattern of naming and lexical decision laten-cies in unpointed Hebrew was almost identical tothe pattern obtained in the Frost et al. study.However, the relationg of lexical decision to nam-ing latencies in pointed Hebrew were similar tothose obtained by Frost et al. in the shallower or-thographies of English and Serbo-Croatian. Thepatterns of response times in naming and lexicaldecisions were fairy similar in unpointed print.This suggests a strong reliance on the addressedroutine in naming when the vowel marks were notpresent. In contrast, naming deviated from thepattern obtained in the lexical decision task whenthe vowel marks were presented. The frequencyeffect almost disappeared, and the overall differ-ence between naming high-frequency words andnonwords was considerably reduced. This outcomesuggests that the presentation of vowel marks en-couraged the readers to adopt a strategy of as-sembling the printed word phonology by usingprelexical conversion rules. Note, however, thatalthough the overall difference between high-fre-quency words and nonwords was smaller inpointed print than in unpointed print, the differ-ence between low-frequency words and nonwordswere fairly similar in the two conditions (54 inpointed print vs. 67 in unpointed print). This out-come is not fully consistent with a prelexical com-putation of printed stimuli in the shallow pointedprint. A possible explanation for the unexpectedslow naming latencies for pointed nonwords isthat although phonology could be easily computedprelexically in pointed print, subjects were alsosensitive to the familiarity of the printed stimuli.Readers in Hebrew read mostly unpointed printbut have a large experience in reading pointedprint as well. Because they are obviously morefamiliar with reading words than nonwords, theslower latencies for nonwords could be attributedto a response factors rather than factor related tothe computation of phonology.

EXPERIMENT 2The aim of Experiment 2 was to examine the

effects of semantic facilitation in the naming taskwhen pointed and unpointed words are presented.Semantic priming effects in the naming task havebeen used in several studies to monitor the extentof lexical involvement in pronunciation (e.g.,Baluch & Besner, 1991; Frost et al. 1987; Tabossi

& Laghi, 1992). In general, it has been shown thatin languages with deep orthographies such asEnglish, semantic facilitation in naming caneasily be obtained, although the effects areusually smaller than the effect obtained in thelexical decision task (Lupker, 1984; Neely, 1991).In contrast, in shallow orthographies such asSerbo-Croatian, semantic priming effects were notobtained in some studies (e.g., Katz & Feldman,1983; Frost et al., 1987), but have been shown inother studies (e.g., Care llo, Lukatela, & Turvey,1988; Lukatela, Feldman, Turvey, Care llo, &Katz, 1989).

An examination of the weak version of the ODHentails, however, a comparison of semantic prim-ing effects in a deep and a shallow orthography.Katz and Feldman (1983) showed greater seman-tic facilitation in English than in Serbo-Croatian.Similarly, Frost et al. (1987) demonstrated agradual decrease in semantic facilitation fromHebrew (deepest orthography) to English(shallower) to Serbo-Croatian (shallowest).Similar results were suggested by Tabossi andLaghi (1992) who compared English to Italian.Recently, Baluch, and Besner (1991) have chal-lenged these findings, suggesting that all thosecross-language differences were obtained becausenonwords were included in the stimulus lists.According to their proposal the inclusion of non-words encouraged the use of a prelexical namingstrategy in the shallower orthographies. Indeed,when nonwords were not included in the stimuluslists, no differences in semantic facilitation werefound in naming phonologically opaque andphonologically transparent words in Persian.

Experiment 2 was designed to examine theweak version of the ODH using a semanticpriming paradigm, while considering thehypothesis that the effects of orthographic depthare caused by the mere inclusion of nonwords inthe stimuli lists. For this purpose semanticfacilitation in naming target words was examinedin pointed and unpointed Hebrew orthography,when only words were employed.

MethodsSubjects. Ninety-six undergraduate students

from the Hebrew University, all native speakers ofHebrew, participated in the experiment for coursecredit or for payment.

Stimuli and design. The stimuli were 48 targetwords that were paired with semantically relatedand semantically unrelated primes. Relatedtargets and primes were two instances of asemantic category. In order to avoid repetit.on

165

Page 166: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

iPrelexical and Postlexical Strategies n .Reading: Evidence from a Deep and a Shallow Orthography 161

effects, two lists of stimuli were constructed.Targets presented with semantically relatedprimes in one list were unrelated in the other list,and vice versa. Each subject was tested in one listonly. There were two experimental conditions: inone condition all stimuli were pointed, and in theother they were unpointed. Forty-eight subjectwere tested in each condition, 24 on each list. Thetargets were three to five letter words and hadtwo or three syllables. Both primes and targetswere unambiguous, and each could be read as ameaningful word in only one way.

Procedure and apparatus. An experimentalsession consisted of 16 practice trials followed bythe 48 test trials. Each trial consisted of apresentation of the prime for 750 ms followed bythe presentation of the target. Subject wereinstructed to read the primes silently and to namethe targets aloud. The exposure of the target wasterminated by the subject's vocal response, andthe intertrial interval was 2 seconds. Theapparatus was identical to the one used in thenaming task of Experiment 1.

ResultsRTs in the different experimental conditions are

presented in Table 2. Naming of related targetswas faster than naming of unrelated targets inboth pointed and unpointed print. However,semantic facilitation was twice as large withunpointed print than with pointed print.

Table 2. Reaction times and percent errors to relatedand unrelated targets with pointed and unpointed print.

UnpointedPrint

PointedPrint

Unrelated 531 512

( I%) (2%)

Related 509 503(4%) (1%)

PrimingEffect 22 9

The statistical significance of these effects wasassessed by an ANOVA across subjects (F1) andacross stimuli (F2), with the main factors of se-mantic relatedness (related, unrelated), and printtype (pointed, unpointed). The effect of semanticrelatedness was significant (F1(1,47) = 20.9, MSe= 656, p< 0.001; F2(1,94) = 33.2, MSe = 363,p<0.001). The effect of print type was significant

in the subject analysis (F1(1,47) = 30.8, MSe =232, p<0.001), but not in the stimulus analysis(F2(1,94) = 1.0). More important to our hypothesis,the two-way interaction was significant in bothanalyses (F1(1,47) = 6.6, MSe = 207, p< 0.01: F2(1,94) = 4.7, MSe = 363, p<0.03). A Tukey-A post-hoc analysis of the interaction (p<0.05) revealedthat the semantic facilitation was significant onlywith unpointed print, but not with pointed print.

DISCUSSIONSimilar to experiment 1, naming in pointed

print was found to be faster than naming inunpointed print, as revealed by the difference inRTs in the unrelated condition. However, whereassemantic relatedness accelerated naming in therelated condition in unpointed print, it had amuch smaller effect on accelerating naminglatencies in pointed print. Thus the results ofExperiment 2 suggest that semantic facilitation isstronger in the deeper than in the shallowerHebrew orthography. This outcome is in completeagreement with the findings of Frost et al. (1987),and Tabossi and Lagh.i (1992). Note, however, thatthe greater effects of semantic facilitation inpointed print than in unpointed print wereobtained even though nonwords were not includedin the stimulus set. These results conflict with thefindings of Baluch and Besner (1991). We willrefer to this in the general discussion.

EXPERIMENT 3The major claim of the weak ODH is that in all

orthographies both prelexical and lexicalphonology are involved in naming. Thus, the useof lexical or prelexical phonology is not an all-or-none process but a quantitative continuum. Thedegree to which a prelexical process of assemblingphonology predominates over the lexical routinedepends on the costs involved in assemblingphonology directly from the print. The ODHsuggests that unless the costs are too high interms of processing resources (as is usually thecase in deep orthographies), the default strategyof the cognitive system is to assemble a phonologiccode for lexical access, not to retrieve it from thelexicon following visual access. The following twoexperiments examined this aspect of thehypothesis by manipulating the processing costsfor generating a prelexical phonologicalrepresentation.

The manipulation of cost consisted of delayingthe presentation of the vowel marks relative to thepresentation of the consonant letters. InExperiment 3 subjects were presented with

166

Page 167: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

162 Frost

unambiguous letter strings that could be read asone meaningful word only (i.e., only one vowelcombination created a meaningful word). Theletters were followed by the vowel marks, whichwere superimposed on the consonants, but atdifferent time intervals ranging from 0 ms (in fact,a regular pointed presentation) to 300 ms from theonset of consonant presentation. Subjects wereinstructed to make lexical decisions or to name thewords as soon as possible.

The vowel marks allow the easy prelodcal as-sembly of the word's phonology using spelling-to-phonology conversion rules. However, since theletter strings were unambiguous and could beread as a meaningful word in only one way, theycould be easily named using the addressedroutine, by accessing the lexicon visually. In fact,as we have shown in numerous studies, thisnaming strategy is characteristic of readingunpointed Hebrew (see Frost & Bentin, 1992b, fora review). The question was, therefore, whethersubjects are inclined to delay their response andwait for the vowel marks that are notindispensable for either correct lexical decisions orfor unequivocal pronunciation. If they are willingto wait for the vowel marks to appear then it mustbe because they prefer the option of prelexicalassembly of phonology, just as the ODH wouldpredict. The relative use of the assembled andaddressed routines was also verified by using bothhigh-frequency and low-frequency words in thestimulus lists. It was assumed that the moresubjects rely on the vowel marks for naming usingGPC conversion rules, the smaller would be thefrequency effect in this task. Nonwords wereintroduced as well to allow a baseline assessmentof the lagging costs. Note that nonwords cannot beunequivocally pronounced before the vowel marksare presented. In order to correctly pronouncethem, subjects have to wait the full lag period.However, since at least some of the articulationprogram can be launched immediately after theconsonants are presented, the absolute differencein RTs between the presentation at lag 0 and atlag 300 of nonwords would reflect the actual costin response time for lagging the vowel marks by300 ms. Any lag effect smaller than this differencewould suggest that subject did not wait the fulllag period but combined both prelexical andlexical routines to formulate their response.

MethodSubjects. Ninety-six undergraduate students

from the Hebrew University, all native speakers ofHebrew, participated in the experiments for

course credit or for payment. Forty-eightparticipated in the lexical decision task and 48 inthe naming task.

Stimuli and Design. The stimuli were 40 high-frequency words (mean frequency 5.0 on thepreviously described 1-7 scale), 40 low-frequencywords (mean frequency 3.6), and 80 nonwords. Allwords were unambiguous; their pronunciation wasunequivocal, that is, they could be read as ameaningful word in only one way. Nonwords werecreated by altering one letter of a real word, andcould not be read as a meaningful word with anyvowel configuration. All stimuli were three to fiveletters long, and contained two syllables with fourto six phonemes. The average number of lettersand phonemes was similar for the three types ofstimuli.

The words were presented to the subjects forlexical decision or naming. Forty-eight differentsubjects were tested in each task with the samestimuli. The stimuli were presented in four lagconditions 0, 100, 200, and 300 ms. Each lag wasdefined by the SOA (stimulus onset asynchrony)between the presentation of the consonants andthe vowel marks. Thus, at lag 0 the consonantsand the vowel marks were presentedsimultaneously, at lag 100 the consonants werepresented first and the vowel marks weresuperimposed 100 ms later, etc. Four lists ofwords were formed; Each list contained 160stimuli that were composed of 10 high-frequencywords, 10 low-frequency words, and 20 nonwordsin each of the four lag conditions. The stimuliwere rotated across lists by a Latin Square designso that words that appeared in one lag in one list,appeared in another lag in another list, etc. Thepurpose of this rotation was to test each subject inall lagging conditions while avoiding repetitionswithin a list. The subjects were randomly assignedto each list and to each experimental condition(lexical decision or naming).

Procedure and apparatus. The procedure andapparatus were identical to those used in theprevious experiments. The only difference wasthat the letters and the vowel marks appearedwith the different SOAs. The clock for measuringresponse times was initiated on each trial with thepresentation of the letters, regardless of vowelmarks. The subjects were informed of the SOAmanipulation but were requested to communicatetheir lexical decisions or vocal responses as soonas possible, that is, without necessarily waiting forthe vowel marks to appear. Each session startedwith 16 practice trials. The 160 test trials werepresented in one block.

1 6 7

Page 168: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Pre lexical and Post lexical Stratezies in Reading: Evidence from a Deep and a Shallow Orthography 163

ResultsMean RTs for high-frequency words, low-

frequency words and nonwords in the different lagconditions for both the lexical decision and thenaming tasks are presented in Table 3. Althoughthe response pattern at each of the different lagsis important, for the purpose of simplicity the "lageffect" specified in Table 3 reflects the differencebetween the simultaneous presentation (lag 0) andthe longest SOA (lag 300).

The lagging of the vowel information hadrelatively little effect in the lexical decision task.The presentation of vowels marks 300 ms afterthe letters delayed subjects' responses by only 25ms for high-frequency words, and 37 ms for low-frequency words. The lagging of vowels had asomewhat greater effect on the nonwords. Acrossall lags, the frequency effect was stable, about 85ms on the average. In contrast to lexical decisions,the effect of lagging the vowel information had amuch greater influence on naming. However, thefrequency effect was very small at lag 0 and 100.

The statistical significance of these differenceswas assessed in a three-way ANOVA with thefactors of task (lexical decision, naming), stimulustype (high-, low- frequency, nonwords), and lag (0,100, 200, 300), across subjects and across stimuli.The main effect of task was significant (F1(1,94) =4.2, MSe = 70,811, p<0.04; F2(1,157) = 35, MSe =8320, p<0.001), as was the main effect of stimulustype (F1(2,188) = 252, MSe = 4914, p <0.001;F2(2,157) = 106, MSe = 13,354, p<0.001), and themain effect of lag (F1(3,282)=98, MSe = 2495,p<0.001; F2(3,471) = 94, MSe = 2587, p<0.001).Task interacted with sumulus type (F1(2,188) =23.1, MSe = 4914, p<0.001; F2(2,157) = 13,MSe=8320, p<0.001) and with lag (F1(3,282) = 13.4,MSe = 2495, p<0.001, F2(3,471) = 22, MSe = 1387,p<0.001). Lag interacted with stimulus type(F1(6,564) = 9.3, MSe = 2076, p<0.001; F2(6,471) =9, M Se = 2587, p <0.001). Che three-wayinteraction did not reach significance in thesubject analysis(F1=1.3) but was marginallysignificant in the stimulus analysis (F2(6,471) =2.1, MSe = 1387; p<0.05). A Tukey-A post-hoc testrevealed that the frequency effect in naming wassignificant only at the longer lags (200, 300 msSOA), and not in the shorter lags (0, 100 ms SOA),whereas in lexical decisions the frequency effectwas significant at all lags (p<0.05).

DiscussionThe results of Experiment 3 suggest that the

effective cost of delaying vowel marks in lexicaldecisions in Hebrew is relatively low. A lag of 300

ms in vowel marks presentation resulted in 25 msdifference in response time for high-frequencywords and 37 ms for low-frequency words. Thisoutcome suggests that subjects were not inclinedto wait for the vowel marks to formulate theirlexical decisions; rather, their responses werebased on the recognition of the letter cluster. Thisinterpretation converges with previous studiesthat showed that lexical decisions in Hebrew arenot based on a detailed phonological analysis ofthe printed word, but rely on a fast judgment ofthe printed word's orthographic familiarity basedon visual access to the lexicon (Bentin & Frost,1987; Frost & Bentin, 1992b). The slightly highercost of lagging the vowel marks with low-frequency words and the even higher one fornonwords proposes that stimulus familiarityplayed a role in the decision strategy. Thissuggests that for words that were less familiar, orfor nonwords, subjects were more conservative intheir decisions and were inclined to wait longer forthe vowel marks, for the possible purpose ofobtaining a prelexical phonological code as well. Inaddition to the lag effect, a strong frequency effectwas obtained at all lags. This provides furtherconfirmation of lexical involvement in the task,regardless of the vowel mark presentation.

A very different pattern of results emerged inthe naming task. The effective cost of delaying thevowel marks in naming was much higher. The lageffect obtained in naming words was about 70 mson the average, twice as large as the effect foundfor lexical decisions. Thus, although the phono-logic structure of the unambiguous words could beunequivocally retrieved from the lexicon followingvisual access (addressed phonology), subjects weremore inclined to wait for the vowels to appearin the naming task than in the lexical decisiontask, presumably in order to generate a prelexicalphonologic code. The lag effect for words suggeststhat subjects combined both prelexical and lexicalroutines to name words, but the relative use of theprelexical and lexical routines varied as a functionof the delay in vowel mark presentation. Thislag effect should be first compared to the effect ob-tained for naming nonwords. Because they cannotbe named correctly without the vowel marks,the lag effect for nonwords reflects the overall costin response time due to a 300 ms delay of vowelmark presentation. This cost is smaller than thelag itself because the articulation program canbe initiated as the letters appear, previous tothe vowel mark presentation. The results suggestthat the cost of lagging the vowel marks by300 ms is about 150 ms in response time.

168

Page 169: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

164 Frost

Table 3. Lexical decision and naming RTs and percent errors for high-frequency words (HF), low-frequency words

(LF), and nonwords (NW) when vowel marks appear at different lags after letter presentation. Words are

unambiguous.

Lexical Decision Naming

Lag Lag

LAG 0 100 200 300 Effect 0 100 200 300 Effect

HF 546 551 560 571 25 587 608 610 648 62

(6%) (6%) (7%) (6%) (4%) (3%) (4%) (3%)

LF 626 642 644 663 37 598 624 648 678 80

(12%) (10%) (11%) (10%) (5%) (3%) (4%) (5%)

Frequencyeffect 80 91 84 92 11 16 38 30

NW 641 664 684 710 69 655 700 736 799 144

(9%) (8%) (8%) (9%) (5%) (7%) (5%) (7%)

Thus, the adoption of a pure prelexical strategy ofnaming words should have resulted in a similarlag effect. The results suggest that this was notthe strategy employed in naming words: The lageffect for words was only half the effect obtainedfor nonwords. However, the lag effect for namingwords should be also compared to the lag effectobtained in the lexical decision task. Note that, asin the lexical decision task, subjects did not needthe vowel marks (and a prelexical phonologic code)to name the words correctly. Nevertheless, in con-trast to lexical decision, they preferred to waitlonger for the vowel information. It could be ar-gued that the introduction of nonwords introducedthe prelexical strategy in naming. However, sincethe lag effect for words was much smaller than thelag effect for nonwords, it suggests that a com-bined use of the assembled and addressed routinewas used in pronunciation.

Another possible interpretation could besuggested, however, to account for the lag effect innaming. Because in the naming task subjects arerequired to produce the correct pronunciation ofthe printed word, they might have adopted astrategy of waiting for the vowels in order toverify the phonologic structure of the printed wordwhich they generated lexically. By thisinterpretation, phonology was lexically addressedbut verified at a second stage against the delayedvowel information to confirm .%le lexicallyretrieved pronunciation. The verificationinterpretation would regard the difference innaming words and nonwords in this paradigm asthe difference between having words verified bythe subsequently presented vowels, which is short,and having to construct a pronunciation from thevowel information, which is longer.

Although this interpretation is plausible, it isnot entirely supported by the effect of wordfrequency on response latencies. The suggestionthat the combined use of prelexical and lexicalphonology varied as a function of the cost ofgenerating a prelexical phonological code isreinforced by examining the frequency effect innaming. In contrast to lexical decision, thefrequency effect in naming was overall muchsmaller, especially at the shorter SOAs. At the 0mr.; SOA the frequency effect decreased from 80 msin lexical decision to 11 in naming. This supportsthe conclusion that naming with the simultaneouspresentation of vowels was mainly prelexical andscarcely involved the mental lexicon. The longerthe SOA, the larger the cost of generating aprelexical phonologic code and, consequently, thegreater the use of addressed lexical phonology.This is reflected in the increase in the frequencyeffect at the longer SOAs (38 and 30 ms for 200and 300 ms SOA, respectively). This patternstands in sharp contrast to the lexical decisiontask, which yielded very similar frequency effectsat all lags.2

EXPERIMENT 4The aim of Experiment 4 was to examine the

effect of lagging the vowel marks whenphonologically ambiguous words (heterophonichomographs) are presented for lexical decision ornaming. In contrast to unambiguous words, thecorrect phonological structure of Hebrewheterophonic homographs can be determinedunequivocally only by referring to the vowelmarks, which specify the correct phonologicalalternative and consequently the correct meaning.Thus, if the analysis provided in the previous

' 69

Page 170: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Pre lexical and Post lexical Strategies in Reading: Evidence from a Deep and a Shallow Orthographil 165

experiment concerning the cost of lagging thevowel marks is correct, the effect of presentingambiguous words on naming should be similar tothe effect of presenting nonwords. In both cases,subjects would have to rely on the vowel marks forgenerating a phonological representationnecessary for pronunciation. In contrast, laggingthe vowel marks of ambiguous words should affectlexical decisions to a much lesser extent. This isbecause lexical decisions for ambiguous wordshave been previously shown to be based on theabstract orthographic structure, and occur prior tothe process of phonological disambiguation(Bentin & Frost, 1987; Frost & Bentin, 1992a).

In Experiment 4 subjects were presented withletter strings that could be read as twomeaningful words, depending on the vowelconfiguration assigned to the letters. Nonwordswere presented as well. The vowel marks weresuperimposed on the letters at different lags, thesame as those employed in Experiment 3. Therelative use of orthographic and phonologic codingin the two tasks was again assessed by measuringthe effect of lagging the vowel information onlexical decision and naming.

MethodSubjects. One hundred and sixty undergraduate

students from the Hebrew University, all nativespeakers of Hebrew, participated in theexperiment for course credit or for payment.Eighty participated in the lexical decision taskand 80 in the naming task. None of the subjectshad participated in the previous experiment.

Stimuli and design. The words were 40 am-biguous consonant strings each of whichrepresented both a high-frequency and a low-frequency word. The two phonological alternativeswere mostly nouns or adjectives that were notsemantically or morphologically related. Theprocedure for assessing the subjective frequenciesof the words was similar to the one employed inthe previous experiments: Fifty undergraduatestudents were presented with lists that containedthe pointed disambiguated words related to the 40ambiguous letter strings, and rated the frequencyof each word on a 7-point scale. The ratedfrequencies were averaged across all 50 judges.Each of the 40 homographs that were selected forthis study represented two words that differed intheir rated frequency by at least 1 point on thatscale. As in the previous experiments, thenonwords were constructed by altering one or twoletters of meaningful words. The design wassimilar to that of Experiment 3. However, in orderto avoid repetition of the same letter string with

different vowel marks, eight lists of words werepresented to the subjects instead of four (hencethe larger number of subjects). Each list containedonly one form (the dominant or the subordinate) ofeach homograph, at one of the possible four lags,so that each subject was presented with 40 wordsand 40 nonwords in a list. The stimuli wererotated across lists by a Latin Square design, andconsequently each letter string was presentedwith both vowel mark configurations at allpossible lags.

Procedure and apparatus. The procedure andapparatus were identical to those of Experiment 3.Each session started with 16 practice trials,followed by the 80 test trials, which werepresented in one block.

ResultsMean RTs for the dominant alternatives, the

subordinate alternatives and the nonwords in thedifferent lag conditions for both the lexical deci-sion and the naming tasks are presented in Table4. As in Experiment 3, the effect of lagging thevowel information had little influence on lexicaldecision latencies. In fact, the use of ambiguouswords reduced the difference between the simul-taneous presentation of vowels and their presen-tation 300 ms after the letters to 21 ms for wordson the average, and to only 6 ms for nonwords. Adifferent pattern emerged in the naming task. Theeffects of lagging the vowel marks on RTs weretwice as large as the effects found for unambigu-ous words in Experiment 3, where there was onlyone meaningful promniciation. In fact, the lag ef-fect on ambiguous words was virtually identicalwith the lag effect on nonwords.

The statistical significance of these differenceswas assessed in a three-way ANOVA with thefactors of task (lexical decision, naming), stimulustype (high-, low- frequency, nonwords), and lag (0,100, 200, 300), across subjects and across stimuli.The main effect of task was significant (F1(1,158)= 39, MSe =114,2811, p<0.001; F2(1,117) = 314,MSe = 7379, p<0.001), as was the main effect ofstimulus type (F1(2,316) = 42, M Se = 7541,p<0.001; F2(2,117) = 10.5, MSe = 11,737, p<0.001),and the main effect of lag (F1(3,474)=69, MSe =8459, p<0.001; F2(3,351) = 113, MSe = 2528,p<0.001). Task interacted with stimulustype (F1(2,316) = 6.6, M S e = 7541, p<0.001;F2(2,117) = 5.8, MSe = 7,379, p<0.001) and withlag (F1(3,474) = 46, MSe = 8459, p<0.001,F2(3,351)=76, MSe = 2616, p<0.001). Lag did notinteract with stimulus type (F1, F2 <1.0). Thethree-way interaction did not reach significance(F1 = 1.3, F2 = 1.4)

170

Page 171: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

166 Frost

Table 4. Lexical decision and naming RTs and percent errors for high-frequency words (HF), low-frequency words(LF), and nonwords (NW) when vowel marks appear at different lags after letter presentation. Words are ambiguous.

Lexical Decision Naming

Lag Lag

LAG 0 100 200 300 Effect 0 100 200 300 Effect

Dominant 585 608 608 611 26 619 650 692 763 142

(4%) (5%) (3%) (3%) (3%) (2%) (4%) (3%)

Subordinate 611 620 626 627 16 641 699 739 790 150

(7%) (5%) (4%) (2%) (5%) (4%) (6%) (4%)

NW 624 629 635 630 6 677 713 755 828 151

9%) (9%) (10%) (10%) (5%) (4%) (8%) (7%)

DiscussionThe results of Experiment 4 suggest that the

cost of delaying the vowel marks forphonologically ambiguous letter strings in thelexical decision task was very low. This outcomesupports the conclusions put forward by Frost andBentin (1992a), suggesting that lexical decisionsin Hebrew are based on the recognition of theabstract root or orthographic cluster and do notinvolve access to a specific word in the phonologiclexicon.

The longest delays of response times due to thelagging of vowel information occurred in thenaming task. This was indeed expected. Becausethe correct pronunciation of phonologicallyambiguous words was unequivocally determinedonly after the presentation of the vowel marks,subjects had to wait for the vowels to appear inorder to name those words correctly. Thus, thesestimuli provide a baseline for assessing the effectof lagging the vowel marks on naming latencies.This baseline confirms the previous assessment,which was based on the responses to nonwords inExperiment 3, and suggests that, overall, 300 insin delaying the vowel marks cost 150 ms inresponse time for articulation if the vowels arenecessary for correct pronunciation. Note that, incontrast to the unambiguous words of Experiment3, the difference in RTs between dominant andsubordinate alternatives of ambiguous wordscannot reveal the extent of lexical involvement.First, this difference cannot be accurately labeledas a frequency effect. The two phonologicalalternatives of each homograph differed only intheir dominance, and thus could have been bothfrequent or both nonfrequent. Moreover, there is aqualitative difference between frequency effects

that arises from lexical search and the dominanceeffect that results from the conflict between twophonological alternatives of heterophonichomographs. Thus, the slower RTs of subordinatealternatives, at least in the naming task, wereprobably due not to a longer lexical search for low-frequency words, but to a change in thearticulation program. If subjects first consideredpronouncing the dominant alternative after theambiguous consonants were presented, the laterappearance of the vowel marks that specified thesubordinate alternative would have caused achange in their articulation plan, resulting inslower RTs.

General DiscussionThe present study investigated the relative use

of assembled and addressed phonology in namingunpointed and pointed printed Hebrew words.Experiment 1 demonstrated that the lexical statusof the stimulus had greater effect in unpointedthan in pointed print. Experiment 2 confirmedthat semantic priming facilitated naming to alesser extent in the shallow pointed orthographythan in the deeper unpointed orthography, eventhough nonwords were not included in the stimu-lus list. Experiment 3 and 4 examined the effect ofdelaying the vowel mark presentation on lexicaldecision and naming, in order to assess their con-tribution and importance in the two tasks.Because the vowel marks allow fast conversion ofthe graphemic structure into a phonologic repre-sentation using prelexical conversion rules, theirdelay constitutes an experimental manipulationthat reflects the cost of assembling phonologyfrom print. The two experiments showed that, al-though both naming and lexical decision could be

Page 172: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Pre lexical and Post lexical Stratees in Reading: Evidence from a Deep and a Shallow Orthography 167

performed without considering the vowel marks,subjects were inclined to use them to a greaterextent in the naming task when the cost inresponse delay was low.

These results provide strong support for theweaker version of the ODH. The disagreement be-tween the alternative view and the ODH revolvesaround the extent of using assembled phonologyin shallow orthographies. It is more or less unan-imously accepted that in deep orthographies read-ers prefer to use the addressed routine in naming.This is because the opacity of the relationship be-tween orthographic structure and phonologicform, which is characteristic of deep orthogra-phies, prevents readers from assembling a prelex-ical phonological representation through the useof simple GPC rules. Thus, the major debate con-cerning the validity of the ODH often takes placeon the territory of shallow orthographies, in orderto show their extensive use of addressed phonol-ogy in naming. What lies behind the alternativeview, therefore, is the assumption (even axiom)that the use of the addressed routine for namingconstitutes the least cognitive effort for any readerin any alphabetic orthography.

The advantage of examining the validity of theODH by investigating reading in pointed and un-pointed Hebrew is therefore multiple. First, it iswell established that Hebrew readers are used toaccessing the lexicon by recognizing the word's or-thographic structure. The phonologic informationneeded for pronunciation is then addressed fromthe lexicon (Frost & Bentin, 1992b). The questionof interest, then, is: what reading strategies areadopted by Hebrew readers when they are ex-posed to a shallower orthography? Do they adopt aprelexical strategy even though it is not the natu-ral strategy for processing most Hebrew readingmaterial? A demonstration of the extensive use ofthe assembled routine in pointed Hebrew there-fore provides strong support for the ODH. If read-ers of Hebrew prefer the use of the assembledroutine, surely habitual readers of shallower or-thographies would have a similar preference.

The second advantage in using the twoorthographies of Hebrew is that it allows themanipulation of a within-language between-orthography design. This design has amethodological advantage over studies thatcompare different languages. The interpretation ofdifferences in reading performance between twolanguages as reflecting subjects' use of pre- vs.post-lexical phonology can be criticized onmethodological grounds. The correspondencebetween orthography and phonology is only one

dimension on which two languages differ. Englishand Serbo-Croatian, for example, differ ingrammatical structure and in the size andorganization of the lexicon. These confoundingfactors, it can be argued, may also affect subjects'performance. The comparison of pointed andunpointed orthography in one language, Hebrew,allows these pitfalls to be circumvented.

Taken together, the four experiments suggestthat-the presentation of vowel marks in Hebrewencourages the reader to generate a prelexicalphonologic representation for naming. The use ofthe assembled routine was detected by examiningboth frequency and semantic priming effects innaming. Experiment 3 provides an importantinsight concerning the preference for prelexicalphonology. When the vowels appearedsimultaneously with the consonants, thefrequency effect in naming was small andnonsignificant, again suggesting minimal lexicalinvolvement. Thus the results of this conditionreplicate the fmdings of Experiment 1.

Two methodological factors should be consideredin interpreting the results of Experiments 3 and 4.The first relates to the experimental demandcharacteristics of the lagged pointed presentation.It could be argued that this unnatural pre-sentation of printed information encouraged sub-jects to wait for the vowels to appear and conse-quently to use them. In fact, why not adopt astrategy of waiting for all possible information tobe provided? Although this could be a reasonablesolution for efficient performance in these experi-ments, the results of the lexical decision taskmake it very unlikely that this simple strategywas adopted by our subjects. In this task subjectswere not inclined to wait for the vowel marks.This result was most conspicuous in Experiment4, in which the lag effect was minimal. This out-come suggests that the lagged presentation didnot induce a uniform strategy of vowel mark pro-cessing, but that subjects adopted a flexible strat-egy characterized by a gradual pattern of relyingmore and more on the vowel marks (prelexicalphonology) as a function of the task and of theambiguity of the stimuli.

The differential effect of lag on ambiguous andunambiguous words in the two tasks suggests thatboth the assembled and the addressed routineswere used for generating phonology from print. Onone hand subjects used explicit vowel informationemploying prelexical transformation rules. This isreflected in the greater effect of lag on namingrelative to lexical decision latencies, and in thesmaller frequency effect of experiment 3 at the

1 72

Page 173: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

168 Frost

shorter SOAs. On the other hand, it is clear thatthe phonologic structure of unambiguous wordswas generated using the addressed routine aswell. This is reflected in the differential effect oflagging the vowel information on namingambiguous and unambiguous words: There weresmaller effects of lag on naming unambiguousthan ambiguous words. This confirms thatsubjects did not wait for the vowel marks in orderto pronounce the unambiguous words, as long asthey waited in order to pronounce the ambiguouswords. The flexibility in using the two routines fornaming unambiguous words was affected,however, by the "cost" of the vowel information.When no cost was involved in obtaining the vowelinformation (simultaneous presentation), theprelexical routine was preferred, as reflected bythe small frequency effect. In contrast, when theuse of vowel information involved a higher cost,given the delayed presentation of the vowelmarks, a gradually greater reliance on theaddressed routine was observed. This is wellreflected by the larger frequency effect at thelonger lags. This pattern stands in sharp contrastto the lexical decision task, in which the frequencyeffect was stable and very similar across lags.

The second methodological issue to be consid-ered in interpreting the results of Experiments 3and 4 is the presence of nonwords in the experi-ments. One factor that has been recently proposedto account for the conflicting results concerningthe effect of orthographic depth is the inclusion ofnonwords in the stimulus list (Baluch & Besner,1991). In their study, Baluch and Besner pre-sented native speakers of Persian with opaqueand transparent Persian words and showed thatdifferences in semantic facilitation in the namingtask appeared only when nonwords were includedin the stimulus list. When the nonwords wereomitted, no differences in semantic facilitationwere found. Baluch and Besner concluded that theinclusion of nonwords in the list encourages sub-ject to adopt a prelexical strategy of generatingphonology from print, because the phonologicstructure of nonwords cannot be lexically ad-dressed but only prelexically assembled.

The results of Experiment 3 and 4 do notsupport the view that the nonwords induced apure prelexical strategy. If this were so, the effectsof lag would have been similar for words and fornonwords. Subjects would have waited for thevowels of every stimulus to appear and the lageffect for unambiguous words, ambiguous words,and nonwords would have been similar, about 150ms. This clearly was not the outcome we obtained.

Subjects waited the full 150 ms to pronounce thenonwords and the ambiguous words but waitedonly half as much time to pronounce theunambiguous words. This confirms a mixedstrategy in reading

However, the argument proposed by Baluch andBesner deserves serious consideration. Not only isits logic compelling, but also the effect ofnonwords on reading strategies is welldocumented in several studies. The crux of thedebate is, however, whether this argument canaccount for the observed effects of orthographicdepth found in the various studies described above(e.g., Frost et al., 1987; Katz & Feldman, 1983).The evidence for that claim seems to be much lesscompelling. There is no argument that nonwordsinduce a prelexical strategy of reading in allorthographies (e.g., Hawkins, Reicher, & Rogers,1976; and see McCusker, Hillinger, & Bias, 1981,for a review). However, note that the weak ODHproposes that whatever the effect of nonwordinclusion is, it would be different in deep and inshallow orthographies. Thus, evidence for oragainst the weak ODH could only be supported bystudies directly examining the differential effect ofnonwords in various orthographies. Such a designwas indeed employed by Baluch and Besner(1991), but their conclusions hinge on the non-rejection of the null hypothesis. In contrast totheir results, several studies have provided clearevidence supporting the weak ODH. For example,Frost et al. (1987) showed that the ratio ofnonwords had a dramatically different effect onnaming in Hebrew, in English, and in Serbo-Croatian. Different effects of nonwords inclusionon semantic facilitation in naming in the deepEnglish and the shallow Italian, were alsoreported by Tabossi and Laghi (1992).

The results of Experiment 2 also stand in sharpcontrast to Baluch and Besner's conclusions. InExperiment 2 different semantic priming effectswere found between pointed and unpointedHebrew, even though nonwords were not includedin the stimulus list. How can this difference be ac-counted for? A possible explanation for this dis-crepancy could be related to the strength of theexperimental manipulation employed in the twostudies. Baluch and Besner used opaque andtransparent words within the unpointed-deepPersian writing system. The phonological trans-parency of their words was due to the inclusions ofletters that convey vowels. However, becausethese letters can also convey consonants in a dif-ferent context, they may have introduced someambiguity in the print that is characteristic of

P-v

Page 174: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Pre lexical and Post lexical Strategies in Reading: Evidence from a Deep and a Shallow Orthography 169

deep orthographies, hereby encouraging the use ofthe addressed routine. In contrast, the experimen-tal manipulation of using pointed and unpointedHebrew print is much stronger. Hence, differen-tial effects of semantic facilitation in namingemerged in Hebrew, providing strong support forthe weak ODH.

The debate concerning the ODH, however, can-not simply revolve around the interpretation ofvarious findings without adopting a theoreticalframework for reading in general. Interestingly,while contrasting the ODH with the alternativeview, one might find very similar definitions thatcapture these conflicting approaches. Proponentsof the alternative view would argue that, regard-less of the characteristics of their orthographyreaders in different languages seem to adopt re-markably similar mechanisms in reading (e.g.,Besner & Smith, 1992; Seidenberg, 1992). Thus,the alternative view attempts to offer a universalmechanism that portrays a vast communality inthe reading process in different languages.Surprisingly, proponents of the ODH take a verysimilar approach. They suggest a basic mecha-nism for processing print in all languages, whichis finely tuned to the particular structure of everylanguage (Carello et al., 1992). The major discus-sion is, therefore, what exactly this mechanism is.

The basic assumption of the ODH concerns therole of phonology in reading. It postulates as abasic tenet that all writing systems are phonologi-cal in nature and their primary aim is to conveyphonologic structures, i.e. words, regardless of thegraphemic structure adopted by each system (seeDe Francis, 1989; Mattingly, 1992, for a discus-sion). Thus, the extraction of phonologic informa-tion from print is the primary goal of the reader,whether skilled or beginner. It is the emphasisupon the role of phonology in the reading processthat bears upon the importance of prelexicalphonology in print processing. The results of thepresent study furnish additional support for anincreasingly wide corpus of research that providesevidence confirming the role of prelexical phonol-ogy in reading. This evidence comes not only fromshallow orthographies like Serbo-Croatian (e.g.,Feldman & Turvey, 1983; Lukatela & Turvey,1990), but also from deeper orthographies likeEnglish, using a backward masking paradigm(e.g., Perfetti et al., 1988) or a semanticcategorization task with pseudohomophonic foils(Van Orden, 1987; Van Orden, Johnston, & Halle,1988). Recently, Frost and Bentin (1992a) showedthat phonologic analysis of printed heterophonichomographs in the even deeper unpainted Hebrew

orthography precedes semantic disambiguation.In another study, Frost and Kampf (1993) showedthat the two phonologic alternatives of Hebrewheterophonic homographs are automatically acti-vated following the presentation of the ambiguousletter string.

The results of the present study converge withthese findings. They provide an opportunity toexamine the ODH and the alternative view not inthe linguistic environment of shalloworthographies, but of deep orthographies. Ifprelerdcal phonology plays a significant role in thereading of pointed Hebrew by readers who aretrained to use mainly the addressed routine forphonological analysis, then the plausibleconclusion is that, in any orthography, assembledphonology plays a much greater role in readingthan the alternative view would assume.

REFERENCESBalota, D. A., & Chumbley, J. I. (1984). Are lexical decisions a

good measure of lexical access? The role of word frequency inthe neglected decision stage. journal of Experimental Psychology:Human Perception and Performance, 10, 340-357.

Baluch, B., & Besner, D. (1991). Strategic use of lexical andnonlexical routines in visual word recognition: Evidence fromoral reading in Persian. Journal of Experimental Psychology:Learning, Memory, and Cognition, 17, 644-652.

Bentin, S., Bargai, N., & Katz, L. (1984). Orthographic andphonemic coding for lexical access: Evidence from Hebrew.journal of Experimental Psychology: Learning Memory & Cognition,10, 353-368.

Bentm, S., & Frost, R. (1987). Processing lexical ambiguity andvisual word recognition in a deep orthography. Memory &-Cognition, 15, 13-23.

Besner. D., & Hildebrant (1987). Orthographic and phonologicalcodes in the oral reading of Japanese Kana. journal ofExperimental Psychology: Learning Memory & Cognition, 13,335-343.

Besner, D., & Smith, M. C. (1992). Basic processes in reading: Isthe orthographic depth hypothesis sinking? In R. Frost & L.Katz (Eds.), Orthography, phonology, morphology, and meaning(pp. 45-66). Amsterdam Advances in Psychology, ElsevierScience Publishers.

Besner, D., Patterson, K., Lee, L., & Hildebrant N. (1992). Twoforms of Japanese Kana: Phonologically but NOTorthographically interchangeable. Journal of Experimen talPsychology: Learning Memory & Cognition.

Carello, C., Lukatela, G., & Turvey, M. T. (1988). Rapid naming isaffected by association but not syntax. Memory & Cognition, 16,187-195.

Carello, C., Turvey, M. T., & Lukatela, G. (1992). Can theories ofword recognition remain stubbornly nonphonological? In R.Frost & L. Katz (Eds.) Orthography, phonology, morphology, andmeaning (pp. 211-226). Amsterdam: Advances in Psychology,Elsevier Science Publishers.

Colombo, L., & Tabossi, P. (1992). Strategies and stressassignment: Evidence from a shallow orthography. In R. Frost& L. Katz (Eds.), Orthography, phonology, morphology, andmeaning (pp. 319-340). Amsterdam: Advances in Psychology,Elsevier Science Publishers.

Page 175: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

170 Frost

De Francis, j. (1989). Visible speech: The diverse oneness of writingsystems. Honolulu: University of Hawaii Press.

Feldman, L. B., & Turvey, M. T. (1983). Word recognition inSerbo-Croatian is phonologically analytic. Journal ofExperimental Psychology: Human Perception and Performance, 9,

228-298.Frost, R. (1992). Orthography and phonology: The psychological

reality of orthographic depth. In M. Noonan, P. Downing, & S.Lima (Eds.), The linguistics of literacy (pp. 255-274).Amsterdam/Philadelphia: John Benjamins Publishing CO.

Frost, R., Katz, L., & Bentin, S. (1987). Strategies for visual wordrecognition and orthographical depth: A multilingualcomparison. Journal of Experimental Psychology: HumanPerception and Performance, 13, 104-115.

Frost, R., & Katz, L. (1989). Orthographic depth and theinteraction of visual and auditory processing in wordrecognition. Memory & Cognition, 17, 302-311.

Frost, R., & Bentin, S. (1992a). Processing phonological andsemantic ambiguity: Evidence from semantic priming atdifferent SOAs. Journal of Experimental Psychology: Learning,Memory, and Cognition. 18, 58-68.

Frost, R., & Bentin, S. (1992b). Reading consonants and guessingvowels: Visual word recognition in Hebrew orthography. In R.Frost & L. Katz (Eds.), Orthography, phonology, morphology, andmeaning (pp. 27-44). Amsterdam: Advances in Psychology,Elsevier Science Publishers.

Frost, R., & Kampf, M. (1993). Phonetic recoding ofphonologically ambiguous printed words. Journal ofExperimental Psychology: Learning, Meniory, and Cognition, 1-11.

Hawkins, H. L., Reicher, G. M., & Rogers, M. (1976). Flexiblecoding in word recognition. Journal of Experimental Psychology:Human Perception and Performance, 2, 233-242.

Katz, L. & Feldman L. B. (1981). Linguistic coding in wordrecognition. In: A.M. Lesgold & C.A. Perfetti (Eds.), Interactiveprocesses in reading. Hillsdale, NJ: Erlbaum.

Katz, L., & Feldman L. B. (1983). Relation between pronunciationand recognition of printed words in deep and shalloworthographies. Journal of Experimental Psychology: Learning,Memory, and Cognition., 9, 157-166.

Katz, L. & Frost, R. (1992). Reading in different orthographies: theorthographic depth hypothesis. In: R. Frost & L. Katz (Eds.),Orthography, phonology, morphology, and meaning (pp. 67-84).Advances in Psychology, Elsevier, North-Holland.

Klima. E. S. (1972). How alphabets might reflect language. In F.Kavanagh & 1. G. Mattingly (Eds.), Language by ear and by eve.The MIT Press, Cambridge, Massachusetts, and London,England.

Laudanna, A., & Caramazza, A. (1992). Morpho-lexicalrepresentations and reading. Paper presented at the fifthconference of the European Society for Cognitive Psychology,Pans, France.

Liberman, I. Y., Liberman, A. M., Mattingly, I. G., & Shankweiler,D. (1980). Orthography and the beginning reader. in J. F.

Kavanagh & R. L. Venezky (Eds.), Orthography, reading, anddyslexia. Austin, TX: Pro-Ed.

Lukatela, G., Popadit, D., Ognjenovid, P., & Turvey, M. T. (1980).Lexical decision in a phonologically shallow orthography.Memory & Cognition, 8, 415-423.

Lukatela, C , Feldman, L. B., Turvey, M. T., Care llo, C., & Katz, L.(1989). Context effects in bi-alphabetical word perception.Journal of Memory and Language, 28. 214-236

Lukatela, G., & Turvey, M. T. (1990). Automatic and prelexicalcomputation of phonology in visual word identification.European Journal of Cognitive Psychology, 2, 325-343.

Lupker, J. S. (1984). Semantic priming without association. Asecond look. Journal of Verbal Learning and Verbal Behavior, 23,709-733.

McCusker, L. X., Hillinger, M. L., & Bias, R. G. (1981). Phonologicrecoding and reading. Psychological Bulletin, 89, 217-245.

Mattingly, I. G. (1992). Linguistic awareness and orthographicform. In R. Frost & L. Katz (Eds.), Orthography, phonology,morphology, and meaning (pp. 11-26). Amsterdam: Advances inPsychology, Elsevier Science Publishers.

Neely, J. H. (1991). Semantic priming effects in visual wordrecognition. In D. Besner & G. W. Humphreys (Eds.), Basic pro-cesses in reading: Visual word recognition. Hillsdale, NJ: Erlbaum.

Perfetti, C. A., Bell, L. C., & Delaney, S. M. (1988). Automatic(prelexical) phonetic activation in silent word reading:Evidence from backward masking. Journal of Memory andbinguage, 27, 59-70.

Perfetti, C.A., Zhang, S., & Berent, I. (1992). Reading in Englishand Chinese: Evidence for a universal phonological principle.In R. Frost & L. Katz (Eds.), Orthography, phonology, morphology,and meaning (pp. 227-248). Amsterdam: Advances inPsychology, Elsevier Science Publishers.

Scheerer, E. (1986). Orthography and lexical access. ln G. Augst(Ed.), New trend in graphemics and orthography. (pp. 262-286).Berlin: De Gruyter.

Sebastian-Galles, N. (1991). Reading by analogy in a shalloworthography. Journal of Experimental Psychology: HumanPerception and Performance, 17, 471-477.

Seidenberg, M. S. (1985). The time course of phonological codeactivation in two writing systems. Cognition, 19, 1-30.

Seidenberg, M. S. (1992). Beyond orthographic depth in reading:Equitable division of labor. In R. Frost & L. Katz (Eds.),Orthography, phonology, morphology, and meaning (pp. 85-118).Amsterdam: Advances in Psychology, Elsevier SciencePublishers.

Seidenberg, M. S. & Vidanovid, S. (1985). Word recognition inSerbo-Croatian and English: Do they differ? Paper presented atthe Twenty-fifth Annual Meeting of the Psychonomic Society,Boston.

Tabossi, P., & Laghi, L. (1992). Semantic priming in thepronunciation of words in two writing systems: Italian andEnglish. Memory & Cognition, 20, 303-313.

Turvey, M. T., Feldman. L. B., & Lukatela, G. (1984). The Serbo-Croatian orthography constrains the reader to a phonologicallyanalytic strategy. In L. Henderson (Ed.), Orthographies andreading: Perspectives from cognitive psychology, neuropsychology,and linguistics (pp. 81-89). Hillsdale, NJ: Erlbaum.

Van Orden G. C. (1987). A ROWS is a ROSE: Spelling, sound andreading. Memory & Cognition, 15, 181-198.

Van Orden, G. C., Johnston, J. C., & Halle, B. L. (1988). Wordidentification in reading proceeds from spelling to sound tomeaning. Iournal of Experimental Psychology: [Taming, Memory,and Cognition, 14, 371-386.

FOOTNOTES'Journal of Experimental Psychology: Learning, Memory, andCognition, in press.

I The error rates presented in this study should be evaluated withcare because the criteria for assessing an error in pointed andunpointed print are different. While in unpointed print anypronunciation consistent with the consonants only is consideredcorrect, in pointed print any pronunciation that is inconsistentwith the explicit vowel information is considered an error. Thisis of special significance while evaluating the nonwords data. Inunpointed print subjects have much greater flexibility in namingnonwords than in pointed print, hence the larger error rates inthe pointed condition.

2 But note that the effect of delaying vowels on naming latenciesdoes not linearly decrease with lag, as should be predicted bythe increase of frequency effect. In order to obtain a moreordered relationship r lore lag conditions should have been used.

Page 176: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins Ixboratories Status Report on Speech Research

1993, SR-113, 171-196

Relational Invariance of Expressive Microstructure acrossGlobal Tempo Changes in Music Performance:

An Exploratory Study*

Bruno H. Repp

This study addressed the question of whether the expressive microstructure of a musicperformance remains relationally invariant across moderate (musically acceptable)changes in tempo. Two pianists played Schumann's "Träumerei" three times at each ofthree tempi on a digital piano, and the performance data were recorded in MIDI format. Ina perceptual test, musically trained listeners attempted to distinguish the originalperformances from performances that had been artificially speeded up or slowed down tothe same overall duration. Accuracy in this task was barely above chance, suggesting thatrelational invariance was largely preserved. Subsequent analysis of the MIDI dataconfirmed that each pianist's characteristic timing patterns were highly similar across thethree tempi, although there were statistically significant deviations from perfect relationalinvariance. The timing of (relatively slow) grace notes seemed relationally invariant, butselective examination of other detailed temporal features (chord asynchrony, tone overlap,pedal timing) revealed no systematic scaling with tempo. Finally, although the intensityprofile seemed unaffected by tempo, a slight overall increase in intensity with tempo wasobserved. Effects of musical structure on expressive microstructure were large andpervasive at all levels, as were individual differences between the two pianists. For thespecific composition and range of tempi considered here, these results suggest that major(cognitively controlled) temporal and dynamic features of a performance change roughly inproportion with tempo, whereas minor features tend to be governed by tempo-independentmotoric constraints.

INTRODUCTIONWhen different artists perform the same

musical composition, they often choose verydifferent tempi. Even though each artist may beconvinced that his or her tempo is "right," andeven though the composer may have prescribed aspecific tempo in the score, there is in fact a rangeof acceptable tempi for any composition played onconventional instruments. The Americancomposer Ned Rorem has expressed this well:

This research was made possible by the generosity ofHaskins Laboratories (Carol A. Fowler. president). Additionalsupport came from NIH BRSG Grant RR05596 to theLaboratories. I am grateful to LPH for her patientparticipation in this study, and to Peter Desain and HenkjanHoning for helpful comments on an earlier version of themanuscript.

171

Tempos vary with generations like the rapidity oflanguage. Music's velocity has less organic importthan ;ts phraseology and rhythmic qualities; whatcounts in performance is the artistry of phrase andbeat within a tempo. ... [The composer's] Tempoindication is not creation, but an afterthought relatedto performance. Naturally an inherently fast piecemust be played fast, a slow one slowbut just towhat extent is a decision for players. (Rorern, 1983,p. 326.)

Guttmann (1932) measured the durations of alarge number of orchestral performances andfound substantial tempo variation across differentconductors, for the same composition. In extremecases, the fastest observed performance was about30% shorter than the slowest one. In two recentstudies, Repp (1990, 1992) compared perfor-mances by famous pianists of solo pieces by

I 7 6

Page 177: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

172 Repp

Beethoven and Schumann and in each case founda considerable range of tempi. The ratio of the ex-treme tempi was approximately 1:1.6 in each case,corresponding to a 37% difference in performanceduration. Tempo differences among repeated per-formances by the same artist tend to be smallerbut may also be considerable (Guttmann, 1932;Repp, 1992), notwithstanding the observation ofremarkable constancy for some artists or ensem-bles (Clynes and Walker, 1986).

Tempo varies not only between but also withinperformances. Even though no tempo changesmay have been prescribed by the composer in thescore, a sensitive performer will continuouslymodulate the timing according to the structuraland expressive requirements of the music. Thistemporal modulation forms part of the "expressivemicrostructure" of a performance. Naturally, it ismore pronounced in slow, "expressive" than infast, "motoric" pieces. The problem of determiningthe global or baseline tempo of a highly modulatedperformance is addressed elsewhere (Repp, inpress). In this article the question of interest iswhether overall tempo interacts with the patternof exp:essive modulations.

The null hypothesis to be tested is thatexpressive microstructure (which on the pianoincludes not only successive tone onset timing butalso simultaneous tone onset asynchronies,successive tone overlap or separation, pedaltiming, and successive as well as simultaneousintensity relationships) is relationally invariantacross changes in tempo. What this implies is thata change in tempo amounts to multiplying alltemporal intervals by a constant, so that all theirrelationships (ratios) remain intact. Nontemporalproperties of performance (i.e., tone intensities)are likewise predicted to remain relationallyinvariant.

Relational invariance (also called "proportionalduration") is a key concept in studies of motorbehavior (see Gentner, 1987; Heuer, 1991; Viviariiand Laissard, 1991). It has been used as anindicator of the existence of a "generalized motorprogram" (Schmidt, 1975) having a variable rateparameter. Many activities have been examinedfrom that perspective, and relational invariancehas commonly been observed, even thoughstringent statistical tests may show significantdeviations from strict proportionality (see Heuer,1991). The deviations may result from thecombination of more central and more peripheralsources of variability. Heuer (1991) conjecturesthat relational invariance is most likely to obtainwhen the motor behavior is "natural" (i.e.,

conforms to peripheral constraints) and "self-selected" (i.e., not imposed by the experimenter).These conditions certainly apply to artistic musicperformance, especially when (as in the presentstudy) it is slow and expressive, so that peripheralfactors (i.e., technical difficulties) are minimized.Nevertheless, even such a performance hasaspects (e.g., the relative asynchrony of chordtones or the relative overlap of legato tones on thepiano) that are subject to peripheral constraintsimposed by fmgering, the spatial distance betweenkeys, and the limits of fine motor control. Thequestion of relational invariance thus may beasked at several levels of detail, and the answersmay differ accordingly. In sheer complexity anddegree of precision, expert musical performanceexceeds almost any other task that may beinvestigated from a motor control viewpoint (seeShaffer, 1980, 1981, 1984).

Few previous studies have looked into thequestion of relational invariance in musicperformance. Deviations from relationalinvariance seem likely when the tempo changesare large. Handel (1986) has pointed out thatchanges in tempo may cause rhythmicreorganization: A doubling or halving of tempooften results in a change of the level at which theprimary beat (the "tactus") is perceived; thischanges the rhythmic character of the piece, withlikely consequences for performancemicrostructure. Very fast tempi may lead totechnical problems whereas very slow tempi maylead to a loss of coherence. These kinds of issuesmay be profitably investigated with isolatedrhythmic patterns, scales, and the like. When itcomes to the artistic interpretation of seriouscompositions, however, such dramatic differencesin tempo are rare. The tempi chosen cannot straytoo far from established norms, or else theperformance will be perceived as deviant andstylistically inappropriate. In this study,therefore, we will be concerned only withmoderate tempo variations that are within therange of aesthetic acceptability for thecomposition chosen. Still, as indicated above, thisrange is wide enough to make the question ofrelational invariance nontrivial.

Since different artists' performances of the samemusic differ substantially in their expressivemicrostructure ( "interpretation") as well as inoverall tempo, it is virtually impossible todetermine the relationship between these twovariables by comparing recordings of differentartists. Although many studies havedemonstrated that individual performers are

177

Page 178: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Relational Invariance of Expressive Microstructure across Global Tempo Changes in Music Performance 173

remarkably consistent in reproducing theirexpressive timing patterns across repeatedperformances of the same music, theseperformances are generally also very similar intempo (e.g., Seashore, 1938; Shaffer, 1984;Shaffer, Clarke, and Todd, 1985; Repp, 1990,1992). Where substantial differences in tempo dooccur, underlying changes in "interpretation" (i.e.,in the cognitive structure underlying theperformance) cannot be ruled out (e.g., Shaffer,1992; Repp, 1992). In other words, the change intempo may have been caused by a change ininterpretation, rather than the other way around.Clearly, artists can change their interpretation atwill, and with it the tempo of a performance. Thequestion posed in the present study was whether achange in tempo necessitates a change inexpressive microstructure. Therefore, anexperimental approach was taken in which thesame artist was asked to play a piece at differenttempi without deliberately changing theinterpretation. The resulting performance datawere analyzed and compared. This approach wassupplemented by a perceptual test in which thetempo of recorded performances was artificiallymodified and the result was judged by musicallytrained listeners.

One previous study that directly addressed thehypothesis tested here was conducted by Clarke(1982), following preliminary but inconclusiveobservations by Michon (1974). Clarke examinedselected sections of performances of the highlyrepetitive piano piece, "Vexations," by Erik Satie,played by two pianists who had been instructed tovary the tempo. The timing patterns (tone inter-onset intervals, or IOIs) of the same music at sixdifferent tempi were compared, and a significantinteraction with tempo was found. The interactionwas in part due to a narrowing of the tempo rangein the course of the excerpt, but a significant effectremained after the tempo drift was removedstatistically. Subsequent inspection of the datasuggested to Clarke that the pianists had changedtheir interpretation of the grouping structureacross ternpi. At least one pianist produced moregroups at the slower tempo, where groupboundaries were identified by a temporary slowingdown. Clarke also argued that group boundariesthat were maintained tended to be emphasizedmore at the slower tempi.

While Clarke's interpretation is intuitivelyplausible, the changes in the timing profiles acrosstempi were quite small and do not suggestdifferent structural interpretations to this reader.Although there was more temporal modulation at

the slower tempi, this is consistent with thehypothesis of relational invariance and may haveaccounted for the interaction with tempo. Such aninteraction might not have appeared in log-transformed data.l Finally, even if it were truethat the pianists changed their interpretationsalong with tempo, this may have reflected theextremely boring and repetitive nature of themusic which, as the title says, is a deliberateharassment of the performer. Therefore, thegenerality of Clarke's conclusions is not clear. Avery casual report of a follow-up study with amovement from a Clementi Sonatina (Clarke,1985) does not change the picture significantly.

In a study of piano performance at a simplerlevel, MacKenzie and Van Eerd (1990) foundhighly significant changes in tone IOIs as afunction of tempo. Their pianists' task was to playscales as evenly as possible, and the range oftempi was very large. The ICH profiles observedwere not due to musical structure or expressiveintent but to fingering; they became morepronounced as tempo increased. These kinds ofmotor constraints, though they become importantin fast and technically difficult pieces, are likely toplay only a minor role in slow, expressiveperformance, as studied here.

The immediate st'-nulus for the present studywas provided by some informal observationsreported by Desain and Honing (1992a). Theyrecorded a pianist playing a tune by Paisiello (thetheme of Beethoven's variations on "Nel cor pinnon mi sento") at two different tempi, M.M.60 and90, and observed that the temporal microstructurediffered considerably between the twoperformances. They also used a MIDI sequencer tospeed up the slow performance and to slow downthe fast performance, and in each case they foundthat the altered performance sounded unnaturalcompared to the original performance having thesame tempo. On closer examination, the majordifferences between the two performances seemedto lie in the timing of grace notes, in thearticulation of staccato notes, and in the "spread"(onset asynchrony) of chords, though the overalltiming profiles also looked quite different.

In a subsequent paper, Desain and Honing(1992b) developed a computer model forimplementing changes in expressive timing. Theydistinguished four types of "musical objects":sequences of tones, chords, appoggiature, andacciaccature, the last two referring to types ofgrace notes. Each type of musical object is subjectto different temporal transformation rules, whichare applied successively within a hierarchical

78

Page 179: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

174 Repp

representation of the musical structure. In thefirst instance, these transformations representchanges along a continuum of degrees ofexpressiveness (i.e., deviations from mechanicalexactitude) within a fixed temporal frame.However, the temporal changes within a largerunit propagate down to smaller units, effectivelyaltering their tempo. Judging from their Figure 5,Desain and Honing propose that tone onsetswithin melodic sequences a--e stretched or shrunkproportionally as tempo changes, whereas theother three types of structures remain temporallyconstant or get "truncated" (in the case of chordasynchrony). Thus, these authors seem to suggestthat relational invariance across changes in tempodoes hold in a sequence of melody tones, but not inchords or ornaments. As to articulation (i.e., theoverlap or separation of successive tones), theydiscuss three alternative transformations, withoutcommitting themselves. Their model is a systemfor implementing different types of rules, not atheory of what these rules might be. Nevertheless,their examples reflect some of their earlierinformal observations.

The present study investigated not only timingpatterns but also tone intensities, another impor-tant dimension of expressi- e performance that hasreceived much less attention in research so far.Todd (1992) has pointed out that an increase oftempo often goes along with an increase in inten-sity, which leads to the hypothesis that the overalldynamic level of a performance may be affected bya tempo change. MacKenzie and Van Eerd (1990)found such an increase in key press velocity withtempo in their study of pianists playing scales. Itis not clear whether any changes in intensityrelationships should be expected across tempi,however. MacKenzie and Van Eerd found a subtleinteraction, but the differences were not expres-sively motivated. The hypothesis investigatedhere was that intensit., microstructure wouldremain relationally invariant across tempochanges, but that pianists might play louderoverall at faster tempi. (If this latter effect wereabsent, the intensity pattern would be absolutelyinvariant.)

The present study aimed at providing some pre-liminary data bearing on these issues. The datawere limited in so far as they came from a singlemusical composition played by two pianists, butthey included a number of different measurableaspects of performance. The composition, RobertSchumann's "Traumerei," was selected because its

expressive timing characteristics and acceptabletempo range were known from Repp's (1992) de-tailed analysis of 28 expert performances, and alsobecause it is a slow, interpretively demanding, buttechnically not very challenging piece. It containsno staccato notes or ornaments, apart from a fewexpressive grace notes. Thus the main questionwas whether the expressive timing of the melodytones would or would not scale proportionally withchanges in tempo. However, the tempo scaling ofother performance aspects (grace notes, chordasynchrony, tone overlap, pedaling, dynamics)was of nearly equal interest. Based on the limitedobservations summarized above, it was expectedthat the overall timing and intensity profileswould exhibit relational invariance across tempochanges, whereas this might not be true for themore detailed temporal features. In the course ofthe detailed analyses, there were opportunities tomake new (and confirm old) observations aboutthe relation of musical structure and performancemicrostructure, some of which will be discussedbriefly in order to characterize the nature of thepatterns whose relational (non)invariance was in-vestigated. A detailed investigation of structuralfactors is beyond the scope of this paper.

In addition to these performance measurements,a perceptual test was carried out along the linesexplored by Desain and Honing (1992a), to deter-mine whether artificially speeded-up or slowed-down performances would sound odd compared tooriginal performances having the same overalltempo. The results of that test will be describedfirst, followed by the performance analyses.

MethodsThe music

The score of Schumann's "Traumerei" is shownin Figure 1; its layout highlights the hierarchicalstructure of the piece. The composition comprisesthree 8-bar sections, each of which contains two 4-bar phrases which in turn are composed of shortermelodic gestures. The first section (bars 1-8) isrepeated in performance. Each of the six phrasesstarts with an upbeat that (except for the one atthe very beginning) overlaps with the end of thepreceding phrase. Phrases 1 and 5 are identicaland similar to phrase 6, whereas phrases 2, 3, and4 are more similar to each other than to phrases 1,5, and 6. Thus there are two phrase types here.Vertically, the composition has a four-voicepolyphonic structure. For a more detailedanalysis, see Repp (1992).

Page 180: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Relational Invariance of anressive Microstructure across Global Tempo Changes in Music Performance 1 75

13I

a

...-r

I r

vt5a

El

El

t.Ii I

EaI

511

Tr-1

N

e k .....--1 I .a. .d

Th n

177 I

Y I I 1

I

(ospr.)

1).-41"--i I I r r 2:71

rill!lespe.)

_(...=._______-

L,..2..._.......rla,.......;.,,,.v t ; 1 ....vi f

.,_,.........,,_1I

It) I J f_-1.- r't

f..... ;) r____L 1-

I, ) V.I r I.2.

A ... .. 1.

:77)

Arl

22 (7.,

J. r-r-1

;'TN

- I rI I

I T

14,u,.,,,

RI

FIgure 1. Score of Schumann's "Teaurnerei," prepared with MusicProse software following the Schumann edition(Breitkopf & [Tarte!). Minor discrepancies are due to software limitations. The layout of the computer score highlightsparalle, musical structures.

60

Page 181: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

176 ReP1,

PerformersThe two performers were LPH, a professional

pianist in her mid-thirties, and BHR (the author),a serious amateur in his late forties. Both werethoroughly familiar with Schumann's "Triiumerei"and had played it many times in the past.Recording procedure

The instrument was a Roland RD250S digitalpiano with weighted keys and sustain pedalswitch, connected to an IBM-compatiblemicrocomputer running the Forte sequencingprogram. Synthetic "Piano 1" sound was used. Thepianists played from the score and monitored thesound over earphones. The computer recordedeach performance in MIDI format, including theonset time, offset time, and velocity of each keydepression, as well as pedal on and off times. Thetemporal resolution was 5 ms.

Each pianist was recorded in a separate session.After some practice on the instrument, (s)heplayed the full piece (including the repeat of thefirst 8 bars) at her/his preferred tempo.Afterwards, (s)he listened to the beginning of therecorded performance and set a Franz LM-FB-4metronome to what (s)he believed to be the basictempo of the performance. The settings chosen byLPH and BHR were M.M.63 and 66, respectively.Subsequently, each pianist performed the piece ata slower and at a faster tempo. These tempi werechosen by the experimenter to be M.M.54 and 72for LPH and M.M.56 and 76 for BHR. All thesetempi were within the range observed incommercially recorded performances by famouspianists (Repp, 1992; in press). The desired tempowas indicated by the metronome before eachperformance; the metronome was turned off beforeplaying started. The pianists' intention was toplay as naturally and expressively as possible ateach tempo; no conscious effort was made either tomaintain or to change the interpretation acrosstempi.

Each pianist then repeated the cycle of threeperformances until three good recordings hadbeen obtained at each tempo.2 In these repeats,the medium (preferred) tempo was also cued bymetronome. LPH's first performance was excludedbecause it showed signs of her still gettingaccustomed to the instrument, and an extramedium-tempo performance was added at the endof the session. BHR actually recorded fiveperformances at each tempo, two in one sessionand three in another. Only the performances fromthe second session were used. Thus neitherpianist's first performance was included in her/hisfinal set of 9 performances.

The MIDI data files were converted into textfiles and imported into a data analysis andgraphics program (DeltaGraph Professional) forfurther processing. Tone interonset intervals(Ms) in milliseconds were obtained by computingdifferences among successive tone onset times.Tone intensities were expressed in terms of rawMIDI velocities ranging from 0 to 127. In themiddle range (which accommodated virtually allmelody tones), a difference of 4 MIDI velocityunits corresponds to a 1 dB difference in peak rmssound level (Repp, 1993).Perceptual test

Stimuli. The purpose of the perceptual test wasto determine whether performances whose tempohas been artificially altered sound noticeably moreawkward than unaltered performances. To reducethe length of the test, each performance wastruncated after the third beat of bar 8; thecadential chord at that point was extended induration to convey finality. Moreover, only thefirst two of each pianist's three performances ateach tempo were used. From each of these 12truncated original performances, two modifiedperformances were generated by artificiallyspeeding up or slowing down its tempo, such thatits total duration matched that of an originalperformance at a different tempo by the samepianist. Thus, for example, LPH's first slowperformance was speeded up to match theduration of the first medium performance, andwas speeded up further to match the first fastperformance. The tempo modification wasachieved by changing the "metronome" setting inthe Forte sequencing program, relative to thebaseline setting of M.M.100 during recording.(This setting was unrelated to the tempo of therecorded performance but determined thetemporal resolution.)

Each of the 12 original performances was thenpaired with each of two duration-matchedmodified performances, resulting in 24 pairs. Theoriginal performance was first in half the pairsand second in the other half, in a counterbalancedfashion. The pairs were arranged in a randomorder and were recorded electronically onto digitalcassette tape. Performances within a pair wereseparated by about 2 s of silence, whereas about 5s of silence intervened between pairs. The testlasted about half an hour.

Subjects and procedure. Nine professional-levelpianists served as paid subjects. All but one weregraduate students at the Yale School of Music,se.ren of them for a master's degree in pianoperformance and one for a Ph.D. degree in

Page 182: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Relational Invariance of Expressive Microstructure across Global Tempo Changes in Music Performance 177

composition. They were tested individually in aquiet room. The test tape was played back at acomfortable intensity over Sennheiser HD420SLearphones. The subjects were fully informed aboutthe nature and manipulation of the stimuli, andthey were asked to identify the originalperformance in each pair by writing down "1" or"2," relying basically on which performancesounded "better" to them.

Results and DiscussionPerceptual test

Average performance in the perceptual test was55.6% correct. This was not significantly abovechance across subjects [F(1,8) = 2.91, p < .13],though it was marginally significant across pairsof items [F(1,11) = 5.08, p < .05].3 There was nosignificant difference between the scores for thetwo pianists performances (though BHR's wereslightly easier to discriminate) or for differenttypes of comparisons. These results indicate thatthe tempo transformations did little damage to theexpressive quality of the performances, and hencethat there were probably no gross deviations fromrelational invariance, at least in the first 8 bars.Timing patterns at the preferred tempo

The performance aspect of greatest interest wasthe pattern of IOIs (the timing pattern or profile).Before examining the effects of changes in tempoon this aspect of temporal microstructure,however, it seemed wise to look at the timingprofiles of the performances at the pianists'preferred (medium) tempi, to confirm that theywere meaningful, representative, and reliable.This seemed particularly important in view of thefact that an electronic instrument had been used.

Figure 2 plots the onset timing patterns aver-aged across the three medium-tempo perfor-mances of each pianist. The format of this figure isidentical to that of Figure 3 in Repp (1992), whichshows the Geometric Average timing pattern of 28different performances by 24 Famous Pianists(GAFP for short). The timing patterns of threerelated phrases are superimposed in each panel.(The two renditions of bars 1-8 were first aver-aged.) Interonset intervals longer than a nominaleighth-note are shown as "plateaus" of equaleighth-note intervals.

The two pianists' average medium-tempoperformances resembled both each other and theGAFP timing pattern. The correlations of thecomplete log-transformed IOI profiles (N=254)were 0.85 between LPH and BHR, 0.88 betweenLPH and GAFP, and 0.94 between BHR andGAFP. Of course, these overall correlations were

dominated by the major excursions in the timingprofiles. A measure of similarity at a moredetailed level was obtained by computingcorrelations only over the initial 8 bars, which didnot contain any extreme ritardandi. Thesecorrelations (N=130) were 0.64 (p < .0001)between LPH and BHR, 0.77 between LPH andGAFP, and 0.83 between BHR and GAFP.4Clearly, both pianists' performances werereasonably representative in the sense that theyresembled the average expert profile, even thoughthey were only moderately similar to each other ata detailed quantitative level.

Each pianist also showed high individualconsistency in terms of timing patterns bothwithin and between the three performances ather/his preferred tempo. Thus the averagecorrelation among the three complete medium-tempo performances was 0.91 for LPH and 0.95for BHR. For the initial 8 bars the correspondingcorrelations were 0.90 for both LPH and BHR.Needless to say, the timing patterns of the tworenditions of bars 1-8 within each performancewere also extremely similar.

Furthermore, as can be seen in Figure 2, bothLPH and BHR produced very similar timingprofiles for the identical phrases in bars 1-4 and17-20 (left-hand panels); LPH played the wholephrase slower in bars 17-20 whereas BHR sloweddown only in bar 17. Both pianists also showedsimilar timing patterns for the analogous phrasesin bars 9-12 and 13-16 (right-hand panels), whichdiverged substantially only at the end, due to themore pronounced ritardando in bar 16, the end ofthe middle section. The structurally similarphrase in bars 5-8 (right-hand panels) againshowed a similar pattern; apart from slightdifferences in overall tempo, major deviationsfrom the pattern of the other two phrases occurredonly in the second bar, where the music in factdiverges (cf. Figure 1), and also in the fourth barfor BHR, who emphasized the cadence in thesoprano voice and treated the phrase-final bassvoice as more transitional than did LPH. Finally,bars 21-24 (left-hand panels) retained thequalitative pattern of bars 1-4 but includedsubstantial lengthening due to the fermata andthe final grand ritardando.5

Many additional comments could be made aboutthe detailed pattern of temporal variations andtheir relation to the musical structure, but such adiscussion may be found in Repp (1992) and neednot be repeated here. Instead, we will focus nowon the comparison among the three tempoconditions.

S 2

Page 183: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

178 Repp

2500

-J

LU

1000-

LU

Z

Cf)

0LU

- Bars 0-4

Bars 17-20

Bars 21-24

350 1161117414iiii1111.11illilliii

2500

0 - - CY CNI CY 0 en 0 CI 'V

BAR/EIGHTH-NOTE

--op Bars 0-4

s-- Bars 17-20

Bars 21-24

1,0

1111 io.sla oil la. 114141c? III 0 V/ A 0 V) r,

cJ c:4 cI ei ci 4- 4 4

BAR/EIGHTH-NOTE

iiiiiiiiiii141 C) .- in c? gr) nth ththAth co cO

BAR/EIGHTH-NOTE

Bars 5.8

e-- Bars 9-12

Bars 13.16

BHR

I

In CI VI A CI VI141 V) (DID ID tO r:. F.. r:. r:.

BAR/EIGHTH-NOTE

Flgure 2. Eighth-note 10Is as a function of metric distance, averaged across the three medium-tempo performances ofeach pianist (top: LPH; bottom: BHR). Structurally identical or similar phrases are overlaid in left- and right-handpanels.

1S3

Page 184: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Relational Invariance of Expressive Microstructure across Global Tenwo Changes in Music Performance 179

Effects of tempo on global timing patternsA first question was whether the pianists were

as reliable in their timing patterns in the slow andfast conditions as at the medium tempo. After all,they had to play at tempi they would not havechosen themselves. The average overall between-performance correlations within tempo categoriesfor LPH were 0.88 (slow) and 0.91 (fast), as corn-pared to 0.91 (medium); and for BHR they were0.95 (slow) and 0.94 (fast), as compared to 0.95(medium). Thus the pianists were just about as re-liable in their gross timing patterns at unfamiliartempi as at the familiar tempo. Computed overbars 1-8 only, the correlations were 0.83 (slow)and 0.86 (fast) versus 0.90 (medium) for LPH, and0.89 (slow) and 0.86 (fast) versus 0.90 (medium)for BHR. At this more detailed level, then, there isan indication that the pianists were slightly moreconsistent when they played at their preferredtempo. Clearly, however, they produced highlyreplicable timing patterns even at novel tempi.

We come now to the crucial question: Were thetiming profiles at the slow and fast tempi similar

to those at the medium tempo? First, thecorrelational evidence: The average of the 27between-tempo correlations (3 x 3 for each of 3pairs of tempo categories) across entireperformances for LPH was 0.90, which was thesame as her average within-tempo correlation. ForBHR, the analogous correlations were 0.94 and0.95, respectively. At the more detailed level ofbars 1-8, the correlations were both 0.86 for LPH,and 0.87 versus 0.88 for BHR. Thus, between-tempo correlations were essentially as high aswithin-tempo correlations.

If, as predicted by the relational invariance hy-pothesis, the intervals at different tempi are pro-portional, then the average timing profiles shouldbe parallel on a logarithmic scale. This is shownfor the first rendition of bars 0-8 in Figure 3.There is indeed a high degree of parallelism be-tween these functions; the few local deviations donot seem to follow an interpretable pattern. Thesedata thus seem consistent with the hypothesis of amultiplicative rate parameter (which becomes ad-ditive on a logarithmic scale).

1000

ecineiti4444BAR/EIGHTH-NOTE BAR/EIGHTH-NOTE

[BHRJ

wW4BAR/EIGHTH-NOTE

tlow

nwpium

last

Inen

BAR/EIGHTH-NOTE

Figure 3 Eighth-note timing profiles for the initial 8 bars (first renditions) at three different tempi, averaged across thethree performances within each tempo category.

Sti

Page 185: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

180 Repp

Nevertheless, further statistical analysisrevealed that there were some reliable changes intemporal profiles across tempi. Analyses ofvariance were first conducted on each pianists'complete log-transformed 101 data, with MI (214levels; see footnote 4) and tempo (3 levels) as fixedfactors, and individual performances nestedwithin tempi (3 levels) as the random factor. If thelog-transformed timing profiles are strictlyparallel, then the RN by tempo interaction shouldbe nonsignificant. In other words, the between-tempo variation in profile shape should notsignificantly exceed the within-tempo variation.For LPH, this was indeed the case (F(426,1278) =1.031. For BHR, however, there was a small butsignificant interaction [F(426,1278) = 1.55, p <.0001]. Moreover, when similar analyses wereconducted on bars 0-8, corresponding to theexcerpts used in the perceptual test (though withall three performances at each tempo included),the IOI by tempo interaction was significant forboth LPH [F(96,288) = 1.57, p < .003] and BHR[F(96,288) = 1.77, p < .0003]. Additional analysesconfirmed that BHR showed subtle changes intiming pattern with tempo throughout most of themusic, whereas LPH showed no statisticallyreliable changes with tempo except at the verybeginning of the piece.6

Yet another way of analyzing the data revealed,however, that LPH did show some systematicdeviations from relational invariance, after all.For pairs of tempi, the log-transformed averageIOIs at the faster tempo were subtracted from thecorresponding IOIs at the slower tempo, and thecorrelation between these differences and the IOIsat the slower tempo was examined. The relationalinvariance hypothesis predicts that the log-difference (i.e., the ratio) between correspondingIOIs should be constant, and hence the correlationwith ICH magnitude should be zero. However,LPH showed a highly significant positivecorrelation for medium versus fast tempo I r(252)= 0.50, p < .0001], which indicates that long IOIschanged disproportionately more than short IOIs.BHR showed a similar but smaller tendency,which was nevertheless significant [r(252) = 0.18,p < .01]. Between the slow and medium tempi,LPH showed a weak tendency in the oppositedirection, whereas BHR showed no significantcorrelation. Very similar results were obtained forbars 1-8 only; thus the correlations did not derivesolely from the very long IOIs. Relationalinvariance evidently held to a greater degree

between the slow and medium tempi thanbetween the medium and fast tempi.

Thus there is statistical evidence that relationalinvariance did not hold strictly, as observed alsoby Heuer (1991) in his discussion of simpler motortasks. Still, the timing patterns were highly simi-lar across the different tempi, and the differencesthat did occur do not suggest different structuralinterpretations. Moreover, they were not suffi-ciently salient perceptually to enable listeners todiscriminate original performances from perfor-mances whose tempo had been modifiedartificially. (Since there was no evidence thatchanges with tempo were larger later in the piecethan at the beginning, the results of theperceptual test for bars 0-8 probably can begeneralized to the whole performance.) It seemsfair to conclude, then, that relational invariance oftiming profiles held approximately in theseperformances.Grace note timing patterns

The analysis so far has focused on eighth-note(and longer) IOIs only. However, "Triiumerei" alsocontains grace notes. The question of interest waswhether the timing of the corresponding tonesalso changed proportionally with overall tempo.

There are two types of grace notes in the piece.Those of the first type, notated as small eighth-notes, are important melody notes occurring dur-ing major ritardandi. The slowdown in tempo letsthem "slip in" without disturbing the rhythm. Oneof these notes occurs in bar 8 (between the fourthand fifth eighth-notes); the other is the upbeat tothe recapitulation in bar 17 (following the lasteighth-note in bar 16). The other type, notated aspairs of small sixteenth-notes, represents written-out left-hand arpeggi (bars 2, 6, and 18) and isplayed correspondingly faster. While proportionalscaling of the first, slow type was to be expectedbecause they are essentially part of the expressivetiming profile, the prediction for the second typewas less clear. However, neither type is readilyclassifiable in terms of Desain and Honing's(1992b) appoggiatura-acciaccatura distinction.

Several ANOVAs were conducted on the log-transformed IOIs defined by these grace notes.Separate analyses of bars 8 and 16 included threeJOIs: the preceding eighth-note ICH and the twoparts of the eighth-note NM bisected by the onsetof the grace note. In neither case was there asignificant WI by tempo interaction.7 Thus thetiming of these grace notes scaled proportionallywith changes in tempo. LPH and BHR differed in

tS 5

Page 186: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Relational Invariance of Expressive Microstructure across Global Tempo Changes in Music Performance 181

their timing patterns: In bar 8, the grace noteoccupied 68% (LPH) versus 57% (BHR) of thefourth eighth-note IOI; in bar 16, it occupied 55%(LPH) versus 37% (BHR) of the last eighth-noteIOI. BHR thus took the grace-note notationsomewhat more literally than did LPH.

Bars 2 and 18 were analyzed separately frombar 6, which involved different pitches (cf. Figure1), though the timing patterns were found to bevery similar. In each case there were three IOIs,defined by the onset of the second eighth-note inthe right hand, the onsets of the two grace notesin the left hand, and the onset of the highest noteof the chord in the right hand. In no case wasthere a significant IOI by tempo interaction.8This, even for these arpeggio grace notesrelational invariance seemed to hold. There wereagain individual differences between the twopianists: In bars 2, 6, and 18, the two grace notestogether took up 68% (LPH) versus 79% (BHR) ofthe second eighth-note IOI, while the second gracenote occupied 63% (LPH) versus 69% (BHR) of thegrace note interval. Both pianists maintainedtheir individual patterns across bars 2, 6, and 18.Both individual timing patterns are within therange of typical values observed in Repp's (1992)analyses of famous pianists' performances.8Chord asynchronies

One of the factors that might enable listeners todiscriminate a slowed-down fast performance froman originally slow one is that the asynchroniesamong nominally simultaneous tones are enlargedin the former. It was surprising, therefore, thatthere was no indication in the perceptual datathat slowed-down fast performances soundedworse than speeded-up slow ones. There is noobvious reason why unintended verticalasynchronies (i.e., purely technical inaccuracies)should scale with tempo. However, asynchroniesthat are intended for expressive purposes (cf.Vernon, 1937; Palmer, 1989) might possiblyincrease as tempo decreases. Rasch (1988) reportssuch a tendency for asynchronies in ensembleplaying.

A complete analysis of asynchronies was beyondthe scope of this paper. Instead, the analysisfocused on selected instances only. One goodcandidate was the four-tone chord near thebeginning of each phrase (see Figure 1). Thischord occurs eight times during the piece, seventimes in identical form in F major and once (bar13) transposed up a fourth to B-flat major. It doesnot contain any melody tones, which eliminates

one major motivation for expressive asynchrony(cf. Palmer, 1989). However, it is technically trickybecause it involves both hands in interleavedfashion: The first and third tones (numbering fromtop to bottom) are taken by the right hand, whilethe second and fourth tones are taken by the lefthand. In addition to unintended asynchronies thatmay be caused by this physical constellation,planned asynchronies may be involved in theproper "voicing" of the chord, which is a sonority ofexpressive importance.

For each of the eight instances of the chord ineach performance, the lOIs for the three lowertones (Tones 2-4) were calculated relative to thehighest tone (Tone 1).10 Lag times were positivewhereas lead times were negative. The untrans-formed IOIs were subjected to separate ANOVAsfor each pianist, with the factors rendition (i.e.,the 8 occurrences of the chord), tone (3), andtempo (3); performances (3, nested within tempo)were the random variable. The question here waswhether the IOIs changed at all with tempo.

The ANOVA results were similar for the twopianists. Both LPH [F(1,6) = 70.94, p < .0003] andBHR [F(1,6) = 84.71, p < .0002] showed a highlysignificant grand mean effect, indicating that theaverage IOI was different from zero: The threelower tones lagged behind the highest tone by anaverage of 16.9 ms (LPH) and 9.6 ms (BHR),respectively. In addition, each pianist showed arendition main effect [LPH: F(7,42) = 3.58, p <.005; BHR: F(7,42) = 12.40, p < .0001], a tone maineffect [LPH: F(2,12) = 12.80, p < .002; BHR:F(2,12) = 49.18, p < .0001], and a rendition by toneinteraction (LPH: F(14,84) = 2.56, p < .005; BHR:F(14,84) = 4.34, p < .0001]. However, no effectsinvolving tempo were significant. In particular,there was no indication that lags were larger atthe slow tempo. The high significance levels of theother effects show that the data were not simplytoo variable for systematic effects of tempo to befound.

The systematic differences in patterns ofasynchrony across renditions are of theoreticalinterest and warrant a brief discussion. They areshown in Figure 4, averaged across allperformances. The mean lags for individual tonesindicated that, for both pianists, the lowest(fourth) tone lagged behind more than the othertwo. For BHR, moreover, the third tone did notlag behind at all, on the average. This was clearlya reflection of the fact that it was played with thesame hand as the highest (first) tone.

Page 187: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

182 Rq

70

60-

50-

-(3 40-E--a) 30-E

20 -cn

._J 10-

-10-

-20

50

III

I I

LPH

234 1234 1234 1234 1234 1234 1234 12341 5 1R 5R 9 13 17 21

40-

-J.; 30-E0) 20-E= 0 -ca)

_1 0

-20

Iv

I I

BHR

I I 1 1

1234 1234 1234 1234 1234 1234 1234 12341 5 1R 5R 9 13 17 21

Bar number

rigurc 4. Chord asynchronies relative to the highest tone, avetaged acro5s all tempi. The bars for each chord show lagtimes for tones2-4 (in order of decreasing pitch) relative to tone I. "R" after a bar number stands for "repeat?

Page 188: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Relational Invariance of Expressive Microstructure across Global Tempo Changes in Music Performance I S3

Both pianists showed the largest lag times in bar1, which is perhaps a start-up effect of "getting thefeel" of the instrument. Again for both pianists,the only rendition in which any lower tonespreceded the highest tone was the one in bar 13,which is in a different key. The precedence of thethird tone here may be due to the fact that it fallson a black key, which is more elevated than thewhite keys and hence is reached earlier by thethumb of the right hand. Apart from thisexceptional instance, BHR maintained hischaracteristic 1-3-2-4 pattern across all renditions,though the magnitude of the lags variedsomewhat. LPH was more variable, showing anaverage 1-2-3-4 pattern in four renditions, a 1-3-2-4 pattern in two, and a 1-2-4-3 pattern in one.

It could well be that the chord asynchronies justanalyzed did not vary with tempo because theywere not governed by expressive intent, only byfingering and hand position. The chord examineddid not contain any melody tones and thus did notmotivate any highlighting of voices throughdeliberate timing offsets. Therefore, a second setof chords was examined. The tone clusters on thesecond and third eighth-notes of bars 10 and 14(see Figure 1) also contain four tones each (threein one instance), but all of them have melodicfunction. The soprano voice completes the primarymelodic gesture; the alto voice "splits off" into acounterpoint; the tenor voice enters with animitation of the primary melody, accompanied bythe bass voice, which has mainly harmonicfunction. Thus the relative salience of the voicesmay be estimated as 1-3-2-4, which may well bereflected in a sensitive pianist s relative timing ofthe tone onsets. Unlike the chords examinedpreviously, tones 1 and 2 are played by the righthand, whereas tones 3 and 4 are played by the lefthand. Since the highlighting of the tenor voiceentry (left hand) may be the performer's primarygoal, the temporal relationship between the twohands may be the primary variable here.

This indeed turned out to be the case, thoughwith striking differences between the two pianists.LPH changed her pattern of asynchroniesradically (and somewhat inconsistently) withtempo in bar 10; otherwise, however, the patternsdid not vary significantly with tempo. The maineffects of the tone factor, on the other hand, weregenerally significant, indicating consistency inpatterning for each chord.

The results are shown in Figure 5. As can beseen, LPH showed enormous asynchronies whichwere almost certainly intended for expressive

purposes. They alternated between two patterns:The left hand (tones 3 and 4) either lagged behindor led the right hand (tones 1 and 2). The right-before-left pattern occurred in positions 10-2 (slowtempo) and 14-2. The left-before-right patternoccurred in positions 10-3 (at both slow and fasttempo) and 14-3, as well as in position 10-2 at thefast tempo. The medium-tempo patterns inpositions 10-2 and 10-3 (not shown in the figure)were split among the two alternatives. The twotones in each hand were roughly simultaneous,except in position 14-3, where the lowest noteoccurred especially early. The net effect in bar 14and, less consistently, in bar 10 was that the tenorand bass voices entered late but continued early,so that the first 101 in these voices wasconsiderably shorter than the corresponding IOIin the soprano voice. This makes good sense:While the soprano voice slows down towards theend of a gesture, the tenor voice initiates theimitation with a different temporal regime. Thediscrepancy thus highlights the independentmelodic strains.

The statistical analysis was conducted on theleft-hand lag times only, because they showed themajor variation. Due to LPH's strategic changeswith tempo in bar 10, there were highlysignificant interactions with tempo in the overallanalysis of her data. A separate analysis on bar14, however, revealed no effects involving tempo.The difference between positions 14-2 and 14-3was obviously significant, as was the effect of tone[F(1,6) = 19.57, p < .005] and the position by toneinteraction [F(1,6) = 28.34, p < .002].

BHR, on the other hand, showed a very differentpattern. His asynchronies were much smaller andalways showed a slight lag of the left hand, moreso in bar 10 than in bar 14. His patterns werehighly consistent but reveal little expressiveintent, as they are comparable in magnitude tothe asynchronies in the nonmelodic chordanalyzed earlier. The statistical analysis on theleft-hand lag times showed the difference betweenbars 10 and 14 to be highly reliable (F(1,6) =24.51, p < .003]. The tendency for the bass voice tolag behind the tenor voice in bar 10, but to leadit in bar 14, was reflected in a significantinteraction [F(1,6) = 12.66, p < .02]. No effectinvolving tempo was significant, however. Thus,neither pianist's patterns scaled with tempo.Relational invariance does not seem to hold at thislevel; instead, there is absolute invariance acrosstempi, except for the qualitative changes noted insome of LPH's data.

S 8

Page 189: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

184

80

60-

40-

20-

-60-

-80-

-100 -

-120slow

1 2 3 4

40

LPH

II

II

slow,fast fast

II 11

all all

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 410-2 10-3 14-2 14-3

BHR20-

0 1 II 1 1 11 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10-2 10-3 14-2 14-3

Bar/Eighth-note

Figure 5 Chord asynchronies in bars 10 and 14, averaged across all tc except as noted. The second tone is absent inposition 14-2.

v

Page 190: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Relational Invariance of Expressive Microstructure across Global Tempo Changes in Music Performance 185

Tone overlapAnother detailed feature that was examined,

again in a selective manner, was tone overlap.MacKenzie and Van Eerd (1990), in their study ofpiano scale playing, found that the absolute over-lap of successive tones in legato scales decreasedas tempo increased. At the slowest tempo(nominally similar to the present slow tempi, butreally twice as fast because the scales movedalong in sixteenth-notes) successive tones over-lapped by about 15 ms, which decreased to nearzero as tempo increased. Because of the expressivelegato required in "Traumerei," and the slowerrate of tones (eighth-notes), larger overlaps wereto be expected. The question was whether they gotsmaller as tempo increased. The measure of over-lap was the difference between the offset time ofone tone and the onset time of the following tone;a negative value indicates a gap between the tones(which may not be audible because of pedalling).

The passage selected for analysis was theascending eighth-note melody in the soprano voicewhich occurs first in bars 1/2 and recurs inidentical or similar form 7 times during the piece(see Figure 1). Because of different fingering andan additional voice in the right hand, theinstances in bars 9/10 and 13/14 were notincluded. Moreover, because overlap times variedwildly in the last interval of bar 22 (preceding thefermata) for both pianists, the data wererestricted to the first four instances only (i.e., bars1/2 and 5/6, and their repeats). Each of theseconsists of 5 consecutive eighth-notes, yielding 4overlap times. The last two notes span theinterval of a fourth in two instances (bars 112) anda major sixth in the other two (bars 5/6);otherwise, the notes are identical. The fingering islikely to be the same. The statistical analyses (onuntransformed data) included bars as a factor, aswell as renditions (here, strict repeats), tones, andtempo. Again, the question was whether theoverlap times varied with tempo at all.

LPH's data were quite a surprise because heroverlap times were an order of magnitude largerthan expected. Not only did she play legatissimothroughout, but she often held on to the secondand fourth tones through the whole duration ofthe following tone, resulting in overlap times inexcess of 1 s. Since these overlap times thusbecame linked to the regular eighth-note lOis, itis not surprising that effects of tempo were found.The effects were not simple, however. The ANOVAshowed all main effects and interactions involvingbar, tone, and tempo (except for the bar by tone

interaction) to be significant. The datacorresponding to the triple interaction [F(6,18) =4.04, p < .011 are shown in Figure 6 below. Thedata are averaged over renditions, which had onlya main effect [F(1,6) = 10.30, p < .02], representingan increase in overlap with repetition.

Perhaps the most interesting fact about LPH'sdata is that the overlap pattern differed betweenbars 1/2 and 5/6 from the very beginning, eventhough the two tone sequences were identical upto last tone. There was slightly more overlapbetween the first two tones in bars 1/2 than inbars 5/6. While this may not be a significantdifference, there is no question that the overlapbetween the following two tones was much greaterin bars 1/2 than in bars 5/6. The same is true forthe next two tones, although there is much lessoverlap in absolute terms. Finally, the situation isreversed for the last two tones, which showenormous overlap in bars 5/6, but much less inbars 1/2. The tempo effects within this strikinginteraction are fairly consistent (overlapdecreasing with increasing tempo), except for thethird and fourth tones in bars 112, whose overlapincreases with tempoan unexplained anomaly.

BHR's overlap times, in contrast to LPH's, wereof the expected magnitude, and there was nosignificant effect involving tempo. However, themain effects of bar and tone, and the bar by toneinteraction [F(3,18) = 24.90, p < .00011 were allhighly significant. As for LPH, the difference infinal interval size affected the overlap pattern ofthe whole melodic gesture. BHR showed negativeoverlap (i.e., a gap, camouflaged by pedalling)between the first two tones, which was larger inbars 5/6 than in bars 1/2. The overlap between thesecond and third tones was larger in bars 5/6, butthat between the third and fourth tones waslarger in bars 1/2. Finally, there was overlapbetween the last two tones in bars 112, but a smallgap in bars 5/6, probably reflecting the stretch tothe larger interval of a sixth. KIR's playing stylethus can be seen to be only imperfectly legato,perhaps due to his less developed technique.Nevertheless, he was highly consistent in hisimperfections, and his data suggest independenceof tempo.Pedal timing

As the last temporal facet of these per-formances, we consider the timing of the sustainpedal. Although variations in pedal timing withintone IOIs probably had little if any effect on theacoustic output, they are of interest from theperspective of motor organization and control.

Page 191: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

186 RePP

1600

1400

1200

E woo-a 80013 6000

400

200

0

Bars 1/2 Bars 5/6

ILPH

60

40

20

CoE

-20to.)

o -40-60

-80

smf smf smf smf smf smf smf smf1-2 2-3 3-4 4-5 1-2 2-3 3-4 4-5

IIII

II

hl

IImf smf smf smf smf s

III

BHR

mf smf smi1-2 2-3 3-4 4-5 1-2 2-3 3-4 4-5

Tone pair

Ftsure 6. Tone overlap times in bars 1/2 and 516 al s(low), m(edium), and f(ast) tempo.

Page 192: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Relational Invariance of Expressive Microstructure across Global Tempo Changes in Music Performance 187

The pianist must coordinate the foot movementswith the hand movements, and this is usuallydone subconsciously, while attention is directedtowards the keyboard.

A complete analysis of the pedalling patternswould lead too far here, as the pedal was usedcontinuously throughout each performance.11 Theanalysis focused on the opening motive of eachphrase, specifically on the interval between thedownbeat melody tone and the following 4-notechord, whose highest tone was used as the tempo-ral reference (see Figure 1). This quarter-note IOI,which occurred eight times in the course of eachperformance (in bars 1, 5, 1R, 5R, 9, 13, 17, and21, where "R" stands for "repeat"), usually con-tained a pedal change (i.e., a pedal offset followedby a pedal onset) in both pianists performances,though BHR's pedal offsets generally fell veryclose to the onset of the downbeat tone, often pre-ceding it slightly. Exceptions were bars 1, 1R, and9, which usually contained only a pedal onset, theprevious offset being far outside the Kg or absent(at the beginning of the piece). -nly the pedal on-sets are marked in the score. The pedal serveshere to sustain the bass note struck on the down-beat, which a small left hand needs to relinquishin order to reach its portion of the following chord.In bar 13, in fact, the bass note can be sustainedonly by pedalling, az it occurs in a lower octave.

The question of interest was whether, across thevariations in tempo, the pedal actions occurred ata fixed time after the onset of the quarter-note ICHor whether their timing varied in proportion withtempo, so that they occurred at a fixed percentageof the time interval. Figure 7 shows the averagequarter-note IOI durations (squares) at the threetempi in the different bars. As expected, there wasgenerally a systematic decrease in ICH durationfrom slow to medium to fast tempo [LPH: F(2,6) =12.85, p < .007; BHR: F(2,6) = 88.25, p < .00011.Both pianists also showed a highly significantmain effect of bar on MI duration (LPH: F(7,421=21.36, p < .0001; BHR: F(7,42) = 47.38, p < .0001):The longest IOIs occurred in bar 17 (the beginningof the recapitulation); LPH also had relativelylong IOIs in bar 21 and short ones in bar 1,whereas BHR had relatively long IOIs in bar 13(the passage in a different key). There was a smalltempo by bar interaction for BHR only [F(14,42) =2.14, p < .031.

Figure 7 also shows average pedal offset andonset times, as well as bass tone offset times. Thetwo pianists differed substantially in theirpedalling strategies. LPH's pedal offsets (when

present) occurred well after ICH onset, and herpedal onsets occurred around the middle of theIOI. BHR's pedal offsets, on the other hand,occurred at the beginning of the IOI and his pedalonsets were relatively early also. Another strikingdifference between the two pianists is that LPHheld on to the bass tone (as indicated by theabsence of dashes in the figure), except in bar 13where this was physically impossible. BHR, on theother hand, despite his larger hands, alwaysreleased the bass tone following the pedaldepression.

For LPH, pedalling did not to seem to varysystematically with tempo.12 This was confirmedin ANOVAs on pedal offset and onset times, aswell as on their difference (pedal off time). None ofthese three variables showed a significant tempoeffect or a tempo by bar interaction. However, allthree varied significantly across bars [F(4,24) =8.57, p < .0003; F(7,42) = 15.43, p < .0001; F(4,24)= 10.00, p < .0002]. For BHR, on the other hand,both pedal offset times [F(2,6) = 10.06, p < .02]and pedal onset times [F(2,6) = 6.57, p < .04]decreased as tempo increased, whereas pedal offtimes did not vary significantly. All three variedsignificantly across bars [F(4,24) = 6.09, p < .002;F(6,36) = 12.58, p < .0001; F(4,24) = 22.03, p <.0001], even though the differences were muchsmaller than for LPH. Even more clearly than thepedalling times, BHR's bass tone offset timesvaried with tempo [F(2,6) = 27.26, p < .001], andalso across bars [F(7,42) = 14.14, p < .00011.

To determine whether BHR's timings exhibitedrelational invariance, they were expressed aspercentages of the total 101 and the ANOVAswere repeated. Of course, the pedal offset times,which varied around the 101 onset, could not beeffectively relativized in this way, so that thetempo effect persisted [F(2,6) = 9.36, p < .02]. Thetempo effect on pedal onset times, on the otherhand, disappeared [F(2,6) = 0.37], whereas atempo effect on pedal off times emerged [F(2,6) =12.62, p < .008]. Thus BHR's data could beinterpreted as either exhibiting relationalinvariance for pedal onset times, or absoluteinvariance of pedal off times. There was noquestion, however, that BHR's bass tone offsettimes scaled proportionally with tempo, as therewas no tempo effect on the percentage scores[F(2,6) = 0.711. All relativized measures, on theother hand, continued to vary significantly acrossbars IF(4,24) = 4.62, p < .007; F(6,36) = 17.81, p <.0001; F(4,24) = 23.86, p < .0001; F(7,42) = 18.90, p< .0001). This was equally true for LPH's data.

Page 193: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

188 Repp

2000

180°-

1600 -

1400 -

1200 -

1000 -

800 -

600 - 0

40°-

200 -

0

1600

1400

1200

LPH

00

0 oo

* *

a Chord on

Bass off

O Pedal on

* Pedal off

0 0 0

0 00 0

00

* *

000

***000

1000

800

600

400

200

0

smf smf smf smf smf smf smf smf

BHR

a

Iisa

IN

a

-_- _00000000000000000 00

0 0

*smf smf smf smi smf smf smf smf

1 5 1R 5R 9 13 17 21

Bar number

Figure 7. Absolute pedal offset and onset times and bass tone offset times within the first (quarter-note) 101 in each ofeight bars, at s(low), m(edium), and f(ast) tempo. Each 10! starts at 0 and ends with the onset of (the highest tone of)the chord. Missing symbols reflect incomplete or absent data.

I 3

Page 194: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Relational Invariance of Expressive Microstructure across Global Tempo Changes in Music Performance 189

In summary, these data do not offer muchsupport for the hypothesis of relational invariancein pedal timing across tempo changes. LPH's datawere highly variable. BHR's data are consistentwith an alternative hypothesisthat the pedaloff-on action (a rapid up-down movement with thefoot) was independent of tempo but was initiatedearlier at the faster tempo. This earlier start maybe a reflection of a general increase in musculartension.

The results are much clearer with regard to thedifferences in pedalling times across bars. Clearly,these effects are not relationally invariant andthus must be caused by structural or expressivefactors that vary from bar to bar. This may be the

first time that such systematic variation has beendemonstrated in pedalling behavioraninteresting subject for future studies.Intensity microstructure

A detailed analysis of the intensity microstruc-ture of the performances would lead too far in thepresent context. To verify, however, that both pi-anists showed meaningful patterns of intensityvariation, Figure 8 shows the intensity profiles(strictly speaking, MIDI velocity profiles) of themelody notes (soprano voice) averaged across thethree medium-tempo performances of each pianist(and averaged across the two renditions of bars 1-8). The format is similar to that of Figure 2, withcorresponding phrases superimposed.

90

.70 - CV CV TV r) C5

BAR/EIGHTH-NOTE

80-

70-

0 60-

50-

40

30-

BHR -.- ear. 0-4

-a- Bars 17.20

Bars 21-24

20fl on V1 el WI e:

e4 ri A 4.

BA ii/EIG HTH- NOTE

- la - w r, $1

ati ail al) aif tfo oil

BAR/EIGHTH-NOTE

BAR/EIGHTH-NOTE

Figure 8. Average MIDI velocity as a function of metric distance, averaged across three medium-tempo performances.Structurally identical or similar phrases are overlaid. Only at cent eighth-notes are connected.

194

Page 195: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

190 Re 1/1/

The intensity patterns were slightly morevariable than the timing patterns. Nevertheless,both pianists showed structurally meaningfulpatterns that were similar across phrases ofsimilar structure.13 For example, the salientmelodic ascent in bars 1-2, 5-6, and so on, alwaysshowed a steep rise in intensity which leveled offor fell somewhat as the peak was reached. Themotivic chain during the second part of eachphrase showed modulations related to the motivicstructure, especially in the phrase type shown inthe left-hand panels, and a steady decrescendo inthe phrase type shown in the right-hand panels.Both pianists played bars 17-20 somewhat moresoftly than the identical bars 0-4, and ended thefinal phrase (bars 21-24) even more softly (left-hand panels). LPH, but not BHR, played bars 9-12more strongly than bars 5-8 (right-hand panels).Both pianists played bars 13-16 more softly thanbars 9-12; BHR especially observed the p pmarking in bar 12 in the Clara Schumann edition(cf. Figure 1).

Given these fairly orderly patterns, the questionof interest was: Did they vary systematically withtempo? Before addressing this issue, however, itmay be asked whether the average dynamic levelwas equal across tempi. According to Tcdd (1992),a faster tempo may be coupled with a higherintensity.

Figure 9 shows that this was indeed the case inthe present performances: For both LPH andBHR, the average intensity of the soprano voiceincreased from slow to fast and from medium tofast tempo; there is no clear difference betweenslow and medium for LPH. The magnitude of the

62

>-. 60

2 58a)

56

54a)cs)cti 52a3

< 50

48

change is not largeabout 2 MIDI units, whichtranslates into about 0.5 dB (Repp, 1993).However, there is a second systematic trend inthe data for both pianists: In the course of therecording session, the average intensity droppedby an amount equal to (BHR) or larger (LPH)than that connected with a tempo change.Interestingly, there was also a tendency forthe tempi to slow down across repeatedperformances (see Repp, in press), but thatdecrease was much smaller than the differenceamong the principal tempo categories. Both trendsmay reflect the pianists progressive relaxationand adaptation to the instrument; possibly alsofatigue. Finally, it is clear from Figure 9 thatBHR played more softly, on the average, thanLPH.14

Because of the within-tempo variation inaverage intensity, the between-tempo differenceswere nonsignificant, though they appeared to besystematic. For the same reason, the statisticalassessment of variations in intensity patternsacross tempi was somewhat problematic, for ifthat variation was contingent on overall intensity(level of tension") there was no reason to expect itto be larger between than within tempi. ANOVAswere nevertheless carried out. Across all melodytones (N=166), the tone by tempo interaction fellshort of significance for LPH [F(330,990) = 1.13, p< .08] and was clearly nonsignificant for BHR.Across the melody tones of the excerpt used in theperceptual test (bars 0-8, N=42), the tone bytempo interaction reached significance for LPH[F(82,246) = 1.49, p < .01) but remained clearlynonsignificant for BHR.15

mediumTempo

Ftgure 9. Grand average MIDI velocity as a function of tempo, separately for each pianist's three performances at eachtempo.

fast

LPH-1

o LPH-2LPH-3

BHR-1e BHR-2

BHR-3

1S5

Page 196: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Relational Invariance of Expressive Microstructureacross Global Tempo Changes in Music Performance 191

Figure 10 shows the average intensity profilesfor the melody tones in bars 0-8 at the three tempifor each pianist. The similarities among the threeprofiles are far more striking than the differences.For LPH, one source of the significant toneby tempo interaction is evident at the very

90

80-

>,70-

2 60-a)

El 50-

40,30-

20

beginning: The dynamic change from the upbeatto the downbeat was far larger at the fast tempothan at the slow tempo. (Curiously, BHR showedthe opposite.) On the whole, however, theintensity profiles remained invariant acrosstempi.

slow

medum

ISM

4 4 1 I Il Isla. I I Ir- Ps co) um r.t'Ctt3P .

e4 e4 (.4 e4 6 6 ci 6 4 4 4 4.

BAR/EIGHTH-NOTE

90

80-

>,70-*Z-3

2 60->c)-a- 50-

40-)2

30-*

LPH

I IIIiIiI I

6 6 ui 6 6 to co co r, r- r, co co co co

BAR/E1GHTH-NOTE

slow

--e medium

di-- fast

2 0 I 1 I I 1 I I

ci e4 e4 e4 6 r) r,

BAR/E1GHTH-NOTE

BHR

A ;D. a

I. 111 i I I II III1.17 N. r) CI Lel N.4. .4 N 4i) th 4.6 Li:. A I.:. A

BAR/EIGHTH-NOTE

Figure 10 Average MIDI velocity profiles for bars 0-8 at three different tempi.

Page 197: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

192 Re1710

GENERAL DISCUSSION

The present study explored in a preliminaryway the question of whether the expressive mi-crostructure of a music performance remains rela-tionally invariant across moderate changes intempo. The results suggest that the onset timingand intensity profiles essentially retain theirshapes (the latter being very nearly constant),whereas other temporal features are largely inde-pendent of tempo and hence exhibit local con-stancy rather than global relational invariance.

After a perceptual demonstration that a uniformtempo transformation does no striking damage tothe expressive quality of performances ofSchumann's "Trtiumerei," six specific aspects ofthe expressive microstructure were examined. Thefirst and most important one was the pattern oftone onset times. That pattern was highly similaracross tempi, yet deviated significantly from pro-portionality. Although relational invariance didnot hold perfectly, the deviations seemed smalland not readily interpretable in terms of struc-tural reorganization. The second aspect examinedwas the timing of grace notes. Somewhat surpris-ingly, in view of the observations by Desain andHoning (1992a, 1992b) that stimulated thepresent study, grace notes were timed in arelationally invariant manner across tempi. Thismay have been due to their relatively slow tempoand expressive function; the result may notgeneralize to ornaments that are executed morerapidly. The third aspect concerned selectedinstances of onset asynchrony among the tones ofchords. These asynchronies did not changesignificantly with tempo, even when theY wererather large and served an expressive function;apparently, they were "anchored" to a referencetone. The fourth aspect was the overlap amongsuccessive legato tones. When this overlap was ofthe expected magnitude (pianist BHR), it wasindependent of tempo. For one pianist (LPH), theoverlaps were unusually large and did vary withtempo. The fifth aspect examined was pedaltiming. It did not seem to depend on tempo in avery systematic way and had no audible effect inany case. Finally, the intensity microstructurewas studied. Although overall intensity increasedsomewhat with tempo, the detailed intensitypattern remained nearly invariant.

In no case were there striking deviations fromrelational invariance. Those features that werefound not to scale proportionally with tempo weregenerally on a smaller time scale, so thatartificially introduced proportionality (in the

perceptual test) was difficult to detect. Because ofthe limited contribution of these details to theoverall impression of the performances, it may beconcluded that relational invariance of expressivemicrostructure held approximately across thetempi investigatA here. This is consistent withthe notion that musical performance is governedby a "generalized motor program" (Schmidt, 1975)or "interpretation" that includes a variable rateparameter. Although there are good reasons forexpecting the interpretation to change whentempo changes are large (Clarke, 1985; Handel,1986), the tempo range investigated hereapparently could accommodate a single structuralinterpretation.

All microstructural patterns examined exhibitedlarge and highly systematic variation as afunction of musical structure, as well as clearindividual differences between the two pianists.While tile data on timing microstructure(including grace notes) are consistent with thedetailed analyses in Repp (1992), the observationson chord asynchrony, tone overlap, pedalling, andintensity microstructure go well beyond theseearlier analyses. However, since the focus of thepresent study was on effects of tempo, thestructural interpretation of microstructuralvariation was not pursued in great detail here.Suffice it to note once again the pervasiveinfluence of musical structure on performanceparameters and the astounding variety ofindividual implementations of what appear to bequalitatively similar structural representations.

Despite certain methodological parallels, thepresent study is diametrically opposed to thescale-playing experiment of MacKenzie and VanEerd (1990), which essentially focused on pianists'inability to achieve a technical goal (viz.,mechanical exactness). By contrast, the primaryconcern here was pianists success in realizing theartistic goal of expressive performance. Therelative invariances observed in timing andintensity patterns were cognitively governed, notdue to kinematic constraints. However, both typesof studies have their justification and providecomplementary information. At some of thedetailed levels examined here (chordasynchronies, tone overlap) motoric constraintssuch as fingering patterns clearly had aninfluence. Thus, even within a slow, highlyexpressive performance, there are aspects that aregoverned to some extent by "raw technique," andultimately the technical skills that help an artistto achieve subtlety of expression may be the verysame that help her/him to play scales smoothly.

Page 198: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Relational Invariance of Expressive Microstructure across Global Tempo Changes in Music Performance 193

However, the technique serves its true purposeonly in the context of an expressive performanceand therefore is perhaps better studied in thatcontext also.

The present study has limitations, the mostobvious of which is that only a single piece ofmusic was examined. Clearly, the character of"Traumerei" is radically different from that ofSalieri's "Nel cor phi non mi sento," whose timingmicrostructure Desain and Honing (1992a) foundto change considerably with tempo. "Traumerei" isslow, expressive, entirely legato, rhythmicallyvague, and contains few fast notes; even the gracenotes are timed deliberately. The Salieri ditty, onthe other hand, is moderately fast, requires adetached articulation and rhythmical precision,and contains a number of very fast grace notes(acciaccature) that may resist further reduction induration. Clearly, it would be wrong to concludefrom the present findings that expressivemicrostructure always remains invariant acrosstempo changes, even to a first approximation.Desain and Honines results, in spite of theirinformality, already seem to provide acounterexample.16 So do perhaps Clarke's (1982,1985) findings. The conclusion seems justified,however, that in some types of music relationalinvariance is maintained to a large extent. Thismusic is probably the kind that carries expressivegestures on an even rhythmic flow. P.hythmicallydifferentiated music containing many short notevalues, rests, and varied articulation (such asexamined by Desain and Honing, 1992a) isprobably much more susceptible to changes withtempo.

In fact, it would be both more precise and morecautious to say that relational invariance can bemaintained in certain kinds of music. The presentpianists, although they did not try very con-sciously to maintain their "interpretation" acrosstempi but instead attempted to play as naturallyand expressively as possible, nevertheless re-frained from "changing their interpretation" in thecourse of the recording session, whatever that mayimply in an objective sense. In real-life music per-formance, on the other hand, changes in tempoare probably mere often a consequence of a changein an artist's conception of the music than a delib-erate change in the temp .. as such. Tempo differ-ences observed among different artists are often tobe understood in this way. What the present find-ings imply, however, is that if different artistswere forced to play at the same overall tempo,without giving them time to reconsider their

"interpretation" of the piece, their expressive mi-crostructure would probably be just as different asit was before the tempi were equalized. In otherwords, tempo as such probably accounts only for avery small proportion of the variance among dif-ferent performers' interpretations. Or, to put it yetanother way, interpretation usually influencestempo, but tempo does not necessarily influenceinterpretation.

It will of course be desirable to employ a largernumber of musicians in future studies, as well asa better instrument. The present two pianists'performances, however, seemed adequate enoughfor the purpose of the study. The fact that onepianist (BHR) was not professionally trained isnot perceived to be a problem in view of the factthat his data were generally more consistent andless variable than those of the professionalpianist, LPH. LPH's greater variability may beattributed to three sources: First, she had lesstime to adapt to the instrument; second, she hadnot played "Traumerei" recently; and third, as aprofessional artist she was not obliged to obey theconstraints of a quasi-experimental situation.BHR, on the other hand, was more familiar withthe Roland keyboard, had practiced Schumann's"Kinderszenen" less than a year ago, and, as anexperimental psychologist, was used to exhibitingconsistent behavior in the laboratory.17

In summary, the present findings suggest thatfor certain types of music it is possible to changethe tempo within reasonable limits, eithernaturally in performance or artificially in acomputer, without changing the quality andexpression of a performance significantly. Thissuggests that the very complex motor plan thatunderlies artistic music performance containssomething like a variable rate parameter thatpermits it to be executed at a variety of tempiwhile serving the same underlying cognitivestructure. Various small-scale performancedetails, however, may be locally controlled andunaffected by the rate parameter. Thus localconstancy appears to be nested within globalflexibility in music performance.

REFERENCESClarke, E. F. (1982). Timing in the performance of Erik Satie's

'Vexations'. Acta Psychologica, 50, 1-19.Clarke, E. F. (1985). Structure and expression in rhythmic

performance. In P. Howell, I. Cross, & R. West (Eds.), Musicalstructure and cognition (pp. 209-236). London: Academic Press.

Clynes, M., & Walker, J. (1986). Music as time's measure. MusicPerception, 4, 85-120.

j

Page 199: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

194 ReW

Desain, P., & Honing, H. (1992a). Tempo curves consideredharmful. In P. Desain & H. Honing (Eds.), Music, mind, andmachine (pp. 25-40). Amsterdam: Thesis Publishers.

Desain, P., & Honing, H. (1992b). Towards a calculus forexpressive timing in music. In P. Desain and H. Honing (Eds.),Music, mind, and machine (pp. 175-214). Amsterdam: ThesisPublishers.

Desain, P., & Honing, H. (in press). Does expressive timing inmusic performance indeed scale proportirnally with tempo?Psychological Research.

Gentner, D. R. (1987). Timing of skilled moz. : :-!rformance: Testsof the proportional duration model. Psychological Review, 94,255-276.

Guttmann, A. (1932). Das Tempo und seine Variationsbreite.Archiv fiir die gesamte Psychologie, 85, 331-350.

Handel, S. (1986). Tempo in rhythm: Comments on Sidnell.Psychomusicology, 6, 19-23.

Heuer, H. (1991). Invariant relative timing in motor-programtheory. In J. Fagard & P. H. Wolff (Eds.), The development oftiming control and temporal organization in coordinated action (pp.37-68). Amsterdam: Elsevier.

MacKenzie, C. L., & Van Eerd, D. L. (1990). Rhythmic precision inthe performance of piano scales: Motor psychophysics andmotor programming. In M. Jeannerod (Ed.), Attention andperformance X111 (pp. 375-408). Hillsdale, NJ: Erlbaum.

Michon, J. A. (1974). Programs and "programs" for sequentialpatterns in motor behavior. Brain Research, 71, 413424.

Palmer, C. (1989). Mapping musical thought to musicalperformance. journal of Experimental Psychology: HumanPerception and Performance, 15, 331-346.

Rasch, R. A. (1988). Timing and synchronization in ensembleperformance. In J. A. Sloboda (Ed.), Generative processes in music(pp. 70-90). Oxford, UK: Clarendon Press.

Repp, B. H. (1990). Patterns of expressive timing in performancesof a Bet thoven minuet by nineteen famous pianists. Journal ofthe Acoustical Society of America, 88, 622-641.

Repp, B. H. (1992). Diversity and commonality in musicperformance: An analysis of timing microstructure inSchumann's "Traumerei." Journal of the Acoustical Society ofAmerica, 92, 2546-2568.

Repp, B. H. (1993). Some empirical observations on sound levelproperties of recorded piano tones. Journal of the AcousticalSociety of America, 93, 1136-1144.

Repp, B. H. (in press). On determining the global tempo of atemporally modulated music performance. Psychology of music.

Rorem, N. (1983). Composer and performance_ (Originallypublished in 1967.) In N. Rorem, Setting the tone (pp. 324-333).New York: Coward-McCann, Inc.

Schmidt, R. A. (1975). A schema theory of discrete motor skilllearning. Psychological Review, 82, 225-260.

Seashore, C. E. (1938). Psychology of music. New York: McGraw-Hill. (Reprinted by Dover Publications, 1967).

Shaffer, L. H. (1980). Analysing piano performance: A study ofconcert pianists. In G. E. Stelmach & J. Requin (Eds.), Tutorialsin motor behavior (pp. 443-455). Amsterdam: North-Holland.

Shaffer, L. H. (1981). Performances of Chopin. Bach, and Bartok:Studies in motor programming. Cognitive Psychology, 13, 326-376.

Shaffer, L. H. (1984). Timing in solo and duet piano performances.Quarterly Journal of Experimental Psychology, 36A, 577-595

Shaffer, L. H. (1992). H DV; to interpret music. In M. R. Jones and S.Holleran (Eds.), Cognitive bases of musical communication (pp.263-278). Washington, DC: American PsychologicalAssociation.

Shaffer, L. H., Clarke, E. F., & Todd, N. P. (1985). Metre andrhythm in piano playing. Cognition, 20, 61-77.

Todd, N. P. McA. (1992). The dynamics of dynamics: A model ofmusical expression. Journal of the Acoustical Society of America,91, 3540-3550.

Vernon, L. N. (1937). Synchronization of chords in artistic pianomusic. In C. E. Seashore (Ed.), Objective analysis of musicalperformance (pp. 306-345). University of Iowa Studies in thePsychology of Music, Vol. 4. Iowa City: University of IowaPress.

Viviani, P., & Laissard, G. (1991). Timing control in motorsequences. In J. Fagard & P. H. Wolff (Eds.), The development oftiming control and Temporal Organimtion in Coordinated Action(pp. 1-36). Amsterdam: Elsevier.

FOOTNOTES'Psychological Research, in press.'For temporal intervals to be proportional at different tempi,their absolute differences must be larger at slow than at fasttempi. Clarke mentions that an initial analysis yielded"identical" results for raw and log-transformed data, but it isnot clear whether that was also true in the later analysis justreferred to.

2The pianists' accuracy in implementing the desired tempi willnot concern us here; this issue is the subject of a separate paper(Repp, in press). Naturally, the three performances within eachtempo category were not exactly equal in tempo.

3These values represent the grand mean effects in one-wayANOVAs on the average scores minus 50%, equivalent to one-sample t-tests. The second test was conducted on the averagescores of pairs of pairs containing the same type of comparisonin opposite order (e.g., LPH's first medium-tempo performancefollowed by the speeded-up version of her first slowperformance, and the speeded-up version of her second slowperformance followed by her second medium-tempoperformance).

4All correlations reported include multiple data points for IOIslonger than one eight-note (cf. Figure 2), which is equivalent toweighting longer IOIs in proportion to their duration. Thisseems a defensible procedure in comparing profiles, andomission of these extra points had a negligible effect on themagnitude of the correlations. However, in the ANOVAsreported further below, only a single data point was used foreach long 101, so as not to inflate the degrees of freedom forsignificance tests.

5Two differences between LPH and BHR are worth noting here:LPH prolonged the interval preceding the fermata chord muchmore than did BHR, and she did not implement the gesturalboundary (i.e., the "comma" in the score) in bar 24, executing acontinuous ritardando instead. Contrary to the score. LPHgenerally arpeggiated the chord under the fermata; hence thelong 101, which was measured to the last (highest) note of thechord.

60ne possible reason for why LPH showed no significantinteractions is that her tempo variation within tempo categorieswas larger than that of BHR (see Repp, in press), so that theseparation of within- and between-tempo variation was lessclear in her data.

71n bar 8, BHR showed a triple interaction of rendition (first vs.second playing), 101, and tempo (F(4,12) = 7.72, p < .0031, dueto more even timing of the grace note in the first than in thesecond rendition, at the slow tempo only.

8BHR showed a marginally significant (but not easilycharacterizable) triple interaction with rendition in bar 6IF(4,12) = 3.42, p < .05).

9In terms of the IOI ratios plotted in Repp's (1992) Figure 11, therespective coordinate values are approximately 0.5 and 0.6 forLPH, and 0.25 and 0.45 for BHR.

MIME1 9

Page 200: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Relational Invariance of Expressive Microstructure across Global Tempo Changes in Music Performance 195

10The measurements were coarse because the 5-ms temporalresolution was poor relative to the small size of these intervals.However, the availability of repeated measurements withinand across performances attenuated this problem considerably.

"It is common in piano performance for the pedal to be usedmore frequently than indicated in the score. It also should beduly noted that the pedal in this instance was a simple footswitch that did not permit the subtlety of pedalling possible ona real piano.

12L1'H's pedalling times were much more variable than BHR's;this was also true within each tempo. The only substantialdeviation for BHR occurred in bar 1, where pedal onsetoccurred at the beginning of the 101 in the absence of apreceding pedal offset. (In two instances at the medium tempo,howevernot shown in the figurethere was a preceding pedaldepression, and offset and onset times were similar to those inother bars.) Because of this deviation, BHR's bar 1 was omittedfrom further analyses of pedalling times. bi 27 never had apedal offset near or within the 10I in bars 1R or 9, whereas LPHdid some of the time; however, her following pedal onset times

did not seem to depend on whether there was a pedal offsetnearby.

13The intensity profiles of the expert performances studied byRepp (1992) are not available for comparison, as they are verydifficult to determine accurrately from acoustic recordings.

14-fhis be true only for the melody notes analyzed here,which indeed seemed to "stand out" more from among thefour voices in LPHs performance.

15Surprisingly, however, there was a significant tempo maineffect for BHR [F(2,6) = 12.78, p < .007j, suggesting that thewithin-tempo changes in overall intensity evident in Figure 9developed only later during his performances. For LPH, on thecontrary, the between-tempo variation was conspicuouslysmaller than the within-tempo variation [F(2,6) = 0.05, p > .95)during the initial 8 bars.

16Their data have been augmented in the meantime and arepresented in (Desain tr Honing, in press).

17For those who are suspicious of authors serving as their ownsubjects, it might be added that BHR played with music, nothypotheses, in his mind.

2 0

Page 201: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

Haskins Laboratories Status Report on Speech Research1993, SR-113, 197-208

A Review of Psycho linguistic Implications for LinguisticRelativity: A Case Study of Chinese

by Rumjahn Hoosain*

The content, and indeed the title, of this bookbrings back to us an old and well-known theorythe Sapir-Whorf hypothesis: that languageinfluences thought and determines the world viewof the speaker (Whorf, 1956). Rumjahn Hoosainproposes that "unique linguistic properties of theChinese language, particularly its orthography,have implications for cognitive processing."

Although it does not seem to be his intention torevive the Sapir-Whorf hypothesis in its originalform, Hoosain ilearly displays his desire tosalvage the theory through extension or revision.This is demonstrated by his remark in Chapter 1that "the influence of orthography on cognition,particularly the influence of the manner in whichthe script represents sound and meaning, was notoriginally recognized in the work of Sapir andWhorf." Thus, Hoosain seems to be trying to keepthe spirit of the Sapir-Whorf hypothesis alivewhile applying it more specifically to the effect ofthe orthography and particularly to the case ofChinese. This effect is, as he describes it, "more interms of the facility with which language usersprocess information, and the manner ofinformation processing." Therefore, "suchlanguage effects matter more in data-driven orbottom-up processes rather than in conceptuallydriven or top-down processes."

Though not a large volume (198 pages), the bookcovers an impressively large number ofpsycholinguistic studies related to the Chineselanguage' and orthography, especially thosepresented at five conferences held in Taiwan andHong Kong in the last dozen years. The reviews of

Xut

1 am grateful to Ignatius G. Mattingly, Alvin M. Liberman,and Michael T. Turvey for their comments on an earlierversion of this review.

those studies are spread over six chapters in thebook, including an introduction in Chapter 1 and aconclusion in Chapter 6. The second chapterprovides an introduction to general aspects of theChinese language; and the third, fourth, and fifthchapters focus on, respectively, perceptual,memory-related, and neurolinguistic aspects.

Chapter 2 provides background informationabout the Chinese language, especially about itsorthography. This chapter is necessary becausethere have been many myths and misconceptionsabout Chinese and its orthography that need to beclarified. Hoosain's description of Chinese, ingeneral, is reasonable. However, there are certainremarks that are misleading and are stillcharacteristic of common misunderstandings.

In 2.3, for example, Hoosain states that "a smallpercentage of the (Chinese) characters conveymeaning by pictographic representation, eithericonic or abstract," though he qualifies this bynoting that "whereas these pictographiccharacters have an etymology related to pictures,this relation is unlikely to have psychologicalreality in present day usage." The critical issuehere is the use of the phrase "convey meaning bypictographic representation." What a character inChinese represents is a linguistic unit, in the caseof the character for niao3 2 ("bird," used byHoosain as an illustration of pictographiccharacters), a monosyllabic word. The word is aunit in the languagean entry in the lexicon. Alexical entry carries meaning, whereas a characteris nothing but a visual symbol created and used torefer to a lexical item. The character by itselfcarries no meaning whatsoever. It conveysmeaning only through its reference to thelinguistic unit it represents. Thus, the characterfor niao3 conveys the concept of "bird," notbecause it has the shape of a bird (otherwise, any

197

2 01

Page 202: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

198 Xu

picture of a bird would do), or because it isassociated with the idea "bird," but because thatshape is agreed by convention of the Chineseliterate community to refer to a spoken word withthe pronunciation /niao3/ (in present dayMandarin), which in turn refers to the concept of afeathered animal usually with flying ability.3 Onthe other hand, real pictures of birds, whethervivid or sketchy, although they may well conveythe meaning of bird directly, are not part of theorthography, and thus are completely differententities from the bird-shaped character for theword nizo3 . The character for niao3 is (or hasbeen) "pictographic" only in the sense that it wascreated by depicting the shape of a bird, and thepictographic structure of the character may haveserved as a mnemonic for the word. So, what cannever be overemphasized is the fact that the bird-shaped character represents a LINGUISTICunitthe spoken word for "bird" in Chinese.Therefore, a more accurate definition for so-calledpictographic characters would be: They arecharacters created to represent spoken names ofobjects by depicting the visual pattern of theobject the spoken names refer to. The shapes ofthe characters, when they are still reminiscent ofthe object the corresponding spoken words referto, can be used as mnemonics for the spokenwords.

In 2.5, while rejecting as a misconception theclaim that characters in Chinese are logograms inthe sense that each of them represents a wholeword, Hoosain insists that "the meaning of acharacter is represented by the character directly,not mediated by sound symbols, and a syllable isindicated as a whole rather than spelled out, as inalphabetic languages." There are two confusionsin this statement. First, as discussed above,Hoosain takes orthographic symbols as themeaning-carrying units by themselves. Second, bystating that "a syllable is indicated as a wholerather than spelled out," Hoosain sounds as ifphonemes were the only possible meaninglesslinguistic units that can be represented byorthographic symbols. Apparently, Hoosain doesnot realize (or often forgets) that syllables are alsomeaningless units that can be used to construct(i.e., to spell) meaningful morphemes and words,and it is the syllable sign that is the basicgraphemic unit in the Chinese orthography(De Francis, 1984, 1989; Mattingly, 1992). Instead,Hoosain takes the characters as the smallestgraphemic unit in the Chinese writing system. Aspointed out by De Francis (1989), the basicgraphemes in Chinese are characters that singly

represent whole syllables. They may themselvesconstitute frames, i.e., graphic units that areseparated by white spaces in writing, or combinewith other nonphonetic elements to form morecomplex characters constituting frames. Thosegraphemes, i.e., characters that providephonological values when combining with othergraphic elements to form other characters, areusually called phonetics, or phonetic radicals.

In his discussion in 2.4, Hoosain reports thatthere are about 800 phonetics in Chinese, andabout 90% of Chinese characters are phoneticcompounds. But in the discussion that follows, hebelittles the function of the phonetics in Chineseorthography by emphasizing the irregularities andunreliabilities of their phonetic values, and byciting studies that show the semantic cueingfunction of the radical to be greater than thephonemic cueing function of the phonetic. It is notmade clear by Hoosain what exactly is meant by"cueing function" in the unpublished study hecites. The other two studies (Yau, 1982; Peng,1982) Hoosain cites show that semantic radicalstend to be located at the left or the top of thecharacters. "Because," Hoosain argues, "characterstrokes tend to be written left to right and topdown, the beginning strokes therefore are morediscriminative." Being more discriminative,however, is different from being more crucial. As amatter of fact, the common use of the word'determinative' to refer to the semantic radicals inChinese characters implies that their function isto help distinguish between characters that areotherwise homophones because they share thesame phonetic radical. The premise here is thatthe phonological values of the phonetic radicalsare already being used to the fullest possibleextent. Of course, there do exist manyirregularities and unpredictabilities in thephonological values of the phonetic radicals inChinese. What is wrong in Hoosain's account ofthe Chinese orthography is that he is using thoseirregularities to deny the basic organizationprinciples in the construction of the characters.This is like denying the phonemic principle in theconstruction of English words merely becausethere are tremendous irregularities in the letter-phoneme correspondence.

Psychological studies of languages and readingcan not and should not be totally independent ofthe knowledge gained over the years about thebasic principles of language and speech and ofwriting systems. Hoosain's failure to fullyappreciate those principles is reflected in his viewthat Chinese characters represent both meaning

" U

Page 203: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

A Review of Psycholinguistic Implications for Linguistic Relativity A Case Study of Chinese IN Rurnjahn Hoosain 199

and sound directly, which is best illustrated in thediagram on page 12 of his book.

sound

N, meaning

English: signs (script) sound meaning

Chinese: sign (script)

Notice here his ambiguous use of the words'sign' and 'sound.' Apparently, by 'sign,' he meansletters for English but characters for Chinese. By'sound,' he seems to mean phonemes for Englishbut syllables for Chinese. As discussed above,while letters are the smallest graphemic units inEnglish, characters are not the smallestgraphemic units in Chinese. The smallestgraphemic units in Chinese are the phoneticradicals in characters.4 So, it is the phoneticradicals that map onto syllables directly (althoughno more accurately than letters map ontophonemes in English), not the characters.Likewise, it is the morphemes that charactersmap onto, not sound.

As we will see, the confusion in theunderstanding of the linguistic and orthographicaspects of the Chinese language is bound to affectHoosain's interpretation of the studies he reviews.

Chapter 3 reviews studies concerned withperceptual aspects of the Chinese language. Theconclusion reached by Hoosain in this chapter isthat "the distinctive configuration of the Chinesescript as well as its script-sound and script-meaning relation can differentially affectperceptual processes, compared with similarprocesses in English." The effects include: "(a) lackof acuity difference for characters arranged indifferent orientations; (b) preference of direction ofscan according to language experience; (c) theindividuality of constituent characters ofdisyllabic words in some situations; (d) the moredirect access to meaning of individual words,although access to sound could require a differenteffort due to the lack of grapheme-phonemeconversion rules; and (e) the variation in eyemovements in reading connected with differentmanners of reading." Due to lack of space, andbecause they are not essential concerns in thestudy of reading, the first two effects and the lasteffect will not be discussed in this review. Theonly thing that needs to be pointed out is thatletters in English and characters in Chinese are atdifferent levels of linguistic representation in thetwo orthographies, and, therefore, cannot beequated. In other words, differences found

between the two may not always unambiguouslypoint to critical differences between the readingmechanisms of the two languages.

When discussing the size of perceptual units inreading, it is essential to make sure the unitsbeing discussed are unambiguous. That is to say,when characters are referred to, we should notsometimes mean words and sometimesmorphemes, and we should not equate them withother units, say, letters. This is not always easy,because in Chinese morphemes and words oftencoincide. This difficulty, however, makes it all themore important for a reviewer of reading studiesto watch out for dubious hidden assumptionsignoring this distinction, and for flawed designs aswell as improper interpretations based on thoseassumptions. This is, however, exactly whatHoosain often fails to do as a reviewer.

For example, in his discussion of the study by C.M. Cheng (1981) on word superiority, Hoosainconcludes that characters in disyllabic Chinesewords have more perceptual salience than lettersin English, and he calls this phenomenon a"contrast between the two languages." Asdiscussed above, characters in Chinese correspondto morphemes, and morphemes are the smallestmeaning carrying units in languages. Letters inEnglish, however, correspond to phonemes, i.e.,the smallest distinctive segments in speech,though the correspondence is often inconsistent.Phonemes do not carry meaning by themselves.They have to combine with other phonemes toform meaning-carrying morphemes. So,perceptual differences found between charactersin Chinese and letters in English should beregarded as a reflection of the differences betweenthe two types of linguistic units tested, ratherthan as an indication of any true differencebetween the psycholinguistic processing of the twolanguages.

Liu (1988) found that a Chinese characterstanding alone was named as fast as the samecharacter occurring as a constituent in a two-character word. In contrast, a simple word inEnglish, such as air, is named slower in isolationthan as a constituent of a compound word, such asairways. Hoosain interprets these findings as anindication that "constituent characters of Chinesewords are handled somewhat independently,rather than as integral parts of words." However,there are several problems with Liu's study. First,Liu did not mentioned whether the constituentcharacters of his Chinese stimuli could alsofunction as words by themselves. (No sample listof the Chinese words used was provided.) Second,

i)1

Page 204: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

200 Xu

neither word frequency in English and Chinesenor character frequency in Chinese was controlledin the study, which makes any differences foundbetween naming latencies of compound words andtheir constituents suspect. Third, native speakersof Chinese who had also learned to speak Englishwere used as subjects for both Chinese andEnglish. This use of non-native speakers ofEnglish for reading English in comparison withnative speakers of Chinese reading Chinese isproblematic. Fourth, as Hoosain points outhimself, the English compounds like baseball wereall presented with a hyphen in the middle, nomatter whether they are conventionally printedthat way or not. There is no telling what effectthis arbitrary manipulation may have had. So, theresults of this study as evidence of "anotherdifference between Chinese and English" are, tosay the least, inadequate.

In summarizing the section on perceptual units,Hoosain concludes that "particularly in caseswhere component characters are simple andhighly familiar, or when individual charactershave to be pronounced, there is some degree ofindividual salience for these characters that is notlikely to be found in the case of bound morphemesin English. This indivZdual salience has to do withthe character as sensory unit." While thestatement about the salience of individualcharacters may be true to a certain extent, thecomparison between characters in Chinese andbound morphemes in English is inappropriate. InChinese, most of the high frequency characterscan function as monosyllabic words by themselves.Of the 1185 characters with frequency ofoccurrence of 100 or higher per million characters,only 44, i.e., 3.7%, can not function asmonosyllabic words (Wang et al., 1986). The onlycomparable situation in English is in cases wherewords are also used as constituents in compoundwords. But the only study Hoosain cites that madethis comparison is flawed. Without betterevidence, therefore, the best conclusion that canbe drawn about this matter is that further studiesare needed.

In the discussion in 3.3 on getting at the soundof words in Chinese, Hoosain insists that "the onlyway to get at the sound of Chinese characters isthrough lexical information, whether directly orindirectly." His major argument is that in Chinesecharacters, "the phonetics are characters on theirown, and their own pronunciations have to belearned on a character-as-a-whole-to-sound-as-a-whole basis and not through grapheme-phonemeconversion rules." Here Hoosain seems to confuse

the difference in level of representation of speechsound with the presence or absence of grapheme-speech-sound correspondence. The lack ofgrapheme-phoneme correspondence in Chinesedoes not mean the lack of grapheme-speech-soundcorrespondence altogether. Graphemes in Chinesecorrespond to speech sound at the level of the syl-lable (DeFrancis, 1989; Mattingly, 1992). This cor-respondence is not onc to one, but neither is thecorrespondence between letters and phonemes oneto one in English. This confusion between levels ofrepresentation and existence of grapheme-speech-sound correspondeuce determines Hoosain's biaswhen reviewing studies on phonological access inreading Chinese.

In examining studies on naming latency forcharacters, Hoosain focuses on Seidenberg (1985).In that study, it was found that naming timeswere faster for phonetic compounds than for non-phonetic compounds, but only for low frequencycharacters and not for high frequency characters.Seidenberg also did a parallel study with English,in which he found that words with exceptionalpronunciations were named slower than thosewith regular pronunciations. But again, this wasfound to be true only for low frequency words.Seidenberg took these results to mean that, forboth Chinese and English, access to thepronunciation is through the lexicon for highfrequency words and through grapheme-to-soundconversion for low frequency words.

Judging from the example given in Seidenberg'spaper, there are problems in the material he usedin his Chinese experiment. Both of the so-callednonphonetic compound characters can be used asphonetic radicals in many other characters. So,they are in principle also phonetic compounds,except that their semantic radicals are nil.Besides, Seidenberg did not control the frequencyor the phonological consistency of the phoneticradicals used in his study. Those problems makesSeidenberg's finding questionable.

Hoosain also notices some of the abovementioned problems in Seidenberg's study, but hequestions the results from a different direction.He argues that access to whatever phonologicalinformation is provided by the phonetic radicals inChinese cannot be compared to grapheme-phoneme conversion in alphabetic languages. Andhe raises an intriguing question: "Hol.. can it beargued that access to the sound of ',Itchnonphonetic compounds (when they stand atone)is through lexical information, but access to thesound of phonetic compounds (which are actuallymade up of these same nonphonetic compounds

u 4

Page 205: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

A Review of Psycholinguistic Implications for Linguist:c Relativity: A Case Study of Chinese by Rumlahn Hoosain 201

now acting as phonetics) is through graphemic-to-phoneme conversion? After all, how is the sound ofthese phonetics arrived at in the first place?"

While this question is well posed, Hoosain'sanswer is problematic. He insists that access tothe pronunciations of all the characters in Chineseis through the lexicon, either directly or indirectly.This would imply that the naming latency shouldbe the same whether the phonetic radicals haveconsistent or inconsistent phonological values inthe characters that they are part of. However, thisprediction does not agree with the fmdings byFang, Horng, and Tzeng (1986), which is alsomentioned by Hoosain. They demonstrated thathigh consistency in pronunciation shared amongcharacters with the same phonetic radicalfacilitated naming latency for simple characters aswell as for compound characters andpseudocharacters. This consistency effect wasshown to be comparable to a similar effect foundin naming latency for English words (Glushko,1979).

Another phenomenon taken seriously byHoosain is the minor finding by Seidenberg (1985)that the overall naming latency was much slowerfor his Chinese subjects than for his Englishsubjects. Hoosain takes this as an indication thatphonological access in Chinese is slower than inEnglish. It is not noticed by Hoosain, however,that the English subjects Seidenberg used werecollege students, whereas the Chinese subjects heused were described only as native Cantonese.This makes one wonder what their educationalbackgrounds were and even how much readingexperience they had had. Undoubtedly, thesefactors would have effectively influenced theirnaming latency.

To summarize the discussion of getting at thesound of words, there is so far no solid evidencefor Hoosain's view that phonological access forcharacters in Chinese is solely through thelexicon. The studies by Seidenberg (1985) and byFang et al. (1986) suggest that more evidence forthe use of phonetic components in characters ingetting at the sound of characters is very likely tobe found.

In 3.4, Getting at the Meaning of Words,Hoosain claims to have found evidence thatgetting at the meaning of Chinese characters ismore direct than is the case with printed words inalphabetic writing systems. At the outset, Hoosainrightly warns the reader to bear in mind that"spoken language is prior to written language...Therefore sound-m3aning relation is prior toscript-meaning relation, giving rise to the

question of the need for phonological recodingbefore access to meaning." By "prior," however,Hoosain means the order in historicaldevelopment, both for human languages ingeneral and for individual speakers of a languagein particular, leaving plenty of room for the actualreading process to either involve or not involve so-called phonological mediation. This leads to thenext question he asks, namely, "whetherphonological recocling is involved to a differentextent depending on the nature of theorthography." So, it follows, according to Hoosain,that "if the script primarily represents sounds ofthe language, script-meaning relation may not beso direct, and getting at the meaning more likelymay be mediated by phonological recoding."

With these questi:ms in perspective, Hoosainpresents several studies he thinks show a moredirect connection between script and meaning inChinese than in alphabetic orthographies. Thefirst series of studies involve the famous Stroopphenomenon. Stroop (1935) found that when thename of a color was written in ink of anothercolor, it took subjects longer to name the color ofthe ink than it took them to name the colors of inkpatches. This phenomenon was taken to meanthat somehow it is unavoidable to process themeaning of the printed words even when the taskis to attend to the color of the ink. Thisexplanation, however, is questionable, since thephenomenon can also be understood as anindication that it is somehow unavoidable toprocess the pronunciation of the printed words,thus slowing down the naming process when thereis disagreement in pronunciations between theprinted color name and the color of the ink. Withthis alternative explanation, the even greaterStroop effect for Chinese color names found byBiederman and Tsao (1979) could, contrary toHoosain's interpretation, be interpreted as anindication that it is even more unavoidable toprocess the pronunciation of Chinese characters.Without further evidence, though, the implicationof the greater Stroop effect for Chinese should atleast be considered undetermined.

In trying to find more evidence for directcharacter-meaning connection in Chinese,Hoosain cites the study by Tzeng and Wang(1983), but misunderstands the procedure used inthat st,arly.5 In the study, subjects were asked tochoose from two numbers written in charactersthe one that is numerically greater than the other.It was found that it took the subjects longer tomake the deckion when the characters for thelarger number had a smaller physical size on the

205

Page 206: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

202 Ku

screen. Hoosain, however, mistakenly describesthe task as that of deciding on the physical size ofthe written numbers. He thus interprets theresults as indicating that it is unavoidable toprocess the meaning of the Chinese numberswhen the task is only about the physical size ofthe printed numbers. This interpretation isinappropriate because the actual task in theexperiment required subjects to focus on themeaning of the numbers to begin with.

As for the finding that Chinese numbers aretranslated faster into English than the other wayaround by native Cantonese speakers (Hoosain,1986), which Hoosain uses as another piece ofevidence for a direct connection betweencharacters and meaning, I find it inconclusivebecause no comparable results were obtained fromnative speakers of English. Similarly, the findingthat cross-language priming effects for nativespeakers of Chinese are greater from Chinese toEnglish than in the other direction (Keatley, 1988)is not conclusive evidence for a direct connectionbetween meaning and characters in Chinese.

Another study cited by Hoosain as showingevidence for a direct character-meaningconnection was conducted by Treiman, Baron, andLuk (1981). In that study, native speakers ofEnglish and Cantonese were required to maketruth/falsity judgments of sentences in their ownlanguages. A sentence could be a "homophone"sentence, such that it sounded true when read outloud (e.g., A pair is a fruit). It was found that ittook longer to reject these homophone sentencesthan truly false sentences like "A pier is a fruit."In contrast, this kind of delay due to homophonywas relatively small for Chinese, and thisdifference was taken as evidence that phonologicalrecoding was less involved in Chinese sentenceprocessing. However, there is a problem in thatstudy that Hoosain does not notice. In the sampleChinese sentence presented in the paper, Mei2shi4 yil zhong3 zhi2 wu4 6 ("Mei2 is a plant"), thecharacter mei2, meaning "coal," is said to behomophonous with the name of a plant. But thecharacter for mei2 as name of that plant is not ameaningful word in itself in the spoken language(neither in Cantonese nor in Mandarin). In thespoken language, it always combines with someother syllables to form words. The syllable mei2 byitself corresponds only to one word, namely, "coal."So, the smaller impairment for the Cantonesespeakers in that experiment may well 'oe aconsequence of using improper test materialrather than an indication of less phonologicalrecoding in Chinese sentence processing.

The only study that offers a somewhat strongercase for possible fast access to the meanings ofChinese characters is the one by Hoosain andOsgood (1983) on affective meaning responsetimes. They asked subjects to report the affectivemeaning of a word by saying either "positive" or"negative" upon seeing the word displayed on ascreen. Native speakers of Cantonese could makethis judgment faster than native speakers ofEnglish. Hoosain and Osgood interpreted thisresult as evidence that the processing of at leastsome aspects of meaning of Chinese words wasfaster than that of English words, and suggestedthat this was because phonology was bypassedwhen accessing the meaning of printed Chinesewords. Fascinating as the finding may be, it issomewhat in conflict with a recent study byPerfetti and Zhang (1991, also see Perfetti, Zhang,and Berent, in press), in which it was found thatphonemic information was immediately availableas part of character identification in Chinese.Furthermore, these recent experimental resultssuggest that whatever the semantic value of acharacter, it is activated no earlier than itsphonological value. If the findings in both studies(Hoosain and Osgood, 1983; and Perfetti andZhang, 1991) are true, then the faster affectivemeaning response time for Chinese words shouldonly mean faster decision after the activation ofthe phonology as well as the semantics of thosewords, rather than faster activation of thesemantics, assuming, of course, that thephonological activation of the Chinese words is nofaster than that of the English words.

In conclusion, I do not find solid evidence for amore direct connection between meaning andcharacters in Chinese than between meaning andprinted words in alphabetic orthographies. Nor doI find solid evidence for lack of phonologicalmediation in accessing meaning from charactersin Chinese. Rather, the studies by Seidenberg(1983) and Fang et al. (1986), which are also citedby Hoosain, show evidence of similarity betweenthe processing of Chinese orthography and theprocessing of alphabetic orthographies.

Chapter 4 of Hoosain's book focuses on memory-related aspects of the Chinese language. Ingeneral, Hoosain shows in this chapter thatphonological recoding is used in memorizingprinted Chinese, but, at the same time, he citessome studies that he thinks show a greater role ofvisual coding for Chinese than for otherlanguages.

In 4.1.1, Hoosain cites quite a few studies thatfound a greater digit span for Chinese than for

Page 207: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

A Review of Psycholinguistic Implications for Linguistic Relativity: A Case Study of Chinese hy Rumjahn Hoosain 203

some other languages. It is shown that thisdifference can be well accounted for by theworking memory model proposed by Baddeley andHitch (1974) and Baddeley (1983, 1986). Thismodel assumes an "articulatory loop mechanism"for short-term memory which has a span ofapproximately two seconds worth of speech sound.Hoosain shows that number names in Chinesehave shorter duration in pronunciation than insome other languages. Because of this, Hoosainargues, more numbers can be kept in thearticulatory loop for the Chinese speakers.

The discussion in this section is interesting,except that there is no emphasis on the fact thatthe phenomenon is purely linguistic, and hasalmost nothing to do with orthography. Also,Hoosain does not emphasize that the finding isinteresting because it confirms one of thehypothetical principles under which all humanlanguages function (but see Ren and Mattingly,1990; Mattingly, 1991, for criticism of the workingmemory hypothesis). Instead, Hoosain seems to bethrilled by the finding itself, that digit span inChinese is larger than in some other languages,and uses this as an indication that differencesamong languages do affect the way information isprocessed.

In his discussion of the visuo-spatial scratch-padand perceptual abilities, Hoosain cites a study byWoo and Hoosain (1984). They conducted a short-term memory test, in which subjects had toidentify among a list of seven characters thosethat had appeared among the six charactersshown to them a few seconds before. It was foundthat "weak readers made much more visualdistracter errors (when target characters weremixed with similar looking distracters) but notmore phonological distracter errors (when targetswere mixed with similar sounding distracters)."Hoosain contrasts this finding with that ofLiberman, Mann, Shankweiler, and Werfelman(1982), who found that, for American subjects,good beginning readers of English did not dobetter than weak beginning readers in memory forabstract visual designs or faces. However, they didperform better in memorizing nonsense syllables.In comparing this result with the results of Wooand Hoosain (1984), Hoosain comes to theconclusion that poor readers in Chinese have morevisual problems than poor readers in English.Apparently, Hoosain misinterprets the nature andthe implication of the results obtained by Woo andHoosain. The finding that poor readers made morevisual distracter errors indicates that they wererelying more heavily than good readers on the

visual characteristics of the characters whentrying to encode them for immediate recall. As aresult, poor readers were penalized more by thegraphic similarities brought about by the visualdistracters. The finding does not, as Hoosaininterprets it, simply indicate that poor readers inChinese have more visual problems. In acomparable study by Ren and Mattingly (1990), itwas found that good readers in Chinese wererelatively more affected by phonologically similarseries than poor readers when performingimmediate recall of series of Chinese characters.These results are comparable to those found forEnglish readers (Shankweiler, Liberman, Mark,Fowler, and Fischer, 1979). In neither of those twostudies, the results were interpreted by theauthors as evidence that good readers had morephonological problems than poor readers. Rather,they were taken to suggest that good readers relymore heavily than poorer readers on phonologicalencoding, hence were penalized more by thephonological similarity introduced in theexperimental stimuli.7 So, the results of Woo andHoosain (1984) and Ren and Mattingly (1990)both show that one of the crucial differencesbetween good and poor readers in Chinese is inthe manner in which they recode readingmaterial; Good readers use more phonologicalencoding while poor readers rely more on visualencoding. This agrees, rather than contrasts, withfindings about the differences between good andpoor readers in English (Shankweiler et al.).

In a study by Chen and Juola (1982), nativeChinese and English speaking subjects had todecide which of the items on separate test listswas graphemically, phonemically or semanticallysimilar to the item just presented visually.Chinese subjects were found to be fastest inmaking graphic similarity decisions aboutcharacters while American subjects were found tobe fastest in phonemic similarity judgments. Chenand Juola then concluded, as reported by Hoosainwithout reservation, that "Chinese words producemore distinctive visual information, but Englishwords result in a more integrated code." Thus thetwo different "scripts activate different coding andmemory mechanisms." Hoosain, however, does notnotice at least two problems with the study. First,the study claimed to use words as stimuli both forEnglish and Chinese. However, of the 144characters used, 26 could function only as boundmorphemes according to the frequency dictionaryof Wang et al. (1986), and another 11 could noteven be found in that dictionary. Using boundmorphemes in Chinese necessarily involves more

207

Page 208: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

204 Xu

homophones, making graphemic discriminationsmore indispensable. Second, the task the subjectshad to accomplish requires the kind ofphonological awareness that most Chinese readersdo not need in ordinary reading. Yet, lackingconscious knowledge about the phonemicstructure of characters does not necessarily meanthe absence, or even less use, of phonologicalrecoding in actual reading. So, here Hoosain isconfusing phonological recoding in actual readingwith conscious phonological manipulation inperforming certain cognitive experimental tasks.

Another study Hoosain cites as evidence forexistence of a visual coding strategy for Chinesematerials is that of Mou and Anderson (1981).These authors found predominant use ofphonological recoding in STM for Chinese. Yet, atthe same time, they found evidence of visualrecoding. In that study, the presentation ofstimulus lists was simultaneous rather thanserial. As pointed out by Yu, Jing, and Sima(1984) and Yu, Zhang, Jing, Peng, Zhang, andSimon (1985), measured STM capacity variesdepending upon the visual presentation method,i.e., visual aspects of the stimuli have greatereffect upon the results with simultaneouspresentation than with serial presentation. Yu etal. (1985) also concluded, based on the results of14 experiments they carried out in China and theUSA, that "no effects were detected that arepeculiar to ideographic or logographic languagesin contrast to alphabetic languages."

In 4.4 Hoosain looks at memory in relation toreading problems in Chinese. He reports, in par-ticular, an extensive study conducted byStevenson, Stigler, Lucker, Lee, Hsu, andKitamura (1982) that compared reading disabilityamong Chinese, American, and Japanese children.The major finding was that the proportion of chil-dren with reading problems is very similar amongthe three different languages, thus disconfirmingthe previous popular belief that there are fewerreading problems among Chinese and Japanesechildren. The study did find, however, an interest-ing difference between Chinese and American aswell as Japanese children. Of those Chinese chil-dren who failed the fifth grade test, most failedbecauge of a poor comprehension score, whereasmost of the fifth grade American or Japanese chil-dren who failed the test did so because of theirpoor vocabulary score. Hoosain suggests this is be-cause pronunciation tests involve more straight-forward answers than word meaning tests,whereas a more open-ended effort is required forquestions about meaning. He further suggests in

4.5, that because of the nature of the Chinese or-thography, Chinese children have to do more rotelearning. And, because this kind of learning callsfor a large measure of sustained discipline, au-thoritarianism is thus tied to this system oflearning, which in turn restrains students fromgiving more open-ended answers. This kind of rea-soning, although interesting, is only one of thepossible ways of account for the data. An alterna-tive explanation is that the poor comprehensionscore for Chinese children may be due to impropertesting design. Teaching of reading and writing inChinese is in general character-oriented.However, characters correspond to morpihemes,and most of the morphemes need to comgne withother morphemes to form a word. So, knowing thepronunciation of a character and its meaning inone or several of the words that contain it does notnecessarily lead to knowing its meaning in otherwords it is part of. This discrepancy between thetarget of teaching and the units of comprehensionmakes it difficult for people to design tests formeasuring progress in comprehension, especially.when it is assumed that knowing the pronuncia-tion of all the characters in a text and theirmeaning in isolation should be sufficient for com-prehension of the meaning of the text.

To summarize the discussion of Chapter 4, someof the alleged differences reported by Hoosainbetween the memory aspect of Chinese and otherlanguages are found to be inappropriateinterpretations of the experimental findings, whileothers derive from studies with defective designs.In conclusion, so-called bottom-up differences, ordifferences between processing of Chinese andother languages, may well be just a reflection ofdifferences among languages themselves, ormerely of the difficulty in finding parallel ways ofmeasuring performance in different languages.

In Chapter 5, Neurolinguistic Aspects of theChinese Language, Hoosain presents convincingevidence that the neurolinguistic aspects ofChinese are not essentially different from otherlanguages, as suggested by some of the earlystudies. At the same time, however, there is apersisting confusion in Hoosain's review betweenlanguage and orthography. Though this shouldnot be entirely blamed on him, since most of thestudies he reviews did not make this distinction inthe first place, as a reviewer he should have beenmore aware of this confusion.

In 5.1, Hoosain shows that .despite a claim (e.g.,Hatta, 1977) that characters are processed in theright hemisphere, there are plenty of other studiesthat demonstrated left hemisphere advantage for

Page 209: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

A Review of Psycholinguistic Implications for Linguistic Relativity: A Case Study of Chinese by Rumiahn floosain 205

printed Chinese words as well as English words.It is further shown that right hemisphereadvantage for Chinese was only found with singlecharacters, whereas two-character words alwayselicited left hemisphere advantage. Also, withsingle characters the findings are mixed: Bothright-hemisphere advantage and left-hemisphereadvantage are found, and which hemisphericadvantage is obtained seems to depend heavily onthe test conditions. Right-hemisphere advantageseems to be related to short exposure time, lowillumination, and high structural complexity ofthe characters. This is further confirmed by thefinding (Ho and Hoosain, 1989) that, with shortexposure time, low luminance, and high strokenumber, right-hemisphere advantage could beobserved even for low frequency two characterwords.

In 5.3 and 5.4, Hoosain reviews twenty-onestudies on Chinese aphasics. His generalconclusion is that "there is little evidence tosuggest any overwhelming neuroiinguistic effectsof Chinese language uniqueness." Although it wassuggested by some earlier studies that Chineselanguage functions are more lateralized in theright hemisphere, results of more extensiveinvestigations indicate no such tendency. Ingeneral, the proportion of right-handed patientswith aphasia after right hemisphere damage doesnot exceed the overall proportion of right handedpeople who have their speech functions lateralizedin the right hemisphere (about 4%). So, Hoosainconcludes that "it is quite clear from these studiesthat specialization for language tends to be in theleft hemisphere for the Chinese, as for speakers ofalphabetic languages." (Notice, however, his use ofthe word language' when most of the studies wereactually about reading.)

There is an indication from some studies thatChinese reading functions may be localized moreposteriorly, involving the parietal and occipitallobes, and psycho-motor szlemas may play agreater role in memory for Chinese words.However, as Hoosain points out, it is premature toconclude that there is more involvement of theoccipito-parietal region for Chinese. As for reportsthat finger tracing has been used by some patientsto help recognize Chinese characters, whichHoosain regards as essentially different fromspelling out individual letters for helping readwords in an alphabetic orthography, it seems thatthose patients' reading ability has been impairedso much that they are doing only what beginninglearners of an orthography are doing. It is verylikely that both finger tracing and oral spelling re-

flect only the way word composition is learned fordifferent orthographies,8 and this process could bequite peripheral to the essential mechanisms in-volved in fluent reading.

The finding that processing of tones in Chineseis located in the left hemisphere is interesting,although again this is an aspect of language and isthus not directly related to orthography. A studyby Packard (1986) showed that patients with lefthemisphere damage had defective production oflexical tones in Mandarin. On the other hand,Hughes, Chan, and Su (1983) found that patientswith right hemisphere damage had defects in bothproduction and comprehension of affectiveintonation, but at the same time had kept theirability to process lexical tones almost intact. Thefinding that processing of tones is located in theleft hemisphere, together with the finding thatprocessing sign language (e.g., Damasio, Bellugi,Damasio, Poizner, and Gilder, 1986) is alsolocated in the left hemisphere, reveals theuniversality shared by all human languages notonly in terms of their ability to conveyinformation, but also in terms of the centralneural processing mechanism they all utilize.

To summarize this discussion of Chapter 5, theevidence Hoosain provides for similarity inlateralization of script as well as general linguisticprocessing in the brain between Chinese and otherlanguages is convincing. The tendency for righthemisphere advantage in processing isolatedcharacters shown in some of the studies reviewedby Hoosain, however, does need furtherexploration.

In final conclusion, Hoosain's major attempt inthis book is to introduce a new version of the lin-guistic relativity hypothesis. The original versionis the famous Sapir-Whorf hypothesis that lan-guage influences thought. The new version is con-cerned less with structure of language or worldview than with the manner in which linguistic in-formation is processed, and in particular, with thescript-sound-meaning aspects of the Chinese or-thography and their effects on information pro-cessing. Hoosain's final conclusion in his book isthat "there are definite correlations between lan-guage characteristics and cognitive performance."However, I find that this conclusion is basedmainly on misconceptions about the Chinese lan-guage and its orthography, on results obtainedunder defective experimental designs, and on mis-interpretations of the results due to those miscon-ceptions. Besides, the difficulty in separating lan-guage differences and orthographic differencesfrom processing differences also affects Hoosain's

Page 210: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

206 Xu

judgments. More specifically, the difficulty is infinding truly consistent measurements in compar-ing different languages and orthographies. Unlessthis difficulty is overcome, there is always thedanger of mistaking the differences due toinconsistent measurements for true contrasts inthe processing of different languages andorthographies.

So, as a final comment on the title of Hoosain'sbook, to parallel Hoosain's borrowing from physicsthe notion of 'relativity' to underscore his vision oflinguistic processing of different languages, Iwould like to borrow, also from physics, the notionof 'uncertainty' or 'indeterminacy' to emphasizethe caution we should exercise when comparingdifferent languages and orthographies. It is oftendifficult to determine at the same time both thedifference between two languages or writingsystems ar d the differences in their processing.Comparing the processing of Chinese charactersdirectly with the processing of letters, words, orbound or unbound morphemes in an alphabeticwriting system is often risky; and, attributing anydifferences found in this kind of comparison tocognitive or behavioral differences will often proveto be a pitfall.

REFERENCESBaddeley, A. D. (1983). Working memory. Philosophical

Transactions of the Royal Society, London, B302, 311-324.Baddeley, A. D. (1986). Working memory. Oxford: Clarendon.Baddeley, A. D., & Hitch, G. J. (1974). Working memory. In G. A.

Bower (Ed.), The Psychology of Learning and motivation: Advancesin research and theory, Vol. 8 (pp. 47-90). New York: AcademicPress.

Biederman, I., & Tsao, I. C. (1979). On processing Chineseideographs and English words: Some implications from Stroop-test results. Cognitive Psychology, 11, 125-132.

Chen, H. C., & Juola, J. F. (1982). Dimensions of lexical coding inChinese and English. Memory & Cognition, 10, 216-2/4.

Cheng, C. M. (1981). Perception of Chinese characters. ActaPsychologica Taiwanica, 23, 281-358.

Damasio, A., Bellugi, U., Damasio, H., Poizner, H., & Van Gilder,J. (1986). Sign language aphasia during let-hemisphere amytalinjection. Nature, 322, 363-365.

DeFrancis, J. F. (1984). The Chinese Language: Facts and Fantasy.Honolulu: University of Hawaii Press.

DeFrancis, J. F. (1989). Visible speech: The diverse oneness ofwriting systems. Honolulu: University of Hawaii Press.

Fang, S. P., Horng, R. Y., & Tzeng, 0. J. L. (1986). Consistencyeffects in the Chinese character and pseudo-character namingtasks. In H. S. R. Kao & R. Hoosain (Eds.), Linguistics,psychology, and the Chinese Language (pp. 11-21). HongKong: University of Hong Kong Centre of Asian Studies.

Glushko. R. J. (1979). The organization and activation oforthographic knowledge in reading aloud. Journal ofExperimental Psycnology: Human Perception and Performance, 5,674-691.

Hatta, T. (1977). Recognition of Japanese kanji in the left and rightvisual fields. Neuropsychologia, 15, 685-688.

Ho, S. K., & Hoosain, R. (1989). Right hemisphere advantage inlexical decision with two-character Chinese words. Brain andLanguage, 27, 606-615.

Hoosain, R. (1986). Psychological and orthographic variables fortranslation asymmetry. In H. S. R. Kand & R. Hoosain (Eds.),Linguistics, psychology, and the Chinese Language (pp. 203-216).Hong Kong: University of Hong Kong Centre of Asian Studies.

Hoosain, R., & Osgood, C. E. (1983). Information processing timesfor English and Chinese words. Perception & Psychophysics, 34,573-577.

Hughes, C. P., Chan, J. L., & Su, M. S. (1983). Aprosodia inChinese patients with right cerebral hemisphere lesions.Archives of Neurology, 40, 732-736.

Keatley, C. W. (1988). Facilitation effects in the prirmd lexical decisiontask within and across languages. Unpublished doctoraldissertation. University of Hong Kong.

Liberman, L Y., Mann, V. A., Shankweiler, D., & Werfelman, M.(1982). Children's memory for recurring linguistic and non-linguistic material in relation to reading ability. Cortex, 18, 367-3; 5.

Liu, I. M. (1988). Context effects on word /character naming:Alphabetic versus logographic languages. In I. M. Liu, H. C.Chen, & M. J. Chen (Eds.), Cognitive aspects of the ChineseLanguage (pp. 81-92). Hong Kong: Asian Research Service.

Mattingly, I. G. (1991). Modularity, working memory, andreading disability. In S. A. Brady & D. P. Shankweiler (Eds.),Phonological processes in literacy: A tribute to Isabelle Y. Liberman(pp. 163471). Hillsdale, NJ: Lawrence Erlbaum Associates.

Mattingly, I. G. (1992). Linguistic awareness and orthographicform. In L. Katz & R. Frost (Eds.), Orthography, phonology,morphology, and meaning (pp. 11-261. Amsterdam: ElsevierScience Publishers.

Mou, L. C., & Anderson, N. S. (1981). Graphemic and phonemiccodings of Chinese characters in short-term retention. Bulletin ofthe Psychonomic Society, 17, 255-258.

Packard, J. L. (1986). Tone production deficits in nonfluentaphasic Chinese speech. Brain and Language, 29, 212-223.

Peng, R. X. (1982). A preliminary report on statistical analysis ofthe structure of Chinese characters. Acta Psychologica Sinica, 14,385-390.

Perfetti, C. A., & Zhang, S. (1991). Phonological processes inreading Chinese words. Journal of Experimental Psychology:Learning, Memory and Cognition, 17, 633-643.

Perfetti, C. A., Zhang, S., & Berent, I. (1992). Reading in Englishand Chinese: evidence for a "universal" phonological principle.In L. Katz & R. Frost (Eds.), Ortlwgraphy, phonology, morphology,and meaning. Amsterdam: Elsevier Science Publishers.

Ren, N., & Mattingly, I. G. (1990). Short-term serial recallperformance by good and poor readers of Chinese. HaskinsLaboratories Status Report on Speech Research. SR-103/104, 153-164.

Seidenberg, M. S. (1985). The time course of phonological codeactivation in two writing systems. Cognition, 19, 1-30.

Shankweiler, D., Liberman, I. Y., Mark, L. S., Fowler, C. A., &Fischer, F. W. (1979). The speech code and learning to read.Journal of Experimental Psychology: Human Learning and Memory,5, 531-545.

Stevenson, H. W., Stigler, G. W., Luker, G. W., Lee, S. Y., Hsu, C.C., & Kitamura, S. (1982). Reading disabilities: The case ofChinese, Japanese, and English. Child Development, 53, 1164-1181

Stroop, J. R. (1935). Studies of interference in serial verbalreactions. Journal of Experimental Psychology, 18, 643-662.

Treiman, R. A., Baron, J., & Luk, K. (1981). Speech receding insilent reading: A comparison of Chinese and English. Journal ofChinese Linguistics, 9, 116425.

BEST COPY AVAILABLE 2 jO

Page 211: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

A Review of Psycholinguistic Implications for Linguistic Relativity: A Case Study of Chinese by Rumjahn Hoosain 207

Tzeng, 0. J. L., Hung, D. L. , & Wang, W. S.-Y. (1977). Speechrecoding in reading Chinese characters. Journal of ExperimentalPsychology: Human Learning and Memory., 3, 621-630.

Tzeng, 0. J. L., & Wang, W. S.-Y. (1983). The first two R's..A.merican Scientist. 71, 238-243.

Wang, H., Chang, B. R., Li, Y. S., Lin, L. H., Liu, J., Sun, Y. L.,Wang, Z. W., Yu, Y. X., Zhang, J. W., & Li, D. P. (1986). AFrequency Dictionary of Current Chinese. Beijing: BeijingLanguage Institute Press.

Whorf, B. L. (1956). Language, thought, and reality. Selectedwritings of Benjamin Lee Whorf. Cambridge, MA: MIT Press.

Woo, E. Y. C., & Hoosain, It (1984). Visual and auditory functionsof Chinese dyslexics. Psychologia, 27, 164-170.

Yau, S. C. (1982). Linguistic analysis of archaic Chineseideograms. Paper presented at the XV International Conference onSino-TibetanIAnguages and Linguistics. Beijing.

Yu, B., Jing, Q., & Sima, H. (1984). STM capacity for Chinesewords and idioms: Chunking and acoustical loop hypotheses.Memory & Cognition, 13, 193-201.

Yu, B., Zhang, W., Jing, Q., Peng, R., Zhang, G., & Simon, H. A.(1985). STM capacity for Chinese and English materials.Memory & Cognition. 13, 202-207.

FOOTNOTES'Language and Speech, 35, 325-340 (1992).tAlso University of Connecticut, Storrs.I Linguistically, Chinese is actually a large language familyconsisting of many mutually unintelligible languages. However,because the same writing system is used by the whole Chinesespeaking community in China, and because of the close culturalties among the Chinese people as well as the centralized politicaltraditions, the Chinese languages are often considered dialects ofa single language.

2The number 3 at the end of the spelling indicates the tone of thesyllable.

3Incidentally, but perhaps more importantly, there are manydifferent shades of meanings attributed to this word as is thecase with almost all the words in any language. This quality isapparently lacking in any true picture of a bird.

4Chinese character is typically composed of a phonetic radical anda semantic radical. Sometimes there is more than one semanticradical in a character, and sometimes there is no semantic radicalat all. When the latter is the case, however, it is usually said thatthe character is a nonphonetic compound. This name ismisleading since the whole character by itself represents asyllable, and thus is no less phonetic than a phonetic radical in acompound character.

5The misunderstanding was probably caused by the ambiguity ofthe description in the original paper by Tzertg and Wang (1983).I found an unambiguous description only in the caption forFigure 3 in the paper.

6The spellings used here are in pinyin. They are only used here torepresent the Chinese characters referred to. They do not reflectthe phonological values of these morphemes in Cantonese.

7The difference between the two studiesthat is, in Woo andHoosain, poor readers were more affected by visual similaritywhereas in Ren and Mattingly good readers were more affectedby phonological similarityis probably due to the difference inpresentation of the stimulus list. In the former, it wassimultaneous display, whereas in the latter, it was serial display.See the discussion of the study by Mou and Anderson (1981)later in this review.

8Incidentally, when I was at my preliminary stage of learningEnglish, I did not have the privilege of access to a teacher oreven another fellow student, so I learned my English words bywriting them over and over again. So, today, I still have greatdifficulty spelling words orally or understanding other people'soral spelling. But I don't have reading difficulty in English. If Ibecame aphasic, I don't think I would be able to use spelling tohelp recognize English words. On the other hand, many of myfellow Chinese can do oral spelling fluently, because they startedlearning English with a teacher.

211

Page 212: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

209

Appendix

SR * Report Date NTIS# ERIC *

SR-21/22 January-June 1970 AD 719382 ED 044-679

SR-23 July-Seplember 1970 AD 723586 ED 052-654

SR-24 October-December 1970 AD 727616 ED 052-653

SR-25/26 January-June 1971 AD 730013 ED 056-560SR-27 July-September 1971 AD 749339 ED 071-533

SR-28 October-December 1971 AD 742140 ED 061-837SR-29 /30 January-June 1972 AD 750001 ED 071-484

SR-31/32 July-December 1972 AD 757954 ED 077-285

SR-33 January-March 1973 AD 762373 ED 081-263

SR-34 April-June 1973 AD 766178 ED 081-295SR-35/36 July-December 1973 AD 774799 ED 094 444SR-37/38 January-June 1974 AD 783548 ED 094-445SR-39/40 July-December 1974 AD A007342 ED 102-633SR-41 January-March 1975 AD A013325 ED 109-722SR-42/43 April-September 1975 AD A018369 ED 117-770SR-44 October-December 1975 AD A023059 ED 119-273SR-45/46 January-June 1976 AD A026196 ED 123-678SR-47 July-September 1976 AD A031789 ED 128-870SR-48 October-December 1976 AD A036735 ED 135-028SR-49 January-March 1977 AD A041460 ED 141-864SR-50 April-June 1977 AD A044820 ED 144-138SR-51/52 July-December 1977 AD A049215 ED 147-892SR-53 January-March 1978 AD A055853 ED 155-760SR-54 April-June 1978 AD A067070 ED 161-096SR-55/56 July-December 1978 AD A065575 ED 166-757SR-57 January-March 1979 AD A083179 ED 170-823SR-58 April-June 1979 AD A077663 ED 178-967SR-59/60 July-December 1979 AD A082034 ED 181-525SR-61 January-March 1980 AD A085320 ED 185-636SR-62 April-June 1980 AD A095062 ED 196-099SR-63/64 July-December 1980 AD A095860 ED 197-416SR-65 January-March 1981 AD A099958 ED 201-022SR-66 April-June 1981 AD A105090 ED 206-038SR-67/68 July-December 1981 AD A111385 ED 212-010SR-69 January-March 1982 AD A120819 ED 214-226SR-70 April-June 1982 AD A119426 ED 219-834SR-71/72 July-December 1982 AD A124596 ED 225-212SR-73 January-March 1983 AD A129713 ED 229-816SR-74/75 April-September 1983 AD A136416 ED 236-753SR-76 October-December 1983 AD A140176 ED 241-973SR-77/78 January-June 1984 AD A145585 ED 247-626SR-79 /80 July-December 1984 AD A151035 ED 252-907SR-81 January-March 1985 AD A156294 ED 257-159SR-82/83 April-September 1985 AD A165084 ED 266-508SR-84 October-December 1985 AD A168819 ED 270-831SR-85 January-March 1986 AD A173677 ED 274-022SR-86 / 87 April-September 1986 AD A176816 ED 278-066SR-88 October-December 1986 PB 88-244256 ED 282-278

SR-113 January-March 1993

212

Page 213: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

210

SR-89/90 January-June 1987 PB 88-244314 ED 285-228

SR-91 July-September 1987 AD A192081 '01.

SR-92 October-December 1987 PB 88-246798 **

SR-93/94 January-June 1988 PB 89-108765 **

SR-95/96 July-December 1988 PB 89-155329 **

SR-97/98 January-July 1989 PB 90-121161 ED32-1317

SR-99/100 July-December 1989 PB 90-226143 ED32-1318

SR-101/102 January-June 1990 PB 91-138479 ED325-897

SR-103/104 July-December 1990 PB 91-172924 ED331-100

SR-105/106 January-June 1991 PB 92-105204 ED340-053

SR-107/108 July-December 1991 PB 92-160522 ED344-259

SR-109/110 January-June 1992 PB 93-142099 ED352594

SR-111/112 July-December 1992 PB 93-216018 ED359575

AD numbers may be ordered from:

U.S. Department of CommerceNational Technical Information Service5285 Port Royal RoadSpringfield, VA 22151

ED numbers may be ordered from:

ERIC Document Reproduction ServiceComputer Microfilm Corporation (CMC)3900 Wheeler AvenueAlexandria, VA 22304-5110

In addition, Haskins Laboratories Status Report on Speech Research is abstracted in Language and LanguageBehavior Abstracts, P.O. Box 22206, San Diego, CA 92122

**Accession number not yet assigned

SR-113 January-March 1993

. 4 3ni.

Page 214: CS 508 427 Fowler, Carol A., Ed. -REPORT NO 93 · AUTHOR Fowler, Carol A., Ed. TITLE Spee:11 Research Status Report, January-March 1993. ... A Case Study of Chinese' by Rumjahn Hoosain"

SR-113JANUARY-MARCH 1993

Contents

HaskinsLaboratories

Status Report on

Speech Research

Some Assumptions about Speech and How They ChangedAlvin M. Liberman 1

On the Intonation of Sinusoidal Sentences: Contour and Pitch HeightRobert E. Remez and Philip E. Rubin 33

The Acquisition of Prosody: Evidence from French- andEnglish-Learning Infants

Andrea G. Levitt 41

Dynamics and Articulatory PhonologyCatherine P. Browman and Louis Goldstein 51

Some Organizational Characteristics of Speech Movement ControlVincent L. Gracco 63

The Quasi-steady Approximation in Speech ProductionRichard S. McGowan 91

Implementing a Genetic Algorithm to Recover Task-dynamicParameters of an Articulatory Speech Synthesizer

Richard S. McGowan 95

An MRI-based Study of Pharyngeal Volume Contrasts in AkanMark K. Tiede 107

ThaiM. R. Kalaya Tingsabadh and S. Arthur Abramson

On the Relations between Learning to Spell andLearning to Read

Donald Shankweiler and Eric Lundquist

131

135

Word Superiority in ChineseIgnatius G. Mattingly and Yi Xu 145

Pre lexical and Post lexical Strategies in Reading:Evidence from a Deep and a Shallow Orthography

Ram Frost 153

Relational Invariance of Expressive Microstructure acrossGlobal Tempo Changes in Music Performance: An Exploratory Study

Bruno H. Repp 171

A Review of Psycho linguistic Implications for Linguistic Relatioity:A Case Study of Chinese by Rumjahn Hoosain

Yi Xu 197Appendix 209

214