slgede bulletin, 1976, 200-211. - stanford university · slgede bulletin, 1976, ~(l), 200-211....

12
SlGeDE Bulletin, 1976, 200-211. SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION: THE MISS SYSTEM AND ITS APPLICATIONS by William R. Sanders, Gerard V. Benbassat and Robert L. Smith Institute for Mathematical Studies in the Social Sciences Stanford University 2 Choice of Speech Synthesis Technique hard-copy manual or of printed text on a computer terminal. We use ten criteria in organizing our comparison of voice response systems for application to CAl. The first three measure what the speech sounds like, the next four measure the system's ability to compose and speech, and the last three measure system implementation parameters that have a major effect on cost. We are attempting to use audio to contribute to the solution of the most elusive problem of classical computer-assisted instruction: how to have a dialog with the student about what he is undertaking to do. this problem involves much more as well: namely, having a good semantic representation of the work in the curriculum at hand, and having a facility for processing natural language on the computer. for Comparing .£AI Speech Criteria 2.1 Systems There are many methods for providing computer controlled speech to a student using a CAl system. In this section we describe criteria appropriate for comparing voice response systems used with instructional computers, then discuss the particular requirements imposed by curriculums at IMSSS, then briefly review general techniques used by voice response systems, and explain IMSSS's choice in terms of our criteria of comparison. Previous work with natural language applications to CAl convinces us that the output problem how to say something reasonable and informative -- is more crucial to good dialog than being able to understand complex "natural language" input typed by the student. This is particularly true in the context of like the set theory instructional program (see [20]). Moreover, it is important to separate for the student what he is doing (constructing a proof, writing a program) from the program's commentary about it. With the exception of foreign language instruction, all of our current C!-pplications will use audio as a "meta- language" of informal dialog. Introduction In Section 5 we discuss the applications of audio to CAl. We mention previous uses of various synthesis methods, with a view to determining the actual instructional use of the audio in these applications. The permanent viability of audio in CAl will depend, we believe, on that an audio component adds significantly to the kind of instruction that is possible. Thus, it is important that audio not merely take the place of a (1) We thank Professor Patrick Suppes for his direction of this project. We also thank the many people who talked with us about their research and our project, including B. Atal, J. Olive and L. Rabiner of Bell Laboratories, R. Morris of Carleton University, R. Reddy of Carnegie-Mellon University, J. Allen of MIT, and C. Bush, A. Peterson and B. Widrow of Stanford· University. Many IMSSS personnel contributed to this work in various ways, particularly T. Dienstbier, R. Schulz, K. Powell, G. Black, D. Webb, E. Andrew, R. Roberts, and L. Blaine. And most of all we express our sincere gratitude to D. Duck for arousing the authors' interest in manipulated speech. The research reported here was supported by the National Science Foundation under NSF Grant EPP 74-15016, AD1, In Section 2 we describe criteria appropriate for organizing the comparison of voice response systems for use with instructional computers. Then we describe the particular requirements imposed by curriculums at IMSSS, review general voice synthesis techniques, and finally discuss our actual choice. In Sections 3 and 4 we outline the hardware and software that have been created to support MISS in operational CAl at Stanford. The Institute for Mathematical Studies in the Social Sciences at Stanford (IMSSS) has developed a synthesis system, MISS (Microprogrammed Intoned Speech Synthesizer), to test the effectiveness of computer-generated speech in the context of complex CAl programs. No one method of computer controlled speech production is completely satisfactory for all the uses of computer-assisted instruction (CAl). The choice of synthesis method .isstrongly related to the kinds of curriculums and instructional designs that will use speech. We chose to use acoustic modelling by linear predictive coding as the method of synthesis for MISS. (1) I I I I 200

Upload: others

Post on 11-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SlGeDE Bulletin, 1976, 200-211. - Stanford University · SlGeDE Bulletin, 1976, ~(l), 200-211. SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION: THE MISS SYSTEM AND ITS APPLICATIONS

SlGeDE Bulletin, 1976, ~(l), 200-211.

SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION:

THE MISS SYSTEM AND ITS APPLICATIONS

by

William R. Sanders, Gerard V. Benbassat and Robert L. Smith

Institute for Mathematical Studies in the Social Sciences

Stanford University

2 Choice of ~ Speech Synthesis Technique

hard-copy manual or of printed text on a computerterminal.

We use ten criteria in organizing ourcomparison of voice response systems forapplication to CAl. The first three measure whatthe speech sounds like, the next four measure thesystem's ability to compose and manipula~e speech,and the last three measure system implementationparameters that have a major effect on cost.

We are attempting to use audio to contributeto the solution of the most elusive problem ofclassical computer-assisted instruction: how tohave a dialog with the student about what he isundertaking to do. Solvin~ this problem involvesmuch more as well: namely, having a good semanticrepresentation of the student~s work in thecurriculum at hand, and having a facility forprocessing natural language on the computer.

for Comparing .£AI SpeechCriteria2.1Systems

There are many methods for providing computercontrolled speech to a student using a CAl system.In this section we describe criteria appropriatefor comparing voice response systems used withinstructional computers, then discuss theparticular requirements imposed by curriculums atIMSSS, then briefly review general techniques usedby voice response systems, and explain IMSSS'schoice in terms of our criteria of comparison.

Previous work with natural languageapplications to CAl convinces us that the outputproblem how to say something reasonable andinformative -- is more crucial to good dialog thanbeing able to understand complex "natural language"input typed by the student. This is particularlytrue in the context of pro~rams like the set theoryinstructional program (see [20]). Moreover, it isimportant to separate for the student what he isdoing (constructing a proof, writing a program)from the program's commentary about it. With theexception of foreign language instruction, all ofour current C!-pplications will use audio as a "meta­language" of informal dialog.

Introduction

In Section 5 we discuss the applications ofaudio to CAl. We mention previous uses of varioussynthesis methods, with a view to determining theactual instructional use of the audio in theseapplications. The permanent viability of audio inCAl will depend, we believe, on showin~ that anaudio component adds significantly to the kind ofinstruction that is possible. Thus, it isimportant that audio not merely take the place of a

(1) We thank Professor Patrick Suppes for hisdirection of this project. We also thank the manypeople who talked with us about their research andour project, including B. Atal, J. Olive and L.Rabiner of Bell Laboratories, R. Morris of CarletonUniversity, R. Reddy of Carnegie-Mellon University,J. Allen of MIT, and C. Bush, A. Peterson and B.Widrow of Stanford· University. Many IMSSSpersonnel contributed to this work in various ways,particularly T. Dienstbier, R. Schulz, K. Powell,G. Black, D. Webb, E. Andrew, R. Roberts, and L.Blaine. And most of all we express our sinceregratitude to D. Duck for arousing the authors'interest in manipulated speech. The researchreported here was supported by the National ScienceFoundation under NSF Grant EPP 74-15016, AD1,

In Section 2 we describe criteria appropriatefor organizing the comparison of voice responsesystems for use with instructional computers. Thenwe describe the particular requirements imposed bycurriculums at IMSSS, review general voicesynthesis techniques, and finally discuss ouractual choice. In Sections 3 and 4 we outline thehardware and software that have been created tosupport MISS in operational CAl at Stanford.

The Institute for Mathematical Studies in theSocial Sciences at Stanford (IMSSS) has developed asynthesis system, MISS (Microprogrammed IntonedSpeech Synthesizer), desi~ned to test theeffectiveness of computer-generated speech in thecontext of complex CAl programs. No one method ofcomputer controlled speech production is completelysatisfactory for all the uses of computer-assistedinstruction (CAl). The choice of synthesis method.isstrongly related to the kinds of curriculums andinstructional designs that will use speech. Wechose to use acoustic modelling by linearpredictive coding as the method of synthesis forMISS. (1)

I

I

I

I

200

Page 2: SlGeDE Bulletin, 1976, 200-211. - Stanford University · SlGeDE Bulletin, 1976, ~(l), 200-211. SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION: THE MISS SYSTEM AND ITS APPLICATIONS

The ability to manipulate intonation compareshow well, if at all, systems can change theintonation of a sentence. (3) A system with such

The ability to compose words compares thosesystems with only prerecorded words to those whichmay compose words out of smaller segments, such asmorphemes or phonemes.

~ ability to compose sentences comparesthose systems with only prerecorded messages tothose which may compose sentences out of smallerspeech segments, such as words.

Coverage describes whether a system is capableof producing all the sounds of speech. A system

-may cover all the sounds of speech in alllanguages, or may cover only those sounds typicalof a particular language, or may fail to produceall typical sounds even in a particular language.Coverage can be measured like intelligibilityexcept that the word list must contain pairs ofwords that differ by a single distinctive soundfeature.

example, change a message thatperceived as a declarativethat would be perceived as a

The complexity of computation reQuired toproduce speech varies widely between synthesistechniques. Increased complexity can oftenincrease the intelligibility and Quality of thespeech'at an increase in expense.

An evaluation of the curriculums at IMSSS (seeSection 5 below) indicates that high quality andintelligibility are mandatory as is the ability todynamically" compose sentences and the randomaccessing of utterances. "The ability to manipulateintonation is desirable. Since the mechanisms ofchanging intonation are not well understood, it isunrealistic to make this a mandatory requirement.Yet prosody manipulations" are so desirable thatimportance is placed having some basic, capabilities

2.2 lMSSS -Speech Requirements

The vocabulary storage size can be a major anddirect factor in the cost of a synthesis system.Within a particular synthesis technique, increasedintelligibility and Quality usually cause increasedstorage size, while the ability to composesentences, words, or intonation. cause decreasedstorage size.

Althou~hany application would like the bestpossible speech, the current technology of speechsynthesis does not allow for optimiZing all theabove criteria in a single system. Given thisfact, different curriculums want to compromise indifferent ways. A foreign language course is moreWilling to give up manipulating intonation andcomposing messages for increased intelligibility,coverage and Quality. Such a course would not wantto expose its students to atypical pronunciationsof the target language. However, a non-linguisticcourse using speech as a second source forstimulating and informing the student may wellallow degradations in coverage, quality andintelligibility, if by so doing it is able toimpart important semantic relationships through theintonation of its messages.

Some factors directly affecting the cost of asystem have been mentioned but producing accuratecost models of the various synthesis techniques iswell beyond the scope of this paper. The cost is amost important criterion for comparison af CAlsynthesis systems, but technology is changingrapidly enou~h to make evaluation on other criteriamore important than on present day costs.

The~ transmission rate determines the costof speaking to a student physically remote from theinstructional computer. To provide a means ofcomparing analog and digital systems in terms oftransmission rate, we measure the number of voicesthat can be placed on a conventional telephoneline. An analog system can handle just one voice.For digital systems, transmission at 9600 baud willbe assumed. Where the data rate exceeds that,analog transmission from a synthesizer near theinstructional computer giving one voice per phoneline will be assumed.

ability could, forwould normally bestatement into onequestion.

Rhyme test and the modifiedsuitable for testin~

words intonation and prosodyrefer to the speech's

(pitch), duration and volume

(2) The Fairbank'srhyme test areintelligibili ty. [5]

(3) We use theinterchangeably tofundamental frequencycontours.

~ flexibility Q[ accessing utterances· is adescription of the limitations placed on theproduction of dynamically composed messages due tothe inability to retrieve the speech representationquickly enough to present it to "a student. Linearaccessing is constrained to "the retrieval of onlythe next recorded message. ~ accessinK allowsthe rapid retrieval of any recorded utterance.Locally randoM accessing _allows the retrieval ofutterances physically nearby on the storage device,but the time delay to retrieve a distant utterancewould be excessive.

Quality is a "measure of the fidelity of thespeech, that is the precision with which natural,humanlike speech is produced. This is primarily asubjective measure related to whether or notlisteners perceive the speech as typical of thatproduced by a native speaker. However, it is alsostrongly related to the absence or presence ofunnatural noises, such as hisses or clicks,produced with the speech. Such noises might notdistract from a short intelligibility test butwould affect the willingness of a student to listenattentively to ~he speech.

The "intelligibility of a speech system is theability of that system to elicit the correctidentification of utterances from listeners. Theintelligibility of a system is measured empiricallyby playing phonetically balanced lists of words to-native listeners. (2). Since it is well knownthat listeners can often correctly identify highly'cistorted words, high scores on such tests do notnecessarily mean that the system is capable ofproducing all the typical sounds of the langua~e

involved.

201

Page 3: SlGeDE Bulletin, 1976, 200-211. - Stanford University · SlGeDE Bulletin, 1976, ~(l), 200-211. SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION: THE MISS SYSTEM AND ITS APPLICATIONS

Many methods of reproducing and synthesizingspeech have been used. Our division into five·general techniques is primarily based on themethods~ operational characteristics rather thantheir internal structure. The interested reader isreferred to Flanagan~s two books for detaileddescriptions of the various synthesis techniques(see [6] and [7]).

to effect intonation. Since our curriculums have afairly restricted vocabulary, it seems lessimportant to choose a technique which allows wordcomposition if such a method had substantiallylower quality. Less important still wereconsiderations of vocabulary stora~e size and datatrans~ission rate since our students are allclustered hear the computer and can share the samevocabulary storage device. Finally, thecomputational complexity of the synthesizer was notconsidered important due to the rapidly decreasingcost of computational hardware.

Linear Prediction analysis and Format analysisare two popular methods of performin~ acousticmodelling (see [2J. [8J. [15J, and [14J). Makhoul[13] ha; shown these two techniques to be verysimilar; however, Format analysis achieves asmaller storage size by assuming that the spectrumhas a format structure. Lack of conformity toformat structure results in lower Quality speechthan that produced using Linear Prediction.Furthermore, Linear Prediction analysis' iscomputationally more direct than Format analysis.Good intelligibility, coverage and quality can bepractically realized with either method since fewercompromises with storage size constraints arenecessary.· Digital disks and drums are naturalmediums for storing the acoustic speech parameters,Sentences, words, and even parts of words can berecorded and stored as units. Message quality andintelligibility will be better with larger recordedunits, but composability will be reduced.

2.3 DescriptionTechniques

Speech Synthesis

accomplished without affectin~

phonetic Quality of the speech.can be manipulated.

the spectrum orThus, intonation

The first speech system considered is acomputer controlled analog magnetic tape player.It has excellent intelligibility, Quality, andcoverage, but has only linear accessing or at bestlocally random accessing. No sentence, word orintonation composition is possible. Tapes must bemanually stored and retrieved. However,transmission of the control commands to the tapeplayer requires an instgnificant amount of a phoneline~s data capability and no computation isrequired.

Analog recordings .Q!l disks 1!!lQ. ~. wouldsolve the tape's accessing limitations but due tothe cost of such storage, most such systems havecompromised quality, coverage and intelligibilityfor putting more messages on the available storagemedium. Although sentence composition is possible,no ability to manipulate intonation is available-..

Digi tal recording, on . disks .s.ru! .Q.!::.!!m§. haveexactly the same description in terms of ourcriteria of evaluation as their analogcounterparts. Pulse Code Modulation (PCM), DeltaModulation, and Adaptive Delta Modulation are allpopular digital representations of the analogspeech signal, but none of these techniques arefunctionally different from an analog system exceptin the amount. of magnetic material needed in thestorage medium and the relative cost ofimplementing a particular system.

Acoustic modelling of the characteristics ofspeech signals prOVides a technique withsignificant improvements over simpler recordin~

methods. Substantial reductions in storage sizeare accomplished by parametrically representing theshort term speech spectrum and its voicingfrequency. During recording, a sequence ofapproximately 50 such sets of parameters isrecorded for each second of an utterance. Forp~ayback, synthesis is accomplished by using eachset of parameters in turn to recreate a shortperiod of the speech wave, Furthermore, as thesynt~esis is accomplished, variations in thespeaking rate and voicing frequency can be

202

Speech. II~forms words out of parts ofwords according to rules formulated for aparticular lan~uage (see [4], [21]). Speech byrule makes use of the phonetician~s concepts ofphonemes and morphemes, Phonemes are classes ofshort speech segments where the class definitionsare based on differences in the sounds which arereco~nizably distinct for the language. Morphemesare those combinations of several phonemes whichprovide minimal context for the interaction ofphonemes (see [9] for a discussion of these lasttwo points). Jon Allen's speech by rule system [1]synthesizes words by decomposing orthographic wordsinto their morphological representation, then tntophonological representation and finally intoacoustic representation. To accomplish this, alexicon.of all common morphs·and four sets of rulesare needed. One set of rules is for decomposingthe text representation of words into morphs,another set is for decomposing words directly intophones for those words which fail morphemedecomposition, a third set is for correcting theresulting phoneme sequence to take into accountjunctural effects, and a fourth set is fortransformin~ the phoneme seouence-into controls fora synthesizer. Speech by rule systems can havegood intelligibility but due to rule failures,coverage is often not complete even for onelangua~e and coverage is certainly incomplete formultiple languages. Rule inadequacies causeauality to be fair. However, random accessing anddynamic composition of sentences, words andintonation are all possible. The vocabularystorage size is quite small in comparison withother synthesis methods, but a large dictionary ofmorphs is still needed, The transmission rate islow, but the computational complexity is very high.

2.4 IMSSS ~s Choice of g Speech Synthesizer

We decided to use a word based linearprediction synthesis system because of thefollowing considerations: 1) the need for completecoverage by multiple language instruction prohibitsusing speech by rule systems; 2) the need fordynamic composition of messages by curriculums such

Page 4: SlGeDE Bulletin, 1976, 200-211. - Stanford University · SlGeDE Bulletin, 1976, ~(l), 200-211. SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION: THE MISS SYSTEM AND ITS APPLICATIONS

IBM Tape Drive 09 track BOO 1600 bpi

IBM 3B03 Tape

IController

IBM Tape Drive 1 Systems Concepts9 track BOO 1600 bpi I SA10

DEC to IBM,Selector

Disk Drive 0 Channel AdaptorFor audiQ and others

Disk ControllerIBM 3330 style

Disk Drive 1For Tenex File System Each drive has

30 million words

Disk Drive 2 IFor Tenex file system I

General InstrumentsHead per Track Drum 0

0.9 million words Systems ConceptsFor Tenex Swapping DR10

Drum to DECInterface

General Instruments ..Head per Track Drum 1

0.9 million wordsFor Audio storage . .

To High Speed LineT'erminals Terminal Interface

.

10 Synchronous Linesto Multiplexers

11 Asynchronous Linesto Graphic Displays

2 Line PrinterInterfaces

.

To MISSTerminals

~B Output channels16 Simultaneous

Independent Voices

Systems Concepts MF10Core Memory256 K Words

Decsystem KI10 36 Bit Words Parity bitInstructional Computer 300 Nanosecond Access TimeTenex Operating System 4 Internal Interleaved Ports

Figure 1. Block Diagram' of the Instructional Computer System

Page 5: SlGeDE Bulletin, 1976, 200-211. - Stanford University · SlGeDE Bulletin, 1976, ~(l), 200-211. SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION: THE MISS SYSTEM AND ITS APPLICATIONS

as Set Theory prohibits usin~ tape players; 3) theneed to modify the intonation of a composed messageprohibits using either analog or digital recordin~

techniques. The choice ofa word based system doesplace certain functional restrictions onapplications. Intonation modifications are lessprecise and effective when applied only on wordunits instead of phoneme units. There is noability to compose a word which is not in therecorded lexicon; however, some attempts will bemade to combine affixes with root words.

Since no suitable synthesis system wascommercially available, we implemented a linearprediction analysis program, designed and built ahardware synthesizer, and implemented softwaresupport suitable for use by curriculum programs.

3 Hardware Implementation of §ll OperationalSynthesis System

A complete technical description of thehardware and software implementation is beyond· thescope of this paper. Such a description willbeavailable in a forthcoming technical report [17].Here we present an overview of the implementationconsiderations important to the success of a CAlvoice response system.

3.1 The Instruct ional Computer~

The instructional computer system at IMSSSconsists of a central computer with its associatedrandom access memory, a. drum used for swapping,magnetic disks used for information storage,magnetic tapes used for off-line informationstorage, a~d user terminals used by both studentsand programmers. The computer is a DigitalEquipment Corporation KIlO running the TEN EXtimesharing system. Core memory consists of aquarter of a million 36 bit words with an accesstime of 300 nanoseconds. The head per track drumprovides an extension of memory to over 2 millionwords by using virtual memory and swappingtechniques. On-line disk storage is over onehundred million words. Off-line storage usesconventional nine track tapes.

3.3 MISS Hardware

We call both the overall voice response systemand actual synthesizer hardware MISS, which standsfor Micropro~rammed Intoned Speech Synthesizer.The synthesizer is actually a pair of very fast andclosely coupled processors specially designed toefficiently perfofm the type of calculationsneeded. One of the processors is amicropro~rammable digital filter which does theactual speech synthesis using linear predictiveparameters. The output of the digital filter isconverted to analog speech signals by any of forty­eight digital to analog converters whose outputsare transmitted to individual terminals. The otherprocessor is a more general purpose computer whichinterprets commands from the instructionalcomputer, retrieves speech data from the mainmemory, performs the appropriate intonationcomputations, and passes the data to the digitalfilter. A block diagram of MISS can be seen inFigure 2.

3.~ Terminal Hardware

Each terminal requiring speech has anamplifier and headset. The headset was chosen forits fidelity and isolation characteristics sinceseveral students often take lessons in the sameroom. The terminal headset amplifier is connectedto the synthesizer by a pair of wires in much thesame way as the terminal itself is connected to thecomputer's terminal multiplexer. In addition,synthesizer channels may be connected directly tophone line arrangements to service users vi~ thetelephone system.

·4 Software Implementation Q[ Sll OperationalSynthesis System

We will briefly describe the vocabularyrecording process, the support software written foruse by the curriculum programs, the programs neededto experiment and develop the MISS system, and thedata structures used in the operational system.

Hardware Additions~ 4.1 Vocabulary

The implementation of a word based synthesissystem involved the addition of a second drum fo~

vocabulary storage, the construction of a specialpurpose computer to perform the synthesiscalculations, and the addition of amplifiers andheadsets to the terminals. We decided that up toforty-eight terminals could benefit from speechcapabilities. Based on our previous experiencewith a Delta Modulation speech system, we estimatedthat no more than sixteen of those terminals wouldrequire simultaneous speech. The head per trackdrum was needed because moving head disks are notfast enough to access the speech data for sixteenindependent voices. The system with theseadditions is shown in Figure 1.

204

Of primary importance to a word basedsynthesizer is the definition of the vocabulary.The English lexicon was composed from our previousaudio system's vocabulary of 2700 words and wasexpanded to 12,000 words for other curriculumswhich had not previously used audio. A humanspeaker was chosen for the intelligibility andquality of his voice. He recorded the wordsdirectly onto the computer system using pulse codemodulation at 2~0,000 bits per second to achieve avery accurate representation of the words~ Thisrecorded vocabulary was stored on digital tape.Later, the words were retrieved from tape, theLinear Predictive analysis was performed and thederived parametric representations were stored onboth the drum and archival tape. This analysisreduced the storage requirements to 6000 bits persecond of speech. Finally, the synthesized wordsare being verified to insure that each is typicalof the intended word. Native Chinese and Russian

Page 6: SlGeDE Bulletin, 1976, 200-211. - Stanford University · SlGeDE Bulletin, 1976, ~(l), 200-211. SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION: THE MISS SYSTEM AND ITS APPLICATIONS

System Interrace

Memo~y Interface.

Asynchronous wordtcansfers to and frommemory under control ofthe prosody computer.

Control Interface.

Dynamic Control ofchannel activity.

Debugging control of allinternal registers andmemories.

Interacts with prosodycomputer via common 'i~

r-egisters.

Prosody Computer

Microprogrammed.

1K read/write programmemory.

Operates on 12 bit data:adds 7 mult~plY7 logic,shifts, tests, control.

2k read/write datamemory.

3 megacycle executionrate.

Pipeline two levelsdeep.

Four priority interruptlevels.1. Handlinginterface controlcommands.2. Feeding data tofilter.3. Memorytransfers.4. Prosodycalculations andscheduling.

Digi tal Filter

Microprogrammed.

64 read/write programmemory.

Operates on 12 bit data.

Each instructioncomputes one polesection of filter indirect, cascade orlattice forms.

Special instructions forexcitation calculations.

400 nanosecondinstruction time.

Pipeline 3 levels deep.

512 word data memory.

Directly loads any of 48output channels.

Interacts with prosodycomputer via commonregisters.

Output Channels

48 individual channels.

1~ bit digital to analogconverts.

Sample and hold'amplifier.

5 pole lowpass filter.

Telephone level balancedoutput.

A special output is fedback to prosody computerfor experimental use.

Figure 2. Block Diagram of the MISS Machine

Curriculum ProgramThe curriculum program composes a message and calls the library procedure SPEAK.SPEAK( "Every set has a power set. II ).

SPEAK Library ProcedureSPEAK breaks the argument into words. For each

word it looks up the word's storage address inthe lexicon and places them in a command list.

'The command list:1. address of EVERY2. address of SET3. address of HAS4. address of A5. address of POWER6. address of SET

System ProcedureThe system tells MISS where the command list is,

then fetches the. data from the drum/disk into adouble buffer. As the data for each word isfetched, the corresponding command list entryis changed to reflect where in memory the datais. MISS uses this change to know it ispossible to start retrieving the data. WhenMISS is finished with a buffer 7 it zeros outthe command list entry; this tells the systemthe buffer is free and available for fetchingthe next sound.

LexiconThe lexicon is. a hash table of all available

words in a language. Entries in the table areaddressed by name. Each entry has 11 fieldswhich tell where the sound data is stored, thepart of speech, and certain prosodic parametersof the word. This simple example does not showhow the part of speech or the prOSOdicparameters are used to form prosody commandsfor MISS.

System BuffersThe command list and buffers are shown as the

MISS machine is saying "HAS" and the data for"All is being fetched from the drum.

Command list Buffer 1 Buffer 2o (completed EVERY) Data for the This buffero (completed SET) word HAS. is beingpointer to buffer MISS is reading filled by theaddress of A out of this drum with theaddress of POWER buffer. data for theaddress of SET word A.

Figure 3. Flow Diagram of Speech Data and Commands

205

Page 7: SlGeDE Bulletin, 1976, 200-211. - Stanford University · SlGeDE Bulletin, 1976, ~(l), 200-211. SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION: THE MISS SYSTEM AND ITS APPLICATIONS

Curriculum programs only have to specify thelanguage to use and the level of automatic prosodicmanipulations to be performed. Whenever theprogram wants a particular message spoken, itsimply passes the text representation of themessage to a library procedur'e. This procedure isdescribed in the next section.

Software to access and manipulate the wordswas written as programmed library procedures.These procedures were carefully documented so allinstructional programs using audio could access thespeech in a uniform and efficient way. The use oflibrary procedures has facilitated making changesin the speech data and synthesizer, since theeffects of these changes to the various CAlprograms are usually hidden inside the procedures.

Acoustic programs are interested in moresophisticated manipulations of the speech. Theywant to record, edit, move, analyze, and transformthe speech data. They also need to edit thelexicon, and graphically display the variousrepresentations of speech. Furthermore, theseprograms want to be able to run both in interactivemode for experimental use and in batch mode forautomatically processing large vocabularies. Thelibrary procedures provide mechanisms for- all theseoperations in such a way that programs may easilyincorporate these featu~es.

We have three main acoustic programs, one forrecording and analyzing the vocabulary, one forexperimenting with intonation, and one formanipulating and archiving lexicons. The analysisprogram takes a word, finds where the speech beginsand ends, detects the precise locations of allpitch pUlses in the speech, performs the LinearPredictive analysis and does further datareductions on these parameters. The analyzedversions of the words are then stored on the drum,the disk, or tape and the text representation ofthe word is stored in a lexicon. The intonationprogram allows for manipulating the pitch, durationand volume of sequences of words to study theireffect on the perception of sentence prosody.These manipulations will eventually be performed bythe MISS processor. The lexicon program is usedfor defining homonyms, defining special expandedpronunciations of words, and archiving sounds ontomagnetic tape.

The efficient flow of commands and datathrough· the computer system is essential for anoperational system. To achieve efficiency, asystem must minimize data movements andcomputations. Since an overly efficientimplementation can inhibit the system's flexibilityand extensibility. a careful balance must bemaintained. Figure 3 shows the flow ofinformation needed for a typical message. Aprogram that dynamically composes a message to beheard at a student terminal passes the textrepresentation of_ the message to a libraryprocedure. This procedure converts the text into acommand list consisting of the drum or diskaddresses of the speech data for each wordinterspersed with intonation commands. The systemcauses the data to be retrieved from the storagedevice and placed in a memory buffer. The memorylocation of the data and the intonation commandsare passed to the syn~hesizer. MISS retrieves thecommands and sound data as it needs them, appliesthe intonation commands to the data, and passes theresulting data to the digital filter. The filterperforms 'its computations which generate a PulseCode Modulation version of the speech. This ispassed to an output channel which converts thisdata into analog speech.· The speech is transmittedto a student wearing a headset at a 'terminal.Command lists can be prepared by a preprocessor formessages that are not created dynamically, such asstandard greetings and error messages, thusimproving efficiency.

In Section 2 we discussed the requirementsthat we placed on an audio system. We stressedthere that our overall need is to composeintelli~ible messages from randomly accessed words,with sufficient Quality for foreign langua~e

instruction.

5 Applications

language has its own lexicon, and each lexicon isassociated with a particular re~ion of a storagedevice. The desired name can be efficiently lookedup in the lexicon, and information associated withthe name referenced. Each name has associated withit up to ten information fields, three of which areused to specify the stora~e device and address ofthe sound data. Other fielqs are" optional and cancontain linguistic information needed for parsingand applyin~ intonation contours, or otherinformation needed by a particular application.Lexicons are implemented as shared program·segmentsunder TENEX, so all programs using a language sharethe same physical memory.

for usewill be

4.2 MISS Software

speakers have recorded small vocabulariesin language courses. These vocabulariesanalyzed like the English vocabulary.

4.3 Data and Command Structures

A sound may be represented in terms of itsPulse Code Modulation values, its Delta Modulationvalues, or its Linear Predictor Coefficents. Thesound data is stored in a uniform format on allstorage devices. The sound data itself is stored,along with additional information naming the actualspeech representation used and the methods used totransform the speech into this representation.

The CAl systems currently under development atIMSSS emphasize dialog with the student concerninga semantic structure built up interactively. Werefer to this approach to the use of computers ineducation as the use of complex instructionalsystems (see [24]). For example, the EXCHECKsystem teaches set theory by building a high-level

-model of the student's proof.

The lexicon provides a method of addressing byname data on. the disk, drum, or tape. Each

206

Page 8: SlGeDE Bulletin, 1976, 200-211. - Stanford University · SlGeDE Bulletin, 1976, ~(l), 200-211. SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION: THE MISS SYSTEM AND ITS APPLICATIONS

while the audio message told the student to IITYPESTRIKE". Suitable audio and printed reinforcementwas given when the student typed the correctanswer.

The Delta system was also used for teachingFrench at Stanford. The curriculum structure wasbased on eight strands for conceptual categories,such as gender, pronunciation, and idioms. Withineach strand, students- received drill on patt~rns,

_such as:

Digital peM audio was used in a system calledDial-a-Drill, a product of the Computer CurriculumCorporation, as reported in [11]. New York Citystudents were called at home and given daily drillsin elementary mathematics over tne telephone. Thestudents responded to the audio messages by typingthe keys of a touch-tone telephone.

The main instructional innovation of thereading program was the use of a complexstatistical model that optimized the student'sinstruction. This helped produce the "significantand consistent gains in reading achievement overwhat would be expected from classroom instructionalone" ([3], p. 1).

STRIKEBIKE LIKE

The use of audio in reading was fundamental tothe program. Over 2700 words and short phraseswere available to the program, which randomlyaccessed those messages. The Delta system servedwell, but the quality was poor and -the coverage wasincomplete. We should also point out that stUdentdialog was minimal. The messages were either justa werdor -a short pbrase ,. and the program did notc~eate a semantic data base of any sort about<whichto have a dialog with the student.

Delta modulation was used very successfully atStanford in teaching elementary reading (see [3]).The reading curriculum was divided into strands,each of which was "designed to provide practice ona particular decoding o~ communication skili ll ([3],p. 5). Typical student dialog consisted of aTeletype terminal printing out a list of words

5.1 Previous ~ Q[ ..A..!:!ill:.Q.. in. Computer-assisted Instruction

Analog tape has been used frequently. Audiotape was also used for a number of years in theteacbing of Russian at Stanford. A bank of sixhigh-quality tape machines Were put under computer­control. Tapes keyed to the curriculum weremounted on the tape decks by an operator. Thissystem gave quality reproduction, but the messageswere fixed in both order and content, with thecurriculum designed to move in lock-step fashion.

:This system, without basic changes, was la~er

implemented with cassette tape hardware.

Several of the methods of producing sounddiscussed in Section 2 have previously been usedinstructionally, and are still being used. We willnot give a complete history of this, but will

,.,mention a few examples.

Audio messages were coordinated to the CRTdisplay. For example, in one lesson in~mathematics, a car and a truck surrounded by set~braces appeared on the screen, accompanied by theaudio "There are two members in this set. II Afterthe message, two more sets appeared on the screen,one empty and the other containing a train and asteamshovel. The student heard the instructionsIIFind another set with two members". To answercorrectly, the student moved his light pen to thearea of the screen with the correct set, and thecomputer responded "Yes, the sets have the samenumber of members". (See pp. 271-301 of [23] formore details here.) Messages were contingentlyprogrammed to depend on the student's response tothe problems. The tape drives offered locally

'random accessing and good quality, but nocomposition.

Analog tape was also used in the Stanford­Brentwood Computer-assisted Instruction Laboratory[23], a project that combined film, audio, and CRToutput under computer control. There were 17instructional stations, and a computer-controlled

-tape drive associated with each station. Lessonsin the mathematics curriculum covered such topics

·,-as counting, numerals, linear measure, and so on.

AUDIO: Paris est une ville.

The inability to compose messages (other thanby concatenating the sounds) made creating patternslike the above quite difficult. The resultsobtained were ·achieved only by re-recording thesounds of the individual words until some of thepatterns fit together reasonably well.Furthermore, words recorded by the same speaker ondifferent days did not combine as well as· wordsrecorded at the same time. The lack of prosodiccontrol extends well beyond this, however, into thesemantics and pr~gmatics of the message •.

UN LION EST UN ANIMAL.

In the abovej the student used the firsta pattern, and the second message as thethat pattern. The correct responsestudent would have been for him to type:

message aswords forfrom the

••• un animalAUDIO: un lion

Specially designed tape devices have been putto varying instructional uses. For example, Ishidaand Fujimura [10] used a hybrid analog-digitaldevice for instruction in foreign languagepronunciation. The system provided locally randomaccessing.

Analog~ and~ have also been used,perhaps less frequently than tapes. For example, adrum-type system was used at IMSSS to teachspelling (see [12]). The particular system, aWestinghouse Prodac-50 controlling 12 drives withsix-inch wide tapes, allowed a degree of randomaccess. The quality of sound was good but theequipment required inordinate maintenance and washence abandoned. A disk system was implemented foruse with the PLATO IV system. Fixed messa~es (forseveral curriculums) could be randomly accessed.The system has been used experimentally in teachingelementary reading and veterinary medicine. Thissystem was considered too unreliable, but a newversion of the system is planned. [19]

207

Page 9: SlGeDE Bulletin, 1976, 200-211. - Stanford University · SlGeDE Bulletin, 1976, ~(l), 200-211. SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION: THE MISS SYSTEM AND ITS APPLICATIONS

~very set has a powerset.

5.5 Using Audio ill f!:Q.Qf.Explication

is a part of the derivation, we can paraphrase thiswith the audio message

When the student requests a sketch of hisproof, the program will produce a graphic display(see [20] for a sample of such a display). Thisdisplay shows the main steps of the proof (orsubproof) and their relationship. The audio willthen further summarize what is on the screen,paraphrasing formulas and giving derivation chainsinformally.

ofthe

the problemtheory. If

For every x there is a y such thaty ~ powerset of x

We are designing a program OVERVIEW that willsolve this problem in part. It operates by puttingthe object pro~ram (in this case, the set theorypro~ram) into a job structure separate from theOVERVIEW program. Thus it can observe theoperation of the object program, generating audiomessages about the dialog. It is important thatOVERVIEW operate somewhat asynchronously from theobject program, and that the object program nothave to be extensively modified for this purpose.The disadvantage of OVERVIEW is that it will haveonly limited access to the internal components ofthe object program. (Some of the data structurewill be shared, through features of TENEX~)

Another problem we have noted in set theory isthe difficulty the student experiences in keeping aclear notion of the proof he is constructing,especially when the argument becomes lengthy. Wehave developed a procedure that takes a lengthyproof, finds the "important" steps of that proof,and produces a sketch or summary. There are manyproof-theoretic and linguistic problems inobtaining this analysis, which are discu~sed

elsewhere.

terminal as the medium of communication, since theassistance gets confused with the object dialog.Common techniques to handle this kind ofinstruction include the use of manuals and humanassistance. We, have experienced this problem withthe set theory program, for example. The manualprovides sample dialogs and complete commandspecifications, but we have generally made personalappointments with the students on various occasionsthroughout the course~ particularly at thebeginning, in order to familiarize. them wi~h theoperation of the set theory program.

For example, considerparaphrasing sentences of set(somewhat formal) sentence

5.3 Contributions 2f.. Speech .!:.2. Program-Student Interaction

We will be able to evaluate the importance ofthe higher quality and more complete coverage ofMISS by using it with curriculums which hadpreviously used the Delta Modulation system. Weexpect that MISS's better speech will, in itself,only slightly improve these curriculums. The realbenefit of MISS will be the ability to say thingsthat preViously could not be said.

MISS will be used for instruction in MandarinChinese in a course developed by Mr. E-Shi (Peter)Wu. This course is now being experimentally usedwith the old Delta Modulation audio system, andwill be used with MISS in early 1976. This work isbriefly described in [241. The conversion to MISSwill enable us to experiment with using intonationmanipulations to create the different tones ofChinese.

3) The computer and thestudent are in full dialog, withsome appropriate parts of theconversation using the audiocomponent.

1) The computer givesdirections to the student about howto use the computer in some way.Here, the audio 'component takes theplace of a human instructor whomight sit beside the studentguiding him through an initialdialog.

5.2 Using Audio in Language Instruction

We have identified three separate models ofprogram-student interaction using computercontrolled speech as a main component. Each ofthese models arises from a real instructionalproblem in the logic and set theory courses atIMSSS. These three models are:

2) The computer gives anexplanation, using graphic displayand audio simultaneously.

Raugh, Atkinson, and Schupbach [16] used theDelta audio system in controlled experiments todetermine the effectiveness of the keyword methodfor learning foreign language vocabulary. As partof the learning method, the computer spoke thewords of the target languages (Spanish and Russianin different experiments). establishing an acousticlink in the student's memory to an English keyword.The Delta system was efficient enough to serve inthis kind of application, which did not requirecomposed messages.

5.11 Using Audio to Give Directions

A common problem for the student is gettingstarted using an instructional program. It isdifficult to remedy this using only the student's

We now give some detailsprojects underway with regardtheory courses at Stanford.

of the three audioto logic and set

The bas.is of this particular -paraphrase is 1)dropping of explicit variables not linguisticallynecessary, and 2) using the verb "to have" toindicate function application. We know how togenerate for output purposes many informalparaphrases that we do not know how to recognizewhen input.

The audio, in this application, will give

208

Page 10: SlGeDE Bulletin, 1976, 200-211. - Stanford University · SlGeDE Bulletin, 1976, ~(l), 200-211. SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION: THE MISS SYSTEM AND ITS APPLICATIONS

explanations that are Quite informal. For example,consider the formula:

For all m,n,pm times n = p

If and only ifThere are A,B such that

(Card(A) : m & Card(B) : nAND

Card(A X B) : p)

A reasonable audio message for describing thisformula might be:

'Defines cardinal multiplicationusing cross product.

It is reasonable because it might provide just theintuition that the student needs to understand theformula. Teachers in the classroom tend to writein precise detail on the blackboard while giving aless precise intuition verbally. The use of aUdiohere is analogous to this explanation role of theteacher.

This proof explanation system is beingdeveloped and should be operational in early 1976.

IMSSS has for a number of years had CAlprograms for elementary logic. The current programuses some of the advanced proof machinery from theset theory program in order to make the instructionmore informal and at a logically higher level.(See [24] for a short history and some details ofthe current program.)

A deficiency of the current curriculum is thelack of instructional material about semantics. Acommon teaching method in logic texts and coursesis to give the semantics of first-order logic innatural' language paraphrases and examples. Intraditional courses, the student is often calledupon to formalize sentences and arguments startingwith an English formulation.

Inspection of the fragment of English involvedin several texts (for example, (22]) indicates thatit is sufficiently uniform for us to handle usingour current methods of natural language processing.We are producing. a dialog system in which studentscan formalize sentences and arguments and then dealwith these arguments in English and formal logic.

In doing this, it is important to keep the"object language" of first:...order logic separatefrom the "meta-language" of the English sentences.This ability will be provided by the audio system,which will inform the student of the intendedinterpretation of the formulas and proofs he isconstructing. Paraphrases, similar to thosedescribed above in the discussion of set theory,will be used.

This application of audio will be ready forstudents in late 1976.

209

5.7 Summary Q[ Applications

The above initial applications of the audiosystem are all directed to current problems in the"logic and set theory courses. All of them sharethe following properties:

1) The vocabulary isrelatively small and fixed.

2) The dialog will be based onthe constructions that the studentis doing, not some fixed set ofmessages in a branching network.This is particularly true in thelatter two examples.

3) It is very important to beable to alter sentence stresspatterns. "We believe the abilityto paraphrase formulas andsummarize proofs depends on controlof stress.

Page 11: SlGeDE Bulletin, 1976, 200-211. - Stanford University · SlGeDE Bulletin, 1976, ~(l), 200-211. SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION: THE MISS SYSTEM AND ITS APPLICATIONS

References W. A CAl programEducational Technology,49.

for the home.December, 1971, p.

Allen, J. Speech synthesis from unrestrictedtext. In J. L. Flanagan & L. R. Rabiner(Eds.), Speech Synthesis, Stroudsburg, Pa.:Dowden, Hutchinson, & Ross, 1913.

12. Lorton, Paul Jr., Computer-based instructionin spelling: An investigation of optimalstrategies ~ presenting instructionalmaterial. Ph. D. dissertation, StanfordUniversity, June, 1973.

2. Atal, B. S. & Hanauer, S. L. Speech analysisand synthesis by linear prediction of thespeech wave. Journal. of ~ Acoustical

_Society of America, 1970, 2Q, 631-655.

13. Makhoul, J. I. Natural communication withcomputers: Speech compression research ~BBN. Bolt Beranek and Newman, Inc.,Decel.llber 1974, Final Report, V. II, ReportNo. 2976.

Coker, C. H., Umeda, N. & Browman, C. P.Automatic synthesis from ordinary Englishtext. IEEE Transactions ..Q!l. > Audio .§!1lQ..§.acoustics, June 1973, AU·21, No.3, 293­298.

Atkinson, R: C., Fletcher, J. D., Lindsay, E.J., Campbell, J. 0., and Barr, A.Computer-assisted instruction in initialreading (Tech. Rep. No. 207). Stanford,Calif.: Institute for Mathematical Studiesin the Social Sciences, StanfordUniversity, 1973.

,

3.

4.

5. Fairbanks,drillbookBrothers,

G. ~(2nd ed.l.

1940.

~ articulationNew York: Harper &

14.

15.

16.

Makhoul, J. Spectral analysis of speech bylinear prediction. IEEE Transactions ~Audio .illl.d. Electroacollstics, June 1973, AU­.?L. B£.".3., 140-148.

Markel, J. D. & Gray, A. H. A linearprediction vocoder simulation based uponthe autocorrelation method. l§g[Transactions .Qn. Acoustics, Speech. andSignal Processing t April 1974, ASSP-22. No.E., 124-134.

Raugh,M. R., Schupbach,R. D., &.Atkinson,R. C. Teaching a large. Russian vocabulary£l~ mnemonic keyword method (Tech. Rep.No. 256). Stanford, Calif.: Institute forMathematical Studies in the SocialSciences, Stanford University, 1975.

11. Jerman, M. ,Clinton, J. P. M., & Sobers, A.

Ishida, H., & Fujimura, 0., A computer-basedpronunciation-hearing test system using ahybrid magnetic tape unit, Proceedings of~ IrIP Congress 1211, V. 2, Amsterdam:

. North Holland, 1971.

Gray, A. H. &Markel, J. D. A spectral­flatness measure for studying theautocorrelation method of linear predictionof speech analysis. ~ Transactions m2Acoustics. Speech and Signal Processing,June 1974, ASSP-22, B£.".3., .207-217.

S. General phonetics. Madison,University of Wisconsin Press,

&Blaine, L. H., A generalizedmathematics instruction, in

Personal communication, .CERL,of Illinois, Champagne-Urbana,

L.for

J. Risken,UniversityIllinois.

Smith, R.systempress.

Smith, R. L., Graves,H., Blaine, L. H., &Marinov, V. G. Computer-assisted axiomaticmathematics: Informal rigor~ In O. Lecarme& R. Lewis (Eds .,.), Computers in. education.part ~ IFIP. Amsterdam: North Holland,1975.

Sanders, W. R~ & Benbassat, G. V. ~ MISSspeech synthesis system, in preparation.Institute for Mathematical Studies in theSocial Sciences, Stanford University.

Stevens, K. N., Kasowski, S. & Fant, C. G. M.An electrical analog of the vocal tract.InJ. L. Flanagan, & L. R. Rabiner, (Eds.),Soeech _ synthesis.. Stroudsburg, Pa.:Dowden, Hutchinson & Ross, 1973.

21.

19.

18.

20.

17.

R. (ed.), SpeechPa. : Dowden,

Speech analysis synthesis andNew York: Springer-Verlag,

Heffner, R';"'M.Wise.: The1969,

Flanagan, J.L. & Rabiner, L.synthesis. S~roudsburg,

Hutchinson & Ross, 1973.

Flanagan, J. L.perception.1972.

10.

7.

6.

9.

8.

210

Page 12: SlGeDE Bulletin, 1976, 200-211. - Stanford University · SlGeDE Bulletin, 1976, ~(l), 200-211. SPEECH SYNTHESIS FOR COMPUTER ASSISTED INSTRUCTION: THE MISS SYSTEM AND ITS APPLICATIONS

22. Suppes, P., Introduction to~, New York:Van Nostrand, 1957.

23. Suppes, P., and Morningstar, M., ComDuter­asssisted instruction ~ Stanford. 1966-68:Data. models .. illll! evaluation £f..t.h..,g.arithmetic programs, New York: AcademicPress, 1972.

24. Suppes, P., Smith, R. L., & Beard, M.University-level. g,l l!1. _ Stanford: 19.12.(Tech. Rep. No. 265). Stanford, Calif.:Institute for Mathematical Studies in theSocial Sciences, 1975.

211