sp eec h in e rf ace s t o virt ual re alit y...h a us ers can giv e sp ok en comm an ds o s elect d...

Speech Interfaces to Virtual RealityScott McGlashanSwedish Institute of Computer ScienceBox 1263, S-164 28 Kista, SwedenTel: +46 70 583 1507 E-mail: [email protected] this paper, we consider how speech interfaces can be combined witha direct manipulation interface to virtual reality. We outline the bene�ts ofadding a speech interface, the requirements it imposes on speech recognition,language processing and interaction design. We describe the multimodalDIVERSE system which provides a speech interface to virtual worlds mod-elled in DIVE. This system can be extended to provide better managementof the interaction as well as innovative functionality which allow users totalk directly to agents in the virtual world.1 IntroductionVirtual reality has sometimes been thought of as embodying a return to a `nat-ural' way of interacting by direct manipulation of objects in a world. However, inthe everyday world we also act through language: speaking is a `natural' way ofcommunicating our goals to others, and e�ecting changes in the world. A directmanipulation interface to a virtual world can be augmented with a spoken lan-guage interface so that users can give spoken commands to select and manipulateobjects. In this paper we explore design principles and techniques for providinga speech interface to virtual reality.This paper is structured as follows. In Section 2 we consider bene�ts of aspeech interface, and in Section 3 some of the requirements this imposes on speechrecognition, language interpretation and the interactional model. In Section 4, wedescribe a system which provides a speech interface to virtual worlds modeled inthe DIVE system. We then consider in Section 5, how this system can be extendedto improve performance and provide more advanced functionality.1

2 Bene�ts of a Speech InterfaceConventional interfaces to virtual reality are based on direct manipulation whichhas three main characteristics (Scheneiderman 1988):1. Manipulation is carried out by physical actions.2. The objects manipulated are persistent.3. The actions are rapid and reversible, and their e�ect on objects is immedi-ately visible.In the case of virtual reality, users perceive objects in the virtual world by means ofa 2D, 3D or stereoscopic display. As with Macintosh and Windows95 interfaces,actions are e�ected on objects by selecting commands from menus, keyboardsequences or mouse operations. For example, the user can navigate in the worldby using cursor keys, select objects by clicking on them, and apply actions to theselected object by choosing an operation from a menu. The range of objects andactions available to the user are explicit, and the user has the responsibility forexplicitly initiating and monitoring actions to ensure that the desired e�ect hasbeen achieved.If a speech interface is combined with a direct manipulation interface then theresult is a multimodal interface: the user can act upon the world by issuing phys-ical or speech commands and, conversely, the system can respond by speakingand/or by making changes in the virtual world. In most multimodal systems, thesame functions can be carried out in each modality. The user is free to decidewhich modality to use for their actions; for example, users may use direct ma-nipulation for transportation within the virtual world, but the speech modalityfor manipulating objects. Although the motivation for the choice of modality isfrequently inscrutable, various factors, apart from personal preference, are im-portant including: the `naturalness' of an action in a modality (issuing spokencommands to move oneself seems counter-intuitive); and the di�culty or complex-ity of carrying out the action in the other modality (such as the sequence of menuselection and mouse clicking required to manipulate an object). Furthermore, amultimodal interface can provide more e�cient interaction than a single modalityinterface: the user can issue more commands with less e�ort. Users can carryout two actions simultaneously; for example, using direct manipulation to moveabout, and speech to bring objects to them as they move. Alternatively, the usercan combine their actions to achieve synergy e�ects by, for example, clicking onan object and simultaneously speaking the action to be performed on the object1.Before we consider some of the speci�c bene�ts of combining a direct manip-ulation interface with a speech interface, let us consider some of the areas which1Unfortunately, there is remarkably little empirical evidence to demonstrate that synergye�ects occur in practise (Carbonell and Mignot 1993).2

speech interfaces have been used. As we shall see, this is helpful in suggestinghow speech interfaces to virtual reality can be further extended.One of the main areas in which speech interfaces have been used is telecomservices. This application area is ideal for speech input: it is the modality usersare familiar with; other modalities such as direct manipulation cannot (currently)be supported due to the low bandwidth of the telephone network; and the computersystem shares the same modality limitations as the human user. Speech interfaceshave been deployed in two classes of services: voice command and control; andinteractive information services. For example, in a tele-banking application thesystem can prompt the customer for their bank account number: customers canjust as easily speak the digits in the account number to a computer system asthey can to a human operator. In such a simple application however, speechis not always the preferred choice. Customers equipped with touchtone phonesusually have the alternative of entering the digits using their telephone keypad.While the touchtone method may be `unnatural', users can �nd it more e�cientif the speech interface is error-prone and so are prepared to learn the touchtoneconventions (Basson et al. 1995). This situation, however, is less likely to persistwhen the recognition task goes beyond simple digits | since the touchtone codesare more complex | and users become more familiar with the conventions ofspeech interfaces, especially the need to use a restricted language which the systemunderstands. In fact, in these cases users can positively prefer the speech interface.The signi�cance of the results of speech-based telecom services is two-fold: speechinterfaces can be bene�cial when they are more e�cient than the alternatives,independent of how `natural' they are2; and, like any other computer interface,users have to learn conventions governing their use as these may not be thesame conventions used in a comparable human service. Just as virtual reality isgoverned by conventions other than those applicable in the everyday world, so toospoken interaction with computer systems.Speech interfaces has been most successfully deployed as part of interactivedialogue systems in limited task domains, such as banking and travel services,where users use language to achieve a speci�c goal. In these applications, systemsplay the role of a cooperative agent in a dialogue: i.e. they attempt to recognizeand understand what the user says, interpret the utterance against their own goalsin the interaction, and produce a response which is appropriate in the context ofthe dialogue. Some of the salient characteristic of spoken dialogue systems are(McGlashan et al. 1992):Restricted Language The system is not capable of understanding general nat-ural language, but only language which is appropriate to the particular task.Users are required to use a limited vocabulary and syntax if they are to havesuccessful interactions with the system. Identifying the appropriate restric-2In fact, familiarity is more important, and familiarity can be an e�ect of the e�ciency ofthe modality. 3

ted language is partly based on studies of human-human dialogues withinthe same task domain | which have shown that situated language use ishighly conventionalized | and experiments using `Wizard of Oz' or `system-in-the-loop' techniques which reveal that users naturally use more restrictedlanguage when talking to a machine (Giachin and McGlashan 1996).Incremental Information Transfer Users are able to provide all the inform-ation at once, or incrementally. For example, if a user wishes to obtaininformation about a ight leaving Stockholm at 1430 on Tuesday, then theycan either provide all the information at once | a ight leaving Stockholmat 1430 on Tuesday | or initially provide a single piece of information |a ight from Stockholm | and provide the remaining information when re-quested by the system. Incremental information transfer is also a techniqueused by the system to reduce the complexity of the task when a dialoguedegrades. For example, if the system is having di�cult obtaining the date,then the system can simplify the task by �rst asking for the year, thenmonth, and �nally day.Feedback The system reports to the user its own states and goals in order tomake the system's behaviour as transparent as possible (Frankish 1989).The system may con�rm information it has understood, or give an explan-ation of why it has not understand the user's utterance (for example, theuser spoke too quietly). The precise structure and content of the feed-back messages | as well as other system utterances | is very importancesince dialogue participants have a tendency to replicate the language ofother participants, be they human or computer (Garrod and Anderson 1987;Zoltan-Ford 1991). Well designed system messages can reinforce the restric-ted language which the system can understand.Dialogue Management Both the system and user have a responsibility to en-sure that the dialogue proceeds smoothly. The system is responsible since(a) it may know more about the task domain than the user, and (b) itmay be the source of problems in the dialogue, such as speech recognitionand language interpretation errors. Consequently, the system should followstrategies for navigation | such as taking the initiative and asking questionsto obtain information not provided by the user but which is necessary tocomplete the task | and strategies for detecting and repairing problems inthe dialogue. These strategies may be proactive so as to avoid later prob-lems | such as incrementally con�rming information given by the user |or reactive strategies for dealing with errors when they do arise | such asasking users to spell the names of cities. These dialogue management tech-niques are essential to ensure an e�cient and robust dialogue with users(Eckert and McGlashan 1993). 4

As part of a spoken dialogue system, a speech interface can be seen as aninstance of an indirect management interface: i.e. an interface which engagesthe user in a cooperative process with a computer-based agent (Kay 1990). Al-ternatively, a speech interface may be part of a natural language understandingsystem: i.e. the system will recognize and understand user input so as to yield asemantic representation of the input. In the latter case, the speech interface mayshare properties of a direct manipulation interface, especially the requirementsthat the user needs to explicitly initiate and manage the interaction. In bothcases, however, the properties of restricted language, feedback and incrementalinformation transfer are crucial for an e�ective speech interface.Having discussed speech interface in the context of dialogue systems for in-formation services, let us now consider their bene�ts in the context of a virtualreality system. There are three principal bene�ts in comparison with a directmanipulation interface: naturalness, hands/eyes freedom, and `beyond the hereand now'.The �rst bene�t is naturalness, or more precisely, familiarity. Users are fa-miliar with using language to act in the world. This bene�t, of course, needsto be tempered with `unnaturalness' of the restricted language which the sys-tem can recognize and understand, as well as with the user's familiarity in usingdirect manipulation to carry out the same action. For example, a user may be-come very familiar, or simply prefer, using direct manipulation for actions such asself-transportation. Furthermore, using speech commands for self-transportationactions is not natural in the everyday world: normally, we simply carry out theappropriate physical actions rather than say legs, move me to the bar!. However,there are situations where this action is `naturally' carried out using language;for example, a physically handicapped person may rely on a helper to move theirwheelchair; likewise, a high-ranking general may have a personal driver to movethem around a battle�eld. Extrapolating from these situations, users may �nd itmore natural to use spoken commands for certain classes of action in the virtualworld if they are addressing an agent specialized for these functions. We willreturn to the idea of populating a virtual world with such specialized agents inSection 5.4 below.The second advantage is that speech o�ers a way of issuing commands whileallowing hands and eyes to remain free. Thus operations normally carried outthrough the direct manipulation modality | such as transportation, change ofview, object creation and deletion, etc | can be e�ected without tying up othermodality (cf. synergy e�ects which require that both modalities are used forthe same action). Thus di�erent actions can be simultaneously carried out usingdi�erent modalities. This is particularly useful in cases when hands/eyes arealready busy, but other tasks need to be dealt with from time to time; for example,when direct manipulation is used to drive a tank, speech can be used to issuecommands for �ring rockets, etc. 5

Features �! increasing di�cultyTraining speaker-dependent speaker-independentVocabulary size Tens ThousandsUtterance mode Isolated ContinuousSpeech mode Read SpontaneousLanguage Restricted Language Natural LanguageEnvironment Favorable AdverseTable 1: Features of a Speech RecognizerThe third advantage is that users can refer to objects which are not presentin their current view of the virtual world. This is not possible in a direct manip-ulation interface where actions can only be applied to objects which are visuallypresent. Users can use speech to select and manipulate objects which were invisual focus (the last tent entered), will be visual focus (the next tent in a series),are simply known about objects (the general's tent), abstract objects (such as theset of tent which are contain major generals), high-level actions, and so on.As we shall see, further advantages of a speech interface arise when it isembedded in a dialogue system capable of interacting with agents in the virtualworld. For whereas direct manipulation requires that the user must explicitlyinitiate and monitor all actions, a dialogue-based speech interface to a virtualworld can initiate clari�cation dialogues when further information is required tocarry out the command, or monitor facets of virtual world and report signi�cantchanges to the user.3 Requirements on Speech, Language and Inter-actionIn order to provide these bene�ts, there are costs. Three of the issues that need tobe dealt with are speech recognition, language interpretation, and the interactionalmetaphor.3.1 Speech RecognitionWhile speech may be a `natural' modality for acting in the everyday world, it ismore di�cult to model in a computer system than direct manipulation.Several features characterizing a speech recognition system are shown inTable 1 (Giachin and McGlashan 1996). Isolated speech arises when words areseparated by silences, continuous speech when the acoustic signal is lacking innatural sub-divisions equivalent to words. Speech is normally continuous; for6

example, spoken naturally within a sentence, there will not be any signi�cantacoustic break between the words paint it red. The pronunciation of individualwords may vary widely in acoustic properties, even when spoken by the samespeaker. One important cause of variation is co-articulation where the articula-tion of neighbouring sounds a�ect one another. Articulations are also modi�ed bythe context, with the result that the pronunciation of words spoken spontaneouslydi�ers from the same words read slowly and deliberately. Spontaneous speech alsointroduces naturally occurring phenomena including: extra-linguistic phenomenasuch as coughs, blows, lip `smacks', and �lled pauses (for example, paint erhmthe house ahhm red); and re-starts, where the speaker reiterates part of an utter-ance after an interruption (take me to the hou- the red house). The consequenceis that the acoustic properties of individual words are not distinct and constant,but highly context dependent, varying according to the preceding and followingspeech. This is one of the reasons why recognition of speaker-independent, con-tinuous spontaneous speech is so di�cult. The use of large vocabularies in adverseenvironments make the task extremely di�cult.In order to provide a representation for subsequent grammatical analysis, tech-niques have been developed for segmenting the acoustic signal, classify these seg-ments into acoustic-phonetic units, and then comparing these units with a data-base of known templates to identify those which best match the input.Firstly electrical signals from a microphone (or telephone line) are conditionedto remove as much extraneous background noise as possible and smoothed toeliminate major amplitude variations. The analogue signal is then converted intodigital form and signi�cant features are extracted. One common method is topass the signal through a set of frequency �lters, where each �lter extracts thepart of the signal within its own bandwidth over a short time period (10 to 20milliseconds). The overall output from the whole �lter bank is characterized as afeature vector. In order to determine what has been said, the feature vector mustbe compared with templates stored in the system. In principle, the problem ofrecognition then reduces to the problem of comparing each feature vector obtainedfrom the input against all the stored templates to �nd the closest matches. Inpractice, a more complex method must be used to allow for the e�ects of co-articulation, to compensate for timing di�erences arising from di�erent speakingrates, and to deal with di�erences between speakers.One of the most successful methods for acoustic-phonetic modelling is Hid-den Markov Modelling (HMM) (Rabiner and Juang 1993). Each phonetic unit isrepresented as a sequence of feature vectors, and an acoustic model contains aseries of states corresponding to a particular feature vector. In order to recognizea phonetic unit, the states are visited, one by one (with the possibility of loopingback to the same state, or omitting some states, as shown in Figure 1). Everytime a state is visited, a feature vector is generated. Each transition betweenstates is associated with a transition probability which expresses the probabilitythat particular sequence of state will occur. The phonetic units are typically not7

s1 s4s3s2

t13 t24

t44t33t22t11

t14

t12 t23 t34

Figure 1: Four State Hidden Markov Modelwords, but context-dependent phonetic units which make up words. Approachesbased on phonetic units can deal with di�erences in the pronunciation resultingfrom co-articulation and assimilation. Models can then be linked together (withfurther transition probabilities) to produce composite models with which wholewords can be recognized. The transition probabilities themselves are learnt fromtraining data. An advantage of the HMM approach is that the models can allowfor di�erences between speakers by using training data from many speakers, thusmaking it possible to develop a recognition system which is speaker-independent.Finding an optimal word sequence is carried out by a decoding algorithmsuch as the Viterbi algorithm or extensions of it. This amounts to determining,for every model in turn, the probability that this model will generate the observedsequence of feature vectors, and then choosing the model with the highest prob-ability. The next step is to move from phonetic units to words, by looking upsequences of phonetic units in a lexicon which contains, for each word in the sys-tem, a standard pronunciation and possibly some variants. The lookup has to bedone for every probable combination of phonetic units. For this reason, the outputnot a simple sequence of words, but a `word lattice', as illustrated in Figure 2which gives the likelihoods of any of a large number of combinations of wordsbeing the ones which the user actually spoke. Linguistic knowledge, however,is frequently deployed at the recognition level to reduce the large number of thepossible combinations. Consequently, in addition to an acoustic-phonetic model,a language model is also integrated into the recognition process. One of the mostsuccessful makes use of a probabilistic �nite-state automaton, N-grams, which cap-ture relationships between adjacent words (Jelinek 1989). An alternative is to use�nite-state networks (FSNs) which model the relevant information in well de�nedphrases. Some experiments show that the usage of FSNs improves performanceover N-grams for restricted language (Giachin 1995).Recent progress in speech recognition has resulted in speaker-independent,continuous speech systems which are capable of providing `reasonable' recogni-tion results with spontaneous speech (Meisel 1995). Based on past performance,8

I

want

went

went

wantto

drain

pain

paint

faintit

bed

red

tit

Figure 2: Word Lattice for I want to paint it redrecognition of microphone quality speech is likely to continuously improve in thefuture, especially for large vocabularies. Figure 3 shows the projected improve-ment for a state of the art system with di�erent vocabulary sizes over the next�ve years. Excellent recognition results can be obtained at the present time for

Figure 3: Continuous Speech Recognition Performancerestricted language input when FSNs are used. For example, the system will notmis-interpret paint it red as pain tit red if the latter phrase is not part of therestricted language. While this may limit the linguistic capability of the system,it does o�er the advantage that subsequent level of analysis can be restricted inscope: syntactic, semantic and pragmatic analysis need only handle phenomenawhich can be handled by the speech recognizer. Moreover, restricted languagesare a natural part of many everyday activities which are focussed on a particulartask. Command and control applications, as well as dialogue applications wherethe user input at di�erent stages of the dialogue can be accurately predicted,are very promising areas for high accuracy speech recognition using restrictedlanguage. 9

3.2 Language UnderstandingGiven that the system can recognize user utterances expressed in a restrictedlanguage, the user input needs to be interpreted to determine its e�ect on thevirtual world. This requires that the input is analyzed so as to determine whichobject the user is referring to, and the action they want to apply. Here we shallrestrict ourselves to the descriptions of objects, although action descriptions canbe treated in an analogous manner.Linguistic expressions are usually characterized as referring to discourse ob-jects (Heisterkamp et al. 1992: Luperfoy 1992). Discourse objects are part of adiscourse model which represents a participant's understanding of an evolving dis-course. This level of description is intermediate between a linguistic (or graphical)level of description, and the levels of objects in the everyday (or virtual) world.This three-way distinction is motivated on the grounds that di�erent types of con-straints are appropriate to di�erent levels of description. Referential constraintsdo not apply to linguistic objects since (a) di�erent linguistic descriptions canreference the same object, and (b) we can refer to objects which have no cor-responding linguistic description; for example, in the context of talking about ahouse, we are able to refer to the windows without there being a prior mentionin the discourse. Nor do referential constraints apply directly to objects in theworld. The discourse and world levels di�erentiated between objects which havebeen mentioned, or presupposed, in the discourse compared with objects in theworld which could be mentioned. Applying referential constraints to the discoursemodel ensures that interpretation is speci�c to what has been understood by aparticipant and allows interpretation to be goal-directed. Reference resolution canbe guided by the goals of a participant so as to constrain reference to a speci�csubset of objects in the discourse model. Applying reference resolution directlyto objects in the world cannot be constrained in this way | all objects need tobe considered as possible candidates for reference resolution.Linguistic expressions can be total or partial descriptions of discourse objects;a total description is one which describes all properties of an discourse object andso refers to a unique object in the discourse model; a partial description describessome, but not all, of the properties of the object. Partial descriptions, such asanaphoric and elliptical expressions, are signi�cantly more common in spokendialogue and discourse; participants tend to produce fragmentary utterances |rather than full sentences | and participants are required to draw upon sharedcontextual, domain and linguistic knowledge for their resolution. For example, ifthe user wants to paint the house red in Figure 4 the user could say paint the housered, paint the building red, paint the large object red, or even just paint it red. Ineach case, the system is required to have the knowledge that houses are related tobuildings, buildings are large objects, and that each of these expressions, togetherwith it, can refer to the house. The last (deictic) expression not only indicateshow partial descriptions can be dependent upon non-linguistic context, but also10

Figure 4: Understanding user expressions in contextthat descriptions can be ambiguous: without further information, it is impossibleto know that it refers to the house.The knowledge required for reference resolution can be divided into ontolo-gical, linguistic and contextual knowledge. Ontological knowledge covers bothgeneral knowledge and domain-speci�c knowledge. This knowledge is describedin terms of a typed-object inheritance hierarchy: each object is described in termsof properties and roles; objects are principally related to one another by means ofspecialization; and objects are organized so that the more general are at the topof the hierarchy | the most speci�c ones at the bottom | and objects inheritthe roles and properties of objects above them in the hierarchy. For example,the house object is a specialization of building and it in turn is a specializationof thing, etc. Linguistic expressions are characterized as instances of objects inthe inheritance hierarchy, together with linguistic-speci�c information concerningfeatures such as de�niteness (de�nite versus inde�nite noun phrases), polarity(positive or negative descriptions), and so on. Linguistic information is used toguide the search for possible referents in the discourse model. For example, ade�nite expressions such as the house indicates reference to a discourse objectalready instantiated in the current discourse model. Instantiation may arise froma previous explicit linguistic (or graphical) expression, or by presupposition; wecan talk about the rooms of the house because houses are presupposed to haverooms (Bos et al. 1994).Ontological and linguistic knowledge are not always su�cient for referenceresolution: reference may dependent upon which objects are most salient in thediscourse situation. In the discourse model, objects can be ordered in terms ofsaliency to yield the focus of a discourse (Grosz 1981)3. For example, in Figure 4if the previous utterance had been create a house, then the referent of it in paint it3We return in Section 4.4.1 to the question of how focus can be determined.11

red can be resolved as the discourse object house, assuming that this is the mostsalient object. Since computer systems lack the detailed ontological and linguisticknowledge of human participants, tracking the focus in discourse is an essentialtechnique for reference resolution in language understanding and dialogue systems.3.3 Interactional MetaphorThe third issue in developing a speech interface to virtual worlds is the natureof the relationship between the system and user. In spoken dialogue systems forinformation services, the role of the system is clear: it is a simulation of humaninformation service agent, and the user can expect similar, albeit more limited,behaviour from the system. In direct manipulation interfaces to virtual realitythe basic metaphor is Personal Presence: the user is embodied as an actorin the world and is thus provided with a perspective on the world. However,this metaphor is not so clearly applicable for spoken interaction since there isno obvious dialogue counterpart. Various metaphors for spoken interaction havebeen proposed by Karlgren et al. 1995:Proxy The user can take control of various agents in the virtual world andthereby interacts with the virtual world through them; for example, painter,paint the house red!Divinity The user acts like a god and controls the world directly; for example,Let the house be red!Telekinesis Objects and agents in the virtual world can be dialogue partners intheir own right; house, paint yourself red!Communicative Agent The user communicates with an agent, separate fromthe virtual world, which carries out their spoken commandsSelecting the appropriate interactional metaphor is very important for a speechinterface since it will a�ect the language used in addressing the system: i.e.the complexity of user language is partially determined by what they think thecommunicative competence of the system is. Just as we design our utterancesfor human dialogue counterparts | compare how you talk to your partner withhow you talk to a tax inspector | so too users design their language for talkingto computer systems. The appropriate interactional metaphor can play a centralrole in restricting the language of users.In sum, there are three important issues to consider for a speech interface tovirtual worlds: accuracy recognizing what the user said; identifying what the userare talking about in the discourse context; and designing a dialog counterpartwhich facilitates constrained language use.12

4 The DIVERSE SystemIn this section, we describe the spoken language interface to virtualworlds developed as part of the DIVERSE system (Karlgren et al. 1995;Bretan et al. 1995a). The focus was not speech recognition per se, but thedevelopment of a prototype system where issues of multimodal interactioncould be studied independent of a speci�c application. The DIVERSE sys-tem has speech and direct manipulation interfaces which allow users moveabout, and select and manipulate objects in a virtual world modeled in DIVE(Carlsson and Hagasand 1993). Before looking at the speech and language under-standing components in detail, we describe DIVERSE's interactional metaphor.User speak to the system via an agent which follows them around the virtualworld; it is always in the bottom right corner of the graphical display as shownin Figure 5. The agent embodies the communicative competence of the system

Figure 5: DIVERSE interfaceand carries out the user's commands by appropriately updating the virtual world.The agent is portrayed as a cartoon-like character, rather than a more realisticrendering of a human, in order to facilitate restricted language use. The agentprovides gesture and linguistic feedback to the user. Gestures are used to indic-ate its attention state; it faces the user while waiting for speech input and turnstowards the selected object when the users command has been understood. Lin-guistic feedback makes explicit part of the system's understanding of the discoursesituation. Synthesized speech messages are used to indicate progress in under-standing the commands and their execution; for example, ok, doing it, done andso on. In addition the agent makes explicit its understanding of discourse focus13

through a list of salient objects. In Figure 5 the focus list is above the agent'shead and brown cow is the most salient object. The list also facilitates restrictedlanguage use by suggesting which objects which can be selected and manipulatedby speech commands.One of the distinctive features of the DIVERSE system concerns the principle,outlined in Section 2 above, that a speech interface must make its understandingof the discourse situation explicit to the user. Unlike spoken dialogue systems,the DIVERSE interface does not attempt to manage the interaction for the user:i.e. it does not con�rm information, or correct errors when they arise. Ratherthe speech interface, like a direct manipulation interface, adopts a non-managedapproach: the system always updates the agent or virtual world on the basis ofwhat it can understand of the user input. While user input will then always havesome e�ect | minimumly, the focus list will be updated | it may be the wronge�ect due errors in speech recognition or in reference resolution. For example, ifthe user says paint the house, but the system understands destroy the house, thenthe world will be updated by removing the house. DIVERSE subscribes to theprinciple `errors don't matter' since when errors do occur, they are visible to theuser who can immediately take the appropriate action to correct the error. As weshall see in Section 5.3, a more managed approach may be appropriate for speci�capplications.4.1 ArchitectureThe system is built upon the DIVE system and adds components for speechprocessing, language understanding and interfacing to DIVE as shown in Figure 6.The user speaks (or types) commands to the system. Speech recognition is carriedout by a speaker-dependent, continuous speech recognizer which is trained for thedomain and outputs a text string (Woodland et al. 1994). The text string is thenassigned a syntactic and semantic analysis to yield an interpretation appropriateto the domain. Reference resolution identi�es which object in the virtual worldthe user is referring to, and then executes the appropriate action. In addition tofeedback via the graphical interface, the system also provides spoken feedback viaan INFOVOX speech synthesizer.4.2 Syntactic AnalysisThe string outputted by speech recognition is syntactically analyzed in two phases:surface analysis and dependency analysis (Bretan et al. 1995b). Surface analysisis carried out by a commercial parser (ENGCG) which provides parts-of-speechand functional category information (Karlsson et al. 1995). The syntactic repres-entation of paint it red is shown below:(1) "<paint>" 14

DIVE

SpeechRecognition

SyntaxAnalysis

Perception/Execution

SemanticAnalysis

SpeechSynthesis

ReferenceResolution

Figure 6: DIVERSE Architecture"paint" <SVOC/A> <SVO> <SV> V IMP VFIN @+FMAINV"<it>" "it" <NonMod> PRON ACC SG3 @OBJ"<red>" "red" <Nominal> A ABS @PCOMPL-OIt does this by storing for each word its possible analyses and using grammat-ical constraints to eliminate analyses which are contextually inappropriate; forexample, the reading of paint as a noun is eliminated in the above context (cf.bring me the red paint.). The e�ect is that the component is robust, but producesambiguous analyses, especially with prepositional phrases whose functional ana-lysis can frequently only be determined with semantic and pragmatic information.This surface approach di�ers from approaches which produce deeper analysis buttend to be fragile. In the DIVERSE system, input need only be analyzed to theextent that it is relevant to the command and control task. Deeper analysis is notrequired.The second stage of analysis constructs a dependency graph. Like the part-of-speech analysis, dependency is a traditional approach to syntactic structure whichpredates Chomskian Grammar, but has much in common with many modernsyntactic theories (Corbett et al. 1993). At its heart is the principle that wordsare related by dependency relations: e.g. a noun depends on a verb (argument), anadjective depends on a noun or verb (modi�cation). The e�ect of the dependencyanalysis is to extract unambiguous graphs which represent the argument andmodi�cation structure of utterances. In the analysis of paint it red shown in15

role: predlex: paintmood: imp

role: objectlex: itdef: sg

role: modlex: redFigure 7: Dependency Analysis of Paint it redFigure 7, it is an argument of paint and red a (resultative) modi�er of paint.4.3 Semantic AnalysisThe goal of semantic analysis is to extract a simple logical representation of theuser's input; in particular, a representation referring to an object in DIVE, and arepresentation of the action applying to the object. With referential expressions,such as it and the red house, the features are divided into those which are used togenerated a set of possible DIVE objects | such as `de�nite' | and those whichare properties | such as type and colour { used to select between the objects. In(2), the semantic analysis of paint it red is shown:(2) object: rel(sg:definite,X)value: Y = redaction: do(paint(X,Y))The semantic analysis identi�es the action as paint, and a representation of itwhich is speci�ed for de�niteness, but has no properties (cf. the red house whichwould have the property colour red). Before executing the action, the reference ofneeds to be identi�ed.4.4 Reference ResolutionThe reference resolution component, in conjunction with the DIVE interface, isresponsible for matching descriptions to some object in the DIVE environmentwhich the user is referring to4. As mentioned in Section 3.2, reference resolutionrelies upon multiple knowledge sources. DIVERSE uses ontological informationabout objects in DIVE world, linguistic information such as de�niteness, percep-tual properties such as colour, and, most importantly, focus to resolve references.4The DIVERSE system does not distinguish between discourse objects and objects in theworld. We return to this issue in Section 5.2 below.16

4.4.1 Object FocusObjects in the virtual world can be in focus to di�erent degrees. What determinestheir focus level is a combination of di�erent parameter from the visual and dis-course situation, and how these parameters change over time. These parametersvary in their priority: an object which is being point at is more in focus than onejust in the visual �eld. Both of these have priority over an object which has beenmentioned in a user utterance. The parameters also persist/decay at di�erentrates so that as the interaction progresses, their focal status also changes: objectswhich have been mentioned by user, or in successful actions, stay in focus longerthan an object which has only been in visual focus. The relationship betweenpriority and persistence is shown in Figure 8. Using this focus technique, it inOBJECT REFERED TO BY USER

OBJECT IS BEING POINTED AT

OBJECT IN VISUAL FOCUS

SUCESSFUL ACTION ON OBJECT

Persistence

Priority

Figure 8: Relationship between priority and persistence of parameters in objectfocuspaint it red refers to the cow in Figure 5. If the user was then to select the house,then focus would change so that house was in focus. Repetition of the commandpaint it red would then result in the house being painted red.4.4.2 Property PerceptionProperties in descriptions are important for discriminating between objects. Ofparticular interest are non-discrete properties, such as colour, which can overlapand vary in how they are perceived and described by users. For example, thecolour of an object in a graphical rendering of a virtual world may be describedas red or brown due to overlap between the properties as shown in Figure 9. Aprototype approach is used for property matching (Rosch 1975): a property holdsof an object if the semantic value is su�ciently close to the property prototype.In Figure 5, the user is able to describe the cow as red even though the systemsees it as a brown cow since these colors can overlap. In this way, a `best �t' isestablished between the user description and properties of objects.17

Figure 9: Intersection of Colour Properties5 Enhancing DIVERSEThe DIVERSE system successfully demonstrates the utility of combining speechand direct manipulation interfaces to a virtual world. In this section, we de-scribe how the system can be extended to provide a more powerful and e�cientmechanism for interacting with a virtual world.5.1 Improving Speech Recognition and Language Under-standingOne of the weakness of the DIVERSE system is its speech recognizer: recognitionaccuracy is signi�cantly inferior to state-of-the-art recognizers, and the recognizerneeds to be trained for each user's pronunciation of commands. A related problemis that the capabilities of the syntax and semantic analysis components are notaligned with the capability of the recognizer. The recognizer can only accuratelyrecognize a relatively small set of utterances. However, the syntax component,which has been trained on a very large corpora of written language, can robustlyanalyze any textual input, although it tends to produce incorrect analyses wherethe conventions of written and spoken language diverge. For example, spokenlanguage tends to be highly elliptical but written text does not; utterances likered paint tend to be mis-analyzed so that paint is treated as a verb, an analysisappropriate to written but not spoken language. The semantic component of thesystem is the real bottleneck though: it can only handle a limited range of syn-tactic structure. The e�ects of the misalignment between the capabilities of thesecomponents is that (a) incorrectly recognized utterances are given a syntactic(and sometimes semantic) analysis, and (b) correctly recognized utterances failto be assigned a semantic analysis.This problem has been addressed by using a state-of-the-art speech recognizer,namely the Nuance system (formerly Corona), which is speaker-independent andhas a very high accuracy for small to medium sized vocabularies. This recognizeralso has the capability to constrain language understanding within a pre-de�nedgrammar (similar to the �nite-state networks mentioned in Section 3.1 above),18

thereby aligning speech recognition and language understanding capabilities5.Each grammar speci�es a restricted language for user utterances: utterancesoutside the grammar are either completely rejected or, if they partially matchpart of the grammar, are given a `best-�t' analysis. For example, paint it redcan be speci�ed as part of a painter grammar and the speech recognizer willreturn not a string, but an analysis identical to that in (2) above. By aligningspeech and language components in this way, we have eliminated the syntax andsemantic components of the DIVERSE system. Not only is this more e�cient,but it reduces software maintenance: changing applications only requires that thegrammar is modi�ed.5.2 Improving Reference ResolutionAnother weakness of the DIVERSE system is its approach to reference resolution.Most language understanding and spoken dialogue systems resolve the referenceof linguistic expressions within a discourse model. In the DIVERSE system,there is no distinction between the discourse and world levels, so that referentialconstraints need to be applied to all objects in the virtual world. This is note�cient due to the increased size of the search space; for example, if the userissues the command bring me the tank, then the reference of the tank is resolvedwith respect to all objects in the world rather than those in a discourse model.Without a discourse model, the DIVERSE system also lacks the ability to resolvereferences to previous actions. For example, after issuing the command paint itred, the user is not able to ask for this action to be repeated (do it again), appliedto another object (do that to the tank), or even undone (undo the last action!)since there is no representation of actions independent of their e�ects on virtualworld.The solution is to provide a discourse model which distinguishes between aparticipant's understanding of the world, and the world itself, as discussed inSection 3.2. The discourse model is updated on the basis of user input from bothmodalities: i.e. utterances and direct manipulation actions are taken as updatesof the model and refer to objects within it. Changes in the discourse modelarising from multimodal updates are then passed to the virtual world managerfor execution. This approach is illustrated in Figure 10. This allows referenceresolution to take place with respect to objects in the discourse model. Sincethis model is smaller than the world itself, search and update operations aremore e�cient. Also, since actions from either modality are rei�ed as objects inthe discourse model, users can refer to previous actions. For example, a directmanipulation action will be added to the discourse model as an action object5Of course, this does not solve the problem of informing the user how to express commandsthey can carry out via direct manipulation. We are looking at cross-modal paraphrasing |where each direct manipulation command is paraphrased in language | as a technique to trainusers about the capability of the speech interface.19

Figure 10: Discourse Interpretation in a Multimodal Systemwhich can be refered to using linguistic expressions such the last action. Finally,if user's linguistic descriptions do not refer an existing object in the discoursemodel, then a new object of the appropriate type is created since its existence isbeing presupposed by the user.5.3 Improving RobustnessAs part of its agent-based interaction metaphor, the DIVERSE system adoptsthe principle that `errors don't matter' since users are given feedback about thee�ects of their utterance, and they can correct any problem themselves. Thisprinciple, however, is problematic. Users may have di�culty in identifying thesource of the problem | it may arise from a system error in recognition orreference resolution | or they may have di�cultly in reversing the e�ects ofan incorrect action. This may make the task of repairing errors inconvenientfor the user. Moreover, in some domains, errors do matter and are more thansimply inconvenient. In a battle�eld simulation for training military personnel,the usefulness of the simulation lies in its faithful reproduction of certain aspects ofthe battle�eld without, of course, its real dangers. If military personnel are beingtaught speci�c protocols for command and control of personnel and hardware,then this aspect of the environment should be accurately simulated, otherwisethey will not learn that some mistakes can be fatal. In the DIVERSE system,this situation will only arise when users detect the error themselves: the systemdoes not attempt to identify or repair errors for them. Consequently, in suchapplication domains, the user is unable to trust the interaction agent: authoritycannot be delegated to the agent by spoken commands. While the system willalways carry out some action, it may not be the action authorized by the user.The solution to this lies in augmenting the system with capabilities similar tospoken dialogue system; namely, proactive and reactive strategies for managingthe interaction so as to ensure successful communication. These strategies canbe con�gured for particular applications: not all applications will require thatcommands are con�rmed by the user, or that clari�cation of incomplete commands20

is sought from the user. In applications which allow clari�cation, the system canallow partial matching of semantic interpretations with action types. For example,if the the system has a paint action which requires that the object to be paintedand the colour are speci�ed, but the user's utterance omits speci�cation of theobject and colour | through incompleteness on the user's part or partial speechrecognition | then the system can clarify the command with the user as shownbelow:(3) User: paint ...System: what do you want to paint?User: the houseSystem: what color do you want to paint the house?User: redSystem: painting house red : : :This type of dialogue management is also very useful in applications where com-mands must correspond to a precise protocol: the system has the responsibility tocheck that the command is valid or, if it is not valid, but this system is in trainingmode, to clarify the command with the user. Alternatively, the system can rejectthe invalid command. In either case, the user is able to trust the agent to onlyexecute commands which conform to the protocol.5.4 Talking AgentsThe �nal extension to the DIVERSE system is the most radical since it changes theinteractional metaphor so as to allow users to talk directly to `talking agents' in thevirtual world. MIT's `Things that think' project is looking at ways in which every-day objects can be invested with more intelligence and communicative capabilities.One example scenario is the incorporation of computing technology in clothes andusing the human body as a communication network, so that when two businesspeople meet rather than exchange business cards, their respective systems canexchange details through their handshake (Negroponte and Gershenfeld 1995).While this project is still in its infancy, and sounds more like science �ctionthan fact, it does indicate new ways in which we may communicate with, andthrough, intelligent computing technology.Our focus is stepwise enhancements of existing technology, especially in thetelecommunications and virtual reality domains, by bringing together intelligentagent-based techniques, and spoken dialogue capabilities. In the case of virtualreality systems, we are exploring how a virtual world can be populated withspeech-capable agents which user can directly interact with. In the DIVERSEsystem, the user interacts with a single agent which embodies the communicativecompetence of all agents. The alternative is that agents in the virtual world cancarry out various specialized functions and can themselves have communicativecapabilities. There can be agents for transporting the user, fetching objects,21

painting objects and, in a virtual battle�eld application, agents for tanks, missiles,radars, and so on. The user can not only initiate interaction with these agentsby spoken commands but the agents can be instructed to initiate a dialogue withthe user. For example, the user can instruct a radar agent to immediately reportany incoming enemy aircraft.An illustrative hierarchy of talking agent types is shown in Figure 11. Each*basic Dialogue

*secure protocol

*user authorization

Figure 11: Talking Agent Hierarchyagent has a set of goals, a set of dialogue strategies and a grammar. The goalsspecify the range of actions the agent can carry out. The grammar speci�es therange of utterances understood by the agent and their interpretation with respectto its actions. The dialogue strategies describe how the agent interacts withuser, including strategies for dealing with mismatches between user commandsand agent actions. For example, a driver agent can perform transportationactions and understand commands like take me to headquarters, a gofer agentcan retrieve objects and understand commands like bring me the battle plans.The agents are organized in a hierarchy so that properties of higher agents canbe inherited by agents lower in the hierarchy. For example, agents inherit basicdialogue capabilities de�ned for talking agent, while instances of the militaryagent inherit the property secure protocol. This property requires that usercommands must conform exactly to the actions they are able to execute. This mayinvolve clari�cation of incomplete commands depending on whether the system isin training mode, or not. Once the user addresses an agent, the actions can beexecuted by the appropriate spoken commands, as illustrated below:(4) Launcher: launcher 476 here.User: target red tank.Launcher: specify coordinates of red tank.User: the coordinates are 437 342.Launcher: targeted red tank at coordinates 437 342. Launch mis-sile? 22

User: con�rm missile launch.Launcher: missile launched. Over and out?User: over and out.where the launcher agent clari�es the target command by requesting coordinatesof the target. The agent is capable of �ring the missile but since it is a typeof hardware agent, user authorization is requested before executing the action.The user's authorization is then checked to ensure that it conforms to the protocol| ok or yes is not su�cient | and the missile is launched. The user then termin-ates the dialogue by uttering over and out, the standard utterance for terminatingdialogues with a military agent.Before a dialogue like (4) can be initiated, the user needs to get the atten-tion of the appropriate agent6. To do so, the user must be able to distinguishbetween talking agents and non-talking agents in the virtual world. Ratherthan just have the user select objects and wait until one speaks its identi�ca-tion code, talking agents can be identi�ed by an icon indicating they are speech-enabled, similar to the identi�cation of speech-enabled WWW pages on the in-ternet (Novick and House 1994)7. The user then needs to select a speci�c talkingagent to interact with. One possibility is to use speech | launcher (476) | al-though this is may be problematic for a number of reasons; for example, there maybe more than one agent of the requested type and something like the DIVERSEagent will be required to establish which agent is being requested. A better, andsimpler, solution is to exploit the multimodality of the interface: it is more e�-cient for the user to select the talking agent by clicking on it. Once an agent isselected, it indicates to the user that it awaiting commands by (a) highlightingits speech icon8, and (b) speaking its identi�cation code. The dialogue continuesuntil the user terminates the dialogue.6 ConclusionCombining a speech interface with a direct manipulation interface to virtualworlds can provide real bene�ts such as natural (and faster) mode of interac-tion, allowing commands to be issued without hand control,and allowing users toapply commands to objects which are not visible. These bene�ts were demon-strated in the DIVERSE system. Further bene�ts arise when the speech interfaceis improved and augmented with dialogue capability. Users are able to commu-nicate with, and so control, talking agents in the virtual world, a facility which6Some of the ideas here bene�ted from discussions with Tomas Axling.7This strategy will only be successful when objects are close enough for the user to see ifthey have the icon.8More radical methods of indicating that the agent is attentive could also be used; for ex-ample, the object could `grow' a cartoon-like talking head.23

would be slower and more awkward in a direct manipulation interface. This fa-cility can be developed with existing technology so today's soldiers can train fortomorrow's technology.ReferencesBasson, S., A. Kalyananswamy, E. Man, S. Springer, and D. Yashchin (1995).Establishing speech technology requirements: the money talks �eld trial. Tech-nical report, NYNEX Science and Technology, White Plains, N.Y.Bos, J., E. Mastenbroek, S. McGlashan, S. Millies, and M. Pinkal (1994). Acompositional DRS-based formalism for NLP applications. In Proceedings of theInternatonal Workshop on Computational Semantics, Utrecht.Bretan, I., N. Frost, and J. Karlgren (1995a). Real speech in virtual worlds.Technical report, SICS/Telia Research, Stockholm.Bretan, I., N. Frost, and J. Karlgren (1995b). Using surface syntax in inter-active interfaces. In Proceedings of 10th Nordic Conference of ComputationalLinguistics.Carbonell, N. and C. Mignot (1993). `natural' multimodal HCI: Experimentalresults on the use of spontaneous speech and hand gestures. In Proceedings ofERCIM Workshop on Multimodal Human-Computer Interaction.Carlsson, C. and O. Hagasand (1993). DIVE: a platform for multi-user virtualenvironments. Computers & Graphics, 17(6).Corbett, G., N. M. Fraser, and S. McGlashan (1993). Heads in grammaticaltheory. Cambridge: Cambridge University Press.Eckert, W. and S. McGlashan (1993). Managing spoken dialogues for informationservices. In Proceedings of 3rd European Conference on Speech Communicationand Technology, pages 1653{6.Frankish, C. R. (1989). Conversations with computers: problems of feedbackand error correction. In Eurospeech '89. Paris.Garrod, S. and A. Anderson (1987). Saying what you mean in dialogue: A studyin conceptual and semantic co-ordination. Cognition, 27: 181{218.Giachin, E. (1995). Phrase bigrams for continuous speech recognition. In Pro-ceedings of ICASSP'95.24

Giachin, E. and S. McGlashan (1996). Spoken language dialogue systems. InK. Church, S. Young, and G. Bloothooft, editors, Corpus-Based Methods inLanguage and Speech Processing. Kluwer.Grosz, B. J. (1981). Focusing and description in natural language dialogues.In A. K. Joshi, B. L. Webber, and I. A. Sag, editors, Elements of discourseunderstanding. Cambridge: Cambridge University Press.Heisterkamp, P., S. McGlashan, and N. J. Youd (1992). Dialogue semanticsfor spoken dialogue systems. In Proceedings of the International Conference onSpoken Language Processing, Ban�, Canada.Jelinek, F. (1989). Self-organizing language modeling for speech recognition. InReadings in Speech Recognition. Morgan-Kaufmann.Karlgren, J., I. Bretan, N. Frost, and L. Jonsson (1995). Interaction models,reference, and interactivity in speech interfaces to virtual environments. In Pro-ceedings of Eurographics'95.F. Karlsson, A. Voutilainen, J. Heikkil�a, and A. Anttila, editors (1995). Con-straint Grammar. Mouton de Gruyter.Kay, A. (1990). User interface: A personal view. In B. Laurel, editor, The Artof Human-Computer Interface Design. Reading, Mass: Addison-Wesley.Luperfoy, S. (1992). The representation of multimodal user interface dialoguesusing discourse pegs. In Proceedings of ACL'92, Newark Delaware, USA, pages22{31.McGlashan, S., N. M. Fraser, N. Gilbert, E. Bilange, P. Heisterkamp, andN. J. Youd (1992). Dialogue management for telephone information services.In Proceedings of the International Conference on Applied Language Processing.Trento, Italy.Meisel, W. S. (1995). ARPA workshop reports on speech-recognition state-of-the-art. In Speech Recognition Update, 20. ISSN 1070-857X.Negroponte, N. and N. Gershenfeld (1995). Wearable computing. Wired, 3.12:256.Novick, D. G. and D. House (1994). Spoken-language access to multimedia(SLAM): A multimodal interface to the world-wide web. Technical report, Centerfor Spoken Language Understanding, Oregon Graduate Institute.Rabiner, L. R. and B.-H. Juang (1993). Fundamentals of Speech Recognition.Englewood Cli�s, NJ: Prentice Hall. 25

Rosch, E. (1975). Cognitive representations of semantic categories. Journal ofExperimental Psychology: General, 104: 192233.Scheneiderman, B. (1988). Direct manipulation: A step beyond programminglanguages. IEEE Computer, 16(8): 57{69.Woodland, P. C., J. J. Odell, V. Valtchev, and S. J. Young (1994). Large vocab-ulary continuous speech recongition using HTK. In Proceedings of ICASSP'94.Zoltan-Ford, E. (1991). How to get people to say and type what computers canunderstand. International Journal of Man-Machine Studies, 34: 527{47.

26

sp eec h in e rf ace s t o virt ual re alit y...h a us ers can giv e sp ok en comm an ds o s elect d...

Documents