cimmi - a contextually informed multimodal integration method for a humanoid robot in an assistive...

8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

1/136

CIMMI - A Contextually Informed Multimodal

Integration Method for a Humanoid Robot in anAssistive Technology Context

Elizabeth Harte

Bachelor of Science (Hons.), Computer Science (2005)

University of Auckland

Submitted for the Degree of

Masters of Engineering Science (Research)

Intelligent Robotics Research Centre

Department of Electrical & Computer Systems Engineering

Monash University, Clayton, VIC 3800, Australia

April 2008


2/136

Declaration

I declare that, to the best of my knowledge, the research described herein is originalexcept where the work of others is indicated and acknowledged, and that the thesishas not, in whole or in part, been submitted for any other degree at this or any otheruniversity.

Elizabeth Harte

Melbourne, AustraliaApril 2008

i


3/136

Acknowledgments

My thanks go to Prof. Ray Jarvis for his enthusiasm and support over the past 2years. He is an inspiring roboticist and researcher. I will be lucky to be even half ascreative as him in my life.

I would also like to acknowledge Mitsubishi Heavy Industries Ltd for the uniqueopportunity to use the Wakamaru robot for this project.

I would like to thank all the members of the Intelligent Robotics Research Center.I have met some great people and made some great friends. Thank you for welcomingthis Kiwi into the circle. In particular, thanks go to Alan Zhang and Jay Chakravartyfor their valuable discussions and friendship during the development of this project.

Finally, I thank Rob Davidson for his moral support during this project. His en-couragement, support and editorial skills were greatly appreciated during the writing

of this thesis. Kia ora.

ii


4/136

Abstract

Robots dealing directly with people need to be able to handle the uncertainties as-sociated with human-centric environments, including human interaction. This fieldhas seen much attention in the past decade, though work using multimodal systemshas not been deeply explored. This work describes a Contextually Informed Multi-Modal Integrator, CIMMI, that fuses speech and symbolic gesture probabilisticallyand is informed by contextual knowledge in an assistive technology robotic applica-tion. Symbolic gestures are gestures that have semantic meaning, such as a wavegesture meaning hello. Very few systems use symbolic gestures in a robotic domain,primarily because symbolic gestures are not as commonly used in day-to-day humanconversation as deictic (pointing) gestures. Using symbolic gestures allows the sys-tem to resolve ambiguities associated with action words in a command hypothesis

where deictic gestures are often used to identify objects or locations. Exploring theuse of symbolic gestures will be one focus of this work.

Since no deictic gestures are implemented in this work, contextual knowledge isused to resolve ambiguities associated with object and location words in a command.Contextual knowledge is defined as both conversational and situational. Conversa-tional contextual knowledge uses the dialogue history to select the best commandfrom a generated list of possible commands or to resolve ambiguities in the selectedcommand. In human conversation, if a person had been talking about a yellowcup then the conversational context allows a person to simplify additional refer-ences to the same object by saying the cup without having to specify the impliedyellow again. Using this idea as an influence, ambiguities, where the object or

location information is missing in a command, could be resolved by referencing theconversational context. These simple ideas are explored and discussed further in thiswork.

The second kind of implemented contextual knowledge is situational. In thiswork, situational contextual knowledge consists of the last known locations of theuser, the robot and a list of objects that exist in the environment. Each objectin the list has a colour, object class type (such as cup) and a location. Knowingthe locations of an object allows the user to simplify their spoken requests so theycan simply ask bring me the blue cup without stating the implied from the table.This same concept applies to specifying the colour of an object. Using these ideasas an influence, this knowledge can also resolve misrecognitions, where the colour

or location information could be missing from a command. The user and robotlocations are only used to assist the robotic responses. This situational knowledgeis explored further in this thesis. The exploitation of conversational and situationalcontextual knowledge, as introduced above, will be the second focus of this work.

The accuracy of the speech recognition system alone (53%) was compared tousing the speech with contextual knowledge, both conversational and situational.This increased the accuracy of the system to 72%. With the addition of the gesturerecognition as well, the systems accuracy rose only by 4% more because only 3 casesrequired the gesture to clarify the ambiguity in the spoken command. CIMMI wasimplemented on a humanoid robot, Wakamaru, and the speech, gesture and objectrecognition systems were implemented on a sepearate laptop.

iii


5/136

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Multimodal Human-Robot Interaction . . . . . . . . . . . . . . . . . 2

1.2.1 Modes of Input . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Contextual Knowledge . . . . . . . . . . . . . . . . . . . . . . 21.2.3 Multimodal Integration . . . . . . . . . . . . . . . . . . . . . 3

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Literary Review 7

2.1 Service and Field Robotics . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Healthcare Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Social Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Multimodal Integration . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.1 Modes of Input . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.2 Contextual Knowledge . . . . . . . . . . . . . . . . . . . . . . 112.4.3 Methods of Integration . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Multimodal Human-Robot Interaction . . . . . . . . . . . . . . . . . 13

3 Contextually Informed Multimodal Human-Robot Interaction and

Robotic Responses 16

3.1 Functional Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Software Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Symbolic Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . 24

3.4.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4.2 Region Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.3 Static Gesture Recognition . . . . . . . . . . . . . . . . . . . 30

3.5 Contextual Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . 323.5.1 Conversational Context . . . . . . . . . . . . . . . . . . . . . 323.5.2 Situational Context . . . . . . . . . . . . . . . . . . . . . . . 33

3.6 Integrating Speech and Symbolic Gestures in an Informed Manner . 363.6.1 CIMMI Overview . . . . . . . . . . . . . . . . . . . . . . . . . 373.6.2 The Iterative Approach . . . . . . . . . . . . . . . . . . . . . 403.6.3 The Sequential Approach . . . . . . . . . . . . . . . . . . . . 45

3.7 Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.8 Wakamarus Responses . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Experiments 57

4.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.1.1 Experiment 1: Selecting the Number of Recognition Hypotheses 594.1.2 Experiment 2: Word Recognition Rate of Unparsed Speech

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1.3 Experiment 3: Word Recognition Rate of Parsed Speech Results 62

4.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

iv


6/136

CONTENTS

4.2 Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.1 Experiment 1: Gesture Detection and Recognition by SkinSegmentation Only . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2.2 Experiment 2: Gesture Detection and Recognition by Skinand Motion Segmentation . . . . . . . . . . . . . . . . . . . . 69

4.2.3 Experiment 3: Gesture Detection and Recognition by Skinand Motion Segmentation in the New Laboratory Environmen 73

4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3 Multimodal Interaction Accuracy . . . . . . . . . . . . . . . . . . . . 77

4.3.1 Experiment 1: Accuracy of the Individual Modes . . . . . . . 774.3.2 Experiment 2: Using the Iterative Approach . . . . . . . . . 794.3.3 Experiment 3: Using the Sequential Approach . . . . . . . . 85

4.4 Ob ject Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.4.1 Experiment 1: Object Recognition using Colour Segmentation

and Shapematcher in the Old Laboratory Environment . . . 944.4.2 Experiment 2: Object Recognition using Colour Segmentation

and Shapematcher in the New Laboratory Environment . . . 954.5 Navigation and Obstacle Avoidance . . . . . . . . . . . . . . . . . . 98

4.5.1 Experiment 1: Navigation and Obstacle Avoidance in the OldLaboratory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.5.2 Experiment 2: Navigation and Obstacle Avoidance in the NewLaboratory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.6 Robotic Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.6.1 Natural Interactions . . . . . . . . . . . . . . . . . . . . . . . 1034.6.2 Requesting Information . . . . . . . . . . . . . . . . . . . . . 1054.6.3 Fetch-And-Carry Requests . . . . . . . . . . . . . . . . . . . . 1054.6.4 Task Performance . . . . . . . . . . . . . . . . . . . . . . . . 106

5 Discussions 110

5.1 The Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 1105.2 Integrating Speech with Contextual Knowledge . . . . . . . . . . . . 1115.3 Addition of Symbolic Gestures . . . . . . . . . . . . . . . . . . . . . 1125.4 Comparing the Iterative and Sequential Multimodal Approaches . . 1135.5 The Robotic Demonstrations . . . . . . . . . . . . . . . . . . . . . . 113

6 Future Work and Conclusions 115

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

v


7/136

List of Figures

3.1 The defined rooms in the laboratory environment. The height of thelaser range finder, which is used for obstacle avoidance, is lower thanthe height of the white polysterine walls between the rooms. Thedesk is the foreground table, the kitchen is at the furthest tableand the table location is to the right with the blue cup on it. Theclosest Wakamaru robot is the active one. The other is not used inthis system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Overview of the Transactional Intelligence, and the ContextualKnowledge which informs it. . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 A diagram showing the speech recognition process, from capturing the

spoken utterance through to generating a list of parsed recognitionhypotheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 The non-skin (left) and skin (right) histograms . . . . . . . . . . . . 253.5 A series of motion segmentations of a wave gesture. (a) In the initial

frame there is no history to segment any motion so all lighter colouredregions are segmented. (b) In the second frame, the user has movedtheir arm a distance so the previous motion segmentation and thecurrently detected motion are added as shown here. (c) After threeframes, the static background region has been segmented away andonly the users motion, and speckles of motion of the person in thebottom right, are detected as shown here. . . . . . . . . . . . . . . . 26

3.6 A single frame from a wave gesture using skin and motion segmentation 273.7 A diagram of the gesture capture and recognition process . . . . . . 273.8 The red circle outlines the face region and the orange circle represents

the hand region. The blue dots represent the center of the boundingcircles in the XY plane for each frame of the gesture sequence. Thelighter the colour of the dot, the closer that part of the gesture andvice versa. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.9 Showing the 9 directions that should be differentiable by using thegesture tracking and the three features specified. . . . . . . . . . . . 30

3.10 Captures from the SM program of matches between two gesture skele-tons. Notice the individual segments of the skeleton representing the

silhouette are numbered and different colours. . . . . . . . . . . . . . 313.11 A diagram illustrating CIMMIs sequential algorithm . . . . . . . . . 393.12 Calculating the speech weight, wts, in the iterative approach. . . . . 413.13 Wakamaru with Bumblebee camera on head, HokuyoURG laser range

finder at middle and laptop on back. . . . . . . . . . . . . . . . . . . 483.14 Localization panoramic images on Wakamaru. . . . . . . . . . . . . . 503.15 Screen captures from Wakatest, Wakamarus debugging and testing

tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.16 A sample image of each of the recognized object class . . . . . . . . 523.17 A sample object capture which is segmented, made into a skeleton

then matched to the object database. . . . . . . . . . . . . . . . . . . 53

3.18 A diagram showing the SM matching process, as provided by Macrini 54

vi


8/136

LIST OF FIGURES

3.19 Route following and obstacle avoidance on Wakamaru. (a) shows

two sequential, but not consecutive, images of the robots positionapproximately when the screen captures in (b) and (c) were taken.(b) shows two consecutive captures of the obstacle avoidance usingthe laser range finder. Obstacles are red and the path is green withthe yellow square as the start point and blue square the end point. (c)shows Wakamarus on board obstacle avoidance sensor map in greenand blue with the path in pink. If the on board sensors detected anobstacle, parts of the blue sensor map would be red. Notice the faintpink crosses which identify the location of the landmark reflectors. . 56

4.1 A capture of the glass panels along one side of the new laboratory . 58

4.2 (a) The binary skin segmented images. These segmented regions cor-respond to the coloured boxes in the coloured captures. (b) Threesequential, but not consecutive, captures from a go gesture. Thecoloured boxes bound the detected skin coloured regions. Note thatthe top two left hand boxes in the top row are patches of ceiling seg-mented as skin regions. Camshift started tracking the hand and faceinstead of the ceiling patches, resulting in the large box in the bottomrow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 (a) The binary skin segmented images. These segmented regions cor-respond to the coloured boxes in the coloured captures. (b) Threesequential, but not consecutive, captures from a go gesture. The

coloured boxes bound the detected skin coloured regions. Note thatthe top two left hand boxes in the top row are patches of ceiling seg-mented as skin regions. Camshift started tracking the hand and faceinstead of the ceiling patches, resulting in the large box in the bottomrow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4 (a) The binary skin segmented images. These segmented regions cor-respond to the coloured boxes in the coloured captures. (b) Three se-quential, but not consecutive, captures from a go gesture. The bluebox shows the location of the face in the first frame of the sequenceand the coloured boxes bound the detected skin coloured regions.Note that the top two left hand boxes in the top row are patches of

ceiling segmented as skin regions. Camshift started tracking the handand face instead of the ceiling patches, resulting in the large box inthe bottom row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

vii


9/136

LIST OF FIGURES

4.5 Graphs showing the gesture training database for the same dataset

plotted against the three classification features. The gestures in (a)were only tracked by skin segmentation, and in (b) were tracked usingboth skin and motion segmentation. In (a), the gesture classes werequite separate from each other and training samples were clusteredwithin each class. By comparison, in (b) the gesture classes were lessseparate from each other, with some overlap as well. The trainingsamples were less clustered per class, especially the go gesture. Thiscould be because the tracking of go gestures in the previous versionwas influenced by segmented background regions, as the gesture regioncould get stuck on these regions modifying the path tracked. . . . . 72

4.6 A capture of the user not performing a gesture using skin and motion

segmentation in the new laboratory. . . . . . . . . . . . . . . . . . . 744.7 A capture of the user performing a come gesture using skin and motion

segmentation in the new laboratory. . . . . . . . . . . . . . . . . . . 754.8 A capture, and segmentations, of objects on the table in the old lab-

oratory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.9 A capture, segmentation and skeleton of an object on the kitchen

table in the new laboratory. . . . . . . . . . . . . . . . . . . . . . . . 954.10 A bowl skeleton matching a cup skeleton due to the rotational invari-

ance of the ShapeMatcher approach. . . . . . . . . . . . . . . . . . . 964.11 A sample capture where the yellow plate is partially out of frame,

resulting in a misrecognition of the region. . . . . . . . . . . . . . . . 97

4.12 A comparison of the camera disparity obstacle detection against thelaser range finder obstacles detection. . . . . . . . . . . . . . . . . . 100

4.13 A sequence of images of Wakamaru traveling from its charger basetoward the kitchen table in the laboratory environment. . . . . . . 102

4.14 Waving to Wakamaru . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.15 Wakamaru touching a recognized object . . . . . . . . . . . . . . . . 1074.16 An attempted object recognition capture. The robots head is still

moving downwards, resulting in a blurry image. This image con-tains part of the object and was, in fact, correctly recognized. Othercaptures would still only be looking at the wall resulting in a failedrecognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.17 Another attempted object recognition capture. The segmented cupsilhouette failed to be recognised as a cup because of the shadowon the table which was segmented as part of the cup, changing thesilhouette. This meant the cup had a top width of similar width toits base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

viii


10/136

List of Tables

3.1 Sample Unparsed Speech Recognitions . . . . . . . . . . . . . . . . . 223.2 Sample Risk Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Sample Values from the Associative Map . . . . . . . . . . . . . . . . 233.4 Sample Parsed Speech Recognitions . . . . . . . . . . . . . . . . . . 233.5 Table showing some Sample Command Hypotheses from Speech In-

put. These utterances were spoken in the order listed. Because theYellow Cup in Ex. 1 is not in the known object database, the robotrequested its location, which the user provided in Ex 2. The combina-tion of these give an unambiguous command hypothesis Get YellowCup Desk. Ex. 3 illustrates that using the conversational history,

the identity of the cup can be resolved to Yellow Cup where theuser is asking the robot to take the cup away again. . . . . . . . . . 343.6 Table showing Sample Objects from the Object Database . . . . . . 36

4.1 The First Ten Recognition Hypotheses for a Spoken Utterance . . . 604.2 Sample Unparsed Speech Recognitions . . . . . . . . . . . . . . . . . 614.3 Word Recognition Rate (WRR) for the Unparsed Speech Results . . 624.4 Sample Parsed Speech Recognitions . . . . . . . . . . . . . . . . . . 624.5 Word Recognition Rate (WRR) for the Parsed Speech Results . . . . 634.6 Two Cases of Deletion Errors for Parsed Speech . . . . . . . . . . . . 644.7 Comparing the Gesture Detection and Recognition Results . . . . . 764.8 Table showing some Sample Unimodal Hypotheses Selected by the

Iterative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.9 Table showing some Sample Multimodal Hypotheses Selected by the

Iterative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.10 Comparison of the Use of Speech and Contextual Knowledge against

the Iterative version of CIMMI . . . . . . . . . . . . . . . . . . . . . 844.11 Table showing some Sample Unimodal Hypotheses Selected by the

Sequential Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.12 Comparison of Speech Recognition Accuracy Results . . . . . . . . . 884.13 Table showing some Sample Multimodal Hypotheses Selected by the

Sequential Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.14 Comparison of the Accuracies of using Speech with Contextual Knowl-

edge to the two different version of CIMMI . . . . . . . . . . . . . . 93

ix


11/136

1Introduction

1.1 Motivation

Increasing levels of affordable computational resources and quality sensors have per-

mitted the development of robots that can operate in a widening set of domains,

both structured and unstructured environments and at times shared by people.

Sometimes a robot is designed because it can do a task more accurately or ef-

ficiently than a human, such as with industrial and surgical robots. Other times

robots are designed because a human cannot perform the dangerous or inaccessible

task, such as with search and rescue or space exploration. In the case of service

robotics, robots are often designed to perform tasks which directly interact with,

and assist, a person.

The need for assistive robotics is increasing as demographic studies have shown

there will be an increase in the number of elderly in the future [1]. According to the

Australian Bureau of Statistics in 2004, 13% of the population was over 65. This

figure is expected to increase to between 26% and 28% by 2051 [2]. While assistive

technologies may not replace human carers in the future, they could a provide viable

alternatives. A robot carer could assist an elderly person in a domestic environment,

allowing that person to be independent and remain in their home while still having

some basic support and security.

Robot assistants dealing directly with people need to be social robots - ones that

interact with people by following some expected behaviourial norms [3]. They also

need to be able to handle the uncertainties associated with human-centric environ-

ments including people interactions. This field has seen much attention in the past

decade, however work in multimodal systems for robotic platforms has not been

deeply explored.

1


12/136

1.2. MULTIMODAL HUMAN-ROBOT INTERACTION

1.2 Multimodal Human-Robot Interaction

Multimodal integration is where two or more modes of input are combined to create a

richer, and possibly more accurate, interpretation than possible with a single mode

alone. Using the partial information from one mode to clarify another is called

mutual disambiguation - a concept defined and proved by Sharon Oviatt [4].

1.2.1 Modes of Input

As with human communication, speech is commonly used as a primary mode of

communication input in multimodal robotic systems. The speech input is combined

with other modes, such as gestures or emotion recognition, complementing its in-

terpretation. In human-robot interaction (HRI), deictic gestures (pointing gestures)

are most commonly used as they can replace object or location information missing

from the speech. For example, if the user said Bring me that cup, then the direction

of the pointing gesture could resolve which cup to get.

Symbolic gestures, also called emblems, are gestures that hold their own semantic

meaning and can be understood without context from additional speech, such as a

thumbs up gesture [5]. These gestures have been explored broadly in human-

computer interaction (HCI) as pen-based actions or as GUI interface clicks. This

work has been transferred to touch screen interfaces for HRI but natural hand-

performed gestures have not been researched deeply [6]. Symbolic gestures can

substitute, reinforce, or conflict with action words in the speech input.

1.2.2 Contextual Knowledge

While humans communicate by multiple modes, they also use their knowledge to

aid their interpretations. Contextual knowledge is instinctively used by people to

resolve what may otherwise be ambiguous communications. Contextual knowledge

is broad and varied, though some simple and common kinds are:

what they know about the environment

what they have been talking about

the speaker and listener locations

These different kinds of context can be used by a robot to help it to more ac-

curately interpret, or respond to, what is communicated to it by a human user.

The environmental knowledge can provide situational context about objects in the

environment so a spoken utterance like get the cup in the kitchen could be stated

2


13/136

1.3. CONTRIBUTIONS

without specifying which cup. The speaker could assume the listener had a shared

knowledge of objects in the environment, such as the green cup in the kitchen.

Similarly, the conversation history can provide context about which specific cup

a speaker is speaking about by referring to a previously spoken about yellow cup.

The conversational context can also be used to resolve a command by combining a

users clarification with the last recognized user command.

The speaker and listener locations, another part of the situational context, can

be used to increase the likelihood of one object being requested over another, or to

aid responses to a request.

In a robotic system, all this knowledge can be used to help the system inter-

pret, and respond to, human communication more accurately. Whether the speaker

provides an incomplete command or some information is misrecognized, contextual

knowledge can help resolve the ambiguities in a useful manner.

1.2.3 Multimodal Integration

To combine multiple modes of input, various methods have been proposed. A popu-

lar early work, QuickSet, combined two n-best lists of input, one of speech and one

of gestures. An n-best list is a list of speech or gesture recognition hypotheses any-

one of which could be the correct recognition. QuickSet would statistically combine

the n-best lists using the probabilities of each hypothesis [7] [8]. This statistical ap-

proach became the driving force for developed systems that followed [9] [10] [11] [12].

More recently, learning algorithms [13] [14] [15] and domain independent approaches

[16] [17] [18] have been developed for use with HCI systems.

In HRI, the multimodal systems often process speech with deictic (pointing)

gestures integrated at the semantic level [19] [20] [21]. The methods implemented

vary from frame-based matching to constraint-based integration to learning algo-

rithms. In these systems, ob ject recognition was used to identify which object the

pointing gesture was pointing at to resolve anaphoric ambiguities in the speech. The

multimodal fusion in [21] only combined the gesture inputs if there was a need of dis-

ambiguation because their gesture recognition accuracy was low, an idea supported

by [13]. The current multimodal systems may use contextual knowledge, but very

few assess their practical implementations [19] [20] [21].

1.3 Contributions

To develop a robot to interact with humans in human-centric environments, it re-

quires both transactional and spatial intelligences. Transactional intelligences are

the skills to interpret what was communicated and interact with the human to re-

3


14/136

1.3. CONTRIBUTIONS

solve ambiguities if a difficulty occurs. In this work, the transactional intelligence

consists of speech and gesture recognition as well as their probabilistic combination

in CIMMI, a Contextually Informed MultiModal Integrator. CIMMI combines the

speech and symbolic gestures semantically, where speech is the primary mode of

input. The semantic relationships between the speech and gestures are manually

defined in an associative map, illustrating whether a word positively, neutrally or

negatively reinforces the semantic meaning of the corresponding gesture. The sym-

bolic gestures are also used to substitute a speech word semantically. In this work,

a substitution occurs if an action word is misrecognized in the speech, then the

recognized gesture provides the action word in the command hypothesis.

Spatial intelligence refers to being able to understand and navigate through a

space and to know the location, function of and how to manipulate objects in that

space. In this work, a simple navigation and obstacle avoidance system and an

object recognition system was implemented to demonstrate the success, and failures,

of CIMMI.

The transactional and spatial intelligences can be informed by contextual knowl-

edge, both conversational and situational. Conversational context is built by previ-

ous dialogue between the robot and the user. It can be used to resolve the identity

of an object such as if a yellow cup was requested earlier in the conversation, then

when the speaker asks about a cup, it could be assumed that it is the same cup asthe yellow one. This conversational information is used to clarify object identities

and locations in this work.

Situational context includes spatial information about known objects in the en-

vironment, the robot and the user. The known object knowledge is used if a speaker

requests the green cup without stating the location of the cup. The robotic system

can resolve this missing location by referencing the green cup in the known object

knowledge and returning the last known location of it. This situational context is

used to clarify objects identities (i.e. colour) and locations in this work. The robot

and user locations are not used to clarify ambiguities in the multimodal commands,but do dictate how the robot responds to particular requests i.e. is the robot already

in the correct location?

The transactional intelligence developed in this thesis primarily involves fusing

the speech and gesture modes in a probabilistic multimodal integration method to

generate a command hypothesis that can be correctly responded to by the robot.

The command hypothesis is selected by a weight based system, where the weights are

calculated using the transactional intelligence and contextual knowledge. A weight

signifies CIMMIs confidence in a command hypothesis, where the weight is based

on:

4


15/136

1.4. THESIS OVERVIEW

the speech/gesture pair relationship according to the associative map

if any word in the speech input has an associated risk value

if the object or location mentioned were discussed in the conversation history

if the object mentioned exists in the object knowledge

These weights, combined with the individual probabilities of the speech and

gesture recognition systems, support the selection of a multimodal command to be

responded to by the robot, so long as it has no ambiguities.

In this work, ambiguities are defined as

missing or uncertain information, including references to non-existent ob jects

a conflict between the selected speech/gesture pair

low individual probabilities of the speech input

Missing information can be resolved either by using the symbolic gestures, the

contextual knowledge or by requesting more information from the user. A conflict

between the speech/gesture pair or low probabilities need to be resolved by the user

by requesting confirmation of the spoken input.

1.4 Thesis Overview

In this chapter, some of the concepts associated with multimodal integration, in-

cluding the modes and methods, were discussed and the motivation for this work

explained. In the next chapter, a review of the previous work done in these fields

is discussed more technically. This review is not a complete survey of the fields

involved, but does discuss the keys ideas and issues involved in multimodal human-

robot interaction.With this survey as a base, chapter 3 discusses and justifies the concepts used in

this thesis, with particular emphasis on the mutimodal integration methodology. The

speech and gesture recognition systems, which provide the inputs to the multimodal

integration algorithm, are described, as is the demonstration system, as implemented

on Wakamaru - a humanoid robot.

Having described the work completed, the experimental testing and results are

presented in chapter 4. The speech and gesture recognition systems are assessed as

unimodal systems and then compared to the multimodal results of CIMMI. CIMMI is

then tested in some robotic demonstrations where object recognition and navigation

5


16/136

1.4. THESIS OVERVIEW

are assessed individually first. The successes, and failures, of these experiments are

then discussed in chapter 5.

In the final chapter, the future work for this project is outlined, and the conclu-

sions of the work presented.

6


17/136

2Literary Review

2.1 Service and Field Robotics

Since Al-Jazaris mechanical boat of musicians in the 13th century [22], people have

created machines to perform both simple and complex tasks [23] [24] [25]. In 1954,

George Devol and Joseph Engelberger invented Unimate, the first programmable

and teachable robotic arm which became the basis for the now flourishing indus-

trial robotics industry [26]. While industrial robotics is one major area of modern

robotics, the other is service and field robotics. Industrial robots are used because

they perform their tasks better than a human in terms of speed and precision, and

relieve humans of the tedium of repetition. In comparison, service and field robots

are used because a human needs, or prefers, a robot to perform a dangerous, inac-

cessible or assistive task.

Field robotics can be divided into 3 keys areas, search and rescue robotics, space

exploration and military, police or security robotics. Search and rescue robots are

designed to access places where it may be too dangerous for a human to go, such as

in a bush fire or collapsed buildings [27] [28]. Space exploration robots, like the Mars

rovers, explore where a human is, so far, unable to go [29] [30]. More recently, space

robotic aids, like the Robonaut [31] [32], have been developed to assist astronauts

perform tasks, though at this stage these robots are teleoperated. Military and police

robots functions include bomb detection and defusing, security and surveillance and

transportation [33] [34] [35].

Service robots are designed to interact with humans directly and to perform tasks

to assist a human, often in everyday chores of a tedious nature or where assistance is

needed by a frail or disabled person. Service robotics can be broken into the several

key domains:

Entertainment Entertainment robots, such as Sonys Aibo [36] and

WowWees RoboSapien, perform interesting acts by teleoperation and are

7


18/136

2.2. HEALTHCARE ROBOTICS

quite popular purchases for consumers despite their lack of truly social

interaction.

Public guides Public guides, such as in a museum or airport, require a much

higher degree of interaction to accomplish their more complex tasks, but

the interaction can be by touch screen [37] [38] [39] rather than by speech

[40]

Health care robotics Health care robots have been designed to assist the el-

derly and infirm by guidance and scheduling reminders. This is discussed

further in the next section.

Personal assistants Personal robots have been developed to serve people

in human-centric environments, such as in offices and homes. This is

discussed in more detail in Section 2.3.

2.2 Healthcare Robotics

The design and development of healthcare robots to assist the elderly and infirm

has been broadly researched in recent years, as statistics indicate there will be a

greater proportion of elderly in the future [1] [41]. According to the Australian

Bureau of Statistics in 2004, 13% of the population was over 65. This figure is

expected to increase to between 26% and 28% by 2051 [2]. Furthermore in 2004,

1.5% of the population was over 85 years old, and this is also expected to rise to

between 6% and 8% by 2051 [2]. While assistive technologies are unlikely to replace

human carers in the future, they could provide viable alternatives. A robot carer

could assist an elderly person in a domestic environment, allowing that person to

be independent and remain in their home while still having some basic support and

security. The research for assistive technology ranges from monitoring systems [42]

[43] [44] to scheduling systems [45] [46] [47] to guidance systems [47] [48] [49] and

robotic assistants [50] [51] [45] [47].

Healthcare robots include Pearl, who was developed by the NurseBot Project

[45] [47] to locate a person, remind them of an appointment and guide them to their

destination. The platform was a mobile unit with unmoving walker arms for support

and a touch screen. The touch screen and speech synthesis provided the human-robot

interaction, which was adequate for the testing of this system in a nursing home.

The Care-O-bot system was developed by Hans, Graf and Schraft from Fraunhofer

IPA, Stuttgart and could perform fetch and carry and various other supporting tasks

in a domestic environment [50] [51]. It had a gripper arm, allowing it to lift and hold

objects, but the robot itself could also be used as a support and walker. The system

8


19/136

2.3. SOCIAL ROBOTICS

included a simple scheduler, could make emergency phone calls and, with its touch

screen monitor, could provide videophone, tv and stereo interface. The human-robot

interaction was performed multimodally, but not cooperatively as the system could

either process a speech input or a touch screen input, but did not integrate both.

2.3 Social Robotics

Personal assistants ten years ago were merely fetch-and-carry systems where the

focus of the systems was the planning, mapping, localization and obstacle avoidance

[52] [53]. More recently, there has been an emphasis on making robots more social

so that they can perform tasks and provide information more accurately [21] [20][54] [55]. Bartneck and Forlizzi [3] defined a social robot as:

... an autonomous or semi-autonomous robot that interacts and commu-

nicates with humans by following the behavioral norms expected by the

people with whom the robot is intended to interact.

Breazeal [56] comments that humans have to develop socially and emotionally

to survive in our society, so it should not be surprising that we must give robots,

or have them learn, these same skills to the same end. This point is reiterated by

Forilizzi [57] who introduced Roomba robots into domestic environments and per-

formed an ethnographic study of the human-robot interactions. Roomba, compared

to another robot vacuum cleaner Flair, inspired social interaction, in some cases

being nicknamed and spoken to as an entity, even though it could not socially inter-

act itself. While the author does not give a reason for the different treatment, it is

assumed that it is because of Roombas autonomy, competency and cute appearance

as compared to Flair.

The form of the robot is another interesting area of research, particularly with

regards to human-robot interaction (HRI). Many robotic assistants use graphical

displays of a face or user interface on top of a robotic platform, such as RHINO [37],

BIRON [58], Care-o-bot [51], Nursebot [47], Valerie the Roboceptionist [54] and

Grace and George [59]. These displays were often chosen for their expressiveness

and reliability compared to the mechanical alternative. However, it is acknowledged

that the lack of a physical embodiment does make HRI more difficult, especially as

the human often has trouble knowing where the robot is looking [54].

Most socially interactive robots are anthropomorphic in some regard, with the

well known exceptions being robotic pets, such as PARO - a baby harp seal robot

[60]. Some of these robots may only have an expressive face as the human-like

part, such as Kismet, an expressive anthropomorphic robot head [61] and Sparky,

9


20/136

2.4. MULTIMODAL INTEGRATION

a small mobile robot with an expressive face [62]. Most, however, are humanoid

with a mechanical embodiment of a face on either a bipedal [63] or wheeled robotic

platform [19] [64] [40] [65] [66] [21]. The humanoid form is argued by researchers to

be ideal for social HRI because, for robots to interact with humans like a human

(using gaze, gestures, etc.) they need to be physically capable of performing such

acts [55] [67] [66]. Also, by having a humanoid form, then a humans expectations

of the robots functions will be natural human interaction, an ideal scenario for HRI

[64] [40] [65].

Repliee-Q1 and Repliee-Q2 are an extension of the humanoid form robots, as the

robots are cast from human subjects and have realistic skin and facial features [68].

These robots are androids (or gynoids if female) and have innumerous actuators from

the waist up controlling all their motion. Because of the realism of these systems,

peoples interactive expectations are very high.

2.4 Multimodal Integration

Multimodal integration or fusion is the challenge of taking many different modes of

input, such as speech, lip tracking, gestures, posture, head pose etc, and merging

all this information, either at the feature level or semantic level, to create a richer

interpretation than any single mode may have provided [4]. It has been proved thatthis fusion of inputs can resolve errors that may occur in a single input by using

partial information of another input - a concept called mutual disambiguation, as

defined and proved by Sharon Oviatt in [4].

2.4.1 Modes of Input

Speech is often used as the primary mode of communication for people so this is

often the case for both human-computer and human-robot interactions [69] [70] [19]

[20] [21] [71] [72]. Despite it being the primary mode of communication, it is rarely

the only mode of communication. According to McNeill, 90% of gestures occur in

conjunction with some speech [73]. People commonly use different types of gestures

in conjunction with speech to communicate their intentions. Gestures, either pen-

based or naturally performed by hand, are often classified in four keys ways [73]

[5]:

Symbolic Also called emblems, these are gestures that have a semantic mean-

ing, such as wave to mean hello. Sign language is a vocabulary of sym-

bolic gestures.

Deictic Pointing gestures referring to people or objects in a space.

10


21/136


Iconic Iconic gestures resemble what they convey, such as to show the size,

shape or motion of an object.

Metaphoric These gestures involve the abstract manipulation of an object

or tool.

Gullberg found that over 50% of observed spontaneous gestures were deictic

gestures [69]. These gestures can replace missing object or location information by

pointing to an object or location from some spoken speech [70] [19] [20] [21]. For

example, if a user said put the lamp here, then the pointing gesture would resolve

the location of here. The first system to integrate speech and deictic gestures was

Bolts seminal paper, Put-that-there [74].

Alternatively, symbolic gestures have a direct meaning and can be understood

without context from additional speech [5]. Because of this independence from

speech, they can reinforce, substitute or conflict with speech [71] [72]. As pen-based

gestures, symbolic gestures can be GUI interface clicks or predefined structures,

such as arrows, which have a specific meaning in the application [7] [8] [75] [76].

Symbolic hand gestures form has to adhere to an expected structure to give the

gesture specific meaning, such as a stop gesture should be indicated by an open

palm facing away from the gesture. In human-robot interaction, symbolic gestures

have been broadly explored on touch screen interfaces [45] [47] [50] [51] but not so

much as natural hand gestures [6].

More recently, parallel modes of input that complement each other, such as

speech and lip tracking, have been fused where one mode directly reinforces the

other [15] [14].

2.4.2 Contextual Knowledge

Contextual knowledge is instinctively used by people to communicate with each

other. However, it is only sometimes used in unimodal or multimodal systems to

select a command to be performed. It can have many forms and uses, as context

is a broad area of information. Some of the key contexts used in HRI systems are

conversational and situational.

Conversational contextual knowledge is where the previously recognised com-

mands contribute to selecting the current command [40] [77] [20] [21] [78] [79] [80]

[81]. Storing this dialogue history is also a method to resolve ambiguous commands

by combining information provided by the speaker to form an unambiguous com-

mand [40] [20] [21] [77].

An extension of conversational contextual knowledge is also being able to limit

a speech recognition vocabulary depending on an expected answer [40] [20]. When

11


22/136


a vocabulary consists of 10,000 words or more, the likelihood of misrecognitions

increases. To decrease the number of misrecognitions, sub-vocabularies can be used

to lessen the range of words recognized by the system. Alternatively, sub-grammars

can be used improve the speech recognition performance, where a grammar defines

the form the recognized spoken utterance could take [77].

Contextual knowledge can also be situational, where the location of the agents in-

volved contributes to selecting a current command [70] [77] [80]. Situational context

is called situatedness in speech recognition applications [82] [77] [83]. Situational

context has been defined a number of ways, including by plan recognition [82] [83]

and information about objects in the environment [21] [82] [77].

2.4.3 Methods of Integration

In addition to the types of modes and knowledge that are fused multimodally, how

they are fused is also an important field of research. Virtual World [75], Finger

Pointer [84], VisualMan [76], Jeanie [85] and QuickSet [7] [8] all integrated speech

and either pen or gesture based motion at a semantic level. QuickSet, though,

was the first system to use a unification based system, a logic based method for

combining the meanings from two modes into a a single common interpretation.

The other systems used frame based integration, a pattern matching method for

fusing data structures of semantic information into a single common interpretation.

QuickSet would take two n-best lists of speech and gesture semantic information,

where an n-best list is a list of n possibly correct interpretations of each mode. It

then generated a list of multimodal commands by exhaustively combining the speech

and gestures that temporally align. That is, where a gesture precedes or overlaps the

speech [86]. The list of multimodal commands was then sorted statistically using the

probabilities of each list item from the individual modes. Finally, a multimodal com-

mand was selected if the speech/gesture pair defining it could be legally combined

into an executable instruction.

While QuickSet was highly expressive and scalable, it did not account for the

correct solution not being recognized by one of the input modalities. Some of the

QuickSet developers extended their multimodal system to include an associative map

to represent legal semantic combinations between all the defined types of speech and

pen based inputs, as well as formalizing the integration process statistically [86].

Others proposed using a finite state automaton which would parse, understand and

interpret inputs from a speech and gesture inputs [87]. The statistical approach

defined by [86] became the driving force for the multimodal field, leading to the

development of SmartKom [9], MUST [10], a speech and gaze multimodal system

[11] and a speech and mouse interface for remotely controlling a UAV robot [12].

12


23/136


Despite this statistical strength, a variety of approaches were developed for mul-

timodal fusion in the last five years. A new method for combining multimodal inputs

where one or more are only intermittedly used (such as gestures) was defined by [13].

The approach involved learning to weight non-verbal features only when they are

relevant to the speech features, i.e. when there is an ambiguity. This technique

showed a 73% improvement over using the non-verbal inputs whenever available. A

semantic based fusion technique was developed by [16] to make their system domain

independent, similar to the more recent work [17] and [18]. For the modes of speech

and lip-tracking, coupled hidden markov models [15] and a method to learn a collec-

tion of meaningful multimodal structures [14] have been recently developed, though

this field has been much more deeply explored than this [88].

2.5 Multimodal Human-Robot Interaction

Early multimodal HRI systems did not combine the multiple modes of input, but

rather acted on one or the other [50] [51]. One of the earliest multimodal HRI

systems that combined the modes was Hermes, by Bischoff et al [40]. Hermes was

a robust humanoid museum robot that used speech, vision and touch with context

information about the nature of the conversation to interpret a users intention.

The vision system would perceive objects to help resolve pronouns and tactile senseallowed Hermes attention to be redirected [40]. The speech system had a vocabulary

of about 350 words but would only check for a handful of words depending on the

conversational context, as triggered by key words.

Iba et al from CMU described the framework where a robot is programmed in-

teractively through a multimodal interface [70]. The user can provide feedback and

interrupt tasks to tell the robot to avoid an obstacle or stop moving. The system

uses speech and one or two handed gloved gestures as the modes of input, though

the accuracy of these systems are not reported. The limited speech vocabulary con-

tains about 50 words, and the gesture system has seven single-handed gestures andone two-handed gesture. The multimodal integration gives the symbolic interpreta-

tions of the speech and gestures probabilistic values that are used to decide which

task should be performed based on the current situational context. This system

was demonstrated, but had not been quantitatively evaluated in the most recent

publication.

Rogalla et al developed an HRI system using the service robot Albert [19]. The

system integrated speech, gestures and object recognition into an event manage-

ment architecture, where an event is a single input. The event manager transformed

a certain event or group of events into an action to be performed. Their gesture

13


24/136


recognition overall performance was 95.6% for 6 static gestures using skin segmenta-

tion and a Fourier description of the normalized contour. Their speech recognition

system was domain dependent to increase accuracy, though the accuracy was not

reported.

Gorostiza et al [89] described a multimodal HRI framework for a personal robot

assistant. The framework took into account speech, body gestures and tactile sensors

as the modes of input and used markup languages to encode these functionalities.

As this project was still in development, they did not specifically outline how the

multimodal integration would work in their current publications.

Huwel and Wrede implemented a multimodal human-robot system which used

speech, gesture and object recognition as well as environmental information [20].

The focus of their project is to enable a robot to carry out an involved dialog for

handling instructions. Their speech system was domain independent, could incor-

porate results from pointing gestures and object recognition, and could interpret

speech even if the input is not grammatically correct and the recognition incorrect.

They used situated semantic units to define speech interpretations, with any miss-

ing information was presumed to be available in the contextual knowledge. While

the contextual knowledge is not described, it is suggested that conversational and

object information is used. As a german research group, the users were non-native

English speakers, all these factors contributing to the average 52% accuracy of theirspeech recognition. Because of the design of their system though, 61% of the 1642

test utterances were interpreted as correct utterances with full semantic meaning,

illustrating how their speech system recovers from the speech recognition errors.

They do not report the accuracy of their gesture and object recognition systems.

The most recent HRI publication is [21] by Stiefelhagen et al from the German

research group Sonferforschungsbereich on humanoid robotics. They have developed

a robotic system whose internal systems include speech recognition, dialogue pro-

cessing, localization, tracking and identification of a person, pointing gesture recog-

nition and head orientation recognition. Each of these systems has been developedand tested separately and are being integrated into a single system. The multimodal

fusion uses a constraint based system to semantically fuse inputs from speech, point-

ing gestures and an environmental model [90]. To resolve some speech ambiguities,

they use conversation history and user identification - contextual information.

Because their gesture recognition system only detected 87% of gestures, and had

only 47% recognition accuracy, their system uses speech as the primary modality,

and only references the gesture recognition system if there is a disambiguation. The

direction of a pointing gesture generates a list of objects from the environment model

that are in that direction within an error cone. By comparing the appropriate speech

14


25/136


token with the objects on the list, a match can be found, resolving the ambiguity.

Their multimodal system had a success rate of 74% on 102 multimodal user inputs.

Of the errors, 52% were caused by failing to detect a gesture, 22% because a gesture

could not be resolved with a high confidence, 17% due to speech recognition errors

and 9% were because the speech and gesture inputs failed time contraints.

Based on these implementations of multimodal technology for robotic platforms,

there are many areas to explore. The practical use of contextual knowledge to

inform a multimodal system needs to be assessed. In addition, different kinds of

gestures that would be used to communicate with a robot, not just deictic, should

be investigated as well.

15


26/136

3Contextually Informed Multimodal

Human-Robot Interaction and Robotic

Responses

For a robot to communicate with a person effectively, it must possess transactional

intelligence the skills to understand what was communicated and interact with

the human to resolve ambiguity so as to negotiate what should be acted upon if a

difficulty occurs. In this work, speech recognition, gesture recognition and CIMMI

represent these skills. But for a robot to respond to a humans request, spatial intel-

ligence is required to understand and interact with the environment. In this work,

navigation, obstacle avoidance and object recognition represent these skills. Both of

these intelligences are informed by some contextual knowledge, both conversational

and situational. All of these components are described in this chapter.

The scenarios that the system is expected to accomplish are initially described

in Section 3.1. For a person to communicate with the robot, both speech and

gestures are recognized as described in Sections 3.3 and 3.4 respectively. These

modes of input are combined by CIMMI using the defined contextual knowledge.

This knowledge is described in Section 3.5. The multimodal algorithm algorithm

itself is comprehensively described in Section 3.6.

For a service robot to act in human-centric environments, it requires spatial

intelligence as well. In this work, the navigation, obstacle avoidance and object

recognition systems represent these skills. They were developed to demonstrate the

multimodal algorithm on a humanoid robot and are described in Section 3.8.

3.1 Functional Scenarios

The developed informed multimodal system was designed to accomplish simple tasks

in a domestic environment perhaps to assist an old or frail person. In the laboratory

16


27/136

3.1. FUNCTIONAL SCENARIOS

Figure 3.1: The defined rooms in the laboratory environment. The height of the laserrange finder, which is used for obstacle avoidance, is lower than the height of the whitepolysterine walls between the rooms. The desk is the foreground table, the kitchenis at the furthest table and the table location is to the right with the blue cup on it.The closest Wakamaru robot with the mounted camera is the active one. The other isnot used in this system.

environment where the system was developed, simple rooms were defined with a hall,

kitchen and office space using short walls to separate the spaces (See Fig. 3.1). Each

room had a table with one or two objects on it. The simple tasks to be performed

by the robot include:

Natural interactions The robot should act and respond to users responses by

natural means, such as by speech and gestures. The Mitsubishi engineers

built in a series of smooth and natural-looking motions for when the robot

was performing preprogrammed actions, such as leaving the charger, but also

moving around the room. Simple hand and arm gestures were also defined for

user responses such as waving.

Providing information The system should be able to provide information about

the environment, such as object locations, on demand using its situational

contextual knowledge - the known object database.

Fetch and carry The robot should be able to retrieve an object from a location

at the users request. To accomplish this, the robot must navigate to a the

objects location - avoiding obstacles along the way. It then must recognize the

17


28/136

3.2. SOFTWARE OVERVIEW

requested object, reach out and touch it. Wakamaru does not have articulated

hands, so the object is not retrieved. The possibility of using magnets was

explored but was unsuccessful due to the shape of Wakamarus hands. The

robot should then return the object to the user. If no object is recognized,

then the robot should return to the user and inform them of this.

With these scenarios in mind, the transactional intelligence was developed to

accomplish the human interaction components of the system, as described in the

following sections.

3.2 Software Overview

In this section, the transactional intelligence is outlined conceptually, initially by

implementation and then as a user interaction. This overview should provide a

guide for how the different components of the transactional intelligence relate, but

also the reasons for different design decisions.

The developed system is modular by design. The capturing and processing of

the speech, vision and laser-depth data are performed in separate processes on the

laptop, as it is a more powerful system than that in Wakamaru alone. CIMMI and

robotic response system is a single threaded process running on Wakamaru. Thespeech recognizer is a client system, triggering action in the robotic system. The

vision system is a server that detects gestures while it waits for instructions from

Wakamaru to return the last found gesture, detect obstacles, or recognize objects.

The laser system is a server that waits for an instruction to return depth data for

obstacle avoidance.

A user communicates with Wakamaru by speech and dynamic symbolic gestures

to complete specific tasks. The speech recognizer detects a single utterance, trigger-

ing the gesture recognizer to return the last detected gesture. Retrieving the last

recognized gesture, rather than detecting the current one, is done because previous

studies showed that when any meaningful gestures were made, 85% of the time it was

accompanied by a spoken keyword temporally aligned during and after the gesture

[91] and gestures could precede the deictic word by as much as four seconds [88].

Kettebekov and Sharma also showed that 93.7% of time, a deictic or iconic gesture

was temporally aligned with the semantically associated nouns, pronouns, spatial

adverbials, and spatial prepositions [91]. While this system does not semantically

recognize deictic or iconic gestures, it assumes this statistic would apply to the re-

lationship between symbolic gestures and the semantically associated speech action

word, the first word of a sentence.

18


29/136

3.2. SOFTWARE OVERVIEW

Figure 3.2: Overview of the Transactional Intelligence, and the Contextual Knowledgewhich informs it.

The speech and gesture recognizers generate n-best lists of recognition hypothe-

ses, where n = 5 for speech, as selected by experimentation (See Section 4.1), and

n = 3 for gestures, as there are only 3 gestures implemented. These recognizers

are scalable, as words can be added to the vocabulary file for parsing, and trainingdata added to the gesture recognition file for classification (See Sections 3.3 and 3.4

for more information). However, in the current implementation additional actions

cannot by clarified by CIMMI or responded to by the robot, as these components

are hard coded. Implementing a more flexible clarification process is a short term

goal of this work. The recognition hypotheses in the n-best lists are sent to CIMMI

along with the probabilities of correctness each recognition hypothesis has.

CIMMI selects the best speech/gesture pair, a command hypothesis, based the

transactional intelligences and the contextual knowledge. See Section 3.5 for details

about the implemented contextual knowledge. The selected hypothesis is checked forambiguities according to some predefined rules (See Section 3.6). If an ambiguity is

not resolved beyond a satisfactory threshold of certainty, then the robot will ask the

user for more information. If there is no ambiguity or an acceptable level of certainty,

then the hypothesis is passed onto the robotic responses finite state machine so the

robot can appropriately respond to the input. This overview of the transactional

intelligence is represented by Fig 3.2.

Having outlined the transactional intelligence in this work, each component will

be described in more detail in the following sections.

19


30/136

3.3. SPEECH RECOGNITION

Figure 3.3: A diagram showing the speech recognition process, from capturing thespoken utterance through to generating a list of parsed recognition hypotheses

3.3 Speech Recognition

Speech is used as the primary form of communication between people so it makes

sense that a robot interacting with people recognizes this mode of input. Because of

this fact, speech recognition is implemented as part of the transactional intelligence

of this human-robot interactive system. As an overview of this system, the speech

recognition engine processes a spoken utterance and generates a list of recognition

hypotheses. These hypotheses are then parsed using a defined vocabulary and these

interpreted hypotheses, and their probabilities of correctness, are sent to Wakamaru.

This process is described in detail in the following section.

The developed speech recognition system uses the Microsoft Speech SDK v5.11

because it is well documented and provides an interface to the speech recognizer for

developing speech recognition applications.

The Microsoft Speech Recognizer Engine (SRE) v5.1 was based on Whisper

(Windows Highly Intelligent SPEech Recognizer) speech engine [92] which in turn

was based on SPHINX-II [93]. A speech recognizer consists of an acoustic model to

represent the sounds, a decoder to interpret possible words that could be represented

by those models and language models which define the probabilities of sequence of

words. The acoustic models in the SRE use hidden Markov models (HMMs) to

represent three phonemes, where as phoneme is a sound [94]. The language models

use both bigram-class-based and trigram-word structures where bigram-class-based

structures are pairs of words sorted into context-dependent classes [95] and, by

example, trigram-word structures are the quick brown and quick brown fox for

the utterance the quick brown fox [96].

Training the acoustic models required reading provided passages where the

sounds were encoded into HMMs. Approximately seven hours of training was needed

before this American speech recognition engine was recognizing most of the 80 words

in the desired vocabulary. A training file was also developed containing the specific

1Microsoft Speech SDK 5.1 is available for download from:

http://www.microsoft.com/speech/speech2007/downloads.mspx.

20


31/136


utterances the system should recognize, such as bring me the blue cup. To recog-

nize a spoken utterance, the SRE generates an HMM to represent three phonemes

and these are concatenated together to represent the spoken utterance. The HMMs

are then decoded into words using the learned language models, selecting the words

with the highest probabilities. A list of hypotheses is generated by selecting alter-

native words with the next highest probabilities. This much training was required

to refine the acoustic models so then it could accommodate subtle variations in the

users pronunciation because words are pronunciated slightly differently depending

on the neighbouring words in the spoken utterance.

In the developed speech system, a spoken utterance is captured and a list of five

recognition hypotheses is generated. The number of recognition hypotheses gener-

ated, five, was selected by experimentation, as it encompassed 97% of the correct

recognition results for the tested dataset (See Section 4.1 for more information). A

recognition hypothesis consists of a list of words that were possibly spoken by the

user and a confidence value generated by the SRE (See Table 3.1). Generating a list

of possible recognitions is done because the first recognition according to the speech

engine is not always the correct one.

The confidence values for each recognition hypothesis are scaled so the values

sum to 1.0, i.e. are a probability. The hypotheses are also parsed, detecting an

action, a colour, an object type and/or a location, similar to [19]. However, forthe recognized hypotheses to be parsed, the nickname for Wakamaru, Waka, must

be recognized in at least one of the hypotheses. The reasoning for the use of this

Waka was the speech recognizer would try to recognize sounds in the laboratory

as spoken utterances, such as other conversations, music etc. To avoid these sounds

being recognized by CIMMI and possibly interfering with experiments, the name

Waka needed to be recognized so then only a spoken utterance that was clearly

directed at Wakamaru would be recognized.

Each hypothesis is parsed by the developed system using a domain dependent

vocabulary, where each keyword has a:

Lexical category (such as noun, verb etc) represented by a number. If a

word is in the vocabulary then the words unique ID number and its

probability of correctness are stored in the parsed recognition structure,

[action, colour, object, location] depending on their defined lexical cate-

gory. This structure is a simple form of case grammars, where a different

number of slots would be defined depending on the action word recog-

nized [20]. Because of the simple utterances that were being recognized,

this was not necessary for the implemented system.

21


32/136


Table 3.1: Sample Unparsed Speech Recognitions

Ex. Spoken Utterance Recognition Hypotheses Confidence

1 bring the yellow cup Waka bring the yellow cup Waka 111484bring the yellow cup water 111484bring the yellow cab Waka 93467bring the yellow tape Waka 93467bring the yellow cut Waka 93467

2 fetch the yellow cup Waka fetch the yellow cut Waka 112738fetch the yellow cut off 112738fetch the yellow cup Waka 107971fetch the yellow cup soccer 107971fetch the yellow cab Waka 86016

Table 3.2: Sample Risk Values

Words Time Help Come Go Stop Hello Cup Desk

Risk Values 0 10 0 0 10 0 0 0

Risk value The risk value represents the value of certain words semantically,

where the risk values are higher for words where misinterpretation could

have seriously negative consequence, such as if the words help and stop

were misrecognised (see Table 3.2). Gestures do not have risk values in

this implementation of the system, as symbolic gestures with high risk

meanings, such as stop, have not been implemented (See Section 3.4 for

more information).

Currently the risk values are binary, as illustrated in Table 3.2, but this

can easily be extended so risk values can be a greater range of values, or

fractional values. These values could also be learned by user interaction.

List values relating the word to each of the gestures defined in the

system semantically. The values relating a word to each gesture are de-

fined using an associative map, similar to [86]. The map is manually

defined in the vocabulary file based on the context of a domestic environ-

ment and the developers common sense. For example, the gesture Come

is contradictory to the word Go, so the corresponding associative map

value is 0 (See Table 3.3). Also, the utterance Whats the time?, repre-

sented as Time in Table 3.3, is not complimentary or contradictory to

any gesture, so its associative map values are always 1. This associative

22


33/136


Table 3.3: Sample Values from the Associative Map

Words Time Help Come Go Stop Hello Cup Desk

Come Gesture 1 2 2 0 0 1 2 2

Go Gesture 1 2 0 2 0 1 2 2

Wave Gesture 1 2 1 1 1 2 1 1

Table 3.4: Sample Parsed Speech Recognitions

Example Spoken Utterancen-Best List

Parsed Input Prob.1 Bring the yellow cup Get Yellow Cup 0.37Get Yellow 0.31Get Yellow Cut 0.31

2 Fetch the yellow cup Get Yellow Cut 0.37Get Yellow Cup 0.35Get Yellow 0.28

map could be adapted to individual users with learning algorithms and

also contain fractional values. This associative map is a key factor of

mutual disambiguation in our system [97].

The vocabulary can be extended beyond the 80 words defined by adding new

words to the text file along with the necessary values. However, additional action

words cannot be added to the vocabulary without extending the robotic response

system as this is hard-coded for Wakamaru and the implemented action words only.

Different vocabulary files could also be defined for different contexts, such as the

kitchen, bathroom or office based on the recognition of keywords as done by [40].

Once all the recognition hypotheses have been parsed, the parsed recognition

structures and probabilities, [action, colour, object, location, probability], are sent to

Wakamaru to be combined with the gesture recognitions and contextual knowledge.

These two components of the transactional knowledge will be defined in the following

two sections. The overall speech recognition process is illustrated in Fig. 3.3. This

style of diagram will be used to describe the different components of the transactional

intelligence in the following sections.

23


34/136

3.4. SYMBOLIC GESTURE RECOGNITION

3.4 Symbolic Gesture Recognition

In human communication, while speech may be the primary form of interaction used

by people, they instinctively gesticulate to express themselves as well. According to

McNeill, 90% of gestures occur in conjunction with some speech [73]. Based on this

fact, gesture recognition was implemented in this work. This system is only a toy

or proof of concept system to illustrate the use of a multimodal system in robotics.

Symbolic gestures were used in this system because they can substitute, reinforce

and conflict with an action word semantically. Deictic gestures, while more common

in human communication for anaphora resolution (this, here, him words) and

object identification, is being thoroughly explored by a number of groups [70] [19][89] [20] [21]. The multimodal combination of symbolic gestures and speech has been

primarily explored in HCI [7] [8] [91] with some work in HRI [6].

The three symbolic dynamic gestures recognized in this work are wave, come

here and go away. The gestures are dynamic because they are moving as opposed

to static gestures, such as a thumbs up. These gestures are tracked using OpenCVs

Camshift tracking and 3D geometry. Each of the implemented gestures have two or

more meanings,

Come - bring or come

Go - get, go, call or answer

Wave - hello, goodbye or help

The gestural meaning is decided during multimodal integration, as dictated by

the speech action word or, in the action words absence, the speech object or location

words. For example, if the action word is get, then the semantic meaning of the

gesture go is get. However, if the speech action word is missing, but only a location

word is defined, then it is assumed the semantic meaning is go.

For gesture performance, the gesturer must be standing approximately 2 meters

away from the robot so both a frontal face and the gestures are detectable. If the

gesturer is too far away, the face will be too small to detect using OpenCVs Haar

frontal face detector. If the gesturer is to close to the camera then the perspective of

the gesture is changed, perhaps resulting in a misrecognition. The face and all pixels

to the right of it (as seen by the robot) are ignored to simplify image processing and

assuming that gestures are performed right-handed. If a user could be identified

(perhaps by using face recognition) as a left handed person, then a left hand mode

could be used.

24


35/136


36/136


(a) (b) (c)

Figure 3.5: A series of motion segmentations of a wave gesture. (a) In the initial framethere is no history to segment any motion so all lighter coloured regions are segmented.(b) In the second frame, the user has moved their arm a distance so the previous motion

segmentation and the currently detected motion are added as shown here. (c) Afterthree frames, the static background region has been segmented away and only the usersmotion, and speckles of motion of the person in the bottom right, are detected as shownhere.

templates use the previous three frames to build a motion history. This means

that differences between the current frame and last three frames are included in the

segmented motion image, as shown in Fig. 3.5.

These two masks, the skin and motion segmentation, are then differenced, result-

ing in a greyscale image showing moving skin coloured regions in the frame. This

image is used to select the regions for the Camshift algorithm to track, as shown in

Fig. 3.6. In the first frame, there is no motion history so static regions are included

in the segmentation. However, after 3 frames have been processed, the background

regions are successfully ignored.

3.4.2 Region Tracking

The initial search windows for the Camshift tracker are manually defined by labeling

the greyscale image generated by the skin and motion segmentation. In the next

frame, the algorithm using the greyscale skin and motion segmentation for the frame

and a search window for one region as defined from the previous frame to track that

region. It is assumed part of the region in this frame will overlap with the search

window. CamShift finds the center of a region by iteratively searching in the search

window of the greyscale image. It then calculates the regions size and orientation

and returns the bounding box of it. The bounding box is used as the search windows

for this region in the next frame. The center of the bounding box is considered the

(x,y) coordinate of a region for the current frame. This is done for each region that

was segmented in the first frame which can be as many as 15 sometimes. However,

it is assumed only one region will be a gesture.

26


37/136


(a) A binary image of the skin segmentation (b) A greyscale image of the motion seg-mentation

(c) The conjunction of (a) and (b) (d) The colour capture with the detectedregions identified

Figure 3.6: A single frame from a wave gesture using skin and motion segmentation

Figure 3.7: A diagram of the gesture capture and recognition process

27


38/136


For each tracked region of each frame, the center (x,y) and the average disparity

value of the tracked region are translated into (X, Y, Z)

cimmi - a contextually informed multimodal integration method for a humanoid robot in an assistive...

Documents