cimmi - a contextually informed multimodal integration method for a humanoid robot in an assistive...

Upload: lellobot

Post on 30-May-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    1/136

    CIMMI - A Contextually Informed Multimodal

    Integration Method for a Humanoid Robot in anAssistive Technology Context

    Elizabeth Harte

    Bachelor of Science (Hons.), Computer Science (2005)

    University of Auckland

    Submitted for the Degree of

    Masters of Engineering Science (Research)

    Intelligent Robotics Research Centre

    Department of Electrical & Computer Systems Engineering

    Monash University, Clayton, VIC 3800, Australia

    April 2008

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    2/136

    Declaration

    I declare that, to the best of my knowledge, the research described herein is originalexcept where the work of others is indicated and acknowledged, and that the thesishas not, in whole or in part, been submitted for any other degree at this or any otheruniversity.

    Elizabeth Harte

    Melbourne, AustraliaApril 2008

    i

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    3/136

    Acknowledgments

    My thanks go to Prof. Ray Jarvis for his enthusiasm and support over the past 2years. He is an inspiring roboticist and researcher. I will be lucky to be even half ascreative as him in my life.

    I would also like to acknowledge Mitsubishi Heavy Industries Ltd for the uniqueopportunity to use the Wakamaru robot for this project.

    I would like to thank all the members of the Intelligent Robotics Research Center.I have met some great people and made some great friends. Thank you for welcomingthis Kiwi into the circle. In particular, thanks go to Alan Zhang and Jay Chakravartyfor their valuable discussions and friendship during the development of this project.

    Finally, I thank Rob Davidson for his moral support during this project. His en-couragement, support and editorial skills were greatly appreciated during the writing

    of this thesis. Kia ora.

    ii

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    4/136

    Abstract

    Robots dealing directly with people need to be able to handle the uncertainties as-sociated with human-centric environments, including human interaction. This fieldhas seen much attention in the past decade, though work using multimodal systemshas not been deeply explored. This work describes a Contextually Informed Multi-Modal Integrator, CIMMI, that fuses speech and symbolic gesture probabilisticallyand is informed by contextual knowledge in an assistive technology robotic applica-tion. Symbolic gestures are gestures that have semantic meaning, such as a wavegesture meaning hello. Very few systems use symbolic gestures in a robotic domain,primarily because symbolic gestures are not as commonly used in day-to-day humanconversation as deictic (pointing) gestures. Using symbolic gestures allows the sys-tem to resolve ambiguities associated with action words in a command hypothesis

    where deictic gestures are often used to identify objects or locations. Exploring theuse of symbolic gestures will be one focus of this work.

    Since no deictic gestures are implemented in this work, contextual knowledge isused to resolve ambiguities associated with object and location words in a command.Contextual knowledge is defined as both conversational and situational. Conversa-tional contextual knowledge uses the dialogue history to select the best commandfrom a generated list of possible commands or to resolve ambiguities in the selectedcommand. In human conversation, if a person had been talking about a yellowcup then the conversational context allows a person to simplify additional refer-ences to the same object by saying the cup without having to specify the impliedyellow again. Using this idea as an influence, ambiguities, where the object or

    location information is missing in a command, could be resolved by referencing theconversational context. These simple ideas are explored and discussed further in thiswork.

    The second kind of implemented contextual knowledge is situational. In thiswork, situational contextual knowledge consists of the last known locations of theuser, the robot and a list of objects that exist in the environment. Each objectin the list has a colour, object class type (such as cup) and a location. Knowingthe locations of an object allows the user to simplify their spoken requests so theycan simply ask bring me the blue cup without stating the implied from the table.This same concept applies to specifying the colour of an object. Using these ideasas an influence, this knowledge can also resolve misrecognitions, where the colour

    or location information could be missing from a command. The user and robotlocations are only used to assist the robotic responses. This situational knowledgeis explored further in this thesis. The exploitation of conversational and situationalcontextual knowledge, as introduced above, will be the second focus of this work.

    The accuracy of the speech recognition system alone (53%) was compared tousing the speech with contextual knowledge, both conversational and situational.This increased the accuracy of the system to 72%. With the addition of the gesturerecognition as well, the systems accuracy rose only by 4% more because only 3 casesrequired the gesture to clarify the ambiguity in the spoken command. CIMMI wasimplemented on a humanoid robot, Wakamaru, and the speech, gesture and objectrecognition systems were implemented on a sepearate laptop.

    iii

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    5/136

    Contents

    1 Introduction 1

    1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Multimodal Human-Robot Interaction . . . . . . . . . . . . . . . . . 2

    1.2.1 Modes of Input . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Contextual Knowledge . . . . . . . . . . . . . . . . . . . . . . 21.2.3 Multimodal Integration . . . . . . . . . . . . . . . . . . . . . 3

    1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2 Literary Review 7

    2.1 Service and Field Robotics . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Healthcare Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Social Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Multimodal Integration . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.4.1 Modes of Input . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.2 Contextual Knowledge . . . . . . . . . . . . . . . . . . . . . . 112.4.3 Methods of Integration . . . . . . . . . . . . . . . . . . . . . . 12

    2.5 Multimodal Human-Robot Interaction . . . . . . . . . . . . . . . . . 13

    3 Contextually Informed Multimodal Human-Robot Interaction and

    Robotic Responses 16

    3.1 Functional Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Software Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Symbolic Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . 24

    3.4.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4.2 Region Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.3 Static Gesture Recognition . . . . . . . . . . . . . . . . . . . 30

    3.5 Contextual Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . 323.5.1 Conversational Context . . . . . . . . . . . . . . . . . . . . . 323.5.2 Situational Context . . . . . . . . . . . . . . . . . . . . . . . 33

    3.6 Integrating Speech and Symbolic Gestures in an Informed Manner . 363.6.1 CIMMI Overview . . . . . . . . . . . . . . . . . . . . . . . . . 373.6.2 The Iterative Approach . . . . . . . . . . . . . . . . . . . . . 403.6.3 The Sequential Approach . . . . . . . . . . . . . . . . . . . . 45

    3.7 Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.8 Wakamarus Responses . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4 Experiments 57

    4.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.1.1 Experiment 1: Selecting the Number of Recognition Hypotheses 594.1.2 Experiment 2: Word Recognition Rate of Unparsed Speech

    Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1.3 Experiment 3: Word Recognition Rate of Parsed Speech Results 62

    4.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    iv

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    6/136

    CONTENTS

    4.2 Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    4.2.1 Experiment 1: Gesture Detection and Recognition by SkinSegmentation Only . . . . . . . . . . . . . . . . . . . . . . . . 67

    4.2.2 Experiment 2: Gesture Detection and Recognition by Skinand Motion Segmentation . . . . . . . . . . . . . . . . . . . . 69

    4.2.3 Experiment 3: Gesture Detection and Recognition by Skinand Motion Segmentation in the New Laboratory Environmen 73

    4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3 Multimodal Interaction Accuracy . . . . . . . . . . . . . . . . . . . . 77

    4.3.1 Experiment 1: Accuracy of the Individual Modes . . . . . . . 774.3.2 Experiment 2: Using the Iterative Approach . . . . . . . . . 794.3.3 Experiment 3: Using the Sequential Approach . . . . . . . . 85

    4.4 Ob ject Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.4.1 Experiment 1: Object Recognition using Colour Segmentation

    and Shapematcher in the Old Laboratory Environment . . . 944.4.2 Experiment 2: Object Recognition using Colour Segmentation

    and Shapematcher in the New Laboratory Environment . . . 954.5 Navigation and Obstacle Avoidance . . . . . . . . . . . . . . . . . . 98

    4.5.1 Experiment 1: Navigation and Obstacle Avoidance in the OldLaboratory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

    4.5.2 Experiment 2: Navigation and Obstacle Avoidance in the NewLaboratory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    4.6 Robotic Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . 103

    4.6.1 Natural Interactions . . . . . . . . . . . . . . . . . . . . . . . 1034.6.2 Requesting Information . . . . . . . . . . . . . . . . . . . . . 1054.6.3 Fetch-And-Carry Requests . . . . . . . . . . . . . . . . . . . . 1054.6.4 Task Performance . . . . . . . . . . . . . . . . . . . . . . . . 106

    5 Discussions 110

    5.1 The Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 1105.2 Integrating Speech with Contextual Knowledge . . . . . . . . . . . . 1115.3 Addition of Symbolic Gestures . . . . . . . . . . . . . . . . . . . . . 1125.4 Comparing the Iterative and Sequential Multimodal Approaches . . 1135.5 The Robotic Demonstrations . . . . . . . . . . . . . . . . . . . . . . 113

    6 Future Work and Conclusions 115

    6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    v

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    7/136

    List of Figures

    3.1 The defined rooms in the laboratory environment. The height of thelaser range finder, which is used for obstacle avoidance, is lower thanthe height of the white polysterine walls between the rooms. Thedesk is the foreground table, the kitchen is at the furthest tableand the table location is to the right with the blue cup on it. Theclosest Wakamaru robot is the active one. The other is not used inthis system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.2 Overview of the Transactional Intelligence, and the ContextualKnowledge which informs it. . . . . . . . . . . . . . . . . . . . . . . . 19

    3.3 A diagram showing the speech recognition process, from capturing the

    spoken utterance through to generating a list of parsed recognitionhypotheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 The non-skin (left) and skin (right) histograms . . . . . . . . . . . . 253.5 A series of motion segmentations of a wave gesture. (a) In the initial

    frame there is no history to segment any motion so all lighter colouredregions are segmented. (b) In the second frame, the user has movedtheir arm a distance so the previous motion segmentation and thecurrently detected motion are added as shown here. (c) After threeframes, the static background region has been segmented away andonly the users motion, and speckles of motion of the person in thebottom right, are detected as shown here. . . . . . . . . . . . . . . . 26

    3.6 A single frame from a wave gesture using skin and motion segmentation 273.7 A diagram of the gesture capture and recognition process . . . . . . 273.8 The red circle outlines the face region and the orange circle represents

    the hand region. The blue dots represent the center of the boundingcircles in the XY plane for each frame of the gesture sequence. Thelighter the colour of the dot, the closer that part of the gesture andvice versa. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    3.9 Showing the 9 directions that should be differentiable by using thegesture tracking and the three features specified. . . . . . . . . . . . 30

    3.10 Captures from the SM program of matches between two gesture skele-tons. Notice the individual segments of the skeleton representing the

    silhouette are numbered and different colours. . . . . . . . . . . . . . 313.11 A diagram illustrating CIMMIs sequential algorithm . . . . . . . . . 393.12 Calculating the speech weight, wts, in the iterative approach. . . . . 413.13 Wakamaru with Bumblebee camera on head, HokuyoURG laser range

    finder at middle and laptop on back. . . . . . . . . . . . . . . . . . . 483.14 Localization panoramic images on Wakamaru. . . . . . . . . . . . . . 503.15 Screen captures from Wakatest, Wakamarus debugging and testing

    tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.16 A sample image of each of the recognized object class . . . . . . . . 523.17 A sample object capture which is segmented, made into a skeleton

    then matched to the object database. . . . . . . . . . . . . . . . . . . 53

    3.18 A diagram showing the SM matching process, as provided by Macrini 54

    vi

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    8/136

    LIST OF FIGURES

    3.19 Route following and obstacle avoidance on Wakamaru. (a) shows

    two sequential, but not consecutive, images of the robots positionapproximately when the screen captures in (b) and (c) were taken.(b) shows two consecutive captures of the obstacle avoidance usingthe laser range finder. Obstacles are red and the path is green withthe yellow square as the start point and blue square the end point. (c)shows Wakamarus on board obstacle avoidance sensor map in greenand blue with the path in pink. If the on board sensors detected anobstacle, parts of the blue sensor map would be red. Notice the faintpink crosses which identify the location of the landmark reflectors. . 56

    4.1 A capture of the glass panels along one side of the new laboratory . 58

    4.2 (a) The binary skin segmented images. These segmented regions cor-respond to the coloured boxes in the coloured captures. (b) Threesequential, but not consecutive, captures from a go gesture. Thecoloured boxes bound the detected skin coloured regions. Note thatthe top two left hand boxes in the top row are patches of ceiling seg-mented as skin regions. Camshift started tracking the hand and faceinstead of the ceiling patches, resulting in the large box in the bottomrow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    4.3 (a) The binary skin segmented images. These segmented regions cor-respond to the coloured boxes in the coloured captures. (b) Threesequential, but not consecutive, captures from a go gesture. The

    coloured boxes bound the detected skin coloured regions. Note thatthe top two left hand boxes in the top row are patches of ceiling seg-mented as skin regions. Camshift started tracking the hand and faceinstead of the ceiling patches, resulting in the large box in the bottomrow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    4.4 (a) The binary skin segmented images. These segmented regions cor-respond to the coloured boxes in the coloured captures. (b) Three se-quential, but not consecutive, captures from a go gesture. The bluebox shows the location of the face in the first frame of the sequenceand the coloured boxes bound the detected skin coloured regions.Note that the top two left hand boxes in the top row are patches of

    ceiling segmented as skin regions. Camshift started tracking the handand face instead of the ceiling patches, resulting in the large box inthe bottom row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    vii

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    9/136

    LIST OF FIGURES

    4.5 Graphs showing the gesture training database for the same dataset

    plotted against the three classification features. The gestures in (a)were only tracked by skin segmentation, and in (b) were tracked usingboth skin and motion segmentation. In (a), the gesture classes werequite separate from each other and training samples were clusteredwithin each class. By comparison, in (b) the gesture classes were lessseparate from each other, with some overlap as well. The trainingsamples were less clustered per class, especially the go gesture. Thiscould be because the tracking of go gestures in the previous versionwas influenced by segmented background regions, as the gesture regioncould get stuck on these regions modifying the path tracked. . . . . 72

    4.6 A capture of the user not performing a gesture using skin and motion

    segmentation in the new laboratory. . . . . . . . . . . . . . . . . . . 744.7 A capture of the user performing a come gesture using skin and motion

    segmentation in the new laboratory. . . . . . . . . . . . . . . . . . . 754.8 A capture, and segmentations, of objects on the table in the old lab-

    oratory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.9 A capture, segmentation and skeleton of an object on the kitchen

    table in the new laboratory. . . . . . . . . . . . . . . . . . . . . . . . 954.10 A bowl skeleton matching a cup skeleton due to the rotational invari-

    ance of the ShapeMatcher approach. . . . . . . . . . . . . . . . . . . 964.11 A sample capture where the yellow plate is partially out of frame,

    resulting in a misrecognition of the region. . . . . . . . . . . . . . . . 97

    4.12 A comparison of the camera disparity obstacle detection against thelaser range finder obstacles detection. . . . . . . . . . . . . . . . . . 100

    4.13 A sequence of images of Wakamaru traveling from its charger basetoward the kitchen table in the laboratory environment. . . . . . . 102

    4.14 Waving to Wakamaru . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.15 Wakamaru touching a recognized object . . . . . . . . . . . . . . . . 1074.16 An attempted object recognition capture. The robots head is still

    moving downwards, resulting in a blurry image. This image con-tains part of the object and was, in fact, correctly recognized. Othercaptures would still only be looking at the wall resulting in a failedrecognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    4.17 Another attempted object recognition capture. The segmented cupsilhouette failed to be recognised as a cup because of the shadowon the table which was segmented as part of the cup, changing thesilhouette. This meant the cup had a top width of similar width toits base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

    viii

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    10/136

    List of Tables

    3.1 Sample Unparsed Speech Recognitions . . . . . . . . . . . . . . . . . 223.2 Sample Risk Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Sample Values from the Associative Map . . . . . . . . . . . . . . . . 233.4 Sample Parsed Speech Recognitions . . . . . . . . . . . . . . . . . . 233.5 Table showing some Sample Command Hypotheses from Speech In-

    put. These utterances were spoken in the order listed. Because theYellow Cup in Ex. 1 is not in the known object database, the robotrequested its location, which the user provided in Ex 2. The combina-tion of these give an unambiguous command hypothesis Get YellowCup Desk. Ex. 3 illustrates that using the conversational history,

    the identity of the cup can be resolved to Yellow Cup where theuser is asking the robot to take the cup away again. . . . . . . . . . 343.6 Table showing Sample Objects from the Object Database . . . . . . 36

    4.1 The First Ten Recognition Hypotheses for a Spoken Utterance . . . 604.2 Sample Unparsed Speech Recognitions . . . . . . . . . . . . . . . . . 614.3 Word Recognition Rate (WRR) for the Unparsed Speech Results . . 624.4 Sample Parsed Speech Recognitions . . . . . . . . . . . . . . . . . . 624.5 Word Recognition Rate (WRR) for the Parsed Speech Results . . . . 634.6 Two Cases of Deletion Errors for Parsed Speech . . . . . . . . . . . . 644.7 Comparing the Gesture Detection and Recognition Results . . . . . 764.8 Table showing some Sample Unimodal Hypotheses Selected by the

    Iterative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.9 Table showing some Sample Multimodal Hypotheses Selected by the

    Iterative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.10 Comparison of the Use of Speech and Contextual Knowledge against

    the Iterative version of CIMMI . . . . . . . . . . . . . . . . . . . . . 844.11 Table showing some Sample Unimodal Hypotheses Selected by the

    Sequential Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.12 Comparison of Speech Recognition Accuracy Results . . . . . . . . . 884.13 Table showing some Sample Multimodal Hypotheses Selected by the

    Sequential Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.14 Comparison of the Accuracies of using Speech with Contextual Knowl-

    edge to the two different version of CIMMI . . . . . . . . . . . . . . 93

    ix

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    11/136

    1Introduction

    1.1 Motivation

    Increasing levels of affordable computational resources and quality sensors have per-

    mitted the development of robots that can operate in a widening set of domains,

    both structured and unstructured environments and at times shared by people.

    Sometimes a robot is designed because it can do a task more accurately or ef-

    ficiently than a human, such as with industrial and surgical robots. Other times

    robots are designed because a human cannot perform the dangerous or inaccessible

    task, such as with search and rescue or space exploration. In the case of service

    robotics, robots are often designed to perform tasks which directly interact with,

    and assist, a person.

    The need for assistive robotics is increasing as demographic studies have shown

    there will be an increase in the number of elderly in the future [1]. According to the

    Australian Bureau of Statistics in 2004, 13% of the population was over 65. This

    figure is expected to increase to between 26% and 28% by 2051 [2]. While assistive

    technologies may not replace human carers in the future, they could a provide viable

    alternatives. A robot carer could assist an elderly person in a domestic environment,

    allowing that person to be independent and remain in their home while still having

    some basic support and security.

    Robot assistants dealing directly with people need to be social robots - ones that

    interact with people by following some expected behaviourial norms [3]. They also

    need to be able to handle the uncertainties associated with human-centric environ-

    ments including people interactions. This field has seen much attention in the past

    decade, however work in multimodal systems for robotic platforms has not been

    deeply explored.

    1

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    12/136

    1.2. MULTIMODAL HUMAN-ROBOT INTERACTION

    1.2 Multimodal Human-Robot Interaction

    Multimodal integration is where two or more modes of input are combined to create a

    richer, and possibly more accurate, interpretation than possible with a single mode

    alone. Using the partial information from one mode to clarify another is called

    mutual disambiguation - a concept defined and proved by Sharon Oviatt [4].

    1.2.1 Modes of Input

    As with human communication, speech is commonly used as a primary mode of

    communication input in multimodal robotic systems. The speech input is combined

    with other modes, such as gestures or emotion recognition, complementing its in-

    terpretation. In human-robot interaction (HRI), deictic gestures (pointing gestures)

    are most commonly used as they can replace object or location information missing

    from the speech. For example, if the user said Bring me that cup, then the direction

    of the pointing gesture could resolve which cup to get.

    Symbolic gestures, also called emblems, are gestures that hold their own semantic

    meaning and can be understood without context from additional speech, such as a

    thumbs up gesture [5]. These gestures have been explored broadly in human-

    computer interaction (HCI) as pen-based actions or as GUI interface clicks. This

    work has been transferred to touch screen interfaces for HRI but natural hand-

    performed gestures have not been researched deeply [6]. Symbolic gestures can

    substitute, reinforce, or conflict with action words in the speech input.

    1.2.2 Contextual Knowledge

    While humans communicate by multiple modes, they also use their knowledge to

    aid their interpretations. Contextual knowledge is instinctively used by people to

    resolve what may otherwise be ambiguous communications. Contextual knowledge

    is broad and varied, though some simple and common kinds are:

    what they know about the environment

    what they have been talking about

    the speaker and listener locations

    These different kinds of context can be used by a robot to help it to more ac-

    curately interpret, or respond to, what is communicated to it by a human user.

    The environmental knowledge can provide situational context about objects in the

    environment so a spoken utterance like get the cup in the kitchen could be stated

    2

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    13/136

    1.3. CONTRIBUTIONS

    without specifying which cup. The speaker could assume the listener had a shared

    knowledge of objects in the environment, such as the green cup in the kitchen.

    Similarly, the conversation history can provide context about which specific cup

    a speaker is speaking about by referring to a previously spoken about yellow cup.

    The conversational context can also be used to resolve a command by combining a

    users clarification with the last recognized user command.

    The speaker and listener locations, another part of the situational context, can

    be used to increase the likelihood of one object being requested over another, or to

    aid responses to a request.

    In a robotic system, all this knowledge can be used to help the system inter-

    pret, and respond to, human communication more accurately. Whether the speaker

    provides an incomplete command or some information is misrecognized, contextual

    knowledge can help resolve the ambiguities in a useful manner.

    1.2.3 Multimodal Integration

    To combine multiple modes of input, various methods have been proposed. A popu-

    lar early work, QuickSet, combined two n-best lists of input, one of speech and one

    of gestures. An n-best list is a list of speech or gesture recognition hypotheses any-

    one of which could be the correct recognition. QuickSet would statistically combine

    the n-best lists using the probabilities of each hypothesis [7] [8]. This statistical ap-

    proach became the driving force for developed systems that followed [9] [10] [11] [12].

    More recently, learning algorithms [13] [14] [15] and domain independent approaches

    [16] [17] [18] have been developed for use with HCI systems.

    In HRI, the multimodal systems often process speech with deictic (pointing)

    gestures integrated at the semantic level [19] [20] [21]. The methods implemented

    vary from frame-based matching to constraint-based integration to learning algo-

    rithms. In these systems, ob ject recognition was used to identify which object the

    pointing gesture was pointing at to resolve anaphoric ambiguities in the speech. The

    multimodal fusion in [21] only combined the gesture inputs if there was a need of dis-

    ambiguation because their gesture recognition accuracy was low, an idea supported

    by [13]. The current multimodal systems may use contextual knowledge, but very

    few assess their practical implementations [19] [20] [21].

    1.3 Contributions

    To develop a robot to interact with humans in human-centric environments, it re-

    quires both transactional and spatial intelligences. Transactional intelligences are

    the skills to interpret what was communicated and interact with the human to re-

    3

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    14/136

    1.3. CONTRIBUTIONS

    solve ambiguities if a difficulty occurs. In this work, the transactional intelligence

    consists of speech and gesture recognition as well as their probabilistic combination

    in CIMMI, a Contextually Informed MultiModal Integrator. CIMMI combines the

    speech and symbolic gestures semantically, where speech is the primary mode of

    input. The semantic relationships between the speech and gestures are manually

    defined in an associative map, illustrating whether a word positively, neutrally or

    negatively reinforces the semantic meaning of the corresponding gesture. The sym-

    bolic gestures are also used to substitute a speech word semantically. In this work,

    a substitution occurs if an action word is misrecognized in the speech, then the

    recognized gesture provides the action word in the command hypothesis.

    Spatial intelligence refers to being able to understand and navigate through a

    space and to know the location, function of and how to manipulate objects in that

    space. In this work, a simple navigation and obstacle avoidance system and an

    object recognition system was implemented to demonstrate the success, and failures,

    of CIMMI.

    The transactional and spatial intelligences can be informed by contextual knowl-

    edge, both conversational and situational. Conversational context is built by previ-

    ous dialogue between the robot and the user. It can be used to resolve the identity

    of an object such as if a yellow cup was requested earlier in the conversation, then

    when the speaker asks about a cup, it could be assumed that it is the same cup asthe yellow one. This conversational information is used to clarify object identities

    and locations in this work.

    Situational context includes spatial information about known objects in the en-

    vironment, the robot and the user. The known object knowledge is used if a speaker

    requests the green cup without stating the location of the cup. The robotic system

    can resolve this missing location by referencing the green cup in the known object

    knowledge and returning the last known location of it. This situational context is

    used to clarify objects identities (i.e. colour) and locations in this work. The robot

    and user locations are not used to clarify ambiguities in the multimodal commands,but do dictate how the robot responds to particular requests i.e. is the robot already

    in the correct location?

    The transactional intelligence developed in this thesis primarily involves fusing

    the speech and gesture modes in a probabilistic multimodal integration method to

    generate a command hypothesis that can be correctly responded to by the robot.

    The command hypothesis is selected by a weight based system, where the weights are

    calculated using the transactional intelligence and contextual knowledge. A weight

    signifies CIMMIs confidence in a command hypothesis, where the weight is based

    on:

    4

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    15/136

    1.4. THESIS OVERVIEW

    the speech/gesture pair relationship according to the associative map

    if any word in the speech input has an associated risk value

    if the object or location mentioned were discussed in the conversation history

    if the object mentioned exists in the object knowledge

    These weights, combined with the individual probabilities of the speech and

    gesture recognition systems, support the selection of a multimodal command to be

    responded to by the robot, so long as it has no ambiguities.

    In this work, ambiguities are defined as

    missing or uncertain information, including references to non-existent ob jects

    a conflict between the selected speech/gesture pair

    low individual probabilities of the speech input

    Missing information can be resolved either by using the symbolic gestures, the

    contextual knowledge or by requesting more information from the user. A conflict

    between the speech/gesture pair or low probabilities need to be resolved by the user

    by requesting confirmation of the spoken input.

    1.4 Thesis Overview

    In this chapter, some of the concepts associated with multimodal integration, in-

    cluding the modes and methods, were discussed and the motivation for this work

    explained. In the next chapter, a review of the previous work done in these fields

    is discussed more technically. This review is not a complete survey of the fields

    involved, but does discuss the keys ideas and issues involved in multimodal human-

    robot interaction.With this survey as a base, chapter 3 discusses and justifies the concepts used in

    this thesis, with particular emphasis on the mutimodal integration methodology. The

    speech and gesture recognition systems, which provide the inputs to the multimodal

    integration algorithm, are described, as is the demonstration system, as implemented

    on Wakamaru - a humanoid robot.

    Having described the work completed, the experimental testing and results are

    presented in chapter 4. The speech and gesture recognition systems are assessed as

    unimodal systems and then compared to the multimodal results of CIMMI. CIMMI is

    then tested in some robotic demonstrations where object recognition and navigation

    5

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    16/136

    1.4. THESIS OVERVIEW

    are assessed individually first. The successes, and failures, of these experiments are

    then discussed in chapter 5.

    In the final chapter, the future work for this project is outlined, and the conclu-

    sions of the work presented.

    6

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    17/136

    2Literary Review

    2.1 Service and Field Robotics

    Since Al-Jazaris mechanical boat of musicians in the 13th century [22], people have

    created machines to perform both simple and complex tasks [23] [24] [25]. In 1954,

    George Devol and Joseph Engelberger invented Unimate, the first programmable

    and teachable robotic arm which became the basis for the now flourishing indus-

    trial robotics industry [26]. While industrial robotics is one major area of modern

    robotics, the other is service and field robotics. Industrial robots are used because

    they perform their tasks better than a human in terms of speed and precision, and

    relieve humans of the tedium of repetition. In comparison, service and field robots

    are used because a human needs, or prefers, a robot to perform a dangerous, inac-

    cessible or assistive task.

    Field robotics can be divided into 3 keys areas, search and rescue robotics, space

    exploration and military, police or security robotics. Search and rescue robots are

    designed to access places where it may be too dangerous for a human to go, such as

    in a bush fire or collapsed buildings [27] [28]. Space exploration robots, like the Mars

    rovers, explore where a human is, so far, unable to go [29] [30]. More recently, space

    robotic aids, like the Robonaut [31] [32], have been developed to assist astronauts

    perform tasks, though at this stage these robots are teleoperated. Military and police

    robots functions include bomb detection and defusing, security and surveillance and

    transportation [33] [34] [35].

    Service robots are designed to interact with humans directly and to perform tasks

    to assist a human, often in everyday chores of a tedious nature or where assistance is

    needed by a frail or disabled person. Service robotics can be broken into the several

    key domains:

    Entertainment Entertainment robots, such as Sonys Aibo [36] and

    WowWees RoboSapien, perform interesting acts by teleoperation and are

    7

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    18/136

    2.2. HEALTHCARE ROBOTICS

    quite popular purchases for consumers despite their lack of truly social

    interaction.

    Public guides Public guides, such as in a museum or airport, require a much

    higher degree of interaction to accomplish their more complex tasks, but

    the interaction can be by touch screen [37] [38] [39] rather than by speech

    [40]

    Health care robotics Health care robots have been designed to assist the el-

    derly and infirm by guidance and scheduling reminders. This is discussed

    further in the next section.

    Personal assistants Personal robots have been developed to serve people

    in human-centric environments, such as in offices and homes. This is

    discussed in more detail in Section 2.3.

    2.2 Healthcare Robotics

    The design and development of healthcare robots to assist the elderly and infirm

    has been broadly researched in recent years, as statistics indicate there will be a

    greater proportion of elderly in the future [1] [41]. According to the Australian

    Bureau of Statistics in 2004, 13% of the population was over 65. This figure is

    expected to increase to between 26% and 28% by 2051 [2]. Furthermore in 2004,

    1.5% of the population was over 85 years old, and this is also expected to rise to

    between 6% and 8% by 2051 [2]. While assistive technologies are unlikely to replace

    human carers in the future, they could provide viable alternatives. A robot carer

    could assist an elderly person in a domestic environment, allowing that person to

    be independent and remain in their home while still having some basic support and

    security. The research for assistive technology ranges from monitoring systems [42]

    [43] [44] to scheduling systems [45] [46] [47] to guidance systems [47] [48] [49] and

    robotic assistants [50] [51] [45] [47].

    Healthcare robots include Pearl, who was developed by the NurseBot Project

    [45] [47] to locate a person, remind them of an appointment and guide them to their

    destination. The platform was a mobile unit with unmoving walker arms for support

    and a touch screen. The touch screen and speech synthesis provided the human-robot

    interaction, which was adequate for the testing of this system in a nursing home.

    The Care-O-bot system was developed by Hans, Graf and Schraft from Fraunhofer

    IPA, Stuttgart and could perform fetch and carry and various other supporting tasks

    in a domestic environment [50] [51]. It had a gripper arm, allowing it to lift and hold

    objects, but the robot itself could also be used as a support and walker. The system

    8

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    19/136

    2.3. SOCIAL ROBOTICS

    included a simple scheduler, could make emergency phone calls and, with its touch

    screen monitor, could provide videophone, tv and stereo interface. The human-robot

    interaction was performed multimodally, but not cooperatively as the system could

    either process a speech input or a touch screen input, but did not integrate both.

    2.3 Social Robotics

    Personal assistants ten years ago were merely fetch-and-carry systems where the

    focus of the systems was the planning, mapping, localization and obstacle avoidance

    [52] [53]. More recently, there has been an emphasis on making robots more social

    so that they can perform tasks and provide information more accurately [21] [20][54] [55]. Bartneck and Forlizzi [3] defined a social robot as:

    ... an autonomous or semi-autonomous robot that interacts and commu-

    nicates with humans by following the behavioral norms expected by the

    people with whom the robot is intended to interact.

    Breazeal [56] comments that humans have to develop socially and emotionally

    to survive in our society, so it should not be surprising that we must give robots,

    or have them learn, these same skills to the same end. This point is reiterated by

    Forilizzi [57] who introduced Roomba robots into domestic environments and per-

    formed an ethnographic study of the human-robot interactions. Roomba, compared

    to another robot vacuum cleaner Flair, inspired social interaction, in some cases

    being nicknamed and spoken to as an entity, even though it could not socially inter-

    act itself. While the author does not give a reason for the different treatment, it is

    assumed that it is because of Roombas autonomy, competency and cute appearance

    as compared to Flair.

    The form of the robot is another interesting area of research, particularly with

    regards to human-robot interaction (HRI). Many robotic assistants use graphical

    displays of a face or user interface on top of a robotic platform, such as RHINO [37],

    BIRON [58], Care-o-bot [51], Nursebot [47], Valerie the Roboceptionist [54] and

    Grace and George [59]. These displays were often chosen for their expressiveness

    and reliability compared to the mechanical alternative. However, it is acknowledged

    that the lack of a physical embodiment does make HRI more difficult, especially as

    the human often has trouble knowing where the robot is looking [54].

    Most socially interactive robots are anthropomorphic in some regard, with the

    well known exceptions being robotic pets, such as PARO - a baby harp seal robot

    [60]. Some of these robots may only have an expressive face as the human-like

    part, such as Kismet, an expressive anthropomorphic robot head [61] and Sparky,

    9

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    20/136

    2.4. MULTIMODAL INTEGRATION

    a small mobile robot with an expressive face [62]. Most, however, are humanoid

    with a mechanical embodiment of a face on either a bipedal [63] or wheeled robotic

    platform [19] [64] [40] [65] [66] [21]. The humanoid form is argued by researchers to

    be ideal for social HRI because, for robots to interact with humans like a human

    (using gaze, gestures, etc.) they need to be physically capable of performing such

    acts [55] [67] [66]. Also, by having a humanoid form, then a humans expectations

    of the robots functions will be natural human interaction, an ideal scenario for HRI

    [64] [40] [65].

    Repliee-Q1 and Repliee-Q2 are an extension of the humanoid form robots, as the

    robots are cast from human subjects and have realistic skin and facial features [68].

    These robots are androids (or gynoids if female) and have innumerous actuators from

    the waist up controlling all their motion. Because of the realism of these systems,

    peoples interactive expectations are very high.

    2.4 Multimodal Integration

    Multimodal integration or fusion is the challenge of taking many different modes of

    input, such as speech, lip tracking, gestures, posture, head pose etc, and merging

    all this information, either at the feature level or semantic level, to create a richer

    interpretation than any single mode may have provided [4]. It has been proved thatthis fusion of inputs can resolve errors that may occur in a single input by using

    partial information of another input - a concept called mutual disambiguation, as

    defined and proved by Sharon Oviatt in [4].

    2.4.1 Modes of Input

    Speech is often used as the primary mode of communication for people so this is

    often the case for both human-computer and human-robot interactions [69] [70] [19]

    [20] [21] [71] [72]. Despite it being the primary mode of communication, it is rarely

    the only mode of communication. According to McNeill, 90% of gestures occur in

    conjunction with some speech [73]. People commonly use different types of gestures

    in conjunction with speech to communicate their intentions. Gestures, either pen-

    based or naturally performed by hand, are often classified in four keys ways [73]

    [5]:

    Symbolic Also called emblems, these are gestures that have a semantic mean-

    ing, such as wave to mean hello. Sign language is a vocabulary of sym-

    bolic gestures.

    Deictic Pointing gestures referring to people or objects in a space.

    10

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    21/136

    2.4. MULTIMODAL INTEGRATION

    Iconic Iconic gestures resemble what they convey, such as to show the size,

    shape or motion of an object.

    Metaphoric These gestures involve the abstract manipulation of an object

    or tool.

    Gullberg found that over 50% of observed spontaneous gestures were deictic

    gestures [69]. These gestures can replace missing object or location information by

    pointing to an object or location from some spoken speech [70] [19] [20] [21]. For

    example, if a user said put the lamp here, then the pointing gesture would resolve

    the location of here. The first system to integrate speech and deictic gestures was

    Bolts seminal paper, Put-that-there [74].

    Alternatively, symbolic gestures have a direct meaning and can be understood

    without context from additional speech [5]. Because of this independence from

    speech, they can reinforce, substitute or conflict with speech [71] [72]. As pen-based

    gestures, symbolic gestures can be GUI interface clicks or predefined structures,

    such as arrows, which have a specific meaning in the application [7] [8] [75] [76].

    Symbolic hand gestures form has to adhere to an expected structure to give the

    gesture specific meaning, such as a stop gesture should be indicated by an open

    palm facing away from the gesture. In human-robot interaction, symbolic gestures

    have been broadly explored on touch screen interfaces [45] [47] [50] [51] but not so

    much as natural hand gestures [6].

    More recently, parallel modes of input that complement each other, such as

    speech and lip tracking, have been fused where one mode directly reinforces the

    other [15] [14].

    2.4.2 Contextual Knowledge

    Contextual knowledge is instinctively used by people to communicate with each

    other. However, it is only sometimes used in unimodal or multimodal systems to

    select a command to be performed. It can have many forms and uses, as context

    is a broad area of information. Some of the key contexts used in HRI systems are

    conversational and situational.

    Conversational contextual knowledge is where the previously recognised com-

    mands contribute to selecting the current command [40] [77] [20] [21] [78] [79] [80]

    [81]. Storing this dialogue history is also a method to resolve ambiguous commands

    by combining information provided by the speaker to form an unambiguous com-

    mand [40] [20] [21] [77].

    An extension of conversational contextual knowledge is also being able to limit

    a speech recognition vocabulary depending on an expected answer [40] [20]. When

    11

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    22/136

    2.4. MULTIMODAL INTEGRATION

    a vocabulary consists of 10,000 words or more, the likelihood of misrecognitions

    increases. To decrease the number of misrecognitions, sub-vocabularies can be used

    to lessen the range of words recognized by the system. Alternatively, sub-grammars

    can be used improve the speech recognition performance, where a grammar defines

    the form the recognized spoken utterance could take [77].

    Contextual knowledge can also be situational, where the location of the agents in-

    volved contributes to selecting a current command [70] [77] [80]. Situational context

    is called situatedness in speech recognition applications [82] [77] [83]. Situational

    context has been defined a number of ways, including by plan recognition [82] [83]

    and information about objects in the environment [21] [82] [77].

    2.4.3 Methods of Integration

    In addition to the types of modes and knowledge that are fused multimodally, how

    they are fused is also an important field of research. Virtual World [75], Finger

    Pointer [84], VisualMan [76], Jeanie [85] and QuickSet [7] [8] all integrated speech

    and either pen or gesture based motion at a semantic level. QuickSet, though,

    was the first system to use a unification based system, a logic based method for

    combining the meanings from two modes into a a single common interpretation.

    The other systems used frame based integration, a pattern matching method for

    fusing data structures of semantic information into a single common interpretation.

    QuickSet would take two n-best lists of speech and gesture semantic information,

    where an n-best list is a list of n possibly correct interpretations of each mode. It

    then generated a list of multimodal commands by exhaustively combining the speech

    and gestures that temporally align. That is, where a gesture precedes or overlaps the

    speech [86]. The list of multimodal commands was then sorted statistically using the

    probabilities of each list item from the individual modes. Finally, a multimodal com-

    mand was selected if the speech/gesture pair defining it could be legally combined

    into an executable instruction.

    While QuickSet was highly expressive and scalable, it did not account for the

    correct solution not being recognized by one of the input modalities. Some of the

    QuickSet developers extended their multimodal system to include an associative map

    to represent legal semantic combinations between all the defined types of speech and

    pen based inputs, as well as formalizing the integration process statistically [86].

    Others proposed using a finite state automaton which would parse, understand and

    interpret inputs from a speech and gesture inputs [87]. The statistical approach

    defined by [86] became the driving force for the multimodal field, leading to the

    development of SmartKom [9], MUST [10], a speech and gaze multimodal system

    [11] and a speech and mouse interface for remotely controlling a UAV robot [12].

    12

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    23/136

    2.5. MULTIMODAL HUMAN-ROBOT INTERACTION

    Despite this statistical strength, a variety of approaches were developed for mul-

    timodal fusion in the last five years. A new method for combining multimodal inputs

    where one or more are only intermittedly used (such as gestures) was defined by [13].

    The approach involved learning to weight non-verbal features only when they are

    relevant to the speech features, i.e. when there is an ambiguity. This technique

    showed a 73% improvement over using the non-verbal inputs whenever available. A

    semantic based fusion technique was developed by [16] to make their system domain

    independent, similar to the more recent work [17] and [18]. For the modes of speech

    and lip-tracking, coupled hidden markov models [15] and a method to learn a collec-

    tion of meaningful multimodal structures [14] have been recently developed, though

    this field has been much more deeply explored than this [88].

    2.5 Multimodal Human-Robot Interaction

    Early multimodal HRI systems did not combine the multiple modes of input, but

    rather acted on one or the other [50] [51]. One of the earliest multimodal HRI

    systems that combined the modes was Hermes, by Bischoff et al [40]. Hermes was

    a robust humanoid museum robot that used speech, vision and touch with context

    information about the nature of the conversation to interpret a users intention.

    The vision system would perceive objects to help resolve pronouns and tactile senseallowed Hermes attention to be redirected [40]. The speech system had a vocabulary

    of about 350 words but would only check for a handful of words depending on the

    conversational context, as triggered by key words.

    Iba et al from CMU described the framework where a robot is programmed in-

    teractively through a multimodal interface [70]. The user can provide feedback and

    interrupt tasks to tell the robot to avoid an obstacle or stop moving. The system

    uses speech and one or two handed gloved gestures as the modes of input, though

    the accuracy of these systems are not reported. The limited speech vocabulary con-

    tains about 50 words, and the gesture system has seven single-handed gestures andone two-handed gesture. The multimodal integration gives the symbolic interpreta-

    tions of the speech and gestures probabilistic values that are used to decide which

    task should be performed based on the current situational context. This system

    was demonstrated, but had not been quantitatively evaluated in the most recent

    publication.

    Rogalla et al developed an HRI system using the service robot Albert [19]. The

    system integrated speech, gestures and object recognition into an event manage-

    ment architecture, where an event is a single input. The event manager transformed

    a certain event or group of events into an action to be performed. Their gesture

    13

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    24/136

    2.5. MULTIMODAL HUMAN-ROBOT INTERACTION

    recognition overall performance was 95.6% for 6 static gestures using skin segmenta-

    tion and a Fourier description of the normalized contour. Their speech recognition

    system was domain dependent to increase accuracy, though the accuracy was not

    reported.

    Gorostiza et al [89] described a multimodal HRI framework for a personal robot

    assistant. The framework took into account speech, body gestures and tactile sensors

    as the modes of input and used markup languages to encode these functionalities.

    As this project was still in development, they did not specifically outline how the

    multimodal integration would work in their current publications.

    Huwel and Wrede implemented a multimodal human-robot system which used

    speech, gesture and object recognition as well as environmental information [20].

    The focus of their project is to enable a robot to carry out an involved dialog for

    handling instructions. Their speech system was domain independent, could incor-

    porate results from pointing gestures and object recognition, and could interpret

    speech even if the input is not grammatically correct and the recognition incorrect.

    They used situated semantic units to define speech interpretations, with any miss-

    ing information was presumed to be available in the contextual knowledge. While

    the contextual knowledge is not described, it is suggested that conversational and

    object information is used. As a german research group, the users were non-native

    English speakers, all these factors contributing to the average 52% accuracy of theirspeech recognition. Because of the design of their system though, 61% of the 1642

    test utterances were interpreted as correct utterances with full semantic meaning,

    illustrating how their speech system recovers from the speech recognition errors.

    They do not report the accuracy of their gesture and object recognition systems.

    The most recent HRI publication is [21] by Stiefelhagen et al from the German

    research group Sonferforschungsbereich on humanoid robotics. They have developed

    a robotic system whose internal systems include speech recognition, dialogue pro-

    cessing, localization, tracking and identification of a person, pointing gesture recog-

    nition and head orientation recognition. Each of these systems has been developedand tested separately and are being integrated into a single system. The multimodal

    fusion uses a constraint based system to semantically fuse inputs from speech, point-

    ing gestures and an environmental model [90]. To resolve some speech ambiguities,

    they use conversation history and user identification - contextual information.

    Because their gesture recognition system only detected 87% of gestures, and had

    only 47% recognition accuracy, their system uses speech as the primary modality,

    and only references the gesture recognition system if there is a disambiguation. The

    direction of a pointing gesture generates a list of objects from the environment model

    that are in that direction within an error cone. By comparing the appropriate speech

    14

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    25/136

    2.5. MULTIMODAL HUMAN-ROBOT INTERACTION

    token with the objects on the list, a match can be found, resolving the ambiguity.

    Their multimodal system had a success rate of 74% on 102 multimodal user inputs.

    Of the errors, 52% were caused by failing to detect a gesture, 22% because a gesture

    could not be resolved with a high confidence, 17% due to speech recognition errors

    and 9% were because the speech and gesture inputs failed time contraints.

    Based on these implementations of multimodal technology for robotic platforms,

    there are many areas to explore. The practical use of contextual knowledge to

    inform a multimodal system needs to be assessed. In addition, different kinds of

    gestures that would be used to communicate with a robot, not just deictic, should

    be investigated as well.

    15

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    26/136

    3Contextually Informed Multimodal

    Human-Robot Interaction and Robotic

    Responses

    For a robot to communicate with a person effectively, it must possess transactional

    intelligence the skills to understand what was communicated and interact with

    the human to resolve ambiguity so as to negotiate what should be acted upon if a

    difficulty occurs. In this work, speech recognition, gesture recognition and CIMMI

    represent these skills. But for a robot to respond to a humans request, spatial intel-

    ligence is required to understand and interact with the environment. In this work,

    navigation, obstacle avoidance and object recognition represent these skills. Both of

    these intelligences are informed by some contextual knowledge, both conversational

    and situational. All of these components are described in this chapter.

    The scenarios that the system is expected to accomplish are initially described

    in Section 3.1. For a person to communicate with the robot, both speech and

    gestures are recognized as described in Sections 3.3 and 3.4 respectively. These

    modes of input are combined by CIMMI using the defined contextual knowledge.

    This knowledge is described in Section 3.5. The multimodal algorithm algorithm

    itself is comprehensively described in Section 3.6.

    For a service robot to act in human-centric environments, it requires spatial

    intelligence as well. In this work, the navigation, obstacle avoidance and object

    recognition systems represent these skills. They were developed to demonstrate the

    multimodal algorithm on a humanoid robot and are described in Section 3.8.

    3.1 Functional Scenarios

    The developed informed multimodal system was designed to accomplish simple tasks

    in a domestic environment perhaps to assist an old or frail person. In the laboratory

    16

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    27/136

    3.1. FUNCTIONAL SCENARIOS

    Figure 3.1: The defined rooms in the laboratory environment. The height of the laserrange finder, which is used for obstacle avoidance, is lower than the height of the whitepolysterine walls between the rooms. The desk is the foreground table, the kitchenis at the furthest table and the table location is to the right with the blue cup on it.The closest Wakamaru robot with the mounted camera is the active one. The other isnot used in this system.

    environment where the system was developed, simple rooms were defined with a hall,

    kitchen and office space using short walls to separate the spaces (See Fig. 3.1). Each

    room had a table with one or two objects on it. The simple tasks to be performed

    by the robot include:

    Natural interactions The robot should act and respond to users responses by

    natural means, such as by speech and gestures. The Mitsubishi engineers

    built in a series of smooth and natural-looking motions for when the robot

    was performing preprogrammed actions, such as leaving the charger, but also

    moving around the room. Simple hand and arm gestures were also defined for

    user responses such as waving.

    Providing information The system should be able to provide information about

    the environment, such as object locations, on demand using its situational

    contextual knowledge - the known object database.

    Fetch and carry The robot should be able to retrieve an object from a location

    at the users request. To accomplish this, the robot must navigate to a the

    objects location - avoiding obstacles along the way. It then must recognize the

    17

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    28/136

    3.2. SOFTWARE OVERVIEW

    requested object, reach out and touch it. Wakamaru does not have articulated

    hands, so the object is not retrieved. The possibility of using magnets was

    explored but was unsuccessful due to the shape of Wakamarus hands. The

    robot should then return the object to the user. If no object is recognized,

    then the robot should return to the user and inform them of this.

    With these scenarios in mind, the transactional intelligence was developed to

    accomplish the human interaction components of the system, as described in the

    following sections.

    3.2 Software Overview

    In this section, the transactional intelligence is outlined conceptually, initially by

    implementation and then as a user interaction. This overview should provide a

    guide for how the different components of the transactional intelligence relate, but

    also the reasons for different design decisions.

    The developed system is modular by design. The capturing and processing of

    the speech, vision and laser-depth data are performed in separate processes on the

    laptop, as it is a more powerful system than that in Wakamaru alone. CIMMI and

    robotic response system is a single threaded process running on Wakamaru. Thespeech recognizer is a client system, triggering action in the robotic system. The

    vision system is a server that detects gestures while it waits for instructions from

    Wakamaru to return the last found gesture, detect obstacles, or recognize objects.

    The laser system is a server that waits for an instruction to return depth data for

    obstacle avoidance.

    A user communicates with Wakamaru by speech and dynamic symbolic gestures

    to complete specific tasks. The speech recognizer detects a single utterance, trigger-

    ing the gesture recognizer to return the last detected gesture. Retrieving the last

    recognized gesture, rather than detecting the current one, is done because previous

    studies showed that when any meaningful gestures were made, 85% of the time it was

    accompanied by a spoken keyword temporally aligned during and after the gesture

    [91] and gestures could precede the deictic word by as much as four seconds [88].

    Kettebekov and Sharma also showed that 93.7% of time, a deictic or iconic gesture

    was temporally aligned with the semantically associated nouns, pronouns, spatial

    adverbials, and spatial prepositions [91]. While this system does not semantically

    recognize deictic or iconic gestures, it assumes this statistic would apply to the re-

    lationship between symbolic gestures and the semantically associated speech action

    word, the first word of a sentence.

    18

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    29/136

    3.2. SOFTWARE OVERVIEW

    Figure 3.2: Overview of the Transactional Intelligence, and the Contextual Knowledgewhich informs it.

    The speech and gesture recognizers generate n-best lists of recognition hypothe-

    ses, where n = 5 for speech, as selected by experimentation (See Section 4.1), and

    n = 3 for gestures, as there are only 3 gestures implemented. These recognizers

    are scalable, as words can be added to the vocabulary file for parsing, and trainingdata added to the gesture recognition file for classification (See Sections 3.3 and 3.4

    for more information). However, in the current implementation additional actions

    cannot by clarified by CIMMI or responded to by the robot, as these components

    are hard coded. Implementing a more flexible clarification process is a short term

    goal of this work. The recognition hypotheses in the n-best lists are sent to CIMMI

    along with the probabilities of correctness each recognition hypothesis has.

    CIMMI selects the best speech/gesture pair, a command hypothesis, based the

    transactional intelligences and the contextual knowledge. See Section 3.5 for details

    about the implemented contextual knowledge. The selected hypothesis is checked forambiguities according to some predefined rules (See Section 3.6). If an ambiguity is

    not resolved beyond a satisfactory threshold of certainty, then the robot will ask the

    user for more information. If there is no ambiguity or an acceptable level of certainty,

    then the hypothesis is passed onto the robotic responses finite state machine so the

    robot can appropriately respond to the input. This overview of the transactional

    intelligence is represented by Fig 3.2.

    Having outlined the transactional intelligence in this work, each component will

    be described in more detail in the following sections.

    19

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    30/136

    3.3. SPEECH RECOGNITION

    Figure 3.3: A diagram showing the speech recognition process, from capturing thespoken utterance through to generating a list of parsed recognition hypotheses

    3.3 Speech Recognition

    Speech is used as the primary form of communication between people so it makes

    sense that a robot interacting with people recognizes this mode of input. Because of

    this fact, speech recognition is implemented as part of the transactional intelligence

    of this human-robot interactive system. As an overview of this system, the speech

    recognition engine processes a spoken utterance and generates a list of recognition

    hypotheses. These hypotheses are then parsed using a defined vocabulary and these

    interpreted hypotheses, and their probabilities of correctness, are sent to Wakamaru.

    This process is described in detail in the following section.

    The developed speech recognition system uses the Microsoft Speech SDK v5.11

    because it is well documented and provides an interface to the speech recognizer for

    developing speech recognition applications.

    The Microsoft Speech Recognizer Engine (SRE) v5.1 was based on Whisper

    (Windows Highly Intelligent SPEech Recognizer) speech engine [92] which in turn

    was based on SPHINX-II [93]. A speech recognizer consists of an acoustic model to

    represent the sounds, a decoder to interpret possible words that could be represented

    by those models and language models which define the probabilities of sequence of

    words. The acoustic models in the SRE use hidden Markov models (HMMs) to

    represent three phonemes, where as phoneme is a sound [94]. The language models

    use both bigram-class-based and trigram-word structures where bigram-class-based

    structures are pairs of words sorted into context-dependent classes [95] and, by

    example, trigram-word structures are the quick brown and quick brown fox for

    the utterance the quick brown fox [96].

    Training the acoustic models required reading provided passages where the

    sounds were encoded into HMMs. Approximately seven hours of training was needed

    before this American speech recognition engine was recognizing most of the 80 words

    in the desired vocabulary. A training file was also developed containing the specific

    1Microsoft Speech SDK 5.1 is available for download from:

    http://www.microsoft.com/speech/speech2007/downloads.mspx.

    20

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    31/136

    3.3. SPEECH RECOGNITION

    utterances the system should recognize, such as bring me the blue cup. To recog-

    nize a spoken utterance, the SRE generates an HMM to represent three phonemes

    and these are concatenated together to represent the spoken utterance. The HMMs

    are then decoded into words using the learned language models, selecting the words

    with the highest probabilities. A list of hypotheses is generated by selecting alter-

    native words with the next highest probabilities. This much training was required

    to refine the acoustic models so then it could accommodate subtle variations in the

    users pronunciation because words are pronunciated slightly differently depending

    on the neighbouring words in the spoken utterance.

    In the developed speech system, a spoken utterance is captured and a list of five

    recognition hypotheses is generated. The number of recognition hypotheses gener-

    ated, five, was selected by experimentation, as it encompassed 97% of the correct

    recognition results for the tested dataset (See Section 4.1 for more information). A

    recognition hypothesis consists of a list of words that were possibly spoken by the

    user and a confidence value generated by the SRE (See Table 3.1). Generating a list

    of possible recognitions is done because the first recognition according to the speech

    engine is not always the correct one.

    The confidence values for each recognition hypothesis are scaled so the values

    sum to 1.0, i.e. are a probability. The hypotheses are also parsed, detecting an

    action, a colour, an object type and/or a location, similar to [19]. However, forthe recognized hypotheses to be parsed, the nickname for Wakamaru, Waka, must

    be recognized in at least one of the hypotheses. The reasoning for the use of this

    Waka was the speech recognizer would try to recognize sounds in the laboratory

    as spoken utterances, such as other conversations, music etc. To avoid these sounds

    being recognized by CIMMI and possibly interfering with experiments, the name

    Waka needed to be recognized so then only a spoken utterance that was clearly

    directed at Wakamaru would be recognized.

    Each hypothesis is parsed by the developed system using a domain dependent

    vocabulary, where each keyword has a:

    Lexical category (such as noun, verb etc) represented by a number. If a

    word is in the vocabulary then the words unique ID number and its

    probability of correctness are stored in the parsed recognition structure,

    [action, colour, object, location] depending on their defined lexical cate-

    gory. This structure is a simple form of case grammars, where a different

    number of slots would be defined depending on the action word recog-

    nized [20]. Because of the simple utterances that were being recognized,

    this was not necessary for the implemented system.

    21

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    32/136

    3.3. SPEECH RECOGNITION

    Table 3.1: Sample Unparsed Speech Recognitions

    Ex. Spoken Utterance Recognition Hypotheses Confidence

    1 bring the yellow cup Waka bring the yellow cup Waka 111484bring the yellow cup water 111484bring the yellow cab Waka 93467bring the yellow tape Waka 93467bring the yellow cut Waka 93467

    2 fetch the yellow cup Waka fetch the yellow cut Waka 112738fetch the yellow cut off 112738fetch the yellow cup Waka 107971fetch the yellow cup soccer 107971fetch the yellow cab Waka 86016

    Table 3.2: Sample Risk Values

    Words Time Help Come Go Stop Hello Cup Desk

    Risk Values 0 10 0 0 10 0 0 0

    Risk value The risk value represents the value of certain words semantically,

    where the risk values are higher for words where misinterpretation could

    have seriously negative consequence, such as if the words help and stop

    were misrecognised (see Table 3.2). Gestures do not have risk values in

    this implementation of the system, as symbolic gestures with high risk

    meanings, such as stop, have not been implemented (See Section 3.4 for

    more information).

    Currently the risk values are binary, as illustrated in Table 3.2, but this

    can easily be extended so risk values can be a greater range of values, or

    fractional values. These values could also be learned by user interaction.

    List values relating the word to each of the gestures defined in the

    system semantically. The values relating a word to each gesture are de-

    fined using an associative map, similar to [86]. The map is manually

    defined in the vocabulary file based on the context of a domestic environ-

    ment and the developers common sense. For example, the gesture Come

    is contradictory to the word Go, so the corresponding associative map

    value is 0 (See Table 3.3). Also, the utterance Whats the time?, repre-

    sented as Time in Table 3.3, is not complimentary or contradictory to

    any gesture, so its associative map values are always 1. This associative

    22

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    33/136

    3.3. SPEECH RECOGNITION

    Table 3.3: Sample Values from the Associative Map

    Words Time Help Come Go Stop Hello Cup Desk

    Come Gesture 1 2 2 0 0 1 2 2

    Go Gesture 1 2 0 2 0 1 2 2

    Wave Gesture 1 2 1 1 1 2 1 1

    Table 3.4: Sample Parsed Speech Recognitions

    Example Spoken Utterancen-Best List

    Parsed Input Prob.1 Bring the yellow cup Get Yellow Cup 0.37Get Yellow 0.31Get Yellow Cut 0.31

    2 Fetch the yellow cup Get Yellow Cut 0.37Get Yellow Cup 0.35Get Yellow 0.28

    map could be adapted to individual users with learning algorithms and

    also contain fractional values. This associative map is a key factor of

    mutual disambiguation in our system [97].

    The vocabulary can be extended beyond the 80 words defined by adding new

    words to the text file along with the necessary values. However, additional action

    words cannot be added to the vocabulary without extending the robotic response

    system as this is hard-coded for Wakamaru and the implemented action words only.

    Different vocabulary files could also be defined for different contexts, such as the

    kitchen, bathroom or office based on the recognition of keywords as done by [40].

    Once all the recognition hypotheses have been parsed, the parsed recognition

    structures and probabilities, [action, colour, object, location, probability], are sent to

    Wakamaru to be combined with the gesture recognitions and contextual knowledge.

    These two components of the transactional knowledge will be defined in the following

    two sections. The overall speech recognition process is illustrated in Fig. 3.3. This

    style of diagram will be used to describe the different components of the transactional

    intelligence in the following sections.

    23

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    34/136

    3.4. SYMBOLIC GESTURE RECOGNITION

    3.4 Symbolic Gesture Recognition

    In human communication, while speech may be the primary form of interaction used

    by people, they instinctively gesticulate to express themselves as well. According to

    McNeill, 90% of gestures occur in conjunction with some speech [73]. Based on this

    fact, gesture recognition was implemented in this work. This system is only a toy

    or proof of concept system to illustrate the use of a multimodal system in robotics.

    Symbolic gestures were used in this system because they can substitute, reinforce

    and conflict with an action word semantically. Deictic gestures, while more common

    in human communication for anaphora resolution (this, here, him words) and

    object identification, is being thoroughly explored by a number of groups [70] [19][89] [20] [21]. The multimodal combination of symbolic gestures and speech has been

    primarily explored in HCI [7] [8] [91] with some work in HRI [6].

    The three symbolic dynamic gestures recognized in this work are wave, come

    here and go away. The gestures are dynamic because they are moving as opposed

    to static gestures, such as a thumbs up. These gestures are tracked using OpenCVs

    Camshift tracking and 3D geometry. Each of the implemented gestures have two or

    more meanings,

    Come - bring or come

    Go - get, go, call or answer

    Wave - hello, goodbye or help

    The gestural meaning is decided during multimodal integration, as dictated by

    the speech action word or, in the action words absence, the speech object or location

    words. For example, if the action word is get, then the semantic meaning of the

    gesture go is get. However, if the speech action word is missing, but only a location

    word is defined, then it is assumed the semantic meaning is go.

    For gesture performance, the gesturer must be standing approximately 2 meters

    away from the robot so both a frontal face and the gestures are detectable. If the

    gesturer is too far away, the face will be too small to detect using OpenCVs Haar

    frontal face detector. If the gesturer is to close to the camera then the perspective of

    the gesture is changed, perhaps resulting in a misrecognition. The face and all pixels

    to the right of it (as seen by the robot) are ignored to simplify image processing and

    assuming that gestures are performed right-handed. If a user could be identified

    (perhaps by using face recognition) as a left handed person, then a left hand mode

    could be used.

    24

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    35/136

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    36/136

    3.4. SYMBOLIC GESTURE RECOGNITION

    (a) (b) (c)

    Figure 3.5: A series of motion segmentations of a wave gesture. (a) In the initial framethere is no history to segment any motion so all lighter coloured regions are segmented.(b) In the second frame, the user has moved their arm a distance so the previous motion

    segmentation and the currently detected motion are added as shown here. (c) Afterthree frames, the static background region has been segmented away and only the usersmotion, and speckles of motion of the person in the bottom right, are detected as shownhere.

    templates use the previous three frames to build a motion history. This means

    that differences between the current frame and last three frames are included in the

    segmented motion image, as shown in Fig. 3.5.

    These two masks, the skin and motion segmentation, are then differenced, result-

    ing in a greyscale image showing moving skin coloured regions in the frame. This

    image is used to select the regions for the Camshift algorithm to track, as shown in

    Fig. 3.6. In the first frame, there is no motion history so static regions are included

    in the segmentation. However, after 3 frames have been processed, the background

    regions are successfully ignored.

    3.4.2 Region Tracking

    The initial search windows for the Camshift tracker are manually defined by labeling

    the greyscale image generated by the skin and motion segmentation. In the next

    frame, the algorithm using the greyscale skin and motion segmentation for the frame

    and a search window for one region as defined from the previous frame to track that

    region. It is assumed part of the region in this frame will overlap with the search

    window. CamShift finds the center of a region by iteratively searching in the search

    window of the greyscale image. It then calculates the regions size and orientation

    and returns the bounding box of it. The bounding box is used as the search windows

    for this region in the next frame. The center of the bounding box is considered the

    (x,y) coordinate of a region for the current frame. This is done for each region that

    was segmented in the first frame which can be as many as 15 sometimes. However,

    it is assumed only one region will be a gesture.

    26

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    37/136

    3.4. SYMBOLIC GESTURE RECOGNITION

    (a) A binary image of the skin segmentation (b) A greyscale image of the motion seg-mentation

    (c) The conjunction of (a) and (b) (d) The colour capture with the detectedregions identified

    Figure 3.6: A single frame from a wave gesture using skin and motion segmentation

    Figure 3.7: A diagram of the gesture capture and recognition process

    27

  • 8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology

    38/136

    3.4. SYMBOLIC GESTURE RECOGNITION

    For each tracked region of each frame, the center (x,y) and the average disparity

    value of the tracked region are translated into (X, Y, Z)