multimodal interactions with a chatbot and study of ... and study of interruption recovery in...

Heriot-Watt University

Masters Thesis

Multimodal Interactions with a Chatbotand study of Interruption Recovery in

Conversation.

Author:

Fantine Chabernaud

Supervisor:

Dr. Franck Broz

A thesis submitted in fulfillment of the requirements

for the degree of MSc. Artificial Intelligence with Speech and Multimodal

Interactions

in the

School of Mathematical and Computer Sciences

August 2017

http://www.hw.ac.uk

http://www.macs.hw.ac.uk

Declaration of Authorship

I, Fantine Chabernaud, declare that this thesis titled, ’Multimodal Interactions with a

Chatbot and study of Interruption Recovery in Conversation.’ and the work presented

in it is my own. I confirm that this work submitted for assessment is my own and is

expressed in my own words. Any uses made within it of the works of other authors in any

form (e.g., ideas, equations, figures, text, tables, programs) are properly acknowledged

at any point of their use. A list of the references employed is included.

Signed: Fantine Chabernaud

Date: the 17th of August 2017

i

“Were we incapable of empathy – of putting ourselves in the position of others and

seeing that their suffering is like our own – then ethical reasoning would lead nowhere.

If emotion without reason is blind, then reason without emotion is impotent.”

Peter Singer, Writings on an Ethical Life, 2015

Abstract

The current project presents a novel approach enhancing Human-Robot Interaction

(HRI). One goal in Human-Robot Interaction is to design robots that seems friendly

and where interacting with them feels natural. A particular attention is drawn to the

interruptions that can occur during a dialogue and the recovery of the conversation.

Many reasons originate interruptions, either a misunderstanding or an external factor.

For this purpose, the Pepper robot from SoftBank Robotics was implemented with both

a dialogue and a gesture system. The robot is monitoring the user’s state like the

direction of the gaze, the facial expression and the distance between them. In case of

an interruption of the dialogue, the system has to find a coherent recovery according

to the verbal and non-verbal languages. The recovery has to be felt as natural by the

user. The main hypothesis concerns the gesture behavior that should help the robot to

recover the conversation. The evaluation showed the importance of designing a recovery

according to user’s preferences. Even if it is inconclusive, the experiment reveals clues

to improve the recovery strategy.

Acknowledgements

I would like to warmly thank Doctor Franck BROZ for supervising my dissertation

project, for being understanding and for taking the time to answer my questions.

I am very grateful toward Christian DONDRUP for his considerable help in under-

standing the softwares, for his patience at answering all my questions and for his kind

support. Without him, I would probably not have been so far in the implementation of

my project.

I thank all my friends and people whom I had the chance to work nearby and who

contributed to set an agreeable and efficient working atmosphere.

iv

Contents

Declaration of Authorship i

Abstract iii

Acknowledgements iv

Contents v

List of Figures viii

List of Tables ix

Abbreviations x

1 Introduction 1

2 Literature review 3

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.2 Note on terms used . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Dialogue system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 Spoken Language Understanding . . . . . . . . . . . . . . . . . . . 4

2.2.3 Dialogue Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.4 Natural Language Generation . . . . . . . . . . . . . . . . . . . . . 5

2.2.5 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.6 Functional issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.7 Robot appearance . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Proxemics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Proxemics applied to HRI . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 Proxemics to engage the user . . . . . . . . . . . . . . . . . . . . . 7

2.4 Turn-taking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4.1 Conversational strategies to engage the user . . . . . . . . . . . . . 10

2.5 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Requirements analysis 12

3.1 Project scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

v

Contents vi

3.2 Project objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.1 Python SDK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.2 Pepper robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.3 Naoqi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.4 Choregraphe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4.1 Strategy of recovery . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4.2 Design of the interruption scenarios . . . . . . . . . . . . . . . . . 18

3.4.3 Dialog system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Ethical and legal considerations 20

4.1 Ethical approval details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Professional issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Legal issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4 Ethical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.5 Social issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Methodology 23

5.1 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.2 Evaluation of the interruptions scenarios . . . . . . . . . . . . . . . 24

5.2 Experimental procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3.2 Objective measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.3.3 Subjective measures . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.3.4 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6 Results 28

6.1 Observations of the evaluation . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2 Participation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3 Analysis of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3.1 Analysis of raw data . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3.2 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.3.3 Comments analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7 Discussion 34

7.1 Participation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.3.1 What could enhance the implementation? . . . . . . . . . . . . . . 35

7.3.1.1 A smarter system . . . . . . . . . . . . . . . . . . . . . . 35

7.3.1.2 Improved dialogue system . . . . . . . . . . . . . . . . . . 35

7.3.1.3 Improvement of dialogue features . . . . . . . . . . . . . 36

7.3.1.4 A chatbot with memory . . . . . . . . . . . . . . . . . . . 36

7.3.1.5 Improvement of the tracking . . . . . . . . . . . . . . . . 36

7.3.2 What could be changed for the evaluation? . . . . . . . . . . . . . 36

Contents vii

8 Conclusion 38

A Appendix A 39

A.1 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

A.2 Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

A.3 Project tasks and deliverables . . . . . . . . . . . . . . . . . . . . . . . . . 40

A.4 Example of a questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.5 Risk assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Bibliography 47

List of Figures

2.1 Spoken dialogue system architecture . . . . . . . . . . . . . . . . . . . . . 4

2.2 Example of a dialogue by database retrieval . . . . . . . . . . . . . . . . . 5

3.1 Pepper robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Theoretical architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Final architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.1 Environment set up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Measures of naturalness in HRI . . . . . . . . . . . . . . . . . . . . . . . . 25

6.1 Repartition of the interaction likability and the topics . . . . . . . . . . . 30

6.2 Measurement of the differences with/out recovery . . . . . . . . . . . . . . 31

6.3 Comparison between answers at first and second interactions . . . . . . . 32

A.1 Questionnaire used at the evaluation: first interaction . . . . . . . . . . . 41



A.4 Questionnaire used at the evaluation: final questions 40 & 41 . . . . . . . 44

A.5 Questionnaire used at the evaluation: final question 42 . . . . . . . . . . . 45

A.6 Questionnaire used at the evaluation . . . . . . . . . . . . . . . . . . . . . 46

viii

List of Tables

3.1 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6.1 Two-way repeated measures ANOVA . . . . . . . . . . . . . . . . . . . . . 31

6.2 Two-way repeated measures ANOVA . . . . . . . . . . . . . . . . . . . . . 33

A.1 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

A.2 Details of the risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

A.3 Details of the main deliverables . . . . . . . . . . . . . . . . . . . . . . . . 40

ix

Abbreviations

AI Artificial Intelligence

AIML Artificial Intelligence Markup Language

ALICE Artificial Linguistic Internet Computer Entity

ASR Automatic Speech Recognition

DM Dialogue Management

HRI Human Robot Interaction

NLG Natural Language Generation

SLU Spoken Language Understanding

SSH Secure SHell protocol

TTS Text To Speech

x

Chapter 1

Introduction

In everyday conversations with each other, a lot of information is produced and ex-

pressed. In addition to words in a speech, there are visual cues as the body position or

gestures that contribute to the conversation. It can emphasize an idea or express a feeling

as rejection or interest. Many of those actions come from an unconscious will. It often

takes place from our own personality. For instance, introvert people tend to be softer in

their voice and movements than extroverts do. Two people interacting with each other

in a conversation have to deal with the complexity of their personalities and emotions.

There, feelings can also add a filter to someone’s behavior. For instance, anger, ire or

passion induce greater gesture, louder voice and offensive positions. Whereas annoyance

and boredom will lead someone’s attention away from the source of these emotions.

In some points in a conversation, there might happen an interruption. Also called a

silence, it happens when, for some reasons, the conversation prematurely ends. They

can be due to an external agent, a reaction of one of the participants or a changing

environment. This niche is the focus of the present research. It takes place in a wider

issue in human-robot interaction: the engagement of the user.

Nowadays companies are taking advantage of new dialogue systems to replace or improve

their assistance by phone or text message. These systems are artificial intelligences

able to have a coherent dialogue with someone. They use speech, graphics, gesture

or other communication channels. It is possible, for example, to command a pizza by

text message or to find the nearest coffee shop in a shopping mall by asking a robot to

show you the way. These dialogue systems are designed to do some tasks. Thus they

are called task-based systems. However dialogue systems which aim to be friendly are

social conversational agents or also called chatbots. Indeed their goal is to do chats or

small talks. They are usually associated with a task-based system to help the user in

a task. It can be a cook helper or the robot in a shopping mall. They can embody

1

Chapter 1. Introduction 2

a physical robot that can be talked to or only a text chat on a computer. In order

to make the user confident, robots have to be friendly looking and naturally behaving.

Thus actual research, especially for humanoid robots, is focused on reproducing human

behavior. This relies on dialogue and proxemics knowledge. Proxemics is the domain

related to non-verbal language and how people use the space to interact with others.

We know that in everyday life condition, we will not stand as close to someone we just

meet as to a long-time friend. This distance rule is part of the proxemics behavior.

Proxemics behavior is shaped by unconscious reflexes and the cultural education. The

unconscious part takes its roots in empathy. Proxemics is a low level language as it does

not express as complex ideas as speech do. We are able to communicate without sharing

the same language and this can be easily experienced by traveling abroad in a foreign

country. In fact, we observe that human behavior is intricate to understand and difficult

to simulate. Nonetheless it is possible to make a robot interacting and being understood

by people with basic common knowledge. Knowing a given number of different sensors

and knowledge of a physical robot, what would be the best way for a chatbot to behave?

More precisely, how is it possible to identify an interruption in the conversation taking

into account verbal and non-verbal behavior? For this project, it is interesting to focus

on both dialogue and gesture systems. If the dialogue system breaks down as it occurs

in interruption, the gesture system will help to recover the conversation.

In this research project, the aim is to understand the mechanisms of interruptions to

find solutions for the recovery of the conversation. Then the implementation focus on

the interruption by an external disruption, meaning someone or something catch the

user’s attention. As this project is related to computer science and robotic more than

linguistic science, the focus will be on the application of dialogue and gesture systems on

a robot. There is no will to get into an extended knowledge of all kind of interruptions.

Therefor, for the purpose of this research, only a few types of interruptions will be

experimented. The goal will be reached if participants talk to the robot naturally. The

project is conducted in three main parts. Firstly, much effort is put into the field of

the features of a conversation. Turn-taking feature is particularly important for the

study of interruptions because they have the same characteristic of a changing speaker.

The point is to gather information about the structure of dialogue and the effect of an

interruption. Secondly, the focus is made on the proxemics in human-robot interaction.

This comprehends the design of a method of the recovery according to the semantic of

the conversation and the attitude of the user. Thirdly, the system is implemented into

the Pepper robot and evaluated.

Chapter 2

Literature review

2.1 Introduction

2.1.1 Overview

The recovery from an interruption of a conversation implies proxemics, dialogue and

more specifically, turn-taking knowledge. How should the next speaker take the speech

turn in respect of human being? Knowing that a speaker wants to be listened to by

others, what should decide someone to take the speaker role? Such questions can be

partly answered by previous work in the main domains of turn-taking and proxemics

behavior. Turn-taking information are the rules of switching speakers in a conversation.

This happens when a speaker yields the speech role to someone else. The proxemics is

therefore interesting to get more information than only verbal one. It consists of the

whole body language of someone in order to emphasize an idea. As relatively little

research has been done on this domain of interruptions in conversation, the present

project looks for information in adjacent domains. A first explanation of interruptions

will be drawn from the turn-taking review. In a second part, the review of proxemics

will help designing a recovery method.

2.1.2 Note on terms used

The dialogue system or conversational agents refers to a computer system that can

coherently dialogue with a human. The references to a chatbot, a chatter robot, a social

robot or a social conversational agent leads to the same technology: a dialogue system

designed to be more friendly by having small talks.

3

Chapter 2. Literature review 4

2.2 Dialogue system

2.2.1 Overview

A dialogue system is a computer process that involves a chain of modules. This process

is supposed to build a coherent conversation with the user. According to the article

[Tsiakoulis et al., 2012], there are five modules internationally recognized as ”automatic

speech recognition (ASR), semantic decoding or spoken language understanding (SLU),

dialogue management (DM), natural language generation (NLG) and speech synthesis or

text-to-speech (TTS)”. Those modules and connections are summed up on the following

figure.

Figure 2.1: Spoken dialogue system architecture

2.2.2 Spoken Language Understanding

The spoken language understanding is a module located after the automatic speech

recognition. It receives a signal that contains the voice of the user and some external

noises. It processes the signal to retrieve only the utterance. Then it converts the

recognized speech utterance in dialogue acts by parsing. The process here is to cut into

piece the user’s utterances and labeled those pieces according to a specific grammar.

Then these acts will be processed by the dialogue manager.

2.2.3 Dialogue Manager

The dialogue manager is the core of the understanding of the robot. As robots does

not have a conscious as we do, they do not properly understand the meaning and the

semantic of a sentence. However it is possible for them, with basic rules of grammar, to

get coherent responses. Thus the dialogue manager will look for the nearest meaning of


the dialogue acts received from the previous module. Depending on the type of dialogue

manager system, those acts are compared to stored acts from a database. The database

comprehends different dialogue acts and their responses. In a simple system of database

retrieval, the given response is the utterance that is closest to the given dialogue acts.

It means that according to the acts, the system looks for a full already made answer.

There, the next module of natural language generator is not really used. For instance,

if the user say ”Hello, how are you?”, there the act is a cheering. The robot will look

into its database in the cheering category and retrieve the corresponding answer.

Figure 2.2: Example of a dialogue by database retrieval

2.2.4 Natural Language Generation

The language generator reads in the dialogue act output of the dialogue manager Then

it generates a natural response. It effectuates the same process as the spoken language

understanding but in the other way: it put pieces of utterance together. It makes

a sentence from small part of a response. The difficulty lies in the naturalness and

goodness of the final output utterance. Indeed, it will depends on the method used to

put the pieces together. Then the utterance goes to the Speech synthesizer to be heard

by the user.

There are three main methods to generate an utterance: the rule-based templates, the

grammar-based and the machine learning. The rule-based is designing each possible

sentence by hand. The templates are written schema that need to be completed with

the answers given by the dialogue manager. It is simple and it allows a full control

of the responses. However this method is hard to maintain, repetitive and manually

expensive. The grammar-based method can be used for both parsing on the spoken

language understanding module and the natural language generation. A set of grammar

rules are defined to write the responses. It gives more flexibility than rule-based tem-

plates. However it is still manually expensive. The third method is the most scalable,


portable and precise. It requires a previous training with in-domain data to be accurate.

Numerous implementations exist for this method. Basically the system improves itself

just by being used.

2.2.5 Hardware

The hardware part is at the input and output of the robot. The input is a microphone

and an automatic speech recognition (ASR). It determines the start and the end of the

user utterance by recording what the user says. The output is a speech synthesizer

(TTS) and a speaker. The speech synthesizer recreates a voice aloud. It is a complex

system that is supposed to respect the punctuation and ton of natural speech. Some

systems uses effects to simulates emotions and avoid a neutral and boring voice.

2.2.6 Functional issues

Actual systems are not enough robust in noisy environments. Usually the speech recog-

nition is faulty. The experience with a chatbot can quickly turn out to be frustrating.

When the robot does not understand, it usually repeats the same utterance over and

over instead of trying to recover the dialogue by asking a random question or changing

the subject. The recovery method depends on the context of the experiment.

2.2.7 Robot appearance

In an article of the IEEE Spectrum website [Ackerman, 2016], it says that robots look

friendlier by having some features like curvy forms, light colored skin, child-looking with

a big head and big curious eyes. Furthermore, even actual humanoid robots have some

flaws that make them distinguishable from real humans. Thus its is easy to recognize

them as machines and distance ourselves from them. According to the study made by

Z lotowski et al. [2016], a ”highly human-like robot is perceived as less trustworthy and

empathic than a machinelike robot with some human-like features”. Their study shows

also that machine-like robots with human-like features and a positive behavior create a

higher feeling of trustworthiness than human-like robots with same features. However,

in case of negative behavior, the machine-like robot will create more anxiety to the user

than a human-like robot. This means the appearance is a double-edged decision to take

for social robots.


2.3 Proxemics

2.3.1 Proxemics applied to HRI

The body language can be very noisy: turn head, raise hands without a will to empha-

size something. The non-verbal language is based on the orientation of body, the gaze,

the gestures and how space is occupied by the person. In fact we can take into account

all senses: visual, auditory, olfactory, thermal and touch sensory. From those features,

we should have signs of engagement or disengagement. According to the article of Mead

and Mataric [2016] there are different patterns of proxemics behavior. From a previous

study of Mead et al. [2012] we can draw three levels of proxemics representation: the

”physical representation, based on inter-agent distance and orientation”, the ”psycho-

logical representation, based on the inter-agent relationship” and the ”psychophysical

representation, based on the sensory experience of social stimuli”. The physical features

rely on the position of the agents and their orientation. The psychophysical level con-

cerns all our sensors like audition or vision. The psychological level concerns cultural

and personality aspects of someone. All those three levels combine themselves to create

a behavior. It is a simple internal representation.

A robot that does not respect our distances by getting to close to us can be threatening.

To avoid that, we have to study the human proxemics behavior to respect it. There are

four models of proxemics mechanisms described in the article from Jonathan Mumm

and Bilge Mutlu [Mumm and Mutlu, 2011]. The compensation model appears when the

partner will balance an misappropriate behavior. For example if the partner go too close,

the other will avoid the gaze or go a step backward. The reciprocity model appears when

both partners accord themselves on each other’s behavior. The attraction-mediation

model corresponds to a high level of attraction and therefore closeness. Thus a low level

of attraction induces a greater distance. The attraction-transformation model appears

when the level of interaction, either high or low, affects the behavior of both partners

adopting either compensation or reciprocation. Hypothesis made by the authors is that

knowing the compensation model, participants will keep a greater distance with the

robot which maintains a straight eye contact with them. In conclusion, more likable

robot should see their distance to the user decreased. Reciprocally a user that goes

backward in front of a robot is likely to disengage the conversation.

2.3.2 Proxemics to engage the user

According to this study [Skantze et al., 2014], eye gaze, gesture and verbal expressions are

important to engaged the user. Some acknowledgements as short utterances or nodding


make the user more confident while speaking. The user is especially more engaged in

a joint task completion with a robot using its gaze to show things. We could then

think that reproducing natural human behavior make the robot more engaging in HRI.

However according to an observation of the eye blinking in this study [Tritton et al.,

2012], a natural behavior like blinking eyes does not affect the participants although it is

a natural feature. McClave [2000] has studied dialogues and she has annotated the basic

non-verbal language signs. She has concluded that ”many head movements co-occurring

with speech are patterned. They have been shown to have semantic, discourse, and

interactive functions”. Non-verbal signs can are codified and they can be organized as

a language.

Yu et al. evaluated the importance of gaze monitoring to keep users engaged with a

dialog system. The vision is used as a mean to help turn taking in speech. It also

helps to adjust the strategies according to the listener’s attention. If this attention

is too low compared to what was expected, then “the model triggers a sequence of

linguistic devices”. The study focus on two types of attentional demand. The “onset

demand” is a level of attention required from the listener by the system when it begins

its phrase. The “production demand” concerns parts of the dialog that carry important

information. Those parts particularly require the listener’s attention. Then the system

is tested outside in a place with lots of distraction for the participants. The “coordinative

policy” is composed of a set of interjections that aims to get the participant’s attention.

The study shows that improper behavior has a negative effect on the engagement. For

instance if the system does not detect any attention on itself whereas, in reality there is,

then the system continues to try to attract attention. The participant pays attention to

a system that does not recognize it. Thus the participant is confused and will most likely

withdraw his/her attention. The system should aim to avoid such incoherent behaviors.

2.4 Turn-taking

Turn-taking rules are part of the conversation process. Interruptions are a kind of turn-

taking situation that will be highlighted in the project. Therefor it is interesting to

understand this feature.

What is a stable group interaction order? There are two possible measurements ac-

cording to the study [Okamoto et al., 2002], there are the syntactic structure and the

situation. When the interruption appears at more than two syllables away from a pos-

sible turn-transition space, it is a syntactic structure. The situation measure is made

depending on the context and the cultural view of both speakers. A turn taking space

occurs if one of the following condition is fulfilled: the speaker ends and points out


the next speaker, the speaker continues, another speaker enter the conversation or with

cooperation the transition is smoothly made. Another method is presented in the re-

view part of [Okamoto et al., 2002] where there are four different cases of turn-taking

depending on the context and the culture. Firstly, the speaker is cut off before he/she

has made his/her first point of the conversation. Secondly, the speaker has made the

first point of a turn. Thirdly, the speaker is cut off in mid-clause after the first point of

a turn. At least, the speaker begins to speak during a pause or other’s turn end signal.

In conclusion, there is no clear cut for a proper turn-taking.

Natural turn-taking is more flexible than in actual conversation with robots. Indeed

the actual way is to press a button or finish a sentence and stop speaking. The turn-

taking operation is much more complicated to solve a problem in group as it requires

a camera and an analysis of the non-verbal language. This is the problem of joint

activity described in the article [Skantze, 2016]. Vocal clues in turn-completion are: the

pitch, the duration and the energy. If the pitch is rising or falling, the turn is being

yielded. The duration concerns the length of the phonemes. Short ones for a turn being

yielded. The same for a lower intensity or a low loudness which calls the turn yielding.

Visually, a speaker breathing-in will continue to speak. Whereas a speaker looking away

or looking back to someone else will probably be turn yielding. However unexpected

interruptions can happen. It comes from an external reason like someone looking for an

urgent attention. Eventually, to generate a coordination signal, the user more likely to

take the turn is the one toward whom the robot is looking.

Both the robot and the user must learn each others’ way of being to have a smoother

turn-taking operation. Mitsunaga et al. implemented a system that learns from the user

and adapt its behavior. It works with a reward function to improve its behavior according

to a few parameters. Those parameters are the distance, the gaze-meeting-ratio, the

waiting-time and the motion-speed. The distance between the user and the robot is

defined by the proxemics where humans have specific interaction distances according to

their status: intimate, personal and social distances. The gaze-meeting-ratio correspond

to the number of time when the user’s gaze meet the robot’s gaze over the length of

the interaction. The waiting-time is the time between the pronounced utterance and

the action. Finally the motion-speed manages the speed at which gestures are made.

Different configurations of those parameters are tested and the user’s impressions are

recorded so that it is possible to adapt the system’s behavior. However each configuration

of parameters relies on subjectives impressions. For instance, some users would prefer a

wider distance than others.


2.4.1 Conversational strategies to engage the user

In the study [Yu et al., 2015b] the team developed a conversational agent called TickTock.

They implemented a dialog system that works with a rule-based method. The database

comes from CNN interviews of a show. It keeps in memory the questions and answers

of participants. Thus the dialog system retrieves the answer whose question is the most

alike of the one asked to the dialog system. Then the retrieved utterance is combined

with one of the two strategies: active or passive. During the evaluation, the conversation

between participants and the dialogue system is recorded to annotate the engagement

level on a five-level scale. As engagement is very subjective and proper to each of us,

the participants were asked to annotate themselves for each turn on their engagement.

In a later study of Yu et al., the Multimodal TickTock system is meant to compare

two strategies based on their engagement level. The physical system has a cartoon face

that represent the system talking. With a non human-like face, the user is expecting a

different behavior as with another human. The dialogue system of TickTock is based

on a database retrieval process. It compares the utterances with those in the database

and retrieve the response. Five strategies are designed to improve user engagement:

switch topics, initiate activities, end topics with an open question, tell a joke and refer

back to a previously engaged topic. The user engagement is monitored by the system

in real time. It evaluates the user’s utterances to adopt the best strategy to keep the

user engaged. Two versions of the TickTock system are compared. The REL one uses

only the five previous strategies whereas the REL+ENG uses the five strategies and

a reaction to low user engagement. First evaluation of recorded interaction annotated

by experts. Another rigorous evaluation called “Amazon Mechanical Turk study” uses

an independent jury: only interactions of users with no previous knowledge of dialog

systems. Then couple of recorded interactions with REL+ENG and REL are annotated

by non-expert people. Finally both evaluations are compared to find which version of

system is best. Indeed REL+ENG seems better as it monitors user’s engagement.

In another study Yu et al. focused on the gaze to get attention and monitor the listener’s

level of attention. The gaze is used as a mean to precise turn taking in speech. It also

helps to adjust the strategies according to the listener’s attention. If the attention

is too low compared to what was expected, then “the model triggers a sequence of

linguistic devices” such as interjections. The study focus on two types of attentional

demand. The “onset demand” is a level of attention required from the listener by the

system when it begins its phrase. The “production demand” concerns parts of the

dialog that carry important information. Those parts particularly require the listener’s

attention. Then the system is tested outside in a place with lots of distraction for the

participants. The “coordinative policy” is composed of a set of interjections that aims to


get the participant’s attention. The study shows that improper behavior like the system

detecting no attention on itself and trying to attract attention has a negative effect on

the participant. Indeed the participant will then withdraw the attention. If listener’s

attention is not enough the system execute a pause, an interjection and again a pause.

If attention has been successfully drawn, then the system says the utterance. If there is

still not enough attention, the system pauses after 2 words, then it says entire utterance.

The pause lasts 1.5 to 2.5 seconds.

2.5 Hypothesis

Among all the types of interruptions, the study focuses on one particular pattern. Indeed

the domain of interruptions is broad. For the purpose of the evaluation and the simplicity

of it, a choice was made to focus on interruption by external disturbance. It is easy to set

an external interruption with regularity and consistency. The gesture system in addition

to the dialogue system is expected to help the robot for the recovery of the conversation.

As shown in many HRI experiences from the literature, participants’ feedback is in favor

of embodied robots with coherent gestures and gaze behavior. The study will evaluate

the effectiveness of recovery using different gesture strategies.

Chapter 3

Requirements analysis

3.1 Project scope

The project provide a multi-modal interaction system that is implemented on the Pepper

robot. The engagement is emphasized with proxemics methods in a conversation. As

robots are more and more common in everyday life under multiple forms, they have

to be adaptable in order to fit in people’s life. This implies a capacity to interact with

humans and be friendly to be approved by the overall population. Many companies tried

to develop perfect humanoid robots to make them more appreciable. However to avoid

the Uncanny Valley which represents the strangeness of an approximative human-like

model, lot of time has been spent in the research of physical appearance. Machine-like

robots can be appreciable at the condition to have a proper behavior. It has been seen

that people get emotionally attached with these kind of robots even if their shape is

not human-like as it has been resumed by Yang et al. [2013]. Thus the recovery of

the conversation after an interruption has a role in how to better engaged the user in

interaction with robots. This matter might improve the user’s experience with robots.

3.2 Project objectives

The main goal is to implement a physical robot with both dialogue and gesture systems.

Both dialogue and gesture systems have to be designed according to the robot’s features

and functions. The challenge lies in the coordination of both systems to get a successful

recovery after the interruption. Ideally the final system is efficient enough to smoothly

recover an interruption. At least, it is interesting to better understand the human-robot

interaction especially in the case of interruption recovery.

12

Chapter 3. Requirements analysis 13

3.3 Materials

3.3.1 Python SDK

The documentation concerning the Pepper robot from SoftBank is either for Python or

C++ programming. Python is the language chosen to program the Pepper robot. It is

an easy and flexible programming language that is widely used. It is well documented

and it has wide programmer communities.

3.3.2 Pepper robot

The Pepper robot is produced by the firm Aldebaran from SoftBank Robotics. This

robot was especially designed for social interactions. It has a humanoid shape: body

mounted on wheels, moving arms and head. See the following figure 3.1 from Softbank

Robotics documentation. The robot has two HD cameras one in each eye to monitor the

user. This feature is used for the tracking of the user’s head so that the robot keeps the

user in its field of view by adjusting its head position. Thanks to the pre-implemented

functions, the robot can monitor the user’s feeling and reactions in real time. This is

why it is particularly appreciated for social robotics. It is also able to avoid collision

while moving the head, the arms or walking by using its sonar and lasers. The table

thereafter summarizes the robot features that will observe proxemics’ cues.

Sensors and actuators Use

Lasers / sonar Calculate the distance between the robot and the user

LEDs eyes (green, red, blue) Indicates three states: listening, speaking, processing

Loudspeakers Synthesize the robot’s voice

Microphones Listen to the user

2 HD cameras in the eyes Monitor the user’s mood

1 camera 3D 3D perception of the environment

Table 3.1: Stakeholders

The connection between the computer and the robot is done by WIFI or by Ethernet

cable. In the case of the dialog system using QiChat that will be described in the

following section, we need to access the internal memory of the robot. To do that, we

connect remotely by Secure Shell, commonly called ”ssh”. This allows to load topics

and files on the robot.


Figure 3.1: Pepper robot

The implementation gradually reveals some limits of the robot. The acuteness of de-

ductions depends on the steadiness of the sensors. For instance, a bright environment

will perturb the camera and the robot will lost the ability to track a face. Another

drawback is the lack of multi-task ability. Indeed, each movement takes a certain time

to execute which is normal. However this is an issue as the dialog stops during the

movement so it loses naturalness. Similarly when the robot is speaking, it cannot listen

at the same time. We observe that the timing is very important when it comes to a

conversation. In everyday life, reactions in a conversation are done at a fast pace be-

tween humans. Whereas in human-robot interaction, the robot is usually slower than


the human. This is the case of a previous study by Yu et al. where the robot was still

trying to draw the user’s attention despite the user was already paying attention. This

is due to a delay between the robot processing the clues of the user listening and the

reality. This incoherence is very detrimental to the engagement of the user as it shows

a lack of understanding from the robot. It contributes to decrease the confidence in the

robot. Sometimes, delays in gestures or reactions also gives this feeling of an incoherent

robot. Finally, the robot can mistake parasitic movements or sounds such as ”mmmm”

or ”euuu” from the user.

3.3.3 Naoqi

Naoqi is a system of libraries gathering functions for the Pepper robot. It is being

released and updated by SoftBank Robotics. It is like a ready to use pack that com-

prehends events, animations and functions. This simplifies the programming process.

It allows focus on high-end programming rather than being focused on controlling ac-

tuators or connection. For instance, to make the hand wave, we simply have to call a

function with the name of the movement to execute as argument. Of course, depending

on the complexity of functions and particularly of the need of captors, the script will be

more or less complex. Even though the robot can be run without Naoqi, it was used in

the present project to focus on programming the recovery strategy.

3.3.4 Choregraphe

Choregraphe is the software accompanying robots from SoftBank Robotics. It is a

graphical programming environment for implementing behavior on the robot. It has a

graphic window with features to monitor the robot’s state, to display a view of its vision

cameras and a modelization of its environment. First the robot has to be connected by

its IP. To implement a new behavior, we can simply scroll in the main menu and look

for the desired behavior. Then using the cursor, we drag the behavior to the center of

the window and it forms a box. After connecting this box with start and end arrows,

the behavior is ready to be executed on the robot. This software was interesting in this

project to understand the robot’s range of abilities. It was used to follow the dialogues

as there is a display of what the robot understand in real-time.


3.4 Implementation

3.4.1 Strategy of recovery

The final design of the system is a combination of all the research done before and the

very own features of Pepper robot. The overall structure should have looked as on the

figure 3.2.

Figure 3.2: Theoretical architecture

The architecture using modules was convenient to achieve a goal of efficiency as each

module could be improved independently. In this first attempt of an architecture, the

turn-taking manager would have been at a central place of the system. It would have

been orchestrating the information coming from the events triggered at the ”User’s emo-

tion monitoring” module and processing the response. The response would either come

from the ”Conversational agent” module or from the ”Interruption manager” module in

case of an interruption. We can see that this organization highlights the importance of

appropriate turn-yielding by the ”Turn-taking manager” to avoid silences or cut the user

short in her/his speech. Moreover, quick and coherent reactions are enabled by being

aware of the user’s state thanks to the ”User’s emotion monitoring”. The ”Memory”


and machine learning modules are keys to to get a better system. Indeed those elements

are part of standard humans and they play a key role in relationships. That is what

make a system smart. For instance, a robot that remembers who you are and your

preferences is more likable. In practice, it turns out to be a much simpler structure.

Indeed gesture, dialogue and user monitoring are more or less made at the same level on

the robot. Unfortunately the memory and the machine learning were not implemented

as the focus was on one recovery. So the final architecture of the system looks like on

the figure 3.3.

Figure 3.3: Final architecture

Basically the main script comprehends the monitoring of the user by the action of

events. If gazes meet, then an interaction began and the main script subscribes to the

Qichat topic. During the dialogue, events are still active, if something happens like an

interruption, the robot is able to react. Ideally the dialogue and gesture behavior are

run smoothly in parallel. Unfortunately reactions take some time to process while the

dialogue is deactivated. On this model, there is no memory nor machine learning. The

system is not smart as all behaviors had to be implemented by hand.


3.4.2 Design of the interruption scenarios

For the purpose of the evaluation, each participant will have to interact two times, one

time with the recovery strategy and a one time without. The interaction without a

recovery is a reference interaction where the robot do nothing in case of an interruption.

The recovery strategy is triggered if the user is looking away or if is not talking. If the

user looking away, a counter is incremented until a certain threshold. This method is

inspired from the attention level of Yu et al. study. When the threshold is exceeded,

i.e. the user is not paying attention, then the robot recovers the attention by waving

the hand and making a vocal sound. Knowing that the hand waving is the one in the

direction of the user’s gaze. In other words, the robot detects the orientation of the

head and choose the proper movement to execute. For instance, if the user is looking

away on the right, then the robot will wave its left arm. However this design choices

relies on personal experience and common sense. Then if the user look back to the

robot, the recovery is successful. If not, the counter is still incremented and at a higher

threshold the robot moves on the side. Ideally the robot has a movement that shifts

itself in front of the user if this one does not give his/her attention back despite the call.

This would intent to meet the user’s gaze again by replacing the whole robot in front

of the user. To detect an interruption, the system uses the camera and an event that

detects silences in the dialogue. If there is a silence of 20 seconds, then the robot will

try to engage the user back by waving a hand and calling ”hey”. In case the user is

looking away, the robot will wave the hand. If the user is still looking away and have

not looked back to the robot, then the robot will shift in front of the user in a last

attempt to try to meet his/her gaze. These strategies are using Naoqi libraries with

ready-to-use functions called events. Events are triggered by stimulus. Each event has

its own corresponding stimulus, it could be the apparition of a head in the field of view

of the robot or a shoulder being touched. When an event is triggered a flag is raised and

it calls a callback functions. Then the reaction depends on what is coded in the callback

function.

3.4.3 Dialog system

The first idea and the ideal was to implement a chatbot. It would have given entire

freedom for the user to talk to the robot. However due to the time constraint of this

project, it has been decided to use the dialog system specific to the SoftBank’s robots:

QiChat. It is composed of topics which are files loaded on the robot’s internal memory.

It corresponds to the linguistic knowledge of the robot. Each topic contains utterances

and words that can be recognized by the robot. The principle is very simple: if the


robot recognized an utterance, then it will give the corresponding answer. For instance,

a ”Hello” will trigger the answer ”Hello, how are you?”. A human input associated

with a relevant robot output is called a rule. When the robot output is waiting for

a specific answer, then this goes to a sub-rule. It is possible to design more complex

reactions with functions such as the ”topicTag” function that allows to switch to another

topic when the keyword is said. In the project, the dialogues are split in two. There is a

greeting topic first, then a small talk topic linked by the ”topicTag” function. The major

drawback is that the user is restricted to the use of a limited list of words and utterance

configurations. When the user says something, the robot computes the utterance and

gives a confidence percentage according to what was understood. It often appears that

what is understood is not what was said. If the percentage is below the threshold, here

of 50% then the utterance is not taking into account and the robot go back to a listening

state. If the percentage is over the threshold, then the robot understood the utterance

and answers it. The robot has in knowledge only the list of words from the topics that

can be recognized and understood. The drawback of this method is that the system

does not learn, nor memorize any information about the user.

Chapter 4

Ethical and legal considerations

4.1 Ethical approval details

The aim of the study is to create an efficient system in recovering conversation between

a chatbot and a human. This system will use speech and proxemics analysis. It takes

place in human-robot interaction. The methods used to experiment involve a person

speaking with the system implemented on the Pepper robot. The participant will be

asked to play some scenarios simulating different conversations with a second task to

create interruptions. The interaction may be recorded by text, audio and video for a

post-analysis. The participant will be asked to fill a questionnaire upon his/her feelings

after the interaction.

The project involves human subjects and personal data identifiable with living people.

The project’s evaluation consists of evaluating a user interface according to the descrip-

tion: ”Interface only approval can be granted to projects that will be evaluating a user

interface by observing individuals using the software and performing a system usability

scale (SUS) questionnaire. Participants must be staff or students of Heriot-Watt Uni-

versity. The standard consent form must be completed by the participant prior to the

study and stored by the student conducting the project. All collected data must be

anonymised. No sensitive data will be collected. Only standard computing equipment,

i.e. an office PC, laptop, tablet, or mobile phone, will be used.”. Thus to protect the

data collected during the evaluation according to the university policy, the following

rules were agreed:

• Identifiable data will be stored on a secure machine with restricted access

• Data will be anonymised before publication unless consent has been given

20

Chapter 4. Ethical and legal considerations 21

• Identifiable data will only be retained for the duration of the consent granted by

the participant

• External data and systems will be used within the license terms specified

4.2 Professional issues

The professional issues concern the code that should be well commented and incremented

to make it clear. It is important to think to further experiments and prepare the code

to be read by someone else. It also make easier the debugging part.

4.3 Legal issues

Legal issues are for the implementation a matter of respect of the licenses for softwares

and robot. All the credit for open source code as ALICE bot is deserved to its authors.

The ALICE is an open software that requires neither copyright, nor other protections.

Anyone is free to use and modify it. The IBM SPSS software is used under the license

of the university.

In the case of the experiments with participants, some personal information as name,

gender and age asked. These information are confidential and they will be protected. No

data will be used outside the study. All sensible information will be kept by password

on a memory device. Moreover to analyze the experiments, it may be useful to record

the conversation by video and audio. Thus we need that each participant agrees with

the consent form to participate to the experiment. The footages will be protected as

virtual data in an encrypted folder secured with password.

4.4 Ethical issues

Nowadays more and more robots are evolving in our public and private places. It creates

a lot of privacy and security issues. For instance, connected robots can be accessed by

remote hackers. In everyday life, it makes a lot of information about our habits that

can be stolen from one robot being around us in the house, at work or in public places.

For instance the Pepper robot is likely to go in public places to help people finding their

way. It can monitor the flow of people going around it. Social robotic field is brand new

and there are many gatherings around ethics rules that should be written as laws. For

Chapter 4. Ethical and legal considerations 22

now, it is only a matter of each individual responsibility to have such robot at home or

not.

However the confidentiality issue of social robot is not specific to the current project. It

is a broader issue that includes all chatbots and HRI in general. Thus this project in

itself has no unique ethical or legal issues.

4.5 Social issues

Actual chatbots are replacing some people for simple tasks as ordering a pizza or finding

the way to the nearest coffee shop. They are meant to enhance our everyday life by

finding solution and making our life slightly easier. They are part of the connected

movement where smart devices spread everywhere. Although they are not yet popular

everywhere in the world, it is a matter of time for the technology to become better,

cheaper and to take part in our manners. In order to reach this state, the study of the

recovery in conversation is making a step further in the improvement of actual chatbot

technologies. Indeed robots used in public places would be annoying if they cannot

recover from interruptions. Even though the robot fails, a system that can recover from

interruptions is more engaging. Thus it would affect the public in a broader impression

of social robots. They would be more easily integrated as social robots in daily life.

The point of this research is to make the robot more friendly so that people can have

more empathy for them. There is an open question about the need of more empathy

towards robots in places like battlefields or health care. Robots are created and are

meant to serve humans. Thus would it be fair to enhance robots than are meant to

get destroyed? What if people give more value to the robot than to their own life?

Numerous studies has been conducted in health care where some social animal robots

where used as therapy to help dementia people to be well. Wada et al. [2008] show in

their study that there is a real attachment of people toward the animal robot. Thus

reprogramming or changing the animal robot would be detrimental in such case.

Chapter 5

Methodology

5.1 Experimental design

5.1.1 Environment

The study is conducted at the university. The place to run the interactions has to be

quiet to avoid external pressure or noise that would false the results. The participants

are standing in front of the robot at the distance of their choice according to what offered

the room. The experimental settings comprise the Pepper robot, a computer to initiate

the robot and two cameras to record the interactions. The side of both participant and

robot and the face of the participant are recorded. The recorded videos are used to

measure both behaviors and reactions in the post-processing analysis.

Figure 5.1: Environment set up

23

Chapter 5. Methodology 24

5.1.2 Evaluation of the interruptions scenarios

The participant is told to achieve two tasks: speak to the robot and read a small article

on a tablet. There are three different dialogues implemented on the robot and one

recovery strategy. Knowing that each participant is invited to interact twice with the

robot: one time with the recovery turned on and the other without. So that everyone will

have a specific combination of the dialogues and strategies. The dialogues are short talks

where the robot ask questions. The user answers with a short utterance or a word. One

interaction is about planning a travel abroad. The robot asks simple questions to the

participant as if they were planning a trip. Another talk is a guess game where Pepper

describes a thing using adjectives and the user has to guess what this is. Finally, the

third talk is about ordering a pizza. The user gives his/her food preferences according to

Pepper’s questions. The recovery strategy are described in the 3, section 3.4.2 ”Design

of the interruption scenarios”. If the participant looks away for a certain time, then the

robot detects an interruption and reacts accordingly to the implemented strategy. Once

the participant is told about the interaction, she/he is invited to stand up in front of

Pepper robot to start the interaction. Knowing that the participant is free to organize

his/her time to finish the tasks. As the robot has a limited implemented knowledge, the

participant will have a sheet with the keywords to use during the conversation.

5.2 Experimental procedure

Participants are invited to fill a consent form beforehand. It describes the overall goal

of the experiment. It also allows or not the use of the footages for a later publication.

Then participants are explained how the experiment will take place. They receive a

guideline sheet with instructions and words they can use for the first interaction. They

receive the tablet with the article to read and the interaction began. When the first

interaction is finished, then the participant is asked to fill a first questionnaire using

three subscales from the ”Godspeed questionnaire” from Bartneck et al. [2009]. Then

again the participant is asked to interact with the robot with a new configuration and

fill out the questionnaire again. Moreover a final questionnaire ask a couple of question

on the overall feeling of the robot. Meanwhile both interactions are recorded by camera

to allow a post-experiment analysis of the data.


5.3 Measures

5.3.1 Overview

The evaluation is a core part of this project as it represents the achievement of the

research. All the theory is put into practice. According to Hung et al. [2009], the mea-

surement can be split into two groups: subjective and objective measures. The following

figure comes from the article of Hung et al. [2009] and it summarizes the main measures

of naturalness in HRI.

Figure 5.2: Measures of naturalness in HRI


5.3.2 Objective measures

Objective measures are based on the records and the interaction itself. It implies count-

able metrics like the length of the interaction beginning at the time the user look at the

robot until the end of conversation triggered by a specific term as ”Good bye”. The

use of ratio of interruptions and successful recoveries gives a highlight on the chatbot’s

ability to recover the conversation.

5.3.3 Subjective measures

The subjective measures relies on the feedback of the participants after the interaction.

The following metrics are subjective ones because they implies feelings and impressions

which are not universally reliable metrics. Nonetheless it is relevant to analyze those

feedbacks from a large group of participants to get a trend if it exists. There are

three questionnaires for the whole evaluation. They are filled by the participant after

each interaction. Two questionnaires are inspired from the Godspeed questionnaire to

evaluate anthropomorphism Bartneck et al. [2009]. The last questionnaire compare the

different strategies and ask for the participant’s preferences.

The Godspeed questionnaire has been conceived to evaluate robots that are not designed

for a performance but for social interactions. The participant’s impression on the inter-

action with the robot is encoded using notes on a scale from 1 to 5. For instance, a 1

would mean ”I completely dislike” and a 5 would mean ”I completely like”. This note

is given for notions such as ”false” versus ”natural” or ”dead” versus ”alive”. These

notions are spread over five sections: anthropomorphism, animacy, likability, perceived

intelligence and perceived safety. A short example of questionnaire is provided in the

Appendix A, section A.4 and figure A.1. Those notes can then be used to do some

statistics in post-processing. It could be interesting to allocate a note to the proxemics

of the participant during the interaction.

5.3.4 Post-processing

The data obtained by answering the online questionnaires are gathered in a file to be

then processed on IBM SPSS Statistics software. This software is widely use in social

science for statistical analysis of datasets. Some information are added to the present

dataset such as the age of the participant and details about the recovery strategy and

the topic of each interaction. Moreover footage of the interactions are analyzed to take

some measures such as:


• Number of misunderstandings: number of time a word is repeated by the user

because of the robot’s lack of understanding or the robot misunderstanding a

word

• Number of successful recoveries: number of times the attention of the user is caught

back on the robot following a recovery behavior

• Number of unsuccessful recoveries: number of times the recovery behavior is trig-

gered with no reaction of the user or inappropriate recovery behavior knowing the

situation

• Time spent interacting from a greeting to an ending

• Number of silences longer than 20 seconds

• Number of times the recovery strategy ”hey” and wave hand is triggered

• Number of time the recovery strategy with robot shifts is triggered

• Number of time the topic is being repeated from just an initiation of the topic to

the whole topic.

Chapter 6

Results

6.1 Observations of the evaluation

First of all, in addition to the consent form with a short presentation of the project, the

participant was given some guidelines along with a sheet with the vocabulary to use for

each topic. However the directions given at the beginning of the evaluation were quite

confusing. Here are the additional information told verbally:

• Do both tasks at the same time: speak to Pepper and read a small article

• The Pepper robot is leading the conversation, you begin the interaction with a

greeting, then answer its questions and finish by an ending word

• Need to repeat if does not understand and/or speak louder

• One word at a time for each answer, included in a sentence or on itself

• Stay in front of the robot

• Free to stop when both tasks are achieved

• Do not get frustrated

• For the Guess topic, the answers are in the order of the list of words on the sheet.

If the robot was not understanding properly, some indications were given during the

interaction as speaking louder or repeating the word. So it might have interfere a little

on the interaction.

28

Chapter 6. Results 29

Participants were often puzzled at they were asked to both read and speak with the

robot at the same time. It is not a common behavior to get distracted while interacting

with someone. As reading an article was confusing, participants could have been asked

instead to talk to the robot while a video is projected behind so that the participant

will distracted by looking at the video. Moreover the length of the articles has to be

shortened as it was too long. So the first participant had a longer paragraph to read

than the following participants. These could have been avoided by running more trial

evaluations before starting the real evaluation. The dialogues should have been longer

because each participant had barely enough time to look at the guidelines sheet and the

dialogue was already done. So the interruptions did not really occur and topics were

repeated many times. Thus the dialogues should have been much longer to give the user

the time to get bored and look away creating an interruption. Nonetheless participants

did their best to achieve both tasks.

6.2 Participation

The evaluation gathered 13 participations. They were aged between 20 and 55 years old

with a mean at 29.7 as the experiments took place in the university context. Most of

them have a good knowledge of working with robots with a mean of 3.7 on a five points

scale where 5 equal to expert in robotic. A few of them knows about the goal of the

research.

6.3 Analysis of the results

6.3.1 Analysis of raw data

First of all, it was interesting to verify if one topic was more appreciated than another.

Knowing that each topic appears the same number of time as the others. If it was the

case of a preferred interaction, then the evaluation would not be correct as the preferred

topic would prime the test. The following Figure 6.1 shows the link between the notes

of the interaction on the like scale from 1 to 5 with 5 as ”I enjoyed a lot” and the topic

subjects. We can observe that all topics are almost similar in terms of likability.


Figure 6.1: Repartition of the interaction likability and the topics

There is no difference in the likability of the interaction with and without recovery strat-

egy. However there is a slight improvement in the likability for the second interaction

as shown on the Figure 6.2. Eventually each interaction went quicker than what was

thought in theory. In average, each interaction lasted two minutes. Second interactions

tend to be longer than first interactions. In a similar way, interactions with a recovery

strategy tend to last longer than without recovery. It is only about a minute of difference.

Thus it might not be pertinent. With the recovery strategy, the number of misunder-

standings and the number of repeated topics surge. This might show a confusion in the

communication. According to the question of preferred interaction, the system without

recovery is more appreciated with 70% than the system with recovery. This is explained

by the fact that the recovery was slowing the robot answers and presenting an incoherent

behavior. Indeed, the numerous recovery attempt are annoying the participants more

than drawing the attention back as showed on the Figure 6.2.. The topics are at equality

each of them in terms of use so that they do not influence the preferences.


Figure 6.2: Measurement of the differences with/out recovery

6.3.2 Statistical analysis

The evaluation evaluates the effectiveness of the recovery strategy versus no recovery

at all. Each participant has to experiment both strategies thus it is a within-subject

evaluation. The three different topics are not taken into account as they are just support

for the strategies. The Godspeed questionnaire is split into three clusters each of 5

criteria. Each note on a scale given to a criteria is summed within the cluster. We then

get three groups: anthropomorphism, likability and intelligence. The notes of each scale

become a numerical data that can be processed by ANOVA test. The two-way repeated

measures ANOVA for each group of the Godspeed questionnaire appears to be the most

suitable test. We obtain the following table:

Godspeed section mean gesture mean no gesture p - value

Anthropomorphism 16.23 16.77 0.391Likability 20.31 21.31 0.090Intelligence 16.23 16.77 0.391

Table 6.1: Two-way repeated measures ANOVA


We can see that the Likability has a significant result whereas Anthropomorphism and

Intelligence are less significant. This can be understood as an even feeling of Anthro-

pomorphism and Intelligence through the whole experiment. Whereas some parameters

change the Likability between two interactions. This is very likely that the recovery

strategy is at the origin of this significant result. Indeed the robot does not change its

appearance nor its fundamental knowledge.

The isolated questions about appreciation, engagement and appropriateness of the be-

havior are treated with the Wilcoxon signed rank test. It compares two similar questions

when asked at different times of the experiment. Most participants enjoyed the second

interaction as shown on the Figure 6.3.. Even if it is contestable as the participation is

low. In this case, the result depends more or less of one participant. The appropriateness

of the robot’s behavior is more appreciated at the second interaction. There is an even

repartition of the recovery strategies. The results can be explained as the participant

having a better understanding of the interaction proceedings. However the engagement

level is higher for the first interaction. That means participants felt less engaged in the

conversation at the second interaction.

Figure 6.3: Comparison between answers at first and second interactions


Comparison first/second interactions Z value p - value

Like -0.351 0.725Behavior appropriateness -0.844 0.399Engagement -1.00 0.317

Table 6.2: Two-way repeated measures ANOVA

6.3.3 Comments analysis

To conclude the evaluation, participants were asked open questions about what they

liked and disliked about the interactions. Pepper’s inner features such as the voice, the

appearance and the smooth movements were quite appreciated. Moreover some com-

ments noticed the eye tracking as an interesting feature. This was confirmed in the

literature where the eye contact was an important component of engagement [Skantze

et al., 2014]. A few comments show the appreciation of the recovery strategy and saw it

as ”entertaining”. However in the disliked comments, the recovery appears to be a dis-

turbance when it appears too often. What was mostly disliked is the misunderstanding

of words such as ”hi” and ”bye” or ”ham” and the need to repeat many times before

the robot understands. What could improve the experience with the robot would be

to expand the knowledge, the answers, to give more feedbacks and filters for parasitic

words as ”hemmm”. This reflects the analysis of the interaction preferences where the

system without gesture recovery is preferred.

Chapter 7

Discussion

7.1 Participation

The participation is quite low for such an experiment because of choice. Indeed it became

soon obvious that the evaluation’s design was inappropriate to originate interruptions.

The reading format for the second task is inappropriate because it is uncommon to read

while talking. It was uneasy for the participants to know what to do either read or

speak. Then it was decided to stop the evaluation. Moreover, some participants were

aware of the general goal of the research. Whereas a strict experiment would allow only

uninformed participants. This lack of extreme rigor was permitted because of the study

context of this project.

7.2 Evaluation

As scientific evaluation should evaluate only the participants with no external distur-

bances, the guidances given during experiment might have interfered. Ideally the par-

ticipant knows when to start and end and what to do. This is why the design of the

evaluation is very important. Moreover the articles given might prime the participant.

In this case, they were about diverse subjects except robotic and AI. Another difficulty

appears when designing the recovery strategy. Seen from the literature review, inter-

ruptions are inspired from human-human interactions. In social humanoid robotics, we

often try to copy the best human behavior. We can ask if recoveries between humans and

human-robot are similar. Indeed robots are still obviously machinelike thus users might

behave in a specific way. We observe that naturalness in the movements is important

to show a coherent behavior. The smoothness of the movements was quite enjoyed by

the participants. In a similar way, the behavior animations already implemented in the

34

Chapter 7. Discussion 35

Pepper robot allow a great range of movements. This surely contributed to improve the

feeling of the participants on the overall experience. Furthermore, different shades of

behavior coupled to various answers can give an impression of naturalness and adapt-

ability. As we humans are complex, we might tend to see complex behaviors as more

natural. This could lead to the development of personalities for robots. A criticism can

be done about the dialogue as it was a task based talk instead of a small talk. However

a task based talk requires full attention of the user. A small talk would allow the user

to get distracted more easily. In addition to the type of talk criticism, the fact that a

poor knowledge of words does not give the user the possibility to freely express.

7.3 Improvements

7.3.1 What could enhance the implementation?

7.3.1.1 A smarter system

First of all, the design of a subtler recovery strategy would improve the experience of the

user. To do that, it would be interesting to design different types of recoveries. So that

evaluations would be run for each type of recovery to evaluate its effectiveness. Knowing

that the design of the recovery depends upon many streams such as the proxemics and

some anthropology. In addition to the actual system, it would be interesting to design a

smart system using machine learning. The system would then improve itself by learning

from interactions. The principal advantages are an easiness in expanding its knowledge

and a variety of words/behaviors. However the principal drawback to this method is

that it learns everything, bad and good behavior. To avoid that, it would require a

supervision. Unfortunately it is costly to supervise all learnings. It is also difficult to

cause enough interactions for the robot to get a proper learning database.

7.3.1.2 Improved dialogue system

In a future project, another dialogues system would replace the QiChat system which

was not enough robust. A common issue was the system getting stuck in sub-rules so

that it is waiting for a particular answer. The system would comprehend a larger choice

of sentences. Moreover the accent influences the understanding of words. For instance,

a British accent on ”potatoe” will not give the same result as the American accent.


7.3.1.3 Improvement of dialogue features

The design of the dialogue should allow the user to stop the robot. Similarly the user

should have the choice of answering ”yes” or ”no”. In the case of the implemented

dialogue, even ”no” answers were leading to a pursuit of the conversation. The lack of

command to stop the conversation can lead to a frustration in the user. In addition,

it is important for the user to get an acknowledgement of the robot’s understanding.

Typically the robot saying ”thank you”, ”okay” or repeating the previous idea are ways

to give a feedback to the user. So that the user knows that the communication is working

well with the robot.

7.3.1.4 A chatbot with memory

A memory would be added to remember who spoke, when and what was said so that

there is no need to repeat all over again. It would also allow the user to go back in the

conversation to change something said before. A database with all the user’s interactions

would improve the relation between the human and the robot. Indeed people who know

each other are closer than total stranger. It is even more personalized for social robots

that might live in our houses. All in all, the best improvement for the dialogue system

would be to design a chatbot. It comprehends all previous features and it is quite flexible.

7.3.1.5 Improvement of the tracking

Another drawback is that at some point, the tracking ceases and the robot get stuck

in one position during the same conversation. This trouble should be solved for a later

project unless it is part of the robot. Similarly, it would be interesting to improve the

precision of the movements that are not from the Naoqi libraries. Indeed, the robot

shifting was not precise enough. The tracking cameras are also not enough robust to

track in changing light or moving people.

7.3.2 What could be changed for the evaluation?

The first idea was to make participants play a game. It appears to be either too sub-

merging or not enough so that the user could do both tasks: play and speak. Instead it

was decided to read a paragraph of an article. Similarly the design of Pepper’s dialogue

was changed. The robot was first supposed to tell a story with blanks where the partic-

ipant would give his/her own words to fill the blanks. Because of the QiChat system,

there were many problems. Finally the user was reading an article while Pepper asking


questions. As the questions were too short and too engaging, it would be better to

lengthen the dialogue and expand the topics. The interruption would be then triggered

by the participant glancing at a video being ran in background. To be sure the partic-

ipant looks at the video, we can either tell that he/she will be asked questions about

what she/he saw on the video or set a sufficiently long dialogue to get the participant

bored. Moreover videos might catch the eye quite easily.

Chapter 8

Conclusion

A dialogue and a gesture systems were implemented on Pepper robot. The recovery

of interruptions included a monitoring of the user and a combination of both dialogue

and gesture behaviors. Despite of all the research, the overall system was still very

raw. The best way to improve it would be to pursue the evaluation to adjust the

numerous variables. Even if the hypothesis has not been completely answered, this

work opened many doors on a niche field of the interruptions management in human-

robot interactions. The project was an intricate problem where the solution is still to

find. However it was full of lessons. Indeed the design of the evaluation is as important

as the design of the recovery strategy. To achieve this goal, must research were done in

different fields such as robotics and psychology. It was always thrilling to bring closer

those fields of study. Moreover the discovery of how to lead an evaluation involving

people on a complex issue is rich of learnings.

38

Appendix A

Appendix A

A.1 Stakeholders

Stakeholders Needs

Professionals in relation with thepublic

Easy use and maintenance; Socially appropriatebehavior of the robot;

Future researcher in the Human-Robot Interaction domain

A new approach of conversational agents;

Project supervisor Fulfilled projectMyself Implementation of a multimodal interaction sys-

tem;

Table A.1: Stakeholders

39

Additional resources. Additional resources 40

A.2 Risks

Risk Importance Likelihood Solution

Delays due tooutside circum-stances

High Low Allow some flexibility to the schedule

Pepper robotbreakdown

High Medium Implement the system on anotheravailable compatible robot

Implementationthat reveals beingtoo difficult

High High Adapt the project to make it simpler;Try another approach of the project;Focus on the research part

Incompatiblesoftwares

High Medium Research beforehand to avoid such in-compatibility

Inaccuracy of thesensors

High High Simplify the task and prefer simpleand unsophisticated proxemics (e.g.big movements)

Table A.2: Details of the risks

A.3 Project tasks and deliverables

Deliverables Details

Research report This report contains the literature review that back-grounds the project. It is handle on the 6th of April.

Implemented system The system comprehends the gesture and system programsas well as the interface with the Pepper robot.

Final dissertation report It contains all the researches and the experiments fromtheir design to the results.

Table A.3: Details of the main deliverables


A.4 Example of a questionnaire

09/08/2017 First interaction

https://docs.google.com/forms/d/18W8l9UUGji24cVQg80_56JSFMZ4EHw12OmMUCrgNimE/edit 1/7

First interaction*Required

1. ID Participant *

2. ID First interaction *

Please answer the following questions about your impressionof the interaction

3. Did you like interacting with the robot? *Mark only one oval.

1 2 3 4 5

No, I disliked Yes, I enjoyed

4. Did you think Pepper’s behavior was appropriate? *Mark only one oval.

1 2 3 4 5

Inappropriate Appropriate

5. How engaged in the interaction did you feel? *Mark only one oval.

1 2 3 4 5

Not at all A lot

Please rate your impressions of the robot on the followingscale

This must be answered quickly without second thoughts.

6. *Mark only one oval.

1 2 3 4 5

Fake Natural

Figure A.1: Questionnaire used at the evaluation: first interaction





1 2 3 4 5

Machinelike Humanlike


1 2 3 4 5

Unconscious Conscious


1 2 3 4 5

Artificial Lifelike


1 2 3 4 5

Moving rigidly Moving elegantly


1 2 3 4 5

Dislike Like


1 2 3 4 5

Unfriendly Friendly


1 2 3 4 5

Unkind Kind






1 2 3 4 5

Unpleasant Pleasant


1 2 3 4 5

Awful Nice


1 2 3 4 5

Incompetent Competent


1 2 3 4 5

Ignorant Knowledgeable


1 2 3 4 5

Irresponsible Responsible


1 2 3 4 5

Unintelligent Intelligent


1 2 3 4 5

Foolish Sensible

Second interaction

21. ID Second interaction *






1 2 3 4 5

Incompetent Competent


1 2 3 4 5

Ignorant Knowledgeable


1 2 3 4 5

Irresponsible Responsible


1 2 3 4 5

Unintelligent Intelligent


1 2 3 4 5

Foolish Sensible

Final questionnaireHere are a few questions that we would like you to answer to conclude the experiment

40. Which interaction did you prefer? *Mark only one oval.

First interaction

Second interaction

41. What did you like when interacting with Pepper?

Figure A.4: Questionnaire used at the evaluation: final questions 40 & 41




Powered by

42. What did you dislike when interacting with Pepper?

You completed the evaluation! Thank you very much for yourparticipation !

Figure A.5: Questionnaire used at the evaluation: final question 42


A.5 Risk assessment

MACS Risk Assessment Form (Project)

Student: Fantine CHABERNAUD

Project Title: Gaze Interaction with an Animated Character

Supervisor: Dr Franck BROZ

Risks:

Risk Present (give details)

(tick if present)

Control Measures

and/or Protection

Standard Office environment-

includes purely software

projects

System implemented on a

moving robot

Keeping the robot at a few

meters from the participant,

Unusual peripherals

e.g. Robot, VR helmet, haptic

device, etc.

None None

Unusual Output

e.g. Laser, loud noises, flashing

lights etc.

Nothing Nothing

Other risks

None None

Figure A.6: Questionnaire used at the evaluation

Bibliography

Ackerman, E. (2016). Ces 2017: Why every social robot at ces looks alike. IEEE

SPECTRUM.

Bartneck, C., Kulic, D., Croft, E., and Zoghbi, S. (2009). Measurement instruments

for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived

safety of robots. International journal of social robotics, 1(1):71–81.

Hung, V., Elvir, M., Gonzalez, A., and DeMara, R. (2009). Towards a method for evalu-

ating naturalness in conversational dialog systems. In Systems, Man and Cybernetics,

2009. SMC 2009. IEEE International Conference on, pages 1236–1241. IEEE.

McClave, E. Z. (2000). Linguistic functions of head movements in the context of speech.

Journal of pragmatics, 32(7):855–878.

Mead, R., Atrash, A., and Mataric, M. (2012). Representations of proxemic behavior for

human-machine interaction. In NordiCHI 2012 Workshop on Proxemics in Human-

Computer Interaction, Copenhagen.

Mead, R. and Mataric, M. J. (2016). Autonomous human–robot proxemics: socially

aware navigation based on interaction potential. Autonomous Robots, pages 1–13.

Mitsunaga, N., Smith, C., Kanda, T., Ishiguro, H., and Hagita, N. (2008). Adapt-

ing robot behavior for human–robot interaction. IEEE Transactions on Robotics,

24(4):911–916.

Mumm, J. and Mutlu, B. (2011). Human-robot proxemics: physical and psycholog-

ical distancing in human-robot interaction. In Proceedings of the 6th international

conference on Human-robot interaction, pages 331–338. ACM.

Okamoto, D. G., Rashotte, L. S., and Smith-Lovin, L. (2002). Measuring interrup-

tion: Syntactic and contextual methods of coding conversation. Social Psychology

Quarterly, pages 38–55.

Skantze, G. (2016). Real-time coordination in human-robot interaction using face and

voice. AI Magazine, 37(4):19–31.

47

Bibliography 48

Skantze, G., Hjalmarsson, A., and Oertel, C. (2014). Turn-taking, feedback and joint

attention in situated human–robot interaction. Speech Communication, 65:50–66.

Tritton, T., Hall, J., Rowe, A., Valentine, S., Jedrzejewska, A., Pipe, A. G., Melhuish,

C., and Leonards, U. (2012). Engaging with robots while giving simple instructions.

In Conference Towards Autonomous Robotic Systems, pages 176–184. Springer.

Tsiakoulis, P., Gasic, M., Henderson, M., Planells-Lerma, J., Prombonas, J., Thomson,

B., Yu, K., Young, S., and Tzirkel, E. (2012). Statistical methods for building robust

spoken dialogue systems in an automobile. Proceedings of the 4th applied human

factors and ergonomics.

Wada, K., Shibata, T., Musha, T., and Kimura, S. (2008). Robot therapy for elders

affected by dementia. IEEE Engineering in medicine and biology magazine, 27(4).

Yang, J.-Y., Dinh, B. K., Kim, H.-G., and Kwon, D.-S. (2013). Development of emotional

attachment on a cleaning robot for the long-term interactive affective companion. In

RO-MAN, 2013 IEEE, pages 288–289. IEEE.

Yu, Z., Bohus, D., and Horvitz, E. (2015a). Incremental coordination: Attention-centric

speech production in a physically situated conversational agent. In SIGDIAL Confer-

ence, pages 402–406.

Yu, Z., Nicolich-Henkin, L., Black, A. W., and Rudnicky, A. I. (2016). A wizard-of-

oz study on a non-task-oriented dialog systems that reacts to user engagement. In

SIGDIAL Conference, pages 55–63.

Yu, Z., Papangelis, A., and Rudnicky, A. (2015b). Ticktock: A non-goal-oriented multi-

modal dialog system with engagement awareness. In Proceedings of the AAAI Spring

Symposium.

Z lotowski, J., Sumioka, H., Nishio, S., Glas, D. F., Bartneck, C., and Ishiguro, H. (2016).

Appearance of a robot affects the impact of its behaviour on perceived trustworthiness

and empathy. Paladyn, Journal of Behavioral Robotics, 7(1).

multimodal interactions with a chatbot and study of ... and study of interruption recovery in...

Documents