multimodal interactions with a chatbot and study of ... and study of interruption recovery in...
TRANSCRIPT
Heriot-Watt University
Masters Thesis
Multimodal Interactions with a Chatbotand study of Interruption Recovery in
Conversation.
Author:
Fantine Chabernaud
Supervisor:
Dr. Franck Broz
A thesis submitted in fulfillment of the requirements
for the degree of MSc. Artificial Intelligence with Speech and Multimodal
Interactions
in the
School of Mathematical and Computer Sciences
August 2017
Declaration of Authorship
I, Fantine Chabernaud, declare that this thesis titled, ’Multimodal Interactions with a
Chatbot and study of Interruption Recovery in Conversation.’ and the work presented
in it is my own. I confirm that this work submitted for assessment is my own and is
expressed in my own words. Any uses made within it of the works of other authors in any
form (e.g., ideas, equations, figures, text, tables, programs) are properly acknowledged
at any point of their use. A list of the references employed is included.
Signed: Fantine Chabernaud
Date: the 17th of August 2017
i
“Were we incapable of empathy – of putting ourselves in the position of others and
seeing that their suffering is like our own – then ethical reasoning would lead nowhere.
If emotion without reason is blind, then reason without emotion is impotent.”
Peter Singer, Writings on an Ethical Life, 2015
Abstract
The current project presents a novel approach enhancing Human-Robot Interaction
(HRI). One goal in Human-Robot Interaction is to design robots that seems friendly
and where interacting with them feels natural. A particular attention is drawn to the
interruptions that can occur during a dialogue and the recovery of the conversation.
Many reasons originate interruptions, either a misunderstanding or an external factor.
For this purpose, the Pepper robot from SoftBank Robotics was implemented with both
a dialogue and a gesture system. The robot is monitoring the user’s state like the
direction of the gaze, the facial expression and the distance between them. In case of
an interruption of the dialogue, the system has to find a coherent recovery according
to the verbal and non-verbal languages. The recovery has to be felt as natural by the
user. The main hypothesis concerns the gesture behavior that should help the robot to
recover the conversation. The evaluation showed the importance of designing a recovery
according to user’s preferences. Even if it is inconclusive, the experiment reveals clues
to improve the recovery strategy.
Acknowledgements
I would like to warmly thank Doctor Franck BROZ for supervising my dissertation
project, for being understanding and for taking the time to answer my questions.
I am very grateful toward Christian DONDRUP for his considerable help in under-
standing the softwares, for his patience at answering all my questions and for his kind
support. Without him, I would probably not have been so far in the implementation of
my project.
I thank all my friends and people whom I had the chance to work nearby and who
contributed to set an agreeable and efficient working atmosphere.
iv
Contents
Declaration of Authorship i
Abstract iii
Acknowledgements iv
Contents v
List of Figures viii
List of Tables ix
Abbreviations x
1 Introduction 1
2 Literature review 3
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Note on terms used . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Dialogue system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Spoken Language Understanding . . . . . . . . . . . . . . . . . . . 4
2.2.3 Dialogue Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.4 Natural Language Generation . . . . . . . . . . . . . . . . . . . . . 5
2.2.5 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.6 Functional issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.7 Robot appearance . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Proxemics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Proxemics applied to HRI . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Proxemics to engage the user . . . . . . . . . . . . . . . . . . . . . 7
2.4 Turn-taking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Conversational strategies to engage the user . . . . . . . . . . . . . 10
2.5 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Requirements analysis 12
3.1 Project scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
v
Contents vi
3.2 Project objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Python SDK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 Pepper robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.3 Naoqi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.4 Choregraphe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.1 Strategy of recovery . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.2 Design of the interruption scenarios . . . . . . . . . . . . . . . . . 18
3.4.3 Dialog system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Ethical and legal considerations 20
4.1 Ethical approval details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Professional issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Legal issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Ethical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Social issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Methodology 23
5.1 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.2 Evaluation of the interruptions scenarios . . . . . . . . . . . . . . . 24
5.2 Experimental procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3.2 Objective measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.3 Subjective measures . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.4 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6 Results 28
6.1 Observations of the evaluation . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Participation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3 Analysis of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3.1 Analysis of raw data . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3.2 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.3.3 Comments analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7 Discussion 34
7.1 Participation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.3.1 What could enhance the implementation? . . . . . . . . . . . . . . 35
7.3.1.1 A smarter system . . . . . . . . . . . . . . . . . . . . . . 35
7.3.1.2 Improved dialogue system . . . . . . . . . . . . . . . . . . 35
7.3.1.3 Improvement of dialogue features . . . . . . . . . . . . . 36
7.3.1.4 A chatbot with memory . . . . . . . . . . . . . . . . . . . 36
7.3.1.5 Improvement of the tracking . . . . . . . . . . . . . . . . 36
7.3.2 What could be changed for the evaluation? . . . . . . . . . . . . . 36
Contents vii
8 Conclusion 38
A Appendix A 39
A.1 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
A.2 Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
A.3 Project tasks and deliverables . . . . . . . . . . . . . . . . . . . . . . . . . 40
A.4 Example of a questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.5 Risk assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Bibliography 47
List of Figures
2.1 Spoken dialogue system architecture . . . . . . . . . . . . . . . . . . . . . 4
2.2 Example of a dialogue by database retrieval . . . . . . . . . . . . . . . . . 5
3.1 Pepper robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Theoretical architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Final architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.1 Environment set up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Measures of naturalness in HRI . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1 Repartition of the interaction likability and the topics . . . . . . . . . . . 30
6.2 Measurement of the differences with/out recovery . . . . . . . . . . . . . . 31
6.3 Comparison between answers at first and second interactions . . . . . . . 32
A.1 Questionnaire used at the evaluation: first interaction . . . . . . . . . . . 41
A.2 Questionnaire used at the evaluation: first interaction . . . . . . . . . . . 42
A.3 Questionnaire used at the evaluation: first interaction . . . . . . . . . . . 43
A.4 Questionnaire used at the evaluation: final questions 40 & 41 . . . . . . . 44
A.5 Questionnaire used at the evaluation: final question 42 . . . . . . . . . . . 45
A.6 Questionnaire used at the evaluation . . . . . . . . . . . . . . . . . . . . . 46
viii
List of Tables
3.1 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.1 Two-way repeated measures ANOVA . . . . . . . . . . . . . . . . . . . . . 31
6.2 Two-way repeated measures ANOVA . . . . . . . . . . . . . . . . . . . . . 33
A.1 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
A.2 Details of the risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
A.3 Details of the main deliverables . . . . . . . . . . . . . . . . . . . . . . . . 40
ix
Abbreviations
AI Artificial Intelligence
AIML Artificial Intelligence Markup Language
ALICE Artificial Linguistic Internet Computer Entity
ASR Automatic Speech Recognition
DM Dialogue Management
HRI Human Robot Interaction
NLG Natural Language Generation
SLU Spoken Language Understanding
SSH Secure SHell protocol
TTS Text To Speech
x
Chapter 1
Introduction
In everyday conversations with each other, a lot of information is produced and ex-
pressed. In addition to words in a speech, there are visual cues as the body position or
gestures that contribute to the conversation. It can emphasize an idea or express a feeling
as rejection or interest. Many of those actions come from an unconscious will. It often
takes place from our own personality. For instance, introvert people tend to be softer in
their voice and movements than extroverts do. Two people interacting with each other
in a conversation have to deal with the complexity of their personalities and emotions.
There, feelings can also add a filter to someone’s behavior. For instance, anger, ire or
passion induce greater gesture, louder voice and offensive positions. Whereas annoyance
and boredom will lead someone’s attention away from the source of these emotions.
In some points in a conversation, there might happen an interruption. Also called a
silence, it happens when, for some reasons, the conversation prematurely ends. They
can be due to an external agent, a reaction of one of the participants or a changing
environment. This niche is the focus of the present research. It takes place in a wider
issue in human-robot interaction: the engagement of the user.
Nowadays companies are taking advantage of new dialogue systems to replace or improve
their assistance by phone or text message. These systems are artificial intelligences
able to have a coherent dialogue with someone. They use speech, graphics, gesture
or other communication channels. It is possible, for example, to command a pizza by
text message or to find the nearest coffee shop in a shopping mall by asking a robot to
show you the way. These dialogue systems are designed to do some tasks. Thus they
are called task-based systems. However dialogue systems which aim to be friendly are
social conversational agents or also called chatbots. Indeed their goal is to do chats or
small talks. They are usually associated with a task-based system to help the user in
a task. It can be a cook helper or the robot in a shopping mall. They can embody
1
Chapter 1. Introduction 2
a physical robot that can be talked to or only a text chat on a computer. In order
to make the user confident, robots have to be friendly looking and naturally behaving.
Thus actual research, especially for humanoid robots, is focused on reproducing human
behavior. This relies on dialogue and proxemics knowledge. Proxemics is the domain
related to non-verbal language and how people use the space to interact with others.
We know that in everyday life condition, we will not stand as close to someone we just
meet as to a long-time friend. This distance rule is part of the proxemics behavior.
Proxemics behavior is shaped by unconscious reflexes and the cultural education. The
unconscious part takes its roots in empathy. Proxemics is a low level language as it does
not express as complex ideas as speech do. We are able to communicate without sharing
the same language and this can be easily experienced by traveling abroad in a foreign
country. In fact, we observe that human behavior is intricate to understand and difficult
to simulate. Nonetheless it is possible to make a robot interacting and being understood
by people with basic common knowledge. Knowing a given number of different sensors
and knowledge of a physical robot, what would be the best way for a chatbot to behave?
More precisely, how is it possible to identify an interruption in the conversation taking
into account verbal and non-verbal behavior? For this project, it is interesting to focus
on both dialogue and gesture systems. If the dialogue system breaks down as it occurs
in interruption, the gesture system will help to recover the conversation.
In this research project, the aim is to understand the mechanisms of interruptions to
find solutions for the recovery of the conversation. Then the implementation focus on
the interruption by an external disruption, meaning someone or something catch the
user’s attention. As this project is related to computer science and robotic more than
linguistic science, the focus will be on the application of dialogue and gesture systems on
a robot. There is no will to get into an extended knowledge of all kind of interruptions.
Therefor, for the purpose of this research, only a few types of interruptions will be
experimented. The goal will be reached if participants talk to the robot naturally. The
project is conducted in three main parts. Firstly, much effort is put into the field of
the features of a conversation. Turn-taking feature is particularly important for the
study of interruptions because they have the same characteristic of a changing speaker.
The point is to gather information about the structure of dialogue and the effect of an
interruption. Secondly, the focus is made on the proxemics in human-robot interaction.
This comprehends the design of a method of the recovery according to the semantic of
the conversation and the attitude of the user. Thirdly, the system is implemented into
the Pepper robot and evaluated.
Chapter 2
Literature review
2.1 Introduction
2.1.1 Overview
The recovery from an interruption of a conversation implies proxemics, dialogue and
more specifically, turn-taking knowledge. How should the next speaker take the speech
turn in respect of human being? Knowing that a speaker wants to be listened to by
others, what should decide someone to take the speaker role? Such questions can be
partly answered by previous work in the main domains of turn-taking and proxemics
behavior. Turn-taking information are the rules of switching speakers in a conversation.
This happens when a speaker yields the speech role to someone else. The proxemics is
therefore interesting to get more information than only verbal one. It consists of the
whole body language of someone in order to emphasize an idea. As relatively little
research has been done on this domain of interruptions in conversation, the present
project looks for information in adjacent domains. A first explanation of interruptions
will be drawn from the turn-taking review. In a second part, the review of proxemics
will help designing a recovery method.
2.1.2 Note on terms used
The dialogue system or conversational agents refers to a computer system that can
coherently dialogue with a human. The references to a chatbot, a chatter robot, a social
robot or a social conversational agent leads to the same technology: a dialogue system
designed to be more friendly by having small talks.
3
Chapter 2. Literature review 4
2.2 Dialogue system
2.2.1 Overview
A dialogue system is a computer process that involves a chain of modules. This process
is supposed to build a coherent conversation with the user. According to the article
[Tsiakoulis et al., 2012], there are five modules internationally recognized as ”automatic
speech recognition (ASR), semantic decoding or spoken language understanding (SLU),
dialogue management (DM), natural language generation (NLG) and speech synthesis or
text-to-speech (TTS)”. Those modules and connections are summed up on the following
figure.
Figure 2.1: Spoken dialogue system architecture
2.2.2 Spoken Language Understanding
The spoken language understanding is a module located after the automatic speech
recognition. It receives a signal that contains the voice of the user and some external
noises. It processes the signal to retrieve only the utterance. Then it converts the
recognized speech utterance in dialogue acts by parsing. The process here is to cut into
piece the user’s utterances and labeled those pieces according to a specific grammar.
Then these acts will be processed by the dialogue manager.
2.2.3 Dialogue Manager
The dialogue manager is the core of the understanding of the robot. As robots does
not have a conscious as we do, they do not properly understand the meaning and the
semantic of a sentence. However it is possible for them, with basic rules of grammar, to
get coherent responses. Thus the dialogue manager will look for the nearest meaning of
Chapter 2. Literature review 5
the dialogue acts received from the previous module. Depending on the type of dialogue
manager system, those acts are compared to stored acts from a database. The database
comprehends different dialogue acts and their responses. In a simple system of database
retrieval, the given response is the utterance that is closest to the given dialogue acts.
It means that according to the acts, the system looks for a full already made answer.
There, the next module of natural language generator is not really used. For instance,
if the user say ”Hello, how are you?”, there the act is a cheering. The robot will look
into its database in the cheering category and retrieve the corresponding answer.
Figure 2.2: Example of a dialogue by database retrieval
2.2.4 Natural Language Generation
The language generator reads in the dialogue act output of the dialogue manager Then
it generates a natural response. It effectuates the same process as the spoken language
understanding but in the other way: it put pieces of utterance together. It makes
a sentence from small part of a response. The difficulty lies in the naturalness and
goodness of the final output utterance. Indeed, it will depends on the method used to
put the pieces together. Then the utterance goes to the Speech synthesizer to be heard
by the user.
There are three main methods to generate an utterance: the rule-based templates, the
grammar-based and the machine learning. The rule-based is designing each possible
sentence by hand. The templates are written schema that need to be completed with
the answers given by the dialogue manager. It is simple and it allows a full control
of the responses. However this method is hard to maintain, repetitive and manually
expensive. The grammar-based method can be used for both parsing on the spoken
language understanding module and the natural language generation. A set of grammar
rules are defined to write the responses. It gives more flexibility than rule-based tem-
plates. However it is still manually expensive. The third method is the most scalable,
Chapter 2. Literature review 6
portable and precise. It requires a previous training with in-domain data to be accurate.
Numerous implementations exist for this method. Basically the system improves itself
just by being used.
2.2.5 Hardware
The hardware part is at the input and output of the robot. The input is a microphone
and an automatic speech recognition (ASR). It determines the start and the end of the
user utterance by recording what the user says. The output is a speech synthesizer
(TTS) and a speaker. The speech synthesizer recreates a voice aloud. It is a complex
system that is supposed to respect the punctuation and ton of natural speech. Some
systems uses effects to simulates emotions and avoid a neutral and boring voice.
2.2.6 Functional issues
Actual systems are not enough robust in noisy environments. Usually the speech recog-
nition is faulty. The experience with a chatbot can quickly turn out to be frustrating.
When the robot does not understand, it usually repeats the same utterance over and
over instead of trying to recover the dialogue by asking a random question or changing
the subject. The recovery method depends on the context of the experiment.
2.2.7 Robot appearance
In an article of the IEEE Spectrum website [Ackerman, 2016], it says that robots look
friendlier by having some features like curvy forms, light colored skin, child-looking with
a big head and big curious eyes. Furthermore, even actual humanoid robots have some
flaws that make them distinguishable from real humans. Thus its is easy to recognize
them as machines and distance ourselves from them. According to the study made by
Z lotowski et al. [2016], a ”highly human-like robot is perceived as less trustworthy and
empathic than a machinelike robot with some human-like features”. Their study shows
also that machine-like robots with human-like features and a positive behavior create a
higher feeling of trustworthiness than human-like robots with same features. However,
in case of negative behavior, the machine-like robot will create more anxiety to the user
than a human-like robot. This means the appearance is a double-edged decision to take
for social robots.
Chapter 2. Literature review 7
2.3 Proxemics
2.3.1 Proxemics applied to HRI
The body language can be very noisy: turn head, raise hands without a will to empha-
size something. The non-verbal language is based on the orientation of body, the gaze,
the gestures and how space is occupied by the person. In fact we can take into account
all senses: visual, auditory, olfactory, thermal and touch sensory. From those features,
we should have signs of engagement or disengagement. According to the article of Mead
and Mataric [2016] there are different patterns of proxemics behavior. From a previous
study of Mead et al. [2012] we can draw three levels of proxemics representation: the
”physical representation, based on inter-agent distance and orientation”, the ”psycho-
logical representation, based on the inter-agent relationship” and the ”psychophysical
representation, based on the sensory experience of social stimuli”. The physical features
rely on the position of the agents and their orientation. The psychophysical level con-
cerns all our sensors like audition or vision. The psychological level concerns cultural
and personality aspects of someone. All those three levels combine themselves to create
a behavior. It is a simple internal representation.
A robot that does not respect our distances by getting to close to us can be threatening.
To avoid that, we have to study the human proxemics behavior to respect it. There are
four models of proxemics mechanisms described in the article from Jonathan Mumm
and Bilge Mutlu [Mumm and Mutlu, 2011]. The compensation model appears when the
partner will balance an misappropriate behavior. For example if the partner go too close,
the other will avoid the gaze or go a step backward. The reciprocity model appears when
both partners accord themselves on each other’s behavior. The attraction-mediation
model corresponds to a high level of attraction and therefore closeness. Thus a low level
of attraction induces a greater distance. The attraction-transformation model appears
when the level of interaction, either high or low, affects the behavior of both partners
adopting either compensation or reciprocation. Hypothesis made by the authors is that
knowing the compensation model, participants will keep a greater distance with the
robot which maintains a straight eye contact with them. In conclusion, more likable
robot should see their distance to the user decreased. Reciprocally a user that goes
backward in front of a robot is likely to disengage the conversation.
2.3.2 Proxemics to engage the user
According to this study [Skantze et al., 2014], eye gaze, gesture and verbal expressions are
important to engaged the user. Some acknowledgements as short utterances or nodding
Chapter 2. Literature review 8
make the user more confident while speaking. The user is especially more engaged in
a joint task completion with a robot using its gaze to show things. We could then
think that reproducing natural human behavior make the robot more engaging in HRI.
However according to an observation of the eye blinking in this study [Tritton et al.,
2012], a natural behavior like blinking eyes does not affect the participants although it is
a natural feature. McClave [2000] has studied dialogues and she has annotated the basic
non-verbal language signs. She has concluded that ”many head movements co-occurring
with speech are patterned. They have been shown to have semantic, discourse, and
interactive functions”. Non-verbal signs can are codified and they can be organized as
a language.
Yu et al. evaluated the importance of gaze monitoring to keep users engaged with a
dialog system. The vision is used as a mean to help turn taking in speech. It also
helps to adjust the strategies according to the listener’s attention. If this attention
is too low compared to what was expected, then “the model triggers a sequence of
linguistic devices”. The study focus on two types of attentional demand. The “onset
demand” is a level of attention required from the listener by the system when it begins
its phrase. The “production demand” concerns parts of the dialog that carry important
information. Those parts particularly require the listener’s attention. Then the system
is tested outside in a place with lots of distraction for the participants. The “coordinative
policy” is composed of a set of interjections that aims to get the participant’s attention.
The study shows that improper behavior has a negative effect on the engagement. For
instance if the system does not detect any attention on itself whereas, in reality there is,
then the system continues to try to attract attention. The participant pays attention to
a system that does not recognize it. Thus the participant is confused and will most likely
withdraw his/her attention. The system should aim to avoid such incoherent behaviors.
2.4 Turn-taking
Turn-taking rules are part of the conversation process. Interruptions are a kind of turn-
taking situation that will be highlighted in the project. Therefor it is interesting to
understand this feature.
What is a stable group interaction order? There are two possible measurements ac-
cording to the study [Okamoto et al., 2002], there are the syntactic structure and the
situation. When the interruption appears at more than two syllables away from a pos-
sible turn-transition space, it is a syntactic structure. The situation measure is made
depending on the context and the cultural view of both speakers. A turn taking space
occurs if one of the following condition is fulfilled: the speaker ends and points out
Chapter 2. Literature review 9
the next speaker, the speaker continues, another speaker enter the conversation or with
cooperation the transition is smoothly made. Another method is presented in the re-
view part of [Okamoto et al., 2002] where there are four different cases of turn-taking
depending on the context and the culture. Firstly, the speaker is cut off before he/she
has made his/her first point of the conversation. Secondly, the speaker has made the
first point of a turn. Thirdly, the speaker is cut off in mid-clause after the first point of
a turn. At least, the speaker begins to speak during a pause or other’s turn end signal.
In conclusion, there is no clear cut for a proper turn-taking.
Natural turn-taking is more flexible than in actual conversation with robots. Indeed
the actual way is to press a button or finish a sentence and stop speaking. The turn-
taking operation is much more complicated to solve a problem in group as it requires
a camera and an analysis of the non-verbal language. This is the problem of joint
activity described in the article [Skantze, 2016]. Vocal clues in turn-completion are: the
pitch, the duration and the energy. If the pitch is rising or falling, the turn is being
yielded. The duration concerns the length of the phonemes. Short ones for a turn being
yielded. The same for a lower intensity or a low loudness which calls the turn yielding.
Visually, a speaker breathing-in will continue to speak. Whereas a speaker looking away
or looking back to someone else will probably be turn yielding. However unexpected
interruptions can happen. It comes from an external reason like someone looking for an
urgent attention. Eventually, to generate a coordination signal, the user more likely to
take the turn is the one toward whom the robot is looking.
Both the robot and the user must learn each others’ way of being to have a smoother
turn-taking operation. Mitsunaga et al. implemented a system that learns from the user
and adapt its behavior. It works with a reward function to improve its behavior according
to a few parameters. Those parameters are the distance, the gaze-meeting-ratio, the
waiting-time and the motion-speed. The distance between the user and the robot is
defined by the proxemics where humans have specific interaction distances according to
their status: intimate, personal and social distances. The gaze-meeting-ratio correspond
to the number of time when the user’s gaze meet the robot’s gaze over the length of
the interaction. The waiting-time is the time between the pronounced utterance and
the action. Finally the motion-speed manages the speed at which gestures are made.
Different configurations of those parameters are tested and the user’s impressions are
recorded so that it is possible to adapt the system’s behavior. However each configuration
of parameters relies on subjectives impressions. For instance, some users would prefer a
wider distance than others.
Chapter 2. Literature review 10
2.4.1 Conversational strategies to engage the user
In the study [Yu et al., 2015b] the team developed a conversational agent called TickTock.
They implemented a dialog system that works with a rule-based method. The database
comes from CNN interviews of a show. It keeps in memory the questions and answers
of participants. Thus the dialog system retrieves the answer whose question is the most
alike of the one asked to the dialog system. Then the retrieved utterance is combined
with one of the two strategies: active or passive. During the evaluation, the conversation
between participants and the dialogue system is recorded to annotate the engagement
level on a five-level scale. As engagement is very subjective and proper to each of us,
the participants were asked to annotate themselves for each turn on their engagement.
In a later study of Yu et al., the Multimodal TickTock system is meant to compare
two strategies based on their engagement level. The physical system has a cartoon face
that represent the system talking. With a non human-like face, the user is expecting a
different behavior as with another human. The dialogue system of TickTock is based
on a database retrieval process. It compares the utterances with those in the database
and retrieve the response. Five strategies are designed to improve user engagement:
switch topics, initiate activities, end topics with an open question, tell a joke and refer
back to a previously engaged topic. The user engagement is monitored by the system
in real time. It evaluates the user’s utterances to adopt the best strategy to keep the
user engaged. Two versions of the TickTock system are compared. The REL one uses
only the five previous strategies whereas the REL+ENG uses the five strategies and
a reaction to low user engagement. First evaluation of recorded interaction annotated
by experts. Another rigorous evaluation called “Amazon Mechanical Turk study” uses
an independent jury: only interactions of users with no previous knowledge of dialog
systems. Then couple of recorded interactions with REL+ENG and REL are annotated
by non-expert people. Finally both evaluations are compared to find which version of
system is best. Indeed REL+ENG seems better as it monitors user’s engagement.
In another study Yu et al. focused on the gaze to get attention and monitor the listener’s
level of attention. The gaze is used as a mean to precise turn taking in speech. It also
helps to adjust the strategies according to the listener’s attention. If the attention
is too low compared to what was expected, then “the model triggers a sequence of
linguistic devices” such as interjections. The study focus on two types of attentional
demand. The “onset demand” is a level of attention required from the listener by the
system when it begins its phrase. The “production demand” concerns parts of the
dialog that carry important information. Those parts particularly require the listener’s
attention. Then the system is tested outside in a place with lots of distraction for the
participants. The “coordinative policy” is composed of a set of interjections that aims to
Chapter 2. Literature review 11
get the participant’s attention. The study shows that improper behavior like the system
detecting no attention on itself and trying to attract attention has a negative effect on
the participant. Indeed the participant will then withdraw the attention. If listener’s
attention is not enough the system execute a pause, an interjection and again a pause.
If attention has been successfully drawn, then the system says the utterance. If there is
still not enough attention, the system pauses after 2 words, then it says entire utterance.
The pause lasts 1.5 to 2.5 seconds.
2.5 Hypothesis
Among all the types of interruptions, the study focuses on one particular pattern. Indeed
the domain of interruptions is broad. For the purpose of the evaluation and the simplicity
of it, a choice was made to focus on interruption by external disturbance. It is easy to set
an external interruption with regularity and consistency. The gesture system in addition
to the dialogue system is expected to help the robot for the recovery of the conversation.
As shown in many HRI experiences from the literature, participants’ feedback is in favor
of embodied robots with coherent gestures and gaze behavior. The study will evaluate
the effectiveness of recovery using different gesture strategies.
Chapter 3
Requirements analysis
3.1 Project scope
The project provide a multi-modal interaction system that is implemented on the Pepper
robot. The engagement is emphasized with proxemics methods in a conversation. As
robots are more and more common in everyday life under multiple forms, they have
to be adaptable in order to fit in people’s life. This implies a capacity to interact with
humans and be friendly to be approved by the overall population. Many companies tried
to develop perfect humanoid robots to make them more appreciable. However to avoid
the Uncanny Valley which represents the strangeness of an approximative human-like
model, lot of time has been spent in the research of physical appearance. Machine-like
robots can be appreciable at the condition to have a proper behavior. It has been seen
that people get emotionally attached with these kind of robots even if their shape is
not human-like as it has been resumed by Yang et al. [2013]. Thus the recovery of
the conversation after an interruption has a role in how to better engaged the user in
interaction with robots. This matter might improve the user’s experience with robots.
3.2 Project objectives
The main goal is to implement a physical robot with both dialogue and gesture systems.
Both dialogue and gesture systems have to be designed according to the robot’s features
and functions. The challenge lies in the coordination of both systems to get a successful
recovery after the interruption. Ideally the final system is efficient enough to smoothly
recover an interruption. At least, it is interesting to better understand the human-robot
interaction especially in the case of interruption recovery.
12
Chapter 3. Requirements analysis 13
3.3 Materials
3.3.1 Python SDK
The documentation concerning the Pepper robot from SoftBank is either for Python or
C++ programming. Python is the language chosen to program the Pepper robot. It is
an easy and flexible programming language that is widely used. It is well documented
and it has wide programmer communities.
3.3.2 Pepper robot
The Pepper robot is produced by the firm Aldebaran from SoftBank Robotics. This
robot was especially designed for social interactions. It has a humanoid shape: body
mounted on wheels, moving arms and head. See the following figure 3.1 from Softbank
Robotics documentation. The robot has two HD cameras one in each eye to monitor the
user. This feature is used for the tracking of the user’s head so that the robot keeps the
user in its field of view by adjusting its head position. Thanks to the pre-implemented
functions, the robot can monitor the user’s feeling and reactions in real time. This is
why it is particularly appreciated for social robotics. It is also able to avoid collision
while moving the head, the arms or walking by using its sonar and lasers. The table
thereafter summarizes the robot features that will observe proxemics’ cues.
Sensors and actuators Use
Lasers / sonar Calculate the distance between the robot and the user
LEDs eyes (green, red, blue) Indicates three states: listening, speaking, processing
Loudspeakers Synthesize the robot’s voice
Microphones Listen to the user
2 HD cameras in the eyes Monitor the user’s mood
1 camera 3D 3D perception of the environment
Table 3.1: Stakeholders
The connection between the computer and the robot is done by WIFI or by Ethernet
cable. In the case of the dialog system using QiChat that will be described in the
following section, we need to access the internal memory of the robot. To do that, we
connect remotely by Secure Shell, commonly called ”ssh”. This allows to load topics
and files on the robot.
Chapter 3. Requirements analysis 14
Figure 3.1: Pepper robot
The implementation gradually reveals some limits of the robot. The acuteness of de-
ductions depends on the steadiness of the sensors. For instance, a bright environment
will perturb the camera and the robot will lost the ability to track a face. Another
drawback is the lack of multi-task ability. Indeed, each movement takes a certain time
to execute which is normal. However this is an issue as the dialog stops during the
movement so it loses naturalness. Similarly when the robot is speaking, it cannot listen
at the same time. We observe that the timing is very important when it comes to a
conversation. In everyday life, reactions in a conversation are done at a fast pace be-
tween humans. Whereas in human-robot interaction, the robot is usually slower than
Chapter 3. Requirements analysis 15
the human. This is the case of a previous study by Yu et al. where the robot was still
trying to draw the user’s attention despite the user was already paying attention. This
is due to a delay between the robot processing the clues of the user listening and the
reality. This incoherence is very detrimental to the engagement of the user as it shows
a lack of understanding from the robot. It contributes to decrease the confidence in the
robot. Sometimes, delays in gestures or reactions also gives this feeling of an incoherent
robot. Finally, the robot can mistake parasitic movements or sounds such as ”mmmm”
or ”euuu” from the user.
3.3.3 Naoqi
Naoqi is a system of libraries gathering functions for the Pepper robot. It is being
released and updated by SoftBank Robotics. It is like a ready to use pack that com-
prehends events, animations and functions. This simplifies the programming process.
It allows focus on high-end programming rather than being focused on controlling ac-
tuators or connection. For instance, to make the hand wave, we simply have to call a
function with the name of the movement to execute as argument. Of course, depending
on the complexity of functions and particularly of the need of captors, the script will be
more or less complex. Even though the robot can be run without Naoqi, it was used in
the present project to focus on programming the recovery strategy.
3.3.4 Choregraphe
Choregraphe is the software accompanying robots from SoftBank Robotics. It is a
graphical programming environment for implementing behavior on the robot. It has a
graphic window with features to monitor the robot’s state, to display a view of its vision
cameras and a modelization of its environment. First the robot has to be connected by
its IP. To implement a new behavior, we can simply scroll in the main menu and look
for the desired behavior. Then using the cursor, we drag the behavior to the center of
the window and it forms a box. After connecting this box with start and end arrows,
the behavior is ready to be executed on the robot. This software was interesting in this
project to understand the robot’s range of abilities. It was used to follow the dialogues
as there is a display of what the robot understand in real-time.
Chapter 3. Requirements analysis 16
3.4 Implementation
3.4.1 Strategy of recovery
The final design of the system is a combination of all the research done before and the
very own features of Pepper robot. The overall structure should have looked as on the
figure 3.2.
Figure 3.2: Theoretical architecture
The architecture using modules was convenient to achieve a goal of efficiency as each
module could be improved independently. In this first attempt of an architecture, the
turn-taking manager would have been at a central place of the system. It would have
been orchestrating the information coming from the events triggered at the ”User’s emo-
tion monitoring” module and processing the response. The response would either come
from the ”Conversational agent” module or from the ”Interruption manager” module in
case of an interruption. We can see that this organization highlights the importance of
appropriate turn-yielding by the ”Turn-taking manager” to avoid silences or cut the user
short in her/his speech. Moreover, quick and coherent reactions are enabled by being
aware of the user’s state thanks to the ”User’s emotion monitoring”. The ”Memory”
Chapter 3. Requirements analysis 17
and machine learning modules are keys to to get a better system. Indeed those elements
are part of standard humans and they play a key role in relationships. That is what
make a system smart. For instance, a robot that remembers who you are and your
preferences is more likable. In practice, it turns out to be a much simpler structure.
Indeed gesture, dialogue and user monitoring are more or less made at the same level on
the robot. Unfortunately the memory and the machine learning were not implemented
as the focus was on one recovery. So the final architecture of the system looks like on
the figure 3.3.
Figure 3.3: Final architecture
Basically the main script comprehends the monitoring of the user by the action of
events. If gazes meet, then an interaction began and the main script subscribes to the
Qichat topic. During the dialogue, events are still active, if something happens like an
interruption, the robot is able to react. Ideally the dialogue and gesture behavior are
run smoothly in parallel. Unfortunately reactions take some time to process while the
dialogue is deactivated. On this model, there is no memory nor machine learning. The
system is not smart as all behaviors had to be implemented by hand.
Chapter 3. Requirements analysis 18
3.4.2 Design of the interruption scenarios
For the purpose of the evaluation, each participant will have to interact two times, one
time with the recovery strategy and a one time without. The interaction without a
recovery is a reference interaction where the robot do nothing in case of an interruption.
The recovery strategy is triggered if the user is looking away or if is not talking. If the
user looking away, a counter is incremented until a certain threshold. This method is
inspired from the attention level of Yu et al. study. When the threshold is exceeded,
i.e. the user is not paying attention, then the robot recovers the attention by waving
the hand and making a vocal sound. Knowing that the hand waving is the one in the
direction of the user’s gaze. In other words, the robot detects the orientation of the
head and choose the proper movement to execute. For instance, if the user is looking
away on the right, then the robot will wave its left arm. However this design choices
relies on personal experience and common sense. Then if the user look back to the
robot, the recovery is successful. If not, the counter is still incremented and at a higher
threshold the robot moves on the side. Ideally the robot has a movement that shifts
itself in front of the user if this one does not give his/her attention back despite the call.
This would intent to meet the user’s gaze again by replacing the whole robot in front
of the user. To detect an interruption, the system uses the camera and an event that
detects silences in the dialogue. If there is a silence of 20 seconds, then the robot will
try to engage the user back by waving a hand and calling ”hey”. In case the user is
looking away, the robot will wave the hand. If the user is still looking away and have
not looked back to the robot, then the robot will shift in front of the user in a last
attempt to try to meet his/her gaze. These strategies are using Naoqi libraries with
ready-to-use functions called events. Events are triggered by stimulus. Each event has
its own corresponding stimulus, it could be the apparition of a head in the field of view
of the robot or a shoulder being touched. When an event is triggered a flag is raised and
it calls a callback functions. Then the reaction depends on what is coded in the callback
function.
3.4.3 Dialog system
The first idea and the ideal was to implement a chatbot. It would have given entire
freedom for the user to talk to the robot. However due to the time constraint of this
project, it has been decided to use the dialog system specific to the SoftBank’s robots:
QiChat. It is composed of topics which are files loaded on the robot’s internal memory.
It corresponds to the linguistic knowledge of the robot. Each topic contains utterances
and words that can be recognized by the robot. The principle is very simple: if the
Chapter 3. Requirements analysis 19
robot recognized an utterance, then it will give the corresponding answer. For instance,
a ”Hello” will trigger the answer ”Hello, how are you?”. A human input associated
with a relevant robot output is called a rule. When the robot output is waiting for
a specific answer, then this goes to a sub-rule. It is possible to design more complex
reactions with functions such as the ”topicTag” function that allows to switch to another
topic when the keyword is said. In the project, the dialogues are split in two. There is a
greeting topic first, then a small talk topic linked by the ”topicTag” function. The major
drawback is that the user is restricted to the use of a limited list of words and utterance
configurations. When the user says something, the robot computes the utterance and
gives a confidence percentage according to what was understood. It often appears that
what is understood is not what was said. If the percentage is below the threshold, here
of 50% then the utterance is not taking into account and the robot go back to a listening
state. If the percentage is over the threshold, then the robot understood the utterance
and answers it. The robot has in knowledge only the list of words from the topics that
can be recognized and understood. The drawback of this method is that the system
does not learn, nor memorize any information about the user.
Chapter 4
Ethical and legal considerations
4.1 Ethical approval details
The aim of the study is to create an efficient system in recovering conversation between
a chatbot and a human. This system will use speech and proxemics analysis. It takes
place in human-robot interaction. The methods used to experiment involve a person
speaking with the system implemented on the Pepper robot. The participant will be
asked to play some scenarios simulating different conversations with a second task to
create interruptions. The interaction may be recorded by text, audio and video for a
post-analysis. The participant will be asked to fill a questionnaire upon his/her feelings
after the interaction.
The project involves human subjects and personal data identifiable with living people.
The project’s evaluation consists of evaluating a user interface according to the descrip-
tion: ”Interface only approval can be granted to projects that will be evaluating a user
interface by observing individuals using the software and performing a system usability
scale (SUS) questionnaire. Participants must be staff or students of Heriot-Watt Uni-
versity. The standard consent form must be completed by the participant prior to the
study and stored by the student conducting the project. All collected data must be
anonymised. No sensitive data will be collected. Only standard computing equipment,
i.e. an office PC, laptop, tablet, or mobile phone, will be used.”. Thus to protect the
data collected during the evaluation according to the university policy, the following
rules were agreed:
• Identifiable data will be stored on a secure machine with restricted access
• Data will be anonymised before publication unless consent has been given
20
Chapter 4. Ethical and legal considerations 21
• Identifiable data will only be retained for the duration of the consent granted by
the participant
• External data and systems will be used within the license terms specified
4.2 Professional issues
The professional issues concern the code that should be well commented and incremented
to make it clear. It is important to think to further experiments and prepare the code
to be read by someone else. It also make easier the debugging part.
4.3 Legal issues
Legal issues are for the implementation a matter of respect of the licenses for softwares
and robot. All the credit for open source code as ALICE bot is deserved to its authors.
The ALICE is an open software that requires neither copyright, nor other protections.
Anyone is free to use and modify it. The IBM SPSS software is used under the license
of the university.
In the case of the experiments with participants, some personal information as name,
gender and age asked. These information are confidential and they will be protected. No
data will be used outside the study. All sensible information will be kept by password
on a memory device. Moreover to analyze the experiments, it may be useful to record
the conversation by video and audio. Thus we need that each participant agrees with
the consent form to participate to the experiment. The footages will be protected as
virtual data in an encrypted folder secured with password.
4.4 Ethical issues
Nowadays more and more robots are evolving in our public and private places. It creates
a lot of privacy and security issues. For instance, connected robots can be accessed by
remote hackers. In everyday life, it makes a lot of information about our habits that
can be stolen from one robot being around us in the house, at work or in public places.
For instance the Pepper robot is likely to go in public places to help people finding their
way. It can monitor the flow of people going around it. Social robotic field is brand new
and there are many gatherings around ethics rules that should be written as laws. For
Chapter 4. Ethical and legal considerations 22
now, it is only a matter of each individual responsibility to have such robot at home or
not.
However the confidentiality issue of social robot is not specific to the current project. It
is a broader issue that includes all chatbots and HRI in general. Thus this project in
itself has no unique ethical or legal issues.
4.5 Social issues
Actual chatbots are replacing some people for simple tasks as ordering a pizza or finding
the way to the nearest coffee shop. They are meant to enhance our everyday life by
finding solution and making our life slightly easier. They are part of the connected
movement where smart devices spread everywhere. Although they are not yet popular
everywhere in the world, it is a matter of time for the technology to become better,
cheaper and to take part in our manners. In order to reach this state, the study of the
recovery in conversation is making a step further in the improvement of actual chatbot
technologies. Indeed robots used in public places would be annoying if they cannot
recover from interruptions. Even though the robot fails, a system that can recover from
interruptions is more engaging. Thus it would affect the public in a broader impression
of social robots. They would be more easily integrated as social robots in daily life.
The point of this research is to make the robot more friendly so that people can have
more empathy for them. There is an open question about the need of more empathy
towards robots in places like battlefields or health care. Robots are created and are
meant to serve humans. Thus would it be fair to enhance robots than are meant to
get destroyed? What if people give more value to the robot than to their own life?
Numerous studies has been conducted in health care where some social animal robots
where used as therapy to help dementia people to be well. Wada et al. [2008] show in
their study that there is a real attachment of people toward the animal robot. Thus
reprogramming or changing the animal robot would be detrimental in such case.
Chapter 5
Methodology
5.1 Experimental design
5.1.1 Environment
The study is conducted at the university. The place to run the interactions has to be
quiet to avoid external pressure or noise that would false the results. The participants
are standing in front of the robot at the distance of their choice according to what offered
the room. The experimental settings comprise the Pepper robot, a computer to initiate
the robot and two cameras to record the interactions. The side of both participant and
robot and the face of the participant are recorded. The recorded videos are used to
measure both behaviors and reactions in the post-processing analysis.
Figure 5.1: Environment set up
23
Chapter 5. Methodology 24
5.1.2 Evaluation of the interruptions scenarios
The participant is told to achieve two tasks: speak to the robot and read a small article
on a tablet. There are three different dialogues implemented on the robot and one
recovery strategy. Knowing that each participant is invited to interact twice with the
robot: one time with the recovery turned on and the other without. So that everyone will
have a specific combination of the dialogues and strategies. The dialogues are short talks
where the robot ask questions. The user answers with a short utterance or a word. One
interaction is about planning a travel abroad. The robot asks simple questions to the
participant as if they were planning a trip. Another talk is a guess game where Pepper
describes a thing using adjectives and the user has to guess what this is. Finally, the
third talk is about ordering a pizza. The user gives his/her food preferences according to
Pepper’s questions. The recovery strategy are described in the 3, section 3.4.2 ”Design
of the interruption scenarios”. If the participant looks away for a certain time, then the
robot detects an interruption and reacts accordingly to the implemented strategy. Once
the participant is told about the interaction, she/he is invited to stand up in front of
Pepper robot to start the interaction. Knowing that the participant is free to organize
his/her time to finish the tasks. As the robot has a limited implemented knowledge, the
participant will have a sheet with the keywords to use during the conversation.
5.2 Experimental procedure
Participants are invited to fill a consent form beforehand. It describes the overall goal
of the experiment. It also allows or not the use of the footages for a later publication.
Then participants are explained how the experiment will take place. They receive a
guideline sheet with instructions and words they can use for the first interaction. They
receive the tablet with the article to read and the interaction began. When the first
interaction is finished, then the participant is asked to fill a first questionnaire using
three subscales from the ”Godspeed questionnaire” from Bartneck et al. [2009]. Then
again the participant is asked to interact with the robot with a new configuration and
fill out the questionnaire again. Moreover a final questionnaire ask a couple of question
on the overall feeling of the robot. Meanwhile both interactions are recorded by camera
to allow a post-experiment analysis of the data.
Chapter 5. Methodology 25
5.3 Measures
5.3.1 Overview
The evaluation is a core part of this project as it represents the achievement of the
research. All the theory is put into practice. According to Hung et al. [2009], the mea-
surement can be split into two groups: subjective and objective measures. The following
figure comes from the article of Hung et al. [2009] and it summarizes the main measures
of naturalness in HRI.
Figure 5.2: Measures of naturalness in HRI
Chapter 5. Methodology 26
5.3.2 Objective measures
Objective measures are based on the records and the interaction itself. It implies count-
able metrics like the length of the interaction beginning at the time the user look at the
robot until the end of conversation triggered by a specific term as ”Good bye”. The
use of ratio of interruptions and successful recoveries gives a highlight on the chatbot’s
ability to recover the conversation.
5.3.3 Subjective measures
The subjective measures relies on the feedback of the participants after the interaction.
The following metrics are subjective ones because they implies feelings and impressions
which are not universally reliable metrics. Nonetheless it is relevant to analyze those
feedbacks from a large group of participants to get a trend if it exists. There are
three questionnaires for the whole evaluation. They are filled by the participant after
each interaction. Two questionnaires are inspired from the Godspeed questionnaire to
evaluate anthropomorphism Bartneck et al. [2009]. The last questionnaire compare the
different strategies and ask for the participant’s preferences.
The Godspeed questionnaire has been conceived to evaluate robots that are not designed
for a performance but for social interactions. The participant’s impression on the inter-
action with the robot is encoded using notes on a scale from 1 to 5. For instance, a 1
would mean ”I completely dislike” and a 5 would mean ”I completely like”. This note
is given for notions such as ”false” versus ”natural” or ”dead” versus ”alive”. These
notions are spread over five sections: anthropomorphism, animacy, likability, perceived
intelligence and perceived safety. A short example of questionnaire is provided in the
Appendix A, section A.4 and figure A.1. Those notes can then be used to do some
statistics in post-processing. It could be interesting to allocate a note to the proxemics
of the participant during the interaction.
5.3.4 Post-processing
The data obtained by answering the online questionnaires are gathered in a file to be
then processed on IBM SPSS Statistics software. This software is widely use in social
science for statistical analysis of datasets. Some information are added to the present
dataset such as the age of the participant and details about the recovery strategy and
the topic of each interaction. Moreover footage of the interactions are analyzed to take
some measures such as:
Chapter 5. Methodology 27
• Number of misunderstandings: number of time a word is repeated by the user
because of the robot’s lack of understanding or the robot misunderstanding a
word
• Number of successful recoveries: number of times the attention of the user is caught
back on the robot following a recovery behavior
• Number of unsuccessful recoveries: number of times the recovery behavior is trig-
gered with no reaction of the user or inappropriate recovery behavior knowing the
situation
• Time spent interacting from a greeting to an ending
• Number of silences longer than 20 seconds
• Number of times the recovery strategy ”hey” and wave hand is triggered
• Number of time the recovery strategy with robot shifts is triggered
• Number of time the topic is being repeated from just an initiation of the topic to
the whole topic.
Chapter 6
Results
6.1 Observations of the evaluation
First of all, in addition to the consent form with a short presentation of the project, the
participant was given some guidelines along with a sheet with the vocabulary to use for
each topic. However the directions given at the beginning of the evaluation were quite
confusing. Here are the additional information told verbally:
• Do both tasks at the same time: speak to Pepper and read a small article
• The Pepper robot is leading the conversation, you begin the interaction with a
greeting, then answer its questions and finish by an ending word
• Need to repeat if does not understand and/or speak louder
• One word at a time for each answer, included in a sentence or on itself
• Stay in front of the robot
• Free to stop when both tasks are achieved
• Do not get frustrated
• For the Guess topic, the answers are in the order of the list of words on the sheet.
If the robot was not understanding properly, some indications were given during the
interaction as speaking louder or repeating the word. So it might have interfere a little
on the interaction.
28
Chapter 6. Results 29
Participants were often puzzled at they were asked to both read and speak with the
robot at the same time. It is not a common behavior to get distracted while interacting
with someone. As reading an article was confusing, participants could have been asked
instead to talk to the robot while a video is projected behind so that the participant
will distracted by looking at the video. Moreover the length of the articles has to be
shortened as it was too long. So the first participant had a longer paragraph to read
than the following participants. These could have been avoided by running more trial
evaluations before starting the real evaluation. The dialogues should have been longer
because each participant had barely enough time to look at the guidelines sheet and the
dialogue was already done. So the interruptions did not really occur and topics were
repeated many times. Thus the dialogues should have been much longer to give the user
the time to get bored and look away creating an interruption. Nonetheless participants
did their best to achieve both tasks.
6.2 Participation
The evaluation gathered 13 participations. They were aged between 20 and 55 years old
with a mean at 29.7 as the experiments took place in the university context. Most of
them have a good knowledge of working with robots with a mean of 3.7 on a five points
scale where 5 equal to expert in robotic. A few of them knows about the goal of the
research.
6.3 Analysis of the results
6.3.1 Analysis of raw data
First of all, it was interesting to verify if one topic was more appreciated than another.
Knowing that each topic appears the same number of time as the others. If it was the
case of a preferred interaction, then the evaluation would not be correct as the preferred
topic would prime the test. The following Figure 6.1 shows the link between the notes
of the interaction on the like scale from 1 to 5 with 5 as ”I enjoyed a lot” and the topic
subjects. We can observe that all topics are almost similar in terms of likability.
Chapter 6. Results 30
Figure 6.1: Repartition of the interaction likability and the topics
There is no difference in the likability of the interaction with and without recovery strat-
egy. However there is a slight improvement in the likability for the second interaction
as shown on the Figure 6.2. Eventually each interaction went quicker than what was
thought in theory. In average, each interaction lasted two minutes. Second interactions
tend to be longer than first interactions. In a similar way, interactions with a recovery
strategy tend to last longer than without recovery. It is only about a minute of difference.
Thus it might not be pertinent. With the recovery strategy, the number of misunder-
standings and the number of repeated topics surge. This might show a confusion in the
communication. According to the question of preferred interaction, the system without
recovery is more appreciated with 70% than the system with recovery. This is explained
by the fact that the recovery was slowing the robot answers and presenting an incoherent
behavior. Indeed, the numerous recovery attempt are annoying the participants more
than drawing the attention back as showed on the Figure 6.2.. The topics are at equality
each of them in terms of use so that they do not influence the preferences.
Chapter 6. Results 31
Figure 6.2: Measurement of the differences with/out recovery
6.3.2 Statistical analysis
The evaluation evaluates the effectiveness of the recovery strategy versus no recovery
at all. Each participant has to experiment both strategies thus it is a within-subject
evaluation. The three different topics are not taken into account as they are just support
for the strategies. The Godspeed questionnaire is split into three clusters each of 5
criteria. Each note on a scale given to a criteria is summed within the cluster. We then
get three groups: anthropomorphism, likability and intelligence. The notes of each scale
become a numerical data that can be processed by ANOVA test. The two-way repeated
measures ANOVA for each group of the Godspeed questionnaire appears to be the most
suitable test. We obtain the following table:
Godspeed section mean gesture mean no gesture p - value
Anthropomorphism 16.23 16.77 0.391Likability 20.31 21.31 0.090Intelligence 16.23 16.77 0.391
Table 6.1: Two-way repeated measures ANOVA
Chapter 6. Results 32
We can see that the Likability has a significant result whereas Anthropomorphism and
Intelligence are less significant. This can be understood as an even feeling of Anthro-
pomorphism and Intelligence through the whole experiment. Whereas some parameters
change the Likability between two interactions. This is very likely that the recovery
strategy is at the origin of this significant result. Indeed the robot does not change its
appearance nor its fundamental knowledge.
The isolated questions about appreciation, engagement and appropriateness of the be-
havior are treated with the Wilcoxon signed rank test. It compares two similar questions
when asked at different times of the experiment. Most participants enjoyed the second
interaction as shown on the Figure 6.3.. Even if it is contestable as the participation is
low. In this case, the result depends more or less of one participant. The appropriateness
of the robot’s behavior is more appreciated at the second interaction. There is an even
repartition of the recovery strategies. The results can be explained as the participant
having a better understanding of the interaction proceedings. However the engagement
level is higher for the first interaction. That means participants felt less engaged in the
conversation at the second interaction.
Figure 6.3: Comparison between answers at first and second interactions
Chapter 6. Results 33
Comparison first/second interactions Z value p - value
Like -0.351 0.725Behavior appropriateness -0.844 0.399Engagement -1.00 0.317
Table 6.2: Two-way repeated measures ANOVA
6.3.3 Comments analysis
To conclude the evaluation, participants were asked open questions about what they
liked and disliked about the interactions. Pepper’s inner features such as the voice, the
appearance and the smooth movements were quite appreciated. Moreover some com-
ments noticed the eye tracking as an interesting feature. This was confirmed in the
literature where the eye contact was an important component of engagement [Skantze
et al., 2014]. A few comments show the appreciation of the recovery strategy and saw it
as ”entertaining”. However in the disliked comments, the recovery appears to be a dis-
turbance when it appears too often. What was mostly disliked is the misunderstanding
of words such as ”hi” and ”bye” or ”ham” and the need to repeat many times before
the robot understands. What could improve the experience with the robot would be
to expand the knowledge, the answers, to give more feedbacks and filters for parasitic
words as ”hemmm”. This reflects the analysis of the interaction preferences where the
system without gesture recovery is preferred.
Chapter 7
Discussion
7.1 Participation
The participation is quite low for such an experiment because of choice. Indeed it became
soon obvious that the evaluation’s design was inappropriate to originate interruptions.
The reading format for the second task is inappropriate because it is uncommon to read
while talking. It was uneasy for the participants to know what to do either read or
speak. Then it was decided to stop the evaluation. Moreover, some participants were
aware of the general goal of the research. Whereas a strict experiment would allow only
uninformed participants. This lack of extreme rigor was permitted because of the study
context of this project.
7.2 Evaluation
As scientific evaluation should evaluate only the participants with no external distur-
bances, the guidances given during experiment might have interfered. Ideally the par-
ticipant knows when to start and end and what to do. This is why the design of the
evaluation is very important. Moreover the articles given might prime the participant.
In this case, they were about diverse subjects except robotic and AI. Another difficulty
appears when designing the recovery strategy. Seen from the literature review, inter-
ruptions are inspired from human-human interactions. In social humanoid robotics, we
often try to copy the best human behavior. We can ask if recoveries between humans and
human-robot are similar. Indeed robots are still obviously machinelike thus users might
behave in a specific way. We observe that naturalness in the movements is important
to show a coherent behavior. The smoothness of the movements was quite enjoyed by
the participants. In a similar way, the behavior animations already implemented in the
34
Chapter 7. Discussion 35
Pepper robot allow a great range of movements. This surely contributed to improve the
feeling of the participants on the overall experience. Furthermore, different shades of
behavior coupled to various answers can give an impression of naturalness and adapt-
ability. As we humans are complex, we might tend to see complex behaviors as more
natural. This could lead to the development of personalities for robots. A criticism can
be done about the dialogue as it was a task based talk instead of a small talk. However
a task based talk requires full attention of the user. A small talk would allow the user
to get distracted more easily. In addition to the type of talk criticism, the fact that a
poor knowledge of words does not give the user the possibility to freely express.
7.3 Improvements
7.3.1 What could enhance the implementation?
7.3.1.1 A smarter system
First of all, the design of a subtler recovery strategy would improve the experience of the
user. To do that, it would be interesting to design different types of recoveries. So that
evaluations would be run for each type of recovery to evaluate its effectiveness. Knowing
that the design of the recovery depends upon many streams such as the proxemics and
some anthropology. In addition to the actual system, it would be interesting to design a
smart system using machine learning. The system would then improve itself by learning
from interactions. The principal advantages are an easiness in expanding its knowledge
and a variety of words/behaviors. However the principal drawback to this method is
that it learns everything, bad and good behavior. To avoid that, it would require a
supervision. Unfortunately it is costly to supervise all learnings. It is also difficult to
cause enough interactions for the robot to get a proper learning database.
7.3.1.2 Improved dialogue system
In a future project, another dialogues system would replace the QiChat system which
was not enough robust. A common issue was the system getting stuck in sub-rules so
that it is waiting for a particular answer. The system would comprehend a larger choice
of sentences. Moreover the accent influences the understanding of words. For instance,
a British accent on ”potatoe” will not give the same result as the American accent.
Chapter 7. Discussion 36
7.3.1.3 Improvement of dialogue features
The design of the dialogue should allow the user to stop the robot. Similarly the user
should have the choice of answering ”yes” or ”no”. In the case of the implemented
dialogue, even ”no” answers were leading to a pursuit of the conversation. The lack of
command to stop the conversation can lead to a frustration in the user. In addition,
it is important for the user to get an acknowledgement of the robot’s understanding.
Typically the robot saying ”thank you”, ”okay” or repeating the previous idea are ways
to give a feedback to the user. So that the user knows that the communication is working
well with the robot.
7.3.1.4 A chatbot with memory
A memory would be added to remember who spoke, when and what was said so that
there is no need to repeat all over again. It would also allow the user to go back in the
conversation to change something said before. A database with all the user’s interactions
would improve the relation between the human and the robot. Indeed people who know
each other are closer than total stranger. It is even more personalized for social robots
that might live in our houses. All in all, the best improvement for the dialogue system
would be to design a chatbot. It comprehends all previous features and it is quite flexible.
7.3.1.5 Improvement of the tracking
Another drawback is that at some point, the tracking ceases and the robot get stuck
in one position during the same conversation. This trouble should be solved for a later
project unless it is part of the robot. Similarly, it would be interesting to improve the
precision of the movements that are not from the Naoqi libraries. Indeed, the robot
shifting was not precise enough. The tracking cameras are also not enough robust to
track in changing light or moving people.
7.3.2 What could be changed for the evaluation?
The first idea was to make participants play a game. It appears to be either too sub-
merging or not enough so that the user could do both tasks: play and speak. Instead it
was decided to read a paragraph of an article. Similarly the design of Pepper’s dialogue
was changed. The robot was first supposed to tell a story with blanks where the partic-
ipant would give his/her own words to fill the blanks. Because of the QiChat system,
there were many problems. Finally the user was reading an article while Pepper asking
Chapter 7. Discussion 37
questions. As the questions were too short and too engaging, it would be better to
lengthen the dialogue and expand the topics. The interruption would be then triggered
by the participant glancing at a video being ran in background. To be sure the partic-
ipant looks at the video, we can either tell that he/she will be asked questions about
what she/he saw on the video or set a sufficiently long dialogue to get the participant
bored. Moreover videos might catch the eye quite easily.
Chapter 8
Conclusion
A dialogue and a gesture systems were implemented on Pepper robot. The recovery
of interruptions included a monitoring of the user and a combination of both dialogue
and gesture behaviors. Despite of all the research, the overall system was still very
raw. The best way to improve it would be to pursue the evaluation to adjust the
numerous variables. Even if the hypothesis has not been completely answered, this
work opened many doors on a niche field of the interruptions management in human-
robot interactions. The project was an intricate problem where the solution is still to
find. However it was full of lessons. Indeed the design of the evaluation is as important
as the design of the recovery strategy. To achieve this goal, must research were done in
different fields such as robotics and psychology. It was always thrilling to bring closer
those fields of study. Moreover the discovery of how to lead an evaluation involving
people on a complex issue is rich of learnings.
38
Appendix A
Appendix A
A.1 Stakeholders
Stakeholders Needs
Professionals in relation with thepublic
Easy use and maintenance; Socially appropriatebehavior of the robot;
Future researcher in the Human-Robot Interaction domain
A new approach of conversational agents;
Project supervisor Fulfilled projectMyself Implementation of a multimodal interaction sys-
tem;
Table A.1: Stakeholders
39
Additional resources. Additional resources 40
A.2 Risks
Risk Importance Likelihood Solution
Delays due tooutside circum-stances
High Low Allow some flexibility to the schedule
Pepper robotbreakdown
High Medium Implement the system on anotheravailable compatible robot
Implementationthat reveals beingtoo difficult
High High Adapt the project to make it simpler;Try another approach of the project;Focus on the research part
Incompatiblesoftwares
High Medium Research beforehand to avoid such in-compatibility
Inaccuracy of thesensors
High High Simplify the task and prefer simpleand unsophisticated proxemics (e.g.big movements)
Table A.2: Details of the risks
A.3 Project tasks and deliverables
Deliverables Details
Research report This report contains the literature review that back-grounds the project. It is handle on the 6th of April.
Implemented system The system comprehends the gesture and system programsas well as the interface with the Pepper robot.
Final dissertation report It contains all the researches and the experiments fromtheir design to the results.
Table A.3: Details of the main deliverables
Additional resources. Additional resources 41
A.4 Example of a questionnaire
09/08/2017 First interaction
https://docs.google.com/forms/d/18W8l9UUGji24cVQg80_56JSFMZ4EHw12OmMUCrgNimE/edit 1/7
First interaction*Required
1. ID Participant *
2. ID First interaction *
Please answer the following questions about your impressionof the interaction
3. Did you like interacting with the robot? *Mark only one oval.
1 2 3 4 5
No, I disliked Yes, I enjoyed
4. Did you think Pepper’s behavior was appropriate? *Mark only one oval.
1 2 3 4 5
Inappropriate Appropriate
5. How engaged in the interaction did you feel? *Mark only one oval.
1 2 3 4 5
Not at all A lot
Please rate your impressions of the robot on the followingscale
This must be answered quickly without second thoughts.
6. *Mark only one oval.
1 2 3 4 5
Fake Natural
Figure A.1: Questionnaire used at the evaluation: first interaction
Additional resources. Additional resources 42
14/08/2017 First interaction
https://docs.google.com/forms/d/18W8l9UUGji24cVQg80_56JSFMZ4EHw12OmMUCrgNimE/edit 2/7
7. *Mark only one oval.
1 2 3 4 5
Machinelike Humanlike
8. *Mark only one oval.
1 2 3 4 5
Unconscious Conscious
9. *Mark only one oval.
1 2 3 4 5
Artificial Lifelike
10. *Mark only one oval.
1 2 3 4 5
Moving rigidly Moving elegantly
11. *Mark only one oval.
1 2 3 4 5
Dislike Like
12. *Mark only one oval.
1 2 3 4 5
Unfriendly Friendly
13. *Mark only one oval.
1 2 3 4 5
Unkind Kind
Figure A.2: Questionnaire used at the evaluation: first interaction
Additional resources. Additional resources 43
14/08/2017 First interaction
https://docs.google.com/forms/d/18W8l9UUGji24cVQg80_56JSFMZ4EHw12OmMUCrgNimE/edit 3/7
14. *Mark only one oval.
1 2 3 4 5
Unpleasant Pleasant
15. *Mark only one oval.
1 2 3 4 5
Awful Nice
16. *Mark only one oval.
1 2 3 4 5
Incompetent Competent
17. *Mark only one oval.
1 2 3 4 5
Ignorant Knowledgeable
18. *Mark only one oval.
1 2 3 4 5
Irresponsible Responsible
19. *Mark only one oval.
1 2 3 4 5
Unintelligent Intelligent
20. *Mark only one oval.
1 2 3 4 5
Foolish Sensible
Second interaction
21. ID Second interaction *
Figure A.3: Questionnaire used at the evaluation: first interaction
Additional resources. Additional resources 44
14/08/2017 First interaction
https://docs.google.com/forms/d/18W8l9UUGji24cVQg80_56JSFMZ4EHw12OmMUCrgNimE/edit 6/7
35. *Mark only one oval.
1 2 3 4 5
Incompetent Competent
36. *Mark only one oval.
1 2 3 4 5
Ignorant Knowledgeable
37. *Mark only one oval.
1 2 3 4 5
Irresponsible Responsible
38. *Mark only one oval.
1 2 3 4 5
Unintelligent Intelligent
39. *Mark only one oval.
1 2 3 4 5
Foolish Sensible
Final questionnaireHere are a few questions that we would like you to answer to conclude the experiment
40. Which interaction did you prefer? *Mark only one oval.
First interaction
Second interaction
41. What did you like when interacting with Pepper?
Figure A.4: Questionnaire used at the evaluation: final questions 40 & 41
Additional resources. Additional resources 45
14/08/2017 First interaction
https://docs.google.com/forms/d/18W8l9UUGji24cVQg80_56JSFMZ4EHw12OmMUCrgNimE/edit 7/7
Powered by
42. What did you dislike when interacting with Pepper?
You completed the evaluation! Thank you very much for yourparticipation !
Figure A.5: Questionnaire used at the evaluation: final question 42
Additional resources. Additional resources 46
A.5 Risk assessment
MACS Risk Assessment Form (Project)
Student: Fantine CHABERNAUD
Project Title: Gaze Interaction with an Animated Character
Supervisor: Dr Franck BROZ
Risks:
Risk Present (give details)
(tick if present)
Control Measures
and/or Protection
Standard Office environment-
includes purely software
projects
System implemented on a
moving robot
Keeping the robot at a few
meters from the participant,
Unusual peripherals
e.g. Robot, VR helmet, haptic
device, etc.
None None
Unusual Output
e.g. Laser, loud noises, flashing
lights etc.
Nothing Nothing
Other risks
None None
Figure A.6: Questionnaire used at the evaluation
Bibliography
Ackerman, E. (2016). Ces 2017: Why every social robot at ces looks alike. IEEE
SPECTRUM.
Bartneck, C., Kulic, D., Croft, E., and Zoghbi, S. (2009). Measurement instruments
for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived
safety of robots. International journal of social robotics, 1(1):71–81.
Hung, V., Elvir, M., Gonzalez, A., and DeMara, R. (2009). Towards a method for evalu-
ating naturalness in conversational dialog systems. In Systems, Man and Cybernetics,
2009. SMC 2009. IEEE International Conference on, pages 1236–1241. IEEE.
McClave, E. Z. (2000). Linguistic functions of head movements in the context of speech.
Journal of pragmatics, 32(7):855–878.
Mead, R., Atrash, A., and Mataric, M. (2012). Representations of proxemic behavior for
human-machine interaction. In NordiCHI 2012 Workshop on Proxemics in Human-
Computer Interaction, Copenhagen.
Mead, R. and Mataric, M. J. (2016). Autonomous human–robot proxemics: socially
aware navigation based on interaction potential. Autonomous Robots, pages 1–13.
Mitsunaga, N., Smith, C., Kanda, T., Ishiguro, H., and Hagita, N. (2008). Adapt-
ing robot behavior for human–robot interaction. IEEE Transactions on Robotics,
24(4):911–916.
Mumm, J. and Mutlu, B. (2011). Human-robot proxemics: physical and psycholog-
ical distancing in human-robot interaction. In Proceedings of the 6th international
conference on Human-robot interaction, pages 331–338. ACM.
Okamoto, D. G., Rashotte, L. S., and Smith-Lovin, L. (2002). Measuring interrup-
tion: Syntactic and contextual methods of coding conversation. Social Psychology
Quarterly, pages 38–55.
Skantze, G. (2016). Real-time coordination in human-robot interaction using face and
voice. AI Magazine, 37(4):19–31.
47
Bibliography 48
Skantze, G., Hjalmarsson, A., and Oertel, C. (2014). Turn-taking, feedback and joint
attention in situated human–robot interaction. Speech Communication, 65:50–66.
Tritton, T., Hall, J., Rowe, A., Valentine, S., Jedrzejewska, A., Pipe, A. G., Melhuish,
C., and Leonards, U. (2012). Engaging with robots while giving simple instructions.
In Conference Towards Autonomous Robotic Systems, pages 176–184. Springer.
Tsiakoulis, P., Gasic, M., Henderson, M., Planells-Lerma, J., Prombonas, J., Thomson,
B., Yu, K., Young, S., and Tzirkel, E. (2012). Statistical methods for building robust
spoken dialogue systems in an automobile. Proceedings of the 4th applied human
factors and ergonomics.
Wada, K., Shibata, T., Musha, T., and Kimura, S. (2008). Robot therapy for elders
affected by dementia. IEEE Engineering in medicine and biology magazine, 27(4).
Yang, J.-Y., Dinh, B. K., Kim, H.-G., and Kwon, D.-S. (2013). Development of emotional
attachment on a cleaning robot for the long-term interactive affective companion. In
RO-MAN, 2013 IEEE, pages 288–289. IEEE.
Yu, Z., Bohus, D., and Horvitz, E. (2015a). Incremental coordination: Attention-centric
speech production in a physically situated conversational agent. In SIGDIAL Confer-
ence, pages 402–406.
Yu, Z., Nicolich-Henkin, L., Black, A. W., and Rudnicky, A. I. (2016). A wizard-of-
oz study on a non-task-oriented dialog systems that reacts to user engagement. In
SIGDIAL Conference, pages 55–63.
Yu, Z., Papangelis, A., and Rudnicky, A. (2015b). Ticktock: A non-goal-oriented multi-
modal dialog system with engagement awareness. In Proceedings of the AAAI Spring
Symposium.
Z lotowski, J., Sumioka, H., Nishio, S., Glas, D. F., Bartneck, C., and Ishiguro, H. (2016).
Appearance of a robot affects the impact of its behaviour on perceived trustworthiness
and empathy. Paladyn, Journal of Behavioral Robotics, 7(1).