embodied language learning and cognitive bootstrapping: methods … · acquire behavioral,...

Journal Name

Embodied Language Learning andCognitive Bootstrapping: Methodsand Design Principles

Regular Paper

⋆ Corresponding author E-mail:

Received D M 2013; Accepted D M 2013

DOI: 10.5772/chapter.doi

© 2013 ; licensee InTech. This is an open access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution,and reproduction in any medium, provided the original work is properly cited.

Abstract A set of interdisciplinary methods is describedwhich contributes to our understanding of cognitivedevelopment, and especially language acquisition,in robots. The approach is inspired by analogousdevelopment in humans and consequently we investigatethe parallel co-development of action, conceptualizationand social interaction. The integration of differentprocesses is an area which requires more attentionand a number of novel methods are described. Thereis a particular focus on the integration of action andlanguage acquisition. Extensive experiments with thehumanoid robot iCub are reported, as well as work withsynthetic agents. Research into human learning relevantto developmental robotics has also yielded useful results.

Keywords Human Robot Interaction, HRI, DevelopmentalRobotics, Robot Language, Cognitive Bootstrapping,Statistical Learning

1. Introduction

In this paper we present a contribution to the field ofrobot language learning and cognitive bootstrapping: ourgoals are to develop artificial embodied agents that canacquire behavioral, cognitive, and linguistic skills throughindividual and social learning. One purpose is to showhow language learning needs to bring together manydifferent processes and to draw attention to the need foran interdisciplinary approach. Thus we include workin developmental robotics, cognitive science, psychology,linguistics and neuroscience as well as practical computer

science and engineering. Much of the research describedin this paper was initiated in the EU ITALK project,undertaken in six universities in Europewith collaboratorsin the US and Japan [1]. In this report we focus onthe methods and design principles that have been used,introducing many novel approaches. A number of thesehave been investigated separately initially, later to beintegrated. This paper draws attention to the broad rangeof methods needed to complement each other to establishpre-requisites for language learning.

Our work is inspired by analogous human development,one aspect of which is the key role of social interactionin language learning. Thus we conducted extensiveexperiments in Human-Robot Interaction (HRI) and alsoinvestigated Human-Human Interaction (HHI) in areasapplicable to developmental robotics. Following thehuman analogy, we subscribe to the hypothesis thatintegration of multiple learning paths promotes cognitivedevelopment, and in particular that co-development ofaction and language enable the enhancement of languagecapabilities - an area that has received little attention inthe past. We report firstly on new approaches integratingmultimodal sensory streams and dependencies betweenaction and language. We then report, secondly, onancilliary work on individual components of an integratedmodel. A third section covers research into human socialinteraction relevant to developmental robotics. See Table I.

The experiments decribed in the first two sections usevarious forms of statistical learning. Much of our work

www.intechopen.com Short Journal Name, 2013, Vol. No, No:2013 1

http://creativecommons.org/licenses/by/3.0

Figure 1. An experiment with the iCub robot. The participant

is asked to teach iCub the words for shapes and colors on the

box, speaking as if the robot were a small child. See Sections 2.3,2.4, and 3.1.

will feed in to wider concepts of statistical learning,where computational principles that operate in differentmodalities contribute to domain general mechanisms [2].

The focus of HRI experimental work was the embodiedhumanoid robot iCub, see Figure 1. Research wasalso carried out on simulated robots, and throughcomputational modelling. As much of our workwas inspired by child development we investigatedhow robotic agents might handle objects and toolsautonomously, how they might communicate withhumans, and how they might adapt to changing internal,environmental and social conditions. We also exploredhow parallel development, the integration of cognitiveprocesses with sensorimotor experiences, behaviorallearning and social interaction, could promote languagecapabilities.

Our research has been influenced, either explicitly orimplicitly, by enactive, sensorimotor theories of perceptionand cognition [3–6]. We have worked on the hypothesisthat embodied active perception in differentmodalities canbe integrated to simulate human cognition, and assumethat language learners experience multiple modalities.Some initial research has been with single mode input, butthis leads on to the development of methods dealing withmulti-modal input.

Thus the following assumptions underpin the approachadopted in this project:

(i) Agents acquire skills through interaction with thephysical environment, given the importance ofembodiment, sensory-motor coordination, and actionoriented representation1 - physical interaction.

(ii) Agents acquire skills through interactionwith humansin the social environment - social interaction.

(iii) Behavioral, cognitive and linguistic skills developtogether and affect each other - co-development.

Clearly these categories are interrelated and comprisemany common challenges. For example, the conceptof symbol grounding, where the meaning of languageis grounded in sensing and experiencing the world, is

1 By “representation” we refer broadly to particular informationalcorrelations between physical, social, linguistic or internal andsensorimotor processes.

fundamental throughout [7, 8]. A constructivist view oflanguage underpins the work of this project [9]. Similarly,the concept of time, and the physical experience of time,is crucial both to sequential actions and to aspects oflanguage learning, such as the order of words and theunderstanding of linguistic constructions.

Key areas of research are related to understanding howagents learn and enact linguistic meaning, and how thedynamics of social interaction are relevant. We investigatehow compositional action and language representationsare integrated to bootstrap the cognitive system. Separatecomponents are also described: one such component is thework on goal directed action in a robot, since this providesscaffolding for language learning of actions carried out onobjects (Section 3.3).

The paper is structured in three parts, as shown inTable 1. First, multimodal, integrated methodologiesare described. We note where work on human-robotinteraction is carried out with naïve participants, outsidethe research team. Secondly we present single aspects oflanguage and action learning before the different strandsare integrated. Thirdly HHI and HRI work relevantto developmental robotics is covered. The sectionswithin each part describe in detail the methods used inthis research. Each approach is described under threeheadings: Introduction, Experimental work and Outlook. Weintroduce the method, giving some research background,we then describe the experimental work that has beencarried out, explain the techniques involved, notingadvantages and disadvantages, and then conclude withthe future outlook. Some of the results of the project as awhole can be found in [10], as well as in individual reportscited below.

2. Embodied, Multimodal Language LearningMethodologies

2.1. Integrating language and action with time-sensitiverecurrent neural nets

Introduction: During early phases of development theacquisition of language is strongly influenced by thedevelopment of action skills, and vice versa. Dealing withthe complex interactions between language and actions,as has been observed in language comprehension [11,12] and acquisition [13–16], requires the identification ofcomputational means capable of representing time. Theability to deal with temporal sequences is a central featureof language, and indeed any cognitive system.

Therefore we opted for artificial neural networks forthe investigation of grammatical aspects in languageand, especially, for the capability of those systemsto autonomously capture grammatical rules fromexamples [17–19]. More recently, several connectionistmodels approach the problem of language acquisition, andin particular the co-acquisition of elements of syntax andsemantics, by implementing artificial systems that acquirelanguage through the direct behavioral experience ofartificial agents [20–23]. This approach has the specific aimof responding to the criticism of the symbol groundingproblem [7, 8] on one hand, which is one of the majorchallenges for the symbolic AI-based systems, and to

2 Short Journal Name, 2013, Vol. No, No:2013 www.intechopen.com

Table 1.

EmbodiedMultimodal Language Learning Methodologies

Section Research area Integrated Human Work withapproach interaction iCub

2.1 Integrating language and action with multimodal percepts yes yes, alsoTime-sensitive recurrent neural nets and action with models

2.2 ERA - Epigenetic Robotics Architecture - SOM* multimodal percepts yes yesneural nets combining sensory and motor data and action

2.3 Meaningful use of words and multimodal percepts yes, naïve yescompositional forms participants

2.4 Acquisition of linguistic negation multimodal, action, yes, naïve yesin embodied interaction affect/motivation participants

Ingredients of Robot Learning


3.1 Transition from babbling to word forms yes, naïve yesin real-time learning participants

3.2 Language Game paradigm and social yes yeslearning of word meanings

3.3 Passive Motion Paradigm (PMP) - to yesgenerate goal-directed movements in robots

Scaffolding Social Interaction through HHI* and HRI*


4.1 HHI* and HRI* mediated speech, vision yes, naïveby motor resonance action participants yes

4.2 Co-development and interaction speech, vision and yes, naïve partialin tutoring scenarios. HHI and HRI action participants

4.3 Analysing user expectations. HHI and HRI yes, naïve partialparticipants

4.4 Linguistic corpora studies to investigate yes, naïvechild language acquisition. HHI participants

*SOM: Self-Organizing Map. HHI: Human-Human Interaction. HRI: Human-Robot Interaction

exploit the autonomous learning capabilities of neuralnetworks, both in terms of behaviors and elements ofsyntax.

The work described here was influenced by pioneeringstudies conducted by Jun Tani and collaborators [22, 24, 25]who investigated how a neuro-robot can co-develop actionand language comprehension skills.

Experimental work: In the models cited above therepresentation of time is achieved through the internalorganization of specific types of neural network, namely,recurrent neural networks (RNN), which can learn andrecall temporal sequences of inputs, and have been shownto be reliable models of short-term memory circuitry(see [26]). Besides the typical implementation of RNNs,in which certain nodes show re-entrant connections, thatis, they are connected to themselves, different variationshave been proposed. An interesting variation is theMultiple Timescales RNN [24, 25]. The MTRNN core isbased on a continuous time recurrent neural network [27]characterized by the ability to preserve its internal state

Figure 2. The set up for iCub for experiments described in

Section 2.1. The robot is trained through a trial-and-error processto respond to sentences such as “reach the green object”. It then

becomes able to generalize to new, previously unheard, sentenceswith new behaviors.

and hence exhibit complex temporal dynamics. The neuralactivities on MTRNN are calculated following the classicalfiring rate model where each neuron’s activity is given

www.intechopen.com :

Embodied Language Learning and Cognitive Bootstrapping: Methods and Design Principles

3

by the average firing rate of the connected neurons. Inaddition to this, the MTRNN model implements a leakyintegrator and therefore the state of every neuron is notonly defined by the current synaptic inputs but alsoconsiders its previous activations.

Neural networks are often trained with variations of theback-propagation method. In particular, RNN, as well asMTRNN, are trained with the Back Propagation ThroughTime algorithm (BPTT), which is typically used to trainneural networks with recurrent nodes. This algorithmallows a neural network to learn the dynamical sequencesof input-output patterns as they develop in time. See [28].The main difference between a standard Back Propagationalgorithm and the BPTT is that, in the latter case thetraining set consists of a series of input-output sequences,rather than in single input-output patterns.

The MTRNN and RNN methods above were applied inexperiments with the iCub robot to investigate whetherthe robot could develop comprehension skills analogousto those developed by children during the very earlyphase of their language development. More specificallywe trained the robot through a trial and error process toconcurrently develop and display a set of behavioral skillsand an ability to associate phrases such us “reach the greenobject” or “move the blue object” to the correspondingactions. See Figure 2. A caretaker provides positiveor negative feedback on whether the robot achieves theintended results.

This method allowed the perceived sentences and thesensors encoding other (visual, tactile, and proprioceptive)information to influence the robot actuators without beingfirst transformed into an intermediate representation: thatis, a representation of the meaning of the sentence.This method enabled us to study how a robot cangeneralize at the level of behavior - how it can respond tonew, never experienced utterances with new appropriatebehaviors. At the same time, we also studied howit can “comprehend” new sentences by recombiningthe “meaning” of constituent words in a compositionalmanner to produce new utterances. Similarly, we studiedhow the robot can produce new actions by recombiningelementary behaviors in a compositional way [25, 29].

The BPTT of medium to large-scale MTRNNs iscomputationally expensive, as the algorithm relies heavilyon large matrix-vector multiplications. State-of-the-artCPU based algorithms require a prohibitively largeamount of time to train and run the network, prohibitingreal-time applications of MTRNNs. To optimize this,we relied instead on Graphical Processing Unit (GPU)computing to speed up the training of the MTRNNs [30].

Outlook: Our approach provides an account of howlinguistic information might be grounded in sub-symbolicsensory-motor states, how conceptual information isformed and initially structured, how agents can acquirecompositional behavior and display generalizationcapabilities. This in turn leads to the emergence ofcompositional organization that enables the robot toreact appropriately to new utterances, never experiencedbefore, without explicit training.

2.2. Epigenetic Robotics Architecture (ERA) - combiningsensory and motor data

Introduction: The Epigenetic Robotics Architecture (ERA)was developed to directly address issues of ongoingdevelopment, concept formation, transparency, scalability,and the integration of a wide range of cognitivephenomena [31]. The architecture provides a structure fora model that can learn, from ongoing experience, abstractrepresentations that combine and interact to produce andaccount formultiple cognitive and behavioral phenomena.It has its roots in early connectionist work on spreadingactivation and the interactive activation and competitionmodels. In its simplest form the ERA architectureprovides structuredHebbian association between multipleself-organizing maps such that spreading and competingactivity between and within these maps provides ananalogue of priming and basic schemata.

Once embodied and connected to both sensory and motordata streams, the model has the ability to predict thesensory consequences of actions and so provides animplementation of theories of sensorimotor perception [4,32].

Experimental work: ERA provides for structured associationbetween multiple Self-Organizing Maps via special “hub”maps; several “hubs” then themselves interact via a “hub”map at the next level and so on. Here the structure ofthe architecture emerges as a consequence of the statisticsof the input / output signals. Activity flows up thearchitecture driven by sensor and motor activity, and backdown the architecture via associations to prime or predictthe activity at the surface layer. See Figure 3.

Scalability is addressed in several different ways;firstly, by constructing hierarchies large inputs can beaccommodated and the gradual integration of informationin higher and higher regions of the hierarchy providesan analogue of abstraction. Secondly, while the modelis fundamentally an associative priming model, it is ableto produce analogies to a wide variety of psychologicalphenomena. Thirdly, the homogeneous treatment ofdifferent modalities – whether sensor or motor based– provides a method that can easily accommodatenew and additional modalities without requiringspecialized pre-processing, though we do acknowledgethat appropriate pre-processing may be beneficial. Finallyin relation to sensorimotor theories, the gap betweensensorimotor prediction and an interaction-based accountof affordances is significantly narrowed [33].

The ERA architecture in its simplest formwas successfullyapplied to modelling bodily biases in children’s wordlearning [34], the effect of grouping objects on learning acommon feature, and the transformative effect of labellingand spatial arrangement on the computational or cognitivecomplexity of tasks [35]. Additionally, an extended versionof the architecture utilizing active Hebbian links to directlyinfluence the learning within each self-organizing mapwas explored in relation to modelling the “switch” taskand more generally the so called “U-shaped performancecurves” in development [36, 37].

Outlook: While ERA is fundamentally an associativepriming model, it is able to produce a wide variety


Figure 3. Top panel: The ERA model in its simplest form as

a structured mapping between self-organizing maps driven bysensory input. Bottom panel: The extended ERA model in which

a hierarchy of self-organizing maps are driven at the sensory levelby sensory input, and then at the hub level by the positions of

winning nodes in the connected maps at the previous layer. See

Section 2.2.

of psychological phenomena which have been validatedagainst both existing child data and additional childexperiments, confirming predictions of the model (see alsothe Conclusion to this paper regarding “Research Loops”).Beyond the integration of cognitive phenomena, ERA alsoprovides a fulcrum for the technical integration of manyof the modelling outputs of the project by developingstructures based on simple relationships between inputs,outputs, and anything else provided. The architecture canlearn, from ongoing experience, abstract representationsthat combine and interact to produce and account formultiple cognitive and behavioral phenomena.

In its current form the ERA modelling approach hasa number of limitations including problems learningsequential information, and producing complex dynamicand adaptive behavior. While dynamic behavior canand has been generated from the model, this is motorfocused and therefore not particularly useful for learningaction affordances. In combination with pre-wired actionproduction systems, action words and basic affordancescan be learned but this is unsatisfactory andmore plausiblemethods for action production need to be found.

2.3. Integrating multimodal perceptions for themeaningful use of words and compositional forms

Introduction: In this section we focus on methodsemployed for grounding lexical concepts in a robot’ssensorimotor activities via Human Robot Interaction [38].Language learning is a social and interactive process, asemphasized by Tomasello [9], Kuhl [39] and Bloom [40].The methods described here concern language learningfrom embodied interaction, how this is influenced byfeedback from the robot, and how this affects the robot’slearning experience.

In this and other work, see Sections 2.4 and 3.1, the humanspeech tutors to the robot were naïve participants, paid atoken amount as a gesture of appreciation. Most of themwere administrative staff from the university or studentsfrom other disciplines. They were asked to speak to therobot as if it were a small child. Note that the robot learntseparately from each participant over multiple sessions, sothat in effect learning occurred as if each participant hadtheir own robot which learnt only from them.

Experimental work: The methodologies employed arebroken down into three parts: firstly, extracting relevantsalient aspects of the human tutor’s speech based onresearch with human children and aspects of ChildDirected Speech (CDS); secondly, the learningmechanismsin linking salient human speech with the robot’s ownperceptions so that it could produce similar speech duringsimilar sensorimotor experiences; thirdly, attempting toachieve rudimentary compositionality exhibited in simpletwo-word utterances made by the robot [41].

(i) The first of these methodologies focuses on extractingsalient aspects of the human tutor’s speech. This isachieved by considering what a human infant hears ina social situation with a caregiver. Typically utterancesare short, often less than 5 words with many utterancesconsisting of a single word. Repetition is common. Thecaregiver talks more slowly than would be typical withan adult. Most words are are mono- or disyllabic. Salientwords are lengthened and prosody is used to give greateremphasis to such words. Often salient words are placed atthe end of utterances to young infants. Initially there areusually more nouns than other types of words. (See alsoSection 3.1.)

Two main methods are used for extracting salient words:firstly prosodic measures combining energy (volume),pitch (using fundamental frequency) and duration (thelength of an uttered word) and secondly splittingutterances into two sections focusing on the high saliencefinal word and pre-final words. Both of these techniquesreflect aspects of CDS mentioned above.

(ii) The second of the methodologies is the learningmechanism itself. In this context, we consider that themeaning of a communicatively successful utterance isgrounded in its usage, based on the robot’s sensorimotorhistory - auditory, visual and proprioceptive - derivedfrom acting and interacting in the world. These groundedmeanings can then be scaffolded via regularities in therecognized word/sensorimotor stream of the robot. Thefirst step in this process is to merge the speech stream of



5

the human, represented as a sequence of salient words,with the robot’s sensorimotor stream. This is achieved bymatching the two modalities based on time.

To achieve such associations, we face a number ofchallenges. First is the challenge of associating what wassaid to the appropriate parts of the sensorimotor stream.Thus the human tutor may show the robot a shape (e.g.the “sun”), but only say the word sunwithin the utterancebefore or after the shape has appeared/disappeared fromthe view of the robot (e.g. “here’s a sun” and then showthe sun, or say “that was a sun” after having shown thesun). Secondly, which of the set of sensorimotor attributes,and at which points in time, are such attributes relevantto the speech act? We make no pre-programmed choicesas to what is relevant for the robot. However, in order tomanage these issues, we apply two heuristics. The firstcopes with the association of events by remapping eachsalient word uttered by the human tutor onto each elementof the temporally extended sensorimotor stream of theutterance containing it. In effect, this makes the wordchosen potentially relevant to the whole of the robot’ssensoritmotor experience during that utterance and thusrelevant to any sensorimotor inputs which arose duringthat time. The second heuristic uses mutual information toweight the appropriate sensorimotor dimensions relevantto the classification of that word (effectively using an“information index” [42]). The memory mechanismemployed is k-Nearest Neighbor (kNN) typically using a kvalue of 1 and weighting each sensorimotor feature by theinformation index. The robot may then later utter such asalient word when it re-experiences a similar sensorimotorcontext.

(iii) Thirdly, we investigate robot acquisition andproduction of two-word (or longer) utterances, whosecomponent lexical items have been learnt throughexperience. This again is based on experiments andanalysis dealing with the acquisition of lexical meaning inwhich prosodic analysis and extraction of salient wordsare associated with a robot’s sensorimotor perceptionsin an attempt to ground these words in the robot’s ownembodied sensorimotor experience. An in-depth analysisof the relationship between characteristics of the teacher’sspeech and the robot’s sensorimotor perceptions wasundertaken.

Following the extraction of salient words we investigatethe learning ofword order. In Englishwe take an adjective,such as for size or color, to precede an object name: redstar as a modifer rather than star red. That is, an adjectivewill ordinarily precede a noun it modifies. Note that wetemporarily ignore the fact that an adjective predicating aproperty of a noun can also follow the noun as in The staris red, and both kinds of structure occur in child-directedspeech and in the robot-directed speech in our studies.

Two kNN memory files are used to capture thecombination of salient words occurring in an utterance.The first holds all salient words uttered before the finalsalient word in the utterance. The second holds the finalsalient word in the utterance. Note that these are salientwords, so the final salient word may not necessarily be thefinal actual word in an utterance. The robot matches these

memory files against its current sensorimotor perceptionsand thus tries to find the most similar experience (if any)when it “heard” a word previously compared to what itnow experiences. This has the effect of making the robotutter words which reflect both what it was taught (aboutobjects and colors) and the order in which the wordsoriginally occurred. That is, by successively uttering anybest matching words for each of the two memory files,upon seeing a new colored shape, even if in a novelcombination, the robot should express the correct attributein a proto-grammatic compositional form, reflecting usageby the human it learned from, possibly in a completelynovel utterance.

Outlook: The approaches outlined above have advantagesand drawbacks. A positive factor is that the human tutoris able to use natural unconstrained speech. However thespeech topic is limited to the simple environment of therobot, talking about blocks, shapes and colors and is thusnaturally constrained. In terms of prosodic salience themapping of sensory embodiment to words automaticallyallows the robot to associate simple lexical meaning tothem, based on its own perceptions. However theassignment of salient words within the temporal utterancein which they occurred may have competing solutions.

One problem with the method outlined above is thenon-real-time nature of the association of words andsensorimotor experiences. In current implementations thelimiting factor has been the inability to do phonetic orphonemic word recognition in real-time without extensivetraining.

Extensions to these methods include further analysis ofthe prosodic nature of the interaction, and investigationsinto how the robot might use prosodic clues to supportthe capacity to learn to use words meaningfully beyondthe mere sensoriomotor associations attached to particularwords. More specifically, how well can the robot attachan attribute to a word, and distinguish between a set ofattributes such as “color” and a member of that set suchas “red”? This distinction would be a step into derivinglinguistic constructions, combined with perceived wordorder or inflectonal markings. This method couldcontribute to grammar induction as a way of formingtemplates for word types, or thematic constructions andtheir appropriate contexts of use (i.e. meanings in aWittgensteinian sense of Language Games).

2.4.Acquisition of linguistic negation in embodiedinteractions

Introduction: Linguistic negation is a fundamentalphenomenon of human language and a simple “no"often belongs to the first words to be uttered by Englishspeaking children. Research concerned with the ontogenyof human linguistic negation indicates that the firstuses of a rejective “no" are linked to affect [43]. Wetherefore introduced a minimal motivational model intoour cognitive architecture as proposed by Förster et al. [44]in order to support the grounding of early types ofnegation. We employed methods as in the acquisition oflexical usage work in an interactive scenario (described


in Section 2.3 and in Saunders et al. [38] ) to support theenactive acquisition and usage of single lexical items.

Experimental work: The purpose of the experimentalwork was to investigate a robot’s capacity to learn touse negative utterances in an appropriate manner. Theresulting architecture, designed to elicit the linguisticinterpretation of robot behavior from naïve participants,was used in human-robot interaction (HRI) studies withiCub. This architecture consists of the following parts:

(i) A perceptual system that provides the other parts withhigh-level percepts of particular objects and humanfaces (loosely based on the saliency filters described byRuesch et al. [45]).

(ii) A minimal motivational system that can be triggeredfrom the other sub-systems.

(iii) A behavioral system that controls the robot’s physicalbehavior based on both the output of the perceptualsystem and the motivational system

(iv) A speech extraction system which extracts wordsfrom a recorded dialogue and which operates offline.The system does not operate in real time.

(v) Sensorimotor-motivational data originating from thesystems described above and recorded during aninteraction session are subsequently associated withthe extracted words, using the same heuristics asdescribed in Section 2.3 on learning to use words initerated language games with naïve participants.

(vi) A languaging system that receives inputs fromthe systems outlined above. Subsequently it mapsperceptions, motivation and behavioral state to anembodied dictionary that is provided by the speechextraction system. The mapping is performed using amemory-based learning algorithm [46], similar to theone described in Section 2.3, but with a k value of 3.This system controls the speech actions of iCub.

We employ what we call an active vocabulary: in order toenrich the dialogue and, anticipating ties in the mappingalgorithm, two consecutive uttered words are enforced tobe different from each other in the very same experientialsituation, i.e. when the sensorimotor-motivational datais exactly the same. This is achieved by only allowing adifferent word associated to the experience to be utterednext (if any).

We have been investigating constructively the hypothesisthat rejective negation is linked to motivation rather thanjust to perceptual entities. Affective response to objectsis valenced as positive, neutral or negative, and so canshape motivation and volition for actions in response tothem.2 The constructed motivational system leads to theavoidance of certain objects, i.e. (non-verbal) rejection ofthese objects via facial expressions and matching body

2 This important psychological insight going back to the Buddha appearsalso in the related enactive model of the embodied mind detailed inVarela et al. [3, Ch. 6]. The cognitive architecture used here is the first toimplement this principle on a humanoid, albeit in a simple way, and thisserves as an essential ingredient in grounding language learning by therobot in a way that expands beyond mere sensorimotor associations, byincluding “feeling”, i.e., valenced stance toward objects.

behavior, or the opposite for objects towards which thevalence is positive. However, the described architecturehas also been constructed for a second purpose: to supportor weaken the hypothesis that the very root of negation liesin the prohibitive action of parents. In language, rejectivenegation is usedwhen one rejects an object or action, whileprohibitive negation is used to prohibit another’s action. Itmay be that exposure to prohibitive negation promotes thedevelopment of negation in children. To this purpose weperform an HRI study that compares the performance ofsystems learning in a combined prohibitive plus rejectivescenario against a purely rejective negation scenario. Inthe rejective scenario participants are asked to teach thehumanoids different shapes printed on small boxes thatare placed in front of them. They are told as well thatthe humanoid has different preferences for these objects:it might like, dislike, or be neutral about them. In theprohibitive scenario participants are told to teach the robotthe names of the shapes, but also that some of them areforbidden to touch. In neither case are the participantsaware of the true purpose of the experiment: to investigatethe acquisition by the robot of the capacity to use negativeutterances in an appropriate manner.

Outlook: The system described here is the first groundedlanguage learning system to include motivational aspectsin addition to sensorimotor data in language grounding.In developmental trajectories with different naïveparticiants the humanoid acquires in just a few sessionsthe capability to use negation. Its speech and behaviorappears to humans to express an array of types of whatfunctions as, and is construed as, linguistic negationin embodied interactions [5, 47]. The elicitation oflinguistic negation in the interactions of humanoid modelswith naïve participants and the comparative efficacy ofnegation acquisition with and without prohibition canhelp assess the notion that internal states such as affectand motivation can be as important as sensorimotorexperience in the development of language; for detailedresults so far see [47].

3. Ingredients of Robot Learning

3.1. The transition from babbling to word forms inreal-time learning

Introduction: The experiments described here have theinitial purpose of modelling the transition from babblingto salient word form acquisition, through real timeproto-conversations between human participants and aniCub robot.

The work is analogous to some of the processes in humaninfants aged about 6 -14 months. For full details see Lyon(2012) [48]. The scenario is shown in Figure 1.

The learning of word forms is a prerequisite to learningword meanings [49]. Before a child can begin tounderstand the meanings of words he or she must be ableto representword forms, which then come to be associatedwith particular objects or events [50]. The acquisition ofword forms also facilitates the segmentation of an acousticstream: learnt word forms act as anchor points, dividingthe stream of sounds into segments and thus supportingsegmentation by various other routes.



7

There is a close connection between perception andproduction of speech sounds in human infants [51,52]. Children practice what they hear, there is anauditory-articulatory loop, and children deaf from birth,though they can understand signed and written language,cannot learn to talk3. An underlying assumption is thatthe robot, like a human infant, is sensitive to the statisticaldistribution of sounds, as demonstrated by Saffran [53]and other subsequent researchers.

Most of the salient words in our scenario are in practicesingle syllable words (red, green, black etc., box, square, ringetc.). The more frequent syllables produced by the teacherare often salient word forms and iCub’s productions areinfluenced by what it has heard. When iCub produces aproper word form the teacher is asked to make a favorablecomment, which acts as reinforcement.

Experimental work: A critical component of early humanlanguage learning is contingent interaction with carers [39,54–57]. Therefore we have conducted experiments inwhich human participants, using their own spontaneousspeech, interact with an iCub robot, aiming to teach it someword forms.

The human tutors were 34 naïve participants, paid a tokenamount as a gesture of appreciation. Most of them wereadministrative staff from the univesity or students fromother disciplines. They were asked to speak to the robotas if it were a small child. After the experiment theyanswered a short questionnaire on their attitude to theiCub. Most had the impression that iCub was actingindependently (on a scale of 1 to 5, 16/19 respondents gavea score of 4 or 5).

The following assumptions about iCub’s capabilities aremade:

(i) It practices turn taking in a proto-conversation

(ii) It can perceive phonemes, analogous to human infants

(iii) It is sensitive to the statistical distribution ofphonemes, analogous to human infants [53, 58]

(iv) It can produce syllabic babble, but without thearticulatory constraints of human infants, so, unlike ahuman, it can produce consonant clusters

(v) It has the intention to communicate so it reactspositively to reinforcement, such as approvingcomments.

The scenario for the experiments has the teacher sitting at atable opposite iCub, which can change its facial expressionand move hands and arms. The lower body is immobile(Figure 1). There are a set of blocks and the participantis asked to teach iCub the names of shapes and colors onthe sides of the blocks. Initially iCub produces randomsyllabic babble, but this changes to quasi-random syllabicbabble biased towards speech heard from the teacher.When the teacher hears a proper word form s/he is askedto reinforce this with an approving comment.

The teacher’s speech is represented as a stream ofphonemes. As no assumption is made on how this

3 The celebrated Helen Keller did not become deaf and blind until 19months old

phonemic stream might be segmented into words orsyllables, iCub perceives the phonemic input as a setof all possible syllables. For instance, using letters aspseudo-phonemes, the string i s a b o x generates i is sa sab aab bo box o ox. A frequency table for each of these syllables,in iCub’s language processor, is incremented as they areperceived.

Influenced by what it has heard, iCub’s initial randomsyllabic babble becomes biased towards the speech of theteacher.

Overview of babbling to word form process

Initial state:iCub produces random syllabic babble

Repeat until dialogue time ends:teacher speaks :speech represented as unsegmented stream of phonemes

process :iCub perceives speech as set of all possible syllables,frequency table for each syllable is incremented

iCub speaks :produces quasi-random babble, biased to teachers input

process :teacher listens. Is there a real word in babble?if teacher hears any real word formthen teacher utters reinforcement

process :if iCub hears reinforcementthen previous utterance is analysedword is selected by heuristicand stored in lexicon

Each participant had 2*4 minute proto-conversations withiCub. For the conversion of the teacher’s speech to a stringof phonemes an adapted version of the SAPI 5.4 speechrecognizer was used. Participants were trained on thespeech recognizer for 10 minutes before the experiment.The iCub’s output was converted using the eSpeak speechsynthesizer. The CMU phonemic alphabet is used. Fordetails see [59, 60]. Drawbacks to our approach includedthe variable performance of the phoneme recognizer.

Since our participants were asked to talk to iCub as if itwere a small child the user expectation was influencedin advance. Participants used their own spontaneouswords and we found child directed speech usedextensively, particularly with those that had experiencecaring for human infants. One problem was that someparticipants praised iCub excessively, and thus re-inforcedinappropriately. There was a wide range of interactivestyles: some participants were very talkative, otherssaid little. For more details see Lyon (2012) [48]. Avideo clip giving an example of a “conversation” is athttp://youtu.be/eLQnTrX0hDM (note that ‘0’ is zero).

Outlook: The results indicate that phonetic learning, basedon a sensitivity to the frequency of sounds occurring, cancontribute to the emergence of salient words. It supports


other methods, for instance, through prosody and actions,as described in Section 2.3 and in Saunders (2011) [61].

To understand why this method works we need todistinguish between speech sounds and orthographictranscripts of words: there is not a 1-to-1 correspondencebetween them. Orthographic transcripts of speech do notrepresent exactly what the listener actually hears. Salientcontent words (nouns, verbs, adjectives) are more likelyto have a consistent canonical phonemic representationthan function words, where variations in prosody andpronunciation is often pronounced. For instance, in4 hours of spontaneous speech annotated phonemicallyGreenberg reported that the word “and” was recorded in80 different forms [62]. A consequence of this is that, asperceived phonemically, the frequency of function wordsis less than their frequency in orthographic transcripts. Incontrast the frequency of salient content words builds upand so does their influence on the learner.

Our current approach accords with recent neuroscientificresearch showing that dual streams contribute to speechprocessing [63, 64]. The experiments described hereinvestigate dorsal stream factors by modeling thetransition from babbling to speech.

Future work should investigate other methods ofrepresenting speech sounds, as well as, or insteadof, phonemes. Advances have been made in usingarticulatory features, such as place and manner ofarticulation and voicing; their acoustic manifestations canbe captured as a basis for the representation of speech. Seefor example [65, page 294].

3.2. The Language Game paradigm and social learningof word meanings

Introduction: In de Greeff and Belpaeme [66] we studiedhow social learning could be used by a robot to acquire themeaning of words. Social learning relies on the interplaybetween learning strategies, social interaction and thewillingness of a tutor and learner to engage in a learningexchange.

Experimental work: We implemented a social learningalgorithm to learn the meaning of words, based onthe Language Game paradigm of Steels [67, 68] (aconcept resonant with Wittgenstein’s language games).The algorithm differed from classical machine learningapproaches in that it allowed for relatively unstructureddata and that it actively solicited appropriate learning datafrom a human teacher. As an example of the latter, whenthe agent noticed a novel stimulus in the environmentit would enquire from the human for the name of thatstimulus. Or when its internal knowledge model wasambiguous, it would ask for a clarification. The algorithm,after validation in simulation [69], was integrated ona humanoid robot, which displayed appropriate socialcues to engage the human teacher (see Figure 4). Therobot was placed opposite a human subject, with a touchscreen between the robot and the human to display visualstimuli and to allow the human to give input to therobot – thereby avoiding the need for speech recognitionand visual perception in the robot, which might haveintroduced noise in the experiment.

Figure 4. Setup for social learning of word-meaning pairs by a

humanoid robot. See Section 3.2.

In the experiment two conditions were used, one in whichthe robot used social learning and respective social cuesto learn (social condition) and the other in which therobot does not provide social cues (non-social condition).The social condition resulted in both faster and betterlearning by the robot, which – given the fact that therobot has access to more learning data in the socialcondition through the additional feedback given by thehuman tutor – is perhaps not surprising. However, wedid notice that people formed a “mental model” of therobot’s learning, and tailored their tutoring behavior to theneeds of the robot. We also noticed a clear gender effect,where female tutors weremarkedlymore responsive to therobot’s social bids than male tutors [66].

Outlook: These experiments showed how the design of thelearning algorithm and the social behavior of the robotcan be leveraged to enhance the learning performance ofembodied robots when interacting with people. Furtherwork is demonstrating how additional social cues canresult in tutors offering better quality teaching to artificialagents, leading to improved learning performance.

3.3. The Passive Motor Paradigm (PMP): generating goaldirected movements in robots

Introduction: This section addresses robotic movements,essential to research into the integration of action andlanguage learning.

A movement, per se, is nothing unless it is associatedwith a goal and this usually requires recruitment of anumber of motor variables (or degrees of freedom), in thecontext of an action. Even the simple task of aimlesslytrying to reach a point B in space, starting from a pointA, in a given time T can in principle be carried outin an indefinitly large number of ways, with regards tospatial aspects (hand path), timing aspects (speed profileof the hand), and recruitment patterns of the availablejoints in the body (final posture achieved). How doesthe brain choose one pattern from the numerous otherpossible ones? Recognising the crucial importance ofmulti-joint coordination was really a paradigm shift fromthe classical Sherringtonian viewpoint [70] ( typicallyfocused on single-joint movements), to the Bernsteinian[71] quest for principles of coordination or synergy



9

Figure 5. Top panel shows the PMP network for co-ordinating the upper body of iCub (left arm-torso-right arm chain). The middle panel

shows the use of such a network in a bimanual reaching task. PMP networks have to be assembled on the fly based on the nature of themotor task and other task relevant constraints involved. As seen in the figure, any PMP network is grouped into different motor spaces

(tool, end effector space, arm joint space, waist space) based on the nature of the task (since the task shown in the figure does not involveany use of tool , there is no tool space). Each motor space consists of a generalized displacement node (blue) and a generalized force

node (pink). Vertical connections (purple) denote impedances (K: Stiffness, A: Admittance) in the respective motor spaces and horizontalconnections denote the geometric relation between the two motor spaces represented by the Jacobian (Green). The goal induces a force

field that causes incremental elastic configurations in the network analogous to the coordination of a marionette with attached strings. The

network also includes a time base generator which endows the system with terminal attractor dynamics: this means that equilibrium isnot achieved asymptotically but in finite time. External and internal constraints (represented as other task-dependent force/torque fields)

bias the path to equilibrium in order to take into account suitable penalty functions. Bottom panel shows iCub performing tasks where thetarget has to be reached with an additional constraint of specific hand pose in order to allow further manipulation. This is a multi-referential

system of action representation and synergy formation, which integrates a Forward and an Inverse Internal Model. See Section 3.3

formation. Since then the process by which the CentralNervous System (CNS) coordinates the action of ahigh-dimensional (redundant) set of motor variables forcarrying out the tasks of everyday life, the “degrees offreedom problem”, has been recognized as a central issuein the scientific study of neural control of movement.Techniques that quantify task goals as cost functions anduse the sophisticated formal tools of optimization [72]have recently emerged as a leading approach to solve thisill-posed action generation problem [73, 74].

However, questions arise regarding the massive amountof computations that need to be performed to computean optimal solution [75, 76]. We need to know howdistributed neural networks in the brain implementthese formal methods [76], how cost functions can beidentified/formulated in contexts that cannot be specified

a priori, how we can learn to generate optimal motoractions [77] and the related issue of sub-optimality [78, 79].All are still widely debated [80, 81]. Recent extensions [82]provide novel insights into the issues related to reductionin computational cost and learning.

An alternative theory of synergy formation pursued bythis consortium is the PassiveMotion Paradigm (PMP) [83,84] an extension of the equilibrium point hypothesis(EPH) [85–87] and based on the theory of impedancecontrol [88]. In PMP, focus of attention is shifted from “costfunctions” to “force fields”. In general, the hypothesiswas that the “force field” metaphor is closer to thebiomechanics and the cybernetics of action than the “costfunction” metaphor. We aim at capturing the variabilityand adaptability of human movement in a continuouslychanging environment in a way that is computationally


“inexpensive”, allowing compositionality and run timeexploitation of redundancy in a task specific fashion,together with fast learning and robustness.

Experimental work: The hypothesis was investigated byimplementing the model on the iCub and conducting anumber of experiments related to upper body coordinationand motor skill learning [84]. The basic idea in PMP isthat actions are the consequences of an internal simulationprocess that “animates” the body schemawith the attractordynamics of force fields induced by the goal and taskspecific constraints. Instead of explicitly computing costfunctions, in PMP, the controller has to just switch on taskrelevant “force fields” and let the body schema evolve inthe resulting attractor dynamics. The force fields whichdefine/feed the PMP network can be modified at run timeas a consequence of cognitively relevant events such as thesuccess/failure of the current action/sub-action [89, 90].See Figure 5.

Outlook: An important property of PMP networks is thatthey operate only through well-posed computations.This feature makes PMP a computationally inexpensivetechnique for synergy formation. The property of alwaysoperating through well-posed computations furtherimplies that PMP mechanisms do not suffer from the“curse of dimensionality” [91] and can be scaled upto any number of degrees of freedom [84, 92]. In theframework of PMP, the issue of learning relates to learningthe appropriate elastic (impedances), temporal (time basegenerator) and geometric (Jacobian) parameters related toa specific task. Some work has been done in this direction,for example [93] deals with the learning of the elasticand temporal parameters and [94] deals with the issue oflearning the geometric parameters. However a generaland systematic framework that applies to a wide range ofscenarios is still an open question and work is ongoing inthis direction.

The local and distributed nature of computations inPMP ensures that the model can be implemented usingneural networks [94, 95]. At the same time, the brainbasis of PMP is an issue that remains underexploredat the moment and requires a more comprehensiveinvestigation. A justification can still be made which infact highlights the central difference between EPH andPMP. In the classical view of EPH, the attractor dynamicsthat underlies production of movement is based on theelastic properties of skeletal neuromuscular system andits ability to store/release mechanical energy [96]. Takinginto account results from motor imagery [97, 98] PMPposits that cortical, subcortical, and cerebellar circuitsmay also be characterized by similar attractor dynamics.This could explain the similarity of effects of real andimagined movements because, although in the latter casethe attractor dynamics associated with the neuromuscularsystem is not operant, the dynamics due to the interactionamong other brain areas are still at play. In other words,considering the mounting evidence from neurosciencein support of common neural substrates being activatedduring both real and imagined movements, we posit thatreal, overt actions are also the results of an “internalsimulation” as in PMP. Even though there are results frombehavioral studies [92], a more comprehensive program

to investigate the neurobiological basis of PMP may beneeded to substantiate this viewpoint.

On the other hand, it is still an open questionwhether or not the motor system represents equilibriumtrajectories [75]. Many motor adaptation studiesdemonstrate that equilibrium points or equilibriumtrajectories per se are not sufficient to account for adaptivemotor behavior, but this is not sufficient to rule outthe existence of neural mechanisms or internal modelscapable of generating equilibrium trajectories. Rather, assuggested by Karniel [75], such findings should inducethe research to shift from the lower level analysis ofreflex loops and muscle properties to the level of internalrepresentations and the structure of internal models.

Figure 6. Experiments to gather interaction data. Participants

(parents, adults) were asked to demonstrate actions, such as

stacking cups, to a child (top level panels), a virtual robot on ascreen (2nd level panels), the iCub robot (3rd level panels) or

another adult (bottom panels). See Section 4.2

4. The Scaffolding of Social Interaction through HHIand HRI

4.1.Contingent human-human and human-robotinteraction mediated by motor resonance

Introduction: A fundamental element of the integration ofaction and language learning is constituted by the way



11

people perceive other individuals and react contingentlyto their actions. Indeed, beyond the explicit andvoluntary verbal exchanges, individuals also share beliefsand emotions in a more automatic way which may notalways be mediated by conscious awareness. This isthe case of communication based on gaze motion, bodyposture and movements. The assessment of such implicitcommunicative cues and the study of the mechanisms attheir basis helps us understand human-human interactionand investigate how people perceive and relate tonon-living agents in human-robot interaction.

The physiological mechanism at the basis of this implicitcommunication is known as motor resonance [99] andis defined as the activation of the observer’s motorcontrol system during action perception. Motor resonanceis considered one of the crucial mechanisms of socialinteraction, as it can provide a description of theunconscious processes which induce humans to perceiveanother agent (either human or robot) as an interactionpartner. The concept of motor resonance can be appliedto investigate both human-human (HHI) and human-robotinteraction (HRI) and the measure of the resonance evokedby a robotic device could provide quantitative descriptionsof the naturalness of the interaction.

In particular, behavioral investigations can describe thetangible consequences of the tight coupling betweenaction and perception described as motor resonance. Byrecording gaze movement and motion kinematics duringor after action observation we can directly individuatewhich features of the observed human or robot actionare used by observers during action understandingand execution. The modification of gaze or bodilymovements associated with the observation of someoneelse’s behavior can indeed shed light on motor planning,indicating if and in what terms implicit communicationhas occurred. In particular, motor resonance can implyfacilitation in the execution of an action similar to the oneobserved - motion priming [100] - or a distortion whileperforming a different movement - motion interference[101]. Other phenomena that reflect motor resonance andcould therefore become an efficient measure of interactionnaturalness are automatic imitation [102, 103], a specialcase of translation of sensory information into action [104]and goal anticipation with the gaze [105, 106]. For a reviewof the methodologies currently used for the study of motorresonance in HHI and HRI see [107, 108].

There are alternative techniques available to measurethe naturalness of HRI: for instance, neuroimaging andneurophysiological studies allow the evaluation of theactivation of the putative neural correlates of motorresonance (the mirror-neuron system) during actionobservation [109]. The limitation of these methods arethat they are often quite invasive processes and do notpermit testing of natural interactions. Alternatively,standardized questionnaires were proposed to measurethe users’ perception of robots and estimate factorsinvolved in HRI. However, the questionnaires just assessthe conscious evaluations of the robotic devices and donot take into account some cognitive and physical aspectsof HRI, failing in a complete HRI quantification. Tocircumvent this issue, physiological measurements such

as galvanic skin conductance, muscle and ocular activitieshave been used to describe participants’ responses wheninteracting with a mobile robot (e.g. [110]). We believethat a comprehensive description of the naturalness of thecommunication between humans and robots can only beprovided by the combination of all the above mentionedtechniques.

Experimental work: With the aim of studyingaction-meditated implicit communication and ofevaluating how HRI evolves in a natural interactivecontext we adopted two new behavioral measure of motorresonance: the monitoring of proactive gazing behavior[106] and the measure of automatic imitation [103] (see[111] for a short review) .

As the predictivity in someone’s gaze pattern is associatedwith motor resonance [105, 112], the quantification of thisanticipatory, unconscious behavior can represent a goodestimate of the activation of the resonating mechanismand - in turn - of the naturalness of an interaction.This option presents some advantages with respect tothe previously adopted methods, as it does not requiresubjects to perform predetermined movements, but just tolook naturally at an action. Moreover, it allows the study ofthe effect of observing complex, goal directed actions. Thisdiffers from classic behavioral protocols, which usuallyrequire simple stereotyped movements. The method weemployed was to replicate the experiments previouslyconducted in HHI studies: i.e. examine anticipatorygaze behavior when subjects were observing someoneperforming a goal directed action, such as transporting anobject into a container [105]. This was done by replacingthe human demonstrator with the robotic platform iCub.In this way, we could contrast directly the natural gazepattern adopted during the observation of human androbot actions. A comparison between the timing of gazing(the number of predictive saccades) in the two conditionsprovided an indication of the degree of resonance evokedby the different actors. In particular, the appearance of thesame anticipation in gaze behavior during robot as duringhuman observation indicated that a humanoid roboticplatform moving as a human actor can activate a motorresonance mechanism in the observer [106], implying itsability to induce pro-social behaviors [108, 113].

At the same time, studying the automatic imitationphenomena allowed us to quantitatively describe if andhow human actions adapt in the presence of robotic agents,namely if motor resonance mechanisms appear. This wasdone by studying the automatic imitation effect inducedby movement observation in movement production [114],when the observed action was performed by a humanagent or by the humanoid robot iCub [103]. Themodification of the observer’s movement velocity as aresult of the changes in the human or robot actor’svelocity is behavioral evidence of the occurrence of motorresonance phenomena. Interestingly, the appearance of theactor and the shape of motion trajectory had no substantialimpact on the amount of automatic imitation, while theadoption by the robot of a non-biologically plausiblevelocity profile significantly reduced the unconsciousmovement adaptation by the subject (ibid).


Outlook: The behavioral methods proposed here presentcrucial advantages with respect to other methods ofinvestigating action-mediated communication in HRIcontexts. In particular, the evaluation of gazing andautomatic imitation behaviors guarantees spontaneityand smoothness in HRI, allowing for an ecologicaltesting of natural interaction. However, they presentalso some drawbacks, including the impossibility ofdetermining exactly the neural activation associated withthe interaction, which can be obtained by more invasivetechniques as neurophysiological and neuroimaginginvestigations. Moreover, beyond the basic, unconsciousreactions to human and robot actions measured by thesebehavioral methods, several other cognitive processesmight be involved during action observation andinteraction influencing robot perception, includingattention, emotional states, previous experiencesand cultural background. From this perspective,the methodologies we propose aim at covering theexisting gap between the completely unconsciousinformation obtained by neural correlates examinationand the conscious evaluation of robotic agents given byquestionnaires, by providing a quantitative descriptionof human motor response during HRI, with a focus oncontingent, action-based communication.

4.2.Co-development and interaction in tutoringscenarios

Introduction: In this section we focus on methods whichreflect the tutoring behavior of either a parent to a childor a human to a robot (simulated on a screen or actuallyembodied). The scenarios reflect both the social natureof learning interactions and the necessary co-developmentwhere the actions of the learner also affect the actionsof the teacher. For our analysis, we used quantitativeand qualitative approaches as well as integrative methodsbased on a corpus of data from two different experimentswithin a tutoring scenario. The first was a parent-infantinteraction and the second a human-robot interaction[115, 116]. With respect to parent-infant interaction weconducted a semi-experimental study, in which 64 pairsof parents were asked to present a set of 10 manipulativetasks to their infant (8 to 30 months) and to another adultby using both talk and manual actions. During the tasks, aparent and the child were sitting across a table facing eachother while being videotaped with two cameras [117–119].Parents demonstrated several tasks to their children. Someof these parents were recruited for a second study wherethey were asked to demonstrate similar objects and actionsto a virtual robot. See Figure 6.

Experimental work:

4.2.1. Quantitative approach

For the quantitative approach, we focussed oninvestigations of child-directed speech called mothereseand child-directed motions, called motionese [117]. Thequantitative results pursue two goals: firstly, to providea multimodal analysis of action demonstrations andspeech to understand how speech and action are modifiedfor children and how the modifications in the different

modalities (speech and action) change with the children’sage, their motor and linguistic capabilities. Secondly, toapply our multimodal analysis methods for the purposeof a comparison in order to characterize the interactionwith a robot: to what extent is the interaction (and canit be) similar to a tutoring situation, in which a caregiveris scaffolding a child? When we compared the dataobtained from a tutor in a parent-child situation to thatfrom a human-robot interaction we found that, witha simulated robot, actions were modified more thanspeech. This virtual robot was designed to give thetutor visual feedback in the form of eye-gaze directedto the most salient part of the scene. Results suggestthat the tutor reacts to this feedback and adapts his/herbehavior accordingly. In sum, to benefit from structuringinput the robot’s processing modalities should displayfeedback [120].

4.2.2. Qualitative approach

For the qualitative approach, we use EthnomethodologicalConversation Analysis (EMCA) [121] as an analyticalframework providing both a theoretical account ofsocial interaction and a methodology for fine-grainedmicro-analysis of video-taped interaction data. Thisperspective invites us to consider “tutoring” as acollaborative achievement between tutor and learner. Itaims at understanding the sequential relationship betweendifferent actions and at revealing the methods participantsdeploy to organize their interaction and solve the practicaltasks at hand.

We undertook systematic annotation of the corpus withboth manual and computational methods. In orderto capture the tutor’s hand motions, we used 2Dcomputational pattern recognition methods [118, 119].We developed a 2D motion tracker (as a plugin for thegraphical plugin shell iceWing [122]) that allows trackingof specific points in a video file (e.g. right and left hands)over time using an Optical Flow based algorithm [123].The generated output consisted of a time-stamped listof x and y coordinates of the tracked point(s) definingtheir position in the video frame. We used a 2D motiontracking method instead of 3D body tracking because theexisting video data did not contain 3D information. Animportant part of the tracking action consists of a bi-axialmovement and previous analysis suggested no significantdifferences between 2D and 3D tracking results on thisdata set (cf. [117]). Central to a qualitative approach isthe relation of the data from different annotation sourcesto each other,so that a close interaction loop can bedemonstrated [119].

4.2.3. Integrative approach

By integrative methods we mean computationalapproaches that allow us to analyse phenomena indevelopmental studies. More specifically, we assumethat we can better understand the function of parentalbehavioral modifications when we consider the interplayof different modalities. Motherese is not used for its ownsake but in an interplay with other modifications suchas motionese in order to convey e.g. the meaning of an



13

action or to introduce a novel word for an object [115, 124].We need to better understand how specific features ofmotherese, such as stress, pauses as specific aspectsof intonation on a phonological level or particularconstruction on a syntactical level are related to specificparts of actions on objects in the real physical world. Thenwe can begin to build a model of how multi-modal cuesobserved in tutoring situations help to bootstrap learning.Furthermore, they may help us to better understand howthe emergence of meaning may be modeled in artificialsystems. Examples of the Integrative Approach follow.

Models of acoustic packaging At the current state of research,we assume that our model of acoustic packaging [125] isthe most appropriate method to investigate the interplayof action and speech as this algorithmic solution enablesus to firstly, combine the information about languageand speech at an early processing level and secondlyanalyze how parents package their action acoustically. Inrelation to our previous findings, according to which wefound more temporal synchrony between acoustic andvisual signal [126] as well as more acoustic packageswhen an adult addressed a child in comparison to theinteraction with another adult [125]. Such packages,when used in early interactions with a child, relate tothe child’s language development [127]. Thus, modelsof acoustic packaging give us insights into the functionsof the multimodal child-directed modifications and howmulti-modal information enables a system to understand,that is to bootstrap and then to continuously refine a veryfirst concept, the basic structure, of actions that are beingdemonstrated.

Models of cognition Another example of integrativemethods are parallel experiments with humans andartificial cognitive systems with the aim of buildingsimple but realistic models of cognition. We testedcategorization in human-human and machine-machineexperiments. In the following we outline the methodologyfor the human-human side of an experiment into theeffects of social interaction on categorization of this kind.For this purpose an alien world scenario was constructed.Participants in the experiment learn through interactionwith a teacher the categories for sixteen objects whichappear on a computer screen. These objects can beeither round or square, red or green, light or heavy,blinking or non-blinking. There are relevant features forthe categorization and irrelevant features. The categoriesdetermine what kind of manipulation should be carriedout on the objects. These manipulations can be place (e.g.object needs to be placed in a particular position on thescreen) or shake (upwards and downwards movement orleft and right movement). The participants know aboutthe appropriateness of their manipulation through a scorethat is shown to them in a training phase at the end ofeach interaction with an object after a given time. Thismethodology was used first in the study of Morlino etal. [128, 129].

Outlook: In our experiments we focused on socialinteraction. The objective was to see what kind of teachingbehavior would improve an agent’s learning. Again,human and artificial agents were tested in parallel. Thefocus in the experiment was on the types of instructions

that the tutor gave in the experiments. Two types ofteaching strategy emerged. One centered around negativeand positive feedback whereas the other strategy triedto symbolize the action required from the learner, e.g.alternatively pressing two buttons to indicate the shakingmovement.

In this work the feedback given by the tutor via thesymbols is quantified so that the different types offeedback can be modelled to build an artificial tutor. Theexperiment is then run with the artificial cognitive system.The tutor is modelled on a human tutor whereas thelearner is an artificial neural network. The aim is to yieldinsights into what kind of feedback allows and improvescategory learning in artificial agents and to give insightsinto the consequences for cognitive and social robotics.

Future research needs to explore (i) the question ofsynchrony and (ii) the question of contingent interaction.With respect to (i) we need to investigate correlationsbetween action and speech, for instance how areattention keeping functions in motionese, such as slow orexagerrated actions, accompanied bymotherese. Similarly,how are verbal attention getters accompanied by actions.

With respect to (ii), the question of contingent interaction,our qualitative analysis has shown that for successfultutoring it is not sufficient to just look at synchronybetween speech and action, but also to consider itsinteractional dimension. The way in which tutors presentan action is not only characterized by synchrony betweentalk and action, but also by the inter-personal coordinationbetween tutor and learner [130].

4.3.Analysing user expectations in human-robotinteraction

Introduction: Interactions do not take place in avoid: they are influenced by certain prior assumptions,preconceptions and expectations about the partner andthe interaction. Since people generally have experienceof interactions with other people, they usually have agood idea about what to expect; in contrast, in interactionswith communication partners that are somewhat different,such as children, pets, foreigners or robots, situations withwhich they may not have much experience, people maynot be sure about what to expect.

Methodologically, this is useful because the impactof such assumptions becomes apparent in asymmetricinteractions. In human-robot interaction, differentpreconceptions have been shown to have a considerableinfluence [131, 132]. So, first, in order to predictpeople’s behavior in interactions with a robot and,second, to guide them into appropriate behaviors thatfacilitate the interaction as well as the bootstrappingof language, experimental studies are necessary todetermine what influences users’ expectations and theirsubsequent behaviors. For instance, in interactionswith children, caregivers employ numerous cues thatmay facilitate language learning. Whether and to whatdegree users can be made to employ such features wheninteracting with robots is thus an important question[133]. Understanding the similarities and differencesbetween child-directed and robot-directed speech and


their determining factors is furthermore crucial to predicthow people will interact with an unfamiliar robot innovel communication situations. Thus, it is desirableto understand what drives the linguistic choices peoplemake when talking to particular artificial communicationpartners.

Experimental work: In the context of the project we carriedout controlled interaction experiments in which only oneaspect of the interaction was varied. Factors that influenceusers’ expectations comprise, for instance, appearance ofthe robot and its degrees of freedom, as well as furtheraspects of robot embodiment [134], its communicativecapabilities [135] and behaviors [136]. These factorswere investigated in experimental settings in which allparticipants were confronted with the same robot andvery similar robot behaviors, but where, however, oneaspect of the robot was varied at a time. We thenanalyzed users’ behavior in these interactions, especiallytheir linguistic behavior, since the ways in which peopledesign their utterances reveal how they think about theircommunication partner [134]. Particularly revealing arepronouns, passive constructions, sentence complexity andpoliteness formulae [137].

However, it is unlikely that people do not updatetheir preconceptions during an interaction, and thus therelationship between users’ expectations and processes ofalignment [138] and negotiation on the basis of the robot’sbehavior needs to be taken into account. We investigatedthe interaction between preconceptions and alignmentand feedback in the same way as preconceptions alone,namely by means of controlled studies of human-robotinteractions. For instance, we identified preconceptionsin greetings [132] and analyzed how users’ behaviorchanged over the course of the interaction or a setof interactions [136]. Furthermore, we had the robotbehave differently with respect to one feature, forinstance, contingent versus non-contingent gaze andpinging behavior [130].

Outlook: The experiments carried out support the viewthat user expectations continue to play a considerable roleover the course of the interactions, since they basicallyconstrain their own revision; thus, if people understandthe interaction with the robot as social, they will bewilling to update their partner model based on the robot’sbehavior. If however they understand human-robotinteraction as tool-use, they will not be willing to takethe robot’s behavior into account to the same extent.Future work will need to identify further means to elicitand possibly change users’ preconceptions and to designinterventions that possibly shape users’ expectations andsubsequent behaviors [133].

4.4. Linguistic corpora studies to investigate childlanguage acquisition

Introduction: In language acquisition research, one ofthe major empirical methodologies in the field is thestudy of child language and child-directed speech asdocumented in linguistic corpora [139]. For an overviewof corpus-based studies of child language acquisition seeBehrens (2008) [140].

Compared to experimental approaches the advantagesof using corpus data for the study of child languageacquisition include their ecological validity (all elementsin the dataset are naturally occurring) their principledsuitability for longitudinal research (covering longertime spans than is feasible with individual experimentalsessions), the fact that they are freely available in largequantities as well as that they are machine-readable and,given appropriate annotation, conveniently processed inways that open up unique analytical possibilities.

Disadvantages of corpus studies vis-a-vis experimentalapproaches are that the context of the productions in thecorpus is not controlled, that there is no direct cueing ofthe specific phenomenon or behavior that is at issue in agiven study, and that many potentially relevant contextproperties of the transcribed interactions (e.g. participants’gaze behavior or gestures) are often not preserved. Also,depending on the specific corpus that is chosen for agiven study, factors such as corpus size, sample densityand, where applicable, longitudinal span may imposeadditional limitations on the kinds of research questionsthat can be reasonably investigated with a given resource.

Another limitation of data transcribed from audiorecordings is that orthographic transcripts do not entirelyrepresent the actual sounds that are heard. See Section 3.1.

Experimental work: Apart from the immediate theoreticalimplications of the empirical results of such studies, theycan also inform the process of constructing suitable stimulifor later experiments and computational investigations.For instance, in a scenario in which a robot is faced withthe challenge of acquiring several different constructionalpatterns in parallel (comparable to the situation of achild), statistical properties of the input that are assumedto influence the acquisition process in children canbe transferred to the robotic scenario in which theycan be systematically manipulated and explored [141].Experiments can vary such quantitative parameters as theavailability or strength of distributional cues to a particularcategory in the input, the frequency proportions betweendifferent variants of a given pattern, the amount of lexicaloverlap between two or more different patterns in theinput and so on [142].

In this project, corpus studies of naturalistic input patternswere conducted for a number of the most elementarygrammatical constructions of English (i.e. basic argumentstructure constructions such as the simple and complexintransitive, the simple and complex transitive and theditransitive construction). Typical uses and functions ofthese patterns in child-directed speech were investigatedin large-scale corpus studies of caregiver utterances in 25English language corpora on the CHILDES database [139].

Outlook: Some of the difficulties described above appearedin our corpus studies. For instance, in a studyof the way in which more abstract functions of agrammatical construction are developmentally groundedin uses that are more accessible to child languagelearners (“constructional grounding”), we investigatedboth caregiver’s and children’s use of the prenominalpossessive construction (e.g. Eve’s shoe) in differentEnglish language corpora. The aim of the study was



15

to assess whether ease of acquisition is better viewed asa function of semantic concreteness/qualitative salienceor rather of input frequency/quantitative salience to thechild. For the investigated corpora, the results pointedin the direction of quantitative salience, but questionsremained whether some of the seemingly late-acquiredvariants in fact only showed up so late in the data becausethe corpora were not large and dense enough to registerpossible earlier uses of these variants. However, these areclearly not principled problems and in fact much currentwork goes into compiling ever larger, denser and morefully annotated corpora that are often aligned with audioand/or video recordings in order to capture more andmore features of the scene [143].

5. Conclusion

The research undertaken in this project was bydesign a multifaceted approach. The initial premisewas that co-development of action, linguistic skills,conceptualization and social interaction jointly contributeto the scaffolding of language capabilities, and anoverview of research areas addressed is shown in Table1. However, though these elements ultimately all cometogether, they can also be profitably studied in smallercombinations, or separately.

This heterogenous approach is supported by findingsin neuroscientific research. Consider the experimentsdescribed in Sections 2 and 3, which employ variedstatistical learning processes as they focus on differentaspects of language learning. In a wider domain statisticallearning is constrained to operate in specific modalities,which then go on to subserve domain-general mechanisms[2].

Our work included simulations of integrative neuralprocesses analogous to those of humans (Section 2.1) andthe development of the Epigenetic Robotics Architecture(ERA), a structure to integrate a wide range of cognitivephenomena (Section 2.2). An example of the resultof training iCub for the meaningful use of words andactions can be seen in the video clip on youtube:youtu.be/5l4LHD2lYJk. Note that ‘l’ is lower caseletter ‘ell’.

Experiments with iCub were carried out combining visual,audio and proprioceptive perceptions for learning themeaning of words (Section 2.3), leading on to work onproto-grammatic compositional forms. The acquisition ofnegation was investigated, adding valenced preferences tothe robot’s “experience” of objects (Section 2.4).

Prior to integration, work on components of languagelearning processes were studied separately. Real-timeinteractive experiments with iCub demonstrated howthe transition from babbling to word form productionsmight occur (Section 3.1); Another approach to learningthe meanings of words through social interaction wasinvestigated with the Language Games paradigm(Section 3.2).

Since the integration of action and language is centralto our hypotheses the implementation of goal-directedactions in robots is a key factor. Section 3.3 describes

the theory and practice of the Passive Motion Paradigm(PMP). This approach avoids the classic problems ofoptimizing movements of robot joints with multipledegrees of freedom: the indefinitely large number ofpossible moves to achieve a goal generates ill-posedproblems.

One consequence of the PMP approach is a shift fromlow-level analysis to the structure of internal models (seethe end of Section 3.3). This needs to be reconciled withthe enactive, sensorimotor theory underlying the projectapproach (see Section I, Introduction) which proposesraw, uninterpreted perceptual experience to scaffold theacquisition of behaviors [5].

We need to recall that some learning processes are essentialto the acquisition of speech, others facilitate learning butare not absolutely essential. Although they can learn touse signed or written language, children profoundly deaffrom birth cannot learn to speak. On the other hand blindinfants can learn to speak, albeit typically at a slower ratethan their sighted contemporaries.

A theme running through the project is that languagelearning and conceptualization in our agents are inspiredby their development in the child. Thus our workincluded research into contingent interaction throughspeech and gestures, (Sections 4.1, 4.2, 4.3) and studieson corpora of child language (Section 4.4). Thework on motor resonance, a crucial mechanism in theintegration of action and language learning (Section 4.1),brings a new, multi-disciplinary approach to investigatingcommunication between human and robotic agents.

Though the work in this project is inspired by childdevelopment we have not always modelled every humancharacteristic. The real child is usually immersed in alearning environment all day long, whereas the roboticsubjects of our experiments had short, task-based sessions,more like therapeutic scenarios. Other aspects of our workwhich are not in full accord with the real child modelinclude neural modelling based on back-propagationlearning, which does not have a biological basis. Theuse of orthographic transcripts in child speech corporado not altogether represent auditory perceptions. Thearticulatory abilities of the iCub in babbling-to-word formsexperiments do not fully match infant productions.

However, overall the project has progressed ourunderstanding of language learning and cognitivebootstrapping, and how they might be applied in robotics.One important aspect of methods, here applied in thefield of language and action learning, is that the methodsthemselves can lead to new and interesting insights forfurther theoretical proposals. One example of this iswhere the methods outlined in Section 2.2 which discussthe use of the Epigenetics Robotics Architecture (ERA)were applied in order to discover whether effects foundin psychological experiments on early language learningwith children would also occur in similar experimentswith the humanoid robot iCub [34, 35]. The results ofthese experiments led to a revision of the theoretical ideassupporting such proposals and was further analysedusing newer variations on such methods. This process,which we label Research Loops, is an important outcome


fusing together work from both robotics and physicallyembodied studies with work on human development. Ineffect, methods are used to verify theoretical ideas on anexperimental robotic platform in a way that would not bepossible with human children or adults.

In conclusion, we hope that this article allows researchersin the field of embodied language learning to assess andenhance the methodologies shown, and we look forwardto seeing further progress in this field.

Acknowledgement

The work described in this article was conducted withinthe EU Integrated Project ITALK (“Integration andTransfer of Action and Language in Robots”) fundedby the European Commission under contract numberFP7-214668.

6. References

[1] A. Cangelosi, G. Metta, G. Sagerer, S. Nolfi, C.l. Nehaniv,et al. Integration of action and language knowledge: Aroadmap for developmental robotics. IEEE Transactions onAutonomous Mental Development, 2(3):167–195, 2010.

[2] R Frost, B C Armstrong, N Siegelman, and M HChristiansen. Domain generality versus modalityspecificity: the paradox of statistical learning. Trendsin Cognitive Science, 19(3), 2015.

[3] F Varela, E Thompson, and E Rosch. The Embodied Mind.MIT Press, 1991.

[4] J. K. O’Regan and A. Noë. A sensorimotor account of visionand visual consciousness. Behavioral and Brain Sciences,24(5):939–973, 2001.

[5] C L Nehaniv, F Förster, J Saunders, et al. Interactionand Experience in Enactive Intelligence and HumanoidRobotics. In Symposium on Artificial Life, IEEE ALIFE, 2013.

[6] Caroline Lyon. Beyond vision: Extending the scope of asensorimotor account of perception. In J. M. Bishop andA. O. Martin, editors, Contemporary Sensorimotor Theory.Springer, 2014.

[7] S. Harnad. The symbol grounding problem. Physica D.,42:335–346, 1990.

[8] L Steels and M Hild. Language Grounding in Robots.Springer-Verlag, New York, 2012.

[9] Michael Tomasello. Constructing a Language: A Usage-BasedTheory of Language Acquisition. Harvard University Press,2003.

[10] Frank Broz, Chrystopher L. Nehaniv, Angelo Cangelosi,et al. The ITALK Project: A developmental roboticsapproach to the study of individual, social and linguisticlearning. Topics in Cognitive Science, 6(3), 2014.

[11] F. Pulvermueller, M. Haerle, and F. Hummel. Walking ortalking? Behavioral and neurophysiological correlates ofaction verb processing. Brain and Language, 78:134–168,2001.

[12] A. M. Glenberg. Language and action: creating sensiblecombinations of ideas. In G. Gaskell, editor, The OxfordHandbook of Psycholinguistics. Oxford University Press, 2007.

[13] H. H. Clark. Space, time, semantics and the child. In T.E.Moore, editor, Cognitive Development and the Acquisition ofLanguage, pages 27–64. Academic Press, New York, USA,1973.

[14] L. B. Smith, S. S. Jones, and B. Landau. Naming in youngchildren: A dumb attentional mechanism? Cognition,60:154–171, 1996.

[15] K. Nelson. Some evidence for the cognitive primacyof categorization and its funcional basis. Merrill-PalmerQuarterly, 69:21–39, 1973.

[16] J. M. Mandler. How to build a baby: II. Conceptualprimitives. Psychological Review, 99:587–604, 1992.

[17] J. L. Elman. Finding structure in time. Cognitive Science,14:179–211, 1990.

[18] M. H. Christiansen and N. Chater. Toward a connectionistmodel of recursion in human linguistic performance.Cognitive Science, 23(2):157–205, 1999.

[19] D. Chalmers. Syntactic transformations on distributedrepresentations. Connection Science, 2:53–62, 1990.

[20] D. Marocco, K. Fischer, T. Belpaeme, and A. Cangelosi.Grounding action words in the sensorimotor interactionwith the world: experiments with a simulated iCubhumanoid robot. Frontiers in Neurorobotics, 4(7), 2010.

[21] A. Cangelosi and T. Riga. An embodied modelfor sensorimotor grounding and grounding transfer:Experiments with epigenetic robots. Cognitive Science,30(4):673–689, 2006.

[22] Y. Sugita and J. Tani. Learning semantic combinatorialityfrom the interaction between linguistic and behavioralprocesses. Adaptive Behavior, 13(3):211–225, 2005.

[23] P. Dominey. Emergence of grammatical constructions:Evidence from simulation and grounded agentexperiments. Connection Science, 17(3-4):289–306, 2005.

[24] Y. Yamashita and J. Tani. Emergence of functionalhierarchy in a multiple timescale neural network model: Ahumanoid robot experiment. PLoS Computational Biology,4(11):201–233, 2008.

[25] M. Peniak, D. Marocco, J. Tani, Y. Yamashita, K. Fischer, andA. Cangelosi. Multiple time scales recurrent neural networkfor complex action acquisition. In IEEE ICDL-EPIROB, 2011.

[26] M. M. Botvinick and D. C. Plaut. Short-term memoryfor serial order: A recurrent neural network model.Psychological Review, 113(2):201–233, 2006.

[27] R. D. Beer. On the dynamics of small continuous-timerecurrent neural networks. Adaptive Behavior, 3(4):469–509,1995.

[28] P. J. Werbos. Backpropagation through time: What it doesand how to do it. In Proceedings of the IEEE, volume 78(10),pages 1550–1560. IEEE, 1990.

[29] E. Tuci, T. Ferrauto, G. Massera, and S. Nolfi.Co-development of linguistic and behavioural skills:compositional semantics and behaviour generalisation. InProc. Conf. on Simulation of Adaptive Behavior (SAB2010).Springer, 2010.

[30] Martin Peniak and Angelo Cangelosi. Scaling-up actionlearning neuro-controllers with GPUs. In Neural Networks(IJCNN), 2014 International Joint Conference on, pages2519–2524. IEEE, 2014.

[31] A. F. Morse, J. DeGreeff, T. Belpeame, and A. Cangelosi.Epigenetic robotics architecture (ERA). IEEE Transactions onAutonomous Mental Development, 2(4):325–339, 2010.

[32] Alva Noë. Action in Perception (Representation and the Mind).MIT Press, 2004.

[33] A. F. Morse. Snapshots of sensorimotor perception.In V Muller, editor, Philosophy and Theory of ArtificialIntelligence. Springer, 2013.

[34] A. F. Morse, T. Belpeame, A. Cangelosi, and L. B. Smith.Thinking with your body: Modelling spatial biases incategorization using a real humanoid robot. In Paperpresented at Cognitive Science, 2010.

[35] A. F. Morse, P. Baxter, T. Belpeame, L. B. Smith, andA. Cangelosi. The power of words (and space). In Proc.(ICDL-EPIROB), 2011.

[36] A. F. Morse, T. Belpeame, A. Cangelosi, and C. Flocia.Modeling U-shaped performance curves in ongoingdevelopment. In Paper presented at Cognitive Science, 2011.

[37] Anthony Morse, V.L. Benitez, T. Belpaeme, A. Cangelosi,and L.B. Smith. Posture affects word learning in robots andinfants. PLOS One, 2015. In press.

[38] Joe Saunders, Chrystopher L. Nehaniv, and Caroline Lyon.Robot learning of lexical semantics from sensorimotorinteraction and the unrestricted speech of human tutors.



17

In Proc. New Frontiers in Human-Robot Interaction, AISBConvention, 2010.

[39] P. K. Kuhl. Is speech learning gated by the social brain?Developmental Science, 10(1):110–120, 2007.

[40] Paul Bloom. How Children Learn the Meaning of Words. MITPress, 2002.

[41] J Saunders, H Lehmann, F Förster, and C L Nehaniv.Robot acquisition of lexical meaning: Moving towards thetwo-word stage. In IEEE ICDL-EPIROB, 2012.

[42] J. R. Quinlan. C4.5: Programs for Machine Learning. MorganKaufmann, San Mateo, CA, 1993.

[43] R. Pea. The development of negation in early childlanguage. In D.R. Olson, editor, The Social Foundations ofLanguage and Thought: Essays in Honor of Jerome S. Bruner.W.W. Norton, 1980.

[44] F. Förster, C. L. Nehaniv, and J. Saunders. Robots thatsay ‘no’. In Proceedings of the 10th European Conference onArtificial Life, ECAL 2009, 2009.

[45] J. Ruesch, M. Lopes, A. Bernardino, J. Hornstein,J. Santos-Victor, and R. Pfeifer. Multimodal saliency-basedbottom up-attention a framework for the humanoid robotiCub. In International Conference on Robotics and Automation,ICRA, pages 962–967. IEEE, 2008.

[46] Walter Daelemans and Antal van den Bosch. Memory-BasedLanguage Processing. Cambridge University Press, 2005.

[47] Frank Förster. Robots that Say ’No’: Acquisition ofLinguistic Behaviour in Interaction Games with Humans. PhDthesis, Adaptive Systems Research Group, University ofHertfordshire, 2013.

[48] C Lyon, J Saunders, and C L Nehaniv. Interactive languagelearning by robots: The transition from babbling to wordforms. PLoS One, 7(6), 2012.

[49] M. Vihman, R. DePaolis, and T. Keren-Portnoy. A dynamicsystems approach to babbling and words. In E. Bavin,editor, Handbook of Child Language, pages 163–182. CUP,2009.

[50] H Yeung and J Werker. Learning words’ sounds beforelearning how words sound: 9-months-old infants usedistinct objects as cues to categorize speech information.Cognition, 113(2), 2009.

[51] Benedicte de Boisson-Bardies. How Language Comes toChildren. MIT, 1999.

[52] A. Fernald and V. A. Marchman. Language learning ininfancy. In M. J. Traxler and M. A. Gernsbacher, editors,Handbook of Psycholinguistics, pages 1027–1071. Elsevier, 2ndedition, 2006.

[53] J. Saffran, R. Aslin, and E. Newport. Statistical learning by8-month-old infants. Science, 274:1926–1928, 1996.

[54] Patricia K. Kuhl. Early language acquisition: cracking thespeech code. Nature Reviews Neuroscience, 5(11):831–843,2004.

[55] A. Bigelow and C. Decoste. Sensitivity to social contingencyfrom mothers and strangers in 2-, 4-, and 6-month-oldinfants. Infancy, 4:111–140, 2004.

[56] Britta Wrede, Stefan Kopp, Katharina Rohlfing, ManjaLohse, and Claudia Muhl. Appropriate feedbackin asymmetric interactions. Journal of Pragmatics,42:2369–2384, 2010.

[57] Britta Wrede, Katharina Rohlfing, Marc Hanheide, andGerhard Sagerer. Towards Learning by Interacting, pages 139– 150. Springer, 2009.

[58] P. Jusczyk and R. Aslin. Infants’ detection of the soundpatterns of words in fluent speech. Cognitive Psychology,29:1–23, 1995.

[59] A Rothwell, C Lyon, C L Nehaniv, and J Saunders. Frombabbling towards first words: the emergence of speech ina robot in real-time interaction. In IEEE Symposium onArtificial Life (IEEE Alife 2010), pages 86–91, 2011.

[60] C Lyon, C L Nehaniv, and J Saunders. Preparing to Talk:Interaction between a Linguistically Enabled Agent and aHuman Teacher. InAAAI Fall Symposium Series, Dialog withRobots, FS-10-05, 2010.

[61] Joe Saunders, Hagen Lehmann, Yo Sato, andChrystopher L. Nehaniv. Towards using prosody to scaffoldlexical meaning in robots. In Proc. IEEE ICDL-EPIROB,2011.

[62] S Greenberg. Speaking in shorthand: A syllable-centricperspective for understanding pronunciation variation.Speech Communication, 29:159–176, 1999.

[63] Gregory Hickok and David Poeppel. Dorsal and ventralstreams: a framework for understanding aspects of thefunctional anatomy of language. Cognition, 92(1-2):67–99,2004.

[64] D Saur, B W Kreher, S Schnell, et al. Ventral and dorsalpathways for language. Proc. of the National Academy ofScience, 105(46), 2008.

[65] S Chang, M Wester, and S Greenberg. An elitist approachto automatic articulatory-acoustic feature classification forphonetic characterization of spoken language. SpeechCommunication, 2005.

[66] J. de Greeff and T. Belpaeme. Why robots should be social:Enhancing machine learning through social human-robotinteraction. PLoS one, September 2015.

[67] Luc Steels. Self-organizing vocabularies. In C. Langton andT. Shimohara, editors, Proc. Artificial Life V (Alife V). MITPress, 1996.

[68] Luc Steels and Tony Belpaeme. Coordinating perceptuallygrounded categories through language. A case study forcolour. Behavioral and Brain Sciences, 24(8):469–529, 2005.

[69] J. de Greeff, F. Delaunay, and T. Belpaeme. Human-robotinteraction in concept acquisition: a computational model.In Proc. of IEEE ICDL, 2009.

[70] C S Sherrington. The integrative action of the nervoussystem. Yale University Mrs Epsa Ely Silliman memoriallectures, 1906.

[71] NA Bernstein. The Coordination and Regulation of Movements.Pergamon Press, 1967.

[72] A E Bryson and Y C Ho. Applied Optimal Control, volume293(1). Hemisphere Publishing Corp., 1975.

[73] E Todorov, W Li, and X Pan. From task parameters to motorsynergies: A hierarchical framework for approximatelyoptimal control of redundant manipulators. Journal ofRobotic Systems, 22(11):691–710, 2005.

[74] E Todorov. Optimal Control Theory. Bayesian BrainProbabilistic Approaches to Neural Coding, pages 269–298,2006.

[75] A. Karniel. Open questions in computational motor control.J Integr Neurosci, 10(3):385–411, 2011.

[76] Stephen H Scott. Optimal feedback control and theneural basis of volitional motor control. Nature ReviewsNeuroscience, 5(7):532–546, 2004.

[77] Kenji Doya. How can we learn efficiently to actoptimally and flexibly? Proc. National Academy of Sciences,106(28):11429–11430, 2009.

[78] G Ganesh, M Haruno, M Kawato, and E Burdet. Motormemory and local minimization of error and effort, notglobal optimization, determine motor behavior. Journal ofNeurophysiology, 104(1):382–390, 2010.

[79] Jacopo Zenzeri, Graduate Student Member, and PietroMorasso. Expert Strategy Switching in the Control of aBimanual Manipulandum with an Unstable Task. Strategy,pages 3115–3118, 2011.

[80] Karl Friston. Perspective What Is Optimal about MotorControl ? Neuron, 72(3):488–498, 2011.

[81] Vishwanathan Mohan and Pietro Morasso. Passive motionparadigm: an alternative to optimal control. Frontiers inNeurorobotics, 5, 2011.

[82] Emanuel Todorov. Efficient computation of optimal actions.Proc. National Academy of Sciences, 106(28):11478–11483,2009.

[83] F A Mussa Ivaldi, P Morasso, and R Zaccaria. Kinematicnetworks. A distributed model for representing andregularizing motor redundancy. Biological Cybernetics,60(1):1–16, 1988.


[84] V Mohan, P Morasso, G Metta, and G Sandini. Abiomimetic, force-field based computational model formotion planning and bimanual coordination in humanoidrobots. Autonomous Robots, 27(3):291–307, 2009.

[85] A. Feldman. Functional tuning of the nervous system withcontrol of movement or maintenance of a steady posture.Biophysics, 11:925–935, 1966.

[86] E Bizzi, F A Mussa-Ivaldi, and S Giszter. Computationsunderlying the execution of movement: a biologicalperspective. Science, 253(5017):287–291, 1991.

[87] E Bizzi, N Hogan, F A Mussa-Ivaldi, and S Giszter. Doesthe nervous system use equilibrium-point control to guidesingle and multiple joint movements? Behavioral and BrainSciences, 15:603–613, 1992.

[88] N. Hogan. Modularity and causality in physical systemmodeling. ASME Journal of Dynamic Systems Measurementand Control, 109:384–391, 1987.

[89] Vishwanathan Mohan, Pietro Morasso, Giorgio Metta, andStathis Kasderidis. Actions and Imagined Actions inCognitive Robots. In Perception-Action Cycle, pages 539–572.Springer, 2011.

[90] Vishwanathan Mohan and Pietro Morasso. Towardsreasoning and coordinating action in the mental space.International Journal of Neural Systems, 17(4):329–341, 2007.

[91] Richard Ernest Bellman. Dynamic Programming. DoverPublications, Incorporated, 2003.

[92] Pietro Morasso, Maura Casadio, Vishwanathan Mohan,and Jacopo Zenzeri. A neural mechanism of synergyformation for whole body reaching. Biological Cybernetics,102(1):45–55, 2010.

[93] Vishwanathan Mohan, Pietro Morasso, Jacopo Zenzeri,Giorgio Metta, V Srinivasa Chakravarthy, and GiulioSandini. Teaching a humanoid robot to draw âAŸShapes.Autonomous Robots, 31(1):21–53, 2011.

[94] V. Mohan and P. Morasso. A forward / inverse motorcontroller for cognitive robotics. Artificial Neural Networks- ICANN 2006, pages 602–611, 2006.

[95] PMorasso and V Sanguineti. Self-organization, computationalmaps, and motor control, volume 119. North Holland, 1997.

[96] A G Feldman. Functional Tuning of the NervousSystem with Control of Movement or Maintenance of aSteady Posture. II. Controllable Parameters of the Muscle.Biophysics, 11(3):498–508, 1966.

[97] J Decety. Do imagined and executed actions share the sameneural substrate? Brain Research, 3(2):87–93, 1996.

[98] M Jeannerod. Neural simulation of action: a unifyingmechanism for motor cognition. NeuroImage, 14(1 Pt2):S103–S109, 2001.

[99] G. Rizzolatti, L. Fadiga, L. Fogassi, and V. Gallese.Resonance behaviors and mirror neurons. Archives Italiennede Biologie, 137(2-3):85–100, 1999.

[100] L. Craighero, L. Fadiga, C. A. Umiltà, and G. Rizzolatti.Evidence for visuomotor priming effect. Neuroreport,8(1):347–349, 1996.

[101] J. M. Kilner, Y. Paulignan, and S. J. Blakemore. Aninterference effect of observed biological movement onaction. Current Biology, 13:522–525, 2003.

[102] C. Heyes. Automatic imitation. Psychological Bulletin,137(3):463–483, 2011.

[103] A Bisio, A Sciutti, F Nori, G Metta, L Fadiga, G Sandini,and T Pozzo. Motor Contagion during Human-Human andHuman-Robot Interaction. PloS one, 9(8), 2014.

[104] A. Wohlschlaeger, M. Gattis, and H. Bekkering. Actiongeneration and action perception in imitation: an instanceof the ideomotor principle. Phil. Trans. Royal Society. SeriesB, Biological sciences, 358(1431):501–515, 2003.

[105] J. R. Flanagan and R. S. Johansson. Action plans used inaction observation. Nature, 424:769–771, 2003.

[106] A Sciutti, A Bisio A, F Nori, G Metta, L Fadiga L., andG Sandini. Robots can be perceived as goal-oriented agents.Interaction Studies, 2013.

[107] A Sciutti, A Bisio, F Nori, et al. Measuring human robotinteraction through motor resonance. International Journalof Social Robotics, 4(3), 2012.

[108] T Chaminade and G Cheng. Social cognitive neuroscienceand humanoid robotics. J Physiol. Paris, 103(3-5), 2009.

[109] G. Rizzolatti and L. Craighero. The mirror-neuron system.Annual Review of Neuroscience, 27:169–192, 2004.

[110] F Dehais, E A Sisbot, R Alami, and M Causse.Physiological and subjective evaluation of a human-robotobject hand-over task. Appl. Ergon., 42(6), 2011.

[111] G Baud-Bovy, P Morasso, F Nori, G Sandini, andA Sciutti. Human machine interaction and communicationin cooperative actions. In R Cingolani, editor, BioinspiredApproaches for Human-Centric Technologies. Springer, 2014.

[112] T. Falck-Ytter, G. Gredebäck, and C. von Hofsten. Infantspredict other people’s action goals. Nature Neuroscience,9(7):878–879, 2006.

[113] R van Baaren, R Holland, K Kawakami, and A vanKnippenberg. Mimicry and prosocial behavior. Psychol Sci,15 (1), 2004.

[114] A Bisio, N Stucchi, M Jacono, L Fadiga, and T Pozzo.Automatic versus voluntary motor imitation: effect ofvisual context and stimulus velocity. PLoS One, 5(10), 2010.

[115] Kerstin Fischer, Kilian Foth, Katharina Rohlfing, and BrittaWrede. Mindful tutors – linguistic choice and actiondemonstration in speech to infants and to a simulated robot.Interaction Studies, 12(1), 2011.

[116] Katrin Lohan, Katharina Rohlfing, Karola Pitsch, et al. Tutorspotter: Proposing a feature set and evaluating it in arobotic system. International Journal of Social Robotics, 4(2),2012.

[117] K. J. Rohlfing, J. Fritsch, B. Wrede, and T. Jungmann.How can multimodal cues from child-directed interactionreduce learning complexity in robots? Advanced Robotics,20(10):1183–1199, 2006.

[118] A. L. Vollmer, K. S. Lohan, K. Fischer, et al. People modifytheir tutoring behavior in robot-directed interaction foraction learning. In Proc. of DEVLRN âAZ09: IEEE ICDL,2009.

[119] K Pitsch, A-L Vollmer, K J Rohlfing, J Fritsch, and B Wrede.Tutoring in adult-child interaction. On the loop of thetutor’s action modification and the recipientâAŸs gaze.Interaction Studies, 15:55-98, 2014.

[120] Kerstin Fischer. Contingency, projection and attentionto common ground as major design principles for robotfeedback. In Proc. ‘Robot Feedback in Human-RobotInteraction: How to Make a Robot ‘Readable’ for a HumanInteraction Partner’, IEEE RoMan’12, 2012.

[121] M. Rapley. Ethnomethodology/conversation analysis,in qualitative research methods in mental health andpsychotherapy: A guide for students and practitioners.John Wiley, 2011.

[122] F. Lömker, A. Hüwel, and I. Savas. iceWing - User andProgramming Guide., 2006. visited 18 Nov 2014.

[123] B. D. Lucas and T. Kanade. An iterative image registrationtechnique with an application to stereo vision. In Proc.Imaging Understanding Workshop, pages 121–130, 1981.

[124] K S Lohan, S S Griffiths, S Sciutti, T C Partmann, and K JRohlfing. Co-development of manner and path concepts inlangauge, action, and eye-gaze behavior. TopiCS in CognitiveScience, 2014).

[125] L. Schillingmann, B. Wrede, and K.J. Rohlfing. Acomputational model of acoustic packaging. IEEETransactions on Autonomous Mental Development, 1(4):226–237, 2009.

[126] M. Rolf, M. Hanheide, and K.J. Rohlfing. Attentionvia synchrony: Making use of multimodal cues insocial learning. IEEE Transactions on Autonomous MentalDevelopment, 1(1):55–67, 2009.

[127] K J Rohlfing and I Nomikou. Intermodal synchrony as formof maternal responsiveness is associated with language



19

development. Language, Interaction and Acquisition, 5(1),2014.

[128] G. Morlino, C. Gianelli, A. M. Borghi, and S. Nolfi.Developing the ability to manipulate objects: Acomparative study with human and artificial agents.In Proc. Epigenetic Robotics, pages 169–170, 2010.

[129] S Griffiths, S Nolfi, G Morlino, et al. Bottom-up learning offeedback in a categorization task. In IEEE ICDL-EPIROB,2012.

[130] Kerstin Fischer, Katrin S. Lohan, Joe Saunders, ChrystopherNehaniv, BrittaWrede, and Katharina Rohlfing. The impactof the contingency of robot feedback on HRI. In Proc.Workshop on Collaborative Robots and Human Robot Interaction(CR-HRI 2013), 2013.

[131] Steffi Paepcke and Leila Takayama. Judging a bot by itscover: An experiment on expectation setting for personalrobots. In Proceedings of Human Robot Interaction (HRI),Osaka, Japan, 2010.

[132] Kerstin Fischer. Interpersonal variation in understandingrobots as social actors. In Proc.HRI’11, 2011.

[133] Kerstin Fischer. Alignment or collaboration? how implicitviews of communication influence robot design. In Proc.Conference on Cooperative Technological Systems, 2014.

[134] Kerstin Fischer, Katrin Lohan, and Kilian Foth. Levelsof embodiment: Linguistic analyses of factors influencingHRI. In Proc. HRI’12, 2012.

[135] Kerstin Fischer, Bianca Soto, Caroline Pantofaru, andLeila Takayama. The role of social framing in initiatinghuman-robot interaction. In Proc. IEEE Symposium on Robotand Human Interactive Communication, Ro-Man ’14, 2014.

[136] Kerstin Fischer and Joe Saunders. Getting acquainted witha developing robot. In Human Behavior Understanding.Springer LNCS 7559, 2012.

[137] Cindy K. Chung and James W. Pennebaker. Thepsychological function of function words. In K. Fiedler,editor, Social communication: Frontiers of social psychology.New York: Psychology Press, 2007.

[138] M J Pickering and S Garrod. Towards a mechanisticpsychology of dialogue. Behavioral and Brain Sciences, 27,2004.

[139] B. MacWhinney. The CHILDES Project: Tools for AnalyzingTalk. Erlbaum, 1995.

[140] H Behrens. Corpora in language acquisition research: History,methods, perspectives. John Benjamins, 2008.

[141] A Goldberg. Constructions at Work. OUP, 2006.[142] Arielle Borovsky and Jeff Elman. Language input and

semantic catergories: A relation between cognition andearly word learning. Journal of Child Language, 33, 2006.

[143] D Roy, R Patel, P DeCamp, et al. The human speechomeproject. In Proc. Cognitive Science Conference, 2008.

The ITALK project includes researchers from PlymouthUniversity and the University of Hertfordshire in the UK;the Italian Institute of Technology and the National ResearchCouncil, Italy; Bielefeld University, Germany; University ofSouthern Denmark; Riken Brain Science Institute, Japan; and theMassachusetts Institute of Technology, USA


embodied language learning and cognitive bootstrapping: methods … · acquire behavioral,...

Documents