adaptive behavior integrating learning by experience and...

15
Original Paper Adaptive Behavior 1–15 Ó The Author(s) 2015 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/1059712315608424 adb.sagepub.com Integrating learning by experience and demonstration in autonomous robots Paolo Pagliuca and Stefano Nolfi Abstract We propose an integrated learning by experience and demonstration algorithm that operates on the basis of both an objective scalar measure of performance and a demonstrated behaviour. The application of the method to two qualita- tively different experimental scenarios involving simulated mobile robots demonstrates its efficacy. Indeed, the analysis of the obtained results shows that the robots trained through this integrated algorithm develop solutions that are function- ally better than those obtained by using either a pure learning by demonstration, or a pure learning by experience algo- rithm. This is because the algorithm drives the learning process toward solutions that are qualitatively similar to the demonstration, but leaves the learning agent free to differentiate from the demonstration when this turns out to be nec- essary to maximize performance. Keywords Learning by demonstration, learning by experience, autonomous robot, evolutionary strategies 1. Introduction Humans frequently learn by using a combination of demonstration and trial and error. ‘When learning to play tennis, for instance, an instructor will repeatedly demonstrate the sequence of motions that form an orthodox forehand stroke. Students subsequently imi- tate this behaviour, but still need hours of practice to successfully return balls to a precise location on the opponent’s court’ (Kober, Bagnell, & Peters, 2013, p. 1257). From a machine learning perspective, the combined use of learning by demonstration (Argall, Chernova, Veloso, & Browning, 2009; Billard, Callinon, Dillmann, & Schaal, 2008) and learning by experience (Kober et al., 2013; Nolfi & Floreano, 2000; Sutton & Barto, 1998) provides advantages and drawbacks. Firstly, it enables the learning agent to exploit a richer training feedback constituted by both a scalar performance objective (reinforcement signal or fitness measure) and a detailed description of a suitable behaviour (demon- stration). Secondly, it permits to combine the comple- mentary strengths of the two different learning modes. Indeed, the possibility to observe proper behaviours frees the agent learning by demonstration from the need to find out the behavioural strategy that can be used to solve the task and, consequently, narrows down the goal of the learning process only to the discovery of how the demonstrated behaviour can be re-produced. Moreover, the possibility to rely on the objective scalar measure evaluating the functionality of the overall robot’s behaviour enables the learning robots to select variations that represent real progresses and to exploit properties that emerge from the agent/environmental interactions. On the other hand, the two learning modes have also drawbacks. Indeed, in learning by demonstration, the attempt to solve the task by approximating a demon- strated solution, rather than directly optimizing perfor- mance, exposes learning agents to the problems caused by the fact that minor differences between the demon- strated and reproduced behaviours might cumulate over time eventually causing huge undesired effects. In learning by experience, the selection of variations that represent genuine advance with respect to the current capabilities of the agent might drive the learning pro- cess toward local minima that might then prevent the discovery of better solutions. Furthermore, the combi- nation of these two learning modes has potential short- comings as well. In effect, channelling the learning Institute of Cognitive Sciences and Technologies, National Research Council (CNR), Roma, Italia Corresponding author: Paolo Pagliuca, Institute of Cognitive Sciences and Technologies, National Research Council (CNR), Via S. Martino della Battaglia, 44 00185 Roma, Italia Email: [email protected] at Consiglio Nazionale Ricerche on October 5, 2015 adb.sagepub.com Downloaded from

Upload: others

Post on 21-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

Original Paper

Adaptive Behavior1–15� The Author(s) 2015Reprints and permissions:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/1059712315608424adb.sagepub.com

Integrating learning by experience anddemonstration in autonomous robots

Paolo Pagliuca and Stefano Nolfi

AbstractWe propose an integrated learning by experience and demonstration algorithm that operates on the basis of both anobjective scalar measure of performance and a demonstrated behaviour. The application of the method to two qualita-tively different experimental scenarios involving simulated mobile robots demonstrates its efficacy. Indeed, the analysis ofthe obtained results shows that the robots trained through this integrated algorithm develop solutions that are function-ally better than those obtained by using either a pure learning by demonstration, or a pure learning by experience algo-rithm. This is because the algorithm drives the learning process toward solutions that are qualitatively similar to thedemonstration, but leaves the learning agent free to differentiate from the demonstration when this turns out to be nec-essary to maximize performance.

KeywordsLearning by demonstration, learning by experience, autonomous robot, evolutionary strategies

1. Introduction

Humans frequently learn by using a combination ofdemonstration and trial and error. ‘When learning toplay tennis, for instance, an instructor will repeatedlydemonstrate the sequence of motions that form anorthodox forehand stroke. Students subsequently imi-tate this behaviour, but still need hours of practice tosuccessfully return balls to a precise location on theopponent’s court’ (Kober, Bagnell, & Peters, 2013,p. 1257).

From a machine learning perspective, the combineduse of learning by demonstration (Argall, Chernova,Veloso, & Browning, 2009; Billard, Callinon, Dillmann,& Schaal, 2008) and learning by experience (Koberet al., 2013; Nolfi & Floreano, 2000; Sutton & Barto,1998) provides advantages and drawbacks. Firstly, itenables the learning agent to exploit a richer trainingfeedback constituted by both a scalar performanceobjective (reinforcement signal or fitness measure) anda detailed description of a suitable behaviour (demon-stration). Secondly, it permits to combine the comple-mentary strengths of the two different learning modes.Indeed, the possibility to observe proper behavioursfrees the agent learning by demonstration from theneed to find out the behavioural strategy that can beused to solve the task and, consequently, narrows downthe goal of the learning process only to the discovery ofhow the demonstrated behaviour can be re-produced.

Moreover, the possibility to rely on the objective scalarmeasure evaluating the functionality of the overallrobot’s behaviour enables the learning robots to selectvariations that represent real progresses and to exploitproperties that emerge from the agent/environmentalinteractions.

On the other hand, the two learning modes have alsodrawbacks. Indeed, in learning by demonstration, theattempt to solve the task by approximating a demon-strated solution, rather than directly optimizing perfor-mance, exposes learning agents to the problems causedby the fact that minor differences between the demon-strated and reproduced behaviours might cumulateover time eventually causing huge undesired effects. Inlearning by experience, the selection of variations thatrepresent genuine advance with respect to the currentcapabilities of the agent might drive the learning pro-cess toward local minima that might then prevent thediscovery of better solutions. Furthermore, the combi-nation of these two learning modes has potential short-comings as well. In effect, channelling the learning

Institute of Cognitive Sciences and Technologies, National Research

Council (CNR), Roma, Italia

Corresponding author:

Paolo Pagliuca, Institute of Cognitive Sciences and Technologies, National

Research Council (CNR), Via S. Martino della Battaglia, 44 00185 Roma,

Italia

Email: [email protected]

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 2: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

process through demonstrated solutions is likely to bebeneficial only when the demonstrated behaviour doesrepresent a solution that is effective and learnable fromthe point of view of the learning agent. This is notnecessarily the case when the learning agent is a robotand the demonstrator is a human, i.e., when the demon-strator and the learning agent have different bodystructures and different cognitive capabilities. Finally,the combined effects of two different learning methodscan drive the learning process toward the synthesis ofintermediate solutions, representing a sort of compro-mise between the changes driven by the two learningmodes, which are not necessarily functionally effective.

Over the last few years, several researchers have pro-posed hybrid learning by demonstration and experiencemethods for robot training. In particular, Rosensteinand Barto (2004) extended an actor critic reinforcementlearning (Sutton & Barto, 1998) with a supervisor agentthat co-determines, together with the learning agent,the actions to be executed at each time step. Theseactions are generated by performing a weighted sum ofthe actions proposed by the supervisor and the learningagent. The parameter that trades off between the twoactions is varied by taking into account the estimatedefficacy of both the supervisor and the agent in specificcircumstances and the need to increase the autonomyof the agent progressively during the learning process.Overall, the results obtained in a series of case studiesdemonstrate that the combination of the two learningmodes can provide advantages. Nonetheless, the strat-egy of averaging the actions suggested by the supervisorand the agent might limit the efficacy of the algorithmto domains in which the negative effects caused by thesummation of incongruent actions are negligible. In arelated work, Judah, Roy, Fern, and Dietterich (2010)incorporated user advice into a reinforcement learningalgorithm. More specifically, learners alternate betweenpractice (i.e., experience) and end-user critique, whereadvice is gathered. The user analyses the behaviour ofthe learners and marks subset of actions as good orbad. The policy is optimized through a linear combina-tion of practice and critique data. Results obtained in aseries of experiments demonstrate that the approachsignificantly outperformed pure reinforcement learning.However, the study revealed usability issues related tothe amount of practice and advice needed to achievesuccessful performance.

Kober and Peters (2009) investigated how a specialkind of pre-structured parameterized policies calledmotor primitives (Ijspeert, Nakanishi, & Schaal, 2004;Schaal, Mohajerian, & Ijspeert, 2007) can be subjectedto a combined learning by demonstration and experi-ence training (Daniel, Neumann, & Peters, 2012;Paraschos, Daniel, Peters, & Neumann, 2013). Motorprimitives are constituted by two coupled parametricaldifferential equations. The equations are hand-designed, while the equation parameters are learned.

Learning occurs in two phases: first, the parameters aresubjected to a learning by demonstration process; then,the parameters are refined on the basis of a reinforce-ment learning process. This method has been success-fully applied to a series of distal rewarded tasks thatcould not be solved through learning by demonstrationby itself. However, it operates appropriately only whenthe amount of deviation from the demonstrated trajec-tories is actively limited (Kober & Peters, 2009;Kober et al., 2013; Valenti, 2013). As a result, themethod can only be successful when the learning bydemonstration phase enables the learning robot toacquire a close-to-optimal strategy that only requiressubsequent fine-tuning.

Argall, Browning, and Veloso (2008) proposed amethod for enriching the demonstration data set withadvice produced by the human demonstrator duringthe observation of the behaviour exhibited by the learn-ing robot. Advice consist in required modificationssuch as ‘turn less/more tightly’ or ‘use a smoothertranslational speed’ that can be easily identified by ahuman observer and can be used to automatically gen-erate additional demonstration data obtained througha corrected version of the behaviour displayed by thelearning robot. The collected results indicate that thistechnique can outperform standard learning by demon-stration methods. Consequently, as hypothesized, per-sonalizing the demonstration depending on thecharacteristics of the learning robot can improve theoutcome of the learning process. Nevertheless, thismethod only operates by attempting to reduce the dis-crepancy between the actual and the demonstratedbehaviour and does not combine this with a learningby experience method driven by a direct measure ofperformance. Therefore, it does not provide a way toovercome the problem caused by the fact that minorinevitable differences between the demonstrated andthe reproduced behaviour can lead to large negativeconsequences over time.

Other related works are those of Abbeel and Ng(2005), Cetina (2007), Chernova and Veloso (2007) andKnox and Stone (2009, 2010). Abbeel and Ng (2005)demonstrated the advantage of initializing the state-action pairs of a reinforcement learning based on alearning by demonstration method. Cetina (2007)implemented an algorithm in which the advice of theuser is used to restrict the set of possible actionsexplored by the reinforcement learning. Chernova andVeloso (2007) proposed a learning by demonstrationapproach, in which the learning agent can progressivelyincrease its autonomy by reducing the number ofdemonstrations required from the teacher. Knox andStone (2009) proposed a reinforcement learning methodthat operates on the basis of short-term rewards pro-vided by human trainers that predict the expected long-term effect of single actions or short sequence ofactions.

2 Adaptive Behavior

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 3: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

A still unanswered question of all the presentedapproaches concerns how different training feedbackscan be effectively integrated. Indeed, the differentlearning modes (i.e., learning by experience and learn-ing by demonstration) often drive the learning agentstoward different outcomes, whose effect is not necessa-rily additive. For example, the combination of the twoprocesses in sequence, that can be obtained by first sub-jecting the robot to a learning by demonstration phaseand then to a learning by experience phase, might causethe washing out of the capabilities acquired during thefirst phase at the beginning of the second phase(Kober et al., 2013; Valenti, 2013). Similarly, a combi-nation achieved by summing up the effects of the twoprocesses might lead to undesirable consequences.

In this paper, we introduce a new method that inte-grates the learning by demonstration and by experiencein a single algorithm. This algorithm operates by esti-mating the local gradient of the current candidate solu-tion with respect to an objective performance measureand exploring preferentially variations that reduce thedifferences between the robot and the demonstratedbehaviour.

The results obtained on two qualitatively differenttasks demonstrate how the algorithm is able to synthe-size effective solutions. Such solutions are qualitativelysimilar to the demonstrated behaviour with respect tothe characteristics that are functionally appropriate,while deviate from the demonstration with respect toother characteristics.

In the next section, we illustrate the first experimen-tal setting and the learning algorithm. In section 3, theobtained results will be described. The second experi-mental setting and the corresponding obtained resultsare described in sections 4 and 5. Finally, in section 6we draw our conclusions.

2. Exploration experiment

In a first experiment, a simulated Khepera robot(Mondada, Franzi, & Ienne, 1993) has been trained forthe ability to explore an arena surrounded by walls thatcontains cylindrical obstacles located in randomly vary-ing positions (Figure 1). With the term explore wemean ‘to visit the highest possible number of environ-mental locations’. To verify the advantages and disad-vantages of different learning modes and of theircombination we carried out three series of experimentsin which the robot has been trained through a learningby experience, a learning by demonstration, and anintegrated learning by experience and demonstrationalgorithm. The three algorithms are described in sec-tions 2.1, 2.2 and 2.3.

The robot is provided with six infrared sensorslocated in its front side that enable it to detect nearbyobstacles and two motors controlling the desired speed

of the two wheels. The robot’s controller consists of athree-layer feed-forward neural network with six sen-sory neurons, which encode the state of the six corre-sponding infrared sensors, six internal neurons, andtwo motor neurons, which encode the desired speed ofthe two wheels. The activation of the infrared sensors isnormalized in the range [0.0, 1.0]. The activation of theinternal and motor neurons is computed based on thelogistic function. The activation of the motor neuronsis normalized in the range [210.4, 10.4] cm/s. Internaland motor neurons are provided with biases.

The environment consists of a flat arena of1.03 1.0 m, surrounded by walls, containing five cylin-ders with a diameter of 2.5 cm located in randomlyvarying positions. The robot and the environment havebeen simulated by using FARSA (Massera Ferrauto,Gigliotta, & Nolfi, 2013), an open software tool thathas been used to successfully transfer results obtainedin simulation to hardware for several similar experi-mental settings (e.g., De Greef & Nolfi, 2010; Sperati,Trianni, & Nolfi, 2008).

The performance of the robot, i.e., the ability to visitall environmental locations, is computed by dividingthe environment into 100 cells of 103 10 cm and count-ing the number of cells that have been visited at leastonce over a period of 75 s during which the robot isallowed to interact with the environment. To selectrobots able to carry out the task in variable environ-mental conditions, the robot’s performance is evaluatedduring 25 trials, each lasting 75 s. The position and theorientation of the robot and the position of the obsta-cles are randomly initialized at the beginning of eachtrial. The overall performance is calculated by aver-aging the performances obtained during the differenttrials. It is worth noting that the optimal performanceis unknown since the time for exploration is limited andsome of the cells are occupied by obstacles.

The 48 connection weights of the neural networkand the eight biases are the free parameters that are ini-tially set randomly and then learned by using the algo-rithms described in the next sections. The values of

Figure 1. Exploration experiment. The robot and theenvironment.

Pagliuca and Nolfi 3

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 4: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

these parameters determine the control policy of therobot, i.e., how the robot reacts to the experienced sen-sory states.

2.1 Learning by experience

In learning by experience methods, the learning robotdiscovers either the behaviour through which the taskcan be solved, or the control rules that, in interactionwith the environment, lead to the production of theselected behaviour. The learning process is driven by ascalar measure of performance that rates the extent towhich the robot is able to accomplish the task. The uti-lization of a performance measure that does not specifyhow the problem should be solved potentially enablesthe learning robot to find effective solutions that areoften simpler and more robust than those identifiableby human designers (Argall, Browning, & Veloso,2010; Coates, Abbeel, & Ng, 2008; Taylor, Suay, &Chernova, 2011). On the other hand, the use of such animplicit training feedback, that constraints only looselythe course of the adaptive process, can initially drivethe learning process toward the synthesis of sub-optimal behavioural solutions that constitutes localminima (i.e., that cannot be progressively transformedinto better solutions without producing performancedrops).

A first way to achieve a learning of this kind consistsin using a stochastic hill climber algorithm (Russell &Norvig, 2009), in which the free parameters of therobot’s controller are randomly modified and the varia-tions are retained or discarded depending on whetheror not they produce an improvement of the robot’sperformance.

A second option consists in using a reinforcementlearning (Sutton & Barto, 1998) that operates by calcu-lating the estimated utility of possible alternativeactions on the basis of received rewards and using thisacquired information to preferentially select the actionswith the highest expected utilities.

A third way entails the use of an evolutionary algo-rithm (Holland, 1975; Nolfi & Floreano, 2000;Schwefel, 1995) that operates on a population of candi-date solutions. These solutions are selected based ontheir relative performance and varied through the pro-duction of offspring, i.e., copies modified throughgenetic operators such as mutation and crossing over.

Finally, a fourth option is provided by natural evo-lution strategies (NES) (Wierstra, Schaul, Peters, &Schmidhuber, 2008; also Rechenberg & Eigen, 1973;Schwefel, 1977) that operate by sampling the local gra-dient of the candidate solution (i.e., the correlationbetween variations of the free parameters and variationof performance) and modifying the candidate solutiontoward the most promising directions of the multi-dimensional parameter space. Samples are generatedby creating varied copies of the candidate solution.

In our experiments, we used the xNES algorithm(Glasmachers, Schaul, Yi, Wierstra, & Schmidhuber,2010), i.e., an algorithm of the latter type since, as wewill see, it can be easily extended to incorporate alsolearning-by-demonstration feedbacks. The xNES algo-rithm has already been successfully tested on a series ofdomains including autonomous robot learning(Glasmachers et al., 2010).

More specifically, the xNES algorithm works asdescribed in Algorithm 1. The notation used differsfrom (Glasmachers et al., 2010) in order to allow thereader to understand all the steps of the algorithm eas-ily with plenty of details.

ind ~p � juð Þ ð1Þ

where p is a Gaussian distribution with parametersu=(m,s). m represents the mean of the Gaussian dis-tribution and it is set to 0.0. s stands for the standarddeviation of the Gaussian distribution and it is set to1.0.

l=4+ floor 3 � log pð Þð Þ ð2Þ

where p is the number of free parameters, and floorindicates the inferior integer part of the number.

hcovM =0:5 � max 0:25,1

l

� �ð3Þ

where l is the number of offspring.

uk =max log l

2 +1� �

� log kð Þ� �

Pmax log l

2 +1� �

� log jð Þ� � ð4Þ

where l is the number of offspring and k is the index ofthe kth offspring.

2.2 Learning by demonstration

In learning by demonstration, the control policy islearned from examples or demonstrations provided bya teacher (Argall et al., 2009; Schaal, 1997). The objec-tive of the learning process is therefore to reproducethe demonstrated behaviour. In particular, the learningalgorithm should enable the discovery of the controlmechanisms that, in interaction with the environment,yield the demonstrated behaviour (Billard, Balinon, &Guenter, 2006). The examples can be defined either asthe sequences of sensory-motor pairs that are extractedfrom the teacher demonstration (offline demonstrationlearning) (Hayes & Demiris, 1994), or as the sequencesof the desired action specifications that the teacher pro-vides to the learning robot while it is acting (onlinedemonstration learning) (Rosenstein & Barto, 2004).

In general terms, the availability of the demonstra-tion reduces the complexity of the learning task andenables the use of potentially more powerful supervised

4 Adaptive Behavior

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 5: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

learning techniques. On the other hand, the behaviourdemonstrated by the teacher might be impossible tolearn from the robot’s point of view (i.e., the demon-strator might be unable to identify behaviours learn-able by the robot). Besides, the acquisition of an abilityto produce a behaviour that is very similar but notidentical to the demonstration does not guarantee theachievement of good performances since small differ-ences might cumulate over time leading to significantdifferences.

In our experiment, we generated the demonstrationsautomatically through a demonstration robot providedwith a hand-designed controller. This allowed us togenerate a large number of demonstrations and to useexactly the same kind of demonstration in all experi-ments. The hand-designed controller of the demonstra-tor: (1) moves the robot straight at the highest possiblespeed when the infrared sensors are not activated, (2)makes the robot turn left or right when the two right orthe two left infrared sensors are activated respectively,and (3) makes the robot turn left when both the twoleft and the two right infrared sensors are activated.

This is made by setting by default the desired speed ofthe two wheels to the highest positive value and by pro-portionally reducing the speed of one wheel accordingto the average activation of the two opposite infraredsensors. It is worth noting that the optimal relationbetween the activation of the infrared sensors and thespeed of the wheels, with respect to the ability of therobot to explore the arena, is not necessarily propor-tional and homogeneous for the two sensors located oneither side. Nevertheless, since there is no clear way fordetermining this relation, we used a simple propor-tional and homogeneous relation.

The fact that certain characteristics are hard to opti-mize from the point of view of an external designer con-stitutes one of the reasons that explain why combininglearning by demonstration and experience can be bene-ficial. Indeed, as we will see, the integrated learning byexperience and demonstration (E&D) algorithm canfind solutions that are better optimized in this respect.

In the offline demonstration mode, the sequence ofsensory-motor pairs experienced by the demonstratorrobot, while it operates in the environment for several

Algorithm 1: the xNES algorithm.

initialize:the candidate solution (ind) by selecting the best over 100 parameters’ sets generated by using Equation 1the number of offspring (l), see Equation 2the covariance matrix (covMat) with zeroed values. The covariance matrix encodes the correlation between free

parameters of high utility samples. The utility of a sample is the relative performance advantage with respectto the other generated samples.

the learning rate used to update the individual (hind). We use hind=1.0.the update rate of the covariance matrix (hcovM), see Equation 3

repeatfor k=1.l do

generate random variation vectors with a Gaussian distributionsk ~p :juð Þ

generate samples by adding the product of the variation vectors by the exponential of the covariance matrixto the candidate solution

zk = ind + ecovMat � sk

evaluate the fitness of the samplesfk =F zkð Þ

endsort zk with respect to their performance fk

ranking = z1, :::, zl,F z1ð Þ � ::: � F zlð Þcalculate the utility of the samples uk

uk , see Equation 4sum-up variations weighted by utilities

rind =P

uk � sk

calculate the estimated local gradientrcovM =

Puk � sk � sT

k � I� �

, where I is the identity matrix, and T superscript represents the matrix trans-position operation

move the candidate solution depending on the local gradient with the given learning rateind = ind +hind � ecovMat � rind

update the covariance matrix based on the estimated local gradient with the given update ratecovMat= covMat+hcovM � rcovM

for 500 iterations

Pagliuca and Nolfi 5

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 6: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

trials, are used to train the neural network controller ofthe learning robot. The training is performed by usingthe demonstrations as training set and learning theweights of the robot’s neural controller through back-propagation (Rumelhart, Hinton, & Williams, 1986).More specifically, the sensory states experienced by thedemonstrator robot and the motor actions producedby the demonstrator robot every 50ms are used to setthe sensory state and the teaching input of the learningrobot. Conversely, in the online demonstration mode,the learning robot is situated in the environment andcan operate based on its own motor neurons. To gener-ate the teaching input, the demonstrator robot is virtu-ally placed in the same location (i.e., position andorientation) of the learning robot at each time step.The output actions produced by the demonstratorrobot in this situation are used to set the teaching inputof the learning robot.

In both cases, a learning rate of 0.2 was used andthe learning process was continued for 1.8753 107 timesteps of 50 ms. This number was chosen in order tokeep the overall period of evaluation constant withrespect to the previous algorithm.

2.3 Integrated learning by experience and bydemonstration

Here we propose a new algorithm that enables the reali-zation of an integrated learning by E&D. The algorithmis so designed that the two learning modes can operatesimultaneously by driving the learning process towardsolutions that are functionally effective and similar tothe demonstrated behaviour. The combination of theeffects of the two learning modes is not realized at thelevel of the actions, as in the case of Rosenstein andBarto (2004), but rather at the level of the variations ofthe free parameters. In other words, the algorithm oper-ates by combining variations that are expected to pro-duce progress in objective performance with variationsthat are likely to increase the similarity with the demon-strated behaviour. As a whole, this implies that thealgorithm tries to steer the learning by experience pro-cess toward solutions that resemble the demonstratedbehaviour.

To obtain such an integrated algorithm, we designedan extended version of the xNES algorithm, describedin section 2.1, which operates by iteratively estimatingand exploiting both the local performance and thedemonstration landscape. This has been made by train-ing the current candidate solution for a limited time(i.e., 25 trials as in the case of the learning by experiencealgorithm) based on the demonstrated behaviour. Moreprecisely, the integrated algorithm works as describedin Algorithm 2:

Readers might replicate these experiments and theexperiments described in section 4 by downloading and

installing FARSA from https://sourceforge.net/projects/farsa/ and using the experimental pluginsnamed KheperaExplorationExperiment and RobotBallApproachExperiment, respectively.

3. Results for the exploration experiment

In this section, we report the results obtained in threeseries of experiments performed by using the three algo-rithms described above. For each experiment, we ran 10replications starting with different randomly generatedcandidate solutions.

The performance displayed by the best robots at theend of the training process (Figure 2) shows that therobots trained with the integrated learning by E&Dalgorithm outperform both the robots trained the learn-ing by demonstration (D) algorithm (Mann–Whitneytest, df=10, p\.01) and the robots trained with thelearning by experience (E) algorithm (Mann–Whitneytest, df=10, p\.01).

By analysing the behaviour displayed by the robotstrained through the three learning algorithms (Figure 4)and the percentage of times these robots avoid obstaclesby turning toward they preferred direction (Table 1),we can see that, as expected, the robot trained in the Dand E&D conditions display a behaviour similar to thedemonstration. The robots trained in the E condition,instead, exhibit a behaviour that differs qualitativelyfrom the demonstration. In particular, they avoid theobstacles by turning always, or almost always, in thesame direction. This qualitative difference plays a func-tional role with respect to the exploration task: indeed,avoiding obstacles by mostly turning in the same direc-tion increases the probability to end up in previouslyexplored locations of the environment compared withavoiding obstacles by turning in both directions.

The fact that the robots trained in the E experimen-tal condition initially develop the ability to avoid obsta-cles by always turning in the same direction is notsurprising. This simple strategy enables the robots toimprove their performance with regard to the initialphase in which they are still unable to avoid obstacles.Yet, the acquisition of this strategy and its furtherrefinement then blocks the learning process into a localminimum, i.e., a situation in which it is not possible tochange the obstacle avoidance strategy without impact-ing negatively on performance.

The integrated learning by E&D algorithm over-comes this local minima problem thanks to its abilityto drive the learning process toward solutions that aresimilar to the demonstration, i.e., toward solutions thatavoid obstacles by turning either left, or right, depend-ing on the relative position of the obstacle.

The behaviour of the robot trained in the E&D con-dition is similar to the demonstrated behaviour con-cerning the direction, but not with regard to the

6 Adaptive Behavior

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 7: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

trajectory followed to avoid obstacles (Figure 4). Thistrajectory depends on the relative reduction of thespeed of the left or right wheel when the right or leftinfrared sensors are activated. In the case of thedemonstrator robot, this relative reduction is propor-tional to the activation of the two left or of the tworight infrared sensors. However, as you might remem-ber, this relation is not optimized. It has been chosensimply because we, as the demonstrator designer, wereunable to identify the best relation. The robots trainedin the E&D experimental condition perform better thanthe demonstration in this respect (i.e., avoid the obsta-cles with a trajectory that maximize the explorationability). Consequently, not only do they outperformthe robots trained in the E condition, but they also

have a more effective behaviour than the robots trainedin the D condition (Figure 2).

Overall, this implies that the E&D algorithm leadsthe learning robots toward solutions that incorporatethe functionally effective characteristics of the demon-strated behaviour, while deviate from the characteristicsof the demonstrated behaviour that are not functional.

4. Spatial positioning with headingexperiment

In this section we report the results of a series ofexperiments carried out in a more complex scenario inwhich a robot should reach a 2D planar target positionwith a given target orientation (i.e., a position and

Algorithm 2: the integrated learning by experience and demonstration algorithm.

initialize:the candidate solution (ind) by selecting the best over 100 parameters’ sets generated by using Equation 1the number of offspring (l), see Equation 2the covariance matrix (covMat) with zeroed valuesthe learning rate used to vary the individual (hind). We use hind=1.0.the update rate of the covariance matrix (hcovM), see Equation 3

repeatfor k=1. l do

generate random variation vectors with a Gaussian distributionsk ~p :juð Þ

generate a special sample\Sd. by training the candidate solution ind for 25 trialsfor 25 trials do

w=w+ backprop indð ÞendSdh i= ind +w

where the update of the free parameters (w) is computed by means of the back-propagation algorithm. Inparticular, the batch mode described in section 2.2 has been used

generate samples by adding the product of the variation vectors by the exponential of the covariance matrixto the candidate solution

zk = ind + ecovMat � sk

evaluate the fitness of the samplesfk =F zkð Þ

endsort zk with respect to their performance fk

ranking = z1, :::, zl,F z1ð Þ � ::: � F zlð Þcalculate the utility of the samples uk, \Sd. is always ranked as second (the aim is at taking into consider-ation both the performance and the demonstration feedback)uk see Equation 4sum-up variations weighted by utilitiesrind =

Puk � sk

calculate the estimated local gradientrcovM =

Puk � sk � sT

k � I� �

, where I is the identity matrix, and T superscript represents the matrix transpo-sition operation

move the candidate solution depending on the local gradient with the given learning rateind = ind +hind � ecovMat � rind

update the covariance matrix based on the estimated local gradient with the given update ratecovMat = covMat +hcovM � rcovM

for 500 iterations

Pagliuca and Nolfi 7

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 8: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

Figure 2. Boxplots of performance obtained, in the learning by experience (E), learning by demonstration (D) and integratedlearning by experience and demonstration (E&D) experimental conditions. Each boxplot displays the performance obtained by post-evaluating the best 10 robots of each replication for 500 trials in environments including five obstacles. Boxes represent the inter-quartile range of the data, while the horizontal lines inside the boxes mark the median values. The whiskers extend to the mostextreme data points within 1.5 times the inter-quartile range from the box. For the D condition, we only report the result obtainedin the offline mode that yielded better results than the online mode.

Figure 3. Trajectories displayed by the best robots of the three experimental conditions and by the demonstrator robot during atypical trial. To ensure that the trials shown are representative, they have been selected among the trials that produced aperformance similar to the average performance obtained in each condition. The blue circle indicates the position of the robot atthe end of the trial. The red circles indicate the position of the obstacles.

8 Adaptive Behavior

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 9: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

orientation from which the red cylinder could bepushed toward the blue cylinder by simply moving for-ward; Figure 4).

More specifically, the experiment concerns a simu-lated marXbot robot (Bonani et al., 2010) placed in anarena containing a red and a blue cylindrical object.We used a different robotic platform in this second setof experiments since the marXbot is provided with anomni-directional colour camera that enables it to detectthe relative position of the two cylinders.

The robot is equipped with 24 infrared sensorsevenly spaced around the robot body, an omni-directional camera, and two motors controlling thedesired speed of the two wheels. The camera image ispre-processed by calculating the fraction of red andblue pixels present in each of the 12 30o-sectors of a cir-cular stripe portion of the image.

The robot’s controller consists of a three-layer feed-forward neural network with 32 sensory neurons, six

internal neurons and two motor neurons. In particular,eight sensory neurons encode the average activationstate of eight groups of three adjacent infrared sensorsand 24 sensory neurons encode the percentage of redand blue pixels detected in the 12 sectors of the per-ceived image. The two motor neurons encode thedesired speed of the two wheels. The activation of thesensors is normalized in the range [0.0, 1.0]. The activa-tion of internal and motor neurons is computed on thebasis of the logistic function. The activation of themotor neurons, i.e., the desired speed of the two wheels,is normalized in the range [227, 27] cm/s. Internal andmotor neurons are provided with biases.

The environment consists of a flat arena of 5.03 5.0m containing a red and a blue cylinder with a diameterof 12 and 17 cm respectively, randomly placed within33 3 rectangular portion of the arena shown inFigure 4, right. In particular, the robot is initiallyplaced within one of three possible rectangular portions

Figure 4. Spatial positioning with heading experiment. Left: The robot and the environment. The black arrow indicates the relativeposition and orientation that the robot should reach (i.e., the robot should reach the target with a relative position and orientationthat could enable it to push the red object toward the blue object). Right: Exemplification of how the position of both the robot andthe red cylinder varies among trials. The top circle indicates the blue cylinder. The intermediate and bottom rectangles indicate theareas in which the red cylinder and the robot are placed, respectively. The exact position of the red cylinder and of the robot israndomly chosen inside the selected rectangular area.

Table 1. Performance and percentage of times the best robots turn in their preferred direction in the three experimentalconditions.

Best robot Performance Turns in the preferred direction (%)Best 10 robots Performance Turns in the preferred direction (%)

E 60.65 100.0D 51.36 2.4E&D 61.90 0.8E 55.09 90.94D 44.25 11.72E&D 61.28 0.88

Some trained robots turn preferentially left, others preferentially right. Others turn both left and right, depending on the circumstances. These latter

robots do not have a preferred direction, they turn left and right with similar probabilities. Top part: data for the best robot of all replications.

Bottom part: average results of the 10 best robots of each replication.

Pagliuca and Nolfi 9

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 10: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

(Figure 4 right, circles in the bottom rectangular areas).Similarly, the red cylinder occupies one of three possi-ble rectangular portions (Figure 4 right, circles in thecentral rectangular areas). The initial location of theblue cylinder has been kept fixed over all the simula-tions (Figure 4 right, top circle). As in the case of theprevious experiment, the robot and the environmentare simulated by using FARSA (Massera et al., 2013).

The performance of the robot is calculated by takinginto account the offset between the target position andorientation of the robot and the actual position andorientation of the robot when it first touches the cylin-der, or at the end of the trial. The target spatial para-meters are defined as follows:

� the target position is located on the straight linepassing over the centres of the bases of red and blueobjects, outside the segment connecting these twopoints on the side of the red object, at a distanceequal to the sum of the robot and red cylinderradius;

� the target orientation is given by the relative anglebetween the red and the blue cylinder.

More precisely, the fitness is calculated by averagingthe product of the inverse of the position offset and theinverse of the angular offset, normalized in the range[0.0, 1.0], over trials:

Fitness=

Pt=1...nt

1� DPtð Þ � 1� DOtð Þ½ �

nt

where DPt is the position offset normalized in the range[0.0, 1.0], DOt is the angular offset normalized in therange [0.0, 1.0], t is the trial index, and nt is the numberof trials.

To select robots able to carry out the task in varyingenvironmental conditions, we evaluated each robot fornine trials each lasting up to 50 s (trials end either whenthe robot touches the red cylinder, or after 50 s). In thecase of the integrated learning by E&D algorithm, theback-propagation training has been also carried out fornine trials during each iteration. At the beginning ofeach trial the red cylinder and the robot are in randomlocations selected within one of the 33 3 rectangularareas shown in (Figure 4, right), that are respectively ata distance of approximately 0.75 and 1.5 m from theblue cylinder.

As for the previous experiment, a demonstratorrobot that operates on the basis of a hand-designedcontroller is used to generate demonstration data. Thecontroller works by calculating, at each time step, adestination point situated along the straight line (l) con-necting red and blue cylinders. The distance betweenthe destination point and the red cylinder is directly

proportional to the offset between the current robotposition and l:

distance=K � dl

where K is a constant set to 2.5 and dl is the distance inmetres between the robot and l line. This is because thetime required for the robot to reach the target locationdepends on the initial angular offset between the robotorientation and the target orientation.

In particular, the coordinates of the destinationpoint over planar surface are calculated according tothe following equation:

destination= distance � cos að Þ, distance � sin að Þð Þ

where a represents the relative angle between the redand the blue cylinder.

The action to be performed by the demonstrationrobot at each time step is selected in order to reduceboth the distance between the robot and the destinationpoint, and the offset between the current and the targetorientation of the robot. In more detail, the desiredspeed of the robots’ wheels is set on the basis of the fol-lowing equations:

speedLeftWheel

=

vmax

2if � 180<u<0

vmin

2� u

180

� �+

vmax

2� 1� u

180

� �� �if 0\u<180

8>><>>:

9>>=>>;

speedRightWheel

=

vmax

2� u

180

� �� vmin

2� 1+

u

180

� �� �if � 180<u\0

vmax

2if 0<u<180

8>><>>:

9>>=>>;

where u is the turning angle normalized in the range[2180�,180�]. u is positive or negative depending onwhether the destination point is located on the right oron the left of the robot’s front.

This strategy enables the demonstrator robot tosolve the task with a relative good performance.However, as for the previous experiment, the demon-stration strategy is non-optimal. Indeed, in certain casesthe robot might reach the destination point with a par-tially wrong orientation. Furthermore, in other cases itmight spend an unnecessary amount of energy and timeto reach the target position and orientation. Besides, weshould consider that the learning robot might be unableto produce behaviours that match perfectly, or almostperfectly, the demonstrations as a consequence of lim-ited precision of the robot sensory system. The controlsystem of the demonstrator robot is not affected by thislimitation since it operates on the basis of precise

10 Adaptive Behavior

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 11: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

distance and angular measures extracted from therobot/environmental simulation.

The 212 connection weights and biases of the neuralnetwork are initially set randomly and then learned bymeans of the algorithms described above.

5. Results for the spatial positioning withheading experiment

Here we report the results obtained in three series ofexperiments in which the robots have been trained byusing the algorithms described in sections 3.1–3.3. Eachexperiment was replicated 10 times.

By analysing the performance displayed at the endof the training process we can see how, as for the explo-ration experiment, the robots trained in the integratedlearning by E&D condition outperform the robotstrained in the E and D conditions (Mann–Whitney test,df=10, p\.01). The difference between E and D condi-tions is also statistically significant (Mann–Whitneytest, df=10, p\.01). To verify robots’ generalizationcapabilities, we post-evaluated the best robots for 180trials during which the distance between the robot andthe blue cylinder was systematically varied in the range[62.5, 300] cm (Figure 5). Also in this case the robotstrained in the integrated learning by E&D conditionoutperform the robots trained in the E and D

conditions (Mann–Whitney test, df=10, p\.01). Therobots trained in the E condition outperform the robottrained in the D condition (Mann–Whitney test,df=10, p\.05). The best E&D robots outperform thehand-designed demonstrator robot as well (Mann–Whitney test, df=10, p\.01). The fact that the differ-ence in performance between the E&D condition andthe D and E conditions is more marked in the case ofthis second and more complex experiment might indi-cate that the E&D algorithm scale better with respectto the complexity of the task.

To analyse the similarity between the behaviour dis-played by trained robots and by the hand-designeddemonstrator robot, we calculated the summed squareerror between the actions suggested by the demonstra-tor and the actions actually produced by the best robotstrained in the three experimental conditions (Figure 6).As expected, the difference between the behaviour ofthe robots and the demonstration is larger in the E con-dition than in the D and E&D conditions. Surprisingly,the differences with respect to the demonstration beha-viour are greater in the D condition, in which the robotsare trained for the ability to reproduce the demon-strated behaviour only, than in the E&D condition, inwhich the robots are also trained for the ability to maxi-mize performance. This could be explained by the factthat the local minima affecting the outcome of the

Figure 5. The boxplots display the performance carried out by the robots at the end of the training process in the learning byexperience (E), learning by demonstration (D) and integrated learning by experience and demonstration (E&D). Boxes represent theinter-quartile range of the data, while the horizontal lines inside the boxes mark the median values. The whiskers extend to the mostextreme data points within 1.5 times the inter-quartile range from the box. Data obtained by post-evaluating the best 10 robots ofthe 10 replications of each experimental condition for 180 trials. For the D condition, we only report the result obtained in theonline mode that yielded better results with respect to the offline mode.

Pagliuca and Nolfi 11

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 12: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

demonstration learning, that operates by minimizingthe offset between the demonstrated and the repro-duced actions, play a minor role when the learning pro-cess is driven primarily by the attempt to maximize thelong term effect of actions (i.e., the ability to reach thetarget with the appropriate position and orientation).

Figure 7 shows the typical trajectories displayed bythe three best robots obtained in the three correspond-ing experimental conditions.

Not surprisingly, the best D robot displays a beha-viour that is qualitatively similar to the demonstration:the robot always approaches the red cylinder from theappropriate direction by producing a single curvilineartrajectory (Figure 7, left). Yet, the curvature of the tra-jectory of the D robot often differs significantly withrespect to the curvature produced by the demonstratorrobot, despite the learning error at the end of the train-ing is very low (0.0042). This difference can beexplained by considering that, as we mentioned above,small differences between the produced and the demon-strated behaviour tend to cumulate by producing sig-nificant differences over time. The other robotsobtained in the remaining nine replications of theexperiment display qualitatively similar behaviours(results not shown). The fact that the performance ofthe D robots are relatively low (Figure 5) can thus beexplained by considering that the significant differ-ences, that originate from the cumulative effect of smalldifferences over time, have an effect on both the

similarity with respect to the demonstrated behaviourand the performance. In some replications, the func-tional effect of the difference is small, while in othercases it is large (Figure 5). The analysis of the relationbetween learning error and performance demonstratesthat these two measures are lowly correlated. In otherwords, the robots with the highest performance are notnecessarily those with the lower residual learning bydemonstration error. This can be explained by consid-ering that, in learning by demonstration, performancedoes not play any role. The minimization of the errorwith respect to the demonstration, therefore, does notguarantee a maximization of performance.

The behaviour produced by the best E robot is bet-ter than the one produced by the best D robot and dif-fers widely from the demonstration (Figures 6 and 7).Indeed, it produces a series of curvilinear trajectoriesfollowed by sudden changes of directions until therobot manages to reach a relative position and orienta-tion from which it is able to move consistently towardthe red cylinder. Since the number of back-and-forthmovements is highly variable and dependent on thecircumstances, in some cases the robot is unable toreach the target destination in time. The robots of theother replications of the experiment show qualitativelysimilar behaviours (results not shown).

The best E&D robots approach the target byproducing single curvilinear trajectories that are quali-tatively similar to that displayed by the demonstrator

Figure 6. Average difference between the actions produced by the best robot trained in the three experimental conditions—experience (E), learning by demonstration (D) and integrated learning by experience and demonstration (E&D)—and the actionsthat would be produced by the demonstrator robot in the same robot/environmental conditions. Data are calculated by averagingthe summed square difference between the actual and desired speeds of the two wheels, normalized in the range [0.0, 1.0], during180 trials.

12 Adaptive Behavior

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 13: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

robot (Figures 6 and 7). Furthermore, their behavioursare even more similar to the demonstration than thoseshown by the D robots (Figure 6). Nonetheless, theymanage to control the curvature of the trajectory in away that enables them to achieve high performanceand even outperform the demonstrator robot (the aver-age performance of the demonstrator robot calculatedover 500 trials is 0.79). Apparently, this is achieved byproducing longer trajectories, with respect to thedemonstrator, that allow the robot to reach the targetdestination with an orientation that matches better thetarget orientation. Overall, this implies that, also in thiscase, the E&D robots manage to find solutions that arequalitatively similar to the demonstrated behaviour,but deviate from the demonstration with respect to theaspects that are functionally sub-optimal.

6. Discussion and conclusion

Standard learning methods rely on a single type oflearning feedback and simply discard all other relevantinformation. For example, reinforcement learningmethods are based on reward signals that indicate howwell the agent is performing, but neglect other relevantcues such as: (1) perceived states that provide indica-tions on why the current exhibited behaviour had onlypartial success (e.g., the kicked ball went too far or too

left with respect to the target), (2) advice provided byother agents (e.g., ‘more slowly’), and (3) demonstra-tions of effective behaviours. Since the quality of thetraining feedback strongly affects the outcome of thelearning process, the development of new learningmethods capable of exploiting richer training feedbackscan potentially lead to significantly more powerfultraining techniques.

Although this research direction has started to beexplored in the last few years, how different trainingfeedbacks can be integrated in an effective manner stilllargely represents an open question.

As we stated in the introduction, in order to appreci-ate the complexity of the problem, we should considerthat different learning modes (e.g., learning by experi-ence and learning by demonstration) often drive thelearning agents toward different outcomes and thattheir effect is not necessarily additive. This entails thatidentifying an effective way to combine heterogeneouslearning modes is far from trivial.

In this paper we proposed a new method that enablesto combine learning by demonstration and learning byexperience feedbacks. The algorithm is shaped in a wayensuring that the learning process is constantly drivenby both an objective performance measure, which esti-mates the overall ability of the robot to perform thetask, and a learning by demonstration feedback used to

Figure 7. Trajectories displayed by the best robots obtained in the learning by experience (top-right), learning by demonstration(bottom-left) and learning by experience and demonstration condition (bottom-right) in a trial in which they start from the sameinitial position and orientation and in which the red cylinder is placed in the same position. The top-left picture shows the trajectoryproduced by the demonstrator robot. The red and blue circles represent the position of red and blue cylinder.

Pagliuca and Nolfi 13

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 14: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

steer the learning process toward solutions similar tothe demonstration.

The analysis of the results obtained by testing thismethod on two qualitatively different problems indi-cates that our new integrated algorithm is indeed ableto synthesize solutions that are (1) functionally betterthan those obtainable by using a single mode only, and(2) qualitatively similar to the demonstrated behaviour.Besides, the integrated learning algorithm succeeds insynthesizing solutions that incorporate the aspects ofthe demonstration that are functionally useful, whereasdeviate with respect to other non-functional aspects.

In future works we plan to investigate the efficacy ofthe proposed method in task involving significant largersearch space (i.e., a greater number of parameters to beset) that are particularly hard to tackle through learningby experience only.

References

Abbeel, P., & Ng, A. Y. (2005). Exploration and apprentice-

ship learning in reinforcement learning. In Proceedings of

the 22nd International Conference on Machine Learning.

New York: ACP Press.Argall, B. D., Browning, B., & Veloso, M. (2010). Mobile

robot motion control from demonstration and corrective

feedback. In: O. Sigaud & J. Peters (Eds.), From motor

learning to interaction learning in robots, SCI (vol. 264, pp.

431–450). Heidelberg: Springer.Argall, B. D., Chernova, S., Veloso. M., & Browning, B.

(2009). A survey of robot learning from demonstration.

Robotics and Autonomous Systems, 57(5), 469–483.Argall, B. D., Browning, B., & Veloso, M. (2008). Learning

robot motion control with demonstration and advice-

operators. In Proceedings of the IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS’08)

(pp. 399–404). IEEE Press.Billard, A., Callinon, S., Dillmann, R., & Schaal, S. (2008).

Robot programming by demonstration. In B. Siciliano &

O. Khatib (Eds.), Handbook of robotics. New York:

Springer.Billard, A., Calinon, S., & Guenter, F. (2006). Discriminative

and adaptive imitation in uni-manual and bi-manual tasks.

In The Social Mechanisms of Robot Programming by

Demonstration, Robotics and Autonomous Systems, 54(5),

370–384 (special issue).Bonani, M., Longchamp, V., Magnenat, S., Retornaz, P.,

Burnier, D., Roulet, G. & Mondada, F. (2010). The marX-

bot, a miniature mobile robot opening new perspectives

for the collective-robotic research. In Proceedings of the

IEEE/RSJ International Conference on Intelligent Robots

and Systems (IROS) (pp. 4187–4193). Piscataway, NJ:

IEEE Press.Cetina, V. U. (2007). Supervised reinforcement learning using

behavior models. In Sixth International Conference on

Machine Learning and Applications. New York: IEEE

Press.Chernova, S., & Veloso, M. (2007). Confidence-based learn-

ing from demonstration using Gaussian Mixture Models.

In Proceedings of the International Conference on

Autonomous Agents and Multiagent Systems (AAMAS’07)

(pp. 1–8). ACMNew York.Coates, A., Abbeel, P., & Ng, A. Y. (2008). Learning for con-

trol from multiple demonstrations. Proceedings of the 25th

International Conference on Machine Learning (ICML ’08)

(pp. 144–151). ACMNew York.Glasmachers, T., Schaul, T., Yi, S., Wierstra, D., & Schmid-

huber, J. (2010). Exponential natural evolution strategies.

In: Proceedings of the 12th Annual Conference on Genetic

and Evolutionary Computation (GECCO’10), Portland,

OR, pp. 393–400. New York: ACM.Daniel, C., Neumann, G., & Peters, J. (2012). Hierarchical

relative entropy policy search. In N. Lawrence and M. Gir-

olami (Eds), Proceedings of the International Conference on

Artificial Intelligence and Statistics (AISTATS 2012) (pp.

273–281).De Greef J., & Nolfi, S. (2010). Evolution of implicit and

explicit communication in a group of mobile robots. In

S. Nolfi, & M. Mirolli (Eds.), Evolution of communication

and language in embodied agents. Berlin: Springer.Hayes, G., & Demiris, J. (1994). A robot controller using

learning by imitation. In A. Borkowski and J.L Crowley

(Eds.), Proceedings of the 2nd International Symposium on

Intelligent Robotic Systems (pp. 198–204). IEEE Press.Holland, J. H. (1975). Adaptation in natural and artificial sys-

tems. Ann Arbor, MI: University of Michigan Press.Ijspeert, A., Nakanishi, J., & Schaal, S. (2004). Learning

attractor landscapes for learning motor primitives. In

S. Thrun, L. K. Saul, & B. Scholkopf (Eds.), Advances in

neural information processing systems (vol. 16). Cambridge,

MA: MIT Press.Judah, K., Roy, S., Fern, A., & Dietterich, T.G. (2010). Rein-

forcement learning via practice and critique advice. In

Proceedings of the Twenty-Fourth AAAI Conference on

Artificial Intelligence. Atlanta, GA: AAAI Press.Knox, W. B., & Stone, P. (2009). Interactively shaping agents

via human reinforcement: The TAMER framework. In

Proceedings of the Fifth International Conference on Knowl-

edge Capture. New York: ACM Press.Knox, W. B., & Stone, P. (2010). Combining manual feedback

with subsequent MDP reward signals for reinforcement

learning. In Proceedings of the 9th International Conference

on Autonomous Agents and Multiagent Systems (vol. 1).

Richland SC: International Foundation for Autonomous

Agents and Multiagent Systems.Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement

learning in robotics: A survey. The International Journal of

Robotics Research 32(11), 1238–127.Kober, J., & Peters, J. (2009). Policy search for motor primi-

tives in robotics. In D. Koller, D. Schuurmans, Y. Bengio,

& L. Bottou (Eds.), Advances in neural information process-

ing systems 21. Carmel, IN: Current Associates Inc.Massera, G., Ferrauto, T., Gigliotta, O., & Nolfi, S. (2013).

FARSA: An open software tool for embodied cognitive

science. In P. Lio, O. Miglino, G. Nicosia, S. Nolfi, &

M. Pavone (Eds.), Proceeding of the 12th European Confer-

ence on Artificial Life. Cambridge, MA: MIT Press.Mondada, F., Franzi, E., & Ienne, P. (1993). Mobile robot

miniaturisation: A tool for investigation in control algo-

rithms. In: Proceedings of the third International Sympo-

sium on Experimental Robotics, Kyoto, Japan

14 Adaptive Behavior

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from

Page 15: Adaptive Behavior Integrating learning by experience and ...laral.istc.cnr.it/nolfi/papers/2015-Pagliuca-Adaptive-Behavior.pdf · Paolo Pagliuca and Stefano Nolfi Abstract We propose

Nolfi, S., & Floreano, D. (2000). Evolutionary robotics: The

biology, intelligence, and technology of self-organizing

machines. Cambridge, MA: MIT Press/Bradford Books.Paraschos, A., Daniel, C., Peters, J., & Neumann, G. (2013).

Probabilistic movement primitives. In C. J. C. Burges,

L. Bottou, M. Welling, Z. Ghahramani, &

K. Q. Weinberger (Eds.), Advances in neural information

processing systems (NIPS). Cambridge, MA: MIT PressRechenberg, I., & Eigen, M. (1973). Evolutionsstrategie: Opti-

mierung technischer Systeme nach Prinzipien der biolo-

gischen Evolution. Stuttgart: Frommann-Holzboog.Rosenstein, M. T., & Barto, A. G. (2004). Supervised actor-

critic reinforcement learning. In J. Si, A. Barto, W. Powell,

& D. Wunsch (Eds.), Learning and approximate dynamic

programming: Scaling up to the real world. New York: John

Wiley & Sons, Inc.Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986).

Learning internal representations by error propagation. In

D. E. Rumelhart & I. L. McCelland (Eds.), Parallel dis-

tributed processing: Explorations in the microstructure of

cognition (vol. 1). Cambridge, MA: MIT Press.Russell, S., & Norvig, P. (2009). Artificial intelligence: A mod-

ern approach. Englewood Cliffs, NJ: Prentice Hall.Schaal, S. (1997). Learning from demonstration. In:

M.C. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in

neural information processing systems 9. Cambridge, MA:MIT Press.

Schaal, S., Mohajerian, P., & Ijspeert, A. (2007). Dynamicssystems vs. optimal control—A unifying view. Progress inBrain Research, 165(1), 425–445.

Schwefel, H. P. (1977). Numerische Optimierung von Com-

puter-Modellen mittels der Evolutionsstrategie (vol. 26).Basel/Stuttgart: Birkhaeuser.

Schwefel, H. P. (1995), Evolution and optimum seeking. NewYork: Wiley.

Sperati, V., Trianni, V., & Nolfi, S. (2008). Evolving coordi-nated group behaviour through maximization of meanmutual information. Special Issue on Swarm Robotics,Swarm Intelligence Journal, 4(2), 73–95.

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning.Boston, MA: MIT Press.

Taylor, M., Suay, H., & Chernova, S. (2011). Integrating rein-

forcement learning with human demonstrations of varying

ability, AAMAS, Taipei, Taiwan.Valenti, M. (2013). Learning of manipulation capabilities in a

humanoid robot. Master thesis, School of Engineering, Uni-versita degli Studi di Roma ‘La Sapienza’, Italy.

Wierstra, D., Schaul, T., Peters, J., & Schmidhuber, J. (2008).Natural evolution strategies. In Proceedings of the

Congress on Evolutionary Computation (CEC08). HongKong: IEEE Press.

About the Authors

Paolo Pagliuca received an MSc in engineering from the University of Roma Tre, Italy and is aPhD student at the CNR node of the University of Plymouth in Roma. His main research inter-ests are within the domain of evolutionary robotics, adaptive behaviour and combination oflearning and evolution in autonomous robots.

Stefano Nolfi (http://laral.istc.cnr.it/nolfi/) is a research director of the Institute of CognitiveSciences and Technologies of the Italian National Research Council and head of the Laboratoryof Autonomous Robots and Artificial Life (http://laral.istc.cnr.it/). Stefano conducted pioneeringresearch in artificial life and is one of the founders of Evolutionary Robotics. His main researchinterest is in study of how embodied and situated agents can develop behavioural and cognitiveskills autonomously by adapting to their task/environment. Stefano authored and co-authoredmore than 150 peer-review scientific publications.

Pagliuca and Nolfi 15

at Consiglio Nazionale Ricerche on October 5, 2015adb.sagepub.comDownloaded from