amir massoud farahmand majid nili ahmadabadi, babak n. araabi, caro lucas sologen

133
Investigations on Investigations on Automatic Behavior-based Automatic Behavior-based System Design System Design + [A Survey on] [A Survey on] Hierarchical Hierarchical Reinforcement Learning Reinforcement Learning Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas www.SoloGen.net [email protected]

Upload: onaona

Post on 14-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Investigations on Automatic Behavior-based System Design + [A Survey on] Hierarchical Reinforcement Learning. Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas www.SoloGen.net [email protected]. [a non-uniform] Outline. Brief History of AI - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Investigations on Automatic Investigations on Automatic Behavior-based System DesignBehavior-based System Design

+[A Survey on][A Survey on] Hierarchical Reinforcement Hierarchical Reinforcement

LearningLearningAmir massoud Farahmand

Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucaswww.SoloGen.net

[email protected]

Page 2: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[a non-uniform]

Outline• Brief History of AI• Challenges and Requirements of Robotic Applications• Behavior-based Approach to AI• The Problem of Behavior-based System Design• MDP and Standard Reinforcement Learning Framework• A Survey on Hierarchical Reinforcement Learning• Behavior-based System Design• Learning in BBS

– Structure Learning– Behavior Learning

• Behavior Evolution and Hierarchy Learning in Behavior-based Systems

Page 3: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Happy birthday to Artificial Intelligence

• 1941 Konrad Zuse, Germany, general purpose computer

• 1943 Britain (Turing and others) Collossus, for decoding

• 1945 ENIAC, US. John von Neumann a consultant

• 1946 The Logic Theorist on JOHNNIAC--Newell, Shaw and Simon

• 1956 Dartmouth Conference organized by John McCarthy (inventor of LISP)

• The term Artificial Intelligence coined at Dartmouth---intended as a two month, ten man study!

Page 4: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

HP to AI (2)

‘It is not my aim to surprise or shock you----but the simplest way I can summarize is to say that there are now in the world machines that think, that learn and that create. Moreover, their ability to these things is going to increase rapidly until........…’

(Herb Simon 1957)

Unfortunately, Simon was too optimistic!

Page 5: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

What AI have done for us?

• Rather good OCR (Optical Character Recognition) and Speech recognition softwares

• Robots make cars in all advanced countries• Reasonable machine translation is available for a large range of

foreign web pages• Systems land 200 ton jumbo jets unaided every few minutes• Search systems like Google are not perfect but very effective

information retrieval• Computer games and autogenerated cartoons are advancing at an

astonishing rate and have huge markets• Deep blue beat Kasparov in 1997. The world Go champion is a

computer.• Medical expert systems can outperform doctors in many areas of

diagnosis (but we aren’t allowed to find out easily!)

Page 6: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

AI: What is it?

• What is AI?• Different definitions

– The use of computer programs and programming techniques to cast light on the principles of intelligence in general and human thought in particular (Boden)

– The study of intelligence independent of its embodiment in humans, animals or machines (McCarthy)

– AI is the study of how to do things which at the moment people do better (Rich & Knight)

– AI is the science of making machines do things that would require intelligence if done by men. (Minsky) (fast arithmetic?)

• Is it definable?!• Turing test, Weak and Strong AI and …

Page 7: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

AI: Basic assumption

• Symbol System Hypothesis: it is possible to construct a universal symbol system that thinks

• Strong Symbol System Hypothesis: the only way a system can think is through symbolic processing

• Happy birthday Symbolic (Traditional – Good old-fashioned) AI

Page 8: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Symbolic AI: Methods

• Knowledge representation (Abstraction)

• Search

• Logic and deduction

• Planning

• Learning

Page 9: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Symbolic AI: Was it efficient?

• Chess [OK!]

• Block-worlds [OK!]

• Daily Life Problems– Robots [~OK!]– Commonsense [~OK!]– … [~OK]

Page 10: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Symbolic AI and Robotics

• Functional decomposition• Sequential flow• Correct perceptions is assumed to be done by vision-researched in a

“a-good-and-happy-will-come-day”!• Get a logic-based or formal description of percepts• Apply search operators or logical inference or planning operators

PerceptionTask

executionPlanning

World Modelling

Motor control

sensors actuators

Page 11: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Challenges and Requirements of Robotic SystemsChallenges

• Sensor and Effector Uncertainty• Partial Observability• Non-Stationarity

Requirements(among many others)

• Multi-goal• Robustness• Multiple Sensors• Scalability • Automatic designAutomatic design• [Adaptation (Learning/Evolution)][Adaptation (Learning/Evolution)]

Page 12: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior-based approach to AI

• Behavioral (activity) decomposition [against functional decomposition]

• Behavior: Sensor->Action (Direct link between perception and action)

• Situatedness• Embodiment• Intelligence as Emergence of …

Page 13: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavioral decomposition

build maps

explore

avoid obstacles

locomote

manipulatethe world

sensors actuators

Page 14: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Situatedness

• No world modelling and abstraction• No planning• No sequence of operations on symbols• Direct link between sensors and actions• Motto: The world is its own best model

Page 15: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Embodiment

• Only an embodied agent is validated as one that can deal with real world.

• Only through a physical grounding can any internal symbolic system be given meaning

Page 16: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Emergence as a Route to Intelligence

• Emergence: interaction of some simple systems which results in something more than sum of those systems

• Intelligence as emergent outcome of dynamical interaction of behaviors with the world

Page 17: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior-based design

• Robust– not sensitive to failure of particular part of the system

– no need for precise perception as there is no modelling there

• Reactive: Fast response as there is no long route from perception to action

• No representation

Page 18: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

A Simple problem

• Goal: make a mobile robot controller that collects balls from the field and move them to home

• What we have:– Differentially controlled mobile robot– 8 sonar sensors– Vision system that detects balls and home

Page 19: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Basic design

move toward ball

move toward home

exploration

avoid obstacles

Page 20: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

A Simple Shot

Page 21: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

?How should we

DESIGNa behavior-based system?!

Page 22: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior-based System Design Methodologies

• Hand Design– Common in almost everywhere.– Complicated: may be even infeasible in complex problems– Even if it is possible to find a working system, it is not optimal

probably.• Evolution

– Good solutions can be found– Biologically feasible– Time consuming– Not fast in making new solutions

• Learning– Biologically feasible– Learning is essential for life-time survival of the agent.

Page 23: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

The Importance of Adaptation (Learning/Evolution)

• Unknown environment/body– [exact] Model of environment/body is not known

• Non-stationary environment/body– Changing environment (offices, houses, streets, and almost

everywhere)– Aging– [cannot be remedied with evolution very easily]

• Designer may not know how to benefit from every aspects of her agent/environment– Let’s the agent learn it by itself (learning as optimization)

• etc …

Page 24: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Different Learning Methods

LearningMethods

Supervised Reinforcement Unsupervised

Neural Networks Decision Trees Bayesian Classifier Neural NetworksDifferent Clustering

Methods

MLFF

RBF

Self Organizing Feature Map

Associative Memories

SVM

Page 25: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Reinforcement Learning

• Agent senses state of the environment

• Agent chooses an action

• Agent receives reward from an internal/external critic

• Agent learns to maximize its received rewards through time.

Page 26: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Reinforcement Learning

• Inspired from Psychology– Thorndike, Skinner, Hull, Pavlov, …

• Very successful applications– Games (Backgammon)– Control– Robotics– Elevator Scheduling– …

• Well-defined mathematical formulation– Markov Decision Problems

Page 27: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Markov Decision Problems

• Markov Process: Formulating a wide range of dynamical systems

• Finding an optimal solution of an objective function

• [Stochastic] Dynamics Programming• Planning: Known environment• Learning: Unknown environment

Page 28: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

MDP

Page 29: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Reinforcement Learning Revisited (1)

• Very important Machine Learning method• An approximate online solution of MDP

– Monte Carlo method– Stochastic Approximation– [Function Approximation]

Page 30: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Reinforcement Learning Revisited (2)

• Q-Learning and SARSA are among the most important solution of RL

Page 31: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Some Simple Samples

2 4 6 8 10 12 14 16 18 20

0.5

1

1.5

2 4 6 8 10 12 14 16 18 20

0.5

1

1.52 4 6 8 10 12 14 16 18 20

0.5

1

1.5

Value Function Policy

Map of the Environment

1D Grid World

Page 32: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Some Simple Samples

2 4 6 8 10 12 14 16 18 20

2

4

6

8

10

12

14

16

18

20

2 4 6 8 10 12 14 16 18 20

2

4

6

8

10

12

14

16

18

20

2 4 6 8 10 12 14 16 18 20

2

4

6

8

10

12

14

16

18

20

Map

Policy

Value Function

Value Function (3D view)

2D Grid World

Page 33: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Some Simple Samples

2 4 6 8 10 12 14 16 18 20

2

4

6

8

10

12

14

16

18

20

2 4 6 8 10 12 14 16 18 20

2

4

6

8

10

12

14

16

18

20

2 4 6 8 10 12 14 16 18 20

2

4

6

8

10

12

14

16

18

20

Map Value Function

Value Function (3D view)Policy

2D Grid World

Page 34: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Curses of DP

It is not easy to use DP (and RL) in robotic tasks.• Curse of Modeling

– RL solves this problem

• Curse of Dimensionality (e.g. robotic tasks have a very big state space)

– Approximating Value function• Neural Networks

• Fuzzy Approximation

– Hierarchical Reinforcement Learning

Page 35: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

A Sample of Learning in a Robot

Hajime Kimura, Shigenobu Kobayashi, “Reinforcement Learning using Stochastic Gradient Algorithm and its Application to Robots,” The Transaction of the Institute of Electrical Engineers of Japan, Vol.119, No.8 (1999) (in Japanese!)

Page 36: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Hierarchical

Reinforcement Learning

Page 37: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

ATTENTION

Hierarchical reinforcement learning methods are not specially designed for behavior-

based systems.

Covering them in this presentation with this depth should not be interpreted as their high

amount of relation to behavior-based system design.

Page 38: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Hierarchical RL (1)

• Use some kind of hierarchy in order to …– Learn faster

– Need less values to be updated (smaller storage dimension)

– Incorporate a priori knowledge by designer

– Increase reusability

– Have a more meaningful structure than a mere Q-table

Page 39: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Hierarchical RL (2)

• Is there any unified meaning of hierarchy?

NO!

• Different methods:– Temporal abstraction– State abstraction– Behavioral decomposition– …

Page 40: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Hierarchical RL (3)

• Feudal Q-Learning [Dayan, Hinton]• Options [Sutton, Precup, Singh]• MaxQ [Dietterich] • HAM [Russell, Parr, Andre]• ALisp [Andre, Russell]• HexQ [Hengst]• Weakly-Coupled MDP [Bernstein, Dean & Lin, …]• Structure Learning in SSA [Farahmand, Nili]• Behavior Learning in SSA [Farahmand, Nili]• …

Page 41: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Feudal Q-Learning• Divide each task to a few smaller sub-tasks• State abstraction method• Different layers of managers• Each manager gets orders from its super-manager and orders

to its sub-managers

Super-Manager

Manager 1 Manager 2

Sub-Manager 1 Sub-Manager 2

Page 42: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Feudal Q-Learning

• Principles of Feudal Q-Learning– Reward Hiding: Managers must reward sub-managers for doing their

bidding whether or not this satisfies the commands of the super-managers. Sub-managers should just learn to obey their managers and leave it up to them to determine what it is best to do at the next level up.

– Information Hiding: Managers only need to know the state of the system at the granularity of their own choices of tasks. Indeed, allowing some decision making to take place at a coarser grain is one of the main goals of the hierarchical decomposition. Information is hidden both downwards - sub-managers do not know the task the super-manager has set the manager - and upwards -a super-manager does not know what choices its manager has made to satisfy its command.

Page 43: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Feudal Q-Learning

Page 44: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Feudal Q-Learning

Page 45: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Options: Introduction

• People make decisions at different time scales– Traveling example

• People perform actions with different time scales– Kicking a ball

– Becoming a soccer player

• It is desirable to have a method to support this temporally-extended actions over different time scales

Page 46: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Options: Concept

• Macro-actions• Temporal abstraction method of Hierarchical RL• Options are temporally extended actions which each of

them is consisted of a set of primitive actions• Example:

– Primitive actions: walking NSWE– Options: go to {door, cornet, table, straight}

• Options can be Open-loop or Closed-loop

• Semi-Markov Decision Process Theory [Puterman]

Page 47: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Options: Formal Definitions

Page 48: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Options: Rise of SMDP!

• Theorem: MDP + Options = SMDP

Page 49: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Options: Value function

Page 50: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Options: Bellman-like optimality condition

Page 51: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Options: A simple example

Page 52: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Options: A simple example

Page 53: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Options: A simple example

Page 54: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Interrupting Options

• Option’s policy is followed until it terminates.

• It is somehow unnecessary condition– You may change your decision in the middle of

execution of your previous decision.

• Interruption Theorem: Yes! It is better!

Page 55: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Interrupting Options:An example

Page 56: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Options: Other issues

• Intra-option {model, value} learning

• Learning each options– Defining sub-goal reward function

• Generating new options– Intrinsically Motivated RL

Page 57: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

MaxQ

• MaxQ Value Function Decomposition

• Somehow related to Feudal Q-Learning

• Decomposing value function in a hierarchical structure

Page 58: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

MaxQ

Page 59: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

MaxQ: Value decomposition

Page 60: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

MaxQ: Existence theorem

• Recursive optimal policy.• There may be many recursive optimal policies with different value

function.• Recursive optimal policies are not an optimal policy.• If H is stationary macro hierarchy for MDP M, then all recursively

optimal policies w.r.t. have the same value.

Page 61: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

MaxQ: Learning

• Theorem: If M is MDP, H is stationary macro, GLIE (Greedy in the Limit with Infinite Exploration) policy, common convergence conditions (bounded V and C, sum of alpha is …), then with Prob. 1, algorithm MaxQ-0 will converge!

Page 62: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

MaxQ

• Faster learning: all states updating– Similar to “all-goal-updating” of Kaelbling

Page 63: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

MaxQ

Page 64: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

MaxQ: State abstraction

• Advantageous– Memory reduction– Needed exploration will be reduced– Increase reusability as it is not dependent on its

higher parents

• Is it possible?!

Page 65: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

MaxQ: State abstraction

• Exact preservation of value function• Approximate preservation

Page 66: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

MaxQ: State abstraction

• Does it converge?– It has not proved formally yet.

• What can we do if we want to use an abstraction that violates theorem 3?– Reward function decomposition

• Design a reward function that reinforces those responsible parts of the architecture.

Page 67: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

MaxQ: Other issues

• Undesired Terminal states

• Non-hierarchical execution (polling execution)– Better performance– Computational intensive

Page 68: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Return of BBS(Episode II)

Automatic Design

Page 69: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Learning in Behavior-based Systems

• There are a few works on behavior-based learning– Mataric, Mahadevan, Maes, and ...

• … but there is no deep investigation about it (specially mathematical formulation)!

• And most of them incorporate flat architectures.

Page 70: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Learning in Behavior-based Systems

There are different methods of learning with different viewpoints, but we have concentrated on Reinforcement Learning.– [Agent] Did I perform it correctly?!– [Tutor] Yes/No! (or 0.3)

Page 71: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Learning in Behavior-based Systems

We have divided learning in BBS into two parts:• Structure Learning

– How should we organize behaviors in the architecture assume having a repertoire of working behaviors

• Behavior Learning– How should each behavior behave? (we do not have a

necessary toolbox)

Page 72: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure LearningAssumptions

• Structure Learning in Subsumption Architecture as a good sample for BBS

• Purely parallel case• We know B1, B2, and … but we

do not know how to arrange them in the architecture

– we know how to {avoid obstacles, pick an object, stop, move forward, turn, …} but we don’t know which one is superior to others.

Page 73: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure Learning

manipulatethe world

build maps

explore

locomote

avoid obstacles

Behavior Toolbox

The agent wants to learn how to arrange these behaviors in order to get maximum reward from its environment (or tutor).

Page 74: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure Learning

manipulatethe world

build maps

explore

locomote

avoid obstacles

Behavior Toolbox

Page 75: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure Learning

manipulatethe world

build maps

explorelocomote

avoid obstacles

Behavior Toolbox 1-explore becomes controlling behavior and suppress avoid obstacles

2-The agent hits a wall!

Page 76: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure Learning

manipulatethe world

build maps

explorelocomote

avoid obstacles

Behavior Toolbox Tutor (environment) gives explore a punishment for its being in that place of the structure.

Page 77: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure Learning

manipulatethe world

build maps

explorelocomote

avoid obstacles

Behavior Toolbox“explore” is not a very good behavior for the highest position of the structure. So it is replaced by “avoid obstacles”.

Page 78: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure LearningChallenging Issues

• Representation: How should the agent represent knowledge gathered during learning?– Sufficient (Concept space should be covered by Hypothesis space)– Tractable (small Hypothesis space)– Well-defined credit assignment

• Hierarchical Credit Assignment: How should the agent assign credit to different behaviors and layers in its architecture?– If the agent receives a reward/punishment, how should we

reward/punish the structure of the agent?• Learning: How should the agent update its knowledge

when it receives reinforcement signal?

Page 79: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure LearningOvercoming Challenging Issues

• Decomposing the behavior of a multi-agent system to simpler components may enhance our vision to the problem under investigation: decomposing value function of the agent to simpler elements.

Structure can provide a lot of clues to us.

Page 80: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure LearningValue Function Decomposition

Each structure has a value regarding its receiving reinforcement signal.

T structure agent with theREVT •The objective is finding a structure T with a high value.•We have decomposed value function to simpler components that enable the agent to benefit from previous interaction with the environment.

Page 81: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure LearningValue Function Decomposition

• It is possible to decompose total system’s value to value of each behavior in each layer.

• We call it Zero-Order method.

layeri in thebehavior gcontrollin is ),( thjtijZO BREVjiV

m

i

n

jiijijT LPVLBPV

1 1

gcontrollin is |

Don’t read the following equations!

Page 82: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure Learning Value Function Decomposition

(Zero Order Method)It stores the value of behavior-being in a specific

layer.

avoid obstacles(0.8)

avoid obstacles(0.6)

explore(0.7)

explore(0.9)

locomote(0.4)Higher layer

Lower layer

ZO Value Table in the agent’s mind

locomote(0.4)

Page 83: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure LearningCredit Assignment(Zero Order Method)

• Controlling behavior is the only responsible behavior for the current reinforcement signal.

• Appropriate ZO value table updating method is available.

Page 84: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure LearningValue Function Decomposition and

Credit AssignmentAnother Method (First Order)

It stores the value of relative order of behaviors– How much is it good/bad if “B1 is being placed higher than B2”?!

• V(avoid obstacles>explore) = 0.8

• V(explore>avoid obstacles) = -0.3

• Sorry! Not that easy (and informative) to show graphically!!• Credits are assigned to all (controlling, activated) pairs of behaviors.

– The agent receives reward while B1 is controlling and B3 and B5 are activated

• (B1>B3): +• (B1>B5): +

Page 85: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure LearningExperiment: Multi-Robot

Object Lifting• A Group of three robots

want to lift an object using their own local sensors– No central control

– No communication

– Local sensors

• Objectives– Reaching prescribed height

– Keeping tilt angle small

Page 86: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure LearningExperiment: Multi-Robot

Object Lifting

Behavior Toolbox

Stop

Push More

Hurry Up

Slow Down

Don’t Go Fast

?!

Page 87: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure LearningExperiment: Multi-Robot

Object Lifting

0 5 10 15 20 25 30 35 40 45 50-50

0

50

100

150

Episode

Rew

ard

ZO

FO

Hand-designed structure

Random structure

Page 88: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure LearningExperiment: Multi-Robot

Object Lifting

Sample shot of height of each robot after sufficient learning

0 10 20 30 40 50 60 70 80 900

0.5

1

1.5

2

2.5

3

3.5

Steps

z of

rob

ots

goal

1

2

3

Page 89: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Structure LearningExperiment: Multi-Robot

Object Lifting

Sample shot of tilt angle of the object after sufficient learning

0 10 20 30 40 50 60 70 80 900

5

10

15

20

25

30

35

40

45

Steps

Tilt

ang

le (

in d

egre

es)

Page 90: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior Learning

The assumption of having a working behavior repertoire may not be practical in every situations– Partial Knowledge of the Designer to the Problem:

Suboptimal Solutions

Assumption:– Input and output spaces of each behavior is known (S’

and A’).

– Fixed Structure

Page 91: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior Learning

Page 92: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior Learning

explore

avoid obstacles

How should each behavior behave when the system is in state S?!

a1=B1(s1’)

a2=B2(s2’)

Page 93: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior LearningChallenging Issues

• Hierarchical Behavior Credit Assignment: How should the agent assign credit to different behaviors in its architecture?– If the agent receives a reward/punishment, how should

we reward/punish the behaviors of the agent?– Multi-agent Credit Assignment Problem

• Cooperation between Behaviors: How should we design behaviors so that they can cooperate with each other?

• Learning: How should the agent update its knowledge when it receives reinforcement signal?

Page 94: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior LearningValue Function Decomposition

• Value function of the agent can be decomposed into simpler behavior-level components.

m

i

n

j Ss AajjjjijjiijT

j j

asQasLBsPLPLBPV1 1

),(),(in gcontrollin is gcontrollin is |

Page 95: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior LearningHierarchical Behavior Credit

AssignmentAugmenting action space of behaviors with

“No Action”– Cooperation between behaviors– Each behavior knows whether there exists a

better behavior in lower behaviors:• Do not suppress them!

• Developed a multi-agent credit assignment framework for logically expressible teams.

Page 96: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior LearningHierarchical Behavior Credit

Assignment

unknown:

unknown:

unknown:

:)(

:

:

*

l

l

u

u

R

B

B

B

NAB

B

Ti

unknown:

unknown:

unknown:

:)(

:

:

*

l

l

u

u

R

B

B

B

NAB

B

Ti

Page 97: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior LearningOptimality Condition and Value

Updating

!!

Page 98: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Concurrent Behavior and Structure Learning

• We have divided the BBS learning task into two separate process:– Structure Learning– Behavior Learning

• Concurrent behavior and structure learning is possible

Page 99: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Concurrent Behavior and Structure Learning

Initialize LearningParameters

Interact with theenvironment and receive

reinforcement signal

Update estimation of structure and behavior value functions

Update Architecture according to new estimations

Page 100: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior and Structure LearningExperiment: Multi-Robot

Object Lifting

0 10 20 30 40 50 60 70 80 90 10019

20

21

22

23

24

25

26

27

28

29

Percentile of the superior results

Avera

ge G

ain

ed R

ew

ard

Str/Beh learning

Str learning

Hand-designed Beh learning

Cumulative average gained reward during testing phase of object lifting task for different learning methods.

Page 101: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior and Structure LearningExperiment: Multi-Robot

Object Lifting

1.9 2 2.1 2.2 2.3 2.4 2.5 2.60

0.1

0.2

0.3

0.4

0.5

0.6Behavior/Structure Learning

Behavioral Performance

Pro

babili

ty

1.6 1.8 2 2.2 2.4 2.60

0.1

0.2

0.3

0.4

0.5

0.6Behavior Learning

Behavioral Performance

Pro

babili

ty

2.4 2.45 2.5 2.55 2.60

0.1

0.2

0.3

0.4

0.5

0.6Structure Learning

Behavioral Performance

Pro

babili

ty

2.3 2.4 2.5 2.60

0.1

0.2

0.3

0.4

0.5

0.6Hand-designed

Behavioral Performance

Pro

babili

ty

Figure 17. Probability distribution of behavioral performance during learning phase of the object lifting task for different learning methods.

Page 102: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Austin Villa Robot Soccer Team

N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004

Page 103: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Austin Villa Robot Soccer Team

N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004

Initial Gait

Page 104: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Austin Villa Robot Soccer Team

N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004

During Training Process

Page 105: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Austin Villa Robot Soccer Team

N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004

Fastest Final Result

Page 106: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution

• Computational framework inspired from natural evolution.– Natural Selection (Selection of the Fittest)– Reproduction

• Crossover

• Mutation

Page 107: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution

• A good (fit) individual survives from different hazards and difficulties during its lifetime and can find a mate and reproduce itself.

• Its useful genetic information is passed to its offspring.

• If two fit parents mate with each other, their offspring is [probably] better than both of them.

Page 108: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution

• Artificial Evolution is used a method of optimization– Does not need explicit knowledge of objective

function– Does not need objective function derivatives– Does not get stuck in local min./max.

• In contrast with Gradient-based searches

Page 109: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution

0 50 100 150 200 250 300 350 400 450-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x

y

A function with multiple max. and min.

Page 110: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution

0 50 100 150 200 250 300 350 400 450-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

x

yA noisy function with multiple max. and min.

Page 111: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] EvolutionA General Scheme

Initializepopulation

Calculate fitness of each individual

Select best individuals

Mate best individuals

Page 112: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution in Robotics

• Artificial Evolution as an approach to automatically design controller of situated agent.

• Evolving Controller Neural Network

Page 113: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution in Robotics

• Objective function is not a very well-defined in robotic task.

• The dynamic of the whole system (agent/environment) is too complex to compute derivative of objective function.

Page 114: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution in Robotics

• Evolution is very time consuming.• Actually in most cases, we do not have a

population of robots. So we use a single robot instead of a population (take much more time).

• Implementation on a real physical robot may cause damage to the robot before evolving a suitable controller.

Page 115: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution in RoboticsSimulated/Physical Robot

• Evolve from the first generation on the physical robot.– Too expensive

• Simulate robots and evolve an appropriate controller in a simulated world. Transfer the final solution to the physical robot.– Different dynamics of physical and simulated robots.

• After evolving a controller on a simulated robot, continue the evolution on the physical system too.

Page 116: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution in Robotics

Page 117: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution in Robotics

Page 118: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution in Robotics

Floreano, D. and Mondada, F. Automatic Creation of an Agent: Genetic Evolution of a Neural Network Driven Robot,” In D. Cliff, P. Husbands, J.-A. Meyer, and S. Wilson (Eds.), From Animals to Animats III, Cambridge, MA: MIT Press, 1994.

Best individual of generation 45, born after 35 hours

Page 119: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution in Robotics

D. Floreano, S. Nolfi, and F. Mondada, “Co-Evolution and Ontogenetic Change in Competing Robots,” Robotics and Autonomous Systems, To appear, 1999

25 generations(a few days)

Page 120: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

[Artificial] Evolution in Robotics

J. Urzelai, D. Floreano, M. Dorigo, and M. Colombetti, “Incremental Robot Shaping,” Connection Science, 10, 341-360, 1998.

Page 121: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Hybrid Evolution/Learning in Robots

• Evolution is slow– but can find very good solutions

• Learning is fast (more flexible during lifetime)

– but may get stuck in local maxima of fitness function.

We may use both evolution and learning.

Page 122: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Hybrid Evolution/Learning in Robots

You may remember that in the structure learning method, we have assumed that there is a set of working behaviors.

To develop behaviors, we have used learning.

Now, we want to use evolution instead.

Page 123: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior Evolution and Hierarchy Learning in BBS

• Behavior Generation– Co-evolution– Slow

• Structure Organization– Learning– Memetically Biased

Initial Structure

Agent

Behavior Pool 1

Behavior Pool 2

Behavior Pool n

Meme Pool(Culture)

Figure 2. Building the agent from different behavior pools.

Page 124: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior Evolution and Hierarchy Learning in BBS

• Fitness function: How to calculate fitness of each behavior?

• Fitness Sharing:– Uniform– Value-based

• Genetic Operators– Mutation– Crossover

Page 125: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior Evolution and Hierarchy Learning in BBSExperiment: Multi-Robot Object Lifting

Figure 5. (Object Lifting) Averaged last five episodes fitness comparison for different design methods: 1) evolution of behaviors (uniform fitness sharing) and learning structure (blue), 2) evolution of behaviors (valued-based fitness sharing) and learning structure (black), 3) hand-designed behaviors with learning structure (green), and 4) hand-designed behaviors and structure (red). Dotted line across the hand-designed cases (3 and 4) show one standard deviation region across the mean performance.

0 5 10 15 20 25 30 35 40 45 50-150

-100

-50

0

50

100

150

200

250

300

350

Generations

Fitness

Structure Learning - Value-based Fitness Sharing

Structure Learning - Uniform Fitness Sharing

Hand-designed Behaviors and Structure

Hand-designed Behavior/Learning Structure

Page 126: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior Evolution and Hierarchy Learning in BBSExperiment: Multi-Robot Object Lifting

5 10 15 20 25 30 35 40 45 502.2

2.4

2.6

2.8

3

3.2

Time Steps

z of

rob

ots

5 10 15 20 25 30 35 40 45 500

5

10

15

20

Time Steps

Tau

robot 1

robot 2

robot 3

Page 127: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior Evolution and Hierarchy Learning in BBSExperiment: Multi-Robot Object Lifting

Figure 6. (Object Lifting) Averaged last five episodes and lifetime fitness comparison for uniform fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is much higher.

0 5 10 15 20 25 30 35 40 45 50-200

-150

-100

-50

0

50

100

150

200

250

300

Generations

Fitness a

nd L

ifetim

e F

itness

Structure Learning - No Meme Pool

Structure Learning - with Meme Pool

Hand-designed Structure/Behavior Evolution

Hand-designed Behaviors/Structure Learning

Page 128: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior Evolution and Hierarchy Learning in BBSExperiment: Multi-Robot Object Lifting

-300 -200 -100 0 100 200 3000

0.2

0.4

0.6

0.8

1

Fitness

Pro

babili

ty

Generation 1

Meme No Meme Fixed Str.

0 50 100 150 200 250 300 3500

0.2

0.4

0.6

0.8

1

Fitness

Pro

babili

ty

Generation 5

100 150 200 250 300 3500

0.2

0.4

0.6

0.8

1

Fitness

Pro

babili

ty

Generation 20

100 150 200 250 300 3500

0.2

0.4

0.6

0.8

1

Fitness

Pro

babili

ty

Generation 50

Figure 9. (Object Lifting) Probability distribution comparison for uniform fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents.

Page 129: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior Evolution and Hierarchy Learning in BBSExperiment: Multi-Robot Object Lifting

0 5 10 15 20 25 30 35 40 45 50-200

-150

-100

-50

0

50

100

150

200

250

300

Generations

Fitness a

nd L

ifetim

e F

itness

Structure Learning - with Meme Pool

Structure Learning - No Meme Pool

Hand-designed Behaviors/Structure Learning

Hand-designed Structure/Behavior Evolution

Figure 10. (Object Lifting) Averaged last five episodes and lifetime fitness comparison for value-based fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is higher.

Page 130: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Behavior Evolution and Hierarchy Learning in BBSExperiment: Multi-Robot Object Lifting

-400 -300 -200 -100 0 100 200 3000

0.2

0.4

0.6

0.8

1

Fitness

Pro

bability

Generation 1

Meme No Meme Fixed Str.

-400 -300 -200 -100 0 100 200 3000

0.2

0.4

0.6

0.8

1

Fitness

Pro

bability

Generation 5

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

Fitness

Pro

bability

Generation 20

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

Fitness

Pro

bability

Generation 50

Figure 13. (Object Lifting) Probability distribution comparison for value-based fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents.

Page 131: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Conclusions, Ongoing Research, and Future Work

• A [rather] complete and mathematical investigation on automatic designing of behavior-based systems

• Structure Learning• Behavior Learning• Concurrent Behavior and Structure Learning• Behavior Evolution and Structure Learning

– Memetical Bias

• Good results in two different domain– Multi-robot Object Lifting– An Abstract Problem

Page 132: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen

Conclusions, Ongoing Research, and Future Work

• However, there are many steps remained for fully automated agent design– Extending to Multi-Step Formulation

– How should we generate new behaviors without even knowing which sensory information is necessary for the task (feature selection)

– Applying structure learning methods to more general architectures, e.g. MaxQ.

– Problem of Reinforcement Signal Design• Designing a good reinforcement signal is not easy at all.

Page 133: Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas SoloGen