learning and evolution in hierarchical behavior-based systems amir massoud farahmand advisor: majid...

Learning and Evolution in Hierarchical Behavior-based Systems

Amir massoud Farahmand

Advisor:

Majid Nili Ahmadabadi

Co-advisors:

Caro Lucas – Babak N. Araabi

University of Tehran - Dept. of ECE 2

Motivation

Machines (e.g. robots): from labs. to homes, factories, … .

Machines face: Unknown environment/body

[exact] Model of environment/body is not known

Non-stationary environment/body Changing environment (offices,

houses, streets, and almost everywhere)

Aging Designer may not know how to

benefit from every aspects of her agent/environment


Motivation

Difficulty of the design processMachines see different thingsMachines interact differentlyThe designer is not a machine!

I know what I want!

Our goal: Automatic design of intelligent machines


Research Specification

Goal: Automatic design of intelligent robots

Architecture: Hierarchical behavior-based architectures.

Objective performance measure is available (reinforcement signal) [Agent] Did I perform it correctly?! [Tutor] Yes/No! (or 0.3)


Behavior-based Approach to AI

Behavior-based approach as a successful alternative for classical AI approachNo {Abstraction, Planning, Deduction, … }

Behavioral (activity) decompositionagainst functional decomposition

Behavior: Sensor->Action (Direct link between perception and action)


Behavioral Decomposition

build maps

explore

avoid obstacles

locomote

manipulatethe world

sensors actuators


Behavior-based Design

Robust not sensitive to failure of particular part of the

system no need for precise perception as there is no

modelling thereReactive: Fast response as there is no long route

from perception to action

No explicit representation


?How should we

DESIGNa behavior-based system?!


Behavior-based System Design Methodologies

Hand Design Common in almost everywhere. Complicated: may be even infeasible in complex problems Even if it is possible to find a working system, it is not

optimal probably. Evolution

Good solutions can be found Biologically feasible Time consuming Not fast in making new solutions

Learning Biologically feasible Learning is essential for life-time survival of the agent.


Taxonomy of Design Methods

Behavior-based System Design

Learning Evolution

Structure (hierarchy) learning

Behavior learningCo-evolution of

behaviorsHybridization of

Evolution and Learning

Memetic Algorithm


Problem FormulationBehaviors

ii

ii

iiiii

ii

iii

SSM

AASS

SssMssS

AA

ASB

:

,

);(

Action No

n1,...,i :


Problem FormulationPurely Parallel Subsumption Architecture (PPSSA)

layer) in the is indicates(that

][ T)()2()1(

thj

mindexindexindex

iBjindex(i):

n m ... B BBT

oidanceObstacleAvtionBallCollecWanderingT

•Different behaviors excites

•Higher behaviors can suppress lower ones.

•Controlling behavior


Problem FormulationReinforcement Signal and the Agent’s Value Function

N

iirN

R1

1

)1( behaviors ofset and structure agent with the

)1( behaviors ofset and structure agent with the1

1

,...,niBTRE

,...,niBTrN

EV

i

i

N

ttT

•This function states the value of using a set of behaviors inan specific structure.•We want to maximize the agent’s value function


Problem FormulationDesign as an Optimization

Structure Learning: Finding the best structure given a set of behaviors using learning

Behavior Learning: Finding the best behaviors given the structure using learning

Concurrent Behavior and Structure Learning

Behavior Evolution: Finding the best behaviors given structure using evolution

Behavior Evolution and Structure Learning

TBT

i VBTi,

** maxarg,

TT

VT maxarg*

TB

i VBi

maxarg*

TBT

i VBTi,

** maxarg,

TB

i VBi

maxarg*


Where?!


Learning Evolution





Memetic Algorithm


Learning in Behavior-based Systems

There are a few researches on behavior-based learningMataric, Mahadevan, Maes, and ...

… but there is no deep investigation about it (specially mathematical formulation)!

And most of them incorporate flat architectures.


Learning in Behavior-based Systems

We design: Structure (Hierarchy) Behavior

We Learn:Structure Learning

Organizing behaviors in the architecture using a behavior toolbox

Behavior Learning The correct mapping of each behavior


Where?!


Learning Evolution





Memetic Algorithm


Structure Learning

manipulatethe world

build maps

explore

locomote

avoid obstacles

Behavior Toolbox

The agent wants to learn how to arrange these behaviors in order to get maximum reward from its environment (or tutor).


Structure Learning

manipulatethe world

build maps

explore

locomote

avoid obstacles

Behavior Toolbox


Structure Learning

manipulatethe world

build maps

explorelocomote

avoid obstacles

Behavior Toolbox 1-explore becomes controlling behavior and suppress avoid obstacles

2-The agent hits a wall!


Structure Learning

manipulatethe world

build maps

explorelocomote

avoid obstacles

Behavior Toolbox Tutor (environment) gives explore a punishment for its being in that place of the structure.


Structure Learning

manipulatethe world

build maps

explorelocomote

avoid obstacles

Behavior Toolbox“explore” is not a very good behavior for the highest position of the structure. So it is replaced by “avoid obstacles”.


Structure LearningChallenging Issues

Representation: How should the agent represent knowledge gathered during learning? Sufficient (Concept space should be covered by Hypothesis

space) Generalization Capability Tractable (small Hypothesis space) Well-defined credit assignment

Hierarchical Credit Assignment: How should the agent assign credit to different behaviors and layers in its architecture? If the agent receives a reward/punishment, how should we

reward/punish the structure of the agent? Learning: How should the agent update its knowledge

when it receives reinforcement signal?


Structure LearningOvercoming Challenging Issues

Our approach is defining a representation that allows decomposing the agent’s value function to simpler components.

Decomposing the behavior of a multi-agent system to simpler components may enhance our vision to the problem under investigation.

Structure can provide a lot of clues to us.


Structure Learning

Structure Learning

Zero Order Representation First Order Representation

The value of each behavior in each layer

The value of order (higher/lower)of behaviors in the structure


Structure Learning Zero Order Representation

avoid obstacles(0.8)

avoid obstacles(0.6)

explore(0.7)

explore(0.9)

locomote(0.4)Higher layer

Lower layer

ZO Value Table in the agent’s mind

locomote(0.4)


Structure LearningZero Order Representation - Value Function Decomposition

g)controllin is (gcontrollin is |1

...



g"controllin is "...g"controllin is "1

g"controllin is "...g"controllin is "g"controllin is "1

1

22

11

111

121

1

mmt

t

t

N

tmt

N

tt

N

tmt

N

ttT

LPLrN

E

LPLrN

E

LPLrN

E

LrELrN

E

LLLrN

E

rN

EREV



miVLBP

LBrN

ELBPLrN

E

n

jijij

n

jijtijit

,...,1 |

in behavior gcontrollin theis 1

|g]controllin is |1

[

1

1

m

i

n

jiijijT LPVLBPV

1 1

gcontrollin is |

Agent’s value function

ZO components

Layer’s value



m

i

n

jiijij

TT

T

TT

LPVLBPVT

VT

1 1

*

*

gcontrollin is |maxargmaxarg

maxarg


Structure LearningZero Order Representation - Credit Assignment and Value Updating

Controlling behavior is the only responsible behavior for the current reinforcement signal.

gcontrollin is |~

iijijij LPVLBPV

nijijnijnijnij rnLnBVVn

" step at time gcontrollin is "" step at time active is "~

1~

,,,1


Structure LearningFirst Order Representation



m

iiindexkiindex

N

tt

N

ttT BPBr

NEr

NEV

1][

11


]1

[

j

T

kjj

T

kj BBB

jkk

BBBj

kN

tt

k

N

tt

k

N

tt

VVB

Br

NE

BrN

E

BrN

E

;

0

;1

1

1

behavior activenext theis

and gcontrollin is 1

active is elsenobody and gcontrollin is 1

gcontrollin is |1



m

ii

i

jjindexiindexiindexT BPVVV

1

1

1)()(0)( g)controllin is (


Structure LearningFirst Order Representation – Credit Assignment

If only one behavior becomes activated, we should update V0(i) . If two or more behaviors become active, we must update V(i>j) for which ‘i’ is the index of the controlling behavior and ‘j’ which is the index of the next active behavior .


A Break!A Break!


Introduction to Experiments

Abstract problemMulti-robot object

lifting problem I will only discuss

this problem now.

A group of robots lifts a bulky object.


ExperimentsStructure Learning

0 5 10 15 20 25 30 35 40 45 50-50

0

50

100

150

Episode

Rew

ard

ZO

FO

Hand-designed structure

Random structure

Comparison of the average gained reward of two different structure learning methods (Zero Order (ZO) and First Order (FO)), hand-designed structure, and random structure for the object lifting problem.


Where?!


Learning Evolution





Memetic Algorithm


Behavior Learning

No more behavior repertoire assumptionAll we know

Sensor/Actuator dimensionsReinforcement Signal


Behavior LearningChallenging Issues

How should behaviors cooperative with each other to maximize the performance of the agent?

How should we assign credit to behaviors of the architecture?

How should each behavior update its knowledge?


Behavior Learning

1. B2, B3, and B4 excite

2. B4 takes the control

3. Punishment!!!

?!


Behavior Learning

Augmenting the action space with a pseudo-action named NoAction (NA)

NA does nothing and let lower behaviors take control

1. B2, B3, B4 excite

2. B4 proposed NA

3. B3 proposes an action and takes control

4. Reward!


Behavior Learning

NA lets behaviors to cooperateHow should we force them to

cooperative correctly?!Hierarchical Credit Assignment Problem

Boolean-like algebra for logically expressible multi-agent systems

3121321 AAAAAAA


Behavior Learning

unknown:

unknown:

unknown:

:)(

:

:

*

l

l

u

u

R

B

B

B

NAB

B

Ti

behaviorsupper

excitednot behavior gcontrollin

*

behaviorslower

1)(...)(1:1

NABNABBT

kuuR


Behavior LearningOptimality

*

**

*

*

in excited is

" " ofon contributi by the achieved is Reward)()(

Ss

i

iSsiSsi

dsSsspsBpsR

SsBsREsREr

Internal states of different behaviors excites in different regions


Behavior LearningOptimality

iii

Ss

iiiii

aBsBpsR

dsSsspaBsBpsRasQ

selects in excited is )(

selects in excited is ,

Ss

iii dsSsspNABsBpsRNAsQ selects in excited is ),(

iiiii AaasQNAsQ ),(),(


Behavior LearningValue Updating

) selects and in behavior gcontrollin is (

)(),(,),(1, ,,1

iii

iiikiiiiiikiii

asB

srasasQasasQkk

)select and in excited are s andbehavior gcontrollin is and B;(

)(),(,),(1,

i

T

,,1

NAsBBBB

srNAsNAsQNAsNAsQ

jjijj

jikjjjjkjj kk

For the case of immediate reward


Behavior LearningValue Updating

For the general return case, we should use Monte Carlo estimation.

Bootstrapping method is not applicable.


Concurrent Behavior and Structure Learning

ApplyingBehavior Learning

State-Action MappingsStructure Learning

Hierarchy


ExperimentsBehavior Learning

0 5 10 15 20 25 30 35 40 45 505

10

15

20

25

30

Episodes

Ave

rage

Gai

ned

Rew

ard

Str. Learning Beh./Str. LearningBeh. Learning

Reward comparison between structure learning, behavior learning, and concurrent behavior/structure learning methods for the object lifting task.



0 5 10 15 20 25 30 35 400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Average Gained Reward

Pro

babi

lity

Random Hand-designed

Str.Learning

Beh./Str.Learning

Beh. Learning

0 5 10 15 20 25 30 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Average Gained Reward

Pro

babi

lity

Random

Beh./Str.Learning

Hand-designed

Beh. Learning

Str. Learning

Learning phase Testing phase



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 120

22

24

26

28

30

32

34

Percentile of the superior results

Ave

rage

Gai

ned

Rew

ard

Hand-designed

Str. Learning

Beh. Learning

Beh./Str. Learning

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 122

24

26

28

30

32

34

Percentile of the superior results

Ave

rage

Gai

ned

Rew

ard

Beh./Str. Learning

Beh. Learning

Str. Learning

Hand-designed

Learning phase Testing phase



A sample trajectory showing the position of robot-object contact points, the tilt angle of the object during object lifting, and controlling behavior of robots in each time steps after sufficient structure/behavior learning. Behaviors correspondence with numbers of lowest diagram is as follows: 0 (No Behavior), 1 (Push More), 2 (Don’t Go Fast), 3 (Stop), 4 (Hurry up), 5 (Slow down).

0 0.5 1 1.52

2.5

3

3.5

Time (sec)

Hei

ght

0 0.5 1 1.50

10

20

Time (sec)

Tilt

Ang

le

0 0.5 1 1.50

12

34

5

Time (sec)Con

trol

ling

Beh

avio

rs

robot 1

robot 2

robot 3


Where?!


Learning Evolution





Memetic Algorithm


Behavior Co-evolutionMotivations

+ Learning can trap in local

maxima of objective function Learning is sensitive

(POMDP, non-Markov, …) Evolutionary methods have

more chance to find the global maximum of the objective function

Objective function may not be well-defined in robotics

- Evolutionary robotics’

methods are usually slow Fast changes of the

environment Non-modular controllers

Monolithic No reusability


Behavior Co-evolutionMotivations

Use evolution to search the difficult and big part of parameters’ space Behaviors’ parameters space is usually the bigger one

Use learning to do fast responses Structure’s parameters space is usually the smaller

one A change is the structure results in different agent’s

behavior

Evolve behaviors separately (modularity and re-usability)


Behavior Co-evolution

Agent

Behavior Pool 1

Behavior Pool 2

Behavior Pool n

Evolve each kind of behavior in its own genetic pool


Behavior Co-evolutionFitness Sharing

Fitness of the agent Fitness of each behavior?!

Fitness SharingUniformValue-based


Behavior Co-evolution

Each behavior’s genetic pool SelectionGenetic Operators

CrossoverMutation

Hard Replacement

Soft Perturbation

oldoldnew ki

ji

ji BXXBB


Where?!


Learning Evolution





Memetic Algorithm


Memetic Algorithm

We waste learned knowledge after each agent’s lifetime

Meme as a unit of information that reproduces itself as people exchange idea

Traditional memetic algorithms: Evolutionary Method: Meme exchange Local Search: Meme refinement

May be called as Hybrid Evolutionary Algorithm


Memetic Algorithm

Two different interpretations of meme:Current hybridization of behavior co-

evolution and structure learningSimilar to traditional MADifference with traditional MA: different

parameters spaces are being searchedMeme as a cultural bias


Memetic Algorithm

Experienced individuals store their experiences in the form of meme in the culture.

Newborn individuals get a new meme from the culture.

Structure as a meme


Memetic Algorithm

Agent

Behavior Pool 1

Behavior Pool 2

Behavior Pool n

Meme Pool(Culture)


Memetic Algorithm

Each meme has its own value

Value of the meme is updated using the fitness of the agent

Valuable memes have more chance to be selected for newborn individuals

iTi fT ,: *M

iiTTTT TBAAfffiniini

,: 11


ExperimentsBehavior Co-evolution – Structure Learning – Memetic Algorithm

(Object Lifting) Averaged last five episodes fitness comparison for different design methods: 1) evolution of behaviors (uniform fitness sharing) and learning structure (blue), 2) evolution of behaviors (valued-based fitness sharing) and learning structure (black), 3) hand-designed behaviors with learning structure (green), and 4) hand-designed behaviors and structure (red). Dotted line across the hand-designed cases (3 and 4) show one standard deviation region across the mean performance.

0 5 10 15 20 25 30 35 40 45 50-150

-100

-50

0

50

100

150

200

250

300

350

Generations

Fitn

ess

Structure Learning - Value-based Fitness Sharing

Structure Learning - Uniform Fitness Sharing

Hand-designed Behaviors and Structure

Hand-designed Behavior/Learning Structure



(Object Lifting) Averaged last five episodes and lifetime fitness comparison for uniform fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is much higher.

0 5 10 15 20 25 30 35 40 45 50-200

-150

-100

-50

0

50

100

150

200

250

300

Generations

Fitn

ess

and

Life

time

Fitn

ess

Structure Learning - No Meme Pool

Structure Learning - with Meme Pool

Hand-designed Structure/Behavior Evolution

Hand-designed Behaviors/Structure Learning



(Object Lifting) Probability distribution comparison for uniform fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents.

-300 -200 -100 0 100 200 3000

0.2

0.4

0.6

0.8

1

Fitness

Pro

babi

lity

Generation 1

Meme (M) No Meme (N) Fixed Str. (F)

0 50 100 150 200 250 300 3500

0.2

0.4

0.6

0.8

1

Fitness

Pro

babi

lity

Generation 5

100 150 200 250 300 3500

0.2

0.4

0.6

0.8

1

Fitness

Pro

babi

lity

Generation 20

100 150 200 250 300 3500

0.2

0.4

0.6

0.8

1

Fitness

Pro

babi

lity

Generation 50

F N M N M

F M N

N M

N

M

F

N M

N

M

M N

F



0 5 10 15 20 25 30 35 40 45 50-200

-150

-100

-50

0

50

100

150

200

250

300

Generations

Fitn

ess

and

Life

time

Fitn

ess

Structure Learning - with Meme Pool

Structure Learning - No Meme Pool

Hand-designed Behaviors/Structure Learning

Hand-designed Structure/Behavior Evolution

(Object Lifting) Averaged last five episodes and lifetime fitness comparison for value-based fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is higher.



Figure 13. (Object Lifting) Probability distribution comparison for value-based fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents.

-400 -300 -200 -100 0 100 200 3000

0.2

0.4

0.6

0.8

1

Fitness

Pro

babi

lity

Generation 1

Meme (M) No Meme (N) Fixed Str. (F)

-400 -300 -200 -100 0 100 200 3000

0.2

0.4

0.6

0.8

1

Fitness

Pro

babi

lity

Generation 5

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

Fitness

Pro

babi

lity

Generation 20

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

Fitness

Pro

babi

lity

Generation 50

F

M

N

M N

F

N

M

F

M

N

F

N

M


Other Topics

Probabilistic Analysis of PPSSAChange in the excitation probability

Change in the controlling probability of each layer.

Some estimate of learning timeThe effect of reinforcement signal

uncertainty onValue functionPolicy of the agent


Conclusions


Learning Evolution





Memetic Algorithm


Contributions

Deep and mathematical investigation of behavior-based systems

Tackling the design process from different approaches Learning Evolution

Culture-based methods

Structure learning is quite new in hierarchical reinforcement learning


Suggestions for the Future Work

Extending the proposed methods to more complex architectures

Automatic behaviors’ state space extraction Traditional clustering methods are not suitable

Convergence proof in learningAutomatic Abstraction of Knowledge

Simultaneous low-level and high-level decision making

Investigations on the reinforcement signal design


Thanks!Thanks!

learning and evolution in hierarchical behavior-based systems amir massoud farahmand advisor: majid...

Documents