hybrid behavior co-evolution and structure learning in behavior-based systems amir massoud farahmand...
TRANSCRIPT
Hybrid Behavior Co-evolution and Structure Learning in Behavior-based
SystemsAmir massoud Farahmand (a,b,c)
(www.cs.ualberta.ca/~amir)
Majid Nili Ahmadabadi (b,c)
Caro Lucas (b,c)
Babak N. Araabi (b,c)
a) Department of Computing Science, University of Alberta
b) Control and Intelligent Processing Center of Excellence, Department of Electrical and Computer Engineering,
University of Tehran
c) School of Cognitive Sciences, IPM
Motivation
Situated real-world agents (e.g.) face different uncertainties– Unknown environment/body
• [exact] Model of environment/body is not known
– Non-stationary environment/body
• Changing environment (offices, houses, streets, and almost everywhere)
• Aging• …
Designing a robust controller for such an agent is not easy.
Research Specification
• Goal: Automatic design of intelligent agent
• Architecture: Hierarchical behavior-based architectures (a version of Subsumption architecture)– Behavior-based systems:
• A robust successful approach for designing situated agents
• Behavioral decomposition• Behaviors: Sensors ---> Actions
• Evaluation: Objective performance measure is available (reinforcement signal)
– [Agent] Did I perform it correctly?!– [Tutor] Yes/No! (or 0.3)
build maps
explore
avoid obstacles
locomote
manipulatethe world
sensors actuators
?How should we
DESIGNa behavior-based system?!
Behavior-based System Design Methodologies
• Hand Design– Common in almost everywhere.– Complicated: may be even infeasible in complex problems– Even if it is possible to find a working system, it is probably not the
best solution.• Evolution
– Good solutions can be found (+)– Biologically plausible (+)– Time consuming (-)– Not fast in making new solutions (-)
• Learning– Biologically plausible (+)– Learning is essential for life-time survival of the agent. (+)– May get stuck in a local minimum (-)
Taxonomy of Design Methods
Behavior-based System Design
Learning Evolution
Structure (hierarchy) learning
Behavior learningCo-evolution of
behaviorsEvolution of Structure
Taxonomy of Design Methods
Behavior-based System Design
Learning Evolution
Structure (hierarchy) learning
Behavior learningCo-evolution of
behaviorsEvolution of Structure
Hybridization of Evolution and Learning
Problem Formulation
Behaviors
{ }{ }
ii
ii
iiiii
ii
iii
SSM
AASS
SssMssS
AA
ASB
′→
⊂⊂
∈∀=′′=′
∪=′
=′→′
:
,
);(
Action No
n1,...,i :
Problem Formulation
Purely Parallel Subsumption Architecture (PPSSA)
layer) in the is indicates(that
][ T)()2()1(
thj
mindexindexindex
iBjindex(i):
n m ... B BBT ≤=
[ ]oidanceObstacleAvtionBallCollecWanderingT =
•Different behaviors excites
•Higher behaviors can suppress lower ones.
•Controlling behavior
Problem FormulationReinforcement Signal and the Agent’s Value
Function
∑=
=N
iirN
R1
1
{ }
{ }[ ])1( behaviors ofset and structure agent with the
)1( behaviors ofset and structure agent with the1
1
,...,niBTRE
,...,niBTrN
EV
i
i
N
ttT
==
⎥⎦
⎤⎢⎣
⎡== ∑
=
π
π
•This function states the value of using a set of behaviors inan specific structure.•We want to maximize the agent’s value function
Problem FormulationDesign as an Optimization
• Structure Learning: Finding the best structure given a set of behaviors using learning
• Behavior Learning: Finding the best behaviors given the structure using learning
• Concurrent Behavior and Structure Learning
• Behavior Evolution: Finding the best behaviors given structure using evolution
• Behavior Evolution and Structure Learning { } T
BTi VBT
i,
** maxarg, =
TT
VT maxarg* =
{ } TB
i VBi
maxarg* =
{ } TBT
i VBTi,
** maxarg, =
{ } TB
i VBi
maxarg* =
Behavior-based System Design
Learning Evolution
Structure (hierarchy) learning
Behavior learningCo-evolution of
behaviorsEvolution of Structure
Hybridization of Evolution and Learning
Structure Learning
manipulatethe world
build maps
explore
locomote
avoid obstacles
Behavior Toolbox
The agent wants to learn how to arrange these behaviors in order to get maximum reward from its environment (or tutor).
Structure Learning
manipulatethe world
build maps
explore
locomote
avoid obstacles
Behavior Toolbox
Structure Learning
manipulatethe world
build maps
explorelocomote
avoid obstacles
Behavior Toolbox 1-explore becomes controlling behavior and suppress avoid obstacles
2-The agent hits a wall!
Structure Learning
manipulatethe world
build maps
explorelocomote
avoid obstacles
Behavior Toolbox Tutor (environment) gives explore a punishment for its being in that place of the structure.
Structure Learning
manipulatethe world
build maps
explorelocomote
avoid obstacles
Behavior Toolbox“explore” is not a very good behavior for the highest position of the structure. So it is replaced by “avoid obstacles”.
Structure LearningChallenging Issues
• Representation: How should the agent represent knowledge gathered during learning?– Sufficient (Concept space should be covered by Hypothesis space)– Generalization Capability– Tractable (small Hypothesis space)– Well-defined credit assignment
• Hierarchical Credit Assignment: How should the agent assign credit to different behaviors and layers in its architecture?– If the agent receives a reward/punishment, how should we
reward/punish the structure of the agent?• Learning: How should the agent update its knowledge when it
receives reinforcement signal?
Structure LearningOvercoming Challenging Issues
• Our approach is defining a representation that allows decomposing the agent’s value function to simpler components.
• Structure can provide a lot of clues to us.
Structure Learning Zero Order Representation
avoid obstacles(0.8)
avoid obstacles(0.6)
explore(0.7)
explore(0.9)
locomote(0.4)Higher layer
Lower layer
ZO Value Table in the agent’s mind
locomote(0.4)
Structure LearningZero Order Representation - Value Function
Decomposition[ ]
( ){ }
{ } { }
g)controllin is (gcontrollin is |1
...
g)controllin is (gcontrollin is |1
g)controllin is (gcontrollin is |1
g"controllin is "...g"controllin is "1
g"controllin is "...g"controllin is "g"controllin is "1
1
22
11
111
121
1
mmt
t
t
N
tmt
N
tt
N
tmt
N
ttT
LPLrN
E
LPLrN
E
LPLrN
E
LrELrN
E
LLLrN
E
rN
EREV
⋅⎥⎦⎤
⎢⎣⎡++
⋅⎥⎦⎤
⎢⎣⎡+
⋅⎥⎦⎤
⎢⎣⎡=
⎥⎦
⎤⎢⎣
⎡∧++⎥
⎦
⎤⎢⎣
⎡∧=
⎥⎦
⎤⎢⎣
⎡∨∨∨∧=
⎥⎦
⎤⎢⎣
⎡==
∑
∑
∑
∑∑
∑
∑
==
=
=
π
π
π
ππ
π
ππ
Structure LearningZero Order Representation - Value Function
Decomposition
{ }
{ } miVLBP
LBrN
ELBPLrN
E
n
jijij
n
jijtijit
,...,1 |
in behavior gcontrollin theis 1
|g]controllin is |1
[
1
1
==
⎥⎦⎤
⎢⎣⎡=
∑
∑ ∑∑
=
=ππ
{ } ( )∑∑= =
=m
i
n
jiijijT LPVLBPV
1 1
gcontrollin is |
Agent’s value function
ZO components
Layer’s value
Structure LearningZero Order Representation - Value Function
Decomposition
{ } ( )∑∑= =
==⇒
=
m
i
n
jiijij
TT
T
TT
LPVLBPVT
VT
1 1
*
*
gcontrollin is |maxargmaxarg
maxarg
Structure LearningZero Order Representation - Credit Assignment and Value Updating
• Controlling behavior is the only responsible behavior for the current reinforcement signal.
{ } ( )gcontrollin is |~
iijijij LPVLBPV =
( ) [ ]nijijnijnijnij rnLnBVVn
××+−=+
" step at time gcontrollin is "" step at time active is "~
1~
,,,1αα
Behavior-based System Design
Learning Evolution
Structure (hierarchy) learning
Behavior learningCo-evolution of
behaviorsEvolution of Structure
Hybridization of Evolution and Learning
Behavior Co-evolutionMotivations
+• Learning can trap in the local
maxima of objective function• Evolutionary methods have
more chance to find the global maximum of the objective function
• Learning is sensitive (POMDP, non-Markov, …)
• Objective function may not be well-defined in robotics
-• Evolutionary robotics’
methods are usually slow– Fast changes of the
environment• Non-modular controllers
– Monolithic– No reusability
Behavior Co-evolutionIdeas
• Use evolution to search the difficult and big part of parameters’ space– Behaviors’ parameters space is usually the bigger one
• Use learning to do fast responses– Structure’s parameters space is usually the smaller one– A change is the structure results in different agent’s behavior
• Evolve behaviors separately – Modularity– re-usability
Behavior Co-evolution
Agent
Behavior Pool 1
Behavior Pool 2
Behavior Pool n
We have different behavior (genetic) pools
Behavior Co-evolution
Agent
Behavior Pool 1
Behavior Pool 2
Behavior Pool n
One behavior is selected randomly from each pool. We want to assess its fitness.
Behavior Co-evolution
Agent
Behavior Pool 1
Behavior Pool 2
Behavior Pool n
Agent interacts with the environment using an architecture that is built by selected behaviors
Behavior Co-evolution
Agent
Behavior Pool 1
Behavior Pool 2
Behavior Pool n
… and tries to maximize its reward.
Behavior Co-evolution
Agent
Behavior Pool 1
Behavior Pool 2
Behavior Pool n
Based on the performance of the agent, a fitness is assigned to it.
Fitness
Behavior Co-evolutionFitness Sharing
• We can evaluate fitness of the agent after its interaction with the environment.
• How can we assess the fitness of each behavior based on the fitness of the agent? (remember that we have separate behavior pools)
• We approximate it!
{ } { } ⎥⎦
⎤⎢⎣
⎡∈== ∑
∈
episodeK Last t, agent with the1
episodeK Last tepisodesK Last }{episodesK Last
BrK
EfV tBB
€
f Bij
( ) =1
NV B{ } i Last K episodes
B{ } i
∑ (Fitness)
Behavior Co-evolution
Each behavior’s genetic pool has conventional evolutionary operators/phenomena– Selection– Genetic Operators
• Crossover• Mutation
– Hard» Replacement
– Soft» Perturbation
oldoldnew ki
ji
ji BXXBB +=′
Multi-Robot Object Lifting Problem
• Three robots want to lift an object using their own local sensors– No central control– No communication– Local sensors
• Objectives– Reaching prescribed
height– Keeping tilt angle
small
A group of robots lifts a bulky object.
Multi-Robot Object Lifting Problem
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Multi-Robot Object Lifting Problem
QuickTime™ and aTIFF (PackBits) decompressorare needed to see this picture.
Conclusion
• Hybridization of evolution and learning
• Evolution and learning search different subspaces of the solution space
• Competitive results to human-designs
Important Questions
• Is it possible to benefit from information gathered during learning?– Each agent learns an approximately good
structure’s arrangement. However, we do not use it at all!
• Is there any other way of sharing fitness of the agent between behaviors?– Now, we share all behaviors uniformly.
It seems that the answer to these questions is positive!
Future Research
• Can we decompose other problems (not just hierarchical behavior-based systems) similarly?!– Learning and evolution– Fast and Deep– Different subspaces of the solution space
• Other ways of fitness sharing– Low bias– Low variance
Questions?!