1
Learning Behavior-Selection by Emotions and Cognition in aMulti-Goal Robot Task
Sandra Clara Gadanho
Presented by Jamie Levy
2
Purpose
Build an autonomous robot controller which can learn to master a complex task when situated in a realistic environment.Continuous time and spaceNoisy sensorsUnreliable actuators
3
Possible problems to the Learning Algorithm: Multiple goals may conflict with each other Situations in which the agent needs to
temporarily overlook one goal to accomplish another.
Short-term and long-term goals.
4
Possible problems to the Learning Algorithm (cont): May need a sequence of different
behaviors to accomplish one goal. Behaviors are unreliable. Behavior’s appropriate duration is
undetermined, it depends on the environment and on their success.
5
Emotion-based Architecture
Traditional RL adaptive system complemented with an emotion system responsible for behavior switching.
Innate emotions define goals. Agent learns emotion associations of
environment-state and behavior pairs to determine its decisions.
Q-learning to learn behavior-selection policy which is stored in Neural Networks.
6
ALEC – Asynchronous Learning by Emotion and Cognition Augments the EB architecture with a
cognitive system.Which has explicit rule knowledge extracted
from environment interactions. Is based on the CLARION model by Sun
and Peterson 1998.Allows learning the decision rules in a bottom
up fashion.
7
ALEC architecture (cont)
Cognitive system of ALEC I was directly inspired by the top-level of the CLARION model
ALEC II has some changes (to be discussed later)
ALEC III emotion system learns about goal state exclusively while cognitive system learns about goal state transitions.
LEC (Learning by Emotion and Cognition) is non-asynchronous used to test usefulness of behavior switching.
8
EB II
Replaces emotional model with a goal system.
Goal system is based on a set of homeostatic variables that it attempts to maintain within certain bounds.
9
EBII Architecture is composed of two parts:
Goal System Adaptive
System
10
Perceptual Values
Light intensity Obstacle density Energy availability
Indicates whether a nearby source is releasing energy
11
Behavior System
Three hand-designed behaviors to select from:Avoid obstaclesSeek lightWall following
These are not designed to be very reliable and may failEx) wall following may lead to a crash
12
Goal System
Responsible for deciding when behavior switching should occur.
Goals are explicitly identified and associated with homeostatic variables.
13
Three different states –-target-recovery-danger
14
Homeostatic Variables
Variable remains in its target as long as its values are optimal or acceptable.
Well-being variable is derived from the above.
Variable has effect on well-being.
15
Homeostatic Variables
Energy Reflects the goal of maintaining its energy
WelfareMaintains goal of avoiding collisions
ActivityEnsures agent keeps moving; otherwise value
slowly decreases and target state is not maintained
16
Well-Being
State Change – when a homeostatic variable changes from one state to another the well-being is positively influenced.
Predictions of State Change – when some perceptual cue predicts the state change of a homeostatic variable, influence is similar to above, but lower in value.
These are modeled after emotions and may describe “pain” or “pleasure.”
17
Well-Being (cont)
cs = state coefficient rs = influence of state on well being.
18
Well-Being (cont)
ct(sh) = state transition coefficient wh = weight of homeostatic variable
1.0 for energy 0.6 for welfare 0.4 for activity
19
Well-Being (cont)
cp = prediction of coefficient rph = value of prediction
Only considered for energy and activity variables
20
21
22
Well-being calculation (cont)
23
Well-being calculation - Prediction
Values of rph depend on the strengths of the current predictions and vary between -1 (for predictions of no desirable change) and 1.
If there is no prediction rph = 0.
24
Well-being calculation - Prediction
25
Well-being calculation - Prediction
Activity prediction provides a no-progress indicator given at regular time intervals when the activity of the robot is low for long periods of time.rp(activity) = -1
There is no prediction for welfarerp(welfare) = 0
26
Adaptive System Uses Q-learning State information is
fed to NN comprising of homeostatic variable values and other perceptual values gathered from sensors.
27
Adaptive System (cont) Developed
controller tries to maximize the reinforcement received by selecting between one of the available hand-designed behaviors.
28
Adaptive System (cont) Agent may select
between performing the behavior proven better in past or an arbitrary one.
Selection function is based on Boltzmann-Gibbs’ distribution. (pg 30 in class textbook).
29
EB II Architecture
30
ALEC Architecture
31
ALEC I Inspired by the CLARION model Each individual rule consists of a condition
for activation and a behavior suggestion. Activation condition is dictated by a set of
intervals, one for each dimension of input space.
6 input dimensions varying between 0 and 1 with intervals of 0.2
32
ALEC I (cont) A condition interval may only start or end
at pre-defined points of the input space. Since this may lead to a large number of
possible states, rule learning is limited to those few cases with successful behavior selection.
Other cases are left to Emotion System which uses its generalization abilities to cover the state space.
33
ALEC I (cont)
Successful behaviors for certain states used to extract a rule corresponding to the decision made and are added to the agent’s rule set.
If same decision is made, the agent updates the Success Rate (SR) for that rule.
34
ALEC I – Success
r = immediate reinforcement Difference of Q-value between state x
where decision a was made and the resulting state y.
Tsuccess = 0.2 Constant threshold
35
ALEC I (cont) – Rule expansion, shrinkage If a rule is often successful, the agent tries to
generalize it to cover nearby environmental states.
If a rule is very poor, the agent makes it more specific.
If it still does not improve, the rule is deleted. Maximum of 100 rules.
36
ALEC I (cont) – Rule expansion, shrinkage Statistics are kept for the success rate of
every possible one-state expansion or shrinkage of the rule, to select best option.
Rule is compared to a “match all” rule (rule_all) with the same behavior suggestion and against itself after the best expansion or shrinkage (rule_exp, rule_shrink).
37
ALEC I (cont) – Rule expansion, shrinkage
A rule is expanded if it is significantly better than the match-all rule and the expanded rule is better or equal to the original rule.
A rule that is insufficiently better than the match-all rule is shrunk if this results in an improvement or otherwise is deleted.
38
ALEC I (cont) – Rule expansion, shrinkage
39
Rule expansion, shrinkage (cont)
Constant thresholds:Tsuccess = 0.2 Thresholds Texpand = 2.0 Tshrunk = 1.0
40
ALEC I (cont) – Rule expansion, shrinkage
A rule that performs badly is deleted. A rule is also deleted if its condition has not
been met for a while. When two rules propose the same behavior
selection and their conditions are sufficiently similar, they are merged into a single rule.
Success rate is reset whenever a rule is modified by merging, expansion or shrinkage.
41
Cognitive System (cont)
If the cognitive system has a rule that applies to the current environmental state, then the cognitive system influences the behavior decision.Adds an arbitrary constant of 1.0 to the
respective Q-value before the stochastic behavior selection is made.
42
ALEC Architecture
43
44
Example of a Rule – execute avoid obstacles. Six input dimensions segmented with 0.2
granularity – 0, 0.2, 0.4, 0.6, 0.8, 1 energy = [0.6, 1] activity = [0,1] welfare = [0, 0.6] light intensity = [0,1] obstacle density = [0.8, 1] energy availability = [0,1]
45
ALEC II
Instead of the above function, the agent considers that a behavior is successful if there is a positive homeostatic variable transition. If a variable state changes to the target state from the
danger state.
46
ALEC III
Same as ALEC II except that the well-being does not depend on state transitions nor predictionsct(sh) = 0 cp = 0
47
Experiments
Goal of ALEC is to allow an agent faced with realistic world conditions to adapt on-line and autonomously to its environment.
Cope with continuous time and space Limited memory Time constraints Noisy sensors Unreliable actuators
48
Khepera Robot
Left and Right wheel motors 8 infrared sensors that allow it to detect
object proximity and ambient light6 in the front2 in the rear
49
Experiment (cont)
50
Goals
Maintain Energy Avoid Obstacles Move around in environment
Not as important as the first two.
51
Energy Acquisition
Must overlook goal of avoiding obstaclesMust bump into source
Energy is available for a short periodMust look for new sources
Energy is received by high values of light in rear sensors
52
Procedure
Each experiment consisted of:100 different robot trials of 3 million simulation
steps A new fully recharged robot with all state
values reset placed at randomly selected starting positions in each trial
For evaluation the trial period was divided into 60 smaller periods of 50,000 steps.
53
Procedure (cont)
For each of these periods the following were recorded: Reinforcement – mean of reinforcement (well-being)
value calculated at each step. Energy – mean level of robot Distance – mean value of Euclidean distance d taken
at 100-step intervals (approx. # steps to move between corners of environment)
Collisions – percentage of steps involving collisions.
54
Results
Pairs of controllers were compared using a randomized analysis of variance (RANOVA) by Piater (1999)
55
Results (cont)
Most important contribution to reinforcement is the state value.
For the successful accomplishment of the task and goals, all homeostatic variables should be taken into consideration in reinforcement. Agents with no: Energy dependent reinforcement fail in their main task
of maintaining energy levels. Welfare – increased collisions Activity – move only as a last resort (avoid collisions)
56
Results (cont)
Predictions of state transitions proved essential for an agent to accomplish its tasks.Controller with no energy prediction is unable
to acquire energy.Controller with no activity prediction will
eventually stop moving.
57
Results – EB, EBII and Random
The first set of graphs is dealing with three different agents:EB – discussed in earlier paperEB II Random – selects randomly amongst the
differently available behaviors at regular intervals.
58
Results – EB, EBII and Random
59
60
61
62
63
64
65
66
67
Conclusion
Emotion and Cognitive systems can improve learning, but are unable to store and consult all the single events the agent experiences.
The Emotion system gives a “sense” of what is right, while the Cognitive system has constructs a model of reality and corrects the emotion system when it reaches incorrect conclusions.
68
Future work
Adding more specific knowledge in the cognitive system which then may be used for planning of more complex tasks.