modified mdps for concurrent execution anyuan guo victor lesser university of massachusetts
TRANSCRIPT
Modified MDPs for Concurrent Execution
AnYuan Guo
Victor Lesser
University of Massachusetts
Concurrent Execution
A set of tasks where each task is relatively easy to solve on its own, but when executed concurrently, new interactions arise that complicate the execution of the composite task.
Single agent executing multiple tasks in parallel (example: office robot)
Multiple agents act in parallel (team)
Cross Product MDP
The problem of concurrent execution can be solved optimally by solving the cross product MDP formed by the separate processes.
Problem: exponential blow up
Related Work Deterministic Planning
- Situation calculus [Reiter96] - Extending STRIPS [Boutilier97, Knoblock94]
Termination schemes for temporally extended actions [Rohanimanesh03] Planning in cross-product MDP [Singh98] Learning ( W-learning [Humphrys96], MAXQ [Dietterich00])
The Goal
Somehow break apart the interactions, encapsulate them within each agent, so they can again be solved independently.
Algorithm Summary
Define the types of events and interactions of interest
Summarize the other agent’s effect on self in terms of statistical information of how often the constraining event occurs
Change my model to reflect this statistic
Events in MDP State based events (agent enters s5) Action based events (agent moves
north 1 step) State-action based events (agent
moves north 1 step from s4)
Events in MDP1 affect events in MDP2,a total of 9 types of interactions
Assumptions The list of possible interactions
between the MDPs are given The constraints are one-way only.
The effects do not propagate back
to the originator of the constraint.
Directed Acyclic Constraints
Constraints between a set of events that forms a directed acyclic graph.
Event Frequency & MDP modification
event 1 event 2
1) Calculate frequency
2) Modify MDP
Calculating State Visitation Frequency
Given a policy , solve the system of simultaneous linear equations:
Ss
sssTsFsF'
)'),'(,()()'(
Under the constraint that:
Ss
sF 1)(
Calculating Action Frequencies
Given a policy , the action frequency F(a) is the sum of the visitation frequencies of all the states in which action a is executed.
Bs
sFaF )()(
where })(|{ assB
Calculating State-Action Frequencies
0
)(),(
sFasF
otherwise
if as )(
Now both the action and the state at which it is executed matters:
Also generalizes to a set of statesand actions.
Account for the Effects of Constraints
Modify the model Modify the transition probability
table Intuition: other agents can change
the dynamics of my environment
RTAS ,,,
Example: A1
A2
Account for State Based Events
A constraint from another task can affect the current task’s ability to enter certain states:
P(s1,a1,s1)
P(s1, a1, s2)
P(s1, a1, s3)
P(s2,a1,s1)
P(s2, a1, s2)
P(s2, a1, s3)
P(s3,a1,s1)
P(s3, a1, s2)
P(s3, a1, s3)
s1
s2
s3
s3s2s1
A slice of the TPT: under action a1.
from:
to:
Account for Action Based Events
A constraint from another task can affect the current task’s ability to carry out certain actions:
P(s1,a1,s1)
P(s1, a1, s2)
P(s1, a1, s3)
P(s2, a1, s1)
P(s2,a1,s2)
P(s2, a1, s3)
P(s3, a1, s1)
P(s3, a1, s2)
P(s3,a1,s3)
s1
s2
s3
s1 s2 s3
TPT foraffected action a1
Account for State-Action Based Events
A constraint from another task can affect the current task’s ability to carry out certain actions at certainstates: s1 s2 s3
P(s1, a1, s1) P(s1, a1, s2)
P(s1, a1, s3)
P(s2, a1, s1) P(s2, a1, s2)
P(s2, a1, s3)
P(s3, a1, s1) P(s3, a1, s2)
P(s3,a1,s3)
s1
s2
s3
TPT for affected action a1
Experiments
States (location of the agent) Actions (move up, down, left, right or any
of the 4 diagonal steps, 8 total) Transitions (0.05 of slipping to an adjacent
state rather than intended) Rewards (-1, -3 for diagonal, 100 for goal) Constraint: agent 1 taking the “up” action
prevents agent 2 from doing so
The mountain climbing scenario:
Results: Policies
Policies when executingindependently
Policies when executedconcurrently, after weapply the algorithm
Results
Size of State Space
Average Value of Policy
Improvements
Explore different ways to modify the MDP (e.g. shrink action set)
Relax the directed-acyclic constraint restriction (take an iterative approach)
Show that it is optimal for summaries that consist of a single random variable
New Directions
Different types of summaries - steady state behavior (current work)
- multi-state summaries - summaries with temporal information
Dynamic task arrival/departure: - given some model of arrival - without model – learning
Positive interactions (e.g. enable)
The End