markov decision processes - inaoeesucar/clases-mgp/notes/c11-mdp.pdfmarkov decision processes...
TRANSCRIPT
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Markov Decision Processes
Probabilistic Graphical Models
L. Enrique Sucar, INAOE
(INAOE) 1 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Outline1 Introduction
2 Markov Decision ProcessesRepresentation
3 EvaluationValue IterationPolicy Iteration
4 Factored MDPsAbstractionDecomposition
5 POMDPs
6 ApplicationsPower Plant OperationRobot Task Coordination
7 References
(INAOE) 2 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Introduction
Introduction
• Sequential decision problems – those that involve aseries of decisions over time
• Considering that there is uncertainty in the results of theagent’s decisions, these type of problems can bemodeled as Markov decision processes (MDPs)
• By solving the MDP model we obtain what is known asa policy, which indicates to the agent which action toselect at each time step based on its current state; theoptimal policy is the one that selects the actions so thatthe expected value is maximized
• Finally we will introduce partially observable MDPs(POMDPs), in which there is not only uncertainty in theresults of the actions but also in the state
(INAOE) 3 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Markov Decision Processes
Markov Decision Processes
• A Markov decision process (MDP) models a sequentialdecision problem, in which a system evolves over timeand is controlled by an agent
• The system dynamics are governed by a probabilistictransition function Φ that maps states S and actions Ato new states S’
• At each time, an agent receives a reward R thatdepends on the current state s and the applied action a
(INAOE) 4 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Markov Decision Processes
Example - robot in the grid world
(INAOE) 5 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Markov Decision Processes
Grid World
• The robot’s possible actions are to move to theneighboring cells (up, down, left, right)
• There is uncertainty in the result of each action taken bythe robot. For example, if the selected action is up therobot goes to the upper cell with a probability of 0.8 andwith probability of 0.2 to other cells
• The objective of the robot is to go to the goal cell as fastas possible and avoid the dangers. This will beachieved by solving the MDP that represents thisproblem, and maximizing the expected reward
(INAOE) 6 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Markov Decision Processes
Example - policy
(INAOE) 7 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Markov Decision Processes Representation
MDP - formalization
• An MDP is a tuple M =< S,A,Φ,R >, where S is afinite set of states s1, . . . , sn. A is a finite set ofactions a1, . . . , sm. Φ : A× S × S → [0,1] is the statetransition function specified as a probability distribution
• The probability of reaching state s′ by performing actiona in state s is written as Φ(a, s, s′)
• R(s,a) is the reward that the agent receives if it takesaction a in state s
(INAOE) 8 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Markov Decision Processes Representation
Horizon
• Depending on how much into the future (horizon) weconsider there are two main types of MDPs: (i) finitehorizon and (ii) infinite horizon
• Finite horizon problems consider that there exists afixed, predetermined number of time steps for which wewant to maximize the expected reward
• Infinite horizon problems do not have a fixed,predetermined number of time steps, these could varyand in principle could be infinite – we will focus on thiscase
(INAOE) 9 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Markov Decision Processes Representation
Value
• A policy, π, for an MDP is a function π : S → A thatspecifies for each state, si , the action to be executed, ai
• Given a certain policy, the expected accumulatedreward for a certain state, s, is known as the value forthat state according to the policy, Vπ(s):
Vπ(s) = R(s,a) +∑s∈S
Φ(a, s, s′)Vπ(s′) (1)
• R(s,a) represents the immediate reward given action a,and
∑s∈S Φ(a, s, s′)Vπ(s′) is the expected value of the
next states according to the chosen policy
(INAOE) 10 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Markov Decision Processes Representation
Discount factor
• For the infinite horizon case, a parameter known as thediscount factor, 0 ≤ γ < 1, is included so that the sumconverges
• This parameter can be interpreted as giving more valueto the rewards obtained at the present time than thoseobtained in the future
• Including the discount factor, the value function iswritten as:
Vπ(s) = R(s,a) + γ∑s∈S
Φ(a, s, s′)Vπ(s′) (2)
(INAOE) 11 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Markov Decision Processes Representation
Bellman Equation• What is desired is to find the policy that maximizes the
expected reward; that is, the policy that gives thehighest value for all states
• For the discounted infinite-horizon case with any givendiscount factor γ, there is a policy π∗ that is optimalregardless of the starting state and that satisfies what isknown as the Bellman equation
Vπ(s) = maxaR(s,a) + γ∑s∈S
Φ(a, s, s′)Vπ(s′) (3)
• The policy that maximizes the previous equation is thenthe optimal policy, π∗:
π∗(s) = argmaxaR(s,a) + γ∑s∈S
Φ(a, s, s′)Vπ(s′) (4)
(INAOE) 12 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Evaluation
Evaluation
• There are three basic methods for solving an MDP andfinding an optimal policy: (a) value iteration, (b) policyiteration, and (c) linear programming
• The first two techniques solve the problem iteratively,improving an initial value function or policy, respectively
• he third one transforms the problem to a linear programwhich can then be solved using standard optimizationtechniques such as the simplex method
(INAOE) 13 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Evaluation Value Iteration
Value Iteration
• Value iteration starts by assigning an initial value toeach state; usually this value is the immediate rewardfor that state
• Then these estimates of the values are improved ineach iteration by maximizing using the Bellmanequation
• The process is terminated when the value for all statesconverges
• The actions selected in the last iteration correspond tothe optimal policy
(INAOE) 14 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Evaluation Value Iteration
Algorithm
• (Initialization)• ∀sV0(s) = R(s,a)
• t = 1• REPEAT (Iterative improvement)∀sVt (s) = maxaR(s,a) + γ
∑s∈S Φ(a, s, s′)Vt−1(s′)
• UNTIL ∀s | Vt (s)− Vt−1(s) |< ε
• (Obtain optimal policy)• π∗(s) = argmaxaR(s,a) + γ
∑s∈S Φ(a, s, s′)Vt (s′)
(INAOE) 15 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Evaluation Value Iteration
Analysis
• The time complexity of the algorithm is quadratic interms of the number of state–actions
• Usually the policy converges before the valuesconverge; this means that there is no change in thepolicy even if the value has not yet converged. Thisgives rise to the second approach, policy iteration
(INAOE) 16 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Evaluation Policy Iteration
Policy Iteration
• Policy iteration starts by selecting a random, initial policy• Then the policy is iteratively improved by selecting the
action for each state that increases the most theexpected value
• The algorithm terminates when the policy converges,that is, the policy does not change from the previousiteration
(INAOE) 17 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Evaluation Policy Iteration
Algorithm
• π0 : ∀sa0(s) = ak (Initialize the policy)• t = 1• REPEAT• Calculate values for the current policy:• ∀sVπt−1
t (s) = R(s,a) + γ∑
s∈S Φ(a, s, s′)Vt−1(s′)• Iterative improvement:• ∀sπt (s) = argmaxaR(s,a) + γ
∑s∈S Φ(a, s, s′)Vt (s′)
• UNTIL πt = πt−1
(INAOE) 18 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Evaluation Policy Iteration
Analysis
• Policy iteration tends to converge in fewer iterations thanvalue iteration, however the computational cost of eachiteration is higher, as the values have to be updated
• Solving small MDPs with the previous algorithms is veryefficient; however it becomes difficult when thestate–actions space is very large
• An alternative is to decompose the state space and takeadvantage of the independence relations to reduce thememory and computation requirements, using agraphical model-based representation of MDPs knownas Factored MDPs
(INAOE) 19 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Factored MDPs
Factored MDPs
• In a factored MDP, the set of states is described via aset of random variables X = X1, . . . ,Xn
• The transition model and reward function can becomeexponentially large if they are explicitly represented asmatrices, however, the frameworks of dynamicBayesian networks and decision trees give us the toolsto describe the transition model and the reward functionconcisely
(INAOE) 20 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Factored MDPs
Transition function - DBN
• The transition function for each action, a, is representedas a two–stage dynamic Bayesian network, that is atwo–layer directed acyclic graph GT whose nodes areX1, . . . ,Xn,X ′1, . . . ,X
′n
• Each node X ′i is associated with a conditionalprobability distribution PΦ(X ′i | Parents(X ′i ))
• The transition probability Φ(a, si , s′i ) is then defined tobe ΠiPΦ(x ′i | ui) where ui represents the values of thevariables in Parents(X ′i )
(INAOE) 21 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Factored MDPs
Reward function - DT
• The reward associated with a state often depends onlyon the values of certain features of the state
• The relationship between rewards and state variablescan be represented with value nodes in an influencediagrams
• Although in the worst case the conditional reward table(CRT) will take exponential space to store the rewardfunction, in many cases the reward function exhibitsstructure allowing it to be represented compactly usingdecision trees or graphs
(INAOE) 22 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Factored MDPs
Example - factored MDP
(INAOE) 23 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Factored MDPs
Algebraic Decision Diagrams
• The representation can be compacted even more byrepresenting a CPT as a tree or a graph, such thatrepeated probability values appear only once in theleaves of these graphs
• A particular presentation that is very efficient is analgebraic decision diagram or ADD
• Based on this compact representation, very efficientversions of the value and policy iteration algorithmshave been developed that also reduce thecomputational time required to solve complex MDPmodels. An example of this is the SPUDD algorithm
(INAOE) 24 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Factored MDPs
ADD
(INAOE) 25 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Factored MDPs Abstraction
Abstraction
• The idea of abstraction is to reduce the state space bycreating an abstract model where states with similarfeatures are grouped together
• Equivalent states are those that have the sametransition and reward functions; these can be groupedtogether without altering the original model
• Further reductions can be achieved by grouping similarstates; this results in approximate models and creates atrade-off between the precision of the model (and theresulting policy) and its complexity
• Different alternatives: partition the state space into a setof blocks such that each block is stable; partition thestate space into qualitative states that have similarreward functions
(INAOE) 26 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Factored MDPs Decomposition
Decomposition
• Decomposition consists in dividing the global probleminto smaller subproblems that are solved independentlyand their solutions combined
• There are two main types of decomposition: (i) serial orhierarchical, and (ii) parallel or concurrent
(INAOE) 27 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Factored MDPs Decomposition
Hierarchical MDPs
• Hierarchical MDPs provide a sequential decomposition,in which different subgoals are solved in sequence toreach the final goal
• Hierarchical MDPs accelerate the solution of complexproblems by defining different subtasks that correspondto intermediate goals, solving for each subgoal, andthen combining these subprocesses to solve the overallproblem
• Examples: HAM and MAXQ
(INAOE) 28 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Factored MDPs Decomposition
Concurrent MDPs
• In concurrent or parallel MDPs, the subtasks areexecuted in parallel to solve the global task
• They consider that the task can be divided in severalrelatively independent subtasks that can be solvedindependently and then the solutions combined to solvethe global problem
• An alternative is to consider conflicts or restrictionsbetween task, initially solving each subtaskindependently, and when the solutions are combined,take into account potential conflicts between the partialpolicies, and solve these conflicts to obtain a global,approximately optimal policy
(INAOE) 29 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
POMDPs
Partially observable MDPs
• In some domains, the state can not be observedcompletely, there is only partial information about thestate of the system – POMDP
• In this case, there are certain observations from whichthe state can be estimated probabilistically
• Consider the previous example of the robot in the gridworld. It could be that the robot can not determineprecisely the cell where it is (its state), but can estimatethe probability of being in each cell by observing thesurrounding environment
(INAOE) 30 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
POMDPs
Formalization
• A POMDP is a tuple M =< S,A,Φ,R,O,Ω,Π >. Thefirst four elements are the same as in an MDP
• O is a finite set of observations o1, . . . ,ol• Ω : S ×O → [0,1] is the observation function specified
as a probability distribution, which gives the probabilityof an observation o given that the process is in state s,P(o | s)
• Π is the initial state distribution that specifies theprobability of being in state s at t = 0
(INAOE) 31 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
POMDPs
Solving a POMDP
• In a POMDP the current state is not known withcertainty, only the probability distribution of the state,which is known as the belief state
• So solving a POMDP requires finding a mapping formthe belief space to the action space - equivalent to acontinuous state space MDP
• Solving a POMDP is much more difficult than solving anMDP, as the belief space is in principle infinite
(INAOE) 32 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
POMDPs
Approximate Solution TechniquesPolicy tree and DDN techniques: assuming a finite horizon,
a POMDP can be represented as a policy tree(similar to a decision tree) or as a dynamicdecision network; then, algorithms for solvingthese types of models can be applied.
Sampling techniques: the value function is computed for aset of points in the belief space, andinterpolation is used to determine the optimalaction to take for other belief states which arenot in the set of sampling points.
Value function approximation techniques: given that thevalue function is convex and piece-wise linear,it can be described as a set of vectors (called αvectors). Thus, it can be approximated by a setof vectors that dominate the others, and in thisway find an approximate solution.
(INAOE) 33 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications
Applications
• Assisting power plant operators in the operation of apower plant under difficult situations
• Coordinating a set of modules to solve a complex taskfor service robots
(INAOE) 34 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Power Plant Operation
Power Plant Operation Assistant
• The steam generation system of a combined-cyclepower plant provides superheated steam to a steamturbine
• During normal operation, a three-element feedwatercontrol system commands the feed-water control valve(fwv) to regulate the level (dl) and pressure (pd) in thedrum
• However, this traditional controller does not consider thepossibility of failures in the control loop and ignoreswhether the outcomes of executing a decision will helpto increase the steam drum lifetime, security, andproductivity
• This problem was modeled as an MDP – for trainingand assistance for power plant operators
(INAOE) 35 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Power Plant Operation
AsistO• AsistO is based on a decision-theoretic model -MDP-
that represents the main elements of the steamgeneration system of a combined-cycle power plant
• The main variables in the steam generation systemrepresent the state in a factored form
• The actions correspond to the control of the main valvesin this subsystem of the power plant: feed-water valve(fwv) and main steam valve (msv)
• The reward function is defined in terms of arecommended operation curve for the relation betweenthe drum pressure and steam flow, the control actionsshould try to maintain the plant within thisrecommended operation curve
• The transition function is learned by using the powerplant simulator and sampling the state and actionspaces
(INAOE) 36 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Power Plant Operation
Recommended Operation Curve
(INAOE) 37 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Power Plant Operation
Experiments
• Model: five state variables: Fms, Ffw , Pd , g, d ; andfour actions: open/close the feed-water (fwv) and mainsteam (msv) valves.
• The reward function was defined based on therecommended operation curve
• The memory requirements for a flat MDP representationand a factored representation were compared. The flatMDP required 589,824 parameters (probability values)while the factored MDP only 758
• The optimal solution for the factored MDP was obtainedin less than two minutes on a standard personalcomputer
(INAOE) 38 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Robot Task Coordination
Robot Task Coordination
• A service robot should combine several capabilities,such as localization and navigation, obstacle avoidance,people detection and recognition, object recognition andmanipulation, etc.
• These different capabilities can be implemented asindependent software modules, which can then becombined for solving a particular task
• It is necessary to coordinate the different modules toperform a task, ideally in an optimal way
(INAOE) 39 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Robot Task Coordination
Task coordination with MDPs• Markov decision processes provide an appropriate
framework for task coordination for service robots• Once a task is modeled as an MDP, the MDP can be
solved to obtain a policy to perform the task, which isrobust with respect to the uncertainty in the results ofthe different actions
• Under this framework, based on general softwaremodules and an MDP–based coordinator, it is inprinciple relatively easy for a service robot to solvedifferent tasks. We just need to modify the MDP rewardfunction according to the new task objectives
• Additionally, it is desirable for the robot to performseveral actions simultaneously, such as navigation to acertain location, avoiding obstacles and looking forpeople; all at the same time – this implies an explosionin the action—state space
(INAOE) 40 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Robot Task Coordination
Concurrent MDPs
• Based on functional decomposition, a complex task ispartitioned into several subtasks
• Each subtask is represented as an MDP and solvedindependently, and the policies are executed in parallel
• However, conflicts may arise between the subtasks: (i)resource conflicts, and (ii) behavior conflicts.
(INAOE) 41 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Robot Task Coordination
Resource onflicts
• Resource conflicts occur when two actions require thesame physical resource (e.g., to control the wheels of arobot) and cannot be executed concurrently
• This type of conflict is solved off–line by a two-phaseprocess
• In the first phase we obtained an optimal policy for eachsubtask (MDP) – if there is a conflict between theactions selected by each MDP for a certain state, theone with maximum value is considered, and the state ismarked as a conflict state
• This initial solution is improved in a second phase usingpolicy iteration
(INAOE) 42 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Robot Task Coordination
Behaivor
• Behavior conflicts arise in situations in which it ispossible to execute two (or more) actions at the sametime but it is not desirable given the application
• Behavior conflicts are solved on–line based on a set ofrestrictions specified by the user
• If there are no restrictions, all the actions are executedconcurrently; otherwise, a constraint satisfactionmodule selects the set of actions with the highestexpected utility
(INAOE) 43 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Robot Task Coordination
Experiments
• An experiment was done with Markovito, a servicerobot, which performed a delivery task
(INAOE) 44 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Robot Task Coordination
Delivery Task
• The goal is for the robot to receive and deliver amessage, an object or both, under a user’s request
• Subtasks:1 navigation, the robot navigates safely in different
scenarios;2 vision, for looking and recognizing people and objects;3 interaction, for listening and talking with a user;4 manipulation, to receive and deliver an object safely;
and5 expression, to show emotions using an animated face.
• Each subtask is represented as a factored MDP
(INAOE) 45 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Robot Task Coordination
Representation
• If we represent this task as a single, flat MDP, there are1,179,648 states (considering all the non-duplicatedstate variables) and 1,536 action combinations, giving atotal of nearly two billion state-actions
• The MDP model for each subtask was defined using astructured representation
• The transition and reward functions were specified bythe user based on task knowledge and intuition
• In this task, conflicts might arise between the differentsubtasks, so we need to include conflict resolutionstrategies
(INAOE) 46 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Robot Task Coordination
Behavior Conflicts
action(s) restriction action(s)get message not during turn OR advanceask user name not before recognize userrecognize user not start avoid obstacleget object directed towardsOR not during OR turndeliver object OR moving
(INAOE) 47 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
Applications Robot Task Coordination
Evaluation
• For comparison, the delivery task was solved under twoconditions: (i) without restrictions and (ii) withrestrictions
• In the case without restrictions, the robot performssome undesirable behaviors. For example, in a typicalexperiment, the robot is not able to identify the personwho wants to send a message for a long time
• In the case where restrictions were used, these alloweda more fluid and efficient solution
• On average, the version with restrictions takes about50% of the time steps required by the version withoutrestrictions to complete the task
(INAOE) 48 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
References
Book
Sucar, L. E, Probabilistic Graphical Models, Springer 2015 –Chapter 11
(INAOE) 49 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
References
Additional Reading (1)
Bellman, R.: Dynamic Programming. PrincetonUniversity Press, Princeton, New Jersey (1957)
Corona, E., Morales E.F, Sucar, L.E.: Solving PolicyConflicts in Concurrent Markov Decision Processes.ICAPS Workshop on Planning and Scheduling UnderUncertainty. (2010)
Corona, E., Sucar, L.E.: Task Coordination for ServiceRobots Based on Multiple Markov Decision Processes.In: L. E. Sucar, J. Hoey, E. Morales (eds.) DecisionTheory Models for Applications in Artificial Intelligence:Concepts and Solutions, IGI Global, Hershey (2011)
Dean, T., Givan, R.: Model Minimization in MarkovDecision Processes. In: Proceedings of the 14thNational Conference on Artificial Intelligence (AAAI), pp.106–111 (1997)
(INAOE) 50 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
References
Additional Reading (2)
Dietterich, T.: Hierarchical Reinforcement Learning withthe MAXQ Value Function Decomposition. Journal ofArtificial Intelligence Research, 13, 227–303 (2000)
Hoey, J., St-Aubin, R., Hu, A., Boutilier, C.: SPUDD:Stochastic Planning Using Decision Diagrams. In:Proceedings UAI, pp. 279–288 (1999)
Li, L., Walsh, T. J., Littman, M. L.: Towards a UnifiedTheory of State Abstraction for MDPs. In: Proceedingsof the Ninth International Symposium on ArtificialIntelligence and Mathematics, pp. 21–30 (2006)
Meuleau, N., Hauskrecht, M., Kim, K.E., Peshkin, L.,Kaelbling, L.P., Dean, T., Boutilier, C.: Solving VeryLarge Weakly Coupled Markov Decision Processes. In:Proceedings of the Association for the Advancement ofArtificial Intelligence (AAAI), pp. 165–172 (1998)
(INAOE) 51 / 52
Introduction
MarkovDecisionProcessesRepresentation
EvaluationValue Iteration
Policy Iteration
FactoredMDPsAbstraction
Decomposition
POMDPs
ApplicationsPower PlantOperation
Robot TaskCoordination
References
References
Additional Reading (3)
Parr, R., Russell, S. J.: Reinforcement Learning withHierarchies of Machines. In: Proceeding NIPS, (1997)
Poupart, P.: An Introduction to Fully and PartiallyObservable Markov Decision Processes. In: L. E. Sucar,J. Hoey, E. Morales (eds.) Decision Theory Models forApplications in Artificial Intelligence: Concepts andSolutions, IGI Global, Hershey (2011)
Puterman, M. L.: Markov Decision Processes: DiscreteStochastic Dynamic Programming. Wiley, New York(1994)
Reyes, A., Sucar, L.E., Morales, E.F.: AsistO: AQualitative MDP-Based Recommender System forPower Plant Operation. Computacion y Sistemas, 13 (1),5–20 (2009)
(INAOE) 52 / 52