markov decision processes - inaoeesucar/clases-mgp/notes/c11-mdp.pdfmarkov decision processes...

Introduction

MarkovDecisionProcessesRepresentation

EvaluationValue Iteration

Policy Iteration

FactoredMDPsAbstraction

Decomposition

POMDPs

ApplicationsPower PlantOperation

Robot TaskCoordination

References

Markov Decision Processes

Probabilistic Graphical Models

L. Enrique Sucar, INAOE

(INAOE) 1 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Outline1 Introduction

2 Markov Decision ProcessesRepresentation

3 EvaluationValue IterationPolicy Iteration

4 Factored MDPsAbstractionDecomposition

5 POMDPs

6 ApplicationsPower Plant OperationRobot Task Coordination

7 References

(INAOE) 2 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Introduction

Introduction

• Sequential decision problems – those that involve aseries of decisions over time

• Considering that there is uncertainty in the results of theagent’s decisions, these type of problems can bemodeled as Markov decision processes (MDPs)

• By solving the MDP model we obtain what is known asa policy, which indicates to the agent which action toselect at each time step based on its current state; theoptimal policy is the one that selects the actions so thatthe expected value is maximized

• Finally we will introduce partially observable MDPs(POMDPs), in which there is not only uncertainty in theresults of the actions but also in the state

(INAOE) 3 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References



• A Markov decision process (MDP) models a sequentialdecision problem, in which a system evolves over timeand is controlled by an agent

• The system dynamics are governed by a probabilistictransition function Φ that maps states S and actions Ato new states S’

• At each time, an agent receives a reward R thatdepends on the current state s and the applied action a

(INAOE) 4 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Example - robot in the grid world

(INAOE) 5 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Grid World

• The robot’s possible actions are to move to theneighboring cells (up, down, left, right)

• There is uncertainty in the result of each action taken bythe robot. For example, if the selected action is up therobot goes to the upper cell with a probability of 0.8 andwith probability of 0.2 to other cells

• The objective of the robot is to go to the goal cell as fastas possible and avoid the dangers. This will beachieved by solving the MDP that represents thisproblem, and maximizing the expected reward

(INAOE) 6 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Example - policy

(INAOE) 7 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Markov Decision Processes Representation

MDP - formalization

• An MDP is a tuple M =< S,A,Φ,R >, where S is afinite set of states s1, . . . , sn. A is a finite set ofactions a1, . . . , sm. Φ : A× S × S → [0,1] is the statetransition function specified as a probability distribution

• The probability of reaching state s′ by performing actiona in state s is written as Φ(a, s, s′)

• R(s,a) is the reward that the agent receives if it takesaction a in state s

(INAOE) 8 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Horizon

• Depending on how much into the future (horizon) weconsider there are two main types of MDPs: (i) finitehorizon and (ii) infinite horizon

• Finite horizon problems consider that there exists afixed, predetermined number of time steps for which wewant to maximize the expected reward

• Infinite horizon problems do not have a fixed,predetermined number of time steps, these could varyand in principle could be infinite – we will focus on thiscase

(INAOE) 9 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Value

• A policy, π, for an MDP is a function π : S → A thatspecifies for each state, si , the action to be executed, ai

• Given a certain policy, the expected accumulatedreward for a certain state, s, is known as the value forthat state according to the policy, Vπ(s):

Vπ(s) = R(s,a) +∑s∈S

Φ(a, s, s′)Vπ(s′) (1)

• R(s,a) represents the immediate reward given action a,and

∑s∈S Φ(a, s, s′)Vπ(s′) is the expected value of the

next states according to the chosen policy

(INAOE) 10 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Discount factor

• For the infinite horizon case, a parameter known as thediscount factor, 0 ≤ γ < 1, is included so that the sumconverges

• This parameter can be interpreted as giving more valueto the rewards obtained at the present time than thoseobtained in the future

• Including the discount factor, the value function iswritten as:

Vπ(s) = R(s,a) + γ∑s∈S

Φ(a, s, s′)Vπ(s′) (2)

(INAOE) 11 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Bellman Equation• What is desired is to find the policy that maximizes the

expected reward; that is, the policy that gives thehighest value for all states

• For the discounted infinite-horizon case with any givendiscount factor γ, there is a policy π∗ that is optimalregardless of the starting state and that satisfies what isknown as the Bellman equation

Vπ(s) = maxaR(s,a) + γ∑s∈S

Φ(a, s, s′)Vπ(s′) (3)

• The policy that maximizes the previous equation is thenthe optimal policy, π∗:

π∗(s) = argmaxaR(s,a) + γ∑s∈S

Φ(a, s, s′)Vπ(s′) (4)

(INAOE) 12 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Evaluation

Evaluation

• There are three basic methods for solving an MDP andfinding an optimal policy: (a) value iteration, (b) policyiteration, and (c) linear programming

• The first two techniques solve the problem iteratively,improving an initial value function or policy, respectively

• he third one transforms the problem to a linear programwhich can then be solved using standard optimizationtechniques such as the simplex method

(INAOE) 13 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Evaluation Value Iteration

Value Iteration

• Value iteration starts by assigning an initial value toeach state; usually this value is the immediate rewardfor that state

• Then these estimates of the values are improved ineach iteration by maximizing using the Bellmanequation

• The process is terminated when the value for all statesconverges

• The actions selected in the last iteration correspond tothe optimal policy

(INAOE) 14 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Algorithm

• (Initialization)• ∀sV0(s) = R(s,a)

• t = 1• REPEAT (Iterative improvement)∀sVt (s) = maxaR(s,a) + γ

∑s∈S Φ(a, s, s′)Vt−1(s′)

• UNTIL ∀s | Vt (s)− Vt−1(s) |< ε

• (Obtain optimal policy)• π∗(s) = argmaxaR(s,a) + γ

∑s∈S Φ(a, s, s′)Vt (s′)

(INAOE) 15 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Analysis

• The time complexity of the algorithm is quadratic interms of the number of state–actions

• Usually the policy converges before the valuesconverge; this means that there is no change in thepolicy even if the value has not yet converged. Thisgives rise to the second approach, policy iteration

(INAOE) 16 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Evaluation Policy Iteration

Policy Iteration

• Policy iteration starts by selecting a random, initial policy• Then the policy is iteratively improved by selecting the

action for each state that increases the most theexpected value

• The algorithm terminates when the policy converges,that is, the policy does not change from the previousiteration

(INAOE) 17 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Algorithm

• π0 : ∀sa0(s) = ak (Initialize the policy)• t = 1• REPEAT• Calculate values for the current policy:• ∀sVπt−1

t (s) = R(s,a) + γ∑

s∈S Φ(a, s, s′)Vt−1(s′)• Iterative improvement:• ∀sπt (s) = argmaxaR(s,a) + γ

∑s∈S Φ(a, s, s′)Vt (s′)

• UNTIL πt = πt−1

(INAOE) 18 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Analysis

• Policy iteration tends to converge in fewer iterations thanvalue iteration, however the computational cost of eachiteration is higher, as the values have to be updated

• Solving small MDPs with the previous algorithms is veryefficient; however it becomes difficult when thestate–actions space is very large

• An alternative is to decompose the state space and takeadvantage of the independence relations to reduce thememory and computation requirements, using agraphical model-based representation of MDPs knownas Factored MDPs

(INAOE) 19 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Factored MDPs

Factored MDPs

• In a factored MDP, the set of states is described via aset of random variables X = X1, . . . ,Xn

• The transition model and reward function can becomeexponentially large if they are explicitly represented asmatrices, however, the frameworks of dynamicBayesian networks and decision trees give us the toolsto describe the transition model and the reward functionconcisely

(INAOE) 20 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Factored MDPs

Transition function - DBN

• The transition function for each action, a, is representedas a two–stage dynamic Bayesian network, that is atwo–layer directed acyclic graph GT whose nodes areX1, . . . ,Xn,X ′1, . . . ,X

′n

• Each node X ′i is associated with a conditionalprobability distribution PΦ(X ′i | Parents(X ′i ))

• The transition probability Φ(a, si , s′i ) is then defined tobe ΠiPΦ(x ′i | ui) where ui represents the values of thevariables in Parents(X ′i )

(INAOE) 21 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Factored MDPs

Reward function - DT

• The reward associated with a state often depends onlyon the values of certain features of the state

• The relationship between rewards and state variablescan be represented with value nodes in an influencediagrams

• Although in the worst case the conditional reward table(CRT) will take exponential space to store the rewardfunction, in many cases the reward function exhibitsstructure allowing it to be represented compactly usingdecision trees or graphs

(INAOE) 22 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Factored MDPs

Example - factored MDP

(INAOE) 23 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Factored MDPs

Algebraic Decision Diagrams

• The representation can be compacted even more byrepresenting a CPT as a tree or a graph, such thatrepeated probability values appear only once in theleaves of these graphs

• A particular presentation that is very efficient is analgebraic decision diagram or ADD

• Based on this compact representation, very efficientversions of the value and policy iteration algorithmshave been developed that also reduce thecomputational time required to solve complex MDPmodels. An example of this is the SPUDD algorithm

(INAOE) 24 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Factored MDPs

ADD

(INAOE) 25 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Factored MDPs Abstraction

Abstraction

• The idea of abstraction is to reduce the state space bycreating an abstract model where states with similarfeatures are grouped together

• Equivalent states are those that have the sametransition and reward functions; these can be groupedtogether without altering the original model

• Further reductions can be achieved by grouping similarstates; this results in approximate models and creates atrade-off between the precision of the model (and theresulting policy) and its complexity

• Different alternatives: partition the state space into a setof blocks such that each block is stable; partition thestate space into qualitative states that have similarreward functions

(INAOE) 26 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Factored MDPs Decomposition

Decomposition

• Decomposition consists in dividing the global probleminto smaller subproblems that are solved independentlyand their solutions combined

• There are two main types of decomposition: (i) serial orhierarchical, and (ii) parallel or concurrent

(INAOE) 27 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Hierarchical MDPs

• Hierarchical MDPs provide a sequential decomposition,in which different subgoals are solved in sequence toreach the final goal

• Hierarchical MDPs accelerate the solution of complexproblems by defining different subtasks that correspondto intermediate goals, solving for each subgoal, andthen combining these subprocesses to solve the overallproblem

• Examples: HAM and MAXQ

(INAOE) 28 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Concurrent MDPs

• In concurrent or parallel MDPs, the subtasks areexecuted in parallel to solve the global task

• They consider that the task can be divided in severalrelatively independent subtasks that can be solvedindependently and then the solutions combined to solvethe global problem

• An alternative is to consider conflicts or restrictionsbetween task, initially solving each subtaskindependently, and when the solutions are combined,take into account potential conflicts between the partialpolicies, and solve these conflicts to obtain a global,approximately optimal policy

(INAOE) 29 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

POMDPs

Partially observable MDPs

• In some domains, the state can not be observedcompletely, there is only partial information about thestate of the system – POMDP

• In this case, there are certain observations from whichthe state can be estimated probabilistically

• Consider the previous example of the robot in the gridworld. It could be that the robot can not determineprecisely the cell where it is (its state), but can estimatethe probability of being in each cell by observing thesurrounding environment

(INAOE) 30 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

POMDPs

Formalization

• A POMDP is a tuple M =< S,A,Φ,R,O,Ω,Π >. Thefirst four elements are the same as in an MDP

• O is a finite set of observations o1, . . . ,ol• Ω : S ×O → [0,1] is the observation function specified

as a probability distribution, which gives the probabilityof an observation o given that the process is in state s,P(o | s)

• Π is the initial state distribution that specifies theprobability of being in state s at t = 0

(INAOE) 31 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

POMDPs

Solving a POMDP

• In a POMDP the current state is not known withcertainty, only the probability distribution of the state,which is known as the belief state

• So solving a POMDP requires finding a mapping formthe belief space to the action space - equivalent to acontinuous state space MDP

• Solving a POMDP is much more difficult than solving anMDP, as the belief space is in principle infinite

(INAOE) 32 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

POMDPs

Approximate Solution TechniquesPolicy tree and DDN techniques: assuming a finite horizon,

a POMDP can be represented as a policy tree(similar to a decision tree) or as a dynamicdecision network; then, algorithms for solvingthese types of models can be applied.

Sampling techniques: the value function is computed for aset of points in the belief space, andinterpolation is used to determine the optimalaction to take for other belief states which arenot in the set of sampling points.

Value function approximation techniques: given that thevalue function is convex and piece-wise linear,it can be described as a set of vectors (called αvectors). Thus, it can be approximated by a setof vectors that dominate the others, and in thisway find an approximate solution.

(INAOE) 33 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Applications

Applications

• Assisting power plant operators in the operation of apower plant under difficult situations

• Coordinating a set of modules to solve a complex taskfor service robots

(INAOE) 34 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Applications Power Plant Operation

Power Plant Operation Assistant

• The steam generation system of a combined-cyclepower plant provides superheated steam to a steamturbine

• During normal operation, a three-element feedwatercontrol system commands the feed-water control valve(fwv) to regulate the level (dl) and pressure (pd) in thedrum

• However, this traditional controller does not consider thepossibility of failures in the control loop and ignoreswhether the outcomes of executing a decision will helpto increase the steam drum lifetime, security, andproductivity

• This problem was modeled as an MDP – for trainingand assistance for power plant operators

(INAOE) 35 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


AsistO• AsistO is based on a decision-theoretic model -MDP-

that represents the main elements of the steamgeneration system of a combined-cycle power plant

• The main variables in the steam generation systemrepresent the state in a factored form

• The actions correspond to the control of the main valvesin this subsystem of the power plant: feed-water valve(fwv) and main steam valve (msv)

• The reward function is defined in terms of arecommended operation curve for the relation betweenthe drum pressure and steam flow, the control actionsshould try to maintain the plant within thisrecommended operation curve

• The transition function is learned by using the powerplant simulator and sampling the state and actionspaces

(INAOE) 36 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Recommended Operation Curve

(INAOE) 37 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Experiments

• Model: five state variables: Fms, Ffw , Pd , g, d ; andfour actions: open/close the feed-water (fwv) and mainsteam (msv) valves.

• The reward function was defined based on therecommended operation curve

• The memory requirements for a flat MDP representationand a factored representation were compared. The flatMDP required 589,824 parameters (probability values)while the factored MDP only 758

• The optimal solution for the factored MDP was obtainedin less than two minutes on a standard personalcomputer

(INAOE) 38 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

Applications Robot Task Coordination

Robot Task Coordination

• A service robot should combine several capabilities,such as localization and navigation, obstacle avoidance,people detection and recognition, object recognition andmanipulation, etc.

• These different capabilities can be implemented asindependent software modules, which can then becombined for solving a particular task

• It is necessary to coordinate the different modules toperform a task, ideally in an optimal way

(INAOE) 39 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Task coordination with MDPs• Markov decision processes provide an appropriate

framework for task coordination for service robots• Once a task is modeled as an MDP, the MDP can be

solved to obtain a policy to perform the task, which isrobust with respect to the uncertainty in the results ofthe different actions

• Under this framework, based on general softwaremodules and an MDP–based coordinator, it is inprinciple relatively easy for a service robot to solvedifferent tasks. We just need to modify the MDP rewardfunction according to the new task objectives

• Additionally, it is desirable for the robot to performseveral actions simultaneously, such as navigation to acertain location, avoiding obstacles and looking forpeople; all at the same time – this implies an explosionin the action—state space

(INAOE) 40 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Concurrent MDPs

• Based on functional decomposition, a complex task ispartitioned into several subtasks

• Each subtask is represented as an MDP and solvedindependently, and the policies are executed in parallel

• However, conflicts may arise between the subtasks: (i)resource conflicts, and (ii) behavior conflicts.

(INAOE) 41 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Resource onflicts

• Resource conflicts occur when two actions require thesame physical resource (e.g., to control the wheels of arobot) and cannot be executed concurrently

• This type of conflict is solved off–line by a two-phaseprocess

• In the first phase we obtained an optimal policy for eachsubtask (MDP) – if there is a conflict between theactions selected by each MDP for a certain state, theone with maximum value is considered, and the state ismarked as a conflict state

• This initial solution is improved in a second phase usingpolicy iteration

(INAOE) 42 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Behaivor

• Behavior conflicts arise in situations in which it ispossible to execute two (or more) actions at the sametime but it is not desirable given the application

• Behavior conflicts are solved on–line based on a set ofrestrictions specified by the user

• If there are no restrictions, all the actions are executedconcurrently; otherwise, a constraint satisfactionmodule selects the set of actions with the highestexpected utility

(INAOE) 43 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Experiments

• An experiment was done with Markovito, a servicerobot, which performed a delivery task

(INAOE) 44 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Delivery Task

• The goal is for the robot to receive and deliver amessage, an object or both, under a user’s request

• Subtasks:1 navigation, the robot navigates safely in different

scenarios;2 vision, for looking and recognizing people and objects;3 interaction, for listening and talking with a user;4 manipulation, to receive and deliver an object safely;

and5 expression, to show emotions using an animated face.

• Each subtask is represented as a factored MDP

(INAOE) 45 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Representation

• If we represent this task as a single, flat MDP, there are1,179,648 states (considering all the non-duplicatedstate variables) and 1,536 action combinations, giving atotal of nearly two billion state-actions

• The MDP model for each subtask was defined using astructured representation

• The transition and reward functions were specified bythe user based on task knowledge and intuition

• In this task, conflicts might arise between the differentsubtasks, so we need to include conflict resolutionstrategies

(INAOE) 46 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Behavior Conflicts

action(s) restriction action(s)get message not during turn OR advanceask user name not before recognize userrecognize user not start avoid obstacleget object directed towardsOR not during OR turndeliver object OR moving

(INAOE) 47 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References


Evaluation

• For comparison, the delivery task was solved under twoconditions: (i) without restrictions and (ii) withrestrictions

• In the case without restrictions, the robot performssome undesirable behaviors. For example, in a typicalexperiment, the robot is not able to identify the personwho wants to send a message for a long time

• In the case where restrictions were used, these alloweda more fluid and efficient solution

• On average, the version with restrictions takes about50% of the time steps required by the version withoutrestrictions to complete the task

(INAOE) 48 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

References

Book

Sucar, L. E, Probabilistic Graphical Models, Springer 2015 –Chapter 11

(INAOE) 49 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

References

Additional Reading (1)

Bellman, R.: Dynamic Programming. PrincetonUniversity Press, Princeton, New Jersey (1957)

Corona, E., Morales E.F, Sucar, L.E.: Solving PolicyConflicts in Concurrent Markov Decision Processes.ICAPS Workshop on Planning and Scheduling UnderUncertainty. (2010)

Corona, E., Sucar, L.E.: Task Coordination for ServiceRobots Based on Multiple Markov Decision Processes.In: L. E. Sucar, J. Hoey, E. Morales (eds.) DecisionTheory Models for Applications in Artificial Intelligence:Concepts and Solutions, IGI Global, Hershey (2011)

Dean, T., Givan, R.: Model Minimization in MarkovDecision Processes. In: Proceedings of the 14thNational Conference on Artificial Intelligence (AAAI), pp.106–111 (1997)

(INAOE) 50 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

References


Dietterich, T.: Hierarchical Reinforcement Learning withthe MAXQ Value Function Decomposition. Journal ofArtificial Intelligence Research, 13, 227–303 (2000)

Hoey, J., St-Aubin, R., Hu, A., Boutilier, C.: SPUDD:Stochastic Planning Using Decision Diagrams. In:Proceedings UAI, pp. 279–288 (1999)

Li, L., Walsh, T. J., Littman, M. L.: Towards a UnifiedTheory of State Abstraction for MDPs. In: Proceedingsof the Ninth International Symposium on ArtificialIntelligence and Mathematics, pp. 21–30 (2006)

Meuleau, N., Hauskrecht, M., Kim, K.E., Peshkin, L.,Kaelbling, L.P., Dean, T., Boutilier, C.: Solving VeryLarge Weakly Coupled Markov Decision Processes. In:Proceedings of the Association for the Advancement ofArtificial Intelligence (AAAI), pp. 165–172 (1998)

(INAOE) 51 / 52

Introduction



Policy Iteration


Decomposition

POMDPs



References

References


Parr, R., Russell, S. J.: Reinforcement Learning withHierarchies of Machines. In: Proceeding NIPS, (1997)

Poupart, P.: An Introduction to Fully and PartiallyObservable Markov Decision Processes. In: L. E. Sucar,J. Hoey, E. Morales (eds.) Decision Theory Models forApplications in Artificial Intelligence: Concepts andSolutions, IGI Global, Hershey (2011)

Puterman, M. L.: Markov Decision Processes: DiscreteStochastic Dynamic Programming. Wiley, New York(1994)

Reyes, A., Sucar, L.E., Morales, E.F.: AsistO: AQualitative MDP-Based Recommender System forPower Plant Operation. Computacion y Sistemas, 13 (1),5–20 (2009)

(INAOE) 52 / 52

markov decision processes - inaoeesucar/clases-mgp/notes/c11-mdp.pdfmarkov decision processes...

Documents