1 markov decision processes basics concepts alan fern

34
1 Markov Decision Processes Basics Concepts Alan Fern

Upload: harold-terry

Post on 17-Dec-2015

225 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: 1 Markov Decision Processes Basics Concepts Alan Fern

1

Markov Decision ProcessesBasics Concepts

Alan Fern

Page 2: 1 Markov Decision Processes Basics Concepts Alan Fern

Some AI Planning Problems

Fire & RescueResponse Planning

Solitaire Real-Time Strategy Games

Helicopter Control Legged Robot Control Network Security/Control

Page 3: 1 Markov Decision Processes Basics Concepts Alan Fern

3

Some AI Planning Problemsh Health Care

5 Personalized treatment planning5 Hospital Logistics/Scheduling

h Transportation5 Autonomous Vehicles5 Supply Chain Logistics5 Air traffic control

h Assistive Technologies5 Dialog Management5 Automated assistants for elderly/disabled5 Household robots

h Sustainability5 Smart grid5 Forest fire management

…..

Page 4: 1 Markov Decision Processes Basics Concepts Alan Fern

4

Common Elementsh We have a systems that changes state over time

h Can (partially) control the system state transitions by taking actions

h Problem gives an objective that specifies which states (or state sequences) are more/less preferred

h Problem: At each moment must select an action to optimize the overall (long-term) objective 5 Produce most preferred state sequences

Page 5: 1 Markov Decision Processes Basics Concepts Alan Fern

5

Observations Actions

????

world/system

State of world/system

Observe-Act Loop of AI Planning Agent

action

Goal maximize expected reward over lifetime

Page 6: 1 Markov Decision Processes Basics Concepts Alan Fern

6

World StateAction from finite set

????

Stochastic/Probabilistic Planning: Markov Decision Process (MDP) Model

Goal maximize expected reward over lifetime

Markov DecisionProcess

Page 7: 1 Markov Decision Processes Basics Concepts Alan Fern

7

State describesall visible infoabout dice

Action are the different choices of dice to roll

????

Example MDP

Goal maximize score orscore more than opponent

Page 8: 1 Markov Decision Processes Basics Concepts Alan Fern

8

Markov Decision Processes

h An MDP has four components: S, A, R, T:

5 finite state set S (|S| = n)5 finite action set A (|A| = m)

5 transition function T(s,a,s’) = Pr(s’ | s,a)g Probability of going to state s’ after taking action a in state s

5 bounded, real-valued reward function R(s,a)g Immediate reward we get for being in state s and taking action ag Roughly speaking the objective is to select actions in order to

maximize total reward g For example in a goal-based domain R(s,a) may equal 1 for goal

states and 0 for all others (or -1 reward for non-goal states)

Page 9: 1 Markov Decision Processes Basics Concepts Alan Fern

action = roll the two 5’s

1,1 1,2 6,6

Probabilisticstate transition

State

Page 10: 1 Markov Decision Processes Basics Concepts Alan Fern

10

What is a solution to an MDP?

MDP Planning Problem:

Input: an MDP (S,A,R,T) Output: ????

h Should the solution to an MDP be just a sequence of actions such as (a1,a2,a3, ….) ? 5 Consider a single player card game like Blackjack/Solitaire.

h No! In general an action sequence is not sufficient5 Actions have stochastic effects, so the state we end up in is

uncertain5 This means that we might end up in states where the remainder

of the action sequence doesn’t apply or is a bad choice5 A solution should tell us what the best action is for any possible

situation/state that might arise

Page 11: 1 Markov Decision Processes Basics Concepts Alan Fern

11

Policies (“plans” for MDPs)

h A solution to an MDP is a policy 5 Two types of policies: nonstationary and stationary

h Nonstationary policies are used when we are given a finite planning horizon H5 I.e. we are told how many actions we will be allowed to take

h Nonstationary policies are functions from states and times to actions5 π:S x T → A, where T is the non-negative integers

5 π(s,t) tells us what action to take at state s when there are t

stages-to-go (note that we are using the convention that t

represents stages/decisions to go, rather than the time step)

Page 12: 1 Markov Decision Processes Basics Concepts Alan Fern

12

Policies (“plans” for MDPs)

h What if we want to continue taking actions indefinately? 5 Use stationary policies

h A Stationary policy is a mapping from states to actions5 π:S → A5 π(s) is action to do at state s (regardless of time)5 specifies a continuously reactive controller

h Note that both nonstationary and stationary policies assume or have these properties:5 full observability of the state5 history-independence5 deterministic action choice

Page 13: 1 Markov Decision Processes Basics Concepts Alan Fern

13

What is a solution to an MDP?

MDP Planning Problem:

Input: an MDP (S,A,R,T) Output: a policy such that ????

h We don’t want to output just any policy

h We want to output a “good” policy

h One that accumulates lots of reward

Page 14: 1 Markov Decision Processes Basics Concepts Alan Fern

14

Value of a Policyh How good is a policy π?

5 How do we measure reward “accumulated” by π?

h Value function V: S →ℝ associates value with each state (or each state and time for non-stationary π)

h Vπ(s) denotes value of policy at state s5 Depends on immediate reward, but also what you achieve

subsequently by following π

5 An optimal policy is one that is no worse than any other policy at any state

h The goal of MDP planning is to compute an optimal policy

Page 15: 1 Markov Decision Processes Basics Concepts Alan Fern

15

What is a solution to an MDP?

MDP Planning Problem:

Input: an MDP (S,A,R,T) Output: a policy that achieves an “optimal value”

h This depends on how we define the value of a policy

h There are several choices and the solution algorithms depend on the choice

h We will consider finite-horizon value

Page 16: 1 Markov Decision Processes Basics Concepts Alan Fern

16

Finite-Horizon Value Functions

h We first consider maximizing expected total reward over a finite horizon

h Assumes the agent has H time steps to live

h To act optimally, should the agent use a stationary or non-stationary policy?5 I.e. Should the action it takes depend on absolute time?

h Put another way:5 If you had only one week to live would you act the same

way as if you had fifty years to live?

Page 17: 1 Markov Decision Processes Basics Concepts Alan Fern

17

Finite Horizon Problemsh Value (utility) depends on stage-to-go

5 hence use a nonstationary policy

h is k-stage-to-go value function for π

5 expected total reward for executing π starting in s for k time steps

h Here Rt and st are random variables denoting the reward received and state at time-step t when starting in s5 These are random variables since the world is stochastic

)(sV k

]),,(|)([

],|[)(

0

0

0

sstksasRE

sREsV

ttk

t

t

k

t

tk

Page 18: 1 Markov Decision Processes Basics Concepts Alan Fern

18

Page 19: 1 Markov Decision Processes Basics Concepts Alan Fern

19

Computational Problemsh There are two problems that are of typical interest

h Policy evaluation: 5 Given an MDP and a policy π5 Compute finite-horizon value function for any k and s

h Policy optimization: 5 Given an MDP and a horizon H5 Compute the optimal finite-horizon policy

h How many finite horizon policies are there? 5 |A|Hn 5 So can’t just enumerate policies for efficient optimization

)(sV k

Page 20: 1 Markov Decision Processes Basics Concepts Alan Fern

20

Computational Problems

h Dynamic programming techniques can be used for both policy evaluation and optimization5 Polynomial time in # of states and actions5 http://web.engr.oregonstate.edu/~afern/classes/cs533/

h Is polytime in # of states and actions good?

h Not when these numbers are enormous!5 As is the case for most realistic applications5 Consider Klondike Solitaire, Computer Network Control,

etc

Enters Monte-Carlo Planning

Page 21: 1 Markov Decision Processes Basics Concepts Alan Fern

21

Approaches for Large Worlds: Monte-Carlo Planning

h Often a simulator of a planning domain is availableor can be learned from data

21

Klondike SolitaireFire & Emergency Response

Page 22: 1 Markov Decision Processes Basics Concepts Alan Fern

22

Large Worlds: Monte-Carlo Approach

h Often a simulator of a planning domain is availableor can be learned from data

h Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator

22

World Simulato

r RealWorld

action

State + reward

Page 23: 1 Markov Decision Processes Basics Concepts Alan Fern

23

Example Domains with Simulators

h Traffic simulators

h Robotics simulators

h Military campaign simulators

h Computer network simulators

h Emergency planning simulators 5 large-scale disaster and municipal

h Sports domains

h Board games / Video games5 Go / RTS

In many cases Monte-Carlo techniques yield state-of-the-artperformance. Even in domains where model-based plannersare applicable.

Page 24: 1 Markov Decision Processes Basics Concepts Alan Fern

24

MDP: Simulation-Based Representationh A simulation-based representation gives: S, A, R, T, I:

5 finite state set S (|S|=n and is generally very large)5 finite action set A (|A|=m and will assume is of reasonable size)

5 Stochastic, real-valued, bounded reward function R(s,a) = rg Stochastically returns a reward r given input s and a

5 Stochastic transition function T(s,a) = s’ (i.e. a simulator)g Stochastically returns a state s’ given input s and ag Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP

5 Stochastic initial state function I. g Stochastically returns a state according to an initial state distribution

These stochastic functions can be implemented in any language!

Page 25: 1 Markov Decision Processes Basics Concepts Alan Fern

25

Computational Problems

h Policy evaluation: 5 Given an MDP simulator, a policy , a state s, and horizon h5 Compute finite-horizon value function

h Policy optimization: 5 Given an MDP simulator, a horizon H, and a state s5 Compute the optimal finite-horizon policy at state s

Page 26: 1 Markov Decision Processes Basics Concepts Alan Fern

26

Trajectories

h We can use the simulator to observe trajectories of any policy π from any state s:

h Let Traj(s, π , h) be a stochastic function that returns a length h trajectory of π starting at s.

h Traj(s, π , h) 5 s0 = s 5 For i = 1 to h-1

g si = T(si-1, π(si-1))5 Return s0, s1, …, sh-1

h The total reward of a trajectory is given by

1

010 )(),...,(

h

iih sRssR

Page 27: 1 Markov Decision Processes Basics Concepts Alan Fern

27

Policy Evaluation

h Given a policy π and a state s, how can we estimate ? h Simply sample w trajectories starting at s and average the

total reward receivedh Select a sampling width w and horizon h.h The function EstimateV(s, π , h, w) returns estimate of Vπ(s)

EstimateV(s, π , h, w) 5 V = 0 5 For i = 1 to w

g V = V + R( Traj(s, π , h) ) ; add total reward of a trajectory5 Return V / w

h How close to true value will this estimate be?

Page 28: 1 Markov Decision Processes Basics Concepts Alan Fern

28

Sampling-Error Bound

))],,(([],|[)(1

0

hsTrajREsREsVh

t

tth

approximation due to sampling

• Note that the ri are samples of random variable R(Traj(s, π , h))

• We can apply the additive Chernoff bound which bounds the difference between an expectation and an emprical average

)),,((,),,,(1

1 hsTrajRrrwhsEstimateV i

w

iiw

Page 29: 1 Markov Decision Processes Basics Concepts Alan Fern

29

Aside: Additive Chernoff Bound• Let X be a random variable with maximum absolute value Z.

An let xi i=1,…,w be i.i.d. samples of X

• The Chernoff bound gives a bound on the probability that the average of the xi are far from E[X]

11

1

1 ln][ w

w

iiw ZxXE

Let {xi | i=1,…, w} be i.i.d. samples of random variable X, then with probability at least we have that, 1

wZ

xXEw

iiw

2

1

1 exp][Prequivalently

Page 30: 1 Markov Decision Processes Basics Concepts Alan Fern

30

Aside: Coin Flip Example

• Suppose we have a coin with probability of heads equal to p.

• Let X be a random variable where X=1 if the coin flip gives heads and zero otherwise. (so Z from bound is 1)

E[X] = 1*p + 0*(1-p) = p

• After flipping a coin w times we can estimate the heads prob. by average of xi.

• The Chernoff bound tells us that this estimate converges exponentially fast to the true mean (coin bias) p. wxp

w

iiw

2

1

1 expPr

Page 31: 1 Markov Decision Processes Basics Concepts Alan Fern

31

Sampling Error Bound

))],,(([],|[),(1

0

hsTrajREsREhsVh

t

tt

approximation due to sampling

11

max ln),,,()( wh VwhsEstimateVsV

We get that,

with probability at least 1

)),,((,),,,(1

1 hsTrajRrrwhsEstimateV i

w

iiw

Page 32: 1 Markov Decision Processes Basics Concepts Alan Fern

Two Player MDP (aka Markov Games)

action

State/reward

action

State/reward

• So far we have only discussed single-player MDPs/games

• Your labs and competition will be 2-player zero-sum games(zero sum means sum of player rewards is zero)

• We assume players take turns (non-simultaneous moves)

Player 1 Player 2Markov Game

Page 33: 1 Markov Decision Processes Basics Concepts Alan Fern

33

Simulators for 2-Player Gamesh A simulation-based representation gives: :

5 finite state set S (assume state encodes whose turn it is)5 action sets and for player 1 (called max) and player 2 (called min)

5 Stochastic, real-valued, bounded reward function g Stochastically returns a reward r for max given input s and action a

(it is assumed here that reward for min is -r)g So min maximizes its reward by minimizing the reward of max

5 Stochastic transition function T(s,a) = s’ (i.e. a simulator)g Stochastically returns a state s’ given input s and ag Probability of returning s’ is dictated by Pr(s’ | s,a) of gameg Generally s’ will be a turn for the other player

These stochastic functions can be implemented in any language!

Page 34: 1 Markov Decision Processes Basics Concepts Alan Fern

34

Finite Horizon Value of Gameh Given two policies one for each player we can

define the finite horizon value

h is h-horizon value function with respect to player 1

5 expected total reward for player 1 after executing and starting in s for k steps

h For zero-sum games the value with respect to player 2 is just

h Given we can easily use the simulator to estimate 5 Just as we did for the single-player MDP setting