bams 517 introduction to markov decision processes

BAMS 517Introduction to Markov Decision Processes

Eric Cope (mostly!)and Martin L. Puterman

Sauder School of Business

Markov Decision Processes (MDPs) We’ve been looking so far at decision problems that require a

choice of only one or a few actions The complexity of these decisions was small enough so that we could write

down a decision tree and solve it Many decision problems require that actions be taken repeatedly over time.

The more decisions and uncertain events we have to consider, the more tedious it becomes to write down a decision tree for the problem

Decision trees are not a very parsimonious way to represent or solve very complex decision problems

MDPs provide a rich analytical framework for studying complex decision problems A convenient and economical method of representing a decision problem Can be used to study problems involving infinite sequences of decisions MDPs can be easily stored and solved on a computer Allow us to further explore the structure of optimal decisions

General Approach Formulate a problem as an MDP by

identifying its decision epochs, states, actions, transition probabilities and rewards.

Determine an optimality criterion – initially expected total reward.

Solve it using backward induction. Determine the optimal policy.

The parking problem Suppose you’re driving to a theatre that is

at the end of a one-way street You’ll need to park in one of the parking

spots along the street. Naturally, you want to park as close as possible to the theatre

If however you drive past the theatre, you’ll have to park in the pay lot, and pay $c

You are a distance of x from the theatre and you see an open spot. Do you take it, or try to get closer to the theatre?

The parking problem Some simplifying modeling assumptions:

You can only see if the spot you’re driving past is occupied or not (it’s nighttime)

Each spot has a probability p (0 < p < 1) of being vacant, and vacant spaces occur independently

You assign a cost of $a∙x for parking a distance of x from the theatre (measured in parking spot widths)

There are N total spots Suppose N = 100. Try to imagine what a decision tree would

look like for this problem The tree is extremely unwieldy and complex – long series of similar

nodes forming a complex network of branches Much of the redundancy in the tree that can be eliminated using the

MDP representation

The parking problem You might imagine that the optimal solution to the problem is

of the form: drive past the first k spots, and then park in the first open spot after that As it turns out, for some value of k this will be the optimal rule,

regardless of the values of the other problem parameters (c, N, etc.) The structure of MDPs often allows you to prove that such general

properties hold of the optimal solution The parking problem is an instance of an “optimal stopping”

problem Once you park, you stop having to decide The problem is deciding when to park given your current limited

information

An inventory management problem Suppose you are the manager of a

small retail fashion outlet. You sell a popular type of dress (that remains “in fashion” for a year) at a price $p

You have only limited shelf and storage space for dresses. Customers come at random intervals and buy the dress. If you have no dresses in stock, they leave the store and you lose the sale

You can order new dresses from your supplier at any time. Ordering costs $K plus $c per dress

When do you order new dresses, and how many do you order?

An inventory management problem Some simplifying modeling assumptions

Every day, a random number D of customers arrive, where D ε {0, 1, 2, …} Demand for the dresses is constant, and the number of customers

wanting to buy is independent from day to day You place orders first thing in the morning, and they arrive immediately You can only carry N dresses at any time due to storage limitations The dresses will be sold for a year, after which they will be replaced by

a new line. Unsold dresses will be disposed of at the end of the year at a reduced price.

Objective: Maximize expected total profit over the year What is the key information needed to make a decision?

Constant information: space limitations, probability distribution of customer arrivals, ordering process

Changing information: inventory on hand, number of days until new line arrives

An inventory management problem Imagine the decision tree for this problem

It will be extremely large, but will include many redundant nodes For example: consider the following scenarios for day 100:

Each of these scenarios leads to the same situation on day 101 In a decision tree, you would have to write separate branches

on for each of these scenarios, even though you would face essentially the same decision on day 101 in each case Decisions only depend on the present state of affairs, and not on past

events or decisions

Day 100 Inventory Order Sales Day 101 Inventory

12 3 5 10

10 0 0 10

11 0 1 10

20 0 10 10

5 10 5 10

An inventory management problem It is better to simply consider the decision problem that you

would face on day 101, with 10 units of inventory, only once In the MDP modeling framework, we talk about the “state” of having

10 units of inventory on day 101, and consider the decision problem faced by someone in this state

We can fully consider the decision problem by considering all the possible “states” we might find ourselves in There will be a state for every combination of day and inventory level Note that the states here correspond to the possible values of the

“changing information” we might have about the problem at any time that is relevant to the overall objective of maximizing total profit

Each state incorporates in its description all the problem information needed to make a good decision when in that state

An inventory management problem Note that in each possible state, different sets of actions are available

to us If in the current state there is an inventory of n items, then we can only

order up to N –n items, due to space limitations Our choice of action will lead us into new states with different

probabilities Suppose the demand D realized on each day is such that P(D=0) = P(D=1) =

1/3, P(D=2) = P(D=3) = 1/6, P(D > 3) = 0 Suppose the current state is 10 items in inventory on day 100. Here are the

probabilities that the next state will be 12 items on day 101 for different order values:

# ordered Prob. next state is 12 items, day 101

0 0

3 1/3

5 1/6

7 0

An inventory management problem In addition, different actions will cause us to gain different

profits Daily profits = min{D, s+a} ∙p – ca– K if a > 0, and equals

min{D,s} ∙ p if a =0 where a is the number ordered In order to choose the best action in any particular state, we

need to understand: The possible future states that are attainable through each action, and

the probabilities of reaching those states The possible future profits that we gain from each action

If we have to consider the evolution of states and total profits gained over the entire future, this could be quite complicated Instead, we’ll only consider, for any given state and action, what the

next state could be, and how much profit could be gained before moving to the next state

From these “one-step” state transitions and profits, we can analyze the entire decision problem

Elements of MDPs Decision epochs: the times at which decisions may be made. The time in

between successive decision epochs are referred to as periods We first consider problems with a finite number N of decision epochs The Nth decision epoch is a terminal point – no decision is made at it

States: a state describes all the relevant available information necessary in order to take an optimal action at any given decision epoch. We denote the set of all possible current information by the set of states S

Action sets: For each state s ε S, the action set A(s) denotes the set of allowable actions that can be taken in state s

Transition probabilities: For any given state and action, the probabilities of moving (transitioning) to any other state in the next decision epoch If s is the current state at time t and action a ε A(s) is taken, then the

probability of transitioning to state s0 is denoted pt(s0 | s, a)

Assume Markovian dynamics: transitions only depend on current state and action

Elements of MDPs; timeline Rewards: For any given state and action, the random benefits (costs) that are

incurred before or during the next state transition The reward received after taking action a in state s and at time t and arriving in

state s0 is denoted rt(s, a, s0) Note that the random rewards may depend on the next state s0. Usually we will

only consider the expected reward rt(s, a) = s ε S rt(s, a, s0) pt(s0 | s, a) There may be terminal rewards rN(s) at the Nth decision epoch

TimeN1 2 3 4 5 6 7

s1 s2 s3 s4 s5 s6 s7 sN

a1 a2 a3 a4 a5 a6 a7

r1 r2 r3 r4 r5 r6 r7

sN-1

aN-1

rN-1

Actions:

States:

Epochs:

Rewards: rN

N-1

Transition probability:

p1(s2|s1,a1)p2(s3|s2,a2)

p3(s4|s3,a3)p4 (s5|s4a4)

p5(s6|s5,a5)p6(s7|s6,a6)

pN-1 (sN|sN-1,aN-1)

TimeN1 2 3 4 5 6 7Epochs: N-1

But Who’s Counting http://www.youtube.com/watch?v=KjZJ3TV-MyM This can be formulated as an MDP

States – unoccupied slots and number to be placed Actions – which unoccupied slot to place the number in Rewards – the value of placing the number in the space Goal – maximize total

MDPs as decision trees

……

…

……

… ……

…

NN-1N-2N-3


……

…

……

… ……

…

NN-1N-2N-3

States (decision nodes)


……

…

……

… ……

…

NN-1N-2N-3

Terminal States


……

…

……

… ……

…

NN-1N-2N-3

Actions (decision branches)


……

…

……

… ……

…

NN-1N-2N-3

Rewards / Transitions (uncertainty nodes & branches)

Specifying states As we mentioned, the state descriptor should provide all the

relevant problem information that is necessary for making a decision Normally, we don’t include problem information that doesn’t change

from epoch to epoch in the state description For example, in the parking problem, the cost of parking in the

parking lot, the total number of parking spaces, etc. is constant at all times. Therefore, we don’t bother including this information in the state description

We would, however, include information about the state of the current parking space (vacant, occupied)

The number of epochs remaining also changes from epoch to epoch (for finite-horizon problems). However, we often won’t include this information in the state description, because it is implicitly present in the specification of rewards and transition probabilities

Deterministic dynamic programs A special type of MDP (which are sometimes also

called dynamic programs) is one in which all transition probabilities are either 0 or 1 These are known as deterministic dynamic programs

(DDPs) Such problems arise in several applications:

finding shortest paths in networks critical path analysis sequential allocation inventory problems with known demands

Shortest path through a network

Nodes represent states, arcs represent actions/transitions, numbers represent arc lengths / costs

Goal: find the shortest route from node 1 to node 8

1 3

4

2

6

7

5

8

2

4

3

4

5

5

6

1

2

1

2

6

Formulation of shortest path problem Let u(s) denote the shortest distance from node s to node 8 We compute u(s) just as we did previously for MDPs For any state s, for each arc s→s0 out of state s, add the distance

to state s0, plus the shortest distance u(s0) from s0 to node 8. Let u(s) be minimum such value for all these arcs u(8) = 0 – “terminal state” u(7) = 6 + u(8) = 6 u(6) = 2 + u(8) = 2 u(5) = 1 + u(8) = 1 u(4) = min{ 4 + u(5), 5 + u(6) } = min{ 4 + 1, 5 + 2 } = 5 u(3) = min{ 5 + u(5), 6 + u(6), 1 + u(7) } = min{ 5 + 1, 6 + 2, 1 + 6 } = 6 u(2) = 2 + u(7) = 2 + 6 = 8 u(1) = min{ 2 + u(2), 4 + u(3), 3 + u(4) } = min{ 2 + 8, 4 + 6, 3 + 5 } = 8

Critical path models A critical path network is a graphical method of analyzing a

complex project with many tasks, having precedence constraints

In the graph of this network, nodes represent states of completion, and arcs represent tasks to complete The node from which an arc originates represents the project’s state

of completion that is needed to begin that task All other tasks that logically precede that task must be done first

The arcs are numbered according to the length of time they require for completion

The critical path is a list of tasks forming a path through the network from the project start node to the project end node If the completion of any task on the critical path is delayed, then the

overall project must be delayed as well The critical path is the longest path through the network

Critical path: launching a new product It was determined that in order to launch a new product, the

following activities needed to be completed:Activity Description PredecessorDurationA Product Design -- 5 mos.B Market Research -- 1 mo.C Production Analysis A 2 mos.D Product Model A 3 mos.E Sales Brochure A 2 mos.F Cost Analysis C 3 mos.G Product Testing D 4 mos.H Sales Training B, E 2 mos.I Pricing H 1 mo.J Project Report F, G, I 1 mo.

Critical path: launching a new product

1

2

3

4

5

6

7 8A (5)

B (1)

C (2)

D (3)

E (2)F (3)

G (4)

H (2)I (1)

J (1)

Critical path: launching a new product

We use the backward induction algorithm to find the longest path u(s) = Longest path from node s to the project completion node 8 The critical path is marked in red

This is not really a decision problem per se, but an illustration of the backward induction problem applied to a network similar to a DDP

1

2

3

4

5

6

7 8A (5)

B (1)

C (2)

D (3)

E (2)F (3)

G (4)

H (2)I (1)

J (1)

u(7)=1

u(4)=5

u(5)=4

u(6)=2

u(2)=max{8,6,6}=8

u(3)=4

u(1)=max{13,5}=13 u(8)=0

Backward induction algorithm1. Set uN(s) = rN(s) for all s ε S. Set n = N.

2. Solve

3. If n-1 = 0 stop, otherwise replace n by n-1 and return to 2.

Ss

nnnsAa

n juasjpasrsu )(),|(),(max)( 11)(

1

N = 10, K = 30, c = 20, p = 40, d0 = d1 = d2 = 1/4; d3 = d4 = 1/8, dk = 0 for k > 4 The Bellman equations to solve for n = 1,…, 365; s = 0,…,10 are:

We set the terminal rewards u366(s) = 0 for all s and again solve by working backwards for u365(s),…,u1(s)

Solution to the inventory problem

ordering costs expected sales revenue

expected value of next state

Optimal order quantities

Optimal Order

Quantities

s

0 1 2 3 4 10

time

365 0 0 0 0 0 0

364 3 0 0 0 0 0

363 5 4 0 0 0 0

362 6 5 4 0 0 0

361 8 7 6 0 0 0

360 9 8 7 0 0 0

359 10 9 8 0 0 0

358 10 9 8 0 0 0

357 10 9 8 0 0 0

1 10 9 8 0 0 0

The optimal order quantities are listed at left

In the last period (day 365) you don’t want to order anything

You never order anything if you have 3 or more items in stock

In days 1, …, 359, if you have less than 3 items in stock, you “order up” to a full inventory level

This is known as an (s,S) inventory policy: If your inventory falls below a

level s, you order up to level S. This is well-known (Scarf, 1959)

to be the form of an optimal inventory policy for this problem

An investment/consumption problem We consider a (simplified) approach to investment planning

for your life: You will make M dollars per year until age 65 Each year, you can choose to spend a portion of this money and

invest the rest Invested money brings a return of r% per year Your utility for spending x dollars per year is log(x/10000) You are currently d years old, and you will live to the age of D Let cn be the amount of money you consume in year n. We require

that cn < wn, your level of wealth in year n (which includes your year n income)

Your current level of wealth is wd

Your lifetime utility is u(x) = n=dD log(cn/10000)

The value of any remaining wealth at your death is 0

An investment/consumption problem We formulate this as a DDP. The equations to solve are:

40 50 60 70 800

500

1000

1500

2000

2500

3000SpendingWealth

At right is a graph of the optimal spending policy (along with total wealth) for a problem with the following parameters: d = 40, D = 80, r = 10%, M = $50K,

initial wealth w40 = $50K

The time value of money When sequential decisions are made over the course of many

months or years, it is often desirable to consider the “time value of money” Receiving a dollar today is worth more to you than receiving a dollar

tomorrow, since you have the added option of spending that dollar today

It is customary to “discount” the values of dollars received in the future by an appropriate factor Let (t) denote the discount factor applied to money received t

periods in the future 0 < (t) < 1 Thus, $x received t periods in the future is worth (t) ∙$x to you

now Typically, we let (t) = t, for some fixed , 0 < < 1 The choice of depends on the length of the period

Discount factors in Bellman’s equation The choice of (t) = t is convenient because then we can easily

include discounting into Bellman’s equation un(s) = maxa { rn(s, a) + s0 p(s0 | s, a)¢un+1(s0) }

We simply apply the discount factor to the expected value of the next state We regard the expected value of the next state as a “certain equivalent”

value of the next decision we will make one period in the future This certain equivalent value is discounted by Quick proof (optional): Let (n) = (dn, …, dN-1). Then

un(s) = max(n) E(n) [ t=nN t-n rt(st, dt(st)) | sn = s]

= max(n) E (n) [ rn(s, dn(s)) + ¢t=n+1N t-n-1 rt(st, dt(st)) | sn = s]

= maxa frn(s, a) + ¢s0 p(s0 | s, a)¢max(n+1)E(n+1)[t=n+1N t-n-1rt(st, dt(st)) |

sn+1=s0]g

= maxa frn(s, a) + ¢s0 p(s0 | s, a)¢un+1(s0)g

The secretary problem You are hiring for the position of secretary. You

will interview N prospective candidates After each interview, you can either decide to hire

the person, or interview the next candidate If you don’t offer the job to the current interviewee, they

go and find work elsewhere, and can no longer be hired The goal is to find a decision policy which

maximizes the probability of hiring the best person You don’t know the rank order of the candidates You can only rank the people you’ve interviewed so far

For example, you know if the current candidate is better than any of the previous candidates

The secretary problem If the current interviewee is not the best one you’ve seen so

far, then the probability that this person is the best is zero If there are more people to interview, then you might as well – there

is at least a chance that the best is yet to come If the current interviewee is the best one so far, then you

might consider hiring him or her What is the probability that this person is the best of all? This

depends on the number you have interviewed so far If the nth candidate is the best that you have seen so far, then the

probability that this person is the best out of all N candidates is

The secretary problem Because the only information relevant to your hiring decision is whether

the current person is the best you’ve seen so far, we let the state s 2 {0,1}, according to whether the current person is the best so far or not

If you decide to hire the current candidate, the “reward” is the probability that that person is the best of all N

If you decide not to hire the current candidate, there is no immediate reward and you interview the next candidate The probability that the next (n+1st) person will be the best you’ve seen so

far is equal to 1/(n+1) Let un(s) be the maximal probability of eventually selecting the best

candidate if the current state is s at time n. Then uN+1(s) = 0, s = 0,1 For n = 1, …, N:

un(0) = nun+1(0) / (n+1) + un+1(1) / (n+1)

un(1) = max{ nun+1(0) / (n+1) + un+1(1) / (n+1) , n/N }

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

N

P(select best)t/N

The secretary problem The optimal policy is of the

form “Interview the first t candidates, and then hire the first candidate that is better than all the previous ones”

The graph at right shows how the probability of selecting the best candidate and the optimal proportion t/N of candidates to initially interview varies with N

Both curves approach 1/e ¼ 0.36788 in the limit

Controlled queueing systems A queue is a model for a congested

resource “Jobs” line up in the queue waiting

for “service” New jobs arrive to the end of the

queue at random times First-come-first-serve Each job requires a random amount of

service Servers complete job service

requirements at a given rate Example: lining up for a bank teller

Queueing models are useful for estimating average waiting times, queue lengths, server utilization, etc

Controlled queueing models

We model N discrete time periods In each period, at most one job can arrive (with probability ) and at most

one job can complete service (with probability ). Assume > Imagine that we can control the rate at which the server can complete jobs

That is, we can choose within the range (, 1) Cost c() associated with choosing rate (c increasing in )

State s = # number of jobs in system Total buffer size B; states s ε S = {0,…,B+1} Reward R for every completed job Penalty b for every job blocked due to full buffer Holding cost h(s) depending on number of jobs in system (h increasing)

queue buffer server

job in servicejobs in queue

arriving jobs completed jobs

Controlled queueing models Timeline:

time n state s

choose rate n; pay costs c(n)

a new jobs arrive; incur blocking penalty

b∙max{0,s+a-B-1}

k jobs complete service; receive

reward Rk

time n+1 state

s+min{a,B+1-s}-k

incur holding cost h(s+min{a,B+1-s}-

k)

Controlled queueing models Optimality equations:

0 < s < B +1:

s = 0:

s = B +1

‘Two-armed bandit’ problems A ‘one-armed bandit’ is another name

for a slot machine If you play it long enough, it’s as good as

being robbed There are two slot machines you can

play. You plan to play the machines N times Every time you play, you can choose which

arm to pull You pay a dollar, and either win $2 or

nothing on any given pull You don’t know the probabilities of winning

on either machine The more you play either machine, the more

you learn about the probabilities of winning How do you decide which arms to pull?

‘Two-armed bandit’ problems To simplify the problem, suppose you already know that machine 1 has a

probability of winning of 0.5 You don’t know the probability p of winning on machine 1

Recall the coin and thumbtack example: You can choose to either flip the coin or the thumbtack in each of N plays Every time the outcome is heads / pin up, you win $1, otherwise you lose $1 You know the probability of heads is 0.5, but you are unsure of the probability

of the tack landing pin up The question then becomes if and for how long you should play machine 2

You may suspect that machine 2 has a slightly worse chance of winning However, it might be worthwhile trying machine 2 for a while to see if it

appears to be better than machine 1 If machine 2 doesn’t appear to be better, then you can revert to machine 1 and

continue playing that until the end As long as you play machine 1 you don’t learn anything about p

‘Two-armed bandit’ problems Let the prior probability for p be P(p = x), where x can be any

value in {0, 0.01, 0.02, …, 0.99, 1} Suppose at some point in time you have played machine 2 a total

of n times, and you have won k times out of those n It is possible to show that the posterior probabilities for p can be

determined just from knowing n and k We don’t need to know the entire sequence of wins and losses, only the

totals Denote the posterior as P(p = x | n, k) P(p = x | n, k) / P(k wins out of n | p = x) ∙ P(p = x)

(Bayes’ rule)

= (n! / k! (n-k)!) ∙ xk ∙(1-x)n-k ∙ P(p = x) Let q(n, k) denote the probability that you assign to the tack landing pin

up on the next throw, after observing k wins out of n flips q(n, k) = x x ∙ P(p = x | n, k)

‘Two-armed bandit’ problems We can formulate the problem as the following MDP:

States: s = (n, k), where n ε {0, 1, …, N} and k ε {0, 1, …, n} Actions: a ε {1, 2} – which machine you play Rewards:

r((n, k), 1) = 0.5(1) + 0.5(-1) = 0 r((n, k), 2) = q(n, k)(1) + (1 – q(n,k))(-1) = 2q(n,k) – 1 All terminal rewards are 0

Transitions: p((n,k) | (n,k), 1) = 1 p((n+1,k+1) | (n,k), 2) = q(n,k) p((n+1,k) | (n,k), 2) = 1 – q(n,k)

Optimality equations: ut(n, k) = max {ut+1(n, k), 2q(n, k) – 1 + q(n, k) ∙ut+1(n+1, k+1)

+ (1-q(n, k)) ∙ ut+1(n+1,k) }

‘Two-armed bandit’ problems A sample optimal policy is

pictured at right N = 1000 uniform prior probability optimal policy at time 100 applied discount factor =

0.98 x axis = n y axis = k red region ) machine 1 green region ) machine 2 blue region ) infeasible play machine 1play machine 2

‘Two-armed bandit’ problems Bandit problems are canonical models for sequential

choice problems Research project selection Oil exploration Clinical trials Sequential search Etc

Bandit problems also capture a fundamental dilemma in problems of incomplete information How best to balance learning with maximizing reward “Exploration / Exploitation” trade-off

Structured policies One of the advantages of the MDP framework is that it is

often possible to prove that the optimal policy for a given problem type has a special structure, for example: threshold policies of the sort we saw in the inventory problem:

if s ≤ s*, then order up to level S if s > s*, then order 0

monotone policies such as are optimal for the queueing problem: the larger the value of s (more people in the queue), the higher the value

of you should choose

Establishing such a structure for the optimal policy is desirable because: It provides general managerial insight into the problem A simple decision rule can be easier to implement Computation of the optimal policy can often be simplified

Monotone optimal policies Monotone optimal policies can occur when you have ordered

state and action spaces states and actions correspond to numbers according to natural

ordering E.g., in the queueing example, the number of people in the queue and

the service rate used are both ordered quantities Denote the optimal action to take in state s as d*(s) A policy is monotone if

s0 > s implies d*(s0) > d*(s) for all s, s0

or ifs0 > s implies d*(s0) < d*(s) for all s, s0

Superadditive functions A function f(x, y) is said to be superadditive if:

For all x-, x+, y-, y+ such that x- < x+ and y- < y+,f(x+, y+) + f(x-, y-) ¸ f(x+, y-) + f(x-, y+)

Reversing the inequality yields the definition for a subadditive function

x+x-

y+

y-f(x+, y-)

f(x+, y+)f(x-, y+)

f(x-, y-)

f is a superadditive function if the sum of the quantities joined by the dashed line exceeds the sum of the quantities joined by the solid line

Superadditivity and monotone policies Recall Bellman’s equation:

un(s) = maxa { rn(s, a) + s0 pn( s0 | s, a) un+1(s0)} Define w(s, a) = rn(s, a) + s0 pn( s0 | s, a)

un+1(s0), so we may writeun(s) = maxa w(s, a)

If w(s, a) is a superadditive (subadditive) function, then the optimal policy d*(s) will be monotone increasing (decreasing) in the state s

The book (Sec. 4.7) provides several tests to establish that w(s,a) is superadditive

bams 517 introduction to markov decision processes

Documents

decision epochs

decision problemcan

complex decision problemsmdps

itmany decision problems

parking spots

parking spot widthsthere

parking problemsome

parking problemsuppose