bams 517 introduction to markov decision processes
DESCRIPTION
BAMS 517 Introduction to Markov Decision Processes. Eric Cope (mostly!) and Martin L. Puterman Sauder School of Business. Markov Decision Processes (MDPs). We’ve been looking so far at decision problems that require a choice of only one or a few actions - PowerPoint PPT PresentationTRANSCRIPT
BAMS 517Introduction to Markov Decision Processes
Eric Cope (mostly!)and Martin L. Puterman
Sauder School of Business
Markov Decision Processes (MDPs) We’ve been looking so far at decision problems that require a
choice of only one or a few actions The complexity of these decisions was small enough so that we could write
down a decision tree and solve it Many decision problems require that actions be taken repeatedly over time.
The more decisions and uncertain events we have to consider, the more tedious it becomes to write down a decision tree for the problem
Decision trees are not a very parsimonious way to represent or solve very complex decision problems
MDPs provide a rich analytical framework for studying complex decision problems A convenient and economical method of representing a decision problem Can be used to study problems involving infinite sequences of decisions MDPs can be easily stored and solved on a computer Allow us to further explore the structure of optimal decisions
General Approach Formulate a problem as an MDP by
identifying its decision epochs, states, actions, transition probabilities and rewards.
Determine an optimality criterion – initially expected total reward.
Solve it using backward induction. Determine the optimal policy.
The parking problem Suppose you’re driving to a theatre that is
at the end of a one-way street You’ll need to park in one of the parking
spots along the street. Naturally, you want to park as close as possible to the theatre
If however you drive past the theatre, you’ll have to park in the pay lot, and pay $c
You are a distance of x from the theatre and you see an open spot. Do you take it, or try to get closer to the theatre?
The parking problem Some simplifying modeling assumptions:
You can only see if the spot you’re driving past is occupied or not (it’s nighttime)
Each spot has a probability p (0 < p < 1) of being vacant, and vacant spaces occur independently
You assign a cost of $a∙x for parking a distance of x from the theatre (measured in parking spot widths)
There are N total spots Suppose N = 100. Try to imagine what a decision tree would
look like for this problem The tree is extremely unwieldy and complex – long series of similar
nodes forming a complex network of branches Much of the redundancy in the tree that can be eliminated using the
MDP representation
The parking problem You might imagine that the optimal solution to the problem is
of the form: drive past the first k spots, and then park in the first open spot after that As it turns out, for some value of k this will be the optimal rule,
regardless of the values of the other problem parameters (c, N, etc.) The structure of MDPs often allows you to prove that such general
properties hold of the optimal solution The parking problem is an instance of an “optimal stopping”
problem Once you park, you stop having to decide The problem is deciding when to park given your current limited
information
An inventory management problem Suppose you are the manager of a
small retail fashion outlet. You sell a popular type of dress (that remains “in fashion” for a year) at a price $p
You have only limited shelf and storage space for dresses. Customers come at random intervals and buy the dress. If you have no dresses in stock, they leave the store and you lose the sale
You can order new dresses from your supplier at any time. Ordering costs $K plus $c per dress
When do you order new dresses, and how many do you order?
An inventory management problem Some simplifying modeling assumptions
Every day, a random number D of customers arrive, where D ε {0, 1, 2, …} Demand for the dresses is constant, and the number of customers
wanting to buy is independent from day to day You place orders first thing in the morning, and they arrive immediately You can only carry N dresses at any time due to storage limitations The dresses will be sold for a year, after which they will be replaced by
a new line. Unsold dresses will be disposed of at the end of the year at a reduced price.
Objective: Maximize expected total profit over the year What is the key information needed to make a decision?
Constant information: space limitations, probability distribution of customer arrivals, ordering process
Changing information: inventory on hand, number of days until new line arrives
An inventory management problem Imagine the decision tree for this problem
It will be extremely large, but will include many redundant nodes For example: consider the following scenarios for day 100:
Each of these scenarios leads to the same situation on day 101 In a decision tree, you would have to write separate branches
on for each of these scenarios, even though you would face essentially the same decision on day 101 in each case Decisions only depend on the present state of affairs, and not on past
events or decisions
Day 100 Inventory Order Sales Day 101 Inventory
12 3 5 10
10 0 0 10
11 0 1 10
20 0 10 10
5 10 5 10
An inventory management problem It is better to simply consider the decision problem that you
would face on day 101, with 10 units of inventory, only once In the MDP modeling framework, we talk about the “state” of having
10 units of inventory on day 101, and consider the decision problem faced by someone in this state
We can fully consider the decision problem by considering all the possible “states” we might find ourselves in There will be a state for every combination of day and inventory level Note that the states here correspond to the possible values of the
“changing information” we might have about the problem at any time that is relevant to the overall objective of maximizing total profit
Each state incorporates in its description all the problem information needed to make a good decision when in that state
An inventory management problem Note that in each possible state, different sets of actions are available
to us If in the current state there is an inventory of n items, then we can only
order up to N –n items, due to space limitations Our choice of action will lead us into new states with different
probabilities Suppose the demand D realized on each day is such that P(D=0) = P(D=1) =
1/3, P(D=2) = P(D=3) = 1/6, P(D > 3) = 0 Suppose the current state is 10 items in inventory on day 100. Here are the
probabilities that the next state will be 12 items on day 101 for different order values:
# ordered Prob. next state is 12 items, day 101
0 0
3 1/3
5 1/6
7 0
An inventory management problem In addition, different actions will cause us to gain different
profits Daily profits = min{D, s+a} ∙p – ca– K if a > 0, and equals
min{D,s} ∙ p if a =0 where a is the number ordered In order to choose the best action in any particular state, we
need to understand: The possible future states that are attainable through each action, and
the probabilities of reaching those states The possible future profits that we gain from each action
If we have to consider the evolution of states and total profits gained over the entire future, this could be quite complicated Instead, we’ll only consider, for any given state and action, what the
next state could be, and how much profit could be gained before moving to the next state
From these “one-step” state transitions and profits, we can analyze the entire decision problem
Elements of MDPs Decision epochs: the times at which decisions may be made. The time in
between successive decision epochs are referred to as periods We first consider problems with a finite number N of decision epochs The Nth decision epoch is a terminal point – no decision is made at it
States: a state describes all the relevant available information necessary in order to take an optimal action at any given decision epoch. We denote the set of all possible current information by the set of states S
Action sets: For each state s ε S, the action set A(s) denotes the set of allowable actions that can be taken in state s
Transition probabilities: For any given state and action, the probabilities of moving (transitioning) to any other state in the next decision epoch If s is the current state at time t and action a ε A(s) is taken, then the
probability of transitioning to state s0 is denoted pt(s0 | s, a)
Assume Markovian dynamics: transitions only depend on current state and action
Elements of MDPs; timeline Rewards: For any given state and action, the random benefits (costs) that are
incurred before or during the next state transition The reward received after taking action a in state s and at time t and arriving in
state s0 is denoted rt(s, a, s0) Note that the random rewards may depend on the next state s0. Usually we will
only consider the expected reward rt(s, a) = s ε S rt(s, a, s0) pt(s0 | s, a) There may be terminal rewards rN(s) at the Nth decision epoch
TimeN1 2 3 4 5 6 7
s1 s2 s3 s4 s5 s6 s7 sN
a1 a2 a3 a4 a5 a6 a7
r1 r2 r3 r4 r5 r6 r7
sN-1
aN-1
rN-1
Actions:
States:
Epochs:
Rewards: rN
N-1
Transition probability:
p1(s2|s1,a1)p2(s3|s2,a2)
p3(s4|s3,a3)p4 (s5|s4a4)
p5(s6|s5,a5)p6(s7|s6,a6)
pN-1 (sN|sN-1,aN-1)
TimeN1 2 3 4 5 6 7Epochs: N-1
But Who’s Counting http://www.youtube.com/watch?v=KjZJ3TV-MyM This can be formulated as an MDP
States – unoccupied slots and number to be placed Actions – which unoccupied slot to place the number in Rewards – the value of placing the number in the space Goal – maximize total
MDPs as decision trees
……
…
……
… ……
…
NN-1N-2N-3
MDPs as decision trees
……
…
……
… ……
…
NN-1N-2N-3
States (decision nodes)
MDPs as decision trees
……
…
……
… ……
…
NN-1N-2N-3
Terminal States
MDPs as decision trees
……
…
……
… ……
…
NN-1N-2N-3
Actions (decision branches)
MDPs as decision trees
……
…
……
… ……
…
NN-1N-2N-3
Rewards / Transitions (uncertainty nodes & branches)
Specifying states As we mentioned, the state descriptor should provide all the
relevant problem information that is necessary for making a decision Normally, we don’t include problem information that doesn’t change
from epoch to epoch in the state description For example, in the parking problem, the cost of parking in the
parking lot, the total number of parking spaces, etc. is constant at all times. Therefore, we don’t bother including this information in the state description
We would, however, include information about the state of the current parking space (vacant, occupied)
The number of epochs remaining also changes from epoch to epoch (for finite-horizon problems). However, we often won’t include this information in the state description, because it is implicitly present in the specification of rewards and transition probabilities
Deterministic dynamic programs A special type of MDP (which are sometimes also
called dynamic programs) is one in which all transition probabilities are either 0 or 1 These are known as deterministic dynamic programs
(DDPs) Such problems arise in several applications:
finding shortest paths in networks critical path analysis sequential allocation inventory problems with known demands
Shortest path through a network
Nodes represent states, arcs represent actions/transitions, numbers represent arc lengths / costs
Goal: find the shortest route from node 1 to node 8
1 3
4
2
6
7
5
8
2
4
3
4
5
5
6
1
2
1
2
6
Formulation of shortest path problem Let u(s) denote the shortest distance from node s to node 8 We compute u(s) just as we did previously for MDPs For any state s, for each arc s→s0 out of state s, add the distance
to state s0, plus the shortest distance u(s0) from s0 to node 8. Let u(s) be minimum such value for all these arcs u(8) = 0 – “terminal state” u(7) = 6 + u(8) = 6 u(6) = 2 + u(8) = 2 u(5) = 1 + u(8) = 1 u(4) = min{ 4 + u(5), 5 + u(6) } = min{ 4 + 1, 5 + 2 } = 5 u(3) = min{ 5 + u(5), 6 + u(6), 1 + u(7) } = min{ 5 + 1, 6 + 2, 1 + 6 } = 6 u(2) = 2 + u(7) = 2 + 6 = 8 u(1) = min{ 2 + u(2), 4 + u(3), 3 + u(4) } = min{ 2 + 8, 4 + 6, 3 + 5 } = 8
Critical path models A critical path network is a graphical method of analyzing a
complex project with many tasks, having precedence constraints
In the graph of this network, nodes represent states of completion, and arcs represent tasks to complete The node from which an arc originates represents the project’s state
of completion that is needed to begin that task All other tasks that logically precede that task must be done first
The arcs are numbered according to the length of time they require for completion
The critical path is a list of tasks forming a path through the network from the project start node to the project end node If the completion of any task on the critical path is delayed, then the
overall project must be delayed as well The critical path is the longest path through the network
Critical path: launching a new product It was determined that in order to launch a new product, the
following activities needed to be completed:Activity Description PredecessorDurationA Product Design -- 5 mos.B Market Research -- 1 mo.C Production Analysis A 2 mos.D Product Model A 3 mos.E Sales Brochure A 2 mos.F Cost Analysis C 3 mos.G Product Testing D 4 mos.H Sales Training B, E 2 mos.I Pricing H 1 mo.J Project Report F, G, I 1 mo.
Critical path: launching a new product
1
2
3
4
5
6
7 8A (5)
B (1)
C (2)
D (3)
E (2)F (3)
G (4)
H (2)I (1)
J (1)
Critical path: launching a new product
We use the backward induction algorithm to find the longest path u(s) = Longest path from node s to the project completion node 8 The critical path is marked in red
This is not really a decision problem per se, but an illustration of the backward induction problem applied to a network similar to a DDP
1
2
3
4
5
6
7 8A (5)
B (1)
C (2)
D (3)
E (2)F (3)
G (4)
H (2)I (1)
J (1)
u(7)=1
u(4)=5
u(5)=4
u(6)=2
u(2)=max{8,6,6}=8
u(3)=4
u(1)=max{13,5}=13 u(8)=0
Backward induction algorithm1. Set uN(s) = rN(s) for all s ε S. Set n = N.
2. Solve
3. If n-1 = 0 stop, otherwise replace n by n-1 and return to 2.
Ss
nnnsAa
n juasjpasrsu )(),|(),(max)( 11)(
1
N = 10, K = 30, c = 20, p = 40, d0 = d1 = d2 = 1/4; d3 = d4 = 1/8, dk = 0 for k > 4 The Bellman equations to solve for n = 1,…, 365; s = 0,…,10 are:
We set the terminal rewards u366(s) = 0 for all s and again solve by working backwards for u365(s),…,u1(s)
Solution to the inventory problem
ordering costs expected sales revenue
expected value of next state
Optimal order quantities
Optimal Order
Quantities
s
0 1 2 3 4 10
time
365 0 0 0 0 0 0
364 3 0 0 0 0 0
363 5 4 0 0 0 0
362 6 5 4 0 0 0
361 8 7 6 0 0 0
360 9 8 7 0 0 0
359 10 9 8 0 0 0
358 10 9 8 0 0 0
357 10 9 8 0 0 0
1 10 9 8 0 0 0
The optimal order quantities are listed at left
In the last period (day 365) you don’t want to order anything
You never order anything if you have 3 or more items in stock
In days 1, …, 359, if you have less than 3 items in stock, you “order up” to a full inventory level
This is known as an (s,S) inventory policy: If your inventory falls below a
level s, you order up to level S. This is well-known (Scarf, 1959)
to be the form of an optimal inventory policy for this problem
An investment/consumption problem We consider a (simplified) approach to investment planning
for your life: You will make M dollars per year until age 65 Each year, you can choose to spend a portion of this money and
invest the rest Invested money brings a return of r% per year Your utility for spending x dollars per year is log(x/10000) You are currently d years old, and you will live to the age of D Let cn be the amount of money you consume in year n. We require
that cn < wn, your level of wealth in year n (which includes your year n income)
Your current level of wealth is wd
Your lifetime utility is u(x) = n=dD log(cn/10000)
The value of any remaining wealth at your death is 0
An investment/consumption problem We formulate this as a DDP. The equations to solve are:
40 50 60 70 800
500
1000
1500
2000
2500
3000SpendingWealth
At right is a graph of the optimal spending policy (along with total wealth) for a problem with the following parameters: d = 40, D = 80, r = 10%, M = $50K,
initial wealth w40 = $50K
The time value of money When sequential decisions are made over the course of many
months or years, it is often desirable to consider the “time value of money” Receiving a dollar today is worth more to you than receiving a dollar
tomorrow, since you have the added option of spending that dollar today
It is customary to “discount” the values of dollars received in the future by an appropriate factor Let (t) denote the discount factor applied to money received t
periods in the future 0 < (t) < 1 Thus, $x received t periods in the future is worth (t) ∙$x to you
now Typically, we let (t) = t, for some fixed , 0 < < 1 The choice of depends on the length of the period
Discount factors in Bellman’s equation The choice of (t) = t is convenient because then we can easily
include discounting into Bellman’s equation un(s) = maxa { rn(s, a) + s0 p(s0 | s, a)¢un+1(s0) }
We simply apply the discount factor to the expected value of the next state We regard the expected value of the next state as a “certain equivalent”
value of the next decision we will make one period in the future This certain equivalent value is discounted by Quick proof (optional): Let (n) = (dn, …, dN-1). Then
un(s) = max(n) E(n) [ t=nN t-n rt(st, dt(st)) | sn = s]
= max(n) E (n) [ rn(s, dn(s)) + ¢t=n+1N t-n-1 rt(st, dt(st)) | sn = s]
= maxa frn(s, a) + ¢s0 p(s0 | s, a)¢max(n+1)E(n+1)[t=n+1N t-n-1rt(st, dt(st)) |
sn+1=s0]g
= maxa frn(s, a) + ¢s0 p(s0 | s, a)¢un+1(s0)g
The secretary problem You are hiring for the position of secretary. You
will interview N prospective candidates After each interview, you can either decide to hire
the person, or interview the next candidate If you don’t offer the job to the current interviewee, they
go and find work elsewhere, and can no longer be hired The goal is to find a decision policy which
maximizes the probability of hiring the best person You don’t know the rank order of the candidates You can only rank the people you’ve interviewed so far
For example, you know if the current candidate is better than any of the previous candidates
The secretary problem If the current interviewee is not the best one you’ve seen so
far, then the probability that this person is the best is zero If there are more people to interview, then you might as well – there
is at least a chance that the best is yet to come If the current interviewee is the best one so far, then you
might consider hiring him or her What is the probability that this person is the best of all? This
depends on the number you have interviewed so far If the nth candidate is the best that you have seen so far, then the
probability that this person is the best out of all N candidates is
The secretary problem Because the only information relevant to your hiring decision is whether
the current person is the best you’ve seen so far, we let the state s 2 {0,1}, according to whether the current person is the best so far or not
If you decide to hire the current candidate, the “reward” is the probability that that person is the best of all N
If you decide not to hire the current candidate, there is no immediate reward and you interview the next candidate The probability that the next (n+1st) person will be the best you’ve seen so
far is equal to 1/(n+1) Let un(s) be the maximal probability of eventually selecting the best
candidate if the current state is s at time n. Then uN+1(s) = 0, s = 0,1 For n = 1, …, N:
un(0) = nun+1(0) / (n+1) + un+1(1) / (n+1)
un(1) = max{ nun+1(0) / (n+1) + un+1(1) / (n+1) , n/N }
0 10 20 30 40 500
0.2
0.4
0.6
0.8
1
N
P(select best)t/N
The secretary problem The optimal policy is of the
form “Interview the first t candidates, and then hire the first candidate that is better than all the previous ones”
The graph at right shows how the probability of selecting the best candidate and the optimal proportion t/N of candidates to initially interview varies with N
Both curves approach 1/e ¼ 0.36788 in the limit
Controlled queueing systems A queue is a model for a congested
resource “Jobs” line up in the queue waiting
for “service” New jobs arrive to the end of the
queue at random times First-come-first-serve Each job requires a random amount of
service Servers complete job service
requirements at a given rate Example: lining up for a bank teller
Queueing models are useful for estimating average waiting times, queue lengths, server utilization, etc
Controlled queueing models
We model N discrete time periods In each period, at most one job can arrive (with probability ) and at most
one job can complete service (with probability ). Assume > Imagine that we can control the rate at which the server can complete jobs
That is, we can choose within the range (, 1) Cost c() associated with choosing rate (c increasing in )
State s = # number of jobs in system Total buffer size B; states s ε S = {0,…,B+1} Reward R for every completed job Penalty b for every job blocked due to full buffer Holding cost h(s) depending on number of jobs in system (h increasing)
queue buffer server
job in servicejobs in queue
arriving jobs completed jobs
Controlled queueing models Timeline:
time n state s
choose rate n; pay costs c(n)
a new jobs arrive; incur blocking penalty
b∙max{0,s+a-B-1}
k jobs complete service; receive
reward Rk
time n+1 state
s+min{a,B+1-s}-k
incur holding cost h(s+min{a,B+1-s}-
k)
Controlled queueing models Optimality equations:
0 < s < B +1:
s = 0:
s = B +1
‘Two-armed bandit’ problems A ‘one-armed bandit’ is another name
for a slot machine If you play it long enough, it’s as good as
being robbed There are two slot machines you can
play. You plan to play the machines N times Every time you play, you can choose which
arm to pull You pay a dollar, and either win $2 or
nothing on any given pull You don’t know the probabilities of winning
on either machine The more you play either machine, the more
you learn about the probabilities of winning How do you decide which arms to pull?
‘Two-armed bandit’ problems To simplify the problem, suppose you already know that machine 1 has a
probability of winning of 0.5 You don’t know the probability p of winning on machine 1
Recall the coin and thumbtack example: You can choose to either flip the coin or the thumbtack in each of N plays Every time the outcome is heads / pin up, you win $1, otherwise you lose $1 You know the probability of heads is 0.5, but you are unsure of the probability
of the tack landing pin up The question then becomes if and for how long you should play machine 2
You may suspect that machine 2 has a slightly worse chance of winning However, it might be worthwhile trying machine 2 for a while to see if it
appears to be better than machine 1 If machine 2 doesn’t appear to be better, then you can revert to machine 1 and
continue playing that until the end As long as you play machine 1 you don’t learn anything about p
‘Two-armed bandit’ problems Let the prior probability for p be P(p = x), where x can be any
value in {0, 0.01, 0.02, …, 0.99, 1} Suppose at some point in time you have played machine 2 a total
of n times, and you have won k times out of those n It is possible to show that the posterior probabilities for p can be
determined just from knowing n and k We don’t need to know the entire sequence of wins and losses, only the
totals Denote the posterior as P(p = x | n, k) P(p = x | n, k) / P(k wins out of n | p = x) ∙ P(p = x)
(Bayes’ rule)
= (n! / k! (n-k)!) ∙ xk ∙(1-x)n-k ∙ P(p = x) Let q(n, k) denote the probability that you assign to the tack landing pin
up on the next throw, after observing k wins out of n flips q(n, k) = x x ∙ P(p = x | n, k)
‘Two-armed bandit’ problems We can formulate the problem as the following MDP:
States: s = (n, k), where n ε {0, 1, …, N} and k ε {0, 1, …, n} Actions: a ε {1, 2} – which machine you play Rewards:
r((n, k), 1) = 0.5(1) + 0.5(-1) = 0 r((n, k), 2) = q(n, k)(1) + (1 – q(n,k))(-1) = 2q(n,k) – 1 All terminal rewards are 0
Transitions: p((n,k) | (n,k), 1) = 1 p((n+1,k+1) | (n,k), 2) = q(n,k) p((n+1,k) | (n,k), 2) = 1 – q(n,k)
Optimality equations: ut(n, k) = max {ut+1(n, k), 2q(n, k) – 1 + q(n, k) ∙ut+1(n+1, k+1)
+ (1-q(n, k)) ∙ ut+1(n+1,k) }
‘Two-armed bandit’ problems A sample optimal policy is
pictured at right N = 1000 uniform prior probability optimal policy at time 100 applied discount factor =
0.98 x axis = n y axis = k red region ) machine 1 green region ) machine 2 blue region ) infeasible play machine 1play machine 2
‘Two-armed bandit’ problems Bandit problems are canonical models for sequential
choice problems Research project selection Oil exploration Clinical trials Sequential search Etc
Bandit problems also capture a fundamental dilemma in problems of incomplete information How best to balance learning with maximizing reward “Exploration / Exploitation” trade-off
Structured policies One of the advantages of the MDP framework is that it is
often possible to prove that the optimal policy for a given problem type has a special structure, for example: threshold policies of the sort we saw in the inventory problem:
if s ≤ s*, then order up to level S if s > s*, then order 0
monotone policies such as are optimal for the queueing problem: the larger the value of s (more people in the queue), the higher the value
of you should choose
Establishing such a structure for the optimal policy is desirable because: It provides general managerial insight into the problem A simple decision rule can be easier to implement Computation of the optimal policy can often be simplified
Monotone optimal policies Monotone optimal policies can occur when you have ordered
state and action spaces states and actions correspond to numbers according to natural
ordering E.g., in the queueing example, the number of people in the queue and
the service rate used are both ordered quantities Denote the optimal action to take in state s as d*(s) A policy is monotone if
s0 > s implies d*(s0) > d*(s) for all s, s0
or ifs0 > s implies d*(s0) < d*(s) for all s, s0
Superadditive functions A function f(x, y) is said to be superadditive if:
For all x-, x+, y-, y+ such that x- < x+ and y- < y+,f(x+, y+) + f(x-, y-) ¸ f(x+, y-) + f(x-, y+)
Reversing the inequality yields the definition for a subadditive function
x+x-
y+
y-f(x+, y-)
f(x+, y+)f(x-, y+)
f(x-, y-)
f is a superadditive function if the sum of the quantities joined by the dashed line exceeds the sum of the quantities joined by the solid line
Superadditivity and monotone policies Recall Bellman’s equation:
un(s) = maxa { rn(s, a) + s0 pn( s0 | s, a) un+1(s0)} Define w(s, a) = rn(s, a) + s0 pn( s0 | s, a)
un+1(s0), so we may writeun(s) = maxa w(s, a)
If w(s, a) is a superadditive (subadditive) function, then the optimal policy d*(s) will be monotone increasing (decreasing) in the state s
The book (Sec. 4.7) provides several tests to establish that w(s,a) is superadditive