online sampling for markov decision processes bob givan joint work w/ e. k. p. chong, h. chang, g....

51
Online Sampling Online Sampling for for Markov Decision Markov Decision Processes Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Upload: malakai-fenley

Post on 01-Apr-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Online SamplingOnline Sampling

forfor

Markov Decision ProcessesMarkov Decision Processes

Bob Givan

Joint work w/ E. K. P. Chong, H. Chang, G. Wu

Electrical and Computer Engineering

Purdue University

Page 2: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 2November 4-9, 2001

Markov Decision Process (MDP) Ingredients:

System state x in state space X Control action a in A(x) Reward R(x,a) State-transition probability P(x,y,a)

Find control policy to maximize objective fun

Page 3: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 3November 4-9, 2001

Optimal Policies• Policy – mapping from state and time to actions

Stationary Policy – mapping from state to actions

Goal – a policy maximizing the objective functionVH*(x0) = max Obj [R(x0,a0), …, R(xH-1,aH-1)]

where the “max” is over all policies u = u0,…,uH-1

For large H, a0 independent of H. (w/ergodicity assum.)

Stationary optimal action a0 for H =

via receding horizon control

Page 4: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 4November 4-9, 2001

Q Values Fix a large H, focus on finite-horizon reward

Define Q(x,a) = R(x,a) + E[VH-1*(y)] “Utility” of action a at state x. Name: Q-value of action a at state x.

Key identities (Bellman’s equations): VH*(x) = maxa Q(x,a)

0*(x) = argmaxa Q(x,a)

Page 5: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 5November 4-9, 2001

Solution Methods Recall:

u0*(x) = argmaxa Q(x,a)

Q(x,a) = R(x,a) + E [VH-1*(y)]

Problem: Q-value depends on optimal policy. State space is extremely large (often continuous)

Two-pronged solution approach: Apply a receding-horizon method Estimate Q-values via simulation/sampling

Page 6: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 6November 4-9, 2001

Methods for Q-value EstimationPrevious work by other authors:

Unbiased sampling (exact Q value)[Kearns et al., IJCAI-99]

Policy rollout (lower bound) [Bertsekas & Castanon, 1999]

Our techniques:

Hindsight optimization (upper bound)

Parallel rollout (lower bound)

Page 7: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 7November 4-9, 2001

Expectimax Tree for V*

Max

Exp Exp

Max Max HorizonH

k

# states

......

......

............

............

...... ...... ...... ......

...... ...... ............ ...... ...... ......

# actions

(kn )H leaves

n

Page 8: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 8November 4-9, 2001

Unbiased SamplingMax

Exp Exp

Max MaxHorizon H

k

# states

......

......

............

............

...... ...... ...... ......

Samplingdepth H s

Samplingwidth C

...... ...... ............ ...... ...... ......

(kC )H s leaves

# actions

(kn )H leaves

n

Page 9: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 9November 4-9, 2001

Unbiased Sampling (Cont’d) For a given desired accuracy, how large

should sampling width and depth be?

Answered: Kearns, Mansour, and Ng (1999)

Requires prohibitive sampling width and depth

e.g. C 108, Hs > 60 to distinguish “best” and “worst” policies in our scheduling domain

We evaluate with smaller width and depth

Page 10: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 10November 4-9, 2001

How to Look Deeper?Max

Exp Exp

Max Max HorizonH

k

# states

......

......

............

............

...... ...... ...... ......

Tiny Samplingdepth Hs

Tiny Samplingwidth C

...... ...... ............ ...... ...... ......

(kC )Hs leaves

# actions

(kn )H leaves

n

Page 11: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 11November 4-9, 2001

Policy Roll-outMax

Exp Exp

Max Max......

......

......

...... ...... ...... ......

......

......

Exp

.... ..Exp Exp Exp

Max MaxMaxMax ......

......

Selected bypolicy u

......

......

Prunedactions

Action selected bypolicy PI(u)

Page 12: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 12November 4-9, 2001

Policy Rollout in Equations Write VH

u (y) for the value of following policy u

Recall: Q(x,a) = R(x,a) + E [VH-1*(y)]

= R(x,a) + E [maxu VH-1u(y)]

Given a base policy u, use

R(x,a) + E [VH-1u(y)]

as an lower bound estimate of Q-value.

Resulting policy is PI(u), given infinite sampling

Page 13: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 13November 4-9, 2001

Max

Exp......k

# actionsVPI(u) (x )

......

Samplingwidth C' << C H

Vu(X ak )sample

Vu(X ak )sample

Exp

......

(# states) H

Vu (X a1 )sample

Vu (X a1 )sample

Policy Roll-out (cont’d)

Page 14: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 14November 4-9, 2001

Parallel Policy Rollout Generalization of policy rollout, due to

[Chang, Givan, and Chong, 2000]

Given a set U of base policies, use

R(x,a) + E [maxu∊U VH-1u(y)]

as an estimate of Q-value

More accurate estimate than policy rollout

Still gives a lower bound to true Q-value

Still gives a policy no worse than any in U

Page 15: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 15November 4-9, 2001

Hindsight Optimization – Tree ViewMax

Exp Exp

Max Max......

......

......

......

...... ...... ...... ......

Pull out Exp's

......

............

......

Combine Max's

Exp Exp Exp Exp

Max MaxMaxMax

......

......

Page 16: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 16November 4-9, 2001

Hindsight Optimization – Equations Swap Max and Exp in expectimax tree.

Solve each off-line optimization problem

O (kC’ • f(H)) time where f(H) is the offline problem complexity

Jensen’s inequality implies upper bounds

)],((max[)(~ 1

0,..., 10

H

i iiaaH axRExVH

)]}([),({max)( *1

* yVEaxRxV HyaH

V~ *V

Page 17: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 17November 4-9, 2001

Hindsight Optimization (Cont’d)

Max

Exp Exp

Max Max

......

......

......

......

......

Samplingwidth C' << C H

......Horizon

H -1

Selecting best action seq. from kH -1 choices:an deterministic/off-line optimization problem

kH-1

k

# actions

(# states) H

Page 18: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 18November 4-9, 2001

Application to Example Problems

Apply unbiased sampling, policy rollout, parallel rollout, and hindsight optimization to: Multi-class deadline scheduling Random early dropping Congestion control

Page 19: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 19November 4-9, 2001

Basic Approach

Traffic model provides a stochastic description of possible future outcomes

Method

Formulate network decision problems as POMDPs by incorporating traffic model

Solve belief-state MDP online using sampling(choose time-scale to allow for computation time)

Page 20: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 20November 4-9, 2001

Domain 1: Deadline Scheduling

Objective: Minimize weighted loss

......

w 1

w 2

w 3

w 7

weights

Multiclass Trafficwith deadlines

Schedulerserved

dropped

Page 21: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 21November 4-9, 2001

Domain 2: Random Early Dropping

TrafficSources

1

2

4

3

Server

Objective: Minimize delaywithout sacrificing throughput

Page 22: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 22November 4-9, 2001

Domain 3: Congestion Control

ControlDelays

...

G 1 G 2

S 0

S 1

S 2

S 3

High Priority Cross Traffic

Fully ControlledSources

Bottleneck Nodein paths to G 2

d1

d2

d3

...

Objective : optimize delay, throughput, loss, andfairness

Page 23: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 23November 4-9, 2001

Traffic Modeling A Hidden Markov Model (HMM) for each source

Note: state is hidden, model is partially observed

3 state example model

.07

.08

.02.01

.02

.06

0

12 .25 2 packets.25 1 packet.50 0 packets

Traffic generationprobabilities

Transitionprobabilities

1 packet .900 packets .10

1 packet .980 packets .02

Page 24: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 24November 4-9, 2001

Deadline Scheduling Results

Non-sampling Policies:

EDF: earliest deadline first. Deadline sensitive, class insensitive.

SP: static priority. Deadline insensitive, class sensitive.

CM: current minloss [Givan et al., 2000] Deadline and class sensitive. Minimizes weighted loss for the current packets.

Page 25: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 25November 4-9, 2001

Deadline Scheduling Results

Objective: minimize weighted loss

Comparison: Non-sampling policies Unbiased sampling (Kearns et al.) Hindsight optimization Rollout with CM as base policy Parallel rollout

Results due to H. S. Chang

Page 26: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 26November 4-9, 2001

Deadline Scheduling Results

Page 27: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 27November 4-9, 2001

Deadline Scheduling Results

Page 28: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 28November 4-9, 2001

Deadline Scheduling Results

Page 29: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 29November 4-9, 2001

Random Early Dropping Results

Objective: minimize delay subject to throughput loss-tolerance

Comparison: Candidate policies: RED and “buffer-k” KMN-sampling Rollout of buffer-k Parallel rollout Hindsight optimization

Results due to H. S. Chang.

Page 30: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 30November 4-9, 2001

Random Early Dropping Results

Page 31: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 31November 4-9, 2001

Random Early Dropping Results

Page 32: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 32November 4-9, 2001

Congestion Control Results

MDP Objective: minimize weighted sum of throughput, delay, and loss-rate

Fairness is hard-wired

Comparisons: PD-k (proportional-derivative with k target queue) Hindsight optimization Rollout of PD-k == parallel rollout

Results due to G. Wu, in progress

Page 33: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 33November 4-9, 2001

Congestion Control Results

Page 34: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 34November 4-9, 2001

Congestion Control Results

Page 35: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 35November 4-9, 2001

Congestion Control Results

Page 36: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 36November 4-9, 2001

Congestion Control Results

Page 37: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 37November 4-9, 2001

Results Summary Unbiased sampling cannot cope

Parallel rollout wins in 2 domains Not always equal to simple rollout of one base

policy

Hindsight optimization wins in 1 domain

Simple policy rollout – the cheapest method Poor in domain 1 Strong in domain 2 with best base policy – but

how to find this policy? So-so in domain 3 with any base policy

Page 38: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 38November 4-9, 2001

Talk Summary

Case study of MDP sampling methods

New methods offering practical improvements Parallel policy rollout Hindsight optimization

Systematic methods for using traffic models to help make network control decisions Feasibility of real-time implementation depends on

problem timescale

Page 39: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 39November 4-9, 2001

Ongoing Research

Apply to other control problems (different timescales): Admission/access control QoS routing Link bandwidth allotment Multiclass connection management Problems arising in proxy-services Diagnosis and recovery

Page 40: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 40November 4-9, 2001

Ongoing Research (Cont’d)

Alternative traffic models Multi-timescale models Long-range dependent models Closed-loop traffic Fluid models

Learning traffic model online

Adaptation to changing traffic conditions

Page 41: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 41November 4-9, 2001

Congestion Control (Cont’d)

Page 42: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 42November 4-9, 2001

Congestion Control Results

Page 43: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 43November 4-9, 2001

Hindsight Optimization (Cont’d)

TrafficSimulation

HindsightOptimizer

Averaging

Q -valueEstim ate

T rafficT races

H indsight-optim alValues

...

...

StateEstim ate

C andidateAction

Action Selection

Action Evaluator

SelectedAction

Page 44: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 44November 4-9, 2001

Policy Rollout (Cont’d)

TrafficSimulation

HindsightOptimizer

Averaging

Q -valueEstim ate

T rafficT races

H indsight-optim alValues

...

...

StateEstim ate

C andidateAction

Action Selection

Action Evaluator

SelectedAction

Base Policy

Policy-performance

Page 45: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 45November 4-9, 2001

Receding-horizon Control

For large horizon H, policy is ~ stationary.

At each time, if state is x, then apply action

u*(x) = argmaxa Q(x,a)

= argmaxa R(x,a) + E [VH-1*(y)]

Compute estimate of Q-value at each time.

Page 46: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 46November 4-9, 2001

Congestion Control (Cont’d)

Page 47: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 47November 4-9, 2001

Domain 3: Congestion Control

High-priority traffic: Open-loop controlled

Low-priority traffic: Closed-loop controlled

Resources: Bandwidth and buffer

Objective: optimize throughput, delay, loss, and fairness

BottleneckNode

High-priority Traffic

Best-effort Traffic

. . .

. . .

Page 48: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 48November 4-9, 2001

Congestion Control Results

Page 49: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 49November 4-9, 2001

Congestion Control Results

Page 50: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 50November 4-9, 2001

Congestion Control Results

Page 51: Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University

Bob Givan Electrical and Computer Engineering

Purdue University 51November 4-9, 2001

Congestion Control Results