max-norm projections for factored mdps carlos guestrin daphne koller stanford university ronald parr...

Max-norm Projections for Factored MDPs

Carlos Guestrin

Daphne KollerStanford University

Ronald ParrDuke University

Motivation MDPs: plan over

atomic system states; Policy — specifies

action at every state; Polytime algorithms

for finding optimal policy.

Number of states exponential in state variables.

Motivation: BNs meet MDPs

Real-world MDPS have: Hundreds of variables; Googles of states.

Can we exploit problem specific structure?

For representation; For planning.

Goal: Merge BN and MDPs for Efficient Computation.

Factored MDPs [Boutilier et al.]

Total reward adding sub-rewards:R=R1+R2

R2

Z

R1

Y’

Z’

Y

X’ X

Time t t+1

Actions only change small parts of model.

Value function: Value of policy starting at state s.

Exploiting Structure

Structured value function approach: [Boutilier et al. ‘95] Collapse value function using a tree; Works well only when many states have same

value. X

3)( =XV Z

5)( =ZXV 9)( =ZXV

Model structure may imply structured value function;

Decomposable Value Functions

Each hi is the status of some small part(s) of a complex system: status of a machine; inventory of a store.

∑=i ii shwsV )()(

~Linear combination of restricted domain functions. [Bellman et al. ‘63][Tsitsiklis & Van Roy ’96][Koller & Parr ’99,’00]

AwV =~

K basis functions

2n states

h1(s1) h2(s1)...h1(s2) h2(s2)…...

A=

Our Approach

Embed structure into value function space a priori: Project into structured vector space of factored value

functions; Efficiently find closest approximation to “true” value.

∑=k kkhwV

~

Linear Combinationof Structured Features

Policy Iteration

Value of acting on

Guess V= greedy(V)V = value of acting on

VPRV γ+=(2nx2n)(2nx1) (2nx1)

Value RewardDiscounted expected value

Approximate Policy Iteration

Guess w0

t= greedy(A wt)Awt+1 value of acting on t

AwPRAw γ+≈ Approximate value determination:

Approximate Value Determination

Need a projection of the value function into thespace of the basis functions: (Ld projection)

( )dw AwPRAww ππ γ+−= minarg

Previous work uses L2 and weighted-L2 projections.

[Koller & Parr ’99, ’00]

AwPRAw γ+≈

( ) .max ...1 ∞= +−= τππττ ττγβ AwPRAwt

P

Analysis of Approx. PI

Theorem:

;)1(

22

*0

*

γγβγ−

+−≤−∞∞

Pt

t VAwVAw

We should be doing projections in Max-norm!

( )∞

−−= γ RwAPAw wminarg

Approximate PI: Revisited

Guess w0


AwPRAw γ+≈ Approximate value determination:

Analysis motivating projections in max-norm;

Efficient algorithm for max-norm

projection.

Efficient Max-norm Projection

Computing max-norm for fixed weights;

Cost networks; Efficient max-norm projection.

( )∞


∞−= bHww wminarg

AwPRAw γ+≈

Max over Large State Spaces

For fixed weights w, compute max-norm:

)()(max sbshwbHwi

iis

−=−= ∑∞φ

However, if basis and target are functions of only a few variables, we can do it efficiently!

Cost Networks can maximize over large state spaces efficiently when function is factored: { }niii

XXXXCwhereCf

n

KK

1,)(max1

⊆∑




( )∞



AwPRAw γ+≈

Can use variable elimination to maximize over state space: [Bertele & Brioschi ‘72]

Cost Networks

[ ]),(),(),(max

),(),(max),(),(max

),(),(),(),(max

121,,

4321,,

4321,,,

CBgCAfBAf

DBfDCfCAfBAf

DBfDCfCAfBAf

CBA

DCBA

DCBA

++=

+++=

+++ A

D

B C

1f

4f 3f

2f

As in Bayes nets, maximization is exponential in size of largest factor.

Here we need only 16, instead of 64 sum operations.




( )∞



AwPRAw γ+≈

Algorithm for finding:

∞−∈ bHww wminarg*

.)()(max

)()(max:

;:;,,...,:

1

1

1

⎟⎠

⎞⎜⎝

⎛ −≥

⎟⎠

⎞⎜⎝

⎛ −≥

∑

∑

=

=

k

iiis

k

iiis

k

shwsb

andsbshwtoSubject

MinimizewwVariables

φ

φ

φφ

Max-norm Projection

Solve by Linear Programming: [Cheney ’82]

Representing the Constraints

Explicit representation is exponential (|S|=2n):

Sssbshwk

iii K1,)()(

1

=−≥ ∑=

φ

If basis and target are factored, can use Cost Networks to represent the constraints:

[ ]),(),(max),(),(max 4321,,

DBfDCfCAfBAfDCBA

+++≥φ

),(),(

),(),(max

43),(

1

),(121

,,

DBfDCfg

gCAfBAf

CB

CB

CBA

+≥

++≥φ

Approximate Policy Iteration

Guess w0


How do represent the policy? How do we update it efficiently?

PolicyImprovement

What about the Policy ?Contextual Action Model:

Z

Y’

Z’

Y

X’ Xdefault

Z

Y’

Z’

Y

X’ XAction 1

Z

Y’

Z’

Y

X’ XAction 2

Factored value functions and model compact policy descriptionPolicy forms a decision list:

If then action 1 else if then action 2 else if then action 1

xyz

x

Theorem: [Koller & Parr ’00]

Factored Policy Iteration: Summary

Guess V = greedy(V)V = value of acting on

Structure inducesdecision-list policy

Key operations isomorphicto Bayesian Network inference

Time per iteration reduced from O((2n)3) to O(poly(k,n,C))

• C = largest factor in cost net (function of structure)• k = number of basis functions (k << 2n)• poly = complexity of LP solver, in practice close to linear

Network Management Problem

Computers connected in a network;

Each computer can fail with some probability;

If a computer fails, it increases the probability its neighbors will fail;

At every time step, the sys-admin must decide which computer to fix.

Bidirectional Ring Ring and Star

Server

Star

3 LegsRing of Rings

Server

Server

Comparing projections in L2

to L

Max-norm projection also much more efficient: Single cost network rather than many many BN

inferences; Use of very efficient LP package (CPLEX).

0

0.05

0.1

0.15

0.2

0.25

0.3

3 4 5 6 7 8 9 10

number of variables

Relative error:

L2 single basis

L single basis

L pair basis

L2 pair basis

Results on Larger Problems: Running Time

0

100

200

300

400

500

1E+00 1E+02 1E+04 1E+06 1E+08 1E+10 1E+12 1E+14

number of states

Total Time (minutes)

Ring

3 Legs

Star

Runs in time O(n3) not O((2n)3)

Results on Larger Problems: Error Bounds

0

0.1

0.2

0.3

0.4

1E+00 1E+02 1E+04 1E+06 1E+08 1E+10 1E+12 1E+14

number of states

Bellman Error / Rmax

Ring

3 Legs

Star

Error remains bounded

Conclusions Max-norm projection directly minimizes error

bounds;

Closed-form projection operation provides exponential complexity reduction;

Exploit structure to reduce computation costs! Solve very large MDPs efficiently.

Future Work

POMDPs (IJCAI’01 workshop paper);

Additional structure: Factored actions; Relational representations; CSI;

Multi-agent systems;

Linear program solution for MDP.

max-norm projections for factored mdps carlos guestrin daphne koller stanford university ronald parr...

Documents

value slide

value of policy

maxnorm projections

compute maxnorm

maxnorm efficient algorithm

true value

value function space

collapse value function