generalizing plans to new environments in multiagent relational mdps carlos guestrin daphne koller...

39
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Post on 19-Dec-2015

223 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Generalizing Plans to New Environments in Multiagent

Relational MDPs

Carlos Guestrin

Daphne KollerStanford University

Page 2: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Multiagent Coordination Examples

Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control

Multiple, simultaneous decisions Exponentially-large spaces Limited observability Limited communication

Page 3: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

peasant

footman

building

Real-time Strategy GamePeasants collect resources and buildFootmen attack enemiesBuildings train peasants and footmen

Page 4: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Scaling up by Generalization

Exploit similarities between world elements

Generalize plans: From a set of worlds to a

new, unseen world Avoid need to replan Tackle larger problems

Formalize notion of “similar” elementsCompute generalizable plans

Page 5: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Relational Models and MDPs

Classes: Peasant, Gold, Wood, Barracks,

Footman, Enemy… Relations

Collects, Builds, Trains, Attacks… Instances

Peasant1, Peasant2, Footman1, Enemy1…

Value functions in class level Objects of the same class have same

contribution to value function Factored MDP equivalents of PRMs

[Koller, Pfeffer ‘98]

Page 6: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Relational MDPs

Class-level transition probabilities depends on: Attributes; Actions; Attributes of

related objects Class-level reward function Instantiation (world)

Number objects; Relations Well-defined MDP

Peasant

P’ P

AP

Gold

G’ GCollects

Page 7: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Planning in a World

Long-term planning by solving MDP # states exponential in number of objects # actions exponential

Efficient approximation by exploiting structure!

RMDP world is a factored MDP

Page 8: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Roadmap to Generalization

Solve 1 world

Compute generalizable value function

Tackle a new world

Page 9: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

World is a Factored MDP

P

F

E

G

R

F’

E’

G’

P’

State Dynamics Decisions Rewards

P(F’|F,G,H,AF)H

AP

AF

Page 10: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Long-term Utility = Value of MDP

Value computed by linear programming:

,

),()( :subject to

)(:minimize

ax

xax

xx

QV

V

One variable V (x) for each state One constraint for each state x and action a Number of states and actions exponential!

[Manne `60]

Page 11: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Approximate Value Functions

Linear combination of restricted domain functions [Bellman et al. `63][Tsitsiklis & Van Roy `96][Koller & Parr `99,`00][Guestrin et al. `01]

Each Vo depends on state of object and related objects: State of footman Status of barracks

Must find Vo giving good approximate value function

o oVV )()(

~xx

Page 12: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Single LP Solution for Factored MDPs

Variables for each Vo , for each object Polynomially many LP variables

One constraint for every state and action Exponentially many LP constraints

Vo , Qo depend on small sets of variables/actions Exploit structure as in variable elimination

[Guestrin, Koller, Parr `01]

,

),()( :subject to

':minimize

ax

xaxo

oo

o

ooo

QV

V [Schweitzer and Seidmann ‘85]

Page 13: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

,),()( :subject to

axxaxo

oo

o QV

Representing Exponentially Many Constraints

)(),( :subject to max0,

o

oo VQ xxaxa

Exponentially many linear = one nonlinear constraint

,)(),(0 :subject to

axxxao

oo VQ

Page 14: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Can use variable elimination to maximize over state space: [Bertele & Brioschi ‘72]

Variable Elimination

A

D

B C

1f

4f 3f

2f

As in Bayes nets, maximization is exponential in tree-width

Here we need only 23, instead of 63 sum operations

),(),(),(max 121,,

CBgCAfBAfCBA

),(),(max),(),(max 4321,,

DBfDCfCAfBAfDCBA

),(),(),(),(max 4321,,,

DBfDCfCAfBAfDCBA

Page 15: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Representing the Constraints

Functions are factored, use Variable Elimination to represent constraints:

),(),(max),(),(max0 4321,,

DBfDCfCAfBAfDCBA

),(),(

),(),(max0

43),(

1

),(121

,,

DBfDCfg

gCAfBAf

CB

CB

CBA

Number of constraints exponentially smaller

)(),( :subject to max0,

o

oo VQ xaxxa

Page 16: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Roadmap to Generalization

Solve 1 world

Compute generalizable value function

Tackle a new world

Page 17: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Generalization

Sample a set of worlds

Solve a linear program for these worlds: Obtain class value functions

When faced with new problem: Use class value function No re-planning needed

Page 18: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Worlds and RMDPs

Meta-level MDP:

Meta-level LP:

,,),,()(

)(),()( :subject to

)()(:minimize

0

0

axxax

xx

x

x

x

QV

VPxV

VxV

Page 19: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Class-level Value Functions

Approximate solution to meta-level MDP Linear approximation Value function defined in the class level All instances use same local value function

Page 20: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Class-level LP

,,

),()( :subject to

)()(:minimize

][][

][

ax

axx

xxx

c Coooc

c Cooc

c Coococ

QV

Vo

Constraints for each world represented by factored LP

Number of worlds exponential or infinite Sample worlds from P()

Page 21: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Theorem

Exponentially (infinitely) many worlds !

need exponentially many samples?NO!

samples

Value function within , with prob. at least 1-.

Rmax is the maximum class reward Proof method related to [de Farias, Van Roy ‘02]

Page 22: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

LP with sampled worlds

,,

),()( :subject to

)()(:minimize

][][

][

I

QV

V

c Coooc

c Cooc

I c Coococ

o

ax

axx

xxx

Solve LP for sampled worlds Use Factored LP for each world Obtain class-level value function New world: instantiate value function and act

Page 23: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Learning Classes of Objects

Which classes of objects have same value function?

Plan for sampled worlds individually Use value function as “training data” Find objects with similar values Include features of world

Used decision tree regression in experiments

Page 24: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Summary of Generalization Algorithm

1. Model domain as Relational MDPs

2. Pick local object value functions Vo

3. Learn classes by solving some instances

4. Sample set of worlds

5. Factored LP computes class-level value

function

Page 25: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

A New World When faced with a new world , value function

is:

Q function becomes:

At each state, choose action maximizing Q(x,a) Number of actions is exponential! Each QC depends only on a few objects!!!

Page 26: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Q(A1,…,A4, X1,…,X4) ¼

Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) +

Q3(A2, A3, X2,X3) + Q4(A3, A4,

X3,X4)

Local Q function Approximation

M4

M1

M3

M2

Q3

Q(A1,…,A4, X1,…,X4)

Associated with Agent 3

Limited observability: agent i only observes variables in Qi

Observe only X2 and X3

Must choose action to maximize i Qi

Page 27: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Use variable elimination for maximization: [Bertele & Brioschi ‘72]

Maximizing i Qi: Coordination

Graph

Limited communication for optimal action choice

Comm. bandwidth = induced width of coord. graph

A1

A4

A2 A3

1Q

4Q 3Q

2Q

),(),(),(max 321312211,, 321

AAgAAQAAQA A A

),(),(max),(),(max 424433312211,, 4321

AAQAAQAAQAAQAA A A

),(),(),(),(max 424433312211,,, 4321

AAQAAQAAQAAQA A A A

If A2 attacks and A3 defends,

then A4 gets $10

Page 28: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Summary of Algorithm

1. Model domain as Relational MDPs

2. Factored LP computes class-level value

function

3. Reuse class-level value function in new world

Page 29: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Experimental Results

SysAdmin problem

Unidirectional Ring

Server

StarRing of Rings

Page 30: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Generalizing to New Problems

3

3.2

3.4

3.6

3.8

4

4.2

4.4

4.6

Ring Star Three legs

Est

imat

ed p

oli

cy v

alu

e p

er a

gen

tClass-based value function'Optimal' approximate value functionUtopic maximum value

Page 31: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Generalizing to New Problems

3

3.2

3.4

3.6

3.8

4

4.2

4.4

4.6

Ring Star Three legs

Est

imat

ed p

oli

cy v

alu

e p

er a

gen

tClass-based value function'Optimal' approximate value functionUtopic maximum value

Page 32: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Generalizing to New Problems

3

3.2

3.4

3.6

3.8

4

4.2

4.4

4.6

Ring Star Three legs

Est

imat

ed p

oli

cy v

alu

e p

er a

gen

tClass-based value function'Optimal' approximate value functionUtopic maximum value

Page 33: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Classes of Objects Discovered

Learned 3 classes

Server

Intermediate

Intermediate

Intermediate

Leaf

LeafLeaf

Page 34: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Learning Classes of Objects

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Ring Star Three legs

Max

-no

rm e

rro

r o

f va

lue

fun

ctio

n No class learning

Learnt classes

Page 35: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Learning Classes of Objects

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Ring Star Three legs

Max

-no

rm e

rro

r o

f va

lue

fun

ctio

n No class learning

Learnt classes

Page 36: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Results 2 Peasants, Gold, Wood, Barracks, 2 Footman, Enemy

Reward for dead enemy

About 1 million of state/action pairs

Solve with Factored LP

Some factors are exponential

Coordination graph for action selection

[with Gearhart and Kanodia]

Page 37: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Generalization

9 Peasants, Gold, Wood, Barracks, 3 Footman, Enemy

Reward for dead enemy

About 3 trillion of state/action pairs

Instantiate generalizable value function

At run-time, factors are polynomial

Coordination graph for action selection

Page 38: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

The 3 aspects of this talk

Scaling up collaborative multiagent planning Exploiting structure Generalization

Factored representation and algorithms Relational MDP, Factored LP, coordination graph

Freecraft as a benchmark domain

Page 39: Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Conclusions RMDP

Compact representation for set of similar planning problems

Solve single instance with factored MDP algorithms

Tackle sets of problems with class-level value functions

Efficient sampling of worlds Learn classes of value functions

Generalization to new domains Avoid replanning Solve larger, more complex MDPs