probabilistic planning via determinization in hindsight ff-hindsight

Sungwook Yoon – Probabilistic Planning via Determinization

Probabilistic Planning via Determinization in Hindsight

FF-Hindsight

Sungwook Yoon

Joint work withAlan Fern, Bob Givan and Rao Kambhampati


Probabilistic Planning Competition

Client : Participants, send actionServer: Competition Host, simulates actions

2


The Winner was ……

• FF-Replan– A replanner. Use FF– Probabilistic domain is determinized

• Interesting Contrast– Many probabilistic planning techniques • Work in theory but does not work in practice

– FF-Replan• No theory• Work in practice

3


The Paper’s Objective

Better determinization approach(Determinization in Hindsight)

Theoretical consideration of the new determinization (in Hindsight)

New view on FF-Replan

Experimental studies with determinization in Hindsight (FF-Hindsight)

4


Probabilistic Planning(goal-oriented)

Action

ProbabilisticOutcome

Time 1

Time 2

Goal State

5

ActionState

Maximize Goal Achievement

Dead End

A1 A2

I

A1 A2 A1 A2 A1 A2 A1 A2

Left Outcomes are more likely


All Outcome Replanning (FFRA)

Action

Effect 1

Effect 2

Probability1

Probability2

Action1 Effect 1

Action2 Effect 2

ICAPS-07

6


Probabilistic PlanningAll Outcome Determinization

Action


Time 1

Time 2

Goal State

7

ActionState

Find Goal

Dead End

A1 A2

A1 A2 A1 A2 A1 A2 A1 A2

I

A1-1 A1-2 A2-1 A2-2

A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2


Probabilistic PlanningAll Outcome Determinization

Action


Time 1

Time 2

Goal State

8

ActionState

Find Goal

Dead End

A1 A2

A1 A2 A1 A2 A1 A2 A1 A2

I

A1-1 A1-2 A2-1 A2-2

A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2


Problem of FF-Replan and better alternative sampling

9

FF-Replan’s Static Determinizations don’t respect probabilities.

We need “Probabilistic and Dynamic Determinization”

Sample Future Outcomes and

Determinization in HindsightEach Future Sample Becomes a

Known-Future Deterministic Problem


Probabilistic Planning(goal-oriented)

Action


Time 1

Time 2

Goal State

10

ActionState


Dead End


A1 A2

A1 A2 A1 A2 A1 A2 A1 A2

I

Sungwook Yoon – Probabilistic Planning via Determinization 11

Start Sampling

Note. Sampling will reveal which is betterA1? Or A2 at state I


Hindsight Sample 1Action


Time 1

Time 2

Goal State

12

ActionState


Dead EndA1: 1A2: 0


A1 A2

A1 A2 A1 A2 A1 A2 A1 A2

I




Time 1

Time 2

Goal State

13

ActionState


Dead End


A1: 2A2: 1

A1 A2

A1 A2 A1 A2 A1 A2 A1 A2

I




Time 1

Time 2

Goal State

14

ActionState


Dead End


A1: 2A2: 1

A1 A2

A1 A2 A1 A2 A1 A2 A1 A2

I




Time 1

Time 2

Goal State

15

ActionState


Dead End


A1: 3A2: 1

A1 A2

A1 A2 A1 A2 A1 A2 A1 A2

I


Summary of the Idea:The Decision Process

(Estimating Q-Value, Q(s,a))

1. For Each Action A, Draw Future Samples

2. Solve The Deterministic Problems

3. Aggregate the solutions for each action

4. Select the action with best aggregation

S: Current State, A(S) → S’

Each Sample is a Deterministic Planning Problem

The solution length is used for goal-oriented problems, Q(s,A)

Max A Q(s,A)

16


Mathematical Summary of the Algorithm

• H-horizon future FH for M = [S,A,T,R]– Mapping of state, action and time (h<H) to a state– S × A × h → S

• Value of a policy π for FH – R(s,FH, π)

• VHS(s,H) = EFH [maxπ R(s,FH,π)]

• Compare this and the real value• V*(s,H) = maxπ EF

H [ R(s,FH,π) ]• VFFRa(s) = maxF V(s,F) ≥ VHS(s,H) ≥ V*(s,H)• Q(s,a,H) = (R(a) + EF

H-1 [maxπ R(a(s),FH-1,π)] )– In our proposal, computation of maxπ R(s,FH-1,π) is

approximately done by FF [Hoffmann and Nebel ’01]17

Done by FF

Each Future is aDeterministicProblem


Key Technical ResultsThe Importance of Independent Sampling of States, Actions, Time

The necessity of Random Time Breaking in Decision making

Theorem 1When there is a policy that can achieve the goal with probability 1 within horizon, hindsight decision making algorithm will find the goal with probability 1.

Theorem 2Polynomial number of samples are needed with regard to, Horizon, Action, The minimum Q-value advantage

We identify the characteristic of FF-Replan in terms of Hindsight Decision Making, VFFRa(s) = maxF V(s,F)

18


Empirical Results

Problem FFRa FF-HindsightBlocksworld 270 158

Boxworld 150 100

Fileworld 29 14

R-Tireworld 30 30

ZenoTravel 30 0

Exploding BW 5 28

G-Tireworld 7 18

Tower of Hanois 11 17

IPPC-04 Problems Numbers are solved Trials

For ZenoTravel, when we used Importance sampling, the solved trials have been improved to 26

19


Empirical Results

Planners

Climber River Bus-Fare

Tire1 Tire2 Tire3 Tire4 Tire5 Tire6

FFRa 60% 65% 1% 50% 0% 0% 0% 0% 0%Paragraph 100% 65% 100% 100% 100% 100% 3% 1% 0%FPG 100% 65% 22% 100% 92% 60% 35% 19% 13%FF-HS 100% 65% 100% 100% 100% 100% 100% 100% 100%

These Domains are Developed just to Beat FF-ReplanObviously, FF-Replan did not do well.

But, FF-Hindsight did very well, showingProbabilistic Reasoning Ability while achieving Scalability

20


Conclusion

21

Deterministic Planningscalability

Classic Planning

Machine Learning forPlanning

Net Benefit Optimization

Temporal Planning

Probabilistic Planning

scalability

Markov Decision Processes

Machine Learning forMDP

Temporal MDP

scalability

Determinization


Conclusion

• Devised an algorithm that can take advantage of the significant advances in deterministic planning in the context of probabilistic planning

• Made many of the deterministic planning techniques available to probabilistic planning– Most of the learning to planning techniques are

developed solely for deterministic planning• Now, these techniques are relevant to probabilistic planning

too– Advanced net-benefit style of planners can be used

for the reward maximization style of probabilistic planning problems

22


Discussion

• Mercier and Van Hentenryck provided the analysis of the difference between – V*(s,H) = maxπ EF

H [ R(s,FH,π) ]– VHS(s,H) = EF

H [maxπ R(s,FH,π)]• Ng and Jordan provided the analysis of the

difference between– V*(s,H) = maxπ EF

H [ R(s,FH,π) ]– V^(s,H) = maxπ ∑ [ R(s,FH,π) ] / m, where m is the

sample number

23


IPPC-2004 Results

NMRC J1 Classy NMR mGPT C FFRS FFRA

BW 252 270 255 30 120 30 210 270

Box 134 150 100 0 30 0 150 150

File - - - 3 30 3 14 29

Zeno - - - 30 30 30 0 30

Tire-r - - - 30 30 30 30 30

Tire-g - - - 9 16 30 7 7

TOH - - - 15 0 0 0 11Exploding - - - 0 0 0 3 5

Human Control Knowledge 2nd Place Winners

LearnedKnowledge

NMR Non-Markovian Reward Decision Process PlannerClassy Approximate Policy Iteration with a Policy Language Bias

mGPT Heuristic Search Probabilistic Planning

C Symbolic Heuristic Search

Numbers : Successful Runs

Winner of IPPC-04FFRs

24


IPPC-2006 ResultsFFRA FPG FOALP sfDP Paragraph FFRS

BW 86 63 100 29 0 77Zenotravel 100 27 0 7 7 7

Random 100 65 0 0 5 73

Elevator 93 76 100 0 0 93

Exploding 52 43 24 31 31 52

Drive 71 56 0 0 9 0

Schedule 51 54 0 0 1 0

PitchCatch 54 23 0 0 0 0

Tire 82 75 82 0 91 69

FPG Factored Policy Gradient Planner

FOALP First Order Approximate Linear Programming

sfDP Symbolic Stochastic Focused Dynamic Programming with Decision Diagrams

Paragraph A Graphplan Based Probabilistic Planner

Numbers : Percentage ofSuccessful Runs

Unofficial Winner of IPPC-06 FFRa

25

Sungwook Yoon – Probabilistic Planning via Determinization 26


Sampling ProblemTime dependency issue

Start

S1 S2

Goal

S3

Dead End

A

BC (with probability p)

C (with probability 1-p)

D (with probability 1-p)

D (with probability p)

27


Sampling ProblemTime dependency issue

Start

S1 S2

Goal

S3

Dead End

A

B

S3 is worse state then S1 but looks like there is always a path to GoalNeed to sample independently across actions

28


Action Selection ProblemRandom Tie breaking is essential

Start S1 Goal

C: with probability 1-p

C: with probability p

B: with probability p

A: Always stays in StartB: with probability 1-p

In Start state, C action is definitely better, but A can be used to wait until C to the Goal effect is realized

29


Sampling ProblemImportance Sampling (IS)

Start GoalS1 B: with extremely low probability

B: with very high probability

- Sampling uniformly would find the problem unsolvable.- Use importance sampling.- Identifying the region that needs importance sampling is for further study.-In the benchmark, Zenotravel needs the IS idea.

30


Theoretical Results• Theorem 1

– For goal-achieving probabilistic planning problems, if there is a policy that can solve the probabilistic planning problem with probability 1 with bounded horizon, then hindsight planning would solve the problem with probability 1. If there is no such policy, hindsight planning would return less 1 success ratio.

– If there is a future where no plan can achieve the goal, the future can be sampled

• Theorem 2– The number of future samples needed to correctly identify the

best action– w > 4Δ-2

T ln (|A|H| / δ)– Δ : the minimum Q-advantage of the best action over the other

actions, δ: confidence parameter– From Chernoff Bound

31


Probabilistic PlanningExpecti-max solution

Action


Time 1

Time 2

Goal State

32

ActionState


Max

Max Max Max Max

Exp Exp

E E E E E E E E




Time 1

Time 2

Goal State

33

ActionState


Dead EndA1: 1A2: 0


A1 A2

A1 A2 A1 A2 A1 A2 A1 A2

I




Time 1

Time 2

Goal State

34

ActionState


Dead End


A1: 2A2: 1

A1 A2

A1 A2 A1 A2 A1 A2 A1 A2

I




Time 1

Time 2

Goal State

35

ActionState


Dead End


A1: 2A2: 1

A1 A2

A1 A2 A1 A2 A1 A2 A1 A2

I




Time 1

Time 2

Goal State

36

ActionState


Dead End


A1: 3A2: 1

A1 A2

A1 A2 A1 A2 A1 A2 A1 A2

I

probabilistic planning via determinization in hindsight ff-hindsight

Documents