evaluating planning algorithms

69
Evaluating Planning Algorithms org Hoffmann INRIA Nancy, France June 8, 2011 org Hoffmann Evaluating Planning Algorithms 1/85

Upload: others

Post on 02-Oct-2021

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluating Planning Algorithms

Evaluating Planning Algorithms

Jorg Hoffmann

INRIANancy, France

June 8, 2011

Jorg Hoffmann Evaluating Planning Algorithms 1/85

Page 2: Evaluating Planning Algorithms

Outline

I Evaluation? What’s that?I On the IPC and other bizarre ritualsI Do your homework!I Buried beneath tons of dataI The black art of counting black sheepI Understanding the world

Jorg Hoffmann Evaluating Planning Algorithms 2/85

Page 3: Evaluating Planning Algorithms

Evaluation? What’s that?

What are the advantages – and the disadvantages!!! – of the techniqueI’m proposing here?

I Empirical: Data on examples

I Theoretical: If A then B

I Applied: It is/will be in real-world use at X(and they’re earning $$$ with it)

Theory is the only way to ever truly generalize beyond examples!

Jorg Hoffmann Evaluating Planning Algorithms 3/85

Page 4: Evaluating Planning Algorithms

Applied Evaluation

Good luck!

Don’t lose sight of the big picture:

I What am I doing and why am I doing it?

I Who would be using this in practice and for what?

I What is the added value of planning here?

I Excellent example: [Ruml et al, JAIR’11]

Jorg Hoffmann Evaluating Planning Algorithms 4/85

Page 5: Evaluating Planning Algorithms

Parenthesis: Automatic Planning

Is FF automatic? Yes or No?

Correct answer: No.

I You’ve got to give it the PDDL first

I It’s all a matter of cost-for-input vs. usefulness-of-output!!

I “Applied” Web Service Composition (ca. 1001 papers):

“Services annotated as planning actions, planner composes morecomplex/useful service automatically.”

I Yeah great, but who’s gonna write the “annotation”?

Jorg Hoffmann Evaluating Planning Algorithms 5/85

Page 6: Evaluating Planning Algorithms

Theoretical Evaluation

From standards . . .I Is it sound? (What do you mean, “no”?)I Is it complete?I Can it sing and dance?

. . . to excitement!I “The representational power of Merge-and-Shrink strictly dominates

that of PDBs” (Helmert et al, ICAPS’07)I “Our compilation of conformant planning is exponential only in

conformant width” (Palacios&Geffner, JAIR’09)I “Our polynomial-time action-cost partitioning provides the tightest

possible lower bound” (Katz&Domshlak, ICAPS’08)

I Often more feasible: look at individual domains ([Hoffmann,ICAPS’11; Nissim&Hoffmann&Helmert, IJCAI’11])

Jorg Hoffmann Evaluating Planning Algorithms 6/85

Page 7: Evaluating Planning Algorithms

Empirical Evaluation

This is “easy” . . .

I Run technique on examples (well, implement it first . . . )

I Report data

. . . but the devil is in the details!

I How/against whom do I run it?

I How do I analyze and report the results?

I How do I understand what’s going on?

Jorg Hoffmann Evaluating Planning Algorithms 7/85

Page 8: Evaluating Planning Algorithms

Outline

I Evaluation? What’s that?I On the IPC and other bizarre ritualsI Do your homework!I Buried beneath tons of dataI The black art of counting black sheepI Understanding the world

Jorg Hoffmann Evaluating Planning Algorithms 8/85

Page 9: Evaluating Planning Algorithms

The Four Commandments

1. Run IPC benchmarks.

2. Unless you run all, run the most recent ones.

3. Time-out is 30 minutes.

4. Compare to the most recent winner.

Jorg Hoffmann Evaluating Planning Algorithms 9/85

Page 10: Evaluating Planning Algorithms

Commandment 1:Run IPC benchmarks.

Natural Language Sentence Generation[Koller&Petrick,CompInt’11]:

“While some of the planners did an impressive job of controlling thecomplexity of the search, we also found that all the planners we tested

spent too much time on preprocessing to be useful.”

I Pre-processing difficulties are not considered in IPCI IPC benchmarks “spoon-feed” existing planner implementationsI Ergo: pre-instantiation etc. has gone completely unquestioned since

almost a decade!

I Generally: IPC benchmarks created to suit IPC conditions

Jorg Hoffmann Evaluating Planning Algorithms 10/85

Page 11: Evaluating Planning Algorithms

Commandment 1: (continued)Run IPC benchmarks.

A hypothetical conversation: (any resemblance to real conversationsis purely coincidental)

Two researchers, X and Y, in front of a whiteboard. The whiteboard iscovered with a mixture of haphazard drawings and 1st order logic, allpartly crossed out and over-written.

Says X: “Hm, yes, looks interesting.”

Says Y: “But will it be useful in practice?”

Says X: “Well, let’s look at what it does in a simple transportation domainwith fuel usage.”

Says Y: “But is that in the IPC benchmarks?”

I IPC = some interesting challenges, not all of them!!!

I Later: IPC benchmarks not good for counting sheep . . .

Jorg Hoffmann Evaluating Planning Algorithms 11/85

Page 12: Evaluating Planning Algorithms

Commandment 2:Unless you run all, run the most recent ones.

Well. Plain nonsense, no?

I In what way are the recent ones “better”?I What are “good” or “bad” benchmarks anyway?I Is a benchmark better if it takes more time to solve?I If so, note that Mystery and Mprime, e.g., are still tough nuts

I Yes, Scanalyzer is better than “Monkey-and-bananas” . . .I . . . but this doesn’t apply to the whole history of the IPC!

Jorg Hoffmann Evaluating Planning Algorithms 12/85

Page 13: Evaluating Planning Algorithms

Commandment 3:Time-out is 30 minutes.

Natural Language Sentence Generation:Need plan in split seconds.Creating business processes at SAP [Hoffmann et al, AAAI’10]:Need plan in split seconds.Controlling printers at Xerox [Ruml et al, JAIR’11]:Need plan in split seconds.Video games [Sturtevant, “Dragon Age: Origins”]:Need plan in split seconds.Vacuum cleaners, football, DARPA Grand Challenge, . . .

I Many planning applications take real-time decisions

I In others, planning models are not precise/exhaustive enough toenable exact/full solution . . .

I . . . and hence a human user waits online for the plan!

I Anybody knows an application not falling into these classes?

Jorg Hoffmann Evaluating Planning Algorithms 13/85

Page 14: Evaluating Planning Algorithms

Commandment 4:Compare to the most recent winner.

Some example data:

Domain #instances LM-cut M&S-bopGripper 20 6 20Miconic 150 140 55Σ 170 146 75

I IPC-domain=Miconic =⇒ “and the winner is . . . LM-cut!”I IPC-domain=Gripper =⇒ “and the winner is . . . M&S-bop!”I IPC-domain=Both? Let’s reverse the #instances . . .

I Performance is a function of the benchmarks used!

I IPC organizers make every effort to avoid the detrimentalconsequences . . .

I . . . still the best planner for your context may be someone else

Jorg Hoffmann Evaluating Planning Algorithms 14/85

Page 15: Evaluating Planning Algorithms

IPC Summary

IPC Pros:I Standard language (up to 90s, every planner had its own input . . . )

I Large set of standard benchmarks; standard competitive settingI Awards and excitement

IPC Con 1: not nearly as important as it’s made out to be!I Setting not representative of (most?) applicationsI Many domains, but impossible to cover everythingI “Award” is (a) a very blunt “results summary” and (b) a function

of the benchmarks

IPC Con 2: very particular experiment design!I Spoon-feeds current planners to increase participation and

match their performanceI Challenges search not anything else (pre-processing . . . )I No controlled scaling (scales everything at once)

Jorg Hoffmann Evaluating Planning Algorithms 15/85

Page 16: Evaluating Planning Algorithms

Take-Home Message

I IPC-style experiments setup is a tradition . . .I . . . sticking to which is suited as a standard for comparing

competitive performance.

I But not for anything else!I (On top of usual IPC tests) do whatever is suited for

determining advantages/disadvantages in your context!

I . . . and please don’t be that reviewer.

Jorg Hoffmann Evaluating Planning Algorithms 16/85

Page 17: Evaluating Planning Algorithms

Outline

I Evaluation? What’s that?I On the IPC and other bizarre ritualsI Do your homework!I Buried beneath tons of dataI The black art of counting black sheepI Understanding the world

Jorg Hoffmann Evaluating Planning Algorithms 17/85

Page 18: Evaluating Planning Algorithms

Homework

Some (simple?) rules to heed in experimentation(with planning systems).

I (Read Malte’s papers, do whatever he does.)

I Look at Toby Walsh’s web page:

http://www.cse.unsw.edu.au/˜tw/empirical.html

I IJCAI’01 tutorial on empirical methods in AI:http://www.cse.unsw.edu.au/˜tw/ijcai2001.ppt

I “How Not To Do It”:http://www.cse.unsw.edu.au/˜tw/hownotto.pdf

I Paul Cohen, “Empirical Methods for AI”, MIT Press, 1995

Jorg Hoffmann Evaluating Planning Algorithms 18/85

Page 19: Evaluating Planning Algorithms

The Four Commandments, Revisited

1. Have a hypothesis.

2. Be careful (with statistics/raw data/cut-offs/summarization).

3. Don’t change two things at once!!!

4. Report negative results!!!

Jorg Hoffmann Evaluating Planning Algorithms 19/85

Page 20: Evaluating Planning Algorithms

Commandment 1: Have a hypothesis.

What am I trying to show?

I Trivial? I reviewed lots of papers where this wasn’t clear or wherethe experiment design wasn’t suitable.

I No names here . . . anyone knows an example from myself?I Cohen, survey of 150 AAAI papers: “Only 16% of the papers offered

anything that might be interpreted as a question or a hypothesis.”

I No issue if all you investigate is competitive performance . . .

“H1: FF is faster than HSP.”

I . . . more interesting if you wish to dig deeper!

“H2: FF is faster than HSP because of helpful actions pruning.”

Jorg Hoffmann Evaluating Planning Algorithms 20/85

Page 21: Evaluating Planning Algorithms

Hypothesis Testing in a Nutshell

From IJCAI’01 tutorial:

I Example: toss a coin ten times, observe 8 heads. Is the coin fair,i.e., what is its long run behavior? And what is your residualuncertainty?

I You say, “If the coin were fair, then eight or more heads is prettyunlikely, so I think the coin isnt fair.”

I Like proof by contradiction: Assert the opposite (the coin is fair),show that the sample result (8 heads) has low probability p, rejectthe assertion with residual uncertainty related to p.

I For a comprehensive overview, please consult IJCAI’01 tutorialI For full details, consult a book . . .

Jorg Hoffmann Evaluating Planning Algorithms 21/85

Page 22: Evaluating Planning Algorithms

Commandment 2(a): Be careful with statistics.

Am I using the right statistical test?

I Are the underlying assumptions justified?

I My first exposure to statistics: is A faster than B in a domain?I Ran “Dependent t-test for paired samples”: t = XD

sD/√

n

AB

AB

“yes” “no”

I This test has no notion of “scaling” . . .I . . . and assumes that XD follows a normal distribution

Jorg Hoffmann Evaluating Planning Algorithms 22/85

Page 23: Evaluating Planning Algorithms

Commandment 2(b): Be careful with raw data.

Look at the raw data, not only at summaries!

I Is there a phenomenon not visible at summary level?

I Example: “exceptionally hard cases” in search – rare cases severalorders of magnitude harder than similar instances

I Aka “Heavy-tailed behavior” [Carla Gomes et al, CP’97, . . . ]I Does not appear in median, may not be evident in mean!

I Quotes Gent et al “How To Not Do It”/IJCAI’01 tutorial:

“We missed them until they hit us on the head when experimentscrashed. Old data on smaller problems showed clear behaviour.”

“We thought the program had crashed so we killed the job . . . thenext day the same thing happened with new data, and we realized

that some problems were remarkably difficult.”

Jorg Hoffmann Evaluating Planning Algorithms 23/85

Page 24: Evaluating Planning Algorithms

Commandment 2(c): Be careful with cut-offs.

From IJCAI’01 tutorial:Wind speed vs. forest fire containment time (max 150 hours):

3 120 55 79 10 140 26 15 110 126 78 61 58 81 71 57 219 62 48 21 55 101

What’s the problem??

Cut-offs may bias the sample!

I A lot of high wind fires take > 150 hours to contain . . .I . . . those that don’t are similar to low wind fires

I This kind of thing may happen in search just as well

Jorg Hoffmann Evaluating Planning Algorithms 24/85

Page 25: Evaluating Planning Algorithms

Commandment 2(d): Be careful with summarization.

The best summarization method depends on the situation.

I Median: sample point “in the middle of” distributionI Is often more robust than the meanI (Well, can be a mixed blessing – heavy-tails)

I Especially funny: mean of ratios, like runtime(A)runtime(B)

I Arithmetic mean of 2 and 0.5 is 1.25 . . . !I Thus for data A=2,B=1; A=1,B=2 we get that A is “better” than B

since mean of AB > 1 . . . and vice versa for B

A . . . !I [Example due to Malte Helmert]

I Geometric mean: n√D1 ∗ · · · ∗ Dn

Jorg Hoffmann Evaluating Planning Algorithms 25/85

Page 26: Evaluating Planning Algorithms

Commandment 3: Don’t change two things at once!!!

I You will not know where the new behavior comes fromI Trivial? I’ve seen various papers proposing search heuristic A vs.

old B, and then compared planners X and Y where X used A onsearch C, and Y used B on search D.

I If you wish to know the effect of options O1, . . . ,On, then you needto run experiments on each configuration C ∈ O1 × · · · × On

I Called “ablation studies” or “factorial experiment”I Simplified: C ∈ {o1} × · · · × {ok−1} × Ok × {ok+1} × · · · × {on}

I However, option-interactions are often important!

I Examples: [Hoffmann&Nebel, JAIR’01 Sec 8.3.2; Roger&Helmert,ICAPS’10]

Ablation studies are the ONLY means to evaluate YOUR NEW IDEA, notonly whether in sum it “beats” a completely different technique!

Jorg Hoffmann Evaluating Planning Algorithms 26/85

Page 27: Evaluating Planning Algorithms

Commandment 4: Report negative results!!!

What are the advantages – and the disadvantages!!! – of thetechnique I’m proposing here?

I In the good old days, “cherry-picking” was not only a travellers’ jobin Australia . . .

I (Even better now, no? “4 out of 40” . . . )I Gold medal for “not hiding bad results” goes to Patrik Haslum

I Negative results can be illuminating . . .

(e.g. FF JAIR’01 paper shows uselessness in rnd SAT formulas)I . . . and outright exciting!

(e.g. [Domshlak&Hoffmann&Sabharwal, JAIR’09]: hopeless resultsspiced up by observation that “abstraction can never improve thebest-case resolution refutation size”)

Jorg Hoffmann Evaluating Planning Algorithms 27/85

Page 28: Evaluating Planning Algorithms

A Cooking Recipe

1. Define objectives and hypotheses

2. Design experiment to meet these2.1 Avoid biasing outcome by settings, e.g. cut-offs2.2 To distinguish A from B, change nothing but A and B

3. Run limited samples to calibrate parameters

4. Run experiment

5. Look at raw data to get intuitive understanding

6. Design data analysis6.1 Be careful to properly use summarization/statistics

7. Understand analysis outcome

8. if unexpected behavior then goto 1

9. if something fishy then goto 2

10. if conclusions not crystal clear then goto 3

11. Report all results including negative ones

Jorg Hoffmann Evaluating Planning Algorithms 28/85

Page 29: Evaluating Planning Algorithms

Outline

I Evaluation? What’s that?I On the IPC and other bizarre ritualsI Do your homework!I Buried beneath tons of dataI The black art of counting black sheepI Understanding the world

Jorg Hoffmann Evaluating Planning Algorithms 29/85

Page 30: Evaluating Planning Algorithms

Buried beneath tons of data

I Anybody can generate 7 GB of data . . .I . . . or much more than that,

in case you’re doing a factorial experiment . . .

I . . . but how to extract the relevant observations?I . . . and present them within 2 pages conference paper?

I Yes of course you need to summarize . . .I . . . but how to? =⇒ understand first!

I Vicious circle: need to summarize in order to understand inorder to decide how to summarize . . .

I Take evolutionary approach

Jorg Hoffmann Evaluating Planning Algorithms 30/85

Page 31: Evaluating Planning Algorithms

Burying the reader beneath tons of data . . .

0

50

100

150

200

250

300

0 2 4 6 8 10

rt-A*FG+XYZ-h-12

rt-A*FE-dt-B*ER7ZXY-f-17

arbnqjsjy15qsdhcioqsh

516685-’_&-_-_66

Jorg Hoffmann Evaluating Planning Algorithms 31/85

Page 32: Evaluating Planning Algorithms

Burying the reader beneath tons of data . . .

Jorg Hoffmann Evaluating Planning Algorithms 32/85

Page 33: Evaluating Planning Algorithms

. . . showing clearly the relevant observations!

[Gomes et al., Constraints’05]

Jorg Hoffmann Evaluating Planning Algorithms 33/85

Page 34: Evaluating Planning Algorithms

Coverage

Planner A Planner B90% 95%

0 10 20 30 40 50 60 70 80 90

100

0 5 10 15 20 25 30

% s

olve

d in

x m

inut

es

runtime (minutes)

AB

Jorg Hoffmann Evaluating Planning Algorithms 34/85

Page 35: Evaluating Planning Algorithms

Factorial Experiments C ∈ O1 × · · · × On

Example 1

I [Hoffmann&Nebel, JAIR’01]

I Interpolating between FF and HSP:I O1 = {hFF , hadd}

I O2 = {Enforced Hill-climbing,Hill-climbing}I O3 = {Helpful actions,None}

I 23 = 8 combinations

I Not too bad?

Jorg Hoffmann Evaluating Planning Algorithms 35/85

Page 36: Evaluating Planning Algorithms

Interpolating between FF and HSP

How Not To Do It:

(From initial JAIR submission; “significantly better” decided by hand)

Jorg Hoffmann Evaluating Planning Algorithms 36/85

Page 37: Evaluating Planning Algorithms

Interpolating between FF and HSP

Significant per-domain improvements/deteriorations:

Jorg Hoffmann Evaluating Planning Algorithms 37/85

Page 38: Evaluating Planning Algorithms

Factorial Experiments C ∈ O1 × · · · × On

Example 2

I [Roger&Helmert, ICAPS’10]

I How to combine heuristic estimators?I O1 = 2{h

FF ,hCG ,hcea }

I O2 = {max, sum, tie-break , pareto, alternation, alternation-TB}

I 4 ∗ 6 + 3 ∗ 1 = 27 combinations . . .

I (Granted, large n more headache than large |Oi |)

Jorg Hoffmann Evaluating Planning Algorithms 38/85

Page 39: Evaluating Planning Algorithms

How to combine heuristic estimators?

Cross-domain summary:

I Coverage score: 100 solved, 0else

I Quality score: like IPC’08, i.e.,100 ∗ q∗/q

I Speed score: interpolatelogarithmically between 1 secand time-out 1800 sec

I Guidance score: interpolatelogarithmically between 100and 1000000 expansions

Jorg Hoffmann Evaluating Planning Algorithms 39/85

Page 40: Evaluating Planning Algorithms

How to combine heuristic estimators?

Per-domain zoom-in:Coverage differences whenswitching to Alternation (“+”new solved, “−” now un-solved).

Jorg Hoffmann Evaluating Planning Algorithms 40/85

Page 41: Evaluating Planning Algorithms

Outline

I Evaluation? What’s that?I On the IPC and other bizarre ritualsI Do your homework!I Buried beneath tons of dataI The black art of counting black sheepI Understanding the world

Jorg Hoffmann Evaluating Planning Algorithms 41/85

Page 42: Evaluating Planning Algorithms

LPG vs. FF in “Mystery”

task LPG FFprob-01 0.01 0.00prob-02 0.22 0.00prob-03 0.04 0.00prob-04 – –prob-05 – –prob-06 86.33 –prob-08 – –prob-09 0.08 0.01prob-10 14.41 –prob-11 0.01 0.00prob-12 – –prob-13 – –prob-14 990.78 1.72prob-15 1.39 0.04prob-16 – –prob-17 1.29 0.03prob-19 0.38 0.73prob-20 0.27 0.02prob-21 – –prob-22 – –prob-23 – –prob-24 – –prob-25 0.00 0.00prob-26 0.06 0.04prob-27 12.05 0.00prob-28 0.00 0.00prob-29 0.05 0.00prob-30 0.95 0.01

Jorg Hoffmann Evaluating Planning Algorithms 42/85

Page 43: Evaluating Planning Algorithms

Counting Black Sheep

An astronomer, a physicist and a mathematician are on a train inScotland. The astronomer looks out of the window, sees a black sheep

standing in a field, and remarks:

“ How odd. Scottish sheep are black.”

“ No, no, no!” says the physicist. “ Only some Scottish sheep are black.”

The mathematician rolls his eyes at his companions’ muddled thinkingand says, “ In Scotland, there is at least one sheep, at least one side of

which is black.”

Jorg Hoffmann Evaluating Planning Algorithms 43/85

Page 44: Evaluating Planning Algorithms

LPG vs. FF in “NoMystery”

0

10

20

30

40

50

60

70

80

90

100

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

solv

ed %

ratio available fuel vs. minimum fuel

LPGFF

Jorg Hoffmann Evaluating Planning Algorithms 44/85

Page 45: Evaluating Planning Algorithms

The “Performance Function”

I Performance is a function of algorithm and planning problem:

f(A ,P)

I Running a test =⇒ one point of that function

I Experiments: “What is the form of f(A ,P)?”

0

2

4

6

8

10

12

14

16

0 1 2 3 4 5 6 7 8

f(x)astronomer’s hypothesis

0

2

4

6

8

10

12

14

16

0 1 2 3 4 5 6 7 8

f(x)

Jorg Hoffmann Evaluating Planning Algorithms 45/85

Page 46: Evaluating Planning Algorithms

The “Performance Function”, ctd.

Why is it difficult to determine “the form of f(A ,P)”?

(1) Form a priori completely unknown (unlike f(x) = ax2 + bx + c)

(2) “A ” is highly complex/structured

(3) “P” is highly complex/structured

(2,3) =⇒ want to know “what kind of” algorithm/task:

p(F A1 (A), . . . ,F A

n (A),F P1 (P), . . . ,F P

m (P))

I F A /F P : algorithm/problem features

I What features? All relevant ones, ideally

I Which are those? It’s a kind of magic . . .

Jorg Hoffmann Evaluating Planning Algorithms 46/85

Page 47: Evaluating Planning Algorithms

The Performance Function in NoMystery

What did we do better in NoMystery?

p( F A1 (A) ∈ {FF,LPG},F P

1 (P) = size, roadmap, etc.,F P

2 (P) = avail vs. min fuel ratio )

I We changed exactly one problem feature – F P2 (P)

I In Mystery, unsystematically changed everythingI Same for IPC! No notion of “problem features”, no good for counting

sheep!

“There exists a sheep with a black side” vs.“The more gene X has property Y, the blacker is the sheep”

Jorg Hoffmann Evaluating Planning Algorithms 47/85

Page 48: Evaluating Planning Algorithms

Changing a single algorithm feature F Ai at a time

== ablation studies!

Jorg Hoffmann Evaluating Planning Algorithms 48/85

Page 49: Evaluating Planning Algorithms

Changing a single problem feature F Pi at a time

What are useful problem features?

I A simple one: the domainI Presenting results per-domain ≡ vary only F P

1 ∈ {domains}

I More simple ones: instance size parametersI Scaling size param ≡ vary only F P

i = number-of-trucks etc.

I More subtle F Pi relevant to algorithms: an art form!

I Work hard, keep your eyes open, use your intuition, . . .I . . . copy from others

Jorg Hoffmann Evaluating Planning Algorithms 49/85

Page 50: Evaluating Planning Algorithms

F Pi = amount of uncertainty in model

[Sarraute&Buffet&Hoffmann, SecArt’11]

Jorg Hoffmann Evaluating Planning Algorithms 50/85

Page 51: Evaluating Planning Algorithms

F Pi = ratio available fuel vs. minimum fuel

0

10

20

30

40

50

60

70

80

90

100

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

solv

ed %

ratio available fuel vs. minimum fuel

LPGFF

[Hoffmann&Kautz&Gomes&Selman, IJCAI’07]

Jorg Hoffmann Evaluating Planning Algorithms 51/85

Page 52: Evaluating Planning Algorithms

F Pi = ratio available freecells vs. minimum freecells

0

200

400

600

800

1000

1200

1400

1600

1 1.25 1.5 1.75 2 2.25 2.5

runt

ime

ratio available freecells vs. minimum freecells

LPGFF

[Hoffmann, never to be published]

Jorg Hoffmann Evaluating Planning Algorithms 52/85

Page 53: Evaluating Planning Algorithms

F Pi = “AsymRatio” maxg∈Gcost(g)

cost(∧

g∈G g)

[Hoffmann&Gomes&Selman, LMCS’07]

Jorg Hoffmann Evaluating Planning Algorithms 53/85

Page 54: Evaluating Planning Algorithms

F Pi = “Conformant Width”

[Palacios&Geffner, JAIR’09]

Jorg Hoffmann Evaluating Planning Algorithms 54/85

Page 55: Evaluating Planning Algorithms

F Pi = Constrainedness

[Mitchell&Selman&Levesque, AAAI’92]

Jorg Hoffmann Evaluating Planning Algorithms 55/85

Page 56: Evaluating Planning Algorithms

Outline

I Evaluation? What’s that?I On the IPC and other bizarre ritualsI Do your homework!I Buried beneath tons of dataI The black art of counting black sheepI Understanding the world

Jorg Hoffmann Evaluating Planning Algorithms 56/85

Page 57: Evaluating Planning Algorithms

Empirical CS == Natural Science

“In Lincolnshire, summer1666, an apple fell straight tothe ground.”

“Everywhere, always, applesfall straight to the ground.”

“It’s because of gravity!”

Jorg Hoffmann Evaluating Planning Algorithms 57/85

Page 58: Evaluating Planning Algorithms

Your Empirical CS == Natural Science

Observation:“In instance αβγ of domainXYZ, my planner is faster thanversion ABC of planner foo-bar.”

Generalization/Formalization:“If the instance has property Xthen algorithms of type Y haveproperty Z .”

Explanation:“It’s because of search spaceproperty φ!”

Jorg Hoffmann Evaluating Planning Algorithms 58/85

Page 59: Evaluating Planning Algorithms

How Good is Almost Perfect?

[Helmert&Roger, AAAI’08]:

DefinitionLet T be a planning task, and let c ∈ �. Define the heuristic functionh∗ − c as (h∗ − c)(s) := max(0, h∗(s) − c). Define Nc(T ) as the numberof states s where g(s) + (h∗ − c)(s) < h∗(T ).

Nc(T ): number of states that must be expanded by A* withalmost-perfect heuristic h∗ − c.

TheoremIn Gripper, N1(Tn) grows exponentially with the number of balls. InMiconic-Simple, there exist scaling families of tasks Tn where N4(Tn)grows exponentially with n. In Blocksworld, there exist scaling families oftasks Tn where N1(Tn) grows exponentially with n.

Jorg Hoffmann Evaluating Planning Algorithms 59/85

Page 60: Evaluating Planning Algorithms

How Good is Almost Perfect?

Observation:I A* doesn’t scale in the IPC instances of trivial domains like Gripper,

with any of the known admissible heuristics

Generalization/Formalization:I The search space of A* must necessarily grow exponentially in

these domains, even with almost perfect heuristicsI (In contrast to known tractability results for almost perfect heuristics)

Explanation:I Goal state can be reached in many different ways (transpositions)I (Main proof argument)

Best Paper Award at AAAI’08

Jorg Hoffmann Evaluating Planning Algorithms 60/85

Page 61: Evaluating Planning Algorithms

Where Ignoring Delete Lists Works

[Hoffmann, AIPS’02, JAIR’05]:

undirected

Hanoi [0]Blocksworld−no−arm [0]Fridge [0]Briefcaseworld [0]

Logistics [0,1]Ferry [0,1]

mlm

ed <

= c

mbe

d <

= c

Gripper [0,1]

DriverlogDepotsBlocksworld−arm

harmless recognized

Schedule [5,5]Dining−Phil. [31,31]

unrecognized

AirportAssemblyFreecellMiconic−ADLMprimeMystery

Optical−TelegraphRovers

Grid [0]

PSRPipesworld

Tireworld [0,6]Satellite [4,4]Zenotravel [2,2]Miconic−SIMPLE [0,1]Miconic−STRIPS [0,1]Movie [0,1]Simple−Tsp [0,0]

h+ “exit distance” from states on local minima/benches

Jorg Hoffmann Evaluating Planning Algorithms 61/85

Page 62: Evaluating Planning Algorithms

Where Ignoring Delete Lists Works

Observation:I Relaxed plan heuristics seem to work well in some domains, but not

in others

Generalization/Formalization:I Taxonomy of domain categories sharing topological properties of

idealized heuristic h+

Explanation:I Connections between “optimal actions” in real and relaxed versions

of respective domainsI (Main proof argument)

2002 Award for Best European Dissertation in AI

Jorg Hoffmann Evaluating Planning Algorithms 62/85

Page 63: Evaluating Planning Algorithms

Final Punchline

It’s about understanding the world

not about “my apple flies faster than yours”

Jorg Hoffmann Evaluating Planning Algorithms 63/85

Page 64: Evaluating Planning Algorithms

p.s. Are we solving the right problem here?

Natural Language Generation: [Koller&Hoffmann, ICAPS’10]I Performance: Ok based on trivial modification of FFI Why planning? PDDL cheaper to write than codeI Main issue: PDDL modeling (understand planner reaction)

Attack Path Generation: (with Core Security Technologies)I Performance: Ok based on easy modification of FFI Why planning? PDDL cheaper to write than codeI Main issue: PDDL modeling (understand planner reaction)

Creating business processes at SAP: [Hoffmann et al, AAAI’10]I Performance: Ok based on easy adaptation of FFI Why planning? Flexibility requiredI Main issue: “PDDL” modeling (5 years, 200 people, special GUI, design

patterns, naming conventions, governance process, review meetings, council

supervision, educational training)

Jorg Hoffmann Evaluating Planning Algorithms 64/85

Page 65: Evaluating Planning Algorithms

References

I “How Not To Do It”:http://www.cse.unsw.edu.au/˜tw/hownotto.pdf

I IJCAI’01 tutorial on empirical methods in AI:http://www.cse.unsw.edu.au/˜tw/ijcai2001.ppt

I Toby Walsh’s web page on empirical methods in CS and AI:http://www.cse.unsw.edu.au/˜tw/empirical.html

I P. Cohen, “Empirical Methods for AI”, MIT Press, 1995.

I C. Domshlak, J. Hoffmann, and A. Sabharwal, Friends or Foes? OnPlanning as Satisfiability and Abstract CNF Encodings, Journal ofArtificial Intelligence Research 36: 415-469, 2009.

I C. Gomes, C. Fernandez, B. Selman, and C. Bessiere, StatisticalRegimes Across Constrainedness Regions, Constraints 10(4):317-337, 2005.

I C. Gomes, B. Selman, and N. Crato, Heavy-Tailed Distributions inCombinatorial Search, Principles and Practice of ConstraintProgramming, 3rd International Conference (CP’97).

Jorg Hoffmann Evaluating Planning Algorithms 65/85

Page 66: Evaluating Planning Algorithms

References

I M. Helmert, P. Haslum, and J. Hoffmann, Flexible AbstractionHeuristics for Optimal Sequential Planning, Proceedings of the 17thInternational Conference on Automated Planning and Scheduling(ICAPS’07).

I M. Helmert, Gabriele Roger, How Good is Almost Perfect?,Proceedings of the 23rd AAAI Conference on Artificial Intelligence(AAAI’08).

I J. Hoffmann, Local Search Topology in Planning Benchmarks: ATheoretical Analysis, Proceedings of the 6th InternationalConference on Artificial Intelligence Planning and Scheduling(AIPS’02).

I J. Hoffmann, Where Ignoring Delete Lists Works: Local SearchTopology in Planning Benchmarks, Journal of Artificial IntelligenceResearch 24: 685–758, 2005.

I J. Hoffmann, Where Ignoring Delete Lists Works, Part II: CausalGraphs, Proceedings of the 21st International Conference onAutomated Planning and Scheduling (ICAPS’11).

Jorg Hoffmann Evaluating Planning Algorithms 66/85

Page 67: Evaluating Planning Algorithms

References

I J. Hoffmann, C. Gomes, and B. Selman, Structure and ProblemHardness: Goal Asymmetry and DPLL Proofs in SAT-basedPlanning, Logical Methods in Computer Science 3 (1-6), 2007.

I J. Hoffmann, H. Kautz, C. Gomes, and B. Selman, SAT Encodingsof State-Space Reachability Problems in Numeric Domains,Proceedings of the 20th International Joint Conference on ArtificialIntelligence (IJCAI’07).

I J. Hoffmann and B. Nebel, The FF Planning System: Fast PlanGeneration Through Heuristic Search, Journal of ArtificialIntelligence Research 14: 253–302, 2001.

I J. Hoffmann, I. Weber, and F. Kraft, SAP Speaks PDDL,Proceedings of the 24th AAAI Conference on Artificial Intelligence(AAAI’10).

I M. Katz and C. Domshlak, Optimal Additive Composition ofAbstraction-based Admissible Heuristics, Proceedings of the 18thInternational Conference on Automated Planning and Scheduling(ICAPS’08).

Jorg Hoffmann Evaluating Planning Algorithms 67/85

Page 68: Evaluating Planning Algorithms

References

I A. Koller and J. Hoffmann, Waking Up a Sleeping Rabbit: OnNatural-Language Sentence Generation with FF, Proceedings ofthe 20th International Conference on Automated Planning andScheduling (ICAPS’10).

I A. Koller and R. Petrick, Experiences with planning for naturallanguage generation, Computational Intelligence 27(1): 23-40,2011.

I D. Mitchell, B. Selman, and H. Levesque, Hard and EasyDistributions of SAT Problems, Proceedings of the 10th NationalConference of the American Association for Artificial Intelligence(AAAI’92).

I R. Nissim, J. Hoffmann, and M. Helmert, Computing PerfectHeuristics in Polynomial Time: On Bisimulation andMerge-and-Shrink Abstraction in Optimal Planning, Proceedings ofthe 22nd International Joint Conference on Artificial Intelligence(IJCAI’11).

Jorg Hoffmann Evaluating Planning Algorithms 68/85

Page 69: Evaluating Planning Algorithms

References

I H. Palacios and H. Geffner, Compiling Uncertainty Away inConformant Planning Problems with Bounded Width, Journal ofArtificial Intelligence Research 35: 623-675, 2009.

I G. Roger and M. Helmert, The More, the Merrier: CombiningHeuristic Estimators for Satisficing Planning, Proceedings of the20th International Conference on Automated Planning andScheduling (ICAPS’10).

I W. Ruml, M. Do, R. Zhou, and M. Fromherz, On-line Planning andScheduling: An Application to Controlling Modular Printers, Journalof Artificial Intelligence Research 40: 415-468, 2011.

I C. Sarraute, O. Buffet, and J. Hoffmann, Penetration Testing ==POMDP Solving? Proceedings of the 3rd Workshop on IntelligentSecurity (SecArt’11), at IJCAI’11.

Jorg Hoffmann Evaluating Planning Algorithms 69/85