planning to gather information richard dearden university of birmingham joint work with moritz...

Planning to Gather Information

Richard DeardenUniversity of Birmingham

Joint work with Moritz Göbelbecker (ALU), Charles Gretton, Bramley Merton (NOC), Zeyn Saigol, Mohan

Sridharan (Texas Tech), Jeremy Wyatt

Underwater Vent Finding

AUV used to find vents• Can detect vent itself (reliably), plume of

fresh water emitted• Problem is where to go to collect data to

find the vents as efficiently as possible• Hard because plume detection is

unreliable, can’t easily assign ‘blame’ for the detections we do make

Vision Algorithm Planning

Goal: Answer queries and execute commands.• Is there a red triangle in the

scene?• Move the mug to the right of the

blue circle.

Our operators: colour, shape, SIFT identification, viewpoint change, zoom etc.

Problem: Build a plan to achieve the goal with high confidence

Assumptions

The visual operators are unreliable• Reliability can be represented by a confusion matrix,

computed from data

Speed of response and answering the query correctly are what really matters• We want to build the fastest plan that is ‘reliable

enough’• We should include planning time in our performance

estimate too

ObservedActual

Square Circle

Triangle

Square 0.85 0.1 0.05

Circle 0.1 0.80 0.1

Triangle

0.1 0.05 0.85

POMDPs

Partially Observable Markov Decision Problems Markov Decision Problem:

• (discrete) States, stochastic actions, reward• Maximise expected (discounted) long-term reward• Assumption: state is completely observable

POMDPs: MDPs with observations• Infer state from (sequence of) observations• Typically maintain belief state, plan over that

$

POMDP Formulation

States: Cartesian product of individual state vectors

Actions: A = {Colour, Shape, SIFT, terminal actions} Observations: {red, green, blue, circle, triangle,

square, empty, unknown}

Transition function

Observation functiongiven by confusion matrices

Reward specification time cost of actions, large +ve/-ve rewards on terminal actions

Maintain belief over states, likelihood of action outcomes

},,,,{ oc

oc

oc

oc

occ UBGRZ

]1,0[: SAST

]1,0[: ZASO

ASR :

},,,,{ os

os

os

os

oss USTCZ

AaaZZ

SIFTColourShapeaITa ,, ,

POMDP Formulation

For a broad query: ‘what is that?’ For each ROI:

• 26 states (5 colours x 5 shapes + term)• 12 actions (2 operations, 10 terminal actions

SayBlueSquare, SayRedTriangle, SayUnknown, …)• 8 observations

For n ROIs:• 25n + 1 states• Impractical for even a very small number of ROIs

BUT: There’s lots of structure. How to exploit it?

A Hierarchical POMDP

Proposed solution: Hierarchical Planning in POMDPs – HiPPo• One LL-POMDP for planning the

actions in each ROI• Higher-level POMDP to choose

which LL-POMDP to use at each step

Significantly reduces complexity of the state-action-observation space

Model creation and policy generation are automatic, based on the input query

Which Region to Process? HL

POMDP

How to Process?

LL POMDP

Low-level POMDP

The LL-POMDP is the same as the flat POMDP• Only ever operates on a

single ROI• 26 states, 12 Actions

Reward combines time-based cost for actions and answer quality

ASR :

,...},{ ShapeColourA

]1,0[: ZASO

},,,,{ oc

oc

oc

oc

occ UBGRZ

},,,,{ os

os

os

os

oss USTCZ

ColourShapeaITa , ,

Aa

aZZ

Terminal actions are answering the query for this region

Example

Query: ‘where is the blue circle?’

State space:{RedCircle, RedTriangle,

BlueCircle, BlueTriangle, …, Terminal}

Actions:{Colour, Shape, …, SayFound,

…}

Observations:{Red, Blue, NoColour, UnknownColour, Triangle, Circle,

NoShape, UnknownShape, …}

Observation probabilities given by confusion matrix

Policy

Policy tree for uniform prior initial state

We limit all LL policies to a fixed maximum number of steps

...

Colour

Shape

sFound sNotFound

Shape Shape

sFound

sNotFound

RB

T

T

T

C

CC

High-level POMDP

State space consists of the regions the object of interest is in

Actions are regions to process Observations are whether the

object of interest was found in a particular region

We derive the observation function and action costs for the HL-POMDP from the policy tree for the LL-POMDP

},,{ 21 sH AuuA

}|{ ROIsRFRZ ii 21, , uuaITa

ASR :]1,0[: ZASO

Treat the LL-POMDP as a black box that returns definite labels (not belief densities)

Example

Query: ‘where is the blue circle?’

State space:

Actions:{DoR1, DoR2, SayR1, SayR2,

SayR1^R2, SayNo}

21212121 ,,, RRRRRRRR

Observations:{FoundR1, ¬FoundR1, FoundR2, ¬FoundR2}

Observation probabilities are computed from the LL-POMDP

Results (very briefly)

Approach Reliability (%)

No Planning 76.67

CP 76.67

Hier-P 91.67

Vent Finding Approach

Assume mapping using occupancy grid Rewards only for visiting cells with vents in State space also too large to solve POMDP

• Instead do fixed length lookahead in belief space• Reasoning in belief space allows us to account for

value of information gained from observations• Use P(vent|all observations so far) as heuristic value

at end of lookahead

What we’re working on now

Most of these POMDPs are too big to solve

Take a domain, problem description in a very general language, generate a classical planning problem for it• Assume we can

observe any variable we care about

For each such observation, use a POMDP planner to determine the value of the variable with high confidence

planning to gather information richard dearden university of birmingham joint work with moritz...

Documents

belief state

terminal actions

stochastic actions

highlevel pomdp state

jeremy wyatt slide

high confidence slide

pomdp formulation states

sequence of observations