the permutable pomdp: fast solutions to pomdps for ... location utterance states: hidden but fixed!...

42
AAMAS 2008 1 The Permutable POMDP: Fast Solutions to POMDPs for Preference Elicitation Finale Doshi MIT, Cambridge University Nick Roy MIT

Upload: trinhthu

Post on 07-Mar-2018

234 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 1

The Permutable POMDP:Fast Solutions to POMDPs for

Preference Elicitation

Finale Doshi MIT, Cambridge University

Nick Roy MIT

Page 2: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 2

Motivation

Imagine an agent trying to help book a flight...

A successful agent must manage uncertainty in the dialog.

I'd like to fly to Paris

I'd like to buytwo pairs

???

Page 3: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 3

● The Partially Observable Markov Decision Process (POMDP) planning framework can optimally manage the dialog uncertainty.

● For our toy example, the model consists of

● Agent rewarded for submitting the correct location and shorter dialogs.

Toy Example

vacationlocation

utterance

States: hiddenbut fixed!

agentaction

vacationlocation

utterance

agentaction

Page 4: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 4

● The Partially Observable Markov Decision Process (POMDP) planning framework can optimally manage the dialog uncertainty.

● For our toy example, the model consists of

● Agent rewarded for submitting the correct location and shorter dialogs.

Toy Example

vacationlocation

utterance

agentaction

vacationlocation

utterance

agentaction

– Actions: – -Query: “Where do you want to go?”– -Confirm: “Did you say Rome?”– -Submit: “You're booked to Rome.”

Page 5: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 5

● The Partially Observable Markov Decision Process (POMDP) planning framework can optimally manage the dialog uncertainty.

● For our toy example, the model consists of

● Agent rewarded for submitting the correct location and shorter dialogs.

Toy Example

vacationlocation

utterance

agentaction

vacationlocation

utterance

agentaction

Observations

Page 6: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 6

● The Partially Observable Markov Decision Process (POMDP) planning framework can optimally manage the dialog uncertainty.

● For our toy example, the model consists of

● Agent rewarded for submitting the correct location and shorter dialogs.

Toy Example

vacationlocation

utterance

agentaction

vacationlocation

utterance

agentaction

Observation Model (depends on state, action)

Page 7: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 7

A useful insight● POMDPs are often hard to solve; the optimal action depends on the

agent's belief (probability distribution over the hidden state). ● However, this POMDP has a special structure:

● Correct type of action depends only on the 'shape' of the belief, not the belief itself.

Rome London Paris

prob

abilit

y

both cases:best action is book most likely state

Rome London Paris

Page 8: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 8

A useful insight● POMDPs are often hard to solve; the optimal action depends on the

agent's belief (probability distribution over the hidden state). ● However, this POMDP has a special structure:

● Correct type of action depends only on the 'shape' of the belief, not the belief itself.

Rome London Paris

prob

abilit

y

both cases:best action is confirm most likely state

Rome London Paris

Page 9: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 9

The More General Case

● We can think of many scenarios—booking systems, appointment systems, etc.—that have a similar structure:– Goal: determine the identity of a fixed, hidden state.– Actions: divide into classes of types with similar but state-

dependent effects.– Observations: also divide into classes with similar, state-

dependent effects.● In all of these cases, only the shape of the belief matters!● Note: Other algorithms such as AMDP and Summary POMDP have

used this idea, but ours is not an approximation!

Page 10: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 10

Intuitively, why is this useful?● When solving, we know all permutations of a belief are similar.● This exponentially decreases the belief space we need to consider!

state 1 state 2

valu

e of

bel

ief

Page 11: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 11

Formalities: sufficient conditions

Theorem: All permutations of the belief will have the same value if, ● for every state permutation π(s) and action a

● there exists an action aπ and observation permutation π

π,a(o)

● such that

– R( s , a ) = R( π(s) , aπ

)

– O( o | s ,a ) = O( ππ,a

(o) | π(s) , aπ )

– T( s' | s , a ) = T( π(s') | π(s) , aπ )

The proof follows from substituting the sufficient condition into the Bellman optimality equations for POMDPs.

Page 12: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 12

Formalities: sufficient conditions

Theorem: All permutations of the belief will have the same value if, ● for every state permutation π(s) and action a

● there exists an action aπ and observation permutation π

π,a(o)

● such that

– R( s , a ) = R( π(s) , aπ

)

– O( o | s ,a ) = O( ππ,a

(o) | π(s) , aπ )

– T( s' | s , a ) = T( π(s') | π(s) , aπ )

The proof follows from substituting the sufficient condition into the Bellman optimality equations for POMDPs.

For any action, no matter howwe swap around the states,

Page 13: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 13

Formalities: sufficient conditions

Theorem: All permutations of the belief will have the same value if, ● for every state permutation π(s) and action a

● there exists an action aπ and observation permutation π

π,a(o)

● such that

– R( s , a ) = R( π(s) , aπ

)

– O( o | s ,a ) = O( ππ,a

(o) | π(s) , aπ )

– T( s' | s , a ) = T( π(s') | π(s) , aπ )

The proof follows from substituting the sufficient condition into the Bellman optimality equations for POMDPs.

For any action, no matter howwe swap around the states,

we can find an action and away to swap around the observations

Page 14: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 14

Formalities: sufficient conditions

Theorem: All permutations of the belief will have the same value if, ● for every state permutation π(s) and action a

● there exists an action aπ and observation permutation π

π,a(o)

● such that

– R( s , a ) = R( π(s) , aπ

)

– O( o | s ,a ) = O( ππ,a

(o) | π(s) , aπ )

– T( s' | s , a ) = T( π(s') | π(s) , aπ )

The proof follows from substituting the sufficient condition into the Bellman optimality equations for POMDPs.

For any action, no matter howwe swap around the states,

we can find an action and away to swap around the observations

so that the parameters areremain the same

Page 15: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 15

Formalities: sufficient conditions

Example: given

● π(s), π(Rome) = London

● Action “Confirm Rome”

Let aπ = “Confirm London” and observation permutation π

π,a(o) = o

– R( Rome , “Confirm Rome” ) = R( π( Rome ) , “Confirm London” )– O( Yes | Rome , “Confirm Rome” ) = O( Yes | London , “Confirm London” )– T( Rome | Rome , “Confirm Rome” ) = T( London | London , “Confirm London” )

Rome London Paris

stat

epr

obab

ility

Rome London Paris

Page 16: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 16

Practicalities: how do we use this?

We'll illustrate the approach for a simple point-based

POMDP solution technique, but the idea—which

exponentially reduces the size of the belief space—

can be applied to more sophisticated POMDP solvers.

Page 17: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 17

Practicalities: how do we use this?Generic point-based POMDP solution scheme:● Sample a set of beliefs.● Loop:

– Compute action-observation value vectors:

– Compute the action value vectors:

– Compute the new value function:

a ,o={∣s=∑s '∈ST s '∣s ,aOo∣s ' , a ' s ' },∀ '∈n

ba=R . ,a∑

o∈Oargmax∈ a,o⋅b

n1=argmaxba ,ab

a⋅b

Page 18: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 18

Practicalities: how do we use this?Adaptation to the permutable POMDP:

● Sample a set of beliefs.

● Sort beliefs in descending order and remove nearby points.

● Loop:

– Compute action-observation value vectors:

– Sort the Γa,o in descending order.– Compute the action value vectors:

– Compute the new value function:

a ,o={∣s=∑s '∈ST s '∣s , aO o∣s ' , a ' s ' },∀ '∈ n

ba=R . ,a ∑

o∈O

argmax ∈a ,o ⋅b

n1=argmax ba ,a b

a⋅b

Page 19: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 19

Practicalities: how do we use this?

state 1 state 2

valu

e

Adaptation to the permutable POMDP:

● Sample a set of beliefs.

● Sort beliefs in descending order and remove nearby points.

● Loop:

– Compute action-observation value vectors:

– Sort the Γa,o in descending order.– Compute the action value vectors:

– Compute the new value function:

a ,o={∣s=∑s '∈ST s '∣s , aO o∣s ' , a ' s ' },∀ '∈ n

ba=R . ,a ∑

o∈O

argmax ∈a ,o ⋅b

n1=argmax ba ,a b

a⋅b

Page 20: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 20

Practicalities: how do we use this?

state 1 state 2

valu

e

Adaptation to the permutable POMDP:

● Sample a set of beliefs.

● Sort beliefs in descending order and remove nearby points.

● Loop:

– Compute action-observation value vectors:

– Sort the Γa,o in descending order.– Compute the action value vectors:

– Compute the new value function:

a ,o={∣s=∑s '∈ST s '∣s , aO o∣s ' , a ' s ' },∀ '∈ n

ba=R . ,a ∑

o∈O

argmax ∈a ,o ⋅b

n1=argmax ba ,a b

a⋅b

Page 21: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 21

Practicalities: how do we use this?

state 1 state 2

valu

e

Adaptation to the permutable POMDP:

● Sample a set of beliefs.

● Sort beliefs in descending order and remove nearby points.

● Loop:

– Compute action-observation value vectors:

– Sort the Γa,o in descending order.– Compute the action value vectors:

– Compute the new value function:

a ,o={∣s=∑s '∈ST s '∣s , aO o∣s ' , a ' s ' },∀ '∈ n

ba=R . ,a ∑

o∈O

argmax ∈a ,o ⋅b

n1=argmax ba ,a b

a⋅b

Page 22: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 22

Practicalities: how do we use this?

state 1 state 2

valu

e

Adaptation to the permutable POMDP:

● Sample a set of beliefs.

● Sort beliefs in descending order and remove nearby points.

● Loop:

– Compute action-observation value vectors:

– Sort the Γa,o in descending order.– Compute the action value vectors:

– Compute the new value function:

a ,o={∣s=∑s '∈ST s '∣s , aO o∣s ' , a ' s ' },∀ '∈ n

ba=R . ,a ∑

o∈O

argmax ∈a ,o ⋅b

n1=argmax ba ,a b

a⋅b

Adaptation to the permutable POMDP:

● Sample a set of beliefs.

● Sort beliefs in descending order and remove nearby points.

● Loop:

– Compute action-observation value vectors:

– Sort the Γa,o in descending order.– Compute the action value vectors:

– Compute the new value function:

Page 23: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 23

Practicalities: how do we use this?

state 1 state 2

Adaptation to the permutable POMDP:

● Sample a set of beliefs.

● Sort beliefs in descending order and remove nearby points.

● Loop:

– Compute action-observation value vectors:

– Sort the Γa,o in descending order.– Compute the action value vectors:

– Compute the new value function:

a ,o={∣s=∑s '∈ST s '∣s , aO o∣s ' , a ' s ' },∀ '∈ n

ba=R . ,a ∑

o∈O

argmax ∈a ,o ⋅b

n1=argmax ba ,a b

a⋅b va

lue

Page 24: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 24

Practicalities: how do we use this?

state 1 state 2

Adaptation to the permutable POMDP:

● Sample a set of beliefs.

● Sort beliefs in descending order and remove nearby points.

● Loop:

– Compute action-observation value vectors:

– Sort the Γa,o in descending order.– Compute the action value vectors:

– Compute the new value function:

a ,o={∣s=∑s '∈ST s '∣s , aO o∣s ' , a ' s ' },∀ '∈ n

ba=R . ,a ∑

o∈O

argmax ∈a ,o ⋅b

n1=argmax ba ,a b

a⋅b va

lue

Page 25: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 25

Results ● Scenario:

– Three types of actions: “ask,” “confirm,” “submit”– Unique observation associated with each state– Pr[ hear correct state ] from “ask” = .5– Pr[ hear correct confirmation ] from “confirm” = .8– Varied the number of states.

● Two types of tests:– let the belief set size grow with the number of states (more fair to

the generic algorithm)– fix belief set size (more reasonable if time or memory is limited)

Page 26: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 26

Growing belief set size

Page 27: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 27

Fixed belief set size

Page 28: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 28

Conclusion

● Useful trick for solving POMDPs with a particular structure– can also be used if a substructure is permutable.– useful in learning situations in which a policy must be

evaluated whenever parameters change.● Future work:

– determine a more general set of necessary conditions.– are there approximate ways to apply this trick for

POMDPs with not quite permutable structure?

Page 29: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 29

Thank-you

Page 30: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 30

Formalities: sufficient conditions

Example: given π(s) and action “Give location”

Let aπ = “Give location” and π

π,a(os) = oπ(s) :

– R( Rome , “Give location” ) = R( π( Rome ) , “Give location” )

– O( Rome | Rome , “Give location” ) = O( London | London , “Give location” )– T( Rome | Rome , “Give location” ) = T( London | London , “Give location” )

Rome London Paris

prob

abilit

y

Rome London Paris

Pr( o | Rome, “give location” ) Pr( o | London, “give location” )

Page 31: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 31

Application References

● Williams and Young, Scaling up POMDPs for dialog management: the summary POMDP method, IEEE ASRU Workshop 2005.

● Roy, Pineau, and Thrun, Spoken dialog management using probabilistic reasoning, Proceedings of the 38th Meeting of the ACL, 2000.

● Regan, Cohen, and Poupart, The advisor POMDP: a principled approach to trust through reputation in electronic markets. Conference on Privacy, Security, and Trust, 2005.

● Doshi and Roy, Efficient model learning for dialog management, Human-Robot Interaction, 2007.

● Boutilier, A POMDP formulation of preference elicitation problems, Proceedings of the 18th National Conference on Artificial Intelligence, 2002.

Page 32: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 32

Time Spent on Backups

Page 33: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 33

Alpha Set Sizes

Page 34: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 34

Interaction Model

actionobservation

State of the world

Agent

World

Belief about the world

planningstate estimation

sensors world dynamics

Model of the world

Page 35: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 35

Solving a POMDP

Page 36: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 36

Dialog Model: Solving the POMDP

We think of the previous recursions as building a policy tree...

Action: Go to elevator

+100

-500

Exp

ecte

d R

ewar

d

Certain that we're in state one: User wants to go to the elevator

Certain in state two: Userwants to go to the cafeteria

Page 37: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 37

Dialog Model: Solving the POMDP

We think of the previous recursions as building a policy tree...

Action: Ask for location

+100

-500

Exp

ecte

d R

ewar

d

State one: Userwants to go to theelevator

State two: Userwants to go to thecafeteria

Page 38: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 38

Dialog Model: Solving the POMDP

We think of the previous recursions as building a policy tree; planning ahead increases our expected reward.

Action: Ask for location

+100

-500

Exp

ecte

d R

ewar

d

Action: Go to elevator

Action: Goto cafeteria

Observe“elevator”

Observe“cafeteria”

+100

-500

Exp

ecte

d R

ewar

d

State one: Userwants to go to theelevator

State two: Userwants to go to thecafeteria

Page 39: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 39

Dialog Model: Solving the POMDP

Given multiple trees, we can determine the most appropriate action:

Action: Go to elevator

Action: Ask for location

Action: Go to elevator

Action: Goto cafeteria

Observe“elevator”

Observe“cafeteria”

Action: Go to cafeteria +100

-500

Exp

ecte

d R

ewar

d

State one: Userwants to go to theelevator

State two: Userwants to go to thecafeteria

Page 40: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 40

Practicalities: how do we use this?

● We'll illustrate the approach for a simple point-based POMDP solution technique, but the idea—which effectively reduces the size of the belief space—can be applied to more sophisticated POMDP solvers.

● Begin with some notation:

– Γ: value function; dot( Γ , b ) is the value of being in belief b.

– Γab: action value funciton; dot( Γa

b , b ) is the value of being in b

and taking action a.– Γa,o: action-observation value function; dot( Γa,o , b ) is the value of

being in b, taking action a, and seeing observation o.

Page 41: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 41

Practicalities: how do we use this?Adaptation to the permutable POMDP:● Sample a set of beliefs.● Sort beliefs in descending order and remove nearby points.● Loop:

– Compute action-observation value vectors:

– Sort the Γa,o in descending order.– Compute the action value vectors:

– Compute the new value function:

a ,o={∣s=∑s '∈ST s '∣s ,aOo∣s ' , a ' s ' },∀ '∈n

ba=R ,a∑

o∈Oargmax∈a, o⋅b

n1=argmaxba ,ab

a⋅b

Get a canonicalset of beliefs

Page 42: The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

AAMAS 2008 42

Practicalities: how do we use this?Why we sort the Γa,o vectors: ● Previous step:● Next step:

a ,o={∣s=∑s '∈ST s '∣s ,aO o∣s ' , a ' s ' },∀ '∈n

ba=R ,a∑

o∈Oargmax∈a, o⋅b

state 1 state 2

valu

e of

bel

ief