henry hexmoor siuc

23
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC

Upload: hasad

Post on 23-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems. Henry Hexmoor SIUC. Utility. Preferences are recorded as a utility function S is the set of observable states in the world u i is utility function R is real numbers - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Henry  Hexmoor SIUC

Utilities and MDP: A Lesson in Multiagent System

Based on Jose Vidal’s bookFundamentals of Multiagent Systems

Henry HexmoorSIUC

Page 2: Henry  Hexmoor SIUC

Utility

• Preferences are recorded as a utility function

S is the set of observable states in the worldui is utility functionR is real numbers

• States of the world become ordered.

RSui :

Page 3: Henry  Hexmoor SIUC

Properties of Utilites

Reflexive: ui(s) ≥ ui(s)Transitive: If ui(a) ≥ ui(b) and ui(b) ≥ ui(c)

then ui(a) ≥ ui(c).Comparable: a,b either ui(a) ≥ ui(b) or

ui(b) ≥ ui(a).

Page 4: Henry  Hexmoor SIUC

Selfish agents:

• A rational agent is one that wants to maximize its utilities, but intends no harm.

Agents

Rational, non-selfish agents

Selfish agents

Rational agents

Page 5: Henry  Hexmoor SIUC

Utility is not money:

• while utility represents an agent’s preferences it is not necessarily equated with money. In fact, the utility of money has been found to be roughly logarithmic.

Page 6: Henry  Hexmoor SIUC

Marginal Utility

• Marginal utility is the utility gained from next event

Example: getting A for an A student.

versus A for an B student

Page 7: Henry  Hexmoor SIUC

Transition function

Transition function is represented as

Transition function is defined as the probability of reaching S’ from S with action ‘a’

)',,( sasT

Page 8: Henry  Hexmoor SIUC

Expected Utility

• Expected utility is defined as the sum of product of the probabilities of reaching s’ from s with action ‘a’ and utility of the final state.

Where S is set of all possible states

Ss

ii susasTasuE'

)'()',,(],,[

Page 9: Henry  Hexmoor SIUC

Value of Information

• Value of information that current state is t and not s:

here represents updated, new info represents old value

)](,,[)](,,[ stuEttuEE iiii

)](,,[ ttuE ii

)](,,[ stuE ii

Page 10: Henry  Hexmoor SIUC

Markov Decision Processes: MDP• Graphical representation of a sample Markov decision process along with values for

the transition and reward functions. We let the start state be s1. E.g.,

Page 11: Henry  Hexmoor SIUC

Reward Function: r(s)

• Reward function is represented as r : S → R

Page 12: Henry  Hexmoor SIUC

Deterministic Vs Non-Deterministic

• Deterministic world: predictable effectsExample: only one action leads to T=1 , else Ф

• Nondeterministic world: values change

Page 13: Henry  Hexmoor SIUC

Policy:

• Policy is behavior of agents that maps states to action

• Policy is represented by

Page 14: Henry  Hexmoor SIUC

Optimal Policy

• Optimal policy is a policy that maximizes expected utility.

• Optimal policy is represented as

],,[maxarg)(* asuEs iAai

*

Page 15: Henry  Hexmoor SIUC

Discounted Rewards:

• Discounted rewards smoothly reduce the impact of rewards that are farther off in the future

Where represents discount factor

)10(

...)()()( 32

21

10 srsrsr

)10(

sa

susasTs )(),,(maxarg)(*

Page 16: Henry  Hexmoor SIUC

Bellman Equation

Where r(s) represents immediate reward T(s,a,s’)u(s’) represents future, discounted rewards

sa

susasTsrsu )(),,(max)()(

Page 17: Henry  Hexmoor SIUC

Brute Force Solution

• Write n Bellman equations one for each n states, solve …

• This is a non-linear equation due to a

max

Page 18: Henry  Hexmoor SIUC

Value Iteration Solution

• Set values of u(s) to random numbers• Use Bellman update equation

• Converge and stop using this equation when

where max utility change

)(),,(max)()(1 susasTsrsus

t

a

t

)1(

u

u

Page 19: Henry  Hexmoor SIUC

Value Iteration Algorithm

ureturn

until

susuthen

susuif

susasTsrsudo

Ssfor

uudo

rTITERATIONVALUE

sa

)1(

|)()('|

|)()('|

)(),,(max)()(

'

),,,(

Page 20: Henry  Hexmoor SIUC

The algorithm stops after t=4.15.05.0 and

Page 21: Henry  Hexmoor SIUC

MDP for one agent

• Multiagent: one agent changes , others are stationary.

• Better approach is a vector of size ‘n’ showing each agent’s

action. Where ‘n’ represents number of agents

• Rewards:– Dole out equally among agents– Reward proportional to contribution

)',,( sasT

a

Page 22: Henry  Hexmoor SIUC

• noise + cannot observe world …• Belief state

• Observation model O(s,o) = probability of observing ‘o’, being in state ‘s’.

Where is normalization constant

Observation model

nPPPPb ,...,,, 321

)(),,(),()( sbsasTosOsbs

s

Page 23: Henry  Hexmoor SIUC

Partially observable MDP

),,( babT )(),,(),( sbsasTosO

s s

if * holds = 0 otherwise* - is true for new reward function

• Solving POMDP is hard.

)(),,(),()( sbsasTosOsbs

s bab

,,

s

srsbb )()()(