planning under uncertainty. 573 core topics agency problem spaces search knowledge representation...

Planning Under UncertaintyPlanning Under Uncertainty

573 Core Topics

Agency

Problem Spaces

Search

Knowledge Representation

Reinforcement

Learning Inference Planning

SupervisedLearning

Logic-Based Probabilistic

Administrivia• Reading for today’s class: ch 17 thru 17.3

• Reading for Thursday: ch 21 Reinforcement learning

• Problem Set Extension until Monday 11/10 8am One additional problem

• Tues 11/11 – no class

• Thurs 11/13 – midterm In class Closed book May bring 1 sheet of 8.5x11” paper

Semantics• Syntax: a description of the legal

arrangements of symbols (Def “sentences”)

• Semantics: what the arrangement of symbols means in the world

Sentences

ModelsModels

Sentences

Representation

World

Semantics

Semantics

Inference

Propositional Logic: SEMANTICS

• “Interpretation” (or “possible world”)• Specifically, TRUTH TABLES

Assignment to each variable either T or F Assignment of T or F to each connective Think “function mapping from P to T (or F)”

PT

T

F

F

Q

P Q

T

F F

F

• Does PQ |= PQ ?

First Order Logic

• Syntax more complex

• Semantics more complex Specifically, the mappings are more complex (And the range of the mappings is more

complex)

Models• Depiction of one possible “real-world” model

Interpretations=Mappingssyntactic tokens model

elementsDepiction of one possible interpretation, assuming Constants: Functions: Relations:

Richard John Leg(p,l) On(x,y) King(p)

Interpretations=Mappingssyntactic tokens model

elementsAnother interpretation, same assumptions Constants: Functions: Relations:

Richard John Leg(p,l) On(x,y) King(p)

Satisfiability, Validity, & Entailment

•S is valid if it is true in all interpretations

•S is satisfiable if it is true in some interp

•S is unsatisfiable if it is false all interps

•S1 entails S2 if forall interps where S1 is true, S2 is also true

|=

11

Simple proof from def of conditional probability

)(

)()|()|(

EP

HPHEPEHP

Bayes rules!

Previously, in 573…

© Daniel S. Weld 12

Joint Distribution• All you need - Can answer any question

Inference by enumeration

• But… exponential Both time & space

• Solution: exploit conditional independence


Joint Distribution• All you need to know• Can answer any question

Inference by enumeration

• But… exponential Both time & space

• Solution: exploit conditional independence


Sample Bayes Net

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Pr(B=t) Pr(B=f) 0.05 0.95

Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)

Radio


Given Markov Blanket, X is Independent of All Other Nodes

MB(X) = Par(X) Childs(X) Par(Childs(X))

Burglary

Alarm Sounded

Earthquake


Inference in BNs•We generally want to compute

Pr(X), or Pr(X|E) where E is (conjunctive) evidence

•The graphical independence representation Efficient inference, organized by network shape

•Two simple algorithms: Variable elimination (VE) Junction trees Approximate: Markov Chain Sampling


Learning

• Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian

• Learning Parameters for a Bayesian Network• Learning Structure of Bayesian Networks

• Naïve Bayes Models• Hidden Variables (later)

Parameter Estimation Summary

Prior Hypothesis

Maximum Likelihood Estimate

Maximum A Posteriori Estimate

Bayesian Estimate

Uniform The most likely

Any The most likely

Any Weighted combination

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F

...

P(A|E,B) = ?P(A|E,¬B) = ?P(A|¬E,B) = ?P(A|¬E,¬B) = ?

Prior

+ data= Beta(2,3)

(3,4)

Structure Learning as Search• Local Search1. Start with some network structure2. Try to make a change

(add or delete or reverse edge)3. See if the new network is any better• What should the initial state be? Uniform prior over random networks? Based on prior knowledge? Empty network?• How do we evaluate networks?


Naïve Bayes

F 2 F N-2 F N-1 F NF 1 F 3

ClassValue

…

Assume that features are conditionally ind. given class variableWorks well in practiceForces probabilities towards 0 and 1

Naïve Bayes for Text

• P(spam | w1 … wn)

• Independence assumption?

Spam?

apple dictator Nigeria…

24

Naïve Bayes for Text

• Modeled as generating a bag of words for a document in a given category by repeatedly sampling with replacement from a vocabulary V = {w1, w2,…wm} based on the probabilities P(wj | ci).

• Smooth probability estimates with Laplace m-estimates assuming a uniform distribution over all words (p = 1/|V|) and m = |V| Equivalent to a virtual sample of seeing each word in

each category exactly once.

25

Text Naïve Bayes Algorithm(Train)

Let V be the vocabulary of all words in the documents in DFor each category ci C

Let Di be the subset of documents in D in category ci

P(ci) = |Di| / |D|

Let Ti be the concatenation of all the documents in Di

Let ni be the total number of word occurrences in Ti

For each word wj V Let nij be the number of occurrences of wj in Ti

Let P(wi | ci) = (nij + 1) / (ni + |V|)

26

Text Naïve Bayes Algorithm(Test)

Given a test document XLet n be the number of word occurrences in XReturn the category:

where ai is the word occurring the ith position in X

)|()(argmax1

n

iiii

CiccaPcP

27

Naïve Bayes Time Complexity

• Training Time: O(|D|Ld + |C||V|)) where Ld is the average length of a document in D. Assumes V and all Di , ni, and nij pre-computed in O(|D|

Ld) time during one pass through all of the data. Generally just O(|D|Ld) since usually |C||V| < |D|Ld

• Test Time: O(|C| Lt) where Lt is the average length of a test document.

• Very efficient overall, linearly proportional to the time needed to just read in all the data.

28

Easy to Implement

• But…

• If you do… it probably won’t work…

Probabilities: Important Detail!

Any more potential problems here?

• P(spam | E1 … En) = P(spam | Ei)i

We are multiplying lots of small numbers Danger of underflow! 0.557 = 7 E -18

Solution? Use logs and add! p1 * p2 = e log(p1)+log(p2)

Always keep in log form

30

Naïve Bayes Posterior Probabilities

• Classification results of naïve Bayes I.e. the class with maximum posterior

probability… Usually fairly accurate (?!?!?)

• However, due to the inadequacy of the conditional independence assumption… Actual posterior-probability estimates not

accurate. Output probabilities generally very close to

0 or 1.

573 Core Topics

Agency

Problem Spaces

Search

Knowledge Representation

Reinforcement

Learning Inference Planning

SupervisedLearning

Logic-Based Probabilistic

Planning

Percepts Actions

What action next?

Static

Fully Observable

StochasticInstantaneous

Full

Perfect

Planning under uncertainty

Environment

Models of Planning

Classical Contingent MDP

??? Contingent POMDP

??? Conformant

POMDP

Complete Observation

Partial

None

UncertaintyDeterministic Disjunctive Probabilistic

Awkward

• Never any defn of what an MDP is!

• Missed prob strips defn Should have done that before the 2DBN

All the asumptions about a Markov model could be simplified and eliminated – took too long!

Defn: Markov Model

Q: set of states

init prob distribution

A: transition probability distribution

E.g. Predict Web Behavior

Q: set of states (Pages)

init prob distribution (Likelihood of site entry point)

A: transition probability distribution (User navigation model)

When will visitor leave site?

E.g. Predict Robot’s Behavior

Q: set of states

init prob distribution

A: transition probability distribution

Will it attack Dieter?

Probability Distribution, A

• Forward Causality The probability of st does not depend directly

on values of future states.• Probability of new state could depend on

The history of states visited. Pr(st|st-1,st-2,…, s0)

• Markovian Assumption Pr(st|st-1,st-2,…s0) = Pr(st|st-1)

• Stationary Model Assumption Pr(st|st-1) = Pr(sk|sk-1) for all k.

Representing A

Q: set of states

init prob distributionA: transition probabilities

how can we represent these?

s0

s1

s2

s3

s4

s5

s6

s0 s1 s2 s3 s4 s5 s6

s0

s1

s2

s3

s4

s5

s6

p12

Probability of transitioning from s1 to s2

∑ ?

Factoring Q

• Represent Q simply as a set of states?

• Is there internal structure? Consider a robot domain What is the state space?

s0

s1

s2

s3

s4

s5

s6

A Factored domain

• Six Boolean Variables : has_user_coffee (huc) , has_robot_coffee (hrc), robot_is_wet (w), has_robot_umbrella (u), raining (r), robot_in_office (o)

• How many states?

26 = 64

Representing Compactly

Q: set of states init prob distribution

How represent this efficiently?

s0

s1

s2

s3

s4

s5

s6

r u hrc

w

With a Bayes net (of course!)

Representing A Compactly

Q: set of states

init prob distributionA: transition probabilities

s0

s1

s2

s3

s4

s5

s6

s0 s1 s2 s3 s4 s5 s6 … s35

s0

s1

s2

s3

s4

s5

s6

…

s35

p12 How big is matrix version of A? 4096

2-Dynamic Bayesian Network

huc

hrc

w

u

r

o o

r

u

w

hrc

huc 8

4

16

4

2

2

Total values

required to represent transition probability table = 36

Vs. 4096 required

for a complete

state probablity

table?

T T+1

Dynamic Bayesian Network

huc

hrc

w

u

r

o o

r

u

w

hrc

huc

T T+1

Also known as a Factored Markov Model

Defined formally as * Set of random vars * BN for initial state * 2-layer DBN for transitions

huc

hrc

w

u

r

o

0

Observability

• Full Observability• Partial Observability• No Observability

Reward/cost

• Each action has an associated cost.• Agent may accrue rewards at different

stages. A reward may depend on The current state The (current state, action) pair The (current state, action, next state) triplet

• Additivity assumption : Costs and rewards are additive.

• Reward accumulated = R(s0)+R(s1)+R(s2)+…

Horizon• Finite : Plan till t stages.

Reward = R(s0)+R(s1)+R(s2)+…+R(st)• Infinite : The agent never dies.

The reward R(s0)+R(s1)+R(s2)+… Could be unbounded.

Discounted reward : R(s0)+γR(s1)+ γ2R(s2)+…

Average reward : lim n∞ (1/n)[Σi R(si)]

?

Goal for an MDP

• Find a policy which: maximizes expected discounted reward over an infinite horizon for a fully observable Markov decision process.

Why shouldn’t the planner find a plan??What is a policy??

Optimal value of a state

• Define V*(s) `value of a state’ as the maximum expected discounted reward achievable from this state.

• Value of state if we force it to do action “a” right now, but let it act optimally later: Q*(a,s)=R(s) + c(a) + γΣs’εS Pr(s’|a,s)V*(s’)

• V* should satisfy the following equation: V*(s) = maxaεA {Q*(a,s)}

= R(s) + maxaεA {c(a) + γΣs’εS Pr(s’|a,s)V*(s’)}

Value iteration

• Assign an arbitrary assignment of values to each state (or use an admissible heuristic).

• Iterate over the set of states and in each iteration improve the value function as follows:

Vt+1(s)=R(s) + maxaεA {c(a)+γΣs’εS Pr(s’|a,s)

Vt(s’)} `Bellman Backup’• Stop the iteration appropriately. Vt

approaches V* as t increases.

Max

Bellman Backup

a1

a2

a3

s

Vn

Vn

Vn

Vn

Vn

Vn

Vn

Qn+1(s,a)

Vn+1(s)

Stopping Condition

• ε-convergence : A value function is ε –optimal if the error (residue) at every state is less than ε. Residue(s)=|Vt+1(s)- Vt(s)| Stop when maxsεS R(s) < ε

Complexity of value iteration

• One iteration takes O(|S|2|A|) time.

• Number of iterations required : poly(|S|,|A|,1/(1-γ))

• Overall, the algo is polynomial in state space!

• Thus exponential in number of state vars.

Computation of optimal policy

• Given the value function V*(s), for each state, do Bellman backups and the action which maximises the inner product term is the optimal action.

Optimal policy is stationary (time independent) – intuitive for infinite horizon case.

Policy evaluation

• Given a policy Π:SA, find value of each state using this policy.

• VΠ(s) = R(s) + c(Π(s)) + γ[Σs’εS Pr(s’| Π(s),s)VΠ(s’)]• This is a system of linear equations

involving |S| variables.

Bellman’s principle of optimality

• A policy Π is optimal if VΠ(s) ≥ VΠ’(s) for all policies Π’ and all states s є S.

• Rather than finding the optimal value function, we can try and find the optimal policy directly, by doing a policy space search.

Policy iteration

• Start with any policy (Π0).• Iterate

Policy evaluation : For each state find VΠi(s). Policy improvement : For each state s, find

action a* that maximises QΠi(a,s). If QΠi(a*,s) > VΠi(s) let Πi+1(s) = a* else let Πi+1(s) = Πi(s)

• Stop when Πi+1 = Πi

• Converges faster than value iteration but policy evaluation step is more expensive.

Modified Policy iteration

• Rather than evaluating the actual value of policy by solving system of equations, approximate it by using value iteration with fixed policy.

RTDP iteration

• Start with initial belief and initialize value of each belief as the heuristic value.

• For current belief Save the action that minimises the current

state value in the current policy. Update the value of the belief through

Bellman Backup.• Apply the minimum action and then

randomly pick an observation.• Go to next belief assuming that

observation.• Repeat until goal is achieved.

Fast RTDP convergence

• What are the advantages of RTDP?• What are the disadvantages of RTDP?

How to speed up RTDP?

Other speedups

• Heuristics• Aggregations• Reachability Analysis

Going beyond full observability

• In execution phase, we are uncertain where we are,

• but we have some idea of where we can be.

• A belief state = ?

Models of Planning

Classical Contingent MDP

??? Contingent POMDP

??? Conformant

POMDP

Complete Observation

Partial

None

UncertaintyDeterministic Disjunctive Probabilistic

Speedups

• Reachability Analysis• More informed heuristic

Algorithms for search

• A* : works for sequential solutions.• AO* : works for acyclic solutions.• LAO* : works for cyclic solutions. • RTDP : works for cyclic solutions.

planning under uncertainty. 573 core topics agency problem spaces search knowledge representation...

Documents