cs-424 gregory dudek today’s lecture reinforcement learning: further thoughts. planning

CS-424 Gregory Dudek

Today’s Lecture

• Reinforcement learning: further thoughts.

• Planning


Transition networks• How determine strategies (policy) in a problem defined by a

transition network. It was:– Deterministic or stochastic– Markovian (exhibited the Markov property).– Fully observable (RN: accessible): we can directly observe

(determine) exactly what what we are in during the update process.

• Computing the optimal policy is a Markov Decision Problem (MDP).

If we don’t know the current state for sure, but can only infer it (probabilistically), then we have a Partially Observable system.

Partially Observable Markov Decision Problem (POMDP).

– How hard is it to compute the optimal policy?


Specific details on reinforcementSimplest model:

Given we know all transition probabilities, and the immediate (short-term) reward R(i) associated with each state i.

We can compute the value function U() by solving a linear system

U(i) = R(i) + j M(i,j) U(j)

This approach is referred to as adaptive dynamic programming.

In contrast,

• Sampling and TD methods update this system intermittently based on partial information.

(Note we have omitted the less-effective LMS algorithm in the textbook.)


Types of learners• 2 classes wrt reinforcement learning:

– Passive learners: you just update the state transition/reward info for the states you are taken to, but do not control the sequence of states visited.

• Backgammon learner that merely observers another part of the system playing. A kid watching it’s parents.

– Active learners: the learner actively modifies the sequence of states visited in order to (presumably) acquire information.


Exploration versus Exploitation• Fundamental tradeoff.

• We want to maximize return:– Should we do what we know is best, based on incomplete

information– Or should we seek information about unknown things,

although this may not lead to rewards?

• Plenty of intuitive relevance. • How do we combine these two processes?


Planning: general approach• Use a (restrictive) formal language to describe

problems and goals.– Why restrictive? More precision and fewer states to

search

• Have a goal state specification and an initial state.

• Use a special-purpose planner to search for a solution.


Basic formalism• Basic logical formalism derived from STRIPS.

• State variables determine what actions can or should be taken: in this context they are conditions– Shoe_untied()– Door_open(MC)

• An operator (remember those?) is now a triplePreconditions

AdditionsDeletions

Together, called effects of an operator

Seen in thecontext of search


A plan is• 4 components

– A set of steps defined by a sequence of operator applications

– A set of constraints on the ordering of these steps. (Not necessarily a total ordering.)

– A set of variable binding constraints: set of things various operators can apply to.

– Set of causal links that specify what effects one action achieves that are needed by another.


Going forwards• All state variables are true or false,

but some may not be defined at a certain point on our

State Progression.A planner based on this is a progression planner.

Idea: In a state S,Can apply operator X=(P,A,D).Leads to new state T

T = fX(S) = (S-D) A


Constancy• Important caveat

• When we go from one state to another,we assume that the

only changes were thosethat resulted explicitly from the

Additions and Deletions.

Given this assumption, the operator X computes the strongest provable postconditions.

In reality, even more might be deleted.


Aside: FOL with time

• One approach is a variation of first-order logic called situation calculus [McCarthy].– Events take place at specific times.– Some predicates are fluents and only apply for certain

ranges in time.– A situation is a temporal interval over which all the

predicates remain fixed.

– Reference: read RN Sec 7.6 or DAA Ch. 6.


Going backwards• Remember backwards chaining?

• State at the goal G.• Assuming the deletions aren’t there for some

operator X– Why?

• Can chain backwards by adding what would have been deleted and removing what would have been added

S = f-1X(G) = (G-A) D

Maybe we added too much (with D), or deleted too little?


Means/ends analysis• How can we get from initial to final?

– Assume the states and operators are given.– What’s the right path? How to we measure distance?

• Means/ends analysis assumes we simply reduce the number of things that make our current state different from out goal.


STRIPS• STRIPS is an old planning language

• STanford Research Institute Problem Solver.– Less expressive than situation calculus

– Initial state:

At(office) & NOT(Have(Video)) & Have(Cash) & Have(Uncooked-kernels)

– Goal stateAt(Home) & Have(Video) & Have(Cooked-Popcorn)


Schemas• Basic operators assume a complete specification of

the state in which they are applied.

• This can be tedious– An operator schema is a “generic” operator that has

variables in it• Related to axiom schemas• Related to unification in logic (e.g. prolog)

E.g.

Tie_shoes(h), Tie_necktie(h), Tie_boat_rope(h), Tie_straightjacket(h)

might all be abstracted by Tie_object(X,h)


Least Commitment Planning• When we formulate a plan intuitively, we often think

of doing things in a specific sequenceeven when the sequencing is arbitrary.– This may not be wise.

• This can leads to re-shuffling actions...which is undesirable.

Generate plans such that we have sets of applicable actions, but we don’t order the actions unless there is something (conditions) that demands it.


Partially ordered plan

A

D

EGB

CF


Terminology• Constraints on sequencing, requirements for

operators, links relating operators, conflicts between operators in a given plan.

For a plan:• Sound

– Plan steps obey constraints on sequencing– Successful

• Systematic– Doesn’t “waste” effort

• Complete– Generates a plan if one exists.

• Still may not terminate (cf. Halting problem)• Plan refinement

– Improvement of an existing plan to make it better meet the constraints


Links & Conflicts

ConsumerProducer

ConsumerProducer

Clobberer

A conflict involves a link, & a step that messes it up.


RefinementFix conflicts by creating a new one from an old one.

– Keep old structures (links, producers, consumers, constraints) but add new constraints

• If there are conflicts, resolve them by adding constraints: move a clobberer before of after the link it’s hitting.– (if you can).

• If there are no conflicts, satisfy an unfulfilled requirement.


Applications of planning• Planning for Shakey the robot

– Climb boxes– Push things– Move around

• Blocks world– Moving blocks– Piling them onto one another– Clearing the tops of chosen blocks

• Really doing this suggested we need vision!


Configuration Space Planning


Issues

cs-424 gregory dudek today’s lecture reinforcement learning: further thoughts. planning

Documents

gregory dudek planning

gregory dudek exploration

planning slide

gregory dudek specific

state variables

context of search slide

initial state

current state