cs-424 gregory dudek today’s lecture reinforcement learning: further thoughts. planning
TRANSCRIPT
![Page 1: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/1.jpg)
CS-424 Gregory Dudek
Today’s Lecture
• Reinforcement learning: further thoughts.
• Planning
![Page 2: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/2.jpg)
CS-424 Gregory Dudek
Transition networks• How determine strategies (policy) in a problem defined by a
transition network. It was:– Deterministic or stochastic– Markovian (exhibited the Markov property).– Fully observable (RN: accessible): we can directly observe
(determine) exactly what what we are in during the update process.
• Computing the optimal policy is a Markov Decision Problem (MDP).
If we don’t know the current state for sure, but can only infer it (probabilistically), then we have a Partially Observable system.
Partially Observable Markov Decision Problem (POMDP).
– How hard is it to compute the optimal policy?
![Page 3: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/3.jpg)
CS-424 Gregory Dudek
Specific details on reinforcementSimplest model:
Given we know all transition probabilities, and the immediate (short-term) reward R(i) associated with each state i.
We can compute the value function U() by solving a linear system
U(i) = R(i) + j M(i,j) U(j)
This approach is referred to as adaptive dynamic programming.
In contrast,
• Sampling and TD methods update this system intermittently based on partial information.
(Note we have omitted the less-effective LMS algorithm in the textbook.)
![Page 4: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/4.jpg)
CS-424 Gregory Dudek
Types of learners• 2 classes wrt reinforcement learning:
– Passive learners: you just update the state transition/reward info for the states you are taken to, but do not control the sequence of states visited.
• Backgammon learner that merely observers another part of the system playing. A kid watching it’s parents.
– Active learners: the learner actively modifies the sequence of states visited in order to (presumably) acquire information.
![Page 5: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/5.jpg)
CS-424 Gregory Dudek
Exploration versus Exploitation• Fundamental tradeoff.
• We want to maximize return:– Should we do what we know is best, based on incomplete
information– Or should we seek information about unknown things,
although this may not lead to rewards?
• Plenty of intuitive relevance. • How do we combine these two processes?
![Page 6: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/6.jpg)
CS-424 Gregory Dudek
Planning: general approach• Use a (restrictive) formal language to describe
problems and goals.– Why restrictive? More precision and fewer states to
search
• Have a goal state specification and an initial state.
• Use a special-purpose planner to search for a solution.
![Page 7: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/7.jpg)
CS-424 Gregory Dudek
Basic formalism• Basic logical formalism derived from STRIPS.
• State variables determine what actions can or should be taken: in this context they are conditions– Shoe_untied()– Door_open(MC)
• An operator (remember those?) is now a triplePreconditions
AdditionsDeletions
Together, called effects of an operator
Seen in thecontext of search
![Page 8: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/8.jpg)
CS-424 Gregory Dudek
A plan is• 4 components
– A set of steps defined by a sequence of operator applications
– A set of constraints on the ordering of these steps. (Not necessarily a total ordering.)
– A set of variable binding constraints: set of things various operators can apply to.
– Set of causal links that specify what effects one action achieves that are needed by another.
![Page 9: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/9.jpg)
CS-424 Gregory Dudek
Going forwards• All state variables are true or false,
but some may not be defined at a certain point on our
State Progression.A planner based on this is a progression planner.
Idea: In a state S,Can apply operator X=(P,A,D).Leads to new state T
T = fX(S) = (S-D) A
![Page 10: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/10.jpg)
CS-424 Gregory Dudek
Constancy• Important caveat
• When we go from one state to another,we assume that the
only changes were thosethat resulted explicitly from the
Additions and Deletions.
Given this assumption, the operator X computes the strongest provable postconditions.
In reality, even more might be deleted.
![Page 11: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/11.jpg)
CS-424 Gregory Dudek
Aside: FOL with time
• One approach is a variation of first-order logic called situation calculus [McCarthy].– Events take place at specific times.– Some predicates are fluents and only apply for certain
ranges in time.– A situation is a temporal interval over which all the
predicates remain fixed.
– Reference: read RN Sec 7.6 or DAA Ch. 6.
![Page 12: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/12.jpg)
CS-424 Gregory Dudek
Going backwards• Remember backwards chaining?
• State at the goal G.• Assuming the deletions aren’t there for some
operator X– Why?
• Can chain backwards by adding what would have been deleted and removing what would have been added
S = f-1X(G) = (G-A) D
Maybe we added too much (with D), or deleted too little?
![Page 13: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/13.jpg)
CS-424 Gregory Dudek
Means/ends analysis• How can we get from initial to final?
– Assume the states and operators are given.– What’s the right path? How to we measure distance?
• Means/ends analysis assumes we simply reduce the number of things that make our current state different from out goal.
![Page 14: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/14.jpg)
CS-424 Gregory Dudek
STRIPS• STRIPS is an old planning language
• STanford Research Institute Problem Solver.– Less expressive than situation calculus
– Initial state:
At(office) & NOT(Have(Video)) & Have(Cash) & Have(Uncooked-kernels)
– Goal stateAt(Home) & Have(Video) & Have(Cooked-Popcorn)
![Page 15: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/15.jpg)
CS-424 Gregory Dudek
Schemas• Basic operators assume a complete specification of
the state in which they are applied.
• This can be tedious– An operator schema is a “generic” operator that has
variables in it• Related to axiom schemas• Related to unification in logic (e.g. prolog)
E.g.
Tie_shoes(h), Tie_necktie(h), Tie_boat_rope(h), Tie_straightjacket(h)
might all be abstracted by Tie_object(X,h)
![Page 16: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/16.jpg)
CS-424 Gregory Dudek
Least Commitment Planning• When we formulate a plan intuitively, we often think
of doing things in a specific sequenceeven when the sequencing is arbitrary.– This may not be wise.
• This can leads to re-shuffling actions...which is undesirable.
Generate plans such that we have sets of applicable actions, but we don’t order the actions unless there is something (conditions) that demands it.
![Page 17: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/17.jpg)
CS-424 Gregory Dudek
Partially ordered plan
A
D
EGB
CF
![Page 18: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/18.jpg)
CS-424 Gregory Dudek
Terminology• Constraints on sequencing, requirements for
operators, links relating operators, conflicts between operators in a given plan.
For a plan:• Sound
– Plan steps obey constraints on sequencing– Successful
• Systematic– Doesn’t “waste” effort
• Complete– Generates a plan if one exists.
• Still may not terminate (cf. Halting problem)• Plan refinement
– Improvement of an existing plan to make it better meet the constraints
![Page 19: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/19.jpg)
CS-424 Gregory Dudek
Links & Conflicts
ConsumerProducer
ConsumerProducer
Clobberer
A conflict involves a link, & a step that messes it up.
![Page 20: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/20.jpg)
CS-424 Gregory Dudek
RefinementFix conflicts by creating a new one from an old one.
– Keep old structures (links, producers, consumers, constraints) but add new constraints
• If there are conflicts, resolve them by adding constraints: move a clobberer before of after the link it’s hitting.– (if you can).
• If there are no conflicts, satisfy an unfulfilled requirement.
![Page 21: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/21.jpg)
CS-424 Gregory Dudek
Applications of planning• Planning for Shakey the robot
– Climb boxes– Push things– Move around
• Blocks world– Moving blocks– Piling them onto one another– Clearing the tops of chosen blocks
• Really doing this suggested we need vision!
![Page 22: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/22.jpg)
CS-424 Gregory Dudek
Configuration Space Planning
![Page 23: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning](https://reader036.vdocuments.site/reader036/viewer/2022062717/56649e355503460f94b23dd7/html5/thumbnails/23.jpg)
CS-424 Gregory Dudek
Issues