reinforcement learning with selective perception and ...ghg/utree_1.pdf · selective perception...
TRANSCRIPT
-
Reinforcement Learningwith Selective Perception and
Hidden StateBy Andrew K. McCallum
Presented by Jerod Weinman
University of Massachusetts-Amherst
Reinforcement Learning with Selective Perception and Hidden State – p.1/31
-
OutlineMotivations
Utile Distinctions
U-TREE algorithm
Driving Experiment
Conclusions
Extensions
Reinforcement Learning with Selective Perception and Hidden State – p.2/31
-
The Set Up...
Agents have opposite, yet intertwiningproblems regarding internal state space.
Too many distinctions � SelectivePerceptionToo few distinctions � Short TermMemory
Most RL algorithms depend on knowledgeengineers to design state-space.
Reinforcement Learning with Selective Perception and Hidden State – p.3/31
-
Opposite and Related?
Selective perception creates hidden state onpurpose.
Short term memory, which alleviates hiddenstate, allows agents selective perception.
The “black magic” of RL applications hasbeen engineering state distinctions.
Reinforcement Learning with Selective Perception and Hidden State – p.4/31
-
Motivating Statements
Learning closed-loop behaviors is useful.
Selective perception provides an efficientinterface.
Environment interfaces suffer from hiddenstate; selective perception can make it worse.
Non-Markov hidden state problems can besolved with memory.
Reinforcement Learning with Selective Perception and Hidden State – p.5/31
-
Motivating Statements
Learning selective perception and usingmemory is difficult.
Experience is expensive.
Agents must handle noisy perceptions andactions.
Final performance is to be balanced againsttraining time.
Reinforcement Learning with Selective Perception and Hidden State – p.6/31
-
Utile DistinctionsState-space should be dependent on the taskat hand.
Learning should be proportional to taskdifficulty, not world complexity.
Perceptual aliasing
Agents should only make distinctions neededto predict future reward.
Reinforcement Learning with Selective Perception and Hidden State – p.7/31
-
Distinctions forLearning
Theorem The state distinctions necessary forrepresenting the optimal policy are notnecessarily sufficient for learning the optimalpolicy.
Describe an environment and task for whichan optimal policy may be calculated.
Find a minimum set of state distinctionsadequate for representing that policy.
Recalculate a policy in the reduced internalstate space and the result is a non-optimalpolicy.
Reinforcement Learning with Selective Perception and Hidden State – p.8/31
-
Utile DistinctionsWhy? If states ��� and ��� are aliased, then
the path through � � may have slightly lowerreward than through �� , i.e. �� � �� , yetthe utility of ��� may be higher than that of ��� ,i.e.
� �� � � �� .
�� � � � � � �� � � ��
Optimal policy calculations don’t care whichaliased state the agent goes through, so itchooses the lower cost path, which is the statewith lower utility. Reinforcement Learning with Selective Perception and Hidden State – p.9/31
-
Utile Distinction TestDistinguishes states that have different policyactions or different utilities
Merges states that have the same policyaction and same utility
Reinforcement Learning with Selective Perception and Hidden State – p.10/31
-
U-TREE OverviewTreats percepts as multi-dimensional vectorsof features
Allows the agent to ignore certaindimensions of perceptionInternal state space can be smaller thanspace of all percepts
Combines instance-based learning with utiledistinctions
Agent builds a tree for making statedistinctions.
Reinforcement Learning with Selective Perception and Hidden State – p.11/31
-
U-TREE OverviewNon-leaf nodes branch on present or pastpercepts and actions.
Training instances are deposited in leaves.
Suffix tree is like an order- Markov modelwith varying .Factored state representation captures onlynecessary state distinctions.
Value function approximation achieved byrepresenting value function with a morecompact structure than a mapping from allworld states.
Reinforcement Learning with Selective Perception and Hidden State – p.12/31
-
Example
Reinforcement Learning with Selective Perception and Hidden State – p.13/31
-
Agent and Environment
Finite set of actions, � ���� ���� � � �� ���� � .Scalar range of possible rewards,
� � �� � �� �� � � .Finite set of observations,
� ��� ��� � � �� ��� � .
At time
�
, agent executes action � � � andreceives an observation ��! � � and reward��! � � .
Reinforcement Learning with Selective Perception and Hidden State – p.14/31
-
Agent and Environment
Set of observations is the set of all values of aperceptual vector, with perceptual features
� " �� �� � � �� # $.Each feature is an element of a finite set% � %� �� %� �� � � �� %� �& ' � .Value of dimension % at time � is �
�( � � , so anobservation is written�� � ) � �+* � � � � �+, � � � � � �� � � - � � . .
is Cartesian product of all feature sets,/ / � �0 �%21 �/ % / .
Reinforcement Learning with Selective Perception and Hidden State – p.15/31
-
Instance ChainAgent records raw experiences in a transitioninstance, � �
) �43 �� ��43 �� ��� �� . .Tree nodes add a distinction based on
History index,
5
, indicating number of stepsbackwards in timePerception or action dimension,
(
.
Every node is uniquely identified by the set oflabels on the path from the root, theconjunction 6.
Reinforcement Learning with Selective Perception and Hidden State – p.16/31
-
U-TREEAn instance is deposited in a leaf nodewhose conjunction 6 is satisfied by and itspredecessors.
� 6 is the set of instances associated withleaf 6.
7 �
specifies the leaf to which instancebelongs.
Below the official leaves, fringes are addedthat provide “hypothesis” distinctions.
If more distinctions help predict future reward,the fringes are promoted to “official”distinctions.
Reinforcement Learning with Selective Perception and Hidden State – p.17/31
-
Hidden State Space
U-Tree leaves correspond to internal states ofthe agent.
Deep branches represent finely distinguishedspace
Shallow branches represent broadlydistinguished state space
� 6� � is the learned estimate of expectedfuture discounted reward for a state-actionpair.
All -values indicate expected values for thenext step in the future.
Reinforcement Learning with Selective Perception and Hidden State – p.18/31
-
U-TREE Algorithm
1. Begin with a tree that represents nodistinctions. One root node, 6, and
� 6 � " $.2. Agent takes a step in the environment
(a) Record transition �
) �43 �� �8�3 �� ��� �� .
(b) Associate � with the appropriateconjunction, 6,
� 6 � 6 :9 " � $
Reinforcement Learning with Selective Perception and Hidden State – p.19/31
-
U-TREE Algorithm
3. Perform one sweep of value iteration withleaves as states.
� 6� � � 6� � �; <
Pr
� 6 = / 6� � � 6 =
� 6� � �
>@? A B C; � D E ��F/ � 6� � /
Pr
� 6 = / 6� � �/ " F � � 6� � / 7 � F! � � 6 = $ // � 6� � /
Reinforcement Learning with Selective Perception and Hidden State – p.20/31
-
U-TREE Algorithm
4. After every
G
steps, test whether the transitionutility has changed enough to warrant newdistinctions in the internal state space.
(a) Compare distributions of future discountedrewards associated with the same actionfrom different nodes.
(b) The fringe could be expanded by allpossible permutations of observations andactions to a fixed depth H with a maximumhistory index,
5
, yielding an enormousbranch factor of
� 5 � / / * I �Reinforcement Learning with Selective Perception and Hidden State – p.21/31
-
U-TREE Algorithm
4. Test for utile distinctions.
(c) Possible expansion pruning methodsi. Don’t expand leaves containing zero (or
few) instances.ii. Don’t expand leaves whose instances
exhibit little deviation in utility.iii. Order the terms in the conjunction (i.e.
perceptual dimensions, action) forexpansion.
Reinforcement Learning with Selective Perception and Hidden State – p.22/31
-
U-TREE Algorithm
4. Test for utile distinctions.
(d) Expected future discounted reward ofinstance F is
� F � ��F �J C >?K L E
Pr
� 7 � F! � � 7 � F! �
(e) When a deep fringe node is promoted, allof its uncles and great uncles are too.
Reinforcement Learning with Selective Perception and Hidden State – p.23/31
-
U-TREE Algorithm
5. Choose next action based on -values of thecorresponding leaf.
�� ! � � argmax D A�
� 7 � � � �
Alternatively, explore by choosing a randomaction with probability M.
6. Set
� � *. Goto 2.
Reinforcement Learning with Selective Perception and Hidden State – p.24/31
-
Take the AgentsDriving
Reinforcement Learning with Selective Perception and Hidden State – p.25/31
-
Driving Experiment
Actions include gaze directions, and shift togaze-lane.
Sensory system includes hearing, andseveral gaze gauges.
2,592 sensor states
3,536 world states not including agent’ssensor system, otherwise 21,216 world states
Trying to solve the task with only perceptualdistinctions would be disastrous
Reinforcement Learning with Selective Perception and Hidden State – p.26/31
-
Driving Experiment
Over 5,000 time steps with only slower cars
Hand-written policy (32 leaves) makes 99collisions.
Random actions makes 788 collisions.
U-Tree trained with 10,000 time steps anddecreasing exploration policy (51 leaves)makes 67 collisions.
Reinforcement Learning with Selective Perception and Hidden State – p.27/31
-
Driving Experiment
Over 5,000 time steps with slower and faster cars
Random actions makes 1,260 collisions, 775steps being honked at.
U-Tree trained with 18,000 steps anddecreasing exploration policy (143 leaves)makes 280 collisions, 176 steps beinghonked at.
Reinforcement Learning with Selective Perception and Hidden State – p.28/31
-
Discussion“Chicken and egg” problem
Distinctions
Utility � Policy
Difficulty with long memories
Difficulty with large conjunctions
Difficulty with hard to find rewards
Difficulty with loops in the environment
Reinforcement Learning with Selective Perception and Hidden State – p.29/31
-
DiscussionSuccess with large perception spaces
Success with hidden state
Success with noise
Success with expensive experience
Applicable to general RL domains
Reinforcement Learning with Selective Perception and Hidden State – p.30/31
-
ExtensionsBetter Statistical Tests
Utile-Clustered Branches
Information-Theoretic Splitting
Eliminate the Fringe
Options
Reinforcement Learning with Selective Perception and Hidden State – p.31/31
OutlineThe Set Up...Opposite and Related?Motivating StatementsMotivating StatementsUtile DistinctionsDistinctions for LearningUtile DistinctionsUtile Distinction Testoun {U-Tree} Overviewoun {U-Tree} OverviewExampleAgent and EnvironmentAgent and EnvironmentInstance Chainoun {U-Tree}Hidden State Spaceoun {U-Tree} Algorithmoun {U-Tree} Algorithmoun {U-Tree} Algorithmoun {U-Tree} Algorithmoun {U-Tree} Algorithmoun {U-Tree} AlgorithmTake the Agents DrivingDriving ExperimentDriving ExperimentDriving ExperimentDiscussionDiscussionExtensions