reinforcement learning with selective perception and ...ghg/utree_1.pdf · selective perception...

31
Reinforcement Learning with Selective Perception and Hidden State By Andrew K. McCallum Presented by Jerod Weinman University of Massachusetts-Amherst Reinforcement Learning with Selective Perception and Hidden State – p.1/31

Upload: others

Post on 30-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Reinforcement Learningwith Selective Perception and

    Hidden StateBy Andrew K. McCallum

    Presented by Jerod Weinman

    University of Massachusetts-Amherst

    Reinforcement Learning with Selective Perception and Hidden State – p.1/31

  • OutlineMotivations

    Utile Distinctions

    U-TREE algorithm

    Driving Experiment

    Conclusions

    Extensions

    Reinforcement Learning with Selective Perception and Hidden State – p.2/31

  • The Set Up...

    Agents have opposite, yet intertwiningproblems regarding internal state space.

    Too many distinctions � SelectivePerceptionToo few distinctions � Short TermMemory

    Most RL algorithms depend on knowledgeengineers to design state-space.

    Reinforcement Learning with Selective Perception and Hidden State – p.3/31

  • Opposite and Related?

    Selective perception creates hidden state onpurpose.

    Short term memory, which alleviates hiddenstate, allows agents selective perception.

    The “black magic” of RL applications hasbeen engineering state distinctions.

    Reinforcement Learning with Selective Perception and Hidden State – p.4/31

  • Motivating Statements

    Learning closed-loop behaviors is useful.

    Selective perception provides an efficientinterface.

    Environment interfaces suffer from hiddenstate; selective perception can make it worse.

    Non-Markov hidden state problems can besolved with memory.

    Reinforcement Learning with Selective Perception and Hidden State – p.5/31

  • Motivating Statements

    Learning selective perception and usingmemory is difficult.

    Experience is expensive.

    Agents must handle noisy perceptions andactions.

    Final performance is to be balanced againsttraining time.

    Reinforcement Learning with Selective Perception and Hidden State – p.6/31

  • Utile DistinctionsState-space should be dependent on the taskat hand.

    Learning should be proportional to taskdifficulty, not world complexity.

    Perceptual aliasing

    Agents should only make distinctions neededto predict future reward.

    Reinforcement Learning with Selective Perception and Hidden State – p.7/31

  • Distinctions forLearning

    Theorem The state distinctions necessary forrepresenting the optimal policy are notnecessarily sufficient for learning the optimalpolicy.

    Describe an environment and task for whichan optimal policy may be calculated.

    Find a minimum set of state distinctionsadequate for representing that policy.

    Recalculate a policy in the reduced internalstate space and the result is a non-optimalpolicy.

    Reinforcement Learning with Selective Perception and Hidden State – p.8/31

  • Utile DistinctionsWhy? If states ��� and ��� are aliased, then

    the path through � � may have slightly lowerreward than through �� , i.e. �� � �� , yetthe utility of ��� may be higher than that of ��� ,i.e.

    � �� � � �� .

    �� � � � � � �� � � ��

    Optimal policy calculations don’t care whichaliased state the agent goes through, so itchooses the lower cost path, which is the statewith lower utility. Reinforcement Learning with Selective Perception and Hidden State – p.9/31

  • Utile Distinction TestDistinguishes states that have different policyactions or different utilities

    Merges states that have the same policyaction and same utility

    Reinforcement Learning with Selective Perception and Hidden State – p.10/31

  • U-TREE OverviewTreats percepts as multi-dimensional vectorsof features

    Allows the agent to ignore certaindimensions of perceptionInternal state space can be smaller thanspace of all percepts

    Combines instance-based learning with utiledistinctions

    Agent builds a tree for making statedistinctions.

    Reinforcement Learning with Selective Perception and Hidden State – p.11/31

  • U-TREE OverviewNon-leaf nodes branch on present or pastpercepts and actions.

    Training instances are deposited in leaves.

    Suffix tree is like an order- Markov modelwith varying .Factored state representation captures onlynecessary state distinctions.

    Value function approximation achieved byrepresenting value function with a morecompact structure than a mapping from allworld states.

    Reinforcement Learning with Selective Perception and Hidden State – p.12/31

  • Example

    Reinforcement Learning with Selective Perception and Hidden State – p.13/31

  • Agent and Environment

    Finite set of actions, � ���� ���� � � �� ���� � .Scalar range of possible rewards,

    � � �� � �� �� � � .Finite set of observations,

    � ��� ��� � � �� ��� � .

    At time

    , agent executes action � � � andreceives an observation ��! � � and reward��! � � .

    Reinforcement Learning with Selective Perception and Hidden State – p.14/31

  • Agent and Environment

    Set of observations is the set of all values of aperceptual vector, with perceptual features

    � " �� �� � � �� # $.Each feature is an element of a finite set% � %� �� %� �� � � �� %� �& ' � .Value of dimension % at time � is �

    �( � � , so anobservation is written�� � ) � �+* � � � � �+, � � � � � �� � � - � � . .

    is Cartesian product of all feature sets,/ / � �0 �%21 �/ % / .

    Reinforcement Learning with Selective Perception and Hidden State – p.15/31

  • Instance ChainAgent records raw experiences in a transitioninstance, � �

    ) �43 �� ��43 �� ��� �� . .Tree nodes add a distinction based on

    History index,

    5

    , indicating number of stepsbackwards in timePerception or action dimension,

    (

    .

    Every node is uniquely identified by the set oflabels on the path from the root, theconjunction 6.

    Reinforcement Learning with Selective Perception and Hidden State – p.16/31

  • U-TREEAn instance is deposited in a leaf nodewhose conjunction 6 is satisfied by and itspredecessors.

    � 6 is the set of instances associated withleaf 6.

    7 �

    specifies the leaf to which instancebelongs.

    Below the official leaves, fringes are addedthat provide “hypothesis” distinctions.

    If more distinctions help predict future reward,the fringes are promoted to “official”distinctions.

    Reinforcement Learning with Selective Perception and Hidden State – p.17/31

  • Hidden State Space

    U-Tree leaves correspond to internal states ofthe agent.

    Deep branches represent finely distinguishedspace

    Shallow branches represent broadlydistinguished state space

    � 6� � is the learned estimate of expectedfuture discounted reward for a state-actionpair.

    All -values indicate expected values for thenext step in the future.

    Reinforcement Learning with Selective Perception and Hidden State – p.18/31

  • U-TREE Algorithm

    1. Begin with a tree that represents nodistinctions. One root node, 6, and

    � 6 � " $.2. Agent takes a step in the environment

    (a) Record transition �

    ) �43 �� �8�3 �� ��� �� .

    (b) Associate � with the appropriateconjunction, 6,

    � 6 � 6 :9 " � $

    Reinforcement Learning with Selective Perception and Hidden State – p.19/31

  • U-TREE Algorithm

    3. Perform one sweep of value iteration withleaves as states.

    � 6� � � 6� � �; <

    Pr

    � 6 = / 6� � � 6 =

    � 6� � �

    >@? A B C; � D E ��F/ � 6� � /

    Pr

    � 6 = / 6� � �/ " F � � 6� � / 7 � F! � � 6 = $ // � 6� � /

    Reinforcement Learning with Selective Perception and Hidden State – p.20/31

  • U-TREE Algorithm

    4. After every

    G

    steps, test whether the transitionutility has changed enough to warrant newdistinctions in the internal state space.

    (a) Compare distributions of future discountedrewards associated with the same actionfrom different nodes.

    (b) The fringe could be expanded by allpossible permutations of observations andactions to a fixed depth H with a maximumhistory index,

    5

    , yielding an enormousbranch factor of

    � 5 � / / * I �Reinforcement Learning with Selective Perception and Hidden State – p.21/31

  • U-TREE Algorithm

    4. Test for utile distinctions.

    (c) Possible expansion pruning methodsi. Don’t expand leaves containing zero (or

    few) instances.ii. Don’t expand leaves whose instances

    exhibit little deviation in utility.iii. Order the terms in the conjunction (i.e.

    perceptual dimensions, action) forexpansion.

    Reinforcement Learning with Selective Perception and Hidden State – p.22/31

  • U-TREE Algorithm

    4. Test for utile distinctions.

    (d) Expected future discounted reward ofinstance F is

    � F � ��F �J C >?K L E

    Pr

    � 7 � F! � � 7 � F! �

    (e) When a deep fringe node is promoted, allof its uncles and great uncles are too.

    Reinforcement Learning with Selective Perception and Hidden State – p.23/31

  • U-TREE Algorithm

    5. Choose next action based on -values of thecorresponding leaf.

    �� ! � � argmax D A�

    � 7 � � � �

    Alternatively, explore by choosing a randomaction with probability M.

    6. Set

    � � *. Goto 2.

    Reinforcement Learning with Selective Perception and Hidden State – p.24/31

  • Take the AgentsDriving

    Reinforcement Learning with Selective Perception and Hidden State – p.25/31

  • Driving Experiment

    Actions include gaze directions, and shift togaze-lane.

    Sensory system includes hearing, andseveral gaze gauges.

    2,592 sensor states

    3,536 world states not including agent’ssensor system, otherwise 21,216 world states

    Trying to solve the task with only perceptualdistinctions would be disastrous

    Reinforcement Learning with Selective Perception and Hidden State – p.26/31

  • Driving Experiment

    Over 5,000 time steps with only slower cars

    Hand-written policy (32 leaves) makes 99collisions.

    Random actions makes 788 collisions.

    U-Tree trained with 10,000 time steps anddecreasing exploration policy (51 leaves)makes 67 collisions.

    Reinforcement Learning with Selective Perception and Hidden State – p.27/31

  • Driving Experiment

    Over 5,000 time steps with slower and faster cars

    Random actions makes 1,260 collisions, 775steps being honked at.

    U-Tree trained with 18,000 steps anddecreasing exploration policy (143 leaves)makes 280 collisions, 176 steps beinghonked at.

    Reinforcement Learning with Selective Perception and Hidden State – p.28/31

  • Discussion“Chicken and egg” problem

    Distinctions

    Utility � Policy

    Difficulty with long memories

    Difficulty with large conjunctions

    Difficulty with hard to find rewards

    Difficulty with loops in the environment

    Reinforcement Learning with Selective Perception and Hidden State – p.29/31

  • DiscussionSuccess with large perception spaces

    Success with hidden state

    Success with noise

    Success with expensive experience

    Applicable to general RL domains

    Reinforcement Learning with Selective Perception and Hidden State – p.30/31

  • ExtensionsBetter Statistical Tests

    Utile-Clustered Branches

    Information-Theoretic Splitting

    Eliminate the Fringe

    Options

    Reinforcement Learning with Selective Perception and Hidden State – p.31/31

    OutlineThe Set Up...Opposite and Related?Motivating StatementsMotivating StatementsUtile DistinctionsDistinctions for LearningUtile DistinctionsUtile Distinction Testoun {U-Tree} Overviewoun {U-Tree} OverviewExampleAgent and EnvironmentAgent and EnvironmentInstance Chainoun {U-Tree}Hidden State Spaceoun {U-Tree} Algorithmoun {U-Tree} Algorithmoun {U-Tree} Algorithmoun {U-Tree} Algorithmoun {U-Tree} Algorithmoun {U-Tree} AlgorithmTake the Agents DrivingDriving ExperimentDriving ExperimentDriving ExperimentDiscussionDiscussionExtensions