reinforcement learning with selective perception and ...ghg/utree_1.pdf · selective perception...

Reinforcement Learningwith Selective Perception and

Hidden StateBy Andrew K. McCallum

Presented by Jerod Weinman

University of Massachusetts-Amherst

Reinforcement Learning with Selective Perception and Hidden State – p.1/31

OutlineMotivations

Utile Distinctions

U-TREE algorithm

Driving Experiment

Conclusions

Extensions


The Set Up...

Agents have opposite, yet intertwiningproblems regarding internal state space.

Too many distinctions � SelectivePerceptionToo few distinctions � Short TermMemory

Most RL algorithms depend on knowledgeengineers to design state-space.


Opposite and Related?

Selective perception creates hidden state onpurpose.

Short term memory, which alleviates hiddenstate, allows agents selective perception.

The “black magic” of RL applications hasbeen engineering state distinctions.


Motivating Statements

Learning closed-loop behaviors is useful.

Selective perception provides an efficientinterface.

Environment interfaces suffer from hiddenstate; selective perception can make it worse.

Non-Markov hidden state problems can besolved with memory.


Motivating Statements

Learning selective perception and usingmemory is difficult.

Experience is expensive.

Agents must handle noisy perceptions andactions.

Final performance is to be balanced againsttraining time.


Utile DistinctionsState-space should be dependent on the taskat hand.

Learning should be proportional to taskdifficulty, not world complexity.

Perceptual aliasing

Agents should only make distinctions neededto predict future reward.


Distinctions forLearning

Theorem The state distinctions necessary forrepresenting the optimal policy are notnecessarily sufficient for learning the optimalpolicy.

Describe an environment and task for whichan optimal policy may be calculated.

Find a minimum set of state distinctionsadequate for representing that policy.

Recalculate a policy in the reduced internalstate space and the result is a non-optimalpolicy.


Utile DistinctionsWhy? If states �� and �� are aliased, then

the path through � � may have slightly lowerreward than through �� , i.e. �� , yetthe utility of �� may be higher than that of �� ,i.e.

� �� .

��

Optimal policy calculations don’t care whichaliased state the agent goes through, so itchooses the lower cost path, which is the statewith lower utility. Reinforcement Learning with Selective Perception and Hidden State – p.9/31

Utile Distinction TestDistinguishes states that have different policyactions or different utilities

Merges states that have the same policyaction and same utility


U-TREE OverviewTreats percepts as multi-dimensional vectorsof features

Allows the agent to ignore certaindimensions of perceptionInternal state space can be smaller thanspace of all percepts

Combines instance-based learning with utiledistinctions

Agent builds a tree for making statedistinctions.


U-TREE OverviewNon-leaf nodes branch on present or pastpercepts and actions.

Training instances are deposited in leaves.

Suffix tree is like an order- Markov modelwith varying .Factored state representation captures onlynecessary state distinctions.

Value function approximation achieved byrepresenting value function with a morecompact structure than a mapping from allworld states.


Example


Agent and Environment

Finite set of actions, � �� .Scalar range of possible rewards,

� � �� .Finite set of observations,

� �� .

At time

�

, agent executes action � � � andreceives an observation ��! � � and reward��! � � .


Agent and Environment

Set of observations is the set of all values of aperceptual vector, with perceptual features

� " �� # $.Each feature is an element of a finite set% � %� �� %� �� %� �& ' � .Value of dimension % at time � is �

�( � � , so anobservation is written�� ) � �+* � � � � �+, � � � � � �� - � � . .

is Cartesian product of all feature sets,/ / � �0 �%21 �/ % / .


Instance ChainAgent records raw experiences in a transitioninstance, � �

) �43 �� 43 �� . .Tree nodes add a distinction based on

History index,

5

, indicating number of stepsbackwards in timePerception or action dimension,

(

.

Every node is uniquely identified by the set oflabels on the path from the root, theconjunction 6.


U-TREEAn instance is deposited in a leaf nodewhose conjunction 6 is satisfied by and itspredecessors.

� 6 is the set of instances associated withleaf 6.

7 �

specifies the leaf to which instancebelongs.

Below the official leaves, fringes are addedthat provide “hypothesis” distinctions.

If more distinctions help predict future reward,the fringes are promoted to “official”distinctions.


Hidden State Space

U-Tree leaves correspond to internal states ofthe agent.

Deep branches represent finely distinguishedspace

Shallow branches represent broadlydistinguished state space

� 6� � is the learned estimate of expectedfuture discounted reward for a state-actionpair.

All -values indicate expected values for thenext step in the future.


U-TREE Algorithm

1. Begin with a tree that represents nodistinctions. One root node, 6, and

� 6 � " $.2. Agent takes a step in the environment

(a) Record transition �

) �43 �� 8�3 �� .

(b) Associate � with the appropriateconjunction, 6,

� 6 � 6 :9 " � $


U-TREE Algorithm

3. Perform one sweep of value iteration withleaves as states.

� 6� � � 6� � �; <

Pr

� 6 = / 6� � � 6 =

� 6� � �

>@? A B C; � D E ��F/ � 6� � /

Pr

� 6 = / 6� � �/ " F � � 6� � / 7 � F! � � 6 = $ // � 6� � /


U-TREE Algorithm

4. After every

G

steps, test whether the transitionutility has changed enough to warrant newdistinctions in the internal state space.

(a) Compare distributions of future discountedrewards associated with the same actionfrom different nodes.

(b) The fringe could be expanded by allpossible permutations of observations andactions to a fixed depth H with a maximumhistory index,

5

, yielding an enormousbranch factor of

� 5 � / / * I �Reinforcement Learning with Selective Perception and Hidden State – p.21/31

U-TREE Algorithm

4. Test for utile distinctions.

(c) Possible expansion pruning methodsi. Don’t expand leaves containing zero (or

few) instances.ii. Don’t expand leaves whose instances

exhibit little deviation in utility.iii. Order the terms in the conjunction (i.e.

perceptual dimensions, action) forexpansion.


U-TREE Algorithm

4. Test for utile distinctions.

(d) Expected future discounted reward ofinstance F is

� F � ��F �J C >?K L E

Pr

� 7 � F! � � 7 � F! �

(e) When a deep fringe node is promoted, allof its uncles and great uncles are too.


U-TREE Algorithm

5. Choose next action based on -values of thecorresponding leaf.

�� ! � � argmax D A�

� 7 � � � �

Alternatively, explore by choosing a randomaction with probability M.

6. Set

� � *. Goto 2.


Take the AgentsDriving


Driving Experiment

Actions include gaze directions, and shift togaze-lane.

Sensory system includes hearing, andseveral gaze gauges.

2,592 sensor states

3,536 world states not including agent’ssensor system, otherwise 21,216 world states

Trying to solve the task with only perceptualdistinctions would be disastrous


Driving Experiment

Over 5,000 time steps with only slower cars

Hand-written policy (32 leaves) makes 99collisions.

Random actions makes 788 collisions.

U-Tree trained with 10,000 time steps anddecreasing exploration policy (51 leaves)makes 67 collisions.


Driving Experiment

Over 5,000 time steps with slower and faster cars

Random actions makes 1,260 collisions, 775steps being honked at.

U-Tree trained with 18,000 steps anddecreasing exploration policy (143 leaves)makes 280 collisions, 176 steps beinghonked at.


Discussion“Chicken and egg” problem

Distinctions

Utility � Policy

Difficulty with long memories

Difficulty with large conjunctions

Difficulty with hard to find rewards

Difficulty with loops in the environment


DiscussionSuccess with large perception spaces

Success with hidden state

Success with noise

Success with expensive experience

Applicable to general RL domains


ExtensionsBetter Statistical Tests

Utile-Clustered Branches

Information-Theoretic Splitting

Eliminate the Fringe

Options


OutlineThe Set Up...Opposite and Related?Motivating StatementsMotivating StatementsUtile DistinctionsDistinctions for LearningUtile DistinctionsUtile Distinction Testoun {U-Tree} Overviewoun {U-Tree} OverviewExampleAgent and EnvironmentAgent and EnvironmentInstance Chainoun {U-Tree}Hidden State Spaceoun {U-Tree} Algorithmoun {U-Tree} Algorithmoun {U-Tree} Algorithmoun {U-Tree} Algorithmoun {U-Tree} Algorithmoun {U-Tree} AlgorithmTake the Agents DrivingDriving ExperimentDriving ExperimentDriving ExperimentDiscussionDiscussionExtensions

reinforcement learning with selective perception and ...ghg/utree_1.pdf · selective perception...

Documents