model-free methods9.54/fall14/slides/reinforcement learning 2...in model-free we take a step, and...

17
Model-Free Methods Model-Free Methods

Upload: others

Post on 05-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

Model-Free Methods

Model-Free Methods

Page 2: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

A1

A2

S1

A3

S2

S3

S1

S3

S2

R=2

R= -1

Model-based: use all branches

In model-based we update Vπ(S) using all the possible S’ In model-free we take a step, and update based on this sample

Page 3: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

A1

A2

S1

A3

S2

S3

S1

S3

S2

R=2

R= -1

Model-based: use all branches

In model-free we take a step, and update based on this sample

V(S1) ← V(S1) + α [r + γ V (S3) - V(S1) ]

<V> ← <V> + α (V – <V> )

Page 4: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

A

S1

S3

S2

r1

On-line: take an action A, ending at S1

St

<V> ← <V> + α (V – <V> )

Page 5: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

TD Prediction Algorithm Terminology: Prediction -- computing Vπ (S) for a given π

Prediction error: [r + γV(S') – V(S)] Expected : V(S), observed: r + γV(S')

Page 6: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

A

S1

S3

S2

r1

Learning a Policy: Exploration problem: take an action A, ending at S1

St

Update St then update S1 May never explore the alternative actions to A

Page 7: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

From Value to Action

• Based on V(S), action can be selected

• ‘Greedy’ selection is not good enough (Select action A with current max expected future reward)

• Need for ‘exploration’

• For example: ‘ε-greedy’

• Max return with p = 1-ε, and with p=ε one of the other actions

• Can be a more complex decision

• Done here in episodes

Page 8: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

TD Policy Learning

ε-greedy

ε-greedy performs exploration Can be more complex, e.g. changing ε with time or with conditions

Page 9: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

TD ‘Actor-Critic’

Terminology: Prediction is the same as policy evaluation. Computing Vπ (S)

‘actor’

Motivated by brain modeling

Page 10: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

‘Actor-critic’ scheme -- standard drawing

Motivated by brain modeling (E.g. Ventral striatum is the critic, dorsal striatum is the actor)

Page 11: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

Q-learning

• The main algorithm used for model-free RL

Page 12: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

Q-values (state-action)

A1

A2

S1

A3

S2

S3

S1

S3

S2

R=2

R= -1

Q(S1, A1 )

Q(S1, A3 )

Qπ (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π

Page 13: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

Q-value (state-action)

• The same update is done on Q-values rather than on V

• Used in most practical algorithms and some brain models

• Qπ (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π:

Page 14: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

A1

A2

S1

A3

S2

S3

S1

S3

S2

R=2

R= -1

Q-values (state-action)

Q(S1, A1 )

Q(S1, A3 )

Page 15: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

SARSA

It is called SARSA because it uses s(t) a(t) r(t+1) s(t+1) a(t+1) A step like this uses the current π, so that each S has its a = π(S)

Page 16: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

SARSA RL Algorithm

Epsilon greedy: with probability epsilon do not select the greedy action, but with equal probability among all actions

Page 17: Model-Free Methods9.54/fall14/slides/Reinforcement Learning 2...In model-free we take a step, and update based on this sample A 1 A 2 S 1 A 3 S 2 S 3 S 1 S 3 S 2 R=2 R= -1 Model-based:

On Convergence

• Using episodes:

• Some of the states are ‘terminals’

• When the computation reaches a terminal s, it stops.

• Re-starts at a new state s according to some probability

• At the starting state, each action has a non-zero probability (exploration)

• As the number of episodes goes to infinity, Q(S,A) will converge to

Q*(S,A).