introduction to machine learning - cmpe webethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · temporal...

27
ETHEM ALPAYDIN © The MIT Press, 2010 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml2e Lecture Slides for

Upload: others

Post on 24-May-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

ETHEM ALPAYDIN© The MIT Press, 2010

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

Page 2: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),
Page 3: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Introduction Game-playing: Sequence of moves to win a game

Robot in a maze: Sequence of actions to find a goal

Agent has a state in an environment, takes an action and sometimes receives reward and the state changes

Credit-assignment

Learn a policy

3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 4: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Single State: K-armed Bandit

aQaraQaQ tttt 11

4

Among K levers, choose the one that pays best

Q(a): value of action aReward is ra

Set Q(a) = ra

Choose a* if Q(a*)=maxa Q(a)

Rewards stochastic (keep an expected reward):

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 5: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Elements of RL (Markov Decision Processes) st : State of agent at time t

at: Action taken at time t

In st, action at is taken, clock ticks and reward rt+1 is received and state changes to st+1

Next state prob: P (st+1 | st , at )

Reward prob: p (rt+1 | st , at )

Initial state(s), goal state(s)

Episode (trial) of actions from initial state to goal

(Sutton and Barto, 1998; Kaelbling et al., 1996)

5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 6: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Policy and Cumulative Reward tt sa : AS

6

Policy,

Value of a policy,

Finite-horizon:

Infinite horizon:

tsV

T

iitTtttt rErrrEsV

1

21

rate discount the is 10

1

1

3

2

21

iit

itttt rErrrEsV

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 7: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

1111

111

11

1

1

1

1

1

1

11

1

tta

stttttt

tttta

t

ts

tttta

t

tta

iit

it

a

iit

i

a

ttt

asQassPrEasQ

saasQsV

sVassPrEsV

sVrE

rrE

rE

ssVsV

tt

t

tt

t

t

t

,,,

,

,

,

**

**

**

*

*

max|

in of Valuemax

|max

max

max

max

max

7

Bellman’s equation

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 8: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Environment, P (st+1 | st , at ), p (rt+1 | st , at ), is known

There is no need for exploration

Can be solved using dynamic programming

Solve for

Optimal policy

Model-Based Learning

111

1

ts

tttta

t sVassPrEsVt

t

** ,|max

8

111

1

ts

tttttta

t sVassPasrEstt

*,|,|max arg*

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 9: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Value Iteration

9Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 10: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Policy Iteration

10Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 11: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Temporal Difference Learning Environment, P (st+1 | st , at ), p (rt+1 | st , at ), is not known;

model-free learning

There is need for exploration to sample from

P (st+1 | st , at ) and p (rt+1 | st , at )

Use the reward received in the next time step to update the value of current state (action)

The temporal difference between the value of the current action and the value discounted from the next state

11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 12: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

ε-greedy: With pr ε,choose one action at random uniformly; and choose the best action with pr 1-ε

Probabilistic:

Move smoothly from exploration/exploitation.

Decrease ε

Annealing

Exploration Strategies

A

1bbsQ

asQsaP

,exp

,exp|

12

A

1bTbsQ

TasQsaP

/,exp

/,exp|

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 13: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Deterministic: single possible reward and next state

used as an update rule (backup)

Starting at zero, Q values increase, never decrease

Deterministic Rewards and Actions 1111

11

tta

stttttt asQassPrEasQ

tt

,max,|, **

13

1111

tta

ttt asQrasQt

,max,

1111

tta

ttt asQrasQt

,ˆmax,ˆ

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 14: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

14

Consider the value of action marked by ‘*’:If path A is seen first, Q(*)=0.9*max(0,81)=73Then B is seen, Q(*)=0.9*max(100,81)=90

Or,If path B is seen first, Q(*)=0.9*max(100,0)=90Then A is seen, Q(*)=0.9*max(100,81)=90

Q values increase but never decrease

γ=0.9

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 15: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Nondeterministic Rewards and Actions

ttttt sVsVrsVsV 11

15

When next states and rewards are nondeterministic (there is an opponent or randomness in the environment), we keep averages (expected values) instead as assignments

Q-learning (Watkins and Dayan, 1992):

Off-policy vs on-policy (Sarsa)

Learning V (TD-learning: Sutton, 1988)

tttta

ttttt asQasQrasQasQt

,ˆ,ˆmax,ˆ,ˆ111

1

backup

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 16: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Q-learning

16Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 17: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Sarsa

17Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 18: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Eligibility Traces

asaseasQasQ

asQasQr

ase

aassase

tttttt

tttttt

t

tt

t

,,,,,

,,

,,

111

1

1

otherwise

and if

18

Keep a record of previously visited states (actions)

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 19: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Sarsa (λ)

19Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 20: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Tabular: Q (s , a) or V (s) stored in a table

Regressor: Use a learner to estimate Q (s , a) or V (s)

Generalization

zeros all with

yEligibilit

0θ1

111

111

2

111

eee

θ

θ

θ

tttt

tttttt

tt

ttttttt

tttttt

asQ

asQasQr

asQasQasQr

asQasQrE

t

t

,

,,

,,,

,,

20Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 21: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Partially Observable States The agent does not know its state but receives an

observation p(ot+1|st,at) which can be used to infer a belief about states

Partially observable

MDP

21Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 22: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

The Tiger Problem

Two doors, behind one of which there is a tiger

p: prob that tiger is behind the left door

R(aL)=-100p+80(1-p), R(aR)=90p-100(1-p)

We can sense with a reward of R(aS)=-1

We have unreliable sensors

22Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 23: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

If we sense oL, our belief in tiger’s position changes

23

1

130100

7090

110090

13080

70100

180100

13070

70

)|(

)(

)(.

)(

.

)'('

)|(),()|(),()|(

)(

)(.

)(

.

)'('

)|(),()|(),()|(

)(..

.

)(

)()|()|('

LS

LL

LRRRLLLRLR

LL

LRRLLLLLLL

L

LLLLL

oaR

oP

p

oP

p

pp

ozPzarozPzaroaR

oP

p

oP

p

pp

ozPzarozPzaroaR

pp

p

oP

zPzoPozPp

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 24: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

24

)(

)(

)(

)(

max

)())|(),|(),|(max()())|(),|(),|(max(

)()|(max'

pp

pp

pp

pp

oPoaRoaRoaRoPoaRoaRoaR

oPoaRV

RRSRRRLLLSLRLL

jj

jii

110090

12633

14643

180100

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 25: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

25Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 26: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

Let us say the tiger can move from one room to the other with prob 0.8

26

)'(

)'(

)'('

max'

)(..'

pp

pp

pp

V

ppp

110090

12633

180100

18020

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 27: Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ),

When planning for episodes of two, we can take aL, aR, or sense and wait:

27

1

110090

180100

2

'max

)(

)(

max

V

pp

pp

V

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)