reinforcement learning in robotics - ijsabr.ijs.si/upload/1423561726-reinforcementlearninginrobotics...

24
REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef Stefan institute

Upload: others

Post on 19-May-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

REINFORCEMENT LEARNING IN ROBOTICS

AN INTRODUCTION TO

Nemec Bojan, Jozef Stefan institute

Page 2: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Humanoid robots

A humanoid robot is a robot with appearance similar to the human body, allowing interaction with made-for-human tools or environments

self-maintenance autonomous learning avoiding harmful situations to people, property, and itself safe interacting with human beings and the environment

A humanoid robot is an autonomous robot that can adapt to changes in its environment or itself

Page 3: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Supervised Learning

•learning approaches to regression & classification, neural networks

•learning from examples, learning from a teacher

Unsupervised Learning

•Reinforcement Learning

•learning approaches to sequential decision making

•learning from a critic, learning from delayed reward

Robot learning

Page 4: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states.

• environment states S

• actions A

• rewards R.

State S

• value V

• action-value Q.

Reinforcement learning

S0

S1

S2

Page 5: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

111 ,,,, ttttttt xxxxxx

1 ttt xx

is MDP

is NOT MDP

Agent – driver Environment – car on a road States – car pos,vel,acc Action – stearing wheel angle Action is result of a policy Reward – success or fail

xxx ,,

Markov decision process (MDP) - A probabilistic model of a sequential decision problem, where states can be perceived exactly, and the current state and action selected determine a probability distribution on future states. The outcome of applying an action to a state depends only on the current action and state (and not on preceding actions or states).

The environment is typically formulated as a finite-state Markov decision process

Page 6: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

The basic reinforcement learning model applied to MDPs consists of:

.

• a set of environment states S;

• a set of actions A; and

• a set of scalar "rewards" in R.

At each time t, the agent perceives its state st ϵ S and the set of possible actions A(st). It chooses an action and receives from the environment the new state st + 1 and a reward rt. Based on these interactions, the reinforcement learning agent must develop a policy π : S →A which maximizes the quantity R = r0 + r1 + .. +rn for MDPs which have a terminal state, or the quantity

t

t

trR

for MDP witout terminal states.

A policy π determines which action should be performed in each state; a policy is a mapping from states to actions

Page 7: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Action Value Function

The value of a state is defined as the sum of the reinforcements received when starting in that state and following some fixed policy to a terminal state

0

1 ||)(k

tkt

k

tt ssrEssREsV

Value Function

0

1 ,|,|),(k

ttkt

k

ttt aassrEaassREasQ

Optimal value function V* Is a value function following optimal policy

π* : S →A , maximizing V (Q)

Page 8: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Value function update methods

Monte Carlo Method

Repeat forever Choose a random policy Generate an entire episode For each state s appearing in episode, compute return R = R + r(s) Value function V(s) is average return

)()()( tttt sVRsVsV

Drawback : can only be updated after R is calculated from a complete simulation run, can be applied for episodic tasks

learning rate

R

V

Page 9: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Temporal difference learning

is a prediction method. TD learning is a combination of Monte Carlo ideas and dinamic programming (DP) ideas. Neuroscience researches have proved the existence of this method in human and animal brains.

The TD method updates value or action value function immediately after visits to new states.

)()()()( 11 ttttt sVsVrsVsV

Benefits : suitable for non-episodic tasks, suitable for on-line implementation, better convergence

V

V

V

V

Page 10: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Eglibility traces – TD(λ)

Eligibility trace is a temporary record of the occurrence of an event, such as the visiting of a state or the taking of an action. They are a bridge from TD to Monte Carlo methods

repeat for all states in each step

V

V

V

V

)()()()()( 11 sesVsVrsVsV ttttt

tt

tt

tssifse

ssifsese

1)(

)()(

1

1

update algorithm :

Page 11: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Random walk example

0 1

1 2 3 4 50

0.2

0.4

0.6

0.8

1RANDOM WALK - VALUE FUNCTION AFTER 10 STEPS

V

states

1 2 3 4 50

0.2

0.4

0.6

0.8

1RANDOM WALK - VALUE FUNCTION AFTER 100 STEPS

V

states

True

TD

ET

MC1

MC2

Page 12: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Direct Q-Learning methods

SARSA

),(),(),(),( 111 ttttttttt asQasQrasQasQ

Because of choosing actual future value for state action update, we say that SARSA learns the Q values associated with taking the policy it follows itself. (on-Policy method)

Example : ε-greedy policy

)(

),(max

arand

asQa exploitation, exploration

Q-Learning

),(),(max),(),( 11 tttttttt asQasQrasQasQ

Because of choosing max future value for state action update, we say that Q-learning learns the Q values associated with taking the exploitation policy while following an exploration/exploitation policy. (Off-Policy method)

Page 13: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Cliff – walking problem

Q-Learning

SARSA

R=-100

R=-1

Page 14: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Case study – Ball in a cup game

Page 15: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Goal : Swing ball on a rope to the desired height at 0 velocity

State variables

hand position and velocity, rope angle and angular velocity

State boxing

x1 = [0.3 0]; x2 = [-1 0 1]; x3 = [0 : 360]; 18 values x4 = [-1.5 0 1.5]; Number of states = 3*3*18*3 = 342 Reward function

if (x > 0.65) || (x < -0.05) r = -500 - 5*abs(x_dot); elseif ((abs(theta-D1)<0.2) &&(abs(theta_dot) < 0.3)) r = 1000; else r = theta^2; end

Learning algorithm

Q-Learning, α = 0.4 γ = 0.99 λ = 0.9

States discrete representation of continuous problem

,,, xx

Page 16: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Actions – acceleration [-1,-0.5,0,+0.5,+1]

Simulation results of Q Learning

Page 17: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Experimental result of Q Learning

Page 18: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

A central issue in the trajectory generation and modification is the choice of the representation (or encoding) of the trajectory

Dynamic Motion Primitives

Second order system + Trajectory modulation

Goal (g) Kernel function weights (w)

Discrete time Continous time

Trajectory representation

Page 19: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Second order system Trajectory modulation

Time evolution

Learning of Dynamic Motion Primitives can be accomplished by regression

REGRESION

Dynamic Motion Primitives – DMP-s

{

Page 20: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Policy gradient methods

Policy gradient methods are a type of reiforcement learning techniques that rely upon optimizing parameterized policies with respect to the expected return

}{)(0 k

h

k krEJ }max{J

- dicsount factor [0,1] r - reward

update using gradient update rule

hJhhh 1

- learning rate

local maximum !

Page 21: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Gradient estimation technique

Finite-difference Methods

ihih ,Random variantion of parameters ,

Experimental evaluation of J refihih JJJ )(,

Gradient estimation using regression hhh

T

hh JJ ])[( 1

hJhkh 1

Parameters update

khhhh ,2,1, ...

episode

transition

Page 22: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Liquid pouring learning - simulation

one transition

J estimation using balance

Page 23: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Experimental results

Page 24: REINFORCEMENT LEARNING IN ROBOTICS - IJSabr.ijs.si/upload/1423561726-ReinforcementLearningInRobotics .pdf · REINFORCEMENT LEARNING IN ROBOTICS AN INTRODUCTION TO Nemec Bojan, Jozef

Goal learning