lucio marcenaro tue summer_school

101
An introduction to cognitive robotics EMJD ICE Summer School - 2013 Lucio Marcenaro – University of Genova (ITALY)

Upload: jun-hu

Post on 07-Nov-2014

362 views

Category:

Technology


5 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Lucio marcenaro tue summer_school

An introduction to cognitive roboticsEMJD ICE Summer School - 2013

Lucio Marcenaro – University of Genova (ITALY)

Page 2: Lucio marcenaro tue summer_school

Cognitive robotics?

• Robots with intelligent behavior– Learn and reason

– Complex goals

– Complex world

• Robots ideal vehicles for developing and testing cognitive:– Learning

– Adaptation

– Classification

Page 3: Lucio marcenaro tue summer_school

Cognitive robotics

• Traditional behavior modeling approaches problematic and untenable.

• Perception, action and the notion of symbolic representation to be addressed in cognitive robotics.

• Cognitive robotics views animal cognition as a starting point for the development of robotic information processing.

Page 4: Lucio marcenaro tue summer_school

Cognitive robotics

• “Immobile” Robots and EngineeringOperations– Robust space probes, ubiquitous computing

• Robots That Navigate– Hallway robots, Field robots, Underwater

explorers, stunt air vehicles

• Cooperating Robots– Cooperative Space/Air/Land/Underwater vehicles,

distributed traffic networks, smart dust.

Page 5: Lucio marcenaro tue summer_school

Some applications (1)

Page 6: Lucio marcenaro tue summer_school

Some applications (2)

Page 7: Lucio marcenaro tue summer_school

Other examples

Page 8: Lucio marcenaro tue summer_school

Outline

• Lego Mindstorms

• Simple Line Follower

• Advanced Line Follower

• Learning to follow the line

• Conclusions

Page 9: Lucio marcenaro tue summer_school

The NXT Unit – an embedded system

• 64K RAM, 256K Flash

• 32-bit ARM7 microcontroller

• 100 x 64 pixel LCD graphical display

• Sound channel with 8-bit resolution

• Bluetooth wireless communications

• Stores multiple programs– Programs selectable using buttons

Page 10: Lucio marcenaro tue summer_school

The NXT unit

(Motor ports)

(Sensor ports)

Page 11: Lucio marcenaro tue summer_school

Motors and Sensors

Page 12: Lucio marcenaro tue summer_school

• Built-in rotation sensors

NXT Motors

Page 13: Lucio marcenaro tue summer_school

NXT Rotation Sensor

• Built in to motors

• Measure degrees or rotations

• Reads + and -

• Degrees: accuracy +/- 1

• 1 rotation =

360 degrees

Page 14: Lucio marcenaro tue summer_school

Viewing Sensors

• Connect sensor

• Turn on NXT

• Choose “View”

• Select sensor type

• Select port

Page 15: Lucio marcenaro tue summer_school

NXT Sound Sensor• Sound sensor can measure in dB and dBA

– dB: in detecting standard [unadjusted] decibels, all sounds are measured with equal sensitivity. Thus, these sounds may include some that are too high or too low for the human ear to hear.

– dBA: in detecting adjusted decibels, the sensitivity of the sensor is adapted to the sensitivity of the human ear. In other words, these are the sounds that your ears are able to hear.

• Sound Sensor readings on the NXT are displayed in percent [%]. The lower the percent the quieter the sound.

http://mindstorms.lego.com/Overview/Sound_Sensor.aspx

Page 16: Lucio marcenaro tue summer_school

NXT Ultrasonic/Distance Sensor

• Measures distance/proximity

• Range: 0-255 cm

• Precision: +/- 3cm

• Can report in centimeters or inches

http://mindstorms.lego.com/Overview/Ultrasonic_Sensor.aspx

Page 17: Lucio marcenaro tue summer_school

17

NXT Non-standard sensors: HiTechnic.com

• Compass

• Gyroscope

• Accellerometer/tilt sensor,

• Color sensor

• IRSeeker

• Prototype board with A/D converter for the I2C bus

Page 18: Lucio marcenaro tue summer_school

LEGO Mindstorms for NXT

(NXT-G)

NXT-G graphical programminglanguage

Based on the LabVIEW programming language G

Program by drawing a flow chart

Page 19: Lucio marcenaro tue summer_school

NXT-G PC program interface

Toolbar

Workspace

Configuration

Panel

Help & Navigation

Controller

Palettes

Tutorials Web

Portal

Sequence Beam

Page 20: Lucio marcenaro tue summer_school

Issues of the standard firmware

• Only one data type

• Unreliable bluetooth communication

• Limited multi-tasking

• Complex motor control

• Simplistic memory management

• Not suitable for large programs

• Not suitable for development of own tools or blocks

Page 21: Lucio marcenaro tue summer_school

Other programming languages and environments

– Java leJOS

– Microsoft Robotics Studio

– RobotC

– NXC - Not eXactly C

– NXT Logo

– Lego NXT Open source firmware and software development kit

Page 22: Lucio marcenaro tue summer_school

leJOS

• A Java Virtual Machine for NXT

• Freely available– http://lejos.sourceforge.net/

• Replaces the NXT-G firmware

• LeJOS plug-in is available for the Eclipse free development environment

• Faster than NXT-G

Page 23: Lucio marcenaro tue summer_school

Example leJOS Program

sonar = new UltrasonicSensor(SensorPort.S4);

Motor.A.forward();

Motor.B.forward();

while (true) {

if (sonar.getDistance() < 25) {

Motor.A.forward();

Motor.B.backward();

} else {

Motor.A.forward();

Motor.B.forward();

}

}

Page 24: Lucio marcenaro tue summer_school

Event-driven Control in leJOS

• The Behavior interface– boolean takeControl()

– void action()

– void suppress()

• Arbitrator class

– Constructor gets an array of Behavior objects

• takeControl() checked for highest index first

– start() method begins event loop

Page 25: Lucio marcenaro tue summer_school

Event-driven example

class Go implements Behavior {

private Ultrasonic sonar =

new Ultrasonic(SensorPort.S4);

public boolean takeControl() {

return sonar.getDistance() > 25;

}

Page 26: Lucio marcenaro tue summer_school

Event-driven example

public void action() {

Motor.A.forward();

Motor.B.forward();

}

public void suppress() {

Motor.A.stop();

Motor.B.stop();

}

}

Page 27: Lucio marcenaro tue summer_school

Event-driven example

class Spin implements Behavior {

private Ultrasonic sonar =

new Ultrasonic(SensorPort.S4);

public boolean takeControl() {

return sonar.getDistance() <= 25;

}

Page 28: Lucio marcenaro tue summer_school

Event-driven example

public void action() {

Motor.A.forward();

Motor.B.backward();

}

public void suppress() {

Motor.A.stop();

Motor.B.stop();

}

}

Page 29: Lucio marcenaro tue summer_school

Event-driven example

public class FindFreespace {

public static void main(String[] a) {

Behavior[] b = new Behavior[]

{new Go(), new Spin()};

Arbitrator arb =

new Arbitrator(b);

arb.start();

}

}

Page 30: Lucio marcenaro tue summer_school

Simple Line Follower

• Use light-sensor as a switch

• If measured value > threshold: ON state (whitesurface)

• If measured value < threshold: OFF state (black surface)

Page 31: Lucio marcenaro tue summer_school

Simple Line Follower

• Robot not traveling inside the line but alongthe edge

• Turning left until an “OFF” to “ON” transition is detected

• Turning right until an “ON” to “OFF” transition is detected

Page 32: Lucio marcenaro tue summer_school

Simple Line Follower

NXTMotor rightM = new NXTMotor(MotorPort.A);

NXTMotor leftM = new NXTMotor(MotorPort.C);

ColorSensor cs = new ColorSensor(SensorPort.S2, Color.RED);

while (!Button.ESCAPE.isDown())

{

int currentColor = cs.getLightValue();

LCD.drawInt(currentColor, 5, 11, 3);

if (currentColor < 30)

{

rightM.setPower(50);

leftM.setPower(10);

}

else

{

rightM.setPower(10);

leftM.setPower(50);

}

}

Page 33: Lucio marcenaro tue summer_school

Simple Line Follower

• DEMO

Page 34: Lucio marcenaro tue summer_school

Advanced Line Follower

• Use light-sensor as an Analog sensor

• Sensor ranges btween 0 – 100

• Takes the average light detected over a small area

Page 35: Lucio marcenaro tue summer_school

Advanced Line Follower

• Subtract the current reading of the sensor from what the sensor should be reading– Use this value to directly control direction and

power of the wheels

• Multiply this value for a constant: how strongly the wheels should turn to correct its path?

• Add a value to be sure that the robot is always moving forward

Page 36: Lucio marcenaro tue summer_school

Advanced Line Follower

NXTMotor rightM = new NXTMotor(MotorPort.A);

NXTMotor leftM = new NXTMotor(MotorPort.C);

int targetValue = 30;

int amplify = 7;

int targetPower = 50;

ColorSensor cs = new ColorSensor(SensorPort.S2, Color.RED);

rightM.setPower(targetPower);

leftM.setPower(targetPower);

while (!Button.ESCAPE.isDown())

{

int currentColor = cs.getLightValue();

int difference = currentColor - targetValue;

int ampDiff = difference * amplify;

int rightPower = ampDiff + targetPower;

int leftPower = targetPower;

rightM.setPower(rightPower);

leftM.setPower(leftPower);

}

Page 37: Lucio marcenaro tue summer_school

Advanced Line Follower

• DEMO

Page 38: Lucio marcenaro tue summer_school

Learn how to follow

• Goal– Make robots do what we want

– Minimize/eliminate programming

• Proposed Solution: Reinforcement Learning– Specify desired behavior using rewards

– Express rewards in terms of sensor states

– Use machine learning to induce desired actions

• Target Platform– Lego Mindstorms NXT

Page 39: Lucio marcenaro tue summer_school

Example: Grid World

• A maze-like problem– The agent lives in a grid– Walls block the agent’s path

• Noisy movement: actions do not always go as planned:

– 80% of the time, preferred action is taken(if there is no wall there)

– 10% of the time, North takes the agent West; 10% East

– If there is a wall in the direction the agent would have been taken, the agent stays put

• The agent receives rewards each time step

– Small “living” reward each step (can be negative)

– Big rewards come at the end (good or bad)

• Goal: maximize sum of rewards

Page 40: Lucio marcenaro tue summer_school

Markov Decision Processes

• An MDP is defined by:– A set of states s S– A set of actions a A– A transition function T(s,a,s’)

• Prob that a from s leads to s’• i.e., P(s’ | s,a)• Also called the model (or

dynamics)– A reward function R(s, a, s’)

• Sometimes just R(s) or R(s’)– A start state– Maybe a terminal state

• MDPs are non-deterministic search problems– Reinforcement learning: MDPs

where we don’t know the transition or reward functions

Page 41: Lucio marcenaro tue summer_school

What is Markov about MDPs?

• “Markov” generally means that given the present state, the future and the past are independent

• For Markov decision processes, “Markov” means:

Andrej Andreevič Markov(1856-1922)

Page 42: Lucio marcenaro tue summer_school

Solving MDPs: policies

• In deterministic single-agent search problems, want an optimal plan, or sequence of actions, from start to a goal

• In an MDP, we want an optimal policy *: S → A– A policy gives an action for each state

– An optimal policy maximizes expected utility if followed

– An explicit policy defines a reflex agent

Optimal policy when

R(s, a, s’) = -0.03 for all

non-terminals s

Page 43: Lucio marcenaro tue summer_school

Example Optimal Policies

R(s) = -2.0R(s) = -0.4

R(s) = -0.03R(s) = -0.01

Page 44: Lucio marcenaro tue summer_school

MDP Search Trees

• Each MDP state gives an expectimax-like search tree

a

s

s’

s, a

(s,a,s’) called a transition

T(s,a,s’) = P(s’|s,a)

R(s,a,s’)

s,a,s’

s is a state

(s, a) is a

q-state

Page 45: Lucio marcenaro tue summer_school

Utilities of Sequences

• In order to formalize optimality of a policy, need to understand utilities of sequences of rewards

• What preferences shouldan agent have over reward sequences?

• More or less?– [1,2,2] or [2,3,4]

• Now or later?– [1,0,0] or [0,0,1]

Page 46: Lucio marcenaro tue summer_school

Discounting

• It’s reasonable to maximize the sum of rewards

• It’s also reasonable to prefer rewards now to rewards later

• One solution:values of rewards decayexponentially

Page 47: Lucio marcenaro tue summer_school

Discounting

• Typically discount rewards by < 1 each time step– Sooner rewards have higher

utility than later rewards

– Also helps the algorithms converge

• Example: discount of 0.5:– U([1,2,3])=1*1+0.5*2+0.25*3

– U([1,2,3])<U([3,2,1])

Page 48: Lucio marcenaro tue summer_school

Stationary Preferences

• Theorem if we assume stationary preferences:

• Then: there are only two ways to define utilities– Additive utility:

– Discounted utility:

Page 49: Lucio marcenaro tue summer_school

Quiz: Discounting

• Given:

– Actions: East, West and Exit (available in exit states a, e)– Transitions: deterministic

• Quiz 1: For =1, what is the optimal policy?

• Quiz 2: For =0.1, what is the optimal policy?

• Quiz 3: For which are East and West equally goodwhen in state d?

10 1

a b c d e

10 1

10 1

Page 50: Lucio marcenaro tue summer_school

Infinite Utilities?!

• Problem: infinite state sequences have infinite rewards

• Solutions:– Finite horizon:

• Terminate episodes after a fixed T steps (e.g. life)

• Gives nonstationary policies ( depends on time left)

– Discounting: for 0 < < 1

• Smaller means smaller “horizon” – shorter term focus

• Absorbing state: guarantee that for every policy, a terminal state will eventually be reached

Page 51: Lucio marcenaro tue summer_school

Recap: Defining MDPs

• Markov decision processes:– States S– Start state s0

– Actions A– Transitions P(s’|s,a) (or T(s,a,s’))– Rewards R(s,a,s’) (and discount )

• MDP quantities so far:– Policy = Choice of action for each state– Utility (or return) = sum of discounted rewards

a

s

s, a

s,a,s’

s’

Page 52: Lucio marcenaro tue summer_school

Optimal Quantities

• Why? Optimal values define optimal policies!

• Define the value (utility) of a state s:V*(s) = expected utility starting in s

and acting optimally

• Define the value (utility) of a q-state (s,a):Q*(s,a) = expected utility starting in

s, taking action a and thereafter acting optimally

• Define the optimal policy:*(s) = optimal action from state s

a

s

s, a

s,a,s’

s’

Page 53: Lucio marcenaro tue summer_school

Gridworld V*(s)

• Optimal value function V*(s)

Page 54: Lucio marcenaro tue summer_school

Gridworld Q*(s,a)

• Optimal Q function Q*(s,a)

Page 55: Lucio marcenaro tue summer_school

Values of States

• Fundamental operation: compute the value of a state

– Expected utility under optimal action

– Average sum of (discounted) rewards

• Recursive definition of valuea

s

s, a

s,a,s’

s’

Page 56: Lucio marcenaro tue summer_school

Why Not Search Trees?

• We’re doing way too much work with search trees

• Problem: States are repeated– Idea: Only compute needed quantities once

• Problem: Tree goes on forever– Idea: Do a depth-limited computations, but

with increasing depths until change is small– Note: deep parts of the tree eventually don’t

matter if < 1

Page 57: Lucio marcenaro tue summer_school

Time-limited Values

• Key idea: time-limited values

• Define Vk(s) to be the optimal value of s if the game ends in k more time steps

– Equivalently, it’s what a depth-k search tree wouldgive from s

Page 58: Lucio marcenaro tue summer_school

k=0

Page 59: Lucio marcenaro tue summer_school

k=1

Page 60: Lucio marcenaro tue summer_school

k=2

Page 61: Lucio marcenaro tue summer_school

k=3

Page 62: Lucio marcenaro tue summer_school

k=4

Page 63: Lucio marcenaro tue summer_school

k=5

Page 64: Lucio marcenaro tue summer_school

k=6

Page 65: Lucio marcenaro tue summer_school

k=7

Page 66: Lucio marcenaro tue summer_school

k=100

Page 67: Lucio marcenaro tue summer_school

Value Iteration

• Problems with the recursive computation:

– Have to keep all the Vk*(s) around all the time

– Don’t know which depth k(s) to ask for when planning

• Solution: value iteration

– Calculate values for all states, bottom-up

– Keep increasing k until convergence

Page 68: Lucio marcenaro tue summer_school

Value Iteration

• Idea:– Start with V0

*(s) = 0, which we know is right (why?)– Given Vi

*, calculate the values for all states for depth i+1:

– This is called a value update or Bellman update– Repeat until convergence

• Complexity of each iteration: O(S2A)• Theorem: will converge to unique optimal values

– Basic idea: approximations get refined towards optimal values– Policy may converge long before values do

Page 69: Lucio marcenaro tue summer_school

Practice: Computing Actions

• Which action should we chose from state s:

– Given optimal values V?

– Given optimal q-values Q?

– Lesson: actions are easier to select from Q’s!

Page 70: Lucio marcenaro tue summer_school

Utilities for Fixed Policies

• Another basic operation: compute the utility of a state s under a fixed (general non-optimal) policy

• Define the utility of a state s, under a fixed policy :

V(s) = expected total discounted rewards (return) starting in s and following

• Recursive relation (one-step look-ahead / Bellman equation):

(s)

s

s, (s)

s, (s),s’

s’

Page 71: Lucio marcenaro tue summer_school

Policy Evaluation

• How do we calculate the V’s for a fixed policy?

• Idea one: modify Bellman updates

• Efficiency: O(S2) per iteration• Idea two: without the maxes it’s just a linear system,

solve with Matlab (or whatever)

Page 72: Lucio marcenaro tue summer_school

Policy Iteration

• Problem with value iteration:– Considering all actions each iteration is slow: takes |A| times longer than

policy evaluation

– But policy doesn’t change each iteration, time wasted

• Alternative to value iteration:– Step 1: Policy evaluation: calculate utilities for a fixed policy (not optimal

utilities!) until convergence (fast)

– Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities (slow but infrequent)

– Repeat steps until policy converges

• This is policy iteration– It’s still optimal!

– Can converge faster under some conditions

Page 73: Lucio marcenaro tue summer_school

Policy Iteration

• Policy evaluation: with fixed current policy , find values with simplified Bellman updates:– Iterate until values converge

• Policy improvement: with fixed utilities, find the best action according to one-step look-ahead

Page 74: Lucio marcenaro tue summer_school

Comparison

• In value iteration:– Every pass (or “backup”) updates both utilities (explicitly, based on

current utilities) and policy (possibly implicitly, based on current policy)

• In policy iteration:– Several passes to update utilities with frozen policy

– Occasional passes to update policies

• Hybrid approaches (asynchronous policy iteration):– Any sequences of partial updates to either policy entries or utilities

will converge if every state is visited infinitely often

Page 75: Lucio marcenaro tue summer_school

Reinforcement Learning

• Basic idea:– Receive feedback in the form of rewards

– Agent’s utility is defined by the reward function

– Must learn to act so as to maximize expected rewards

– All learning is based on observed samples of outcomes

Page 76: Lucio marcenaro tue summer_school

Reinforcement Learning

• Reinforcement learning:

– Still assume an MDP:

• A set of states s S

• A set of actions (per state) A

• A model T(s,a,s’)

• A reward function R(s,a,s’)

– Still looking for a policy (s)

– New twist: don’t know T or R• I.e. don’t know which states are good or what the actions do

• Must actually try actions and states out to learn

Page 77: Lucio marcenaro tue summer_school

Model-Based Learning

• Model-Based Idea:– Learn the model empirically through experience– Solve for values as if the learned model were correct

• Step 1: Learn empirical MDP model– Count outcomes for each s,a– Normalize to give estimate of T(s,a,s’)– Discover R(s,a,s’) when we experience (s,a,s’)

• Step 2: Solve the learned MDP– Iterative policy evaluation, for example

(s)

s

s, (s)

s, (s),s’

s’

Page 78: Lucio marcenaro tue summer_school

Example: Model-Based Learning

• Episodes:

x

y

T(<3,3>, right, <4,3>) = 1 / 3

T(<2,3>, right, <3,3>) = 2 / 2

+100

-100

= 1

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right -1

(4,3) exit +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(4,2) exit -100

(done)

Page 79: Lucio marcenaro tue summer_school

Model-Free Learning• Want to compute an expectation weighted by P(x):

• Model-based: estimate P(x) from samples, compute expectation

• Model-free: estimate expectation directly from samples

• Why does this work? Because samples appear with the right frequencies!

Page 80: Lucio marcenaro tue summer_school

Example: Direct Estimation

• Episodes:

x

y

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right -1

(4,3) exit +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(4,2) exit -100

(done)

V(2,3) ~ (96 + -103) / 2 = -3.5

V(3,3) ~ (99 + 97 + -102) / 3 = 31.3

= 1, R = -1

+100

-100

Page 81: Lucio marcenaro tue summer_school

Sample-Based Policy Evaluation?

• Who needs T and R? Approximate the expectation with samples (drawn from T!) (s)

s

s, (s)

s1’s2’ s3’

s, (s),s’

s’

Almost! But we only

actually make progress

when we move to i+1.

Page 82: Lucio marcenaro tue summer_school

Temporal-Difference Learning

• Big idea: learn from every experience!

– Update V(s) each time we experience (s,a,s’,r)

– Likely s’ will contribute updates more often

• Temporal difference learning

– Policy still fixed!

– Move values toward value of whatever successor occurs: running average!

(s)

s

s, (s)

s’

Sample of V(s):

Update to V(s):

Same update:

Page 83: Lucio marcenaro tue summer_school

Exponential Moving Average

• Exponential moving average – Makes recent samples more important

– Forgets about the past (distant past values were wrong anyway)

– Easy to compute from the running average

• Decreasing learning rate can give converging averages

Page 84: Lucio marcenaro tue summer_school

Example: TD Policy Evaluation

Take = 1, = 0.5

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right -1

(4,3) exit +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(4,2) exit -100

(done)

Page 85: Lucio marcenaro tue summer_school

Problems with TD Value Learning

• TD value leaning is a model-free way to do policy evaluation

• However, if we want to turn values into a (new) policy, we’re sunk:

• Idea: learn Q-values directly

• Makes action selection model-free too!

a

s

s, a

s,a,s’

s’

Page 86: Lucio marcenaro tue summer_school

Active Learning

• Full reinforcement learning– You don’t know the transitions T(s,a,s’)

– You don’t know the rewards R(s,a,s’)

– You can choose any actions you like

– Goal: learn the optimal policy

– … what value iteration did!

• In this case:– Learner makes choices!

– Fundamental tradeoff: exploration vs. exploitation

– This is NOT offline planning! You actually take actions in the world and find out what happens…

Page 87: Lucio marcenaro tue summer_school

Detour: Q-Value Iteration

• Value iteration: find successive approx optimal values– Start with V0

*(s) = 0, which we know is right (why?)– Given Vi

*, calculate the values for all states for depth i+1:

• But Q-values are more useful!– Start with Q0

*(s,a) = 0, which we know is right (why?)– Given Qi

*, calculate the q-values for all q-states for depth i+1:

Page 88: Lucio marcenaro tue summer_school

Q-Learning

• Q-Learning: sample-based Q-value iteration

• Learn Q*(s,a) values– Receive a sample (s,a,s’,r)

– Consider your old estimate:

– Consider your new sample estimate:

– Incorporate the new estimate into a running average:

Page 89: Lucio marcenaro tue summer_school

Q-Learning Properties

• Amazing result: Q-learning converges to optimal policy– If you explore enough

– If you make the learning rate small enough

– … but not decrease it too quickly!

– Basically doesn’t matter how you select actions (!)

• Neat property: off-policy learning– learn optimal policy without following it (some caveats)

Page 90: Lucio marcenaro tue summer_school

Q-Learning

• Discrete sets of states and actions

– States form an N-dimensional array

• Unfolded into one dimension in practice

– Individual actions selected on each time step

• Q-values

– 2D array (indexed by state and action)

– Expected rewards for performing actions

Page 91: Lucio marcenaro tue summer_school

Q-Learning

• Table of expected rewards (“Q-values”)

– Indexed by state and action

• Algorithm steps

– Calculate state index from sensor values

– Calculate the reward

– Update previous Q-value

– Select and perform an action

• Q(s,a) = (1 - α) Q(s,a) + α (r + γ max(Q(s',a)))

Page 92: Lucio marcenaro tue summer_school

• Certain sensors provide continuous values

• Sonar

• Motor encoders

• Q-Learning requires discrete inputs

• Group continuous values into discrete “buckets”

• [Mahadevan and Connell, 1992]

• Q-Learning produces discrete actions

• Forward

• Back-left/Back-right

Q-Learning and Robots

Page 93: Lucio marcenaro tue summer_school

Creating Discrete Inputs

• Basic approach

– Discretize continuous values into sets

– Combine each discretized tuple into a single index

• Another approach

– Self-Organizing Map

– Induces a discretization of continuous values

– [Touzet 1997] [Smith 2002]

Page 94: Lucio marcenaro tue summer_school

Q-Learning Main Loop

• Select action

• Change motor speeds

• Inspect sensor values– Calculate updated state

– Calculate reward

• Update Q values

• Set “old state” to be the updated state

Page 95: Lucio marcenaro tue summer_school

Calculating the State (Motors)

• For each motor:

– 100% power

– 93.75% power

– 87.5% power

• Six motor states

Page 96: Lucio marcenaro tue summer_school

Calculating the State (Sensors)

• No disparity: STRAIGHT

• Left/Right disparity

– 1-5: LEFT_1, RIGHT_1

– 6-12: LEFT_2, RIGHT_2

– 13+: LEFT_3, RIGHT_3

• Seven total sensor states

• 63 states overall

Page 97: Lucio marcenaro tue summer_school

Calculating Reward

• No disparity => highest value

• Reward decreases with increasing disparity

Page 98: Lucio marcenaro tue summer_school

Action Set for Line Follow

• MAINTAIN

– Both motors unchanged

• UP_LEFT, UP_RIGHT

– Accelerate motor by one motor state

• DOWN_LEFT, DOWN_RIGHT

– Decelerate motor by one motor state

• Five total actions

Page 99: Lucio marcenaro tue summer_school

Q-learning line follower

Page 100: Lucio marcenaro tue summer_school

Conclusions

• Lego Mindstorms NXT as a conveniente platform for «cognitive robotics»

• Executing a task with «rules»

• Learning hot to execute a task

– MDP

– Reinforcement learning

• Q-learning applied to Lego Mindstorms

Page 101: Lucio marcenaro tue summer_school

Thank you!

• Questions?