lucio marcenaro tue summer_school

An introduction to cognitive roboticsEMJD ICE Summer School - 2013

Lucio Marcenaro – University of Genova (ITALY)

Cognitive robotics?

• Robots with intelligent behavior– Learn and reason

– Complex goals

– Complex world

• Robots ideal vehicles for developing and testing cognitive:– Learning

– Adaptation

– Classification

Cognitive robotics

• Traditional behavior modeling approaches problematic and untenable.

• Perception, action and the notion of symbolic representation to be addressed in cognitive robotics.

• Cognitive robotics views animal cognition as a starting point for the development of robotic information processing.

Cognitive robotics

• “Immobile” Robots and EngineeringOperations– Robust space probes, ubiquitous computing

• Robots That Navigate– Hallway robots, Field robots, Underwater

explorers, stunt air vehicles

• Cooperating Robots– Cooperative Space/Air/Land/Underwater vehicles,

distributed traffic networks, smart dust.

Some applications (1)

Some applications (2)

Other examples

Outline

• Lego Mindstorms

• Simple Line Follower

• Advanced Line Follower

• Learning to follow the line

• Conclusions

The NXT Unit – an embedded system

• 64K RAM, 256K Flash

• 32-bit ARM7 microcontroller

• 100 x 64 pixel LCD graphical display

• Sound channel with 8-bit resolution

• Bluetooth wireless communications

• Stores multiple programs– Programs selectable using buttons

The NXT unit

(Motor ports)

(Sensor ports)

Motors and Sensors

• Built-in rotation sensors

NXT Motors

NXT Rotation Sensor

• Built in to motors

• Measure degrees or rotations

• Reads + and -

• Degrees: accuracy +/- 1

• 1 rotation =

360 degrees

Viewing Sensors

• Connect sensor

• Turn on NXT

• Choose “View”

• Select sensor type

• Select port

NXT Sound Sensor• Sound sensor can measure in dB and dBA

– dB: in detecting standard [unadjusted] decibels, all sounds are measured with equal sensitivity. Thus, these sounds may include some that are too high or too low for the human ear to hear.

– dBA: in detecting adjusted decibels, the sensitivity of the sensor is adapted to the sensitivity of the human ear. In other words, these are the sounds that your ears are able to hear.

• Sound Sensor readings on the NXT are displayed in percent [%]. The lower the percent the quieter the sound.

http://mindstorms.lego.com/Overview/Sound_Sensor.aspx

NXT Ultrasonic/Distance Sensor

• Measures distance/proximity

• Range: 0-255 cm

• Precision: +/- 3cm

• Can report in centimeters or inches

http://mindstorms.lego.com/Overview/Ultrasonic_Sensor.aspx

17

NXT Non-standard sensors: HiTechnic.com

• Compass

• Gyroscope

• Accellerometer/tilt sensor,

• Color sensor

• IRSeeker

• Prototype board with A/D converter for the I2C bus

LEGO Mindstorms for NXT

(NXT-G)

NXT-G graphical programminglanguage

Based on the LabVIEW programming language G

Program by drawing a flow chart

NXT-G PC program interface

Toolbar

Workspace

Configuration

Panel

Help & Navigation

Controller

Palettes

Tutorials Web

Portal

Sequence Beam

Issues of the standard firmware

• Only one data type

• Unreliable bluetooth communication

• Limited multi-tasking

• Complex motor control

• Simplistic memory management

• Not suitable for large programs

• Not suitable for development of own tools or blocks

Other programming languages and environments

– Java leJOS

– Microsoft Robotics Studio

– RobotC

– NXC - Not eXactly C

– NXT Logo

– Lego NXT Open source firmware and software development kit

leJOS

• A Java Virtual Machine for NXT

• Freely available– http://lejos.sourceforge.net/

• Replaces the NXT-G firmware

• LeJOS plug-in is available for the Eclipse free development environment

• Faster than NXT-G

Example leJOS Program

sonar = new UltrasonicSensor(SensorPort.S4);

Motor.A.forward();

Motor.B.forward();

while (true) {

if (sonar.getDistance() < 25) {

Motor.A.forward();

Motor.B.backward();

} else {

Motor.A.forward();

Motor.B.forward();

}

}

Event-driven Control in leJOS

• The Behavior interface– boolean takeControl()

– void action()

– void suppress()

• Arbitrator class

– Constructor gets an array of Behavior objects

• takeControl() checked for highest index first

– start() method begins event loop

Event-driven example

class Go implements Behavior {

private Ultrasonic sonar =

new Ultrasonic(SensorPort.S4);

public boolean takeControl() {

return sonar.getDistance() > 25;

}


public void action() {

Motor.A.forward();

Motor.B.forward();

}

public void suppress() {

Motor.A.stop();

Motor.B.stop();

}

}


class Spin implements Behavior {

private Ultrasonic sonar =

new Ultrasonic(SensorPort.S4);

public boolean takeControl() {

return sonar.getDistance() <= 25;

}


public void action() {

Motor.A.forward();

Motor.B.backward();

}

public void suppress() {

Motor.A.stop();

Motor.B.stop();

}

}


public class FindFreespace {

public static void main(String[] a) {

Behavior[] b = new Behavior[]

{new Go(), new Spin()};

Arbitrator arb =

new Arbitrator(b);

arb.start();

}

}

Simple Line Follower

• Use light-sensor as a switch

• If measured value > threshold: ON state (whitesurface)

• If measured value < threshold: OFF state (black surface)


• Robot not traveling inside the line but alongthe edge

• Turning left until an “OFF” to “ON” transition is detected

• Turning right until an “ON” to “OFF” transition is detected


NXTMotor rightM = new NXTMotor(MotorPort.A);

NXTMotor leftM = new NXTMotor(MotorPort.C);

ColorSensor cs = new ColorSensor(SensorPort.S2, Color.RED);

while (!Button.ESCAPE.isDown())

{

int currentColor = cs.getLightValue();

LCD.drawInt(currentColor, 5, 11, 3);

if (currentColor < 30)

{

rightM.setPower(50);

leftM.setPower(10);

}

else

{

rightM.setPower(10);

leftM.setPower(50);

}

}


• DEMO

Advanced Line Follower

• Use light-sensor as an Analog sensor

• Sensor ranges btween 0 – 100

• Takes the average light detected over a small area


• Subtract the current reading of the sensor from what the sensor should be reading– Use this value to directly control direction and

power of the wheels

• Multiply this value for a constant: how strongly the wheels should turn to correct its path?

• Add a value to be sure that the robot is always moving forward


NXTMotor rightM = new NXTMotor(MotorPort.A);

NXTMotor leftM = new NXTMotor(MotorPort.C);

int targetValue = 30;

int amplify = 7;

int targetPower = 50;

ColorSensor cs = new ColorSensor(SensorPort.S2, Color.RED);

rightM.setPower(targetPower);

leftM.setPower(targetPower);

while (!Button.ESCAPE.isDown())

{

int currentColor = cs.getLightValue();

int difference = currentColor - targetValue;

int ampDiff = difference * amplify;

int rightPower = ampDiff + targetPower;

int leftPower = targetPower;

rightM.setPower(rightPower);

leftM.setPower(leftPower);

}


• DEMO

Learn how to follow

• Goal– Make robots do what we want

– Minimize/eliminate programming

• Proposed Solution: Reinforcement Learning– Specify desired behavior using rewards

– Express rewards in terms of sensor states

– Use machine learning to induce desired actions

• Target Platform– Lego Mindstorms NXT

Example: Grid World

• A maze-like problem– The agent lives in a grid– Walls block the agent’s path

• Noisy movement: actions do not always go as planned:

– 80% of the time, preferred action is taken(if there is no wall there)

– 10% of the time, North takes the agent West; 10% East

– If there is a wall in the direction the agent would have been taken, the agent stays put

• The agent receives rewards each time step

– Small “living” reward each step (can be negative)

– Big rewards come at the end (good or bad)

• Goal: maximize sum of rewards

Markov Decision Processes

• An MDP is defined by:– A set of states s S– A set of actions a A– A transition function T(s,a,s’)

• Prob that a from s leads to s’• i.e., P(s’ | s,a)• Also called the model (or

dynamics)– A reward function R(s, a, s’)

• Sometimes just R(s) or R(s’)– A start state– Maybe a terminal state

• MDPs are non-deterministic search problems– Reinforcement learning: MDPs

where we don’t know the transition or reward functions

What is Markov about MDPs?

• “Markov” generally means that given the present state, the future and the past are independent

• For Markov decision processes, “Markov” means:

Andrej Andreevič Markov(1856-1922)

Solving MDPs: policies

• In deterministic single-agent search problems, want an optimal plan, or sequence of actions, from start to a goal

• In an MDP, we want an optimal policy *: S → A– A policy gives an action for each state

– An optimal policy maximizes expected utility if followed

– An explicit policy defines a reflex agent

Optimal policy when

R(s, a, s’) = -0.03 for all

non-terminals s

Example Optimal Policies

R(s) = -2.0R(s) = -0.4

R(s) = -0.03R(s) = -0.01

MDP Search Trees

• Each MDP state gives an expectimax-like search tree

a

s

s’

s, a

(s,a,s’) called a transition

T(s,a,s’) = P(s’|s,a)

R(s,a,s’)

s,a,s’

s is a state

(s, a) is a

q-state

Utilities of Sequences

• In order to formalize optimality of a policy, need to understand utilities of sequences of rewards

• What preferences shouldan agent have over reward sequences?

• More or less?– [1,2,2] or [2,3,4]

• Now or later?– [1,0,0] or [0,0,1]

Discounting

• It’s reasonable to maximize the sum of rewards

• It’s also reasonable to prefer rewards now to rewards later

• One solution:values of rewards decayexponentially

Discounting

• Typically discount rewards by < 1 each time step– Sooner rewards have higher

utility than later rewards

– Also helps the algorithms converge

• Example: discount of 0.5:– U([1,2,3])=1*1+0.5*2+0.25*3

– U([1,2,3])<U([3,2,1])

Stationary Preferences

• Theorem if we assume stationary preferences:

• Then: there are only two ways to define utilities– Additive utility:

– Discounted utility:

Quiz: Discounting

• Given:

– Actions: East, West and Exit (available in exit states a, e)– Transitions: deterministic

• Quiz 1: For =1, what is the optimal policy?

• Quiz 2: For =0.1, what is the optimal policy?

• Quiz 3: For which are East and West equally goodwhen in state d?

10 1

a b c d e

10 1

10 1

Infinite Utilities?!

• Problem: infinite state sequences have infinite rewards

• Solutions:– Finite horizon:

• Terminate episodes after a fixed T steps (e.g. life)

• Gives nonstationary policies ( depends on time left)

– Discounting: for 0 < < 1

• Smaller means smaller “horizon” – shorter term focus

• Absorbing state: guarantee that for every policy, a terminal state will eventually be reached

Recap: Defining MDPs

• Markov decision processes:– States S– Start state s0

– Actions A– Transitions P(s’|s,a) (or T(s,a,s’))– Rewards R(s,a,s’) (and discount )

• MDP quantities so far:– Policy = Choice of action for each state– Utility (or return) = sum of discounted rewards

a

s

s, a

s,a,s’

s’

Optimal Quantities

• Why? Optimal values define optimal policies!

• Define the value (utility) of a state s:V*(s) = expected utility starting in s

and acting optimally

• Define the value (utility) of a q-state (s,a):Q*(s,a) = expected utility starting in

s, taking action a and thereafter acting optimally

• Define the optimal policy:*(s) = optimal action from state s

a

s

s, a

s,a,s’

s’

Gridworld V*(s)

• Optimal value function V*(s)

Gridworld Q*(s,a)

• Optimal Q function Q*(s,a)

Values of States

• Fundamental operation: compute the value of a state

– Expected utility under optimal action

– Average sum of (discounted) rewards

• Recursive definition of valuea

s

s, a

s,a,s’

s’

Why Not Search Trees?

• We’re doing way too much work with search trees

• Problem: States are repeated– Idea: Only compute needed quantities once

• Problem: Tree goes on forever– Idea: Do a depth-limited computations, but

with increasing depths until change is small– Note: deep parts of the tree eventually don’t

matter if < 1

Time-limited Values

• Key idea: time-limited values

• Define Vk(s) to be the optimal value of s if the game ends in k more time steps

– Equivalently, it’s what a depth-k search tree wouldgive from s

Value Iteration

• Problems with the recursive computation:

– Have to keep all the Vk*(s) around all the time

– Don’t know which depth k(s) to ask for when planning

• Solution: value iteration

– Calculate values for all states, bottom-up

– Keep increasing k until convergence

Value Iteration

• Idea:– Start with V0

*(s) = 0, which we know is right (why?)– Given Vi

*, calculate the values for all states for depth i+1:

– This is called a value update or Bellman update– Repeat until convergence

• Complexity of each iteration: O(S2A)• Theorem: will converge to unique optimal values

– Basic idea: approximations get refined towards optimal values– Policy may converge long before values do

Practice: Computing Actions

• Which action should we chose from state s:

– Given optimal values V?

– Given optimal q-values Q?

– Lesson: actions are easier to select from Q’s!

Utilities for Fixed Policies

• Another basic operation: compute the utility of a state s under a fixed (general non-optimal) policy

• Define the utility of a state s, under a fixed policy :

V(s) = expected total discounted rewards (return) starting in s and following

• Recursive relation (one-step look-ahead / Bellman equation):

(s)

s

s, (s)

s, (s),s’

s’

Policy Evaluation

• How do we calculate the V’s for a fixed policy?

• Idea one: modify Bellman updates

• Efficiency: O(S2) per iteration• Idea two: without the maxes it’s just a linear system,

solve with Matlab (or whatever)

Policy Iteration

• Problem with value iteration:– Considering all actions each iteration is slow: takes |A| times longer than

policy evaluation

– But policy doesn’t change each iteration, time wasted

• Alternative to value iteration:– Step 1: Policy evaluation: calculate utilities for a fixed policy (not optimal

utilities!) until convergence (fast)

– Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities (slow but infrequent)

– Repeat steps until policy converges

• This is policy iteration– It’s still optimal!

– Can converge faster under some conditions

Policy Iteration

• Policy evaluation: with fixed current policy , find values with simplified Bellman updates:– Iterate until values converge

• Policy improvement: with fixed utilities, find the best action according to one-step look-ahead

Comparison

• In value iteration:– Every pass (or “backup”) updates both utilities (explicitly, based on

current utilities) and policy (possibly implicitly, based on current policy)

• In policy iteration:– Several passes to update utilities with frozen policy

– Occasional passes to update policies

• Hybrid approaches (asynchronous policy iteration):– Any sequences of partial updates to either policy entries or utilities

will converge if every state is visited infinitely often

Reinforcement Learning

• Basic idea:– Receive feedback in the form of rewards

– Agent’s utility is defined by the reward function

– Must learn to act so as to maximize expected rewards

– All learning is based on observed samples of outcomes

Reinforcement Learning

• Reinforcement learning:

– Still assume an MDP:

• A set of states s S

• A set of actions (per state) A

• A model T(s,a,s’)

• A reward function R(s,a,s’)

– Still looking for a policy (s)

– New twist: don’t know T or R• I.e. don’t know which states are good or what the actions do

• Must actually try actions and states out to learn

Model-Based Learning

• Model-Based Idea:– Learn the model empirically through experience– Solve for values as if the learned model were correct

• Step 1: Learn empirical MDP model– Count outcomes for each s,a– Normalize to give estimate of T(s,a,s’)– Discover R(s,a,s’) when we experience (s,a,s’)

• Step 2: Solve the learned MDP– Iterative policy evaluation, for example

(s)

s

s, (s)

s, (s),s’

s’

Example: Model-Based Learning

• Episodes:

x

y

T(<3,3>, right, <4,3>) = 1 / 3

T(<2,3>, right, <3,3>) = 2 / 2

+100

-100

= 1

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right -1

(4,3) exit +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(4,2) exit -100

(done)

Model-Free Learning• Want to compute an expectation weighted by P(x):

• Model-based: estimate P(x) from samples, compute expectation

• Model-free: estimate expectation directly from samples

• Why does this work? Because samples appear with the right frequencies!

Example: Direct Estimation

• Episodes:

x

y

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right -1

(4,3) exit +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(4,2) exit -100

(done)

V(2,3) ~ (96 + -103) / 2 = -3.5

V(3,3) ~ (99 + 97 + -102) / 3 = 31.3

= 1, R = -1

+100

-100

Sample-Based Policy Evaluation?

• Who needs T and R? Approximate the expectation with samples (drawn from T!) (s)

s

s, (s)

s1’s2’ s3’

s, (s),s’

s’

Almost! But we only

actually make progress

when we move to i+1.

Temporal-Difference Learning

• Big idea: learn from every experience!

– Update V(s) each time we experience (s,a,s’,r)

– Likely s’ will contribute updates more often

• Temporal difference learning

– Policy still fixed!

– Move values toward value of whatever successor occurs: running average!

(s)

s

s, (s)

s’

Sample of V(s):

Update to V(s):

Same update:

Exponential Moving Average

• Exponential moving average – Makes recent samples more important

– Forgets about the past (distant past values were wrong anyway)

– Easy to compute from the running average

• Decreasing learning rate can give converging averages

Example: TD Policy Evaluation

Take = 1, = 0.5

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right -1

(4,3) exit +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(4,2) exit -100

(done)

Problems with TD Value Learning

• TD value leaning is a model-free way to do policy evaluation

• However, if we want to turn values into a (new) policy, we’re sunk:

• Idea: learn Q-values directly

• Makes action selection model-free too!

a

s

s, a

s,a,s’

s’

Active Learning

• Full reinforcement learning– You don’t know the transitions T(s,a,s’)

– You don’t know the rewards R(s,a,s’)

– You can choose any actions you like

– Goal: learn the optimal policy

– … what value iteration did!

• In this case:– Learner makes choices!

– Fundamental tradeoff: exploration vs. exploitation

– This is NOT offline planning! You actually take actions in the world and find out what happens…

Detour: Q-Value Iteration

• Value iteration: find successive approx optimal values– Start with V0

*(s) = 0, which we know is right (why?)– Given Vi

*, calculate the values for all states for depth i+1:

• But Q-values are more useful!– Start with Q0

*(s,a) = 0, which we know is right (why?)– Given Qi

*, calculate the q-values for all q-states for depth i+1:

Q-Learning

• Q-Learning: sample-based Q-value iteration

• Learn Q*(s,a) values– Receive a sample (s,a,s’,r)

– Consider your old estimate:

– Consider your new sample estimate:

– Incorporate the new estimate into a running average:

Q-Learning Properties

• Amazing result: Q-learning converges to optimal policy– If you explore enough

– If you make the learning rate small enough

– … but not decrease it too quickly!

– Basically doesn’t matter how you select actions (!)

• Neat property: off-policy learning– learn optimal policy without following it (some caveats)

Q-Learning

• Discrete sets of states and actions

– States form an N-dimensional array

• Unfolded into one dimension in practice

– Individual actions selected on each time step

• Q-values

– 2D array (indexed by state and action)

– Expected rewards for performing actions

Q-Learning

• Table of expected rewards (“Q-values”)

– Indexed by state and action

• Algorithm steps

– Calculate state index from sensor values

– Calculate the reward

– Update previous Q-value

– Select and perform an action

• Q(s,a) = (1 - α) Q(s,a) + α (r + γ max(Q(s',a)))

• Certain sensors provide continuous values

• Sonar

• Motor encoders

• Q-Learning requires discrete inputs

• Group continuous values into discrete “buckets”

• [Mahadevan and Connell, 1992]

• Q-Learning produces discrete actions

• Forward

• Back-left/Back-right

Q-Learning and Robots

Creating Discrete Inputs

• Basic approach

– Discretize continuous values into sets

– Combine each discretized tuple into a single index

• Another approach

– Self-Organizing Map

– Induces a discretization of continuous values

– [Touzet 1997] [Smith 2002]

Q-Learning Main Loop

• Select action

• Change motor speeds

• Inspect sensor values– Calculate updated state

– Calculate reward

• Update Q values

• Set “old state” to be the updated state

Calculating the State (Motors)

• For each motor:

– 100% power

– 93.75% power

– 87.5% power

• Six motor states

Calculating the State (Sensors)

• No disparity: STRAIGHT

• Left/Right disparity

– 1-5: LEFT_1, RIGHT_1

– 6-12: LEFT_2, RIGHT_2

– 13+: LEFT_3, RIGHT_3

• Seven total sensor states

• 63 states overall

Calculating Reward

• No disparity => highest value

• Reward decreases with increasing disparity

Action Set for Line Follow

• MAINTAIN

– Both motors unchanged

• UP_LEFT, UP_RIGHT

– Accelerate motor by one motor state

• DOWN_LEFT, DOWN_RIGHT

– Decelerate motor by one motor state

• Five total actions

Q-learning line follower

Conclusions

• Lego Mindstorms NXT as a conveniente platform for «cognitive robotics»

• Executing a task with «rules»

• Learning hot to execute a task

– MDP

– Reinforcement learning

• Q-learning applied to Lego Mindstorms

Thank you!

• Questions?

lucio marcenaro tue summer_school

Technology

public boolean

markov decision

acting optimally

cognitive

expected rewards

policy evaluation

running average

search tree