vision-based reaching using modular deep networks: from

Vision-Based Reaching Using Modular Deep Networks:from Simulation to the Real World

Fangyi Zhang, Jürgen Leitner, Ben Upcroft, Peter Corke1

Abstract— In this paper we describe a deep network archi-tecture that maps visual input to control actions for a roboticplanar reaching task with 100% reliability in real-world trials.Our network is trained in simulation and fine-tuned with alimited number of real-world images. The policy search isguided by a kinematics-based controller (K-GPS), which worksmore effectively and efficiently than ε-Greedy. A critical insightin our system is the need to introduce a bottleneck in thenetwork between the perception and control networks, and toinitially train these networks independently.

I. INTRODUCTION

Sensor based robot tasks such as visual reaching areimplemented today by human-written code. Typically a per-ception system will identify the target in some local or globalcoordinate system and the robot motion planner and jointcontrol system will ensure that the required collision-freemotion is executed. Understanding complex visual scenes hastraditionally been a hard problem but in recent years deeplearning techniques have led to significant improvements inrobustness and performance [1]. Motion planning is normallyimplemented using classical algorithmic approaches, andjoint control implemented using classical control engineeringtechniques.

Inspired by the success of deep learning in computervision, some pioneers have used deep neural networks toperform robotic tasks based directly on image data, suchas object grasping and bottle cap screwing [2], [3], [4],[5]. However, most existing works achieved a good butnot perfect performance by collecting large-scale datasets oradding constraints for optimization. To reduce the cost ofcollecting large-scale real datasets for robotic applicationswe propose to train in simulation and then transfer to a realworld robotic setup.

Our work is inspired by the Deepmind’s Deep Q-Networkapproach [6], which showed that a deep learnt system wasable to take images and directly synthesise control actionsfor Atari video games, albeit all in simulation. While thisresult is an important and exciting breakthrough we havefound that it does not transfer directly to robotics applicationin a naive way, as the trained deep network’s perceptionis quite brittle when dealing with real cameras observingreal scenes [7]. In this paper, we conduct some further

*This research was conducted by the Australian Research Council Centreof Excellence for Robotic Vision (project number CE140100016). Computa-tional resources & services used in this work were partially provided by theHPC and Research Support Group, Queensland University of Technology.

1The authors are with the Australian Centre for Robotic Vision(ACRV), Queensland University of Technology (QUT), Brisbane, [email protected]

Fig. 1. A modular neural network (where perception and control areseparated by a bottleneck) (A) maps raw-pixel images directly to motoractions. It allows the efficient transfer of deep learnt vision-based targetreaching from simulation (B) to a real-world Baxter setup (C).

investigations on perception adaptation and policy search forvision-based target reaching on a robotic arm, with the aimof adapting pre-trained networks for real-world applicationsin a feasible and low-cost way. In particular, we

• propose a low-cost way of adapting visuomotor policiesfrom simulation to real world robotic settings usingmodular neural networks, where we separate perceptionand control by adding a bottleneck (Fig. 1A).

• introduce a new policy search method for Q-learning,using a kinematics-based controller to guide the searchfor a feasible policy. The guidance from a conventionalcontroller makes it possible and quicker to obtain fea-sible policies for a robotic planar reaching task whereε-Greedy works poorly.

• demonstrate learning of a robotic planar reaching task insimulation (Fig. 1B) and transfer to a real world setupusing about 1000 real images, and achieve a successrate of 100% in 20 reaching tests on Baxter (Fig. 1C).

II. RELATED WORK

Inspired by the success of deep learning in computervision, deep learning is becoming more prominent inrobotics research. Especially, the Atari game playing Deep-Q-Network (DQN) [6] has inspired robotics researchers tolook at deep learning for tasks that require a close interaction

arX

iv:1

610.

0678

1v1

[cs

.RO

] 2

1 O

ct 2

016

between vision and action. It showed that visual inputs –taken from a game emulator – can be directly mapped tothe best action to take – button to press – with a deepneural network. The network successfully learned to play 49different games solely based on the reward provided by thegame score. This is done by using Q-learning [8], a form ofreinforcement learning, which aims to maximise the futureexpected reward given a current state s and a policy π:

Q(s, a) = maxπ

E[rt + γrt+1 + γ2rt+2 + . . . |π], (1)

It is the maximum sum of rewards rt discounted by γat each time-step t, achievable by a behaviour policy π =P (a|s) in state s and taking an action a. The novelty ofDQN is that it allows the state to be represented by a high-dimensional input, a computer screen image. This is possibleby using convolutional network layers to extract the relevantfeatures [1]. Once the function is learned, the action with themaximum Q-value for a given input image will be performed.

Inspired by the DQN, we implemented a simple simulatorfor robotic planar reaching, providing a virtual environmentfor a DQN to learn, as shown in Fig. 1B. Through thesimulator, a DQN agent learned to control the 3-joint armto reach towards a blue-circle target [7]. However, we foundthat a deep network trained in simulation was not directlytransferable into the real world even though configurationand sensing noise was added in the training to make it asrobust as possible.

Some other works are also aiming to use deep learningfor robotic applications, and have achieved a promisingperformance. A convolutional neural network (CNN) wastrained to cope with obstacle avoidance problem for mobilerobots [9]. A CNN based policy representation architecture(deep visuomotor policies) and its guided policy search(GPS) method were introduced [2]. The deep visuomotorpolicies map joint angles and camera images directly tojoint torques, achieving a good performance in the tasks ofhanging a coat hanger, inserting a block into a kids toy, andtightening a bottle cap.

By using a large-scale dataset with 800,000 grasp attempts,a CNN was trained to predict the success probability of asequence of motions aiming at grasping on a 7 DoF roboticmanipulator with a 2-finger gripper [3]. Combined with asimple derivative-free optimization algorithm (CEM [10]) thegrasping system achieved a success rate of 80%. Anotherexample of deep learning for grasping is self-supervisedgrasping learning in the real world where force sensors wereused to autonomously label samples [4]. After trained with50K trials using a staged leaning method, a CNN achieveda grasping success rate around 70%.

Recurrent neural networks (RNNs) have also beenused to learn manipulation tasks such as cutting objects,e.g., by combining RNN and model predictive control(DeepMPC) [5]. It achieved a state-of-the-art performance inexperiments on a large-scale dataset of 1488 material cuts for20 diverse classes, and in 450 real-world robotic experiments.

These works show that, with appropriate network ar-chitectures and effective training methods, deep learning

is feasible in robotic manipulation and brings benefits inperformance improvements. However, they are either tooexpensive in terms of time and resources for large-scaledatasets collection, or not robust enough for global solutions,very few of them are focused on transferring skills fromsimulation to the real world.

Transfer learning (TL) attempts to develop methods totransfer knowledge between tasks (i.e., transfer knowledgelearned in one or more sources tasks to a related targettask) [11]. Many TL methods have been introduced in theareas of data mining [12] and reinforcement learning [13].Progressive neural networks were developed for both lever-aging transfer and avoiding catastrophic forgetting to enablethe learning of complex sequences of tasks, whose effective-ness has been validated on a wide variety of reinforcementlearning tasks such as Atari and 3D maze games [14].

However, TL methods for robotic applications are stillvery limited. Most methods need manually designed taskmapping information, such as a similarity-based approachto skill transfer for robots [15]. To reduce the number ofreal-world images for pose CNN training in GPS [2], amethod of adapting visual representations from simulatedto real environments was proposed. The method achieveda success rate of 79.2% in a “hook loop” task, with 10 timesless real-world images [16].

Previously it was shown that a DQN trained in simulationhad a poor performance in real-world reaching, due to thebrittleness of the transferred perception [7]. Adapting aneural network from simulation to the real world necessitatesan extra fine-tuning step. However, fine-tuning directly on areal robot is expensive in terms of resources and time.

III. PROBLEM FORMULATION

A canonical problem in robotics is to reach for the objectto be interacted with. This target reaching task is definedas controlling a robot arm, such that its end-effector isreaching a specific target in operational space. A n-DOFrobot movement is represented in joint or configuration spaceC ∈ Sn; joint angles define the robot pose q ∈ C. A controllerthen reduces the error between the robot’s current and to-be-reached pose q∗. We consider the case when the robot onlyhas visual perception. In this case the target T is representedby its image coordinates (Tx, Ty).

The setup consists of a Baxter robot which aims to reachan arbitrary placed, blue target in a 2D vertical plane, withthree joints of its left arm (Fig. 1C). The robot’s perceptionis limited to raw-pixel images I from an external monocularcamera. A single coloured, blue circle is used as a target.

We limit the robots actions a to 9 discrete options – thebuttons in the Atari scenario. Each action is either rotatingclockwise, counter-clockwise or halting a specific joint qi ofa 3 DoF arm. The rotation actions are fixed to 0.04 radianper execution. Only a single action ai is executed at anygiven time step.

As deep learning approaches require considerable numberof input images, purely learning in the real world is infeasiblewith current techniques. Inspired by the Atari games, a

FC1 FC2Conv1 Conv2 Conv3

Q-v

alue

s

8×8

conv

+ R

eLU

strid

e 4

4×4

conv

+ R

eLU

strid

e 2

3×3

conv

+ R

eLU

strid

e 1

32 filters 64 filters 64 filters

fully

con

n. +

ReL

U

fully

con

n.

512

units

9 un

its

4×84×84

I

Perception Module Control ModuleConv1 Conv2 Conv3 FC_c2 FC_c3FC_c1

Q-v

alue

s

7×7

conv

+ R

eLU

strid

e 2

4×4

conv

+ R

eLU

strid

e 2

3×3

conv

+ R

eLU

strid

e 1

64 filters 64 filters 64 filters

fully

con

n.

300

units

9 un

its

84×84

400

unitsfully

con

n.

fully

con

n. +

ReL

U

fully

con

n. +

ReL

U

I BN

5 un

its

θ

Bottleneck

Or

Fig. 2. Deep networks are used to predict Q-values given some raw pixel inputs. The Deepmind DQN architecture (left) consists of three conv layers andtwo FC layers. It maps 4 consecutive input images into a Q-value for each possible output action. We propose modular neural networks (right), composedof perception and control modules. The perception module which consists of three conv layers and a FC layer, extracts the physically relevant informationfrom the visual input (single frame). The control module selects the best action given the current state. The interface between the two is a bottleneck,which has physical meaning (scene configuration θ).

simulator is developed to provide a virtual environment forthe robot to learn reaching (Fig. 1B). The simulated armconsists of four links and three joints, whose configurationsare consistent to the specifications of a Baxter arm, includingjoint constraints. The target to reach is represented by a bluecircle with a radius of 0.05m. A reach is deemed successfulif the robot reaches and keeps its end-effector within 0.09mto the centre of the target for four consecutive actions, whichis equivalent to about 4 pixels in the input image to aneural network. Although the simulator is low-fidelity, ourexperimental results show that it is sufficient to learn andtransfer reaching skills to a real robot (see Section V).

IV. METHODOLOGY

A DQN trained in simulation does not transfer naivelyto real-world reaching, due to the brittle perception [7]. Toadapt the perception in a low-cost way, we propose a modulardeep learning architecture that can learn robotic reachingfrom visual inputs only, inspired by the Deepmind’s DQNsystem.

A. Modular neural networksThe modularity stems from the well-established approach

in robotics to connect perception with control via a verystrict, well-defined interface. The advantage of a separateperception module is that a small number of real-worldsamples is enough to fine-tune it for a real-world setup aftertraining in simulation.

The original Deepmind DQN uses three convolutional(conv) layers, followed by two fully-connected (FC) layers(Fig. 2(left)). The intuition is that the conv part is focusing onthe perception, i.e., extracing the relevant information fromthe visual intput, and the FC part performs the robot control.We verify this by separating the network into perception andcontrol modules. To create a useful interface, we propose abottleneck layer (BN) which represents the minimal descrip-tion of the scene – herein a combination of the robot’s stateq and the target location (Tx, Ty) (Fig. 2(right)).

Based on the physical representation that is contained inthe bottleneck, we can train and assess the performance ofthe perception and control modules independently.

B. Perception module

The perception module learns how to extract the sceneconfiguration θ from raw-pixel images I .

1) Architecture: Compared to the original DQN, insteadof using 4 consecutive images we use a single grey-scaleimage as input which is sufficient for the planar reachingin stepping position control mode where no temporal infor-mation is necessary. Before input to the perception module,images from the simulator (RGB, 160× 210) are convertedto grey-scale and scaled to 84× 84. To be able to train themodule based on our physical intuition in the bottleneck, aFC layer is added to map high-dimensional features fromConv3 to the scene configuration θ.

The network is initialized with random weights, apart fromthe first conv layer. Conv1 is initialized with the weights froma GoogLeNet [17] pre-trained on the ImageNet dataset [18].A weight conversion, based on standard RGB to grey-scalemethods, is necessary as GoogLeNet has three input channels(RGB) compared to our grey-scale channel.

2) Training method: The perception is trained using su-pervised learning. The training is first conducted in simula-tion, then fine-tuned with a small number of real samples toenable the transfer to the real robot setup. Each sample con-tains an image It – from either the simulator or the real setup– and its ground-truth scene configuration θt. In the training,the weights are updated through backpropagation [19] usingthe quadratic cost function

Ci =1

2m

m∑t=1

∥∥y(It)i − θit∥∥2 , (2)

where Ci is the cost for θit, an element in θt; m is the numberof samples; y(It)i is the prediction of θit for It.

C. Control module

The control module decides which action a to take basedon the current state of the scene θ.

1) Architecture: The control module has 3 FC layers, with400 and 300 units in the two hidden layers (FC_c1 andFC_c2), as shown in Fig. 2(right). The input to the control

Algorithm 1: DQN with K-GPS

1 Initialize replay memory D2 Initialize Q-function Q(θ, a) with random weights3 for iteration=1,K do4 if previous trial finished then5 Start a new trial:6 Randomly generate two arm poses q and q∗

7 Get the end-effector position (Tx, Ty) for thepose q∗

8 Use (Tx, Ty) as the target in this trial; and q asthe beginning pose

9 end10 With probability ε select an action at by

argmina‖q(a)− q∗‖

11 otherwise select at = maxa Q(θt, a)12 Execute at and observe reward rt and state θt+1

13 Add the new sample (θt, at, rt, θt+1) into D14 Sample a random minibatch from D15 Update (Q(θ, a)) using the minibatch16 end

module is the scene configuration θ. Its output is the Q-valuesfor the 9 possible actions. Our experiments in Section V-Bshow that too few layers or units in each layer will make anetwork sensitive to training parameters.

2) Training method: The control module is trained insimulation to learn a policy using Q-learning. We proposea FC network to approximate an optimal Q-value functionas defined in Eq. 1, where the action with the maximumactivation will be executed. Here, the state s is represented byθ, i.e., Q(θ, a). The implementation is based on Deepmind’sDQN, where a replay memory and target Q network are usedto stabilize the training [6].

The reward r of each action is determined according to theEuclidean distance between the end-effector and the centreof a target (d), as shown in equation

r =

λ(δ/d− 1) if d > δ0 if d 6 δ, n < 41 if d 6 δ, n > 4

, (3)

where δ is a threshold for reaching a target (δ = 0.05m); λis a constant discount factor (λ = 10−3); n represents thetimes of d being consecutively smaller than δ.

With this reward function, an agent will receive negativerewards before getting close enough to a target. The presenceof negative rewards can result in policies with fewer stepsto reach a target. Such a reward function helps an agent totake into account temporal costs during policy search, i.e.,fewer steps is better. By setting it to give a positive rewardonly when d is smaller than the threshold (δ) for more than4 consecutive times, the reward function will guide an agentto stay close to a target rather than just going through it.Designing a good reward function is an active topic in thefield of reinforcement learning. This reward function allowsan agent to learn planar reaching, but we do not claim

optimality.In Q-learning, ε-Greedy is usually used for policy search,

which is used in the original DQN. However, experimentsshow that the ε-Greedy method works poorly in our planarreaching task (Section V-B). Therefore, we introduce akinematics-based controller to guide the policy search (K-GPS) to cope with this problem. The guidance controllerselects actions by doing the following optimization:

argmina‖q(a)− q∗‖ , (4)

where a is any option from the 9 available actions, q(a) isthe resulted pose by taking action a in current pose, and q∗

is the desired pose.Our DQN with K-GPS algorithm is shown in Algorithm 1.

When training, a replay memory D which is used to storesamples of (θt, at, rt, θt+1) will first be initialized. Thecontrol module which works as a Q-function Q(θ, a) will beinitialized with random weights. At the beginning of eachtrial, the starting arm pose and target position are randomlygenerated. To guarantee a random target (Tx, Ty) is reachableby the arm, we first randomly select an arm pose q∗, and usethe position of its end-effector as the target position. Thenq∗ is used as the desired pose to guide the policy search. Ineach iteration, the action will be selected by the kinematics-based controller with a probability of ε, otherwise selectedby the control module. In the training, ε decreases linearlyfrom 1 to 0.1, i.e., the guidance gets weaker in the process.After action execution, a new sample of (θt, at, rt, θt+1) willbe added into D. The control module is updated using aminibatch with random samples from D.

V. EXPERIMENTS AND RESULTS

We first assessed the accuracy of perception modulestrained under different conditions; then evaluated the per-formance of control modules and analyzed the benefits ofusing K-GPS; at last tested the performance of a networkcomposed by perception and control modules in real-worldend-to-end reaching.

A. Perception training and fine-tuning

1) Experimental setup: To see the benefit of fine-tuningperception with real images, we trained 6 different agentsusing different methods, as shown in Table I. SIM wastrained from scratch purely in simulation; RW was trainedfrom scratch using real images; FT25–100 were trained byfine-tuning SIM with different percentage of real images.Pleased be noted that the percentage of real or simulatedimages is that in a minibatch for weights updating, not thatin a training set.

When training in simulation, images were generated by thesimulator and labelled with ground-truth scene configurationθ. When training or fine-tuning with real images, 1418images were used with their ground-truth scene configura-tion. The real images were collected in the real-world setup(Fig. 1C), with different arm poses uniformly distributed inthe joint space. For convenience, instead of using a realtarget, a blue circle was painted on each real image as a

A B

C

640×480

160×210

84×84

Fig. 3. An real image (A) from a webcam is first cropped and scaled to thesimulator size and a virtual target is added (B). When input to a perceptionmodule, B is converted to grey-scale and scaled to 84× 84 (C).

virtual target using the simulator. Fig. 3 shows how a realimage was collected. Fig. 3B is an examplary real imagewith a virtual target. As introduced in Section IV-B, beforeinput to a perception module, Fig. 3B will be converted toFig. 3C. To yield more robust results, the real data wasaugmented by transforming the images with variations inrotation, translation, noise, and brightness (as done for CNNtraining in computer vision).

In both training and fine-tuning, we used a minibatchsize of 128 and a learning rate between 0.1 and 0.01.RmsProp [20] was also adopted. The modules trained fromscratch converged after 4 million updating steps (about 6days on a Tesla K40 GPU). Those fine-tuned from SIMconverged after 2 million updating steps (about 3 days).These results show that fine-tuning is helpful to reduce thetraining time.

The performance was evaluated using perception errorwhich is defined as the Euclidean norm e between thepredicted θ̂ and ground truth θ∗ scene configuration, e =∥∥∥θ̂ − θ∗∥∥∥. Usually, a smaller e means smaller errors in bothjoint angle and target coordinate recognition.

The evaluation was conducted in three different scenarios:“Simulation”, “Real” and “Live on Baxter”. In “Simulation”,400 images were used for evaluation. The 400 images areuniformly distributed in scene configuration space. Similarly,400 real-world images were collected for evaluation (“Real”).Please be noted that the 400 real-world images are differentfrom those for training/fine-tuning, although collected at thesame time. To make the test as close as possible to thereal application scenario (Fig. 1C), we also conducted a liveevaluation directly on Baxter (“Live on Baxter”), where 40different scene configurations were tested.

2) Quantitative results: The evaluation results are shownin Table II where eµ and eσ represent the mean and stan-dard deviation of e. As expected, the perception modulesperformed well in the scenarios that they were trianed/fine-tuned for, but poorly otherwise. This can be observed fromthe results of eµ in Table II. The module trained with onlysimulated images (SIM) had a small error in simulation(“Simulation”) but very poor performance in real environ-

FT25 FT50 FT75 FT100Perception Modules

0.0

0.2

0.4

0.6

0.8

1.0

e=kb µ¡

µ¤k

Simulation

Live on Baxter

Fig. 4. The distributions of e for FT25–100 in “Simulation” and “Live onBaxter”. The blue circles and green diamonds represent outliers.

TABLE IPERCEPTION MODULES AND CONDITIONS: THE RATIO OF REAL OR

SIMULATED IMAGES IS THAT IN A MINIBATCH FOR WEIGHTS UPDATING.

Nets Training ConditionsSIM Train from scratch in simulationRW Train from scratch using real-world images.

FT25 Fine-tune SIM using 25% real and 75% simulated images.FT50 Fine-tune SIM using 50% real and 50% simulated images.FT75 Fine-tune SIM using 75% real and 25% simulated images.FT100 Fine-tune SIM using 100% real-world images.

TABLE IIPERCEPTION ERROR: EUCLIDEAN NORM (e) BETWEEN θ̂ AND θ∗

NetsSimulation Real Live on Baxtereµ eσ eµ eσ eµ eσ

SIM 0.0132 0.0086 13.9231 0.8764 13.5346 1.4357RW 0.5368 0.1905 0.0232 0.0457 0.3080 0.1375

FT25 0.0117 0.0077 0.0248 0.0441 0.2189 0.0909FT50 0.0130 0.0078 0.0235 0.0450 0.1920 0.1093FT75 0.0149 0.0095 0.0208 0.0460 0.1354 0.1225

FT100 0.4977 0.1620 0.0193 0.0489 0.1334 0.1530

ments (“Real” and “Live on Baxter”). Similarly, the moduletrained (RW) or fine-tuned (FT100) with only real-worldimages performed fine in real environments but poorly insimulation. In contrast, the modules fine-tuned with bothsimulated and real images can cope with both scenarios.

The results for FT25–100 show that the percentage of realworld images in a minibatch plays a role in balancing theperformance in real and simulated environments, which canbe observed more obviously in Fig. 4, a boxplot for theresults of FT25–100 in “Simulation” and “Live on Baxter”.The more real images were presented in a minibatch duringtraining the smaller the error in the real world became.Similarly for the performance in simulation: more simulatedimages yielded a smaller error. In particular, FT25 had thesmallest eµ in “Simulation”; FT100 had the smallest eµ in“Real” and “Live on Baxter”. However, when balancing eµand eσ , FT75 had the best performance in “Live on Baxter”,

which is more obviously indicated in Fig. 4.By comparing SIM with FT25–100 in Table II, we can

see that the modules fine-tuned with less than 50% sim-ulated images (FT75 and FT100) obtained a bigger errorthan SIM in “Simulation”. This indicates that fine-tuningneeds at least 50% simulated samples in a minibatch toprevent a network from forgetting pre-mastered skills (sceneconfiguration recognition in simulation). We can also observethat, a network fine-tuned using real images (FT100) had asmaller error than a network trained from scratch (RW) in thereal setup. This indicates that fine-tuning from a pre-trainednetwork is helpful to obtain a better performance not justreduce the training time.

In Table II, the results of “Live on Baxter” were similarto those for “Real”, but with bigger errors. Since the pose ofthe external camera that observed the scene was not fixed,tiny variations were to be expected between different setups.This explains the bigger errors in the setup for “Live onBaxter”. This performance drop was already significantlyreduced thanks to the addition of translations and rotationson real images in data augmentation. For comparison, wetested a perception module trained without image translationand rotation under the scenario of “Live on Baxter”, andachieved a very poor performance (eµ=0.7303,eσ=0.1013).In the future, this kind of camera pose variance can beavoided by using an on-board camera.

3) Qualitative results: In addition to the quantitativeresults, we also got some qualitative results from the testson Baxter. When no target presented in the scene, constantvalues (with small variance) will be output by the perceptionnetwork. When two targets presented, neither target wasrecognized accurately; the target recognition outputs wereinconsistent. Yet in both cases, joint angles were recognizedaccurately. When part of the robot body or arm was occludedby a white T-shirt, the arm poses were still recognized butwith a slightly bigger error in most cases. These qualitativeresults show that the fine-tuned networks were robust tosmall-range disturbances in perception.

B. Control training

1) ε-Greedy VS K-GPS: To understand the benefit ofusing K-GPS for Q-learning, we trained 6 control modulesfor the planar reaching task with different degrees of freedom(1, 2 and 3 DoF) using ε-Greedy or K-GPS for comparison.In the 1 DoF case, only q2 was rotatable; 2 DoF with q2 andq3; and 3 DoF with all three joints. In training, we used alearning rate between 0.1 and 0.01, and a minibatch of 64.The ε probability decreased from 1 to 0.1 within 1 milliontraining steps for 1 DoF reaching, 2 million steps for 2 DoF,and 3 million steps for 3 DoF. ε-Greedy and K-GPS usedthe same settings for ε.

Fig. 5 shows the learning curves for all 6 modules, whichindicates the success rate of a module after a certain numberof training steps. As introduced in Section III, when countingthe success rate, a reach is deemed successful if the end-effector reaches and keeps within 0.09m (about 4 pixels inthe input image to a network) to the centre of a target for four

0 10 20 30 40 50 60 70

Training Steps [£105 ]

0.0

0.2

0.4

0.6

0.8

1.0

Suc

cess

Rat

e [%

]

K-GPS 1 DoFK-GPS 2 DoFK-GPS 3 DoF

²-Greedy 1 DoF²-Greedy 2 DoF²-Greedy 3 DoF

Fig. 5. The learning curves of control modules for different degrees offreedom trained using ε-Greedy or K-GPS.

TABLE IIIPERFORMANCE OF ε-GREEDY AND K-GPS

DoFError Distance d [m] Success Rate α [%]

ε-Greedy K-GPS ε-Greedy K-GPS1 0.0151 ± 0.0135 0.0084 ± 0.0075 100.00 100.002 0.1509 ± 0.2656 0.0311 ± 0.0249 83.75 99.503 0.2124 ± 0.2144 0.0431 ± 0.0294 50.25 98.50

TABLE IVAVERAGING COMPLETION STEPS: K-GPS VS GUIDANCE CONTROLLER

DoF K-GPS Guidance1 23.9073 24.06502 45.3008 47.80753 70.6858 76.9175

consecutive actions. In the learning curve, the success ratewas collected using 200 reaching tests. In the test, the targetpositions and starting arm poses were uniformly distributedin the scene configuration space.

From Fig. 5, we can find that, in 1 DoF reaching, modulestrained using K-GPS and ε-Greedy both converged to asuccess rate of 100% after around 1 million steps (4 hourson one 2.66GHz 64bit Intel Xeon processor core). In 2DoF reaching, K-GPS converged to almost 100% after 2million steps (8 hours); ε-Greedy converged to around 80%after 4 million steps (16 hours). In 3 DoF reaching, K-GPS converged to almost 100% after 4 million steps (16hours); ε-Greedy converged to about 40% after 6 millionsteps (24 hours). The learning curves show that K-GPS wasfeasible for all degrees of freedom, but ε-Greedy worked onlyin 1 DoF reaching, it got worse when degrees of freedomincreased.

To further analyze the modules trained using K-GPS andε-Greedy, we evaluated the converged modules in simulationusing the metrics of error distance d and also success rate α.d is the closest Euclidean distance between the end-effectorand the target reached by a control module. Here, to get amore accurate indication on the performance 400 reaching

Occlusion

A B C ED

Occlusion Occlusion

Occlusion

Fig. 6. End-to-end reaching with real targets (A) and occlusions (B-E).

tests were used. The results are shown in Table III. Thesuccess rate results in Table III are consistent to what wefound from Fig. 5. The error distance results show that, K-GPS achieved a smaller error distance than ε-Greedy in alldegrees of freedom.

Other than the evaluation in simulation, we also tested thenetwork trained using K-GPS for 3 DoF reaching (NetworkC) in real environments, where joint angles were fromencoders, target coordinates were from simulator (virtualtargets). In the test, the joint angles from encoders had anaveraging error of 0.0015 radian with a standard deviation of0.0015 radian. Network C achieved a success rate of 100% in20 reaching tests, with an averaging distance of 0.0337m, asshown in Table V (Net C). The good performance in the realworld indicates that the control module is robust to small-range sensing noise.

In addition to the comparison between K-GPS and ε-Greedy, we also compared K-GPS with the guidance con-troller in terms of policy efficiency. 200 successful trials werepicked out to evaluate their averaging completion steps. Thecompletion steps are how many actions one network took toreach a target. The results are shown in Table IV from whichwe can see that the policies learned by K-GPS needed fewercompletion steps than the guidance controller (Guidance).The more degrees of freedom, the more obvious it was.This is because the guidance controller selects an action bydoing an optimization in joint space for each step, ratherthan aiming for a maximum accumulative reward which isthe Q-learning cares. This indicates that, with an appropriatereward function, the policies learned by Q-learning with K-GPS are able to surpass the guidance controller.

2) Qualitative results for architecture comparison: Wealso tested different FC networks with different numbersof hidden layers and units in each layer. The qualitativeresults show that a network with only one hidden layer wasenough for 1 DoF reaching but insufficient for 2 and 3 DoFcases. The number of units in each layer also influencedthe performance. For the 3 DoF reaching, the architectureintroduced in Section IV-C worked the best; and at least twohidden layers with 200 and 150 units were needed, otherwiseit would be very sensitive to training parameters.

C. End-to-end reaching

To test the performance of a network composed by percep-tion and control modules in the real world, we conducted an

TABLE VREAL-WORLD REACHING PERFORMANCE

Nets dµ dσ α

[m] [pixels] [m] [pixels] [%]C 0.0337 1.4141 0.0114 0.4783 100.00

EE1 (SIM+C) 0.4860 20.3926 0.2907 12.1978 0EE2 (FT75+C) 0.0612 2.5680 0.0294 1.2336 100.00

evaluation for end-to-end reaching where inputs were raw-pixel images, and outputs were actions. For comparison, weevaluated two networks: one was composed by SIM and C(EE1); the other was composed by FT75 and C (EE2). SIMwas trained purely in simulation; FT75 achieved the bestperformance in the real world.

Similar to the settings in Section V-B, the performancewas evaluated using the metrics of success rate α and errordistance d. Each network was evaluated with 20 reachingtests in the real world. As mentioned in Section V-B, weused virtual targets for more convenient quantitative resultscollection. The results are shown in Table V where dµ anddσ are the mean and standard deviation dσ of error distances.The error distance in pixel level in the input image (84×84)is also listed in the table.

From Table V, we can see that, as expected, EE1 workedpoorly in the real world, since SIM was not fine-tunedfor real environments. In contrast, EE2 worked well andachieved a success rate of 100%.

By comparing EE2 with C, we can find that, the end-to-end performance was a bit worse than the control moduleitself in terms of error distance. The averaging error distanceof C was 0.0337m and 1.4141 pixels, but that of EE2 was0.0612m and 2.5680 pixels. The factors resulting in thebigger error came from the perception. The low resolution ofthe input images and the filter size and stride settings in FT75limited the accuracy of perception. The perception accuracycan be improved by increasing the input image resolutionand optimizing the settings for filter size and stride in convlayers. An extra end-to-end fine-tuning might also be helpfulto improve the performance [2].

In addition to the quantitative results, we also tested EE2with real targets and even occlusions on the robot body, andgot some qualitative results. In the test, a printed blue circlewas used as a target, a white T-shirt was used to collide partof the Baxter body, as shown in Fig. 6. In the case without

occlusion (A), real targets can always be reached. In CaseB, C and E, most targets can be reached with a bigger errordistance than Case A. In Case D, only a few targets can bereached with the biggest error among all the five cases. Thequalitative results indicate that EE2 did have some robustnessto occlusions, although the occlusions did not appear in thetraining. A video showing the end-to-end reaching tests canbe found at https://wiki.qut.edu.au/x/WQM9DQ.

VI. CONCLUSION AND DISCUSSION

Herein we proposed a method to transfer learnt vision-based reaching from simulation to a real-world setup. Theproposed modular deep network architecture presents a fea-sible and fast way to transfer. The method can be used toseparate a task into a number of sub-tasks, making it tractableto learn and transfer complex skills.

The physical intuition in the target reaching task that onepart of the network performs visual perception while theother part performs control was verified by the introductionof a bottleneck layer. This bottleneck represents the sceneconfiguration which is the robot pose and target position.

A kinematics-based guided policy search (K-GPS) methodwas also introduced, which is more effective and efficientthan ε-Greedy in the robotic planar reaching task. The K-GPS method assumes that we have already got enoughknowledge on the task to learn, and had a model for that.It can be applied to some other robotic applications wherekinematics models are available, but not suitable for somechallenging tasks that lacks models.

The time-cost for training a perception network was re-duced by fine-tuning from pre-trained networks. The time-cost can be further reduced by optimizing the networkarchitecture and training methods, as well as fine-tuning froma more generalized pre-trained model. With the developmentof deep learning in robotics, we will get more and better pre-trained models, then the fine-tuning method will be morepractical and efficient.

Results in the real planar reaching setup where a Baxterrobot only had an external viewpoint of itself and the target,show 100% success rate in 20 trials, with an averaging errordistance of 2.5680 pixels. The modular networks trainedusing our method were robust to small-range disturbances,although those disturbances did not appear in the training.

To further validate the method, more investigations arenecessary for more complicated tasks such as 7 DoF targetreaching in a real 3D space using continuous control.

REFERENCES

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in NeuralInformation Processing Systems (NIPS), 2012, pp. 1097–1105.

[2] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end trainingof deep visuomotor policies,” Journal of Machine Learning Research,vol. 17, no. 39, pp. 1–40, 2016.

[3] S. Levine, P. P. Sampedro, A. Krizhevsky, and D. Quillen, “Learninghand-eye coordination for robotic grasping with deep learning andlarge-scale data collection,” in International Symposium on Experi-mental Robotics (ISER), 2016.

[4] L. Pinto and A. Gupta, “Supersizing self-supervision: Learning tograsp from 50k tries and 700 robot hours,” in IEEE InternationalConference on Robotics and Automation (ICRA), 2016, pp. 3406–3413.

[5] I. Lenz, R. Knepper, and A. Saxena, “Deepmpc: Learning deeplatent features for model predictive control,” in Robotics: Science andSystems (RSS), 2015.

[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovskiet al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[7] F. Zhang, J. Leitner, M. Milford, B. Upcroft, and P. Corke, “Towardsvision-based deep reinforcement learning for robotic motion control,”in Australasian Conference on Robotics and Automation (ACRA),2015.

[8] A. G. Barto, Reinforcement learning: An introduction. MIT Press,1998.

[9] L. Tai, S. Li, and M. Liu, “A deep-network solution towards model-less obstacle avoidance,” in IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), 2016.

[10] R. Y. Rubinstein and D. P. Kroese, The Cross-Entropy Method:A Unified Approach to Combinatorial Optimization, Monte-CarloSimulation, and Machine Learning. Springer-Verlag, New York, 2004.

[11] L. Torrey and J. Shavlik, “Transfer learning,” in Handbook of Researchon Machine Learning Applications and Trends: Algorithms, Methods,and Techniques, E. S. Olivas, J. D. M. Guerrero, M. M. Sober, J. R. M.Benedito, and A. J. S. Lopez, Eds. Hershey, PA, USA: IGI Global,2009, ch. 11, pp. 242–264.

[12] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEETransactions on Knowledge and Data Engineering, vol. 22, no. 10,pp. 1345–1359, 2010.

[13] M. E. Taylor and P. Stone, “Transfer learning for reinforcement learn-ing domains: A survey,” The Journal of Machine Learning Research,vol. 10, pp. 1633–1685, 2009.

[14] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick,K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neuralnetworks,” arXiv preprint arXiv:1606.04671, 2016.

[15] T. Fitzgerald, A. Goel, and A. Thomaz, “A similarity-based approachto skill transfer,” in Women in Robotics Workshop at Robotics: Scienceand Systems Conference (RSS), 2015.

[16] E. Tzeng, C. Devin, J. Hoffman, C. Finn, X. Peng, S. Levine,K. Saenko, and T. Darrell, “Towards adapting deep visuomotor rep-resentations from simulated to real environments,” arXiv preprintarXiv:1511.07111, 2015.

[17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,” in IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2015, pp. 1–9.

[18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima-genet: A large-scale hierarchical image database,” in IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.

[19] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learningrepresentations by backpropagating errors,” Nature, vol. 323, pp. 533–536, 1986.

[20] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradientby a running average of its recent magnitude,” COURSERA: NeuralNetworks for Machine Learning, vol. 4, no. 2, 2012.

https://wiki.qut.edu.au/x/WQM9DQ

vision-based reaching using modular deep networks: from

Documents