From Virtual Demonstration to Real-World Manipulation ... ?· From Virtual Demonstration to Real-World…

Download From Virtual Demonstration to Real-World Manipulation ... ?· From Virtual Demonstration to Real-World…

Post on 26-Jul-2018




0 download

Embed Size (px)


<ul><li><p>From Virtual Demonstration to Real-World Manipulation Using LSTM and MDN</p><p>Rouhollah Rahmatizadeh, Pooya Abolghasemi, Aman Behal, Ladislau BoloniDepartment of Computer Science</p><p>University of Central Florida, United Statesrrahmati,pooya.abolghasemi,,</p><p>Abstract</p><p>Robots assisting the disabled or elderly must perform com-plex manipulation tasks and must adapt to the home environ-ment and preferences of their user. Learning from demonstra-tion is a promising choice, that would allow the non-technicaluser to teach the robot different tasks. However, collectingdemonstrations in the home environment of a disabled useris time consuming, disruptive to the comfort of the user, andpresents safety challenges. It would be desirable to performthe demonstrations in a virtual environment.In this paper we describe a solution to the challenging prob-lem of behavior transfer from virtual demonstration to a phys-ical robot. The virtual demonstrations are used to train a deepneural network based controller, which is using a Long ShortTerm Memory (LSTM) recurrent neural network to generatetrajectories. The training process uses a Mixture Density Net-work (MDN) to calculate an error signal suitable for the mul-timodal nature of demonstrations. The controller learned inthe virtual environment is transferred to a physical robot (aRethink Robotics Baxter). An off-the-shelf vision componentis used to substitute for geometric knowledge available in thesimulation and an inverse kinematics module is used to allowthe Baxter to enact the trajectory.Our experimental studies validate the three contributions ofthe paper: (1) the controller learned from virtual demonstra-tions can be used to successfully perform the manipulationtasks on a physical robot, (2) the LSTM+MDN architecturalchoice outperforms other choices, such as the use of feedfor-ward networks and mean-squared error based training signalsand (3) allowing imperfect demonstrations in the training setalso allows the controller to learn how to correct its manipu-lation mistakes.</p><p>IntroductionAssistive robotics, whether in the form of wheelchairmounted robotic arms or mobile robots with manipulators,promises to improve the independence and quality of lifeof the disabled and the elderly. Such robots can help usersto perform Activities of Daily Living (ADLs) such as self-feeding, dressing, grooming, personal hygiene and leisureactivities. Almost all ADLs involve some type of objectmanipulation. While most current systems rely on remote</p><p>Copyright c 2018, Association for the Advancement of ArtificialIntelligence ( All rights reserved.</p><p>control, there are ongoing research efforts to make assis-tive robots more autonomous (Endres, Trinkle, and Burgard2013; Miller et al. 2012; Bollini, Barry, and Rus 2011). Oneof the challenges of assistive robotics is the uniqueness ofevery home and the need to adapt to the preferences anddisabilities of the user. As the user is a non-programmer,learning from demonstration (LfD) is a promising way toadapt the robot behavior to the specific environment, objectsand preferences. However, collecting large numbers of phys-ical demonstrations from a disabled user is challenging. Forinstance, the manipulated objects and the environment canbe fragile and the wheelchair-bound users might have dif-ficulty recovering from a failed manipulation task such as adropped cup. Similar considerations hinder from-the-scratchreinforcement learning performed in a home environment.</p><p>In this paper we propose an approach where the usersdemonstrate the tasks to be performed in a virtual environ-ment. This allows the safe collection of sufficient demon-strations to train a deep neural network based robot con-troller. The trained controller is then transferred to the phys-ical robot. The general flow is illustrated in Figure 1.</p><p>In the remainder of this paper we discuss related work,describe the approach in detail, and through an experimen-tal study validate the three contributions outlined in the ab-stract.</p><p>Related workVirtual training to physical execution. The desirability oftransferring learning from a simulated robot to a physicalone had been recognized by many researchers. This need isespecially strong in the case of reinforcement learning-basedapproaches, where the exploration phase can be very expen-sive on a physical robot. Unfortunately, policies learned insimulation do not work well when transferred navely to aphysical robot. First, no simulation can be perfect, thus thephysical robot will inevitably have a suboptimal policy leav-ing to a slightly different state from the expected one. Thiswould be acceptable if the policy corrects the error, keep-ing the difference bounded. In practice, it was found thatif a policy was learned from demonstrations, initially smalldifferences between the simulation and the real world tendto grow larger and larger as the state diverges farther andfarther from the settings in which the demonstrations tookplace an aggravated version of the problem that led to the</p><p>arX</p><p>iv:1</p><p>603.</p><p>0383</p><p>3v4 </p><p> [cs</p><p>.RO</p><p>] 2</p><p>2 N</p><p>ov 2</p><p>017</p></li><li><p>Virtual world: training the network Physical world: inference from the network</p><p>Demonstration of the task by user in the simulation</p><p>Robot performs the task in real-world based on the trajectory </p><p>generated by the networkTraining an LSTM network </p><p>on demonstrations</p><p>Dataset of trajectories for training</p><p>Current state of the environment </p><p>and gripper</p><p>Multilayer LSTM NN</p><p>Gripper state at next </p><p>time-step</p><p>Current state of the environment </p><p>and gripper</p><p>Gripper state at next </p><p>time-step</p><p>Inverse kinematics</p><p>Multilayer LSTM NN</p><p>Figure 1: The general flow of our approach. The demonstrations of the ADL manipulation tasks are collected in a virtualenvironment. The collected trajectories are used to train the neural network controller.</p><p>development of algorithms such as DAgger (Ross, Gordon,and Bagnell 2011).</p><p>A number of different approaches had been developed todeal with this problem. One approach is to try to bring thesimulation closer to reality through learning. This approachhad been taken by Grounded Simulation Learning (Farchyet al. 2013) and improved by Grounded Action Transforma-tion (Hanna and Stone 2017) on the task of teaching a Naorobot to walk faster.</p><p>Another approach for improving the virtual to physicaltransfer is to increase the generality of the policy learnedin simulation through domain randomization (Tobin et al.2017). (Christiano et al. 2016) note that the sequence ofstates learned by the controller in simulation might be rea-sonable even if the exact controls in the physical worldare different. Their approach computes what the next statewould be, and relies on learned deep inverse dynamicsmodel to decide on the actions to achieve the equivalentreal world state. (James, Davison, and Johns 2017) train aCNN+LSTM controller to perform a multi-step task of pick-ing up a cube and dropping it to a table. The training isbased on demonstrations acquired from a programmed op-timal controller, with domain randomization in the form ofvariation of environmental characteristics, lighting and tex-tures.</p><p>Yet another possible approach is to learn an invariant fea-ture space that can transfer information between the domainsthrough a sort of analogy making (Gupta et al. 2017).</p><p>In the work described in this paper, we took a differentapproach. We had to accept that the simulation is nowhereclose to the physical environment, thus the robot will makemistakes. Instead of domain randomization, we rely on thenatural imperfections of demonstrations done by (possiblydisabled) humans, but also on the ability of humans to cor-rect the errors they made. Using this training data, we foundthat the physical robot also learned to correct its mistakes.</p><p>Autonomous trajectory execution through Learning fromDemonstration (LfD). LfD extracts policies from exam-ple state to action mappings (Argall et al. 2009) demon-strated by a user. Successful LfD applications include au-tonomous helicopter maneuvers (Abbeel, Coates, and Ng2010), playing table tennis (Kober, Oztop, and Peters 2011;Calinon et al. 2010), object manipulation (Pastor et al.2009), and making coffee (Sung, Jin, and Saxena 2015).</p><p>The greatest challenge of LfD is generalization to un-seen situations. One obvious way to mitigate this problemis by acquiring a large number of demonstrations cover-ing as many situations as possible. Some researchers pro-posed cloud-based and crowdsourced data collection tech-niques (Kehoe et al. 2013; Forbes et al. 2014; Crick et al.2011), and some others proposed to use simulation envi-ronments (Fang, Bartels, and Beetz 2016). Another direc-tion is to use smaller number of demonstrations, but changethe learning model to generalize better. One possible tech-nique is to hand-engineer task-specific features (Calinon,Guenter, and Billard 2007; Calinon et al. 2009). Using deepneural networks might help to eliminate the need to hand-engineer features. For instance, (Levine et al. 2016) usedfeed-forward neural networks to map a robots visual inputto control commands. The visual input is processed usinga convolutional neural network (CNN) to extract 2D fea-ture points, then it is aggregated with the robots currentjoint configuration, and fed into a feed-forward neural net-work. The resulting neural network will predict the next jointconfiguration of the robot. A similar neural network archi-tecture is designed to address the grasping problem (Pintoand Gupta 2016). In (Mayer et al. 2006), LSTMs are usedfor a robot to learn to autonomously tie knots after a pre-processing step to reduce the noise from the training data.</p></li><li><p>Our approachCollecting demonstrations in a virtual environmentTo allow users to demonstrate their ADLs, we designed inthe Unity3D game engine a virtual environment modeling atable with an attached shelf that can hold various objects,and a simple two-finger gripper that can be opened andclosed to grasp and carry an object. The user can use themouse and keyboard to open/close the gripper, as well as tomove and rotate it, giving it 7 degrees of freedom. Unity3Dsimulates the basic physics of the real world including grav-ity, collisions between objects as well as friction, but there isno guarantee that these will accurately match the real world.</p><p>We represent the state of the virtual environment as thecollection of the poses of the M movable objects q ={o1 . . . oM}. The pose of an object is represented by thevector containing the position and rotation quaternion withrespect to the origin o = [px, py, pz, rx, ry, rz, rw]. Dur-ing each step of a demonstration, at time step t we recordthe state of the environment qt and the pose of the end-effector augmented with the open/close status of the gripperet. Thus a full demonstration can be recorded as a list ofpairs d = {(q1, e1) . . . (qT , eT )}.</p><p>For our experiments we considered two manipulationtasks that are regularly found as components of ADLs: pickand place and pushing to a desired pose.</p><p>The pick and place task involves picking up a small boxlocated on top of the table, and placing it into a shelf abovethe table. The robot needs to move the gripper from its initialrandom position to a point close to the box, open the gripper,position the fingers around the box, close the gripper, movetowards the shelf in an orientation where it will not collidewith the shelf, enter the shelf, and finally open the gripperto release the box. This task is a clearly segmented multi-step task, where the steps need to be executed in a particularorder. On the other hand, the execution of the task dependsvery little on the details of the physics: as long as the boxcan be picked up by the gripper, its weight or friction doesnot matter.</p><p>The pushing to desired pose task involves moving and ro-tating a box of size 10 7 7cm to a desired area only bypushing it on the tabletop. In this task, the robot is not al-lowed to grasp the object. The box is initially positioned ina way that needs to be rotated by 90 to fit inside the desiredarea which is 3cm wider than the box in each direction. Therobot starts from an initial gripper position, moves the grip-per close to the box and pushes the box at specific points atits sides to rotate and move it. If necessary, the gripper needsto circle around the box to find the next contact point. Thistask does not have a clearly sequenced set of steps. It is un-clear how many times does the box needs to be pushed. Fur-thermore, the completion of the task depends on the physics:the weight of the box and the friction between the box andtable impacts the way the box moves in response to pushes.</p><p>The demonstrations were collected from a single user, inthe course of multiple sessions. In each session, the user per-formed a series of demonstrations for each task. The qualityof demonstrations varied: in some of them, the user couldfinish the task only after several tries. For instance, some-</p><p>times the grasp was unsuccessful, or the user dropped theobject in an incorrect position and had to pick it up again.After finishing a demonstration, the user was immediatelypresented with a new instance of the problem, with randomlygenerated initial conditions. All the experiments had beenrecorded in the trajectory representation format presentedabove, at a recording frequency of 33Hz. However, we foundthat the neural network controller can be trained more effi-ciently if the trajectories are sampled at a lower rate, with arate of 4Hz to giving the best results.</p><p>To improve learning, we extended our training data by ex-ploiting both the properties of the individual tasks and tra-jectory recording technique. First, we noticed that in the pickand place task the user can put the object to any locationon the shelf. Thus we were able to generate new synthetictraining data by shifting the existing demonstration trajecto-ries parallel with the shelf. As the pushing to desired posetask requires a specific coordinate and pose to succeed, thisapproach is not possible for the second task.</p><p>The second observation was that by recording the demon-stration at 33Hz but presenting the training trajectories atonly 4Hz, we have extra trajectory points. These trajectorypoints can be used to generate multiple independent trajecto-ries at a lower temporal resolution. The process of the trajec-tory generation by frequency reduction is shown in Figure 2.Table 1 describes the size of the final dataset.</p><p> ...</p><p>Original trajectory</p><p> ... </p><p> ... </p><p> ... </p><p> ... </p><p>.</p><p>.</p><p>.</p><p>8 generated trajectories</p><p>Waypoints</p><p>Figure 2: Creating multiple trajectories from a demonstra-tion recorded at a higher frequency.</p><p>Task Pick andplace</p><p>Push topose</p><p>Raw demonstrations 650 1614Demonstrations after shift 3900 -Low frequency demonstrations 31,200 12,912Total no. of waypoints 645,198 369,477Avg. demonstration waypoints 20.68 28.61</p><p>Table 1: The size of the datasets for the two studied tasks</p><p>The neural network based robot controllerThe robot contro...</p></li></ul>