online learning of a full body push recovery controller for … · 2015-11-24 · b. learning setup...

Online Learning of a Full Body Push RecoveryController for Omnidirectional Walking

Seung-Joon Yi∗†, Byoung-Tak Zhang†, Dennis Hong‡ and Daniel D. Lee∗

Abstract— Bipedal humanoid robots are inherently unstableto external perturbations, especially when they are walkingon uneven terrain in the presence of unforeseen collisions.In this paper, we present a push recovery controller forposition-controlled humanoid robots which is tightly integratedwith an omnidirectional walk controller. The high level pushrecovery controller learns to integrate three biomechanicallymotivated push recovery strategies with a zero moment pointbased omnidirectional walk controller. Reinforcement learningis used to map the robot walking state, consisting of footconfiguration and onboard sensory information, to the bestcombination of the three biomechanical responses needed toreject external perturbations. Experimental results show howthis online method can stabilize an inexpensive, commercially-available DARwin-OP small humanoid robot.

Keywords: Bipedal Omnidirectional Walking, Full BodyPush Recovery, Reinforcement Learning, Online Learning

I. INTRODUCTION

Due to their small footprint and high center of mass,bipedal humanoid robots are not statically stable. Open loopdynamic walking methods typically assume flat, well-definedsurfaces for walking that ensure the pre-defined joint trajec-tories exhibit dynamic stability. However, these techniquescan fail due to unevenness of the surface, collisions withobstacles, and unknown modeling errors of the robot orits actuators. Active stabilization is therefore crucial foroperation of humanoid robots in realistic environments, andhas been a topic of great interest in humanoid research.

There have been two main approaches proposed to handleactive stabilization of humanoid robots. The first approachuses sensitive force sensors and an accurate full body modelof the robot to calculate the proper joint torques requiredto reject any external disturbance forces [1], [2], [3], [4],[5]. The other is a biomechanically motived approach whichfocuses on fast, reactive push recovery behaviors using asimplified model of the robot [6], [7], [8], [9], [10], [11],[12].

One common drawback of these approaches is that mostphysical implementations require specialized hardware suchas triaxial force and torque sensors, torque-controlled actua-tors, in addition to rapid computational processing and feed-back control, making their implementation on inexpensive,

∗ GRASP Laboratory, University of Pennsylvania, Philadelphia, PA19104 {yiseung,ddlee}@seas.upenn.edu† BI Laboratory, Seoul National University, Seoul, [email protected]‡ RoMeLa Laboratory, Virginia Tech, Blacksburg, VA [email protected]

Fig. 1. Overview of the omnidirectional walk controller integrated withfull body push recovery controller.

commercially-available, position-controlled humanoid robotssuch as the Nao1 or DARwIn-OP2 infeasible.

In previous work, we described a machine learning ap-proach to learn a hierarchical full body push recoverycontroller for such hardware constrained humanoid robots[13]. In this paper, we address several shortcomings of theprior work. The first issue is that our previous work, alongwith most other humanoid push recovery implementations,assumes that the robot is either standing still or walking inplace, and does not address more general situations when therobot is walking with nonzero velocity. The second issue isthat a simulator was used to learn the controller, like otherapproaches utilizing machine learning [4], [5], which maynot correctly model the robot and environmental properties.

Thus, the aim of this work is twofold. The first is to designan integrated walk controller with push recovery for position-controlled humanoid robots without force sensors that canstabilize the robot against external disturbances during om-nidirectional walking. The second is to use the physicalrobot and experimental apparatus instead of a simulatedenvironment to learn the push recovery controller parametersin an online fashion.

We first design a step-based omnidirectional walk con-troller that integrates with the biomechanically motivatedpush recovery controllers. Our method generates foot and

1http://www.aldebaran-robotics.com/2http://www.robotis.com/xe/darwin_en

1

2011 11th IEEE-RAS International Conference on Humanoid Robots Bled, Slovenia, October 26-28, 2011

978-1-61284-868-6/11/$26.00 ©2011 IEEE

torso trajectories in real time based upon dynamic stabilitycriteria, which can be switched to reject external perturba-tions. The current foot configuration and walking phase aretaken into consideration to determine the control inputs forpush recovery controllers. The robot is then used in conjunc-tion with a motorized moving platform to apply a series ofcontrolled perturbations to the robot. The parameters for thehierarchical controller are then learned in an online fashionusing reinforcement learning. Experimental results show thatour method can be successfully applied to the DARwin-OProbot platform.

The remainder of the paper proceeds as follows. SectionII reviews three biomechanically motivated push recoverycontrollers and its implementation on position controlledhumanoid robots. Section III explains the details of the step-based omnidirectional walk controller which is fully inte-grated with the push recovery controller. Section IV explainshow the DARwIn-OP robot is used to learn the appropriatecontroller from experience in an online fashion. Section Vshows the experimental results using the learned controller.Finally, we conclude with a discussion of outstanding issuesand potential future directions arising from this work.

II. BIOMECHANICALLY MOTIVATED PUSH RECOVERYCONTROL FOR POSITION CONTROLLED ROBOTS

Biomechanical studies show that humans display threedistinctive motion patterns in response to sudden externalperturbations: which we denote as ankle, hip and step pushrecovery strategies [7]. Although there has been some the-oretical analysis of these strategies using simplified mod-els, physical implementation of such analytical controllerson a generic, position controlled humanoid robot is notstraightforward, and there has been little research on howto optimally mix these three strategies as humans do.

Instead of fully relying on analytic models, we suggesteda machine learning approach to determine the appropriatepush recovery controller which minimizes a predeterminedcost function [14]. To generate the proper combination ofpush recovery strategies, a hierarchical high level controlleruses current proprioceptive and inertial sensory informationto determine the optimal combination of the three recoverystrategies. In this section, we review the implementationof the three biomechanically motivated push recovery con-trollers on a position controlled robot, as shown in Figure2.

A. Ankle push recovery

The ankle controller applies control torque on the anklejoints to keep the center of mass within the base of support.For position controlled actuators, controlling the ankle torquecan be done by either controlling the auxiliary zero momentpoint (ZMP) which accelerates the torso to apply effectivetorque at ankle [13], [15] or modulating the target angle ofthe ankle servo. In this work we use the latter method, whichis found to be effective for the particular robot we use in ourexperiments. Then the controller can be simply implemented

(a)

(b)

(c)

Fig. 2. Three biomechanically motivated push recovery strategies andcorresponding controller for a position-controlled robot. (a) Ankle strategythat applies control torque on ankle joints. (b) Hip strategy which usesangular acceleration of torso and limbs to apply counteractive groundreaction force. (c) Step strategy which moves the base of support to a newposition.

as

∆θankle = xankle (1)

where ∆θankle is the joint angle bias and xankle is the originalankle controller input. In addition to the ankle joints, we alsomodulate arm position to apply additional effective torque atthe ankles in a similar way, unless overridden by the hipcontroller.

B. Hip push recovery

The hip controller uses angular acceleration of the torsoand limbs to generate a backward ground reaction force topull the center of mass back towards the base of support. Formaximum effect, a bang-bang profile of the following formcan be used

τhip(t) = τmaxu(t)−2τmaxu(t−TR1)+ τmaxu(t−TR2) (2)

2

where u(t) is the unit step function, τmax is the maximumtorque that the joint can apply, TR1 is the time the torso stopsaccelerating and TR2 is the time torso comes to a stop. AfterTR2, the torso angle should return to the initial position. Thistwo-stage control scheme can be approximately implementedwith position controlled actuators as

∆θtorso =

{xhip 0≤ t < TR2

xhipTR3−t

TR3−TR2TR2 ≤ t < TR3

(3)

where ∆θtorso is the torso pose bias, TR3 the time torso anglecompletes returning to initial position, and xhip is the inputof hip controller. Arm rotation can also be used during hippush recovery to apply additional ground reaction force onthe robot.

C. Step push recovery

The step controller effectively moves the base of supportby taking a step. This is implemented by overriding thestep controller to insert a new step with relative target footposition xcapture. In order to not lift the current support foot,the state of the robot needs to be monitored to determinethe current foot configuration as well as the perturbationdirection. Only during the appropriate walking phase is thewalk control overridden with the new target foot position.

III. A STEP-BASED OMNIDIRECTIONAL WALKCONTROLLER WITH PUSH RECOVERY CONTROL

In this section, we describe the details of our omnidi-rectional walk controller which is integrated with the fullbody push recovery controller. As the active push recoveryrequires walk phase modulation and real-time modificationof landing position, we cannot directly use a typical periodictime-based walk controller. Instead, we use a step-based walkcontroller that generates a series of steps with individualadjustable parameters such as duration, support foot selectionand landing position. The overall walk control is separatedinto a step controller and trajectory controller, where thestep controller determines the parameters for each step andthe trajectory controller generates foot and torso positiontrajectories based upon that information. In this sectionwe cover each controller in detail, and show how theyare integrated with the high level push recovery controllerdescribed in the previous section.

A. Step controller

The step controller determines the parameters of each step,which is defined as

ST EPi = {SF,WT, tST EP,Li,Ti,Ri,L1+1,Ti+1,Ri+1} (4)

where SF denotes the support foot, WT denotes the type ofthe step, tST EP is the duration of the step, and Li,Ti,Ri, andLi+1,Ti+1,Ri+1 are the initial and final 2D poses of left foot,torso and right foot in (x,y,θ) coordinate. Li,Ti,Ri are deter-mined by the final feet and torso poses from the last step, andLi+11,Ri+1 is calculated using the command walk velocityand current foot configuration to enable omnidirectionalwalking. Foot reachability and self-collision constraints are

(a) (b)

Fig. 3. The step-based walk controller. (a) An example of walking behaviorwhich is composed of two steps, ST EPi and ST EPi+1. (b) CorrespongingZMP and torso trajectory p and x. Timing parameters of φ1 = 0.2 andφ2 = 0.8 are used.

also taken into account when calculating target foot poses.Target torso pose Ti+1 is set to the middle point of Li+1 andRi+1 by default, so that the transition occurs at the moststable posture where the center of mass (CoM) lies on themidpoint of the two support positions.

Once we have the initial and final poses of the feet andtorso, we can define the reference ZMP trajectory pi(φ) asthe following piecewise-linear function for the left supportcase

pi(φ) =

Ti(1− φ

φ1)+Li

φ

φ10≤ φ < φ1

Li φ1 ≤ φ < φ2

Ti+1(1− 1−φ

1−φ2)+Li

1−φ

1−φ2φ2 ≤ φ < 1

(5)

or for the right support case

pi(φ) =

Ti(1− φ

φ1)+Ri

φ

φ10≤ φ < φ1

Ri φ1 ≤ φ < φ2

Ti+1(1− 1−φ

1−φ2)+Ri

1−φ

1−φ2φ2 ≤ φ < 1

(6)

where φ is the walk phase and φ1,φ2 are the timing param-eters determining the transition between single support anddouble support phase.

Finally, the step controller allows for inter-step overridefrom the push recovery controller. The hip and step pushrecovery controller can temporarily stop the walking tostabilize the robot after each push recovery behavior isinitiated, and the step push recovery controller can initiate aspecial capture step to reject large perturbations. The walktype parameter WT is used to describe these special typesof steps.

B. Trajectory controller

The trajectory controller generates foot and torso trajec-tories for the current step defined in (4). First we define thesingle support walk phase φsingle as

φsingle =

0 0≤ φ < φ1φ−φ1φ2−φ1

φ1 ≤ φ < φ2

1 φ2 ≤ φ < 1(7)

3

then we use following parameterized trajectory function

fx(φ) = φα +βφ(1−φ) (8)

to generate the foot trajectories

Li(φsingle) = Li+1 fx(φsingle)+Li(1− fx(φsingle)) (9)

Ri(φsingle) = Ri+1 fx(φsingle)+Ri(1− fx(φsingle)) (10)

The torso trajectory xi(t) is calculated according to thefollowing ZMP criterion for a linear inverted pendulummodel

p = x− tZMPx (11)

The piecewise linear ZMP trajectory we use in (5), (6) yieldsfor xi(φ) with zero ZMP error during the step period

xi(φ)=

pi(φ)+api eφ/φZMP +an

i e−φ/φZMP

−φZMPmi sinh φ−φ1φZMP

0≤ φ < φ1


i e−φ/φZMP

φ1 ≤ φ < φ2


i e−φ/φZMP

−φZMPni sinh φ−φ2φZMP

φ2 ≤ φ < 1(12)

where φZMP = tZMP/tST EP and mi, ni are ZMP slopes whichare defined as follows for the left support case

mi = (Li−Ti)/φ1 (13)

ni =−(Li−Ti+1)/(1−φ2) (14)

and for the right support case

mi = (Ri−Ti)/φ1 (15)

ni =−(Ri−Ti+1)/(1−φ2) (16)

Then the parameters api and an

i can be uniquely determinedby the boundary condition xi(0) = Ti and xi(1) = Ti+1. Thisanalytic solution has zero ZMP error during the step periodbut discontinuous jerk at the transitions. However, as thistransition occurs in the middle of the most stable doublesupport stance, we found this does not hamper stability.

In addition to calculating foot trajectories based uponpredetermined target foot poses, the intra-step override fromthe push recovery controller can also modify the foot landingposition while stepping. When the step push recovery istriggered during the initial phase of walking, the targetlanding position is overridden and a new foot trajectoryis generated under foot reachability constraints and footmaximum velocity constraints.

C. Push recovery controller

The push recovery controller consists of three low levelbiomechanically motivated push recovery strategies and ahigh level controller that generates the control inputs forthem based on sensory feedback and current state informa-tion. There is no closed-form analytic solution for combiningthe individual low level controllers, so we use a machine-learning approach that trains the overall controller to opti-mize a cost function from experience. We formalize the high

level controller as a reinforcement learning problem with thefollowing state

S ={

θIMU ,θgyro,θ f oot ,φ ,FD,HCS}

(17)

where θIMU and θgyro are the torso pose and gyroscopedata from inertial sensors, θ f oot is the support foot posecalculated using forward kinematics and onboard sensors,φ is the walk phase, FD is the foot displacement, and HCSis the hip controller state. The learned action is defined asthe combined inputs for the three low level controllers

A ={

xankle,xhip,xcapture}

(18)

and the reward is defined as

R =∣∣θgyro

∣∣2 + gz0|θIMU |2 (19)

where g is the gravitational constant and z0 is the COMheight.

IV. ONLINE LEARNING OF THE PUSH RECOVERYCONTROLLER

We have implemented the integrated walk controller withpush recovery on a commercially available small humanoidrobot and trained the controller in an online fashion. In thissection, we present the details of the learning setup.

A. Hardware setup

We use the commercially available DARwIn-OP humanoidrobot developed by Robotis Co., Ltd. and the RoMeLalaboratory. It is 45 cm tall, weighs 2.8kg, and has 20 degreesof freedom. It has a USB camera for visual feedback, and 3-axis accelerometer and 3-axis gyroscope for inertial sensing.Position-controlled Dynamixel servos are used for actuators,which are controlled by a custom microcontroller connectedby an Intel Atom-based embedded PC at a control frequencyof 100Hz.

To generate repeatable external perturbations, a motorizedmoving platform was constructed using Dynamixel servo-motors. The motorized platform is attached to the same busconnecting the actuators of the robot, and controlled by thesame I/O process. To generate maximum peak acceleration,the platform is slowly accelerated in one direction andthen suddenly accelerated in the opposite direction. Wehave found the platform can generate accelerations greaterthan 0.5g while carrying the robot, providing large enoughperturbations to make the robot fall.

B. Learning setup

In previous work [13], a policy gradient reinforcementlearning algorithm was used to learn the push recoverycontroller in a simulated environment. However, the sameapproach cannot be directly applied to the physical robotfor online learning. The main issue with machine learn-ing approaches on physical robots is that the amount oftraining data is severely restricted. Unlike in simulation,small humanoid robots cannot continuously operate for longperiods of time due to heat buildup at the actuators, and theprocess of gathering training data is slower and may require

4

(a)

(b) (c)

Fig. 4. (a) Learning setup including the servo controlled platform connectedto the DARwIn-OP robot. (b) Commanded position of the platform to induceinstantaneous disturbance. (c) Actual acceleration of the platform.

more human intervention. Also, the sensory noise present inphysical experiments are not well-modeled in simulations.

To overcome these problems, we simplify the reinforce-ment learning formulation of the high level controller asfollows. First we use the heuristic disturbance variable θdas a single continuous state variable which is defined as

θd = s(θIMU + kgyroθgyro) (20)

where s is a moving average smoothing function and kgyro isa heuristic mixing parameter. The rest of the state variablesare discretized to form the discrete state variables

Sd ={

φ , ˆFD, ˆHCS}. (21)

Simple parametric functions with preset parameters kankle,khip, kstep are then used to describe the possible controlleractions:

xankle =

0 |θd |< θ T Hankle(Sd)

kankle(θd−θ DBankle(Sd)) θd ≤−θ T H

ankle(Sd)kankle(θd +θ DB

ankle(Sd)) θd ≥ θ T Hankle(Sd)

(22)

xhip =

0 |θd |< θ T H

hip (Sd)

−khip θd ≤−θ T Hhip (Sd)

khip θd ≥ θ T Hhip (Sd)

(23)

xcapture =

0 |θd |< θ T H

step(Sd)−kstep θd ≤−θ T H

step(Sd)kstep θd ≥ θ T H

step(Sd)(24)

Finally, to reduce the effects of incorrect actions duringnormal walking, a negative training set is generated byletting the robot walk without any external perturbations onthe platform to prevent the controller from learning overlysensitive corrections.

(a) (b)

Fig. 5. Learning process of the push recovery controller. (a) Initialparameter set which does not trigger hip strategy. (b) Learned parameter setafter training that triggers hip strategy when large amount of disturbance isapplied. The same amount of disturbance is applied to the robot.

V. EXPERIMENTAL RESULTS

A. Learning the push recovery controller

We trained the push recovery controller using the servocontrolled platform. The robot is set to walk in differentdirections, and a platform movement is triggered at differentphases of the walking. Then the robot is repositioned by thehuman operator to start a new trial. Each trial takes roughly10 seconds on average, and 20 trials are done per episodeto improve the controller. Figure 5 shows a comparison ofthe robot behavior before and after 20 episodes of trainingwith the same amount of perturbation. We can see that therobot learns to apply the appropriate push recovery behaviorto reject large perturbations during walking.

B. Testing the push recovery controller

We let the robot with learned controller freely walk overa flat surface and applied a number of pushes with variousdirections and magnitude. Figure 6 shows some examples ofrobot response against external disturbances3. We see that thelearned controller can successfully trigger appropriate pushrecovery behaviors to keep the omnidirectionally walkingrobot from falling down.

VI. CONCLUSIONS

We have demonstrated an integrated controller that enablesfull body push recovery during omnidirectional walking

3http://www.youtube.com/watch?v=N5Sh407Mqjg

5

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 6. Responses of the learned push recovery controller against per-turbation during omnidirectional walking. (a) step strategy during walkingforward. (b) step strategy during sidestepping. (c) hip strategy duringwalking forward. (d) hip strategy during turning. (e) hip and step strategiesduring walking forward. (f) hip and step strategies during sidestepping.

for humanoid robots without specialized sensors and ac-tuators. Three low level biomechanically motivated pushrecovery strategies are implemented on a position controlledhumanoid robot, and integrated with a non-periodic, step-based omnidirectional walk controller. The proper parametersfor the hierarchical controller are learned online using acommercially-available small humanoid robot along with aservo-controlled moving platform. Experimental results showthat the trained controller can successfully initiate a full bodypush recovery behavior under external perturbations whilewalking in an arbitrary direction.

Possible future work includes incorporating more sophisti-cated learning algorithms to utilize the limited training data,and implementing these algorithms on full-size humanoidrobots.

ACKNOWLEDGMENTS

We acknowledge the support of the NSF PIRE programunder contract OISE-0730206. This work was also par-tially supported by the IT R&D program of MKE/KEIT(KI002138, MARS), the industrial strategic technology de-velopment program (10035348) funded by the Korean gov-ernment (MKE), and the BK21-IT program.

REFERENCES

[1] S.-H. Hyon and G. Cheng, “Disturbance rejection for biped hu-manoids,” ICRA, pp. 2668 –2675, apr. 2007.

[2] J. Park, J. Haan, and F. C. Park, “Convex optimization algorithms foractive balancing of humanoid robots,” IEEE Transactions on Robotics,vol. 23, no. 4, pp. 817–822, 2007.

[3] S.-H. Hyon, R. Osu, and Y. Otaka, “Integration of multi-level posturalbalancing on humanoid robots,” in ICRA, 2009, pp. 1549–1556.

[4] T. Matsubara, J. Morimoto, J. Nakanishi, S.-H. Hyon, J. G. Hale, andG. Cheng, “Learning to acquire whole-body humanoid center of massmovements to achieve dynamic tasks,” Advanced Robotics, vol. 22,no. 10, pp. 1125–1142, 2008.

[5] A. Yamaguchi, S.-H. Hyon, and T. Ogasawara, “Reinforcement learn-ing for balancer embedded humanoid locomotion,” in Humanoids, dec.2010, pp. 308 –313.

[6] B. Jalgha, D. C. Asmar, and I. Elhajj, “A hybrid ankle/hip preemptivefalling scheme for humanoid robots,” in ICRA, 2011.

[7] M. Abdallah and A. Goswami, “A biomechanically motivated two-phase strategy for biped upright balance control,” in ICRA, 2005, pp.2008–2013.

[8] J. Pratt, J. Carff, and S. Drakunov, “Capture point: A step towardhumanoid push recovery,” in Humanoids, 2006, pp. 200–207.

[9] T. Komura, H. Leung, S. Kudoh, and J. Kuffner, “A feedback controllerfor biped humanoids that can counteract large perturbations duringgait,” in ICRA, 2005, pp. 1989–1995.

[10] D. N. Nenchev and A. Nishio, “Ankle and hip strategies for balancerecovery of a biped subjected to an impact,” Robotica, vol. 26, no. 5,pp. 643–653, 2008.

[11] B. Stephens, “Humanoid push recovery,” in Humanoids, 2007.[12] B.-K. Cho and J.-H. Oh, “Practical experiment of balancing for a

hopping humanoid biped against various disturbances,” in IROS, 2010,pp. 4464–4470.

[13] S.-J. Yi, B.-T. Zhang, and D. D. Lee, “Online learning of uneventerrain for humanoid bipedal walking.” in AAAI’10, 2010.

[14] S.-J. Yi, B.-T. Zhang, D. Hong, and D. D. Lee, “Learning full bodypush recovery control for small humanoid robots.” in ICRA, 2011.

[15] S. Kajita, M. Morisawa, K. Harada, K. Kaneko, F. Kanehiro, K. Fu-jiwara, and H. Hirukawa, “Biped walking pattern generator allowingauxiliary zmp control,” in IROS, oct. 2006, pp. 2993 –2999.

6

online learning of a full body push recovery controller for … · 2015-11-24 · b. learning setup...

Documents