rprop paper

8/2/2019 RPROP Paper

1/6

A Direct Adaptive Method for Faster Backpropagation Learning:T he RPROP Algorithm

Ma rtin Riedmiller Heinrich Bra unInst i tut fu r Logik, Komplexitat und Deduktionssyteme

University of KarlsruheW-7500 Karlsruhe

[email protected]

Abstract- A new learning algorithm for multi-layer feedforward networks, RPROP, is proposed.To overcome t he inherent disadvantages of puregradient-desce nt, RPROP performs a local adap-tation of t he weight-updates according to the be-haviour of the errorfunction. In substantial differ-ence to o ther adaptive techniques, the effect of theRPROP adaptation process is not blurred by theunforseeable influence of the size of the derivativebut only dependent on the temporal behaviour ofits sign. This leads to an efficient and transparentadaptation process. The prom ising capabilities ofRPROP are shown in comparison to other well-known adaptive techniques.

1. INTR ODUC TIONA . Backpropagation LearningBackpropagation is the most widely used algorithm forsupervised learning with multi- layered feed-forward net-works. The basic idea of the backpropagation learningalgor i thm [l] s the repeated application of the chain ruleto compute the influence of each weight in the networkwith respect to an arbi trary er ror function E:

(1)d E - a E ds i d ne t jdwij a s , d n e t ; d w i j-----where wij is the weight from neuron j to neuron i , s;i s the ou tput , and neti is the weighted sum of the inputs

of neuron i. Once t he pa rtia l derivative for each weight isknown, the aim of minimizing the er ror function is achievedby performing a sim ple gradient descent:d EW i j ( t + 1 ) = W i j ( t ) - - ( t )dW i j

Obviously, the choice of the learning rate 6 , which scalesthe der ivat ive, has an im por tan t effect on the t ime neededuntil convergence is reached. If it is set too sm all, toomany steps are needed to reach an acceptable solution;on the contrary a large learning rate will possibly leadto oscillation, preventing th e error to fall below a certainvalue.

An early way proprosed to get rid of the above problemis to in t roduce a momentu m- te r m:

where the momentum parameter p scales the influenceof the previous s tep on the cu r rent . Th e momentum-termis believed to render the learning procedure more stableand to accelerate convergence in shallow regions of theerrorfunction.However, as practical experience has shown, this is notalways true. I t turns o ut in fact , th at the optimal value ofthe m omentum par ameter p is equally problem dependentas the learning rate e, and that no general improvementcan be accomplished.B. Adaptive Learning Algori thmsMany algorithms have been proposed so far to deal withthe problem of appropriate weight-update by doing somesor t of parameter adaptat ion dur ing learning. They canroughly be separated into two categories: global and lo-cal strategies. Global adaptation techniques make use ofthe knowledge of the sta te of the entire network (e.g. thedirection of the previous w eight-step) to modify global pa-rameters, whereas local strategies use only weight-specificinformation (e.g. the partial derivative) to adapt weight-specific parameters. Besides th e fact, that local adap-tation strategies are more closely related to the concept

O-7803-0999-5/93/$03.001993 EEE 586
mailto:[email protected]:[email protected]


2/6

of neural learning and are better suited for parallel im-pleme ntations, th eir superiority over global learning algo-rit hm s has been impressively demon strated in a recentlypublished technical report [2].The majori ty of both global and loca l adapt ive a lgc-r i thms pe r fo rms a modification of a (probably weight-specific) learning-rate according to th e observed behaviourof the errorfunctio n. Exam ples of such algorithm s are theDelta-Bar-Delta technique [3] or the SuperSAB a lgori thm

[4]. T he adap ted learning ra te is eventual ly used t o ca l-culate t he weight-step.What is often disregarded, is , that the size of the actu-ally taken weight-step Awij is not only depend end on the(adapted) learning-rate, but also on the partial derivative

can be dra stically disturbed by the unforseeable behaviourof the derivative itself. T his was one of the reasons thatlead to the development of RPROP: to avoid th e problemof 'blurred adaptivity', RPRO P changes the size of theweight-update Awij directly, i.e. without considering thesize of the partial derivative.C . Oth er Acceleration TechniquesA great variety of further m odifications of the backpropa-gation p rocedur e has been proposed (e.g. the use of mod-ified errorfunctions, or sophisticated weight initializationtechniques), which all promise to accelerate the speed ofconvergence considerably. In our ex perience, some of themworked slightly better on several problems, but for otherproblems they did not improve convergence or even gaveworse resu lts .So the only modification we found useful and uncriticalin use, was to simply 'clip' the logistic activation functionat a value, that could be reasonably distinguished fromthe assympto tic boun dary value. Th is results in an al-ways non-zero derivative, preventing the unit of gettingstuc k. Especially in mor e difficult prob lems , this tech-nique worked far more s table than adding a small con-s tan t va lue to the derivation of the activation function, asproposed in [5].

-Ej . So the effect of the carefully adapted learning-rate

11. RPROPA. Descript ionRPROP s t a n d s for ' resilient propagation' and is an effi-cient new learning scheme, that performs a direc t adapta-tion of the weight s tep based on local gradient information.In crucial difference to previously developped adaptatio ntechniques, the effort of adaptation is not blurred by gra-dient behaviour whatsoever.T o achieve this , we introduce for each weight its indi-vidual update-va lue A i j , which solely determines the sizeof the weight-update. This adaptive update-value evolves

during th e learning process based on its local s ight on theerrorfunction E , according to the following learning-rule:

where 0 < 7- < 1 15000SSABB P

As can b e seen, the ad apt ive procedures RPR OP,Quickprop and SuperSAB do much b e t te r than the origi-nal backpropagation algor ithm with respect t o the conver-gence time as well as the rob ustness of the choice of theirparameter va lues ( indica ted by the width of the WR).Th e Quickprop a lgori thm s t i ll outperforms SuperSAB bya factor 2 . 5 , and is abou t 6 t imes as fast as original back-propagation.

The best result is achieved using R PR OP , which learnedthe task in an average time of only 19 epochs. As shown inthe width of the W R, the choice of the in i t ia l update-va lueA, is not critical either.

1. 0 0.95 534 90 [0.01,4.5]0.05 1.3 40 5 608 10.05.1.11

C. Th e 12-2-12 Encoder ProblemT h e 12-2-12 Encoder task is to learn a n autoassociation of12 input / outpu t pa t te rns . Th e ne twork cons is ts of both12 neurons in the input and th e outpu t layer , and a hiddenlayer of only 2 neurons ( 'Tight Encoder'). T he difficultyof this task is to find a sensitive encoding/decoding ofa 12-bit input/output vector in only 2 hidden neurons.Th e family of the 'Tight E ncoder'-tasks de mon strates thecapability of the learning algorithm to find a difficult so-lution in weight-space. Th e table shows th e results of theseveral learning algorithm s, and th e respective best pa-rameter se t t ing:

Algo. /A0 p /v # e p o c h s UB P 1 .9 0.0 121 30SSAB 2 .0 0 .8 55 11Q P 1 .5 1.75 21 3R P R O P 2 .0 - 19 3

12-2-12Tight Encoder'

W R ( E )[1.1,2.6][0.1,6.0][0.1,3.0]

[0.05,2.0]

I IR P R O P I 1.0 I - i 322 i 112 i [O.oooi,5.b] 1Although a wide variety of parameter values has beentested, original backpropagation was no t able to find a so-lution for the task in less than 15000 epochs . SuperSABfinally did its jo b, bu t only when v arying th e initial mo-me ntu m parame ter considerably, were acceptable learningtimes achieved. So despite the adaptivity of the learnig-rates, several experiments were needed to find an optim alparameter set. Quickprop converges fast, but also a num-ber of trials were needed to find a good value for each ofi ts two parameters .Th e best result w a s ob ta ine d us ing R P R OP . On th e oneside, the algorithm converges considerably faster than all

others, on the other s ide, a good parameter choice can beeasily found (very broad W R) . Figure 1 demons tra tes the(averaged) behaviour of the adapt ive a lgori thms on the12-2-12 encoder task w ith respect to th e variation of theirinitial learning parameter.589


5/6

Nine-Men' s-M orr is

Algo. /A0 p / v # e p o c h s UB P 0.2 0.5 98 34SSAB 0.05 0.9 1 34 4

Wul

VI

0

a,E>

W R ( c / A o )[0.03,0.2]10.01,0.31

12-2-12 Average Learn ing T i m e' I ' SuperSAB ........

1000 RPROP-200 QP ------- .

QPR P R O P

44l

0.005 1 .75 1 34 12 [0.001;0.02]0 .05 - I 23 3 10.05,0.31

200 1

AlgorithmR P R O P

0 1 I0 1 2 3 4 5i n i t i a l l e a r n in g p ar a m et er

< 15.000 ep . I Min. I average19/20 1 1500 I 4987

Figure 1: Behaviour of the average learning time for the12-2-12 encoder task when varying the parameter 6 rsp.A,. T h e figure shows the dependency of the several adap-t ive learning algo r i thm s on a good es t ima te for their ini t ia lparameter values .D. Nine Men's MorrisTo show the performance of the learning procedures onmore realistic problem s, characterized by bigger networksand larger pat t ern sets , a network w a s t ra ined to play theendgame of Nine Men's Morris [6].The en tire network is built up of two identical networks,linked by a ' comparator neuron' . Tw o al ternat ive movesare presented to the respective par t ia l network and thenetwork is to decide, which move is the better one. Eachpartial network has an input layer with 60 units , two hid-den layers with 30 a n d 10 neurons respectively, a nd a sin-g le ou tput un i t .Th e pat tern set consists of 90 patterns, each encodingtwo al ternat iv e moves and th e des ired ou tpu t of the com-parato r neuron. The results of the Nine Men's Morrisproblem are listed below (see also Fig. 2) :

Nine Men's Morris I

After several tr ials to f ind good parameter values, Su-perSAB is able to learn the task in approximately 1/3 ofthe t im e used by th e or iginal backpropagation algor i thm.Quickp rop also took in average 34 epochs to learn the task,but the choice of its parameters w a s much easier comparedto Super SAB.

d "

25

20

1 510

5n

BP-uperSA B '..----.QP --RPROP-0 10 20 3 0 40 50 60 70 80# e p o c h s

Figure 2: Decrease of the error over learning tim e for theNine Men's Morris taskA fur ther improvement w a s achieved us ing RPROP,which only took 23 epochs to learn. Again, the choice

of the ini t ia l update-value A0 w a s found to be fairly un-critical.E. The Two Spirals ProblemThe difficulty of the 'Two Spirals Problem' has beendemonstrated in many at tempts to solve the problemwith backpropagation and several elaborated modifica-t ions . The pat te rn set consis ts of 194 patterns, describingthe points of two distinct spirals in the x-y-Plane. Th enetwork is built up of 2 inpu t un its , three hidden layerswith 5 units each, and 1 out pu t unit with symmetr ic act i-vation functions. Each unit is connected to every unit inearlier layers (short-cu t conne ctions [7]).The results repor ted so far are an average of 20.000epochs on three(!) different runs for Backpropagation(using both an increasing learnig-rate and momentum-factor ) , and an average of 11 OOO epochs us ing a nonlinear'cross-entropy' error function.Using Quickprop, an average of 7900 epochs on threedifferent runs is reported, but it should be noted that a(non-s tandard) arctan-er ror function and (non-s tandard)weight decay were needed to obtain this result.In order to get statistically relevant results, we testedR P R O P o n 20 different runs. A run was considered suc-cessful, if a solution was found in less than 15.000 epochs .Weights were initialized randomly within [-0.25,0.25].Th e maximal update-value Amazwas chosen considerablysmall, i.e. Amoz= 0.001, to avoid an early occurence ofs tuck units . T he parameters v +, 7- were set to their stan-dard values. The result is listed below:

Two SDirals

590


6/6

In compar ison t o the repor ted results , RP RO P con-verges 4 t imes fas ter than backpropagation, and about1.5 t imes fas ter tha n Quickprop, a l though both backprop-agation and Quickprop needed non-standard extension toconverge at a l l . Th e f ac t , tha t the pur e RPR OP a lgor ithmwas used an d no further m odifications had t o be applied,again demonstrates the advantage of the new algor i thmwith respect to the s implici ty of use.

Average number of required epochsProblem 10-5-10 12-2-12 9 Mens FigureMorris Rec.

F. Sum m a r yIn the following, the best results of the several learningalgorithms are listed in an overview. The figures showthe average number of epochs used on the learning tasksdescribed above plus the results on an additional f igurerecognition ta sk. T h e second row from below shows theresults of the RPROP-algor i thm using the s tandard pa-rameter choice (A , = 0.1 ,A,,, = 50.0). As can be seen,even witho ut anno ying param eter tun ing very good resultscan be achieved.

[7] K. Lang and M . Witbrock. Learning to tell two spiralsapar t . In Proceedings of 1988 Connect ionist Models

REFERENCES[l ] D.E. Rumelhar t and J . McCle l land . Parallel Dis-tributed Processing. 1986.[2] W . Schif fmann, M. Joost , an d R . Werner. Optimiza-

tion of the backpropagation algorithm for trainingmultilayer perceptrons. Technical report, Universityof Koblenz, Institute of Physics, 1993.[3] R. Jacobs. Increased rates of convergence throughlear n ing r a te adapta t ion . Neural Networks, 1(4), 1988.[4] T. Tollenaere. Supersab: Fast adaptive backpropaga-tion with good scaling properties. Neural Networks,

[5 ] S. E. Fahlman. An empir ical s tudy of learning speed inback-propagation networks. Technical report, CMU-[6] H.Braun, J. Feulner, an d V. Ullrich. Learning strate-

3(5), 1990.

CS-88- 162, 1988.

IV . CONCLUSIONTh e proposed new learning algor i thm RP RO P is an easyto imp lement an d easy to com pute local learning scheme,which modifies the update-values for each weight accord-ing to the behaviour of the sequence of signs of the par t ia lderivatives in each dimension of the weight-space.

T h e num ber of learning steps is significantly reduced incomparison to the original gradient-descent procedure aswell as to other a dap tiv e procedures, whereas the expenseof com puta t ion of the R PR OP adap tat ion process is heldconsiderably small.

Ano ther i mp ort an t feature, especially relevant in prac-t ical applicat ion, is the robustness of the new algorithmagainst the choice of its initial parameter.R P R O P is currently beeing tested on further learningtasks , fo r example on pat te rn sets containing continuoustarget values. The results obtained so far are very promis-ing and confirm the quality of the new algorithm withrespect t o bot h convergence time and robustness.

591

rprop paper

Documents