characterization of a class of sigmoid functions with applications to neural networks

17
Pergamon Neurol Networks, Vol.9, No. 5,pp.819-S35, 1996 Copyright 01996 Elsevier scienceLtd.Allrishtsreserved Printed inGreatBritain 0893-6080/96 $15.00+ .00 OS93-6OSO(95)OO1O7-7 CONTRIBUTED ARTICLE Characterizationof a Class of Sigmoid Functions with Applications to Neural Networks ANIL MENON, KISHAN MEHROTRA, CHILUKURI K. MOHAN AND SANJAY RANKA SyracuseUniversity (Received 21 November 1994; revised and accepted 15 August 1995) Abatraei-Westudy two classes of sigmoidr: the simple si&noi&, de~ned to be odd, asymptotically bounhd, completely monotone fmctions inonevariable,andthehyperbolicsigmoidr,apropersubsetofsimplesigmoih anda naturalgeneralizationofthehyperbolictangent.Weobtainacompletecharactertiation for theinversesofhyperbolic sigmoih usingEuler’sincompletebetafunctions,anddescribecompositionrulesthat illustratehowsuchfunctions may be synthesizedfromothers.Theseresultsare appliedto twoproblems.Firstweshowthat withrespectto simple sigmoidvthecontinuousCohen-Grossberg-Hopjield modelcanbe reducedto the (associated)L.egendredt~erential equations.Second,we showthat theeffectof usingsimplesigmoti as nodetransferfmctions ina one-hiizkkn layer feedforwardnetworkwithonesummingoutputmaybe interpretedas representingtheoutputfmction as a Fourier seriessinetransformevaluatedat thehiakknlayernodeinputs,thusextendingandcomplementingearlierresultsin this area. Copyright01996 Elsevier Science Ltd Keywords-Sigmoid functions, Hypergeometnc series, Legendre equation, Cohen-Grossberg-Hopfield model, Additive model, Fourier transform. 1. INTRODUCTION Sigmoid functions, whose graphs are “S-shaped” curves,appearin a greatvarietyof contexts,such as thetransferfunctionsusedin manyneuralnetworks.l Theirubiquityis no accident;thesecurvesare among the simplest non-linear curves, striking a graceful balance between linear and non-linear behavior. Grossberg(1973)givesan illuminatingand pertinent discussionof this point. Figure 1 shows threesigrnoidalfunctions,viz. the hyperbolic tangent y = tanh(x) (graph A), the + “logistic” sigmoid y = 1/(1 + exp(-x (graph B), and the“algebraic’ sigmoid,y = x/ (1 + X2) (graph Acknowledgements: We wouldlike to acknowledge the helpful commentsof the reviewersof this paper.We wouldalso like to thankK. GeorgeJosephfor severalusefuldiscussions. Requestsforreprintsshouldbe sentto K.Mehrotra,Schoolof ComputerandInformationScience,2-120Centerfor Scieneeand Technology,SyracuseUniversity,Syracuse, NY 13244-41OO, USA; e-mail:kishan(ij)top.ek.syr.edu 1Otherexamplesof theuseofsigrnoidfunctionsarethelogistic function in populationmodels, the hyperbolictangentin spin models, the Langevinfknction in magneticdipole models, the Gu&rmannianfunctionin specialfimctionstheory,the (cumula- tive)distributionfunctionsinmathematical statistics,thepiecewise , approximatorsin nonlinear approximation hysteresiscurvesin certainnonlinearsystems. theory, and the C). Figure 2 shows their inverses, x = tanh-l(y), x = ln(y/(1 – y)), and x = y/~~, labeled A‘, B‘ and C‘ respectively. In a feweases,sigrnoidcurves can be deseribedby formulae; this rubric includes power series expansions (e.g., hyperbolic tangent), integralexpressions(e.g., error fimction), composi- tions of simplerfunctions (e.g., the Guderrnannian function),inversesof functionsdefinableby formulae [e.g., the “complexified” Langevin function, a 1 0.6 0.2 -0.2 -0.6 -1 .“ .“ ~ /“” ... -3 -2 -1 0 1 2 3 FIGURE1.Some elgmoide. 819

Upload: anil-menon

Post on 14-Jul-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

PergamonNeurolNetworks,Vol.9,No.5,pp.819-S35,1996

Copyright01996 ElsevierscienceLtd.AllrishtsreservedPrintedinGreatBritain

0893-6080/96$15.00+.00OS93-6OSO(95)OO1O7-7

CONTRIBUTED ARTICLE

Characterizationof a Class of Sigmoid Functions withApplications to Neural Networks

ANIL MENON, KISHAN MEHROTRA, CHILUKURI K. MOHAN AND SANJAY RANKA

SyracuseUniversity

(Received 21 November 1994; revised and accepted 15 August 1995)

Abatraei-Westudy two classes of sigmoidr: the simple si&noi&, de~ned to be odd, asymptotically bounhd,completely monotone fmctions inonevariable,andthehyperbolicsigmoidr,apropersubsetofsimplesigmoih andanaturalgeneralizationof thehyperbolictangent.Weobtaina completecharactertiationfor theinversesofhyperbolicsigmoih usingEuler’sincompletebetafunctions,anddescribecompositionrulesthat illustratehowsuchfunctionsmay be synthesizedfromothers.Theseresultsare appliedto twoproblems.Firstweshowthat withrespectto simplesigmoidvthecontinuousCohen-Grossberg-Hopjieldmodelcanbe reducedto the (associated)L.egendredt~erentialequations.Second,we showthat theeffectof usingsimplesigmoti as nodetransferfmctions ina one-hiizkknlayerfeedforwardnetworkwithonesummingoutputmaybe interpretedas representingthe outputfmction as a Fourierseriessine transformevaluatedat thehiakknlayernodeinputs,thusextendingandcomplementingearlierresultsinthis area. Copyright01996 Elsevier Science Ltd

Keywords-Sigmoid functions, Hypergeometnc series, Legendre equation, Cohen-Grossberg-Hopfield model,Additive model, Fourier transform.

1. INTRODUCTION

Sigmoid functions, whose graphs are “S-shaped”curves,appearin a greatvarietyof contexts,such asthetransferfunctionsusedin manyneuralnetworks.lTheirubiquityis no accident;thesecurvesareamongthe simplest non-linear curves, striking a gracefulbalance between linear and non-linear behavior.Grossberg(1973)givesan illuminatingand pertinentdiscussionof thispoint.

Figure 1 shows threesigrnoidalfunctions,viz. thehyperbolic tangent y = tanh(x) (graph A), the

+“logistic” sigmoid y = 1/(1 + exp(-x (graph B),andthe“algebraic’ sigmoid,y = x/ (1 + X2) (graph

Acknowledgements:We wouldliketo acknowledgethehelpfulcommentsof the reviewersof this paper.We wouldalso like tothankK. GeorgeJosephfor severalusefuldiscussions.

Requestsfor reprintsshouldbe sentto K. Mehrotra,SchoolofComputerandInformationScience,2-120Centerfor ScieneeandTechnology,SyracuseUniversity,Syracuse,NY 13244-41OO,USA;e-mail:kishan(ij)top.ek.syr.edu

1 Otherexamplesof theuseof sigrnoidfunctionsarethelogisticfunction in populationmodels, the hyperbolictangent in spinmodels, the Langevinfknction in magneticdipole models, theGu&rmannianfunctionin specialfimctionstheory,the (cumula-tive)distributionfunctionsin mathematicalstatistics,thepiecewise,approximatorsin nonlinear approximationhysteresiscurvesin certainnonlinearsystems.

theory, and the

C). Figure 2 shows their inverses,x = tanh-l(y),x = ln(y/(1 – y)), and x = y/~~, labeled A‘,B‘ and C‘ respectively.In a feweases,sigrnoidcurvescan be deseribedby formulae; this rubric includespower series expansions (e.g., hyperbolic tangent),integralexpressions(e.g., error fimction), composi-tions of simplerfunctions (e.g., the Guderrnannianfunction), inversesof functionsdefinableby formulae[e.g., the “complexified” Langevin function, a

1

0.6

0.2

-0.2

-0.6

-1

.“.“

~ /“”

. . .

-3 -2 -1 0 1 2 3

FIGURE1. Some elgmoide.

819

Page 2: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

A. Menonet al.820

I I I I I n2 -

1.2 - c

0.4 -

A . ..0.4 -

.’. ..“

-1.2 - B

-2I I !

-1 -0.6 -0.2 0.2 0.6 1

FIGURE2. Some Invereeslgmoide.

sigmoid defined as the inverse of the function,I/x – cot(x)], and differentialequations.

Although the level of abstraction in manyproblems is such that one does not need to workwithexplicitformulae,2it is usefulto studynetworkswith specific transferfunctions, as demonstratedbythe following considerations.

1.

2.

In determiningwhethera single layeredfeedfor-wardnetisuniquelydeterminedby itscorrespond-ing input-output map, Sussmann(1992) in hiselegantproof of uniquenessspecificallyused thepropertiesof thetanh(.)function. A lateranalysisby Albertiniet al. (1993)obtainedthesameresultswith fewer assumptions on the node transferfunction, but still requiressuch functions to beodd, and satisfycertain “independence” proper-ties. With respectto the uniquenessproblem, allnode transferfunctionsare not equivalent.3Without tractableanalyticalforms to work with,many problems relatingto sigmoidsare resistantto theory. Neural net theory offers manyexamples. For example, there have been claimsin the literatureabout the advantage(withrespectto computability,trainingtimes, etc.) of certainsigmoidaltransferfunctions over othersin back-propagation networks (Elliot, 1993; Barrington,

2 For example,in neuralnet approximationtheory,significantresultscan be obtainedaboutthe existenceof realizationswithinpreassignedtolerances,withveryfew constraintson the natureofthe node transferfunction;classic results along these lines arefound in Cybenko (1989), Diaconis and Shahshahani(1984),Funahashi(1989),andHorniket al. (1989).

3 Anotherexampleof the non-equivalenceof “sigmoids”isoffered by the work of Macintyreand Sontag (1993) on theVapnik-Chervonenkis(VC) dimensionof fecdforwardnetworks,whichshowedthatit is finiteonlyfora classof sigmoidalfunctionstheycalltheexp-RAfunctions.Theyshowedthatanalyticityof thetransferfunctionis crucial,andcannotbe relaxed,forinstancebymakingthe functionG’.

3.

4.

5<

1993). Some theoretical support comes fromconsideringthe firstderivatives(if defined)of thevarious transfer functions proposed; the firstderivativesare partiallyresponsiblefor control-lingthestepsizein theweightadjustmentphaseofthe back-propagation algorithms,which in turninfluences the rate of convergence. Explicitexpressions for sigmoids are useful in suchconsiderations.The dynamical systemdescribingthe continuousCohen-Grossberg-Hopfield (CGH) model(Grossberg, 1988), also known as the additiveSTM model and the Cohen-Grossberg model,raisesan intriguingquery.If one assumesa tanh(.)node transfer function, one can show that theCGH model is transformableto the Legendredifferentialequation (see Section 6.1). An im-portant question is whether this relationshipisrobust with respect to the choice of the transferfunction.The recentstudyof sigmoidalderivativesby Minaiand Williams(1993) is anothercase in point; theyderiveda connection with Euleriannumbers[seeGraham et al., (1989) for definitions], butrestrictedtheirinquiryto thevery specificlogisticsigmoid. Any generalization of their resultsrequiresa careful look at sigmoidsrepresentableby formulae.There are other relatedissues.For instance,thehyperbolic tangent and logistic sigtnoid areessentiallyequivalent,in thatone can be obtainedfrom the other by simpletranslationand scalingtransformations.Specifically,

1+Jp(-x)-;=;’anh( xl’)”(1.1)

Many sigrnoidshave power seriesexpansionswhichalternatein sign. Many have inverseswithhypergeo-metric seriesexpansions.On the other hand, manysigmoids have no such simple forms, or obviousconnectionswith well-knownsigmoids.It is naturalto askwhetherthesevariedanalyticalexpressionsforsigmoidshave anythingin common. It is difficulttoanswer such questions without a thorough under-standing of the analytical expressionsfor sigmoidfunctions.

In viewof theseconsiderations,thispaperundertakesa study of two classes of sigmoids: the simplesigmoidr, defied to be odd, asymptoticallybound-ed, completelymonotone functions in one variable,andthehyperbolic sigmoidr, a propersubsetof simplesigmoids and a natural generalization of thehyperbolictangent.The classof hyperbolicsigmoidsincludesa surprisingnumberof well-knownsigmoids.

Page 3: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

A Class of SigmoidFunctions

The main contributions of the paper are asfollows:

Simple sigmoids, hyperbolic sigmoids and theirinversesarecompletelycharacterizedin Sections4and 5.Using seriesinversiontechniques,in Section 5, weobtaintheseriesexpansionsof hyperbolicsigmoidsfrom those of their inverses.These resultsextendthose of Minai and Williams(1993)for the logisticfunction.In Section4, we study the composition of simplesigmoids via differentiation,addition, multiplica-tion, and functional composition. These resultsalso completely specify the relationshipbetweenEuler’s incomplete beta function and the para-metrized sigmoids.In Section6.1weshowthatthecontinuousCohen–Grossberg-Hopfieldequationsbelong to the classof non-homogeneous Legendre differentialequa-tions if the neural transferfunction is a simplesigmoid.In Section 6.2 we establisha connection betweenFourier transformsand feedforwardnetswith onesummingoutputandone hiddenlayerwhosenodescontain simplesigmoidaltransferfunctions.

We do not purport to have discovered a generalframework to describeall sigmoids; indeed, such aquest is largelymeaningless;nor are we arguing forlimiting the notion of sigmoids to the classesconsideredin this paper. Simplesigmoidsare ratherspecial sigmoids, but their regular structure oftenmakes the relatedtheory tractable,paving the wayfor more generalanalysis.

2. PRELIMINARIES

Notation: 9t andS?+denoterealspace,and the setofpositive real numbers,respectively;(a, b) and [a, b]denote theopen andclosedintervalsfrom a to b. If Ais a set, then IA I is the cardinalityof A. Given afunction J its domain and range are denoted byDorn(f)and Ran~, respectively;~f~)refersto thekthderivativeof~(if it exists).Occasionally,we shalluseJ’(x) in place of~tl)(x). If a function is k timescontinuouslydifferentiableon a giveninterval1, thenwe writef E Ck(Z). Cm functionsare called smoothfunctions. The term “propositions” refers to resultscited from externalsources.

The conceptsof realanalyticfunctions(K.rantz&Parks, 1992), absolute monotonic and completelymonotonic functions (Widder, 1946) and hypergeo-metric functions(Erdelyiet al., 1953),are central towhat follows; for convenience they are reviewedbelow.

821

DEFINITION2.1 (real analyticity).Let U S L%be anopen set. A function f: U 4 9? k said to be realanalytic4 at X. E U, lfthe fmction may be representedby a convergent power seraks on some interval ofpositive radius centered at XO, i.e., f(x) =z~o aj(x–xo)J. The fmction is said to be realanalytic on V c U, f it is real analytic at eachxl) c v. ■

DEFINITION2.2 (monotonicity).Afunctionf: 91 ~ ~is absolutely monotonic in (a, b) &it has non-negativederivatives of all orders, i.e., f e Cm ((a, b)) and,

fl’J(x)>o a < x <b, k = O, 1,2.... (2.1)

A fmction f: ~ - B is completely monotonic in (a,b), ~ff(–x) is absolutely monotonic in (–b, – a).Equivalently, f is completely monotonic in (a, b)z~ff ● Cm((a, b)) and,

(-l)kJ@)(x)>O a < x <b, k=O, 1,2.... (2.2)

A fmction f: 97*3 is completely convex in (a, b),z~ff ● Cm; ((a, b)), and for aZZnon-negative k andx E (a, b), (–l)kflk)(x) >0. w

A fundamentalproperty of absolutelymonotoneand completelymonotone functions is that they arenecessarilyrealanalyticon theirdomains.5Addition-ally, iff is absolutelymonotone on an intervalZ~ $%,then it is non-negative,nondecreasing, convex, andcontinuouson Z.

DEFINITION 2.3. The generalized Gauss hypergeo-metric (GH) series ~Fg(al, . . . . r+; ‘H,. . . . 79; Z) iSdefined by,

~Fq(a,, ..., ap; ‘yl,. . . ,~q; z)

m (cr,)k(a,)k . . (a,)’ <‘~ (’_h),(%)k... (~,), ‘!

Vi : vi# O, – 1, – 2, . . . (2.3)

where (a). = (a)(a + 1) . . . (a+ n – 1) is the risingfactorial or Pochhammer’s polynomial in a. Bydefinition, (a). = 1. The @is are the numerator

4 Red analytic fimctions are also referred to as regular,homomorphic,or monogesicfunctions.

5 Thisis thecelebrated“Bernstein’stheorem”(Polya,1974).Infull, Bernstein’stheoremasserts tbat given a function f(z), ifinfinitelymanyof its derivativesf(w), f(w),... areof constantsignin the open intervalI, and if the sequencem, nz, . . . does notincreasemorerapidlythana geometricprogression(i.e., thereis afixedquantityC, suchthatVknk+l/n’ < ~, thenf(z) is analyticon the intervalI.

Page 4: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

822 A. Menon et al.

parameters, and the ~is are referred to as the&nominator parameters of the GH series. ■

In particular, the classical GH seriesb in z,2F1(rY,~; ~; z) is definedby,

m(a)k(fl)k Z’k~.4)*F’,(@,~; ‘y;z) = F(% P; 7; z) = ~ (7)k n (

REMARK2.1. The ~Fq representationof a hypergeo-metric series,though standard,is not a uniqueone.For example,the series

could be viewedas OFO(;; z), or as IFI(1; 1; z), or as3F3(I, 1, 1; 1, 1, 1; z), etc. We shall henceforthusethe “minimal” representation,in thiscaseOFO(;; z).

REMARK2.2. In general,theparameters~i and ~i, aswell as the variable z, are allowed to be complex;however,we follow common practiceand restrictourattentionto real values,i.e., Vi : ~i, ~, z G9i?.Evenwith this restriction,the hypergeometricfunction isamazinglyversatile.Spanierand Oldham (1987) listover 170functionsthatare representablein termsofhypergeometricfunction. The hypergeometricfunc-tion is a periodic table d la Mendeleev formathematical functions; different functions getneatlypegged into variousgroups7 by the valuesofthe parameters and the form of the dependentvariable.

3. SIMPLE AND HYPERBOLIC SIGMOIDS

DEFINITION 3.1 (simple sigmoids). A fwctionu : 9 + (–1, 1) is said to be a simple sigmoid if itsatisfies the following conditions:

1. a(.) is a smooth function, i.e., u(–x) is~.2. a(.) is a oddfinction, i.e., 0(–x) = –a(x).3. a(.) has y = +1 as horizontal asymptotes, i.e.,

( ) =1.lim.+m u x4. a(x)/x is a completely convexfunction in (O, 1). ■

Simple sigmoids are required to be odd smoothfunctions bound by horizontal asymptotes; con-straintsimpose a degree of standardizationon the

6 TheclassicalGH seriesis referredto as the Gmwjiinction inthe literature(SpanierandOldham,1987).

7 ~~~em mustbe manyuniversitiestodaywhere95 per-t, ifnot 100percent, of the functionsstudiedby physics,engineering,andevenmathematicsstudents,arecoveredby this singlesymbolF(o, b;qz).” —W. W. Sawyer,citedby Grabamet al. (1989).

kinds of sigmoids being considered. The followingresults clarify the implications of the fourthconstraint.

PROPOSITION 3.1 (Feller, 1965) A fmctionf: (O, 1) d Si?is absolutely monotone on (O, 1) a~itpossesses a power series expansion with non-negativecoefficients, convergingfor O < x <1. ■

LEMMA3.1. A fmction f: (O, 1) 48 is completelymonotone on (O, 1) i~itpossesses an alternating powerseries expansion, convergingfor O < x < 1.

ProoJ8 Iff is completelymonotone in (O,1), thenthepower series expansion of f in (O, 1) has to bealternatingbecause, (–l)~fl~) >0]. On the otherhand, consider an alternating power series f(x)convergingfor all O< x < 1 and itaderivatives:

j(x) =ao–a,x+a~x2– aqx3 +...

(–l)J’)(x) = +al – ti~x + 3a~x2 +.. o (3.1)flz)(x)=2a2-6a3x+““ “.

where ai > 0 Vi. From real analysiswe know thateach of (–l)nfln)(x) has the same convergenceproperties as eqn (3.1). Also, the sum of aconvergent infinite alternatingseries is always lessthan or equal to the first term.This fact, along withtheaboveequationsimpliesthat(–l)~fl~l(x) >0, i.e.,f(x) is completely monotone on (O, 1). ■

COROLLARY3.1. u(x)/x k a completely convexfmction in (O, 1) If rT(fi/fi is a completelymonotonefunction in (O, 1).

Proof If a(x)/x is completelyconvex in (O, 1),thenithas to be analytic in (O, 1) (Widder, 1946). Also,cT(x)/x is an even function, implying that its powerseriesexpansionwillconsistonly of evenpowersin x,which alternate in sign. From Lemma 3.1,a(~/fi, will hence be completely monotone in(O, 1). The same argument suflices for the con-verse.■

If a simplesigmoidis also strictly increasing,thena much stronger statement can be made, asdemonstratedby the following proposition.

PROPOSITION3.2 (Krantz&Parks, 1992)Let y = a(x)be a strictly increasing simple sigmoid (i.e., Vx E 3,a’(x) > O). Then:

1. q - a-l : (–1, 1) -+ 9? exists.

8 Lemma3.1appearsto be “folklore”;we havebeenunabletofinda referenee.

Page 5: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

A Classof SigmoidFunctions 823

2. q(y) is a strictly increasingfunction, analytic in theinterval (–1, 1).

3. q’(y) = l/c’(q(y)), where q’ and u’ are the firstakrivatives of q and u respectively.

4. q(y)/y is absolutely monotone in (O, 1).

REMARK3.1. If u(x)/x is completelymonotone on(O, 1) and cr is invertiblethen q(y)/y is absolutelymonotone on (O, 1),whereq denotestheinverseof a.The converse is also true, and is an immediateconsequenceof Lemma 3.1.

REMARK3.2. A simple sigmoid has two horizontalasymptotes,hence its inverse(if it exists)will havetwo vertical asymptotes(i.e., limY4~lq(y) + +~).It will be seen that as they have been defined,sigrnoidsand theirinversesarequitesimilar;both areodd, increasing, univalent, analytical functions.However, the two differ fundamentally in thatsigmoids are asymptotically bounakd, while theirinversesare not.

Simple sigmoids encompass many of the oftenusedsigmoidsdescribedby formulae.The hyperbolictangentand its close relative,the “exponential” orlogistic sigmoid, are often used in many neuralnetwork theoretical studies and applications. Forexample,most of the“spin-glassexplanations”of theCGH netusethehyperbolictangent.9The hyperbolictangenthas, among others, the following properties:

1.

2.

3.

It is an odd, strictlyincreasinganalyticalfunction,asymptoticallybounded by the linesy = +1.Itsinversetanh-l (y) hasa GH expansiongivenbyyF(l, 1/2; 3/2; y2).The first derivative of tanh-l(y) is given by1/(1 – yz) = IFO(l; ; y2), i.e., the “GH expansionof thefirstderivativeof tanh–l(y) is dependentononly one numeratorparameter.

It can be shown that many other simple sigmoids,such as thatof Elliot (1993), and the Gudermannian(Section 4.2), also have inverseswith classical GHse~es representations.loThe function tanh-*(y)/ysatisfiesa second order linear homogeneous differ-ential equation, with three regular singularpoints,located at O, 1 and oo. A sigmoid with similaranalytical behavior could be expected to have an

9Stochasticversionsof neuralnets oftenreplacedeterministicstate assignmentrulesby probabilisticones, obtainedfrom somedistribution-usually the Gibbs distribution(e.g., Boltzmannmachines and stochastic CGH models). Computingexpectedvalues for the states of the system then leads to the hyperbolictangentfunction.See Hertzet al. (1991)for a typicalexample.

10The phenomenon is not unduly surprising.A heuristicexplanationis thatif the graphsof two functions“look”thesame,theirrespectivedifferentialequationsare usuallymembersof thesamefamily.

inverse that is a solution to some second orderFuchsian equations.11 sin= any second order

Fuchsian equation with three singularitiescan betransformedinto the Gauss hypergeometricdiffer-entialequation,one solutionof which is theclassicalGH series (Klein-B6cher theorem; Whittaker &Watson, 192’7),it follows that the inverseswouldhaveclassicalseriesexpansions.Theseconsiderationsmotivatethe followingdefinition.

DEFINITION 3.2 (hyperbolic sigrnoids). A functionn : B + (–1, 1) is said to be a hyperbolic sigmoidfwction ifit satisfies the following conditions:

1. u is a real analytic, odd, strictly increasing sigmoid,such that Iimx+w a(x) = 1.

2. Letq :(–1, 1) + ~ denote the inverseof o, andq’ its firstderivative.Then,

(a) q(y)/y has a Gauss hypergeometric seriesexpansion in y2 with at most three para-meters.

(b) q’(y) has a Gauss hypergeometric seriesexpansion m y2 with at most one parameter.

4. CHARACTERIZATION: INVERSEHYPERBOLIC SIGMOIDS

The following resultis a complete characterizationfor the inversesof hyperbolic sigrnoids.Proofs arepresentedin the Appendix.

THEOREM4.1 (inverses).Let y = a (x) be a hyperbolicsigmoid, and let q : (–1, 1) * St be its inverse. Then,either

(4.1)

or

q(y) = YF(a, –; –; Y2)= (l –yy2)a~>0 (4.2)

where, by F(a, –; –; y2), we mean F(a, ~; /3;y2)(B c .%). ■

Notation. Eachinversehyperbolicsigrnoidis denotedby qa and is characterizedby a singleparameterm

COROLLARY4.1. The set of hyperbolic sigrnoih is aproper subset of the set of simple sigmoih.

11Fuchsian~wtiom are lineardifferentialeqUatiOnSMch ofwhosesingularpointsareregular(Rainville,1964).

Page 6: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

824

A proof of Corollary 4.1 may be given along thefollowinglines.If a is a hyperbolicsigmoid,thenit issimple on the interval (–1, 1). This follows fromTheorem4.1. The seriesrepresentationfor its inversein (–1, 1) has non-negative coefficients, and thisimpliesq(y)/y is absolutelymonotone (Proposition3.1). Hence a(x)/x is completely monotone, andthereforesimple (Lemma 3.1 and Remark 3.1). Theconverse is not true. Simple sigmoids need not behyperbolic. The error function erf(.) is simple, butone can use the study of Carlitz (1963) on thisfunction to show that it does not have an inverserepresentableby a classicalhypergeometncseries.Itfollows that erf(.) is not a hyperbolic sigmoid, andhence the set of hyperbolic sigmoids is a propersubsetof the setof simplesigmoids. ■

For specificvaluesof itsparameters,thehypergeo-metric function often reducesto other well knownspecialfunctions.When inversehyperbolic sigmoidsare characterizedby eqn (4.1), there is an intimateconnectionwith Euler’sincomplete beta function.

PROPOSITION4.1 (Spanierand Oldham, 1987).Leta,P andy be such that, ~ = ~ – 1. Then,

where B(v; u; z) is the incomplete beta function,&fined by ~ t’-l(l – t)”-’dt, where O<z <1. Znparticular,

tmh-i(z)

~ i3(l/2; 1 – a; Z2)=~

cosh2(”-’)(t)dt. ■o

Spanierand Oldham (1987) give a detaileddescrip-tion of themany propertiesof thisimportantspecialfunction. The following corollary is an immediateconsequenceof Theorem4.1 and Proposition4.1. Itgives the connection between inverse hyperbolicsigmoidsand Euler’sincompletebeta function.

COROLLARY4.2. If q.(y) = yF(cr, 1/2; 3/2; yz), then

%(Y) = k B(l/2; 1 – ~; Y2)” ■

The relationshipbetween hyperbolic sigmoids andthe incompletebeta function also makes explicittherelationshipbetweenthehyperbolictangentfunction,and inverse hyperbolic sigmoids of formyF(a, 1/2; 3/2; y2). Otherconsequencesinclude:

1.

2.

The availabilityof good approximationsfor smallvaluesof y and (1 – y).Rapidlyconvergingseriesexpansionsfor y closeto1.

3.

4.

A. Menonet al.

Connections with other indefinite integrals ofpowers of trigonometricor hyperbolicfimctions.Connections with statistics via the functionL(P, q) (SPanierand Oldham, 1987).

When inversehyperbolic sigmoidsare characterizedby eqn (4.2), we can use the identity,

cosh(tanh-’ (y)) =J&

to show that,

ycoshti(tanh-l(y)) = (l –yy2)a.

(4.4)

(4.5)

The fundamental role played by the hyperbolictangent is once again evident. Here, it relatesthetwo types of hyperbolic sigmoids defined by eqns(4.1) and (4.2).

4.1. New Inversesfrom Old

Theorem4.1 makesit possibleto generatenewinversehyperbolicsigmoidsfrom others.ThekeyideaisthatifyF(a, 1/2; 3/2; y2) is an inversehyperbolicsigmoid,thenso isyF(a + 1, 1/2; 3/2; ~). A similarstatementmay be made for inversehyperbolicsigmoidsof theform yF(a, –; –; Z2). GH fimctions such asF(~, ~; ~; Z) and F(CY+ 1, /3;T; z) are stid to becontiguous. Erd61yiet al. (1953)givea completelistingof themanyidentitiesthatrelatethem.Lemma4.1isastraightforwardconsequenceof threesuch identities.

LEMMA 4.1. If qa : (–1, 1) ~ 9% is an inversehyperbolic sigmoid, then the functions qm+l and q._ldefined by:

(4.6)Y2(1-C4 d

%,1(Y) = ~ ~ (y‘-%(Y)) ~~I

2CZ-1%-l(Y) =

(h - ;;’1-y’)a-2

{y;?;-’w}x~@{z

dycY>2 (4.7)

are also inverse hyperbolic sigmoids. Also, there existfwctions K,(a, z), Kz(a, z) and K3(a, z) such that thefollowing relation holds:

K1(a, Z)V.-I (y)+ K2(cY,z)%(y) + ~3(a, Z)%+l (Y) =0.

(4.8)

ProoJ Equation (4.6), which definesqa+l(y), resultsfrom the following identity:

(CY)nZa-’F(a+ n, P; 7; z) = ~ [Za+n-’F(CY,/3;~; z)].

(4.9)

Page 7: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

A Class of Sigmoid Functions 825

In the following,we will use F(O) as an abbreviationfor F(6; /3;-y;z). Equation (4.7) follows from theidentity:

(-y- CY)nZ7-a-’(l- Z)a+d-7-nF(~_ ~)

=~[Z~-a+n-’(l -Z)”+o-7F(a)]. (4.10)

Equation(4.8) relatingqa-l (Y),qa(y), and qa-l (y), isa consequenceof the identity:

(7 - ~)HCY- 1)+ (z~ - v - fYZi-~z)F(cY)+ cr(z– l)F(a + 1) = O. (4.11)

Inversehyperbolicsigmoidscome in two forms; oneform hasthreeparameters[eqn(4.1)],whiletheotherhastwo “missing” parameters[eqn(4.2)].Subjectto aminor condition, thelatterform is alwaysobtainablefrom the former.

LEMMA 4.2. Let qa= yF(cY,1/2; 3/2; y2), wherecs>1. Then the function q._l defined by:

%-I(Y) = Y(l –Y*); %(y) = yl’(a – 1, –; –; yz)

(4.12)

in an inverse hyperbolic sigmoid, with parametera – 1. ■

For inversehyperbolic sigmoids with “missing”parameters,thereis a very simplecomposition rule.

LEMMA4.3. ~ q.(y) =Y/(1 –y2)a and q., (y) =Y/(1 – Y*)& are two inverse hyperbolic sigmoih witha, a‘ >0, then thefunction (qo(y)q~ I(y))/y N also aninverse hyperbolic sigmoid with parameter(a +a’). ■

In general,the set of inversehyperbolic sigrnoidsisnot closed undermultiplicationor addition.But if Vaand q~~are inversesof two hyperbolicsigrnoidsthentheir sum would also be an inverse hyperbolicsigmoid qfi for some p c $%,i.e., q~+ qat= Kqfi, forsome K, if and only if

vu +%’ = KZI,* (CY)n+ (cr’)” = K(p)n VrI>l (4.13)

which in turn, is possible’2if and only if a = a‘, ora = O,or a‘ = O.

12Equation(4.13),with K = 1, provideaan amusingapplica-tionforFermst’slasttheorem;if weacceptthatforalln >2, therecannot exist positive integersa, b and c satisfyingthe identityan+ /?’= e“, then we may conclude that the sum of inver3ehyperbolicsigmoidswithdifferentintegralparametersearmotbean inversehyperbolicsigmoidwithan integralparameter.

The definitionof hyperbolic sigrnoidsimpliesthattheirinverseshaveGH expansionsiny2. Theorem4.2relaxes this requirement by only requiring GHexpansionsin some odd, injectiveC* function g(y).A proof is providedin the Appendix.

THEOREM4.2. Leta :9?-+ (–1, 1) be a retdarudytic,oaW strictly increasing sigmoid, such that its inverseq : (–1, 1) + 91 has a GH series expansion in someinjective, odd, increasing C’ function g(.), with at mostthree parameters, convergent in (–1, 1). Also let q’have a GH series expansion in g(.), with at most oneparameter. Then, either

V(Y)= dy)F(~, ;; ;; (dY))2)

or,

n(Y)=g(Y)~(w –; –; o!(y))’)/?(Y)

=(1 - (g(y))z)~‘“’itha >0(4.15)

provided

i?’(Y)PI (1 - yq” + “

where g’(.) is thefirst derivative of g(.). ■

In the case g(y) = y, we obtain the characterizationfor inversehyperbolic sigmoids.Another interestingspecial case is when g(y) = q(y), whereq(y) is aninversehyperbolicsigmoid(sinceq(y) is an injective,smooth, odd, increasingfunction the conditions ofthe theoremare satisfied).The elementarycomposi-tion rulespresentedhere allow the generationof aninfinitevarietyof inversehyperbolic sigmoids.’3Thenext sectionpresentssome examples.

4.2. Examples

Any fimctionof theform y/(1 – y2)”, wherecx>0, isthe inverseof a hyperbolicsigmoid.For example,fora = 2, the function y/~~ is the inverseof thehyperbolicsigrnoidx/~w.

Of all inverse hyperbolic sigmoids of the form

13An in~~ng ~w isthePieceWiserationalsigrnoiddefined

by Elliot (1993), a(z) = y/(1+ [zI). Although its inverseq(v)= v/(1 – Iyl) does not fit in an obvious way into theframeworkdevelopedin the last few sections,it is easy to relaxtheconditionsplacedong(y), in Theorem4.2, so as to includethissigmoidas well.

Page 8: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

826 A. iUenonet al.

yF(cE,1/2; 3/2; y2), the function tanh(.) is note-worthy: first, it corresponds to the case a = 1;secondly, all inverse hyperbolic sigmoids withintegralvaluesof CYmay be generatedfrom tanh(x)by a process of differentiation(Lemma 4.1); andthirdly, it is a function often encounteredin neuralnets. As was mentioned in the Introduction, thelogistic function may be thought of as a translatedand scaledversion of the hyperbolictangent.

There is a good exampleof the hypergeometriccompositiondescribedin Theorem4.2. Sincetan(~y)is an odd, injective,smooth, increasingfunctionof y(for some constant ~ > O), from Theorem 4.2, onemay conclude that for positivea, the function, tan(@Y)F(~, 1/2; 3/2; tin2(@)) is the inverseof somerealanalytic,odd, strictlyincreasingsigmoid.It turnsout thattheinverseGudermannianfunction’4maybeobtained from this function, by choosing CY= 1 asfollows:

gal-*(y)= ln(aec(y)+ tan(y)) for– ~ < y < ~

= 2 tan(y/2)F(l, 1/2; 3/2; tan2(y/2)).

Many such examplescould be generated.15

5. CHARACTERIZATION: HYPERBOLICSIGMOIDS

It is often desirable and necessaryto work withsigmoids themselves,rather than their inverses.Inthis section, we obtain power seriesexpansionsofsigmoids.

If x = q(y) is an inversehyperbolicsigmoid,then~ ~ ~–1 must have a Maclaurinseriesexpansionofthe following form:

We are interested in determiningthe coefficients{b21+l}~0 associated with the inverse hyperbolicsigmoids:

v(1 - yqa

and yF(a, 1/2; 3/2; y2).

5.1. Hypbolic Sigmoidsof the First Kind

When an inversehyperbolic sigmoid is of the form

14‘f& inverwGudrjrmannianfunctionfinds use in relatingcircularand hyperbolicfunctions,without the usc of complexfunctions.

ISme tablesof SpanierandOldham(1987),andHansen(1975)in particular,containmanysuchfunctionsandexpansions.

Y/(1 – Y2)”, a remarkably explicit form for thecoefficients{b2J+1}~may be given:

THEOREM5.1 (hyperbolic s@ioids—1). Z~the inversesigmoid is given by y/(1 – y2)a, cx>0, then in someneighborhoodof the origin, we have the valid expansion

where,

b~+’=(-l)k(M+l)’ ((z:l)a)o “1)

ProoJ See the Appendix. ■

5.2. HyperbolicSigmoidaof the SecondKind

When an inversehyperbolic sigmoid is of the formx = yF(a, 1/2; 3/2; y2), the problem is much hard-er. The Lagrange inversion formula leads to anintractableexpression.Kamber’s formulae, as pre-sented by Goodman (1983), can be used to giveexplicit expressions for the coefficients. Unfortu-nately, the resulting expressions involve determin-ants, and are of little computational value. Themethodof repeateddifferentiationis more successful.The starting point for this line of attack is theobservationthat if x = q(y) is an inversehyperbolicsigmoid, then:

dxd 1— = — q(y) = q’ (Y) = (1 – y2)a .dydy

(5.2)

From Theorem 3.2, we seethat for y = m(x),

~ = : .(x)=*=(1– yz)”. (5.3)

By virtue of eqn (5.3), we can compute the higherderivativesof o(.) and hencecompute

dM+lu(x)~+1 = &k+l .

X=(I

Note that dy/dx is expressedin terms of y; thisnecessitatestheuseof thechainrule.For example,tocalculatethe second derivative:

2=(31-’’20%=(1 - yz)” (-$(1 - Yz)”). (5.4)

Page 9: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

A Claw of SigmoidFunctions

The following theorem presentsan efficientway toimplementthisprocedure.

THEOREM5.2 (hyperbolic sigmoids-IIA). fit theinverse hyperbolic sigmoid be q. = yF (a, 1/2;3/2; y2), and u G q: ~. tit D ~ dldx. Then,

D“(y) = D“(cr(x))= Gn- , (y)(l – y2)m (5.5)

where Gn : (–1, 1) ~ 9i?i.ra fiction satisfying the

recursion

Go(y) = 1,

G.(Y) = ; G.-l (Y)

2yna–— G.- l(y), n> 1.1 – Y2 (5.6)

In particular, bzk = O, and b=+l = DX+*(a(x)) =GZk(0).

Proo$ Proved by inductionon n. ■

While the procedure implicit in Theorem 5.2 isefficient, it does involve the computation of thederivative of Gn(y). Equation (5.6) is a partialdifferenceequationwith variablecoefficients.There-fore, thereis littlehope of solvingit in any generalityand obtaininga closed form expression.Even moresophisticatedmethods—suchas Truesdell’sgenerat-ing fimctiontechniqueand Weisner’sgroup theoreticapproach (McBride, 1970)-do not give any specialinsightinto the nature of the polynomialsG. (y).16The next theorem offers a somewhat differentapproachto the method of repeatedderivatives.

THEOREM5.3 (hyperbolic s@noids—IIB). Let

be an expansion for a hyperbolic sigmoid, whoseinverse is of the form yF(a, 1/2; 3/2; y2), valid insome neighborhood of the origin. Then bzk = O, andbx+l = C(2k+ 1, k), where the sequence C(n, k)

satisjies:

16Equation (5.6) is a differential-differencesystem of theascending type; it ean then be shown that the polynomials{Gn(y)}~l satisfy Truesdell’sF-equation. Unfortunately,theresultinggeneratingfunction for Gn(y) is too complicatedforanypracticaluse.

C(l, o) = 1

C(n, @ = O Vk>n, k <0

C(rI+ 1, k) = (2k – n + l) C(rI, k)

–2(na – k +1) C(n, k – 1)

n a and n and k are natural numbers. Furthermore,Dn(a(x)), the nth a%ivative of u, is given by:

D“(y) = D“(a(x))n–1

=x C(n, k)yX-n+’(l -y2)w-k, forn>l. (5.8)k=O

Proof. See Appendix. ■

The recursivesystemdescribedby eqn (5.7) doesnot involve any differentiation.The desired valuebx+l may be obtained by computing the value ofC(2k + 1, k). Equation(5.8) givesinformationaboutthe shapes of the derivatives of the hyperbolicsigmoid. From eqn (5.7),

b, = 1 b,= -2c2, (5.9)b5=4a(7a–3) fq=–8cr(127cr2– 123cr+30). (5.10)

Theorem5.3may be viewedas a generalizationof thework of Minai andWilliams(1993)on thederivativesof the logistic sigmoid. They obtained relationssimilar to eqn (5.7).17 In general, eqn (5.7) is apartialdifferenceequationwith variablecoefficients,and the systemdoes not appearto be relatedto anywellknown setsof numbers.Obtaininga closed form

827

(5.7)

solution for the numbers C(n, k)

intractable.

6. APPLICATIONS

appears to be

In thissection,we presenttwo applications.The firstshowsthatif theneuralnetworktransferfunctionis ahyperbolic sigmoid, then the dynamical equationsdescribing the CGH neural network (Hopfield &Tank, 1986; Grossberg, 1988) can be transfonrtedinto a set of non-homogeneousassociatedLegendredifferentialequations. Some conclusions regardingthebehaviorof theCGH model can be drawn,as theoutputssaturate(i.e., output+ +1).

The second application derives an interestingconnection between Fourier transforms and one-hiddenlayerfeedforwardnets(one-HL nets).Subject

17Interestingly,in theeaseof the Io@stic siwoid, th~

relations happenedto be the recursionscorrespondingto theEuleriannumbers(Grahamet al., 1989). In other words, thecoeftkientsarisingin the computationof higherorderderivativesof the logisticsigmoidturnout to be the Euleriannumbers.

Page 10: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

828 A. Menonet al.

to an additionalminor constraint,we show that theuse of one-HL nets with simple sigmoidal transferfunctions for function approximationis tantamountto assumingthatthe functionbeing approximatedisthe product of two functions;one thederivativeof abounded non-negative function, and the othersatisfying some linear nth order differentialequa-tion, where n is the numberof nodes in the hiddenlayer.

6.1. ContinuousCGH Neta and LegendreDifferentialEquations

The continuous CGH network with N neurons isdescribedby the followingdynamics:18

dui~+gi*{j=~ TijVj+Zi=Ei

j

= – ::, vi E {1,..., N} (6.1)1

wherew and viarethenetinputandnetoutputof theith neuron, respectively,Zi is a constant externalexcitation, and E is the “energy” of the network,givenby:

E = –~ ~ Tijvivj = zvi#=-D’’Ei” (6”2)l,]

Assume –1 < vi<1. ~t vi= ~(ui), where a(.) is ahyperbolicsigrnoid.Letqs a-l, or, ui= ~(vi). Thereare two casesto consider.

CaseZ. Let q(vi) = ViF(~, 1/2; 3/2; v?). In thiscase

dq 1Z = (1 – V;)a

and substitutingeqn (6.3) into eqn (6.1), we get:

(6.3)

(6.4)

The following sequenceof operations is applied toeqn (6.4):

1. Substitute

dvi

‘i= x’

and differentiatewith respectto vi.

18me ~ynaptj~wej~ts Z’ijare assumedto ~ symmetric.1tmakesthe derivationssimplerwithno loss in generality.

2. Multiplythroughoutby (1 – v~)u+l.3. Differentiateonce more with respectto vi.

Equation(6.4) is thentransformedinto:

dyi~ @ – Z(I – a) ~0 – ‘i) dv; + 2~yi =, Qi (6.5)

where

# 1Qi=$ [(1–));)”+l + + 2giVi.

Finally, substituteYi= zi(l – v~)Qi2in eqn (6.5),yielding

whereRi = (1 – V~)–a’2Qi. The associatedLegendredifferentialequation is of the form (Spanier &Oldham, 1987),

(1 -*7 ?.?&z. g

[ 1,::2 f =0.+ V(v+ 1) –— (6.7)

It is clear that the left hand side in eqn (6.6) is theassociatedLegendredifferentialequation with para-meters p = –a [eqn (6.5) requires us to choosep = –a, rather than +a], and v = a. In otherwords, the continuous CGH model with a neuraltransfer function given by ~(vi) = ViF(~) 1/2;3/2; v?), is reducible to the non-homogeneousassociatedLegendredifferentialequation with para-metersp = –d and v = cv.

CaseIZ.Let ~(vi) = ViF(~, –; –; v?). An analogousapproach leads to the very same conclusion as inCaseI, i.e., it is possibleto transformthecontinuousCGH equationwith the above transferfunction to anon-homogeneous associated Legendre equation.However, the right hand side of the transformedequationis complicatedand we do not considerthiscase further.

We emphasizethat the link between the contin-uous CGH equation and the Legendre differentialequation is not accidental, given that it can beestablished for all hyperbolic sigrnoidal transferfunctions.For ui= tanh-l(vi), a = 1, and the aboveequationshave a ratherelementaryform.

An immediateapplicationof the above transfor-mation is in studyingthe saturationbehavior of theCGH neural net. By saturation,we mean that the

Page 11: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

A Class of Sigmoid Functions 829

outputsof the neuronstend to +1. As discussedbyHopfield and Tank (1986), this usuallyoccurs whenthenetworkis headingtowardsa criticalpoint (localor global). Saturationimpliesthat as a node outputvi~ +1, the quantity Ri ~ O. In other words, wemay studythe saturationbehaviorof the continuousCGH model by consideringthehomogeneousversionof eqn (6.6), viz.,

(1-v;)* ,[

–2Vi~+ a(~+ 1)‘—1:VJ ‘i = 0“

(6.8)

From thetheoryof associatedLegendreequations,itis seenthat eqn (6.8) has a solution in termsof theassociatedLegendrefunctions,1$’)(x), and Q$!)(x)(Erd61yiet al., 1953). Here, p = –a, v = a, andx = Vi,and we have:

(1 - v;)*– Zv, ~

‘ dvi

[+ (Y(CI+ 1) – & 1Zi = Ri (6.13)

i

whereRi = (1 – V~)–a’2Qi,and

[ 1: (1-v;)”+’: +2givi.1

Consider the case when Qi = K is a constant. Thenthe above equationreducesto the non-homogeneousequation,

(1 - @ * ,[

a’–2Vi ~+ fl’(~ + 1) ‘—

11 -v; ‘i

= K(l – v;)-”” (6.14)

‘a)(vi) + C2Q:-R)(vJZi= C] P;which may be solved using the specialfunction s~,al,

(1 -;;)@ = c1P~-u)(vJ+ C2Q:-U)(VJ (6.9) definedand describedby Babister(1967). Equation(6.14) first arose in the context of solving for

4 = c1p~-u)(vi)+ C2Q~-u)(vi). Poisson’s equationin sphericalpolar coordinates.(1- ;~)”l’ dt

Neglectingtheeffectof gi, as is common practice,weobtain from eqn (6.4):

(6.10)

Equation (6.9), in conjunction with eqn (6.10),implies:

E,= - #=(1 - v~)-ul’[clp~-”)(vi) + c’Q~-a)(vi)].1

(6.11)

Equation(6.11) in conjunctionwitheqn (6.2) impliesthat the overallenergyat saturationmay be writtenas follows:

+ c2Q~-u)(vi)]. (6.12)

Here, Ei does not dependon Ej for i #j. Thus, to acrude fist approximation, the CGH network“dissociates” at saturation,into independentunits,andthequadraticenergyfunctionmay be writtenasalinearsumof non-linearunivalentfunctions,givenbyeqns (6.11) and (6.12).

We wish to stress the possibilities revealed bydealingwith the CGH equationin a generalcontext.For example,in eqn (6.6),

6.2. FourierTransformsand FeedforwardNets

Therehave been many differentattemptsto describethe behavior of feedforward networks such as thegroup theoreticanalysisof the perception, proposedby MinskyandPapert(1988),thespacepartition(viahyperplanes)interpretationdiscussedby Lippmann(1987) (and many others), the metric synthesisviewpoint introduced by Pao and Sobajic (1987),and the statistical interpretation emphasized byWhite (1989). Gallant and White (1988) showedthat a one-HL feedforward net with “monotonecosine” squashingat thehiddenlayer,anda summingoutput node, embeds as a special case a “Fouriernetwork” that yieldsa Fourier seriesapproximationto a givenfunctionas itsoutput.We presenta relatedconstruction in this section; it is shown that a onehidden layer (one-HL) nets with simple sigmoidalconvextransferfunctions (at thehiddenlayer),and asingle summing output, can be thought of asperforming trigonometric approximation (regres-sion). Specifically,the inverseFourier transform ofthe function (to be learned) is approximatedas alinearcombination of weightedsinusoids.

The result is a consequence of a connectionbetween a class of simple sigmoids and Fouriertransforms,that facilitatesa novel interpretationofone-HL feedforward nets. The following. classictheoremdue to Polya (1949) is a startingpoint.

PROPOSITION6.1 (Polya, 1949). A real valued und

Page 12: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

830

continuous fmctwn f(x) aljined for

satisfying thefollowrng properties:

● f(o) = 1,● f(x) = f(–x),

● f(x) is convex for x >0,

● limx+mf(x) = o,

all real x and

is always a characteristicfmction (Fourier transform)of an absolutely continuousdistributionfmction,~a i.e.,

f(X) = #(h(f); X) = j’” exp(ixt)h(t)dz.-m

Furthermore, the a%nsityh(t) isan evenfwction, and ir

continuouseverywhere except possibly at t = O. N

The following resultconnects simple sigmoidswithFouriertransforms.

THEOREM6.1. Let a(x) be asimpZe sigmoid. lfa(x)/xis a convex fmctwn, then it is theFourier transformofan absolutely continuousdistributionfmction, i.e.,

rU(X)– #(h(t); x) = -m exp(ixt)h(t)dt. (6.15)x

Prooj It sufiicesto prove that c(x)/x satisfiestheconditions of Polya’s theorem;u(x) being simple isbounded, and hence limx+w a(x)/x = o. Also,t7(-x)/-x = –a(x)/–x = cr(x)/x. Since cr(x) iscompletely monotone in (O, 1), it follows thatlima u(x)/x = K (some positive constant). Thereis no loss of generalityin assumingK = 1, sinceonecan always scale a(.) appropriately. Finally, theconvexityof a(x)/x ensuresthatall of theconditionsof Polya’s theorem are satisfiedand the conclusionfollows. ■

REMARK6.1. Polya’s theoremis a sufficientbut notnecessary condition for f(x) to be the Fouriertransform of some function h(t). Hence, Theorem6.1 is also only a sticient condition for a simplesigrnoidto be a Fouriertransform.The casein pointis the non-convex fimction tanh(x), whichis stilltheFouriertransformof a welldefinedfunction[thisandother examples may be found in Oberhettinger(1973)]:

In other words, the conclusionswe drawin the nextfew paragraphsmay be valid for some non-convexsimplesigrnoidsas well.

InR- tit m absolutelycontinuousfunction F(2) is adistributionfimction if it can Ix written in the form F(s) =Cmh(t)~, whereh(t) is calledthe densityof F(m).

A. Menonet al.

REMARK 6.2. In eqn (6.15), h(t) is an even function.Hence the transformis a Fourier cosine transform.The sinecomponentvanishesduringthecourseof anintegration.

Consider a one-HL net, with k input nodes, nhidden layer nodes with convex simple sigmoidaltransfer functions a(.), and one summing outputnode. Let wij denote the weight of the connectionbetweenthe ith node in the hiddenlayer and thejthnode in the input layer; similarly,let Cidenote theweightof theconnectionbetweenthe ithhiddenlayernode and theoutputnode. Thentheoutput O maybeexpressedas,

O = ~ Ci~f = ~ CiU(U,)i=l i=ln k

(6.17)= x (xC,u w~jxj+ 0/

i=l j=l )

where ui and Oiare the input and bias for the ithhidden node, respectively.Since m(.) is a convexsimplesigmoid,usingTheorem6.1, eqn (6.17)mayberewrittenas,

~(t) = ~ C,~, = ~ ciu&(h(t); U,) (6.18)i=] i=l

where #(h(t); ui) denotes that #(h(t); x) isevaluatedat the point

.i = ~ W,jXj+ O,.j=l

Usingthewellknownpropertyof Fouriertransforms(Davies, 1978), that if f(x) = #(h(t); x), thenxf(x) = iS(h’(t); x) = .$(-ih’(t); x), where h’(.) isthe first derivativeof h(.), and i = ~, Equation(6.18) may be rewrittenlgas,

o(t) = ~ C,s(–ih’ (t); u,). (6.19){=1

Equation(6.19)can be recognizedas beinganalogousto the Heaviside expansion formula in Laplacetransformtheory,20which allows the reconstruction

19In ~n @.19J, tie i term in #(–ih’(t); U) ~nv@.$ ‘ie

Fouriercosine transformrepresentationof u(z)/z (see Remark6.2) into a Fouriersine transform.

m Forconvenience,we restatea simpleversionof the formula:If the Laplacetransformof a functionh(t) is given by f(z), i.e.,f(z) = Y(h(t);z)= ~h(t)exp(–zt)dt, andf(z) hasonlyfmtorderpolesat 21,x2,...,2“, th~ h(t) = ~~=1 ~k(Zk), whereFk(z~) is the residue or pole-coe!lkient of f(z)exp(zt) (Bohn,1963).

Page 13: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

A Clamof SigmoidFunctions

of a timevaryingfimction usinginformationrelatingto its spectralcomponents. Equation(6.19) suggeststhat 1-HLnetswithconvex simplesigmoidaltransferfunctions can be thought of as implementing aspectral reconstruction of the output using theweightedinputs u; to evaluatethe associatedpolecoefficients(residues)of the Heavisideexpansion.

In particular, it can be demonstratedthat theresultsof Gallant and White (1988) are implied byeqn (6.19).In whatfollows, weshalluse.!?S(h; x) and#c(h; x) to indicate the Fourier sine and cosinetransformsof h(t).

Since h(t), the continuous distributionfimctioncorresponding to cr(x)/x, is an even fhnction(from Polya’s theorem), it follows that a(x)=x.!F(h(t); x) = xSJh(t); x). Using the property ofFourier transforms that x~c(g(t); x) =&,(–g’(t); x) (Davies, 1978),we may conclude thata(x) = #,(–h’ (t); x).

Let Ui= u+ ri, where the ri are appropriatefunctionsof the xi (since the ui are functionsof theinputsXi)

o(u) = ~ C,#.(–h’(t); u + n). (6.20)isl

From the frequency shifting property of Fouriertransforms(Davies, 1978),

~ W~(f(t); x + a) = &,(f(t) cos(ar);x)

+ .%C(f(t)sin(at);x),

it follows that

O(u)= ~ Ci*.(–h’(t);U+n)i=l

= S 2ci{w(-~’(t)cos(rit)”~)si=l

+ SC(–h’(t) SiIl(~it); U)}

{D .!7. 2 ~ Ci(–h’(t) COS(rit);U)

ixl }

{

n

+%. 2 ~ ci(–h’(t) sin(rJ);U)i=l }

.%-’ (O(U)) = ‘h’(t) ~ CiSh(ri + U)t.i=l

(6.21)

(6.22)

But we may choose u artibrarily:let u = O,implying

rt = Ui = ~ WijXj+(.?i,j=l

and eqn (6.22) becomes,

831

(, 1#-’(o(u)) = -411’(t)~ Cfsin ~ Wfjxj+ Of

i=l

(6.23)

Equation(6.23)maybe usedasa startingpoint for ananalysis identical to that adopted by Gallant andWhite (1988) in their study of one-HL nets with“cosine squashing” functions. It is then straightfor-ward to show that the weights may be so chosen(hardwired) so that the one-HL nets embeds as aspecialcasea Fouriernetwork,whichyieldsa Fourierseries approximation to a given function as itsoutput. In this sense, the results of this sectionextendthose of Gallantand White.

More generally,one can draw similarconclusionsby consideringsigmoids that are the L.aplace trans-forms of somefunction; for exampletanh(x)/x is theLaplacetransformof sgn(sin(rt/2)), wheresgn(x) is+1, Oor –1 dependingon whetherx is greater,equalor lessthanzero (Spanier& Oldham, 1987).A similaranalysis would lead to a connection with realexponential approximation (rather than trigono-metric approximation).Efficientalgorithms,such asProny’s, exist for certain restricted forms of theexponentialapproximationproblem (Su, 1971).

Also relatedare the considerationsof Marks andArabshahi (1994) on the multidimensionalFouriertransformsof the output of a one-HL feedfonvardnet; they showedthat the transformof the output isthe sumof certainscaledDirac deltafimctions.Here,we viewthesigmoiditself as theFouriertransformofsome function;themainadvantageof our interpreta-tion is the algorithmsit suggestsfor trainingone-HLnets of the type consideredin this section.

Another potentialuse of eqn (6.23) is its possibleusein exploringthe“goodness” of theapproximationobtained by a one-HL net with simple sigmoidaltransfer functions. In the last 200 years, much hasbeen learned about the errors associated withexponential and trigonometric approximation, andways to dealwith it; however,considerationof theseissuesis beyond the scope of thispaper.

7. CONCLUSION

We have analyzedthe behavior of importantclassesof sigmoid functions, called simple and hyperbolicsigmoids, instancesof which are extensivelyused asnode transferfunctions in artificialneural networkimplementations. We have obtained a completecharacterization for the inverses of hyperbolicsigmoids using Euler’s incomplete beta functions,and have describedcomposition rules that illustratehow such functionsmay be synthesizedfrom others.We have obtainedpower seriesexpansionsof hyper-bolic sigmoids,and suggestedproceduresfor obtain-

Page 14: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

832 A. Menonet al.

ing coefficientsof theexpansions.For a largeclassofnode fimctions, we have shown that the continuousCGH net equations can be reduced to Legendredifferentialequations.The fact that the connectionbetween Legendre differential equations and theCGH equation holds for such a wide variety ofsigrnoids,andisnotjust anaccidentalconsequenceofa particularsigmoid, stronglyindicatesthat furtherexplorationof this connection is warranted.Finally,we have shown that a large class of feedforwardnetworksrepresentthe output function as a Fourierseries sine transformevaluatedat the hidden layernode inputs, thus extendingan earlierresultdue toGallant and White.

REFERENCES

Albertini, F., Sontag,E., & Maillot, V. (1993). Uniquenessofweightsfor neuralnetworks.In R. Mammone(Ed.),Artz@cialneural networks with applications in speech and virion. London:ChapmanandHall.

Babister, A. W. (1967). Transcendental functions satisfyingnonhomogeneous linear dl~erential equations. New York:Macmillan.

Bohn, E. V. (1963). The transform analysis of linear systems.Reading,MA:Addison-Wesley.

Carlitz, L. (1963).The inverseof the error function. Pac~c J.Math. 13(2), 459-470.

Cybenko, G. (1989). Approximationby superposition of asigmoidrdfunction. Math. Control, Signals and Systems, 2,303-314.

Davies, B. (1978).Integral transforms and their applications. NewYork:SpringerVerlag.

Diaconis,P., &Shahshahani,M. (1984).Onnonlinearfunctionsoflinearcombinations.SIAM J Sci. Stat. Comput. 5, 175-191.

Elliot,L. D. (1993).A betteractivationfunctionforartificialneuralnetworks.TechnicalReport TR 93-8, Institute for SystemsResearch Universityof Maryland,CollegePark,MD.

Erd61yi,A., Magnua,W., Oberhettinger,F., & Tricomi, F. G.(1953). Higher transcendental fmctions (Vol. 1). New York:McGraw-Hill.

Feller, W. (1965). An introduction to probability theory and itsapplication (vol. II). New York:JohnWiley.

Funahashi, K. (1989). on the approximate realization ofcontinuousmappingsby neuralnetworks.Neural Networks2(3), 183-192.

Gallant,A. R., &White,H. (1988).Thereexistsa neuralnetworkthat does not makeavoidablemistakes.In IEEE InternationalConf on Neural Networks (Vol. 1, pp. 657%54).San Diego,CA.

Goodman,A. W. (1983).Univalentfwctions (Vol. I). New York:MannerPublishingCo.

Graham,R. L., Knuth,D. E., & Patashnik,O. (1989).Concretemathematics. Reading,MA:Addison-Wesley.

Grossberg,S. (1973).Contourenhancement,shorttermmemory,andconstanciesin reverberatingneuralnetworks.Studies Appl.Math., 52, 217–257.

Grossberg, S. (1988). Nonlinear neural networks: principles,mechanismsandarchitectures.Neural Networks 1, 17-61.

Hansen,E. R. (1975).A table of series and products. EnglewoodCliffs,NJ:Prentice-Hall.

Barrington,P. (1993).Sigmoidtransferfunctionsin baekpropaga-tion neuralnetworks.Amdyt. Chem.65 (15),2167–2168.

Hertz,J., Krogh,A., &Palmer,R. G. (1991).An introduction to thetheory of neural computation (Vol. 1),Santa Fe Institute studies

in the sciences of complexity. RedwoodCity, CA: Addison-Wesley.

Hopfield,J. J., & Tank, D. W. (1986). Computingwith neuralcircuits:A model.Science,233,625-633.

Hornik,K., Stinchcombe,M., & White, H. (1989). Multi-layerfeedforwardnetworks are universal approximators.NeuralNetworks, 2, 359-366.

Kran@ S. G., & Parks,H. R. (1992).A primer of real analyticfmctions. Berlin:BirkhauserVerlag.

Lippmann,R. P. (1987).An introductionto computingwithneuralnets. IEEE ASSP Magazine, 4, 422.

Maeintyre,A., &Sontag,E. (1993).Finitenessresultsforsigmoidal“neural” networks. In Proc. 25th Annual Symp. TheoryComputing, SanDiego. New York:Associationfor ComputingMachinery(ACM).

Marka,R. J., &Arabshahi,P. (1994).Fourieranalysisandtilteringof a single hiddenlayer pereeptron.In Int. ConJ ArtlJicialNeural Networks (IEEE/ENNS), Sorrento,Italy.

McBride,R. E. (1970).Obtaining generating frictions (Vol. 21).Berlin:SpringerVerlag.

Minai, A., & Williams, R. (1993). On the derivativesof thesigmoid.NeuralNetworkr,6(6), 845-853.

Minsky,M., &Papert,S. A. (1988).Perceptions, an introduction tocomputational geometry. Cambridge,MA:The MITPress.

Oberhettinger,F. (1973).Fourier traruforrns of distributions andtheir inverses. New York:AcademicPress.

Pao, Y. H., & Sobajic,D. J. (1987).Metricsynthesisandcomxptdiscoverywithconnectionistnetworks.InProc. IEEE Systems,Man and Cybernetics Con$, Alexandria,VA.

Polya,G. (1949).Remarkson the characteristicfunction.In Proc.4th Berkeley Symp. Math. Statist. & Probab. (pp. 115-123).

Polya,G. (1974).Onthezeroesof thederivativesof a functionandits analyticcharacter.In R. P. Boas (Ed), GeorgePolya:collected works. (pp. 178–189).Cambridge,MA:MIT Press.

Rainville,E. D. (1964). Intermeditite d~~erential eguations. NewYork:Macmillan.

Spanier, J., & Oldham,K. (1987).An arhrsof fmctions.Washington,DC: Hemisphere.

Su, K. L. (1971). Tinre+brrain synthesis of linear networks.EnglewoodCliffs,NJ:Prentice-Hall.

Sussmarm,H. J. (1992). Uniqueness of weights for minimalfeedforwardnets with a given input~utput map. NeuralNetworks, 5(4), 58%593.

White, H. (1989). baming in artificial neural networks: astatisticalperspective.Neural Computation, 1,42S464.

Whittaker,E. T., & Watson,G. N. (1927).Moakrnanalysis(4thcd.). Cambridge:CambridgeUniversityPress.

Widder, D. V. (1946). The Luplace tromforrn. Princeton,NJ:PrinectonUniversityPress.

Wilf, H. S. (1989).Generatingfiictionology. New York: AcademicPress.

APPENDIX

THEOREM4.1. Let y = u(x) be a hyperbolic sigmoid, and letT : (–1, 1) + ~ be its inverse. Then, either

or

(A.2)q(yqcl, –; –; /) = (~ -;2)0 ~> o

where, by F(a, –; –; y’), we mean F(a, ~; ~; y2)(/3 G 9?).

Proof Since.u(.) is hyperbolic,bydefinitionrI(.)/x is &scribedbya

Page 15: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

A Classof Sigmoid Fmctiorrs

GH series with at most three parameters.Therepossibilities:

rt(X) = X3FO(aI, C3Z,(Z3; ; X2) + CSSC1q(x) = xzF1(a,, az; 71;X2) + casc2

q(x) = xl Fz(al; 7,, W; xz) + caac3

7)(X) = XOF3(; ’71,72, ‘h; X2) + CSSC4.

are four

(A.3)(A.4)(A.5)(A.6)

The followingpropositionshowswhythereis no needto considercases 1, 3, and4, as possibleformsfor hyperbolicsigmoida.

PROPOSrmONA (Spanier& Oldham, 1987).Let ~F~(czI,...,ap;‘h,... ,%;z). ~ a GHseries inZ.withp+ qparometers.Ifnoneofthe numerator parotneters are non-positive integers, i.e.,Vi: a, # O, – 1, – 2,..., then convergence behavior of ~F~ is osfollows:

p < q + 1: ~Fqnecessarilyconvergesforallfmitez.

P = q+ 1: cotwer@?rzceOfPF~islimitedto – 1< z <1,onddeperuirontheporornetersqandyi (A.7)

p > q i- 1 : *Fqnecessarilydr”vergesforollnon-zeroz.

Sincelim,-.., q(z) - *CO,butis finitein theinterval(–1, 1),itfollows that if a GH series is to representq(.), then it has toconvergein the interval(–1, 1), but divergeat z = *1.

This rulesout non-positiverntegralvahkesfor the numeratorparameters;otherwise,theserieswouldconvergeforall z c 91(andnot just in the interval (–1, l)). Yet, even if the numeratorparametersdo nothavenon-positiveintegralvalues,in threeof theabovecases,the numberof numeratorparametersto denominatorparametersis suchthateach seriesconvergesfor aUz (case 1),ordivergeafor aUz (cases 3 and 4). That leavesjust one case toconsider,viz., theclassicalseries,2F1(cq, cq;71,z) = F(cz, 0; ~; z),i.e., we maytakeq(x) = xF(a, f3;T; X2).

Sinceq(.) hasto be a GH serieswithat most three parameters,some of the parametersare allowed to be “missing”.133otherwords,case 2 spawnsthe followingpossibilities:

q(x) = xF(fl, ~; ‘y; X2) + CaseZ(a) (A.8)q(X) = X~(CY, ~; –; X2) + Caac2(b) (A.9)q(x)= xl+, –; ~;X2) +-caae2(c) (A.1O)q(x)= xF(a, –; –; X2) - Casc2(d) (All)q(x)= xF(–, –; 7; X2) - Case2(e) (A.12)q(x)= xF(–, –; –; X2) - CZSC2(f). (A.13)

PropositionAcanbeusedonceagainto weedoutallbuttwoof theaboveset, viz., eases 2(a) and2(d).Therestleadto inappropriatedivergenceor convergencebehaviorin the interval.The followingpropertyof GH functions will be needed.

PROPOSmrONB (Spanier& Oldham, 1987).If y = F(rI, ~; y; x),then

Caac2(a):

~ = ~F(~ + 1,~+ 1;~ +1; X).dx’y

three parameterGH series:

(a),(P), X*q(x) = xF(a, 8; 7; xz) = x ~0 ~ ~“ (A.14)

Let

A: (–1, 1)+ 9?+,

with

833

A(x) .*.

Then

A(x) =~=~(xF(a, fl; ~; X2)

=F(a, ~; V;x2)+xM(a’~7; x2)

= F(a, 6; ~; X2)+ 2# $ F(a + 1,~+ 1;T+ 1;X2)

((“)k(@k -+WJ(oJk :+22 (7)k (k - l)!

)= ~ (~)i ‘—

(

(a)k(~)k 1.1‘~ (~- 1)!(+k ())

~+2 x~

( )

(a)k(@)k ~~+,) ~

= 2 (7),

(

(a)k(fl)k (3/2)& xx

)= ~~~-

From the definitionof hyperbolicsigmoida,A(x) is representableby a GH functionwithatmostthreeparameters;wemustthereforemakethe identification,@= 1/2 and~ = 3/2. Fromthe symmetrypropertiesof theGH function,we neednotconsiderthecasewhena = 1/2, ~ = 3/2. It foUowsthat,

y = xF(a, 1/2; 3/2; X2)

~ = d~(x)—= F(a, –; -; X2)= —

ok (1-;2)0“ (A.16)

The parametera cannot take any arbitraryreal value. Thebehaviorof q(x) at the endpointsof its interval,requiresthat,

linrq(x)+ *CO*hnll A(x)+ *CO.X-* I

(A.17)

Equations(A.16)and(A.17)takentogetherimplythata >0. Thisis a necessarybut not sutlicientcondition.The following twopropositionsaUowus to pindowna’s valuemoreprecisely.

PROPOSITIONC (Erd&lyiet al., 1953).If a and ~ are dtflerent from0, – 1, . . . then F(cr, P; ~; z) converges absolutely for z <1. Forz = 1:

F(cz,~;7; z)arwergesobsolutely if(a + /3- 7) <0 (A.18)

F(a, p; ~; Z) convergesconditwnolly ~O<(a + B– y) <1 (A.19)

F(a, @;~; z) diverges if 1< (a + /3- 7). ■ (A.20)

PROPOSITIOND (Erd61yiet al., 1953).1f(~– a – /3)>0 then

r(7)r(7– Q- f7)‘(a’ 7; 1)= r(7 - a)r(~ -p)

where

r(x) =r~ exp(-t)#-’ di

is Euler’s gamma fmction. g

If a <1, fromPropositionC we see that the seriesconvergesabsolutelyat z = X2= 1. FromPropositionD, this in turnimpliesthat q(x)/x will havea finite valueat the endpointof ita domaininterval.Therefore,a> 1. The final formfor the threeparameterGH representationfor q(x) is, therefore, xF(a, 1/2; 3/2; X2)wherea> 1.

Page 16: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

834 A. IUenonet al.

Case2(d)Gne-parameter GH series:In thiscase,

+k xtj(x) = XF(U, –; -; X7 = x ~ (q ~ = (~ - X2)-.

k>O

Thesituationis muchsimpler,sincewehaveto placeboundsonthevalueof oneparameteralone.Anargumentalmostidenticalto theone aboveallows us to concludethat for q(x)/x to satisfy thepropertiesof a hyperbolicsigmoid, it is both neceasmyandsufEcientthatwe takea >0. ■

THEORes14.2.Let a : @~ (–1, 1) be a realanalytic,oti, strictlyincreasingsigmoid, such that its rnverseq : (–1, 1) ~ S?hasa GHseries expansion in some injective, oaii, increasing C’ fitwn g(.),with at most threeparameters, convergentm (–1, 1).Also let V’ havea GH series expansionin g(.), with at most one parameter. Then,either

w =d)’)+>;;:;W))’)m (a)k (g(y)p ~ilha>l,

“g(Y) k~o~ —k! ‘ (A.21)

or

v(~)=g(y)~(a, –; –; (@))z)i?(Y)

=(1 - (g(y))’)” ‘‘i’ha >0(A.22)

provided

g’(Y)!5 (1–yq” + “

where g’(.) is thejirst derivative of g(.).

Prooj The proof of Theorem4.2 is very similar to that for~e~o~c~;i;~:eti;~t withq(x) =g(x)F(a; 1/2; 3/2; (g(x))’),

(A.23)

whereg’(x) isthe6rst derivativeof g(x). Sinceg’(x) >0 for allx ~ Dom(g), and C-Z>0, it follows that q’(x) >0 for allx ● Dom(q), i.e., q(x) is a strictly increasing function. Theanalyticity,continuity and oddness of q(.) follow from therespective properties of the GH function. We assure thatIimX-lrt(x)+ w, by forcingits derivativeq’(x) to go to infinityat theendpointsof its interval.■

THEOREM5.1. I’ the inverse sigmoidis given by y/(1 – yz)”, a >0,then in some neighborhoodof the origak,we have the valid expression

where,

~+’=(-l)k(u+l)’((ui’)a) (A.24)

Proof We will needthe Lagrangeinversionformula,statedbelow(Wilf, 1989).

Considerthefunctionalequation:u = t#(u). Suppose and~(u) arc analyticin some neighborhoodof the origin (*plane),

with+(0) = 1.Then thereis a neighborhoodof theorigin(inthe t-plane)in whichtheequationu = t+(u) has exactly one root for u.Let

be the Maclaurinexpansionof~(u(t)) in t,and

be the Maclaurinexpansionof the functionf’(u)[~(u)]”.Then

0# = : c.-,.

Here, y au, xs r, and ~(u) = (1 – yz)o. Takef(u) = ua y,andthe theoremfollowsfromthe Lagrangeinversionformula.

THEOREM5.3. Let

“ b+, X*a(x)=~~opk + 1)!

bean expansionfor a hyperbolic sigmoid, with an inverse of theformyF(a, 1/2; 3/2; y2), valid in some neighborhoodof the origrn. Tken,bk =0. ~ &+ I = c(2k + 1, k), wherewe d@nethe sequenceC(n,k) osfollows:

c(l, o)= 1C(n,k) = O Vk>n,k <0

C(n + 1, k) = (2k - n + l)C(n, ,4)

- 2(na -k+ l)c(n, k - 1), n>l. (A.25)

Here, n and k are natural numbers. D“(u(x)), ths nth derivatiws ofU, are given by:

D“(y) = D“(u(x))“-l

‘k~o C(n, k)yx-~’(l - Yz)nm-k. (A.26)

ProoJ This theoremwas obtainedby a processidenticalto thatdescribedin MinaiandWilliams(1993),on the derivativesof thelogisticsigmoid.We thereforerestrictourselvesto an outline.

It is giventhaty = q(x) = XF(CZ,1/2; 3/2; X2), andx = u(y).It can be shownthat

D(x) =: u(y)= l/q’(x)= (1 -X2)”.

Considerthe derivativesof the polynomialf~l(x) = Xk(1 – X2)’,

D~k,,(X)) = $fi,,(X)=kxk-l(l - X2)04

- 21X’+’(1 - X’)*’-’

= (k)fi-,,e, (x) + (–2/)&,4-,(x)

= ~(jk,, (X)) + ~(_fk,, (x)). (A.27)

In eqn (A.27)we havesplit the effectof the operator

into the sumof the actionsof two operatorsL andR (Minaiand

Page 17: Characterization of a Class of Sigmoid Functions with Applications to Neural Networks

A Classof SigmoidFunctions 835

C(I, O)fo,o(z)

C(2, w-l,2tY(4

c(3,0)f_2,3a(z) C(3, l)fo,sa- l(z) C(3,2)/2, ~a-~(z)

C(4, o)f_3,4e(z) C(4, l)j_l,4e -1 (z) C(4, 2)fI,ta - 2(z) c(4,3)f3,4a -3(Z)

FIGURE3. BinaryWorlvalion’gtreefor hyperbolkslgmoids.

Williams refer to them as ~ and Al). Withpolynomials~~,l,theseoperatorsaredefinedby:

~(Af~,l(~))= ,@.1,-, (~)R(A~k,,(x))= ‘z/Afk+,,#-,(%)

respectto the

(A.28)

(A.29)

where,4 is a constant.The mainadvantageof introducingtheseoperatorsis that they give a systematicway of visualizingtheproductionof D“+’(x) fromD“(x);L andR maybe thoughtof asbeingappliedto a binarytree of expressions,whereeach node issome polymnial ~k,,(x), and the root is the polynomialA,. =(1 – X2)”.The action of L on each node of this tree is toproducea left child, given by eqn (A.28), and that of R is toproducea rightchild,giversby eqn (A.29);L actingupon~k,,(x)does threethings:mul@es it byk(= the degreeof x), reducesthedegreeofx by 1,irscreuresthe&greeof (1 – X2)bya. Ontheotherhand,R rncreasesthe degreeof x by 1, thatof(1–d)by(a–1),andmultipliestheoperandby –21,wherelis the degreeof (1 – X2).Figure3 depictsthe processfor the lirst fourlevels.By a detailedstudyof this “derivative”tree, the followingobservationsmaybeproved:

1. The rkhlevel of the tree correspondsto the nth derivativeofa(y), lr”(x) = Dn-’(u(y)) = L(D”-’(x)) +R(D’’-’(x)) (theroot of the treeis designatedn = 1, andDO~~,l(x)) =J~,l(x)).

2. At the ntblevel,the treehasn nodes,andthe kthnode (k runsfrom O through n – 1), is a polynomial in x, given by

‘n = 1

n = 2

n = 3

n = 4

C(rI,k)f~-~l,m-k = C(n, k)xw-*l(l - xz)’’’’-~,where C(n, k)is a constant.It canbe seenthatthersthderivativesof u satisfy:

3. Therearetwo sourcescontributingto thevalueof C(n, k). Oneis the action of R on the (k – I)th term,andtheotheris thatofL on the kth termon the (n- l)th level.

Inductionargumentsinconjunctionwiththeaboveargumentsthengive:

C(I, o)= 1C(n,k)= o Vk>n,k <0

C(n+ 1,k)= (2k– n+ l)C(n, k)- 2(na– k+ l)C(n, k - 1), n> 1. (A.30)

Now, all terms in D“(x) containingan x term havingpositivedegreewill vanish,whenevaluatedat x = O.For even n, all thenodes have an x term with an odd degree, and hence D“(x)vanishesidenticallyat x = O.Foroddn, all terms,exceptthe termcorresponding to k = (rI+ 1)/2, vanish at x = O. Sinceb.= D“(x) 11=0,itfollowsthat~k = Oandbu+,= C(2k+ 1, k).