[ieee tencon 2007 - 2007 ieee region 10 conference - taipei, taiwan (2007.10.30-2007.11.2)] tencon...

Fault Tolerant Learning Using Kullback-LeiblerDivergence

John SumInstitute of Electronic CommerceNational Chung Hsing University

Taichung 402, TaiwanEmail: [email protected]

Chi-sing LeungDepartment of Electronic Engineering

City University of Hong KongKowloon Tong, Hong Kong

Email: [email protected]

Lipin HsuDepartment of Information Management

Chung Shan Medical UniversityTaichung 402, Taiwan

Email: [email protected]

Abstract— In this paper, an objective function for training afault tolerant neural network is derived based on the idea ofKullback-Leibler (KL) divergence. The new objective function isthen applied to a radial basis function (RBF) network that is withmultiplicative weight noise. Simulation results have demonstratedthat the RBF network trained in accordance with the newobjective function is of better fault tolerance ability, in comparedwith the one trained by explicit regularization. As KL divergencehas relation to Bayesian learning, a discussion on the proposedobjective function and the other Bayesian type objective functionsis discussed.

I. INTRODUCTION

Once a neural network model has been obtained numeri-cally by a conventional computer, the model has finally beimplemented by physical hardware. However, this mappingcan never be perfect [8]. One imperfection is the finite bitnumber representation, that can be found in digital hardware[15], [18], which eventually makes the digital implementationof a neural network suffer from multiplicative weight noise.To alleviate such problem, many works have been done in thelast two decades. Some of them aimed at understand the effectof such noise on a neural network. While some others aimedat training a neural network that is able to tolerate such effect.

Stevenson et al [25] amongst the first group gave a com-prehensive analysis on the probability of output error ofThreshold Logic Madaline model due to input and weightnoise. Choi and Choi [12] from statistical sensitivity approachto derive different output sensitivity measures of a multilayerperceptron. Piche in [22] followed an approach from signalto noise ratio (SNR) to derive a set of measures for theoutput sensitivity of the Sigmoidal Madaline and applied it todevelop a weight accuracy selection algorithm to determine theprecision requirement for hardware implementation. Townsendand Tarassenko [26] considered a radial basis function (RBF)network with multiple outputs. They derived an output sen-sitivity matrix for an RBF network that is suffered fromperturbations in input data, basis function centers and outputweights.

Output sensitivity is just one view point to understand theeffect of neural network due to noise. An alternative view pointis from the actual network performance, i.e. the generalizationability. In this regard, Catala and Parra proposed a faulttolerance parameter model and studied the degradation of aRBF network if the RBF centers, widths and the correspond-ing weights due to multiplicative noise [9]. Following Choi& Choi’s statistical sensitivity approach [12], Bernier et al

derived the error sensitivity measures for MLP [2], [4] andRBF network [6]. Fontenla-Romero et al derived the errorsensitivity measure for functional nets [13].

While noise can be harmful to a neural network, Murray& Edwards [20] on the other hand investigated advantagesof adding noise to a neural network during training. In theirpaper, they have found that adding noise during training canactually improve the generalization ability of a neural network,Bishop [7] showed that adding small additive white noiseto a neural network during training is equivalent to addingTikhnov regularization. Jim et al [16] further noticed thatadding multiplicative weight noise not just can improve thegeneralization ability, but also can improve the convergenceability in training a recurrent neural network.

While the above works have provided a clearer pictures onthe effect of noise to the network performance, some otherresearchers developed training methods aiming to improve thefault tolerant ability of a neural network. Owing to reducethe magnitude of the weights, Cavalieri & Mirabella in [10]proposed a modified backpropagation learning, in which aweight magnitude control step has been added in each trainingepoch, for multilayer perceptron. Simon in [24] developed adistributed fault tolerance learning for optimal interpolationnet, in which the learning is formulated as a nonlinear pro-gramming problem – minimizing the training error subjectedto an equality constraint on weight magnitude. Extended fromthe their previous works, Parra and Catala in [21] demonstratedhow a fault tolerant RBF network can be obtained by using asimple weight decay regularizer [19]. Bernier et al developeda method called explicit regularization to attain a MLP [3],[5] and RBF network [6].

In this paper, we are also interested in developing alearning algorithm to train a neural network that is able totolerate weight noise. Specifically, a fault tolerant learningusing Kullback-Leibler (KL) divergence [17] will be derivedand applied to train an RBF network that is suffered frommultiplicative weight noise. In the next section, an objectivefunction based on KL divergence will be derived. Then,a learning algorithm for RBF network will be derived inSection 3. Section 4 presents a simulation results. A discussionof the new objective function and others will be presentedin Section 5. Finally, the conlcusion will be presented inSection 6.

II. KLD-BASED OBJECTIVE FUNCTION

Assume that a measured data set D is collected from an un-known system with input-output joint probability distributionP0(x, y). Without loss of generality, we assume that x, y ∈ R.We further assume that the unknown system belongs to aset of models that can be parameterized by a M -dimensionalparametric vector θ, given by θ = (θ1, θ2, · · · , θM )T . The cor-responding input-output behavior is represented by P (x, y|θ).An implementation of a model θ, denoted by θ̃, is a randommodel that is generated by a probability function P (θ̃|θ).We also call this implementation a faulty network model.Normally, the probability P (θ̃|θ) is determined by the faultmodel or the noise model given in advance. Clearly, a faultyneural network model θ̃ is also an unknown system to us. Notuntil the nominal model θ has been implemented, it is not ableto check the true performance of θ̃.

The marginal probability distribution will thus be given by

P (x, y|θ) =∫

P (x, y|θ̃, θ)P (θ̃|θ)dθ̃

and treating the training data is sampled from an unknownbut stochastic system P0(x, y), the performance measure ofa fault tolerant neural network can naturally be quantified byKullback-Leibler (KL) divergence [17],∫ ∫

P0(x, y) logP0(x, y)P (x, y|θ)dxdy.

Therefore, θ̂ can be given by

θ̂ = arg minθ

{D(P0(x, y)||P (x, y|θ))} . (1)

where

D(P0(x, y)||P (x, y|θ)) = constant − L(θ), (2)

and

L(θ) =∫ ∫

P0(x, y) log{∫

P (x, y|θ̃, θ)P (θ̃|θ)dθ̃

}dxdy.

(3)Then, θ̂ can be obtained by

θ̂ = arg maxθ

{L(θ)} . (4)

For D = {(xk, yk)}Nk=1, N is large and by Law of Large

Number,

L(θ) =1N

N∑k=1

log∫

P (yk|xk, θ̃, θ)P (θ̃|θ)dθ̃. (5)

Hence,

θ̂ = arg maxθ

{1N

N∑k=1

log∫

P (yk|xk, θ̃, θ)P (θ̃|θ)dθ̃

}. (6)

Which is depended on the noise model P (θ̃|θ).

III. FT LEARNING FOR RBF

In the fault-free radial basis function (RBF) network ap-proach, we assume that the dataset D is generated by a RBFnetwork, given by

f(x) =M∑i=1

θiφi(x) + e (7)

θ = (θ1, · · · , θM )T is the RBF weight vector; e is a zero-meanGaussian random variables with variance Se; and φ(x) =(φ1(x), φ2(x), · · · , φM (x))T and φi(x) for all i = 1, 2, · · · ,Mare the radial basis functions given by

φi(x) = exp(− (x − ci)2

σ

). (8)

cis, for all i = 1, 2, · · · ,M are the radial basis function centersand σ is a positive parameter controlling the width of the basisfunctions.

A faulty implementation of θ is denoted by θ̃ in which theelements of θ̃ is a random variable defined as follows :

θ̃i = θi + βi θi, ∀ i = 1, 2, · · · ,M.

βi is a mean zero Gaussian noise of variance Sβ .

P (βi) =1√

2πSβ

exp(− β2

i

2Sβ

)(9)

for all i = 1, 2, · · · ,M . Given an input x, the noise vector βand the parametric vector θ, the output of an implementationcan thus be given by

P (y|x, β, θ) =1√

2πSe

exp

(− (y −∑M

i=1 φi(x)(1 + βi)θi)2

2Se

).

(10)The marginal probability over all possible βs is then be definedas follows :

P (y|x, θ) =∫

P (y|x, θ̃, θ)P (θ̃|θ)dθ̃ =∫

P (y|x, β, θ)P (β)dβ.

Put the definitions of P (βi) in Equation (9) and P (y|x, β, θ)in Equation (10), and integrate over all possible β,

P (y|x, θ) =1√

2πS(x, θ)exp

(− (y − f̂(x, θ))2

2S(x, θ)

)(11)

f̂(x, θ) =∫

yP (y|x, θ)dy = φT (x)θ (12)

S(x, θ) = Se + Sβ

M∑i=1

φ2i (x)θ2

i . (13)

In sequel, L(θ) in Equation (5) can then be written as follows :

L(θ) = −12

log 2π − 12N

N∑k=1

log S(xk, θ)

− 12N

N∑k=1

(yk − φT (xk)θ)2

S(xk, θ)(14)

and θ̂ equals to the

arg minθ

{1

2N

N∑k=1

log S(xk, θ) +1

2N

N∑k=1

(yk − φT (xk)θ)2

S(xk, θ)

}.

By using the idea of gradient descent, a training algorithmcan thus be obtained by the following recursive equation :

θ(t + 1) = θ(t) − µ∂

∂θL(θ(t)), (15)

where µ is a small positive value corresponding to the stepsize and

∂L(θ)∂θ

=Sβ

N

N∑k=1

(1

S(xk, θ)− (yk − φT (xk)θ)2

S2(xk, θ)

)G(xk)θ

− 1N

N∑k=1

(yk − φT (xk)θ)S(xk, θ)

φ(xk), (16)

where

G(xk) = diag{φ2

1(xk), φ22(xk), · · · , φ2

M (xk)}

.

The initial condition θ(0) is set to be a small random vectorclose to null.

IV. SIMULATIONS

To study the performance of our proposed objective func-tion, a simulated experiment based on sinc function approx-imation have been conducted. Sinc function is a commonbenchmark example [11], [27] in many neural networkresearches. The data is generated by a function of followingform :

y = sinc(x) + e, (17)

where the noise term e is a mean zero Gaussian noise withvariance Se. The RBF network to be studied has 37 RBF nodesand their centers are located in {−4.5,−4.25, · · · , 4.25, 4.5}.σ is set to 0.1.

In the experiment, eight different values of Se have beenexamined :

{0.0025, 0.010, 0.015, 0.020, 0.025, 0.030, 0.035, 0.040} .

Then eight training data sets each consists of 100 samplesfor different Se are generated. Figure 1 shows a data setfor which Se equals to 0.010. A testing data set consistingof another 1001 noise-free samples, for which x equals to−5,−4.99, · · · , 4.99, 5, are generated for validation.

For each Se, two sets of RBF networks are trained. The firstset is trained by Bernier et al explicit regularization method :

J(θ) =1N

N∑k=1

(yk − φT (xk)θ)2 +Sβ

NθT

N∑k=1

G(xk)θ (18)

While the second set is trained by our method, Equation (14).In each set of networks, 11 different RBFs corresponding to11 different values of Sβ as listed below are trained.

{0, 0.012, 0.022, · · · , 0.092, 0.12}.

−5 0 5−0.5

0

0.5

1

1.5

Fig. 1. The training and testing datasets for the sinc function experiment.

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.0410

−3

10−2

10−1

Ave

rage

Tra

inin

g M

SE

Bernier et al MethodOur Method

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.0410

−3

10−2

10−1

Ave

rage

Tes

ting

MS

E


Training Error, Sb = 0.0025 Testing Error, Sb = 0.0025

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.0410

−3

10−2

10−1

Ave

rage

Tra

inin

g M

SE


0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.0410

−3

10−2

10−1

Ave

rage

Tes

ting

MS

E


Training Error, Sb = 0.0100 Testing Error, Sb = 0.0100

Fig. 2. The average training and testing MSE against Se.

For the Bernier et al regularization method, For our method, asthe objective function is not quadratic, the model is obtainedby the gradient descent algorithm defined in Equation (15) andµ is set to 0.01.

Suppose a network model trained for a specified Se and Sβ

has been obtained and the weight vector is given by θ̂, 10000random networks are generated in accordance with θ̂ and thefollowing equation.

θ̃ =((1 + β1)θ̂1, (1 + β2)θ̂2, · · · , (1 + βM )θ̂M

)T

,

where βi ∼ N (0, Sβ) for all i = 1, · · · ,M . Their trainingmean square errors (MSE) and testing mean square errors aremeasured by fitting the network model θ̃ to the training andtesting datasets corresponding to the specific Se.

To highlight the performance differences, the average MSEthe values Se are plotted in Figure 2. The dot-dash lineswith circles are corresponding to the results obtained by usingBernier et al method. While the solid lines with triangles arecorresponding to the results obtained by our method. Owing tothe page limit, we only show the case when Sb equals 0.0025and 0.01. It is clear from Figure 2, both methods can generateneural networks fitting well to the training data. Both methodsgive almost identical performance. But their performances inthe testing datasets are different. Our method works better thanthe one obtained by the Bernier et al method.

V. DISCUSSION

Before making a conclusion of the paper, a few remarksshould be added. First, the objective function derived hereis similar but not identical to the one Amari derived for astochastic perceptron learning in p.1398 of [1]. In [1], thenoise is an additive white noise which corrupts on the outputof hidden nodes and output nodes. While the noise we considerhere is a multiplicative weight noise corrupted only on theoutput weights.

Second, the objective function derived in this paper canequally be applied to simple additive weight noise. For theweights that are corrupted by additive white noise of meanzero variance Sβ , S(xk, θ) will reduce to Se + SβθT θ. Then,applying gradient descent method, a training algorithm similarto the one presented in this paper can be accomplished. Iffurthermore we assume Sβ � Se, it can readily be shownthat the objective function in Equation (14) can be reduced tothe well known weight decay objective function [19].

Third, our objective function is in fact related to someother Bayesian type objective functions. In our paper, we haveassumed that there is no prior information on the distributionof θ. If P (θ) is known in advance and assumed that thedataset D is generated by a set of fault models with conditionalprobability P (θ̃|θ), the estimation of θ can thus be obtainedby Bayesian Modeling Averaging (BMA) technique [14], [23].BMA is essentially an extension of Bayesian inference toestimate a model with parameters uncertainty.

P (θ|D) =P (D|θ)P (θ)

P (D)(19)

P (D|θ) =∫

P (D|θ̃, θ)P (θ̃|θ)dθ̃. (20)

In such case, an RBF can be obtained by maximizing theP (θ|D), i.e.

log P (θ)− log P (D)+N∑

k=1

log(∫

P (xk, yk|θ̃, θ)P (θ̃|θ)dθ̃

),

where N is the total number of data in the set D. Compare theabove objective function with ours, the difference just in thea priori term log P (θ). Therefore, our method can be treatedas a special case of BMA method in which P (θ) is assumedto be a constant.

VI. CONCLUSION

In this paper, we have derived from a new approach – KLDivergence – an objective function for training a RBF networkthat is able to tolerate multiplicative weight noise. Supposea RBF network is suffered from multiplicative output-weightnoise, simulation results have indicated that the RBF trainedby the new objective function outperforms the one trained byusing Bernier et l’s explicit regularizer. As KL divergencehas very close relation to Bayesian learning, a discussionhas been included in the end of the paper to elucidate therelations amongst our proposed objective function and theother Bayesian type objective functions.

ACKNOWLEDGEMENT

The work presented in this paper is supported by a researchgrant from the City University of Hong Kong under Project(No. 7001850) and a grant from NSC Taiwan Project Grant(No. NSC 95-2221-E-040-009).

REFERENCES

[1] Amari S.I., Information geometry of the EM and em algorithms forneural networks, Neural Networks, Vol.8(9), 1379-1408, 1995.

[2] Bernier J.L. et al, An accurate measure for multiplayer perceptrontolerance to weight deviations, Neural Processing Letters, Vol.10(2),121-130, 1999.

[3] Bernier J.L. et al, Obtaining fault tolerance multilayer perceptrons usingan explicit regularization, Neural Processing Letters, Vol.12, 107-113,2000.

[4] Bernier J.L. et al, A quantitative study of fault tolerance, noise immunityand generalization ability of MLPs, Neural Computation, Vol.12, 2941-2964, 2000.

[5] Bernier J.L. et al Improving the tolerance of multilayer perceptrons byminimizing the statistical sensitivity to weight deviations, Neurocomput-ing, Vol.31, 87-103, 2000.

[6] Bernier J.L. et al, Assessing the noise immunity and generalization ofradial basis function networks, Neural Processing Letter, Vol.18(1), 35-48, 2003.

[7] Bishop C.M., Training with noise is equivalent to Tikhnov regularization,Neural Computation, Vol.7, 108-116, 1995.

[8] Burr J.B., Digital neural network implementation, in Neural Networks:Concepts, Applications, and Implementations, Volume 2, P. Antognettiand V. Milutinovic (eds.), 237-285, Prentice Hall, 1991.

[9] Catala M. A. and X.L. Parra, Fault tolerance parameter model of radialbasis function networks, IEEE ICNN’96, Vol.2, 1384-1389, 1996.

[10] Cavalieri S. and O. Mirabella, A novel learning algorithm whichimproves the partial fault tolerance of multilayer neural networks, NeuralNetworks, Vol.12, 91-106, 1999.

[11] Chen S., Local regularization assisted orthogonal least square regression,Neurocomputing, 559-585, 2006.

[12] Choi J.Y. and C.H. Choi, Sensitivity of multilayer perceptrons with dif-ferentiable activation functions, IEEE Transactions on Neural Networks,Vol. 3, 101-107, 1992.

[13] Fontenla-Romero O. et al, A measure of fault tolerance for functionalnetworks, Neurocomputing, Vol.62, 327-347, 2004.

[14] Hoeting J.A. et al, Bayesian model averaging : A tutorial, StatisticalScience, Vol.14, No.4, 382-417, 1999.

[15] Holt J. and J. Hwang, Finite precision error analysis of neural networkshardware implementations, IEEE Transactions on Computers, Vol.42,281-290, 1993.

[16] Jim K.C., C.L. Giles and B.G. Horne, An analysis of noise in recurrentneural networks: Convergence and generalization, IEEE Transactions onNeural Networks, Vol.7, 1424-1438, 1996.

[17] Kullback S., Information Theory and Statistics, Wiley, 1959.[18] Lam P.M., C.S. Leung and T.T. Wong, Noise-resistant fitting for

spherical harmonics, IEEE Transactions on Visualization and ComputerGraphics, Vol. 12(2), 254 - 265, March-April 2006

[19] Moody J.E., Note on generalization, regularization, and architectureselection in nonlinear learning systems, First IEEE-SP Workshop onNeural Networks for Signal Processing, 1991.

[20] Murray A.F. and P.J. Edwards, Enhanced MLP performance and faulttolerance resulting from synaptic weight noise during training, IEEETransactions on Neural Networks, Vol.5(5), 792-802, 1994.

[21] Parra X. and A. Catala, Fault tolerance in the learning algorithm of radialbasis function networks, Proc. IJCNN 2000, Vol.3, 527-532, 2000.

[22] Piche S.W., The selection of weight accuracies for Madalines, IEEETransactions on Neural Networks, Vol.6, 432-445, 1995.

[23] Santafe G., J.A. Lozano and P. Larranaga, Bayesian model averagingof naive bayes for clustering, IEEE Transactions on Systems, Man, andCybernetics – Part B: Cybernetics, Vol. 36, No. 5, 1149-1161, 2006.

[24] Simon D., Distributed fault tolerance in optimal interpolative nets, IEEETransactions on Neural Networks, Vol.12(6), 1348-1357, 2001.

[25] Stevenson M., R. Winter and B. Widrow, Sensitivity of feedfowardneural networks to weight errors, IEEE Transactions on Neural Net-works, Vol.1, 71-80, 1990.

[26] Townsend N.W. and L. Tarassenko, Estimations of error bounds forneural network function approximators, IEEE Transactions on NeuralNetworks, Vol.10(2), 217-230, 1999.

[27] Vapnik V. and S. Golowich and A. Smola, Support vector method forfunction approximation, regression estimation, and signal processing,Neural Information Processing Systems, Vol. 9, 281-287”, 1997.

[ieee tencon 2007 - 2007 ieee region 10 conference - taipei, taiwan (2007.10.30-2007.11.2)] tencon...

Documents