noise-boosted bidirectional backpropagation and...

Please cite this article as: O. Adigun and B. Kosko, Noise-boosted bidirectional backpropagation and adversarial learning. Neural Networks (2019),https://doi.org/10.1016/j.neunet.2019.09.016.

Neural Networks xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neural Networks

journal homepage: www.elsevier.com/locate/neunet

2019 Special Issue

Noise-boosted bidirectional backpropagation and adversarial learningOlaoluwa Adigun, Bart Kosko ∗Department of Electrical and Computer Engineering, Signal and Image Processing Institute, University of Southern California, LosAngeles, CA 90089-2564, USA

a r t i c l e i n f o

Article history:Available online xxxx

Keywords:Bidirectional backpropagationNeural networksNoise benefitStochastic resonanceExpectation–Maximization algorithmBidirectional associative memoryNoise benefit

a b s t r a c t

Bidirectional backpropagation trains a neural network with backpropagation in both the backwardand forward directions using the same synaptic weights. Special injected noise can then improve thealgorithm’s training time and accuracy because backpropagation has a likelihood structure. Trainingin each direction is a form of generalized expectation–maximization because backpropagation itselfis a form of generalized expectation–maximization. This requires backpropagation invariance in eachdirection: The gradient log-likelihood in each direction must give back the original update equations ofthe backpropagation algorithm. The special noise makes the current training signal more probable asbidirectional backpropagation climbs the nearest hill of joint probability or log-likelihood. The noise forinjection differs for classification and regression even in the same network because of the constraintof backpropagation invariance. The backward pass in a bidirectionally trained classifier estimates thecentroid of the input pattern class. So the feedback signal that arrives back at the input layer ofa classifier tends to estimate the local pattern-class centroid. Simulations show that noise speededconvergence and improved the accuracy of bidirectional backpropagation on both the MNIST test setof hand-written digits and the CIFAR-10 test set of images. The noise boost further applies to regularand Wasserstein bidirectionally trained adversarial networks. Bidirectionality also greatly reduced theproblem of mode collapse in regular adversarial networks.

© 2019 Published by Elsevier Ltd.

1. Introduction: From adaptive resonance to noise-boostedbidirectional backpropagation and adversarial learning

What is a feedback signal that arrives at the input layer of aneural network?

Grossberg answered this question with his adaptive resonancetheory or ART: The feedback signal is an expectation (Gross-berg, 1976, 1982, 1988). The neural network expects to see thisfeedback signal or pattern given the current input signal thatstimulated the network and given the pattern associations thatit has learned. So any synaptic learning should depend on thematch or mismatch between the input signal and the feedbackexpectation (Grossberg, 2017).

Grossberg gave this ART answer in the special case of a2-layer neural network. The two layers defined the input andoutput fields of neurons. The two layers can also define twostacked contiguous layers of neurons in a larger network withmultiple such stacked layers (Bengio et al., 2009). The topologicalpoint is that the neural signals flow bidirectionally betweenthe two layers. There are no hidden or intervening layers. Thesynapses of the forward flow differ in general from the synapsesof the backward flow.

∗ Corresponding author.E-mail address: [email protected] (B. Kosko).

A bidirectional associative memory or BAM results if the twosynaptic webs are the same (Kosko, 1987, 1988, 1990, 1991a).A BAM is in this sense a minimal 2-layer neural network. Theinput signal passes forward through a synaptic matrix M . Thenthe backward signal passes through the transpose MT of the samematrix M . This minimal BAM structure holds for stacked con-tiguous layers that reverberate through the same weight matrixM (Graves, Mohamed, & Hinton, 2013; Vincent, Larochelle, Lajoie,Bengio, & Manzagol, 2010).

The basic BAM stability theorem results if the network usesthe transposeMT for the backward pass because then the forwardand backward network Lyapunov functions are equal. The BAMtheorem states that every real rectangular matrix M is bidirec-tionally stable for threshold or sigmoidal neurons: Any inputstimulation quickly leads to an equilibrium or resonating bidirec-tional fixed point of a fixed input vector and a fixed output vector.This BAM stability holds for the wide class of Cohen-Grossbergneuron nonlinearities (Cohen & Grossberg, 1983; Kosko, 1990,1991a). It still holds even if the synaptic weights simultane-ously change in accord with a Hebbian or competitive learninglaw (Kosko, 1991a). It also holds for time-lags and for many otherrecent extensions of the basic neuron models (Ali, Yogambigai,Saravanan, & Elakkia, 2019; Bhatia & Golman, 2019; Maharajan,Raja, Cao, Rajchakit, & Alsaedi, 2018; Wang, Chen, & Liu, 2018).

https://doi.org/10.1016/j.neunet.2019.09.0160893-6080/© 2019 Published by Elsevier Ltd.

https://doi.org/10.1016/j.neunet.2019.09.016

http://www.elsevier.com/locate/neunet

http://www.elsevier.com/locate/neunet

mailto:[email protected]

https://doi.org/10.1016/j.neunet.2019.09.016


2 O. Adigun and B. Kosko / Neural Networks xxx (xxxx) xxx

Fig. 1. Two bidirectional representations of the 4-bit bipolar permutation function in Table 1. (a) A 3-layer bidirectional associative memory (BAM) exactly representsthe permutation function and its inverse. The forward or left-to-right direction encodes the permutation mapping. The backward or right-to-left direction encodesthe inverse mapping. Noiseless bidirectional backpropagation trained the BAM network. The best set of weights and thresholds used a single hidden layer with5 threshold neurons. (b) A noise-boosted 3-layer BAM that exactly represents the same permutation function and its inverse. Bidirectional backpropagation withNEM-noise injection trained the network. The NEM noise boost reduced the number of hidden neurons from 5 to 4.

Simple fixed-point stability need not hold in general when theBAM contains one or more hidden layers of nonlinear units. Itdoes hold for the two 3-layer BAM networks in Fig. 1 becausethey represent the invertible permutation mapping in Table 1.Simulations showed that such multilayer feedback BAMs stilloften converged to fixed points. Fig. 3 and Tables 2–7 show howthis convergence behavior varied with the number of hiddenlayers and the number of neurons per layer.

1.1. BAMs extend to multilayer networks

The ART and BAM concepts apply in the more general casewhen the network has any number of hidden or interveningneural layers between the input and output layers. This paperconsiders just such generalized or extended ART–BAM networksfor supervised learning.

The extended BAM’s feedback structure depends on the neuralnetwork’s reverse mapping NT

: RK→ Rn from output vectors

y ∈ RK back to vectors x in the input pattern space Rn. Supposethat the vector input pattern x ∈ Rn stimulates the multilayerneural network N . It produces the vector output y = N(x) ∈ RK .Then the feedback signal x′ is the signal NT (y) that the networkN produces when it maps back from the output y = N(x) throughits web of synapses to the input layer: x′ = NT (y) = NT (N(x)).

We use the notation NT (y) to denote this feedback signal.The backward direction uses only the transpose matrices MT ofthe synaptic matrices M that the network N uses in the for-ward direction. So these deep networks define generalized BAM

Table 14-bit bipolar permutation function (self-bijection) and its inversethat the 3-layer BAM networks represent in Figs. 1(a) and (b). Theinverse simply maps from the output y back to the correspondinginput x.Input x Output y

[− − −−] [+ + −−]

[− − −+] [+ + +−]

[− − +−] [+ + ++]

[− − ++] [+ + −+]

[− + −−] [+ − −+]

[− + −+] [− + −+]

[− + +−] [+ − −−]

[− + ++] [− + −−]

[+ − −−] [+ − ++]

[+ − −+] [− + ++]

[+ − +−] [+ − +−]

[+ − ++] [− + +−]

[+ + −−] [− − +−]

[+ + −+] [− − −−]

[+ + +−] [− − −+]

[+ + ++] [− − ++]

networks with complete unidirectional sweeps from front to backand vice versa.

The transpose notation NT also makes clear that the feedbackor backward signal NT (y) differs from the set-theoretic inverse orpullback N−1(y). The pullback mapping N−1 : 2RK

→ 2Rn alwaysexists. It maps sets B in the output space to sets A in the inputspace: A = N−1(B) = {x ∈ Rn

: N(x) ∈ B} if B ⊂ RK . So the


O. Adigun and B. Kosko / Neural Networks xxx (xxxx) xxx 3

Table 2BAM stability without training. The network used no hidden neurons. The input–output configuration was n←→ p.There were n input threshold neurons and p output threshold neurons. We picked the BAM matrix values uniformlyfrom the bipolar interval [−1, 1].

Network configuration % of convergence Max run to converge Max period for non-convergence

Bipolar

5 ←→ 5 100% 5 runs 010 ←→ 10 100% 7 runs 050 ←→ 50 100% 15 runs 0100 ←→ 100 100% 24 runs 0500 ←→ 500 100% 60 runs 0

Binary

5 ←→ 5 100% 6 runs 010 ←→ 10 100% 10 runs 050 ←→ 50 100% 13 runs 0100 ←→ 100 100% 20 runs 0500 ←→ 500 100% 30 runs 0

Table 3Stability of bidirectional associative memory (BAM) networks without training. We picked the synaptic weights uniformly in the bipolar interval [−1, 1]. The networksused either one or two hidden layers of threshold neurons and so did not always yield a BAM fixed point. The network configuration was n ←→ h1 ←→ p fornetworks with one hidden layer and n←→ h1 ←→ h2 ←→ p for networks with two hidden layers. The term n equaled the number of input neurons, h1 was thesize of the first hidden layer in terms of its threshold neurons, h2 was the size of the second hidden layer, and p was the number of output neurons. There were50 hidden neurons in total.

Network configuration % of convergence Max run to converge Max run for non-convergence

Bipolar

5 ↔ 50↔ 5 89.6% 9 runs 5 runs10 ↔ 50↔ 10 79.1% 15 runs 10 runs50 ↔ 50↔ 50 43.4% 99 runs 100 runs100 ↔ 50↔ 100 53.6% 85 runs 84 runs500 ↔ 50↔ 500 99.4% 10 runs 2 runs

Binary

5 ↔ 50↔ 5 79.9% 9 runs 9 runs10 ↔ 50↔ 10 62.5% 22 runs 19 runs50 ↔ 50↔ 50 37.5% 99 runs 100 runs100 ↔ 50↔ 100 57.8% 91 runs 99 runs500 ↔ 50↔ 500 99.4% 10 runs 2 runs

Bipolar

5 ↔ 25↔ 25↔ 5 81.3% 8 runs 99 runs10 ↔ 25↔ 25↔ 10 62.7% 14 runs 100 runs50 ↔ 25↔ 25↔ 50 44.6% 44 runs 100 runs100 ↔ 25↔ 25↔ 100 51.9% 30 runs 100 runs500 ↔ 25↔ 25↔ 500 54.7% 26 runs 100 runs

Binary

5 ↔ 25↔ 25↔ 5 63.6% 11 runs 99 runs10 ↔ 25↔ 25↔ 10 43.4% 34 runs 100 runs50 ↔ 25↔ 25↔ 50 50.8% 71 runs 100 runs100 ↔ 25↔ 25↔ 100 72.1% 38 runs 100 runs500 ↔ 25↔ 25↔ 500 99.5% 12 runs 100 runs

set-theoretic inverse N−1(y) = N−1({y}) = A partitions the inputpattern space Rn into the two sets A and Ac . All pattern vectors xin A map to y. All other inputs map elsewhere.

The point inverse N−1 : RK→ Rn exists only in the rare

bijective case that the network N : Rn→ RK is both one-to-one

and onto and n = K . The 3-layer BAM networks in Fig. 1(a)–(b)are just such rare cases of bijective networks because they exactlyrepresent the permutation mapping in Table 1. The bidirectionallearning algorithms below do not require that the networks Nhave a point inverse N−1.

Extending BAMs to multilayer networks entails relaxing BAMfixed-point stability in general. Stability still holds between anytwo contiguous fields that reverberate in isolation or subjectonly to fixed or slowly changing inputs from adjoining neurallayers. Some of the figures below show how simple fixed-pointBAM stability tends to fall off as the number of hidden layersincreases. We often take as the network’s output its first forwardand backward results. We can also let the BAM reverberate inmultiple back-and-forth sweeps before we record an equilibriumor quasi-equilibrium output.

1.2. The implicit bidirectionality of classifier networks

An important special case is the modern deep classifier net-work (Bishop, 2006; Jordan & Mitchell, 2015; LeCun, Bengio, &Hinton, 2015; Mohri, Rostamizadeh, & Talwalkar, 2018). These

feedforward networks map images or videos or other patternsto K softmax output neurons. The input neurons are most oftenidentity functions because they act as data registers.

Classifier networks map patterns to probability vectors. Asoftmax output activation has the ratio form of an exponentialdivided by a sum of K such exponentials. So the output vectory = N(x) defines a probability vector of length K . The networkoutput y is a point in the K -1-dimensional simplex SK−1 ⊂ RK .

Supervised learning for the classifier network uses 1-in-Kencoding for the K standard basis or unit bit vectors e1, . . . , eK .Input pattern x ∈ Ck ⊂ Rn should map to the corresponding kthunit bit vector ek ∈ SK−1 if Ck is the kth input pattern class. Anideal classifier emits N(x) = ek if and only if x ∈ Ck. Then the Kset-theoretic pullbacks N−1(ek) of the K basis vectors ek partitionthe input pattern space Rn into the K desired pattern classes Ck:Rn= ∪

nk=1N

−1(ek) = ∪nk=1Ck.

A classifier network defines a bidirectional feedback systemdespite its feedforward mapping of patterns to probabilities. Mostusers simply ignore the backward pass and thus ignore the overallbidirectional structure. The reverse network NT

: SK−1 → Rn

can always map probability descriptions y back to patterns inthe input space through the reverse mapping NT by using thetranspose of all synaptic weight matrices. This holds even whenthe classifier encodes time-varying patterns in simple recurrentloops among their hidden layers (Hochreiter & Schmidhuber,1997).



Table 4Stability of multilayered BAMs without training. We picked the synaptic weight values uniformly in the bipolar interval [−1, 1]. The networks used one or twohidden layers of threshold neurons. The configuration was n ←→ h1 ←→ p for BAM networks with one hidden layer and n ←→ h1 ←→ h2 ←→ p for BAMnetworks with two hidden layers. The term n equaled the number of input threshold neurons, h1 was the size of the first hidden layer in terms of its thresholdneurons, h2 was the size of the second hidden layer, and p was the number of output threshold neurons. There were 200 hidden neurons in total.

Network configuration % of convergence Max run to converge Max run for non-convergence

Bipolar

5 ↔ 200↔ 5 90.4% 7 runs 5 runs10 ↔ 200↔ 10 77.4% 12 runs 8 runs50 ↔ 200↔ 50 27.0% 99 runs 100 runs100 ↔ 200↔ 100 2.0% 99 runs 100 runs500 ↔ 200↔ 500 1.9% 99 runs 100 runs1000 ↔ 200↔ 1000 54.5% 99 runs 100 runs

Binary

5 ↔ 200↔ 5 90.4% 8 runs 6 runs10 ↔ 200↔ 10 61.7% 25 runs 12 runs50 ↔ 200↔ 50 3.1% 98 runs 100 runs100 ↔ 200↔ 100 0.0% No convergence 100 runs500 ↔ 200↔ 500 1.8% 99 runs 100 runs1000 ↔ 200↔ 1000 54.5% 99 runs 100 runs

Bipolar

5 ↔ 100↔ 100↔ 5 78.7% 9 runs 100 runs10 ↔ 100↔ 100↔ 10 55.2% 13 runs 100 runs50 ↔ 100↔ 100↔ 50 2.2% 90 runs 100 runs100 ↔ 100↔ 100↔ 100 0.9% 96 runs 100 runs500 ↔ 100↔ 100↔ 500 1.5% 96 runs 100 runs1000 ↔ 100↔ 100↔ 1000 2.3% 99 runs 100 runs

Binary

5 ↔ 100↔ 100↔ 5 61.3% 11 runs 100 runs10 ↔ 100↔ 100↔ 10 30.3% 30 runs 100 runs50 ↔ 100↔ 100↔ 50 0.0% No convergence 100 runs100 ↔ 100↔ 100↔ 100 0.0% No convergence 100 runs500 ↔ 100↔ 100↔ 500 13.3% 99 runs 100 runs1000 ↔ 100↔ 100↔ 1000 82.0% 89 runs 100 runs

Table 5Classification accuracies of multilayer BAMs trained on the MNIST dataset of handwritten digits. We computed both the one-shot classificationaccuracy and the long-shot accuracy over 100 bidirectional sweeps through the trained networks.

Hidden layer configuration One-shot accuracy Long-shot accuracy Long-shot squared error

Bipolar1024 98.37% 98.21% 0.0512 ←→ 512 98.15% 98.17% 0.0256 ←→ 256←→ 256←→ 256 97.67% 97.35% 0.0

Binary1024 98.47% 98.40% 0.0512 ←→ 512 98.49% 98.45% 0.0256 ←→ 256←→ 256←→ 256 97.90% 97.85% 0.0

Table 6Classification accuracies of multilayer BAMs trained on the CIFAR-10 dataset. We computed both the one-shot accuracy and the long-shot accuracyover 100 bidirectional sweeps through the trained networks.

Hidden layer configuration One-shot accuracy Long-shot accuracy Long-shot squared error

Bipolar1024 55.65% 54.3% 2.1 ×10−6

512 ←→ 512 53.54% 42.27% 4.6 ×10−6

256 ←→ 256←→ 256←→ 256 50.38% 27.76% 2.8 ×10−5

Binary1024 55.26% 52.61% 0.0512 ←→ 512 52.28% 33.33% 5.3 ×10−6

256 ←→ 256←→ 256←→ 256 48.09% 26.02% 1.9 ×10−6

Table 7Centroid analysis for the backward pass of a multilayered BAM with MNIST handwritten digits dataset. We compared the Euclidean distance d(t) between the backward-pass vector x = NT (y) at time t and the sample class centroids over bidirectional sweeps through the trained network. The simulations used dmax = max

1⩽t⩽100d(t) ,

dmin = min1⩽t⩽100

d(t) , and d = 1N

∑100t=1 d

(t) .

Hidden layer configuration Correct classification Misclassification

Min (dmin) Max (dmax) Mean (d) Min (dmin) Max (dmax) Mean (d)

Bipolar1024 −1.0 ×10−2 2.4965 5.8 ×10−3 −0.2704 4.1620 2.3864512 ←→ 512 −3.3 ×10−2 4.0669 3.0 ×10−3 −0.4219 3.4808 0.2280256 ←→ 256←→ 256←→ 256 −2.1 ×10−2 0.9085 0.0014 ×10−3 −0.6152 2.7178 0.0856

Binary1024 −2.1 ×10−2 3.6311 8.3 ×10−3 −3.2937 0.4729 −0.4167512 ←→ 512 −8.1 ×10−3 5.0859 4.0 ×10−3 −1.8951 2.6759 −0.0700256 ←→ 256←→ 256←→ 256 −7.2 ×10−3 4.3520 0.0052 ×10−3 −4.9765 1.3177 −3.1570

The reverse pass through NT can thereby answer why-typecausal questions: Why did this observed output occur? What typeof input pattern caused this result?

Pattern answers to these why-type questions are versions ofGrossberg’s network feedback expectations. The images in Fig. 5show such answers or expectations from a deep BAM network



with 7 hidden layers. The bidirectional network trained on CIFAR-10 images. The convolutional classifier in Fig. 6 shows similarreverse-pass images NT (y) or NT (ek) from a deep convolutionalnetwork in BAM mode. Simple unidirectional backpropagationtraining produced the uninformative reverse-pass images in the(b) panels of both figures. Proper bidirectional training with theB-BP algorithm below produced reverse-pass images that closelymatch the sample centroids of the pattern classes.

The forward pass through the network N answers compar-atively simpler what-if questions: What happens if this inputstimulates the system? What effect will this input cause? Bidirec-tional processing can help even here as many of the simulationsshow. Modern unidirectional classifiers N simply ignore the re-verse mapping information housed in the transpose matrices MT

of the reverse pass NT . The equilibrium forward pass through thenetwork N differs in general from the first forward pass throughN . Modern feedforward classifiers simply take this first forwardpass as the network’s final output.

This implied BAM feedback structure requires only that outputvectors y ∈ SK−1 pass back through the transpose matrices of allthe synaptic matrices that the forward pass used to produce agiven output from a given input. This backward pass also requiresa related backward pass through any windows or masks in con-volutional classifiers (Fukushima, 1980; Krizhevsky, Sutskever,& Hinton, 2012; LeCun, Bengio, et al., 1995). The last sectionshows how the simulations passed output information backwardthrough the convolutional filters of deep convolutional networks.

Then even a convolutional classifier can produce a feedbacksignal x′ = NT (y) at the input layer (and at any hidden layer). Thefeedback signal x′ looks like a handwritten digit if the classifiernetwork trains on the MNIST digit data. It looks like some otherimage if it trains on the CIFAR image test set. So we can train thenetwork in the backward direction NT if we compute some formof error based on the triggering input pattern x ∈ Ck and basedon what the network expects to see x′ = NT (y) = NT (N(x)) inthe Grossberg sense of ART. This insight leads to the bidirectionalsupervised learning algorithm below.

We also point out that the backward direction in such classi-fiers performs a form of statistical regression. This holds becausethe input neurons are identity units. It does not hold if the inputneurons have logistic or other nonlinear form. The regressionstructure leads to an optimal mean-squared structure in thebackward direction as we explain below.

The regression structure in the backward direction dictatesboth a different training regimen and a different noise-injectionalgorithm from the forward classifier. The backward tendencytoward mean-squared estimation implies that such classifiersestimate the K class centroids ck in the backward direction. Sothe feedback signal x′ = NT (y) shows what the classifier networkexpects to ‘‘see’’ or perceive at the input field given the currentpattern stimulus x: It shows what the network expects the localclass centroid to look like. That explains why the backward-passsignals in Figs. 5 and 6 so closely match the sample class centroidsin their (a) panels.

1.3. Bidirectional backpropagation through BP invariance

The new bidirectional backpropagation (B-BP) algorithm ex-tends these ART and BAM concepts to supervised learning inmultilayer networks (Adigun & Kosko, 2016, 2019). B-BP trainsthe neural network in both the forward and backward directionsusing the same weights. Using the same weights entails using thematrix transposes MT in the backward direction for any synapticweight matrices M that the network N uses in the forwarddirection. So this bidirectional processing through the backward-pass network NT converts feedforward networks N into BAM

networks. The B-BP algorithm shows how to extend ordinaryunidirectional BP (Rumelhart, Hinton, & Williams, 1986; Werbos,1974) without overwriting in either direction.

The added computational cost of B-BP is slight comparedwith ordinary or forward-only BP. This holds because the timecomplexity for BP in either direction is O(n) for n input–outputtraining samples. The complexity in the B-BP case remains O(n)because O(n)+ O(n) = O(n).

Fig. 1 shows two such 3-layer BAM networks after B-BP train-ing. All neurons are threshold on–off neurons with zero threshold.Each feedback network exactly represents the same 4-bit bipolarpermutation mapping and its inverse from Table 1. Each networkmaps a 4-bit vector x ∈ {−1, 1}4 to a 4-bit vector y ∈ {−1, 1}4.Each network’s backward direction maps y back to the same inputx. So the input x = (−1, 1, 1, 1) maps to y = (−1, 1,−1,−1)and conversely. So the vector pair (x, y) is a BAM fixed point ofeach network. This holds for all 16 vector pairs in Table 1 foreach BAM network. There are more than 20 quadrillion such 4-bitpermutation mappings and associated inverse mappings becausethere are 24

! or 16! ways to permute the 16 input vectors inTable 1.

The two BAM networks differ both in their synaptic valuesand in their number of hidden neurons. The BAM network inFig. 1(a) needs 5 hidden threshold neurons to learn the permuta-tion mapping in Table 1 with B-BP. Simulations with 4 or fewerhidden neurons failed to find a representation. The BAM networkin Fig. 1(b) needs only 4 hidden neurons because the properadditive noise boosted the B-BP learning as we will show below.The noise obeyed the sufficient condition for a B-BP noise boostin Theorem 4. The neurons were steep logistic functions duringB-BP learning. We rounded them off to threshold neurons withzero thresholds after learning.

An earlier theorem showed that a 3-layer threshold BAM net-work can always exactly represent an n-bit permutation and itsinverse if the network uses 2n hidden threshold neurons (Adigun& Kosko, 2016, 2019). That architecture would require 16 hiddenneurons in this case. So the B-BP algorithm reduced this exponen-tial number of hidden neurons in n to a linear number of hiddenneurons.

The main problem with bidirectional learning is overwriting.Learning in one direction tends to overwrite or undo learningin the other direction. This may explain why users have appliedBP almost exclusively in the forward direction for classifiers orregressors. Running simple BP in the backward direction will onlydegrade the prior learning in the forward direction.

B-BP solves the overwriting problem with two performancemeasures. Each direction gets its own performance measurebased on whether that direction performs classification or regres-sion or some other task. B-BP sums the two error or log-likelihoodperformance measures. The overall forward-and-backwardsweeps minimize the network’s total log-likelihood.

Fig. 2 shows how the two performance measures overcomethe problem of overwriting for the two 3-layer BAMs in Fig. 1. Thefirst two sub-figures in Fig. 2 show that learning in one directionoverwrites learning in the other direction with the same perfor-mance measure. The third figure shows rapid learning in bothdirections because it used the sum of the two correct performancemeasures. The performance measures were correct in the sensethat they preserved the BP learning equations for a given choiceof output-neuron structure and directional functionality. We callthis backpropagation invariance.

BP invariance requires that the log-likelihoods must corre-spond to the functions that the forward and backward passesperform. A deep classifier gives a canonical example. It usesa cross-entropy performance measure for its forward pass of



Fig. 2. Backpropagation overwrites in one direction as it trains in the other direction if it uses a single performance measure or error function. Bidirectionalbackpropagation avoids overwriting because it uses the sum of two error functions that match the terminal neuron structure and functionality of each direction.Simulations used NEM noise injection in each direction for the 3-layer BAM network in Fig. 1(b) that learned the 4-bit bipolar permutation in Table 1 and itsinverse. The forward and backward performance measures were each a double cross entropy in Eq. (28) because the input and output neurons were logistic (laterrounded-off to threshold neurons). (a) Unidirectional BP training on the forward pass reduced the forward error but overwrites the learning of the inverse map inthe backward direction. (b) Unidirectional BP training on the backward pass reduced the backward training error but overwrites the learning of the permutationmap in the forward direction. (c) NEM noise injection with bidirectional BP trained the network in accord with the logistic NEM condition in Eq. (58) and with thesum of the forward and backward cross entropies. There was no over-writing in either direction. The NEM noise injection speeded convergence in both directions.

Fig. 3. Maximum number of runs or iterations that a BAM needed to converge if it did converge. BAMs with no hidden layers always converged in accord withthe basic BAM theorem. We varied the size of the threshold BAM networks and compared bipolar encoding with binary encoding. We picked all synaptic weightsuniformly from the bipolar interval [−1, 1]. (a) No hidden layer (classical BAM). (b) One hidden layer of threshold neurons. (c). Two hidden layers of thresholdneurons.

classification. The classifier implicitly uses a squared-error per-formance measure for its backward pass of regression. The over-whelming number of classifiers in practice simply ignore thisbackward-pass structure.

B-BP’s sum of performance measures stems from the bidi-rectional network’s total log-likelihood structure. The forwarddirection has the likelihood function p(y|x, Θ) for input vector xand output vector y and all network parameters Θ . The backwarddirection has the converse likelihood function p(x|y, Θ). Thislikelihood structure holds in general at the layer level so long asthe layer neurons and performance measure obey BP invariance.

The back-and-forth structure of bidirectionality gives the jointor total likelihood function as the product p(y|x, Θ)p(x|y, Θ). Sothe network log-likelihood L is just the sum L = log p(y|x, Θ) +log p(x|y, Θ). The negative log-likelihoods define the forward andbackward error functions or performance measures. Then thegradient ∇L = ∇ log p(y|x, Θ) + ∇ log p(x|y, Θ) leads to the B-BP algorithm and its noise-boost by way of the generalized EMalgorithm.

The forward direction of the classifier uses the cross-entropybetween the desired output target distribution and the actual Ksoftmax outputs. The cross-entropy is just the negative of thelogarithm of a one-trial multinomial probability density p(y|x, Θ).So the statistical structure of the forward pass corresponds torolling a K -sided die. The backward pass uses the squared-errorat the input. This arises from taking the logarithm of a vectornormal distribution p(x|y, Θ). It also implies that the backwardlearning estimates the kth local pattern-class centroid becausethe centroid minimizes the squared error.

The key point is that BP invariance must hold for properbidirectional learning: The network can perform classification orregression or any other function that leaves the backpropagationlikelihood structure invariant. The bidirectional network itself canperform different functions in different directions. Its directionallikelihood structure then allows noise boosting in those differentdirections.

1.4. NEM noise-boosted bidirectional backpropagation

We show that injecting carefully chosen noise (not blindnoise) into the input and output layers both speeds the con-vergence of B-BP and improves its accuracy. This special noiseis just that noise that makes the signal more probable. It isNEM or Noisy EM noise because the ordinary unidirectionalbackpropagation algorithm turns out to be a special case ofthe generalized expectation–maximization algorithm (Audhkhasi,Osoba, & Kosko, 2016) and because this NEM noise always speedsthe EM algorithm’s ascent up the nearest hill of probability (Osoba& Kosko, 2016). We can also inject NEM noise into hidden units.This may involve more involved matrix transformations.

We develop below the NEM noise inequalities that apply to B-BP training. These theoretical conditions follow the joint forward-and-backward log-likelihood structure of B-BP itself. The NEMnoise is again just that noise n that makes the current signal ymore probable: p(y|n) ⩾ p(y) on average. This simple inequalityleads to an average likelihood ratio that suffices for an averagenoise-boost. Boosting B-BP injects this noise into the output (orhidden) neurons at each likelihood-gradient step for the joint or



bidirectional log-likelihood L = log p(y|x, Θ)+ log p(x|y, Θ). Thissummed log-likelihood L also gives a general proof strategy forB-BP results: Prove the result in each direction and then combinedirections. So we will review and extend the earlier unidirectionalresults for BP (Audhkhasi et al., 2016; Kosko, Audkhasi, & Osoba,2019) and then apply them as lemmas to B-BP.

We show last how to apply B-BP to generative adversarialnetworks (GANs) and how to boost their performance with bidi-rectional NEM noise. Adversarial networks combine at least twonetworks that train while they try to deceive each other. Thegenerator network tends to generate ever better fake patternsfrom training patterns while the discriminator network tries todetect the fake patterns. We test these bidirectional results onthe standard vanilla and Wasserstein GANs using the MNIST andCIFAR-10 image datasets. Bidirectional BP largely removed theproblem of mode collapse in the standard GANs. Mode collapseis a type of training stutter where the network keeps emittingthe same pattern. Wasserstein GANs tend to avoid the problemof mode collapse but at the cost of increased computation. NEMnoise further improved GAN performance.

The overall insight is that neural feedback matters. Feedbackcan help a neural system as well as hurt it. The same is true ofnoise. They both require careful analysis if a user wants to applythem to a given nonlinear neural system.

2. Backpropagation and noise injection

The BP algorithm trains a neural network to approximate somemapping from the input space X to the output space Y (Bishop,2006; Hinton, Rumelhart, & Williams, 1986; Jordan & Mitchell,2015; LeCun et al., 2015; Werbos, 1974). The mapping is a sim-ple function for an ideal classifier. The BP algorithm is itself aspecial case of generalized expectation–maximization (GEM) formaximum-likelihood estimation with latent or hidden parame-ters (Audhkhasi et al., 2016).

The reduction of GEM to BP follows from the key gradientidentity ∇ log p(y|x, Θ) = ∇Q (Θ|Θn) that we re-derive below in(9). The BP noise benefit follows in turn from the EM-noise-boostresults in Osoba and Kosko (2016). These EM results show that in-jecting noise helps the maximum-likelihood system bound fasterup the nearest hill of probability on average if the noise satisfiesa positivity condition that involves a likelihood ratio. We nextreview these results and then extend them to the bidirectionalcase.

2.1. Backpropagation invariance and expectation–maximization

BP training has a probabilistic structure based on maximum-likelihood estimation. BP trains a network by iteratively minimiz-ing some error or performance measure E(Θ) that depends onthe difference between the network’s actual or observed outputvalue y = N(x) and its desired or target t. The parameter vector Θ

describes the network’s current configuration of synaptic weightsand neuronal coefficients. There is some probability p(y|x, Θ)that a neural network N with parameters Θ will emit outputy given input x. The probability p(y|x, Θ) defines the network’soutput-layer likelihood function.

BP invariance requires that the network error E(Θ) equals thenegative of the network’s log-likelihood function L: E(Θ) = −L =− log p(y|x, Θ). Then minimizing the network error E(Θ) maxi-mizes the network log-likelihood log p(y|x, Θ) and vice versa. SoBP performs maximum-likelihood estimation (Bishop, 2006):

Θ∗ = argminΘ

E(Θ) = argmaxΘ

log p(y|x, Θ). (1)

BP invariance helps explain why BP is a special case of gener-alized expectation–maximization. EM itself generalizes maximum

likelihood to the case of hidden variables or missing data (Demp-ster, Laird, & Rubin, 1977). EM iteratively climbs the nearesthill of probability or log-likelihood as it alternates between aforward expectation step and a backward maximization step.The forward step computes or estimates an expectation withrespect to the posterior density of the hidden variables given thecurrent data and parameter estimates. The expectation defines asurrogate likelihood function Q . The backward step maximizesthis surrogate likelihood given the current data and given thecurrent estimate of the parameters. Generalized EM replaces thecomplete maximization of the surrogate with a gradient esti-mate or partial maximization. So both BP and generalized EMare back-and-forth gradient algorithms that find local maximum-likelihood parameters. The formal argument below that BP equalsgeneralized EM shows how entropy minimization achieves thisresult if BP invariance holds.

BP invariance holds in particular for the two common cases ofclassification and regression. Classification requires both that theerror function E(Θ) be cross entropy and that the output neuronshave a softmax or other nonlinear form that produces a one-shotmultinomial likelihood p(y|x, Θ). Regression requires instead thatE(Θ) be squared error and that the output neurons are linearor identity neurons. Then the likelihood p(y|x, Θ) must equal amultidimensional Gaussian density. So its negative log-likelihood−L just equals the squared error E(Θ). Then both the classifierand regression networks have the same BP learning laws.

BP is a special case of the GEM algorithm because the gradientof the network (layer) likelihood log p(y|x, Θ) equals the gradientof EM’s surrogate likelihood function Q (Θ|Θn) (Audhkhasi et al.,2016): ∇Θ log p(y|x, Θ (i)) = ∇ΘQ (Θ (i)

|Θ (i)) at each iteration i forthe network’s weight parameters Θ (i) and input x.

We now restate the recent BP-as-GEM theorem (Audhkhasiet al., 2016) and then sketch its proof. The theorem states that thebackpropagation update equation for a differentiable likelihoodfunction p(y|x, Θ) at epoch i:

Θ (i+1)= Θ (i)

+ η∇Θ log p(y|x, Θ)⏐⏐⏐Θ=Θ(i)

(2)

equals the GEM update equation at epoch i

Θ (i+1)= Θ (i)

+ η∇ΘQ (Θ|Θ (i))⏐⏐⏐Θ=Θ(i)

(3)

if GEM uses the differentiable Q-function

Q (Θ|Θ i) = Eh|y,x,Θ i

[log p(h, y|x, Θ)

]. (4)

The EM algorithm takes the expectation of log p(y|x, Θ) withrespect to the hidden posterior p(h|y, x, Θ (i)). The ‘‘EM trick’’rewrites the conditional probability p(h|y, x, Θ) = p(h,y|x,Θ)

p(y|x,Θ) asp(y|x, Θ) = p(h,y|x,Θ)

p(h|y,x,Θ) . Then taking hidden-posterior expectationsof the log-likelihood log p(y|x, Θ) gives

log p(y|x, Θ) = Eh|y,x,Θ(i)

[log p(y|x, Θ)

](5)

= Eh|y,x,Θ(i)

[log

p(h, y|x, Θ)p(h|y, x, Θ)

](6)

= Q (Θ|Θ (i))+ H(Θ|Θ (i)) (7)

if Q (Θ|Θ (i)) is the EM surrogate likelihood with cross entropyH(Θ|Θ (i)) = −Eh|y,x,Θ(i) [log p(h|y, x, Θ)]. Taking gradients gives

∇ log p(y|x, Θ) = ∇Q(Θ|Θ (i))

+∇H(Θ|Θ (i)) . (8)

The entropy inequality H(Θ (i)|Θ (i)) ⩽ H(Θ|Θ (i)) holds for all

Θ because Jensen’s inequality and the concavity of the logarithmimply that Shannon entropy H(Θ (i)

|Θ (i)) minimizes the cross en-tropy H(Θ|Θ (i)). Hence ∇H(Θ (i)

|Θ (i)) = 0. This gives the master



equation:

∇ log p(y|x, Θ (i)) = ∇Q(Θ (i)|Θ (i)) . (9)

So the BP and GEM gradients are identical at each iteration i.We remark that the master gradient equation (9) holds at

each hidden layer of the neural network. This allows NEM-noiseinjection in hidden layers as well as in output layers (Koskoet al., 2019). We focus on noise injection in output layers andjust summarize how the layer-level equation applies. Supposethe network has k hidden layers hk, . . . ,h1. The numbering startsin the forward direction with the first hidden layer h1 after theinput (identity) layer x. Then the total network likelihood is theprobability density p(y,hk, . . . ,h1, |x, Θ (i)). The ‘‘chain rule’’ ormultiplication theorem of probability factors this likelihood intoa product of layer likelihoods:

p(y,hk, . . . ,h1, |x, Θ (i)) = p(y|hk, . . . ,h1, x, Θ (i))×

p(hk|hk−1, . . . ,h1, x, Θ (i)) · · · p(h2|h1, x, Θ (i))p(h1|x, Θ (i)) (10)

where we assume the input-data prior p(x) = 1 for simplicity.Taking logarithms at iteration i gives the total network log-likelihood L(x) as the sum of the layer log-likelihoods:

L(x) = L(y|x)+ L(hk|x)+ · · · + L(h1|x) (11)

where L(hk|x) = log p(hk|hk−1, . . . ,h1, x, Θ (i)). Then the aboveBP-as-GEM argument (5)–(9) applies at layer k if BP invarianceholds at that layer.

We next present the likelihood structure for three differentneural networks. The networks differ in the structure of theirterminal-layer neurons and in the function that these networksperform. We consider three different likelihood structures basedon the output activation of the networks. The two main neuralmodels are classifiers and regressors. The third type of networkis the logistic network.

A regression network maps the input vector space RI to theoutput space RK . I is the number of input neurons. K is thenumber of output neurons. A regression network uses identityactivation functions at the output layer. The target vector t isa Gaussian random vector. It has a mean vector at and has anidentity or white covariance matrix I (which can generalize to anonwhite covariance matrix):

t ∼ N (t|at , I). (12)

The likelihood preg (t|x, Θ) of a regression network is

preg (t|x, Θ) =1

(2π )K/2 exp{−∥t− at∥2

2

}(13)

where ∥ · ∥ is the Euclidean norm. Then the log-likelihood Lreg ofthe regression network is a constant plus the squared error:

Lreg = log preg (t|x, Θ) (14)

= log(2π )−K2 −∥t− at∥2

2(15)

= log(2π )−K2 −

12

K∑k=1

(tk − atk

)2. (16)

So maximizing Lreg with respect to Θ minimizes the squared-error Ereg of regression:

Ereg =12

K∑k=1

(tk − atk

)2. (17)

A classifier network maps the input space RI to a length-Kprobability vector in [0, 1]K . The output neurons of a classificationnetwork use a softmax or Gibbs activation function. So an output

neuron has the form of an exponential divided by K exponentials.The target vector t is a unit bit vector. It has a 1 in the kth slotand 0s elsewhere. The probability that the kth scalar componenttk equals 1 is

p(tk = 1|x, Θ) = ayk. (18)

The target t defines a one-shot multinomial or categoricalrandom variable. So the (output-layer) likelihood pclass(t|x, Θ) ofa classifier network is the multinomial product

pclass(t|x, Θ) =K∏

k=1

(atk)tk . (19)

The log-likelihood Lclass is the negative cross-entropy:

Lclass = log pclass(t|x, Θ) (20)

= logK∏

k=1

(atk)tk (21)

=

K∑k=1

tk log(atk). (22)

Maximizing the log-likelihood Lclass with respect to Θ minimizesthe output cross-entropy Eclass:

Eclass = −K∑

k=1

tk log(atk). (23)

A logistic network N maps the input space RI to the unithypercube [0, 1]K if K is the number of output logistic neurons.Bipolar logistic output neurons map inputs to the bipolar hy-percube [−1, 1]K by shifting and scaling the logistic activationfunctions. The target vector t consists of K independent Bernoullivariables. The probability that the kth output neuron tk equals 1 isjust the ordinary Bernoulli probability of getting heads after oneflip of a coin:

plog (tk = 1|x, Θ) = (atk)tk (1− atk)

1−tk . (24)

The likelihood plog (t|x, Θ) of a logistic network equals theproduct of the K independent Bernoulli variables. So it equals theprobability of flipping K independent coins:

plog (t|x, Θ) =K∏

k=1

(atk)tk (1− atk)

1−tk . (25)

Then the log-likelihood Llog of a logistic network equals the neg-ative of the double cross entropy:

Llog = log plog (y|x, Θ) (26)

=

K∑k=1

tk log (atk)+K∑

k=1

(1− tk) log (1− atk). (27)

So maximizing the log-likelihood Llog minimizes the error func-tion Elog or double cross entropy:

Elog = −K∑

k=1

tk log (atk)−K∑

k=1

(1− tk) log (1− atk). (28)

The BP algorithm uses gradient descent or its variants to up-date a network’s weights and other parameters. The BP learninglaws remain invariant for all three networks (regression, classifi-cation, and logistic) because taking the gradient of their networklikelihood gives the same partial derivatives for updating theirweights (Kosko et al., 2019).

Suppose that the neural network has just one hidden layer. Allresults apply to any finite number of hidden layers. The weight



matrixW connects the input layer to the hidden layer. The weightmatrix U connects the hidden layer to the output layer. Thelearning laws are the same for Lreg , Lclass, and Llog (Kosko et al.,2019). Then the chain rule expands the partial derivative of thelog-likelihood L with respect to the synaptic weight ukj as

∂L∂ukj=

∂L∂otk

∂otk∂ukj

(29)

=∂L∂atk

∂atk∂otk

∂otk∂ukj

(30)

= (tk − atk)ahj (31)

if weight ukj connects the kth output neuron to the jth hiddenneuron. This holds because tk = 1 and tj = 0 for j = k in 1-in-Kencoding. The term otk is the input or argument of the kth outputneuron. The term atk is the activation of the kth output neuron.The term ahj is the activation of the jth hidden neuron.

The partial derivative of the log-likelihood L with respect tothe synaptic weight wji expands as

∂L∂wji=

∂L∂ahj

∂ahj∂ohj

∂ohj∂wji

(32)

=

( K∑k=1

∂L∂otk

∂otk∂ahj

)∂ahj∂ohj

∂ohj∂wji

(33)

=

( K∑k=1

(tk − atk)ukj

)ahj (1− ahj )xi (34)

if weight wji connects the jth hidden neuron to the ith inputneuron. The term ohj is the input of the jth hidden neuron. Theterm ahj is the logistic activation of the jth hidden neuron. Theactivation xi is the (identity) activation of the ith input neuron.

2.2. Noisy expectation–maximization

The Noisy Expectation–Maximization (NEM) theorem statesthe general sufficient condition for a noise benefit in the EM algo-rithm. The theorem shows that noise injection can only shortenthe EM algorithm’s walk up the nearest hill of log-likelihood onaverage if the noise satisfies the NEM positivity condition. It holdsfor additive or multiplicative or any other type of measurablenoise injection (Osoba & Kosko, 2016). The noise is just the noisethat makes the current signal more probable.

The basic NEM theorem for additive noise states that a noisebenefit holds on average at each iteration i if the following posi-tivity condition holds (Osoba, Mitaim, & Kosko, 2013):

Ex,h,n|Θ∗[log(p(x+ n,h|Θ (i))

p(x,h|Θ (i))

)]⩾ 0 . (35)

Then the EM noise benefit

Q (Θ (i)|Θ∗) ⩽ QN (Θ (i)

|Θ∗) (36)

holds on average at iteration i:

Ex,N|Θ (i)

[Q (Θ∗|Θ∗)− QN (Θ (i)

|Θ∗)]

⩽ Ex|Θn

[Q (Θ∗|Θ∗)− Q (Θ (i)

|Θ∗)]

(37)

where Θ∗ denotes the maximum-likelihood vector of parameters,QN (Θ (i)

|Θ∗) = Eh|y,n,Θ∗ [log p(x + n,h|Θ (i))], and Q (Θ (i)|Θ∗) =

Eh|y,Θ∗ [log p(x,h|Θ (i))].The intuition behind the NEM sufficient condition is that NEM

noise is just that added noise n that makes the current signal xmore probable on average: p(x + n|Θ) ⩾ p(x|Θ). Rearrangingand taking logarithms and expectations gives the NEM sufficient

condition (35). The noise-boosted likelihood is closer on averageat each iteration to the maximum-likelihood outcome than isthe noiseless likelihood (Adigun & Kosko, 2018; Osoba & Kosko,2016). Kullback–Leibler divergence measures the closeness.

2.3. NEM noise benefits in backpropagation

We now present the basic NEM theorems for backpropagation.These results inject noise only into the output layer of a neuralnetwork. They extend to noise injection in the hidden layers aswell (Kosko et al., 2019). The first three noise-boost theoremsappear in Kosko et al. (2019). A version of the classification noise-boost theorem also appeared in Audhkhasi et al. (2016). Werestate and briefly reprove these basic noise results. We thenextend these noise-boost results to the new and more generalcase of bidirectional backpropagation. We then further extendthese results to the main types of adversarial networks.

The additive NEM-noise sufficient condition for BP training isthe average positive inequality

Et,h,n|x,Θ∗

[log

p(t+ n,h|x, Θ (i)

)p(t,h|x, Θ (i)

) ]⩾ 0 (38)

at each training iteration i where t is the output target, x is theinput vector, and n is the noise.

The next results instantiate the expectation in (38) based onthe network function and the demands of BP invariance. Thenwe extend these unidirectional results to the bidirectional caseto obtain the corresponding NEM noise boosts for B-BP.

We start with the BP NEM-noise boost in a regression network.BP invariance requires that the output neurons have identityactivations and that the network’s output performance measureis squared error. The squared-error constraint corresponds to anoutput likelihood structure that is a multidimensional normalprobability density.

Theorem 1 (NEM Noise Benefit for a Regression Network Koskoet al., 2019). A backpropagation NEM noise benefit holds for aregression network at iteration i with Gaussian target vector t ∼N (at, I) if the injected noise n satisfies the inequality

Et,h,n|x,Θ∗

[nT (2t− 2at + n

)]⩽ 0 (39)

where at is the output activation vector.

Proof. A NEM noise benefit holds on average if the followingpositivity condition from (38) holds at training iteration i:

Et,h,n|x,Θ∗

[log

p(t+ n,h|x, Θ (i)

)p(t,h|x, Θ (i)

) ]⩾ 0.

Rewrite the ratio of probability density functions as

p(t+ n,h|x, Θ (i)

)p(t,h|x, Θ (i)

) =p(t+ n|h, x, Θ (i)

)p(h|x, Θ (i)

)p(t|h, x, Θ (i)

)p(h|x, Θ (i)

) (40)

=p(t+ n|h, x, Θ (i)

)p(t|h, x, Θ (i)

) . (41)

Taking logarithms gives

logp(t+ n,h|x, Θ (i)

)p(t,h|x, Θ (i)

) = logp(t+ n|h, x, Θ (i)

)p(t|h, x, Θ (i)

) (42)

= log p(t+ n|h, x, Θ (i))

− log p(t|h, x, Θ (i)) (43)

= log[2π−

K2 exp

{−∥t+ n− at∥2

2

}]



− log[2π−

K2 exp

{−∥t− at∥2

2

}](44)

=∥t− at∥2

2−∥t+ n− at∥2

2(45)

=12

(∥t∥2 + ∥at∥2 − 2tTat − ∥t∥2 − ∥at∥2

− ∥n∥2 − 2nT t+ 2tTat + 2nTat)

(46)

=nT

2

(2at − 2t− n

). (47)

So the NEM sufficient condition becomes

Et,h,n|x,Θ∗[nT

2

(2at − 2t− n

)]⩾ 0. (48)

The inequality has the equivalent form

Et,h,n|x,Θ∗

[nT (2t− 2at + n

)]⩽ 0. ■ (49)

The next BP NEM-noise boost applies to classifier networkswith softmax output neurons. BP invariance requires that theoutput performance measure is the network cross entropy withimplied 1-in-K encoding of labeled patterns as unit bit vectors.The cross-entropy constraint corresponds to an output likelihoodstructure of a categorical or one-shot multinomial probabilitydensity.

Theorem 2 (NEM Noise Benefit for a Classification Network (Au-dhkhasi et al., 2016)). A backpropagation NEM noise benefit withmaximum-likelihood training (BP) holds for a softmax classificationnetwork for a multinomial target vector t ∼ Multinoulli(pk = atk)for k ∈ {1, 2, . . . , K } at iteration i if a hyperplane inequality holdson average:

Et,h,n|x,Θ∗

[nT log at

]⩾ 0 (50)

where at is the output activation vector and where pk is the proba-bility that a given pattern belongs to the kth class.

Proof. A NEM noise benefit holds on average at training iterationi if

Et,h,n|x,Θ∗

[log

p(t+ n,h|x, Θ (i)

)p(t,h|x, Θ (i)

) ]⩾ 0.

Rewrite the ratio of the probability density functions as in (40)–(43). Then the softmax/multinomial structure gives the NEM-noise hyperplane condition:


)p(t,h|x, Θ (i)


)p(t|h, x, Θ (i)

) (51)


− log p(t|h, x, Θ (i)) (52)

= logK∏

k=1

(atk)tk+nk

− logK∏

k=1

(atk)tk (53)

=

K∑k=1

(tk + nk

)log(atk)−

K∑k=1

tk log(atk)

(54)

=

K∑k=1

nk log(atk)

(55)

= nT log at . (56)

Taking a NEM-based expectation gives the final average form ofthe hyperplane inequality:

Et,h,n|x,Θ∗[nT log at

]⩾ 0. ■ (57)

The next BP NEM-noise boost applies to logistic networks.These networks are important for bidirectional processing asin the threshold networks of Figs. 1(a) and (b). BP invariancerequires that a layer of logistic neurons have a performancemeasure that we call double cross entropy (28). It corresponds toa likelihood structure that is a product of Bernoulli probabilities.The output N(x) of a logistic network is not a discrete probabilitydensity as in the case of the softmax classifier. The output N(x) isinstead a discrete fuzzy set (Kosko, 2018).

Theorem 3 (NEM Noise Benefit for a Logistic Network (Koskoet al., 2019)). A backpropagation NEM noise benefit holds fora logistic network with K independent Bernoulli target neuronstk ∼ Bernoulli(pk = atk) at iteration i if the following inequalityholds:

Et,h,n|x,Θ∗

[nT(log (at )− log (1− at )

)]⩾ 0 (58)

where at is the output activation and where pk is the successprobability that the kth output neuron equals 1.

Proof. A NEM noise benefit holds on average if the followingpositivity condition holds at each training iteration i:

Et,h,n|x,Θ∗

[log

p(t+ n,h|x, Θ (i)

)p(t,h|x, Θ (i)

) ]⩾ 0 .

Then rewriting the ratio of the probability density functions as in(40)–(43) and using the product-Bernoulli (double-cross-entropy)structure gives a noise hyperplane condition:


)p(t,h|x, Θ (i)


)p(t|h, x, Θ (i)

) (59)


− log p(t|h, x, Θ (i)) (60)

= logK∏

k=1

(atk)tk+nk (1− atk

)1−tk−nk− log

K∏k=1

(atk)tk (1− atk

)1−tk (61)

=

K∑k=1

(tk + nk

)log(atk)+(1− tk − nk

)log(1− atk

)−

K∑k=1

tk log(atk)+(1− tk

)log(1− atk

)(62)

=

K∑k=1

nk log(atk)− nk log

(1− atk

)(63)

= nT (log at − log (1− at )). (64)

Taking NEM expectations gives the final form as

Et,n|x,Θ∗[nT (log at − log (1− at )

)]⩾ 0. ■

Similar unidirectional NEM-noise benefits hold for recurrentbackpropagation when training recurrent neural networks withtime-varying patterns (Adigun & Kosko, 2017).

The next section extends the above unidirectional-BP resultsto the more general case of bidirectionality. This culminates inthe general Theorem 4 that shows how we can derive families ofbidirectional NEM-noise benefits from network architectures thatobey BP invariance. These noise benefits include B-BP versionsof the unidirectional Theorems 1–3 among many others. Weomit their bidirectional proofs for simplicity because they directlyapply Theorem 4 to the proofs above.



3. Bidirectional backpropagation (B-BP)

The B-BP algorithm trains a multilayered neural network withbackpropagation in both directions over the same web of synapticconnections (Adigun & Kosko, 2016, 2019). Algorithm 1 shows thesteps in the B-BP algorithm in the more general case if we injectNEM noise into the input and output layers of a classifier network.Similar steps can inject noise in hidden layers and in all the layersof bidirectional regression networks.

B-BP is also a form of maximum likelihood estimation. Theforward pass maps the input x to the output N(x). The backwardpass maps the output y back through the network to NT (y)given the current network parameters Θ . The algorithm jointlymaximizes the forward likelihood pf (y|x, Θ) and the backwardlikelihood pb(x|y, Θ). So B-BP finds the weight or parametervector Θ∗ that maximizes the joint network likelihood:

Θ∗ = argmaxΘ

pf (y|x, Θ)pb(x|y, Θ). (65)

This likelihood structure holds more generally in terms of theforward and backward layer likelihoods per (10).

This joint optimization is the same as the joint optimizationof the directional (and layer) log-likelihoods:

Θ∗ = argmaxΘ

log pf (y|x, Θ)+ log pb(x|y, Θ) (66)

because the logarithm is monotone increasing. The B-BP algo-rithm also uses gradient descent or any of its variants to itera-tively update the parameters.

3.1. B-BP likelihood functions

We next present four different B-BP structures based on thecorresponding likelihood structure of the network. The four struc-tures are double regression, double classification, double logistic,and the mixed case of classification and regression. Their functionand network structure dictates their log-likelihoods in accordwith BP invariance. BP invariance ensures that all the differentnetwork structures still use the same BP updates on a givendirectional training pass.

3.1.1. B-BP double regressionA bidirectional network is a double regression network if it

performs regression in both directions. So the input and outputlayers both use identity activations. The output y is a Gaussianrandom vector with mean ay and with identity or white covari-ance matrix I. The input x is a also Gaussian random vector withmean ax and identity or white covariance matrix I:

pf (y|x, Θ) =1

(2π )K/2 exp

{−∥y− ay∥2

2

}(67)

and

pb(x|y, Θ) =1

(2π )I/2exp

{−∥x− ax∥2

2

}(68)

where I is the dimension of the input layer and K is the dimen-sion of the output layer. The forward log-likelihood Lf (Θ) andbackward log-likelihood Lb(Θ) are

Lf (Θ) = log p(y|x, Θ) = log(2π )−K2 −∥y− ay∥2

2(69)

and

Lb(Θ) = log p(x|y, Θ) = log(2π )−I2 −∥x− ax∥2

2. (70)

The error functions are the squared-error Ef (Θ) for the for-ward pass and squared error Eb(Θ) for the backward pass. B-BPminimizes the joint error E(Θ):

E(Θ) = Ef (Θ)+ Eb(Θ) (71)

=12

K∑k=1

(yk − ayk

)2+

12

I∑i=1

(xi − axi

)2. (72)

Then Adigun and Kosko (2019) give the detailed B-BP updatesrules for double regression.

3.1.2. B-BP double classificationA double classifier is a neural network that acts as a classifier

in both the forward and backward directions. So both the inputand output layers use softmax or Gibbs activations.

The output y with activation ayk defines the multinomial prob-ability pf (yk = 1|x, Θ). The input x with axi defines the dualmultinomial probability pb(xi = 1|y, Θ):

pf (y|x, Θ) =K∏

k=1

(ayk)yk (73)

and

pb(x|y, Θ) =I∏

i=1

(axi)xi . (74)

Then the log-likelihoods Lf and Lb are negative cross entropies:

Lf (Θ) = log pf (y|x, Θ) =K∑

k=1

yk log(ayk)

(75)

and

Lb(Θ) = log pf (x|y, Θ) =I∑

i=1

xi log(axi). (76)

The forward error Ef (Θ) is the cross-entropy Ef (Θ). The back-ward error Eb(Θ) is likewise the cross-entropy Eb(Θ). B-BP fordouble classification minimizes this joint error E(Θ):

E(Θ) = Ef (Θ)+ Eb(Θ) (77)

= −

K∑k=1

yk log(ayk)−

I∑i=1

xi log(axi). (78)

Adigun and Kosko (2019) give the update rules for B-BP withdouble classification. Training in each direction applies ordinaryBP updates because of BP invariance. Training occurs in onedirection at a time.

3.1.3. B-BP double logistic networksA double logistic network has logistic neurons at both the

input and output layers. Figs. 1(a) and (b) show such doublelogistic networks where the logistic sigmoids are so steep as togive threshold neurons.

The probability of the logistic-vector output y is a product ofK independent Bernoulli probabilities. The kth output activationayk is the probability pf (yk = 1|x, Θ). The probability of thelogistic-vector input x is a product of I independent Bernoulliprobabilities. So the ith input activation axi is the probabilitypb(xi = 1|y, Θ). Logistic neurons can also map to the bipolarinterval [−1, 1] by simple scaling and translation of the usualbinary logistic function. We trained the bidirectional thresholdnetworks in Figs. 1(a) and (b) as double-logistic networks withsteep logistic sigmoid activations. Then we replaced the steeplogistics after B-BP training with threshold functions.



The forward likelihood pf (y|x, Θ) and the backward likelihoodpb(x|y, Θ) have the product-Bernoulli form

pf (y|x, Θ) =K∏

k=1

(ayk)yk (1− ayk

)1−yk (79)

and

pb(x|y, Θ) =I∏

i=1

(axi)xi(1− axi

)1−xi . (80)

The log-likelihoods Lf (Θ) and Lb(Θ) define double cross entropies:

Lf (Θ) = log pf (y|x, Θ) (81)

=

K∑k=1

yk log(ayk)+ (1− yk) log

(1− ayk

)(82)

and

Lb(Θ) = log pf (x|y, Θ) (83)

=

I∑i=1

xi log(axi)+ (1− xi) log

(1− axi

). (84)

Then the joint error function E(Θ) for a double logistic networkis the sum

E(Θ) = Ef (Θ)+ Eb(Θ) (85)

= −

K∑k=1

yk log(ayk)+ (1− yk) log

(1− ayk

)−

I∑i=1

xi log(axi)+ (1− xi) log

(1− axi

). (86)

The BP update rules for double-logistic training are the same asfor double classification because of BP invariance.

3.1.4. Mixed case: B-BP classification and regressionThe mixed bidirectional case occurs when the forward pass is a

classifier and the backward pass is a regressor. This is the impliedstructure of the modern neural classifier used only in the forwarddirection because it maps identity neurons to output softmaxneurons. The forward pass through the network N estimates theclass membership of a given input pattern x. So the K outputneurons have softmax or Gibbs activations. The backward passthrough NT estimates the class centroid (in the common least-squares case) given an output class-label unit bit vector or otherlength-K probability vector y. So the input neurons use identityor linear activations.

The forward likelihood is a multinomial random variable asin the case of double classification. The backward likelihood isa Gaussian random vector as in the case of double regression.BP invariance holds for these likelihoods given the correspondingsoftmax output neurons and identity input neurons.

The total mixed error E(Θ) sums the forward cross-entropyEf (Θ) and the backward squared-error Eb(Θ):

E(Θ) = Ef (Θ)+ Eb(Θ) (87)

= −

K∑k=1

yk log(ayk)+

12

I∑i=1

(xi − axi

)2. (88)

The original B-BP sources (Adigun & Kosko, 2016, 2019) givethe gradient update rules for B-BP in the mixed case as well fordouble regression and double classification. BP invariance greatlysimplifies the forward and backward updates because then allnetwork types use the same BP updates.

3.2. B-BP algorithm for classifier networks

We first extend the B-BP algorithm to classifier networks forbidirectional representation. The mapping from Y to X or viceversa is not a well-defined point function in most cases. So the kthunit basis vector y need not map back to a unique value. It mapsinstead to a pre-image set f −1(y) as discussed above. The reversepass through NT gives a point x′ in the input pattern space Rn.Then Adigun and Kosko (2019) shows that the regression back-ward pass over the network N tends to map to the centroid of thepre-image in a classification-regression bidirectional network.

The backward pass of a classifier network maps the class labelsor output K -length probability vectors back to the input patternspace. The output space Y is the K -1 simplex SK−1 of K -lengthprobability vectors for a classifier network. This holds because thenetwork maps vector pattern inputs to K softmax neurons vectorsand uses 1-in-K encoding.

So the backward pass for bidirectional representation mapsany given y to a centroidal measure of its class. Let the mthsample (x(m), t(m)) be a pair of sample input and target vectorsthat belongs to the kth class Ck. B-BP trains the network along theforward pass N(x(m)) with t(m) as the target. The forward error Efis the cross entropy between N(x(m)) and t(m).

Backward-pass training can in principle use any x′. The sim-plest case uses the backward-emitted vector x′ = NT (y) afterthe input x has emitted the output vector y in the forwarddirection: y = N(x). Simulations show that somewhat betterperformance tends to result if the backward pass replaces y withthe supervised class label k or the unit vector ek if the stimulatinginput vector x belongs to the kth input class.

The backward pass can also use the kth class population cen-troid ck of input pattern class Ck or its sample-mean estimateck:

ck =1Nk

∑x(m)∈Ck

x(m) (89)

where Nk is the number of input sample patterns x(m) from classCk. Then the weak and strong laws of large numbers state thatthe sample centroid ck converges to the population or true classcentroid ck in probability or with probability one if the samplevectors are random samples (independent and identically dis-tributed) with respective finite covariances or finite means (Hogg,Mckean, & Craig, 2012). This ergodic result still holds in themean-squared sense when the sample vectors are correlated ifthe random vectors are wide-sense stationary and if the finitescalar covariances are asymptotically zero (Gubner, 2006). Thiscorrelated result also holds in probability since convergence inmean-square implies convergence in probability.

Using the centroid estimate ck makes sense in the backwarddirection because locally the population centroid ck minimizesthe squared error of regression. It likewise minimizes the totalmean-squared error of vector quantization (Kosko, 1991b).

The K -means clustering algorithm is the standard way to esti-mate K centroids from training data (Jain, 2010). This algorithmis a form of competitive learning and sometimes called a self-organizing map (Kohonen, 1990). K -means clustering enjoys aNEM noise-boost because the algorithm is itself a special caseof the EM algorithm for tuning a Gaussian mixture model if theclass membership functions are binary indicator functions (Osoba& Kosko, 2013).

The projection xk of the kth class centroid ck to the space oftraining samples in Ck obeys

xk = argminx(m)∈Ck

∥x(m)− ck∥2 (90)



where xk ∈ Ck. The sample centroid need not lie in the patternclass Ck. We can compute the backward vector directly if it doesby taking the difference between the backward pass x′ and thesample class centroid. We otherwise take the difference betweenx′ and that training vector in Ck that lies closest to the sampleclass centroid. Then the backward error Eb equals the squarederror between x′ = NT (y(m)) and the projection xk.

4. NEM noise-boosted B-BP learning

This section shows how NEM noise boosts the B-BP algorithm.B-BP seeks to jointly maximize the log-likelihood of the network’sforward pass and its backward pass. So B-BP is also a form ofmaximum-likelihood estimation.

The above BP-as-GEM theorem states that the gradient of thenetwork log-likelihood equals the gradient of the EM surrogatelikelihood Q (Θ|Θn): ∇ log p = ∇Q . This gradient equality givesin the bidirectional case

∇Θ log p(y|x, Θ)+∇Θ log p(x|y, Θ) =

∇ΘQf (Θ|Θ (i))+∇ΘQb(Θ|Θ (i)) (91)

if Qf (Θ|Θ (i)) and Qb(Θ|Θ (i)) are the surrogate likelihoods for therespective forward pass and backward pass. So B-BP is also aspecial case of generalized EM. These results also hold at the layerlevel so long as the layer likelihood obeys BP invariance in eachdirection.

The above gradient result also holds in general for any numberM of directions and for any neural-network structure so longas the structure maintains BP invariance in each direction. Weconsider next the special case of classification in the forwarddirection because it is by far the most common neural networkin practice.

4.1. NEM-boosted B-BP for classifier networks

B-BP is a special case of GEM in each direction. So the injectionof noise that satisfies the NEM positivity condition improves theaverage performance of B-BP. The noise can be additive or mul-tiplicative or can have any other measurable form. The beneficialNEM noise is just that noise nx and ny that makes the data moreprobable on average.

The forward likelihood p(y|x, Θ) for a classifier network isthe one-shot multinomial probability density. So the classifieruses softmax output neurons and a cross-entropy performancemeasure. The backward likelihood p(x|y, Θ) is an I-dimensionalvector Gaussian density. So the implied backward regression net-work uses identity input neurons (output neurons in the back-ward direction) and a squared-error performance measure. ThenBP invariance holds: both passes use the same BP updates.

The NEM positivity condition holds for B-BP training of aclassifier network with noise injection nx in the input neuronsand with ny in the output neurons if the following NEM positivityconditions hold:

Ey,h,n|x,Θ∗[nTy log ay

]⩾ 0 (92)

and

Ex,h,N|y,Θ∗[nTx

(2x− 2ax + nx

)]⩽ 0 (93)

if x is the sample class centroid, ay is the forward pass N(x),and ax is the backward pass NT (y) through the classifier network.Injecting the noise pair (nx, ny) that satisfies this condition atthe input and output layers produces the NEM noise benefit inthe B-BP training of the classifier (classifier-regression) network.

Data: {x(i), y(i)}Ni=1, batch size M , learning rate α, number ofepochs L, number of iterations T , annealing factor η, noisevariance σ , and other hyperparameters for optimization.

Result: Trained weight Θ .Compute: The class centroids ck for all the K classes:

ck =1Nk

Nk∑j=1

xk(j)

Compute: The projection xk of class centroids ck onto the classsample set:

xk = argminx(i)∈Ck

||x(i) − ck||2

while epoch l : 1→ L dowhile iteration t : 1→ T do• Select a batch of M samples {x(m), y(m)

}Mm=1

• Forward Pass : Compute the K -dimensional outputsoftmax activation vector ay(m):

ay(m)= N(x(m))

• Backward Pass : Compute the I-dimensional inputidentity activation vector ax(t):

ax(m)= NT (y(m))

• Compute the projection x(m) for sample x(m)

x(m)k = argmin

xk||xk − x(m)

||2

• Generate noise ny(m) for the output layer:if nT

y(m) log ay(m) ⩾ 0 then

y(m)← y(m)

+ ny(m)

end• Generate noise nx(m) for the input layer:if nT

x(m)

(2x(m)

k − 2ax(m)+ nx(m)

)⩽ 0 then

x(m)k ← x(m)

k + nx(m)

end• Compute the cross-entropy Ef (Θ):

Ef (Θ) = −1M

M∑m=1

y(m)T log ay(m)

• Compute the squared-error Eb(Θ):

Eb(Θ) =1M

M∑m=1

⏐⏐⏐⏐⏐⏐ax(m)− x(m)

k

⏐⏐⏐⏐⏐⏐2• Back-propagate the error functions and update theweights:

Θ (t+1)= Θ (t)

− α∇ΘEf (Θ)− α∇ΘEb(Θ)

endendAlgorithm 1: B-BP algorithm for NEM noise injection into theoutput and input layers of a classifier neural network trained withbidirectional backpropagation.

Algorithm 1 gives the update rule for the ith epoch of NEM-

boosted B-BP training of such a classifier network.



4.2. B-BP NEM theorem

We now summarize the above noise-boosting results as theB-BP NEM theorem. A joint NEM noise results if we inject noisethat satisfies the average positivity condition on both the forwardand backward passes of B-BP. This includes B-BP versions of theunidirectional Theorems 1–3 above as special cases.

We first define the relevant surrogate likelihood functions forgeneralized EM. The surrogate log-likelihood Qf (Θ|Θ∗) for theforward pass of B-BP without noise injection is

Qf (Θ|Θ∗) = Eh|y,x,Θ∗[log p(y,h|x,Θ)

]. (94)

Then the surrogate log-likelihood Q Nf (Θ|Θ∗) for the forward pass

of B-BP with noise injection is

Q Nf (Θ|Θ∗) = Eh|y,ny,x,Θ∗

[log p(y+ ny,h|x, Θ)

](95)

if ny is the injected noise into the output layer. The surrogate log-likelihood Qb(Θ|Θ∗) for the backward pass of B-BP without noiseinjection is

Qb(Θ|Θ∗) = Eh|y,x,Θ∗[log p(x,h|y,Θ)

]. (96)

The surrogate log-likelihood Q Nb (Θ|Θ∗) for the backward pass of

B-BP with noise injection is

Q Nb (Θ|Θ∗) = Eh|y,nx,x,Θ∗

[log p(x+ nx,h|y,Θ)

](97)

if nx is the injected noise into the input layer. We can now stateand prove the general noise-benefit theorem for B-BP.

Theorem 4 (NEM Noise Benefit with B-BP). Suppose the forwardpositivity condition holds on average at iteration i :

Ey,h,ny|Θ∗

[log

pf (y+ ny,h|x,Θ)pf (y,h|x,Θ)

]⩾ 0. (98)

Suppose further that the backward positivity condition also holds onaverage at iteration i:

Ex,h,nx|Θ∗

[log

pb(x+ nx,h|y,Θ)pb(x,h|y,Θ)

]⩾ 0. (99)

Then the bidirectional EM noise benefits

Qf(Θ (i)|Θ∗

)⩽ Q N

f

(Θ (i)|Θ∗

)(100)

and

Qb(Θ (i)|Θ∗

)⩽ Q N

b

(Θ (i)|Θ∗

)(101)

hold on average at each iteration i:

Ey,ny|Θ (i)

[Qf(Θ∗|Θ∗

)− Q N

f (Θ (i)|Θ∗)

]⩽

Ey|Θ (i)

[Qf (Θ∗|Θ∗)− Qf (Θ (i)

|Θ∗)]

(102)

Ex,nx|Θ (i)

[Qb(Θ∗|Θ∗)− Q N

b (Θ (i)|Θ∗)

]⩽

Ex|Θ (i)

[Qb(Θ∗|Θ∗)− Q N

b (Θ (i)|Θ∗)

]. (103)

The joint EM noise benefit also holds on average at iteration i:

Ey,ny|Θ (i)

[Q Nf (Θ (i)

|Θ∗)]+ Ex,nx|Θ (i)

[Q Nb (Θ (i)

|Θ∗)]⩾

Ey|Θ (i)

[Qf (Θ (i)

|Θ∗)]+ Ex|Θ (i)

[Qb(Θ (i)

|Θ∗)]. (104)

Proof. We show that B-BP training maximizes the sum of thesurrogate log-likelihoods Qf (Θ|Θ (n)) and Qb(Θ|Θ (i)).

The bidirectional version of the master gradient equation (9)shows that the bidirectional log-likelihood equals the sum of

the surrogate log-likelihoods and entropies. So the forward log-likelihood equals

log pf (y|x, Θ (i)) = Qf (Θ (i)|Θ∗)+ H(Θ (i)

|Θ∗). (105)

The backward log-likelihood likewise equals

log pb(x|y, Θ (i)) = Qb(Θ (i)|Θ∗)+ H(Θ (i)

|Θ∗). (106)

The unidirectional BP-as-GEM theorem (Audhkhasi et al., 2016)states that the gradient of the network log-likelihood equalsthe gradient of the EM surrogate likelihood Q (Θ|Θ (n)). Then thegradient of the sum of the directional likelihoods equals

∇Θ log pf (y|x, Θ)+∇Θ log pb(x|y, Θ) =

∇ΘQf (Θ|Θ (i))+∇ΘQb(Θ|Θ (i)) (107)

where Qf (Θ|Θ (i)) and Qb(Θ|Θ (i)) are the surrogate log-likelihoodsfor the respective forward pass and backward pass. This holdsbecause the above entropy inequality implies that both entropygradients are null. Then the gradient update rule for B-BP at theith iteration

Θ (i+1)= Θ (i)

+ η∇Θ

(log pf (y|x, Θ)+

log pb(x|y, Θ))⏐⏐⏐⏐

Θ=Θ(i)(108)

equals the GEM update equation

Θ (i+1)= Θ (i)

+ η∇Θ

(Qf (Θ|Θ (i))+ Qb(Θ|Θ (i))

)⏐⏐⏐Θ=Θ(i)

. (109)

So B-BP is also a special case of GEM. It then inherits GEM’s NEMnoise boost in each direction. □

5. Generative adversarial networks and bidirectional training

An adversarial network consists of two or more neural net-works that try to trick each other. They use feedback among theneural networks and sometimes within neural networks.

The standard generative adversarial network (GAN) consists oftwo competing neural networks. One network generates patternsto trick or fool the other network that tries to tell whether agenerated pattern is real or fake. The generator network G actsas a type of art forger while the discriminator network D acts asa type of art expert or detective.

GAN training helps the discriminator D get better at detectingreal data from fake or synthetic data. It trains the generator G sothat D finds it harder to distinguish the real data from G′s fake orgenerated data.

Goodfellow et al. (2014) showed that some forms of this ad-versarial process reduce to a two-person minimax game betweenD and G in terms of a value function V (G,D):

minG

maxD

V (D,G) = Ex∼pr (x)[log D(x)] +

Ez∼pz (z)[log (1− D(G(z)))] (110)

if pr is the probability density of the real data x and if pz is theprobability density of the latent or hidden variable z.

GAN training alternates between training D and G. It trains Dfor a constant G and then trains G for a constant D. The probabilitypg is the probability density of the generated data G(z) that Gproduces. Optimizing the value function in (110) is the same asminimizing the Jensen–Shannon (JS) divergence between pr andpg (Goodfellow et al., 2014). This is the same as maximizing theproduct of the likelihood functions p(y = 1|x) and p(y = 0|G(z))with respect to D and minimizing the likelihood function p(y =0|G(z)) with respect to G. The likelihood functions are Bernoulliprobability densities.



Minimizing the JS divergence can lead to a vanishing gradientas the discriminator D gets better at distinguishing real datafrom fake data (Arjovsky & Bottou, 2017). The JS divergence canalso correlate poorly with the quality of the generated sam-ples (Arjovsky, Chintala, & Bottou, 2017). It also often leads tomode collapse: Training converges to the generation of a singleimage (Metz, Poole, D. Pfau, & Sohl-Dickstein, 2016). We usethe term vanilla GAN to refer to a GAN trained to minimize theJS distance. We found that bidirectionality greatly reduced theinstances and severity of mode collapse in regular vanilla GANs.

Arjovsky et al. (2017) proposed the Wasserstein GAN (WGAN)that minimizes the Earth-Mover distance between pr and pg . TheKantorovich–Rubinstein duality (Villani, 2009) shows that we candefine the Earth-Mover distance as

W (pr , pg ) = sup∥f ∥L⩽1

Ex∼pr [f (x)] − Ex∼pg [f (x)] (111)

where the supremum runs over the set of all 1-Lipschitz functionsf : χ → R. Then Gulrajani, Ahmed, Arjovsky, V. Dumoulin, andCourville (2017) proposed adding a gradient penalty term thatenforces the Lipschitz constraint. This gives the modified valuefunction L(pg , pr ) as

L(pr , pg ) = Ex∼pr [D(x)] − Ez∼pz [D(G(z))] +

λ Ex∼px [(∥∇xD(x)∥2 − 1)2] (112)

where x is a convex sum of x ∼ pr and G(z) with z ∼ pz . This ap-proach performed better than using weight clipping (Arjovsky &Bottou, 2017) to enforce the Lipschitz constraint. Salimans, Good-fellow, Zaremba, Cheung, and Radford (2016) proposed heuristicssuch as feature matching, minibatch discrimination, and batchnormalization to improve the performance and convergence ofGAN training. Algorithm 3 uses (112) for the B-BP training of aWGAN.

5.1. Bidirectional GAN training

Bidirectional GAN training uses a new backward training passfor the discriminator D network. The bidirectional training usesthe appropriate joint forward–backward performance measureso that training in one direction does not overwrite training inthe reverse direction. Such training does not apply to the usualunidirectional backpropagation training of the generator G as weexplain below. The forward pass of the discriminator D tries todistinguish the pattern x from the generated pattern G(z). Thebackward pass helps the set-inverse map DT better distinguishthe two.

We cast the backward pass as multidimensional regression.Then the backward-pass training becomes maximum-likelihoodestimation for a conditional Gaussian probability density. Thisholds because maximizing the likelihood Lb of the backward passminimizes the backward error Eb when Eb is squared error (Adi-gun & Kosko, 2019; Audhkhasi et al., 2016; Osoba & Kosko,2016).

B-BP did not train the generator G because we found that B-BP did not outperform unidirectional BP for this task. We showbriefly that this holds because of a constraint inequality: Trainingthe generator G on the backward pass to maximize a likelihoodfunction LGb can only harm the GAN’s performance on average.

The argument for this negative result is that B-BP constrainsthe training of G more than unidirectional BP does. GAN trainingseeks solutions G∗ and D∗ that optimize the value function V (D,G)that measures the similarity between the real samples and thefake or generated samples:

minG

maxD

V (D,G) . (113)

Consider the optimal solution D∗G for a given G:

D∗G = argmaxD

V (D,G) (114)

and define C(G) = V (D∗G, G).The best generator G∗ for the value function obeys

G∗ = argminG

C(G). (115)

The definition of argmin implies that

C(G∗) ⩽ C(G) (116)

for all G. Let G∗b be that constrained G that further partially ortotally maximizes the backward-pass likelihood LGb . Then

C(G∗) ⩽ C(G∗b) (117)

since the additional B-BP constraint only shrinks the search spacefor the minimum.

5.2. Bidirectional training of a vanilla GAN

Each bidirectional training iteration of a vanilla GAN involvesthree steps. The first step trains the discriminator D to maximizethe backward-pass likelihood LDb . This likelihood measures themismatch between the real samples and the generated sampleson the backward pass. The likelihood was a multivariate Gaussianprobability density function.

The backward error EDb measures the mismatch between M

generated samples {G(z(m))}Mm=1 and M real or authentic samples{x(m)}Mm=1:

EDb =

M∑m=1

(⏐⏐⏐⏐G(z(m))− DT (a(m)r )⏐⏐⏐⏐2

+⏐⏐⏐⏐x(m)

− DT (a(m)g )⏐⏐⏐⏐2) (118)

where a(m)r = D(x(m)) and a(m)

g = D(G(z(m))). The term DT denotesthe backward pass over the discriminator network D.

The second step maximizes the forward-pass likelihood LDf forthe discriminator D. This maximization is the same as minimiz-ing the forward-pass error ED

f because the error just equals thenegative log-likelihood:

EDf = −

M∑m=1

(log D(x(m))+ log

(1− G(D(z(m)))

)). (119)

The forward-pass error is the cross entropy because the networkacts as binary classifier with a Bernoulli probability density sincethe samples are either real or fake.

The third step trains the generator network G. The goal isto make G(z) resemble x. So we train G to minimize the errorfunction EG:

EG= −

M∑m=1

log D(G(z(m))) . (120)

EG measures the mismatch between the real and generated sam-ples on the forward pass. Algorithm 2 lists the steps for the B-BPtraining of a vanilla GAN.

5.3. Bidirectional training of a Wasserstein GAN

We next show how B-BP can train a Wasserstein GAN. Themeasure of similarity is now the earth-mover distance betweenpr and pg in (112). The algorithm involves three steps for eachtraining iteration. The first two steps train D with B-BP. The thirdstep trains G with unidirectional BP. Algorithm 3 lists the stepsfor the B-BP training of a Wasserstein GAN.



Data: {x(i)}Ni=1 learning rate α, batch size M , starting point forbackward training n∗, number of epochs L, number ofiterations T , and other hyperparameters.

Result: Trained weights Θd for discriminator D and Θg forgenerator G.

while epoch l : 1→ L dowhile iteration t : 1→ T do• Select M noise samples {z(m)

}Mm=1 from pz(z);

• Select M real samples {x(m)}Mm=1 from pr (x);

if n∗ ⩽ l then• Generate noise nx(m) for the input layer of thediscriminator:if nT

x(m)

(2x(m)

− 2ax(m)+ nx(m)

)⩽ 0 then

x(m)n ← x(m)

+ nx(m)

else

x(m)n ← x(m)

end• Update Θd for the backward-pass training of D:

Θd ⇐ −∇Θd

M∑m=1

(⏐⏐⏐⏐G(z(m))− DT (a(m)

r

)⏐⏐⏐⏐2+⏐⏐⏐⏐x(m)

n − DT (a(m)g

)⏐⏐⏐⏐2)end• Generate the respective noise nr(m) and ng(m) for thereal and synthetic data at the output layer of thediscriminator D:if nT

r(m) logD(x(m))

1−D(x(m))⩽ 0 then

nr(m) ← 0

endif nT

g(m) logD(G(x(m)))

1−D(G(x(m)))⩽ 0 then

ng(m) ← 0

end• Update Θd for the forward-pass training of D:

Θd ⇐ −∇Θd

M∑m=1

(log

D(x(m))1− D(G(z(m)))

+

nTr(m) log

D(x(m))1− D(x(m))

+ nTg(m) log

D(G(z(m)))1− D(G(z(m)))

)• Update Θg for the training of G:

Θg ⇐ −∇Θg

M∑m=1

(log D

(G(z(m))

)+

nTg(m) log

D(G(z(m)))1− D(G(z(m)))

)end

endAlgorithm 2: B-BP algorithm for training a vanilla generative ad-versarial network (GAN) with NEM noise injection. The algorithminjects NEM noise into both the input and output layers of thediscriminator D.

Data: {x(i)}Ni=1 learning rate α, batch size M , starting point forbackward training n∗, number of epochs L, number ofiterations T , gradient penalty coefficient λ, and otherhyperparameters for the optimization.

Result: Trained weights Θd for discriminator D and weights Θgfor generator G.

while epoch l : 1→ L dowhile iteration t : 1→ T do• Select M noise samples {z(m)

}Mm=1 from pz(z) ;

• Select M real samples {x(m)}Mm=1 from pr (x) ;

• Select a random number ϵ ∼ U[0, 1];• Compute {x(m)

}Mm=1 as follows:

x(m)= ϵx(m)

+ (1− ϵ)G(z(m))

if n∗ ⩽ l then• Generate noise nx(m) for the input layer of thediscriminator:if nT

x(m)

(2x(m)

− 2ax(m)+ nx(m)

)⩽ 0 then

x(m)n ← x(m)

+ nx(m)

else

x(m)n ← x(m)

end• Update Θd for the backward-pass training of D asfollows:

Θd ⇐ −∇Θd

M∑m=1

(⏐⏐⏐⏐G(z(m))− DT (a(m)

r

)⏐⏐⏐⏐2+⏐⏐⏐⏐x(m)

n − DT (a(m)g

)⏐⏐⏐⏐2)• Update Θd for the forward-pass training of D asfollows:

Θd ⇐ −∇Θd

M∑m=1

((D(x(m))− D

(G(z(m))

))+ λ

(⏐⏐⏐⏐∇xD(x(m))⏐⏐⏐⏐

2 − 1)2)

• Update Θg for the training of G:

Θg ⇐ −∇Θg

M∑m=1

D(G(z(m))

)end

endAlgorithm 3: B-BP algorithm for training a Wasserstein gen-erative adversarial network (WGAN) with NEM noise injection.Unidirectional BP updates the generator network. BidirectionalBP updates the discriminator network. NEM noise injects into theinput layer of the discriminator.

6. Bidirectional simulation results

We simulated the performance of NEM-noise-boosted B-BPfor different bidirectional network structures on the MNIST andCIFAR-10 image data sets. We also simulated multilayer BAMs totest the extent to which they converged or stabilized to BAM fixedpoints. Stability tended to fall off with an increased number ofhidden layers. A general finding was that bipolar coding tendedto outperform binary in most cases (Tables 2–6). This appears to



reflect the correlation-coefficient inequalities from the appendixto the original BAM paper (Kosko, 1988).

The classifier simulations used CIFAR-10 images (Krizhevsky& Hinton, 2009) for multilayer-perceptron (MLP) BAMs and forconvolutional neural network (CNN) BAMs. The BAM structureof each network N : Rn

→ RK arose from using the transposematrices MT on the backward pass through the reverse networkNT: RK

→ Rn. B-BP trained the BAM models. We comparedthe effects of injecting NEM noise into both the input and outputneurons with noiseless unidirectional BP and noiseless B-BP. Wetrained each model with 50,000 samples and tested the modelson 10,000 test samples. The dataset used the following K = 10pattern classes Ck: airplane, automobile, bird, cat, deer, dog, frog,horse, ship, and truck.

The forward pass through N mapped an image pattern x ∈ Ckto an output probability vector y = N(x). The error comparedthis probability vector y with its class-label or unit bit-vectorbasis ek. The BAM classifiers used 10 output neurons with softmaxactivations. So the neural output vectors y always defined aprobability vector of length 10.

The B-BP trained models jointly minimized the forward andbackward error functions. The forward pass used a one-shotmultinomial probability density for its likelihood function. So theforward error Ef was cross-entropy. The backward pass mappedthe class-label or unit basis vector ek to the pattern vector x′ =NT (ek). Some experiments mapped the raw output y = N(x) backthrough NT to produce x′ = NT (ek). We computed the error bycomparing this backward-passed vector x′ with the nearest class-Ck vector to the sample centroid of class Ck (sample centroidsneed not always lie in Ck). The K input neurons involved in thebackward pass used an I-dimensional vector normal probabilitydensity as its likelihood function. That gave the negative log-likelihood as the squared error. So the backward error Eb was justthis squared error.

We next describe the NEM-boosted performance of the mul-tilayer and convolutional BAM classifiers. Then we describe theNEM-boosted performance of B-BP applied to vanilla and Wasser-stein GANs.

6.1. Deep bidirectional neural classifiers and NEM B-BP

Deep bidirectional classifiers N are multilayer BAMs that runforward through the matrix-based synaptic webs and run back-ward through the matrix transposes in NT . B-BP such classifierswith cross entropy in the forward direction and squared errorin the backward direction as discussed above. Let I denote thenumber of neurons at the input layer. The multilayer BAMs used3072 input neurons for the CIFAR-10 dataset because each sampleimage had dimension 32 × 32 × 3.

We tested BAM classifiers with 1, 3, 5, and 7 hidden layersof rectified-linear unit (ReLU) activations. The BAM models used512 neurons at each hidden layer. We used the dropout pruningmethod (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhut-dinov, 2014) to prune hidden neurons. A dropout value of 0.8for the hidden layers reduced overfitting. So there was a 20%probability of pruning a given hidden neuron.

We compared B-BP with NEM noise injection with noiselessB-BP and with B-BP that injects only blind or dithering noise.We injected the noise only into the output and input layers ofthe classifier networks. Injecting NEM noise into hidden layersinvolves a more complicated algorithm that depends on the typeof hidden neurons and the structure of the corresponding layerlikelihood as in (10) (Kosko et al., 2019). Injecting NEM noise intohidden layers also tends to have less benefit than injecting it intothe output layer of a classifier network.

The simulation results showed that B-BP NEM noise injec-tion outperformed noiseless B-BP and blind-noise B-BP both in

training speed-ups and classification accuracy. Table 8 shows thatNEM noise injection performed best with respect to classificationaccuracy. Blind noise achieved better classification accuracy thannoiseless B-BP in some cases. It performed worse in others. NEMnoise injection outperformed blind noise injection in all cases.

Fig. 4 shows the highest and lowest benefit in classificationaccuracy with NEM noise injection. Table 8 shows that NEM noiseimproved classification accuracy compared with noiseless B-BPand with noiseless unidirectional BP. A BAM classifier with 7 hid-den layers had a B-BP NEM-noise-boosted classification accuracyof 59.8% compared with 56.9% for noiseless B-BP.

Table 9 shows that NEM noise injection always speeded upB-BP training on the CIFAR-10 dataset. Injected blind noise alsoslowed down B-BP training. The speed-up in training measuredhow many fewer epochs it took the trained network to reachthe final point when trained with noiseless B-BP. The benchmarkwas the output cross entropy of noiseless B-BP after 100 train-ing epochs. A negative speed-up meant that training took moreiterations to reach the noiseless benchmark.

NEM noise injection always produced a positive speed-upin training compared with noiseless B-BP. The best NEM noisespeed-up was 68% in the BAM classifier with 1 hidden layer. Blindnoise produced a speed-up in some cases and a slow-down inothers. Blind-noise benefits tended to be slight. Table 9 showsthat blind noise only slowed down training.

6.2. Deep convolutional neural classifiers and NEM B-BP

We trained bidirectional convolutional neural network (CNN)classifiers using the CIFAR-10 dataset. The CNNs used 4 convo-lutional layers and 2 fully connected layers. B-BP trained theconvolutional masks for the bidirectional mapping. The forwardpass ran a convolution (time-reversed correlation) with masks.The backward pass ran the transpose convolution (Dumoulin &Visin, 2016) with masks. Figs. 5 and 6 show that the backwardpasses in these deep networks estimate pattern-class centroids.

The forward pass across convolutional layers used downsam-pling with the pooling operations. The backward pass with theconvolutional mask used upsampling. The transpose convolu-tional operation used the stride parameter to upsample along thebackward pass. A convolutional mask was a bidirectional filterwith the B-BP algorithm.

The size of all the convolutional masks was 3 × 3. The CNNmodels used 32 masks at the first convolutional layer and used64 masks at the second convolutional layer. The third and fourthlayers used 128 convolutional masks each. The first fully con-nected layer fc1 connected the output of the convolutional layersto the second fully connected layer fc2. The fc1 layer used 2048neurons and the fc2 layer used 1024 neurons. The fc2 layer used10 neurons to connect to the output layer.

The forward pass used average pooling after the second, third,and fourth convolutional layers to downsample the features. Weused a pooling factor of 2. The backward pass used the stridefactor of the transpose convolution operation at the second, third,and fourth convolutional layers to upsample the features. Thestride factor was 2.

The forward-pass performance measure or error Ef was crossentropy. The backward error Eb was squared error since thesewere classification-regression bidirectional networks. Differentmodes of B-BP trained the CNNs.

Table 10 shows that there was a slight boost in classifica-tion accuracy with NEM noise injection in B-BP. B-BP with NEMnoise injection outperformed the other modes of B-BP. It alsoperformed better than unidirectional BP.

Fig. 6 compares the backward pass of the class-label or unit-bit-vector basis vectors ek through the reverse network NT . The



Table 8Classification accuracy of multilayered BAM networks trained with backpropagation on the CIFAR10 dataset. The deep bidirectionalnetworks used 1, 3, 5, and 7 hidden layers with 512 ReLU neurons for each hidden layer. Blind or NEM noise injected into theoutput and input neurons if at all.Network configuration 1 hidden layer 3 - hidden layer 5 hidden layer 7 hidden layer

Unidirectional BP 53.89% 57.43% 57.83% 57.12%B-BP without noise 54.26% 58.74% 58.18% 56.90%B-BP with blind noise 53.86% 57.54% 58.09% 55.49%B-BP with NEM noise 55.10% 59.43% 59.64% 59.80%

Table 9Speed-ups and slow-downs in bidirectional training due to noise injection inthe output softmax neurons of multilayer classifier BAMs trained on the CIFAR-10 dataset. The percentages are changes in the output cross entropy after 100training epochs compared with noiseless B-BP.

B-BP with blind noise B-BP with NEM noise

1- hidden layer −6.0% 30.0%3- hidden layers −26.0% 40.0%5- hidden layers −8.0% 28.0%7- hidden layers −16.0% 32.0%

Fig. 4. NEM-noise benefits in classification accuracy for multilayered BAMs andB-BP training using the CIFAR-10 dataset. Simulations compared NEM-boostedB-BP with noiseless B-BP and with B-BP with injected blind noise. (a) MultilayerBAM classifier with 1 hidden layer and 1024 hidden ReLU neurons. NEM noiseand blind noise injection outperformed noiseless B-BP . (b) Multilayer BAM with7 hidden layers with 512 ReLU neurons at each hidden layer. NEM-boosted B-BPwith NEM outperformed both noiseless B-BP and B-BP with blind noise injection.

backward-pass patterns on the B-BP trained CNNs were similarto the targets x. The backward-pass patterns that arrived at theinput of the unidirectional-BP trained network did not visuallyresemble the target images. Table 11 shows the speed-up in train-ing due to noise injection. Injecting NEM noise in B-BP trainingalways outperformed injecting blind noise.

Fig. 5. Backward pass of the class-label unit-bit-vector output basis vectors ekfor a multilayered BAM trained on the CIFAR-10 dataset. The multilayered BAMused 7 hidden layers and swept forward and backward through those layers.(a) The sample class centroids of the 10 image pattern classes. (b) Backward-pass prediction of the class-label output vectors ek with unidirectional BP:The network failed to produce meaningful backward-pass feedback signals. (c)Backward-pass prediction of the class-label output vectors ek with B-BP andNEM noise injection: The feedback signals closely matched the correspondingsample class centroids.

Table 10Classification accuracy of convolutional neural networks trained on CIFAR-10images.

Accuracy

Unidirectional BP 76.06%B-BP without noise 76.44%B-BP with blind noise 77.13%B-BP with NEM noise 78.64%

6.3. NEM B-BP and adversarial networks

The adversarial simulations used the MNIST digit data (Le-Cun, Bottou, Bengio, & Haffner, 1998) for both vanilla GANs and



Table 11Speed-up in training due to noise injection in B-BP trained CNN models on theCIFAR10 dataset. The reference is B-BP without noise injection.

Speed-Up

B-BP with blind noise 16.0%B-BP with NEM noise 24.0%

Fig. 6. Backward pass of the class-label unit-bit-vector output basis vectors ekfor a multilayered convolutional BAM CNN trained on the CIFAR-10 dataset. (a)The sample class centroids of the 10 image pattern classes Ck . (b) Backward-pass prediction of the class-label output vectors ek with unidirectional BP only:The network failed to produce meaningful backward-pass feedback signals. (c)Backward-pass prediction of the class-label output vectors ek with B-BP andNEM noise injection in the CNN: The feedback signals closely matched thecorresponding sample class centroids on the CIFAR-10 dataset.

Wasserstein GANs. Separate simulations used the CIFAR-10 im-ages (Krizhevsky & Hinton, 2009) for the vanilla and WassersteinGANs. The image data did require that we extend the vanilla GANto a deep convolutional network.

6.3.1. Training a vanilla GAN with B-BPWe now describe the simulation results for the B-BP training

of vanilla GANs. The GAN’s generator G and discriminator D weremultilayered neural networks. The generator network G mappedthe space of the latent variable z to the space of generatedsamples. The discriminator network D mapped these samples toa single output logistic neuron. We trained the vanilla GANs onlyon the MNIST dataset.

We modeled the latent variable z as a bipolar uniform randomvariable: z ∼ U[−1, 1]. We used Goodfellow’s GAN inception scoreIS(G) (Salimans et al., 2016) based on the exponentiated average

Fig. 7. Performance of the vanilla and Wasserstein GANs on the MNIST dataset.(a) Inception scores for regular vanilla GANs trained on the MNIST dataset. Train-ing with bidirectional BP and NEM noise injection gave the best performancein terms of the highest inception score (the exponentiated Kullback–Leiblerentropic distance in Eq. (121)). All GANs trained with unidirectional BP sufferedfrom mode collapse: They often produced the same image. The B-BP trainedvanilla GANs did not suffer from mode collapse. NEM noise injection alsoimproved the quality of the generated images for both unidirectional BP andbidirectional BP. (b) Inception scores for the Wasserstein GANs trained on theMNIST dataset. Training with B-BP and NEM noise gave the best inception scores.B-BP and NEM noise injection only slightly improved the WGAN performanceand generated images.

Kullback–Leibler divergence:

IS(G) = exp(Ex∼pg [DKL

(p(y|x) ∥ p(y)

)])

(121)

for the generator G. The Kullback–Leibler divergence or rela-tive entropy DKL

(p(y|x) ∥ p(y)

)measures the pseudo-distance

between the conditional probability density p(y|x) and the corre-sponding unconditional marginal density p(y). A higher inceptionscore IS(G) tends to describe generated images of better quality.So a larger inception score corresponds to better GAN perfor-mance. Table 12 shows the inception scores for the vanilla GANstrained on the MNIST dataset. Table 13 shows the inception scoresfor the Wasserstein GANs trained on the same dataset.

The generator network G used 100 input neurons with identityactivations. It had a single hidden layer with 128 logistic sigmoidneurons. The output layer contained 784 logistic neurons.

The discriminator network D used 784 input neurons withidentity activations. It had a single hidden layer of 128 logisticneurons. The output layer had a single logistic neuron.



Fig. 8. Sample generated digit images from vanilla GANs trained with unidirectional BP and B-BP on the MNIST data set. These generated samples came from thetrained GANs after 100 training epochs. (a) Unidirectional BP. (b) Bidirectional BP without noise. (c) Unidirectional BP with NEM noise. (d) Bidirectional BP with NEMnoise.

Fig. 9. Sample generated images from deep-convolutional GANs trained with unidirectional BP and B-BP on the CIFAR-10 data set. We generated these samples aftertraining the GANs over 100 epochs. (a) Unidirectional BP. (b) Bidirectional BP without noise. (c) Unidirectional BP with NEM noise. (d) Bidirectional BP with NEMnoise.

NEM-boosted B-BP gave the best performance for both vanillaand Wasserstein GANs. NEM noise gave more pronounced bene-fits for vanilla GANs. We presume this was because the Wasser-stein GANs involved more processing and had less room forimprovement. Fig. 7 compares the performance of the vanilla andWasserstein GANs trained on the MNIST dataset of handwrittendigits. Fig. 8 shows sample MNIST-like images that the vanillaGANs generated. Fig. 11 shows sample MNIST-like images that

the Wasserstein GANs generated. Fig. 10 compares the perfor-mance of the deep-convolutional and Wasserstein GANs trainedon the CIFAR-10 image dataset. Fig. 9 shows sample CIFAR-10-like images that the deep-convolutional GANs generated. Fig. 12shows sample CIFAR-10-like images that the Wasserstein GANsgenerated.

We also trained a deep-convolutional GAN (DCGAN) on theCIFAR-10 image set. The DCGAN used the same error functionas did the vanilla GAN for the MNIST dataset. The DCGAN had



Fig. 10. Performance of deep convolutional GANs (DCGANs) and WassersteinGANs on the CIFAR10 dataset. (a) DCGAN inception scores: NEM noise outper-formed noiseless unidirectional BP and noiseless B-BP in the discriminator. (b)Wasserstein GAN inception scores: NEM noise also performed best but with lessmarked effects than in (a).

3 convolutional layers and no max-pooling layer (Radford, Metz,& Chintala, 2015). The forward pass convolved the input datawith masks W ∗i . The backward pass used the transposed masksW ∗i (in true bidirectional fashion) to convolve signals. The reverseconvolution was the transpose convolution operation (Dumoulin& Visin, 2016). There was no pooling at the convolutional layersfor either the forward or backward pass.

The discriminator’s 3 convolutional layers used leaky rectifiedlinear (ReLU) activations and a fully connected layer between the

Table 12Inception scores for vanilla GANs trained on the MNIST dataset.

Inception score

Unidirectional BP 3.96 ± 0.07Unidirectional BP with NEM noise 5.15 ± 0.10B-BP discriminator 6.66 ± 0.10B-BP discriminator with NEM noise 7.04. ± 0.11

Table 13Inception scores for Wasserstein GANs trained on the MNIST dataset.

Inception score

Unidirectional BP 7.67 ± 0.10Unidirectional BP with NEM noise 7.71 ± 0.11B-BP discriminator 7.75 ± 0.12B-BP discriminator with NEM noise 7.80 ± 0.12

Table 14Inception scores for deep-convolutional GANs trained on the CIFAR-10 dataset.

Inception score


Table 15Inception scores for Wasserstein GANs trained on the CIFAR-10 dataset.

Inception score


input and output layers. The generator G also used 3 convolu-tional layers with rectified linear activations and a fully connectedlayer between the input and output layer. The generator G used128 input neurons with identity activations.

Table 14 shows that NEM-boosted B-BP gave the bestinception-score performance for both the vanilla GAN trained onthe MNIST data set and the DCGAN trained on the CIFAR-10 imageset. The inception score IS(G) in (121) uses an exponentiatedKullback–Leibler entropic distance to measure the image quality.A higher inception score corresponds to better GAN performance.Table 15 shows likewise that Wasserstein GANs performed bestwith NEM noise and a bidirectional discriminator.

7. Conclusions

NEM noise injection improved the performance of bidirec-tional backpropagation on classifiers both in terms of increasedaccuracy and shorter training time. It often required fewer hidden

Fig. 11. Generated digit images from Wasserstein GANs trained with unidirectional BP and B-BP on the MNIST data set. These generated samples came training thetrained GANs after 100 training epochs. (a) Unidirectional BP. (b) Bidirectional BP without noise. (c) Unidirectional BP with NEM noise. (d) Bidirectional BP with NEMnoise.



Fig. 12. Generated images from Wasserstein GANs trained with unidirectional BP and B-BP on the CIFAR-10 image data set. We generated these samples aftertraining the GANs over 100 epochs. (a) Unidirectional BP. (b) Bidirectional BP without noise. (c) Unidirectional BP with NEM noise. (d) Bidirectional BP with NEMnoise.

neurons or layers to achieve the same performance as noiselessB-BP. The NEM noise benefits were even more pronounced forconvolutional BAM classifiers. NEM noise injects just that noisethat makes the current signal more probable. It almost alwaysoutperformed simply injecting blind noise. B-BP also outper-formed ordinary unidirectional BP. Injecting NEM noise into avanilla generative adversarial network improved its inceptionscore and greatly reduced the problem of mode collapse. It alsoimproved the performance of Wasserstein GANs but the ben-efit was less pronounced. NEM noise can also inject into thehidden-layer neurons of these multilayer BAM networks.

References

Adigun, O., & Kosko, B. (2016). Bidirectional representation and backpropagationlearning. In International joint conference on advances in big data analytics (pp.3–9).

Adigun, O., & Kosko, B. (2017). Using noise to speed up video classificationwith recurrent backpropagation. In International joint conference on neuralnetworks (pp. 108–115). IEEE.

Adigun, O., & Kosko, B. (2018). Training generative adversarial networks withbidirectional backpropagation. In 2018 17th IEEE international conference onmachine learning and applications (pp. 1178–1185). IEEE.

Adigun, O., & Kosko, B. (2019). Bidirectional backpropagation. IEEE Transactionson Systems, Man, and Cybernetics: Systems.

Ali, M., Yogambigai, J., Saravanan, S., & Elakkia, S. (2019). Stochastic stabilityof neutral-type markovian-jumping BAM neural networks with time varyingdelays. Journal of Computational and Applied Mathematics, 349, 142–156.

Arjovsky, M., & Bottou, L. (2017). Towards principled methods for traininggenerative adversarial networks. In International conference on learningrepresentations.

Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. arXiv preprintarXiv:1701.07875.

Audhkhasi, K., Osoba, O., & Kosko, B. (2016). Noise-enhanced convolutionalneural networks. Neural Networks, 78, 15–23.

Bengio, Y., et al. (2009). Learning deep architectures for ai. Foundations and trendsin Machine Learning, 2(1), 1–127.

Bhatia, S., & Golman, R. (2019). Bidirectional constraint satisfaction in rationalstrategic decision making. Journal of Mathematical Psychology, 88, 48–57.

Bishop, C. M. (2006). Pattern recognition and machine learning. springer.Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern

formation and parallel memory storage by competitive neural networks. IEEETransactions on Systems, Man, and Cybernetics, (5), 815–826.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood fromincomplete data via the EM algorithm. Journal of the Royal Statistical Society.Series B. Statistical Methodology, 39(1), 1–22.

Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deeplearning. ArXiv e-prints arXiv:1603.07285.

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model fora mechanism of pattern recognition unaffected by shift in position. BiologicalCybernetics, 36(4), 193–202.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,et al. (2014). Generative adversarial nets. In Advances in neural informationprocessing systems (pp. 2672–2680). NIPS.

Graves, A., Mohamed, A.-r., & Hinton, G. (2013). Speech recognition with deeprecurrent neural networks. In 2013 IEEE international conference on acoustics,speech and signal processing (pp. 6645–6649). IEEE.

http://refhub.elsevier.com/S0893-6080(19)30277-1/sb2


















http://arxiv.org/abs/1701.07875







































Grossberg, S. (1976). Adaptive pattern classification and universal recoding:II. Feedback, expectation, olfaction, illusions. Biological Cybernetics, 23(4),187–202.

Grossberg, S. (1982). How does a brain build a cognitive code? In Studies of mindand brain (pp. 1–52). Springer.

Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, andarchitectures. Neural Networks, 1(1), 17–61.

Grossberg, S. (2017). Towards solving the hard problem of consciousness:The varieties of brain resonances and the conscious experiences that theysupport. Neural Networks, 87, 38–95.

Gubner, J. A. (2006). Probability and random processes for electrical and computerengineers. Cambridge University Press.

Gulrajani, I., Ahmed, F., Arjovsky, M., V. Dumoulin, V., & Courville, A. (2017).Improved training of wasserstein GANs. arXiv preprint arXiv:1704.00028.

Hinton, G. E., Rumelhart, D., & Williams, R. J. (1986). Learning representationsby back-propagating errors. Nature, 323(9), 533–536.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. NeuralComputation, 9(8), 1735–1780.

Hogg, R., Mckean, J., & Craig, A. (2012). Introduction to mathematical statistics.(7th ed.). Pearson Education.

Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern RecognitionLetters, 31(8), 651–666.

Jordan, M., & Mitchell, T. (2015). Machine learning: trends, perspectives, andprospects. Science, 349, 255–260.

Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9),1464–1480.

Kosko, B. (1987). Adaptive bidirectional associative memories. Applied Optics,26(23), 4947–4960.

Kosko, B. (1988). Bidirectional associative memories. IEEE Transactions onSystems, Man and Cybernetics, 18(1), 49–60.

Kosko, B. (1990). Unsupervised learning in noise. Neural Networks, IEEETransactions on, 1(1), 44–57.

Kosko, B. (1991a). Neural networks and fuzzy systems: A dynamical systemsapproach to machine intelligence. Prentice Hall.

Kosko, B. (1991b). Stochastic competitive learning. IEEE Transactions on NeuralNetworks, 2(5), 522–529.

Kosko, B. (2018). Additive fuzzy systems: From generalized mixtures to rulecontinua. International Journal of Intelligent Systems, 33(8), 1573–1623.

Kosko, B., Audkhasi, K., & Osoba, O. (2019). Noise Can Speed Backpropaga-tion Learning and Deep Bidirectional Pretraining. Elsevier, submitted forpublication.

Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tinyimages. Technical report, University of Toronto.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classificationwith deep convolutional neural networks. In Advances in neural informationprocessing systems (pp. 1097–1105).

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444.LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images, speech,

and time series. In The handbook of brain theory and neural networks: Vol.3361, (10), (p. 1995).

LeCun, Y., Bottou, L., Bengio, Y., & Haffner (1998). Gradient-based learningapplied to document recognition. In Proceedings of the IEEE: Vol. 86, (11)(pp. 2278–2324). IEEE.

Maharajan, C., Raja, R., Cao, J., Rajchakit, G., & Alsaedi, A. (2018). Impul-sive Cohen–Grossberg BAM neural networks with mixed time-delays: anexponential stability analysis issue. Neurocomputing, 275, 2588–2602.

Metz, L., Poole, B., D. Pfau, D., & Sohl-Dickstein, J. (2016). Unrolled generativeadversarial networks. arXiv preprint arXiv:1611.02163.

Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of machinelearning. MIT press.

Osoba, O., & Kosko, B. (2013). Noise-enhanced clustering and competitivelearning algorithms. Neural Networks, 37, 132–140.

Osoba, O., & Kosko, B. (2016). The noisy expectation-maximization algorithm formultiplicative noise injection. Fluctuation and Noise Letters, 1650007.

Osoba, O., Mitaim, S., & Kosko, B. (2013). The noisy expectation–maximizationalgorithm. Fluctuation and Noise Letters, 12(3), 1350012–1–1350012–30.

Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learningwith deep convolutional generative adversarial networks. arXiv preprintarXiv:1511.06434.

Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representationsbyback-propagating errors. Nature, 323–533.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., & Radford, A. (2016).Improved techniques for training GANs. arXiv preprint arXiv:1606.0349.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014).Dropout: a simple way to prevent neural networks from overfitting. Journalof Machine Learning Research (JMLR), 15(1), 1929–1958.

Villani, C. (2009). Optimal transport: Old and new. Springer, Berlin,Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stacked

denoising autoencoders: Learning useful representations in a deep networkwith a local denoising criterion. Journal of Machine Learning Research (JMLR),11(Dec), 3371–3408.

Wang, F., Chen, Y., & Liu, M. (2018). pth moment exponential stability ofstochastic memristor-based bidirectional associative memory (BAM) neuralnetworks with time delays. Neural Networks, 98, 192–202.

Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysisin the behavioral sciences (Doctoral Dissertation), MA: Applied Mathematics,Harvard University.























































































































noise-boosted bidirectional backpropagation and...

Documents