on sequential construction of binary neural networks

-

618 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 3, MAY 1995

On Sequential Construction of Binary Neural Networks

Marco Muselli

Abstract-A new technique, called sequential window leanking (SWL), for the construcliee of hvo-layer perceptron~ with biI1pIy inputs is presented. It generates the number of Mddea neurons together with the correct values for the we@&, sbrting from any binary tnlnIng set. The introdoQtion d a new type of neuron, hawing a window-sbaped acthatian funcdon, considerably increases the convergence speed and the compactness of resulting networks. Furthermore, a preprocessing technique, called hamming clustering (HC), is proposed for improving the generalization ability of constructive algorithms for binary feedforward neural networks. Its inse learning is stmightforwwd. Tests the good performances of the prop of network compledty and recoghition accuracy.

I. INTRODUCTION HE backpropagation algorithm [l] T applied both to classification and

lems, showing a remarkable flexibility Nevertheless, some important drawbacks restrict its application range, particularly when dealing with real world data

The network architecture must be fixed a priori, i.e., the number of layers and the number of units for each layer must be determined by the user before the training process. The learning time is, in general, very high and consequently the maximum number of weights one can consider is reduced. Classification problems are not tackled in a natural way because the cost function does not directly depend on the number of wrongly classified patterns.

A variety of solutions has been proposed to solve such problems; among these, an important contribution c o w s from constructive methods [2]. Such techniques successively add units in the hidden layer until all the input-output relations of a given training set are satisfied. In general, the convergence time is very low, since at any iteration the learning process involves the weights of only one neuron. On the contrary, in the backpropagation procedure, all the weights in the network are modified at the same time to minimize the cost a c t i o n value.

In this paper we focus on the construction of binary feedforward neural networks with single hidden layer. Every input and output in the net can only assume two poqsible btates, coded by the integer values + 1 and - 1. Such an architecture

is general enough to implement any boolean function (with one or more output values) if a sufficient number of hidden units is provided [2]. Moreover, this kind of neural networks greatly simplifies the extraction of symbolic rule from connection weights [3].

Some constructive methods are specifically devoted to the synthesis of binary feedforward neural networks [4]-[7]; they take advantage of this particular situation and lead in a short time to interesting solutions. A natural approach is determined by the procedure of sequential learning [6], but its implementation presents some practical problems. First of all, the weights in the output layer grow exponentially, leading to intractable numbers also for few hidden units. Then, difficulties arise when the output dimension is greater than one, since no extension for the standard procedure is given.

Furthermore, the proposed algorithm for the training of hidden neurons is not efficient and considerably increases the time needed for the synthesis of the whole network. On the other side, faster methods like perceptron algorithm [8] or optimal methods like pocket algorithm [9] cannot be used because of the particular construction made by the procedure of sequential learning.

The present work describes a method of solving these problems and proposes a well-suited algorithm, called sequential window learning (SWL), for the training of two-layer perceptrons with binary inputs and outputs. In particular, we introduce a new type of neuron, having a window-shaped activation function; this unit allows the definition of a fast training algorithm based on the solution of linear algebraic equations.

Moreover, a procedure for increasing the generalization ability of constructive methods is presented. Such a procedure, called hamming clustering (HC), explores the local properties of the training set, leaving any global examination to the constructive method. In this way we can obtain a good balance between locality and capacity, which is an important trade-off for the treatment of real world problems [lo].

HC is also able to recognize irrelevant inputs inside the current training set, to remove useless connections. The complexity of resulting networks are then reduced leading to simpler architectures. This fact is strictly related to the Vap- nik-Chervonenkis dimension of the system, which depends on the number of weights in the neural network [l 13.

The structure of this paper is as follows. Section I1 introduces the formalism and describes the modifications and extensions to the procedure of sequential learning. In Section 111 the

Manuscript received August 18, 1993; revised February 1 1 , 1994. The author is with Istituto per i Circuiti Elettronici, Consiglio Nazionale

delle Ricerche. 16149 Genova. Italv . , IEEE Log Number 9404865. window neuron is defined with its properties, and the related

1045-9227/95$04.00 0 1995 IEEE

MUSELLI: ON SEQUENTIAL CONSTRUCTION OF BINARY NEURAL NETWORKS 619

training algorithm is examined in detail. Section IV presents the hamming clustering and its insertion in the sequential window learning, while Section V shows the simulation results and some comparisons with other training algorithms. Conclusions and final observations are the matter of Section VI.

11. THE PROCEDURE OF SEQUENTIAL LEARNING

Throughout this paper, we consider two-layer feedforward perceptrons with binary inputs and outputs; let n be the number of inputs, h the number of units in the hidden layer (initially unknown), and m the number of outputs.

The procedure of sequential learning starts from a training set containing p input-output relations (tp, C”), p = 1, . . . , p. All the components ,$‘, i = 1, .. . , n, (L, k = 1 , . . . , m, are binary, coded by the values -1 and +l. For the sake of simplicity, a new component <: = $1 is always added to each input pattern, so that the bias of hidden neurons becomes an additional weight.

Then, let us introduce the following notations: xj, j = 1, . . . , h, is the jth hidden neuron having activation function U,.

wji, j = 1, .. . , h, i = 0 ,1 , . . . ,n, is the weight for the connection between the ith input and the hidden neuron x,. wjo is the bias of the unit Xj. Sj”, j = 0,1, . . . , h, is the output of xj caused by the application of the pattern tp to the network inputs; we set S,” = f l by definition. All the binary values Sr form a vector S p = (S,”, Sf, . . . , S i ) called the internal representation of pattern tp. Y k , k = 1, . . ’ , m, is the kth output neuron having activation function oy . V k j , k = 1, . . . , m, j = 0,1, . . . , h, is the weight for the connection between the hidden unit xj and the output neuron Y k . ‘UkO is the bias of Y k .

O:, k = 1, . . . , m, is the output of Y k caused by the application of the pattern e” to the network inputs.

The activation functions oz and oy, respectively, for hidden and output neurons, can be different, but they must provide binary values in the set {-1, +l}. Consequently, also the internal representations SkL have binary components.

The procedure of sequential learning adds hidden units, following a suitable rule, until all the relations contained in the training set are satisfied. The standard version, proposed by Marchand et al. [6] applies only to neural networks with single output (m = 1); obviously, we can always construct a distinct network for each output by simply iterating the basic algorithm, but in general the resulting configuration has too many neurons and weights.

Let us begin our examination, however, from the case m = 1; we shall give later some solutions for approaching generic output dimensions. Let P+ and P- be the following sets

P + = { [ ~ : < ” = + l , p = l , . . . , p }

p- = {[”:<” = -1 , p = L . . . , P }

where ( p is the output pattern (single binary value) corresponding to the input pattern [” of the training set.

While we leave free the activation function U, of hidden units, let oy be the well-known sign function

Since we are dealing with the case m = 1, let us denote with w, , j = 1, . . . , h, the weights for the output neuron Y and with wo the corresponding bias.

The kernel of the procedure of sequential learning is the addition of a new unit in the hidden layer; for this aim a suitable training algorithm is applied. It provides the weights of a new unit X, starting from a particular training set, in most cases different from the original one. Let Q: be the set of the patterns tp for which the desired output is S,” = $1, whereas Q; contains the patterns [” for which we want 5’: = -1.

When the training algorithm stops we obtain the following four sets (some of which can eventually be empty)

R,’ contains the patterns [” E Q,’ rightly classified by

R; contains the patterns tp E Q; rightly classified by

W,- contains the patterns t” E Q,’ wrongly classified by

W: contains the patterns tp E Q; wrongly classified by

In the procedure of sequential learning each unit X, is assigned an arbitrary variable s, having values in the set { -1, fl}. By setting the sign of such variable we determine the class of the patterns which will be eliminated from the current training set after the creation of the hidden neuron X, .

In particular, if s, > 0 the learning algorithm for the unit x, must provide W: = 0; in this case the patterns contained in RT will be removed from the training set. In the same way, if s, < 0 the condition W,- = 0 is required and R; will be considered for the elimination.

We can join together these two cases by introducing an equivalent formulation of the procedure of sequential learning. It always requires W: = 0 after each insertion in the hidden layer; the class of the patterns removed is now determined by a proper definition of the sets Q,’ and Q; for the training of the neuron X,. In this formulation the main steps of the algorithm are the following

x, (S,” = +l ) .

x, (S,” = -1).

x, (S,” = -1).

x, (S,” = +l).

An arbitrary variable s j is chosen in the set { -1, +1}. A new hidden unit Xj is generated, starting from the training set

if s j = +1 Qj’ {‘+ P- i f s j = - l

P- i f s j = + l if s j = -1.

The constraint W,,f = 0 must be satisfied in the generation. The resulting set RT is subtracted from the current training set {P+, P-} .


These three steps are iterated until the current training set contains only pattems from one class (P+ = 0 or P- = 0).

In practice, each neuron X, must be active (S,” = +1) for some patterns tp having Cp = s i and holds inactive (S,” = -1) in correspondence of any pattem tp for which [ p = -s j . Neurons that satisfy this condition can always be found; for example the grandmother cell of a pattem tp E P+ verifies such property [6] . A neural network containing only grandmother cells in its hidden layer, however, has no practical interest; so a suitable training algorithm for the hidden neurons is required. This is the object of Section III.

Now, we are interested in the choice of output weights vJ, j = 0, 1, . . , h, that correctly satisfy all the input4utput relations contained in the training set. After the iterated execution of the three main steps above, a possible assignment for the weights v j is the following [6]

for j = l , . . . , h . Unfortunately, these values exponentially grow with the

number h of hidden neurons; thus, even for small sizes of resulting networks, the needeh range of values makes extremely difficult or impossible both the simulation on a conventional computer and the implementation on a physical support.

To overcome this problem, let us subdivide the hidden neurons in g groups, each of which contains adjacent units xj withthesamevalueofsj.Moreformally,ifhl,l= l , . . . , g , is the index for the last neuron belonging to the Ith group, then we have

having set ho = 0 by definition. The following result is then valid.

Theorem I: A correct choice for the output weights v j in the procedure of sequential learning is the following

(3)

for j = hl-l + l,...,hl, 1 = l , . . . , g . Pro08 We show that (3) leads to a correct output Op for

a generic tP E P+; the complementary case (Ip E P-) can be treated in a similar way. If tp E P+, then, from the iterated execution of the main steps, we obtain two possible situations

1) There exists j * (1 5 j * 5 h) such that sj* = +1 and

2) cd belongs to the residual training set, when execution sy* = +l.

stops; thus, we have 8 h = -1.

In the first case let I* denote the group of hidden neurons containing X j * (1 5 I* 5 9). Then we have

s j = s j * = +1 for j = hl.-l + 1,. . . , hl.

whereas

So, the input to the output neuron Y is given by h

vo + vjs; j=1

h h

(4)

j=hp+l h

h

Thus, by applying (1) we obtain Op = c p = +1 as desired. In the second case we have instead

from which h h h

and again 0’” = [ p = +l. From (3) we obtain two extreme cases:

0

If all the s j are equal, then the output weights vj have constant (binary) values, whereas the bias vo linearly grows with h. If the sj vary altemately, that is

then we return to the standard assignment (2).

MUSELLI: ON SEQUENTIAL CONSTRUCTION OF BINARY NEURAL NETWORK :S 68 1

Since the values of the variables sj can be chosen in an arbitrary way, we have a method for controlling the growth of output weights.

A. Generalization of the Procedure of Sequential Leaming The procedure of sequential learning can be extended in two

ways to construct binary feedforward neural networks with a generic number of outputs. The difference between the two extensions lies in the generation of the output weights V k j .

In the first method the assignment (3) is naturally generalized whereas in the latter the weights V k j are found by a proper training algorithm.

In this last case the computing time needed for the construction of the whole network is higher, but such a drawback is often balanced by a greater compactness of resulting architectures and consequently by a better generalization ability. On the other hand, the availability of a fast method for the construction of binary neural networks, starting from a generic training set, is of great practical interest.

In a natural extension of the procedure of sequential learning to the case m > 1, the variables s j must be replaced by a matrix [ s k j ] , k = 1 , . . . , m, j = 1 , . , h, filled by values in the set {-1, 0, +l}. In fact, a hidden neuron x j could push toward positive values an output neuron ( s k j = $1) and toward negative values another unit ( S k j = -1). The choice S k j = 0 means that no interaction exists between the units X j

and Y k (consequently V k j = 0). For the same reason we must consider a number Qk of

groups for each output and the last neuron of each group explicitly depends on the output it refers to. Thus we have a matrix of indexes [ h k l ] , k = 1 , . . . , m, 1 = 1,. . . , Q k , in which the length of each row depends on the output k.

Furthermore, the procedure starts from m pairs of pattem sets P c and P i , k = 1, . . . , m, obtained by the input-output relations of the training set in the following way

PZ ={[”:CC = + l , p = l , . . . , p ) P; = {C’: <; = -1, p = 1,. . . , p } .

With these notations a natural extension of the procedure of

At step 4 the set of input patterns Q t for the training of a sequential learning is shown in Fig. 1.

new hidden unit x h is determined in such a way that

if s k h = $1 Qt c {‘: PL if S k h = -1

for any k E K, where K is a subset of output indexes. Among the possible ways of determining QL and K a simple method giving good results is the following

1) For every k = 1 , . . . , m take the set uk given by

4) Let k’ K be the output associated with the set u k l

with the greatest size. If the number of patterns in u k t

exceeds a given threshold 7, then put k’ in the set K and repeat Steps 2-4, otherwise stop.

Such a simple method leads to the construction of hidden layers having a limited number of neurons, in general much lower than networks obtained by repeated applications of the standard procedure for m = 1. In the simulations we have always used this method with the choice T = n.

Theorem 1 can easily be extended to this case and shows the correctness of this approach: it is able to construct a two- layer neural network satisfying all the input-output relations contained in a given training set.

More compact nets can be obtained in many cases by applying a suitable training algorithm for the output layer. This second technique also assures the feasibility of the neural network, but, as shown later, the convergence is theoretically asymptotic. In most practical cases, however, the training time turns out to be acceptable.

A possible implementation of this second extension is shown in Fig. 2. The choice of variables S k is made only at the first step; the user cannot modify their values during the construction of the hidden layer. The dynamic adaptation of such quantities is provided by the auxiliary variables t k , whose sign is indirectly affected by the training algorithm for output neurons (through the sets V; and VF).

Step 4 chooses the output index k’ that determines the addition of a new hidden unit X, . This neuron must be active (Sr = +I) for some pattems tp E QZ and provide S: = -1 for all the patterns tp E Q ; , likewise the standard version. The disjunction of the sets RT obtained by subsequent choices of the same output index k is warranted by the updating of U: and U; at Step 6.

The sets T c and T i contain internal representations for the input patterns of PZ and P;. They are obtained at Step 7 through the following relations

TZ = { S p : [ ’ E P z } Tk- = {S’ : t ’” E P L }

where the components of S p are given by

After the application of the training algorithm for output neurons, we obtain for every unit Y k two sets V: and VL

v; = {<p E Pk+:Oi = <; = +1} v,- = { p E P i : of,” = <; = -1}.

Such a training algorithm plays a fundamental role in this 2) Put in K (initially empty) the output index k correspond-

ing to the set u k with the greatest number of elements and Set Q t = U , .

3) Modify the set U,, k = 1, . . . , m, in the following way

second extension of the procedure of sequential learning. If the output layer is trained by an algorithm that finds, at least in an asymptotic way, the optimal configuration (i.e., the weight matrix [ V k j ] which makes the minimum number of errors in the current training set), then correct binary neural networks

uk = u k r \ Q t . are always constructed.


PROCEDURE OF SEQUENTIAL LEARNING (Natural Extension)

INPUTS P,', PL = Sets of patterns for the training set of the k-th output.

OUTPUTS h = Number of hidden units. [ w j i ] = Weight matrix for the connections between input and hidden layer. [Ukj] = Weight matrix for the connections between hidden and output layer.

Skj = Arbitrary variable in the set { - l ,O ,+ l} . gk = Number of groups of consecutive neurons having the same value of 8 k j .

hk i = Index of the last neuron belonging to the 1-th group (for the k-th output). QT, QY = Sets of patterns for the training of the hidden neuron X j .

RT, RY, W:, Wjy = Sets of patterns generated by the training algorithm for

TEMPORARY ITEMS

the unit X j .

ALGORITHM 1. Set:

2. Set h = h + 1. 3. Assign to the variables Skh, k = 1, . . . , m, a value in the set {-1, +l}. 4. Choose in an arbitrary way a set of patterns Q: # 0 and a set of output

h = O , . S k O = O ; h k O = O ; g k = O f O r k = l , . . . , m

indexes K # 0 such that:

6. For every k E K , if Skh # Skh,,, , then:

6a. Set gk = g k + 1 7 . Set hkp, = h Vk E K 8. Apply the training algorithm for hidden neurons, starting from the sets Qt and Qh in order to obtain the weights W h i t i = 0,1, . . . , n, for a new

unit x h ha,ving W$ = 8. 9. For every k E K, if s k h = +1,

then: 9a. Set P z = P c \ R?

otherwise: 9A. Set PL = PL \ R i

10. If there exists an output index k such that P: # 0 and P; # 0, go to step 2. 11. For every k = 1,. .. , m, set:

ukj = Skj ( luki l+ 1) ; for j = h k J - 1 + 1,. . . , h k I , 1 = 1,. . . , g k i=hrr+l

h

Fig. 1. Natural extension of the procedure of sequential learning.

Algorithms of this kind are available in the literature; the most popular is probably the pocket algorithm [9]. It can be shown that the probability of reaching the optimal configuration approaches unity as the training time increases. Unfortunately, there is no bound known for the training time actually required and other training algorithms, not as good from a theoretical point of view, butamore efficient, are often used [12].

The properties of pocket algorithm, however, allows us to formulate the following result.

Theorem 2: The extension with output training of the procedure of sequential learning is (asymptotically) able to construct a two-layer perceptron satisfying all the input-output relations contained in a given binary training set.

Pro08 Let us refer to the implementation in Fig. 2; the repeated execution of Steps 3-9 causes the addition of some

MUSELLI: ON SEQUENTIAL CONSTRUCTION OF BINARY NEURAL NETWORKS

I

683

[ w j i ] = Weight matrix for the connections between input and hidden layer. [ u k j ] = Weight matrix for the connections between hidden and output layer.

TEMPORARY ITEMS S k = Arbitrary binary variable. t k = Auxiliary binary variable. Q t , Q; = Sets of patterns for the training of hidden neurons. Ut, U,- = Sets of input patterns activating hidden units. T: , TL = Sets of current internal representations for the training of the

v:, v; = Sets of input patterns correctly classified by Yk. R I , Ry , W T , Wj.- = Pattern sets generated by the training algorithm for

output neuron Y k ,

the unit X j .

ALGORITHM 1. Assign to the variables S k , k = 1,. . . , m, arbitrary binary values. 2. For every k = 1,. . . , m, set:

h = O ; t k = S k ; u $ = P c ; u i = P , - if s k = +1 P - ifsk =+1

Q:={ i f s k = - l ; Q i = { 4 if Sk = -1 3. Set h = h + 1. 4. Choose the unit k' corresponding to the set Qt, with higher size. 5. Apply the training algorithm for hidden neurons, starting from the sets

Q:, and Q , in order to obtain the weights W h i r i = 0 , 1,. . . , n, for a new unit X h having W$ = 0.

then:

otherwise:

7. For every k = 1,. . . , m compute the sets of internal representations TZ and T i starting from P$ and PL.

8. Apply the training algorithm for output neuron, starting from the sets T$ and T i in order to obtain the weight matrix [ V k j ] . Let V;' and V; be the sets of input patterns correctly classified by the output neuron Y k .

U; \ V; = 0, then:

6. If t k i = +I,

6a. Set U$ = U: \ R t

6A. Set U; = U; \ R t

9. For every k = 1,. . ., m, if S k = +1 and U: \ v;' # 0 or Sk = -1 and

9a. Set: t k = 4-1 ; Q t = U: \ V;' ; Q; = U<

otherwise : 9A. Set:

t k = -1 ; Q t = U; \ V; ; Q; = U$ 10. If there exists an output index k such that Q t # 0 and Q; # 0,

go to step 3.

PROCEDURE OF SEQUENTIAL LEARNING (Extension with Output Training)

Fig. 2. Extension with output training of the procedure of sequential learning.


hidden neurons for every output. Let 4 denote the set of indexes of the hidden units generated when Step 4 chooses the kth output ( I k c {I, , h}). Moreover, let Rj’, f o r j E 4, be the set of input patterns correctly classified by the neurons X,.

By construction we have

Thus, in the worst case, all the input patterns belonging to P c or P; will be contained in the union of the sets Rj, for j E I k . Let us suppose, without loss of generality, that there exists a subset J C I k such that

J E J

In this case, as derived from Theorem 1, the neuron Y k

correctly classifies all the input patterns in the training set, with regard to the lcth output, if the following choice for the weights V k j is made

where IJ,I is the number of elements in the set J. The properties of pocket algorithm assure that solution ( 5 ) can be

The two extended versions of the procedure of sequential learning are practically useful if a fast training algorithm for the hidden neurons is available. This algorithm must correctly classify all the input patterns belonging to a given class and some input pattern belonging to the opposite class. No method in current literature, except that proposed in [6], pursues this object; so a suitable technique will be described in the following section.

found asymptotically. 0

In. THE WINDOW NEURON

Currently available constructive methods for binary training sets build neural networks that exclusively contain threshold units. In these neurons the output is given by (l), here reported

As mentioned above, a new component (0 = +1 is added to the input vector (, to consider the neuron bias as a normal weight. In the following we use the term threshold network for indicating a two-layer perceptron containing only threshold units. A well-known result is the following [2]: given a training set made by p binary input-output relations (t”, CP),

p = 1, . . , p , it is always possible to find a threshold network that correctly satisfies these relations.

Now, let us introduce a new kind of neuron having a window-shaped activation function; its output is given by

The real value 6 is called amplitude and is meaningful only by an implementative point of view. For sake of simplicity, in the whole description we could set 6 = 0, but when

the computations are made by a machine (a computer or a dedicated support), the summation in (6) can move away from its theoretical value because of precision errors. Thus, the introduction of a small amplitude 6 allows the practical use of the window neuron.

A window neuron can always be substituted by three threshold neurons; in fact, the output of a generic window neuron is given by

On the contrary, it seems that a direct correspondence in the opposite sense does not exist. Nevertheless, the following result shows the generality of window neuron.

Theorem 3: It is always possible to find a two-layer perceptron containing only window neurons in the hidden layer which correctly satisfies all the input-output relations of a given binary training set.

Proof: The construction follows the same steps as for threshold networks. Let (t”, C”), ,U = 1, . . , p , be the p binary input-output relations contained in a given training set. For sake of simplicity, let us consider the case m = 1 (output pattern 5’” with single binary value); the neural network for generic m can be obtained by iterating the following procedure.

Let s be the output value (-1 or +1) associated with the least number of input patterns in the training set. For every

having = s a window neuron X, is added to the hidden layer with weights

Such a unit is a grandmother cell [2] for the pattern e’” (it is active only in correspondence oft’”), as it can be shown by simple inspection.

Now, a threshold output neuron performing the logic operation OR (NOR) if s = +1 (s = -1) completes the construction. 0

A two-layer perceptron with window neurons in the hidden layer will be called window network in the following. %NO

results are readily determined. The parity operation [13] can be realized by a window network containing L(n + 1)/21 hidden units. A possible choice of weights is

wjo = n - 4j + 2; wj; = +1

for i = l , . . - , n and j = l , . . . , L(n + l)/2j. In this configuration the j th hidden unit is active if the input pattern contains 2 j - 1 components with value + l . Then, the output neuron executes a logic OR and produces the correct result.


A single window neuron performs the symmetry operation [13]. A possible choice of weights is the following

1) Put in B only patterns tpJ with <p3 = +l. 2) Search for a window neuron that provides output -1

for the greatest number of patterns not contained in the minor B. WO = 0

2Ln/21-2 0

-Wn-i+l

for i = 1, . . . , Ln/2] for i = (n + 1)/2, if n odd for

The following theorem offers an operative approach. Theorem5: If the minor B has dimension q 5 n and

contains .only patterns t p 3 , j = 1,. . . , q, for which < p ~ = = (n - [n/2] + 11,. . . , n. wi =

+1, then the window neuron obtained by solving (7) gives output +1 for all the input patterns linearly dependent with tpl , . . . , C p q . Moreover, if the arbitrary weights wi, i =

then every input pattern which is linearly independent with

In these cases window networks are considerably more compact than corresponding threshold networks. Unfortunately, this is not a general result, since there exist linearly separable training sets that lead to more complex window networks. Q , ’ ’ ‘ > 1 2 . 9 are linearly independent in as Q-vector ’Pace,

t p l , . . . , t $ ‘ q in Q“ yields output -1.

obtain Proofi Since < p ~ = +1, for j = 1, . . . , q, from (7) we A. The Training Algorithm for Window Neurons

Given a generic training set (tp, cp), p = 1, . . . , p, we wish to find a learning algorithm that provides the weights wi, i = 0,1, ~9 . , n, for a window neuron that correctly classifies all the patterns tp for which <” = -1 and the maximum number of patterns tp for which <p = +l.

The following result plays a fundamental role. Theorem 4: It is always possible to construct a window

neuron that provides the desired outputs for a given set of linearly independent input patterns.

Proofi Consider the matrix A, having size p x (n + l ) , which rows are formed by the input patterns f” of the training set; let T be the rank of this matrix. If q 5 T let B denote a nonsingular minor containing q linearly independent input patterns of the training set. Suppose, without loss of generality, that B is formed by the first q rows ( j = 1,. , q ) and the first q columns ( i = 0,1 , . . . , q - 1) of A.

Then, consider the following system of linear algebraic equations

B w = z (7)

where the jth component of vector z is given by n

a = q

( t p 3 , < p 3 ) is the input-output relation of the training set associated with the jth row of A. The weights wi, for i = q , . . . , n, are arbitrarily chosen.

By solving (7) we obtain the weights wi for a window neuron that correctly satisfies the q relations (tpJl<”~), for j = 1, . . . , q. In fact, we have from (8)

n 4-1 n

i = O i = O 2=,

hence

= c p ( l - < p 3 ) = < p 3 f o r j = l , . . . , q

if 0 5 6 < 2. ‘0 Then, the main object of correctly classifying all the input

patterns P having <p = -1 can be reached by following two steps

n C w i < y J = ~ f o r j = l , . . . , q .

Now, consider an input pattern tu not contained in the minor B; theq+l vectors ([E,<;,...,ti-~)~, (t:l,tfl,...,trL?)t, . . . , (tiq, trq,. . . , [r:l)t are linearly dependent in Qq, being Q the rational field (t denotes transposition). Thus, there exists constants AI, . . . , A, E Q, some of which are different from zero, such that

i = O

Q

tr = xjt? for i = 0,1, . . . , q - I (9) j = 1

and consequently n n

If the patterns tu, tpl , . . . , t p q are linearly dependent in Qn, then the right-hand term of (10) is null for some rational constants Aj , j = l , . . . , q , that satisfy (9). Hence, the corresponding output of the window neuron is +l.

On the contrary, if the vectors tu, tpl , . . . , t pq are linearly independent in Qn, then there exists at least one index i ( q 5 i 5 n) such that

j=1

But the terms t; - E,”=, Aj$’ are rational numbers; so, if the w;, i = q , . . . , n, are linearly independent in R as Q-vector space we obtain

2wit:=qt;-$Ajt?) i = O a=q # O s

Hence, the output of the window neuron will be -1 if the 0

Several choices for the real numbers w; are available in the literature, but in most cases they require a representation

amplitude S is small enough.

686 IEEE TRANSACTIONS ON NEURAL "WORKS, VOL. 6, NO. 3, MAY 1995

TRAINING ALGORITHM FOR WINDOW NEURONS

INPUTS Pt, P - = Sets of patterns forming the training set.

OUTPUTS wi = Weights for the resulting window neuron.

TEMPORARY ITEMS q = Dimension of the current minor B. I = Set of pattern components contained in the current minor B . J = Set of pattern indexes contained in the current minor B. i = Currently examined pattern component. p = Currently examined pattern index. v = Number of patterns in P+ correctly classified by the current window

imax, pmaxl vmax = Optimal values of i, p and v in the current iteration. neuron.

ALGORITHM 1. Choose a pattern 1''' E P+ and a component i l (0 5 il _< n) in an

2. Set:

3. Set:

4. For every i $! Z (0 _< i 5 n) and for every p $! J with ("E P + : '

arbitrary way.

q = l ; Z = { i l } ; J = { p 1 } ; w j = f i for i=O, l , ..., n

L a x = O ; pmax = O ; vmax=O

4a. Generate a minor B with dimension q + 1 starting from the sets:

4b. If such minor B is nonsingular, I = { i l l . . . , i,, i) ; J = { P I , . . . , P ~ , P }

then: 4ba. Solve the corresponding system (7). 4bb. If the weight vector [wi] correctly classifies all the patterns

of P - , compute the number v of patterns belonging to P+ which are correctly classified. If v > vmax, set:

hnax=g ; P m a x = P ; v m a x = v 5. If vmax > 0,

then: 5a. Set:

5b. Go to step 2. iq+l = L a x ; pq+1 = pmax ; q = q + 1

Fig. 3. Training algorithm for window neurons.

range that is too large for a practical implementation. In the simulations we have used the following choice [14]

wi = fi f o r i = 0 , 1 , 2 , . . .

where ~i are positive integer squarefree numbers (not divisible for a perfect square), sorted in increasing order (by definition Yo = 1).

Theorem 5 gives a method of reaching our desired result: to correctly satisfy all the input-output relations (111, < p ) for which <p = -1 and some pairs (e'', <p) having <p = +l. The final step is to maximize the whole total number of correct outputs.

With high probability, the optimal configuration could be reached only through an exhaustive inspection of all the possi-

ble solutions; unfortunately, this leads to a prohibitive computing time. Good results can be reached by employing a simple greedy procedure: the minor B is built by subsequently adding the row and the column of A that maximize the number of patterns with positive output correctly classified by the corresponding window neuron. Obviously, the number of correctly classified patterns with negative output must be kept maximum.

A detailed description of the resulting training algorithm is shown in Fig. 3. It employs two sets I and J for the progressive construction of minor B; these sets are carefully increased to maximize the number of patterns with correct output. The algorithm stops when it cannot successfully add rows and columns to B.


This learning method can easily be modified if other different goals are to be pursued. Its insertion in the procedure of sequential learning is straightforward and leads to a general technique called SWL, which exhibits interesting features, as we will see in the tests.

Iv . mE HAMMING CLUSTERING Although SWL allows the efficient construction of a two-

layer perceptron starting from a given binary training set, its generalization ability depends on a variety of factors that are not directly controllable. Concerning this, HC approaches the solution of two important problems:

To increase the algorithm locality for improving the generalization ability of the produced neural network in real world applications [ 101. To find redundant inputs which do not affect one or more outputs. By removing corresponding connections, the number of weights in the resulting network, and consequently its complexity, are reduced. This also improves the generalization ability of the system [ l l ] , [15].

A natural method of proceeding is grouping the patterns belonging to the same class that are close each other according to the Hamming distance. This produces some clusters in the input space which determine the class extension. The clustering must be made in such a way that it can directly be inserted in the SWL algorithm.

The concept of template plays a fundamental role in HC; it is very similar to the concept of schema widely used in the theory of genetic algorithms [16]. Let us denote with the symbols ‘+’ and ‘-’ the two binary states (corresponding to the integer values +1 and -1); with this notation the pattern < = (+l, -1, -1, +1, -l)t is equivalent to the string + - - + - .

In HC a template is a string of binary components that can also contain don’t care symbols, denoted by the symbol ‘0’. A template represents a set of patterns which are obtained by expanding the don’t care symbols. For example, in the space of binary patterns with five components, the template +O - 0- is equivalent to the set {+ - - - -,+ - - + -,+ + - - -,+ + - + -}.

The template shows two important properties. 1) It has a direct correspondence with a logic AND among

the pattern components. For example, the template +O - 0- above corresponds to the operation AND (3 AND &, where E l , &, and 55 are, respectively, the first, the third, and the fifth component of a generic pattern [, while a bar indicates the logic operation NOT. Only the patterns in the equivalent set generate the value +1 as a result.

2) The construction of a window neuron that performs the AND associated with a given template is straightforward. Every weight must be set to the corresponding value in the template, whereas the bias must be equal to the number of binary variables in the template, changed in sign. For example, the template +O - 0- corresponds

to the window neuron having weights

WO = -3, w1 = +1, w2 = 0, w3 = -1, w4 = 0, w5 = -1.

This unit is active only for the patterns in the equivalent set of the template above and realizes the desired logic operation AND.

HC proceeds by extending and subdividing clusters of templates; every cluster contains templates having don’t care symbols in the same locations. Let P+ and P- be the sets of patterns forming the given training set; the method initially considers only one cluster containing all the patterns of P+. Then, this cluster undergoes the following two actions

Extension: Each binary component in the cluster is replaced one at a time by the don’t care symbol and the corresponding number of conflicts with the patterns of P- is computed.

Subdivision: The binary component with the minimum number of conflicts is considered. If this number is greater than zero, the cluster is subdivided in two subsets: the first contains the templates that do not lead to conflicts (with the selected binary component replaced by the don’t care symbol) and the latter is formed by the remaining templates (unchanged).

These two actions are then applied to the resulting clusters and the procedure is iterated until no more extensions could be done.

The method is better understood if we consider in detail a clarifying example. Let P+ and P- be the following sets of patterns

- - - - _ + - + - - + - - - + - + + + -

- + - + + + - - + - generated by the logic operation

c = (E l AND E 3 P R 15. (12)

Initially the method considers only one cluster; the following diagram shows the first execution of extension and subdivision

Comp. Conflicts E1 1 $0 + --

+0- -+ + - + - -

+ + + + - - + - + +

The selected component is 12 , which does not generate conflicts with the patterns of P-; thus, at the end of the first subdivision, we obtain a cluster with five templates, each of which contains a don’t care symbol in the second location. Another execution of extension and subdivision leads to the following situation

-0 - ++ E5

+o + 0-

-0 - o+ 0 -0 - ++ E5 2

688 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6. NO. 3, MAY 1995

-2 p-B 51 5 2 53 5 4 5 5

Fig. 4. Neural network corresponding to example (12).

The component & has not yet been considered since it contains only don't care symbols. Again, the selected component J4 does not generate conflicts with the patterns of P- and we have a single cluster with four templates. The next application of extension and subdivision yields the generation of two clusters as follows

Comp. Conflicts

1 {+O + 0-}. -0 - o+ E5 3

Finally, a last iteration gives the desired result Comp. Conflicts

Subd ==+'{0000+} 0O-o-k E a

{oo+o+} ; 2 Comp. Conflicts

1 Subd. * {+O + 00). {+o+o-}Ea 51 (3 1 55 0

Resulting clusters directly correspond to the logic operation (12) from which the training set was derived. In fact, the template OOOO+ (first cluster) corresponds to the component 55 that forms the second operand for the Boolean OR, whereas the template +O + 00 corresponds to the first operand €J AND 53. The binary neural network providing the correct output < is then easily constructed by placing a hidden window neuron for each template and adding an output threshold neuron that performs the required operation OR. Such a network is shown in Fig. 4.

This example shows how HC reaches the proposed goals; first of all, the boolean function generating the training set has been found. Then, the number of connections in the resulting network has been minimized; finally, redundant inputs have been determined. These three results is of great practical importance.

Furthermore, the equivalent threshold network can directly be built; so HC has general validity. It generates AND-OR networks, in which the hidden layer performs AND operations among the inputs and the final layer computes OR operations among the outputs of hidden neurons. Now, a fundamental result of the theory of logic networks says: every boolean function (hence every training set) can be written in the AND- OR form [17]. Consequently, the network above is general.

More compact neural networks can exist, however. For example, the well-known parity function needs 2"-' hidden neurons in the AND-OR configuration, whereas, as shown in Section 111, a window network with [(n + l)/2] hidden units can provide the correct output.

This remark shows the importance of a deep integration between SWL and HC. The following definition introduces a possible approach: if a pattern of P+ belongs to the equivalent set of a given template, then we say that this template covers that pattem. For example, the template oooO+ covers the pattern + - - - + in the training set (1 1). Then we use the term covering for denoting the number of patterns of P+ covered by the templates of a given cluster.

With these definitions a practical method for integrating HC and the training algorithm for window neurons follows these steps:

1) Starting from the pattern sets P+ and P- (training set for the window neuron), perform the HC and reach the final clusters of templates.

2) Choose the cluster having maximum covering and consider the binary components of this cluster (leave out don't care symbols).

3) Construct a new training set from P+ and P- , in which only the selected components are kept.

4) Generate a e corresponding window neuron and remove the connections corresponding to the neglected components (by zeroing their weights).

Such a method performs the training of the window neuron on a reduced number of components and recovers the computing time lost in the execution of HC. Moreover, it can successfully be inserted in the procedure of sequential learning.

v. TESTS AND REXJLTS

The proposed techniques were tested on several benchmarks to point out their performances, also in comparison with other constructive methods. These groups of tests regard different situations, each of which emphasizes a particular characteristic of the learning algorithm.

The first group concerns the so-called exhaustive simulations; in these cases the whole truth table of a Boolean function is given as a training set and the algorithm is required to minimize the number of weights in the resulting network.

The second set of trials refers to binary generalizations; the training set is binary and incomplete. SWL must return a configuration that minimizes the number of errors on a given test set.

Finally, the third group of simulations concerns problems of real generalization; in these cases the training set contains real patterns and is again incomplete. An AID conversion is performed, and the resulting set of binary patterns is used by SWL for generating a neural network that tries to minimize the number of errors on a given test set.

The reported computing times refer to a C code running on a DECStation 5000/200; they provide only a first indication about the convergence speed of the proposed techniques. In fact, a correct valuation requires the comparison with other algorithms and consequently the definition of a proper environment in which the tests have to be made. For sake of brevity, we defer such valuation to a subsequent paper.

A. Exhaustive Simulations Three groups of trials analyzed the goodness of configu-

rations obtained by SWL. In these simulations HC was not

MUSELLI: ON SEQUENTIAL CONSTRUCTION OF BINARY NEURAL NETWORKS

.-. 300 , I L 0 2 250 - S

0)

f 200 1 S

c 3

.- I50 -

- 100- L

zq Tiling / =I z

0 3 4 5 6 7 8 9 1 0 1 1

Number of inputs

Fig. 5. Simulation results for the random function.

$100

g 80 v1 90 c v1

Q) 70 S 60

E 50

m

SWL with HC 10

0 200 400 600 800 1000 N u m b e r of patterns in the training set

Fig. 6. Generalization results for the parity function.

used because it improves only the generalization ability of a learning algorithm.

Parity Function: Training was made for n = 2 , 3 , . . . , l o , and SWL was always able to find the optimal configuration with L(n + 1)/2] hidden neurons. The time required for the construction of the network went from 0.02 (n = 2) to 62 seconds (n = 10).

Symmetry Function: Also in this case the simulations with n = 2 , 3 , . . . , 10 always yielded the minimum network containing only one window neuron. Less than two seconds was sufficient for every trial.

Random Function: A group of tests involves the random function, whose output is given by a uniform probability distribution. It allows an interesting comparison with other constructive methods, like tiling [4] and upstart [5] algorithms. Fig. 5 shows the number of units in the neural networks realized by the three techniques for n = 4,5,. . . , l o . Every point is the average of 50 trials with different training sets.

The networks generated by SWL are more compact and the difference increases with the number n of inputs. This shows the efficiency of SWL in terms of configuration complexity.

The computational task is very heavy, however: the CPU time required grows exponentially with n from 0.03 (n = 4) to 524 seconds (n = 10).

B. Binary Generalizations

Parity and symmetry functions were also used for the performance analysis on binary generalizations. A third group of trials concems the monk problems [18], an interesting benchmark both for connectionist and symbolic methods.

2 1 0 0

;.

= 95

f 90 f a SWL without HC 0

SWL with HC

s 80 0 200 400 600 800 1 1

689

IO N u m b e r of patterns in the training set

Fig. 7. Generalization results for the symmetry function.

TABLE I RESULTS FOR THE APPLICATION OF S W L TO THE MONK PROBLEMS

46 96 29.86

Parity Function: With a number n = 10 of inputs, we considered a training set of p randomly chosen pattems. The remaining 1024 - p pattems formed the test set. Fifty different trials were executed for each value of p = 100,200, . . . ,900, obtaining the correctness percentages shown in Fig. 6.

Since the parity function does not present any locality property (the value in a point is uncorrelated from the values in the neighborhoods), the application'of HC always leads to poorer performances.

Symmetry Function: Tests on symmetry function present the same environment as for parity. Thus, the number of inputs was again n = 10, and 50 runs were made for each value of p = 100,200, . . . ,900. The results are shown in Fig. 7.

Also in this case the lack of locality makes HC useless; the obtained generalization ability is however very interesting.

Monk Problems: Three classification problems have been proposed [18] as benchmarks for learning algorithms. Since their inputs are not binary, a conversion was needed before the application of SWL. We used a bit for each possible value, so that the resulting binary neural network had n = 17 inputs and a single output. In this way the results in Table I were obtained.

Here HC improves the performance on all the tests and reduces the number of weights in the networks, allowing an easier rule extraction.

The computational effort is rather low: about 1 second for problems #1 and #2, about 5 seconds for problem #3.

C. Real Generalizations Finally, two tests were devoted to classification problems

with real inputs. In these cases the application of SWL (and any other binary constructive method) is influenced by the AID conversion used in the preprocessing phase. Among the variety of proposed methods, we chose the Gray code [17], which maps close integer numbers to similar binary strings (with respect to the Hamming distance).

690 IEEE TRANSACTIONS ON NEURAL. NETWORKS, VOL. 6, NO. 3, MAY 1995

(a) (b)

Fig. 8. recognition.

Original (a) and reconstructed (b) pattern for the test on circle

Iris Problem: The dataset of Anderson-Fisher [19] for the classification of three species of Iris starting from petal and sepal dimensions is a classical test on real generalization. If we use an eight-bit gray code for each real input we obtain a training set containing 100 strings of length n = 32.

The neural network constructed by SWL with HC correctly classifies every pattern in the training set and makes three errors (the theoretical minimum) on the test set. Such a network contains six hidden neurons and only 37 connections in two layers.

The CPU time needed for the construction is 10 seconds. Circle Recognition: The last trial concerns the recognition

of a circle from randomly placed points. Let us consider the pattern in Fig. 8(a), in which white and black zones correspond to positive and negative output, respectively. One thousand random points formed the training set, while another lo00 points were contained in the test set. Since the area of the two regions are equal, about 50% of points in either set falls inside the circle.

By using a Gray code with six bits for each real input, we obtained from SWL with HC a binary neural network containing 21 hidden units and 119 connections in two layers. No error was made in the training set, whereas 4.1% is the percentage of errors encountered in the test set. This result shows that the quantization effects of the AID conversion was overcome by HC, leading to an interesting generalization ability also with real inputs. The reconstructed pattern is presented in Fig. 8b.

Five seconds of computation was sufficient for obtaining this result.

VI. CONCLUSION A new constructive method for the training of two-layer per-

ceptrons with binary inputs and outputs have been presented. This algorithm is based on rwo main blocks.

A modification of the procedure of sequential learning [6] with practical extensions to the construction of neural networks with any number of outputs. The introduction of a new kind of neuron, hqving a window-shaped activation function, which allows a fast and efficient training algorithm.

Tests on this procedure, called sequential window learning, show interesting results both in terms of computational speed and compactness of constructed networks.

Moreover, a new preprocessing algorithm, called hamming clustering, has been introduced for improving the generalization ability of constructive methods. Its application is particularly useful in case of training sets with high locality; again, some simulations have shown the characteristics of HC.

Currently, SWL and HC are being applied to real world problems, such as handwritten character recognition and ge- nomic sequence analysis, giving some first promising results.

REFERENCES

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating emrs,”Narure, vol. 323, pp. 533-536,1986. J. Hertz, A. Krogh, and R. Palmer, Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley, 1991. S . I. Gallant, Neural Network Learning and Expert Systems. Cam- bridge, MA: MIT Press, 1993. M. Mkzard and J.-P. Nadal, “Learning in feedforward layered networks: The tiling algorithm,” J. Physics A. vol. 22, pp. 2191-2204, 1989. M. Frean, “The upstart algorithm: A method for constructing and training feedforward neural networks,” Neural Computation, vol. 2, pp.

M. Marchand, M. Golea, and P. Rujitn, “A convergence theorem for sequential learning in two-layer perceptrons,” Europhysics Left., vol. 1 1 , pp. 487492, 1990. D. L. Gray and A. N. Michel, “A training algorithm for binary feedforward neural networks,” IEEE Trans. Neural Networks, vol. 3, pp. 176-194, 1992. F. Rosenblatt, Principles of Neurodynamics. New York Spartan, 1962. S. I. Gallant, “Perceptron-based learning algorithms,” IEEE Trans. Neural Networks, vol. 1, pp. 179-191, 1990. L. Bottou and V. Vapnik, “Local learning algorithms,” Neural Compu- tation, vol. 4, pp. 888-900, 1992. E. B. Baum and D. Haussler, “What size net gives valid generalization?” Neural Computation, vol. 1 , pp. 151-160, 1989. M. Frean, “A ‘Thermal’ perceptron learning rule,” Neural Computation, vol. 4, pp, 946-957, 1992. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by e m r propagation,” in Parallel Distribute Processing, D. E. Rumelhart and J. L. McClelland, Eds. Cambridge, MA: MIT Press, 1986, pp. 318-362. D. A. Marcus, Number Fields. New York Springer-Verlag, 1977. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, “Learn- ability and the Vapnik-Chervonenkis dimension,” J. ACM, vol. 36, pp.

D. E. Goldberg, Genetic Algorithms in Search, Optimization and Ma- chine Learning. H. W. Gschwind and E. J. McCluskey, Design of Digital Computers. New York: Springer-Verlag. 1975. S . B. Thrun, J. Bala, E. Bloedorn, I. Bratko, B. Cestnik, K. De Jong, S. DZeroski, S. E. Fahlman, D. Fisher, R. Hamann, K. Kaufman, S . Keller, I. Kononenko, J. Kreuziger, R. S . Michalski, T. Mitchell, P. Pachowicz, Y. Reich, H. Vafaie, W. Van de Welde, W. Wenzel, J. Wnek, and J. Zhang, The MONK’S Problems. A Performance Comparison of Different Learning Algorithms, Carnegie Mellon Univ. Rep. CMU-CS-91-197, 1991. R. A. Fisher, ‘The use of multiple measurements in taxonomic problems,’’ Annals Eugenics, vol. 8, pp. 376-386, 1936.

198-209, 1990.

929-965, 1989.

Reading, MA: Addison Wesley, 1989.

Marc0 Muselli was born in 1962. He received the B.A. degree in electronic engineering from the University of Genoa, Italy, in 1985.

Mr. Muselli is currently a Researcher at the Isti- tuto per i Circuiti Elettronici of CNR in Genoa. His research interests include neural networks, global optimization, genetic algorithms and characteriza- tion of nonlinear systems.

on sequential construction of binary neural networks

Documents