acc97-paper1

7/29/2019 ACC97-paper1

1/11

NEURAL NETWORKS IN FAULT DETECTION: A Case Study1

D.R. Hush, C.T. Abdallah, G.L. Heileman, and D. Docampo

EECE Department,University of New Mexico,

Albuquerque, NM 87131, USA.

AbstractIn this paper we study the applications of neural

nets in the area of fault detection. In particular,neural networks are used for fault detection in realvibrational data. The study is one of the first toinclude a large set of real vibrational data and toillustrate the potential as well as to the limitations

of neural networks for fault detetction.

1. IntroductionThere has been considerable work in the areas of

fault detection and isolation which were reviewed in[4], [15], [8]. We first comment on standard designsto detect faults using hardware redundancy. Thisapproach relies on duplicating the functionality ofsome critical components in the system, and on usinga majority voting logic to decide on the presence,location, and type of faults. Such an approach is ofcourse costly and requires the anticipation of faultstypes and locations.

On the other hand, the analytical redundancy(also termed functional redundancy) approach ex-ploits the inherent redundancy in the relationshipsbetween the systems inputs and outputs. The ap-proach is appealing in terms of hardware cost butcan also suffer from poor models and may requiresophisticated signal processing techniques in orderto extract useful information out of the data.

There are basically two generic ways to approachthe analytical fault detection problem: The model-based approach and the data-based approach.

In the model-based approach, the engineer has

access to a model of the system whose behavior isbeing monitored. The model could be analytical, orknowledge-based. Most applications of this approach

1The research of D. Hush, C.T. Abdallah, and G. Heile-

man was supported by a Grant from Chadwick-Helmuth un-

der Contract W-300445. The authors are grateful to JP Cain

of Chadwick-Helmuth for his help in collecting the data and

his guidance. The authors also acknowledge fruitful discus-

sions and help from Mr. Jose Luis Alba, of the University of

Vigo, Spain and the generous support of ISTEC.

have dealt with linear systems, since they can beeasily described and studied.

In the data-based approach one bypasses the stepof obtaining a mathematical model and deals directlywith the data. This is more appealing when the pro-cess being monitored is not known to be linear orwhen it is too complicated to be extracted from the

data. It is this approach which we will concentrateon in this paper in order to evaluate the potentialof neural networks as fault detectors. We can dividethe data-based approach into a time-domain tech-niques and frequency-domain techniques. Note thateither technique is actually a front-end of the finalfault detection systems. In fact, these techniquesprovide different indicators which are then combinedin a classifier to provide a test for the existence andtype of faults. This paper discusses the potential, aswell as the limitations, of neural nets usage in faultdetection and possible accommodation in vibrationalsystems.

In order to address the issues discussed above,this report is organized as follows. Section 2 providesa study of NN in fault detection. Our approach tothe fault detection and isolation problem is presentedand our results obtained using real data are givenin section 3. We have also investigated a trendingapproach using Fuzzy ART, the results of which arepresented in section 4. Finally, our conclusions andrecommendations are provided in section 5.

2. Applications of Neural Networks in

Fault Detection

It is conceivable that a neural net can be used asa monitoring device, in order to detect major changesin the operation of the system. Specifically, one ap-proach may be that the neural net is trained on awell-behaving system, and then operated with nomore training in parallel to the actual system. Theneural net output will then be compared to that ofthe physical system, and any anomalies in the out-put of our system will be detected. This approach,

7/29/2019 ACC97-paper1

2/11

labeled trending was not originally used in our re-search. However, we have included a section to dis-cuss some preliminary results using Fuzzy ARTMAPand the trending approach at the end of the paper.

In some cases, the neural net may also be usedto accommodate the change of behavior as part ofhierarchical control system [11]. In others, it is used

simply as a fault detection device where the cluster-ing capabilities of networks such as ART or CMACare called upon [7]. As always, it is a bad idea toblindly apply the measured data to the input of aNN. First and foremost, the given data may be highdimensional which may greatly increase the numberof weights in the NN and slow down its training algo-rithm. Instead, it is advisable to pre-process the datain order to obtain some important features, thus re-ducing the dimensionality of the training data, andwith it the number of required weights. Such pre-processing may be as simple as obtaining the FFTcoefficients [10, 14] for a stationary time series, or

more advanced approaches for nonstationary time se-ries or for data which may not be separable based onthe spectral content alone.

The general idea behind using a NN for faultdetection can be summarized in the following steps:

1. Use a signal processing techniques to obtain afigure of merit f (Spectrum, Cepstrum, k fac-tor, etc) for the different time signals.

2. If the figure of merit is high dimensional, usea feature extraction algorithm to reduce its di-mensionality while keeping most of its informa-

tion content. The resulting signal is fe.

3. Train the neural network on fe either in super-vised or unsupervised mode as discussed below.

Neural networks can be trained in one of two modes:Supervised learning when input/output data is avail-able, or when both good and bad data is avail-able, and un-supervised learning when only outputdata is available, or when has no knowledge of thequality of his data.

When one has access to input/output data andwould like to obtain a functional mapping from the

input to the output, a supervised learning schemeis used. Such an approach may also be used if onehas labeled data (good, bad) in order to sepa-rate the good from the bad. In a fault detection ap-plication, where the designer knows that some datacorresponds to the presence and the absence of par-ticular faults, this approach becomes obvious. Thedifficulty with this approach is that some data la-beled normal may in fact be bad. Also, in order to

work properly, a representative sample of all faultsshould be included in the training set.

One needs to use unsupervised learning when thecollected data is not labeled. A neural network isthen operated as a pattern classifier, trying to dis-cover similarities between some data and anomaliesamongst others. Such an approach was used in [2]

where an ART NN was used to cluster normal be-havior of the process. After training has stopped,the NN weights are fixed and when a substantiallydifferent time series data is presented to the network,it would be detected as sufficiently outside any of theclusters formed so far.

3. Detecting Faulty Bearings: A

Feasibility StudyThe purpose of this study is to explore auto-

mated methods for detecting faults in the viscousdamper bearing of a helicopter drive shaft. This por-tion of the paper describes the design of a system

that accomplishes this task using neural networks.

3.1. Data DescriptionThe data used in this portion of the study are

time series data measured by an accelerometer lo-cated on the outer bracket housing of the helicoptershaft (measured with respect to the shaft 1P). Thetime series were sampled at a rate of 44,100 sps. Thebandwidth of the antialiasing filter was approxi-mately 20 KHz. Spectrograms of the data (shownbelow) suggest that the signals are stationary, andthat they were sufficiently oversampled (i.e. there

was virtually no power above 10 KHz).A total of 21 runs were made using 17 differentbearings. The duration of the runs varied from 4 to 7seconds. Runs 1 and 11 were discarded because theyrepresent startup runs from the two different datagathering trips, and were both suspect in one way oranother.

Thus, the input data for this portion of the studyconsists of time series data collected from the outerbracket accelerometer for Runs 210 and 1221. Allexperiments described in this section used approxi-mately 4 seconds of data from each run. Approxi-mately 2.8 seconds (70%) were used for training and

1.2 seconds (30%) for testing. The stationary na-ture of the signals, coupled with the abundance ofdata samples, made it unnecessary to perform fullcrossvalidation in our experiments. As we shall see,all experiments produced extremely close agreementbetween training and test performance with a singlesplit of the data.

The second level of categorization placed theruns into two groups: Good and Bad. Runs that

7/29/2019 ACC97-paper1

3/11

Run Numbers Level I Grouping Level II Grouping7,12,14 New Good

8,13,15,16 Used Good2,9,17,19 Ball Spall Bad10,18,20 Inner Race Bad

5 Outer Race Bad

6,21 Other Bad

Table 1: Summary of Categorizations for Runs 210and 1221.

were categorized as New or Used where consideredGood and the rest were considered Bad. Table 1summarizes the two levels of groupings.

3.2. System DesignThe ultimate goal here is to design a system that

will correctly categorize time series data from a heli-copter drive shaft into one or both of the groupingsshown in Figure 1. Our focus is on the Level II group-ing (i.e. good verses bad). The system designed hereis datadriven, i.e. data with known categorizationsare used to design various components of the system.Such a system can only be guaranteed to work if thedata used in the design is truly representative of datathat will be presented to the system in the future. Itis unclear whether this is actually true for this partic-ular problem. It has been our experience (as well asothers [?]) that there is wide variability in the signa-ture produced by both good and faulty bearings and

that the overlap in these signatures is quite large. Inother words, the signature of a good bearing is of-ten nearly identical to that of a faulty bearing (andvice versa). This problem stems from the fact thatthere are numerous factorsthat contribute to the thevariability among bearing signatures, and the qual-ity of the bearing is only one of these factors (andnot always the dominate one). Thus, while the sys-tem designed below may achieve good generalizationperformance on the data gathered for this study, itsperformance on future data is difficult to predict.

An alternative approach would be to treat eachbearing as a separate problem. A fault could be de-

tected by monitoring the signature of the bearingover the course of its lifetime and raising a flag whenthe bearings signature begins to deviate appreciablyfrom its starting value. This approach has been usedwith a great deal of success on problems very similarto the one under consideration here [?]. Althougha different type of data would be required to inves-tigate this approach (i.e. we would need data col-lected over the lifetime of several different bearings),

we present a preliminary study of this problem usinga Fuzzy ART neural network.

The basic approach followed here is outlined inFigure ??. This is a traditional pattern recogni-tion system consisting of three major components:preprocessing, feature extraction and classification.The purpose of the preprocessing stage is generally

to compensate for known distortions introduced bythe sensor and/or the environment. This stage of-ten involves operations such as scaling, resampling,and/or filtering. The only preprocessing performedon the bearing data was to normalize each of the runsso that it had zero mean and unit standard deviation.

The second stage is generally the most difficultto optimize. The purpose of feature extraction is toextract features from the preprocessed data whichprovide the greatest discrimination between patternclasses. This is often difficult because the best fea-tures are usually not known ahead of time. Perfect-ing this stage generally involves an indepth knowl-

edge of the problem at hand, good intuition, and afair amount of trial and error. In the bearing clas-sification problem we dont know what features willwork well ahead of time, so our approach has been touse features that are either commonly used for othertypes of time series data and/or have been found towork well in other studies involving the detection offaulty bearings from accelerometer data. To this endwe settled on the following three feature sets:

1. Spectrograms: A timevarying estimate of themagnitude of the Fourier Transform of thedata.

2. Linear Prediction Coefficients (LPC): Coeffi-cients of an optimal Mth order (FIR) linearpredictor for the data.

3. Cepstrum: The inverse Fourier transform ofthe logarithm of the magnitude of the Fouriertransform of the data.

In addition to their ability to carry discriminatoryinformation, a good feature set should also have thefollowing properties:

Invariance: The features should be invariantto su-

perfluous variations in the data, i.e. variationsin the data that provide no discriminatory in-formation.

Dimensionality Reduction: The training processand generalization performance of the classifier(the last stage) suffer from the curse of dimen-sionality. It is therefore important to reducethe dimensionality (i.e. the number of features)

7/29/2019 ACC97-paper1

4/11

as much as possible at the feature extractionstage.

Simplified Representation: Ideally, the featuresshould take on a representation that permitsoptimal discrimination with the simplest pos-sible classifier. For example, it is commonto strive for representations in which the two

classes are linearly separable.

The final stage in Figure ?? is the classifier. Weinvestigated a wide variety of classifiers in this study,with a focus on the following neural network and

fuzzy classifiers:

1. Multilayer Perceptrons (MLPs)

2. Radial Basis Functions (RBFs)

3. Fuzzy ARTMAP

In addition to these we investigated the traditional

linear, quadratic and nearestneighbor classifiers.All systems designed in this report were com-

prised of the five stages shown in Figure ??. Thenormalization stage (zero mean and unit standarddeviation) has already been described. The detailsof the remaining four stages are described in the sec-tions that follow.

3.3. Feature ComputationThe computational aspects of the three feature

sets, Spectrograms, LPC coefficients and Cepstralcoefficients, are described below.

Spectrograms: The spectrograms were computedas follows. The first 4 seconds (176,000 sam-ples) of each run were partitioned into 500consecutive nonoverlapping time segments anda spectral estimate was formed from the 352samples in each segment. Spectral estimateswere obtained using Welchs method (averag-ing modified periodograms). Each modified pe-riodogram was formed by applying a Hammingwindow of length 64 (256) to the time sam-ples, computing the FFT, and then taking thesquared magnitude for each resulting frequencybin. A total of 6 (2) periodograms were aver-

aged to form the spectral estimate for each 352sample segment. In the end, each estimate con-tained 33 (129) frequencies at equally spacedintervals from d.c. to Nyquist (i.e. 0 to 22KHz).

LPC Coefficients: Linear Prediction Coefficients(LPCs) were computed as follows. Approxi-mately 4 seconds (179,200 samples) of each run

were partitioned into 700 consecutive (overlap-ping) time segments of length 1024. Consec-utive time segments were overlapped by 768samples (i.e. the segment window was shiftedby 256 samples at each step). From each1024sample segment 16 Linear Prediction Co-efficients were computed using the Levinson

Durbin algorithm [12].Cepstrum: Cepstral coefficients were computed

from the LPC coefficients using the method de-scribed in [?] (problem 12.32, p. 833). A totalof 16 cepstral coefficients are computed for eachtime segment.

Plots of the spectrograms for each of the 19 runsused in this study are shown in Figures ????. Sev-eral observations can be made from these plots.

1. Except for Run 2, there is no noticeable powerin the upper half of the frequency range (i.e.

from 11 to 22 KHz).

2. In many cases there are strong similarities be-tween the runs for good and bad bearings. Forexample, compare Runs 8 and 10, or Runs 13and 19, or Runs 12 and 21.

3. Two separate runs of the same bearing oftenhave noticeable differences. For example com-pare Runs 7 and 12.

4. Although there are notable exceptions, the fol-lowing trend seems apparent. Good bearings

(both newand used) tend to have most of theirspectral power concentrated in one low fre-quency peak. As the bearing becomes faultythe power tends to spread into into higher fre-quencies, often introducing a second and/orthird peak. For example, see the sequence ofRuns 13, 16, 18 and 20. Runs 13 and 16 are thesame (normal) bearing (16 a takeapart ver-sion of 13) and Runs 18 and 20 have increasingamounts of damage.

3.4. Feature SelectionThe purpose of this section is to determine which

of the individual features in the three feature sets areuseful for discrimination. Those that are not can bediscarded. There are numerous ways of approachingthis problem [?, 5]. The method used here forms anestimate of the Bayes Classification Error (i.e. theminimum attainable classification error) for each in-dividual feature, and discards features with a high(close to 50 %) error. That is, each feature is usedby itself as a discriminator, and an estimate of its

7/29/2019 ACC97-paper1

5/11

best performance is obtained. Those features withhigh individual error rates can safely be discardedsince they will never contribute significantly to thediscrimination problem.

It should be noted that this method filters indi-vidual features only. It makes no attempt to rejectfeatures which carry redundant discriminatory infor-

mation (e.g. features that are highly correlated).The Bayes Classification Error for each featureis estimated using the knearest neighbor (kNN)classifier (with reject in the case of a tie) and theleaveoneout method of crossvalidation [5]. Sepa-rate estimates of classification error are obtained forodd and even values ofk. In theory the classificationerror for odd and even values ofk provide an upperand lower bound (respectively) on the actual Bayeserror. Also, as k grows large (and the number of sam-ples approaches ), these bounds become tighter, sothat in the limit, the even and odd classification er-rors merge at the true Bayes error [5]. To estimate

the Bayes error, the classification error of the kNNclassifier is estimated (using leaveoneout) for val-ues of k ranging from 1 to 20, and an estimate ofthe convergent point for the even and odd curves isobtained.

Plots of the Bayes error estimates for the Spec-tral features are shown in Figure ??. The two plotscorrespond to estimates for the low and high fre-quency resolution features. Both plots have the samegeneral shape. The lower resolution features, how-ever, achieve uniformly better classification scores.This, coupled with the fact that initial attempts toclassify the data using the high resolution features

provided no improvement over the low resolution fea-tures (in fact the performance was slightly worse),prompted us to omit the high resolution featuresfrom further study.

For the low resolution spectral features, the plotin Figure ?? clearly shows the highest degree of dis-crimination in the frequency range 58 KHz, thesame region where the second peak often appearsfor a bad bearing. In addition, the error estimate formost of the remaining frequencies is below 40%, sug-gesting that it may be unwise to simply discard them.On the other hand, a visual inspection of the spectro-grams in Figures ????, coupled with the results of

previous studies and a scientific intuition about thisproblem, suggest that the upper half of the spectrumdoes not carry information useful for discriminatingbetween good and bad bearings. The discriminationin this frequency range suggested by the plots in Fig-ure ?? is most likely due to other types of variationsin the data (not variations in the quality of the bear-ing). For this reason we chose to discard the upper

17 (of 33) frequencies. The lower 16 are all retainedfor further processing.

The Bayes error estimates for the LPC and Cep-stral features are shown in Figure ??. Both contain3 or 4 coefficients with relatively high classificationerror, but because there are so few, and the dimen-sionality of these features is low to begin with (only

16), we decided to retain all of them.

3.5. Dimensionality ReductionThe previous section explored methods of dis-

carding individual features. In this section we reducethe dimensionality of the features that were retainedby projecting them to a lower dimensional space.This projection can account for correlation betweenfeatures, thereby reducing the number of redundantfeatures. Numerous projection methods have beenproposed, both linear and nonlinear [5]. We use alinear method that projects feature vectors onto di-rections with the largest separability(defined below).

The projection is of the form

x = Py

where y is an ndimensional feature vector, x is themdimensional projected vector, and P is the m n projection matrix. The matrix P is designed tomaximize the separability measure defined next.

The separability measure used here is called di-vergence. The divergence, D, is defined as the differ-ence between the expected value of the loglikelihoodfunction given classes 2 and 1, that is

D = E[l12

(x)/2

]

E[l12

(x)/1

] (1)where E[] represents the expected value operator,l12(x) is the loglikelihood function, defined as

l12(x) = ln

p(x/1)

p(x/2)

, (2)

and p(x/i) is the conditional density function ofx (the feature vector) given the class i (i = 1, 2).(Note that in theory l12(x) can be used as the opti-mal discriminator between 1 and 2.)

The divergencemeasure of separability is similarto the Bhattacharyya distance. In fact, the two yield

identical solutions in the equal covariance Gaussianproblem (and nearly identical solutions when the co-variances are not equal). In the general Gaussianproblem the rows of P that maximize D are deter-mined as follows. The first row is given by

p1 =

1

1+ 1

2

(1 2)

(1 2)T 1

1+ 1

2

(1 2)

1/2 (3)

7/29/2019 ACC97-paper1

6/11

where 1, 2, 1 and 2 are the mean vectors andcovariance matrices for classes 1 and 2 respec-tively. This projection capitalizes on the differencein mean between the two classes. Note that when1 = 2, p1 is identical to the projection providedby Fishers Linear Discriminant [5].

The remaining rows ofP capitalize on the differ-

ence in covariancebetween classes. They are chosenas the first m 1 eigenvectors of the matrix 11

2.Eigenvectors from this matrix are ordered accordingto the coefficients ai = i + 1/i, where i are thecorresponding eigenvalues. Larger values ofai corre-spond to more significant eigenvectors. The numberof eigenvectors chosen for incorporation into P is de-termined by plotting the ai (in order, from largest tosmallest) and keeping those that fall to the left of theknee of the curve. After performing this procedureon the data in this study, a total of 8 eigenvectorswere kept for the Spectral features, and 12 for theLPC and Cepstral features. The overall projections,

including the first component due to p1, are summa-rized in the table below.

Feature Set Input Dimension Projected DimensionSpectral 16 9

LPC 16 13Cepstral 16 13

Scatter plots of the first two dimensions for eachof the three projected feature sets are shown in Fig-ures ??, ?? and ??. Note that there is significantoverlap in all three cases, with the worst overlap inthe Spectral features.

3.6. Bayes Error EstimatesBefore proceeding with classifier design it is use-

ful to estimate the Bayes classification error for thefinal feature sets. These estimates are formed in ex-actly the same way that they were in Section 3.4,except that in this case the estimates are based onthe entire feature vector (instead of individual com-ponents). These estimates can sometimes be biased[?, ?], but they provide useful target values to strivefor when designing classifiers in the next section.

Plots of the Bayes error estimates for the threefeature sets are shown in Figures ??, ?? and ??. For

the Spectral features the even and odd curves ap-pear to converge near a classification error of 20%.For the LPC coefficients the curves are approaching avalue between 4 and 5%. The curves for the Cepstralfeatures are not as well behaved as in first two, sug-gesting that the Cepstral data may be multimodal,and that the estimates formed here are probably bi-ased high [?, ?]. At any rate, these curves suggestan error rate in the neighborhood of 15%.

As we shall see, the best classifiers for the Spec-tral and LPC features came close to these estimates,although both fell short by a few of percentagepoints. The best classifier for the Cepstral featureshowever, performed much better than the 15% esti-mate obtained here (verifying that it is indeed biasedhigh).

3.7. Classifier DesignAll of the training and testing performed in this

section and the next assumes that the occurrence ofgood and bad bearings is equally likely, i.e. the priorprobability for both of these events is assumed to be0.5. Although this is probably not true in practice,we have no knowledge of the actual priors, so wehave made the worst case assumption, i.e. that theprior probabilities provide no help in discriminatingbetween the two categories. If the true priors wereknown, they could easily be incorporated into theclassifiers below, and would most certainly improve

their performance.The following classifiers were trained and tested

on each of the three projected feature sets describedabove.

Linear: The linear classifier is the simplest and bestknown of all pattern classifiers. It is importantto include this classifier in any type of patternrecognition study since it often provides thebest solution, and even when it does not, it pro-vides as a useful benchmark. No welltrainedclassifier should ever perform worse than thelinear classifier on a nonlinear problem. There

are an abundance of different training algo-rithms for the linear classifier, depending onones objective [3, 5, 13]. Most of these at-tempt to optimize a specific performance cri-terion. Two of the most common criteria arethe least squares (LS) and the perceptron (P)criterion. Minimizing the LS criterion has theadvantage that it produces the optimal solu-tion to the equal covariance Gaussian problem.On the other hand, if the data are not Gaus-sian, or the covariances matrices are not equal,the LS solution can be very poor. The percep-tron criterion, on the other hand, works rea-

sonably well in these situations. We trainedthe linear classifier using both criterion. Opti-mizing the LS criterion is equivalent to findingthe least squares solution to a set of overde-termined linear equations, which we do usingsingular value decomposition. To optimize theperceptron criterion we use the Pocket algo-rithm (with Ratchet [6], which is often more ef-ficient than the more popular perceptron learn-

7/29/2019 ACC97-paper1

7/11

ing algorithm found in the literature.

Quadratic: The quadratic classifier is a natural ex-tension of the linear classifier. It is typicallytrained to implement the optimal solution tothe Gaussian problem under the unequal co-variance assumption. Since this is the mostgeneral version of the Gaussian problem, the

quadratic classifier should always be consideredwhen sufficient training data are available.

1Nearest Neighbor (1NN): The 1NN classi-fier [3, 5, 13] is arguably the most popularof all traditional nonparametric classifiers. Itis often referred to as a distance classifier[13] because its most common implementationuses Euclidean distanceto measure closenesswhen searching for the nearest neighbor. Thetwo major design components of a 1NN clas-sifier are

determining the number of prototypes,and

positioning the prototypes.

We use the LVQ algorithm [9] to position theprototypes. The number of prototypes was de-termined using the search procedure describedbelow.

Radial Basis Function (RBF): The RBF net-work is the first of the two neural networkclassifiers considered in this study. It is oftenviewed as a mixtureofGaussians model. The

major design components of this network in-clude

determining the number of Gaussians (i.e.the number of hidden layer nodes),

determining the mean vectors and covari-ance matrices for each Gaussian, and

determining the optimal weighting of theGaussians (i.e. determining the outputlayer weights).

The mean vectors were determined using theKMeans algorithm [3] (which we initialized

using the MaximumDistance procedure [13]).After fixing the mean vectors, the covariancematrices were computed in the usual fashion(i.e. as the sample covariance for each cluster),except that the offdiagonal elements were setto zero. This represents a simplification of thefullblown mixtureofGaussian model, but anextensionof the traditional RBF approach thatuses covariance matrices of the form 2I.

To train the output layer weights we used boththe LS algorithm and the Pocket algorithm.The LS algorithm is the traditional choice forthe RBF network, but the Pocket algorithm islikely to yield better results for classificationproblems.

The number of Gaussians (hidden layer nodes)

was determined using the search procedure de-scribed below.

Multilayer Perceptron (MLP): The MLP is thesecond of the two neural network classifiersused in this study. We used a very conventionalnetwork with no intralayer connections, a tanhactivation function in the hidden layer nodes,and a linear activation function in the outputlayer node. The major design components ofthis network include

determining the size of the network (i.e.the number of layers and nodes), and

determining the weights of the network.

We used two different strategies to design theMLPs in this study. The first is a conventionalapproach which uses the backpropagation algo-rithm to learn the weights, and the search pro-cedure described below to determine the num-ber of nodes. The second strategy is uniqueto this project. It builds a 2hidden layer net-work, with 10 nodes in the first hidden layerand 5 nodes in the second hidden layer, onenode at a time. The 10 nodes in the first hid-

den layer act as linear discriminators for eachof the5

2

= 10 pairs of Level I groupings in

Table 1 (the sixth group is ignored). Thatis, the first node is trained to discriminate be-tween New and Used bearings, the second be-tween New and Ball Spall damaged bearings,etc. Once the 10 nodes in the first hidden layerare trained, their weights are fixed and the 5nodes in the second hidden layer are designed.Each of these nodes acts as a linear discrimi-nator between one of the Level I groups andthe other four. For example, the first nodeis trained to discriminate between New bear-

ings and all otherbearings, while the second istrained to discriminate between Used bearingsand all other bearings, etc. Finally, with theparameters of the hidden layer nodes fixed, theoutput layer node is trained to discriminate be-tween good and bad bearings. Since all nodesare trained on an individual basis, the LS orPocket algorithm is used. The advantages ofthis approach are that it is much faster than

7/29/2019 ACC97-paper1

8/11

backpropagation and it eliminates the need tosearch for the network size.

Fuzzy ARTMAP: The ARTMAP neural networkarchitecture is described in detail by Carpenteret al. [1]. A block diagram of this architectureis shown in Figure ??.

The network consists of two fuzzy ART (adap-tive resonance theory) modules, labeled ARTaand ARTb, along with an inter-ART module.Each of the fuzzy ART modules clusters thedata supplied to it according to a specific sim-ilarity metric (which makes use of the fuzzyAND operation). In a supervised learning sce-nario, input vectors (labeled a in Figure ??)are presented at the ARTa module, and thecorresponding desired outputs (labeled O inFigure ??) are presented at the ARTb module.The purpose of the inter-ART module is to de-termine whether the mapping between the pre-

sented inputs and outputs is the desired one.Weights in the ARTa, ARTb, and inter-ARTmodule are adjusted during learning to achievethe desired mapping.

A critical component in the design of the 1NN, RBFand MLP classifiers described above is the determi-nation of their size (i.e. the number of prototypesor nodes). The same general procedure was usedto resolve this issue in all three cases. A sequen-tial search was performed, starting with the smallest

possible size and then increasing it until the classifierperformance began to level off. This method worksquite well in practice, and is supported by the factthat, in theory, the curve of performance verses sizeis unimodal.

Plots of performance verses size for the Spectralfeatures are shown in Figures ?? ??. Each plotshows two curves, one for training performance andone for testing performance. Note that the trainingand testing curves are nearly identical in all threefigures. The RBF plot has two pair of curves, onefor LS training and one for Pocket training. A sig-nificant improvement is obtained with Pocket train-

ing. Overall, these plots show that an increase in sizeyielded little or no improvement for any of the classi-fiers. This suggests that the linear classifier may beoptimal for this feature set (since different types ofnonlinearities dont seem to help).

A specific size (and corresponding performance)must be chosen for the 1-NN, RBF and MLP classi-fiers so that they can be compared with the othersin the next section. Table 3.7 shows the sizes chosen

for this purpose.

Optimal Size Classifiers for Spectral FeaturesClassifier Size % Classification Error1NN 6 27.77/26.09RBF (LS) 18 33.75/34.44RBF (Pocket) 18 27.76/27.39

MLP 10 23.79/23.73

Plots of performance verses size for the LPC fea-tures are shown in Figures ?? ??. In this casethere are significant improvements in all three classi-fiers as the size is increased from the minimum. Alsoas expected, these improvements gradually diminishas the classifier approaches its optimal size. The spe-cific sizes chosen for each classifier are summarizedin Table 3.7.

Optimal Size Classifiers for LPC Features

Classifier Size % Classification Error

1NN 38 7.00/6.91RBF (LS) 18 27.77/28.18RBF (Pocket) 26 19.23/20.11MLP 30 6.90/7.54

Finally, plots of performance verses size for theCepstral features are shown in Figures ?? ??. Herewe see improvements with the RBF and 1NN clas-sifiers as the size is increased from the minimum, butnot with the MLP classifier. The performance of theMLP classifier starts with a much lower classifica-tion error than the other two, but shows very littleimprovement as its size is increased. The sizes cho-

sen for the classifiers in this case are summarized inTable 3.7.

Optimal Size Classifiers for Cepstral FeaturesClassifier Size % Classification Error1NN 26 2.43/2.46RBF (LS) 14 16.06/16.45RBF (Pocket) 22 2.31/2.24MLP 10 7.09/7.56

Note that in all of the experiments with the RBFnetwork above, the performance with the Pocket al-

gorithm was far superior to that of the LS algorithm.

3.8. Classification ResultsThe classification results for the Spectral feature

set are summarized in Table 2. The best classifierfor this feature set is the linear classifier trained withthe Pocket algorithm. Although one of the MLP net-works achieved slightly better performance, the ad-ditional complexity of this classifier does not justify

7/29/2019 ACC97-paper1

9/11

% Classification ErrorClassifier Training TestLinear (LS) 24.65 24.26Linear (Pocket) 22.07 22.63Quadratic 24.09 24.561NN (LVQ) 27.77 26.09

RBF (LS) 33.75 34.44RBF (Pocket) 27.76 27.39MLP (BP) 23.79 23.73MLP (Constructive/LS) 22.56 23.83MLP (Constructive/Pocket) 21.26 22.15Fuzzy ARTMAP 4.2 27.5

Table 2: Classification Results for Spectral Features.

% Classification ErrorClassifier Training TestLinear (LS) 16.80 17.54Linear (Pocket) 15.77 16.16Quadratic 10.18 10.781NN (LVQ) 7.00 6.91RBF (LS) 27.77 28.18RBF (Pocket) 19.23 20.11MLP (BP) 6.90 7.54MLP (Constructive/LS) 7.40 7.45MLP (Constructive/Pocket) 5.65 5.86Fuzzy ART 2.22 31.0

Table 3: Classification Results for LPC Features.

its choice given such a small increase in performance.Note that the classification performance of the linearclassifier is within 3% of the Bayes estimate for thisfeature set.

The classification results for the LPC featuresare summarized in Table 3. There is wide variationin performance among the various classifiers. Thebest performance was achieved by the 1051 MLPnetwork which used the Pocket algorithm to train thenodes. The performance achieved by this network iswithin 1% of the Bayes estimate for this feature set.

The classification results for the Cepstral fea-tures are summarized in Table 4. As in the previouscase, there is wide variation in performance amongthe various classifiers. The best overall performancewas achieved by the RBF network (trained with thePocket algorithm), with the 1-NN classifier (trainedwith LVQ) running a close second. The performanceachieved by this network is well below the 15% Bayeserror estimate for this feature set.

% Classification ErrorClassifier Training TestLinear (LS) 8.84 9.07Linear (Pocket) 6.76 6.97Quadratic 5.71 6.141NN (LVQ) 2.43 2.46

RBF (LS) 16.06 16.45RBF (Pocket) 2.31 2.24MLP (BP) 7.09 7.56MLP (Constructive/LS) 3.98 3.83MLP (Constructive/Pocket) 5.06 5.33Fuzzy ART 4.80 9.70

Table 4: Classification Results for Cepstral Features.

Run Feature % Good % BadSpectrum 25.2 74.8

Run 3 LPC 85.4 14.6Cepstrum 14.4 85.6Spectrum 32.2 67.8

Run 4 LPC 35.0 65.0Cepstrum 60.1 39.9

Table 5: Classification Results for Runs 3 and 4.

3.9. Tests on Unknown BearingsIn this section we describe the results of tests

performed on Runs 3 and 4 whose true classificationsare unknown (but believed to be normal). Neither

of these runs were used in the training and testingabove. Both runs were processed by the three differ-ent pattern recognition systems (one for each featuretype). In each system the optimal classifier (deter-mined in the previous section) was used, i.e. thelinear, MLP and RBF classifiers were used for theSpectral, LPC, and Cepstral features respectively.The classification results are summarized in Table5. Several observations can be made regarding theseresults.

There is a large variation across feature sets.

There is a large variation in the consistency ofthe results between Run 3 and Run 4.

None of the results are in close agreement withthe classification error rates predicted in previ-ous sections.

The explanation for these poor results is that thedata used to design these systems was not represen-tative of all future data. This was alluded to as a

7/29/2019 ACC97-paper1

10/11

potential flaw with this approach earlier in the study.Runs 3 and 4 are significantly different (in ways thatcould not be predicted) from the runs used to designthese systems. This is verified in the plots in Figures?? and ??. Here we show histograms of the classi-fier outputs for the Spectral and Cepstral systems.A comparison is made between the output values for

the training data (both Good and Bad) and the testdata from Runs 3 and 4. Note that in both cases theoutput values for the test data have a significantlydifferent distribution than the training data. In theCepstral system, the outputs are essentially zero forall test inputs. This system uses an RBF classifier,which is wellknown to produce zero outputs for datathat is dissimilar to the training data.

These results tell us that the systems designedhere are not likely to produce meaningful classifica-tions of future data. A more promising approach,which was mentioned earlier in this report, is inves-tigated in the next section.

4. Trending Using Fuzzy ARTWe have also investigated a trending approach to

the detection of faulty bearing. In order to perform atrending analysis, it is necessary to monitor the samebearing over an extended time period. In four of theruns supplied to us, the same bearing was used; theseinclude Runs 13 (normal), 16 (normal), 18 (innerrace), and 20 (inner race severe). Note that the lasttwo runs correspond to damaged bearing.

Our experimental setup for the trending analy-sis consisting of a single fuzzy ART (clustering) mod-ule. The spectrogram data computed for Run 13 was

supplied as input, and the network formed a singlecluster. A histogram of the output node value forthis cluster, for each of the 500 time time segmentsis shown in Figure ??.

Notice that the output node produced the samevalue for all 500 inputs.

At this point the fuzzy ART network weightswere frozen, and the spectrogram data computedfor Run 16 was supplied as input. The histogram for

Run 16 is shown in Figure ??.

This histogram is only slightly different from theprevious one (i.e., it is nearly flat).

The fuzzy ART histogram corresponding to thespectrogram data computed for Run 18 is shown inFigure ??.

In this (damaged bearing) case, the histogramis significantly different from the previous two. Forthe final case (severe damage), Run 20, the his-togram (Figure ??) is extremely ragged.

This experiment suggest that a trending ap-proach, which is based on learning the features asso-

ciated with a normal bearing, and monitoring thesefeatures over time, may be a feasible alternative tothe other experiments performed in this study.

5. Conclusions and Future DirectionsWe have presented an overview of fault detec-

tion and isolation techniques with special emphasison vibrational data and the usage of neural networks.We have presented applications of these techniquesto real vibrational data. Based on this study, wehave determined that the approach which presentsthe best probability of success is the trending ap-proach where a particular system is monitored overits lifetime and faults are detected as deviation fromnormal behavior. On the other hand, an approachrelying on combining data from different systems isdoomed to failure as shown in section 3. As a direc-tion of future research, we intend to study the trend-ing approach using different feature sets and differentneural networks structures.

7/29/2019 ACC97-paper1

11/11

References[1] G. A. Carpenter. Neural Networks: FromFoundations to Applications Short Course, May1991. Wang Institute of Boston University, Tyngs-boro, MA.

[2] T.P. Caudell and D.S. Newman. An adaptiveresonance architecture to define normality and detect

novelties in time series and databases. In Proc. WorldCongress on Neural Networks, pages IV166IV176,Portland, OR, 1993.

[3] R.O. Duda and P.E. Hart. Pattern Classifi-cation and Scene Analysis. Wiley, New York, NY,1973.

[4] P.M. Frank. Fault diagnosis in dynamicsystems using analytical and knowledge-basedredundacya survey and some new results. Auto-matica, 26:459474, 1990.

[5] K. Fukunaga. Introduction to Statistical Pat-tern Recognition. Academic Press, San Diego, CA,1972.

[6] S.I. Gallant. Neural network learning and ex-pert systems. MIT Press, Cambridge, Massachusetts,1993.

[7] Michael J. Healy, Thomas P. Caudell, andScott D. G. Smith. A neural architecture for pat-tern sequence verification through inferencing. IEEETransactions on Neural Networks, 4(1):920, 1993.

[8] R. Isermann. Process fault detection basedon modeling and estimation methods. Automatica,20:387404, 1984.

[9] T. Kohonen. SelfOrganization and Associa-tive Memory. SpringerVerlag, Berlin, Germany,1988.

[10] R. Kuczewski and D. Eames. Helicopter faultdetection and classification with neural networks. InProc. Int. Joint Conf. on Neural Networks, pages II947II956, Baltimore, MD, 1992.

[11] M. Polycarpou and A. Vemuri. Learningmethodology for failure detection and accomodation.IEEE Contr. Sys. Mag., 15(3):1624, 1995.

[12] S.D. Stearns and D.R. Hush. Digital SignalAnalysis. PrenticeHall, Englewood Cliffs, NJ, 1990.

[13] J.T. Tou and R.C. Gonzalez. Pattern Recog-nition Principles. AddisonWesley, Reading, MA,1974.

[14] F.B. Weiskopf, F.G. Arcella, J.S. Lin, andR.W. Newman. A hybrid system approach to ma-chinery monitoring and diagnosis. In Proc. Int.Mach. Monit. & Diag. Conf., pages 2530, Chicago,IL, 1993.

[15] A.S. Willsky. A survey of design methods forfailure detection in dynamic systems. Automatica,12:601611, 1976.

acc97-paper1

Documents