deep learning for infrared based condition monitoring of...

Petar Petrov

rotating machinery

Deep learning for infrared based condition monitoring of

Academic year 2015-2016

Faculty of Engineering and Architecture

Chair: Prof. dr. ir. Rik Van de Walle

Department of Electronics and Information Systems

Master of Science in de industriële wetenschappen: elektronica-ICT

Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Olivier Janssens, Viktor Slavkovikj

Supervisors: Prof. dr. ir. Sofie Van Hoecke, Prof. dr. Steven Verstockt

Faculty of Engineering and ArchitectureDepartment of Electronics and Information Systems

Chairman: Prof. Dr. Ir. Rik Van de Walle

Deep learning for infrared based condition

monitoring of rotating machinery

by

Petar Petrov

Supervisors: Prof. Dr. Ir. Sofie Van Hoecke, Prof. Dr. Steven Verstockt

Counsellors: Olivier Janssens, Viktor Slavkovikj

Master’s dissertation submitted in order to obtain the academic degree ofMaster of Science in de industriele wetenschappen:

elektronica-ICT

Academic Year 2015–2016

ADMISSION TO LEASE-LOAN i

Admission to lease-loan

The author gives permission to make this master thesis dissertation available forconsultation and to copy parts of this master dissertation for personal use. Anyother use is subjected to the copyright terms, especially with regard to the obligationto state the source when quoting results from this master dissertation.

Petar Petrov, August 2016

Deep learning for infrared based conditionmonitoring of rotating machinery

Petar Petrov

Supervisor(s): Sofie Van Hoecke, Steven Verstockt, Olivier Janssens, Viktor Slavkovikj

Abstract—This master thesis explores the use of fully connected and con-

volutional neural networks for condition monitoring rotating machinery us-

ing infrared thermography. Recent state-of-the-art improvements in deep

learning are tried out to aid this goal.

Keywords— Convolutional neural network, condition monitoring, in-

frared thermography, deep learning

I. INTRODUCTION

PREDICTIVE maintenance is required to minimise the cost

associated with downtime and failure of rotating machin-

ery. Continuous condition monitoring with single or multiple

sensors is needed to accomplish this task. Current fault detec-

tion in rotating machinery is focused on vibrational analysis [1].

Detection of insufficient lubrication using this method is error-

prone and requires human resource [2]. Infrared thermography

is a non-intrusive way of measuring spatial temperature distribu-

tions caused by insufficient lubrication. Manual feature extrac-

tion coupled with support vector machines or random forests

is the most recent method of classifying bearing faults based on

thermal images [3]. The use of feature learning opposed to man-

ual feature extraction is advisable as thermal images are difficult

to interpret by humans. Feature learning can be accomplished

with the use of neural networks. Additinally, since infrared ther-

mography can be seen as images, convolutional neural networks

(CNN) can be applied as they are currently state-of-the-art in

image classification [4]. The combination of condition moni-

toring using infrared thermography and neural networks is still

mostly unexplored.

II. EXPERIMENT SETUP

A. The dataset

The raw data used is identical to the one used in [3]. It con-

sists of infrared videos with a VGA resolution. Each video is

recorded at a rate of 6 frames per second and is 10 minutes long.

An example of a frame can be seen in Fig. 1. All the frames in

an individual video are very similar.

There are 4 bearings conditions and the rotating machinery

can be balanced or imbalanced resulting in 8 different classes.

For each class there are 5 separate videos, allowing 5-fold cross-

validation. Due to the frame’s similarity in an individual video,

all frames are strictly used in only the training or validation set.

The name of each class can be seen in Table I. The dataset is

created by extracting one frame each second.

B. Preprocessing

The preprocessing undergoes five steps to eliminate unneeded

information. The first step is lowering the resolution of the

frames to 160x120. This is done in purpose to compensate

Fig. 1. An example of a frame from a raw video.

Class name Bearing and machine condition Balance

HB Healthy bearing Yes

MILB Mildly in-lubricated bearing

EILB Extremely in-lubricated bearing

ORF Bearing with outer raceway fault

HB IM Healthy bearing No

MILB IM Mildly in-lubricated bearing

EILB IM Extremely in-lubricated bearing

ORF IM Bearing with outer raceway fault

TABLE I

Abbreviations of all the classes.

the memory limitations of the GPU. No necessary information

should be lost.

In the second step, the temperature in the frames is set to a

relative temperature instead of absolute. This is done by reading

the ambient temperature from the first frame of each video. The

ambient temperature is defined by the pixel value of a thermo-

couple located in all frames.

The third step normalizes the position of the housing of the

bearing within the image to crop unwanted parts. Two differ-

ent methods were explored to complete this task. The man-

ual method consists of locating the position of a certain dent

in every video. This allows to correct the vertical and horizon-

tal alignment. However the pitch, yaw and roll are not taken

in account. The automatic method makes use of a local fea-

ture detector/descriptor and RANSAC [5] to find the needed

homography. This method produces almost perfect results, but

some frames experience irregularities, which could be circum-

vented with more finetuning. Finally the manual method suf-

ficed for the cropping job and no further finetuning of the auto-

matic method was done. The output frame can be seen in Fig.

2.

Fig. 2. Preprocessed frame.

The last two steps in the preprocessing are scaling the image’s

values to be in the interval [0,1] and zero centering.

III. RESULTS

Both fully connected and convoutional neural networks have

been tested. Each type of network has been optimised step by

step.

A. Fully connected neural network

Models with a range between 2 and 6 fully connected layers

after each other have been tested. Additionally, the amount of

neurons in each layer was changed to a bigger and smaller value.

The effect of dropout [6] was also experimented with. Rectified

linear unit (ReLU) is used as activation function in all the layers.

Categorical crossentropy is used as loss function combined with

softmax in the output layer. Gradient descent was achieved with

a learning rate of 5e-3, Nesterov momentum (90%) [7] and L2

regularization strength of 5e-3.

When dropout is not used, a small network of 2 layers with

50 and 20 nodes per layer respectively achieved the best results

with an overall accuracy of 58.8%. The accuracy for the bearing

condition is 79.5% and for the balance of the machine is 64.5%.

The best achieved result is a 6-layer model with 500, 400,

300, 200, 100, 20 nodes per layer respectively. Dropout is ap-

plied after every layer. An overall accuracy of 64.0% was ob-

tained. 84.8% was the accuracy for the bearing condition and

74.5% the accuracy for the balance of the rotating machinery.

The confusion matrix for all the classes can be seen in Fig. 3.

A saliency map is created using guided backpropagation [8]

[9] to give insight to the important regions of the input image for

the neural network. Fig. 4 shows the saliency map of a correctly

classified image of ORF that achieves maximum output. The

bright region in the middle corresponds to one of the two large

bolts of the housing. The right arc corresponds to the outline of

the housing. These two features are what helps the network to

classify the ORF correctly.

B. Convolutional neural network

The best performing CNN is found by starting of a medium

sized network and tweaking the different aspects of the architec-

HB MILB EILB ORF HB_IM MILB_IM EILB_IM ORF_IMPredicted

HB

MILB

EILB

ORF

HB_IM

MILB_IM

EILB_IM

ORF_IM

Targ

et

20 % 0 % 0 % 0 % 60 % 0 % 0 % 20 %

0 % 60 % 0 % 0 % 20 % 20 % 0 % 0 %

0 % 0 % 40 % 0 % 0 % 0 % 60 % 0 %

20 % 0 % 0 % 74 % 0 % 0 % 0 % 6 %

0 % 0 % 0 % 0 % 98 % 0 % 2 % 0 %

0 % 20 % 0 % 0 % 20 % 40 % 20 % 0 %

0 % 0 % 0 % 0 % 0 % 20 % 80 % 0 %

0 % 0 % 0 % 0 % 0 % 0 % 0 % 100 %

0

10

20

30

40

50

60

70

80

90

100

Fig. 3. Confusion matrix of the 6-layer fully connected network.

0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for ORF

Fig. 4. Saliency map of the correctly classified ORF frames by the 6-layer fully

connected network.

ture. These aspects consist of the amount of filters per convolu-

tion layer, the number of pooling layers needed, the amount and

size of the fully connected layers, the number of convolution

layers, the activation function and the use of batch normaliza-

tion [10]. The final architecture consists of 6 convolution layers

with 4 filters of size 3, with a max pooling layer behind every 2

convolution layers. The last part of the network are 3 fully con-

nected layers with 512 nodes each. All the layers use a Leaky

ReLU [11] as activation function. Dropout is applied after ev-

ery fully connected layer. The same hyperparameters are used as

with the fully connected neural network, except that the learning

rate is lowered to 1e-3.

The achieved overall accuracy is 57.2%. The accuracy for the

bearing condition is 81.2% and the accuracy for balance of the

machine is 62.9%. The confusion matrix for all the classes can

be seen in Fig. 5.


HB

MILB

EILB

ORF

HB_IM

MILB_IM

EILB_IM

ORF_IM

Targ

et

40 % 0 % 0 % 0 % 60 % 0 % 0 % 0 %

0 % 40 % 0 % 0 % 0 % 60 % 1 % 0 %

0 % 0 % 79 % 0 % 0 % 1 % 20 % 0 %

20 % 0 % 0 % 80 % 0 % 0 % 0 % 0 %

40 % 0 % 0 % 0 % 40 % 0 % 20 % 0 %

0 % 20 % 0 % 0 % 0 % 80 % 0 % 0 %

0 % 5 % 41 % 0 % 0 % 0 % 53 % 0 %

0 % 5 % 0 % 15 % 0 % 0 % 0 % 80 %

0

10

20

30

40

50

60

70

80

90

100

Fig. 5. Confusion matrix of the final CNN.

Fig. 6 shows the saliency map of a correctly classified image

of EILB that achieves maximum output for the CNN. The most

bright region can be found in the bottom left. It corresponds

to an dark empty space of the input frame. The bright region

in the middle can be interpreted as a region where the network

detects a feature that changes the temperature distribution. It

also corresponds to the location of one of the two large bolts of

the housing.

0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for EILB

Fig. 6. Saliency map of the correctly classified EILB frames by the final CNN.

IV. CONCLUSION

The difference in validation loss/accuracy between the cross-

validation sets is quite significant, reaching more than 50% in

most of the models. Some of the validation sets experience huge

overfitting that is not compensated with regular tricks as regu-

larization and dropout. It is possible to say that the networks

experience model instability.

Convolutional neural networks were not able to achieve a bet-

ter result than the fully connected networks in terms of valida-

tion accuracy. The CNN suffers the most in distinguishing the

balance of the rotating machinery. The bearing condition is rea-

sonably well classified by both the fully connected network and

the CNN.

The saliency maps of the fully connected network suggests

that the network classifies the classes using physical features

like the two large bolts on the housing, instead of the spatial

temperature distribution. This unwanted behaviour can be at-

tributed to the small amount of different videos for each class.

The saliency maps of the final CNN are more difficult to in-

terpret and there is further research needed to fully understand

these images.

REFERENCES

[1] Jie Liu and Golnaraghi Wang, “An extended wavelet spectrum for bearingfault diagnostics,” 2008.

[2] B. Musizza A Juricic P. Bokoski, J. Petrovcic, “Detection of lubrica-tion starved bearings in electrical motors by means of vibration analysis,”2010.

[3] Olivier Janssens, “Thermal image based fault diagnosis for rotating ma-chinery,” 2015.

[4] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, MichaelBernstein, Alexander C. Berg, and Li Fei-Fei, “ImageNet Large ScaleVisual Recognition Challenge,” arXiv:1409.0575 [cs], Sept. 2014.

[5] Martin A. Fischler and Robert C. Bolles, “Random Sample Consensus:A Paradigm for Model Fitting with Applications to Image Analysis andAutomated Cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395,June 1981.

[6] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever,and Ruslan R. Salakhutdinov, “Improving neural networks by preventingco-adaptation of feature detectors,” arXiv:1207.0580 [cs], July 2012.

[7] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton, “Onthe importance of initialization and momentum in deep learning,” in Pro-ceedings of the 30th international conference on machine learning (ICML-13), 2013, pp. 1139–1147.

[8] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Mar-tin Riedmiller, “Striving for Simplicity: The All Convolutional Net,”arXiv:1412.6806 [cs], Dec. 2014.

[9] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman, “Deep In-side Convolutional Networks: Visualising Image Classification Modelsand Saliency Maps,” arXiv:1312.6034 [cs], Dec. 2013.

[10] Sergey Ioffe and Christian Szegedy, “Batch Normalization: Accel-erating Deep Network Training by Reducing Internal Covariate Shift,”arXiv:1502.03167 [cs], Feb. 2015.

[11] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng, “Rectifier nonlin-earities improve neural network acoustic models,” in Proc. ICML, 2013,vol. 30.

CONTENTS v

Contents

Admission to lease-loan i

Extended abstract v

Table of contents vi

List of used abbreviations vii

1 Introduction 1

2 Related literature 2

2.1 Infrared Thermography . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Basic concepts of neural networks . . . . . . . . . . . . . . . . . . . 3

2.2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . 32.2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . 6

2.3 State-of-the-art deep convolutional neural networks . . . . . . . . . 82.3.1 Data augmentation and preprocessing . . . . . . . . . . . . . 82.3.2 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.3 ReLU Variations . . . . . . . . . . . . . . . . . . . . . . . . 102.3.4 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . 112.3.5 Network architecture . . . . . . . . . . . . . . . . . . . . . . 132.3.6 Pooling layers . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Experiment setup 17

3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Reducing dimension . . . . . . . . . . . . . . . . . . . . . . 193.2.2 Ambient temperature . . . . . . . . . . . . . . . . . . . . . . 193.2.3 Cropping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.4 Scaling and mean substraction . . . . . . . . . . . . . . . . . 24

3.3 Practical hardware & software setup . . . . . . . . . . . . . . . . . 24

4 Results 25

4.1 Fully Connected Neural Networks . . . . . . . . . . . . . . . . . . . 254.1.1 Use of dropout . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . 354.2.1 Number of filters . . . . . . . . . . . . . . . . . . . . . . . . 36

CONTENTS vi

4.2.2 Number of pooling layers . . . . . . . . . . . . . . . . . . . . 384.2.3 Depth and width of fully connected layers . . . . . . . . . . 414.2.4 Number of convolution layers . . . . . . . . . . . . . . . . . 454.2.5 Activation functions . . . . . . . . . . . . . . . . . . . . . . 464.2.6 Batch normalization . . . . . . . . . . . . . . . . . . . . . . 484.2.7 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2.8 Final architecture . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Conclusion 58

5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Bibliography 61

List of Figures 65

List of Tables 67

List of used abbreviations vii

Used abbreviations

AKAZE Accelerated KAZE features. 23

BN Batch Normalization. 48, 49, 66

BRIEF Binary Robust Independent Elementary Features. viii

BRISK Binary Robust Invariant Scalable Keypoints. 23

CM Condition Monitoring. 1, 2

CNN Convolutional Neural Network. 2, 9, 35, 53, 58, 59

CNNs Convolutional Neural Networks. 1, 6, 7, 10, 25, 37, 38, 58, 65, 66

conv convolutional. 6, 14, 15, 65

EILB Extremely In-lubricated Bearing. 29, 35, 50, 53, 55, 57

EILB IM Imbalanced Extremely In-lubricated Bearing. 29, 35, 55, 57

ELU Exponential Linear Unit. 10, 47, 58

FAST Features from accelerated segment test. viii

fc fully connected. 4, 7, 13, 14, 25, 26, 31, 32, 58, 67

GB Gigabyte. 24

GPU Graphical Computing Unit. 24

HB Healthy Bearing. 30, 35, 50, 55, 57, 58

HB IM Imbalanced Healthy Bearing. 30, 55, 57, 58

ILSVRC ImageNet Large Scale Visual Recognition Challenge. 13, 14

IRT Infrared Thermography. 1, 2

LReLU Leaky ReLU. 10, 11, 47, 48, 50, 66

List of used abbreviations viii

MILB Mildly In-lubricated Bearing. 32, 33, 35, 50, 53, 55, 59

MILB IM Imbalanced Mildly In-lubricated Bearing. 32, 35, 55

MSRA MicroSoft Research Asia. 14

NN Neural Network. 4–6, 8–10, 18, 24, 25, 29, 30, 58

NNs Neural Networks. 3, 6, 9, 12, 25, 31, 58, 60

ORB Oriented FAST and Rotated BRIEF. 23

ORF Bearing with Outer Raceway Fault. 29, 35, 50, 55, 57

ORF IM Imbalanced Bearing with Outer Raceway Fault. 35, 55, 57

PCA-whitening Principal Component Analysis. 9

pool pooling. 7

PReLU Parametric ReLU. 10, 11, 47, 58

RAM Random Acces Memory. viii, 24

RANSAC Random sample consensus. 22

ReLU Rectified Linear Unit. vii, viii, 3, 10, 11, 13, 14, 35, 46–48, 66

ROI Region Of Interest. 2

RReLU Random ReLU. 11, 47, 58

SIFT Scale-Invariant Feature Transform. 59

SPP Spatial Pyramid Pooling. 15

SURF Speeded Up Robust Features. 59

VRAM Video RAM. 24

ZCA-whitening Zero-phase Component Analysis. 9

INTRODUCTION 1

Chapter 1

Introduction

The downtime and failures of rotating machinery has significant influence on theproduction process. It can increase the cost by a significant amount. Predictivemaintenance is therefore needed to keep that cost at a minimum. This is achievedthrough the use of continuous Condition Monitoring (CM) and fault detection. Asingle or multiple sensors can be used in predictive maintenance. Vibration anal-ysis is the most widely used method for CM rotating machines [1]. Together withacoustic emission, multiple faults can be detected and identified, such as imbalance,misalignment, outer- and inner-raceway faults. Classifying insufficient lubricationin bearings with these methods is subpar and error-prone [2]. The use of lubricationanalysis is not an option, as it lowers the operational uptime and demands humanresource. Online particle counting is restricted to close loop systems and comeswith a high cost.

Infrared Thermography (IRT) provides a non-intrusive way of measuring spatialtemperature distributions. Different levels of lubrication result in different temper-ature distributions. This knowledge has been successfully used to classify faultsin rotating machines with the use of manual feature extraction, support vectormachine and random decision forest [3]. The nature of manual feature extractionleaves room for optimisation of the classification between healthy bearings and outerraceway faults.

Thermal images provide information that is difficult to interpret by humans. Thismakes manual feature extraction hard and inconvenient. Feature learning can pro-vide a solution to this problem. Since IRT can be seen as images, we can reclassifythe problem as an image classification task. Convolutional Neural Networks (CNNs)are currently the state-of-the-art for performing this type of classification.

This work explores the possibility of using deep CNNs to classify faults in rotatingmachinery, based on IRT. The focus is set on bearing lubrication faults and outerraceway fault. Additionally, imbalance of the rotating machines will be tested.

RELATED LITERATURE 2

Chapter 2

Related literature

2.1 Infrared Thermography

The purpose of this section is providing the reader with some insight about conditionmonitoring using Infrared Thermography (IRT).

Infrared Thermography can be separated as active and passive. Active thermogra-phy uses external heat/cold stimulation. Passive thermography, on the other hand,monitors the object without use of an external heat stimulation. They rely onabnormalities in the temperature distribution. Almost all CM IRT make use ofpassive thermography [4].

Most of current methods for condition monitoring using IRT, make use of featureextraction and feature selection. A widely spread method is making a histogramanalysis of the ROI of the hottest region [3] [5] [6] [7]. Statistical analysis and/ora classifier can be applied to classify the different faults in the rotating machin-ery.

Segmenting the thermal image into multiple pieces instead of using one whole imagecan help the classification process [8]. Creating those regions allows for dispersionbased region selection for choosing the ROI [8]. This also suggests that a Con-volutional Neural Network (CNN) could achieve a better result than using otherclassifiers.

Less lubricated bearing will create more friction and thus more heat that can bedetected with IRT. On the other hand, outer raceway faults do not increase theabsolute temperature by a significant margin. However, such a fault establishes anoverall higher relative temperature, especially in the outer lip seal area [9]. Thismeans that outer raceway faults can be correctly detected using thermal imaging[10].

2.2 Basic concepts of neural networks 3

2.2 Basic concepts of neural networks

The purpose of this section is to familiarise or refresh the reader’s memory regard-ing neural networks. Subsequently, basic CNNs are introduced and explained. Thethird section presents the current state-of-the-art methods that were researched.

2.2.1 Artificial Neural Networks

The brain consists of small units, called neurons. They receive signals from otherneurons with their small extensions, called dendrites, and send signals to otherneurons with their long extension, called the axon (Figure 2.1). The neuron cellbody itself performs an augmentation of the received signals and forwards it to theaxon.

Figure 2.1: Simplified version of a neuron. 1

Structure

The units used in artificial Neural Networks (NNs), called nodes, resemble thebiological ones [11]. A node has multiple input signals ‘X’ that are augmented witha certain weight ‘W’ and a node specific bias ‘b’ (Equation 2.1).

f(X,W, b) = Xi ·Wi + b (2.1)

The result of this calculation is send to a so called activation function. This functionenables the node to function as a basic classifier, but can also introduce non-linearity.In the past, the most used activation functions were sigmoid (Equation 2.2) andhyperbolic tangent (Equation 2.3). Currently, the Rectified Linear Unit (ReLU)(Equation 2.4) and variants are used the most.

The variants of the ReLU are discussed in the next section.

1Image source: ”Neuron Structure and Function”, http://www.rise.duke.edu/


f(x) =1

1 + e−x(2.2)

f(x) = tanh(x) (2.3)

f(x) =

{

0 if x < 0

x if x ≥ 0(2.4)

The graphical representation of the whole node is displayed in Figure 2.2. Note thatthe output has a single value, but can have more than one output connections.

Figure 2.2: Node with three inputs.

Multiple nodes are structurally organised in a so called layer. There are no connec-tions between nodes in the same layer. A NN is formed when multiple layers arecombined through connecting the nodes of neighbouring layer. This type of layeris called a dense layer or a fully connected (fc) layer.

Figure 2.3: Neural network that classifies two classes.

The first layer is called input layer and it doesn’t contain any nodes. It has thepurpose of holding the input data and distributing it to the second layer. Alllayers, besides the first and last one, are called hidden layers. The depth of a NN


is represented by the number of hidden layers plus the last layer, the output layer.The size of the output layer is determined by the amount of classes that need to beclassified. The output values represent what class the NN predicts the input datais.

Loss

It is necessary to determine how good or bad a network performs. A so called lossis used to quantify this need. This is done by comparing the ground truth2 with theprediction of the network. A large loss value means that the prediction is bad anda small loss value means the prediction was close to the truth. The loss functiondepends on the type of problem. For multi-class classification, this function can bea categorical crossentropy (Equation 2.6) or a multi-class hinge loss (Equation 2.5).When using categorical crossentropy, the activation function of the output layer ismost of the time the softmax activation function.

Li =∑

i 6=j

max(0, predictedj − truthi + 1) (2.5)

Li = −truthi + log∑

i

epredictedi (2.6)

Multiple sets of parameters, i.e. weights, can achieve the same loss. The set withthe smallest parameter values is preferred. To take in account this preference, anadditional value is added to the average loss (Equation 2.8), the regularization(Equation 2.7). The amount of regularization can be adjusted by modifying theregularization strength which is denoted as λ.

R = λ∑

k

∑

l

W 2

k,l (2.7)

L =

∑

i

Li

N+R (2.8)

Backpropagation and gradient descent

A NN will give the best results when its parameters (weights and biases) haveachieved optimal values to classify the problem. The difficulty in this task is findingsaid optimal values. Parameters are initialised randomly and are far from optimal.In order to achieve optimal values, the parameters have to be updated in such away that the loss gets smaller. This process is called training a network.

2The objectively correct value(s).


This mechanism can be equated to minimising the loss by adjusting the parameters.This can be done by taking proportional steps in the direction of the negativegradient of the loss function in respect to the weights and the bias. This is calledgradient descent. Because parameters are present in every layer and layers arechained together, there is no method of calculating the gradient of the parametersin one step. The way to overcome this, is to use the derivative chain rule.

The first step is calculating the gradient of the loss with respect to the parametersin the output layer. Next step is calculating the gradient of the output layer withrespect to the parameters in the last hidden layer. Because the gradient of the losswith respect to the parameters in the output layer was calculated in previous step,we can determine the the gradient of the loss with respect to the parameters in thelast hidden layer (Equation 2.9). This step is done recursively to find the gradientof the loss with respect to all the parameters (Equation 2.16). The process is calledbackpropagation.

dloss

dparamhidden

=dloss

dparamout

∗ dparamout

dparamhidden

(2.9)

The gradient descent is scaled by a small factor to restrain it from jumping overthe minimum. For NNs this factor is called the learning rate. It determines thespeed of training of the NN. Parameters are updated by using the gradient andthe learning rate (Equation 2.17). More advanced forms of gradient descent arediscussed in subsection 2.3.4.

The data given to a NN during training is called the training set. The training set issplit into multiple pieces or so called batches. The loss is calculated after providingthe NN with a batch. This is called an iteration. When all batches of the trainingset have been passed to the network, one epoch has completed. The training phaseof a NN is done using multiple epochs.

The number of hidden layers, the learning rate, the regularization strength andthe batch size of a NN are hyperparameters. These are parameters that have tobe tuned manually. Hyperparameters are optimized by searching from a fixed setof values. Research has shown that it is even more efficient to use random search[12].

2.2.2 Convolutional Neural Networks

The training of a NN with large input data, like images, requires a lot of nodes andthus parameters. This vastly increases the training time and memory requirements,which is very undesirable. CNNs provide a solution to these problems by making useof two other types of layers. The main type is the convolutional (conv) layer. Eachnode of the convolution layer is only connected to a subregion of the previous layer.This property is called local connectivity. Additionally, the nodes are organised intogroups, where each group shares its parameters. This property is called parameter


sharing. Both properties lower the memory requirement and computational time.The other type of layer is the pooling (pool) layer, which subsamples the size of thedata. CNNs also make use of the standard hidden layers. In this context, they arecalled fully connected (fc) layers.

Figure 2.4: Representation of the 3D structure of CNNs.

The convolution layer consists of nodes structured in a 3D volume that receives andoutputs 3D data (Figure 2.4). Each node has the same depth as the input and acertain width and height that is determined by the layer field size. The spacingbetween the nodes is defined by the stride of the convolution layer. This strideindirectly determines the width and height of the output. The implementation ofa node boils down to a small filter operation or a dot product between the inputand the weights. To make everything work correctly, there is a border of zeros thatenlarges the width and height of the input. The size of this border is called thezero-padding. The correctness of the combination of the stride, field size and zeropadding can be determent with a simple formula (Equation 2.10). If the output isa fraction, then the given configuration cannot be used.

output =input− fieldsize+ 2 ∗ zeropad

stride(2.10)

The properties of a convolution layer can be written compactly (Equation 2.11).For example, a conv3-64 layer with a 224x224x3 input consists of a field size of 3and produces an output of 64 deep, this means that there are 64 filters, or sets ofweights. Mostly stride of 1 is used, which leads to a zero-padding of 1. The outputof the above example will be 224x224x64.

conv[fieldsize]− [#filters] (2.11)

It is common to insert some pooling layers between the convolution layers. Theselayers subsample the data to lower the amount of parameters and decrease trainingtime. Furthermore, the pooling layers include reduced overfitting and add some

2.3 State-of-the-art deep convolutional neural networks 8

translation invariance to the network. The nodes in the pooling layer are structuredin a 3D volume like the convolution layer. Each node sees a small region in a singledepth slice. The size of this region is defined by its field size. The space betweenthe nodes is determined by the stride, same as with convolution layers. The nodesperform an operation that does not require additional resources. This operationis most of the time a “max” (Figure 2.5), but an “average” can also be used.Additional operations are discussed in subsection 2.3.6.

Figure 2.5: Simplified example of a pooling layer. 3

2.3 State-of-the-art deep convolutional neural net-

works

This section will discuss the different methods, tricks and structures that were foundduring the literature research. Each of them solve problems that lead to betterperformance of the network. They will serve as possible methods of optimisationfor the final NN.

2.3.1 Data augmentation and preprocessing

Data augmentation is an essential part in deep learning. A NN performs better,the more image data is given to it. By augmenting existing images, the availabledata is enlarged without creating duplicates, leading to an increased performanceof the NN [13].

Common practice is to crop the image to a certain fixed size. All possible croppositions can be taken into account, leading to a potentially huge increase in data[13]. This data can be doubled in size by horizontally flipping it. Another methodthat can be used is scaling of the image to the fixed size [13]. All these methodscan be applied in combinations of each other.

Next step is the preprocessing of the data. The simplest form is subtracting themean of all images. This way the data is being zero-centered. It is important tonote that this mean is calculated only for all training images, but subtracted from

3Image source: Stanford course CS231n.


the samples in the validation and test set as well. Mean subtraction is the first stepin global contrast normalization. It is followed up by normalization. Normalizationis performed by dividing each data point by its standard deviation [14]. A more ad-vanced method for preprocessing is Principal Component Analysis (PCA-whitening)[15]. PCA-whitening is used to decorrelate the various features. Unfortunately,PCA-whitening leads to loss of localization information due to being a global filterand applying rotations [16]. This is not helpful when using CNN, because CNNrelies on local connectivity. To by-pass this issue, Zero-phase Component Analysis(ZCA-whitening) can be used instead [16] [17].

2.3.2 Overfitting

Neural Networks can have a huge amount of parameters. This increases the risk ofthe NN to learn the noise of the data during the training, especially if there is nota lot of data. It leads to an increased disparity between the train and validationloss/accuracy. The phenomenon is described as overfitting of the model and playsa big part in the engineering of a NN.

The standard way of reducing overfitting is making use of regularization (subsub-section 2.2.1). For complex data structures, this isn’t enough. Another type ofregularization is dropout [18]. Dropout deactivates a percentage of the nodes in agiven layer during a training iteration (Figure 2.6) and ultimately reduces overfit-ting. The nodes that are deactivated are randomly chosen each iteration and donot participate in the training process for the time they are disabled. The mostoptimal percentage for dropout is 50% [19].

Figure 2.6: Dropout deactivates a percentage of the nodes

Another way of reducing overfitting is adding noise to the gradient (Equation 2.12)[20]. This noise is Gaussian and epoch dependent. Additionally, it helps a lot withsimple or bad network initializations. This way, optimisation of the initialization isnot fully required.


grad = grad+G(0,η

(1 + t)0.55) (2.12)

G = Gaussian distribution, η = {0.01, 0.3, 1.0}, t = epoch

2.3.3 ReLU Variations

The choice of activation function matters as much as preprocessing in terms oftraining speed. Since the introduction of the Rectified Linear Unit (ReLU) (Equa-tion 2.4), it has dominated its use in CNNs due to its simplicity and solving thegradient vanishing.

Nevertheless, ReLUs suffer from a problem, where in rare cases a large negativegradient passes through them and pushes the weights in such a way that the nodegets deactivated. This can lead to a large amount of the NN deactivating andserving as “dead weight”. The solution to this problem is using a small negativeslope on the negative side of the ReLU instead of zero. This is called a Leaky ReLU(LReLU) (Figure 2.7) (Equation 2.13) [21].

f(x) =

{

a ∗ x if x < 0

x if x ≥ 0(2.13)

Since its introduction, a couple of variants have popped up. One of them is theExponential Linear Unit (ELU) (Equation 2.14) [22]. It uses an exponential functionin the negative area of the ReLU instead of having a small slope alpha. It also allowsnodes to filter out noise by saturating faster to a large negative value. The derivativeof the ELU can be computed without problems for use in the backpropagation(Equation 2.15).

f(x) =

{

a ∗ (ex − 1) if x < 0

x if x ≥ 0(2.14)

f ′(x) =

{

f(x) + a if x < 0

1 if x ≥ 0(2.15)

Another popular variant of the LReLU is the Parametric ReLU (PReLU) [23]. Itbuilds up from the LReLU by making the alpha a learnable parameter. To suppressthe explosion of additional parameters, it is suggested to keep the parameter layer-wise. This trick should reduce training time.


Figure 2.7: Left: A normal ReLU. Right: A LReLU with a small negative slope ’a’. 4

A third variant of the LReLU is the Random ReLU (RReLU) [24]. It is identical asthe LReLU, with the difference of choosing a random value for alpha during each it-eration of training from a fixed set of values. During testing time, the average of thefixed set of values is taken and set as the alpha of each node. The RReLU achievesthe best performance compared to the LReLU, PReLU and normal ReLU [24]. Dueto the random nature, RReLU also enjoys a reduced overfitting as result.

2.3.4 Gradient descent

Gradient descent itself can be improved by a margin through use of more complexfunctions. The still simple, yet effective, method of momentum (Equation 2.18)is based on the intuition that a rolling ball down a hill should have accelerationand not move at a constant speed. The most popular variant of this momentumis the Nesterov momentum (Equation 2.19) [25]. It brings better performance bylooking ahead instead of the current gradient. The disadvantage of momentum inboth versions is that it can overshoot, leading to an increase in possibility of gettingstuck in a local minimum.

grad =dloss

dparam(2.16)

param = param− learnrate ∗ grad (2.17)

Vanilla gradient descent

v = µ ∗ v − learnrate ∗ gradparam = param+ v

(2.18)

µ = [0.0,1.0] , mostly µ = 0.9

4Image source: Stanford course CS231n.


Momentum based gradient descent.

vprevious = v

v = µ ∗ v − learnrate ∗ gradparam = param− µ ∗ vprevious + (1 + µ) ∗ v

(2.19)

Nesterov Momentum based gradient descent

Other approach to achieve better performance in gradient descent is tuning pa-rameters locally instead of globally by one rate. Adagrad (Equation 2.20) makesit possible to increase the learning rate of the parameters that are updated infre-quently and decrease it of those that update frequently [26]. Unfortunately, thenature of the formula makes it too aggressive for use in deep NNs.

cache = cache+ grad2

param = param− learnrate ∗ grad√cache+ ǫ

(2.20)

RMSprop (Equation 2.21) tries to fix the aggressiveness of Adagrad by using amoving average over de gradients [27]. The parameter “decay rate” is used and hasthe value of [0.9; 0.99; 0.999].

cache = decay ∗ cache+ (1− decay) ∗ grad2

param = param− learnrate ∗ grad√cache+ ǫ

(2.21)

Another adaptation of Adagrad is Adadelta (Equation 2.22) [28]. By using themagnitude of recent gradients and the steps, it achieves even faster results. LikeRMSprop it holds an exponential moving average over the gradients, but also doesit for the steps.

cache = decay ∗ cache+ (1− decay) ∗ grad2step = decay ∗ step+ (1− decay) ∗ param2

param = param−√step+ ǫ√cache+ ǫ

∗ learningRate ∗ grad(2.22)

Ultimately, every algorithm tries to use learning rate decay. This means that thelearning rate of the network will be much higher during the beginning of train-ing, compared to later on in the training session. This adaptation depends on thenumber of epoch the network has gone through. Equation 2.23 shows the most


used formulas, with “t” representing the number of epochs and “k” a hyperparam-eter.

learnrate = learnrate− k

learnrate = learnrate ∗ e−k∗t

learnrate =learnrate

1 + k ∗ t

(2.23)

2.3.5 Network architecture

After introducing all these methods, the question of how we put everything to-gether, remains. Three different architectures, using state-of-the-art methods, willbe discussed on the following pages. Those architectures have all participated suc-cessfully in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [29].

The runner-up of ILSVRC 2014 was VGG [30]. The architecture of their 11- to19-layer network consists of blocks of convolution layers followed by a max-poolinglayer. These blocks are repeated five times, followed by three fc layers. All theconvolution layers have a stride of 1, zero-padding of 1 and a field size of 3. Thereason behind the field size is that multiple convolution layers of field size 3 arefunctionally the same as one convolution layer of bigger field size. The advantage inthis approach is the reduced amount of parameters needed for training. Addition-ally, more activation functions are present, which leads to better non-linearity. Thefilter size of the convolution layers is respectively 64, 128, 256 and 512. The activa-tion function used after every convolution layer is a simple ReLU. The last two fclayers are of length 4096 and have a dropout rate of 50%. The training was donewith a momentum of 0.9, batch size of 256 and a starting learning rate of 0.01. Thelearning rate was lowered three times with a factor 10 during the training. Standarddata augmentations of cropping and horizontally flipping were performed. Addi-tionally, the network was trained with different scales of the dataset, before the crop.

At time of writing, the performance of VGG has been surpassed by the winnerof ILSVRC 2014, GoogLeNet [31]. GoogLeNet uses a special structure in his archi-tecture, the inception modules. These modules consist of parallel placed convolutionlayers of field size 1, 3, 5 and a max-pooling layer of field size 3. Additionally thereis a convolution layer of field size 1 placed before the field size 3 and 5 convolutionlayer and after the max-pooling layer to reduce the amount of computation. At theend, the outputs of the parallel layers are concatenated. The inception modulesare placed in pairs and followed by a max pooling layer of stride 2 and field size 3.A ReLUas activation function is only used in the convolution layers of field size 1.Other convolution layers lack activations. It is curious to note that the architecturestarts with a convolution layer of field size 7 instead of using three layers of fieldsize 3 that would do the same job, but with less parameters. There is also no ex-planation to why this is done. Another curiosity is the claim that an average pool


layer before the last fc layer is more effective, precision-wise, than using a fc layer.The data augmentation follows the same trend as VGG. It takes a combinationof resizing and cropping to increase the number of data in the dataset. There isno mention of horizontal flipping. The learning rate is gradually decreased by 4%every 8 epochs. Momentum of 0.9 is used for gradient descent.

The ILSVRC 2015 was won with a deep residual framework [32]. The MicroSoftResearch Asia (MSRA) team combined the philosophy of the VGG network, thepractises of GoogLeNet and the ideas of Highway Networks [33]. The residualframework starts with the fashion of GoogLeNet by using a single convolution layerwith field size 7 and stride 2. The ending of the framework is done with an averagepooling layer instead of fc layer(s), to reduce the amount of parameters. The coreconsists of building blocks that are made from a max pooling layer and a couple ofconvolution layers of field size 3. To retain the degree of complexity, the number offilters used in convolution layers is doubled after every max-pooling layer, just likein VGG. MSRA observed that an increased depth in such a network did not leadto decreased error, but a bigger error instead. This means that a shallower networkcould perform better than a deeper network. This is where the residual frameworkcame in. The idea is to create shortcut between the blocks and sum the input withthe output of that block (Figure 2.8) [34]. Features that are found early in the net-work are likewise transferred through the deeper layers and eventually to the outputlayer, without being lost underway. This does not introduce more parameters and isfairly fast computationally. To make the sum possible, zero-padding is used to keepthe shortcuts parameter free. Another used solution is to use convolution layerswith field size 1 before and after the conv3 layers. By choosing a smaller numberof filters for the first conv1 layer, the amount of parameters is reduced and thusalso the computation cost. The last conv1 layer in the block serves the purpose toscale back the dimensions of the data by use of bigger amount of filters to allowthe shortcut sum. Nonlinearity is performed through a ReLU after all convolutionlayers in the block, except the last one, and after the sum of the shortcut and theoutput of the block. The network makes use of batch normalisation [35] insteadof using dropout for the ILSVRC task. In the end, the architecture of residualframeworks allows for a varied amount of layers, up to more than thousand.


Figure 2.8: Left: The standard block used in the residual framework. Right: Block withconv1 layers that scale the dimensions of the data to reduce computations.Used in larger variants of the residual framework. 5

2.3.6 Pooling layers

All the architectures in subsection 2.3.5 make use of input image of the size 224x224.The original data is larger than this size, meaning that cropping and resizing is beingused to scale it to 224x224. This can lead to a loss of precious information. SpatialPyramid Pooling (SPP) allows the use of images of varied sizes, without changingthe dimensionality of the fully connected layers (Figure 2.9) [36]. It converts thedata to a vector with fixed size, by making use of spatial pyramids [37]. This isbased on the Bag-of-Words model.

Figure 2.9: Principle of SPP. 6

Another researched pooling layer is the fractional max pooling layer [38]. It usespooling filters of size [1, 2]x[1, 2], where the size is randomly chosen and fits thefilter. This filter can be reshuffled every epoch. If used in the beginning of anetwork, it can lead to increased spatial invariance by serving as randomized dataaugmentation through warping of the image (Figure 2.10).

5Image source: [32]6Image source: [36]


Figure 2.10: Demonstration of the fractional max pooling layer with an image of parrots.7

7Image source: [38]

EXPERIMENT SETUP 17

Chapter 3

Experiment setup

The experiment serves as a proof of concept for using thermal imaging to identifythe lubrication level of a bearings, imbalance faults and outer raceway faults. Thereare 4 bearing conditions used and each of them can also be balanced or imbalanced.Table 3.1 gives an overview of all the classes.

Class name Bearing condition Is balanced?

HB Healthy bearing YesMILB Mildly in-lubricated bearingEILB Extremely in-lubricated bearingORF Bearing with outer raceway faultHB IM Healthy bearing No

MILB IM Mildly in-lubricated bearingEILB IM Extremely in-lubricated bearingORF IM Bearing with outer raceway fault

Table 3.1: Abbreviations of all the classes.

3.1 Data

The data that was used, is a set of videos made with a thermal camera. Eachvideo is recorded at a rate of 6 frames per second, is 10 minutes long and has aresolution of 640x480. It contains a thermal view of an active bearing housing ina test setup (Figure 3.1). The position of the camera in the test setup is nearlyidentical in every video, with a small variation in height, rotation and distancefrom the bearing. There is also a thermocouple visible in each frame that helps todetermine the ambient temperature at time of shooting and the FLIR logo.

3.2 Preprocessing 18

Figure 3.1: An example of a frame from a video.

There are 5 videos available for each class. The individual frames of one video arealmost completely identical. The only visible movement is the reflections of thesurface of the rotating shaft. The bearing experiences small micro vibrations thatare not visible for the eye. These vibrations operate as a form of data augmentationas they translate the bearing by a few pixels.

Exactly one frame is extracted every 6 frames, resulting in 600 frames per video,3000 frames per class and 24000 frames in total. Note that frames from a singlevideo are extremely similar, making random sampled cross-validation very prone tooverfitting. The frames of a single video must be used in only the training set oronly the validation set. Because there are 5 separate videos for each class, 5-foldcross-validation is used. A validation set corresponds to a set of 8 (one for eachclass) groups of frames, while a training set corresponds to a set of 24 (four for eachclass) groups of frames. All the frames of a group come from a single video of abearing.

3.2 Preprocessing

Each frame undergoes a series of preprocessing before being put to use in a NN.This is necessary to make the training as fast and efficient as possible. It is ac-complished by removing unneeded information and simplifying the frame for theclassifier. These are the steps:

1. Reduce image dimensions.


2. Subtract ambient temperature.

3. Crop unnecessary information.

4. Scale values.

5. Subtract mean value.

3.2.1 Reducing dimension

During the extraction of the frames, the resolution of each image is lowered to160x120 from 640x480. This is done to compensate the memory limitation of theGPU. Due to the fact that heat regions are more important than fine detail, nonecessary information is lost. The scaling is done with bilinear interpolation.

3.2.2 Ambient temperature

In the second step, the ambient temperature value is subtracted from the frame.The temperature shift is to make the temperature values relative instead of absoluteby eliminating the external thermal sources. It also removes background noise inthe darker regions by forcing them to zero. This is realised by a thermocouplefound in each image. The pixel value of the thermocouple represent the ambienttemperature. The deviation of this temperature during the video does not exceed apixel value of 2-3 compared to the first frame. For simplicity, the thermocouple pixelis read once in the first frame of the video and the value is applied to the remainingframes. The thermocouple is indicated by the red crosshair in Figure 3.2.


Figure 3.2: Unedited frame with a red crosshair showing the thermocouple and a bluecrosshair showing the alignment point.

The coordinate of the thermocouple is extracted manually from each video. In-creasing the contrast and brightness facilitated the process of finding the locationof the thermocouple (Figure 3.3).

Figure 3.3: High contrast frame with the thermocouple visible on the right side.


3.2.3 Cropping

The third step of the preprocessing is the most tricky one. Three items have tobe removed from each frame to eliminate most of the unnecessary and misleadinginformation. This is the ”FLIR” logo, the thermocouple and the rotating shaft. Theshaft is removed as it does not provide useful information due to its reflectivity.

The problem with removing the thermocouple and the shaft is the fact that theposition of the bearing is slightly different on each video. This makes it impossi-ble to remove them using a simple threshold, without cropping too much of thehousing.

The position of the housing has to be aligned across all the videos. This can bedone manually or automatic. Both of those option are explored in the followingparagraphs.

Manual alignment

In each of the videos, the housing contains a certain dent that serves as a referencepoint. The position of this dent is shown with a blue crosshair in Figure 3.2.Going through the videos manually, the coordinate of the center of the dent issaved. Knowing these coordinates, allows to reposition each frame by shifting it inthe horizontal and vertical direction (Figure 3.4). Additionally, the smallest andlargest value of the horizontal and vertical position of the reference points are usedto determinate the optimal cropping size for the whole data set. This means noneof the images contain a black strip.

Figure 3.4: Cropped frames from two different videos that were repositioned with morethan 90px.

The overall result of this method is good, but far from perfect. The position ofthe housing is corrected in the horizontal and vertical direction, but pitch, yaw


and roll are not taken in account. A minority of the videos show clear indicationthat yaw is needed to perfectly normalize the position. Besides this shortcoming,it was possible to crop all the unnecessary items with a simple threshold method(Figure 3.4).

Automatic alignment

Using homography, it is possible to reposition all the frames to the same positionof a reference frame. A local feature detector/descriptor is used to find key pointsthat are then matched with the key points of the reference frame. The homographymatrix is found using RANSAC [39].

Each video has two large bolts in the foreground that are oriented differently fromvideo to video. Additionally, there are different reflections in the shaft for eachframe. The key points generated by the feature detector/descriptor algorithm haveto be removed if they are positioned in those regions (Figure 3.5). This removalcan only happen in the reference frame as the position of the housing is differenton each video.

Figure 3.5: Matches between reference frame (right) and image (left) using KAZE.

Some videos contain a lot of key points, while other very few. Some videos hadall the key points positioned on the two large bolts and the shaft. This leads toa completely distorted homography, making those videos unusable in the classifier.Different feature detector/descriptor algorithms were tried to solve this problem(Table 3.2).


Algorithm # of key points # of matchesafter masking

ORB 4 0KAZE 17 10AKAZE 7 0BRISK 2 0

Table 3.2: Comparison between feature detector/descriptor algorithms on a problematicvideo.

The methods ORB [40], BRISK [41] and AKAZE [42] were unable to find any keypoints that match the reference frame on some of the videos (Table 3.2). Only KAZE[43] was able to find matches in a non-problematic region (Figure 3.6). Few of thoseresulted in bad matches, meaning that the homography was not very good.

Figure 3.6: Comparison between the key points of feature detector/descriptor algorithmson a problematic frame. Going horizontally from top left to bottom right:ORB, KAZE, BRISK and AKAZE.

The best homography matrices were obtained by using the original unedited frames.Despite this, black strips were present on some of the images, because of the bigdifference in height, pitch and yaw between the frame and the reference frame.

3.3 Practical hardware & software setup 24

The frames with low key point count were also not optimally transformed. Theresult of the frames with higher amount of key points and matches were close toperfect.

Finally, a feature detector/descriptor was not further used, due to the possibility ofirregularities depending on the choice of reference frame. Finetuning the automaticalignment is possible, but was not done as there is also no need of perfect alignmentbetween the videos. It is only needed to crop parts of the image. The manualalignment suffices for this job (Figure 3.7). This is the last step in creating thedataset.

Figure 3.7: Final preprocessed image of size 108x103. Cropped using the manual method.

3.2.4 Scaling and mean substraction

What follows is done at run time to make the training of the NN faster. Forconvenience, the pixels values of every image are scaled to be between 0 and 1. Thisis simply done by dividing by 255, the max value a pixel can take. Additionally,the mean of every image in the train set is taken and subtracted from every pixel ofall images in the train and validation sets. This is done to center the pixels around0.

3.3 Practical hardware & software setup

All the tests were conducted on a machine with 12 GB of RAM and GPU with 6GB of VRAM.

The software libraries used are OpenCV 3.0, Numpy 1.11, Theano 0.8, Lasagne 0.1and Matplotlib 1.4.

RESULTS 25

Chapter 4

Results

In order to find the most optimal NN architecture, first simple NNs are tested. Sub-sequently, more advanced configurations are tested. First, simple fully connectedNeural Networks are tested. Some optimisations to the fully connected networkare tested out to see if a better result can be achieved. Further tests are basedon Convolutional Neural Networks. These are more thoroughly explored, as thereare more tricks and hyperparameters to tune. The following optimisations of thearchitecture were performed:

• Number of filters per convolution layer

• Number of pooling layers

• Number of fully connected layers

• Number of convolution layers

• The activation function of all layers

• The use of batch normalization

4.1 Fully Connected Neural Networks

To find a good model, the depth of the network (2-6) is varied using a reasonablenumber of neurons per layer (further referred as the width of a fully connectedlayer). Additionally, more scarce and more wide networks are tested.

Nesterov momentum with a learning rate of 5e-3 is used as the parameter updatefunction. The learning rate was chosen using grid search. Adadelta was also consid-ered as a candidate, but the train/validation loss was worse than that of Nesterovmomentum.

A regularization strength of 5e-3 was applied. Not using L2 regularization yieldsworse results. Using higher regularization strength causes a diverging validationloss. This is shown in Figure 4.1

4.1 Fully Connected Neural Networks 26

0 10 20 30 40 50 60 70 80Epochs

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Loss

Validation lossStrength: 5e-2Strength: 5e-3Strength: 5e-4

Figure 4.1: Comparison of the same model between different regularization strengths.

The loss and accuracy in Table 4.1 is calculated after using 5-fold cross-validation.The σ represents the standard deviation of the loss and accuracy respectively.

layer 1 layer 2 layer 3 layer 4 layer 5 layer 6 Loss σ Accuracy (%) σ

50 20 1.64 0.18 58.8 19.0100 20 1.68 0.17 58.8 21.3500 100 1.66 0.19 53.0 21.1200 100 50 20 1.66 0.19 54.1 19.0500 200 80 20 1.64 0.19 50.0 23.71000 400 200 50 1.69 0.19 52.8 21.3200 150 100 50 20 20 1.69 0.18 50.8 24.2500 400 300 200 100 20 1.69 0.18 52.5 21.51000 600 300 200 100 50 1.68 0.19 49.8 23.9

Table 4.1: Tested fully connected models without dropout.

The validation set plays a big roll in the validation accuracy and loss of the model.Depending on which set is chosen, the accuracy per bearing can be as low as 25%and as high as 87.5%. Table 4.2 displays the median validation loss and accuracybased on the models in Table 4.1 for each validation set.


Validation Set Validation Loss Validation Accuracy (%)

1 1.82 37.52 1.59 62.53 1.35 87.54 1.57 505 1.91 25

Average 1.65 52.5

Table 4.2: Result median for each set using the fully connected models.

Remark that validation set 1 requires the full 300 epochs to start converging, whilethe rest of the sets needed no more than 70 epochs. Additionally, the best resultfor validation set 3, 4 and 5 was achieved by stopping the learning process ear-lier, because of a steady increase of the validation loss after a certain number ofepochs.

The disparity between the validation sets can also be seen on Figure 4.2. It isinteresting to note that the training proceeds on an identical matter, regardless ofthe validation set that is used. This can be seen in Figure 4.3


0 10 20 30 40 50 60 70 80Epochs

1.2

1.4

1.6

1.8

2.0

2.2

Loss

Validation lossValidation set 1Validation set 2Validation set 3Validation set 4Validation set 5

0 10 20 30 40 50 60 70 80Epochs

0

20

40

60

80

100

Accu

racy

(%)

Validation accuracyValidation set 1Validation set 2Validation set 3Validation set 4Validation set 5

Figure 4.2: Validation loss and accuracy of the first network in Table 4.1 after 80 epochs.


0 10 20 30 40 50 60 70 80Epochs

1.0

1.2

1.4

1.6

1.8

2.0

Loss

Train lossValidation set 1Validation set 2Validation set 3Validation set 4Validation set 5

0 10 20 30 40 50 60 70 80Epochs

30

40

50

60

70

80

90

100

Accu

racy

(%)

Train accuracyValidation set 1Validation set 2Validation set 3Validation set 4Validation set 5

Figure 4.3: Train loss and accuracy of the first network in Table 4.1 after 80 epochs.

The NN starts overfitting after 10-15 epochs on all the validation sets. This can beseen by comparing the loss between the first graph in Figure 4.2 and Figure 4.3.The difference in loss between validation and training is very large, especially onvalidation set 5.

The confusion matrix for the first model in Table 4.1 is displayed in Figure 4.4. Themodel successfully identifies EILB, EILB IM and ORF (80%), but has a difficulttime distinguishing the other bearing conditions as they sit around 45%.



HB

MILB

EILB

ORF

HB_IM

MILB_IM

EILB_IM

ORF_IM

Targ

et

38 % 0 % 0 % 2 % 41 % 0 % 19 % 0 %

0 % 40 % 0 % 0 % 0 % 39 % 20 % 1 %

0 % 0 % 80 % 0 % 0 % 20 % 0 % 0 %

20 % 0 % 0 % 80 % 0 % 0 % 0 % 0 %

48 % 0 % 0 % 0 % 52 % 0 % 0 % 0 %

0 % 20 % 35 % 0 % 0 % 40 % 5 % 0 %

0 % 0 % 0 % 0 % 20 % 0 % 80 % 0 %

20 % 0 % 0 % 20 % 0 % 0 % 0 % 60 %

0

10

20

30

40

50

60

70

80

90

100

Figure 4.4: Confusion matrix of the first network in Table 4.1.

Using the exact same prediciton result, but simplifying Figure 4.4 by taking bal-anced and imbalanced classes of the same bearing condition together, the confusionmatrix in Figure 4.5 can be made. It is interesting to note that the HB score signif-icantly higher on Figure 4.5 than Figure 4.4 (HB/HB IM). The explanation for thisphenomenon is that the NN fails to distinguish between the HB and HB IM, butsucceeds to classify HB. This also can be seen in Figure 4.6 as the overall accuracythere is lower.

HB MILB EILB ORFPredicted

HB

MILB

EILB

ORF

Targ

et

89 % 0 % 10 % 1 %

0 % 69 % 30 % 1 %

10 % 10 % 80 % 0 %

20 % 0 % 0 % 80 %

0

10

20

30

40

50

60

70

80

90

100

Figure 4.5: Confusion matrix of the first network in Table 4.1. Balanced and imbalancedclasses of the same condition are considered as one.


The model struggles more with distinguishing the balance of the rotating machinery(Figure 4.6) compared to the bearing condition (Figure 4.5), by taking the balancedand imbalanced classes together.

Balanced ImbalancedPredicted

Balanced

Imbalanced

Targ

et65 % 35 %

36 % 64 %

0

10

20

30

40

50

60

70

80

90

100

Figure 4.6: Confusion matrix of the first network in Table 4.1. All classes with the samebalance are considered as one.

The accuracy for detecting the bearing condition is 79,5%. The accuracy for imbal-ance detection is 64.5%. The overall achieved accuracy is 58,8%.

4.1.1 Use of dropout

In this section dropout is applied on the fully connected NNs (unlike previous sec-tion). A reduced overfitting is expected. Dropout is used between all the layers. Therest of the hyperparameters are identical as the tests in the previous section.

layer 1 layer 2 layer 3 layer 4 layer 5 layer 6 Loss σ Accuracy (%) σ

50 20 1.70 0.15 59.0 17.3200 100 50 20 1.73 0.18 55.7 20.6500 400 300 200 100 20 1.72 0.08 64.0 9.4

Table 4.3: Fully connected models using dropout.

Including dropout in the models has an overall positive effect. The more shallownetworks do not experience any improvement, while the deeper networks have asignificant accuracy increase. This can be explained by the use of more dropout inthe deeper model.

Another curiosity is that the 6-layer models are less dependant on the validation setthat is being used. This can clearly be seen in Table 4.3, looking at the σ for theaccuracy and loss. Note that 6-layer models have to be stopped at a certain epoch,because the validation loss starts increasing and the accuracy decreasing. This isgraphically shown in Figure 4.7.


0 50 100 150 200 250 300Epochs

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

2.1

Loss

Figure 4.7: Loss of a 6-layer model in Table 4.3.

The confusion matrices of the networks using dropout display a similar pictureas in the networks without dropout. The same phenomenon of the model failingto distinguish between balanced and imbalanced occurs, but this time it’s alsoprominent in MILB and MILB IM (Figure 4.8). The overall accuracy of the modelis 64,0 %. This is an improvement compared to the best fc model without dropoutfrom Table 4.1.


HB

MILB

EILB

ORF

HB_IM

MILB_IM

EILB_IM

ORF_IM

Targ

et

20 % 0 % 0 % 0 % 60 % 0 % 0 % 20 %

0 % 60 % 0 % 0 % 20 % 20 % 0 % 0 %

0 % 0 % 40 % 0 % 0 % 0 % 60 % 0 %

20 % 0 % 0 % 74 % 0 % 0 % 0 % 6 %

0 % 0 % 0 % 0 % 98 % 0 % 2 % 0 %

0 % 20 % 0 % 0 % 20 % 40 % 20 % 0 %

0 % 0 % 0 % 0 % 0 % 20 % 80 % 0 %

0 % 0 % 0 % 0 % 0 % 0 % 0 % 100 %

0

10

20

30

40

50

60

70

80

90

100

Figure 4.8: Confusion matrix of the 6-layer model in Table 4.3.


Correctly classifying MILB is the most difficult task for the model (Figure 4.9). Thisalso applied for the model without the use of dropout (Figure 4.5). The accuracyfor detection of the bearing condition is 84,8%.


HB

MILB

EILB

ORF

Targ

et

89 % 0 % 1 % 10 %

20 % 70 % 10 % 0 %

0 % 10 % 90 % 0 %

10 % 0 % 0 % 90 %

0

10

20

30

40

50

60

70

80

90

100

Figure 4.9: Confusion matrix of the 6-layer model in Table 4.3. Balanced and imbal-anced classes of the same bearing condition are put together.

Dropout also achieved better accuracy for distinguishing between balanced and im-balanced models (Figure 4.10). The accuracy for imbalance detection is 74,5%.


Balanced

Imbalanced

Targ

et

54 % 46 %

5 % 95 %

0

10

20

30

40

50

60

70

80

90

100

Figure 4.10: Confusion matrix of the 6-layer model in Table 4.3. All classes with thesame balance are considered put together.

To visualise what parts of the input image have the most effect on the output of thenetwork, guided backpropagation [44] [45] is applied. Each saliency map is madeby averaging the best (=network had highest confidence) correctly classified framesfrom each validation set. The result can be seen in Figure 4.11. The brighter spotsinfluence the network output the most.


0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for HB

0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for HB_IM

0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for MILB

0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for MILB_IM

0 20 40 60 80 100

0

20

40

60

80

100


0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for EILB_IM

0 20 40 60 80 100

0

20

40

60

80

100


0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for ORF_IM

Figure 4.11: Saliency maps of the correctly classified frames with the highest confidenceby the 6-layer network.

4.2 Convolutional Neural Networks 35

The saliency maps between classes of the same bearing condition are very similar(Figure 4.11). The saliency map of HB is completely noisy and there are no par-ticular regions with more bright spots. On all the other saliency maps there is abrighter spot visible in the middle of the image. This spot corresponds to the loca-tion of a large bolt on the housing. On the saliency maps of MILB, MILB IM, EILBand EILB IM there is a second bright spot on the bottom. This corresponds to thelocation of the other large bolt. These last observations imply hat the network islearning some classes based on the orientation of one or two bolts instead of thedifference in temperature on the housing of the bearing. Note that the bolts areoriented differently in almost every video.

Lastlty, the saliency maps of ORF and ORF IM have a bright arc on the right sideof the image. This arc corresponds to the right side outline of the housing.

4.2 Convolutional Neural Networks

Convolutional Neural Network (CNN) are the state-of-the-art for image classifi-cation. Deep CNN are considered top tier in generalising data based on images.Unfortunately gradient vanishing can form a problem and a deep network can per-form worse than a shallow one. ReLUs circumvent gradient vanishing, but haveproblems of their own (subsection 2.3.3). To be able to justify the depth of networkand the different optimisation tricks, a build-up approach is needed.

The strategy is to start with a standard, medium sized model. Hyperparametersare tweaked using grid search to identify the most optimal structure of the network.The best performing model is carried over to the next tests.

Table 4.4 shows the starting network. All the convolution layers that are usedin the tests have a filter size of 3. The reason is described in subsection 2.2.2. Afterevery fully connected layer, dropout is applied. It is assumed that dropout preventsoverfitting and has no detrimental effects.


Layer type Filter size Stride Zero pad Dropout

conv 3 1 / Noconv

max pool 2 2 1 /conv 3 1 / Noconv

max pool 2 2 1 /conv 3 1 / Noconv

max pool 2 2 1 /fc 256 / / Yesfc

Table 4.4: The starting network architecture.

The hyperparameters used in all the tests, unless specified otherwise, are: Reg-ularization strength of 5e-3, batch size of 600 and Nesterov momentum. 5-foldcrossvalidation is applied.

4.2.1 Number of filters

The first property to adjust in the standard model is the amount of filters each ofthe convolution layers has. A learning rate of 5e-3 is used and the tests are ran for40 epochs. Table 4.5 shows the results of the best epoch of each model.

# of filters Validation Loss σ Validation Accuracy (%) σ

2 1.49 0.27 47.6 23.04 1.43 0.24 57.2 17.18 1.50 0.29 50.1 16.716 1.54 0.25 53.8 22.132 1.51 0.27 53.1 21.3

Table 4.5: Results on different amount of filters in every convolution layer.

Increasing the amount of filters per convolution layer to 16 or more gives no benefit.Having only 2 filters in a convolution layer provides the worst validation accuracycompared to the other models. The network with 4 filters per convolution layerhas overall the best performance. The validation loss of all the models during theepochs can be seen in Figure 4.12.


0 5 10 15 20 25 30 35 40Epochs

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Loss

Validation loss# filters: 2# filters: 4# filters: 8# filters: 16# filters: 32

0 5 10 15 20 25 30 35 40Epochs

0

20

40

60

80

100

Accu

racy

(%)

Validation accuracy# filters: 2# filters: 4# filters: 8# filters: 16# filters: 32

Figure 4.12: Validation loss and accuracy comparison between CNNs with 2 to 32 filtersper convolution layer.

Another possible structure for the amount of filters in a convolution layer is alinearly increasing and decreasing amount. Table 4.6 compares the results with aflat structure of 4 filter per convolution layer from Table 4.5.


# of filters Validation Loss σ Validation Accuracy (%) σ

4, 4, 4, 4, 4, 4 1.43 0.24 57.2 17.14, 4, 8, 8, 16, 16 1.53 0.24 57.1 19.016, 16, 8, 8, 4, 4 1.51 0.26 50.8 21.6

Table 4.6: Comparison between flat, increasing and decreasing amount of filters in thethree convolution layer groups.

Table 4.6 shows that an increasing amount of filters per convolution layer performsbetter than a decreasing amount. Having 4 filters per convolution layer still out-performs both those structures. The validation loss during the training is displayedin Figure 4.13

0 5 10 15 20 25 30 35 40Epochs

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Loss

Validation loss# filters: 4# filters: 16, 8, 4# filters: 4, 8, 16

Figure 4.13: Loss comparison between CNNs with flat, increasing and decreasingamount of filters per convolution layer.

Subsequent models have 4 filters per convolution layer.

4.2.2 Number of pooling layers

The maximum amount of pooling layers that can be placed is determined from theinput image size. In this case, the size of the image is 103x108. The output sizeafter 5 pooling layers is only 3x3, as Table 4.7 shows. This means that a maximumof 5 layers can be used, as more layers will generate a output size that is too smallto be meaningful. It can be argued that the output size of 3x3 is also too small,setting the range of pooling layers to be tested between 0 and 4.


# pooling layers Width Height

0 103 1081 51 1042 25 273 12 134 6 65 3 3

Table 4.7: Size of output using n pooling layers with no padding.

The optimal amount of pooling layers is tested by making use of a network with 6convolution layers and 2 fully connected layers. The parameters of the layers arethe same as in Table 4.4. Dropout is used after each fully connected layer. Thedifferent networks are shown in Table 4.8. A learning rate of 5e-3 is used and thetests are ran for 40 epochs.

Network 0 Network 1 Network 2 Network 3 Network 4

conv conv conv conv convpool

conv conv conv conv convpool pool

conv conv conv conv convpool

conv conv conv conv convpool pool

conv conv conv conv convconv conv conv conv conv

pool pool pool pool

fc fc fc fc fcfc fc fc fc fc

Table 4.8: The network models for testing the optimal amount of pooling layers.

The performance of the networks in Table 4.8 are shown on Table 4.9. More poolinglayers have a better effect on the validation loss. However the model with 4 poolinglayers has the worst overall accuracy and experienced an explosively diverging lossafter the 14th epoch.


# of pooling layers Validation Loss σ Validation Accuracy (%) σ

0 1.65 0.15 51.4 17.91 1.55 0.22 56.2 17.72 1.53 0.28 57.8 21.43 1.43 0.24 57.2 17.14 1.41 0.18 44.3 12.

Table 4.9: Results using different number of pooling layers. Networks used are shown inTable 4.8.

Plotting the training and validation loss of the networks gives more insight into thetraining. The training loss plot in Figure 4.14 shows that an increasing number ofpooling layers slows down the learning process and that the convergence point isthe same for all the networks.

0 5 10 15 20 25 30 35 40Epochs

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Loss

Train loss# of pooling layers: 0# of pooling layers: 1# of pooling layers: 2# of pooling layers: 3# of pooling layers: 4

Figure 4.14: Training loss of the networks in Table 4.8.

Looking at the validation loss in Figure 4.15, it can be concluded that more poolinglayers provide a better result.


0 5 10 15 20 25 30 35 40Epochs

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Loss

Validation loss# of pooling layers: 0# of pooling layers: 1# of pooling layers: 2# of pooling layers: 3# of pooling layers: 4

Figure 4.15: Validation loss of the networks in Table 4.8.

Due to the model with 4 pooling layers having a explosively diverging loss, subse-quent models are created with 3 evenly distributed pooling layers.

4.2.3 Depth and width of fully connected layers

In this section, depth is defined as the amount of fully connected layers after all theconvolution and pooling layers. Width is defined as the amount of neurons eachfully connected layer has.

The size and depth of the fully connected layers is tested with a similar method asin section 4.1. Depths of 1 to 4 are tested, followed by a more narrow and morewide model. A flat structure is taken as it gives the best results in section 4.1. Theused network model is 6 convolution layers with a filter size of 3, stride of 1 and8 filters, followed by the fully connected layers described in Table 4.10. Dropoutis applied after each fully connected layer. A learning rate of 1e-3 is used and thetests are ran for 120 epochs.


# fully connected layers Layer width Loss σ Accuracy (%) σ

1 128 1.58 0.22 50.6 18.9256 1.53 0.25 53.5 20.6512 1.52 0.26 53.6 20.1

2 128 1.42 0.24 48.0 21.3256 1.44 0.24 56.7 17.0512 1.44 0.31 52.1 21.0

3 128 1.36 0.28 46.0 22.6256 1.36 0.22 57.3 20.1512 1.34 0.26 56.9 19.6

4 128 1.51 0.14 47.8 14.5256 1.42 0.22 44.4 20.3512 1.35 0.29 46.0 14.7

Table 4.10: Validation loss and accuracy on networks with varying fully connected layerdepth and width.

Figure 4.16 and Figure 4.17 are created by merging the results of models withthe same amount of fully connected layers and models with the same amount ofnodes.


0 20 40 60 80 100 120Epochs

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Loss

Train loss# dense layers: 1# dense layers: 2 # dense layers: 3 # dense layers: 4

0 20 40 60 80 100 120Epochs

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Loss

Validation loss# dense layers: 1# dense layers: 2 # dense layers: 3 # dense layers: 4

Figure 4.16: Loss comparison between using 1, 2, 3 and 4 fully connected layers.

From Table 4.10 and Figure 4.16 it can be concluded that more fully connectedlayers provide better validation loss but slow down the training process. Morenodes per fully connected layers decrease the validation loss and improve the speedof training, as seen in Figure 4.17.


0 20 40 60 80 100 120Epochs

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Loss

Train loss# nodes per layer: 128# nodes per layer: 256# nodes per layer: 512

0 20 40 60 80 100 120Epochs

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Loss

Validation loss# nodes per layer: 128# nodes per layer: 256# nodes per layer: 512

Figure 4.17: Loss comparison between models with 128, 256 and 512 nodes per fullyconnected layer.

Four fully connected layers do not offer an significant improvement compared tothree, but slow down training significantly (Figure 4.16). The model with 3 fullyconnected layers and 512 nodes per layer is chosen to be used in subsequent tests.It has the best individual performance together with the one with 3 fully connectedlayers and 256 nodes per layer (Table 4.10), but trained faster than the latter.


4.2.4 Number of convolution layers

Networks with varying amount of convolution layers are tested to find an optimum.Each convolution layer has 4 filters, a filter size of 3 and a stride of 1. There are 3pooling layers evenly distributed between the convolution layers. After those, thereare 3 fully connected layers with 512 nodes each. Dropout is applied after eachfully connected layer. A learning rate of 8e-4 is used and the tests are ran for 180epochs.

# convolution layers Loss σ Accuracy (%) σ

4 1.45 0.31 53.9 19.76 1.35 0.26 57.0 20.08 1.52 0.24 41.2 22.210 1.44 0.27 43.4 21.1

Table 4.11: Comparison between networks with 4 to 10 convolution layers.

A deeper network with more convolution layers does not yield better results, asTable 4.11 shows. The model with the best validation loss and accuracy is the onewith 6 convolution layers (Figure 4.18). The deeper models also do not increase theamount of generalisation, as the deviation is the same or larger and they experiencemore overfitting.


0 20 40 60 80 100 120 140 160 180Epochs

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Loss

Validation loss# of conv layers: 4# of conv layers: 6# of conv layers: 8# of conv layers: 10

0 20 40 60 80 100 120 140 160 180Epochs

0

20

40

60

80

100

Accu

racy

(%)

Validation accuracy# of conv layers: 4# of conv layers: 6# of conv layers: 8# of conv layers: 10

Figure 4.18: Comparison between the models in Table 4.11 based on validation loss andaccuracy.

Subsequent models will use 6 convolution layers with 3 pooling layers evenly dis-tibuted between them.

4.2.5 Activation functions

The network can benefit from the use of an other activation function, instead ofthe ReLU. The activation functions that are introduced in subsection 2.3.3 arecompared to each other in Table 4.12. The learning rate is set at 1e-3 and theepochs are raised to 180.


Activation function Loss σ Accuracy (%) σ

ReLU 1.34 0.26 56.9 19.9LReLU 1.34 0.26 57.0 19.6PReLU 1.43 0.29 51.6 24.9RReLU 1.44 0.27 43.4 21.1ELU 1.50 0.33 52.6 25.2

Table 4.12: Comparison between networks with different activation functions.

The models using PReLU and RReLU achieve the worst results in both validationloss and validation accuracy. ReLU and LReLU models have an almost identical lossand accuracy curve (Figure 4.19) and achieve the best validation across the models.The network using ELU accelerates the training process and reaches minimum lossin 3 times less the epochs compared to the other models, but has a higher minimumvalidation loss and diverges faster. Additionally, the 4th validation set explosivelydiverges after 63 epochs.

0 20 40 60 80 100 120 140 160 180Epochs

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Loss

Validation lossReLULReLU

0 20 40 60 80 100 120 140 160 180Epochs

0

20

40

60

80

100

Accu

racy

(%)

Validation accuracyReLULReLU

Figure 4.19: Validation loss and accuracy of the models using ReLU and LReLU. Theresults are almost identical.

The model using ELU was also compared with the rest of the models, by disregard-ing the 4th epoch that causes a divergence. The validation loss of the model usingELU is the highest, but the validation accuracy remains more epochs on a higherpercentage.


0 20 40 60 80 100 120 140 160 180Epochs

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Loss

Validation lossLReLUPReLURReLUELU

0 20 40 60 80 100 120 140 160 180Epochs

0

20

40

60

80

100

Accu

racy

(%)

Validation accuracyLReLUPReLURReLUELU

Figure 4.20: Validation loss and accuracy of the models in Table 4.12 without the 4thvalidation set.

LReLU is chosen as the best activation function for this architecture. ReLU hasalmost identical results, but slightly worse than LReLU.

4.2.6 Batch normalization

As described in [35], a faster training is expected with the use of Batch Normaliza-tion (BN). Table 4.13 shows the results after 60 epochs and a learning rate of 1e-3compared to the network using LReLU in previous section. The batch normalizedmodel has the exact same architecture but with batch normalization applied after


every convolution and fully connected layer.

Batch normalized Loss σ Accuracy (%) σ

No 1.34 0.26 57.0 19.6Yes 1.45 0.30 50.6 24.1

Table 4.13: Comparison between use and no use of batch normalisation.

BN allows the network to reach its validation loss minimum very fast, in less than 10epochs. It also does an outstanding job on the training loss, lowering it to 0.9 in 5epochs while the normal model needed 180 epochs to achieve this value. In terms ofvalidation, the results are less impressive. The validation loss diverges quite early inthe training process, albeit using a small learning rate (Figure 4.21). The validationaccuracy remains around 50% and does not improve with more epochs.

0 20 40 60 80 100 120 140 160 180Epochs

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Loss

Validation lossNo BNBN

0 20 40 60 80 100 120 140 160 180Epochs

0

20

40

60

80

100

Accu

racy

(%)

Validation accuracyNo BNBN

Figure 4.21: Comparison in validation between use of Batch Normalization (BN) andno BN.

4.2.7 Overfitting

Attempts to lower the apparent amount of overfitting in some of the validationsets have not succeeded. Stronger regularization has negligible effect on the valida-tion data. Inserting dropout layers after every convolution layer has improved thedeviation of the validation loss and accuracy, but gave no better average results.Tweaking the dropout layers to have 80% drop rate slowed the training processand failed to lower the overfitting. Using the same model but training two differentnetworks, one that trains only on the bearing condition and the other only on thebalance, has also not been helpful.


4.2.8 Final architecture

Layer type Filter size # of filters / Stride Dropout Activation function# of nodes

conv 3 4 1 No LReLUconv

max pool 2 / 2 / /conv 3 4 1 No LReLUconv

max pool 2 / 2 / /conv 3 4 1 No LReLUconv

max pool 2 / 2 / /fc / 512 / Yes LReLUfcfc

Table 4.14: Final network architecture.

The proposed networked for achieving the most optimal results is displayed in Ta-ble 4.14. The final results were achieved with the following hyperparameters:

• Learning rate: 1e-3

• Regularization strength: 5e-3

• Batch size: 600

• Gradient method: Nesterov momentum (90%)

• Epochs:

1. Validation set 1: 70





The accuracy for the detection of the bearing condition is 81,2%. The accuracy forthe imbalance detection is 62.9%. The overall achieved accuracy is 57,2%.

Figure 4.22, Figure 4.23 and Figure 4.24 show the performance of the proposedarchitecture and hyperparameters. The values are an average from the 5-fold cross-validation. Note that the model can distinguish between HB, MILB, EILB andORF better than between balanced and imbalanced.



HB

MILB

EILB

ORF

HB_IM

MILB_IM

EILB_IM

ORF_IM

Targ

et60 % 0 % 0 % 0 % 40 % 0 % 0 % 0 %

0 % 40 % 0 % 0 % 0 % 20 % 20 % 20 %

0 % 0 % 100 % 0 % 0 % 0 % 0 % 0 %

20 % 0 % 0 % 80 % 0 % 0 % 0 % 0 %

60 % 0 % 0 % 0 % 40 % 0 % 0 % 0 %

0 % 40 % 25 % 0 % 0 % 20 % 15 % 0 %

0 % 20 % 15 % 0 % 5 % 0 % 60 % 0 %

20 % 0 % 0 % 13 % 0 % 0 % 0 % 67 %

0

10

20

30

40

50

60

70

80

90

100

Figure 4.22: Confusion matrix of the proposed network architecture.


HB

MILB

EILB

ORF

Targ

et

100 % 0 % 0 % 0 %

0 % 60 % 30 % 10 %

3 % 10 % 87 % 0 %

20 % 0 % 0 % 80 %

0

10

20

30

40

50

60

70

80

90

100

Figure 4.23: Confusion matrix of the proposed network architecture. Balanced and im-balanced results are put together.



Balanced

Imbalanced

Targ

et

75 % 25 %

48 % 52 %

0

10

20

30

40

50

60

70

80

90

100

Figure 4.24: Confusion matrix of the proposed network architecture. Balanced resultsare put together and separated from the imbalanced.

Figure 4.22 and Figure 4.23 show that the model succeeds in identifying the healthybearings, but fails to correctly classify the lubrication level and the balance. Outerraceway faults are only misclassified with healthy bearings.

0 20 40 60 80 100 120 140 160 180Epochs

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Loss

Train lossValidation set 1Validation set 2Validation set 3Validation set 4Validation set 5

0 20 40 60 80 100 120 140 160 180Epochs

0

20

40

60

80

100

Accu

racy

(%)

Train accuracyValidation set 1Validation set 2Validation set 3Validation set 4Validation set 5

0 20 40 60 80 100 120 140 160 180Epochs

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Loss

Validation lossValidation set 1Validation set 2Validation set 3Validation set 4Validation set 5

0 20 40 60 80 100 120 140 160 180Epochs

0

20

40

60

80

100

Accu

racy

(%)

Validation accuracyValidation set 1Validation set 2Validation set 3Validation set 4Validation set 5

Figure 4.25: Train and validation, loss and accuracy of the proposed network architec-ture after 40 epochs.


The final results are very contradicting when looking at Figure 4.25. The perfor-mance depends strongly on the validation set used, going as high as 87.5% andas low as 37.5%. This is true for both fully connected and Convolutional NeuralNetwork. Note that most validation sets require early stopping as they diverge.It is interesting to note that the training process is completely identical for everycross-validation set.

Figure 4.26 displays the individual results of every validation set. Note that theresults are from 8 classes, but are split for the viewers convenience. The individualconfusion matrices of the different validation sets suggests that the model has notfound the correct features for detecting balance and imbalance. The model alsoclassifies 50% of MILB as EILB on 3 of the validation sets.



HB

MILB

EILB

ORFTa

rget

100 % 0 % 0 % 0 %

0 % 50 % 50 % 0 %

13 % 0 % 87 % 0 %

0 % 0 % 0 % 100 %

0

10

20

30

40

50

60

70

80

90

100


Balanced

Imbalanced

Targ

et

75 % 25 %

85 % 15 %

0

10

20

30

40

50

60

70

80

90

100


HB

MILB

EILB

ORF

Targ

et

100 % 0 % 0 % 0 %

0 % 50 % 50 % 0 %

0 % 0 % 100 % 0 %

0 % 0 % 0 % 100 %

0

10

20

30

40

50

60

70

80

90

100


Balanced

Imbalanced

Targ

et

75 % 25 %

6 % 94 %

0

10

20

30

40

50

60

70

80

90

100


HB

MILB

EILB

ORF

Targ

et

100 % 0 % 0 % 0 %

0 % 100 % 0 % 0 %

0 % 0 % 100 % 0 %

0 % 0 % 0 % 100 %

0

10

20

30

40

50

60

70

80

90

100


Balanced

Imbalanced

Targ

et

100 % 0 %

25 % 75 %

0

10

20

30

40

50

60

70

80

90

100


HB

MILB

EILB

ORF

Targ

et

100 % 0 % 0 % 0 %

0 % 50 % 50 % 0 %

0 % 0 % 100 % 0 %

0 % 0 % 0 % 100 %

0

10

20

30

40

50

60

70

80

90

100


Balanced

Imbalanced

Targ

et

50 % 50 %

50 % 50 %

0

10

20

30

40

50

60

70

80

90

100


HB

MILB

EILB

ORF

Targ

et

100 % 0 % 0 % 0 %

0 % 50 % 0 % 50 %

0 % 50 % 50 % 0 %

100 % 0 % 0 % 0 %

0

10

20

30

40

50

60

70

80

90

100


Balanced

Imbalanced

Targ

et

75 % 25 %

75 % 25 %

0

10

20

30

40

50

60

70

80

90

100

Figure 4.26: Confusion matrices for the 5-fold cross-validation sets. Topmost is the firstvalidation set,.


Table 4.15 shows that the average precision11 and recall12 are very close to eachother. ORF/EILB are the easiest to classify of all the other classes. MILB IM isthe most difficult to classify with a poor 28.6%.

Class Precision (%) Recall (%) F1-score (%)

HB 60.0 37.5 46.2MILB 40.0 40.0 40.0EILB 100.0 71.8 83.6ORF 80.0 85.8 82.8HB IM 40.0 46.9 43.2

MILB IM 20.0 50.0 28.6EILB IM 60.0 62.9 61.4ORF IM 66.7 76.9 71.5

Total 58.3 59.0 57.2

Table 4.15: Precision, recall and F1-score based on all the validation sets.

A saliency map, using guided backpropagation [44] [45], is created to find out whichparts of the image are important for the network. Each saliency map is made byaveraging the best (=network had highest confidence) correctly classified framesfrom each validation set. The result can be seen in Figure 4.11.

11Precision is the ratio that defines how many of the predicted items for a class are true.12Recall is the ratio that defines how many of the true items of a class are correctly predicted.


0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for HB

0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for HB_IM

0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for MILB

0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for MILB_IM

0 20 40 60 80 100

0

20

40

60

80

100


0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for EILB_IM

0 20 40 60 80 100

0

20

40

60

80

100


0 20 40 60 80 100

0

20

40

60

80

100

Absolute saliency for ORF_IM

Figure 4.27: Saliency maps of the correctly classified frames with the highest confidenceby the final network.


The saliency maps in Figure 4.27 show difference between balanced and imbalancedfor bearings with the same condition, unlike the saliency maps of the fully connectednetwork in Figure 4.11. All the salincy maps have a bright region in the bottom leftthat corresponds to dark empty space in the input frame. A possible explanation isthat network looks at changes in temperature around the housing. Other possibleexplanation is that the network has found a subtle feature in the background thathelps the classification.

All saliency maps have also a brighter region on the bottom right of the image.The location of this brighter region corresponds with the location of the bottombolt on the housing. The saliency maps of HB IM, EILB, EILB IM and ORFhave also a bright region that corresponds to the location of the middle bolt of thehousing. These spots are pretty washed out in most of the sailiency maps and isnot directly suggesting that the network is looking at the orientation of the bolt,but it is possible.

The sailiency maps of HB, HB IM, ORF and ORF IM have a bright region on thetop right side. This region also corresponds to a dark empty space in the inputframe. No real interpretations can be deducted from this.

CONCLUSION 58

Chapter 5

Conclusion

Both fully connected Neural Networks and Convolutional Neural Networks expe-rience model instability. The difference in validation loss and accuracy for thedifferent cross-validation sets is large (Figure 4.25 and Figure 4.2). Some of thevalidation sets experience huge overfitting that is not fully compensated by the useof dropout, stronger regularization and network architecture.

Convolutional Neural Networks did not succeed to provide a better result than fullyconnected NNs. The final architecture achieved an overall accuracy of 57.2%, whilethe fully connected network using dropout achieved 64.0%. The main differenceis the balance/imbalance, 62.9% for the CNN against the 74.5% of the fully con-nected network. The accuracy based on the bearing condition for both classifiersis close, 81.2% for the CNN versus 84.8%. The similarity between the frames of anindividual video is significant. This can be observed by looking at the validation ac-curacy (Figure 4.25). The accuracy values for a single validation set jump betweenmultiples of 12.5% (= 100% / 8 classes), with very few in-between values.

The biggest positive effect for the CNN was achieved with the optimisation of thefully connected layers, the amount of pooling layers and the amount of convolutionlayers. The number of filters used in the network has negligible effect on the valida-tion loss and accuracy as long as there are more than 2. The use of the activationfunctions RReLU and PReLU has surprisingly enough a bad effect on the validationloss and accuracy. The model using the ELU performed the worst on validation loss,but the best on validation accuracy. Batch normalization reaches a low validationloss after just 5 epochs, but does not improve the overall result.

The model suffers the most in distinguishing between balanced and imbalancedclasses. The similarity between the two is too large, especially in HB/HB IM (Fig-ure 4.22 and Figure 4.8). The data provided does not suffice to train a stable modelfor distinguishing between balanced and imbalanced rotating machines for everyvalidation set (Figure 4.26). Additionally, both the fully connected and the con-volutional neural networks have trouble classifying the lubrication level, especiallythe mildly in-lubricated.

The saliency maps of the fully connected NN (Figure 4.11) suggest that the network

5.1 Future work 59

is not learning spatial temperature distributions, but is focusing on physical featureslike the two large bolts on the housing. This unwanted behaviour can be explainedby the small number of available videos per class, as the network can only workwith 4 different videos per class during training. All the frames extracted from avideo can be seen as an artificial extension of one data point through small dataaugmentation. This can also explain the large difference in validation loss/accuracybetween the validation sets.

The saliency maps of the final CNN are more difficult to interpret (Figure 4.27).There are bright regions in the dark empty space of the input frame, especially onMILB. There are also bright regions on the saliency map that correspond to thelocation of the two large bolts of the housing. These are not that prominent asthose on the saliency maps of the fully connected network (Figure 4.11). Thereis probably more research needed to fully understand and interpret the currentsaliency maps of the final CNN.

Although the proposed models score reasonable in distinguishing between the fourbearing condition (Figure 4.9 and Figure 4.23), it can be argued that the fullyconnected network is not using the right features in the thermal images, as suggestedby the saliency maps (Figure 4.11).

5.1 Future work

One of the possibilities for future work is further investigation for a more optimalmodel. Looking at residue neural networks [32] is one of the paths that was notexplored in this thesis. By further eliminating the gradient vanishing problem, adeeper model could be used that would successfully distinguish between balancedand imbalanced rotating machines. Other possible model would be to specificallytrain and optimise a network to only recognise the balance in rotating machin-ery. These could be coupled together with the proposed model in a ensemble toachieve better results. Model stability can also be slightly improved with the useof transfer learning and fine tuning the existing model, although this is more of anuncertainty.

Saliency maps and occlusion maps could be further investigated to understand thelearning process and features of the networks.

Another possibility for future work is in the field of preprocessing. The automaticmethod for repositioning the frames could be finetuned and the irregularities couldbe fixed. Trying out the SIFT and SURF algorithm is a potential solution. Addi-tionally, the physical features that misguide the network, like the two large boltson the housing, could be manually removed. The background could be thresholdedand set to a value of 0 to prevent the network from searching features in theseregions.

The best possible way to increase the performance of the proposed method is theexpansion of the dataset with more videos of the current classes. This could both

5.1 Future work 60

resolve the balance misclassification and the learning of physical features. In theend, Neural Networks perform better when there is more data available to trainfrom. This could also remove the need for the above suggested preprocessing as thenetwork would be able to generalise based on the correct features.

BIBLIOGRAPHY 61

Bibliography

[1] Jie Liu and Golnaraghi Wang. An extended wavelet spectrum for bearing faultdiagnostics. 2008.

[2] B. Musizza A Juricic P. Bokoski, J. Petrovcic. , detection of lubrication starvedbearings in electrical motors by means of vibration analysis. 2010. Tribol int.

[3] Olivier Janssens. Thermal image based fault diagnosis for rotating machinery.2015.

[4] S. Bagavathiappan, B.B. Lahiri, T. Saravanan, John Philip, and T. Jayakumar.Infrared thermography for condition monitoring – A review. Infrared Physics& Technology, 60:35–55, September 2013.

[5] A novel indicator of stator winding inter-turn fault in induction motor usinginfrared thermal imaging[M. Eftekhari 2013].pdf.

[6] Achmad Widodo, Djoeli Satrijo, Toni Prahasto, Gang-Min Lim, and Byeong-Keun Choi. Confirmation of Thermal Images and Vibration Signals for Intelli-gent Machine Fault Diagnostics. International Journal of Rotating Machinery,2012:1–10, 2012.

[7] H. Fandino-Toro, O. Cardona-Morales, J. Garcia-Alvarez, and G. Castellanos-Dominguez. Bearing Fault Identification using Watershed-Based Threshold-ing Method. In Giorgio Dalpiaz, Riccardo Rubini, Gianluca D’Elia, MarcoCocconcelli, Fakher Chaari, Radoslaw Zimroz, Walter Bartelmus, and Mo-hamed Haddar, editors, Advances in Condition Monitoring of Machinery inNon-Stationary Operations, pages 137–147. Springer Berlin Heidelberg, Berlin,Heidelberg, 2014.

[8] Lixiang Duan, Mingchao Yao, Jinjiang Wang, Tangbo Bai, and Laibin Zhang.Segmented infrared image analysis for rotating machinery fault diagnosis. In-frared Physics & Technology, 77:267–276, July 2016.

[9] Raiko Schulz, Mia Loccufier, Steven Verstockt, Kurt Stockman, and SofieVan Hoecke. Outer raceway fault detection and localization for deep grooveball bearings by using thermal imaging. In 11th European Conference on Non-Destructive Testing (ECNDT 2014). http://www. ndt. net, 2014.

[10] Gang-Min Lim. Fault diagnosis of rotating machine by thermography methodon suport vector machine, 2014.

BIBLIOGRAPHY 62

[11] Keisuke Yonehara, Karl Farrow, Alexander Ghanem, Daniel Hillier, KamillBalint, Miguel Teixeira, Josephine Juttner, Masaharu Noda, Rachael L. Neve,Karl-Klaus Conzelmann, and Botond Roska. The First Stage of Cardinal Di-rection Selectivity Is Localized to the Dendrites of Retinal Ganglion Cells.Neuron, 79(6):1078–1085, September 2013.

[12] James Bergstra and Yoshua Bengio. Random Search for Hyper-ParameterOptimization.

[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classifica-tion with Deep Convolutional Neural Networks. In F. Pereira, C. J. C. Burges,L. Bottou, and K. Q. Weinberger, editors, Advances in Neural InformationProcessing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.

[14] Adam Coates, Andrew Y. Ng, and Honglak Lee. An analysis of single-layernetworks in unsupervised feature learning. In International conference on ar-tificial intelligence and statistics, pages 215–223, 2011.

[15] Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Optimal whitening anddecorrelation. arXiv:1512.00809 [stat], December 2015. arXiv: 1512.00809.

[16] Anthony Bell and Terrnece Sejnowski. Edges are independent components ofnatural scenes. 1966.

[17] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of featuresfrom tiny images. Citeseer, 2009.

[18] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, andRuslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580 [cs], July 2012. arXiv:1207.0580.

[19] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus-lan Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks fromOverfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.

[20] Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser,Karol Kurach, and James Martens. Adding Gradient Noise Improves Learningfor Very Deep Networks. arXiv:1511.06807 [cs, stat], November 2015. arXiv:1511.06807.

[21] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlinearitiesimprove neural network acoustic models. In Proc. ICML, volume 30, 2013.

[22] Djork-Arne Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast andAccurate Deep Network Learning by Exponential Linear Units (ELUs).arXiv:1511.07289 [cs], November 2015. arXiv: 1511.07289.

[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving Deep intoRectifiers: Surpassing Human-Level Performance on ImageNet Classification.arXiv:1502.01852 [cs], February 2015. arXiv: 1502.01852.

BIBLIOGRAPHY 63

[24] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical Evaluation ofRectified Activations in Convolutional Network. arXiv:1505.00853 [cs, stat],May 2015. arXiv: 1505.00853.

[25] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On theimportance of initialization and momentum in deep learning. In Proceedingsof the 30th international conference on machine learning (ICML-13), pages1139–1147, 2013.

[26] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods foronline learning and stochastic optimization. The Journal of Machine LearningResearch, 12:2121–2159, 2011.

[27] Geoffrey Hinton Tijmen Tieleman. Rmsprop: Divide the gradient by a runningaverage of its recent magnitude., 2012. coursera: Neural networks for machinelearning.

[28] Matthew D. Zeiler. ADADELTA: An adaptive learning rate method. arXivpreprint arXiv:1212.5701, 2012.

[29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual RecognitionChallenge. arXiv:1409.0575 [cs], September 2014. arXiv: 1409.0575.

[30] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networksfor Large-Scale Image Recognition. arXiv:1409.1556 [cs], September 2014.arXiv: 1409.1556.

[31] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Ra-binovich. Going Deeper with Convolutions. arXiv:1409.4842 [cs], September2014. arXiv: 1409.4842.

[32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep ResidualLearning for Image Recognition. arXiv:1512.03385 [cs], December 2015. arXiv:1512.03385.

[33] Rupesh Kumar Srivastava, Klaus Greff, and Jurgen Schmidhuber. HighwayNetworks. arXiv:1505.00387 [cs], May 2015. arXiv: 1505.00387.

[34] C. M. Bishop. Neural networks for pattern recognition. 1995. Oxford universitypress.

[35] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating DeepNetwork Training by Reducing Internal Covariate Shift. arXiv:1502.03167 [cs],February 2015. arXiv: 1502.03167.

[36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. SpatialPyramid Pooling in Deep Convolutional Networks for Visual Recognition.arXiv:1406.4729 [cs], June 2014. arXiv: 1406.4729.

BIBLIOGRAPHY 64

[37] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of fea-tures: Spatial pyramid matching for recognizing natural scene categories. InComputer Vision and Pattern Recognition, 2006 IEEE Computer Society Con-ference on, volume 2, pages 2169–2178. IEEE, 2006.

[38] Benjamin Graham. Fractional Max-Pooling. arXiv:1412.6071 [cs], December2014. arXiv: 1412.6071.

[39] Martin A. Fischler and Robert C. Bolles. Random Sample Consensus: AParadigm for Model Fitting with Applications to Image Analysis and Auto-mated Cartography. Commun. ACM, 24(6):381–395, June 1981.

[40] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: Anefficient alternative to SIFT or SURF. In 2011 International conference oncomputer vision, pages 2564–2571. IEEE, 2011.

[41] Stefan Leutenegger, Margarita Chli, and Roland Y. Siegwart. BRISK: Bi-nary robust invariant scalable keypoints. In 2011 International conference oncomputer vision, pages 2548–2555. IEEE, 2011.

[42] Lu Feng, Zhuangzhi Wu, and Xiang Long. Fast Image Diffusion for FeatureDetection and Description. International Journal of Computer Theory andEngineering, 8(1):58–62, February 2016.

[43] Pablo Fernandez Alcantarilla, Adrien Bartoli, and Andrew J. Davison. KAZEfeatures. In European Conference on Computer Vision, pages 214–227.Springer, 2012.

[44] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Ried-miller. Striving for Simplicity: The All Convolutional Net. arXiv:1412.6806[cs], December 2014. arXiv: 1412.6806.

[45] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep Inside Convo-lutional Networks: Visualising Image Classification Models and Saliency Maps.arXiv:1312.6034 [cs], December 2013. arXiv: 1312.6034.

LIST OF FIGURES 65

List of Figures

2.1 Simplified version of a neuron. 1 . . . . . . . . . . . . . . . . . . . . 32.2 Node with three inputs. . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Neural network that classifies two classes. . . . . . . . . . . . . . . . 42.4 Representation of the 3D structure of CNNs. . . . . . . . . . . . . . 72.5 Simplified example of a pooling layer. 2 . . . . . . . . . . . . . . . . 82.6 Dropout deactivates a percentage of the nodes . . . . . . . . . . . . 92.7 Left: A normal ReLU. Right: A LReLU with a small negative slope

’a’. 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.8 Left: The standard block used in the residual framework. Right: Block

with conv1 layers that scale the dimensions of the data to reducecomputations. Used in larger variants of the residual framework. 4 . 15

2.9 Principle of SPP. 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.10 Demonstration of the fractional max pooling layer with an image of

parrots. 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 An example of a frame from a video. . . . . . . . . . . . . . . . . . 183.2 Unedited frame with a red crosshair showing the thermocouple and a

blue crosshair showing the alignment point. . . . . . . . . . . . . . . 203.3 High contrast frame with the thermocouple visible on the right side. 203.4 Cropped frames from two different videos that were repositioned with

more than 90px. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.5 Matches between reference frame (right) and image (left) using KAZE.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6 Comparison between the key points of feature detector/descriptor al-

gorithms on a problematic frame. Going horizontally from top left tobottom right: ORB, KAZE, BRISK and AKAZE. . . . . . . . . . . 23

3.7 Final preprocessed image of size 108x103. Cropped using the manualmethod. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1 Comparison of the same model between different regularization strengths. 264.2 Validation loss and accuracy of the first network in Table 4.1 after

80 epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3 Train loss and accuracy of the first network in Table 4.1 after 80

epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.4 Confusion matrix of the first network in Table 4.1. . . . . . . . . . . 30

LIST OF FIGURES 66

4.5 Confusion matrix of the first network in Table 4.1. Balanced andimbalanced classes of the same condition are considered as one. . . . 30

4.6 Confusion matrix of the first network in Table 4.1. All classes withthe same balance are considered as one. . . . . . . . . . . . . . . . . 31

4.7 Loss of a 6-layer model in Table 4.3. . . . . . . . . . . . . . . . . . 324.8 Confusion matrix of the 6-layer model in Table 4.3. . . . . . . . . . 324.9 Confusion matrix of the 6-layer model in Table 4.3. Balanced and

imbalanced classes of the same bearing condition are put together. . 334.10 Confusion matrix of the 6-layer model in Table 4.3. All classes with

the same balance are considered put together. . . . . . . . . . . . . . 334.11 Saliency maps of the correctly classified frames with the highest con-

fidence by the 6-layer network. . . . . . . . . . . . . . . . . . . . . . 344.12 Validation loss and accuracy comparison between CNNs with 2 to 32

filters per convolution layer. . . . . . . . . . . . . . . . . . . . . . . 374.13 Loss comparison between CNNs with flat, increasing and decreasing

amount of filters per convolution layer. . . . . . . . . . . . . . . . . 384.14 Training loss of the networks in Table 4.8. . . . . . . . . . . . . . . 404.15 Validation loss of the networks in Table 4.8. . . . . . . . . . . . . . 414.16 Loss comparison between using 1, 2, 3 and 4 fully connected layers. 434.17 Loss comparison between models with 128, 256 and 512 nodes per

fully connected layer. . . . . . . . . . . . . . . . . . . . . . . . . . . 444.18 Comparison between the models in Table 4.11 based on validation loss

and accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.19 Validation loss and accuracy of the models using ReLU and LReLU.

The results are almost identical. . . . . . . . . . . . . . . . . . . . . 474.20 Validation loss and accuracy of the models in Table 4.12 without the

4th validation set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.21 Comparison in validation between use of Batch Normalization (BN)

and no BN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.22 Confusion matrix of the proposed network architecture. . . . . . . . 514.23 Confusion matrix of the proposed network architecture. Balanced and

imbalanced results are put together. . . . . . . . . . . . . . . . . . . 514.24 Confusion matrix of the proposed network architecture. Balanced re-

sults are put together and separated from the imbalanced. . . . . . . 524.25 Train and validation, loss and accuracy of the proposed network ar-

chitecture after 40 epochs. . . . . . . . . . . . . . . . . . . . . . . . 524.26 Confusion matrices for the 5-fold cross-validation sets. Topmost is

the first validation set,. . . . . . . . . . . . . . . . . . . . . . . . . . 544.27 Saliency maps of the correctly classified frames with the highest con-

fidence by the final network. . . . . . . . . . . . . . . . . . . . . . . 56

List of Tables

3.1 Abbreviations of all the classes. . . . . . . . . . . . . . . . . . . . . 173.2 Comparison between feature detector/descriptor algorithms on a prob-

lematic video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Tested fully connected models without dropout. . . . . . . . . . . . . 264.2 Result median for each set using the fully connected models. . . . . 274.3 Fully connected models using dropout. . . . . . . . . . . . . . . . . . 314.4 The starting network architecture. . . . . . . . . . . . . . . . . . . . 364.5 Results on different amount of filters in every convolution layer. . . 364.6 Comparison between flat, increasing and decreasing amount of filters

in the three convolution layer groups. . . . . . . . . . . . . . . . . . 384.7 Size of output using n pooling layers with no padding. . . . . . . . . 394.8 The network models for testing the optimal amount of pooling layers. 394.9 Results using different number of pooling layers. Networks used are

shown in Table 4.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.10 Validation loss and accuracy on networks with varying fully connected

layer depth and width. . . . . . . . . . . . . . . . . . . . . . . . . . 424.11 Comparison between networks with 4 to 10 convolution layers. . . . 454.12 Comparison between networks with different activation functions. . . 474.13 Comparison between use and no use of batch normalisation. . . . . 494.14 Final network architecture. . . . . . . . . . . . . . . . . . . . . . . . 504.15 Precision, recall and F1-score based on all the validation sets. . . . . 55

deep learning for infrared based condition monitoring of...

Documents