3d-convolutional neural network with generative adversarial...

2nd Reading

May 28, 2020 7:9 2050034

International Journal of Neural Systems, Vol. 30, No. 6 (2020) 2050034 (15 pages)c© World Scientific Publishing CompanyDOI: 10.1142/S0129065720500343

3D-Convolutional Neural Network with Generative AdversarialNetwork and Autoencoder for Robust Anomaly

Detection in Video Surveillance

Wonsup Shin∗, Seok-Jun Bu† and Sung-Bae Cho‡

Department of Computer Science, Yonsei University50 Yonsei-ro, Sudaemoon-gu, Seoul 03722, South Korea

∗[email protected]†sjbuhan @yonsei.ac.kr‡[email protected]

Accepted 28 March 2020Published Online 28 May 2020

As the surveillance devices proliferate, various machine learning approaches for video anomaly detectionhave been attempted. We propose a hybrid deep learning model composed of a video feature extractortrained by generative adversarial network with deficient anomaly data and an anomaly detector boostedby transferring the extractor. Experiments with UCSD pedestrian dataset show that it achieves 94.4%recall and 86.4% precision, which is the competitive performance in video anomaly detection.

Keywords: Video anomaly detection; machine learning; transfer learning; generative adversarial network;3D CNN.

1. Introduction

In recent years, increasing the use of surveillancecameras with less manpower makes the automaticvideo surveillance systems to become more impor-tant.1 A crucial challenge is automatically detectinganomaly which is roughly defined as unusual, uncom-mon or irregular event.2,3 For example, on a highway,a stopped car or person can be defined as an anomaly.Conversely, on the sidewalk, all objects except peoplewalking along the road can be regarded as anoma-lies. In other words, objects in video that are unusualor rarely occurring in a given domain are consid-ered as anomalies. Due to this domain-dependentnature and difficulty of sample collection, the videoanomaly detection (VAD) is still one of the challeng-ing issues in the field of computer vision, and variousresearchers are working to solve it.4–7

An anomaly is classified into three types, suchas point anomaly, contextual anomaly and collectiveanomaly, according to the characteristics of occur-rence factors.8 In video, anomalies are detected by

the appearance pattern, motion pattern or bothpatterns of the objects within the frame sequence.Patterns of objects in video can cause contextualanomaly, which is abnormal in a specific context only,and collective anomaly, which is normal from a localpoint of view but is abnormal in general. Due to thenature of video, however, the point anomaly is notconsidered in this study.9 Figure 1 shows some exam-ples of anomaly in video. In this example, only peo-ple walking on two feet are considered as normal.The left most image in Fig. 1 shows an anomalycaused by the appearance pattern. The object in thevideo is a person in a wheelchair who moves simi-lar speed to a typical pedestrian but has a differentappearance. The middle image in Fig. 1 shows theanomaly caused by the motion pattern. The objectin the video is a person (it looks like a typical pedes-trian), but it crosses the grass that ordinary pedes-trians do not. The right most image in Fig. 1 showsthe anomaly caused by both patterns. The object inthe video is a person who is riding a bicycle, and

‡Corresponding author.

2050034-1

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

https://dx.doi.org/10.1142/S0129065720500343

2nd Reading

May 28, 2020 7:9 2050034

W. Shun, S.-J. Bu & S.-B. Cho

Fig. 1. Examples of video anomalies. The red part of each image represents anomaly objects. The anomalies from theleft to the right images were generated by the appearance pattern, the motion pattern and both patterns of the object,respectively.

Fig. 2. The mapping diagram between conventionalanomaly type and object patterns in video.

appearance and speed are different from typicalpedestrians. Therefore, the methods for the VADshould be able to extract high-quality spatial-temporal features of objects within the video.Figure 2 shows a diagram of mapping object pat-terns and anomaly types.

Deep neural network (DNN) is one of the keytechnologies for modeling anomaly.6,36 Especially,models like the long short-term memory (LSTM),which can learn temporal information, and the con-volutional neural network (CNN), which can extractspatial information, are making tremendous achieve-ments in the field of video feature extraction andclassification.10,11 Intuitively, it seems that a modelcombining the two models is easy to extract thecharacteristics of the video. In the meantime, thereare some works that the model with 3D CNN-basedauto-encoder alone has shown higher performance.24

Most deep VAD classifiers have been modeled inan unsupervised learning manner,12–15 because it isdifficult to collect labeled anomaly data enough tolearn the internal variability of anomalies. However,labeled data is very important and valuable infor-mation that can guide the learning direction of theVAD classifier. Therefore, when there is even a little

bit of labeled abnormal data, unsupervised learning,which does not consider any label information, hasits limitations in terms of data efficiency. It is hard tothink that unsupervised learning is the best way tosolve the VAD. Even semi-supervised learning-basedapproaches in VAD exclude consideration of labeledabnormal data. These approaches are only interestedin configuring a training set with normal data.5 Inthis context, research on supervised learning meth-ods is important in terms of using available data asefficiently as possible.

Previously, we developed an anomaly detectionmodel based on supervised learning using generativeadversarial network (GAN).33 This method dealtwith the shortage of labeled anomaly data in super-vised learning. The key idea used the GAN discrim-inator as the base model and apply transfer learningto the classifier. Since the GAN can generate datathat do not exist in the dataset, the base model haslearned from a lot of data that are very similar tothe actual data: The problem of insufficient labeleddata is addressed. However, the GAN discriminatorcannot extract the features of the data effectively,because it extracts the features for classifying realand fake data.34 The features obtained in this waydo not exactly catch up with the tasks of the classi-fier. It lacked consideration for the ability to extracthigh-level spatial-temporal features.

In this paper, we propose another VAD modelthat overcomes the limitations of our previous work.The proposed method can solve the lack of labeleddata,36 while it is possible to extract features suitablefor the classifier. The main idea is to convert theGAN generator into an autoencoder model based on3D CNN and use the encoder of the autoencoder asthe base model. Unlike a discriminator, and like a

2050034-2

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

2nd Reading

May 28, 2020 7:9 2050034

3D-CNN with GAN and Autoencoder

Table 1. The summary of the related works.

Category Author Feature extraction method Learning category

CNN-based model Zhou et al.40 2D CNN + 3D CNN Supervised

Smeureanu et al.17 Multi-task Fast RCNN Unsupervised

Hinami et al.18 Pretrained VGG net Unsupervised

Sabokrou et al.20 Pretrained Alexnet Unsupervised

Chu et al.,21 3D CNN+ similarity tree Unsupervised

Sultani et al.41 3D-ConvNet Supervised

Wang and Bai44 Parallel CNN Supervised

Sun et al.22 2D CNN Unsupervised

Ullah et al.38 3D CNN Supervised

Reconstructive model Xu et al.3 SDAE Unsupervised

Chen et al.23 Trajectory clustering + AE Unsupervised

Hasan et al.12 HOG, HOF +AE Unsupervised

Medel and Savakis19 ConvLSTM-AE Unsupervised

Chong and Tay13 Spatiotemporal-AE Unsupervised

Zhao et al.24 3D Conv-AE Unsupervised

Generative model Ahmadlou and Adeli45 Probabilistic NN Supervised

Dimokranitou26 AAE Unsupervised

Sun et al.28 VAE Unsupervised

Huang et al.29 CDBN Unsupervised

Sabokrou et al.30 AVID (GAN) Unsupervised

Lee et al.31 STAN (GAN +LSTM) Unsupervised

Shin et al. (2018)33 GAN+ LRCN Supervised

Ravanbakhsh et al. (2019)27 GAN Unsupervised

classifier, it aims to extract features that can modelthe underlying information of the data. At the sametime, because it learns in the GAN framework, itpreserves the advantage of being able to learn withabundant data. The contributions of this paper aresummarized as follows.

• We propose an end-to-end deep VAD model thatutilizes labeled anomaly data, unlike the conven-tional methods.

• We propose a robust feature extraction modeltraining method using GAN framework even whenthe anomaly data is insufficient.

• We propose an effective transfer learning methodfor the VAD classifier.

The rest of this paper is organized as follows. Therelated works for VAD are introduced in Sec. 2. Theproposed model is presented in Sec. 3. Section 4illustrates the experimental results for evaluating theproposed model. We conclude this paper in Sec. 5.

2. Related Works

Numerous approaches have been studied to solvethe problem of VAD. As mentioned previously, most

studies suggested methods based on unsupervisedlearning. Table 1 summarizes the related works intro-duced.

The anomaly detection model is divided into fea-ture extraction and anomaly detection parts. Fea-ture extraction is categorized as CNN-based model,reconstructive model and generative model. In addi-tion, the anomaly detection part can be constructedusing a classifier such as one-class SVM or an end-to-end model using fully connected structure or autoen-coder structure. Since the key idea of the proposedmethod lies in the feature extraction, we intro-duce the related research by focusing on the featureextraction.

First, most feature extraction methods usingCNN are based on transferring pre-trained models.Smeureanu et al. used pretrained VGG networksfor feature extraction.17 Similarly, Sabokrou et al.used pretrained AlexNet.20 These methods alleviatethe issue of building a high-quality feature extrac-tion model through transfer learning. However, thedomain trained by the pretrained model and the tar-get domain are very different. As another approach

2050034-3

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

2nd Reading

May 28, 2020 7:9 2050034


for CNN-based feature extraction, end-to-end mod-els based on supervised or unsupervised learninghave also been developed. For example, there havebeen attempts to do supervised learning by design-ing a structure that connects a fully connected net-work.25,40,41 These methods are intuitive, but do notconsider the issues of supervised learning in the VADdomain mentioned in the previous section. Besides,many models using 3D CNN-based feature extrac-tion have been proposed. This is because Tran et al.demonstrated that 3D CNNs can outperform RNNsin the video domain.16 For example, Chu et al. pro-posed a 3D CNN-based method for VAD problem.21

The method showed excellent performance for pixel-unit detection.

The reconstruction models mainly extract fea-tures using auto encoder (AE). Often, end-to-endanomaly detection is performed using reconstructionloss as a measure. Xu first applied the AE-basedunsupervised learning method to the VAD domain.3

For each frame, AEs composed of 2D convolutionalnetworks were applied to extract video features.23

Chen et al.23 combined hand-crafted feature extrac-tion methods with AE. Many hand-crafted featureextraction methods such as trajectory clustering, his-togram of gradient (HoG) and histogram of opticalflow (HoF) were combined with AE.10,23 Thereafter,the methods of extracting features using only the AEhave been proposed consequently. Some researchersdevised a feature extraction method of using AEcomposed of ConvLSTM.11,19 On the other hand, theAE structure using a 3D convolutional network hasbeen studied steadily.10,24,25 Zhao et al. proposed a3D convolutional network-based AE.24 This methodoutperformed the LSTM-based methods.

The generative model learns the probability dis-tribution of normal data using models such as deepbelief network (DBN),29 variational auto encoder(VAE),28 adversarial auto encoder (AAE)26 andgenerative adversarial network GAN. The GAN iscurrently being studied most actively in anomalydetection domains.7,27,34 But, to the best of ourknowledge, all the GAN-based methods operate inan unsupervised learning manner except our previ-ous work.33 Ravanbakhsh et al. suggested a way touse GAN.27 The key idea is to use the discrimina-tor as an ideal classifier immediately after learningthe GAN. Since learning proceeds only to normaldata, it uses the idea that abnormal data is relatively

classified as fake data. Inspired by the idea of Ravan-bakhsh et al., Sabokrou et al. proposed an ensem-ble method to detect anomalies using both GAN’sgenerator and discriminator.30 Lee devised a GANmodel with bi-directional LSTM to improve the sen-sitivity to temporal features.31 Our previous studyconstructed a VAD classifier based on supervisedlearning and uses transfer learning to solve the issueof constructing high-quality feature extractor.33 Itbuilt a base model for transfer learning through GANtrained with the target dataset. The discriminatorof GAN learned from the large number of gener-ated data and the actual data is the same as theone learned from a large number of data having thesame distribution theoretically as the target domain.It used the feature extraction part of the discrimina-tor as the base model. This study has shown thatlearning the base model through GAN is promis-ing. However, it lacked the consideration of networkstructure and transferring strategy.

3. The Proposed Method

This paper proposes a classification model for VADthat learns in supervised manner. The overall struc-ture of the proposed model is shown in Fig. 3. Itconsists of the preprocessing step and two learningphases. The video clip x is created by windowing inthe preprocessing step and used as input. The firstlearning phase trains GAN that receives a video clipas input and outputs a fake video clip of the samesize. The generator aims to generate video clip asreal as possible for deceiving the discriminator ofGAN. At the same time, the discriminator aims todistinguish between the generated data and the realdata. In this process, the decoder of generator learnsthe ability to generate data to trick the discrimina-tor better, and the encoder of generator learns moreinformative feature extraction to help the decodergenerate data. In addition, each component of GANupdates its weights not only with the actual data,but also with a large number of fake data generatedby the generator.

In the second phase, we transfer the learnedencoder of the generator to the VAD classifier andfine-tune the VAD classifier in supervised manner. Inthis process, the classifier obtains a feature extrac-tor learned with a rich dataset that is very similarto the target data. The data generated by the GAN

2050034-4

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

2nd Reading

May 28, 2020 7:9 2050034


Fig. 3. Architecture of the proposed method composed of a preprocessing step and two learning phases. In the phase 1,we train GAN so that the encoder of generator can extract useful features. In the phase 2, we transfer the encoderof generator to VAD classifier and fine-tune it. Here, Enc, Dec and Dis mean encoder and decoder of generator, anddiscriminator, respectively.

follow the similar distribution of real data. For moredetails about transfer learning, see Sec. 3.2. Afterboth phases are completed, the performance of clas-sifier is verified using the test dataset.

3.1. GAN for video generation

This section describes the structure of the GANmodel used in this paper. For the input video clipx, the generator G outputs x. That is, the generatorhas a relation like G(x) = x. The discriminator Dreceives x and x as inputs and predicts whether thegiven input is real or generated data. At the sametime, the generator tries to generate the data so thatthe discriminator has the difficulty to classify themin the original mechanism of GAN. The learning lossbetween the two models is given by Eq. (1). ThroughEq. (1), each component of GAN updates the weightsthat guarantee the Nash equilibrium between gener-ator G and discriminator D.

minG

maxD

V (D,G)

= minG

maxD

Ex∼pd(x)[logD(x)]

+Ex∼pd(x)[log(1 −D(G(x)))], (1)

where pd is a probability distribution of real data,and V (D,G) is an objective function.

3.1.1. Generator

In order for the encoder Enc to learn properly, it isessential to be able to generate high-level fake datain the decoder Dec. To improve the quality of pro-duction data, we construct a generator as the U-netstructure using a 3D convolution. U-net is an AE-based network designed to minimize the energy lossdefined with the distance between original and recon-structed data, and weights are directly given betweenlayers in symmetric positions.32 According to theoriginal GAN formulation, the similarity betweenthe distribution of data generated by the generatorand actual data is a critical factor in the conver-gence of the loss function V (D,G). From the prac-tical point of view, the U-net can be a partial solu-tion to the problem of the instability associated withconvergence. Intuitively, it can be understood thatGAN is trained so that the sampled data follow theauto-encoded vectors from the actual data, ratherthan the random normal vectors. We formulate thegenerator function G(·) with encoding-decoding pro-cess with convolution-pooling operations and up-sampling function φC(·), φP (·) and φ−1

P (·) as follows:

G(x) = Dec(Enc(x)) (2)

Enc(x) = φP (φC(x)) (3)

Dec(x) = φ−1P (Enc(x)) (4)

2050034-5

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

2nd Reading

May 28, 2020 7:9 2050034


The 3D convolution demonstrates the perfor-mance of spatial-temporal feature extraction and theCNN structure is highly parallel. In detail, the 3Dconvolution denoted by the weight in the ith nodefrom convolution filter (α, β, γ) in the lth layer asaiαβγ

l is calculated by Eq. (5).

aiαβγl = f

(Kl∑

k=1

Hl∑h=1

Wl∑w=1

Dl∑d=1

θkhwdl,i a

k(α+h)(β+w)(γ+d)l−1

+ bil

), (5)

where θkhwdl,i shows the weight of neural network at

index (h,w, d) of the kth channel in the ith kernelconnected to the lth layer. Each of Kl, Hl,Wl andDl denotes the kernel number in the (l − 1)th layer,height, width and depth, respectively.

3.1.2. Discriminator

The discriminator must also be meticulously consid-ered to provide a more beneficial loss to the gen-erator. We construct a discriminator based on thelong recurrent convolutional network (LRCN) model.The LRCN consists on CNN and LSTM which mod-els the spatial feature using convolutional operationsψC(·), pooling operations ψP (·) and LSTM whichmodels the temporal feature using a gate operationand a recurrent loop ψL(·).10 For every video clipV = {x1, . . . , xp} and window size p, the convolu-tional encoding function ψcp(·) is defined as followswith step t:

ψcp(xt) = ψP (ψC(xt)). (6)

The convolution operation, which preserves the spa-tial relationship between features by learning fil-ters that extract correlations, is known to reducethe translational variance between features. Giventhe tth input state, the encoding network performsthe convolution operation ψC(·) using m × m sizedfilter w:

ψlC(xt) =

m−1∑a=0

m−1∑b=0

wabxl−1(i+a)(j+b). (7)

The summary statistic of nearby outputs isderived from ψP (·) by max-pooling operation.Because the dimension of the output vectors fromthe convolutional layer is increased by the number ofconvolutional filters, it must be carefully controlled.

Pooling refers to a dimension reduction process usedin CNN in order to impose a capacity bottleneck andfacilitate faster computation. The max-pooling oper-ation has effects on feature selection and dimensionreduction under k×k sized area with pooling stride τ :

ψlP (xt) = maxxl−1

tij×τ. (8)

The purpose of convolution-pooling operation is toextract the spatial features from the video clips anddeliver it to the following LSTM.

In the LSTM, in order to make the network toeasily figure out the relationship of image sequenceof large time scale, we use the gate operation definedin Eqs. (9)–(13).43 The definition of input gate it,forget gate ft, cell state ct, output gate ot and hiddenvalue ht at time step t is as follows.

it = σ(Wxixt +Whiht−1 +Wci ◦ ct−1 + bi), (9)

ft = σ(Wxf xt +Whfht−1 +Wcf ◦ ct−1 + bf ),

(10)

ct = ft ◦ ct−1 + it

◦ tanh(Wxcxt +Whcht−1 + bc), (11)

ot = σ(Wxoxt +Whoht−1 +Wco ◦ ct + bo), (12)

ψL(·) = ht = ot ◦ tanh(ct), (13)

where ◦ denotes the element-wise product and b

denotes bias term. After the time-series features areextracted by LSTM, the typical DNN is used to com-plete the mapping function ψ(·):

ψl(xi(tp)) = ψl−1L (ψl−2

Cp (xi(tp))), (14)

p(yi|xi(t)) = arg maxexp(ψl−1(xi(tω))wl + bl)∑

exp(ψl−1(xi(tω)) + bl),

(15)

where a softmax activation function is used in thelast layer of DNN so that the output vector isencoded as a probability of real or fake and i denotesto the index of the video clip from overall dataset.

The weights of the discriminator are optimizedusing the backpropagation algorithm based on gra-dient descent optimization, by minimizing the cross-entropy loss function LCE:

ψl(xi(tp)) = ψl−1L (φl−2

C (xi(tp))), (16)

LCE = −∑

i

yi log(yi). (17)

2050034-6

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

2nd Reading

May 28, 2020 7:9 2050034


Table 2. A guide to transfer learning methods for four major scenarios.

Category Text Scenarios

Checkup Is there a lot of new training data? No Yes No YesIs the new data similar to the existing data? Yes Yes No No

Diagnosis Learn only the classifier + + + − + −Learn the classifier and some other layers − − ++ −Learn full-network − + + + − ++

Fig. 4. The pseudocode of the proposed VAD method.

3.2. Transfer learning

This section describes how to learn a VAD classifierfor problem solving. VAD classifier C is composed ofCNN -based feature extraction part EncC and DNN-based classification part ClsC like general classifierfor solving image domain classification problem.

C(x) = ClsC(EncC(x)). (18)

The feature extraction part of VAD classifier hasthe same structure as the encoder of the generator.Through this, it is possible to transfer the encoderlearned in the previous phase to the VAD classifier.The strategy of transferring the base model to thetarget model is also an important issue of transferlearning. Shin and Cho guided the transfer learningmethod of the DNN model consisting of the feature

extractor and the classifier for the four major sce-narios as shown in Table 2.33 Table 2 categorizes inwhich transfer learning can be used according to sev-eral scenarios. The discriminator of GAN is a goodbase model of transfer learning for VAD classifier,because it is a well-formed model using rich fake datagenerated by generator and some real data. Even thedatasets used for learning discriminators and VADclassifiers are very similar. Finally, since the amountof labeled data in the VAD is usually small, we fine-tune the VAD classifier according to the left-mostscenario in Table 2.

3.3. Anomaly detection

The last fully connected layer of the VAD classifier isdesigned to have sigmoid activation. The final out-put value A has a value between 0 and 1, and wetreat this value as anomaly score. Thus, when A

exceeds the threshold T , the given input is treatedas an anomaly. Equal error rate (EER) is used asthe threshold. EER means the point where the falsepositive rate is equal to the false negative rate. Equa-tion (19) represents the result of prediction τ on theinput data. Figure 4 shows the training algorithm ofthe proposed method.

τ =

{Normal if A < T

Abnormal otherwise.(19)

4. Experiments

4.1. Datasets and experimental settings

To verify that the proposed method is useful, we con-duct the experiments using UCSD pedestrian-1, 2dataset35 and Subway entrance and exit dataset.42

UCSD dataset consists of a surveillance camera videothat captures the sidewalk. It defines everythingexcept the person who is walking sidewalk as abnor-mal. Subway dataset consists of a surveillance cam-era video that captures the images in the subway

2050034-7

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

2nd Reading

May 28, 2020 7:9 2050034


Fig. 5. The process to determine the ground truth (label) with data windowing.

ticket gates. This defines no payment, wrong direc-tion movement and loitering as abnormal.

Figure 5 shows the data preprocessing process todetermine the ground truth (label). Here, p is thesize of window, which should be larger than 1, andn is the number of video clips V . As we said inSec. 1, we do not consider point anomaly. In otherwords, an anomaly occurs over several frames. A fewframes-delayed alarms for anomaly are unnoticeableby humans. Thus, we simply label anomaly if thevideo clip contains more than half anomaly frames.Additionally, due to the limitations in learning speedand memory storage capacity, each image is resizedto 180 pixels horizontally and 120 pixels vertically.

Figure 6 shows the specification of all the exper-iments. In experiment 1, we use machine learn-ing methods, our previous works and supervisedlearning-based state-of-the-art methods for compar-ison. The results labeled “GAN +LRCN” representthe performance in Ref. 33 and the results marked“AE+LRCN” are the performance of the modelin which the base model is replaced by AE ratherthan GAN. The results labeled “LRCN” are fromthe experiments using the LRCN model from thebeginning.

Fig. 6. The specification of the experiments.

4.1.1. Model structures

The generator of the proposed method consists oftwo 3D convolutional networks38 and two 3D decon-volutional networks with 5 × 5 and 3 × 3 filtersizes, respectively. Each layer gives stride 2 for pool-ing. Along the U-net structure, symmetrical layershave skip connections. The number of channels is16 and 32 in order. Also, batch normalization is per-formed for each layer to ensure learning stability. The

2050034-8

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

2nd Reading

May 28, 2020 7:9 2050034


discriminator consists of three 2D time-distributedconvolutional networks with 7×7, 5×5 and 3×3 filtersizes, an LSTM layer that receives the output of thelast convolutional network, and two fully connectedlayers. In order to prevent the number of parametersfrom becoming too large, we set the channel of allconvolutional networks to 16 and the output size ofLSTM is fixed to 1024. The numbers of nodes in thefully connected layer is set to 256 and 2, respectively.The fully connected layer of the VAD classifier of theproposed method is also designed to be identical tothat of the discriminator. We set the activation func-tion of the output layer to softmax and the activationfunction of all the other layers to ReLU. We use thesame network structure for all the experiments.

4.2. Performance of VAD

Figure 7 shows the comparison of 10-fold cross-validation results of the proposed method andvarious comparable models33,37,40,41 on UCSDpedestrian-1 dataset. Experimental results show thatthe proposed method yields the highest performance.Table 3 shows the confusion matrix of the proposedmethod. The proposed model has about 94.4% recalland about 86.4% precision performance, where theprecision represents the probability of no false alarm

Fig. 7. 10-fold cross-validation result for each method on UCSD pedestrian-1 dataset. Experimental results show that theproposed method has the highest performance. In addition, deep learning methods perform better than the conventionalmachine learning methods.

Table 3. Confusion matrix result of proposed method.

Actual

Anomaly Normal

Predicted Anomaly 1021 158Normal 60 1079

and the recall represents the probability of noanomaly missing. Figure 8 shows the result of thedetailed comparison with the previous works. Theproposed method shows similar recall performancewith the previous works. However, it achieves about8% higher precision and about 5% higher accu-racy. It implies that the proposed method is morerobust in anomaly detection than the previous works.Table 4 shows the mean, variance and t-test results toshow whether the performance between the proposedmethod and the comparable methods is significantlydifferent. The values of the third column are theresults of a t-test between the proposed method andthe comparable methods, respectively. As a result ofthe experiment, the t-value is less than 0.05 in bothmethods, supporting that the performance improve-ment of the proposed method is statistically signifi-cant. Moreover, the proposed method has the highestaverage accuracy among all the methods, but at the

2050034-9

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

2nd Reading

May 28, 2020 7:9 2050034


Fig. 8. Recall, precision, accuracy, and f1-score perfor-mance comparison results with previous work.

Table 4. The results of mean accuracy, standarddeviation and t-test.

Mean Stdev t-value

GAN+ LRCN 0.8634 0.008 0.0081AE+LRCN 0.8540 0.015 0.0017Proposed 0.8860 0.026 —

same time has the highest standard deviation. Thisimplies that the proposed model may be somewhatunstable. However, the absolute magnitude of thestandard deviation is small to be regarded as a dis-advantage compared to the base model. In addition,we conduct a performance comparison with the con-ventional supervised learning-based VAD models onfour datasets. For this comparative study, we adoptthe supervised learning models mentioned in Sec. 2.We use EER and area under curve (AUC) for theperformance metrics. Table 5 shows the experimen-tal results. As can be seen, the proposed method pro-duces significantly better performance in all the fourdatasets. In particular, the proposed method yields

higher performance in all the cases than our previousstudy.33

4.3. Ability of data generation

To demonstrate that the generator can producea realistic video clip, we visualize and analyze afake video clip generated by the trained generator.Figure 9 shows four generated video clips plottedframe by frame. A row means a video clip. If youfollow the sequence of each line, you can see that theobject moves little by little every frame. Figure 10shows the result of one generated video clip everythree frames. In this figure, the motion of the objectappears more clearly. Table 6 shows the experimen-tal results of whether the generated data are similarto the actual data quantitatively. The value in therow “Actual” is the result of comparing the simi-larity between randomly sampled real data, and thevalue in the row “Generated” is the result of com-paring the similarity between randomly sampled realdata and generated data. We have repeated the com-parisons 30 times and averaged them for fair com-parison. They show that the fake data is very sim-ilar to the actual data. Interestingly, the similarityof the row “Generated” is slightly higher than therow “Actual”. This result is caused because the gen-erated data is blurry compared to the actual data.The generator might be learned to follow the distri-bution of the actual data rather than the detailedcharacteristics of the actual data.

4.4. Case analysis of misclassification

In this section, we analyze the UCSD pedestrian-1dataset using the t-SNE method and analyze thefalse-positive cases to more closely examine theanomaly detection performance of the proposedmodel. The t-SNE is a method of clustering high-dimensional data on a two-dimensional plane, so that

Table 5. Comparison of anomaly detection performance on four benchmark datasets.

Ped1 Ped2 Sub Ent Sub Exit

Methods EER AUC EER AUC EER AUC EER AUC

2D + 3D CNN40 0.24 0.85 0.24 0.86 — 0.93 — 0.92

3D CNN25 0.13 0.94 0.11 0.95 — — — —

GAN +LRCN33 0.14 0.92 0.20 0.85 0.18 0.88 0.23 0.87The proposed 0.11 0.95 0.11 0.93 0.18 0.90 0.16 0.96

2050034-10

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

2nd Reading

May 28, 2020 7:9 2050034


Fig

.9.

Plo

ttin

gth

egen

erate

dvid

eocl

ips.

One

row

mea

ns

one

image

sequen

ce.Y

ou

can

see

that

the

obje

ctm

oves

slig

htl

yin

ever

yfr

am

e.

Fig

.10.

Plo

ttin

ga

gen

erate

dvid

eocl

ipby

3fr

am

eunit

s.Y

ou

can

see

the

obje

ctch

ange

over

tim

e,m

ore

clea

rly.

2050034-11

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

2nd Reading

May 28, 2020 7:9 2050034


Table 6. Comparison of similaritybetween generated data and actualdata.

SSIM Cosine Sim.

Actual 0.6201 0.9800Generated 0.6341 0.9806

the neighboring data on a two-dimensional planeshare similar features.39 The left side of Fig. 11 showsthe result of clustering the test data used in thisexperiment. The red dots indicate abnormal data,and the blue dots indicate normal data.

Since a number of clusters have been created,data are considered to have a wide variety of char-acteristics. Besides, in the middle of the t-SNE, wecan see that the abnormal data and the normaldata are mixed together. This indicates that theanomaly detection in the dataset used in our exper-iments is very difficult. And it supports the supe-riority of the proposed model, because it detectsabout 97% of the anomalies in this dataset. The rightside of Fig. 11 also shows the data for the top fourmiss-classification cases. Each image represents thefirst frame of each video. All misclassified cases arethe false negative; that is, incorrectly classified nor-mal data as abnormal.

However, in all the cases, the objects defined asanomaly tend to be temporarily included in the startor end of a video block. This means that the classi-fication performance of the model is greatly reducedin the video block where the anomaly starts or ends.In the right images of Fig. 11, the objects in the redbox are classified as abnormal. Because of this fact,it is presumed that the misclassification causes notonly from the classification.

Above all, the error of several frame units occur-ring at the boundary between the abnormal case andthe normal case in the image captured by the modernequipment is not at a human level. Therefore, themisclassification case of the proposed model is not aproblem in practice.

4.5. Discussion

As most of the CNN-based models for anomalydetection assume a situation in which anomaly typesand characteristics are clearly defined, they simplyclassify the object inside the data, and decide theabnormality with predefined rules. They target todetect anomaly in specific domains such as identi-fication of thermal infrared face,44,45 estimation ofconcrete,46 and detection of vehicle type,47 whereasthe proposed model attempts to detect abstractanomalies as the unsupervised learning-based models

Fig. 11. t-SNE results in the last layer and the analysis of top four false positive cases. When false positives occur, theanomaly object is occluded by another object, or the boundary point where the data label changes.

2050034-12

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

2nd Reading

May 28, 2020 7:9 2050034


do. That is to say, we focus on the supervisedapproach of anomaly detection, unlike the previousworks based on unsupervised and semi-supervisedapproaches which cannot use the label informationon anomaly.

However, general anomaly detection basicallyrequires additional mechanism to cope with theunseen anomalies. We can tackle this problemby extending the proposed supervised approachor adopting unsupervised and semi-supervisedapproaches. In the future, we will strive hard to theboth directions.

5. Concluding Remarks

It is an important issue to develop a model basedon supervised learning to work out the problem ofVAD. Previously, we developed a method to addressthe shortage problem of labeled anomaly data thatoccurs in supervised learning. The key idea was tocreate a base model using the GAN and use it fortransfer learning. In this paper, we have proposedanother novel supervised learning-based method toaddress the problems of the data shortage as wellas the high-level feature extraction. The proposedmodel attempts to construct a source domain as sim-ilar to the target domain for higher performance.Experimental results show that it is the best amongthe competitive models in accuracy, and the amountof improvement is statistically significant. In addi-tion, we have discussed the advantages and disad-vantages of the model through careful investigationof the video generation and false positive error.

As a future work, we will explore ways to addressthe unseen data problem, one of the main limitationsof supervised learning, in VAD.

Acknowledgment

This work was supported by Defense AcquisitionProgram Administration and Agency for DefenseDevelopment Under the Contract (UD190016ED).

References

1. S. Fleck and W. Straßer, Privacy sensitive surveil-lance for assisted living — A smart camera approach,in Handbook of Ambient Intelligence and SmartEnvironments (Springer, 2010), pp. 985–1014.

2. Y. Cong, J. Yuan and J. Liu, Sparse reconstructioncost for abnormal event detection, IEEE Conf. on

Computer Vision and Pattern Recognition (2011),pp. 3449–3456.

3. D. Xu, Y. Yan, E. Ricci and N. Sebe, Detectinganomalous events in videos by learning deep rep-resentations of appearance and motion, ComputerVision and Image Understanding 156 (2017) 117–127.

4. A. Tome and L. Salgado, Anomaly detection incrowded scenarios using local and global Gaussianmixture models, Int. Conf. on Advanced Conceptsfor Intelligent Vision Systems (2017), pp. 363–374.

5. B. R. Kiran, D. M. Thomas and R. Parakkal, Anoverview of deep learning based methods for unsu-pervised and semi-supervised anomaly detection invideos, arXiv preprint, arXiv:1801.03149 (2018).

6. R. Chalapathy and S. Chawla, Deep learningfor anomaly detection: A survey, arXiv preprint,arXiv:1901.03407 (2019).

7. J. Y. Kim and S. B. Cho, Detecting intrusive mal-ware with a hybrid generative deep learning model,Int. Conf. on Intelligent Data Engineering and Auto-mated Learning (2018), pp. 499–507.

8. M. Ahmed and A. N. Mahmood, Network traffic pat-tern analysis using improved information theoreticco-clustering based collective anomaly detection, Int.Conf. on Security and Privacy in CommunicationSystems (2014), pp. 204–219.

9. B. Barz, E. Rodner, Y. G. Garcia and J. Denzler,Detecting regions of maximal divergence for spatio-temporal anomaly detection, IEEE Trans. on Pat-tern Analysis and Machine Intelligence 41 (2018)1088–1101.

10. J. Donahue, L. Anne Hendricks, S. Guadarrama, M.Rohrbach, S. Venugopalan, K. Saenko and T. Dar-rell, Long-term recurrent convolutional networks forvisual recognition and description, IEEE Conf. onComputer Vision and Pattern Recognition (2015),pp. 2625–2634.

11. N. Ballas, L. Yao, C. Pal and A. Courville, Delvingdeeper into convolutional networks for learning videorepresentations, arXiv preprint, arXiv:1511.06432(2015).

12. M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury and L. S. Davis, Learning temporal reg-ularity in video sequences, IEEE Conf. on ComputerVision and Pattern Recognition (2016), pp. 733–742.

13. Y. S. Chong and Y. H. Tay, Abnormal event detec-tion in videos using spatiotemporal autoencoder, Int.Symposium on Neural Networks (2017), pp. 189–196.

14. N. Srivastava, E. Mansimov and R. Salakhudi-nov, Unsupervised learning of video representationsusing lstms, Int. Conf. on Machine Learning (2015),pp. 843–852.

15. Z. Chen, Y. Tian, W. Zeng and T. Huang, Detect-ing abnormal behaviors in surveillance videos basedon fuzzy clustering and multiple Autoencoders,IEEE Int. Conf. on Multimedia and Expo (2015),pp. 1–6.

2050034-13

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

2nd Reading

May 28, 2020 7:9 2050034


16. D. Tran, L. Bourdev, R. Fergus, L. Torresani andM. Paluri, Learning spatiotemporal features with 3dconvolutional networks, IEEE Int. Conf. on Com-puter Vision (2015), pp. 4489–4497.

17. S. Smeureanu, R. T. Ionescu, M. Popescu andB. Alexe, Deep appearance features for abnormalbehavior detection in video, Int. Conf. on ImageAnalysis and Processing (2017), pp. 779–789.

18. R. Hinami, T. Mei and S. I. Satoh, Joint detectionand recounting of abnormal events by learning deepgeneric knowledge, IEEE Int. Conf. on ComputerVision (2017), pp. 3619–3627.

19. J. R. Medel and A. Savakis, Anomaly detection invideo using predictive convolutional long short-termmemory networks, arXiv preprint, arXiv:1612.00390(2016).

20. M. Sabokrou, M. Fayyaz, M. Fathy, Z. Moayed andR. Klette, Deep-anomaly: Fully convolutional neu-ral network for fast anomaly detection in crowdedscenes, Computer Vision and Image Understanding172 (2018) 88–97.

21. W. Chu, H. Xue, C. Yao and D. Cai, Sparse codingguided spatiotemporal feature learning for abnormalevent detection in large videos, IEEE Trans. on Mul-timedia 21(1) (2018) 246–255.

22. J. Sun, J. Shao and C. He, Abnormal event detec-tion for video surveillance using deep one-class learn-ing, Multimedia Tools and Applications 78(3) (2019)3633–3647.

23. Z. Chen, Y. Tian, W. Zeng and T. Huang, Detectingabnormal behaviors in surveillance videos based onfuzzy clustering and multiple auto-encoders, IEEEInt. Conf. on Multimedia and Expo (2015), pp. 1–6.

24. Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu andX. S. Hua, Spatio-temporal autoencoder for videoanomaly detection, ACM International Conferenceon Multimedia (2017), pp. 1933–1941.

25. J. Yu, K. C. Yow and M. Jeon, Joint representa-tion learning of appearance and motion for abnor-mal event detection, Machine Vision and Applica-tions 29(7) (2018) 1157–1170.

26. A. Dimokranitou, Adversarial autoencoders foranomalous event detection in images, Dissertation(2017).

27. M. Ravanbakhsh, E. Sangineto, M. Nabi and N.Sebe, Training adversarial discriminators for cross-channel abnormal event detection in crowds, IEEEConf. on Applications of Computer Vision (2019),pp. 1896–1904.

28. J. Sun, X. Wang, N. Xiong and J. Shao, Learningsparse representation with variational auto-encoderfor anomaly detection, IEEE Access 6 (2018) 33353–33361.

29. S. Huang, D. Huang and X. Zhou, Learning multi-modal deep representations for crowd anomaly eventdetection, Mathematical Problems in Engineering2018 (2018) 1–13.

30. M. Sabokrou, M. Pourreza, M. Fayyaz, R.Entezari, M. Fathy, J. Gall and E. Adeli, AVID:Adversarial visual irregularity detection, arXivpreprint arXiv:1805.09521 (2018).

31. S. Lee, H. G. Kim and Y. M. Ro, Stan: Spatio-temporal adversarial networks for abnormal eventdetection, Int. Conf. on Acoustics, Speech and SignalProcessing (2018), pp. 1323–1327.

32. O. Ronneberger, P. Fischer and T. Brox, U-net: Con-volutional networks for biomedical image segmenta-tion, Int. Conf. on Medical Image Computing andComputer-Assisted Intervention (2015), pp. 234–241.

33. W. Shin and S. B. Cho, CCTV image sequencegeneration and modeling method for video anomalydetection using generative adversarial network, Int.Conf. on Intelligent Data Engineering and Auto-mated Learning (2018), pp. 457–467.

34. J. Y. Kim, S. J. Bu and S. B. Cho, Zero-day mal-ware detection using transferred generative adver-sarial networks based on deep autoencoders, Infor-mation Sciences 460 (2018) 83–102.

35. V. Mahadevan, W.-X. Li, V. Bhalodia and N. Vas-concelos, Anomaly detection in crowded scenes,IEEE Conf. on Computer Vision and Pattern Recog-nition (2010), pp. 1975–1981.

36. B. T. Morris and M. Trivedi, Understanding vehic-ular traffic behavior from video: A survey of unsu-pervised approaches, Journal of Electronic Imaging22(4) (2013) 041113.

37. F. Chen, Z. Liu and M. T. Sun, Anomaly detectionby using random projection forest, IEEE Int. Conf.on Image Processing (2015), pp. 1210–1214.

38. F. U. M. Ullah, A. Ullah, K. Muhammad, I. U. Haqand S. W. Baik, Violence detection using spatiotem-poral features with 3d convolutional neural network,Sensors 19(11) (2019) 2472.

39. T. Gandhi and M. M. Trivedi, Image based esti-mation of pedestrian orientation for improving pathprediction, IEEE Intelligent Vehicles Symposium(2008), pp. 506–511.

40. S. Zhou, W. Shen, D. Zeng, M. Fang, Y. Weiand Z. Zhang, Spatial–temporal convolutional neu-ral networks for anomaly detection and localizationin crowded scenes, Signal Processing; Image Com-muncation 47 (2016) 358–368.

41. W. Sultani, C. Chen and M. Shah, Real-worldanomaly detection in surveillance videos, in Proc.of the IEEE Conf. on Computer Vision and PatternRecognition (2018), pp. 6479–6488.

42. A. Adam, E. Rivlin, I. Shimshoni and D. Reinitz,Robust real-time unusual event detection using mul-tiple fixed-location monitors, IEEE Trans. PatternAnal. Mach. Intell. 30(3) (2008) 555–560.

43. S. Hochreiter and J. Schmidhuber, Long short-termmemory, Neural Computation 9(8) (1997) 1735–1780.

2050034-14

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

2nd Reading

May 28, 2020 7:9 2050034


44. P. Wang and X. Bai, Regional parallel structurebased CNN for thermal infrared face identifica-tion, Integrated Computer-Aided Engineering 25(3)(2018) 247–260.

45. M. Ahmadlou and H. Adeli, Enhanced probabilis-tic neural network with local decision circles: Arobust classifier, Integrated Computer-Aided Engi-neering 17(3) (2010) 197–210.

46. M. H. Rafiei, W. H. Khushefati, R. Demirbogaand H. Adeli, Supervised deep restricted Boltzmann

machine for estimation of concrete, ACI MaterialsJournals 112(2) (2017) 237–244.

47. M. A. Molina-Cabello, R. M. Luque-Baena, E.Lopez-Rubio and K. Thurnhofer-Hemsi, Vehicle typedetection by ensembles of convolutional neural net-works operating on super resolved images, IntegratedComputer-Aided Engineering 25(4) (2018) 321–333.

2050034-15

Int.

J. N

eur.

Sys

t. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by Y

ON

SEI

UN

IVE

RSI

TY

on

06/0

4/20

. Re-

use

and

dist

ribu

tion

is s

tric

tly n

ot p

erm

itted

, exc

ept f

or O

pen

Acc

ess

artic

les.

3d-convolutional neural network with generative adversarial...

Documents