1 smart surveillance as an edge network service: from harr ...on a raspberry pi model 3 board as the...

1

Smart Surveillance as an Edge Network Service:from Harr-Cascade, SVM to a Lightweight CNN

Seyed Yahya Nikouei, Yu Chen, Sejun Song, Ronghua Xu, Baek-Young Choi, Timothy R. Faughnan

Abstract—Edge computing efficiently extends the realm ofinformation technology beyond the boundary defined by cloudcomputing paradigm. Performing computation near the sourceand destination, edge computing is promising to address thechallenges in many delay-sensitive applications, like real-time hu-man surveillance. Leveraging the ubiquitously connected camerasand smart mobile devices, it enables video analytics at the edge.In recent years, many smart video surveillance approaches areproposed for object detection and tracking by using ArtificialIntelligence (AI) and Machine Learning (ML) algorithms. Thiswork explores the feasibility of two popular human-objectsdetection schemes, Harr-Cascade and HOG feature extractionand SVM classifier, at the edge and introduces a lightweightConvolutional Neural Network (L-CNN) leveraging the depthwiseseparable convolution for less computation, for human detection.Single Board computers (SBC) are used as edge devices fortests and algorithms are validated using real-world campussurveillance video streams and open data sets. The experimentalresults are promising that the final algorithm is able to trackhumans with a decent accuracy at a resource consumptionaffordable by edge devices in real-time manner.

Keywords—Edge Computing, Smart Surveillance, LightweightConvolutional Neural Network (L-CNN).

I. INTRODUCTION

Nowadays almost every person can be connected to thenetwork using their pocket-sized mobile devices wherever andwhenever. The advancement of cyber-physical technologiesand their interconnection through elastic communication net-works facilitate the concept of the Smart Cities that improvethe life quality of residents. Attracted by the convenientlifestyle in bigger cities, the world’s population has beenincreasingly concentrated in urban areas at an unprecedentedscale and speed [10].

The fast pace of urbanization [11] poses many opportuni-ties and challenges. The recent concept of Smart Cities hasattracted the attention of the urban planners and researchersto enhance the security and well-being of the residents. Theproliferation of information and communication technologies(ICT) connects cyber-physical systems and social entities aswell as facilitates many smart community services. One of

S. Y. Nikouei, Y. Chen, and R. Xu are with the Department of Electricaland Computer Engineering, Binghamton University, SUNY, Binghamton, NY13902 USA, email: {snikoue1, ychen, rxu22}@binghamton.edu.

S. Song and B. Y. Choi are with School of Computing and Engineering,University of Missouri-Kansas City, Kansas City, MO 64110 USA, email:{songsej, choiby}@umkc.edu.

T. R. Faughnan is with the New York State University Police,Binghamton University, SUNY, Binghamton, NY 13902 USA, email:[email protected].

the most essential smart community services is the intelligentresident surveillance [7]. It enables a broad spectrum ofpromising applications, including access control in areas ofinterest, human identity or behavior recognition, detection ofanomalous behaviors, interactive surveillance using multiplecameras and crowd flux statistics and congestion analysis andso on [26].

Many of these smart surveillance applications require sig-nificant computing and storage resources handling massivecontextual data created by video sensors. A typical low framerate (1.25 Hz) wide area motion imagery (WAMI) sequencealone can generate over 100M of data per second (400Gper hour). According to the recent study, the video datadominates the real-time traffic and creates heavy workloadon the communication networks. For example, online videoaccounts for 74% of all online traffic in 2017 and 78% ofmobile traffic will be video data by 2021 [13]. Thus, it isimportant to handle this massive data transfer in new ways.The cloud computing paradigm provides excellent flexibilityand is also scalable corresponding to the increasing numberof surveillance cameras. In practice, however, there are sig-nificant hurdles for the remote cloud-based smart surveillancearchitecture.

Key surveillance applications such as monitoring and track-ing need a real-time capability. However, processing raw videodata from widely distributed video sensors such as Close-Circle Television (CCTV) cameras and mobile cameras notonly incurs uncertainty in data transfer and timing but alsoposes significant overhead and delay to the communicationnetworks[10]. Also, it may cause the data security and privacyissues by providing more attacking opportunities for adver-saries. Therefore, current surveillance applications are for off-line forensics analysis instead of a proactive tool to detersuspicious activities before the damages are caused.

The surveillance community has been aware of the growingdemand for human resources to interpret the data due to theubiquitous deployment of networked static and mobile camerasand has made many efforts in past decades [34]. For example,many automated anomaly detection algorithms have beeninvestigated using machine learning [39] and statistical anal-ysis [19] approaches. Although these intelligent approachesare powerful, they are computationally very expensive. Hencethey are implemented as a central cloud service. Researchersare also trying to help operation personnel beware events usingevent-driven visualization mechanism [18], re-configuring thenetworked cameras [36], and mapping conventional real-timeimages to 3D camera images [45]. However, the traditionalhuman-in-the-loop solutions are still challenged significantlyby the demand of real-time surveillance systems due to the

arX

iv:1

805.

0033

1v2

[cs

.CV

] 1

Oct

201

8

2

lack of scalability. For example, video analysis mostly relieson teams of specially trained officers manually watching thou-sands of hours of different format and quality videos, lookingfor one specific target. Due to the manual coordination andtracking and mechanical pan-tilt-zoom (PTZ) remote control,it is challenging to achieve adequate real-time surveillance.

Edge computing as a surveillance service is consideredas the answer to the shortcomings [2], [4], [6]. The edgecomputing technology migrates more computing tasks to theconnected smart things (sensors and actuators) at the edge ofthe network [40]. In general, edge computing possesses thefollowing advantages compared to cloud computing:

1) Real-time response: applications or services are directlyexecuted on-site or near-site, communication delays areminimized, which is essential to delay sensitive, missioncritical tasks, such as the smart surveillance;

2) Lower network workload: raw data generated by sensorsor monitors is consumed at the edge of the networkinstead of outsourcing to a remote cloud center. Whilethe processed results may be sent to the cloud for futureanalysis, the communication overhead is much lowerthan outsourcing tasks to cloud;

3) Lower energy consumption: most of the edge devicesare energy constrained, by its nature the algorithmsdeployed at the edge are lightweight that will reduceenergy consumption for the process and data transmis-sion in total; and

4) Data security and privacy: the less data is sent, thefewer opportunities are available to adversaries to com-promise the confidentiality and integrity of the data,also it is easier to enforce security and privacy policiesat local network in comparison to requesting collabora-tion among multiple network domains under differentadministrations.

In this paper, we propose to devolve more intelligence to theedge to significantly improve many smart tasks such as fastobject detection and tracking. Adopting the recent researchon machine learning, we choose the Convolutional NeuralNetwork (CNN) algorithm which incurs comparatively lesspre-processing overhead than other human image classificationalgorithms. We efficiently tailored the CNN to be furnishedin the resource-constrained edge devices according to anobservation that surveillance systems are mainly for the safetyand security of human being. According to our experimentalstudy, the lightweight CNN (L-CNN) algorithm can processan average of 1.79 and up to 2.06 frames per second (FPS) onthe selected edge device, a Raspberry PI 3 board. It meets thedesign goals considering the limited computing power and therequirement from the application.

In summary, the major contributions of this work are high-lighted below:

1) Aiming at intelligent surveillance as an edge net-work service, a thorough study of two well-knownhuman-objects detection schemes, Harr-Cascade andHOG+SVM, has been conducted, which evaluates theirfeasibility of running on resource-limited edge devices;

2) A lightweight Convolutional Neural Network (L-CNN)

Fig. 1. Edge-Fog-Cloud hierarchical architecture.

is applied to enable a real-time human-objects identifi-cation on a network edge [35];

3) Instead of simulation, system-oriented research hasbeen conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implementedon a Raspberry PI Model 3 board as the edge device;and

4) An extensive experimental validation study has beenconducted using real-world surveillance video data.Comparing with SSD GoogleNet, Harr-Cascade andHOG+SVM, the L-CNN is a promising approach fordelay-sensitive, mission-critical applications like real-time smart surveillance.

The rest of the paper is sorted as follows. Section II providesbackground of the closely related work. Section III explainsHaar-Cascaded and HOG+SVM at edge. Then, Section IVintroduces the proposed lightweight CNN architecture, and thetraining of the L-CNN is disused in Section V. Section VIexplains the results of the tracking algorithm implemented ona Raspberry PI 3 model B and a Tinker board. At last, SectionVII wraps up this paper with conclusions and discussions ofour on-going efforts.

II. BACKGROUND KNOWLEDGE AND RELATED WORK

A. Smart Surveillance as an Edge ServiceThe surveillance community has been aware of the growing

demand for human resources to interpret the data due to theubiquitous deployment of networked static and mobile camerasand has made many efforts in past decades [34].

Traditional surveillance systems depend on human operatorsto manipulate the processing of captured video [8]. However,there are many shortcomings with this approach. Not onlyit is unrealistic to have a human operator to maintain fullconcentration on the video for a long time, but it is alsonot scalable as the number of cameras as sensors growssignificantly. More recently proposed smart systems consideredas the second generation of the surveillance systems aimedat minimizing the role that human operators play in objectdetection, and the responsibility of abnormal behavior detec-tion is taken by various more intelligent machine learningalgorithms [11], [44]. The algorithm automatically processesthe collected video frames in a cloud to detect, track, and reportany unusual circumstances.

3

Figure 1 presents an edge-fog-cloud hierarchical architecturein which functions in a smart surveillance system are classifiedinto three levels:• Level 1: each object of interest is identified through low-

level feature extraction from video data;• Level 2: the behavior or intention of each object of

interest is detected/recognized, quick alarm raising; and• Level 3: anomalous or suspicious activities profile build-

ing and historical statistical analysis and also fine tuningthrough online training the decision making algorithm.

Ideally, the minimum delay and communication overheadwould be achieved if all the functions are conducted on-site atthe network edge where the sensor is located, and the decisionis made instantly [9], [10]. However, it is not realistic toaccomplish the operations of Level 1 and Level 2 by the edgedevices. Therefore, once the detection and tracking tasks aredone, the results are outsourced to the fog layer for further datacontextualization and decision making. The computationallyexpensive Level 3 functions can be positioned on the fogcomputing level or even further away on the cloud centersconsidering the constraints on edge processing power. Andfunctions like long term profile building based on geo-locationof the camera is not required to be accomplished instantly.In a smart surveillance system, the Level 1 functions are thefundamental. More specifically human object detection is vitalas missing any objects in the frame will lead to undetectedbehavior. Also. the false positive rate should be minimized,because wrongly identifying an object in the frame as a humanwill possibly result in false alarms.

B. Human-Object DetectionHaar-like feature extraction is well suited for face and eye

detection [14]. Haar models are light weighted and very fast,which are appreciated as a candidate for edge implementation.However, the human body can have different appearance indifferent ambient lighting, which is harder for this type ofmodels to achieve a high detection accuracy [20].

Grids of Histograms of Oriented Gradient (HOG) can pro-duce reliable features for human detection [15]. While theHaar features fail to detect humans when the body angletoward the camera changes, HOG features continue to performwell. HOG features are given to a Support Vector Machine(SVM) classifier to create a human detection algorithm calledHOG+SVM [33]. One downside to using HOG feature ex-traction at the edge is that this method creates a burdenon the limited resource environment. Performance results arediscussed and compared in Section VI.

Scale Invariance Feature Transformation (SIFT) is anotherwell-known algorithm for human detection through extractingdistinctive invariant features from images, which providesfeatures that can be used to perform reliable matching betweendifferent views of an object or scene [22].

C. Machine Learning at the EdgePowerful machine learning algorithms are recognized as

the solution to take full advantage of big data in many

areas [29], [48]. However, when the big data comes to theedge, the demand for computing and storage resources makesthem unfit. The edge environment necessitates lightweightbut robust algorithms. Applications of some simple machinelearning algorithms have been investigated in environmentswith constraints on resources, such as wireless sensor networks(WSNs) and IoT devices [3]. There are efforts to build large-scale distributed machine learning systems to leverage het-erogeneous computing devices and reduce the communicationoverhead [1], [42]. Although none of the reported algorithms isideally fit to the resource-constrained edge environment, theyhave laid a solid foundation for us. In fact, the community hasrecognized the importance of efficient models for mobile andembedded applications [5], [24], [47].

GoogleNet [43] and Microsoft ResNet [21] are widely usedand well-known architectures for image classification becauseof their high accuracy. They can take a picture as an inputand conduct classification for up to one thousand differentobjects. This type of network has as many filters as neededto create a feature map that can differentiate between thepossible objects classes [23], [25]. If only human objects arerequired to be classified or detected, the network architectureneeds fewer filters to reach the same performance accuracy.However, the authors could not find any deep learning networkspecially designed for detecting human objects in mind, ratherthe networks tend to have a more generalized use cases.

Recent attempts have been made to generate faster deeplearning networks that require less resource without losingperformance. As its name implies, SqueezeNet achieves thesame performance as the AlexNet but takes less memory [27].MobileNet is another architecture created to work on resourceconstraint devices [24]. It is not only memory efficient, butalso it runs very fast because of a different convolutionalarchitecture that creates it. It has been mathematically proventhat this network creates less computational burden whilehaving fewer parameters [41]. MobileNet produces resultscomparable to GoogleNet which is of the best performingarchitectures in terms of accuracy.

III. HARR-CASCADE AND SVM AT THE EDGE

In this paper, we focus on fast and accurate detection ofhumans as objects of interest (human objects) as it is vitalfor the algorithm to give out the exact position coordinationof the object of interest for tracking purposes. Otherwise,an abnormality detection algorithm based on human behaviormay not function properly in case of incomplete informa-tion provided to it. Although discussed comprehensively inliterature, in this section an overview of the Harr-Cascadedand HOG+SVM algorithms are provided. Their wide usagefor human detection in surveillance, makes them noticeablecandidates for edge application and exploration gives insightabout their weaknesses on the edge devices.

A. Harr-CascadeHaar-like features consist of three general shapes. Figure 2

shows Haar-like feature set examples. These filters are going toconvolute over an input image and in each position the sum of

4

Fig. 2. Examples of Haar-like features. (a) two rectangular features. (b) threerectangular features. (c) four rectangular features

pixel values in black rectangles will be subtracted from the sumof pixel values in white rectangles. When the capturing anglechanges the same filter may produce very different results. Asimpler classifier may miss an object if the features are nottotally reliable for detection. One may also argue that with abetter image set for training, Haar Cascade will provide moreaccurate results [30].

When all possible scenarios are considered, even a 24× 24image will produce more than 160 thousand features sincefilters in Fig. 2 can have any combination of sizes and rotationsand positions. The learning process is computationally expen-sive and needs to take place at CPU clusters, which might notbe available to many. However, once the training is finished,a feature set is ready and only the selected features are storedfor future classification. Thus, the computational complexityof the overall algorithm is small in executtion phase.

In trainig phase best performing features are selected. Theprocess of selecting best features is performed by the Adaboostalgorithm which stands for Adaptive Boosting and it is con-structed from classifiers that are called ”weak learners”. Thisalgorithm generates a weighted sum between results of weaklearners (Eq. 1), where h(x) is considered as each weak learnerfor input x. During the learning process each weak learnerreceives a weight in summation for error calculation (Eq. 2),which is based on lastly calculated boosted classifier. The goalis set to minimize error value (as shown in 2) where i is everyinput for learning iteration t.

FT (x) =∑t

ft(x), whereft(x) = αth(x) (1)

Ee =∑i

E[Ft−1 + αth(xi)] (2)

Positive and negative images, are collected for training,where positive images contain the object of interest withdifferent backgrounds including the positions and coordinatesof the sample. In practice, many images are used more than onetime by mirroring them or cutting its edges. Negative imagesdo not include the object of interest. In the training around2000 positive and 1000 negative images are used. The resultof training creates a file containing the specific best performingfeatures to be executed on input images. According to theresults, regid regression is further made, in area where featuresgive positive results, to give more accurate coordination as theoutput.

As revealed by the training procedure and simplistic workingflow of the algorithm, Haar-Cascade object classification willnot perform well if the training set does not contain all possibleangles as shown in section VI. Also if the object of interest

is far away from the camera, which is the usual case insurveillance application videos for outdoor applications, thealgorithm may fail to detect the object. Furthermore, the simplestructure may imply the loss of robustness. However it is usedfor low power and real-time applications because of its fastdetection.

B. SVM ClassifiersPixel values cannot be trusted because of so many parame-

ters that affect them, other features are thus searched for.Figure3(a) depicts a filter that is placed on each pixel in white with itsfour neighboring pixels in black. X and Y derivatives are cal-culated simply by subtracting the horizontal neighboring andvertical neighboring pixel values respectively corresponding tothe white pixel. In particular, the X derivatives are fired by thevertical lines, and Y derivatives are fired by horizontal lines,which makes the overall features to be sensitive to lines andobject edges. Changing the presentation format to amplitudeand angle will result in unsigned gradients for each given pixel.In practice, a filter can be used to convolute over the image andin each step calculate the gradient for a given pixel. Becauseof the unsigned gradients, the angular values are between 0◦ to180◦. If nine bins of 20◦ each are considered, the amplitude ofthe gradients can be represented in respected bin based on theangular value. It is worth mentioning that if the angular valueis not the center of the bin, then the amplitude is going to bedivided into two bins that the angle of the gradient is closestto. If the input image has more than one channel such as RGB,then channel with the highest amplitude is chosen, and also therespective angle is used for histogram representation.

Figure 3(b) shows one of such histograms, with normalizedamplitudes based on highest value of amplitude. This figure istaken from a batch of pixels, where there is a line passing thewindow, so the angular value of 120◦ to 140◦ has the mostabundance.

Fig. 3. (a) HOG convolution filter. (b) HOG representation of an Image. (c)cell representation of the HOG [32].

5

Fig. 4. Image pyramid used in HOG feature extraction method.

As an example, the HOG algorithm output is depicted as animage in Fig. 3(c) where a 64× 64 cell is one gradient cell (itis enlarged to be seen by human eyes, also less computation).

In an attempt to capture all details with different distancesfrom camera location, usually a pyramid of the image isemployed. The image with original resolution is consideredfirst, and then some pixels in each row and column arediscarded to create a lower resolution version of the sameimage, and then the same HOG algorithm will generate anotherfeature map. The steps iterate until it is not feasible anymore toconduct classification on the image. Figure 4 shows an imagepyramid, where the top left is the actual input and the bottom-right one has the least number of pixels, but the size of eachof pixels is the largest, which preserves the dimensionality ofthe input image.

The SVM classifies objects of interest at each stage, somultiple detection reports are possible. In different scenes,fine-tuning of the HOG variables might be needed to determinethe number of maps generated. Figure 5 is an example, wherethe output detects a human object several times because severalfeature maps are provided to the SVM. Assuming to use thegeneral pre-tuned variables yields an extra step to take onlyone of the bounding boxes and discard the rest. One mostlyused method is to capture the biggest bounding box as theobject. This approach may lead to an inaccurate detection.The effect is more noticeable when there are multiple humanobjects closer to each other. Although the detection rate canbe improved by fine-tuning the filter size and variables, inpractice, it is non-trivial to reconfigure once the cameras havealready been installed.

As explained above, the HOG algorithm extracts featuresand a trained SVM based on the featues, classifies the humans.COCO image set [31] archive for person is used for training

Fig. 5. False multiple detection for a single human object.

with around 20K images. Unfortunately, while the featureextraction presents useful information, SVM and HOG are tooexpensive to edge devices where these computing intensivetasks are repeatedly executed for each frame.

IV. LIGHTWEIGHT CNN

Recently, CNNs has been widely applied as a powerfultool for object classifications. However, it is considered asa challenging task to fit the CNNs into the network edgedevices due to the very restrict constraints on resources. Evenif the time consuming and computing intensive training canbe outsourced to the cloud and the network layer architectureget simplified, edge devices still cannot afford the storagespace for parameters and weight values of filters of these deepneural networks and the computation required. Therefore, alightweight designed CNN is expected in the edge environ-ment.

In designing the L-CNN architecture Depthwise SeparableConvolution [24], [41] is employed to reduce the compu-tational cost of the CNN itself, without much sacrificingthe accuracy of the whole network. Also, the network isspecialized for human detection to reduce the unnecessaryhuge filter numbers in each layer. This yields to a networkimplementable at the edge.

A. Depthwise Separable ConvolutionBy splitting each conventional convolution layer into two

parts, computational complexity is more suitable for edgedevices using depthwise separable convolution and pointwiseseparable convolution. More specifically, the conventional con-volution will take an input such as F , which has a dimension-ality of Df × Df and of M channels, and maps it into G,which is N channels of Dg ×Dg dimension. This is done byfilter K, which is a set of N filters, each of them is Dk ×Dk

and has M channels, as calculated in (Eq. 3):

Gk,l,n =∑i,j,m

Ki,j,m,n · Fk+i−1,l+j−1,m (3)

The computational complexity is

CCConventional = Dk ×Dk ×M ×N ×Df ×Df (4)

Figure 6 compares the depthwise separable convolutionfilters and the conventional convolution, where the same resultsis taken into parts to make the complexity of the operationminimized. The depthwise separable convolution consists oftwo parts: The first is M channels of Dk ×Dk × 1 filters thatwill generate M outputs, which is a depthwise convolutionlayer. Next is a pointwise convolution layer in which the filtersare N channels of 1× 1 filters. Similarly, with the input of Fas before this layer will produce an output such as G in (Eq.5) the same as (Eq. 3):

Gk,l,n =∑i,j,m

Ki,j,m,n · Fk+i−1,l+j−1,m (5)

6

Fig. 6. Comparison between (a) The conventional convolution; and (b)Depthwise separable convolution.

where K is a depthwise convolutional filter, which has aspecial dimension of Dk × Dk × M and the mth filter inKwill be applied on mth F . The computational complexity ofthe depthwise convolution is

CCDepth = Dk×Dk×M×Df×Df+N×M×Df×Df (6)

Based on (Eq. 6) and (Eq. 4), the calculation complexityis reduced by a factor calculated by (Eq. 7) [24]. It makes afaster and more efficient network that is an ideal fit for edgedevices.

CCDepth

CCConventional=

1

N+

1

Dk2 (7)

Immediately after each convolutional step, there is a BatchNormalization layer or normalization and an ReLU layer fornonlinearity introduction.

B. The L-CNN ArchitectureThe proposed L-CNN network architecture has 23 layers

considering depthwise and pointwise convolutions as separatelayers, which does not count the final classifier, softmax; andregression layers to give a bounding box around the detectedobject. A simple fully connected neural network classifier takesthe prior probabilities of each window of objects, identifiesthe objects within the proposed window, and adds the labelfor output bounding boxes at the end of the network. Figure7 depicts the network filter specifications for each layer.

Downsizing happens with the help of no striding in filtersand no spesific layer is added to have less computation.The first convolutional layer of the L-CNN architecture isa conventional convolution, but in the rest of the networkdepthwise along with pointwise convolutions are used. TheL-CNN is focused on a human object detection such that thenetwork is used only for pedestrian detection, which furthersimplifies the network and decreases the number of parametersto store.

Introduced in late 2016, the Single Shot Multi-Object De-tector (SSD) method is faster than R-CNN [32] and more

Fig. 7. L-CNN network layers specification.

accurate than YOLO [37]. The name comes from the factthat in one feed forward through the network, results aregenerated and there is no need for extra steps taken in R-CNN. It is a unified framework for detection of an objectwith a single network. For training purposes, SSD architectureneeds more layer architecture than conventional CNN, andwhen installed, it will receive the input image and output thecoordination of each object detected in the image along witha label for the object. It modifies the proposal generator toget class probability instead of the existence of an object inthe proposal. Instead of having the classical sliding windowand checking in each window for an object to report, SSDat the beginning layer of convolutional filters will create aset of default bounding boxes over different aspect ratios,then scales them with each feature map through convolutionallayers along the image itself. In the end, it will check for eachobject category presence, based on the prior probabilities of theobjects in the bounding box and finally adjusts the boundingbox to better fit the detection which means adding five layersand using outputs of two layers in SSD application. One of thedownsides of SSD is that smaller objects detection accuracyis low if prior probability extraction performed in one layer.In smart surveillance, this can lead to loss of generalization.Because the goal is to detect every human object regardless ofdistance to camera or angle towards it. However, if the outputof different feature maps from different layers is used [38]

7

detection rate can be increased.

V. L-CNN TRAINING

A. CNN TrainingA CNN needs to be well trained before being deployed andapplied to conduct the task of classification. Usually, the

training process requires a lot of computing resources andlarge storage space that allows the training images be loadedand fed to the network in batches. Also the filter and otherparameters should be pre-loaded into the memory. In addition,the back propagation operation incurs math intensive matrixand differential calculations. Clearly, the edge environment isnot an ideal place for training.

There are several widely used models to serve this pur-pose, such as TensorFlow [1], Keras [12] and Caffe [28].Introduction of each gives a clear view of each platformweakness or strong points. TensorFlow is an open sourcesoftware library for machine learning and artificial intelligencein general. One big benefit of this model is many GPUs thatcan collaborate and increase the training speed. Also, a lightversion of the TensorFlow is recently introduced for mobiledevices, which allows loading CNN models without additionallibraries needed. However, architecture in Tensorflow can belenghty, so other platforms such as TFlearn are used to makeit more compact.

Keras gained popularity for its user-friendly, easy-to-learnenvironment. It uses TensorFlow as a back-end engine. Keraslibraries, accessed using python, create a bridge betweenpython syntax and TensoFlow. The libraries are created tomake it easy for the user to generate and test deep modes asfast as possible. The trade-off is, allowing spontaneous codingin python, low-level flexibility of TensorFlow is sacrificed.Moreover, one of the problems that make Keras not the bestchoice for edge devices is that while OpenCV library supportsdeep learning models, it still fails to import Keras basednetworks. To use Keras the library itself has to be installedon the edge device, and also the results need to be loaded inthe OpenCV library.

Being introduced in 2014 in C++ language, Caffe is a well-known tool for the deep learning community. It is a low-levellibrary to work with CNNs. Fast speed makes Caffe perfectfor research experiments and industry deployment. Caffe canprocess over 60M images per day with a single NVIDIA K40GPU [28]. On the other hand, caffe has two main weaknesses.Lack of documentation on its commands makes coding a hardjob. Specifically, in SSD realization of caffe model, there arelayers needed for SSD deployment but very little information isprovided on their functionality. Additionally, Caffe architectureis written as a plain text file, which is harder to manage whenmore layers are included in the architecture.

In this work, the proposed L-CNN is trained using MXNetbecause of its implementation of SSD and good documen-tation. Apache MXNet works best for veriaty of machinelearning tasks. The architecture is simplity coded through theprogramming language and while training, the .jason file isgenerated. As a result the fully trained network is fast andimplementable on the edge devices. LAso, MXNet is a flexible

and scaleable deep learning model that many cloud centersprovide today. Writing a deep learning architecture becomesan easy task because of its support of several languagessuch as C++, Java, Python. MXNet has a great communityand documentation that will help faster reach of the results.Moreover, SSD branch of MXNet has a responsive communityto help programmers learn how to familiarize themselves withSSD. The L-CNN used MXNet in python language as thestructure can be generated easily in python functions. The L-CNN architecture file which is called a symbol file ready fortraining will have less than 80 lines with the addition of SSDcode at the end. Through training phase .Jason text file of thearchitecture along another file containing network weights willbe created [46] which can later be used instead of the symbolfile for better speed. These outputs are closely related to whatseen on caffe files (design.txt and .caffemodel).

The images from the VOC07 and VOC12 [17] along withImageNet [16] are used to train the proposed L-CNN networkwhere 85% of the total set used for training and 15% for val-idation. ImageNet is the biggest image set that contains morethan 14 million different images from more than one thousanddifferent classes of objects. In this application, only the imagesform the class of human is used. ImageNet provides boundingbox coordination for some of the images in particular classes.Note that the ImageNet uses synset to name the classes, so afile with the same format of image list as VOC07 is needed.Although synset system of naming is machine-readable, it isharder for humans to understand. A combination of sub-classesfor human images with coordination from ImageNet websiteis employed.

The images have to be the same size as the input of thenetwork. The network accepts colored (RGB) images with thesize of 224×224 pixels. Thus, for training a blob of 16 imageseach having three-dimensional data created and for validationblobs of 16 images. 75% of the total set was used for trainingand 25% for validation. Lightning Memory-Mapped Database(LMDB) files were produced, which store data in a formatas a {key, value} pair. Converting the image set to such aformat leads to a faster reading speed. Furthermore, beforethe training, the data is normalized by calculating the meanvalue of each RGB channel using Caffe platform packages.

Training is done on a server machine with 28 CPU coresof Intel(R) Xenon(R) CPU at the base frequency of 2.4 GHzwith physical memory of 256 GB. Training took 9.7 days.Several stop-criteria are introduced such as maximum iterationof 400 where each epoch is produced of 250 batches, and forevery iteration, one validation test took place. also, Every 40iterations a snapshot of the weights were created to save theprogress. The training and error are calculated as Eq. 8:

MSE =1

n

n∑i=1

(yi − yi)2 (8)

where yi is the value calculated by the network, and the yi isthe actual value. This Mean Square Error represents the errorin object detection and linear regression used for fine-tuningthe bounding box used another error.

8

VI. EXPERIMENTAL RESULTS

A. Experimental SetupAll of the above-discussed methods are implemented on an

edge device, Raspberry PI 3 Model B with ARMv7 1.2 GHzprocessor and 1 GB of RAM.

Raspberry PI is a single Board Computers (SBC), whichrun a full operating system and have sufficient peripherals(memory, CPU, power regulation) to start execution withoutthe addition of hardware, are targeted industrial platformssuch as vending machines. The Raspberry PI Foundation madethe SBC accessible to almost anyone with low cost (lessthan $100) through delivering Raspberry PI product family.Given merits like commodity hardware, supporting high-levelprogramming languages (e.g., Python) and running popularvariants of Unix-based operating systems, The Raspberry Piis an ideal platform for Edge Computing.

The CPU and memory utilization for the algorithms arecaptured by a third party application named memory profiler.This software is used for python applications and can trackthe CPU and memory used by that process. It saves the dataand later plots it using python MATPLOTLIB library. FramePer Second (FPS) is the major parameter to evaluate theperformance of these algorithms. Figure 8 shows the averageFPS in 30 seconds of run time for each algorithm. Onceagain it is reminded that other CNN architectures needed to beretrained using SSD platform model so they can be used fordetection rather than classification and can be compared withother object detection algorithms.

B. Results and DiscussionsFigure 8 summarizes the experimental results. The fastest

algorithm is the Haar Cascaded, the proposed L-CNN is thesecond and very close to the best while other algorithms arevery slow. The figure also shows that Haar Cascaded is thebest in terms of resource efficiency, and again the L-CNNis the second and very close. However, in terms of averagefalse positive rate (FPR) our L-CNN achieved a very decentperformance (6.6%) and False Negative Rate (FNR) of 18.1%that is much better than that of Harr Cascaded (26.3% and34.9%). In fact, the L-CNN’s accuracy is comparable withSSD GoogleNet (5.3% and 15.6%), but the later features amuch higher resource demanding and an extremely low speed

Fig. 8. Performance in FPS, CPU, Memory Utility, Average False PositiveRate (FPR%) and Average False Negative Rate (FNR%)

(0.39 FPS) that makes it not suitable for edge. In contrast, theaverage speed of L-CNN is 1.79 FPS and a peak performanceof 2.06 FPS is obtained. This is 64% faster than MobileNetresults and added along less memory usage, makes L-CNN thebest choice.

It is worth mentioning that GoogleNet does not use a hugememory portion in contrary to other reports because this is areduced SSD based GoogleNet. As shown in Fig. 8 and TableI, with fewer classes, less parameters (thus less memory) isneeded to get the same accuracy. To compute these accuracymeasures, real-life surveillance video is used along with theVOC12 test dataset, and so percentages reported here may behigher than general purpose usage reported in other literature.

Figures 9 (a) to (d) show the results of Haar Cascaded,HOG+SVM, GoogleNet and L-CNN in processing a samplesurveillance video. The footage is re-sized for all algorithmsto 224 × 224 pixels. The smaller image size is, the lesscomputation resource requires. Also, the deep model archi-tecture only accepts fixed-size images. Therefore, to compareall the algorithms fairly, they all fed image with the same size.Because in practice surveillance videos are not allowed to beexposed to the public, figures included in this paper are footagefrom an online open source video.

The Haar Cascaded algorithm gives false detection bymisidentifying the stone and the tripod as a human, shownin Fig. 9 (a). Meanwhile, the HOG+SVM algorithm does notmake the same mistakes as illustrated in Fig. 9 (b). However,two other issues are observed. First, the bounding box is notfixed around the human objects. This may lead to inaccuratetracking performances in later steps. Secondly, in the middleof the frame where objects that are very close are consideredas one in some frames, although in later frames two separateboxes are created for each person. Figures 9 (c) and 9 (d)verify the high accuracy achieved by the CNNs at edge.

Figure 10 highlights the results of the L-CNN algorithm inprocessing video frames in which human object is capturedfrom variant angles and distances. These are challengingscenarios for detection algorithms to decide whether or not theobjects are human beings. Not only the visible features varywhen the angles and distances are different, but also sometimes

Fig. 9. (a)Haar Cascaded. (b)HOG+SVM. (c)SSD-GoogleNet. (d)L-CNN.

9

Fig. 10. L-CNN: A human object from variant angles and distances.

the human body is only partially visible or in different gestures.For example, in the right-up subfigure, the legs of the workerstanding in the middle are overlapped, the second person hasonly head and part of the left arm captured. In the left-bottomsubfigure, both two legs of the pedestrian are not visible. Manyalgorithms either cannot identify it is a human body, or veryhigh false positive rate is incurred.

Table I compares different CNN architectures with theproposed L-CNN algorithm, including several well-knownarchitectures such as VGG, GoogleNet, and the lightweightMobileNet. It is reminded that SSD networks because oftheir change in architecture and needs for special images withcontour of objects for training, are dissimilar to classificationCNNs and so SSD CNNs in this table are architectures thatare trained using SSD. This test is performed on a desktopmachine without the graphic card and with Intel(R) Core(TM)i7 (3.40 GHz and 16 GB of RAM). The result matches ourintuition very well that many heavy algorithms are not goodchoices for an edge device as they require up to 20 times morememory space.

Architecture Memory (MB)VGG 2459.8SSD-GoogleNet 320.4SqueezeNet 145.3MobileNet 172.2SSD-L-CNN 139.5

TABLE I. MEMORY UTILITY OF CNNS.

VII. CONCLUSIONS

To make proactive urban surveillance and human behaviorrecognition and prediction as edge network services, timely,accurate human object detection at the edge is the essential andfirst step. While there are many algorithms for human detec-tion, they are not suitable for edge computing environment. Inthis paper, leveraging the Depthwise Separable Convolutionalnetwork, a lightweight CNN architecture is introduced forhuman object detection at the edge. This model was trainedusing VOC07 datasets which contains the coordination of theobjects of interest. MXNet platform for neural networks wasused for training, and later OpenCV libraries are used forimplementation on the edge device.

This paper has also studied the advantages and constraintsof two widely used human object detection algorithms, namelyHaar Cascaded object detector and HOG+SVM human detec-tor, in the context of edge computing. Along with GoogLeNet,they are implemented on a Raspberry PI as an edge devicefor a comparison study. The experimental results have verifiedthat the proposed L-CNN algorithm has met the design goals.The L-CNN has achieved satisfactory FPS (Maximum 2.03and Average 1.79) and high accuracy (false positive rate of6.6% and false negative rate of 18.1%), it uses two timesfewer parameters than GoogleNet and occupies 2.6 times lessmemory than SSD GoogleNet.

With the capability of immediate human object identifi-cation, our on-going efforts include the following tasks: (1)lightweight object tracking, (2) human behavior recognition,(3) suspicious activity prediction and early alarm, and (4)video clip marking for batch replay. Our ultimate goal is aproactive surveillance system that enables a more safe andsecure community by identifying suspicious activities andraising alert before damages are caused.

REFERENCES

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scalemachine learning on heterogeneous distributed systems,” arXiv preprintarXiv:1603.04467, 2016.

[2] E. Ahmed and M. H. Rehmani, “Mobile edge computing: opportunities,solutions, and challenges,” 2017.

[3] J. Ahn, J. Paek, and J. Ko, “Machine learning-based image classifi-cation for wireless camera sensor networks,” in Embedded and Real-Time Computing Systems and Applications (RTCSA), 2016 IEEE 22ndInternational Conference on. IEEE, 2016, pp. 103–103.

[4] M. Ali, R. Dhamotharan, E. Khan, S. U. Khan, A. V. Vasilakos, K. Li,and A. Y. Zomaya, “Sedasc: secure data sharing in clouds,” IEEESystems Journal, vol. 11, no. 2, pp. 395–404, 2017.

[5] D. Anisimov and T. Khanova, “Towards lightweight convolutionalneural networks for object detection,” in Advanced Video and SignalBased Surveillance (AVSS), 2017 14th IEEE International Conferenceon. IEEE, 2017, pp. 1–8.

[6] H. Cao, M. Wachowicz, C. Renso, and E. Carlini, “An edge-fog-cloud platform for anticipatory learning process designed for internetof mobile things,” arXiv preprint arXiv:1711.09745, 2017.

[7] A. Cenedese, A. Zanella, L. Vangelista, and M. Zorzi, “Padova smartcity: An urban internet of things experimentation,” in World of Wire-less, Mobile and Multimedia Networks (WoWMoM), 2014 IEEE 15thInternational Symposium on a. IEEE, 2014, pp. 1–6.

[8] F. F. Chamasemani and L. S. Affendey, “Systematic review andclassification on video surveillance systems,” International Journal ofInformation Technology and Computer Science (IJITCS), vol. 5, no. 7,p. 87, 2013.

[9] N. Chen, Y. Chen, E. Blasch, H. Ling, Y. You, and X. Ye, “Enablingsmart urban surveillance at the edge,” in 2017 IEEE InternationalConference on Smart Cloud (SmartCloud). IEEE, 2017, pp. 109–119.

[10] N. Chen, Y. Chen, S. Song, C.-T. Huang, and X. Ye, “Smart ur-ban surveillance using fog computing,” in Edge Computing (SEC),IEEE/ACM Symposium on. IEEE, 2016, pp. 95–96.

[11] N. Chen, Y. Chen, Y. You, H. Ling, P. Liang, and R. Zimmermann, “Dy-namic urban surveillance video stream processing using fog computing,”in Multimedia Big Data (BigMM), 2016 IEEE Second InternationalConference on. IEEE, 2016, pp. 105–112.

[12] F. Chollet et al., “Keras,” 2015.

10

[13] Cisco, “Cisco visual networking index: Forecast and methodology,20162021,” https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-indexvni/mobile-white-paper-c11-520862.html, 2017.

[14] M. Cristani, R. Raghavendra, A. Del Bue, and V. Murino, “Humanbehavior analysis in video surveillance: A social signal processingperspective,” Neurocomputing, vol. 100, pp. 86–97, 2013.

[15] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Computer Vision and Pattern Recognition, 2005. CVPR2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp.886–893.

[16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,2009, pp. 248–255.

[17] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,and A. Zisserman, “The PASCAL Visual Object ClassesChallenge 2012 (VOC2012) Results,” http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.

[18] C.-T. Fan, Y.-K. Wang, and C.-R. Huang, “Heterogeneous informationfusion and visualization for a large-scale intelligent video surveillancesystem,” IEEE Transactions on Systems, Man, and Cybernetics: Sys-tems, vol. 47, no. 4, pp. 593–604, 2017.

[19] T. Fuse and K. Kamiya, “Statistical anomaly detection in human dy-namics monitoring using a hierarchical dirichlet process hidden markovmodel,” IEEE Transactions on Intelligent Transportation Systems, 2017.

[20] S. Hannuna, M. Camplani, J. Hall, M. Mirmehdi, D. Damen,T. Burghardt, A. Paiement, and L. Tao, “Ds-kcf: a real-time tracker forrgb-d data,” Journal of Real-Time Image Processing, pp. 1–20, 2016.

[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.

[22] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speedtracking with kernelized correlation filters,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596,2015.

[23] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” science, vol. 313, no. 5786, pp. 504–507,2006.

[24] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-lutional neural networks for mobile vision applications,” arXiv preprintarXiv:1704.04861, 2017.

[25] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutionalneural networks for hyperspectral image classification,” Journal ofSensors, vol. 2015, 2015.

[26] W. Hu, T. Tan, L. Wang, and S. Maybank, “A survey on visualsurveillance of object motion and behaviors,” IEEE Transactions onSystems, Man, and Cybernetics, Part C (Applications and Reviews),vol. 34, no. 3, pp. 334–352, 2004.

[27] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewerparameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360,2016.

[28] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in Proceedings of the 22nd ACM internationalconference on Multimedia. ACM, 2014, pp. 675–678.

[29] K. Kolomvatsos and C. Anagnostopoulos, “Reinforcement learning forpredictive analytics in smart cities,” in Informatics, vol. 4, no. 3.Multidisciplinary Digital Publishing Institute, 2017, p. 16.

[30] R. Lienhart and J. Maydt, “An extended set of haar-like features forrapid object detection,” in Image Processing. 2002. Proceedings. 2002International Conference on, vol. 1. IEEE, 2002, pp. I–I.

[31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,

P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in European conference on computer vision. Springer, 2014,pp. 740–755.

[32] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,and A. C. Berg, “Ssd: Single shot multibox detector,” in Europeanconference on computer vision. Springer, 2016, pp. 21–37.

[33] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”International journal of computer vision, vol. 60, no. 2, pp. 91–110,2004.

[34] J. Ma, Y. Dai, and K. Hirota, “A survey of video-based crowd anomalydetection in dense scenes,” Journal of Advanced Computational Intel-ligence and Intelligent Informatics, vol. 21, no. 2, pp. 235–246, 2017.

[35] S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B.-Y. Choi, and T. R. Faughnan,“Real-time human detection as an edge service enabled by a lightweightcnn,” arXiv preprint arXiv:1805.00330, 2018.

[36] C. Piciarelli, L. Esterle, A. Khan, B. Rinner, and G. L. Foresti,“Dynamic reconfiguration in camera networks: a short survey,” IEEETransactions on Circuits and Systems for Video Technology, vol. 26,no. 5, pp. 965–977, 2016.

[37] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Unified, real-time object detection,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2016, pp. 779–788.

[38] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” IEEE transactions onpattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137–1149, 2017.

[39] M. Ribeiro, A. E. Lazzaretti, and H. S. Lopes, “A study of deepconvolutional auto-encoders for anomaly detection in videos,” PatternRecognition Letters, 2017.

[40] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Visionand challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp.637–646, 2016.

[41] L. Sifre, “Rigid-motion scattering for image classification, 2014,” Ph.D.dissertation, Ph. D. thesis, PSU.

[42] P. Sun, Y. Wen, T. N. B. Duong, and S. Yan, “Timed dataflow:Reducing communication overhead for distributed machine learningsystems,” in Parallel and Distributed Systems (ICPADS), 2016 IEEE22nd International Conference on. IEEE, 2016, pp. 1110–1117.

[43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in Proceedings of the IEEE conference on computer vision and patternrecognition, 2015, pp. 1–9.

[44] X. Wang, “Intelligent multi-camera video surveillance: A review,”Pattern recognition letters, vol. 34, no. 1, pp. 3–19, 2013.

[45] J. Wu, “Mobility-enhanced public safety surveillance system using 3dcameras and high speed broadband networks,” GENI NICE EveningDemos, 2015.

[46] J. Z. Zhang, “Mxnet port of ssd: Single shot multibox object detector,”in https://github.com/zhreshold/mxnet-ssd, 2018.

[47] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely effi-cient convolutional neural network for mobile devices,” arXiv preprintarXiv:1707.01083, 2017.

[48] L. Zhou, S. Pan, J. Wang, and A. V. Vasilakos, “Machine learning onbig data: Opportunities and challenges,” Neurocomputing, vol. 237, pp.350–361, 2017.

1 smart surveillance as an edge network service: from harr ...on a raspberry pi model 3 board as the...

Documents