yolo-lite: a real-time object detection algorithm ... · the cooper union new york, united states...

7
YOLO-LITE: A Real-Time Object Detection Algorithm Optimized for Non-GPU Computers Rachel Huang* School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, United States [email protected] Jonathan Pedoeem* Electrical Engineering The Cooper Union New York, United States [email protected] Abstract—This paper focuses on YOLO-LITE, a real-time object detection model developed to run on portable devices such as a laptop or cellphone lacking a Graphics Processing Unit (GPU). The model was first trained on the PASCAL VOC dataset then on the COCO dataset achieving a mAP of 33.81% and 12.26% respectively. YOLO-LITE runs at about 21 FPS on a non-GPU computer and 10 FPS after implemented onto a website with only 7 layers and 482 million FLOPS. This speed is 3.8× faster than the fastest state of art models. Based on the original object detection algorithm YOLO, YOLO-LITE was designed to create a smaller, faster, and more efficient model increasing the accessibility of real-time object detection to a variety of devices. Index Terms—object detection; YOLO; neural networks; deep learning; non-GPU; mobile; web I. I NTRODUCTION Vision is not only the ability to see a picture in ones head, but it is also the ability to understand and infer from the image that is seen. The understanding of images is not objective either. It is clear how this problem is not a simple one. The ability to replicate vision in computers is necessary to progress day to day technology. Object detection begins to address this issue. It is the process of predicting the bounding boxes of an object and also classifying it in a given image. An efficient and accurate object detection algorithm is key to the success of autonomous vehicles, augmented reality devices, and other intelligent systems. A lightweight algorithm can be applied to many everyday devices, such as an Internet connected doorbell or thermostat. Currently, state-of-the-art object detection used in cars relies heavily on sensor output from expensive radars and depth sensors. Other techniques that are solely computer based require immense amount of GPU power and even then are not always real time, making them impractical for everyday applications. YOLO-LITE tries to address this problem. Using the You Only Look Once (YOLO) [1] algorithm as a starting point, YOLO-LITE is an attempt to get a real time object detection algorithm on a standard non-GPU computer. A. Goal Our goal with YOLO-LITE was to develop an architecture that can run at a minimum of 10 frames per second (FPS) *equal authorship Fig. 1. Example images passed through our YOLO-LITE COCO model. on a non-GPU powered computer with a mAP of 30% on PASCAL VOC. This goal was determined from looking at the state-of-the-art and creating a reasonable benchmark to reach. B. Contributions YOLO-LITE offers two main contributions to the field of object detection. This model: 1) Demonstrates the capability of shallow networks with fast non-GPU object detection applications. 2) Suggests that batch normalization is not necessary for shallow networks and, in fact, slows down the overall speed of the network.

Upload: others

Post on 22-Apr-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: YOLO-LITE: A Real-Time Object Detection Algorithm ... · The Cooper Union New York, United States pedoeem@cooper.edu Abstract—This paper focuses on YOLO-LITE, a real-time object

YOLO-LITE: A Real-Time Object DetectionAlgorithm Optimized for Non-GPU Computers

Rachel Huang*School of Electrical and Computer Engineering

Georgia Institute of TechnologyAtlanta, United States

[email protected]

Jonathan Pedoeem*Electrical Engineering

The Cooper UnionNew York, United States

[email protected]

Abstract—This paper focuses on YOLO-LITE, a real-timeobject detection model developed to run on portable devicessuch as a laptop or cellphone lacking a Graphics ProcessingUnit (GPU). The model was first trained on the PASCAL VOCdataset then on the COCO dataset achieving a mAP of 33.81%and 12.26% respectively. YOLO-LITE runs at about 21 FPS on anon-GPU computer and 10 FPS after implemented onto a websitewith only 7 layers and 482 million FLOPS. This speed is 3.8×faster than the fastest state of art models. Based on the originalobject detection algorithm YOLO, YOLO-LITE was designed tocreate a smaller, faster, and more efficient model increasing theaccessibility of real-time object detection to a variety of devices.

Index Terms—object detection; YOLO; neural networks; deeplearning; non-GPU; mobile; web

I. INTRODUCTION

Vision is not only the ability to see a picture in ones head,but it is also the ability to understand and infer from the imagethat is seen. The understanding of images is not objectiveeither. It is clear how this problem is not a simple one.

The ability to replicate vision in computers is necessary toprogress day to day technology. Object detection begins toaddress this issue. It is the process of predicting the boundingboxes of an object and also classifying it in a given image. Anefficient and accurate object detection algorithm is key to thesuccess of autonomous vehicles, augmented reality devices,and other intelligent systems. A lightweight algorithm canbe applied to many everyday devices, such as an Internetconnected doorbell or thermostat.

Currently, state-of-the-art object detection used in cars reliesheavily on sensor output from expensive radars and depthsensors. Other techniques that are solely computer basedrequire immense amount of GPU power and even then arenot always real time, making them impractical for everydayapplications.

YOLO-LITE tries to address this problem. Using the YouOnly Look Once (YOLO) [1] algorithm as a starting point,YOLO-LITE is an attempt to get a real time object detectionalgorithm on a standard non-GPU computer.

A. Goal

Our goal with YOLO-LITE was to develop an architecturethat can run at a minimum of ∼ 10 frames per second (FPS)

*equal authorship

Fig. 1. Example images passed through our YOLO-LITE COCO model.

on a non-GPU powered computer with a mAP of 30% onPASCAL VOC. This goal was determined from looking at thestate-of-the-art and creating a reasonable benchmark to reach.

B. Contributions

YOLO-LITE offers two main contributions to the field ofobject detection. This model:

1) Demonstrates the capability of shallow networks withfast non-GPU object detection applications.

2) Suggests that batch normalization is not necessary forshallow networks and, in fact, slows down the overallspeed of the network.

Page 2: YOLO-LITE: A Real-Time Object Detection Algorithm ... · The Cooper Union New York, United States pedoeem@cooper.edu Abstract—This paper focuses on YOLO-LITE, a real-time object

II. LITERATURE REVIEW

There has been much work in developing object detectionalgorithms using a standard camera with no additional sensors.State-of-the-art object detection algorithms use deep neuralnetworks.

Convolutional Neural Networks (CNNs) are the main classof architectures that are used for computer vision. Instead ofhaving fully-connected layers, a CNN has a convolution layerwhere a filter is convolved with different parts of the input tocreate the output. The use of a convolution layer allows for aninput (like an image) to be transformed into a smaller space.In addition, a convolution layer tends to have less weights thatneed to be learned than a fully connected layer as filters donot need an assigned weight from every input to every output.

A. R-CNN

Regional-based convolutional neural networks (R-CNN) [2]use region proposals to find objects in images. From eachregion proposal, a feature vector is extracted and fed intoa convolutional neural network. For each class, the featurevectors are evaluated with Support Vector Machines (SVM).Although R-CNN results in high accuracy, the model is notable to achieve real-time speed even with Fast R-CNN [3] andFaster R-CNN [4] due to the expensive training process andthe inefficiency of region proposition.

B. YOLO

YOLO was developed to create a one step process involvingdetection and classification. Bounding box and class predic-tions are made after one evaluation of the input image. Thefastest architecture of YOLO is able to achieve 45 FPS anda smaller version, Tiny-YOLO, achieves up to 240 FPS on acomputer with a GPU.

The idea of YOLO differs from other traditional systemsin that bounding box predictions and class predictions aredone simultaneously. The input image is first divided into aS× S grid. Next, B bounding boxes are defined in every gridcell, each with a confidence score. Confidence here refers tothe probability an object exists in each bounding box and isdefined as:

C = Pr (Object) ∗ IOU truthpred (1)

where IOU, intersection over union, represents a fractionbetween 0 and 1 [1]. Intersection is the overlapping areabetween the predicted bounding box and ground truth, andunion is the total area between both predicted and ground truthas illustrated in Figure 2. Ideally, the IOU should be close to1, indicating that the predicted bounding box is close to theground truth.

Fig. 2. Illustration depicting the definitions of intersection and union.

Simultaneously, while the bounding boxes are made, eachgrid cell also predicts C class probability. The followingequation shows the class-specific probability for each grid cell[1]:

Pr (Classi|Object) ∗ Pr (Object) ∗ IOU truthpred

=Pr (Classi) ∗ IOU truthpred . (2)

YOLO uses the following equation below to calculate loss[1] and ultimately optimize confidence:

Loss =

λcoord

s2∑i=0

B∑j=0

1objij [(xi − xi)2 + (yi − yi)2]

+ λcoord

s2∑i=0

B∑j=0

1objij [(√wi −

√wi)

2 + (√hi −

√hi)

2]

+

s2∑i=0

B∑j=0

1objij (Ci − Ci)

2

+ λnoobj

s2∑i=0

B∑j=0

1noobjij (Ci − Ci)

2

+

s2∑i=0

1obji

∑c∈classes

(pi(c)− pi(c))2. (3)

The loss function is used to correct the center and thebounding box of each prediction. The x and y variables referto the center of each prediction, while w and h refer to thebounding box dimensions. The λcoord and λnoobj variables areused to increase emphasis on boxes with objects, and lower theemphasis on boxes with no objects. C refers to the confidence,and p(c) refers to the classification prediction. The 1

objij is 1

if the jth bounding box in the ith cell is responsible for theprediction of the object, and 0 otherwise. 1obj

i is 1 if the objectis in cell i and 0 otherwise. The loss indicates the performanceof the model, with a lower loss indicating higher performance.

While loss is used to gauge performance of a model, theaccuracy of predictions made by models in object detectionare calculated through the average precision equation shownbelow:

avgPrecision =

n∑k=1

P (k)∆r(k). (4)

Page 3: YOLO-LITE: A Real-Time Object Detection Algorithm ... · The Cooper Union New York, United States pedoeem@cooper.edu Abstract—This paper focuses on YOLO-LITE, a real-time object

Model Layers FLOPS (B) FPS mAP DatasetYOLOv1 26 not reported 45 63.4 VOC

YOLOv1-Tiny 9 not reported 155 52.7 VOCYOLOv2 32 62.94 40 48.1 COCO

YOLOv2-Tiny 16 5.41 244 23.7 COCOYOLOv3 106 140.69 20 57.9 COCO

YOLOv3-Tiny 24 5.56 220 33.1 COCO

TABLE IPERFORMANCE OF EACH VERSION OF YOLO.

P(k) here refers to the precision at threshold k while ∆r(k)refers to the change in recall.

The neural network architecture of YOLO contains 24convolutional layers and 2 fully connected layers. YOLO wasimproved with different versions such as YOLOv2 or YOLOv3in order to minimize localization errors and increase mAP.As seen in Table I, a condensed version of YOLOv2 calledTiny-YOLOv2 [5], achieves the highest FPS of 244, a mAPof 23.7%, and the lowest floating point operations per second(FLOPS) of 5.41 billion.

When Tiny-YOLOv2 runs on a non-GPU laptop (Dell XPS13), the model speed decreases from 244 FPS to about 2.4FPS. With this constriction, real-time object detection is noteasily accessible on many devices without a GPU, such asmost cellphones or laptops.

III. EXPERIMENTATION

While some works [6], [7], [8] focused on creating anoriginal convolution layer or pruning methods in order toshrink the size of the network, YOLO-LITE focuses on takingwhat already existed and pushing it to its limits of accuracyand speed. Additionally, YOLO-LITE focuses on speed andnot overall physical size of the network and weights.

Experimentation was done with an agile mindset. UsingTiny-YOLOv2 as a starting point, different features wereremoved and added and then trained on Pascal VOC 2007& 2012 for about 10-12 hours. The trials were then testedon the validation set of Pascal 2007 to calculate mAP. PascalVOC was used for the development of the architecture, sinceits small size allows for quicker training. The best performingmodel was used as a platform for the next round of iterations.

While there was a focus on trying to intuit what wouldimprove mAP and FPS, it was hard to find good indicators.From the beginning, it was assumed that the FLOPS countwould correlate with FPS; this proved to be true. However,adding more filters, making filters bigger, and adding morelayers did not easily translate to an improved mAP. It wasdifficult to hypothesize how changes to the architecture sizewould affect the mAP.

A. Setup

Darknet, the framework created to develop YOLO was usedto train and test the models. The training was done on aAlienware Aura R7, with a Intel i7 CPU, and a Nvidia 1070GPU. Testing for the frames per second were done on a DellXPS 13 laptop, using Darkflow’s live demo example script.

Fig. 3. Example images with image segmentation from the COCOdataset [10].

B. PASCAL VOC and COCO Datasets

YOLO-LITE was trained on two datasets. The model wasfirst trained using a combination of PASCAL VOC 2007 and2012 [9]. It contains 20 classes with approximately 5,000training images in the dataset.

The highest performing model trained on PASCAL VOCwas then retrained on the second dataset, COCO 2014 [10],containing 80 classes with approximately 40,000 trainingimages. Figure 3 shows some example images with objectsegmentation taken from the COCO dataset.

TABLE IIPASCAL VOC AND COCO DATSETS

Dataset Training Images Number of ClassesPASCAL VOC 2007 + 2012 5,011 20

COCO 2014 40,775 80

C. Indicators for Speed and Precision

Table III reveals what was successful and what was notwhen developing YOLO-LITE. The loss which is reported inTable III, was not a good indicator of mAP. While a high lossindicates a low mAP there is no exact relationship between thetwo. This is due to the fact that the losses listed in Equation 3are not defined exactly by the mAP but rather a combination ofdifferent features. The training time, when taken in conjugationwith the amount of epochs, was a very good indicator of FPSas seen from Trials 3, 6 etc. The FLOPS count was also a goodindicator, but given that the FLOPS count does not take intoconsideration the calculations and time necessary for batchnormalization; therefore, it was not as good as considering atthe epoch rate.

Page 4: YOLO-LITE: A Real-Time Object Detection Algorithm ... · The Cooper Union New York, United States pedoeem@cooper.edu Abstract—This paper focuses on YOLO-LITE, a real-time object

TABLE IIIRESULTS FOR EACH TRIAL RUN ON PASCAL VOC.

Model Layers mAP FPS FLOPS1 Time Loss

Tiny-YOLOv2 (TY) 9 40.48% 2.4 6.97 B 12 hours 1.26TY-No Batch Normal-ization (NB)

9 35.83% 3 6.97 B 12 hours 0.85

Trial 1 7 12.64% 1.56 28.69 B 10 hours 1.91Trial 2 9 30.24% 6.94 1.55 B 5 hours 1.37Trial 2 (NB) 9 23.49% 12.5 1.55 B 6 hours 1.56Trial 3 7 34.59% 9.5 482 M 10 hours 1.68Trial 3 (NB) 7 33.57% 21 482 M 7 hours 1.64Trial 4 8 2.35% 5.2 1.03 B 10 hours 1.93Trial 5 7 .55% 3.5 426 M 7 hours 2.4Trial 6 7 29.33% 9.7 618 M 11 hours 1.91Trial 7 8 16.84% 5.7 482 M 7 hours 2.3Trial 8 8 24.22% 7.8 490 M 13 hours 1.3Trial 9 7 28.64% 21 846 M 12 hours 1.5Trial 10 7 23.44% 8.2 1.661 B 10 hours 1.55Trial 11 7 15.91% 21 118 M 12 hours 1.35Trial 12 8 26.90% 6.9 71 M 9 hours 1.74Trial 12 (NB) 8 25.16% 15.6 71 M 12 hours 1.35Trial 13 8 39.04% 5.8 1.083 B 11 hours 1.42Trial 13 (NB) 8 33.03% 10.5 1.083 B 16 hours 0.77

TABLE IVARCHITECTURE OF EACH TRIAL ON PASCAL VOC

Trial Architecture Description

TY-NB Same architecture at TY, but with no batch normalization.Trial 1 First 3 layers same as TY. Layer 4 has 512 1 3x3 filters. Layer 5 has 1024 3x3 layers. Layer 6 & 7 same as the last 2 layers of TYTrial 2 Same Architecture as TYV, but input image size shrunk to 208x208Trial 2 (NB) Same architecture as trial 2, but no batch normalizationTrial 3 First 4 layers same as TYV. Layer 5 has 128 3x3 filters. Layer 6 has 128 3x3 filters. Layer 7 has 256 3x3 filters. Layer 8 has 125

1x1 filters.Trial 3 (NB) Same architecture as Trial 3, but no batch normalizationTrial 4 Layer 1 5 3x3 filters. Layer 2 5 3x3 filters. Layers 3 16 3x3 filters. Layer 4 64 2x2 filters. Layer 5 256 2x2 filters. Layer 6 128 2x2

filters. Layer 7 512 1x1 filters. Layer 8 125 1x1 filters.Trial 5 L1 8 3x3 filters. L2 16 3x3 filters. L3 32 1x1 filters. L4 64 1x1 filters. L5 64 1x1 filters. L6 125 1x1 filters.Trial 6 Trial 7 is the same as trial 3, but the activation functions were changed to ReLU instead of Leaky ReLUTrial 7 L1 32 3x3 filters. L2 34 3x3 filters. L3 64 1x1. L4 128 3x3 filters. L5 256 3x3 filters. L6 1024 1x1 filters. L7 125 1x1 filters.Trial 8 Trial 8 is the same as trial 3, but one more Layer before L7 with 256 3x3 filters.Trial 9 Trial 9 is the same as trial 3 (NB), but with the input raised to 300x300.Trial 10 Trial 10 is the same as trial 3 (NB), but with the input raised to 416x416.Trial 11 Trial 11 is the same as trial 3 (NB), but wit the input lowered to 112x112.Trial 12Trial 12 (NB) Same architecture as trial 12, but no batch normalizationTrial 13 Trial 13 has the same architecture as TY but has one last layer. It does not have layer 8 of TY.Trial 13 (NB) Same architecture as trial 13, but no batch normalization

Trials 4, 5, 8, and 10 showed that there was no clear rela-tionship between adding more layers and filters and improvingaccuracy.

D. Image Size

It was determined when comparing Trial 2 to Tiny-YOLOv2that reducing the input image size by a half can more thandouble the speed of the network (6.94 FPS vs 2.4 FPS) but willalso effect the mAP (30.24% vs 40.48%). Reducing the inputimage size means that less of the image is passed through thenetwork. This allows the network to be leaner, but also means

that some data was lost. We determined that, for our purposes,it was better to take the speed up over the mAP.

E. Batch Normalization

Batch normalization [11] offers many different improve-ments namely speeding up the training time. Trials have shownbatch normalization improves accuracy over the same networkwithout it. YOLOv2 and v3 have also seen improvements intraining and mAP by implementing batch normalization [5],[12]. While there is much empirical evidence showing thebenefits of using batch normalization, it has been found to notbe necessary while developing YOLO-LITE. To understand

Page 5: YOLO-LITE: A Real-Time Object Detection Algorithm ... · The Cooper Union New York, United States pedoeem@cooper.edu Abstract—This paper focuses on YOLO-LITE, a real-time object

why that is the case, it is necessary to understand what batchnormalization is.

Batch normalization entails taking the output of one layerand transforming it to have a mean zero variance and astandard deviation of one before inputting to the next layer.The idea behind batch normalization is that during trainingusing mini-batches it is hard for the network to learn the trueground-truth distribution of the data as each mini-batch mayhave a different mean and variance. This issue is describedas a covariate shift. The covariate shift makes it difficult toproperly train the model as certain features may be dispro-portionately saturated by activation functions. This is what isreferred to as the vanishing gradient problem. By keeping theinputs in the same scale, batch normalization stabilizes thenetwork which in turn allows the network to train quicker.During test time, estimated values from training are used tobatch normalize the test image.

As YOLO-LITE is a small network, it does not suffergreatly from covariate shift and in turn vanishing gradientproblem. Therefore, the assumptions made by Ioffe et al. thatbatch normalization is necessary does not apply. In addition,it seems that the batch normalization calculation that needsto take place in between each layer holds up the networkand slows down the whole feedforward process. During thefeedfoward each input value has to be updated. For example,the initial input layer of 224 × 224 × 3 has over 150kcalculations that needs to be made. This calculation happeningat every layer builds the time necessary for the forward pass.This has led us to do away with batch normalization.

F. Pruning

Pruning is the idea of cutting certain weights based on theirimportance. It has been shown that a simple pruning methodof removing anything less than a certain threshold can reducethe amount of parameters in Alexnet by 9× and VGGnet by13× with little effect on accuracy [13].

It has also been suggested [13] that pruning along withquantization of the weights and Huffman coding can greatlyshrink the network size and also speed up the network by 3or 4 times (Alexnet, VGGnet) .

Pruning YOLO-LITE showed to have no improvementin accuracy or speed. This is not surprising as the resultsmentioned in the paper [13] are on networks that have manyfully-connected layers. YOLO-LITE consists mainly of con-volutional layers. This explains the lack of results. A methodsuch as the one suggested by Li et al. for pruning convolutionalneural networks may be more promising [14]. They suggestpruning whole filters instead of select weights.

IV. RESULTS

There were 18 different trials that were attempted duringthe experimentation phase. Figure 4 shows the mAP and theFPS for each of these trials and Tiny-YOLOv2.

Fig. 4. Comparison of trials attempted while developing YOLO-LITE

While all the development for YOLO-LITE was done onPASCAL VOC, the best trial was also run on COCO. TableV shows the top results achieved on both datasets.

TABLE VRESULTS FROM TRIAL 3 NO BATCH NORMALIZATION (NB)

Dataset mAP FPS

PASCAL VOC 33.77% 21COCO 12.26% 21

A. Architecture

Table VI and Table VII show the architectures for Tiny-YOLOv2 and the best performing trial of YOLO-LITE, Trial3-no batch. Tiny-YOLOv2 is composed of 9 convolutionallayers, a total of 3,181 filters, and 6.97 billion FLOPS.

In contrast, Trial 3-no batch of YOLO-LITE only consists of7 layers with a total of 749 filters. When comparing the FLOPSof the two models, Tiny-YOLOv2 has 14× more FLOPS thanYOLO-LITE Trial 3-no batch. A lighter model containingreduced number of layers enables the faster performance ofYOLO-LITE.

V. COMPARISON WITH OTHER FAST OBJECT DETECTIONNETWORKS

When it comes to real-time object detection algorithms fornon-GPU devices, the competition for YOLO-LITE is prettyslim. YOLO’s tiny architecture, which was the starting pointfor YOLO-LITE has some of the quickest object detectionalgorithms. While they are much quicker than the biggerYOLO architecture, they are hardly real-time on non-GPUcomputers (∼ 2.4 FPS).

Google has an object detection API that has a modelzoo with several lightweight architectures [15]. The mostimpressive was SSD Mobilenet V1. This architecture clocksin at 5.8 FPS on a non-GPU laptop with an mAP of 21%.

MobileNet [16] uses depthwise separable convolutions, asopposed to YOLO’s method, to lighten a model for real-time

Page 6: YOLO-LITE: A Real-Time Object Detection Algorithm ... · The Cooper Union New York, United States pedoeem@cooper.edu Abstract—This paper focuses on YOLO-LITE, a real-time object

TABLE VITINY-YOLOV2-VOC ARCHITECTURE

Layer Filters Size Stride

Conv1 (C1) 16 3× 3 1Max Pool (MP) 2× 2 2C2 32 3× 3 1MP 2× 2 2C3 64 3× 3 1MP 2× 2 2C4 128 3× 3 1MP 2× 2 2C5 256 3× 3 1MP 2× 2 2C6 512 3× 3 1MP 2× 2 2C7 1024 3× 3 1C8 1024 3× 3 1C9 125 1× 1 1

TABLE VIIYOLO-LITE: TRIAL 3 ARCHITECTURE

Layer Filters Size Stride

C1 16 3× 3 1MP 2× 2 2C2 32 3× 3 1MP 2× 2 2C3 64 3× 3 1MP 2× 2 2C4 128 3× 3 1MP 2× 2 2C5 128 3× 3 1MP 2× 2 2C6 256 3× 3 1C7 125 1× 1 1

object detection. The idea of depthwise separable convolutionscombines depthwise convolution and pointwise convolution.Depthwise convolution applies one filter on each channel thenpointwise convolution applies a 1x1 convolution [16]. Thistechnique aims to lighten a model while maintaining the sameamount of information learned in each convolution. The ideasof depthwise convolution in MobileNet potentially explain thehigher mAP results from SSD MobileNet COCO.

TABLE VIIICOMPARISON OF STATE OF THE ART ON COCO DATASET

Model mAP FPS

Tiny-YOLOV2 23.7% 2.4SSD Mobilenet V1 21% 5.8YOLO-LITE 12.26% 21

Table VIII shows how YOLO-LITE compares. YOLO-LITEis 3.6× faster than SSD and 8.8× faster than Tiny-YOLOV2.

A. Web Implementation

After successfully training models for both VOC andCOCO, the architectures along with their respective weightsfiles were converted and implemented as a web-based model2

also accessible from a cellphone. Although YOLO-LITE runsat about 21 FPS locally on a Dell XPS 13 laptop, once pushedonto the website, the model runs at around 10 FPS. The FPSmay differ depending on the device.

VI. CONCLUSION

YOLO-LITE achieved its goal of bringing object detectionto non-GPU computers. In addition, YOLO-LITE offers sev-eral contributions to the field of object detection. First, YOLO-LITE shows that shallow networks have immense potentialfor lightweight real-time object detection networks. Runningat 21 FPS on a non-GPU computer is very promising forsuch a small system. Second, YOLO-LITE shows that theuse of batch normalization should be questioned when itcomes to smaller shallow networks. Movement in this areaof lightweight real-time object detection is the last frontier inmaking object detection usable in everyday instances.

VII. FUTURE WORK

Future work may include techniques to increase the mAPfor both COCO and VOC models.

While the FPS for YOLO-LITE is at the necessary levelfor real-time use with non-GPU computers, the accuracy needsimprovement in order for it to be a viable model. As mentionedin [17], [12], pre-training the network to classify on Imagenethas had some good results when transferring the network backto object detection.

Redmon et al. [1] use R-CNN in combination with YOLO toincrease mAP. As previously mentioned, R-CNN finds bound-ing boxes and classifies each bounding box separately. Onceclassification is complete, post-processing is used to cleanlocalization errors [1]. Although less efficient, an improvedversion of R-CNN (Fast R-CNN) yields a higher accuracyof 66.9% while YOLO achieves 63.4%. When combining R-CNN and YOLO, the model achieves 75% mAP trained onPASCAL VOC.

Another possible improvement can be to attempt using thefeature in YOLOv3 where there are multiple locations formaking a prediction. This has helped improve the mAP inYOLOv3 [12] and could potentially help improve mAP inYOLO-LITE.

Filter pruning set out by Li et al. [14] can be anotherpossible improvement. While standard pruning did not helpYOLO-LITE, pruning out filters can potentially make thenetwork more lean and allow for a more guided trainingprocess to learn better weights. While it is not clear if thiswill greatly improve the mAP, it can potentially speed up thenetwork. It will also make the overall size of the network inmemory smaller.

2https://reu2018dl.github.io/

Page 7: YOLO-LITE: A Real-Time Object Detection Algorithm ... · The Cooper Union New York, United States pedoeem@cooper.edu Abstract—This paper focuses on YOLO-LITE, a real-time object

A final improvement can come from ShuffleNet. ShuffleNetuses group convolution and channel shuffling in order todecrease computation while maintaining the same amount ofinformation during training [6]. Implementing group convolu-tion in YOLO-LITE may improve mAP.

VIII. RELEVANT LINKS

More information on the web implementation of YOLO-LITE can be found at https://reu2018dl.github.io/. For the cfgand weights files from training PASCAL VOC and COCOvisit https://github.com/reu2018dl/yolo-lite.

IX. ACKNOWLEDGEMENTS

This research was done through the National Science Foun-dation under DMS Grant Number 1659288. A special thanksto Dr. Cuixian Chen, Dr. Yishi Wang, and Summerlin ThaiThompson for their support.

REFERENCES

[1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Unified, real-time object detection,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2016, pp. 779–788. 1, 2, 6

[2] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Richfeature hierarchies for accurate object detection and semanticsegmentation,” CoRR, vol. abs/1311.2524, 2013. [Online]. Available:http://arxiv.org/abs/1311.2524 2

[3] R. B. Girshick, “Fast R-CNN,” CoRR, vol. abs/1504.08083, 2015.[Online]. Available: http://arxiv.org/abs/1504.08083 2

[4] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towardsreal-time object detection with region proposal networks,” CoRR, vol.abs/1506.01497, 2015. [Online]. Available: http://arxiv.org/abs/1506.01497 2

[5] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” arXivpreprint, 2017. 3, 4

[6] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely ef-ficient convolutional neural network for mobile devices,” arXiv preprintarXiv:1707.01083, 2017. 3, 7

[7] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewerparameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360,2016. 3

[8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-lutional neural networks for mobile vision applications,” arXiv preprintarXiv:1704.04861, 2017. 3

[9] PASCAL, “The pascal visual object classes homepage,” http://host.robots.ox.ac.uk/pascal/VOC/index.html, Last accessed on 2018-07-18. 3

[10] “COCO Dataset,” http://cocodataset.org/#home, accessed: 2018-07-23.3

[11] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167, 2015. 4

[12] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”arXiv preprint arXiv:1804.02767, 2018. 4, 6

[13] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural networks with pruning, trained quantization and huffmancoding,” arXiv preprint arXiv:1510.00149, 2015. 5

[14] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruningfilters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.5, 6

[15] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,Z. Wojna, Y. Song, S. Guadarrama et al., “Speed/accuracy trade-offs formodern convolutional object detectors,” in IEEE CVPR, vol. 4, 2017. 5

[16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neuralnetworks for mobile vision applications,” CoRR, vol. abs/1704.04861,2017. [Online]. Available: http://arxiv.org/abs/1704.04861 5, 6

[17] J. Redmon, “Yolo: Real-time object detection,” https://pjreddie.com/darknet/yolo/, Last accessed on 2018-06-24. 6