real-time object detection cs 229 course...

Real-time Object DetectionCS 229 Course Project

Zibo Gong1, Tianchang He1, and Ziyi Yang1

1Department of Electrical Engineering, Stanford University

December 17, 2016

AbstractObjection detection is a key problem in computer vision. We report our work on object

detection using neural network and other computer vision features. We use Faster Region-based Convolutional Neural Network method (Faster R-CNN) for detection and then matchthe object with features from both neural network and features like histograms of gradients(HoG). We are able to achieve real-time performance and satisfactory matching results.

1 IntroductionObject detection is a challenging and exciting task in Computer Vision. Detection can be diffi-cult since there are all kinds of variations in orientation, lighting, background and occlusion thatcan result in completely different images of the very same object. Now with the advance of deeplearning and neural network, we can finally tackle such problems without coming up with variousheuristics real-time.

We installed and trained the Faster R-CNN model 2 on Caffe deep learning framework. The FasterRCNN is a region-based detection neural networks method. We firstly used a region proposal net-work (RPN) to generate detection proposals. Then we employed the same network structure toFast R-CNN to classify the object and modified the bounding box. Furthermore we extractedfeature from object detected via algorithm developed by us. At last we matched object detectedwith ones stored in database.

2 Related WorkIn 2012, Krizhevsky and et al.[1] trained a deep convolutional neural network to classify images inLSVRC-2010 ImageNet into 1000 kinds of classes with much better precision than previous work,which marked the beginning of usage of deep learning in computer vision. In 2014, Jia and et al.[2]created a clean and modifiable deep learning framework: Caffe.

In 2015, Ren et al.[3] proposed faster Region Proposal Network. The network shared convolu-tional features through the image, which led to almost cost-free regional proposals. Besides fasterR-CNN, there are many other approaches to improve CNN’s performance. More recent approachessuch like YOLO [4] and SSD [5] directly put the whole image into the neural network and get thepredicted boxes with score. Their running time is further reduced compared to faster R-CNN. Youonly look once (YOLO) applies a single neural network to the full image. This network dividesthe image into regions and predicts bounding boxes and probabilities for each region. Single ShotMulti-Box Detector(SSD) discretizes the output space of bounding boxes into a set of default boxesover different aspect ratios and scales per feature map location.

3 Dataset and FeaturesFor general object detection, we used SUN2012 database by MIT SAIL lab, which contains imageswith objects labeled by polygons, as shown in Fig.1. For each category of object, we extracted the

1

Figure 1: One database example.

Figure 2: Illustration of Faster R-CNN training process.

bounding box for object detection and further classification tasks. For training, we divided theimages of 3 categories to build up database. Also, we split the whole dataset into training set (90%)and testing set (10%). What’s more, we tested out detecting automobiles in different perspective.The images of cars are from the KITTI database, which contains about 14,000 images. We chosethe fully visible and partly occluded automobiles for training and testing. The KITTI databasealso provides the observation angle of each object. We used them to label the perspective (front,back, left, right) of automobiles.

4 Methods

4.1 Faster Region-based Convolutional Neural NetworkWe trained the Faster Region-based Convolutional Neural Network (Faster R-CNN) model onCaffe deep learning framework by Python language. The Faster R-CNN is a region based detec-tion method. It firstly used a region proposal network (RPN) to generate detection proposals, thenused the same network structure as Fast R-CNN to classify object and modify the bounding box.The strategy is visualized in 2.

The training strategy of this model was as follows: First, we trained the RPN end-to-end byback-propagation and stochastic gradient descent. We chose the ZF and VGG16 net to extractfeatures for the RPN and all the layers are initialized by a pre-trained model for ImageNet clas-sification. We chose the learning rate 0.0001 for 40k mini-batches, a momentum of 0.9 and aweight decay of 0.0005. Second, we trained a separate detection network by Fast R-CNN usingthe proposals generated by the step 1 RPN for 20k mini-batches. This detection network wasalso initialized by a pre-trained model. Now the two networks did not share convolutional layers.Third, we used the detector network to initialize RPN traning for 40k min-batches, but we fixedthe shared convolutional layers. At last, we kept the shared convolutional layers fixed and fine-tunethe unique layers of Fast R-CNN for 20 k mini-batches.

2

4.2 Conventional CV featuresa) Color HistogramsAfter we found the bounding box that gave the outline of object, we could construct three his-tograms of the RGB values of all the pixels within the bounding box. The color of the object couldbe found combining the most frequent RGB values. Usually a clustering algorithm was applied tofind the color, the color histogram is much more computationally efficient. The R, G and B valueof color of object is the peak of each component histogram. Also since the image had already beensegmented by the Faster R-CNN, the color histogram could achieve good performance with lesscomputation expense.

b) Histogram of Oriented Gradients (HoG)Histogram of Oriented Gradients (HoG) is a feature descriptor for object detection. It will builda histogram for localized portions of an image. The histogram is distributed along the angle oforientation (usually 8 directions) and the height is the magnitude of the gradient. In our project,usually the image of an object was divided into 16 bins (4 by 4) and the histogram of each bin wascalculated. Then the total length of HOG feature vector would be 16× 8 = 128.

c) Scale Invariant Feature Transform (SIFT) DescriptorSIFT Descriptor is also a descriptor built from histogram of gradients, however, the SIFT descriptorhas the advantage of being invariant to rotation, translation and resizing.SIFT descriptor was used to extract keypoint from the image. We first applied Gaussian filterof different sizes and took their difference. The keypoints were local extrema across each layer ofDifference of Gaussian (DoG). After determining a keypoint location and patch size, the dominantdirection of gradient was decided and we rotated the patch to have its dominant direction alignedvertically. At last the same procedure of HoG could be carried out to assemble a 128-length featurevector.

4.3 Matching AlgorithmIn our project, we used K-Nearest Neighbor (KNN) to match the extracted features with thedatabase. The KNN algorithm is a basic unsupervised learning algorithm. It compares the incom-ing example with all the dataset and output the training example with the smallest distance fromthe example.Note that here the distance function can be defined by multiple ways. The simplest way would becomputing the Euclidean distance of the feature vectors. Another distance definition that is usefulhere is the Cosine distance, defined by

d(xi, xj) = 1− xTi xj

|xi||xj |. (1)

In our project, we tried out both definitions of distances and use the one which works better andit is Cosine distance.

5 Results & Discussion

5.1 Object DetectionWe successfully achieved detection of backpacks 3(a), towels 3(b), clocks, bottles, and automobilein different perspectives (front, back, left and right) 3(c). And the detection of automobile is withhigh detection precision, as sown in table 5.1:

Perspective Average PrecisionFront 81.57%Back 86.40%Left 87.56%Right 79.31%

Table 1: Average precision of detecting cars in different perspective.

3

Figure 3: Object Detection.

In the poster session, we presented a demo of automobile detection. We took a video at parking lotand detected automobile in every frame. And detected automobile is bounded with box as shownin video. And the detection is in real-time.

5.2 Color DetectionWe have successfully decided the color of object detected via algorithm mentioned above. Or to bemore specific, the color of object in the bounding box. The result is shown in 4. 4[a] is the colordetected, and 4[b] is the object. Judging from naked eyes, those two colors look very alike.

Figure 4: Color Detection.

5.3 Object MatchingFirst we picked the better distance definition used in kNN. We tested out two definitions of distanceon a set of 400 test images of cars with k=10 in kNN. The features used here were RGB featuresand HoG features. The results are shown in Table 5.3. We can see that Euclidean distance givesbetter result, we then used Euclidean distance in further experimentation.

Then we tried out different combinations of features. We tested out the combinations on 1000images of cars. The results are shown in Table 5.3. We can see that the fully connected layer isbetter than the HOG features. Also we find that using color as a supplementary feature wouldreduce the matching scope and improve the running time and matching precision. An exampleof matching result is shown in 5. Also, in the poster session, we showed in the video that afterdetection of automobile, we could successfully matched it with one car in our database (RAV 4 inour demo). And then researched corresponding model and color in car.com.

Distance Definition PrecisionEuclidean 70.5%Cosine 66.3%

Table 2: Precision of different definition of distance.

4

Feature Combination Average PrecisionHoG 70.7%

HoG+Color 73.2%Fully Connected Layer 77.6%

Fully Connected Layer + Color 78.1%

Table 3: Precision of different combinations of features.

5.4 PipelineThe pipeline for our project goes as the following: The input image/video was fed into a trainedneural network, which will output the object classification and bounding box. Then the features ofthe object (including the fully connected layer from CNN and other conventional CV features) areextracted and compared to the database. The final output is the closest match from the database.

Such a pipeline is not only able to tackle the general task of object detection and recognition,it can also be utilized by companies such as online shopping sites. Such companies have the man-power and computational power to build it into a mobile application, which can take real timeimage or video and match it on the server. It can also benefit from the fact that they have hugedatabases on objects.

6 Conclusion & Future WorkIn conclusion, we trained Faster R-CNN to detect objects in real-time. And then we extractedfeatures, for instance, color, HoG, SIFT descriptors, and result (the last layer of networks) givenby faster R-CNN. Finally, we compared the object detected with those in our database and decidedthe matching one, based on the features extracted.

For future work, we can conduct more thorough test of our matching algorithm using larger datasetand more rigorous diagnostics of the algorithm. For example, we can plot the error rate vs the sizeof the dataset or plot the confusion table of the test set, etc.

We tried out feature combinations such as fully connected layer, RGB color and HOG, but morecombinations of other features might be useful. Viable features including rectangle features andother features from the neural network. However, if we confine our attention to kNN, there aremany other distance functions we can try out. Other distance definitions including the Manhattandistance, Histogram intersection distance and Chebyshev distance can be implemented with ease.Matching quality can be improved by better detection. We also recognized that the observationangle is very important for object matching. It’s essential to do the "car face" alignment. We canapply the similar algorithm for face recognition in our project.

Figure 5: An example of matching result

5

References[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional

neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.

[2] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, andT. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprintarXiv:1408.5093, 2014.

[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection withregion proposal networks,” in Advances in neural information processing systems, pp. 91–99,2015.

[4] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-timeobject detection,” arXiv preprint arXiv:1506.02640, 2015.

[5] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed, “Ssd: Single shot multibox detector,”arXiv preprint arXiv:1512.02325, 2015.

6

real-time object detection cs 229 course...

Documents