simultaneously detecting and counting dense vehicles from...

11
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS Simultaneously Detecting and Counting Dense Vehicles from Drone Images Wei Li, Hongliang Li, Senior Member, IEEE, Qingbo Wu, Member, IEEE, Xiaoyu Chen, and King Ngi Ngan, Fellow, IEEE Abstract —Unmanned Aerial Vehicles (UAVs) are an es- sential component in the realization of Industry 4.0. With drones helping to improve industrial safety and efficiency in utilities, construction and communication, there is an urgent need for drone-based intelligent applications. In this paper, we develop a unified framework to simultaneously detect and count vehicles from drone images. We first ex- plore why the state-of-the-art detectors fail in highly dense drone scenes, which provides more appropriate insights. Then, we propose an effective loss to push the anchors towards matching the ground-truth boxes as much as pos- sible, specifically designed for scale-adaptive anchor gen- eration. Inspired by attention mechanisms in the human vi- sual system, we maximize the mutual information between object classes and features by combining bottom-up cues with top-down attention mechanisms specifically designed for feature extraction. Finally, we build a counting layer with regularized constraint related to the number of vehicles. Ex- tensive experiments demonstrate the effectiveness of our approach. For both tasks, our proposed method achieves state-of-the-art results on all four challenging datasets. In particular, our results reduce error by a larger factor than previous methods. Index Terms—Convolutional neural networks, deep learning, intelligent vehicles, object detection, object recognition, unmanned aerial vehicles, vehicle detection. I. I NTRODUCTION D UE to their advantages of mobility and rapidity, drone smart solutions are widely used and integrated into industrial automation, such as utilities, construction, intelligent transportation and emergency response [1]–[3]. Drone-based object detection and counting has become increasingly impor- tant for intelligent applications, such as tracking criminals [4], abnormal detection [5], and scene understanding [6]–[9]. In this paper, we focus on simultaneously locating and predicting the number of vehicles. As discussed in [10], the greatest challenge is that objects are too dense and small. For instance, in the CARPK dataset, there are nearly 90K car annotations Manuscript received Month xx, 2xxx; revised Month xx, xxxx; ac- cepted Month x, xxxx. This work was supported in part by the National Natural Science Foundation of China under Grant 61831005, Grant 61525102, Grant 61601102, and Grant 61871078. (Corresponding au- thor: Hongliang Li.). The authors are with the School of Information and Communi- cation Engineering, University of Electronic Science and Technol- ogy of China, Chengdu 611731, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; kn- [email protected]) in the 1448 images, among which each 1280 × 720 image has approximately 60 targets, and the maximum number of cars is 188. The denseness significantly increases the difficulty in vehicle localization. There are very few studies combining the two tasks due to the lack of bounding-box annotation. For the counting task, most early methods focus on counting the objects of known location in a static environment with fixed cameras [11], [12]. These methods are inflexible and cannot be directly applied to unconstrained drone video [10]. The deep learning-based counting methods avoid these problems and learn a nonlinear mapping to directly output the density maps [13]–[15]. Kang et al. compared density maps on several crowd analysis tasks and showed that the density maps with different resolution achieved different performance [15]. However, many counting methods cannot generate the precise bounding boxes of the object of interest [10]. For the detection task, there are mainly two popular frameworks: the two-stage framework and the single-stage framework. The two-stage framework represented by R-CNN [16] and its variants [17]–[19] extracts proposals first followed by their classification and regression. The single stage framework, such as YOLO [20]–[22] and SSD [23], [24], applies object classifiers and bounding box regressors in an end-to-end manner without extracting proposals. Most of the state-of-the-art methods typically focus on detecting general objects from natural images, where most of the targets are sparsely distributed with fewer numbers. Due to the intrinsic differences between UAV images and natural images, the traditional CNN-based methods tend to miss many densely distributed small objects. To overcome this difficulty, Hsieh et al. introduced the spatially regularized loss combined with a Faster-RCNN framework, named LPN, to simultaneously count and localize objects [10]. The LPN method learned the general adjacent relationship between object proposals and encouraged region proposals to be placed in correct locations. This strategy is motivated by the observation of certain layout patterns for a group of object instances; for example, cars are often parked in a row or column. The assumption is suitable for the environment of static objects (e.g., parking lot). However, when the objects are moving or arranged in an uncertain position, LPN is unsuitable. This approach can be viewed as constructing the regularized term by using contextual spatial information. In this paper, we focus on exploring a more general approach to simultaneously detect and count vehicles in the unconstrained drone environment.

Upload: others

Post on 23-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Simultaneously Detecting and Counting Dense Vehicles from ...ivipc.uestc.edu.cn/public/uploads/files/2019/06/... · Drone images show more specific scale changes than natural images

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

Simultaneously Detecting and Counting DenseVehicles from Drone Images

Wei Li, Hongliang Li, Senior Member, IEEE, Qingbo Wu, Member, IEEE, Xiaoyu Chen,and King Ngi Ngan, Fellow, IEEE

Abstract—Unmanned Aerial Vehicles (UAVs) are an es-sential component in the realization of Industry 4.0. Withdrones helping to improve industrial safety and efficiencyin utilities, construction and communication, there is anurgent need for drone-based intelligent applications. In thispaper, we develop a unified framework to simultaneouslydetect and count vehicles from drone images. We first ex-plore why the state-of-the-art detectors fail in highly densedrone scenes, which provides more appropriate insights.Then, we propose an effective loss to push the anchorstowards matching the ground-truth boxes as much as pos-sible, specifically designed for scale-adaptive anchor gen-eration. Inspired by attention mechanisms in the human vi-sual system, we maximize the mutual information betweenobject classes and features by combining bottom-up cueswith top-down attention mechanisms specifically designedfor feature extraction. Finally, we build a counting layer withregularized constraint related to the number of vehicles. Ex-tensive experiments demonstrate the effectiveness of ourapproach. For both tasks, our proposed method achievesstate-of-the-art results on all four challenging datasets. Inparticular, our results reduce error by a larger factor thanprevious methods.

Index Terms—Convolutional neural networks, deeplearning, intelligent vehicles, object detection, objectrecognition, unmanned aerial vehicles, vehicle detection.

I. INTRODUCTION

DUE to their advantages of mobility and rapidity, dronesmart solutions are widely used and integrated into

industrial automation, such as utilities, construction, intelligenttransportation and emergency response [1]–[3]. Drone-basedobject detection and counting has become increasingly impor-tant for intelligent applications, such as tracking criminals [4],abnormal detection [5], and scene understanding [6]–[9]. Inthis paper, we focus on simultaneously locating and predictingthe number of vehicles. As discussed in [10], the greatestchallenge is that objects are too dense and small. For instance,in the CARPK dataset, there are nearly 90K car annotations

Manuscript received Month xx, 2xxx; revised Month xx, xxxx; ac-cepted Month x, xxxx. This work was supported in part by the NationalNatural Science Foundation of China under Grant 61831005, Grant61525102, Grant 61601102, and Grant 61871078. (Corresponding au-thor: Hongliang Li.).

The authors are with the School of Information and Communi-cation Engineering, University of Electronic Science and Technol-ogy of China, Chengdu 611731, China (e-mail: [email protected];[email protected]; [email protected]; [email protected]; [email protected])

in the 1448 images, among which each 1280×720 image hasapproximately 60 targets, and the maximum number of carsis 188. The denseness significantly increases the difficulty invehicle localization.

There are very few studies combining the two tasks due tothe lack of bounding-box annotation. For the counting task,most early methods focus on counting the objects of knownlocation in a static environment with fixed cameras [11], [12].These methods are inflexible and cannot be directly appliedto unconstrained drone video [10]. The deep learning-basedcounting methods avoid these problems and learn a nonlinearmapping to directly output the density maps [13]–[15]. Kanget al. compared density maps on several crowd analysis tasksand showed that the density maps with different resolutionachieved different performance [15]. However, many countingmethods cannot generate the precise bounding boxes of theobject of interest [10]. For the detection task, there are mainlytwo popular frameworks: the two-stage framework and thesingle-stage framework. The two-stage framework representedby R-CNN [16] and its variants [17]–[19] extracts proposalsfirst followed by their classification and regression. The singlestage framework, such as YOLO [20]–[22] and SSD [23], [24],applies object classifiers and bounding box regressors in anend-to-end manner without extracting proposals. Most of thestate-of-the-art methods typically focus on detecting generalobjects from natural images, where most of the targets aresparsely distributed with fewer numbers. Due to the intrinsicdifferences between UAV images and natural images, thetraditional CNN-based methods tend to miss many denselydistributed small objects.

To overcome this difficulty, Hsieh et al. introduced thespatially regularized loss combined with a Faster-RCNNframework, named LPN, to simultaneously count and localizeobjects [10]. The LPN method learned the general adjacentrelationship between object proposals and encouraged regionproposals to be placed in correct locations. This strategyis motivated by the observation of certain layout patternsfor a group of object instances; for example, cars are oftenparked in a row or column. The assumption is suitable forthe environment of static objects (e.g., parking lot). However,when the objects are moving or arranged in an uncertainposition, LPN is unsuitable. This approach can be viewed asconstructing the regularized term by using contextual spatialinformation. In this paper, we focus on exploring a moregeneral approach to simultaneously detect and count vehiclesin the unconstrained drone environment.

Page 2: Simultaneously Detecting and Counting Dense Vehicles from ...ivipc.uestc.edu.cn/public/uploads/files/2019/06/... · Drone images show more specific scale changes than natural images

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

countingLoss

Feature extractor

class score

boxes regressor

Base network Vehicle detector

average pooling

anchors generation

Fig. 1. Overview of our approach. We propose a unified full convolutional neural network to localize and count vehicles in the drone-based images.It consists of three parts: a base network, a feature extractor, and a vehicle detector. We introduce a scale-adaptive strategy to generate the suitableanchor boxes. A circular flow is embedded in the feature extractor by combining the bottom-up cues with top-down attention. Finally, we build thecounting layer and introduce a counting regularized term into the original loss.

Rethinking the classic detection networks, all of them canbe divided into three important steps: (1) defining anchors forgenerating candidate boxes; (2) extracting features of objectproposals; and (3) defining the classifiers for class scores andthe regressors for bounding boxes, whether it is a two-stageor single-stage network. For the first step, Hsieh et al. chosedefault box sizes approximately four times smaller than thedefault setting of Faster-RCNN and showed that adjustingthe anchor box sizes is effective for detection and countingin the dense scene. This is because vehicles images takenby drone are small and have a more fixed size than generalobjects. However, the empirical setting of anchor sizes is stillsuboptimal and we find that their scales cannot match thesize of targets well, as shown in Fig. 3 (a). To solve thisproblem, we propose a simple yet effective loss function topush the anchors towards matching the ground-truth boxeswith an adaptive scale. The loss is based on the expectedsquared error and encourages the squared distance betweenanchors and their matched boxes to be as small as possible. Inaddition, to handle the small objects, Hsieh et al. [10] selecteda conv4-3 layer to extract features instead of a conv5-3 layerin the second step. This empirical operation cannot extract in-tegrated object information, especially for dense small objects.Therefore, we start by maximizing the mutual informationbetween classes and features. By analyzing the impact of thebottom-up and top-down attention, we find that the top-downattention mechanism provides stronger semantic informationas explicit supervision for detecting dense and small objects.In this step, a circular flow is embedded in the feature extractorby combining the bottom-up cues with top-down attention.For the third step, in classic detection frameworks [17]–[21],[23], the softmax loss is employed for object classification

and the bounding box regression loss is employed for objectlocalization (e.g., Smooth-L1). For the detecting and countingtask, we design the counting regularized term that leveragesthe number of objects combined with objects detection in aunified network. Specifically, we first build a new countinglayer and then introduce a counting regularized constraint intoour loss function to further improve the performance.

For evaluating the effectiveness and reliability of ourapproach, we conduct the experiments on the CARPK,PUCPR+, VisDrone2018-car and UAVDT datasets. CARPK,VisDrone2018-car and UAVDT are currently the most chal-lenging large-scale drone-based datasets. PUCPR+ is madefrom the subdataset of the PKLot dataset, where the scene isclosed to the aerial view. All of the four datasets can supportcount with bounding box annotations. In our experiments, weevaluate the performance of detection and counting. Overall,this paper proposes an efficient framework for simultaneouslydetecting and counting vehicles in drone-based scenes. Themain contributions can be summarized as follows.

- First, we analyze the limitation of anchor setting in theexisting methods for small-scale objects in drone images.A novel loss is designed for generating the scale-adaptiveanchors.

- Second, starting with maximizing the mutual information,we analyze the impact of the top-down attention andapply the circular flow to guide feature extraction.

- Third, an effective counting regularized constraint is in-troduced into the object detection task to further improvethe performance.

- Finally, our model obtains impressive results, signifi-cantly advancing the most recent results of these fourchallenging datasets.

Page 3: Simultaneously Detecting and Counting Dense Vehicles from ...ivipc.uestc.edu.cn/public/uploads/files/2019/06/... · Drone images show more specific scale changes than natural images

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

The remainder of the paper is organized as follows. Weelaborate on our method in Section II, followed by theexperimental evaluation in Section III. Section IV summarizesthe proposed method and future work.

II. OUR APPROACH

We focus on simultaneously detecting and counting vehiclesin drone-based images by learning a unified CNN framework.The main challenge compared with natural image tasks isthat the objects are small and dense. Our overall approachis illustrated in Fig. 1. To provide insights into the small anddense vehicles detection problem, we analyze the weaknessof existing methods. Specifically, we study the influence ofanchor selection and feature extraction of small-size objectsand propose the solution correspondingly. Finally, we build thecounting layer after the feature extraction layer and proposethe counting regularization constraint for learning our networkto improve the performance.

A. Scale-adaptive Anchor GenerationDrone images show more specific scale changes than natural

images. Because the images are taken at a station of highaltitude, they tend to be dense and small. In the RPN [18],Ren et al. chose default box sizes of (128× 128, 256× 256,512×512) for the generic object. Hsieh et al. [10] chose defaultanchors approximately four times smaller than the defaultsetting of RPN and showed that adjusting the anchor boxsizes is effective for counting in dense scenes. These findingsmake us wonder the following: Do the empirical settings ofthe anchor sizes match the scales of drone objects well?

(b) Drone-car(a) VOC-car

Fig. 2. The scale distribution of objects between natural images anddrone images. The blue dots indicate the normalized ground-truth boxes,and the red dots indicate anchor boxes generated by RPN [18].

First, we qualitatively analyze the scale changes of thebounding boxes between the natural images and the droneimages. As shown in Fig. 2, vehicles in natural images spanlarge scales, while they are less than 50 × 50 in droneimages. Intuitively, the scales of anchors generated by Faster-RCNN are not at all compatible with the drone images. Then,we quantitatively analyze the maximum overlap between thebounding boxes. Given the anchor boxes A, the max overlapoi between each bounding box of the ground truth Qi andthe generated candidate boxes P = {Pt, t = 1, ..., T} can becalculated as:

oi = max(IoU(Qi, Pt))

= max(|Qi ∩ Pt||Qi ∪ Pt|

)(1)

(a)RPN-small (b)Our Method

Fig. 3. The distribution in drone images between the ground-truthbounding boxes and the matched anchor boxes. The first row shows thequalitative results of the scale distribution. The second row shows thequantitative results of the max overlaps distribution. (a) is RPN-small[10]; (b) is our method.

where Qi denotes the i-th bounding box of the groundtruth. We report the proportion of max overlaps with an IoUthreshold between 0 to 1 with the histogram statistics method.Specifically, we bin the set of O = {o1, o2, ..., oi, ...on}into 10 containers and return the number of elements ineach container, where n is the number of all the ground-truth bounding boxes, as shown in the second row of Fig. 3.Interestingly, we find the following: For the RPN-small [10],only 38.7% of O is greater than 0.5 and 11.1% of O is greaterthan 0.7. This result implies that the anchor boxes generated byRPN and RPN-small have a large scale bias with the ground-truth boxes. The large-scale bias leads to the inaccuracy ofthe feature of the object regions. Additionally, the difficulty oflearning the coordinates of the bounding boxes is increased.

The above analysis motivates us to explore a suitablestrategy for generating the anchor boxes. A good genericanchor generator should achieve high-scale matching with theground-truth boxes. We formulate the loss to minimize theexpected squared error for measuring the matching degree.We assume that A = {A1, ..., Ak, ..., AK} are the anchorboxes with a total number of K, where Ak denotes the k-th anchor. Sk denotes the matched ground-truth boxes ofAk. Given Sk, the best-matched ground-truth box B is theconditional expectation E[B|B ∈ Sk], where B denotes therandom variable of a bounding box. Thus, the loss Lmat canbe computed as follows:

Lmat = E[(B − B)2]

= E[(B − E[B|B ∈ Sk])2]

= E[E{(B − E[B|B ∈ Sk])2|B ∈ Sk}]

= E[V ar(B|B ∈ Sk)]

=

K∑k=1

P (B ∈ Sk)V ar[B|B ∈ Sk]

(2)

Page 4: Simultaneously Detecting and Counting Dense Vehicles from ...ivipc.uestc.edu.cn/public/uploads/files/2019/06/... · Drone images show more specific scale changes than natural images

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

where P (B ∈ Sk) represents the probability that B belongsto Sk. V ar(B|B ∈ Sk) is the variance of the random Bconditional on B ∈ Sk. Let nk be the number of Sk andn be the total number of B; then,

P (B ∈ Sk) =nkn

(3)

V ar(B|B ∈ Sk) = E[(B − E[B])2|B ∈ Sk]

= E[(B −Ak)2|B ∈ Sk]

=1

nk

∑b∈Sk

(b−Ak)2

(4)

where b is the bounding box in Sk. The Lmat in Eq. (2) canbe rewritten as,

Lmat =1

n

K∑k=1

∑b∈Sk

(b−Ak)2 (5)

The optimized anchors A can be computed as

A = argmin{A1,...,Ak,...,AK}

1

n

K∑k=1

∑b∈Sk

(b−Ak)2 (6)

We find that Lmat is proportional to the objective functionof k-means. Therefore, we can use k-means to solve thisproblem and generate scale-adaptive anchors A. Similar toYOLO9000 [21] and YOLOv3 [22], we apply the Jaccarddistance between two bounding boxes for k-means:

d(Qi, Qj) = 1− |Qi ∩Qj ||Qi ∪Qj |

(7)

where {Qi, Qj} is a pair of normalized bounding boxes.However, YOLO needs to input the fixed-size square image,and the final anchors are related to the input size of thenetwork. Unlike YOLO, the anchors generated by our methodare independent with adaptive scales. Our method is betterthan YOLO for drone-based tasks.

Finally, we set the number of cluster centers K = 7according to cross-validation. Intuitively, the clustered anchorboxes represent the shapes of most bounding boxes. Forour method, we also report the proportion of max overlapswith an IoU threshold between 0 to 1 as shown in Fig. 3(b). Compared with RPN-small, the anchor boxes can bettermatch the ground-truth boxes with 87% of O over 0.5. Thequantitative results in Section III are shown in Table I.

B. Feature Extraction with Circular Flow

To handle the small objects, Hsieh et al. selected a conv4-3 layer instead of a conv5-3 layer in VGG-16 for obtainingmore object details [10]. However, this empirical operationcannot extract integrated object information, especially fordense small objects.

Inspired by the attention mechanisms in the human visualsystem, bottom-up and top-down attention has been usedwidely in computer vision [25]–[27]. Both of these attentionmechanisms support, improve and develop with each other.We believe that top-down attention can also improve thediscriminability of features in drone images. Let C = {0, 1}

denote the classes. fbu represents the bottom-up cues, andftd represents the top-down attention. Inspired by the pre-vious work of mutual information-based feature selection[28], we maximize the mutual information between C andF = {fbu, ftd} as follows:

MI(F,C) =∑f

p(f)∑c

p(c|f)logp(c|f)p(c)

=∑f∈F

p(f)logp(f)+

∑c

p(c)∑f∈F

p(f |c)logp(f |c)

= H(F )−H(F |C)= H(F )−H({fbu, ftd}|C)

(8)

where p(f) is the prior probability of the feature. p(c) indicatesthe probability that the proposal belongs to c. p(f |c) is theconditional likelihood. In the above formula, the first termH(F ) is a constant, which denotes the prior entropy. Thesecond term encourages a low conditional entropy, i.e., awell-learned feature for C must be gathered from {fbu, ftd}without great uncertainty. However, direct computation ofthe conditional entropy is difficult and infeasible because itrequires traversing the whole space for the computation ofp(f |c) [29]. As discussed in [30], the more salient regioncan produce the smaller conditional entropy. We deploy thesaliency map for measuring the conditional entropy [31], [32].In order to verify the validity of top-down attention, wequalitatively assess whether the gathered layer has strongeractivation than the single layer in the regions of each object.

We start by extracting feature X ∈ RW×H×D from thesingle layer (fbu or ftd) and the gathered layer ({fbu, ftd}),where W , H and D denote the width, height and numberof channels. Then, we use the subclassifier of each anchorto weight the feature maps and calculate saliency map bytaking the maximum value along the channel direction. Thisis because we only focus on the single class of vehicles. Theactivation map M ∈ RW×H can be calculated as

M(x, y) = maxd

(∑k

wk,dXd(x, y)) (9)

where d is the index of the d-th channel. wk ∈ R1×1×D

represents the D-dimensional weight of the k-th subclassifiercorresponding to the k-th anchor. Xd(x, y) represents thed-th feature map at spatial location (x, y). Specifically, wechoose one convolutional layer of Resnet-101 as a singlecue. Without loss of generality, we choose three convolutionallayers (res3b3x, res4b22x or res5cx) of Resnet-101 accordingto the size of all the feature maps. These three layers corre-spond to the last layer of each residual block. Their saliencymaps are shown in Fig.4 (c)(d)(e). We observe that featuresat higher layers have abstract semantic information and thelower features contain rich details. The high-level featuresfrom res4b22x focus more on the object regions than res5cx.Therefore, we regard res3b3x as the bottom-up cues and thehigh-level features from res4b22x as the top-down attention[33], [34]. The saliency map of gathered layer constructed

Page 5: Simultaneously Detecting and Counting Dense Vehicles from ...ivipc.uestc.edu.cn/public/uploads/files/2019/06/... · Drone images show more specific scale changes than natural images

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

(a)GT (b)gathered layer (c) res3b3x (e) res5cx(d) res4b22x

Fig. 4. The saliency map of the single layer and the gathered layer. The first column is the ground truth. The next columns correspond to thegathered layer, res3b3x, res4b22x and res5cx. The gathered layer highlights the important regions of each object better than the single layer. Forexample, many background regions are suppressed by the gathered layer in the second image.

by our method is shown in Fig.4 (b). The gathered layer hasstronger activation than the single layer in the object regions.This fact qualitatively indicates that the top-down attention ishelpful to improve the discriminability of features.

Further, we quantitatively analyze the performance of eachsingle layer and verify the validity of top-down attention. Asshown in Table II, the performance of res3b3x is much worsethan the other two layers, although the res3b3x contains moredetails. One explanation is that the features of the res3b3xlayer suffer from more background interference. We verifythis by error analysis. We investigate the distribution of falsedetection due to localization bias (LOC), background (BG) andmissed detection (MD). In Fig. 5 (a), we find that when thefeatures are extracted from res3b3x, 37% of the error is causedby the background, 50% from the localization bias, and theremainder is missed detection. In Fig. 5, the pie chart showsthat the largest error is caused by localization bias. The figurealso implies that only single-layer cues have difficulty detect-ing the dense small targets and distinguishing the background.This result also prompts us to seek different methods forenhancing the underlying feature with the top-down attentionmechanism. We use a top-down flow as explicit supervisionas shown in Fig. 6. The top-down attention mechanism hasa complementary advantage. Top-down attention can providemore abstract semantic information and bottom-up flow guidesto supplement the details of the bottom layer. Inspired byHyperNet [35], RON [36], and FPN [37], we explore three dif-ferent connections: (1) reverse res5cx to res4b22x; (2) reverseres5cx to res3b3x; and (3) reverse res4b22x to res3b3x.Specifically, we first upsample the next layer to the same sizeas the current layer and then concatenate it with the currentlayer, as {res4b22x up−sample

−−−−−−−−→ res3b3xconcat−−−−→ res4b22x′}. In

our implementation, the connection from {ftd = res4b22x}to {fbu = res3b3x} achieves better performance than theother connection. The quantitative results in Section III alsodemonstrate the validity of top-down attention, as shown inTable II.

Our features have some advantages compared to the features

BG

37%

MD

13%

LOC

50%

BG MD LOC

BG

8%

MD

26%

LOC

66%

BG MD LOC

BG

22%

MD

16%LOC

62%

BG MD LOC

(a) res3b3x (b) res4b22x (c) res5cx

Fig. 5. The distribution of false detection types by using res3b3x,res4b22x and res5cx layers as features on the CARPK dataset due tolocalization bias (LOC), background (BG) or missed detection (MD).

Fig. 6. Feature extraction in a circular flow with bottom-up cues andtop-down attention supervision.

extracted from ResNet-101. The differences are as follows:1) the features extracted from ResNet-101 are only a singlelayer, such as res3b3x; 2) based on the features of single layer,the high-level semantic information is introduced into featureextraction in our method by concatenating the feature maps ofres4b22x.

C. Multitask Loss with a Counting Regularized TermThe goal of this paper is to simultaneously detect and count

objects. Previous work often considers these as two separatetasks: object detection and object counting. We believe that

Page 6: Simultaneously Detecting and Counting Dense Vehicles from ...ivipc.uestc.edu.cn/public/uploads/files/2019/06/... · Drone images show more specific scale changes than natural images

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

both tasks can be mutually supportive. It is natural for us tomake a more reliable counting decision if all of the objectsare detected. Thus, if we know the numbers in advance, theobject detection is more reliable. Recently, Hariharan et al.[38] proposed a mutually beneficial method to simultaneouslyperform detection and segmentation. He et al. [39] incorpo-rated a segmentation mask into an object detection framework.Our work draws inspiration from these impressive works, witha different goal of proposing a learning count-assisted objectdetector. We introduce a counting regularized term into theobject detection task. We design a novel counting constraintmodule that leverages the number of objects combined withobjects classification and boxes regression in a unified net-work.

We comprehensively describe the designed module as fol-lows. First, we develop a new branch for a counting constraint.A counting-alignment layer is built next to the feature extractoras shown in Fig. 1. The layer requires the 1 × 1 filtersfor determining the existence of the object. We feed theoutputs into average pooling for extracting the K-dimensionalcounting feature fc. The number of objects Tpred is achievedby K-dimensional regressors wc. Then, we define the groundtruth of the number of objects as Tgt. The counting regularizedterm Lcount can be computed as:

Lcount = ‖wc · fc − Tgt‖1 (10)

For the loss of objects classification and boxes regression, wefollow [40]. The confidence loss of classification is the softmaxloss over two classes’ confidences, and the localization loss ofthe boxes regression is the Huber loss. The overall objectiveloss function L is the confidence loss (conf) and localizationloss (loc) counting regularized term:

L = Lconf + Lloc + λLcount (11)

and the weight term λ is set to 0.1 by cross-validation. Intu-itively, our counting regularized term is useful for eliminatingbackground interference and miss detection to a certain extent.Finally, the estimated number of objects is defined as the sumof all the predicted bounding boxes.

D. Training Strategy

Training our model with the multitask loss is challengingand difficult. There is significant difference in the convergencerates between the detection loss and the counting regulariza-tion. It is difficult to achieve good performance with simplecombination. To address this issue, we train our model bydeploying a progressive training strategy. The idea is motivatedby the multistep training strategy, which first trains the modelwith the loss of a special task by freezing the loss of other tasks[41]. After the specified number of iterations was reached,the loss of another task was opened. Similarly, our methodconsists of two stages as follows. Stage One: In the first 50epochs, we set LStageI = Lconf +Lloc for training the wholenetwork. The learning rate is 1e-4. The trained model is usedfor the initial model of the second stage. Stage Two: For thenext 10 epochs, the loss function is combined with a counting

Epoch0 10 20 30 40 50 60

Loss

0

1

2

3

4

5

6

7

8

9

Joint trainingProgressive training

Fig. 7. The loss curves of different strategies on the CARPK dataset[10]. The blue curve indicates the strategy of multi-loss joint training.The red curve indicates our progressive training strategy.

regularized term, i.e., LStageII = LStageI + λLcount, whereλ = 0.1. The learning rate remains at 1e-4.

This strategy ensures that the regularized term is comple-mentarily trained in our end-to-end model. As shown in Fig. 7,the loss of joint training decreases in a shaking way, while theloss of our strategy tends to decrease steadily in the first stage.With the counting regularization, the loss further decreasesuntil it converge to a lower value in the second stage. Thisresult indicates that the counting constraint has a high penaltyfor vehicles detection task, which is consistent with our goalin detecting and counting vehicles from drone images.

III. EXPERIMENTS

The experiment section is organized as follows: we firstintroduce the basic experiment settings as well as the im-plementation details in Part A; then, the proposed strategiesdescribed in Section II are evaluated and analyzed on theCARPK [10] benchmark in Part B; and finally, in Part C, ourmethod is compared with the state-of-the-art methods on fourbenchmarks.

A. Dataset and Evaluation MetricsDatasets. We perform our method and the state-of-the-art

methods on the three recent drone-based object counting anddetection benchmarks: CARPK [10], VisDrone2018 [42] andUAVDT [43]. In addition, PUCPR+ [10] is considered, wherethe scenes are closed to the drone view.

CARPK is currently the largest dataset of parking lots forcounting and detection, proposed by Hsieh et al. [10]. A totalof ∼ 90K bounding boxes annotations are provided in the1448 images. The dataset consists of 989 images in the trainingset and 459 images in the test set. There are a total of ∼42K annotations in the training set and ∼ 48K annotationsin the test set. We randomly split the training set into twoparts as train+val with a 1:1 ratio. The hyperparameterssettings involved in our experiments are conducted on thesplit train+val sets for training and validation, respectively.The PUCPR+ dataset was collected by Hsieh et al. from thesubdataset of the PKLot dataset [12] and has nearly 17Kannotations. The maximum number of cars is 188 in a single

Page 7: Simultaneously Detecting and Counting Dense Vehicles from ...ivipc.uestc.edu.cn/public/uploads/files/2019/06/... · Drone images show more specific scale changes than natural images

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

image of the CARPK dataset and 331 in the PUCPR+ dataset.Therefore, these two datasets are very challenging for vehicledetection and counting tasks. Both datasets can be downloadedfrom the following link: https:// lafi.github.io/LPN/ .

VisDrone2018 is a large-scale visual object detection bench-mark, which was collected in a very wide area from 14different cities in China. For the task of object detection, itconsists of 6,471 images in the training set, 548 images inthe validation set and 3,190 for testing. It has a total of 10categories, mainly focusing on humans and vehicles with ∼54.2K annotations. We focus on the category of car with∼ 15.9K annotations in the training and validation set forour evaluation. In the training set, there are 6,133 imagesincluding cars. In the validation set, there are 515 imagesincluding cars. It can be downloaded from the following link:http://www.aiskyeye.com.

UAVDT [43] is currently the largest drone vehicle datasetwith labeled bounding boxes. It consists of 50 video sequencesfrom unconstrained scenes, wherein 30 video sequences are fortraining, and 20 are for testing. There are 24,143 frames in thetraining set and 16,592 frames in the test set with nearly 800Klabeled bounding boxes. The dataset has great difficulties withdense small objects. It can be downloaded from the followinglink: https:// sites.google.com/site/daviddo0323/ .

Evaluation Metrics. For the detection evaluation, differentfrom LPN, we use the average precision (AP) with 0.5 and 0.7IoU thresholds as the metric (higher is better) [44] because weexpect to output the detected bounding boxes instead of onlyproposals. Compared with the average recall, these metricsare more suitable for the detection task, while considering therecall and false detection rate [44], [45]. For count evaluation,we use mean absolute error (MAE) and root mean squarederror (RMSE) as the metric [10] (lower is better).

Implementation details. In our full convolution network,the base network is Resnet-101 [46], which is pretrained onImageNet [47]. The network architecture is shown in Fig. 8.Similar to SSD [23], in the training stage, we set the inputsizes to 300 × 300. For the data augmentation, each trainingimage randomly samples a patch followed by SSD [23], [40].In the test stage, we do not fix the image size and set theconfidence threshold to 0.05. We train the network for 50Kiterations with the batch size set to 1. The stochastic gradientdescent (SGD) solver is adopted to optimize the network andthe base learning rate is set to 1e-4.

B. Ablation Analysis

In this subsection, we investigate four aspects: 1) thescale-adaptive method to generate anchor boxes; 2) featureextraction with the circular flow; 3) the counting regularizedconstraint and 4) the progressive training strategy.

1) Anchor generation with adaptive scale: For analyzingthe effectiveness of the scale-adaptive strategy, we select theres3b3x, res4b22x or res5cx layer for extracting features inour full convolutional neural network. In the baseline setting,the default box sizes are set to 16×16, 40×40, and 100×100,following the setting in Hsieh et al. [10], named LPN-ac. InTable I, we report the results by using our anchors, which are

input image

(unfixed-resolution)

ResNet architecture

from resnet-101

res3b3x(8x )

res4b4x(16x )

res4b4x_up(8x )

res3b3x_res4b4x_circular

(8x )

score_reg

(conv3-1036-4K)

score_cls

(conv1-1036-K)

CU

flow_count

(conv1-1036-K)

avg_count

(avg-pooling)

constrain_count

(conv1-K-1)

...

Fig. 8. The CNN-architecture of our method. The convolutional layersare denoted as “conv(kernel size)-(number of channels)-(dimension ofoutput)”. “(factor)x ↓” means that the size of original image is factortimes larger than the size of feature maps. U indicates the up-samplingoperation. C indicates the concatenation of two feature maps.

generated with the adaptive scale (SA). Compared with thebaseline strategy, SA achieves better performance regardlessof which layer is used to extract the features, especially forthe vehicles detection task. For example, res4b22x outperformsthe baseline with an improvement of 2.7 of MAE, and 2.18of RMSE for vehicles counting task. The performance ofthe detection task achieves 87.9% vs. 87.3% at IoU=0.5 and34.8% vs. 31.5% at IoU=0.7. These results demonstrate thatthe adaptive scale is effective for reducing localization biasand false positives.

TABLE IRESULTS ON THE CARPK DATASET BY GENERATING ANCHORS WITH

THE SCALE-ADAPTIVE STRATEGY (SA). COMPARED WITH LPN-ac, OURSTRATEGY ACHIEVES BETTER PERFORMANCE.

Layer Method MAE RMSE [email protected] [email protected]

res3b3x LPN-ac 24.44 33.23 68.9 22.6SA 14.59 21.54 82.7 25.4

res4b22x LPN-ac 8.39 10.68 87.3 31.5SA 6.29 8.50 87.9 34.8

res5cx LPN-ac 9.17 12.23 62.7 12.4SA 6.92 9.20 73.7 15.9

2) Feature extraction with circular flow: We conduct exper-iments with 3 single layers and a two-way flow as shown inTable II. The results show the following: (1) the performanceof the feature from res4b22x is higher than that of the othertwo single layers; (2) the flows of the top-down attentionmechanism lead to better performance than the single-layerflow; and (3) the flows from res4b22x to res3b3x are betterthan the flows from res5cx to res4b22x, and the performance

Page 8: Simultaneously Detecting and Counting Dense Vehicles from ...ivipc.uestc.edu.cn/public/uploads/files/2019/06/... · Drone images show more specific scale changes than natural images

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

specifically increases 14.5% when the AP at IoU=0.7, settingas {res4b22x up−sample

−−−−−−−−→ res3b3xconcat−−−−→ res4b22x}. This

strategy leads to further improvement in our method withMAE=5.82, RMSE = 6.64, [email protected] = 89.3%, [email protected] =60.8%. The improvement is attributed to more effective fea-tures with circular flow.

TABLE IIRESULTS ON THE CARPK DATASET BY USING DIFFERENT LAYERS WITH

THE CIRCULAR FLOW. THE FLOWS FROM RES4B22X TO RES3B3XACHIEVE BETTER PERFORMANCE.

Layer MAE RMSE [email protected] [email protected]

res3b3x 14.59 21.54 82.7 25.4res4b22x 6.29 8.45 87.9 34.8res5cx 6.92 9.20 73.7 15.9

res3b3x+res5cx 8.11 11.31 80.1 22.6res5cx+res4b22x 7.89 10.27 87.2 46.3res3b3x+res4b22x 5.82 7.64 89.3 60.8

TABLE IIIRESULTS ON THE CARPK DATASET BY ADDING THE COUNTING

REGULARIZED CONSTRAINT.

Method MAE RMSE [email protected] [email protected]

SA+TA 5.82 7.64 89.3 60.8SA+TA+CRT 5.42 7.38 89.8 61.4

TABLE IVRESULTS ON THE CARPK DATASET BY USING THE PROGRESSIVE

TRAINING STRATEGY.

Method MAE RMSE [email protected] [email protected]

Joint Training 6.93 9.57 89.4 60.3Progressive Training 5.42 7.38 89.8 61.4

3) The counting regularized constraint: In Table III, thefour metrics are improved with the counting regularizedterm. Our final results achieve MAE=5.42, RMSE=7.38,[email protected]=89.8% and [email protected]=61.4%, which are much betterthan the state-of-the-art methods.

4) The progressive training strategy: For analyzing theeffectiveness of progressive training strategy, we select thejoint training strategy as baseline setting. As shown in Fig. IV,the progressive training strategy achieves better performance.

C. Performance evaluationIn this subsection, we first compare our methods with other

methods on the CARPK, PUCPR+, Visdrone2018-car andUAVDT datasets. Then, we show our runtime and qualitativeresults with correct and failed situations.

1) Comparisons with other methods: In this subsection, wecompare our proposed method with other methods on threechallenging datasets. Our method is divided into three kinds:(1) SA represents the scale-adaptive strategy; (2) SA+CFrepresents combining the scale-adaptive strategy with circularflow features; and (3) SA+CF+CRT represents all of our

strategies combined. The other methods include Faster-RCNN(FRCN), FRCN-small adjusted anchor boxes by Hsieh et al.[10], LPN [10], SSD [23], YOLO9000 [21] and YOLOv3[22]. Hsieh et al. [10] focused on the object counting task,and the output object proposals were inaccurately used fordetection task. Therefore, only the counting results of FRCN,FRCN-small, LPN are taken from [10]. The results of FRCN+,FRCN-small+ were executed specifically for this study withsome adjustments. Different from [10], the additional adjust-ments include the following: (1) refining the proposals alongthe ROI-pooling layer as the final detection according toFRCN; (2) setting the maximum number of total predictingboxes per image to 300 instead of 100. This approach ismotivated by the fact that the vehicle images taken by droneare densely presented and their number is large. The results onthe CARPK dataset are reported in Table V. We observe thefollowing: (1) the performances of FRCN+ and FRCN-small+are better than the execution of [10] without the spatial layoutinformation; (2) our method achieves the new benchmarkwith MAE=5.42 and RMSE=7.38 on the CARPK dataset; (3)YOLOv3 has better performance than YOLOV9000 with themultiscale outputs; and (4) our method consistently achievesthe best performance against these methods on the tasks ofsimultaneously detecting and counting vehicles.

TABLE VRESULTS ON THE CARPK DATASET COMPARED WITH THE

STATE-OF-THE-ART METHODS. SA, SA+CF AND SA+CF+CRT AREOUR METHODS.

Method MAE RMSE [email protected] [email protected]

FRCN [10], [18] 47.45 57.39 - -FRCN-small [10], [18] 24.32 37.62 - -LPN [10] 23.8 36.79 - -SSD [23] 37.33 42.32 68.7 25.9YOLO9000 [21] 38.59 43.18 20.9 3.5YOLOv3 [22] 7.92 11.08 85.3 47.0

FRCN+ 18.54 21.68 58.4 16.0FRCN-small+ 13.11 20.33 87.1 50.8

SA 6.29 8.50 87.9 34.8SA+CF 5.82 7.64 89.3 60.8SA+CF+CRT 5.42 7.38 89.8 61.4

To verify the generalization of our method, we also evaluatethe results on the PUCPR+, VisDrone2018-car and UAVDTdatasets. It is clear that our method achieves impressive resultsas shown in Table VI, Table VII and Table VIII. The qualitativeresults are also shown in Fig. 9 and Fig. 10.

2) Runtime: We use Matconvnet [48] to conduct our exper-iments. Our framework is implemented on the environment ofone NVIDIA TITAN X. With our platform, our full convo-lutional network runs at 9 FPS and faster-RCNN runs at 3FPS. The advantage of our method is that the runtime doesnot increase with the number of objects. However, comparedwith FRCN and the two-stage methods, their runtime increaseslinearly with the number of windows.

IV. CONCLUSION

This paper introduces a novel method for simultaneouslydetecting and counting vehicles in drone-based scenes. This

Page 9: Simultaneously Detecting and Counting Dense Vehicles from ...ivipc.uestc.edu.cn/public/uploads/files/2019/06/... · Drone images show more specific scale changes than natural images

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

Fig. 9. The qualitative results of our method. The green boxes in the first row are the ground truth. The yellow boxes are the correct detection result,the red boxes are the missed boxes, the purple boxes are the boxes with location bias, and the blue boxes are located in the background.

TABLE VIRESULTS ON THE PUCPR+ DATASET COMPARED WITH THE

STATE-OF-THE-ART METHODS. SA, SA+CF AND SA+CF+CRT AREOUR METHODS.

Method MAE RMSE [email protected] [email protected]

FRCN [10], [18] 111.4 149.35 - -FRCN-small [10], [18] 39.88 47.67 - -LPN [10] 22.76 34.46 - -SSD [23] 119.24 132.22 32.6 7.1YOLO9000 [21] 97.96 133.25 12.3 4.5YOLOv3 [22] 5.24 7.14 95.0 45.4

FRCN+ 72.16 76.63 11.8 0.9FRCN-small+ 62.40 67.33 42.8 7.4

SA 6.80 9.89 86.8 33.6SA+CF 3.72 5.28 93.3 44.2SA+CF+CRT 3.92 5.06 92.9 55.4

TABLE VIIRESULTS ON THE VISDRONE2018-CAR DATASET COMPARED WITH THE

STATE-OF-THE-ART METHODS. SA, SA+CF AND SA+CF+CRT AREOUR METHODS.

Method MAE RMSE [email protected] [email protected]

SSD [23] 158.83 161.60 31.0 15.4YOLO9000 [21] 23.12 26.56 24.1 14.3YOLOv3 [22] 26.45 32.04 60.2 37.2

FRCN+ 78.61 84.32 55.0 28.7FRCN-small+ 78.23 83.25 54.3 28.1

SA 15.15 19.65 53.9 24.5SA+CF 9.53 18.08 69.8 43.6SA+CF+CRT 9.04 17.49 70.3 44.9

problem is challenging because the targets are small and dense.We analyze the reasons behind this problem, including howto select the suitable anchor boxes and how to extract thefeatures of small objects. Then, we propose a scale-adaptivestrategy to select anchors and apply circular flow to guidefeature extraction. In addition, we propose adding the countingregularized constraint to the object detection task for furtherimproving performance. The experiments demonstrate thatour contributions lead to state-of-the-art performance on four

TABLE VIIIRESULTS ON THE UAVDT DATASET COMPARED WITH THE

STATE-OF-THE-ART METHODS. SA, SA+CF AND SA+CF+CRT AREOUR METHODS.

Method MAE RMSE [email protected] [email protected]

SSD [23] 11.44 18.33 39.4 25.5YOLO9000 [21] 12.59 16.73 18.7 7.6YOLOv3 [22] 11.58 21.50 38.0 20.3

FRCN+ 8.62 17.31 52.0 21.3FRCN-small+ 8.21 18.53 52.5 20.8

SA 8.68 13.19 52.3 10.9SA+CF 9.02 16.10 54.1 24.7SA+CF+CRT 7.67 10.95 60.1 27.8

Fig. 10. Visualization of the failure of our method. The red boxes arethe missed boxes. The remaining boxes are our detection results. Thecolor of the box indicates its confidence. The color bars indicate theconfidence score. A color closer to yellow indicates that the score islarger. The missed detection is mainly caused by perspective, occlusionand truncation.

Page 10: Simultaneously Detecting and Counting Dense Vehicles from ...ivipc.uestc.edu.cn/public/uploads/files/2019/06/... · Drone images show more specific scale changes than natural images

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

large-scale benchmarks. In the future work, we will performsome related investigation on the counting constraint of densitymaps, such as building the density map estimation and objectdetection task in an integrated way. We also intend to furtherexpand our method and expect the wide application of theproposed method in many other drone-based object detectiontasks.

REFERENCES

[1] C. Luo, L. Yu, and P. Ren, “A vision-aided approach to perching abioinspired unmanned aerial vehicle,” IEEE Transactions on IndustrialElectronics, vol. 65, no. 5, pp. 3976–3984, May. 2018.

[2] Q. Fu, Q. Quan, and K. Y. Cai, “Robust pose estimation for multirotoruavs using off-board monocular vision,” IEEE Transactions on IndustrialElectronics, vol. 64, no. 10, pp. 7942–7951, Oct. 2017.

[3] S. Islam, P. X. Liu, and A. E. Saddik, “Robust control of four-rotor un-manned aerial vehicle with disturbance uncertainty,” IEEE Transactionson Industrial Electronics, vol. 62, no. 3, pp. 1563–1571, Mar. 2015.

[4] X. Wang, “Intelligent multi-camera video surveillance: A review,” Pat-tern recognition letters, vol. 34, no. 1, pp. 3–19, Jan. 2013.

[5] Y. Yuan, Y. Feng, and X. Lu, “Statistical hypothesis detector forabnormal event detection in crowded scenes,” IEEE Transactions onCybernetics, vol. 47, no. 11, pp. 3597–3608, Nov. 2017.

[6] Y. Yuan, Z. Jiang, and Q. Wang, “Hdpa: Hierarchical deep probabilityanalysis for scene parsing,” in 2017 IEEE International Conference onMultimedia and Expo (ICME), pp. 313–318, Jul. 2017.

[7] F. Meng, H. Li, Q. Wu, K. N. Ngan, and J. Cai, “Seeds-based part seg-mentation by seeds propagation and region convexity decomposition,”IEEE Transactions on Multimedia, vol. 20, no. 2, pp. 310–322, Feb.2018.

[8] K. Huang, D. Tao, Y. Yuan, X. Li, and T. Tan, “Biologically inspiredfeatures for scene classification in video surveillance,” IEEE Transac-tions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 41,no. 1, pp. 307–313, Feb. 2011.

[9] Q. Wu, H. Li, F. Meng, and K. N. Ngan, “Toward a blind qualitymetric for temporally distorted streaming video,” IEEE Transactions onBroadcasting, vol. 64, no. 2, pp. 367–378, Jun. 2018.

[10] M.-R. Hsieh, Y.-L. Lin, and W. H. Hsu, “Drone-based object counting byspatially regularized regional proposal network,” in IEEE InternationalConference on Computer Vision (ICCV), vol. 1, pp. 4165–4173. IEEE,Oct. 2017.

[11] M. Ahrnbom, K. Astrom, and M. Nilsson, “Fast classification of emptyand occupied parking spaces using integral channel features,” in IEEEConference on Computer Vision and Pattern Recognition Workshops(CVPRW), pp. 1609–1615, Jun. 2016.

[12] P. R. De Almeida, L. S. Oliveira, A. S. Britto Jr, E. J. Silva Jr, and A. L.Koerich, “Pklot–a robust dataset for parking lot classification,” ExpertSystems with Applications, vol. 42, no. 11, pp. 4937–4949, Jul. 2015.

[13] L. Boominathan, S. S. S. Kruthiventi, and R. V. Babu, “Crowdnet: Adeep convolutional network for dense crowd counting,” in Proceedingsof the 2016 ACM on Multimedia Conference, pp. 640–644. ACM, 2016.

[14] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd countingvia deep convolutional neural networks,” in IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pp. 833–841. IEEE, Jun.2015.

[15] D. Kang, Z. Ma, and A. B. Chan, “Beyond counting: Comparisons ofdensity maps for crowd analysis tasks - counting, detection, and track-ing,” IEEE Transactions on Circuits and Systems for Video Technology,DOI 10.1109/TCSVT.2018.2837153, pp. 1–1, 2018.

[16] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in Proceedings of the IEEE conference on computer vision and patternrecognition, pp. 580–587. IEEE, Jun. 2014.

[17] R. Girshick, “Fast R-CNN,” in IEEE International Conference onComputer Vision (ICCV), pp. 1440–1448. IEEE, Dec. 2015.

[18] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, Jun. 2017.

[19] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully convolutional networks,” in Advances in Neural InformationProcessing Systems 29, pp. 379–387. Curran Associates, Inc., 2016.

[20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Unified, real-time object detection,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pp. 779–788. IEEE,Jun. 2016.

[21] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in IEEEConference on Computer Vision and Pattern Recognition (CVPR), pp.6517 –6525. IEEE, Jul. 2017.

[22] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”CoRR, vol. abs/1804.02767, 2018.

[23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.Berg, “Ssd: Single shot multibox detector,” in European conference oncomputer vision, vol. 9905, pp. 21–37. Springer, Oct. 2016.

[24] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD :Deconvolutional single shot detector,” arXiv preprint arXiv:1701.06659,2017.

[25] A. Borji, “Boosting bottom-up and top-down visual features for saliencyestimation,” in 2012 IEEE Conference on Computer Vision and PatternRecognition, pp. 438–445, Jun. 2012.

[26] H. Tian, Y. Fang, Y. Zhao, W. Lin, R. Ni, and Z. Zhu, “Salient regiondetection by fusing bottom-up and top-down features extracted from asingle image,” IEEE Transactions on Image Processing, vol. 23, no. 10,pp. 4389–4398, Oct. 2014.

[27] S. He and R. W. H. Lau, “Exemplar-driven top-down saliency detectionvia deep association,” in IEEE Conference on Computer Vision andPattern Recognition (CVPR), pp. 5723–5732, Jun. 2016.

[28] M. Bennasar, Y. Hicks, and R. Setchi, “Feature selection using jointmutual information maximisation,” Expert Systems with Applications,vol. 42, no. 22, pp. 8520–8532, 2015.

[29] D. P. Palomar and S. Verdu, “Representation of mutual information viainput estimates,” IEEE Transactions on Information Theory, vol. 53,no. 2, pp. 453–470, Feb. 2007.

[30] Y. Li, Y. Zhou, J. Yan, Z. Niu, and J. Yang, “Visual saliency based onconditional entropy,” pp. 246–257, 2010.

[31] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutionalnetworks: Visualising image classification models and saliency maps,”arXiv preprint arXiv:1312.6034, 2013.

[32] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learningdeep features for discriminative localization,” in 2016 IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929,Jun. 2016.

[33] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang,C. Huang, W. Xu, D. Ramanan, and T. S. Huang, “Look and think twice:Capturing top-down visual attention with feedback convolutional neuralnetworks,” in 2015 IEEE International Conference on Computer Vision(ICCV), pp. 2956–2964, Dec. 2015.

[34] A. Roy and S. Todorovic, “Combining bottom-up, top-down, andsmoothness cues for weakly supervised image segmentation,” in 2017IEEE Conference on Computer Vision and Pattern Recognition (CVPR),pp. 7282–7291, Jul. 2017.

[35] T. Kong, A. Yao, Y. Chen, and F. Sun, “Hypernet: Towards accurateregion proposal generation and joint object detection,” in IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), pp. 845–853,Jun. 2016.

[36] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen, “Ron: Reverseconnection with objectness prior networks for object detection,” in 2017IEEE Conference on Computer Vision and Pattern Recognition (CVPR),pp. 5244–5252, Jul. 2017.

[37] T. Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie,“Feature pyramid networks for object detection,” in 2017 IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), pp. 936–944,Jul. 2017.

[38] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik, “Simultaneousdetection and segmentation,” in European Conference on ComputerVision, pp. 297–312. Springer, 2014.

[39] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” in IEEEInternational Conference on Computer Vision (ICCV), pp. 2980–2988,Oct. 2017.

[40] P. Hu and D. Ramanan, “Finding tiny faces,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pp. 1522–1530, Jul.2017.

[41] T. He, Z. Tian, W. Huang, C. Shen, Y. Qiao, and C. Sun, “An end-to-end textspotter with explicit alignment and attention,” in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, pp.5020–5029, 2018.

[42] P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu, “Vision meets drones: Achallenge,” arXiv preprint arXiv:1804.07437, 2018.

Page 11: Simultaneously Detecting and Counting Dense Vehicles from ...ivipc.uestc.edu.cn/public/uploads/files/2019/06/... · Drone images show more specific scale changes than natural images

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

[43] D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang,and Q. Tian, “The unmanned aerial vehicle benchmark: Object detectionand tracking,” in The European Conference on Computer Vision (ECCV),Sep. 2018.

[44] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-man, “The pascal visual object classes (voc) challenge,” Internationaljournal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.

[45] J. Hosang, R. Benenson, P. Dollar, and B. Schiele, “What makes foreffective detection proposals?” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 38, no. 4, pp. 814–830, Apr. 2016.

[46] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in IEEE Conference on Computer Vision and PatternRecognition (CVPR), pp. 770–778. IEEE, Jun. 2016.

[47] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in 2009 IEEE Conference onComputer Vision and Pattern Recognition, pp. 248–255, Jun. 2009.

[48] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networksfor matlab,” in Proceedings of the 23rd ACM international conferenceon Multimedia, pp. 689–692. ACM, 2015.

Wei Li received the B.Sc. degree in electricaland information engineering from Henan Poly-technic University, Jiaozuo, China, in 2011. Heis currently pursuing the Ph.D. degree underthe supervision of Prof. H. Li., University ofElectronic Science and Technology of China,Chengdu, China. His research interests includeimage recognition, object detection, and ma-chine learning. He is a TPC member for IEEEInternational Conference on Visual Communica-tions and Image Processing (VCIP 2018). He

also serves as a reviewer for several international journals and confer-ences such as IEEE-TIE, IEEE-TCSVT, Elsevier-JVCI, VCIP, etc.

Hongliang Li (SM’12) received his Ph.D. degreein Electronics and Information Engineering fromXi’an Jiaotong University, China, in 2005. From2005 to 2006, he joined the visual signal pro-cessing and communication laboratory (VSPC)of the Chinese University of Hong Kong (CUHK)as a Research Associate. From 2006 to 2008, hewas a Postdoctoral Fellow at the same labora-tory in CUHK. He is currently a Professor in theSchool of Information and Communication En-gineering, University of Electronic Science and

Technology of China. His research interests include image segmen-tation, object detection, image and video coding, visual attention, andmultimedia communication system.

Dr. Li has authored or co-authored numerous technical articles in well-known international journals and conferences. He is a co-editor of aSpringer book titled “Video segmentation and its applications”. Dr. Liis involved in many professional activities. He is an Associate Editorof IEEE Transactions on Circuits and Systems for Video Technologyand Journal on Visual Communications and Image Representation, andthe Area Editor of Signal Processing: Image Communication, ElsevierScience. He served as a Technical Program Chair for VCIP2016 andPCM2017, General Chair of the ISPACS 2010, Publicity Chair of IEEEVCIP 2013, Local Chair of the IEEE ICME 2014, and TPC membersin a number of international conferences, e.g., ICME 2013, ICME 2012,ISCAS 2013, PCM 2007, PCM 2009, and VCIP 2010. He is now a seniormember of IEEE.

Qingbo Wu (M’15) received the B.E. degree inEducation of Applied Electronic Technology fromHebei Normal University in 2009, and the Ph.D.degree in signal and information processing inUniversity of Electronic Science and Technologyof China in 2015. From February 2014 to May2014, he was a Research Assistant with theImage and Video Processing (IVP) Laboratoryat Chinese University of Hong Kong. Then, fromOctober 2014 to October 2015, he served as avisiting scholar with the Image & Vision Com-

puting (IVC) Laboratory at University of Waterloo. He is currently anAssociate Professor in the School of Information and CommunicationEngineering, University of Electronic Science and Technology of China.His research interests include image/video coding, quality evaluation,and perceptual modeling and processing.

Xiaoyu Chen received his B.Sc. degree in Ap-plied Physics at the University of Electronic Sci-ence and Technology of China (UESTC) in 2015.Now he is a 4-th year PhD student in signaland information processing at UESTC. His mainresearch interests include computer vision andmachine Learning, especially the application ofdeep learning on object detection.

King Ngi Ngan (M’79-SM’91-F’00) received thePh.D. degree in electrical engineering from theLoughborough University, Loughborough, U.K.

He was previously a Full Professor with theNanyang Technological University, Singapore,and the University of Western Australia, Crawley,WA, Australia. He has been appointed as theChair Professor of the University of ElectronicScience and Technology of China, Chengdu,China, under the National Thousand TalentsProgram since 2012. He is currently a Chair

Professor with the Department of Electronic Engineering, Chinese Uni-versity of Hong Kong, Hong Kong, China, and the School of Informationand Communication Engineering, University of Electronic Science andTechnology of China, Chengdu, China. He has published extensivelyincluding 3 authored books, 7 edited volumes, more than 380 refereedtechnical papers, and edited 9 special issues in journals. In addition,he holds 15 patents in the areas of image/video coding and commu-nications. He holds honorary and visiting professorships of numerousuniversities in China, Australia and South East Asia.

Prof. Ngan is a Fellow of IET (U.K.) and IEAust (Australia), and anIEEE Distinguished Lecturer from 2006 to 2007. He was the AssociateEditor of the IEEE Transactions on Circuits and Systems for VideoTechnology, the Journal on Visual Communications and Image Rep-resentation, the EURASIP Journal of Signal Processing: Image Com-munication, and the Journal of Applied Signal Processing. He chairedand co-chaired a number of prestigious international conferences onimage and video processing including the 2010 IEEE InternationalConference on Image Processing, and served on the advisory andtechnical committees of numerous professional organizations.