cbnet: a novel composite backbone network architecture for ... · cbnet: a novel composite backbone...

CBNet: A Novel Composite Backbone Network Architecture for Object DetectionYudong Liu,1 Yongtao Wang,1 Siwei Wang,1 TingTing Liang,1

Qijie Zhao,1 Zhi Tang,1 Haibin Ling 2

1 Wangxuan Institute of Computer Technology, Peking University2 Department of Computer Science , Stony Brook University

{bahuangliuhe,wyt,wangsiwei,liangtingting,zhaoqijie,tangzhi}@[email protected]

Abstract

In existing CNN based detectors, the backbone network is avery important component for basic feature1 extraction, andthe performance of the detectors highly depends on it. Inthis paper, we aim to achieve better detection performanceby building a more powerful backbone from existing back-bones like ResNet and ResNeXt. Specifically, we propose anovel strategy for assembling multiple identical backbones bycomposite connections between the adjacent backbones, toform a more powerful backbone named Composite BackboneNetwork (CBNet). In this way, CBNet iteratively feeds theoutput features of the previous backbone, namely high-levelfeatures, as part of input features to the succeeding backbone,in a stage-by-stage fashion, and finally the feature maps ofthe last backbone (named Lead Backbone) are used for objectdetection. We show that CBNet can be very easily integratedinto most state-of-the-art detectors and significantly improvetheir performances. For example, it boosts the mAP of FPN,Mask R-CNN and Cascade R-CNN on the COCO datasetby about 1.5 to 3.0 percent. Meanwhile, experimental resultsshow that the instance segmentation results can also be im-proved. Specially, by simply integrating the proposed CBNetinto the baseline detector Cascade Mask R-CNN, we achievea new state-of-the-art result on COCO dataset (mAP of 53.3)with single model, which demonstrates great effectiveness ofthe proposed CBNet architecture. Code will be made avail-able on https://github.com/PKUbahuangliuhe/CBNet.

1 Introduction

Object detection is one of the most fundamental problems incomputer vision, which can serve a wide range of applica-tions such as autonomous driving, intelligent video surveil-lance, remote sensing, and so on. In recent years, great pro-gresses have been made for object detection thanks to thebooming development of the deep convolutional networks(Krizhevsky, Sutskever, and Hinton 2012), and a few ex-cellent detectors have been proposed, e.g., SSD (Liu et al.2016), Faster R-CNN (Ren et al. 2015), Retinanet(Lin etal. 2018), FPN (Lin et al. 2017a), Mask R-CNN (He et al.2017), Cascade R-CNN (Cai and Vasconcelos 2018), etc.

1Here and after, basic feature refers in particular to the featureswhich are extracted by the backbone network and used as the in-put to other functional modules in the detector like detection head,RPN and FPN.

......

Backbone 1 Backbone 2 …… Backbone K

conv1 conv1 conv1

⊕

⊕

⊕

⊕

⊕

⊕

⊕

⊕

Detection head /FPN/RPN

e.g. Mask R-CNN

Assistant Backbones Lead Backbone

Figure 1: Illustration of the proposed Composite BackboneNetwork (CBNet) architecture for object detection. CBNetassembles multiple identical backbones (Assistant Back-bones and Lead Backbone) by composite connections be-tween the parallel stages of the adjacent backbones. In thisway, CBNet iteratively feeds the output features of the pre-vious backbone as part of input features to the succeedingbackbone, in a stage-by-stage fashion, and finally outputsthe features of the last backbone namely Lead Backbone forobject detection. The red arrows represent composite con-nections.

Backbone Network APbox AP50 AP75

ResNet50 41.5 60.1 45.5ResNet101 43.3 61.7 47.2ResNeXt101 45.9 64.4 50.2ResNeXt152 48.3 67.0 52.8Dual-ResNeXt152 (ours) 50.0 68.8 54.6Triple-ResNeXt152 (ours) 50.7 69.8 55.5

Table 1: Results of the state-of-the-art detector CascadeMask R-CNN on the COCO test-dev dataset (Lin etal. 2014) with different existing backbones and the pro-posed Composite Backbone Networks (Dual-ResNeXt152and Triple-ResNeXt152), which are reproduced by Detec-tron (Girshick et al. 2018; Cai and Vasconcelos 2018). Itshows that deeper and larger backbones bring better detec-tion performance, while our Composite Backbone Networkarchitecture can further strengthen the existing very power-ful backbones for object detection such as ResNeXt152.

arX

iv:1

909.

0362

5v1

[cs

.CV

] 9

Sep

201

9

Generally speaking, in a typical CNN based object de-tector, a backbone network is used to extract basic fea-tures for detecting objects, which is usually designed forthe image classification task and pretrained on the ImageNetdataset (Deng et al. 2009). Not surprisingly, if a backbonecan extract more representational features, its host detec-tor will perform better accordingly. In other words, a morepowerful backbone can bring better detection performance,as demonstrated in Table 1. Hence, starting from AlexNet(Krizhevsky, Sutskever, and Hinton 2012), deeper and larger(i.e., more powerful) backbones have been exploited by thestate-of-the-art detectors, such as VGG (Simonyan and Zis-serman 2014), ResNet (He et al. 2016), DenseNet (Huang etal. 2017), ResNeXt (Xie et al. 2017). Despite encouragingresults achieved by the state-of-the-art detectors based ondeep and large backbones, there is still plenty of room forperformance improvement. Moreover, it is very expensiveto achieve better detection performance by designing a novelmore powerful backbone and pre-training it on ImageNet. Inaddition, since almost all of the existing backbone networksare originally designed for image classification task, directlyemploying them to extract basic features for object detectionmay result in suboptimal performance.

To deal with the issues mentioned above, as illustrated inFigure 1, we propose to assemble multiple identical back-bones, in a novel way, to build a more powerful backbone forobject detection. In particular, the assembled backbones aretreated as a whole which we call Composite Backbone Net-work (CBNet). More specifically, CBNet consists of mul-tiple identical backbones (specially called Assistant Back-bones and Lead Backbone) and composite connections be-tween neighbor backbones. From left to right, the output ofeach stage in an Assistant Backbone, namely higher-levelfeatures, flows to the parallel stage of the succeeding back-bone as part of inputs through composite connections. Fi-nally, the feature maps of the last backbone named LeadBackbone are used for object detection. Obviously, the fea-tures extracted by CBNet for object detection fuse the high-level and low-level features of multiple backbones, henceimprove the detection performance. It is worth mentioningthat, we do not need to pretrain CBNet for training a de-tector integrated with it. For instead, we only need to ini-tialize each assembled backbone of CBNet with the pre-trained model of the single backbone which is widely andfreely available today, such as ResNet and ResNeXt. In otherwords, adopting the proposed CBNet is more economicaland efficient than designing a novel more powerful back-bone and pre-training it on ImageNet.

On the widely tested MS-COCO benchmark (Lin et al.2014), we conduct experiments by applying the proposedComposite Backbone Network to several state-of-the-art ob-ject detectors, such as FPN (Lin et al. 2017a), Mask R-CNN (He et al. 2017) and Cascade R-CNN (Cai and Vas-concelos 2018). Experimental results show that the mAPsof all the detectors consistently increase by 1.5 to 3.0 per-cent, which demonstrates the effectiveness of our CompositeBackbone Network. Moreover, with our Composite Back-bone Network, the results of instance segmentation are alsoimproved. Specially, using Triple-ResNeXt152, i.e., Com-

posite Backbone Network architecture of three ResNeXt152(Xie et al. 2017) backbones, we achieve the new state-of-the-art result on COCO dataset, that is, mAP of 53.3, out-performing all the published object detectors.

To summarize, the major contributions of this work aretwo-fold:• We propose a novel method to build a more powerful

backbone for object detection by assembling multipleidentical backbones, which can significantly improve theperformances of various state-of-the-art detectors.

• We achieve the new state-of-the-art result on theMSCOCO dataset with single model, that is, the mAP of53.3 for object detection.In the rest of the paper, after reviewing related work in

Sec. 2, we describe in details the proposed CBNet for ob-ject detection in Sec. 3. Then, we report the experimentalvalidation in Sec. 4. Finally, we draw the conclusion in Sec.5.

2 Related workObject detection Object detection is a fundamental problemin computer vision. The state-of-the-art methods for generalobject detection can be briefly categorized into two majorbranches. The first branch contains one-stage methods suchas YOLO (Redmon et al. 2016), SSD (Liu et al. 2016), Reti-nanet (Lin et al. 2017b), FSAF (Zhu, He, and Savvides 2019)and NAS-FPN (Ghiasi, Lin, and Le 2019). The other branchcontains two-stage methods such as Faster R-CNN (Ren etal. 2015), FPN (Lin et al. 2017a), Mask R-CNN(He et al.2017), Cascade R-CNN(Cai and Vasconcelos 2018) and Li-bra R-CNN (Pang et al. 2019). Although breakthrough hasbeen made and encouraging results have been achieved bythe recent CNN based detectors, there is still large roomfor performance improvement. For example, on MS COCObenchmark (Lin et al. 2014), the best publicly reported mAPis only 52.5 (Peng et al. 2018), which is achieved by modelensemble of four detectors.

Backbone for Object detection Backbone is a very im-portant component of a CNN based detector to extract basicfeatures for object detection. Following the original works(e.g., R-CNN (Girshick et al. 2014) and OverFeat (Ser-manet et al. 2013)) of applying deep learning to object de-tection, almost all of the recent detectors adopt the pretrain-ing and fine-tuning paradigm, that is, directly use the net-works which are pre-trained for ImageNet classification taskas their backbones. For instance, VGG (Simonyan and Zis-serman 2014), ResNet (He et al. 2016), ResNeXt (Xie et al.2017) are widely used by the state-of-the-art detectors. Sincethese backbone networks are originally designed for imageclassification task, directly employing them to extract basicfeatures for object detection may result in suboptimal perfor-mance. More recently, two sophisticatedly designed back-bones, i.e., DetNet (Li et al. 2018) and FishNet (Sun et al.2018), are proposed for object detection. These two back-bones are specifically designed for the object detection task,and they still need to be pretrained for ImageNet classifica-tion task before training (fine tuning) the detector based onthem. It is well known that designing and pretraining a novel

Backbone 1 Backbone 2

conv1 conv1

⊕

⊕

⊕

⊕

(a)

t=0 t=1

conv1

⊕

⊕

⊕

(b)

⊕

Figure 2: Comparison between (a). our proposed CBNet ar-chitecture (K = 2) and (b). the unrolled architecture ofRCNN (Liang and Hu 2015)(T = 2).

and powerful backbone like them requires much manpowerand computation cost. In an alternative way, we propose amore economic and efficient solution to build a more pow-erful backbone for object detection, by assembling multipleidentical existing backbones (e.g., ResNet and ResNeXt).

Recurrent Convolution Neural Network As shown inFigure 2, the proposed architecture of Composite BackboneNetwork is somewhat similar to an unfolded recurrent con-volutional neural network (RCNN) (Liang and Hu 2015) ar-chitecture. However, the proposed CBNet is quite differentfrom this network. First, as illustrated in Figure 2, the archi-tecture of CBNet is actually quite different, especially for theconnections between the parallel stages. Second, in RCNN,the parallel stages of different time steps share the param-eters, while in the proposed CBNet, the parallel stages ofbackbones do not share the parameters. Moreover, if we useRCNN as the backbone of a detector, we need to pretrainit on ImageNet. However, when we use CBNet, we do notneed to pretrain it.

3 Proposed methodThis section elaborates the proposed CBNet in detail. Wefirst describe its architecture and variants in Section 3.1 andSection 3.2 respectively. And then, we describe the structureof detection network with CBNet in Section 3.3.

3.1 Architecture of CBNetThe architecture of the proposed CBNet consists of K iden-tical backbones (K ≥ 2). Specially, we call the case of K= 2 (as shown in Figure 2.a) as Dual-Backbone (DB) forsimplicity, and the case of K=3 as Triple- Backbone (TB).

As illustrated in Figure 1, the CBNet architecture con-sists of two types of backbones: the Lead Backbone BK andthe Assistant Backbones B1, B2, ..., BK−1. Each backbonecomprises L stages (generally L = 5), and each stage con-sists of several convolutional layers with feature maps of the

same size. The l-th stage of the backbone implements a non-linear transformation F l(·).

In the traditional convolutional network with only onebackbone, the l-th stage takes the output (denoted as xl−1) ofthe previous l − 1-th stage as input, which can be expressedas:

xl = F l(xl−1), l ≥ 2. (1)Unlike this, in the CBNet architecture, we novelly employAssistant Backbones B1, B2, ..., Bk−1 to enhance the fea-tures of the Lead Backbone Bk, by iteratively feeding theoutput features of the previous backbone as part of input fea-tures to the succeeding backbone, in a stage-by-stage fash-ion. To be more specific, the input of the l-th stage of thebackbone Bk is the fusion of the output of the previous l−1-th stage of Bk (denoted as xl−1

k ) and the output of the paral-lel stage of the previous backbone Bk−1 (denoted as xl

k−1).This operation can be formulated as following:

xlk = F l

k(xl−1k + g(xl

k−1)), l ≥ 2, (2)where g(·) denotes the composite connection, which con-sists of a 1×1 convolutional layer and batch normalizationlayer to reduce the channels and an upsample operation. Asa result, the output features of the l-th stage in Bk−1 is trans-formed to the input of the same stage in Bk, and added to theoriginal input feature maps to go through the correspondinglayers. Considering that this composition style feeds the out-put of the adjacent higher-level stage of the previous back-bone to the succeeding backbone, we call it as AdjacentHigher-Level Composition (AHLC).

For object detection task, only the output of LeadBackbone xl

K(l = 2, 3, ...L) are taken as the input ofRPN/detection head, while the output of each stage of As-sistant Backbones is forwarded into its adjacent backbone.Moreover, the B1, B2, ..., BK−1 in CBNet can adopt variousbackbone architectures, such as (He et al. 2016) or ResNeXt(Xie et al. 2017), and can be initialized from the pre-trainedmodel of the single backbone directly.

3.2 Other possible composite stylesSame Level Composition (SLC) An intuitive and simplecomposite style is to fuse the output features from the samestage of backbones. This operation of Same Level Compos-ite (SLC) can be formulated as:

xlk = F l

k(xl−1k + xl−1

k−1), l ≥ 2. (3)

To be more specific, Figure 3.b illustrates the structure ofSLC when K = 2.

Adjacent Lower-Level Composition (ALLC) Contraryto AHLC, another intuitive composite style is to feed theoutput of the adjacent lower-level stage of the previous back-bone to the succeeding backbone. This operation of Adja-cent Lower-Level Composition (ALLC). The operation ofInverse Level Composite (ILC) can be formulated as:

xlk = F l

k(xl−1k + g(xl+1

k−1)), l ≥ 2. (4)

To be more specific, Figure 3.c illustrates the structure ofILC when K = 2.

(a)

⊕

⊕

⊕

⊕

(b)

Composite

connection

⊕

⊕

⊕

⊕

(d)

Composite

connection

Composite

connection

Composite

connection

(c)

⊕

⊕

⊕

⊕

Composite

connection

Composite

connection

Composite

connection

Composite

connection

⊕

⊕

⊕

⊕

Composite

connection

Composite

connection

Composite

connection

Composite

connection

Figure 3: Four kinds of composite styles for Dual-Backbone architecture (an Assistant Backbone and a Lead Backbone). (a)Adjacent Higher-Level Composition (AHLC). (b) Same Level Composition (SLC). (c) Adjacent Lower-Level Composition(ALLC). (d) Dense Higher-Level Composition (DHLC). The composite connection denotes in blue boxes represents somesimple operations, i.e., element-wise operation, scaling, 1×1 Conv layer and bn layer.

Dense Higher-Level Composition (DHLC) InDenseNet (Huang et al. 2017), each layer is connectedto all subsequent layers to build a dense connection ina stage. Inspired by it, we can utilize dense compositeconnection in our CBNet architecture. The operation ofDHLC can be expressed as follows:

xlk = F l

k(xl−1k +

L∑i=l

gi(xik−1)), l ≥ 2. (5)

As shown in Figure 3.d, when K = 2, we assemble the fea-tures from all the higher-level stages in the Assistant Back-bone, and add the composite features to the output featuresof the previous stage in the Lead Backbone.

3.3 Architecture of detection network with CBNetThe CBNet architecture is applicable with various off-the-shelf object detectors without additional modifications to thenetwork architectures. In practice, we attach layers of theLead Backbone with functional networks, RPN (Ren et al.2015) , detection head (Redmon et al. 2016; Ren et al. 2015;Zhang et al. 2018; Lin et al. 2017a; He et al. 2017; Cai andVasconcelos 2018).

4 ExperimentsIn this section, we present experimental results on thebounding box detection task and instance segmentationtask of the challenging MS-COCO benchmark (Lin et al.2014). Following the protocol in MS-COCO, we use thetrainval35k set for training, which is a union of 80kimages from the train split and a random 35k subset of im-ages from the 40k image validation split. We report COCOAP on the test-dev split for comparisons, which is testedon the evaluation server.

4.1 Implementation detailsBaselines methods in this paper are reproduced by ourselvesbased on the Detectron framework (Girshick et al. 2018).All the baselines are trained with single-scale strategy, ex-cept Cascade Mask R-CNN ResNeXt152. Specifically, theshort side of input image is resized to 800, and the longerside is limited to 1333. We conduct experiments on a ma-chine with 4 NVIDIA Titan X GPUs, CUDA 9.2 and cuDNN7.1.4 for most experiments. In addition, we train CascadeMask R-CNN with Dual-ResNeXt152 on a machine with 4NVIDIA P40 GPUs and Cascade Mask R-CNN with Triple-ResNeXt152 on a machine with 4 NVIDIA V100 GPUs.The data augmentation is simply flipping the images. Formost of the original baselines, batch size on a single GPUis two images. Due to the limitation of GPU memory forCBNet , we put one image on each GPU for training the de-tectors using CBNet. Meanwhile, we set the initial learningrate as the half of the default value and train for the sameepoches as the original baselines. It is worth noting that, wedo not change any other configuration of these baselines ex-cept the reduction of the initial learning rate and batch size.

During the inference, we completely use the configura-tion in the original baselines (Girshick et al. 2018). For Cas-cade Mask R-CNN with different backbones, we run bothsingle-scale test and multi-scale test. And for other baselinedetectors, we run single-scale test, in which the short side ofinput image is resized to 800, and the longer side is limitedto 1333. It is noted that we do not utilize Soft-NMS (Bodlaet al. 2017) during the inference for fair comparison .

4.2 Detection resultsTo demonstrate the effectiveness of the proposed CBNet, weconduct a series of experiments with the baselines of state-of-the-art detectors, i.e., FPN (Lin et al. 2017a), Mask R-CNN (He et al. 2017) and Cascade R-CNN (Cai and Vas-

Baseline detector Single DB TB APbbox AP50 AP75 APmask AP50 AP75

FPN + ResNet1013 39.4 61.5 42.8 - - -

3 41.0 62.4 44.5 - - -3 41.7 64.0 45.5 - - -

Mask R-CNN + ResNet1013 40.0 61.2 43.6 35.9 57.9 38.0

3 41.8 62.9 45.6 37.0 59.5 39.33 42.4 64.0 46.7 38.1 59.9 40.8

Cascade R-CNN + ResNet1013 42.8 62.1 46.3 - - -

3 44.3 62.5 48.1 - - -3 44.9 63.9 48.9 - - -

Cascade Mask R-CNN + ResNeXt1523 48.3 67.0 52.8 41.0 64.1 44.2

3 50.0 68.8 54.6 42.0 64.6 45.63 50.7 69.8 55.5 43.3 66.9 46.8

Table 2: Detection results on the MS-COCO test-dev set. We report both object detection and instance segmentation resultson four kinds of detectors to demonstrate the effectiveness of CBNet. Single: with/without baseline backbone. DB: with/withoutDual-Backbone architecture. TB: with/without Triple-Backbone architecture. Column 5-7 show the results of object detectionwhile column 8-10 show the results of instance segmentation.

concelos 2018), and the results are reported in Table 2. Ineach row of Table 2, we compare a baseline (provided byDetectron (Girshick et al. 2018)) with its variants using theproposed CBNet, and one can see that our CBNet consis-tently improves all of these baselines with a significant mar-gin. More specifically, the mAPs of these baselines increaseby 1.5 to 3 percent.

Furthermore, as presented in Table 3, a new state-of-the-art detection result of 53.3 mAP on the MS-COCObenchmark is achieved by Cascade Mask R-CNN baselineequipped with the proposed CBNet. Notably, this result isachieved just by single model, without any other improve-ment for the baseline besides taking CBNet as backbone.Hence, this result demonstrates great effectiveness of theproposed CBNet architecture.

Moreover, as shown in Table 2, the proposed CBNet alsoimproves the performances of the baselines for instance seg-mentation. Compared with bounding boxes prediction (i.e.,object detection), pixel-wise classification (i.e., instance seg-mentation) tends to be more difficult and requires more rep-resentational features. And these results demonstrate the ef-fectiveness of CBNet again.

4.3 Comparisons of different composite stylesWe further conduct experiments to compare the suggestedcomposite style AHLC with other possible composite stylesillustrated in Figure 3, including SLC, ALLC, and DHLC.All of these experiments are conducted based on the Dual-Backbone architecture and the baseline of FPN ResNet101.

SLC v.s. AHLC As presented in Table 4, SLC gets evenworse result than the original baseline. We think the majorreason is that the architecture of SLC will bring serious pa-rameter redundancy. To be more specific, the features ex-tracted by the same stage of the two backbones in CBNet aresimilar, hence SLC cannot learn more semantic informationthan using single backbone. In other words, the network pa-rameters are not fully utilized, but bring much difficulty ontraining, leading to a worse result.

ALLC v.s. AHLC As shown in Table 4, there is a greatgap between ALLC and AHLC. We infer that, in our CBNet,

if we directly add the lower-level (i.e., shallower) features ofthe previous backbone to the higher-level (i.e., deeper) onesof the succeeding backbone, the semantic information of thelatter ones will be largely harmed. On the contrary, if we addthe deeper features of the previous backbone to the shallowones of the succeeding backbone, the semantic informationof the latter ones can be largely enhanced.

DHLC v.s. AHLC The results in Table 4 show that DHLCdoes not bring performance improvement as AHLC, al-though it adds more composite connections than AHLC.We infer that, the success of Composite Backbone Networklies mainly in the composite connections between adjacentstages, while the other composite connections do not enrichmuch feature since they are too far away.

Obviously, CBNets of these composite styles have sameamount of the network parameter (i.e., about twice amountof the network parameters than single backbone), but onlyAHLC brings optimal detection performance improvement.These experiment results prove that only increasing param-eters or adding additional backbone may not bring betterresult. Moreover, these experiment also show that compositeconnections should be added properly. Hence, these exper-iment results actually demonstrate that the suggested com-posite style AHLC is effective and nontrivial.

DB Composite style APbox AP50 AP75

- 39.4 61.5 42.83 SLC 38.9 60.8 42.03 ALLC 36.5 57.6 39.63 ADLC 40.7 61.8 44.03 AHLC 41.0 62.4 44.5

Table 4: Comparison between different composite styles,the baseline is FPN ResNet101 (Lin et al. 2017a). DB:with/without Dual-Backbone. ”SLC” represents Same LevelComposition, ”ALLC” represents Adajacent Lower-LevelComposition,”ADLC” is Adjacent Dense Level Composi-tion and ”AHLC” is Adjacent Higher-Level Composition.

Method Backbone APbox AP50 AP75 APS APM APL

one stage:SSD512 (Liu et al. 2016) VGG16 28.8 48.5 30.3 10.9 31.8 43.5RetinaNet (Lin et al. 2017b) ResNeXt101 40.8 61.1 44.1 24.1 44.2 51.2RefineDet (Zhang et al. 2018)* ResNet101 41.8 62.9 45.7 25.6 45.1 54.1CornerNet (Zhang et al. 2018)* Hourglass-104 42.2 57.8 45.2 20.7 44.8 56.6M2Det (Zhao et al. 2018)* VGG16 44.2 64.6 49.3 29.2 47.9 55.1FSAF (Zhu, He, and Savvides 2019)* ResNext-101 44.6 65.2 48.6 29.7 47.1 54.6NAS-FPN (Ghiasi, Lin, and Le 2019) AmoebaNet 48.3 - - - - -two stage:Faster R-CNN (Ren et al. 2015) VGG16 21.9 42.7 - - - -R-FCN (Dai et al. 2016) ResNet101 29.9 51.9 - 10.8 32.8 45.0FPN (Lin et al. 2017a) ResNet101 36.2 59.1 39.0 18.2 39.0 48.2Mask R-CNN (He et al. 2017) ResNet101 39.8 62.3 43.4 22.1 43.2 51.2Cascade R-CNN (Cai and Vasconcelos 2018) ResNet101 42.8 62.1 46.3 23.7 45.5 55.2Libra R-CNN (Pang et al. 2019) ResNext-101 43.0 64.0 47.0 25.3 45.6 54.6SNIP (model ensemble) (Singh and Davis 2018)* - 48.3 69.7 53.7 31.4 51.6 60.7SINPER (Singh, Najibi, and Davis 2018)* ResNet101 47.6 68.5 53.4 30.9 50.6 60.7Cascade Mask R-CNN (Girshick et al. 2018)* ResNeXt152 50.2 68.2 54.9 31.9 52.9 63.5MegDet (model ensemble) (Peng et al. 2018)* - 52.5 - - - - -ours:(single model)Cascade Mask R-CNN * Dual-ResNeXt152 52.8 70.6 58.0 34.9 55.4 65.3Cascade Mask R-CNN * Triple-ResNeXt152 53.3 71.9 58.5 35.5 55.8 66.7

Table 3: Object detection comparison between our methods and state-of-the-art detectors on COCO test-dev set. * : utilizingmulti-scale testing.

4.4 Sharing weights for CBNetDue to using more backbones, CBNet brings the sharp in-crease of network parameters. To further demonstrate thatthe improvement of detection performance mainly comesfrom the composite architecture rather than the increase ofnetwork parameters, we conduct experiments on FPN, withthe configuration of sharing the weighs of two backbones inDual-ResNet101, and the results are shown in Table 5. Wecan see that when sharing the weights of backbones in CB-Net, the increment of parameters is negligible, but the detec-tion result is still much better than the baseline (e.g., mAP40.4 v.s. 39.4). However, when we do not share the weights,the improvement is not so much (mAP from 40.4 to 41.0),which proves that it is the composite architecture that booststhe performance dominantly, rather than the increase of net-work parameters.

Baseline detector DB Share APbox mb

FPN + ResNet10139.4 470

3 3 40.4 4923 41.0 815

Table 5: Comparison of with/without sharing weightsfor Dual-Backbone architecture. DB: with/without Dual-Backbone. Share: with/without sharing weights. APbox: de-tection results on COCO test-dev dataset. mb: the modelsize.

4.5 Number of backbones in CBNetWe conduct experiments to investigate the relationship be-tween the number of backbones in CBNet and the detec-tion performance by taking FPN-ResNet101 as the baseline,

and the results are shown in Figure 4. It can be noted thatthe detection mAP steadily increases with the number ofbackbones, and tends to converge when the number of back-bones reaches three. Hence, considering the speed and mem-ory cost, we suggest to utilize Dual-Backbone and Triple-Backbone architectures.

Figure 4: Object detection results on the MS-COCOtest-dev dataset using different number of backbones inCBNet architecture based on FPN ResNet101.

⊕

⊕Composite

connection

Composite

connection

Figure 5: An accelerated version of CBNet (K = 2).

Baseline detector DB Ψ APbox fps

FPN + ResNet10139.4 8.1

3 41.0 5.53 3 40.8 6.9

Table 6: Performance comparison between the originalDB and the accelerated vesion. DB: with/without Dual-Backbone. Ψ: with/without the acceleration modification ofthe CBNet architecture illustrated in Figure 5.

4.6 An accelerated version of CBNetThe major drawback of the proposed CBNet is that it willslows down the inference speed of the baseline detectorsince it uses more backbones to extract features thus in-creases the computation complexity. For example, as shownin Table 6, DB increases the AP of FPN by 1.6 percent butslows down the detection speed from 8.1 fps to 5.5 fps. To al-leviate this problem, we further propose an accelerated ver-sion of the CBNet as illustrated in Figure 5, by removingthe two early stages of the Assistant Backbone. As demon-strated in Table 6, this accelerated version can significantlyimprove the speed (from 5.5 fps to 6.9 fps) while not harm-ing the detection accuracy (i.e., AP) a lot (from 41.0 to 40.8).

4.7 Effectiveness of basic feature enhancement byCBNet

We think the root cause that our CBNet can performs muchbetter than the single backbone network for object detectiontask is: it can extract more representational basic featuresthan the original single backbone network which is origi-nally designed for classification problem. To verify this, asillustrated in Figure 6, we visualize and compare the inter-mediate the feature maps extracted by our CBNet and theoriginal single backbone in the detectors for some examples.The example image in Figure 6 contains two foreground ob-jects: a person and a tennis ball. Obviously, the person is thelarge-size object and the tennis ball is the small-size object.Hence, we correspondingly visualize the large scale featuremaps (for detecting small objects) and the small scale featuremaps (for detecting large objects) extracted by our CBNetand the original single backbone. One can see that, the fea-ture maps extracted by our CBNet consistently have strongeractivation values at the foreground object and weaker acti-vation values at the background. This visualization example

shows that our CBNet is more effective to extract represen-tational basic features for object detection.

Feature of Res2

Feature of Res5

Input image

For CBNet For original backbone

Figure 6: Visualization comparison of the features extractedby our CBNet (Dual-ResNet101) and the original backbone(ResNet101). The baseline detector is FPN-ResNet101. Foreach backbone, we visualize the Res2 and Res5 accordingto the size of the foreground objects, by averaging featuremaps along channel dimension. It is noteworthy that featuremaps come from the CBNet are more representational sincethey have stronger activation values at the foreground objectand weaker activation values at the background. Best viewin color.

5 ConclusionIn this paper, a novel network architecture called CompositeBackbone Network (CBNet) is proposed to boost the perfor-mance of state-of-the-art object detectors. CBNet consistsof a series of backbones with same network structure anduses composite connections to link these backbones. Specif-ically, the output of each stage in a previous backbone flowsto the parallel stage of the succeeding backbone as part ofinputs through composite connections. Finally, the featuremaps of the last backbone namely Lead Backbone are usedfor object detection. Extensive experimental results demon-strate that the proposed CBNet is beneficial for many state-of-the-art detectors, such as FPN, Mask R-CNN, and Cas-cade R-CNN, to improve their detection accuracy. To bemore specific, the mAPs of the detectors mentioned aboveon the COCO dataset are increased by about 1.5 to 3 per-cent, and a new state-of-the art result on COCO with themAP of 53.3 is achieved by simply integrating CBNet intothe Cascade Mask R-CNN baseline. Simultaneously, exper-imental results show that it is also very effective to improvethe instance segmentation performance. Additional ablationstudies further demonstrate the effectiveness of the proposedarchitecture and the composite connection module.

ReferencesBodla, N.; Singh, B.; Chellappa, R.; and Davis, L. S. 2017.Soft-nms–improving object detection with one line of code.

In Proceedings of the IEEE International Conference onComputer Vision, 5561–5569.Cai, Z., and Vasconcelos, N. 2018. Cascade r-cnn: Delvinginto high quality object detection. In CVPR.Dai, J.; Li, Y.; He, K.; and Sun, J. 2016. R-fcn: Object detec-tion via region-based fully convolutional networks. In Ad-vances in neural information processing systems, 379–387.Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical im-age database. In Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on, 248–255. Ieee.Ghiasi, G.; Lin, T.-Y.; and Le, Q. V. 2019. Nas-fpn: Learningscalable feature pyramid architecture for object detection.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 7036–7045.Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2014.Rich feature hierarchies for accurate object detection and se-mantic segmentation. In CVPR, 580–587.Girshick, R.; Radosavovic, I.; Gkioxari, G.; Dollar, P.;and He, K. 2018. Detectron. https://github.com/facebookresearch/detectron.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residuallearning for image recognition. In CVPR, 770–778.He, K.; Gkioxari, G.; Dollar, P.; and Girshick, R. 2017.Mask r-cnn. In ICCV, 2980–2988. IEEE.Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger,K. Q. 2017. Densely connected convolutional networks. InCVPR, volume 1, 3.Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012.Imagenet classification with deep convolutional neural net-works. In Advances in neural information processing sys-tems, 1097–1105.Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; and Sun, J.2018. Detnet: Design backbone for object detection. In Pro-ceedings of the European Conference on Computer Vision(ECCV), 334–350.Liang, M., and Hu, X. 2015. Recurrent convolutional neu-ral network for object recognition. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, 3367–3375.Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ra-manan, D.; Dollar, P.; and Zitnick, C. L. 2014. Microsoftcoco: Common objects in context. In European conferenceon computer vision, 740–755. Springer.Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.;and Belongie, S. 2017a. Feature pyramid networks for ob-ject detection. In CVPR, volume 1, 4.Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollar, P.2017b. Focal loss for dense object detection. In ICCV.Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollar, P.2018. Focal loss for dense object detection. IEEE transac-tions on pattern analysis and machine intelligence.Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu,C.-Y.; and Berg, A. C. 2016. Ssd: Single shot multiboxdetector. In ECCV, 21–37. Springer.

Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; and Lin,D. 2019. Libra r-cnn: Towards balanced learning for ob-ject detection. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 821–830.Peng, C.; Xiao, T.; Li, Z.; Jiang, Y.; Zhang, X.; Jia, K.; Yu,G.; and Sun, J. 2018. Megdet: A large mini-batch object de-tector. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 6181–6189.Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2016.You only look once: Unified, real-time object detection. InCVPR.Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposalnetworks. In NIPS, 91–99.Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.;and LeCun, Y. 2013. Overfeat: Integrated recognition, lo-calization and detection using convolutional networks. arXivpreprint arXiv:1312.6229.Simonyan, K., and Zisserman, A. 2014. Very deep convo-lutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556.Singh, B., and Davis, L. S. 2018. An analysis of scale invari-ance in object detection snip. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,3578–3587.Singh, B.; Najibi, M.; and Davis, L. S. 2018. Sniper: Effi-cient multi-scale training. In Advances in Neural Informa-tion Processing Systems, 9333–9343.Sun, S.; Pang, J.; Shi, J.; Yi, S.; and Ouyang, W. 2018. Fish-net: A versatile backbone for image, region, and pixel levelprediction. In Advances in Neural Information ProcessingSystems, 762–772.Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; and He, K. 2017. Ag-gregated residual transformations for deep neural networks.In Computer Vision and Pattern Recognition (CVPR), 2017IEEE Conference on, 5987–5995. IEEE.Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; and Li, S. Z. 2018.Single-shot refinement neural network for object detection.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 4203–4212.Zhao, Q.; Sheng, T.; Wang, Y.; Tang, Z.; Chen, Y.; Cai, L.;and Ling, H. 2018. M2det: A single-shot object detec-tor based on multi-level feature pyramid network. arXivpreprint arXiv:1811.04533.Zhu, C.; He, Y.; and Savvides, M. 2019. Feature selectiveanchor-free module for single-shot object detection. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, 840–849.

cbnet: a novel composite backbone network architecture for ... · cbnet: a novel composite backbone...

Documents