repnas: searching for efﬁcient re-parameterizing blocks

RepNAS: Searching for Efficient Re-parameterizing Blocks

Mingyang Zhang, Xinyi Yu, Jingtao Rong, Linlin OuCollege of Information Engineering, Zhejiang University of Technology

Hang Zhou, People’s Republic of [email protected]

Feng GaoZhejiang Lab

Hangzhou, People’s Republic of [email protected]

Abstract

In the past years, significant improvements in the fieldof neural architecture search(NAS) have been made. How-ever, it is still challenging to search for efficient networksdue to the gap between the searched constraint and realinference time exists. To search for a high-performancenetwork with low inference time, several previous worksset a computational complexity constraint for the searchalgorithm. However, many factors affect the speed of in-ference(e.g., FLOPs, MACs). The correlation betweena single indicator and the latency is not strong. Cur-rently, some re-parameterization(Rep) techniques are pro-posed to convert multi-branch to single-path architecturewhich is inference-friendly. Nevertheless, multi-branch ar-chitectures are still human-defined and inefficient. In thiswork, we propose a new search space that is suitable forstructural re-parameterization techniques. RepNAS, a one-stage NAS approach, is present to efficiently search the op-timal diverse branch block(ODBB) for each layer under thebranch number constraint. Our experimental results showthe searched ODBB can easily surpass the manual diversebranch block(DBB) with efficient training. Code can beavailable at https://github.com/bestfleer/RepNAS .

1. IntroductionDesigning low-latency neural networks is critical to the

application on mobile devices, e.g., mobile phones, securitycameras, augmented reality glasses, self-driving cars, andmany others. Neural architecture search(NAS) is expectedto automatically search a neural network where the perfor-mance surpasses the human-designed one under the mostcommon constraints, such as latency, FLOPs and parameternumber. One of the key factors for the success of NAS is

Figure 1. The comparison between ODBBs and other architecture-searched models. The inference time is tested on a Tesla V100GPU with a batch size of 128, full precision(fp32).

the artificially designed search space. Many previous NASmethods are designed to search on DARTS space[19, 24,20] and MobileNet-like space[12, 23, 25]. DARTS spacecontains a multi-branch topology and a combination of var-ious paths with different operations and complexities. Themulti-branch design can enrich the feature space and im-prove the performance[22]. Multi-branch structure is alsoused in some neural networks designed by human before,e.g., the Inception models[22] and DenseNet models[13].However, these multi-branch architectures result in a longerinference time that is unfriendly for real tasks. Therefore,many applicable NAS methods are adopted to MobileNet-like space by keeping only two branches(skip connectionand convolutional operation) for searched networks to re-alize a tradeoff between the inference time and the perfor-

1

arX

iv:2

109.

0350

8v3

[cs

.LG

] 2

5 N

ov 2

021

https://github.com/bestfleer/RepNAS

mance.To maintain the inference speed of a single branch while

retaining the advantages of multiple branches, several Reptechniques[8, 4, 6, 7, 5] have been introduced. Rep tech-niques can retain multi-branch topology at training timewhile keeping a single path at running time, and thus can re-alize a balance of accuracy and efficiency. The multi-branchcan be fused to one path because of linear operation, for ex-ample, a 3x3 Conv and a 1x1 Conv can be replaced by a3x3 Conv by padding the 1x1 kernel to 3x3 and element-wise adding with the 3x3 Conv. It is worthy to note thatthe fused network has a lower inference time while keepingalmost the same accuracy compared with the multi-branchnetwork.

However, prior works that utilize Rep techniques to trainmodels have some limitations. 1) The branch numberand branch types of each block. The multi-branch train-ing requires multiple memory(increasing linearly with thebranches number) to save the middle feature representa-tions. Therefore, the branch number and branch types ofeach block are manually fixed because of memory con-straints. For instance, the RepVGG block only contains a1x1 Conv, a 3x3 Conv and a skip connection while the ACBhas 1x3 Conv and 3x1 Conv. 2) The role of each branchis unclear, which means some branches may counterpro-ductive. Ding[6] experimented with various micro DiverseBranch Block(DBB) structures to explore which structureworks better. However, on one hand, the manual microDBB(all blocks are the same) is always suboptimal. ManyNAS work[10, 25, 12] revealed that the optimal structure ofblocks is different from each other. On the other hand, itis infeasible to design macro DBB of which search spaceapproximately reaches 1.5 × 1036(Assuming there have 30blocks and each block has 4 branches).

To address the limitation 1), we devise a multi-branchsearch space, called Rep search space. Unlike previousworks that use Rep techniques, each block can preserve anarbitrary number of branches and all block architectures areindependent in this paper. Facing increasing memory, thetotal branch number of the model can be flexibly adjusted.To address the limitation 2), a gradient-based NAS method,RepNAS, is proposed to automatically search the optimalbranch for each block without parameter retraining. Com-pared with previous gradient-based NAS methods[19, 24]that only learn the importance in one edge, RepNAS learnsthe net-wise importance of each branch. Moreover, Rep-NAS can be used under low GPU memory conditions bysetting the branch number constraint. In each training itera-tion, the branches of low importance will be sequentiallypruned until the memory constraint is met. The impor-tance of branches is updated in the same round of back-propagation as gradients to neural parameters. As the train-ing progresses, the importance of each branch is estimated

accurately, meanwhile, the redundant branches hardly par-ticipate in the training. Once the training process finished,the optimized DDB structure is obtained with optimizednetwork parameters.

To summarize, our main comtributions are as follow:

• A Rep search space is proposed in this paper, which al-lows the searched model to preserve arbitrary branchesin training and can be fused into one path in inference.To our best knowledge, it is the first time that Rep tech-niques can be used to NAS.

• To fit the new search space, RepNAS is presentedto automatically one-stage search the optimal DBB.The searched model can be converted to a single-pathmodel and directly deployed without time-consumingretraining.

• Extensive experiments on models with various sizesdemonstrate that the searched ODDB outperform boththe human-designed DDB and NAS models under sim-ilar inference time.

2. Related Work2.1. Structural Re-parameterization

Structural re-parameterization techniques have beenwidely used to improve model performance by injectingseveral branches in training and fusing these branches ininference. RepVGG[8] simply inserted a 1 × 1Conv anda residual connection into a 3 × 3Conv, which makes theperformance of VGG-like networks have a huge improve-ment. A concurrent work, DBB[6], summarized more struc-tural re-parameterization techniques and proposed a DiverseBranch Block which can be inserted into any convolutionalnetwork. Here, we list the existing techniques as follows

a Conv for Conv-BN. BN[14] can be fused into its pre-ceding Conv parameters F and b for inference. The jthfused channel parameters F

′

j,:,:,: and b′

j can be formulatedas

F′

j,:,:,: =γjσjFj,:,:,:, b

′

j = −µjγjσj

+ βj (1)

where γ, σ and β denotes the learned scaling factor and biasterm of BN, respectively.

A Conv for branch Conv addition. Convs with differ-ent kernel size in different branches can be fused into oneConv(without nonlinear activation). The kernel size of eachConv should be padded to the maximum of them. And then,they can be merged into a single Conv by

F′= F 1 + F 2 + ...+ FN , b

′= b1 + b2 + ...+ bN (2)

where N denotes the branch number.A Conv for sequential Convs. A sequence of Convs

with 1 × 1 - k × k can be merged into one single K × KConv.

2

F′= F 2 ∗TRANS(F 1), b

′

j =

D∑d=1

K∑u=1

K∑v=1

b1dF2j,d,u,v+ b

2

(3)where TRANS represents transpose. 1 and 2 denote 1× 1Conv and 3× 3 Conv, respectively.

A Conv for average pooling. A k × k average poolingcan be viewed as a special k× k Conv. Its kernel parameterF′

isF′=

1

k2Ik×k (4)

where Ik×k is k × k identity matrix.However, the architecture of each block is task-specific.

For simple tasks or tiny-scale datasets, too many brancheslead to overfitting. Besides, the output of each branch needsto be saved in GPU memory for backward. Limited GPUmemory prevents the structural Re-parameterization tech-niques from the application.

2.2. Neural Achitecture Search

The purpose of neural network structure search isto automatically find the optimal network structurewith reinforcement learning(RL)[28, 29], evolutionaryalgorithm(EA)[20, 10, 25], gradient[19, 24] methods. RL-based and EA-based methods need to evaluate each sam-pled network by retraining them on the dataset, which aretime-consuming. The gradient-based method can simulta-neously train and search the optimal subnet by assigning alearnable weight to each candidate operation. However, thegradient-based approach causes incorrect ranking results. Asubnet has top proxy accuracy while performing not as ex-pected. Moreover, since gradient-based approaches needmore memory for training, they cannot be applied to thelarge-scale dataset. To address the above problem, someone-stage NAS methods[24, 12, 9] are proposed to simulta-neously optimize the architecture and parameters. Once thesupernet training is complete, the top-performing subnet isalso given without retraining.

3. Re-parameterization Neural AchitectureSearch

In this section, we first present an overview of our pro-posed approach for searching optimal diverse branch blocksand discuss the difference with other NAS work. We thenpropose a new search space based on some Rep techniquesmentioned in Eq. 1-4. Afterward, the RepNAS approach ispresented to fit the proposed search space.

3.1. Overview

The goal of Rep techniques is to improve the training ef-fect of a CNN by inserting various branches with different

kernel size. The inserted branches can be linearly fused intothe original convolutional branch after training, such thatno extra computational complexity(FLOPs, latency, mem-ory footprint or model size) is subjoined. However, trainingwith various branches costs a large GPU memory consump-tion, and it is hard to optimize a network with too manybranches. The core idea of the proposed method is to pruneout some branches across different blocks in a differentiableway, which is shown in Figure 2. It has two essential steps:

(1) Given a CNN architecture(e.g., MobileNets,VGGs), we net-wisely insert several linear operations intooriginal convolutional operations as its branches. For eachbranch, a learnable architecture parameter that representsthe importance is set. During the training, we optimizeboth architecture parameters and network parameters bydiscretely sampling branches, simultaneously. Once train-ing finishes, we can obtain a pruned architecture with opti-mized network parameters.

(2) In inference, the rest of the branches can be di-rectly fused into the original convolutional operations, suchthat the multi-branch architecture can be converted to thesingle-path architecture without a performance drop. Noextra finetune is required in this step.

Compared with many NAS work, cumbersome architec-tures with various branches and skip connections are nolonger an obstacle to application in RepNAS. In contrast toprior structural re-parameterization work, the architecturesof blocks in each layer can be automatically designed with-out any extra time consumption. The whole optimization isin one stage.

3.2. Rep Search Space Design

NAS methods are usually designed to search for theoptimal subnetwork on DARTS space or MobileNet-likespace. The former contains multi-branch architecture,which makes the searched network difficult to apply be-causfunne of the large inference time. The latter whichrefers to the expert experience in designing mobile net-works includes efficient networks, however, multi-brancharchitecture into no consideration. Many human-designednetworks[22, 13] demonstrate multi-branch architecture canimprove model performance by enhancing the feature repre-sentation from the various scale. To combine the advantagesof multi-branch and single-path architecture, a more flexi-ble search space is proposed based on some current struc-tural re-parameterization work. Shown in Figure 3, eachblock contains 7 branches(1 × 1,K ×K, 1 × 1 −K ×K,1× 1−AV G, 1×K, K × 1 and skipconnect). Differentfrom previous search space, we release the heuristic con-straints to offer higher flexibility to the final architecture,which means each block can preserve arbitrary branchesand all block architectures are independent. It is worthyto note that the multi-branch will be fused to a single-path

3

Figure 2. The flow diagram of our proposed approach. Our approach automatically designs architectures of ODBBs at the macro levelthrough pruning the branches with low importance(red number). The architecture searching and network training is finished at the sametime, which is contained in the left dash box. The multi-branch architecture of each block will be converted to one path for acceleration(rightdash box).

after searching, thus, have no impact on inference.

The new search space reaches approximately 2.29 ×10105 architectures. Compared with either micro(1.1 ×1018) or macro search space(1.9 × 1093), the proposedsearch space has greater potential to offer better architec-tures and evaluate the effectiveness of NAS algorithms.However, such an enlarged search space brings chal-lenges for preceding NAS approaches:1) incorrect architec-ture rattings[3]. 2) large memory cost because of multi-branch[1].

3.3. Weight Sharing for Network Parameters

Many previous NAS methods share weights across ar-chitectures during supernet training while decoupling theweights of different branches at the same block. However,this strategy does not work in the proposed Rep search spacebecause of a surging number of subnets. In each training it-eration, only a few branches can be sampled which resultsin the weights being updated by limited times. Therefore,the performances of sampled subnets grow slowly. Thislimits the learning of architecture parameter α. Inspiredby BigNAS[25] and slimmable networks[26], we also sharenetwork parameters across the same block. For any branchthat needs a convolutional operation, the weights of thisbranch can be inherited from the K ×K convolutional op-eration in the main branch(see the right part of Figure 3).

We represent its weights as

FH,W = FK,K:,:,LH:RH,LW :RW

LH = [K/2]− [H/2]RH = [K/2] + [H/2]LW = [K/2]− [W/2]RW = [K/2] + [W/2]

(5)

where [.] is the floor operation, H,W and K,K denotes thekernel size of the inherited branch and main branch, respec-tively. Equipped with weight sharing, the supernet can getfaster convergence, meanwhile, the ranking of branches canbe evaluated precisely. We discuss the detail in AblationStudy.

3.4. Searching for Rep Blocks

To overcome the incorrect architecture ratings, an ele-gant solution is the one-stage differentiable NAS. The one-stage differentiable NAS is expected to simultaneously op-timize the architecture parameter α and its network param-eter ω by differentiable way, which can be formulated as thefollowing optimization:

α∗, ω∗ = argminα,ω

L(ω, α;Dtrain) (6)

where L(ω, α;Dtrain) = E(x,y)∈Dtrain[CE(y′, y)] de-notes the loss function computed in a specified trainingdataset.

Many previous NAS works are designed to search onDARTs search space. With the heuristic rule, each edge

4

Figure 3. Left:The design of new search space based on several structural re-parameterization work[8, 6, 4]. Each block can preserve anarbitrary number of branches and is independent of other blocks. Right: The parameters of each branch are inherited from the originaloperation by center cropping.

only can preserve one branch. Therefore, in prior differen-tiable NAS work[19, 24], the parameters probability αi,j ofthe jth branch in the ith block(N branches) can be writtenas

P (αi,j) =exp(αi,j)∑Nk=1 exp(αi,k)

(7)

Though the Eq.(6) can be optimized with gradient descentas most neural network training[19], it would suffer fromthe huge performance gap between supernet and its childnetwork[1]. Instead, a continuous and differentialble repa-rameterization trick,Gumbel softmax[15], is used in NASapproaches[9, 24]. With Gumbel random variable, Eq.(7)can be rewritten as

Zi,j =exp((αi,j +Gi,j)/λ)∑N

k=1 exp((αi,k +Gi,j)/λ))(8)

where Gi,j = −log(−log(Ui,j) and Ui,j is a uniform ran-dom variable. Z(αi,j) can be approximated as a one-hotvector if temperature λ− > 0. This relaxation realizes thediscretization of probability distribution. In multi-branchsearch space proposed in Rep Search Space Design, eachblock can preserve an arbitrary branches so that we cannotdirectly obtain probability α as Eq.(7) or Eq.(8). To fit thenew space, we can regard whether each branch is retainedor not as a binary classification. Hence, the discretizationprobability of the jth branch in the ith block can be givenas

Z0i,j =

exp((α0i,j +G0

i,j)/λi,j)

exp((α0i,j +G0

i,j)/λi,j) + exp((α1i,j +G1

i,j)/λi,j)(9)

Z1i,j =

exp((α1i,j +G1

i,j)/λi,j)

exp((α0i,j +G0

i,j)/λi,j) + exp((α1i,j +G1

i,j)/λi,j)(10)

where 0 and 1 represent preserve and prune this branch, re-spectively. Different from Eq.(8), each branch in the newspace is independent. Thus, the temperature λ can be setdifferently according to requirements. Furthermore, we cancombination the Eq.(9) and Eq.(10) into a sigmoid mode

Zi,j = f(αi,j)= 1

1+exp((α0i,j+G

0i,j−α1

i,j−G1i,j)/λi,j)

= 11+exp((αi,j+ζi,j)/λi,j)

(11)where ζ ∼ Logistic(0, 1). We only need to optimize α,instead of α0 and α1, through grident descent. It gridentcan be given as

∂L

∂αi,j=∂L

∂xiOTi,jf(αi,j)(1− f(αi,j))/λi,j (12)

where xi and Oi,j denotes the output of the ith block andthe jth branch, respectively. In each training iteration, wecan firstly compute a threshold s according to the globalranking. Subsequently, the branches whose ranking is be-low the swill be pruned out and do not participate in currentforward or backward. Thanks to the independence of eachbranch, we can easily control the activation of each branchthrough its temperature λ.

limλi,j→0+

Zi,j = 0, if Ri,j < 0 and rank(Ri,j) < s

limλi,j→0−

Zi,j = 0, if Ri,j > 0 and rank(Ri,j) < s

limλi,j→0−

Zi,j = 1, if Ri,j < 0 and rank(Ri,j) > s

limλi,j→0+

Zi,j = 1, if Ri,j > 0 and rank(Ri,j) > s

(13)where Ri,j = αi,j + ζi,j is a random variable. rank(Ri,j)denotes the global ranking of the (i, j) branch. In im-plementation, we sort the importance of each branch byEq.(11) and only keep the top-k branches for forward. The

5

Table 1. Experimental configurations.

Dataset Arch. Epochs Batch size Init LR Weight decay Data augmentation

CIFAR-10 VGG-16 600 128 0.1 1× 10−4 same as [6]ImageNet ODBB-A0/A1/A2/B1/B2/ResNet-18 150 256 0.1 1× 10−4 same as [8]ImageNet ODBB-B3/ResNet-101 240 256 0.1 1× 10−4 same as [8]

COCO ResNet18/ODBB-A0 140 114 5e-4 0 same as [27]COCO ResNet101/ODBB-B3 140 96 3.75e-4 0 same as [27]

Algorithm 1 The algorithm detail of RepNAS.Require: unpruned network, branch parameters θ, arch parame-

ters α and branch number constraint CInitialize θ, αwhile not converged do

Calculate the importance of each branchZ by Eq.(11) whereλ = 1Discretize Z by Eq.(13) to met CSample branches where Z = 1Get a batch from data and forward to get L, multiply Z aftereach feature map XBackward L to both θ and αUpdate θ with ∂L

∂θ, update α with ∂L

∂α

end while

importance Zi,j of unpruned (i, j) branch is convergenceto 1 by Eq.(13). Afterward, we simply multiply Zi,j to theoutput of (i, j) unpruned branch, such that ∂L

∂αi,jcan be ob-

tained by the chain rule. The whole algorithm is shown inAlg1.

4. Experiments

We first compare ODBB with baseline, random search,DBB[6] and ACB[4] on CIFAR-10 for a quick sanity check.To further demonstrate the effectiveness of ODBB withvarious model size, experiments on a large-scale dataset,ImageNet1k[17] are conducted. At last, the impact ofweight sharing and the effect of the branch number con-straints on model performance is given in the ablation study.

4.1. Quick Sanity Check on CIFAR-10

We use VGG-16[21] as the benchmark architecture. Theconvolutional operations in the benchmark architecture arereplaced by ODBB, DBB and ACB, respectively. For a faircomparison, the data augmentation techniques and traininghyper-parameter setting are followed with DBB and ACBwhich can be given in Table 1. To optimize architectureparameter α, simultaneously, we use Adam[16] optimizerwith 0.0001 learning rate and (0.5, 0.999) betas. One of theindicators to evaluate the effectiveness of a NAS approachis whether the search result exceeds the result of the ran-

dom search. To produce the random search result, 25 archi-tectures are randomly sampled from the supernet. We traineach of them for 100 epochs and pick up the best one foran entire 600-epoch re-training. The comparison results areshown in Table 3. ODBB can improve VGG-16 on CIFAR-10 by 0.85% and surpass the DBB and ACB. The architec-ture generated by random search also can slightly improvethe performance of the benchmark model but fall behind theODBB by 0.77%, which demonstrates the effectiveness ofour proposed NAS algorithm.

4.2. Performance improvements on ImageNet

To reveal the generalization ability of our method, wethen search for a series of models on ImageNet-1K whichcomprises 1.28M images for training and 50K for validationfrom 1000 classes. RepVgg-series[21] are used as bench-mark architectures. We replace each original RepVGGblock by designed block in Rep Search Space Design. Fora fair comparison, the total branch number of all ODBBmodels is limited to be equal to RepVGGs. The trainingstrategies of each network are listed in Table 1. We also useAdam optimizer with 0.0001 learning rate and (0.5, 0.999)betas to optimize α. We will discuss the impact of thebranch number on performance in Ablation Study.

Results are summarized in Table 4. RepVGG is stackedwith several multi-branch blocks that contain a 1×1 Conv,a 3 × 3 Conv and a skip connection and also can be fusedinto a single path in inference. We search for more power-ful architectures for blocks in RepVGG. For RepVGG-A0-A2 that has 22 layers, ODBB can achieve 0.55%, 0.38%and 0.24% better accuracy than original baselines, respec-tively. To our best knowledge, RepVGG-B3 with ODBBrefreshes the performance for plain models from 80.52%to 80.97%. Noting that ACB, DBB and RepVGG Blockare special cases of our proposed search space, our pro-posed NAS really can search the optimal architecture be-yond human-defined ones.

4.3. Comparison with Other NAS

We compare ODBB-series models with other mod-els searched from DARTS and mobile search space onImageNet-1K. We verify the real inference time on two dif-

6

Table 2. RepNAS performance on ImageNet with comparisons to other NAS methods. We group the models according to their Top-1accuracy. The inference time is tested on a NVIDIA Tesla V100 GPU/NVIDIA Xavier with a batch size of 128/1, full precision(fp32).

Model Parameters(M) FLOPs(B) Inference(s) Search Space Search+Retrain Cost(GPU Days) Top-1(%)

ODBB(A1) 12.78 2.4 0.031/0.028 Rep(Ours) 24+0 75.24DARTS 4.9 0.59 0.067/0.178 DARTS 1+24 73.1SNAS 4.3 0.52 0.063/0.167 DARTS 1.5+24 72.7

NASNet-A 5.3 0.56 0.078/0.195 DARTS 1800+24 74.0AmoebaNet-B 4.9 0.56 0.218 DARTS 3150+24 74.0

ODBB(A2) 25.49 5.1 0.038/0.031 Rep(Ours) 24+0 76.86SPOS 3.5 0.32 0.061/0.053 ShuffleNet 24+24 74.0

BigNASModel-S 4.5 0.24 0.045/0.049 MobileNet 40+0 76.5Once-For-All 4.4 0.23 0.051/0.057 MobileNet 40+24 76.4ODBB(B3) 110.96 26.2 0.107/0.106 Rep(Ours) 50+0 80.97

BigNASModel-L 9.5 1.1 0.186/0.265 MobileNet 80+0 80.9

Table 3. Top-1 Acc. of VGG-16 baseline, random search, DBB,ACB and ODBB on CIFAR-10.

Method Top-1 Acc.(%) Improvement(%)

Baseline 93.83 -Random search 93.91 0.08

DBB[6] 94.45 0.62ACB[4] 94.13 0.30

ODBB(Ours) 94.68 0.85

Table 4. Top-1 Acc. of several benchmark architectures on Ima-geNet. the structural re-parameterization techniques of RepVGGdo not include in ACB and DBB. Thus, we only compare the per-formance with the original RepVGG.

Arch. Param.(M) Bran. Ori.(%) ODBB(%)

A0 8.3 66 72.41 72.96A1 12.78 66 74.46 75.24A2 25.49 66 76.48 76.86B1 14.33 84 78.37 78.61B2 80.31 84 78.78 79.10B3 110.96 84 80.52 80.97

ferent hardware platforms, including GPU(NVIDIA TeslaV100) and embedded device(NVIDIA Xavier). Table 2presents the results and shows that: 1) ODDB-series mod-els searched from Rep search space consistently outper-form other networks searched from other search space withlower inference time since ODDB-series networks are onlystacked by 3x3 convolutions and have no branches. 2) Rep-NAS can directly search on ImageNet-1k and combine the

train and search, which brings high performance and lowGPU days cost. We show the architectures of searchedODBB-series in the Appendix.

4.4. transferability

To verify the transferability of searched models on theother computer version task, object detection experimentsare conducted on MS COCO dataset[18] with CenterNetframework[27]. The detailed implementation is follow-ing [27], ImageNet-1k pretrained backbones with three up-convolutional networks are finetuned on COCO. The inputresolution is 512×512. We use Adam to optimize the over-all parameters. Specifically, the baseline backbone of Cen-terNet is ResNet-series and we replace it with our searchedODBB-series. All backbone networks are pretrained onImageNet-1K with train schedule in Table 1. After fine-tuning on COCO, ODBB-series will be fused into one pathfor fast inference. The results are shown in Table 5. Bothin slight and heavy models, ODBB-A0 and ODBB-B3 sur-pass ResNet-18 and ResNet-101 by 3.3% and 2.7% AP withcomparable inference speed, respectively, which demon-strates our searched models have outstanding performancein other computer version tasks.

4.5. Ablation Study

Training Under Branch Number Constraint. Train-ing multi-branch models requires linearly increased GPUmemory to save the middle feature maps for forward andbackward. How to memory-friendly search and train multi-branch model is significant for the application of struc-tural re-parameterization techniques. For simplicity, we re-duce the branch number of the network to meet the variousmemory constraints by pushing temperature λ of low-rankbranches to 0−. To prove the efficiency and effectiveness

7

Table 5. Object detection on COCO validation.

Backbone COCO AP(%) Inference(s)

ResNet-18 28.1 0.065ResNet-101 34.6 0.141ODBB(A0) 31.4 0.048ODBB(B3) 37.3 0.148

Figure 4. The tradeoff curve of top-1 Acc. and branch number. Wecompare ODBB with other manual block architectures(i.e. origi-nal RepVGG block, DDB, ACB) and the single path model thatonly has one 3× 3 Conv in each block.

Figure 5. Top-1 Acc on ImageNet of supernet and sampled high-performing subnet.

over other human-designed blocks, we also train DBB andACB on benchmark RepVGG-A0 by replacing the origi-nal RepVGG block, respectively. All experiments in thissubsection are conducted on RepVGG-A0 with the sametraining set. The results shown in Figure 4 reveal that theperformance of ODBB can surpass other blocks with fewerbranches. For instance, ODBB only containing 50 brancheshas higher top-1 Acc. over the original RepVGG block(66branches), DBB(88 branches) and ACB(66 branches). Be-

sides, we found that the ODBB with 75 branches is enoughto obtain the same results as the ODBB with more branches.

Efficacy of Weight Sharing. We train and searchODBB(A0) with the following two methods: 1) Theweights of the different branches in the same block areshared. 2) The weights of the different branches are in-dependently updated. Figure 5 presents the comparisonson ImageNet-1k with ODBB(A0). It is apparent that theODBB(A0) updated with weight sharing gets faster conver-gence than the one that is updated independently. Besides,we sample the high-performing subnet in every five trainingepochs according to the ranking of the architecture parame-ter α. Each sampled model is trained from scratch with thescheme in Table 1. Shown as Figure 5, the performance ofsubnet sampled from weight sharing supernet has more sta-ble and higher accuracy in each period. This phenomenonillustrates the weight-sharing supernet serves as a good in-dicator of ranking.

5. LimitationsThe searched networks have the same channels and

depth as RepVGG[8] networks which contain huge param-eters. Therefore, less concerning the number of parametersprevents the deployment of our searched models from CPUdevices. However, VGG-like models can be easily prunedby many existing fast channel pruning methods[5, 2, 11] forparameter reduction.

6. ConclusionIn this work, we first propose a new search space, Rep

space, where each block is architecture-independent andcan preserve arbitrary branches. Any subnet searched fromRep space has fast inference speed since it can be convertedto a single path model in inference. To efficiently train thesupernet, block-wisely weight sharing is used in supernettraining. To fit the new search space, a new one-stage NASmethod is presented. The optimal diverse branch blockscan be obtained without retraining. Extensive experimentsdemonstrate that the proposed RepNAS can search vari-ous sizes of multi-branch networks, named ODDB-series,which strongly outperform the previous NAS networks un-der a similar inference time. Moreover, Compared withother networks utilized Rep techniques, ODDB-series alsoachieve the state-of-the-art top-1 accuracy on ImageNet invarious model size. In the feature work, we will consideradding the channel number of each block to search space,which can further boost the performance of RepNAS.

References[1] Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progres-

sive differentiable architecture search: Bridging the depthgap between search and evaluation. In Proceedings of the

8

IEEE/CVF International Conference on Computer Vision,pages 1294–1303, 2019.

[2] Ting-Wu Chin, Ruizhou Ding, Cha Zhang, and Diana Mar-culescu. Towards efficient model compression via learnedglobal ranking. In Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages 1518–1528, 2020.

[3] Xiangxiang Chu, Bo Zhang, Ruijun Xu, and Jixiang Li. Fair-nas: Rethinking evaluation fairness of weight sharing neuralarchitecture search. arXiv preprint arXiv:1907.01845, 2019.

[4] Xiaohan Ding, Yuchen Guo, Guiguang Ding, and JungongHan. Acnet: Strengthening the kernel skeletons for power-ful cnn via asymmetric convolution blocks. In Proceedingsof the IEEE/CVF International Conference on Computer Vi-sion, pages 1911–1920, 2019.

[5] Xiaohan Ding, Tianxiang Hao, Jianchao Tan, Ji Liu, JungongHan, Yuchen Guo, and Guiguang Ding. Lossless cnn channelpruning via decoupling remembering and forgetting. arXivpreprint arXiv:2007.03260, 2020.

[6] Xiaohan Ding, Xiangyu Zhang, Jungong Han, and GuiguangDing. Diverse branch block: Building a convolution as aninception-like unit. In Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, pages10886–10895, 2021.

[7] Xiaohan Ding, Xiangyu Zhang, Jungong Han, and GuiguangDing. Repmlp: Re-parameterizing convolutions into fully-connected layers for image recognition. arXiv preprintarXiv:2105.01883, 2021.

[8] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han,Guiguang Ding, and Jian Sun. Repvgg: Making vgg-styleconvnets great again. In Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, pages13733–13742, 2021.

[9] Xuanyi Dong and Yi Yang. Searching for a robust neu-ral architecture in four gpu hours. In Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition, pages 1761–1770, 2019.

[10] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng,Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot neural architecture search with uniform sampling. InEuropean Conference on Computer Vision, pages 544–560.Springer, 2020.

[11] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruningfor accelerating very deep neural networks. In Proceedingsof the IEEE international conference on computer vision,pages 1389–1397, 2017.

[12] Shoukang Hu, Sirui Xie, Hehui Zheng, Chunxiao Liu, Jian-ping Shi, Xunying Liu, and Dahua Lin. Dsnas: Direct neuralarchitecture search without parameter retraining. In Proceed-ings of the IEEE/CVF Conference on Computer Vision andPattern Recognition, pages 12084–12092, 2020.

[13] Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Gir-shick, Trevor Darrell, and Kurt Keutzer. Densenet: Im-plementing efficient convnet descriptor pyramids. arXivpreprint arXiv:1404.1869, 2014.

[14] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-

variate shift. In International conference on machine learn-ing, pages 448–456. PMLR, 2015.

[15] Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa-rameterization with gumbel-softmax. 2016.

[16] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. In ICLR (Poster), 2015.

[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. Advances in neural information processing systems,25:1097–1105, 2012.

[18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, and C LawrenceZitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755.Springer, 2014.

[19] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts:Differentiable architecture search. In International Confer-ence on Learning Representations, 2018.

[20] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc VLe. Regularized evolution for image classifier architecturesearch. In Proceedings of the aaai conference on artificialintelligence, volume 33, pages 4780–4789, 2019.

[21] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556, 2014.

[22] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, andAlexander A Alemi. Inception-v4, inception-resnet and theimpact of residual connections on learning. In Thirty-firstAAAI conference on artificial intelligence, 2017.

[23] Mingxing Tan and Quoc Le. Efficientnet: Rethinking modelscaling for convolutional neural networks. In InternationalConference on Machine Learning, pages 6105–6114. PMLR,2019.

[24] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas:stochastic neural architecture search. In International Con-ference on Learning Representations, 2018.

[25] Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender,Pieter-Jan Kindermans, Mingxing Tan, Thomas Huang, Xi-aodan Song, Ruoming Pang, and Quoc Le. Bignas: Scalingup neural architecture search with big single-stage models.In European Conference on Computer Vision, pages 702–717. Springer, 2020.

[26] Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, andThomas Huang. Slimmable neural networks. In Interna-tional Conference on Learning Representations, 2018.

[27] Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Ob-jects as points. In arXiv preprint arXiv:1904.07850, 2019.

[28] Barret Zoph and Quoc V Le. Neural architecture search withreinforcement learning. arXiv preprint arXiv:1611.01578,2016.

[29] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc VLe. Learning transferable architectures for scalable imagerecognition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 8697–8710,2018.

9

Appendix: Model ArchitectureThe design of channels and depth of ODBB-series net-

works is following RepVGG[8], we show it in Table 6.

Table 6. The channel number and depth of each stage.

Name Layers of each stage Channels of each stage

A0 1, 2, 4, 14, 1 64, 48, 96, 192, 1280A1 1, 2, 4, 14, 1 64, 64, 128, 256, 1280A2 1, 2, 4, 14, 1 64, 96, 192, 384, 1408B1 1, 4, 6, 16, 1 64, 128, 256, 512, 2048B3 1, 4, 6, 16, 1 64, 160, 320, 640, 2560B3 1, 4, 6, 16, 1 64, 192, 384, 768, 2560

Figure 6. Searched result of ODBB-A0


10


Figure 9. Searched result of ODBB-B1

11

Figure 10. Searched result of ODBB-B2 Figure 11. Searched result of ODBB-B3

12

repnas: searching for efﬁcient re-parameterizing blocks

Documents