arxiv:1909.04977v1 [cs.cv] 11 sep 2019cars: continuous evolution for efficient neural architecture...

12
CARS: Continuous Evolution for Efficient Neural Architecture Search Zhaohui Yang 1* , Yunhe Wang 2 , Xinghao Chen 2 , Boxin Shi 3,4 , Chao Xu 1 , Chunjing Xu 2 , Qi Tian 2, Chang Xu 5 1 Key Laboratory of Machine Perception (Ministry of Education), Peking University. 2 Huawei Noah’s Ark Lab. 3 Peng Cheng Laboratory. 4 National Engineering Laboratory for Video Technology,Peking University. 5 School of Computer Science, University of Sydney. {zhaohuiyang,shiboxin}@pku.edu.cn; [email protected] {yunhe.wang,xinghao.chen,tian.qi1,xuchunjing}@huawei.com; [email protected] Abstract Searching techniques in most of existing neural architecture search (NAS) algorithms are mainly dominated by differ- entiable methods for the efficiency reason. In contrast, we develop an efficient continuous evolutionary approach for searching neural networks. Architectures in the population which share parameters within one supernet in the latest it- eration will be tuned over the training dataset with a few epochs. The searching in the next evolution iteration will di- rectly inherit both the supernet and the population, which ac- celerates the optimal network generation. The non-dominated sorting strategy is further applied to preserve only results on the Pareto front for accurately updating the supernet. Several neural networks with different model sizes and performance will be produced after the continuous search with only 0.4 GPU days. As a result, our framework provides a series of networks with the number of parameters ranging from 3.7M to 5.1M under mobile settings. These networks surpass those produced by the state-of-the-art methods on the benchmark ImageNet dataset. Convolution neural networks have made great progress in a large range of computer vision tasks, such as recog- nition (Krizhevsky, Sutskever, and Hinton 2012; Simonyan and Zisserman 2015; He et al. 2016), detection (Girshick 2015; Liu et al. 2016), and segmentation (He et al. 2017). Over-parameterized deep neural networks can produce im- pressive recognition performance but will consume huge computation resources at the same time. Besides, designing novel network architectures heavily relies on human experts’ knowledge and experience, and may take many trials before achieving meaningful results (He et al. 2016; Simonyan and Zisserman 2015; Krizhevsky, Sutskever, and Hinton 2012; Szegedy et al. 2015; Huang et al. 2017). To accelerate this process and make it automated, automatic network architec- ture search (NAS) has been proposed, and the learned archi- tectures have exceeded those human-designed architectures on a variety of tasks. However, these searching methods usu- ally require lots of computational resources to search for ar- chitectures of acceptable performance. Techniques for searching network architectures are * This work was done while visiting Huawei Noah’s Ark Lab. Corresponding author. mainly clustered into three groups, Evolution Algo- rithm (EA) based, Reinforcement Learning (RL) based and gradient-based. EA-based works, (Miikkulainen et al. 2019; Real et al. 2017; Real et al. 2019; Xie and Yuille 2017) ini- tialized a set of models and evolved for better architectures, which is time consuming, e.g. Real et al. takes 3150 GPU days for training (Real et al. 2017). RL based works, (Zoph and Le 2017; Cai et al. 2018; Zhong et al. 2018) use a con- troller to predict a sequence of operations, and train differ- ent architectures to gain the reward. Given an architecture, both methods have to train it for a large number of epochs, and then evaluate its performance to guide for evolution or optimize the controller, which makes the searching stage less efficient. Gradient-based methods like DARTS (Liu, Si- monyan, and Yang 2019) first train a supernet and introduce attention mechanism on the connections while searching, then remove weak connections after searching. This phase is conducted by gradient descent optimization and is quite efficient. However, it suffers from a lack of variety for the searched architectures. Although some experiments in previous works (Real et al. 2019) show that the evolutionary algorithm performs bet- ter than RL based approaches for searching neural archi- tectures with better performance, the searching cost of EA is much expensive due to the evaluating procedure of each individual, i.e. a neural network in the evolutionary algo- rithm is independently evaluated. Moreover, there could be some architectures with extremely worse performance in the searching space. If we directly follow the weight sharing ap- proach proposed by ENAS (Pham et al. 2018), the supernet has to be trained to compensate for those worse searching space. It is therefore necessary to reform existing evolution- ary algorithms for an efficient yet accurate neural architec- ture search. In this paper, we propose an efficient EA-based neu- ral architecture search framework. To maximally utilize the knowledge we have learned in the last evolution iteration, a continuous evolution strategy is developed. Specifically, a supernet is first initialized with considerable cells and blocks. Individuals in the evolutionary algorithm represent- ing architectures derived in the supernet will be generated through several benchmark operations (i.e. crossover and arXiv:1909.04977v1 [cs.CV] 11 Sep 2019

Upload: others

Post on 04-Jan-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1909.04977v1 [cs.CV] 11 Sep 2019CARS: Continuous Evolution for Efficient Neural Architecture Search Zhaohui Yang1, Yunhe Wang 2, Xinghao Chen , Boxin Shi3;4, Chao Xu1, Chunjing

CARS: Continuous Evolution for Efficient Neural Architecture Search

Zhaohui Yang1∗, Yunhe Wang2, Xinghao Chen2, Boxin Shi3,4,Chao Xu1, Chunjing Xu2, Qi Tian2†, Chang Xu5

1 Key Laboratory of Machine Perception (Ministry of Education), Peking University.2 Huawei Noah’s Ark Lab. 3 Peng Cheng Laboratory.

4 National Engineering Laboratory for Video Technology, Peking University.5 School of Computer Science, University of Sydney.

{zhaohuiyang,shiboxin}@pku.edu.cn; [email protected]{yunhe.wang,xinghao.chen,tian.qi1,xuchunjing}@huawei.com; [email protected]

Abstract

Searching techniques in most of existing neural architecturesearch (NAS) algorithms are mainly dominated by differ-entiable methods for the efficiency reason. In contrast, wedevelop an efficient continuous evolutionary approach forsearching neural networks. Architectures in the populationwhich share parameters within one supernet in the latest it-eration will be tuned over the training dataset with a fewepochs. The searching in the next evolution iteration will di-rectly inherit both the supernet and the population, which ac-celerates the optimal network generation. The non-dominatedsorting strategy is further applied to preserve only results onthe Pareto front for accurately updating the supernet. Severalneural networks with different model sizes and performancewill be produced after the continuous search with only 0.4GPU days. As a result, our framework provides a series ofnetworks with the number of parameters ranging from 3.7Mto 5.1M under mobile settings. These networks surpass thoseproduced by the state-of-the-art methods on the benchmarkImageNet dataset.

Convolution neural networks have made great progressin a large range of computer vision tasks, such as recog-nition (Krizhevsky, Sutskever, and Hinton 2012; Simonyanand Zisserman 2015; He et al. 2016), detection (Girshick2015; Liu et al. 2016), and segmentation (He et al. 2017).Over-parameterized deep neural networks can produce im-pressive recognition performance but will consume hugecomputation resources at the same time. Besides, designingnovel network architectures heavily relies on human experts’knowledge and experience, and may take many trials beforeachieving meaningful results (He et al. 2016; Simonyan andZisserman 2015; Krizhevsky, Sutskever, and Hinton 2012;Szegedy et al. 2015; Huang et al. 2017). To accelerate thisprocess and make it automated, automatic network architec-ture search (NAS) has been proposed, and the learned archi-tectures have exceeded those human-designed architectureson a variety of tasks. However, these searching methods usu-ally require lots of computational resources to search for ar-chitectures of acceptable performance.

Techniques for searching network architectures are

∗This work was done while visiting Huawei Noah’s Ark Lab.†Corresponding author.

mainly clustered into three groups, Evolution Algo-rithm (EA) based, Reinforcement Learning (RL) based andgradient-based. EA-based works, (Miikkulainen et al. 2019;Real et al. 2017; Real et al. 2019; Xie and Yuille 2017) ini-tialized a set of models and evolved for better architectures,which is time consuming, e.g. Real et al. takes 3150 GPUdays for training (Real et al. 2017). RL based works, (Zophand Le 2017; Cai et al. 2018; Zhong et al. 2018) use a con-troller to predict a sequence of operations, and train differ-ent architectures to gain the reward. Given an architecture,both methods have to train it for a large number of epochs,and then evaluate its performance to guide for evolution oroptimize the controller, which makes the searching stageless efficient. Gradient-based methods like DARTS (Liu, Si-monyan, and Yang 2019) first train a supernet and introduceattention mechanism on the connections while searching,then remove weak connections after searching. This phaseis conducted by gradient descent optimization and is quiteefficient. However, it suffers from a lack of variety for thesearched architectures.

Although some experiments in previous works (Real etal. 2019) show that the evolutionary algorithm performs bet-ter than RL based approaches for searching neural archi-tectures with better performance, the searching cost of EAis much expensive due to the evaluating procedure of eachindividual, i.e. a neural network in the evolutionary algo-rithm is independently evaluated. Moreover, there could besome architectures with extremely worse performance in thesearching space. If we directly follow the weight sharing ap-proach proposed by ENAS (Pham et al. 2018), the supernethas to be trained to compensate for those worse searchingspace. It is therefore necessary to reform existing evolution-ary algorithms for an efficient yet accurate neural architec-ture search.

In this paper, we propose an efficient EA-based neu-ral architecture search framework. To maximally utilize theknowledge we have learned in the last evolution iteration,a continuous evolution strategy is developed. Specifically,a supernet is first initialized with considerable cells andblocks. Individuals in the evolutionary algorithm represent-ing architectures derived in the supernet will be generatedthrough several benchmark operations (i.e. crossover and

arX

iv:1

909.

0497

7v1

[cs

.CV

] 1

1 Se

p 20

19

Page 2: arXiv:1909.04977v1 [cs.CV] 11 Sep 2019CARS: Continuous Evolution for Efficient Neural Architecture Search Zhaohui Yang1, Yunhe Wang 2, Xinghao Chen , Boxin Shi3;4, Chao Xu1, Chunjing

mutation). Non-dominated sort strategy is adopted to se-lect several excellent architectures with different model sizesand accuracies, and corresponding cells in the supernet willbe updated for subsequent optimizing. The evolution pro-cedure in the next iteration is continuously executing basedon the updated supernet and the multi-objective solution setobtained by the non-dominated sorting. In addition, we pro-pose to exploit a protection mechanism to avoid the SmallModel Trap problem. The proposed continuous evolution ar-chitecture search (CARS) can provide a series of models onthe Pareto front with high efficiency. The superiority of ourmethod is verified on benchmark datasets over the state-of-the-art methods.

Related WorksNetwork Architecture SearchNetwork architecture search problem always contains twosteps: network parameter optimization and network archi-tecture optimization. The network parameter optimizationstep adjusts the parameters in the standard layers (i.e. con-volution, batch normalization, fully connected layer) whichis similar to train a standard neural network (i.e. ResNet).The network architecture optimization step learns accuratenetwork architectures with superior performance.

The parameter optimization step contains independentand sharing optimization. Independent optimization learnseach network separately, i.e. AmoebaNet (Real et al. 2019)takes thousands of GPU days to evaluate thousands of mod-els. To accelerate training time, (Elsken, Metzen, and Hutter2019; Elsken, Metzen, and Hutter 2018) initialize parame-ters by network morphism. One-shot method (Bender et al.2018) step further by sharing all the parameters for differ-ent architectures among one supernet. Rather than trainingthousands of different architectures, only one supernet is re-quired to be optimized.

The architecture optimization step include RL-based,EA-based, and gradient-based approaches. RL-based meth-ods (Zoph and Le 2017; Zoph et al. 2018; Pham et al. 2018)use recurrent networks as network architecture controller,and the performance of the generated architectures are uti-lized as the rewards for training the controller. The controllerwould converge during training and finally outputs architec-tures with superior performance. EA-based approaches (Xieand Yuille 2017; Real et al. 2019; Wang et al. 2018; Shuet al. 2019) search architectures with the help of evolution-ary algorithms. The validation performance of each individ-ual is utilized as the fitness to evolve the next generation.Gradient-based approaches (Liu, Simonyan, and Yang 2019;Xie et al. 2019; Wu et al. 2019) view the network architec-ture as a set of learnable parameters and optimize the archi-tecture by the standard back-propagation algorithm.

Multi-objective Network Architecture SearchConsidering multiple complementary objectives, i.e. perfor-mance, the number of parameters, float operations (FLOPs)and latency, there is no single architecture surpass all theothers along with all the objectives. Therefore, a set of archi-tectures within the Pareto front are desired. Many different

works have been proposed to deal with multi-objective net-work architecture search. NEMO (Kim et al. 2017) targets atspeed and accuracy. DPPNet and LEMONADE (Dong et al.2018; Elsken, Metzen, and Hutter 2019) considers device-related and device-agnostic objectives. MONAS (Hsu et al.2018) targets at accuracy and energy. NSGANet (Lu et al.2019) considers FLOPs and accuracy. These methods areless efficient for models are optimized separately. In con-trast, our architecture optimization and parameter optimiza-tion steps are conducted iteratively rather than first fullytraining a set of parameters and then optimize architec-tures. Besides, the parameters for different architectures areshared, thus much more efficient during searching.

ApproachIn this section, we develop a novel continuous evolutionaryapproach for searching neural architectures including twoprocedures, i.e. parameters optimization and architecture op-timization.

We use Genetic Algorithm (GA) for architecture evolu-tion, because GA could provide a vast searching space. Wemaintain a set of architectures (a.k.a. connections) C ={C1, · · · ,CN}, where N is the population size. The archi-tectures in the population gradually update according to theproposed pNSGA-III method during architecture optimiza-tion step. To make the searching stage efficient, we main-tain a supernet N which shares parameters W for differentarchitectures, which could dramatically reduce the compu-tational complexity of separately training these different ar-chitectures during searching.

Supernet of CARSDifferent architectures are sampled from the supernet N ,and each network Ni can be stood by a set of parameterswith float precision Wi and a set of binary connection pa-rameters (i.e. {0, 1}) Ci. The 0-element connection meansthe network does not contain this connection to transformdata flow and 1-element connection means the network usethis connection. From this point of view, each network Ni

could be represented as (Wi,Ci) pair.Full precision parameters W are shared by a set of net-

works. If these network architectures are fixed, the optimalW could be optimized by standard back-propagation, whichfits for all the networks Ni to achieve higher recognitionperformance. After the parameters are converged, we couldalternately optimize the binary connections C by GA algo-rithm. These two stages form the main optimization for ourproposed continuous evolution algorithm and are processedin an alternative training way. We introduce these two kindsof update in the following.

Parameter OptimizationParameter W is the collection of all the parameters in thenetwork. The parameter Wi of the i th individual is Wi =W �Ci, i ∈ {1, · · · , N}, where the � is the mask oper-ation which keep the parameters of the complete graph onlyfor the positions corresponding to 1-elements in the connec-tion Ci. With input data X fed into the network, the pre-dictions of this network is Pi = Ni(X,Wi), where Ni is

Page 3: arXiv:1909.04977v1 [cs.CV] 11 Sep 2019CARS: Continuous Evolution for Efficient Neural Architecture Search Zhaohui Yang1, Yunhe Wang 2, Xinghao Chen , Boxin Shi3;4, Chao Xu1, Chunjing

the i th architecture and Wi is the sampled weights. Givenground truth Y, the prediction loss can be expressed as Li.The gradient of Wi can be calculated as

dWi =∂Li

∂Wi

=∂Li

∂W�Ci.

(1)

Parameters W should fit all the individuals, and thus thegradients for all networks are accumulated to calculate thegradient of parameters W

dW =1

N

N∑i=1

dWi

=1

N

N∑i=1

∂Li

∂W�Ci.

(2)

Any layer is only optimized by networks which use thislayer during forward. By collecting the gradients of indi-viduals in the population, the parameters W are updatedthrough SGD algorithm.

As we have maintained a large set of architectures withshared weights in the supernet, we borrow the idea ofstochastic gradient descent and use mini-batch architecturesfor updating parameters. Accumulating the gradients for allnetworks would take much time for one-step gradient de-scent, and thus we use mini-batch architectures for updat-ing shared weights. We use Nb different architectures whereNb < N , and the indexes of architectures are {n1, · · · , nb}to update parameters. The efficient parameter updating ofEqn 2 is detailed as Eqn 3

dW ≈ 1

Nb

Nb∑j=1

∂Lnj

∂Wnj

. (3)

Hence, the gradient over a mini-batch of architectures istaken as an approximation of the averaged gradients of allthe N different individuals. The time cost for each updatecould be largely reduced, and the appropriate mini-batchsize leads to a balance between efficiency and accuracy.

Architecture OptimizationAs for the architecture optimization procedure, we use theevolution algorithm with the non-dominated sorting strat-egy, which was introduced in the NSGA-III (Deb and Jain2014). Denoting {N1, · · · ,NN} as N different networksand {F1, · · · ,FM} as M different measurements we wantto minimize. In general, these measurements, including thenumber of parameters, float operations, latency, and perfor-mance, could have some conflicts, which increases the dif-ficulty in discovering an optimal solution that minimizes allthese metrics.

In practice, if architecture Ni dominates Nj , the perfor-mance of Ni is no less than that of Nj on all the measure-ments and behaves better on at least one metric. Formally,the definition of domination can be given as below.

Algorithm 1 Continuous Evolution for Multi-objective Ar-chitecture SearchInput: supernet N , connections C = {C(0)

1 , · · · ,C(0)N },

offspring expand ratio t, the number of continuous evo-lution Ievo, multi-objectives {F1, · · · ,FM}, the num-ber of parameter optimization epochs Eparam betweenevolution iterations, criterion C.

1: Warmup parameters of supernet N .2: for i = 1, · · · , Ievo do3: for step = 1, · · · , Eparam do4: Get training mini-batch data X and target Y, ran-

dom sample Nb different networks N1, · · · ,Nnb

from the maintained architectures.5: Calculate averaged loss by forward sampled archi-

tectures L = 1Nb

∑Nb

i=1 C (Ni(X), Y).6: Calculate derivative to parameters W according to

Eqn 3 and update network parameters.7: end for8: Update {C(i)

1 , · · · ,C(i)N } within Pareto front using

pNSGA-III.9: end for

Output: Architectures C = {C(Ievo)1 , · · · ,C(Ievo)

N } withvarious requirements of objectives.

Definition 1. Consider two networks Ni and network Nj ,and a series of measurements {F1, · · · ,FM} we want tominimize. If

Fk(Ni) ≤ Fk(Nj), ∀k ∈ {1, · · · ,M}Fk(Ni) < Fk(Nj), ∃k ∈ {1, · · · ,M}, (4)

Ni is said to dominate Nj , i.e. Ni � Nj .

According to the above definition, if Ni dominates Nj ,Nj can be replaced by Ni during the evolution proceduresince Ni performs better in terms of at least one metric andno worse on other metrics. By exploiting this approach, wecan select a series of excellent neural architectures from thepopulation generated in the current iteration. Then, thesenetworks can be utilized for updating the corresponding pa-rameters in the supernet.

Although the above non-dominated sorting using NSGA-III method (Deb and Jain 2014) can help us to select somebetter models for updating parameter, there exists a smallmodel trap phenomenon during the searching procedure.Specifically, since the parameters in the supernetwork stillneed optimization, the accuracy of each individual archi-tecture in the current iteration may not always stand for itsperformance that can be eventually achieved, as discussedin NASBench-101 (Ying et al. 2019). Thus, some smallermodels of fewer parameters but higher test accuracy tend todominate those larger models of lower fitness but have thepotential for achieving higher accuracies.

Therefore, we propose to improve the conventionalNSGA-III for protecting these larger models, namelypNSGA-III. More specifically, the pNSGA-III algorithmtakes the increasing speed of performance into considera-tion. We take the validation performance and the number of

Page 4: arXiv:1909.04977v1 [cs.CV] 11 Sep 2019CARS: Continuous Evolution for Efficient Neural Architecture Search Zhaohui Yang1, Yunhe Wang 2, Xinghao Chen , Boxin Shi3;4, Chao Xu1, Chunjing

NSGA-III pNSGA-IIIFigure 1: Comparision between different evolution strategy. supernet is trained with training set and evaluated on the validationset. The left figure shows five evolution iterations using NSGA-III. Evolution with NSGA-III suffers from Small Model Trap,which lead the distribution bias to smaller models. The right figure shows the evolution iterations using pNSGA-III. Evolutionwith pNSGA-III provides protection for larger models.

parameters as an example. For NSGA-III method, the non-dominated sorting algorithm considering two different ob-jectives and select individuals according to the sorted Paretostages. For the proposed pNSGA-III, besides considering thenumber of parameters and performance, we also conducta non-dominated sorting algorithm considering the perfor-mance increasing speed and the number of parameters. Thenthe two different Pareto stages are merged, and graduallyselect individuals from the first stage to fill the generation.In this way, the large networks with slower performance in-creasing speed could be kept in the population.

Figure 1 illustrates the selected models in every iterationusing the conventional NSGA-III and the modified pNSGA-III, respectively. It is obvious that the pNSGA-III can pro-vide models with a wide range of model sizes and compara-ble performance to that of the NSGA-III.

Continuous Evolution for CARSIn summary, the proposed CARS for searching optimal neu-ral architecture has two stages: 1) Architecture Optimization2) Parameter Optimization. In addition, parameter warmupis also introduced for the few epochs.Parameter Warmup. Since the shared weights of oursupernet are initialized randomly, if a set of architecturesare also initialized with random architectures, the most fre-quently used blocks for all the architectures would be trainedmore times compared with other blocks. Thus, by follow-ing one-shot NAS methods (Bender et al. 2018; Guo et al.2019), we use uniform sampling strategy to initialize theparameters in the supernet. In this way, the supernet trainseach possible architecture with the same possibility whichmeans each path in a searching space is sampled equiva-lently. For example, in DARTS (Liu, Simonyan, and Yang2019) pipeline, there are 8 different operations for eachnode, including convolution, pooling, identity mapping, andno connection. Each operation will be trained with a prob-ability of 1

8 . This initialization step initialized the sharingweights of the supernet for few epochs.Architecture Optimization. After initializing the param-eters of the supernet, we first randomly sample N differentarchitectures, where N is the number of maintained individ-uals among Pareto front using pNSGA-III, and we define

N as a hyper-parameter. During the architecture evolutionstep, we first generate t×N offsprings, where t is the hyper-parameter to control the number of offsprings. We then usepNSGA-III to select N individuals from (t + 1) × N indi-viduals.

Parameter Optimization. Given a set of architectures, weuse the proposed mini-batch architectures update scheme forparameter optimization according to Eqn 3.

Algorithm 1 summarizes the detailed procedure of theproposed continuous evolutionary algorithm for searchingneural architectures.

Search Time Analysis

During the searching stage of CARS, the training data is splitinto train/val set, and the train set is used for parameter op-timization, meanwhile, the val set is used for architectureoptimization stage. Assuming the average training time onthe train set for one architecture is Ttr, and the inferencetime on val set is Tval. The first warmup stage takes Ewarm

epochs, and it needs Twarm = Ewarm × Ttr in this stage toinitialize parameters in the supernet N .

Assuming the architectures evolve for Ievo iterations intotal. And each iteration contains parameter optimizationand architecture optimization stages. The parameter opti-mization stage trains the supernet for Eparam epochs ontrain set during one evolution iteration, thus the time costfor parameter optimization in one evolution iteration isTparam = Eparam × Ttr × Nb if we consider the mini-batch size Nb. For the architecture optimization stage, allthe individuals can be inferred in parallel, so the searchingtime in this stage could be calculated as Tarch = Tval. Thusthe total time cost for Ievo evolution iterations is Tevo =Ievo × (Tparam + Tarch). All the searching time cost inCARS is,

Ttotal =Twarm + Tevo.

=Ewarm × Ttr+

Ievo × (Eparam × Ttr ×Nb + Tarch)

(5)

Page 5: arXiv:1909.04977v1 [cs.CV] 11 Sep 2019CARS: Continuous Evolution for Efficient Neural Architecture Search Zhaohui Yang1, Yunhe Wang 2, Xinghao Chen , Boxin Shi3;4, Chao Xu1, Chunjing

Table 1: Comparison with state-of-the-art image classifiers on CIFAR-10 dataset. The multi-objectives used for architectureoptimization are performance and model size.

Architecture Test Error Params Search Cost Search(%) (M) (GPU days) Method

DenseNet-BC (Huang et al. 2017) 3.46 25.6 - manualPNAS (Liu et al. 2018a) 3.41 3.2 225 SMBOENAS + cutout (Pham et al. 2018) 2.91 4.2 4 RLNASNet-A + cutout (Zoph et al. 2018) 2.65 3.3 2000 RLAmoebaNet-A + cutout (Real et al. 2019) 3.12 3.1 3150 evolutionHierarchical evolution (Liu et al. 2018b) 3.75 15.7 300 evolutionSNAS (mild) + cutout (Xie et al. 2019) 2.98 2.9 1.5 gradientSNAS (moderate) + cutout (Xie et al. 2019) 2.85 2.8 1.5 gradientSNAS (aggressive) + cutout (Xie et al. 2019) 3.10 2.3 1.5 gradientDARTS (first) + cutout (Liu, Simonyan, and Yang 2019) 3.00 3.3 1.5 gradientDARTS (second) + cutout (Liu, Simonyan, and Yang 2019) 2.76 3.3 4 gradientRENA (Zhou et al. 2018) 3.87 3.4 - RLNSGANet (Lu et al. 2019) 3.85 3.3 8 evolutionLEMONADE (Elsken, Metzen, and Hutter 2019) 3.05 4.7 80 evolutionCARS-A 3.00 2.4 0.4 evolutionCARS-B 2.87 2.7 0.4 evolutionCARS-C 2.84 2.8 0.4 evolutionCARS-D 2.95 2.9 0.4 evolutionCARS-E 2.86 3.0 0.4 evolutionCARS-F 2.79 3.1 0.4 evolutionCARS-G 2.74 3.2 0.4 evolutionCARS-H 2.66 3.3 0.4 evolutionCARS-I 2.62 3.6 0.4 evolution

ExperimentsIn this section, we first introduce the datasets, backbone,and evolution details we used in our experiments. We thensearch a set of architectures using our proposed CARS onCIFAR-10 (Krizhevsky and Hinton 2009) dataset and showthe necessity of proposed pNSGA-III over popularly usedNSGA-III (Deb and Jain 2014) during continuous evolution.We consider three objectives. Besides performance, we sep-arately consider device-aware latency and device-agnosticmodel size. After that, we evaluate our searched architec-tures on CIFAR-10 dataset and ILSVRC2012 (Deng et al.2009) large scale recognition dataset to demonstrate the ef-fectiveness of CARS.

Experimental SettingsDatasets. Our CARS experiments are performedon CIFAR-10 (Krizhevsky and Hinton 2009) andILSVRC2012 (Deng et al. 2009) large scale imageclassification task. These two datasets are popular bench-marks on the recognition task. We search architectureson CIFAR-10 dataset, and the searched architectures areevaluated on CIFAR-10 and ILSVRC2012 datasets.

supernet Backbones. To illustrate the effectiveness of ourmethod, we evaluate our CARS over state-of-the-art NASsystem DARTS (Liu, Simonyan, and Yang 2019). DARTS isa differentiable NAS system searching for cells and sharesthe searched cells from shallow layers to deeper layers. Thesearching space contains 8 different blocks, including fourtypes of convolution, two kinds of pooling, skip connect andno connection. DARTS searches for two kinds of topologyfor normal cell and reduction cell. A normal cell is used for

the layers that have the same spatial size for input feature andoutput feature. Reduction cell is used for layers with down-sampling on input feature maps. After searching for thesetwo kinds of cells, the network is constructed by stacking aset of searched cells.

Evolution Details. In the DARTS method, each interme-diate node in a cell is connected with two previous nodes.Thus each node has its own searching space. Crossover andmutation are conducted on the corresponding nodes with thesame searching space. Both crossover ratio and mutation ra-tio are set as 0.25, 0.25, and we randomly generate new ar-chitectures with a probability of 0.5. For the crossover oper-ation, each node has a ratio of 0.5 to crossover its connec-tions, and for mutation operation, each node has a ratio of0.5 to be randomly reassigned.

Experiments on CIFAR-10Our experiments on CIFAR-10 include NSGA method com-parison, searching and evaluation. For NSGA method com-parison, we compare NSGA-III with proposed pNSGA-III.After that, we conduct CARS on CIFAR-10 and searchfor a set of architectures which separately considers la-tency, model size and performance. At last, we examine thesearched architectures with different computing resources.

Search on CIFAR-10. To search for a set of architecturesusing CARS, we split the CIFAR-10 train set by 25,000 insearching which named as train set and 25,000 for evalua-tion, which named as val set. The split strategy is same withDARTS and SNAS. We search for 500 epochs in total, andthe parameter warmup stage lasts for the first 50 epochs. Af-ter that, we initialize the population, which maintains 128

Page 6: arXiv:1909.04977v1 [cs.CV] 11 Sep 2019CARS: Continuous Evolution for Efficient Neural Architecture Search Zhaohui Yang1, Yunhe Wang 2, Xinghao Chen , Boxin Shi3;4, Chao Xu1, Chunjing

Table 2: An overall comparison on ILSVRC2012 dataset. The CARS models are the architectures searched on the CIFAR-10dataset, then evaluated on ILSVRC2012 dataset.

Architecture Top-1 Top-5 Params +× Search Cost SearchAcc (%) Acc (%) (M) (M) (GPU days) Method

ResNet50 (He et al. 2016) 75.3 92.2 25.6 4100 - manualInceptionV1 (Szegedy et al. 2015) 69.8 90.1 6.6 1448 - manualMobileNetV2 (1×) (Sandler et al. 2018) 72.0 90.4 3.4 300 - manualShuffleNetV2 (2×) (Ma et al. 2018) 74.9 90.1 7.4 591 - manualPNAS (Liu et al. 2018a) 74.2 91.9 5.1 588 224 SMBOSNAS (mild) (Xie et al. 2019) 72.7 90.8 4.3 522 1.5 gradientDARTS (Liu, Simonyan, and Yang 2019) 73.3 91.3 4.7 574 4 gradientPARSEC (Casale, Gordon, and Fusi 2019) 74.0 91.6 5.6 548 1 gradientNASNet-A (Zoph et al. 2018) 74.0 91.6 5.3 564 2000 RLNASNet-B (Zoph et al. 2018) 72.8 91.3 5.3 488 2000 RLNASNet-C (Zoph et al. 2018) 72.5 91.0 4.9 558 2000 RLAmoebaNet-A (Real et al. 2019) 74.5 92.0 5.1 555 3150 evolutionAmoebaNet-B (Real et al. 2019) 74.0 91.5 5.3 555 3150 evolutionAmoebaNet-C (Real et al. 2019) 75.7 92.4 6.4 570 3150 evolutionCARS-A 72.8 90.8 3.7 430 0.4 evolutionCARS-B 73.1 91.3 4.0 463 0.4 evolutionCARS-C 73.3 91.4 4.2 480 0.4 evolutionCARS-D 73.3 91.5 4.3 496 0.4 evolutionCARS-E 73.7 91.6 4.4 510 0.4 evolutionCARS-F 74.1 91.8 4.5 530 0.4 evolutionCARS-G 74.2 91.9 4.7 537 0.4 evolutionCARS-H 74.7 92.2 4.8 559 0.4 evolutionCARS-I 75.2 92.5 5.1 591 0.4 evolution

different architectures and gradually evolve them using pro-posed pNSGA-III. The parameter optimization stage is op-timized by train set and trains for 10 epochs during oneevolution iteration. The architecture optimization stage usesthe val set to update architectures according to proposedpNSGA-III. We separately search by considering perfor-mance and model size/latency.

NSGA-III vs. pNSGA-III. We train CARS with differentNSGA methods during architecture optimization stage forcomparison and objectives are the model size and validationperformance. We visualize the growing trend for differentarchitecture optimization methods after parameter warmupstage. As Figure 1 shows, CARS with NSGA-III wouldencounter the Small Model Trap problem, for small mod-els tend to eliminate large models during architecture opti-mization stage. In contrast, architecture optimization usingpNSGA-III protects larger models which have the potentialto increase their accuracies during later epochs but convergeslower than small models at the beginning. To search for sev-eral models with various computing resources, it is essentialto maintain larger models in the population rather than drop-ping them during architecture optimization stage.

Search Time analysis. For the experiment of consideringthe model size and performance, the training time on thetrain set Ttr takes around 1 minute, and the inference timeon val set Tval is around 5 seconds. For the first initializa-tion stage, it trains for 50 epochs, so the time cost in thisstage Twarm takes around an hour. The continuous evolu-tion algorithm evolves the architectures for Ievo iterations.For the architecture optimization stage in one iteration, theparallel evaluation time is Tarch = Tval. For the parame-

ter optimization stage, we set Nb to be 1 in our experimentssince different batch size does not affect the overall growthtrend for individuals in the population which is discussedin supplementary material, and we train the supernet N for10 epochs in one evolution iteration. So the parameter op-timization time Tparam is about 10 minutes. Thus the timecost for total evolution iteration Tevo is around 9 hours, andthe total searching time Ttotal is around 0.4 GPU day. Forthe experiment of considering the latency and performance,the running latency for each model is evaluated during archi-tecture optimization step. Thus the searching time is around0.5 GPU day.

Evaluate on CIFAR-10. After finishing the CARS, ourmethod keeps N = 128 different architectures, and we eval-uate architectures which have a similar number of parame-ters with previous works (Liu, Simonyan, and Yang 2019;Xie et al. 2019) for comparison. We retrain the searched ar-chitectures on CIFAR-10 dataset with all the training dataand evaluate on the test set. All the training parameters arethe same with DARTS (Liu, Simonyan, and Yang 2019).

We compare the searched architectures with state-of-the-art methods which utilize similar searching space in Ta-ble 1. All the searched architectures can be found in thesupplementary material. Our searched architectures have thenumber of parameters vary from 2.4M to 3.6M on CIFAR-10 dataset, and the performance of these architectures areon par with the state-of-the-art NAS methods, while if weevolve with NSGA-III method, we would only search fora set of architectures with approximately 2.4M parameterswithout larger models, and the models perform relativelypoor.

Page 7: arXiv:1909.04977v1 [cs.CV] 11 Sep 2019CARS: Continuous Evolution for Efficient Neural Architecture Search Zhaohui Yang1, Yunhe Wang 2, Xinghao Chen , Boxin Shi3;4, Chao Xu1, Chunjing

Figure 2: CARS-H and DARTS. On the top are the normaland reduction blocks of CARS-H, and the bottom are thenormal and reduction blocks in DARTS (second order).

Compared to previous methods like DARTS and SNAS,our method is capable of searching for complete architec-tures over a large range of the searching space. Our CARS-G model achieves comparable accuracy with DARTS (sec-ond order) resulting in an approximate 2.75 error rate withsmaller model size. With the same 3.3M parameters asDARTS (second order), our CARS-H achieves lower test er-ror, 2.66% vs. 2.76%. For the small models, our searchedCARS-A/C/D also achieve comparable results with SNAS.Besides, our large model CARS-I achieves lower error rate2.62% with slightly more parameters. The overall trendfrom CARS-A to CARS-J is that the error rate graduallydecreases while increasing the model size. These modelsare all Pareto solutions. Compared to other multi-objectivemethods like RENA (Zhou et al. 2018), NSGANet (Lu et al.2019) and LEMONADE (Elsken, Metzen, and Hutter 2019),our searched architectures also show superior performanceover these methods.

Comparison on Searched Block. In order to have an ex-plicit understanding of the proposed method, we furthervisualize the normal and reduction blocks searched usingCARS and DARTS in Figure 5, respectively. Wherein, theCARS-H and DARTS (second order) have a similar numberof parameters (3.3M), but the CARS-H has higher accuracy.It can be found in Figure 5 that, there are more parametersin the CARS-H reduction block for preserving more usefulinformation, and the size of the normal block of CARS-H issmaller than that of the DARTS (second order) to avoid un-necessary computations. This phenomenon mainly becausethe proposed method using EA has much larger searchingspace, and the genetic operations can effectively jump out ofthe local optimum, which demonstrates its superiority.

Evaluate on ILSVRC2012For those architectures searched on CIFAR-10 dataset,we evaluate the transferability of the architectures onILSVRC2012 dataset. We use 8 Nvidia Tesla V100 to trainin parallel, and the batch size is set to 640. We train 250epochs in total, the learning rate is set to 0.5 with linear de-cay scheduler, and we warmup (Goyal et al. 2017) the learn-ing rate for the first five epochs due to the large batch sizewe used. Momentum is set to 0.9, and weight decay is set to3e-5. Label smooth is also included with a smooth ratio of0.1.

The evaluated results show the transferability capabilityof our searched architectures. Our models cover an extensive

Figure 3: CARS-Lat models are searched on CIFAR-10dataset which considers mobile device latency and valida-tion performance. The top-1 accuracy is the searched mod-els’ performance on the ILSVRC2012 dataset.

range of the parameters ranging from 3.7M to 5.1M with 430to 590 MFLOPs. For different deploy environments, we caneasily select an architecture which satisfies the restrictions.

Our CARS-I surpasses PNAS by 1% on Top1 with thesame number of parameters and approximate FLOPs. TheCARS-G shows superior results over DARTS by 0.9% Top1accuracy with the same number of parameters. Also, CARS-D surpasses SNAS (mild) by 0.6% Top1 accuracy with thesame number of parameters. For different models of NAS-Net and AmoebaNet, our method also has various modelsthat achieve higher accuracy using the same number of pa-rameters. By using the proposed pNSGA-III, the larger ar-chitectures like CARS-I could be protected during architec-ture optimization stages. Because of the efficient parametersharing, we could search a set of superior transferable archi-tectures during the one-time search.

For experiments that search architectures by runtime la-tency and performance, we use the evaluated runtime latencyon the mobile phone as an objective and the performance asanother. These two objectives are used for generating thenext population. We evaluated the searched architectures onILSVRC2012 in Figure 3. The searched architectures coveran actual runtime latency from 40ms to 90ms and surpassthe counterparts.

ConclusionThe evolutionary algorithm based NAS methods are able tofind models with high-performance, but the searching timeof these methods is extremely long, the main problem is eachcandidate network is trained separately. In order to makethis efficient, we propose continuous evolution architecturesearch, namely, CARS, which maximally utilizes the learnedknowledge such as architectures and parameters in the latestevolution iteration. A supernet is constructed with consider-able cells and blocks. Individuals are generated through thebenchmark operations in an evolutionary algorithm. Non-dominated sort strategy is utilized to select architectureswith different model sizes and high accuracies for updatingthe supernet. Experiments on benchmark datasets show thatthe proposed CARS can provide a number of architectureson the Pareto front with high efficiency, e.g. the searching

Page 8: arXiv:1909.04977v1 [cs.CV] 11 Sep 2019CARS: Continuous Evolution for Efficient Neural Architecture Search Zhaohui Yang1, Yunhe Wang 2, Xinghao Chen , Boxin Shi3;4, Chao Xu1, Chunjing

cost on the CIFAR-10 benchmark is only 0.4 GPU days. Thesearched models are superior to models produced by state-of-the-art methods in terms of both model size/latency andaccuracies.

References[Bender et al. 2018] Bender, G.; Kindermans, P.-J.; Zoph, B.; Va-

sudevan, V.; and Le, Q. V. 2018. Understanding and simplifyingone-shot architecture search. ICML.

[Cai et al. 2018] Cai, H.; Chen, T.; Zhang, W.; Yu, Y.; and Wang,J. 2018. Efficient architecture search by network transformation.AAAI.

[Casale, Gordon, and Fusi 2019] Casale, F. P.; Gordon, J.; and Fusi,N. 2019. Probabilistic neural architecture search. arXiv.

[Deb and Jain 2014] Deb, K., and Jain, H. 2014. An evolution-ary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part i: Solving problemswith box constraints. TEC 18(4).

[Deng et al. 2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.;and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical imagedatabase. CVPR.

[Dong et al. 2018] Dong, J.-D.; Cheng, A.-C.; Juan, D.-C.; Wei, W.;and Sun, M. 2018. Dpp-net: Device-aware progressive search forpareto-optimal neural architectures. ECCV.

[Elsken, Metzen, and Hutter 2018] Elsken, T.; Metzen, J. H.; andHutter, F. 2018. Simple and efficient architecture search for con-volutional neural networks. ICLR.

[Elsken, Metzen, and Hutter 2019] Elsken, T.; Metzen, J. H.; andHutter, F. 2019. Efficient multi-objective neural architecture searchvia lamarckian evolution. ICLR.

[Girshick 2015] Girshick, R. 2015. Fast r-cnn. ICCV.[Goyal et al. 2017] Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis,

P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; and He, K.2017. Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv.

[Guo et al. 2019] Guo, Z.; Zhang, X.; Mu, H.; Heng, W.; Liu, Z.;Wei, Y.; and Sun, J. 2019. Single path one-shot neural architecturesearch with uniform sampling. arXiv.

[He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deepresidual learning for image recognition. CVPR.

[He et al. 2017] He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R.2017. Mask r-cnn. ICCV.

[Hsu et al. 2018] Hsu, C.-H.; Chang, S.-H.; Juan, D.-C.; Pan, J.-Y.;Chen, Y.-T.; Wei, W.; and Chang, S.-C. 2018. Monas: Multi-objective neural architecture search using reinforcement learning.arXiv.

[Huang et al. 2017] Huang, G.; Liu, Z.; Van Der Maaten, L.; andWeinberger, K. Q. 2017. Densely connected convolutional net-works. CVPR.

[Kim et al. 2017] Kim, Y.-H.; Reddy, B.; Yun, S.; and Seo, C. 2017.Nemo : Neuro-evolution with multiobjective optimization of deepneural network for speed and accuracy. ICML Workshop.

[Krizhevsky and Hinton 2009] Krizhevsky, A., and Hinton, G.2009. Learning multiple layers of features from tiny images. Tech-nical report, Citeseer.

[Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.;Sutskever, I.; and Hinton, G. E. 2012. Imagenet classifica-tion with deep convolutional neural networks. NeurIPS.

[Liu et al. 2016] Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.;Reed, S.; Fu, C.-Y.; and Berg, A. C. 2016. Ssd: Single shot multi-box detector. ECCV.

[Liu et al. 2018a] Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua,W.; Li, L.-J.; Fei-Fei, L.; Yuille, A.; Huang, J.; and Murphy, K.2018a. Progressive neural architecture search. ECCV.

[Liu et al. 2018b] Liu, H.; Simonyan, K.; Vinyals, O.; Fernando, C.;and Kavukcuoglu, K. 2018b. Hierarchical representations for effi-cient architecture search. ICLR.

[Liu, Simonyan, and Yang 2019] Liu, H.; Simonyan, K.; and Yang,Y. 2019. Darts: Differentiable architecture search. ICLR.

[Lu et al. 2019] Lu, Z.; Whalen, I.; Boddeti, V.; Dhebar, Y.; Deb,K.; Goodman, E.; and Banzhaf, W. 2019. Nsga-net: Amulti-objective genetic algorithm for neural architecture search.GECCO.

[Ma et al. 2018] Ma, N.; Zhang, X.; Zheng, H.-T.; and Sun, J. 2018.Shufflenet v2: Practical guidelines for efficient cnn architecture de-sign. ECCV.

[Miikkulainen et al. 2019] Miikkulainen, R.; Liang, J.; Meyerson,E.; Rawal, A.; Fink, D.; Francon, O.; Raju, B.; Shahrzad, H.;Navruzyan, A.; Duffy, N.; et al. 2019. Evolving deep neural net-works. Artificial Intelligence in the Age of Neural Networks andBrain Computing.

[Pham et al. 2018] Pham, H.; Guan, M. Y.; Zoph, B.; Le, Q. V.; andDean, J. 2018. Efficient neural architecture search via parametersharing. ICML.

[Real et al. 2017] Real, E.; Moore, S.; Selle, A.; Saxena, S.; Sue-matsu, Y. L.; Tan, J.; Le, Q. V.; and Kurakin, A. 2017. Large-scaleevolution of image classifiers. ICML.

[Real et al. 2019] Real, E.; Aggarwal, A.; Huang, Y.; and Le, Q. V.2019. Regularized evolution for image classifier architecturesearch. AAAI.

[Sandler et al. 2018] Sandler, M.; Howard, A.; Zhu, M.; Zhmogi-nov, A.; and Chen, L.-C. 2018. Mobilenetv2: Inverted residualsand linear bottlenecks. CVPR.

[Shu et al. 2019] Shu, H.; Wang, Y.; Jia, X.; Han, K.; Chen, H.; Xu,C.; Tian, Q.; and Xu, C. 2019. Co-evolutionary compression forunpaired image translation. ICCV.

[Simonyan and Zisserman 2015] Simonyan, K., and Zisserman, A.2015. Very deep convolutional networks for large-scale imagerecognition. ICLR.

[Szegedy et al. 2015] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.;Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich,A. 2015. Going deeper with convolutions. CVPR.

[Wang et al. 2018] Wang, Y.; Xu, C.; Qiu, J.; Xu, C.; and Tao, D.2018. Towards evolutionary compression. SIGKDD.

[Wu et al. 2019] Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu,Y.; Tian, Y.; Vajda, P.; Jia, Y.; and Keutzer, K. 2019. Fbnet:Hardware-aware efficient convnet design via differentiable neuralarchitecture search. CVPR.

[Xie and Yuille 2017] Xie, L., and Yuille, A. 2017. Genetic cnn.ICCV.

[Xie et al. 2019] Xie, S.; Zheng, H.; Liu, C.; and Lin, L. 2019. Snas:stochastic neural architecture search. ICLR.

[Ying et al. 2019] Ying, C.; Klein, A.; Real, E.; Christiansen, E.;Murphy, K.; and Hutter, F. 2019. Nas-bench-101: Towards re-producible neural architecture search. ICML.

[Zhong et al. 2018] Zhong, Z.; Yan, J.; Wu, W.; Shao, J.; and Liu,C.-L. 2018. Practical block-wise neural network architecture gen-eration. CVPR.

Page 9: arXiv:1909.04977v1 [cs.CV] 11 Sep 2019CARS: Continuous Evolution for Efficient Neural Architecture Search Zhaohui Yang1, Yunhe Wang 2, Xinghao Chen , Boxin Shi3;4, Chao Xu1, Chunjing

[Zhou et al. 2018] Zhou, Y.; Ebrahimi, S.; Arık, S. Ö.; Yu, H.; Liu,H.; and Diamos, G. 2018. Resource-efficient neural architect.arXiv.

[Zoph and Le 2017] Zoph, B., and Le, Q. V. 2017. Neural architec-ture search with reinforcement learning. ICLR.

[Zoph et al. 2018] Zoph, B.; Vasudevan, V.; Shlens, J.; and Le, Q. V.2018. Learning transferable architectures for scalable image recog-nition. CVPR.

Page 10: arXiv:1909.04977v1 [cs.CV] 11 Sep 2019CARS: Continuous Evolution for Efficient Neural Architecture Search Zhaohui Yang1, Yunhe Wang 2, Xinghao Chen , Boxin Shi3;4, Chao Xu1, Chunjing

AppendixIn this supplementary material, we list all the searched architec-tures using CARS on the CIFAR10 dataset and explore the effectof hyper-parameters utilized in CARS.

Network ArchitecturesIn Table 3, we list all the searched architectures considering themodel size and validation performance. In Table 4, the searchedarchitecture of considering latency and validation performance arelisted. Notations are same with DARTS (Liu, Simonyan, and Yang2019).

Evolution Trend of NSGA-III

(a) Evolution trend for 15 iterations us-ing NSGA-III.

Figure 4: The overall evolution trend for all the 128 individ-uals during 15 iterations of evolution. The multi-objectivescontinuous evolution is conducted by NSGA-III.

In this section, we visualize the evolution trend by using NSGA-III (Deb and Jain 2014) to guide architecture optimization. Figure 4illustrates the first 15 generations. We draw different generationsfrom light to dark. The Small Model Trap phenomenon becomesmuch more obvious during evolution, and larger models tend to beeliminated. Our proposed pNSGA-III could protect large modelsthus could solve the Small Model Trap problem.

Impact of Batch Size in pNSGA-III

Figure 5: The evolution trend of using pNSGA-III with dif-ferent batchsize Nb ∈ {1, 2, 3, 4}.

For the parameter optimization stage in the continuous evolu-tion, Nb different architectures are sampled from the maintained

population to update parameters. We evaluated different batch sizeNb ∈ {1, 2, 3, 4} and find that they all grow with a similar trend.Thus we use Nb = 1 in our experiment which is similar toENAS (Pham et al. 2018).

Page 11: arXiv:1909.04977v1 [cs.CV] 11 Sep 2019CARS: Continuous Evolution for Efficient Neural Architecture Search Zhaohui Yang1, Yunhe Wang 2, Xinghao Chen , Boxin Shi3;4, Chao Xu1, Chunjing

Table 3: Searched architectures by CARS using DARTS backbone on CIFAR10 dataset considering model size and perfor-mance. Notations are same with DARTS. Sepk denotes separable convolution with kernel sizek. Dilk denotes dilated separableconvolution with kernel size k. Max/Avg3 denotes Max/Avg Pooling with size 3. Skip denotes identity connection. None de-notes no connection. C(k-2) and C(k-1) denotes the previous two cells connected with current cell C(k). N1 to N4 denotes fourintermediate nodes within cell C(k).

Architecture Cell Type N1 N2 N3 N4

CARS-ANormal Skip C(k-2) Max3 C(k-2) Max3 C(k-2) Sep3 C(k-2)

Sep5 C(k-1) Avg3 C(k-1) Max3 C(k-1) Dil5 N1

Reduce Avg3 C(k-2) Max3 C(k-2) Max3 C(k-2) Dil5 C(k-2)Max3 C(k-1) Skip C(k-1) Dil5 C(k-1) Skip N1

CARS-BNormal Sep5 C(k-2) Sep3 C(k-2) Dil3 C(k-2) Avg3 C(k-2)

Dil3 C(k-1) Avg3 N1 Max3 C(k-1) Skip C(k-1)

Reduce Sep5 C(k-2) Sep3 C(k-2) Avg3 C(k-2) Dil3 N2Skip C(k-1) Max3 C(k-1) Avg3 C(k-1) Max3 C(k-2)

CARS-CNormal Sep5 C(k-2) Skip C(k-2) Skip C(k-2) Sep5 C(k-2)

Skip C(k-1) Skip C(k-1) Max3 C(k-1) Sep3 C(k-1)

Reduce Max3 C(k-2) Sep5 C(k-2) Dil5 C(k-2) Sep5 C(k-2)Max3 C(k-1) Sep5 C(k-1) Max3 C(k-1) Dil3 C(k-1)

CARS-DNormal Sep5 C(k-2) Skip C(k-2) Skip C(k-2) Sep5 C(k-2)

Dil3 C(k-1) Avg3 C(k-1) Max3 C(k-1) Sep3 C(k-1)

Reduce Max3 C(k-2) Max3 C(k-2) Dil5 C(k-2) Sep5 C(k-1)Max3 C(k-1) Sep3 C(k-1) Max3 C(k-1) Dil3 C(k-1)

CARS-ENormal Sep3 C(k-2) Skip C(k-2) Avg3 C(k-1) Skip N2

Sep3 C(k-1) Sep3 N1 Sep3 N1 Skip N3

Reduce Skip C(k-2) Avg3 C(k-2) Sep3 N1 Avg3 C(k-2)Dil3 C(k-1) Skip N1 Max3 C(k-2) Sep3 N3

CARS-FNormal Skip C(k-2) Sep5 C(k-2) Sep5 N2 Skip C(k-2)

Sep5 C(k-1) Skip N1 Max3 C(k-2) Sep3 C(k-1)

Reduce Avg3 C(k-2) Dil3 C(k-2) Sep5 C(k-1) Max3 C(k-2)Sep5 C(k-1) Dil5 C(k-1) Skip N1 Max3 C(k-1)

CARS-GNormal Max3 C(k-2) Sep3 C(k-2) Dil5 C(k-2) Avg3 C(k-2)

Dil5 C(k-1) Skip C(k-1) Sep4 C(k-1) Sep3 C(k-1)

Reduce Max3 C(k-2) Sep3 C(k-2) Sep3 C(k-2) Avg3 C(k-2)Sep3 C(k-1) Sep5 C(k-1) Skip C(k-1) Dil3 C(k-1)

CARS-HNormal Sep5 C(k-2) Sep3 C(k-2) Avg3 C(k-2) Sep5 N1

Sep3 C(k-1) Dil5 N1 Skip C(k-1) Max3 C(k-2)

Reduce Sep5 C(k-2) Sep3 C(k-2) Dil3 N1 Sep5 C(k-2)Max3 C(k-1) Skip C(k-1) Max3 C(k-2) Avg3 N2

CARS-INormal Sep3 C(k-2) Skip C(k-2) Skip N1 Sep3 C(n-2)

Sep3 C(k-1) Sep5 C(k-1) Sep3 N2 Dil5 N3

Reduce Dil3 C(k-2) Max3 C(k-2) Skip C(k-1) Dil3 C(k-1)Skip C(k-1) Max3 N1 Sep5 N2 Max3 N3

Page 12: arXiv:1909.04977v1 [cs.CV] 11 Sep 2019CARS: Continuous Evolution for Efficient Neural Architecture Search Zhaohui Yang1, Yunhe Wang 2, Xinghao Chen , Boxin Shi3;4, Chao Xu1, Chunjing

Table 4: Searched architectures by CARS using DARTS backbone on CIFAR10 dataset considering latency and performance.Notations are same with DARTS. Notations are same with DARTS. Sepk denotes separable convolution with kernel sizek.Dilk denotes dilated separable convolution with kernel size k. Max/Avg3 denotes Max/Avg Pooling with size 3. Skip denotesidentity connection. None denotes no connection. C(k-2) and C(k-1) denotes the previous two cells connected with current cellC(k). N1 to N4 denotes four intermediate nodes within cell C(k).

Architecture Top-1 (%) Latency (ms) Cell Type N1 N2 N3 N4

CARS-Lat-A 62.6 41.9Normal Skip C(k-2) Max3 C(k-1) Avg3 C(k-2) Max3 C(k-1)

Skip N1 Skip N2 Skip N1 Skip N3

Reduce Skip C(k-2) Sep5 C(k-1) Dil5 C(k-2) Max3 N1Skip C(k-2) Max3 C(k-1) Skip C(k-1) Avg3 N3

CARS-Lat-B 67.8 44.9Normal Skip C(k-2) Skip C(k-1) Skip C(k-1) Dil3 N1

Skip N1 Skip N2 Max3 C(k-2) Max3 N1

Reduce Max3 C(k-2) Max3 C(k-1) Skip C(k-1) Max3 C(k-2)Sep3 C(k-2) Max3 C(k-1) Dil5 C(k-2) Avg3 N1

CARS-Lat-C 69.5 45.6Normal Skip C(k-2) Avg3 C(k-1) Skip C(k-2) Skip C(k-1)

Max3 C(k-1) Skip N2 Dil3 N1 Skip N3

Reduce Skip C(k-2) Sep5 C(k-1) Avg3 C(k-1) Sep5 N1Max3 C(k-2) Max3 C(k-1) Dil5 N1 Skip N3

CARS-Lat-D 71.9 57.6Normal Sep3 C(k-2) Skip C(k-1) Skip C(k-2) Skip C(k-1)

Skip C(k-1) Avg3 N2 Dil3 N1 Skip N3

Reduce Dil5 C(k-2) Sep5 C(k-1) Avg3 C(k-1) Sep5 N1Max3 C(k-2) Max3 C(k-1) Dil5 N1 Skip N3

CARS-Lat-E 72.0 60.8Normal Dil5 C(k-2) Skip C(k-1) Skip C(k-1) Avg3 N1

Skip C(k-1) Avg3 N1 Skip C(k-2) Max3 C(k-1)

Reduce Skip C(k-2) Dil3 C(k-1) Sep3 C(k-2) Sep3 N1Dil3 C(k-2) Avg3 N2 Sep3 C(k-1) Sep5 N1

CARS-Lat-F 72.2 64.5Normal Sep5 C(k-2) Skip C(k-1) Skip C(k-2) Avg3 C(k-1)

Dil3 C(k-1) Max3 C(k-2) Skip C(k-2) Skip C(k-1)

Reduce Sep5 C(k-2) Max3 C(k-1) Sep5 C(k-1) Skip N1Sep5 C(k-2) Sep5 C(k-1) Sep5 N2 Dil3 N3

CARS-Lat-G 74.0 89.3Normal Sep5 C(k-2) Skip C(k-1) Sep3 C(k-2) Sep5 N1

Dil3 C(k-1) Max3 C(k-2) Skip C(k-2) Skip C(k-1)

Reduce Sep5 C(k-2) Max3 C(k-1) Sep3 C(k-2) Avg3 N1Sep5 C(k-2) Sep5 C(k-1) Sep5 N2 Dil3 N3