reuse-centric programming system support of machine learning

122
ABSTRACT GUAN, HUI. Reuse-Centric Programming System Support of Machine Learning. (Under the direction of Xipeng Shen and Hamid Krim.) Modern machine learning, especially deep learning, has shown dramatic progress. Its effective adoption, however, faces a fundamental question: how to create models that efficiently deliver reliable predictions to meet the requirements of diverse applications running on various systems. This thesis introduces reuse-centric optimization, a novel direction for addressing the fundamental question. Reuse-centric optimization centers around harnessing reuse opportunities for enhancing computing efficiency. It generalizes the reuse principle in traditional compilers to a higher level and a larger scope through innovations in both programming systems and machine learning algorithms and their synergy. Its exploitation of computation reuse spans across the boundaries of machine learning algorithms, implementations, and infrastructures; the types of reuse it covers range from pre-trained Neural Network building blocks to preprocessed results for model training and even memory bits; the scopes of reuse it leverages go from training pipelines of deep learning to variants of Neural Networks in ensembles; the benefits it generates include (1) up to 9X faster k-means configurations, (2) up to 186X speedup for finding a good smaller Convolution Neural Network (CNN), (3) up to 2X faster ensemble training with data sharing, (4) the elimination of all space cost in protecting parameters of CNNs.

Upload: others

Post on 01-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

ABSTRACT

GUAN, HUI. Reuse-Centric Programming System Support of Machine Learning. (Under the directionof Xipeng Shen and Hamid Krim.)

Modern machine learning, especially deep learning, has shown dramatic progress. Its effective

adoption, however, faces a fundamental question: how to create models that efficiently deliver

reliable predictions to meet the requirements of diverse applications running on various systems.

This thesis introduces reuse-centric optimization, a novel direction for addressing the fundamental

question. Reuse-centric optimization centers around harnessing reuse opportunities for enhancing

computing efficiency. It generalizes the reuse principle in traditional compilers to a higher level and

a larger scope through innovations in both programming systems and machine learning algorithms

and their synergy. Its exploitation of computation reuse spans across the boundaries of machine

learning algorithms, implementations, and infrastructures; the types of reuse it covers range from

pre-trained Neural Network building blocks to preprocessed results for model training and even

memory bits; the scopes of reuse it leverages go from training pipelines of deep learning to variants

of Neural Networks in ensembles; the benefits it generates include (1) up to 9X faster k-means

configurations, (2) up to 186X speedup for finding a good smaller Convolution Neural Network

(CNN), (3) up to 2X faster ensemble training with data sharing, (4) the elimination of all space cost

in protecting parameters of CNNs.

© Copyright 2020 by Hui Guan

All Rights Reserved

Reuse-Centric Programming System Support of Machine Learning

byHui Guan

A dissertation submitted to the Graduate Faculty ofNorth Carolina State University

in partial fulfillment of therequirements for the Degree of

Doctor of Philosophy

Electrical Engineering

Raleigh, North Carolina

2020

APPROVED BY:

Huiyang Zhou Andrew Rindos

Xipeng ShenCo-chair of Advisory Committee

Hamid KrimCo-chair of Advisory Committee

ACKNOWLEDGEMENTS

I would like to thank my Ph.D. advisor Dr. Xipeng Shen. I am very fortunate to work on my thesis

under his guidance for the last four years of my Ph.D. I have learned a lot from him including how

to find and solve a research problem, how to present the work to audiences, and how to expand the

discovery to a broader scope. I am very grateful for his visionary advice, constructive discussion,

insightful feedback, and generous support. His passion to conduct impactful research and solve

challenging problems has greatly motivated me during my Ph.D. journey and will continue to have

a profound influence on my academic career. I could not have imagined having a better advisor for

my Ph.D. study.

I would also thank my co-advisor, Dr. Hamid Krim. I received his guidance since I joined his

research lab in 2014. During the past six years, Hamid offered me invaluable advice and uncountable

comments that guide me through challenging problems. He gave me the freedom to explore various

projects. He believed in me and gave me endless support. He convinced me to take numerous math

courses for a solid math background. He mentored me with his professional expertise, brilliant

thinking, and vast patience. I feel so fortunate to have Hamid as my co-advisor.

Besides my advisors, I would like to thank the rest of my dissertation committee members

(Dr. Huiyang Zhou, Dr. Min Chi, and Dr. Andrew Rindos) for their great support and valuable

suggestions. I am thankful to Dr. Huiyang Zhou, an expert in Computer Architecture and Systems,

for his tremendous contributions to many of my research projects. I am also grateful to Dr. Min Chi

for her insightful comments on writing good research proposals and identifying promising research

topics. I also appreciate the many internship opportunities provided by Dr. Andrew Rindos at IBM

and his generous support for funding my research projects.

I would like to thank my friends and lab mates for their continued support. This dissertation

would not be possible without the intellectual contribution of Yufei Ding, Lin Ning, Randall Pittman,

Laxmikant Mokadam, and Zhen Lin. Moreover, I am thankful to Guoyang Chen, Yue Zhao, Weijie

Zhou, Zifan Nan, Guoqiang Zhang, Yuanchao Xu, Chencheng Ye, Weiqi Sun, Shuai Yang, Dong Xu,

Jou-An Chen, Lei Zhang for making my experience in the PICTure Research Group and graduate

school exciting and fun. It was a pleasure working together with them all. I would also like to thank

Xing Pan for his long-lasting support, my roommate Jie Wang for her accompany, and many other

friends for making my Ph.D. life a memorable and enjoyable experience.

It has been an honor to work with many great researchers outside of NCSU. I would like to thank

Dr. Seung-Hwan Lim and Dr. Robert Patton from Oak Ridge National Labs. I have been collaborating

with them since 2017 and was given the great opportunity to work on large-scale systems, TITAN

and SummitDev supercomputers. The collaborations result in numerous exciting ideas and fruitful

publications. I also appreciate the great help from my recommendation letter writers, Dr. Chen Ding

and Dr. Michael Carbin for my faculty job search. I am grateful of the enjoyable discussions with

them, their valuable suggestions on both research and job search, and their huge efforts in realizing

my academic dreams.

ii

Finally, I would like to give my sincere thanks to my family: Mom, Dad, and my brother. Their

encouragement went through thousands of miles from home to the US to give me courage. I could

not accomplish what I did without their love and support.

iii

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Chapter 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Contribution and Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Chapter 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 CNN Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 DNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 DNN Training Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Data-Parallel DNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Chapter 3 Reuse-Centric K-Means Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Proposed Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Overview of K-Means Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.2 Overview of the Acceleration Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.3 Reuse-Based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.4 Center Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.5 Two-Phase Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.2 Speedups on Heuristic Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.3 Speedups on the Attainment of Error Surfaces . . . . . . . . . . . . . . . . . . . . . . . 283.3.4 Quality Influence of Center Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.5 Sensitivity Analysis and Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Chapter 4 Composability-Based Fast CNN Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Composability-Based CNN Pruning: Idea and Challenges . . . . . . . . . . . . . . . . . . . . 414.3 Overview of Wootz Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.4 Hierarchical Tuning Block Identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4.1 Optimal Tuning Block Definition Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4.2 Hierarchical Compression-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5 Composability-Based Pruning and Wootz Compiler . . . . . . . . . . . . . . . . . . . . . . . . . 474.5.1 Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.5.2 Wootz Compiler and Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.6 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.6.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

iv

4.6.2 Validation of the Composability Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 544.6.3 Results of Wootz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Chapter 5 Efficient Ensemble Training with Data Sharing . . . . . . . . . . . . . . . . . . . . . . . . 625.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.2 Overview of FLEET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3 Resource Allocation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.3.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.3.3 Greedy Allocation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.5 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.5.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.5.2 End-to-End Speedups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.5.3 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Chapter 6 In-Place Zero-Space Memory Protection for CNN . . . . . . . . . . . . . . . . . . . . . . 836.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.2 Premises and Scopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.3 In-Place Zero-Space ECC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.3.1 WOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.3.2 Full Design of In-Place Zero-Space ECC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.4.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.4.2 WOT results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906.4.3 Fault injection results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.6 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Chapter 7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

v

LIST OF TABLES

Table 3.1 Data statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Table 3.2 Speedups on stochastic hill climbing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Table 3.3 Speedups on the attainment of error surfaces. . . . . . . . . . . . . . . . . . . . . . . . 28Table 3.4 Speedup in parallel settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Table 3.5 Speedups of center reuse across k with different k s t e p . . . . . . . . . . . . . . . . 34Table 3.6 Speedups of center reuse across k with different d s t e p . . . . . . . . . . . . . . . . 35Table 3.7 Speedups and distance savings for the first iteration of k-means with reuse-

based filtering across k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Table 4.1 Dataset statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Table 4.2 Median accuracies of default networks (init, final) and block-trained networks

(init+, final+). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Table 4.3 Speedups and configuration savings for ResNet-50 by composability-based

pruning (when 1, 4, or 16 machines are used for both baseline and composability-based methods as "#nodes" column indicates). Notations are at the tablebottom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Table 4.4 Speedups and configuration savings for Inception-V3 by composability-basedpruning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Table 4.5 Speedups by composability-based pruning with different subspace sizes. . . . 59Table 4.6 Extra speedups brought by improved tuning block definitions. . . . . . . . . . . . 60

Table 5.1 The job of different processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Table 5.2 Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Table 5.3 Mean and standard deviation of the running length of DNNs in seconds. (80

GPUs, 100 DNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Table 5.4 Scheduling and checkpointing overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Table 6.1 Accuracy and weight distribution of 8-bit quantized CNN models on Ima-geNet. The percentage rows use absolute values. . . . . . . . . . . . . . . . . . . . . . 85

Table 6.2 Accuracy drop of VGG16, ResNet-16, and SqueezeNet under different memoryfault rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

vi

LIST OF FIGURES

Figure 1.1 Overview of the reuse scopes, types, and benefits of reuse-centric optimization. 2Figure 1.2 Reuse is a principle for code optimizations in compilers. . . . . . . . . . . . . . . . 2

Figure 2.1 CNN and CNN Pruning. Conv1 and Conv2 are the first two consecutiveconvolutional layers in the CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Figure 2.2 A DNN training pipeline [Pit18]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 2.3 An illustration of data-parallel DNN training [Ser18]. . . . . . . . . . . . . . . . . . 11

Figure 3.1 Overview of k-means–based applications and where our three accelerationtechniques are applied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Figure 3.2 Illustration of how upper bound and lower bound can help avoid distancecomputation to some center c . Circles and double circles represent centersin current and previous iteration respectively, and b (x ) is the so-far nearestcenter of point x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Figure 3.3 Illustration of how the configuration with k = k1 can help save distancecomputation in the first iteration of another configuration with k = k2. b ′(x )is the closest center of point x when k = k1; c1, c2, c3 are the initial centersand b (x ) is the so-far nearest center of point x when k = k2. . . . . . . . . . . . . 19

Figure 3.4 Illustration of how the configuration with F1 can help save distance compu-tation in the first iteration of another configuration with F2, where b ′(x ) iscomputed from the closest center of point x in feature space F1, while c1, c2,... ck2

and b (x ) are the initial centers and the so-far nearest center of point xin feature space F2 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Figure 3.5 Illustration of center reuse across k . The two graphs represent the k-meanson two configurations with k equaling three (left) and two (right) respectively.The double circles in the left graph show the three centers attained in theexploration of that configuration. These centers are grouped to get two groupcenters c ′1, c ′2, which are then used as the initial centers (marked as circles inthe right picture) for exploring the latter configuration. . . . . . . . . . . . . . . . . 22

Figure 3.6 An example of an error curve and the illustration of curve segmentation. . . . 23

vii

Figure 3.7 Approximated classification error curves from the first phase for two datasets. 24Figure 3.8 Classification error curves for two datasets. . . . . . . . . . . . . . . . . . . . . . . . . 31Figure 3.9 Center reuse across k with inc/dec k on dataset connect. acc. is short for

accumulated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 3.10 Center reuse across feature sets with inc/dec d . . . . . . . . . . . . . . . . . . . . . . 34Figure 3.11 Reuse-based filtering performance on different k and different numbers of

landmarks (#lms) on adult (dim=11). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 3.12 Reuse-based filtering performance on different k and different numbers of

landmarks (#lms) on adult (dim=59). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Figure 4.1 Complementary relation with prior work for CNN pruning. Prior workshave designed heuristic criteria to quickly determine the importance of afilter [Li16; Hu16; Mol16; Luo17a], or to combine with reinforcement learn-ing for selecting the set of promising configurations [He18; Ash17]. Thiswork tries to accelerate the explorations of the remaining promising config-urations through computation reuse via composability (block pre-training)supported with a compiler-based framework. . . . . . . . . . . . . . . . . . . . . . . . 40

Figure 4.2 Overview of Wootz framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Figure 4.3 Formats for the specifications of promising subspaces (a) and pruning ob-

jectives (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Figure 4.4 Sequitur applies to a concatenated sequence of layers of four networks

pruned at rates: 0%, 30%, 50%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 4.5 Illustration of composability-based network pruning. Eclipses are pruned

tuning blocks; rectangles are original tuning blocks; diamonds refer to theactivation map reconstruction error. Different colors of pruned tuning blockscorrespond to different pruning options. . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Figure 4.6 Accuracy curves of the default and block-trained networks on dataset CUB200.Each network has 70% least important filters pruned at all convolution mod-ules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Figure 4.7 Accuracies of pruned networks of ResNet-50 after training. The model sizeof full ResNet-50 is 25.6 million. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Figure 5.1 An illustration of the ensemble training pipeline in FLEET. P1 and P2 arepreprocessors and T1-T8 are trainers. There are four training groups, (T1),(T2, T3), (T4), (T5, T6, T7, T8), which train the four DNNs D1-D4 respectively.Edges indicate transfers of preprocessed images. . . . . . . . . . . . . . . . . . . . . 64

Figure 5.2 Illustration of the dataflow implementation. Two DNNs, D 1 and D 2, aretrained using four GPUs (Ranks 0-3) by two training groups, (T 1)and (T 2, T 3, T 4).T 1 and T 2 are training group masters. Sizes of QP , QT and QD are 2048 im-ages, 2048 images and 10 batches respectively. . . . . . . . . . . . . . . . . . . . . . . 74

Figure 5.3 Correlations between model size of a DNN and the training rate and thenumber of epochs until convergence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Figure 5.4 The profiled training rates (images/sec) of 100 DNNs in an ensemble withImagenet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Figure 5.5 The averaged speedups over the baseline in terms of the end-to-end timefor training a 100-DNN ensemble. The error bars show the variations. . . . . . 78

Figure 5.6 Waiting time per GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

viii

Figure 6.1 Large weight (beyond [−64, 63]) distributions in 8-byte (64-bit data) blocksfor SqueezeNet on ImageNet. For instance, the first bar in (a) shows that ofall the 8-byte data blocks storing weights, around 380 have a large weight atthe first byte. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Figure 6.2 Hardware design for in-place zero-space ECC protection. . . . . . . . . . . . . . . 89Figure 6.3 Changes of the total number of large values (beyond [−64, 63]) in the first 7

positions of 8-byte (64-bit data) blocks before the throttling step during theWOT training process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Figure 6.4 Accuracy curves before and after the throttling step during the WOT trainingprocess. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Figure 7.1 Summary of the thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

ix

CHAPTER

1

INTRODUCTION

The performance of machine learning (ML) is improving rapidly in recent years. Convolutional

Neural Networks (CNNs) can surpass human performance on the ImageNet image recognition

competition [He15]; Transformer-based models can achieve state-of-the-art performance on many

natural language processing tasks [Dev18]; AlphaGo Zero can master the game of Go without human

knowledge [Sil17]. A wave of excitement about ML and deep learning has proliferated from academia

to industry, transforming prototypes in research labs to valid solutions to real-world problems.

The effective adoption of ML, however, faces a fundamental question: how to create models

that efficiently deliver reliable predictions to meet the requirements of diverse applications running

on various systems. On the one hand, ML techniques are increasingly adopted in a diverse range

of applications such as self-driving cars [Boj16], personalized recommendation [Par18], customer

assistance [Rah17], manufacturing [Zho17], healthcare [Mio18], and drug discovery [Che18a]. It is yet

to understand how to quickly and automatically discover models that best fit the specific task. On the

other hand, given the various hardware systems to support (e.g., supercomputers, clusters, personal

computers, phones, robotics, smartwatches, and embedded devices), it remains an open question

how to design and implement ML software systems that meet the various hardware constraints.

This thesis introduces reuse-centric optimization, a novel direction for addressing the funda-

mental question. It pioneers the systematic explorations of reuse in High-Performance Machine

Learning. Reuse-centric optimization centers around harnessing reuse opportunities for enhancing

computing efficiency. Computation/data/memory reuse has been one of the primary schemes in

programming systems for low-level code optimizations. Reuse-centric optimization generalizes the

principle to a higher level and a larger scope through innovations in both programming systems and

ML algorithms and their synergy. Its exploitation of computation reuse spans across the boundaries

1

Programming Systems

Machine Learning

Pre-trained neural networkbuilding blocks

Preprocessed results formodel training

Memory bits

Similar computation results

Types of Reuse

Scopes of Reuse

Algorithms

Implementations

Infrastructures

Up to 186X faster CNN pruning [Guan et al. PLDI’19] (Chapter 4)

Zero space cost memory fault protection for CNN [Guan et al.NeurIPS’19] (Chapter 6)

Up to 2X faster ensemble training with data sharing [Guan et al.MLSys’20] (Chapter 5)

5-9X faster k-means configuration [Guan et al. ICDE’18] (Chapter 3)

More … (Chapter 7)

Benefits of Reuse

Compiler

Figure 1.1: Overview of the reuse scopes, types, and benefits of reuse-centric optimization.

1. aß b + c2. bß a - d3. cß b + c4. dß a - d

Original block

1. aß b + c2. bß a - d3. cß b + c4. dß b

Rewritten block

Figure 1.2: Reuse is a principle for code optimizations in compilers.

of ML algorithms, implementations, and infrastructures; the types of reuse it covers range from pre-

trained Neural Network building blocks to preprocessed results and even memory bits; the scopes

of reuse it leverages go from training pipelines of deep learning to variants of Neural Networks in

ensembles; the benefits it generates extend from orders of magnitude faster search for a good smaller

Convolution Neural Network (CNN) to the elimination of all space cost in protecting parameters of

CNNs. An overview of the reuse scopes, types, and benefits of reuse-centric optimization is shown

in Figure 1.1.

1.1 Motivation

Reuse has been an effective approach in programming systems for improving the performance and

reliability of programs. Consider the four-statement basic block shown in Figure 1.2. The occurrence

of a – d in the fourth operation is redundant because a and d are not redefined. The compiler can

rewrite this block so that it computes a – d only once. The second evaluation of a – d is replaced

with a copy from b. Although the kind of optimization can be applied generally to any programs,

this instruction-level semantic-preserving program transformations have produced only limited

speedups.

Reuse-centric optimization aims to generalize the reuse principle to a higher level for larger

benefits. It expands the scopes of reuse to from low-level instructions to high-level algorithms and

workflows; it exploits more types of reuse that comes with the larger scope; it relaxes the semantic

constraints strictly followed in traditional compilers; it leverages domain knowledge and demands

2

new types of analysis and transformations; the benefits it generates are usually 10-100X speedups

instead of 5-20% as often seen in traditional optimizing compilers.

A few prior studies [Din15a; Din15b] have shown the dramatic performance improvements of

reuse-based algorithm-level optimization in distance-related data analysis algorithms. This thesis

aims to provide a more systematic exploration of reuse-centric optimization. As a general paradigm,

reuse-centric optimization can be applied to many domains. This thesis has been mainly focused

on its development in the ML domain due to the remarkable popularity and enormous impact of

ML techniques. Specifically, a set of simple yet effective reuse-centric optimization techniques are

developed for efficient model discovery and reliable model inference.

Motivation for Efficient Model Discovery. The broad range of ML applications leads to the di-

verse needs of ML models and deployment environments. Some applications require large-scale

models running on supercomputers to meet accuracy requirements while others need small models

deployed on resource-limited devices for energy efficiency or latency constraints. However, cre-

ating models for a specific task, called model discovery, is a time-consuming process due to the

enormous configuration space and the slowness of learning algorithms. One example is k-means

configuration, which is to find a configuration of k-means (e.g., #clusters, feature sets) that maximize

some objectives. It is a time-consuming process due to the iterative nature of k-means. A more

notorious example is CNN pruning, which tries to find a smaller CNN architecture by removing some

components from the network. It is an important method to adapt a large CNN model attained on

general datasets to a more specialized task or to fit a device with stricter space or power constraints.

Finding the best-pruned network is time-consuming due to the combinatorial problem space and

the day-long model training time. Efficient model discovery has been a major challenge to customize

ML techniques to fit the requirements of various applications and deployment contexts.

Motivation for Reliable Model Inference. ML algorithms are being actively explored for safety-

critical applications such as autonomous vehicles and aerospace, where it is essential to ensure

the reliability of inference results. The prediction results from a deployed ML model could cause

catastrophic consequences such as car accidents and life-threatening operations. One of the key

threats to reliable inference is memory faults. Traditional methods such as error correction codes

(ECC) and Triple Modular Redundancy (TMR) usually incur substantial memory overhead and

energy costs. The costs worsen the limit on model size and capacity and increase the cost of the

overall solution. Other threats to reliable inference include new fault models resulting from novel

hardware design, incorrect training sets, software implementation errors and even bad model

choices and algorithm design.

1.2 Contribution and Thesis Outline

This thesis reports our systematic explorations of reuse-centric optimization for improving the

efficiency and reliability of ML through innovations in algorithms and programming systems. The

contributions of this thesis are:

3

• A set of reuse-centric approaches to accelerate k-means configuration by promoting multi-

level computation reuse across the explorations of different configurations. The approaches

produce 5-9X speedups.

• A compiler-based framework called Wootz that, for the first time, enables composability-based

CNN pruning by generalizing Teacher-Student Network training for pre-training common

convolutional layers. Wootz produces up to 186X speedups for CNN pruning.

• A flexible ensemble Deep Neural Network (DNN) training framework called FLEET that effi-

ciently trains a heterogeneous set of DNNs with data sharing. FLEET produces up to 1.97X

speedups over the state-of-the-art framework that was designed for homogeneous DNN

ensemble training.

• In-place zero-space cost ECC assisted with a new training scheme called weight distribution-

oriented training that can provide the first known zero space cost memory protection for

CNNs.

All these techniques center around exploiting reuse opportunities with the first three contribut-

ing to efficient model discovery while the last one focusing on reliable model inference. As shown

in Figure 1.1, reuse-centric k-means configuration shows the power of reuse-centric optimization

for efficient model discovery at the implementation level; Wootz at the algorithm and compiler

level; FLEET at the infrastructure level. In-place zero-space cost ECC further demonstrates the

power of reuse-centric optimization for reliable inference by memory reuse — that is, reusing the

highest-order bit in CNN parameters.

The outline of the rest of the thesis is as follows:

Chapter 2 provides the background of several ML algorithms (k-means, DNNs) and model

training pipelines that are important for following the rest of the thesis.

Chapter 3 describes reuse-centric k-means configuration, which consists of a set of reuse-centric

approaches for faster k-means configuration. We propose reuse-based filtering, center reuse, and a

two-phase design to capitalize on the reuse opportunities on three levels: validation, k , and feature

sets. We meanwhile provide some important insights on how to effectively apply the acceleration

techniques to tap into full potential.

Chapter 4 describes a compiler-based framework named Wootz for faster CNN pruning. The

framework includes a compression-based algorithm to efficiently identify reuse opportunities when

training a collection of pruned CNN models, a new training scheme called composability-based

CNN pruning for pre-training reusable neural network building blocks, and a compiler and scripts

to automate the optimization.

Chapter 5 describes a flexible ensemble training framework called FLEET that efficiently trains

a heterogeneous set of DNNs. We theoretically prove that optimal resource allocation is NP-hard

and propose a greedy algorithm to efficiently allocate resources for training each DNN with data

sharing. We integrate data-parallel DNN training into ensemble training to mitigate the differences

4

in training rates and introduce checkpointing into this context to address the issue of different

convergence speeds.

Chapter 6 presents zero-space cost memory fault protection for CNN. The work, for the first

time, enables CNN memory fault protection with zero space cost through bit-level memory reuse.

The design capitalizes on the bit-level reuse opportunities by embedding the error correction codes

into non-informative bits. It further amplifies the opportunities by introducing a novel training

scheme, Weight Distribution-Oriented Training (WOT), to regularize the weight distributions of

CNNs such that they become more amenable for zero-space protection.

Chapter 7 summarizes the thesis and discusses future work for building efficient and reliable

ML systems.

5

CHAPTER

2

BACKGROUND

In this chapter, we provide some background on some machine learning (ML) models (k-means

clustering and DNNs) and model training pipelines used in the thesis.

ML is the study of algorithms that make computers learn to perform a specific task without

using explicit instructions. ML builds mathematical models based on sampled data (also called

training data) to make predictions or decisions. It has been used in a wide range of applications

including computer vision [Kri12], recommender systems [Par18], manufacturing [Zho17], health

care [Mio18], and many others.

ML can be classified into several broadly-defined categories. Supervised learning builds mathe-

matical models based on training data that contain both inputs and desired outputs. The desired

outputs are also called labels. For example, when we want to build a classifier to recognize whether

a dog is in an image, the training data should contain both images with and without dogs and

the label for each image. Both classification and regression algorithms are supervised learning.

Classification algorithms are used when the outputs have a limited set of values while regression

algorithms fit the situations where the outputs are continuous such as price, length, weights and

conversion rate. When an algorithm has to learn from training data where only part of the data

has labels, the algorithm falls into the category of semi-supervised learning. Unsupervised learning

builds mathematical models based on only inputs and is typically used to discover patterns or

structures in the data such as clustering of data points or a distribution that generates the training

data. K-means clustering is an unsupervised learning algorithm while k-nearest neighbor (KNN)

and CNNs are supervised learning algorithms. Other types of ML learning algorithms include active

learning, reinforcement learning, and meta learning.

6

An essential step to perform ML is to train a mathematical model on some training data and then

make predictions based on the trained model. Next, we explain two types of ML models, k-means

clustering and Convolutional Neural Networks (CNNs), we experiment with in the thesis.

2.1 K-Means Clustering

K-means clustering is an unsupervised ML algorithm that aims to partition training data into k

clusters where each data point belongs to its nearest cluster. The distance between a data point and

a cluster is calculated as the distance between the data point and the center of the cluster called

cluster center. Each cluster has one cluster center and is calculated as the mean of data points that

belong to the cluster. K-means clustering partitions training data by minimizing within-cluster

variances measured by squared Euclidean distances. The optimization problem is NP-hard. The

most commonly used heuristic algorithm to solve the problem is Lloyd’s algorithm. There are faster

alternatives such as Yinyang k-means [Din15b].

Lloyd’s algorithm works in the following way. Given an initial set of K cluster centers, c (0)1 , · · · , c (0)K

and N data points x1, · · · , xN , the algorithm alternates between two steps until the convergence

criteria is met:

1. Assignment step: Assign each data point to the nearest cluster such that each cluster contain a

set of data points S (t )i = {xn : ||xn−c (t−1)i ||2 ≤ ||xn−c (t−1)

k ||2, n = 1, 2, · · · , N }, where i = 1, 2, · · · , K

in the t step.

2. Update step: Update each cluster center based on the equation c (t )i = 1

|S (t )i |

xn∈S (t )ixn for

i = 1, 2, · · · , K .

The algorithm converges when the assignments do not change or a maximum number of steps is

reached.

As one of the most popular data mining algorithms [Wu08], k-means clustering has been used

in many applications. Its uses go far beyond simple clustering. An important use of k-means, for

instance, is for model-free classifier construction. After clustering some training data in the feature

space, the method may use the labels of each cluster to classify any new data item falling into that

cluster [Fri01]. Another example is to use k-means for feature learning by using the centroids of the

clusters to produce features [Coa12].

We experiment with k-means clustering to evaluate our reuse-centric optimization in Chapter 3.

2.2 Convolutional Neural Networks

Neural networks are a class of ML models that consists of neurons and connections. A neuron

receives many inputs from the preceding neurons and produces one output. It calculates the output

as a weighted sum of the inputs followed by a non-linear activation function. A connection wires

two neurons with a weight, transforming the output of the predecessor neuron as the input of the

7

… …… …Filter 1: 𝑊1 = [𝑤11, 𝑤12, 𝑤13];

Filter 2: 𝑊2 = [𝑤21, 𝑤22, 𝑤23];

Filter 3: 𝑊3 = [𝑤31, 𝑤32, 𝑤33].

Weight

Conv 2

Conv 1

… …… …Prune Filter 2

𝑊1 𝑊1 𝑊2 𝑊2 𝑊3 𝑊3 𝑊3 𝑊3𝑊1 𝑊1

(a) (b)

Figure 2.1: CNN and CNN Pruning. Conv1 and Conv2 are the first two consecutive convolutionallayers in the CNN.

successor neuron. The successor neuron uses the connection’s weight to multiply the input. A layer

contains a collection of neurons. Two layers can have their neurons connected in various patterns

but neurons within a layer are not connected.

A Deep Neural Network (DNN) is a neural network that contains many layers. Modern DNNs can

contain up to thousands of layers. The weights inside a DNN can be adjusted to learn the mapping

between the inputs and the desired outputs through a learning process called training. After training,

one can use the well-trained DNN to make predictions based on given inputs. This process involves

only a forward pass from the first layer to the last layer and is referred to as inference. Recent years

have witnessed rapid progress in the development of DNNs and their successful applications to the

understanding of images, texts, and other data from sciences to industry [Pat18; Mat18; Rat12].

Convolutional Neural Networks (CNN) is a major class of DNN models. The core of a CNN

consists of many convolutional layers, and most computations at a layer are convolutions between

its neuron values and a set of filters on that layer. A filter consists of a number of weights on input

connections, as Figure 2.1 (a) illustrates. CNNs are important for a broad range of learning tasks,

from face recognition [Law97], to image classification [Kri12], object detection [Ren15], human pose

estimation [Tom14], sentence classification [Kim14], and even speech recognition and time series

data analysis [LeC95].

We next enumerate several popular CNN architectures and then give some background on CNN

pruning that is important to understand Chapter 4 of the thesis.

2.2.1 CNN Architectures

Many CNN architectures are proposed in recent years. We list the CNN architectures used in our

experiments.

AlexNet [Kri12] is proposed in 2012 and has had a large impact in the field of ML. The model

significantly decreased the error rate of image classification compared with previous ML-based

approaches. It contains five convolutional layers and three fully connected layers. The total number

of parameters is 61 million.

VGG16 [Sim14] improves AlexNet by replacing large kernel-sized filters with a multiple of 3x3

kernel-sized filters and stacking more convolutional layers. It has 13 convolutional layers and three

fully connected layers. The total number of parameters is 138 million. VGG16 has many variants such

as VGG11, VGG19. These variants have a different number of layers. Due to the strong generalization

8

ability, VGG16 and its variants are also widely used in many other computer vision tasks (e.g., image

segmentation).

Inceptions are a class of parameter-efficient CNNs featured by a novel modular design [Sze15;

Iof15; Sze16]. Inception-V1 (also called GoogleNet) [Sze15] has 7 million parameters. It contains nine

inception modules and each module contains a set of convolutional layers structured in a certain

way. There are two auxiliary loss layers connected to intermediate layers to provide additional

regularization during training. Inception-V2 [Iof15] and Inception-V3 [Sze16] further improve the

accuracy of Inception-V1 on ImageNet by incorporating many tweaks such as factorizing NxN

convolutions into 1xN or Nx1 asymmetric convolutions and adding batch normalization layers.

These tweaks help reduce the number of parameters or improve model accuracy.

ResNets [He16] are a class of CNNs featured by bypass layers. A bypass layer connects two

convolutional layers by skipping the convolutional layers between them. It allows gradients to flow

more easily backward to alleviate vanishing gradient problem. Similar to Inceptions, ResNets also

follow a modular design. One example of ResNets family is ResNet-50, which contains 16 residual

modules and one fully connected layer. Each residual modules have 3 convolutional layers. The

model has 25 million parameters and a total of 50 convolutional layers.

DenseNets [Hua17] build on dense blocks. A dense block connects each layer to all the preceding

layers inside the block. This means each layer will take the activation maps from all preceding layers

directly and its output activation map will also be used as inputs to all the subsequent layers. This

kind of connectivity pattern can achieve similar accuracy as ResNet on ImageNet with half amount

of parameters.

SqueezeNet [Ian16b] is a compact model designed for mobile applications. It can achieve an

similar accuracy as AlexNet but has only 1.2 million parameters. SqueezeNet contains eight Fire

modules (similar to inception modules or residual modules) and has a total of 26 convolutional

layers.

We experiment with the above CNN models to evaluate our reuse-centric optimization in Chap-

ters 4, 5, and 6. Recent advance in DNNs has led to the emergence of many open-source frame-

works including TensforFlow [Aba15], PyTorch [Pas19], Caffe [Jia14], TVM [Che18b], CNTK [Sei16],

Keras [Cho15] and many others. We use TensorFlow and PyTorch to train CNNs in our experiments.

Details of DNN training pipelines are elaborated in Section 2.3

2.2.2 CNN Pruning

CNN pruning is a method that reduces the size and complexity of a CNN model by removing some

parts, such as weights or filters, of the CNN model and then retraining the reduced model, as

Figure 2.1 (b) illustrates. It is an important approach to adapting large CNNs trained on general

datasets to meet the needs of more specialized tasks [Tia17; Ye18]. An example is to adapt a general

image recognition network trained on a general image set (e.g., ImageNet [Rus15]) such that the

smaller CNN (after retraining) can accurately distinguish different bird species, dog breeds, or car

models [Luo17a; Mol16; Liu17a; Ye18]. Compared to designing a CNN from scratch for each specific

9

Operation

State of DataStorage Data Access

PreprocessingTraining

Raw data

PreprocessedData

Figure 2.2: A DNN training pipeline [Pit18].

task, CNN pruning is an easier and more effective way to achieve a high-quality network [Mol16;

Gor18; O’K18; Liu17a; Tia17]. Moreover, CNN pruning is an important method for fitting a CNN

model on a device with limited storage or computing power [Han15a; Yan17].

For a CNN with L convolutional layers, let Wi = {Wj

i } represent the set of filters on its i -th

convolutional layer, and W denote the entire set of filters (i.e., W = ∪Li=1Wi .) For a given training

dataset D , a typical objective of CNN pruning is to find the smallest subset of W , denoted as W ′, such

that the accuracy reachable by the pruned network f (W ′, D ) (after being re-trained) has a tolerable

loss (a predefined constant α) from the accuracy by the original network f (W , D ). Besides space,

the pruning may seek for some other objectives, such as maximizing the inference speed [Yu17], or

minimizing the amount of computations [He18] or energy consumption [Yan17].

The optimization problem is challenging because the entire network configuration space is as

large as 2|W | and it is time-consuming to evaluate a configuration, which involves the re-training

of the pruned CNN. Previous work simplifies the problem as identifying and removing the least

important filters. Many efficient methods of finding out the importance of a filter have been pro-

posed [Liu17b; Hu16; Li16; Mol16; Luo17a; He17].

The pruning problem then becomes to determine how many least important filters to remove

from each convolutional layer. Let γi be the number of filters removed from the i -th layer in a

pruned CNN and γ = (γ1, · · · ,γL ). Each γ specifies a configuration. The size of the configuration

space is still combinatorial, as large as∏L

i=1 |Γi |, where Γi is the number of choices γi can take.

Prior efforts have concentrated on how to reduce the configuration space to a promising sub-

space [Hoo11; He18; Ash17]. But CNN training is slow and the reduced space still often takes days to

explore. Our work introduced in Chapter 4 focuses on a complementary direction, accelerating the

examinations of the promising configurations.

2.3 DNN Training

This section provides the necessary background of DNN training pipeline and data-parallel DNN

training.

10

Data Storage

Model Gradients AveragedGradients

Model Gradients AveragedGradients

Model Gradients AveragedGradients

Red

uce

Ope

ratio

n

Every processreads different

data

Preprocessor

Preprocessor

Preprocessor

Figure 2.3: An illustration of data-parallel DNN training [Ser18].

2.3.1 DNN Training Pipeline

DNNs are commonly trained using stochastic gradient descent (SGD) [Bot10]. To train a DNN, an

objective function is required to evaluate the model’s prediction compared with the desired outputs

for given inputs. An objective function is also called loss function or cost function if we want to

minimize the value calculated by the function. The output of a loss function is simply referred to

as loss. The gradient in gradient descent means error gradient of loss over variables to train. The

weights in a DNN is trained by moving their value in the negative direction of the gradients so that

the loss is reduced. A learning rate is used to control how much change we make to each weight. It

usually takes hundreds of thousands of iterations to train a DNN.

A typical DNN training pipeline is an iterative process containing three main stages: data fetching,

preprocessing, and training, as shown in Figure 2.2. In each iteration, data is fetched to the main

memory and then run through a sequence of preprocessing operations such as decoding, rotation,

cropping, and scaling. The preprocessed data is arranged into batches and consumed by the training

stage. The batch size is the number of data samples used simultaneously per step.

The modern computing clusters and data centers have evolved into a hybrid structure that con-

tains both CPUs and GPUs on each node. These heterogeneous CPU-GPU clusters are particularly

useful for DNN training as CPUs and GPUs can work together to accelerate the training pipeline.

Compared to the training stage, preprocessing is usually less computation-intensive. To pipeline

the preprocessing and DNN training, typically preprocessing is performed on CPUs while training

on another batch of data happens simultaneously on GPUs.

2.3.2 Data-Parallel DNN Training

Data-parallel DNN training trains a single DNN using multiple training pipelines where each pipeline

handles a different subset of data. As illustrated in Figure 2.3, each pipeline fetches a different subset

11

of data from storage and prepossesses data independently. In the training stage, gradients are

calculated by each pipeline and are reduced so that every pipeline has the same averaged gradients.

The averaged gradients are used to update the model to make sure each pipeline has the same copy

of model parameters.

Pipelines in data-parallel DNN training can run either on the same computing node using

intra-node communication (single node multiple GPU training) or different nodes using inter-node

communication (multiple-node multiple-GPU training). For the existing communication interfaces

(e.g., MPI), intra-node communication is usually more efficient than inter-node communication.

Thus, it is preferred to allocate pipelines on the same computing node rather than on different

nodes. As it is common to run only one pipeline on a single GPU, the number of GPUs available

to train a DNN model practically limits the maximum number of pipelines that can be created in

data-parallel DNN training. Data-parallel DNN training is used to train CNNs in Chapter 5.

12

CHAPTER

3

REUSE-CENTRIC K-MEANS

CONFIGURATION

3.1 Introduction

The effectiveness of k-means in applications depends on many factors, such as the features used for

clustering and the resulting number of clusters. As a result, algorithm configuration is essential for

k-means–based data mining [Kal12; Ber12]. On the other hand, as an iterative algorithm, k-means is

very time-consuming to run on large datasets. The configuration of k-means for a dataset requires

many runs of k-means in various settings. The time-consuming nature of k-means and the required

repeated runs of k-means in its configuration make k-means–based data mining a time-consuming

process, a problem being continuously exacerbated by the rapid growth of data in this era.

There are some general methods proposed for speeding up the configuration process of algo-

rithms [Hol12]. They have mostly focused on how to reduce the number of trial configurations. How

to accelerate the examination of the remaining configurations through historical information reuse

is a complementary direction that has not received sufficient explorations. And how to effectively

accomplish it for k-means is yet a largely unexplored problem.

This chapter presents a systematic exploration in that direction. It introduces the concept of

reuse-centric k-means configuration, which promotes information reuse across the explorations of

different configurations of k-means. The motivating observation is that the explorations of different

configurations of k-means share lots of common and similar computations. Effectively reusing the

computations could largely cut the configuration time with little or no effect on the quality of the

final results.

13

To materialize the idea, this work strives to answer three main research questions:

• What historical information is essentially useful for k-means configuration?

• How to efficiently reuse the information to maximize the reuse benefits?

• Whether and how much the reuse-based optimizations affect the final results?

Specifically, we have designed two techniques, called reuse-based filtering and center reuse, to

promote computation reuse across trials of different configurations.

Reuse-based filtering takes advantage of the clusters and the distance between a point and

its nearest center unveiled in a previous trial of k-means. Through the reuse, it is able to leverage

triangle inequality to avoid some distance calculations–that is, using computationally efficient lower

bounds of the distances between a point and potential centers to filter out some centers that are

unlikely to be the nearest to a point, and avoid calculating the distances to those centers. (§ 3.2.3)

Center reuse is to use the clustering results of some earlier trials to initialize cluster centers for

some later trials on different configurations. The reuse helps make the later trials converge faster

and hence saves configuration time. (§ 3.2.4)

For both types of reuse, we have explored the opportunities in multiple levels: across validations,

across k , and across feature sets. Besides their effectiveness in drastically cutting configuration

time, an appealing property of these techniques is their simplicity. They are designed to be simple

to implement and deploy to ensure their applicability in general data mining applications.

In addition to the two techniques, we have also explored the use of a two-phase design to

speed up the configuration process when a full error surface is needed for meeting various desired

trade-offs among multiple quality metrics (e.g., different weights of the classification errors over the

classification time). (§ 3.2.5)

We evaluate the efficiency and effectiveness of these techniques by way of both the configuration

speed and quality of the final results, in both sequential and parallel settings. Our results show that

these techniques can work together in synergy, speeding up a heuristic search-based configuration

process by up to 5.8X. When they are used to speed up the attainment of the error surface of k-

means–based classifiers, they shorten the process by a factor of 9.1. All the optimization techniques

we propose cause no change to the final k-means results except for the center reuse technique. We

conduct a focused study on its influence, which concludes that the caused disparity is negligible

(less than 3%). We further provide some sensitivity study to reveal how the optimization techniques

perform in various settings, and point out some important insights—such as, the directions of

configuration explorations—on how to deploy them to tap into a full potential. (§ 3.3)

Overall, this work makes the following major contributions:

• It provides the first systematic study on how historical information reuse may help speed up

the k-means configuration process.

• It proposes a set of novel techniques to effectively promote information reuse across explo-

rations of different k-means configurations.

14

• Through sensitivity studies, it reports the performance of the techniques in various settings,

and sheds some important insights on the suitable ways to deploy these techniques.

• It demonstrates large (5–9X) speed benefits from these techniques, and confirms only little

disparity they may cause to the quality of k-means results.

3.2 Proposed Techniques

We describe, in this section, our proposed techniques. Before then, we first discuss the factors and

objectives necessary to consider in k-means configuration.

3.2.1 Overview of K-Means Configuration

Understanding the usage of k-means in real applications helps understand the purpose and objec-

tives of k-means configuration.

Even though k-means is a clustering tool, it is often used as a module for a purpose beyond

simple clustering. In k-means–based data classification, for example, through k-means, training

data are grouped into clusters, which are then used for classifying test data: The cluster centers are

used as compact representations of the data, and each center has an associated class label decided

by its data-point members. The classification of a testing data point is then made to the class of its

closest cluster center.

Figure 3.1a outlines a general structure of applications that use k-means. Data are first projected

onto some feature space. K-means clustering subsequently runs on the projected data to form some

clusters, which are then used by the application for some follow-up purposes (e.g., classification).

K-means configuration is a process of finding the configuration (e.g., the number of clusters k

and feature sets) that can maximize certain objectives. The objectives are often aimed at maximizing

the quality of the ultimate results of the application (e.g., classification accuracy); some internal

metrics of clustering (e.g., within and across cluster distances) could be relevant but are usually

secondary to the application-level objectives. Cross-validation (e.g., on data classification) is often

used in the process to help assess the quality of a configuration.

Figure 3.1a also illustrates two important factors of k-means to configure and to impact the

applications in some way. The first is the set of features to extract or select from the raw data, and

the second is k , the number of clusters to form. Although the configuration involves only two

factors, even with the fast Yinyang k-means algorithm [Din15b], on a dataset of modest size, the

configuration is still computationally intensive (days) when exploring all combinations of k and

feature sets. There are some other factors that could also be worth tuning, such as the definitions of

distance, the ways to do feature extraction. However, the two factors (k and feature sets) have the

largest numbers of variants and hence dominate the configuration space. Our discussion in this

work focuses on them; thanks to the combinatorial nature of the space, the speedups attained for

their configurations directly translate to the overall speedups of the whole configuration process

despite the presence of other secondary factors one may wish to tune.

15

(a) A general structure of k-means–based applicationswith k-means configuration.

(b) The overview of the acceleration techniques.

Figure 3.1: Overview of k-means–based applications and where our three acceleration techniquesare applied.

Our acceleration techniques pertain to the most time-consuming k-means clustering step,

circled with a dash-lined rectangle in Figure 3.1a.

3.2.2 Overview of the Acceleration Techniques

Our acceleration techniques consist of three stages: reuse-based filtering, center reuse, and a two-

phase design. The first two materialize the idea of reuse-centric k-means configuration, which saves

computations in the configuration process through promoting the reuse of computation results from

the trials of some earlier configurations. The last technique uses a two-phase design to first quickly

get an estimated surface of classification errors, and then uses it to help focus the explorations on

valuable configurations. The first two are generally applicable for all k-means–based data mining

tasks, while the last one is especially useful when a detailed relation between configurations and

the final results of the application (e.g., classification accuracy) is needed.

The techniques work at different aspects of the problem and can function in synergy. The

dash-lined boxes in Figure 3.1b illustrate the scopes they each work on.

Reuse-based filtering reuses the clusters obtained in the trial of one configuration (with feature

set S and k value) to speed up the first iteration of k-means in a later trial of some other configuration.

It concentrates on the first iteration of k-means because in modern k-means (e.g., Yinyang K-

means [Din15b]), the later iterations are already highly optimized, and each takes a much shorter

time than the first iteration does. For instance, in our experiments with nine datasets of different

sizes and dimensions (Listed in Table 3.1), the first iteration of Yinyang K-means takes 10-40% of

the entire k-means time.

16

Center reuse sets good initial centers for k-means by leveraging the centers from earlier trials. It

works across all three levels: across the iterations in feature selection, iterations of k value exploration,

and cross-validations. It significantly helps shorten the time for k-means to converge in the algorithm

configuration.

The two-phase design aims at reducing the number of configurations to explore for each set of

features. It hence contributes to the computational savings within, rather than across the explo-

rations of a given set of feature.

Reuse-based filtering and the two-phase design do not alter the clustering results. Center reuse

could lead to clustering results different from the ones attained by using random centers. However,

later in Section 3.3.4, we will show that the influence causes negligible impact on the results of

algorithm configurations.

We next explain each of the three techniques in detail.

3.2.3 Reuse-Based Filtering

K-means is time consuming primarily because of its calculations of the distances from data points to

potential cluster centers. In the standard k-means, each iteration needs to compute n ×k distances

(n is the number of points, k is the number of cluster centers), from every data point to every cluster

center in order to identify which cluster center is the closest to the data point. Modern k-means

algorithms (e.g., Yingyang k-means [Din15b]) successfully avoid many distance calculations in later

iterations of a k-means, but they all still need the n ×k distance calculations in the first iteration

of k-means. In our experiments, we observe that the first iteration weights up to 40% of the entire

k-means time. We call it the first iteration problem. Algorithm configuration of k-means needs many

runs of k-means; every one of them suffers from the first iteration problem.

To alleviate the issue, we propose reuse-based filtering. It is based on the well-known geometric

property of Triangle Inequality (TI).

3.2.3.1 TI and Its Prior Use for K-Means

We provide the formal definition of TI and landmark as follows.

Theorem 3.2.1. Triangle Inequality (TI): Let q , p , L be three points in a metric space (e.g. Euclidean

space) and d (x , y ) be the distance between the any two points x , y in the space. Triangle Inequality

states that d (q , p )≤ d (q , L ) +d (p , L ). Point L is called a landmark.

TI has been used by previous work [Din15b; Dra12; Ham10; Elk03] to avoid unnecessary distance

calculations in k-means, except for its first iteration. The basic idea in those works is to use the

cluster centers in the previous iteration as landmarks to help quickly attain the lower bounds and

upper bounds between each data point and the new centers in the current iteration. If the lower

bound between a point x and a center c is even larger than the upper bound between x and its

so-far nearest center b (x ) (in this current iteration), there is no need to calculate d (x , c ). Figure 3.2

17

Figure 3.2: Illustration of how upper bound and lower bound can help avoid distance computationto some center c . Circles and double circles represent centers in current and previous iterationrespectively, and b (x ) is the so-far nearest center of point x .

illustrates this procedure. The idea has not been applied to the first iteration of k-means because

there is no previous iteration that it can leverage.

3.2.3.2 Basic Idea of Reuse-Based Filtering

Our reuse-based filtering is inspired by the prior use of TI in k-means. Its basic approach is to

leverage the results from the exploration of an earlier configuration to help produce the lower/upper

bounds of distances for the exploration of later configurations, whereby, TI can then be applied to

identify and avoid the unnecessary distance computations.

The nature of algorithm configuration poses several special challenges for materializing the idea

that do not appear in the prior use of TI for accelerating a single run of k-means.

• In different iterations of a single k-means, distances are all based on the same set of data fea-

tures, and the number of cluster centers is also identical. But in algorithm configuration, these

factors could all differ in the exploration of different configurations. That causes complexities

in how to reuse distances, and how to effectively define landmarks.

• How to ensure that the acceleration to the first iteration does not interfere with the acceleration

of the later iterations of k-means. When modern k-means algorithms apply TI to later iterations

to avoid unnecessary distance calculations, they leverage the n ×k distances from the first

iteration to help attain some tight distance bounds for TI to work effectively [Din15b; Dra12;

Ham10; Elk03]. If reuse-based filtering avoids computing many of the distances in the first

iteration, it could pose risks for the acceleration of the later iterations to work properly.

We next explain how our design of reuse-based filtering, and how the design addresses the two

special concerns.

3.2.3.3 Detailed Design of Reuse-Based Filtering

We explain the design of reuse-based filtering in two levels: across k and across feature sets.

18

Figure 3.3: Illustration of how the configuration with k = k1 can help save distance computation inthe first iteration of another configuration with k = k2. b ′(x ) is the closest center of point x whenk = k1; c1, c2, c3 are the initial centers and b (x ) is the so-far nearest center of point x when k = k2.

Reuse across k . This reuse happens among the configurations that share the same set of features,

with different k values. Suppose that the reuse is from one configuration with k = k1 to another with

k = k2. Compared to the previous usage of triangle inequality to eliminate unnecessary distance

computations, as shown in Figure 3.2, we cannot build that one-to-one previous-center relationship

between two configurations with different k . Instead, we could use the closest center b ′(x ) for each

point x in the configuration with k = k1 as the landmark for all the initial centers in the configuration

with k = k2. Figure 3.3 provides the illustration. Note that the distance from x to b ′(x ) can be directly

reused from the trial with k = k1, the only extra distance computations we need to carry out are

those from the centers in k = k1 to the initial centers in k = k2. In total, there are k1×k2 distance

computations, which is negligible in comparison to the cost of distance computations from every

point to every center (i.e. n ×k2), where n is the total number of points. Similarly to the previous

usage of triangle inequality [Din15b; Elk03], this optimization does not change the final cluster

results, as distance computations to some center c will be eliminated only when c can not be the

closest center to the point x .

Reuse across feature sets. This reuse happens among the configurations that have the same

number of clusters k , but use different sets of features. As our optimization is based on triangle

inequality, which requires the distance to be defined in the same metric space, we need to be con-

servative about distance reuses across different feature sets. Before we give the detailed explanation

about how reuse across feature sets is applied, we first introduce a theorem on distances defined in

two different, but highly related, metric spaces.

Theorem 3.2.2. For any two pairs of points (x F1 , c F1 ) and (x F2 , c F2 ), where x F1 and c F1 are in feature

space F1 with the feature set S1, while x F2 and c F2 are in feature space F2 with the feature set S2. If

x F2 and c F2 have only a subset dimensions of x F1 and c F1 respectively, i.e., S2 ⊂ S1, then the distance

between x F2 and c F2 must be no larger that between x F1 and c F1 in any p-norm space. That is to say,

d (x F2 , c F2 )≤ d (x F1 , c F1 ).

19

Figure 3.4: Illustration of how the configuration with F1 can help save distance computation in thefirst iteration of another configuration with F2, where b ′(x ) is computed from the closest center ofpoint x in feature space F1, while c1, c2, ... ck2

and b (x ) are the initial centers and the so-far nearestcenter of point x in feature space F2 respectively.

The theorem directly follows the distance definition in any p-norm space, in that the distance

function monotonically increases with the number of dimensions.

For distance reuse across feature sets, a distance computed in feature space F1 can be used for

bound computations in feature space F2 without affecting the final clustering result as long as S2

is a subset of S1. In particular, for each center c ′F1 obtained in feature space F1, we remove those

dimensions that are not used in F2 to build a corresponding center c ′F2 in feature space F2.

Figure 3.4 gives the illustration of how distance reuse can help eliminate unnecessary distance

computations to some center c across feature sets. Compared to the reuse across k shown in

Figure 3.3, our lower bound computation directly uses d (x F1 , b ′F1 (x )) calculated in feature space F1,

which themselves are the upper bounds of d (x , b ′(x )) in feature space F2. These bound computations

are a simple extension of the traditional triangle inequality, and as a consequence, our method will

remove the distance computation to some center c , only if it is impossible to be the closest center

to the point x .

Further, our accelerations based on both reuse across k and reuse across feature space can easily

be combined with previous accelerations focusing on the later iterations of k-means [Din15b; Elk03].

Instead of using the exact distance results for bound computation as shown in Figure 3.2, we can

easily replace the exact distance with corresponding bounds obtained in our optimized first iteration.

Although our optimization may theoretically affect the efficiency of the optimization applied in the

later iteration of k-means, our empirical experience shows that these two accelerations are mostly

orthogonal to each other. (Details in Section 3.3.5.3.)

3.2.4 Center Reuse

The second technique we propose for accelerating configuration of k-means is called center reuse.

The idea is simple. As is well known, the convergence speed of k-means is highly sensitive to the qual-

20

ity of the initial centers. Some initial centers can make k-means converge in much fewer iterations

than others. The basic idea of center reuse is to use the cluster centers attained in the exploration of

some earlier configuration as the initial centers for the explorations of later configurations.

Center reuse is based on the following hypothesis:

Hypothesis 3.2.3. In algorithmic configuration, effectively using centers from an earlier run of k-

means to initialize later runs of k-means could shorten the convergence process while causing negligi-

ble effects on the result of the algorithm configuration.

Specifically, we consider center reuse in three scenarios, corresponding to the different levels of

the explorations of k-means configurations shown in Figure 3.1b.

Reuse across validations. This reuse is among the different folds in cross validations in the

exploration of a certain configuration. As aforementioned, recall that when exploring a given k

and a set of input features, cross-validation is often used to examine the quality of the final results

when that configuration is used. Take k-means–based classification as an example: cross-validation

computes the errors of the classifier produced through k-means in that configuration. A V -fold cross

validation builds V classifiers with each on a slightly different training dataset. The center reuse at

this level is to use the cluster centers attained during the training of the first of the V classifiers as

the initial centers for the k-means in the constructions of the other V −1 classifiers.

Reuse across k . This reuse happens among the configurations that share the same set of features,

but different k values. Because of the difference in k , the centers may not be directly reusable. Our

empirical investigation shows that the problem can be handled through a simple design. Suppose

that the reuse is from one configuration with k = k1 to another with k = k2. If k2 > k1, in addition to

using the centers attained in exploring the earlier configuration, we add k2−k1 randomly generated

centers as needed. If k2 < k1, we cluster the k1 centers into k2 groups and then take the group centers

as the initial centers for the exploration of the latter configuration. Figure 3.5 illustrates this case.

Reuse across feature sets. This reuse happens among the configurations that use different sets

of features. Suppose that the reuse is from configuration C1 with feature set S1 to configuration C2

with feature set S2. The differences in the feature sets make direct center reuse difficult. Our design

is to reuse the values of overlapped features between S1 and S2, and generate the values of the other

features of S2 (if there are any). Our experiments show that the generation can be as simple as using

the mean value of each feature.

When two configurations differ in both k and feature sets, center reuse first applies the reuse

across feature sets to handle the dimension differences in the features, and then applies the reuse

across k to set the initial cluster centers.

Our empirical investigation shows that center reuse is beneficial for accelerating k-means

configuration—that is, Hypothesis 3.2.3 holds—when the following principles are followed:

• Reuse cross validations should be always applied. Even though the training datasets of different

validations differ, the datasets come from the same source and share the same distributions.

21

Figure 3.5: Illustration of center reuse across k . The two graphs represent the k-means on twoconfigurations with k equaling three (left) and two (right) respectively. The double circles in the leftgraph show the three centers attained in the exploration of that configuration. These centers aregrouped to get two group centers c ′1, c ′2, which are then used as the initial centers (marked as circlesin the right picture) for exploring the latter configuration.

Their cluster centers are hence similar. The reuse can always shorten the convergence process

significantly.

• Reuse across k can be applied to either direction: from a smaller k to a larger one or from a

larger k to a smaller one. In the case of increasing k values, as the extra centers are added

randomly, the complexity is the same compared with the random initialization. In the case

of decreasing k values, we group centers to get initial centers using k-means with only one

iteration. The overhead is also negligible. Even if the two k values differ a lot, center reuse-

based initialization degrades to random initialization and won’t bring extra costs. Specifically,

in our experiments, we notice that center-reuse always give substantial benefits even when

the two k values differ up to 25% of the maximum k value (e.g. the difference is 200 and the

maximum k is 1000). Thus, reuse across k can always be applied.

• Reuse across feature sets are applied to a different set of features. Since we randomly generate

the values for dimensions that are not overlapped between the two feature sets, reuse across

feature sets degrades to random initialization in the worse case (i.e., two features sets are

completely different). If PCA-based step-wise feature selection is used, there is always overlap

between two features sets. Thus reuse across feature sets can always be applied. We observed

that reuse across feature sets can give substantial benefits even when the two features sets

differ in 15% of the maximum dimension (e.g. the difference is 8 and the maximum dimension

is 60).

Section 3.3 will provide the details of our empirical experiments on the effectiveness of center

reuse in saving computations, and on the effects it has on the quality of k-means configuration.

3.2.5 Two-Phase Design

The two techniques presented so far form the basis for our reuse-centric k-means configuration. In

this part, we introduce another complementary technique named two-phase design.

22

0 200 400 600 800 1000k

0.1

0.2

0.3

0.4

0.5

class

ifica

tion

erro

r

elbow

Defaultkl:70, kr:490

Figure 3.6: An example of an error curve and the illustration of curve segmentation.

This technique is particularly useful when the error surface of the target application is desired.

For k-means–based classification, for instance, the error surface indicates how the classification

error changes when k and feature sets change. The surface is composed of a set of error curves, with

each corresponding to one set of input features. The black curve in Figure 3.6 illustrates such an

error curve when a particular feature set is considered while k changes.

The error surface is useful when the criterion for the best configuration varies. For instance,

k-means–based classifier tends to give a higher classification accuracy when k gets larger. But at the

same time, the classification time also gets longer. In some situations, a user may want different

trade-offs between the accuracy and the time in different scenarios. Having the error surface in

hand can help meet the needs without rerunning the algorithm configuration every time the desired

trade-off changes.

The two-phase design is based on the observation that some points on an error surface more

critically affect the accuracy of the error surface than some other points do. For instance, on the

curve shown in Figure 3.6, the parts outside the elbow area are close to straight lines, and are hence

easy to approximate through curve interpolation on only several sampling points, but the elbow area

is more subtle in shape, and would require more sampling points to get a reasonable interpolation

result. Meanwhile, the elbow point is often selected as the desired configuration for its good balance

between the increase of time overhead and the decrease of classification errors.

The idea of the two-phase design is to first quickly obtain an estimated shape of the error surface,

based on which, it then conducts a focused exploration of the configurations (e.g., those fall into the

elbow area) that are most important for the accuracy of the final error surface. The two phases in the

design happen during the exploration of each given set of input features. To ease the understanding,

we explain the technique and how it works by drawing on k-means–based classification as an

example use of k-means.

The first phase in the design, specifically, tries to quickly get the approximate classification errors

at a small number of sampled k values. It employs both the reuse-based filtering and center reuse

for speed. At the same time, it adopts two approximation methods. The first is to use only the first

fold of cross-validation; the second is to replace k-means clustering with only a one-step clustering.

23

0 200 400 600 800 1000k

0.1

0.2

0.3

0.4

0.5cla

ssifi

catio

n er

ror Default

FirstPhase

(a) sensorless

0 200 400 600 800 1000k

0.18

0.19

0.20

0.21

class

ifica

tion

erro

r DefaultFirstPhase

(b) adult

Figure 3.7: Approximated classification error curves from the first phase for two datasets.

The one-step clustering assigns points to clusters based on their distances to the centers produced

from our center reuse scheme; no center updates or point reassignments are done. The rationale of

our first phase approximation method comes from the statistical similarity across different folds of

a dataset, and the reasonable quality of the cluster centers produced from the center reuse.

These two approximation methods could incur some deviation from the exact classification

errors, but as we observed, their Pearson correlation coefficient with the errors from Yinyang k-mean

are always higher than 0.95 for all datasets we tested. Figure 3.7 shows approximated classification

error curves from the first phase for two example datasets with ten k values sampled from the range

[20, 1010]. “Default” here refers to the default k-means with random initialization while “FirstPhase”

refers to our first phase approximation method. Even though the errors from the the first phase are

higher than the errors from standard k-means, the trends of the error curves are well approximated.

Based on that shape of curve, the second phase identifies the critical sections (i.e., the elbow

section) of the curve and selects some important configurations to conduct more focused and

detailed explorations to subsequently get the precise accuracy at those points. Through interpolation

across those points, it finally obtains the error curve. Compared to uniform sampling, this two-phase

design allows better error curves to be attained with detailed explorations of fewer configurations.

Two notes are worth making about the second phase. The first is how to identify the critical

sections. We call it also curve segmentation. Given the range of k , [kmi n , kma x ], the way we segment

the curve and do non-uniform sampling is as follows:

1. Find the elbow point on the curve. ke l b o w is the corresponding k value;

2. For the two sub-ranges [kmi n , ke l b o w ] and [ke l b o w , kma x ], find the elbow point for each partial

curve in the range. The corresponding k values are kl and kr ;

3. Then the range is split into three parts: [kmi n , kl ], [kl , kr ], and [kr , kma x ]. Different stepsizes

can be used when exploring the three sub-ranges.

24

Let s1, s2 and s3 be the stepsize for sampling the range [kmi n , kl ], [kl , kr ], and [kr , kma x ] respec-

tively. We set the step sizes of the sampling as follows:

s2 =m2(kl −kmi n ) +m1m2(kr −kl ) +m1(kma x −kr )

m1m2(nk −1), (3.1)

s1 =m1s2, (3.2)

s3 =m2s2, (3.3)

where nk is the total number of k values to sample, while m1 and m2 are parameters determining

the degree of discrimination in sampling to the different segments; we set them to 0.5 and 2.

The second note is about elbow point detection. While the notation of elbow points is well-

known, there are no broadly accepted definitions. Various techniques [Zha08; Sat11] have been

proposed to detect the knee point of a curve. We adopt a lightweight approach similar to [Sat11].

The idea is to draw a line from the first to the last point of the curve, and then find the data point

that is farthest away from that line. The dots on the curve in Figure 3.6 illustrates the boundaries of

the curve segments obtained through the method.

3.3 Evaluations

We conduct a series of experiments to evaluate the proposed techniques. Specifically, we focus on

the following questions:

• Q1: How much computation can the optimizations save? How much can they speed up k-

means configuration, in both sequential and parallel settings?

• Q2: Do the optimizations degrade the configuration results?

• Q3: How are the benefits affected by dataset attributes and problem settings (e.g., k , data

dimensions, landmarks)?

• Q4: How to deploy the optimizations (e.g., reuse from smaller k or from larger k, smaller feature

sets or larger feature sets) to maximize the benefits?

This section answers the first question by reporting the overall computation savings and

speedups brought by the techniques in § 3.3.2 and § 3.3.3, answers the second question in § 3.3.4,

answers the third and fourth ones through some sensitivity studies in § 3.3.5.

3.3.1 Methodology

Our experiments use k-means–based classification as a concrete usage scenario of k-means config-

urations. It is worth noting that the techniques are general, applicable to other usage of k-means.

All the experiments run on an HPE Apollo 2000 server with two Intel Haswell CPUs (14 cores/CPU,

2.3-3.3GHz) and 128GB RAM. We use nine large, real-world data sets taken from the UCI machine

25

Table 3.1: Data statistics.

Dataset size(B) n #attr #c #d d s t e p k s t e p

gamma 1.2e6 1.9e4 10 2 8 1 10sensorless 4.4e6 5.8e4 49 11 10 1 10credit [Yeh09] 3.0e6 3.0e4 24 2 13 1 10gassensor [Hue16] 1.7e6 1.4e4 11 2 16 1 10miniboone 3.4e7 1.3e5 50 2 34 2 20adult 2.0e7 4.5e4 14 2 59 2 10connect 3.1e7 6.8e4 42 2 61 2 10activity [Ang13] 1.1e7 1.0e5 561 6 157 8 10census 2.0e8 1.4e5 68 2 186 8 20

learning repository [Asu07]. The statistics of the datasets including data size (size), the number of

instances (n), the number of attributes (#attr) and the number of classes (#c) are listed in Table

3.1. We used PCA as the feature projection method to extract features from attributes and adopted

the step-wise feature selection method to select feature sets. We retained a maximum number of

components that cumulatively explain 99% of variation. The minimum number of dimensions

is two. The overall range of feature dimensions to consider in the configuration explorations is

shown in the “#d” column. “d s t e p ” and “k s t e p ” columns show the default step size to increase

or decrease the dimension and the number of clusters respectively.

Yinyang k-means [Din15b] is used in the baseline implementations of the algorithm configura-

tion to minimize time. Yinyang k-means is one of the state-of-the-art algorithms proposed by Ding

and others recently. The algorithm filters unnecessary distance calculations by using continuously

maintained lower bounds on the distances of each point to the cluster centers as well as an upper

bound to the cluster center to which it was assigned. Even though the bounds yield a significant

speedup compared to the standard k-means (over 9X on average) [Din15b], the algorithm needs

to compute one full iteration of k-means to initialize bounds in the beginning, which requires the

computation of all distances to the centers.

K-means++ [Art07] is a commonly used initialization method for better approximation to

the optimal k-means solution (i.e., minimizing the within-cluster sum of squares). However, our

experiment shows that for configuring k-means–based data classification, it does not outperform

random-based center initialization in either the speed or the quality of the produced classifier.

Random initialization is hence used in our baseline.

Euclidean distance is used in all the experiments as selections of various distance metrics are

not a focus of this work. Our acceleration techniques apply to various metric spaces, except that

cross-feature reuse-based filtering requires p-norm spaces as stated in Theorem 3.2.2.

3.3.2 Speedups on Heuristic Search

This part reports the speedups brought to the k-means configuration process by our reuse-based fil-

tering and center reuse. Algorithm configurations typically employ some heuristic search algorithms

26

Table 3.2: Speedups on stochastic hill climbing.

Dataset time(s)*reuse-basedfiltering

center reuseacrossvalidations

across k andfeature sets

gamma 2808.1 3.09 3.58 2.05sensorless 10730.6 3.22 3.40 1.73credit 5713.5 3.30 4.05 2.04gassensor 1901.9 4.37 4.39 2.33miniboone 56636.8 1.56 3.73 1.83adult 9904.4 4.30 5.80 2.18connect 23621.7 1.54 4.42 1.63activity 6569.6 1.11 2.76 1.91census 79872.7 2.30 4.14 1.68

* time(s) refers to k-means clustering time in seconds for all 200 configurations without our opti-mizations.

to explore the configuration space. Our optimizations are largely orthogonal to what search algo-

rithms are used. Our experiments use stochastic hill climbing. Hill climbing is an iterative algorithm

that starts with an arbitrary solution and then attempts to find a better solution by changing the

solution in some way. If the change produces a better solution, then it is taken as the new solution.

Otherwise, a new change is examined. The process is repeated until no further improvements can

be found or some stopping criterion is met. Stochastic hill climbing makes the change at random.

The tuning objective is set to consider both the classification accuracy and the response time

of the built classifier. Specifically, it is to find the smallest k (hence giving the fastest classification

response) that can achieve a classification accuracy over a given threshold (90%). The stopping

criterion is that the maximum number of configurations (200) is tested. The baseline method repeats

the following process until it meets the stop criterion:

1. Choose the number of dimensions d and the number of clusters k from the search space

randomly;

2. Run k-means–based classification with the specified configuration and get the average classi-

fication error through 10-fold cross-validation;

3. If the average classification error reaches a predefined error threshold and the k value is

smaller than that in the current best configuration, then take it as the current best solution.

Our computation reuse techniques accelerate the second step. Each randomly generated con-

figuration could have different d and/or k than what a previous trial uses. We design the following

reuse strategy to select a historic trial for computation reuse:

• If there are previous trials of k-means that use the same d as the current one, then we reuse

the distances from the trial with the largest k for reuse-based filtering and the cluster centers

from the trial with the nearest k for center reuse.

27

Table 3.3: Speedups on the attainment of error surfaces.

DatasetSpeedup by Computation Reuse (#k=32) Percentage of k Saving Overall Speedupreuse-based filtering center reuse two-phase design

#k=8 #k=16 #k=32accross k

acrossfeature sets

acrossvalidataions

across kacrossfeature sets

#k=8 #k=16 #k=32

gamma 3.42 3.12 4.81 2.78 1.44 11.51% 29.52% 16.81% 4.57 6.30 5.70sensorless 3.61 3.80 4.50 2.37 1.59 21.41% 31.00% 24.21% 5.34 6.73 6.94credit 3.65 3.70 5.31 2.81 1.82 6.69% 23.82% 15.09% 4.98 6.64 6.77gassensor 4.31 4.34 5.75 3.32 3.19 26.54% 38.77% 11.95% 6.54 8.59 6.74miniboone 1.91 1.97 4.56 2.57 2.02 3.85% 49.24% 48.98% 4.50 8.24 9.17adult 4.65 5.05 6.43 3.86 3.88 12.99% 24.64% 22.71% 6.76 8.27 9.07connect 1.59 1.74 4.13 2.91 2.28 11.19% 13.88% 5.89% 4.58 5.07 4.98activity 1.21 1.23 2.77 1.84 2.07 15.25% 17.36% 13.79% 3.02 3.49 3.34census 2.71 2.78 4.89 2.63 2.61 16.57% 33.12% 28.91% 5.34 7.32 7.64

• Otherwise, we reuse the distances from the trial whose d is larger but has the least difference

than the current one for reuse-based filtering and the cluster centers from the trial with the

least difference on d (can be smaller or larger) for center reuse.

To enable the above reuse strategy, the cluster centers from previous trials of k-means in the first

fold of cross-validation have to be stored. Also, the point-to-center distances from previous trials of

k-means needs to be updated whenever a configuration with a larger k and the same d is tested.

The speedups from each technique are listed in Table 3.2. For reuse-based filtering, we observe

that it has only minor effects on the speeds of the later iterations of Yinyang k-means (Details in

Section 3.3.5.3). So we report the speedup of reuse-based filtering for the first iteration. The speedup

comes from the distance computation saved by TI-based filtering. Center reuse affects the entire

k-means clustering and thus has the dominant influence on the overall speedup. It confirms that

our simple design for center reuse across k and features sets works well. The next subsection reports

the overall effects when all the techniques are used together in attaining error surfaces.

3.3.3 Speedups on the Attainment of Error Surfaces

Our second experiment studies the benefits of our techniques on the attainment of classification

error surfaces. As Section 3.2.5 has mentioned, error surfaces capture the relations between config-

urations and the errors of the corresponding classifiers. They could be helpful when the criterion

for the best configuration varies. In this experiment, we apply all three techniques that Section 3.2

proposes to accelerate the attainment of the error surfaces.

Uniform search of the configuration space is a simple but frequently used method to attain error

surfaces. It is used in the baseline implementation. Uniform search evaluates every combination of

all the d values and several uniformly sampled k values.

Our proposed method is to use two-phase design to reduce the number of k to be evaluated

for building a classification error curve. Computation reuse techniques are applied to save the

clustering time for each sampled configuration. Since the configurations for uniform search to be

28

tested are already known, our method starts with the largest d and the largest k and apply reuse to

smaller d values and k values.

We first apply all the three techniques to speed up sequential uniform search, and then apply

them to parallel uniform search on the 28-core parallel machine.

3.3.3.1 Sequential

In this setting, only one thread is used for the configuration process. The range of dimensions to

search is listed in Table 3.1 and the range of k values is [20,1010]. The number of sampled k is

8, 16, 32.

The speedups from all three techniques are listed in Table 3.3. With our two-phase design, we

could reduce the number of k to be evaluated for recovering the error curve without affecting the

benefits from the computation reuse. As a consequence, the overall speedup is up to 9.17X.

The overall speedup on the dataset activity is not as high as on the other datasets. Specifically,

the dominated acceleration factor, center reuse across validations, produces a smaller speedup

compared with that on the other datasets. In contrast, the results from the dataset census, which has

the same large dimensions but about 14 times larger data size, shows much larger speedups for reuse-

based filtering and center reuse across all three levels. This is because when a dataset is relatively

small but has a large size of feature sets, training sets are likely to follow different distributions and

thus the centers resulting from one fold of training set is not as good for another fold of training set.

3.3.3.2 Parallel

The parallel results are interesting to examine because our techniques, especially the computation

reuses, bring data dependences to the exploration of different configurations: for a configuration to

reuse results from another, it has to wait for the results to be produced. They hence could hamper

the parallel search.

Table 3.4 presents results of our algorithms in parallel settings when the two computation reuse

techniques are applied. In order to run our algorithms in parallel, some dependencies incurred

by the computation reuse have to be removed. When scheduling the task to each thread, we use

the following strategy to break dependencies: if the number of threads supported is no larger than

the number of feature sets to be evaluated, then only remove dependencies caused by reuse across

feature sets. Each thread examines a subset of feature sets and the entire sampled k values. The larger

the dimension is, the longer the k-means clustering time is. To balance the workload among the

threads, we assign each thread feature sets in an alternating manner. For example, if the dimensions

are from two to five and the number of threads is two, then the first thread runs dimensions two and

four while the second thread runs dimensions three and five. When the thread supported is larger

than the number of feature sets, some dependencies caused by reuse across k are also removed.

As shown in Table 3.4, we have good speedups with various numbers of threads and numbers

of sampled k . The larger the number of sampled k is, the larger the speedup is. This is because we

have a larger ratio of reusable distance computation for a larger set of sampled k .

29

Table 3.4: Speedup in parallel settings.

Dataset #threadsSpeedup#k=5 #k=10 #k=20 #k=30

gamma

24816

3.663.593.262.91

4.064.133.703.18

4.534.354.133.23

4.604.524.193.29

sensorless

24816

3.852.231.691.40

4.233.163.723.46

4.334.083.843.84

4.584.584.023.67

credit

24816

3.733.963.713.69

4.014.494.404.02

4.774.894.604.30

5.125.084.884.75

gassensor

24816

4.324.273.692.55

4.874.694.473.95

5.265.024.664.06

5.425.634.904.75

miniboone

24816

3.923.863.703.97

4.124.143.874.29

4.384.164.224.54

4.574.274.224.71

adult

24816

5.655.484.872.41

6.015.325.643.06

6.455.986.065.77

6.686.676.315.94

connect

24816

4.014.013.783.68

4.114.244.003.68

4.344.394.314.15

4.404.514.343.13

activity

24816

2.372.322.152.15

2.452.422.242.27

2.582.512.392.44

2.642.582.422.50

30

0 200 400 600 800 1000k

0.1

0.2

0.3cla

ssifi

catio

n er

ror Default

CtrReuse

(a) sensorless

0 200 400 600 800 1000k

0.175

0.180

0.185

0.190

0.195

class

ifica

tion

erro

r DefaultCtrReuse

(b) adult

Figure 3.8: Classification error curves for two datasets.

3.3.4 Quality Influence of Center Reuse

Recall that among all the three optimizations we introduce, only center reuse might affect the quality

of the resulting classifiers due to the sensitivity of k-means on initial centers. In this part, we provide

the details of our empirical measurements of the effects of center reuse on the quality of both

classification and clustering results.

Since k-means is sensitive to initialization, we perform 100 runs of k-means–based classification

with different random seeds for each configuration. The classification error is averaged over the 100

runs. The metrics we use to measure the discrepancy between the error curve from the baseline

and that from our center-reuse based technique are Mean Absolute Error (MAE) and Mean Percent

Error (MPE).

Given a list of values [a1, a2, ..., an ] and its approximation [a1, · · · , an ], MAE and MPE are defined

as follows:

MAE=

∑ni=1 |ai − ai |

n(3.4)

MPE= 100%×∑n

i=1 |(ai − ai )/ai |n

(3.5)

For the datasets in Table 3.1, the range of MAE is from 2.67E −04 to 1.75E −03 and the range

of MPE is from 0.15% to 2.63%. Figure 3.8 shows the classification error curves from two datasets.

“Default” here refers to the Yinyang k-means with random initialization while “CtrReuse” refers to

the Yinyang k-means with center-reuse initialization. Even though center reuse yields a different

error curve, the MAE is lower than 0.002 and the MPE is lower than 3%, indicating the little influence

of center reuse on classification quality.

We also validated the minor influence of center reuse on clustering quality through traditional

internal metrics including Davies-Bouldin index [Dav79], Dunn index [Dun74], and Silhouette

coefficient [Rou87]. Since the objective of k-means is to minimize the within-cluster sum of squares

31

(WCSS), we also included WCSS as a metric. We used Pearson correlation coefficient (PCC) to

measure the correlation of k -metric value curves with center-reuse initialization and those with

random initialization. For the datasets in Table 3.1, the PCCs are higher than 0.969 for all the indexes

and higher than 0.9997 for WCSS. The MPE for all the indexes and WCSS are mostly within 5%. The

results indicate that center reuse also has negligible influence on clustering quality.

3.3.5 Sensitivity Analysis and Insights

Experiments on heuristic search and the attainment of error surfaces have shown the efficacy of our

acceleration techniques. To better take advantage of reuse-based optimization, we provide some

insights by answering the following questions:

• For center reuse across k , should we reuse centers resulting from a smaller k or a larger k ?

• For center reuse across feature sets, should we reuse centers resulting from a smaller feature

set or a larger feature set?

• For reuse-based filtering, how does the number of landmarks affect the speedup? How does

the optimization affect later iterations of k-means? How is the speedup related with k and the

size of feature sets?

We next answer the questions in detail. We report the detailed measurements with four representa-

tives of all the datasets.

3.3.5.1 Insights for Center Reuse across k

To compare the speedup of reusing centers from a smaller k with that from a larger k , we compare

two methods: CtrReuseK-inc, which uses k centers to initialize k + k s t e p centers by randomly

adding k s t e p centers, and CtrReuseK-dec, which uses k centers to initialize k −k s t e p centers by

merging centers through one iteration of k-means. The range of k values is 20 to 1020. The d s t e p

is one. The baseline method is the default method for k-means configuration.

Experiment results show that center reuse by merging centers from a larger k generally gives

larger speedups than that by randomly adding centers from a smaller k . Figures 3.9a and 3.9b

show the detailed results on datasets adult and connect: CtrReuseK-dec gives much larger speedups

than CtrReuseK-inc especially when the dimensions are larger than eight. It is worth noting that

because CtrReuseK-dec starts from the largest k , it has a longer startup time. However, its larger

speedups on other k values lead to much shorter overall configuration time. It is confirmed by the

accumulated time of the configuration process as Figures 9c-f show with dimensions 10 and 58.

Table 3.5 shows the speedups of center reuse across k with different step size. Because decreasing

k gives better speedups, we listed only speedups from the method CenterReuseK-dec except for

the column ‘10‘, where the speedups from CenterReuseK-inc are listed in parentheses. According

to the table, the larger the k s t e p , the smaller the speedup. When the k s t e p is less than 500,

32

0 20 40 60dim

2.75

3.00

3.25

3.50

3.75

4.00

spee

dup

(X)

CtrReuseK-incCtrReuseK-dec

(a) adult: overall speedups.

0 20 40 60dim

3

4

5

6

spee

dup

(X)

CtrReuseK-incCtrReuseK-dec

(b) connect: overall speedups.

0 200 400 600 800 1000k

0

50

100

150

200

250

time

(s)

DefaultCtrReuseK-incCtrReuseK-dec

(c) adult: acc. time (d i m=10).

0 200 400 600 800 1000k

0

200

400

600

time

(s)

DefaultCtrReuseK-incCtrReuseK-dec

(d) connect: acc. time (d i m=10).

0 200 400 600 800 1000k

0

200

400

600

800

time

(s)

DefaultCtrReuseK-incCtrReuseK-dec

(e) adult: acc. time (d i m=58).

0 200 400 600 800 1000k

0200400600800

10001200

time

(s)

DefaultCtrReuseK-incCtrReuseK-dec

(f) connect: acc. time (d i m=58).

Figure 3.9: Center reuse across k with inc/dec k on dataset connect. acc. is short for accumulated.

33

Table 3.5: Speedups of center reuse across k with different k s t e p .

k s t e p 10 20 50 100 200 500

sensorless 4.11(3.96) 3.08 1.88 1.71 1.72 1.13miniboone 4.31(4.06) 3.22 2.20 1.68 1.32 1.15adult 4.26(3.85) 3.10 2.13 2.06 1.46 1.23connect 4.95(4.20) 3.49 2.26 1.93 1.56 1.16

0 200 400 600 800 1000k

2.4

2.6

2.8

3.0

spee

dup

(X)

CtrReuseDim-incCtrReuseDim-dec

(a) adult: overall speedups.

0 10 20 30 40 50 60dim

0200400600800

100012001400

time

(s)

DefaultCtrReuseDim-incCtrReuseDim-dec

(b) adult: accumulated time.

0 200 400 600 800 1000k

1.8

2.0

2.2

2.4

2.6

2.8

spee

dup

(X)

CtrReuseDim-incCtrReuseDim-dec

(c) connect: overall speedups.

0 10 20 30 40 50 60dim

0500

1000150020002500300035004000

time

(s)

DefaultCtrReuseDim-incCtrReuseDim-dec

(d) connect: accumulated time.

Figure 3.10: Center reuse across feature sets with inc/dec d .

which means the change is less than 50% of the maximum k value, center reuse gives significant

speedups.

3.3.5.2 Insights for Center Reuse across Feature Sets

We conduct similar experiments to examine the influence of reuse directions across feature sets.

We use CtrReuseDim-inc and CtrReuseDim-dec for the increasing and decreasing directions, and

d s t e p for the step size. CtrReuseDim-inc fills extra dimensions with the mean value of each di-

mension, and CtrReuseDim-dec just removes the extra dimensions. The range of k values is from

20 to 1020 and k s t e p is 200. The range of d values is from 2 to 59 and d s t e p is one. The baseline

method is the default method.

Experiment results show that center reuse by removing dimensions from centers with a larger

d generally performs better than center reuse by adding dimensions to centers a smaller k . Fig-

ure 3.10 shows the detailed results on datasets adult and connect.

34

Table 3.6: Speedups of center reuse across k with different d s t e p .

d s t e p 1 2 4 8 16 32

sensorless 1.82(1.26) 1.14 1.03 1.0 - -adult 2.62(2.36) 2.10 1.67 1.31 1.22 1.07connect 2.43(1.94) 2.15 2.06 1.27 1.08 1.01activity 4.19(3.32) 3.36 2.70 2.20 1.62 1.28

Table 3.7: Speedups and distance savings for the first iteration of k-means with reuse-based filteringacross k .

#landmarks 200 500 1000 2000 3000 4000 5000

sensorless 2.77(89.7%) 3.07(92.5%) 3.20(93.4%) 3.14(92.9%) 3.06(91.5%) 2.95(90.1%) 2.71(85.3%)miniboone 1.26(50.5%) 1.29(54.8%) 1.37(57.6%) 1.39(60.0%) 1.41(61.1%) 1.47(61.8%) 1.47(62.1%)adult 2.52(74.9%) 2.96(81.6%) 3.47(84.9%) 3.62(86.1%) 3.58(85.4%) 3.45(84.0%) 3.29(82.4%)activity 1.05(16.7%) 1.03(16.7%) 1.01(15.3%) 0.97(12.3%) 0.95(9.9%) 0.94(7.4%) 0.93(5.7%)

Table 3.6 shows the speedups of center reuse across feature sets with different step size. Since

decreasing d gives similar or even better speedups, we listed only speedups from the method

CenterReuseDim-dec except for the column ‘1‘, where the speedups from CenterReuseDim-inc are

listed in parentheses. According to the table, the larger the d s t e p , the smaller the speedup. When

the d s t e p is less than 50% of the maximum size of the feature set, the speedups from center

reuse are significant.

3.3.5.3 Insights for Reuse-Based Filtering

This part investigates the influence of the number of landmarks for reuse-based filtering, and the

impact of the filtering on later iterations of k-means.

The number of landmarks determines the number of distance pruned through reuse-based

filtering. As the number of landmarks is larger, the lower bounds of distance from each point to

cluster centers are tighter and thus more exact distance calculations could be pruned; however, the

overhead of calculating the lower bounds also becomes significant.

Table 3.7 shows the speedups and corresponding distance savings for the first iteration of k-

means with reuse-based filtering across k . According to the results, decent speedups manifest

when the number of landmarks is aroundp

n . The observation is consistent with previous studies

on fast knn with triangle inequality [Wan11; Lu12; Din15a].

As mentioned in Section 3.2.3, reuse-based filtering helps with the first iteration of k-means, but

could possibly degrade the efficiency of the later iterations of k-means. The results in Section 3.3.2

and 3.3.3 have already shown that the overall effects are positive, leading to large speedups. Fig-

ures 3.11 and 3.12 show the detailed results of adult on three numbers of landmarks (#l m s ) to shed

some insights in further depth when d = 11 and d = 59. Figures 3.11d and 3.12d report that the

looser bounds due to the use of reuse-based filtering in the first iteration cause about 0-20% extra

distance calculations in later iterations. However, the large savings in the first iteration (reflected by

35

the up to 5.1X speedups in Figure 3.11b and up to 3.4X speedups in Figure 3.12b) still lead to 10-60%

overall distance savings as Figures 3.11f and 3.12f reports. A Similar positive results are observed on

other d values.

3.4 Related Work

Our work falls in the category of algorithm configuration. Algorithm configuration or tuning is to find

the configuration of a given algorithm that maximizes some performance metric. As a combinatorial

problem, algorithm configuration space is often enormous, and the tuning is notoriously time-

consuming.

Many studies have attempted to help shorten the process. Most of these prior methods fall into

three categories [Hol12]: (1) using racing procedures to help eliminate candidate configurations

that are significantly outperformed by other configurations [Bir10; Bal07; Bir02], (2) using stochastic

local search (SLS) methods to intelligently search the configuration space [LI10; Hut07; Hoo04],

(3) using sequential model-based optimization methods to build models to help quickly identify

promising configurations [Hut11; Hut09]. These methods mainly aim at reducing the number of

configurations needed to try to find appropriate configurations. In this work, we tackle the k-means

configuration problem from the angle of computation reuse that is complementary to previous

methods.

Reuse-centric optimization, especially center reuse, at a high level, shares a spirit with that of

transfer learning, which stores the knowledge gained in solving the source task and apply it to other

problems with similar properties. Both concepts are motivated by the fact that knowledge learned

previously can help solve new problems faster or with better solutions [Pan10]. This work materializes

the high-level concept by answering some open questions on what knowledge is beneficial to reuse

for k-means configuration, and how to reuse the knowledge effectively. The set of novel techniques

it proposes are designed to leverage the specific properties of k-means configurations to address

those open challenges.

3.5 Conclusions

This chapter introduced the concept of reuse-centric k-means configurations to promote informa-

tion reuse across the explorations of different configurations of k-means. It was shown that our

computation-reuse promotion techniques, reuse-based filtering and center reuse, could largely

cut the configuration time of k-means–based data classification. We also introduced a two-phase

design, which when working in synergy with the other two techniques, reduced the uniform-search–

based attainment of classification error surfaces by a factor of 9. In addition, through a series of

sensitivity study and in-depth analysis, we provided some important insights on how to tap into the

full potential of the techniques.

36

0 200 400 600 800 1000k

0.05

0.10

0.15

0.20

0.25

first

iter

atio

n ra

tio

IndKmeansTr#lms=500#lms=2000#lms=5000

(a) First iteration ratio over k .

0 200 400 600 800 1000k

2

3

4

5

spee

dup

(X)

#lms=500#lms=2000#lms=5000

(b) First iteration speedups.

0 200 400 600 800 1000k

75

80

85

90

95

savi

ng (%

)

#lms=500#lms=2000#lms=5000

(c) First iteration distance savings.

0 200 400 600 800 1000k

0

5

10

15

20

25

extra

dist

ance

s (%

)

#lms=500#lms=2000#lms=5000

(d) Extra distances in other iterations.

0 200 400 600 800 1000k

10

0

10

20

30

clust

erin

g tim

e sp

eedu

p(%

)

#lms=500#lms=2000#lms=5000

(e) Clustering speedup over k .

0 200 400 600 800 1000k

10

20

30

40

50

dist

ance

savi

ng (%

)

#lms=500#lms=2000#lms=5000

(f) Overall distance savings.

Figure 3.11: Reuse-based filtering performance on different k and different numbers of landmarks(#lms) on adult (dim=11).

37

0 200 400 600 800 1000k

0.10

0.15

0.20

0.25

first

iter

atio

n ra

tio

IndKmeansTr#lms=500#lms=2000#lms=5000

(a) First iteration ratio over k .

0 200 400 600 800 1000k

1.5

2.0

2.5

3.0

spee

dup

(X)

#lms=500#lms=2000#lms=5000

(b) First iteration speedups.

0 200 400 600 800 1000k

40

50

60

70

80

savi

ng (%

)

#lms=500#lms=2000#lms=5000

(c) First iteration distance savings.

0 200 400 600 800 1000k

0.0

2.5

5.0

7.5

10.0

12.5

15.0

extra

dist

ance

s (%

)

#lms=500#lms=2000#lms=5000

(d) Extra distances in other iterations.

0 200 400 600 800 1000k

2.5

5.0

7.5

10.0

12.5

15.0

17.5

clust

erin

g tim

e sp

eedu

p(%

)

#lms=500#lms=2000#lms=5000

(e) Clustering speedup over k .

0 200 400 600 800 1000k

5.0

7.5

10.0

12.5

15.0

17.5

20.0

dist

ance

savi

ng (%

)

#lms=500#lms=2000#lms=5000

(f) Overall distance savings.

Figure 3.12: Reuse-based filtering performance on different k and different numbers of landmarks(#lms) on adult (dim=59).

38

CHAPTER

4

COMPOSABILITY-BASED FAST CNN

PRUNING

4.1 Introduction

CNN pruning is an important method to adapt a large CNN model trained on general datasets to

fit a more specialized task or a smaller device. The key challenge is on deciding which filters to

remove in order to maximize the quality of the pruned networks while satisfying the constraints. It

is time-consuming due to the enormous configuration space and the slowness of CNN training. The

long CNN pruning process is a major barrier for timely solution delivery in Artificial Intelligence

(AI) product development.

The prior efforts have been, however, mostly from the machine learning community [Li16; Hu16;

Mol16; Luo17a; He18]. They leverage DNN algorithm-level knowledge to reduce the enormous

configuration space to a smaller space (called promising subspace) that is likely to contain a good

solution, and then evaluate these remaining configurations to find the best, as Figure 4.1 illustrates.

Although these prior methods help mitigate the problem, network pruning remains a time-

consuming process. One reason is that, despite their effectiveness, no prior techniques can guarantee

the inclusion of the desirable configuration in a much reduced subspace. As a result, to decrease

the risk of missing the desirable configuration, practitioners often end up with a still quite large

subspace of network configurations that takes days for many machines to explore. It is also quite

often true that modifications need to make to the CNN models, datasets, or hardware settings

throughout the development process of an AI product; each of the changes could make the result

of a CNN pruning obsolete and call for a rerun of the entire pruning process. Our conversations

39

2|W| config. space Promising subspace

Prior work This work

Accelerate via block pre-training

Identifypromisingsubspace.

Train & evaluateremaining

configurations.

Bestconfiguration

Improve

Figure 4.1: Complementary relation with prior work for CNN pruning. Prior works have designedheuristic criteria to quickly determine the importance of a filter [Li16; Hu16; Mol16; Luo17a], orto combine with reinforcement learning for selecting the set of promising configurations [He18;Ash17]. This work tries to accelerate the explorations of the remaining promising configurationsthrough computation reuse via composability (block pre-training) supported with a compiler-basedframework.

with AI product developers indicate that the long pruning process is one of the major hurdles for

shortening the time to market AI products.

This study distinctively examines the problem from the programming systems perspective.

Specifically, rather than improving the attainment of promising subspace as all prior work focuses

on, we try to drastically speed up the evaluations of the remaining configurations in the promising

subspace through cross-network computation reuse via a compiler-based framework, a direction

complementary to prior solutions. We achieve the goal through three-fold innovations.

First, we empirically uncover the existence of composability in the training of a collection of

pruned CNN models, and reveal the opportunity that the composability creates for saving computa-

tions in CNN pruning. The basic observation that leads to this finding is that two CNN networks

in the promising subspace often differ in only some layers. In the current CNN pruning methods,

the two networks are both trained from scratch and then tested for accuracy. A question we ask

is whether the training results of the common layers can be reused across networks to save some

training time. More generally, we view the networks in a promising subspace as compositions of a

set of building blocks (a block is a sequence of CNN layers). The question is if we first pre-train (some

of) these building blocks and then assemble them into the to-be-explored networks, can we shorten

the evaluations of these networks and the overall pruning process? Through a set of experiments,

we empirically validate the hypothesis, based on which, we propose composability-based CNN

pruning to capture the idea of reusing pre-trained blocks for pruning (§ 4.2).

Second, we propose a novel hierarchical compression-based algorithm, which, for a given CNN

and promising subspace, efficiently identifies the set of blocks to pre-train to maximize the benefits

of computation reuse. We prove that identifying the optimal set of blocks to pre-train is NP-hard.

Our proposed algorithm provides a linear-time heuristic solution by applying Sequitur [NM97], a

hierarchical compression algorithm, to the CNN configurations in the promising subspace (§ 4.4).

40

Finally, based on all those findings, we developed Wootz1, the first compiler-based framework

that, for an arbitrary CNN (in Caffe Prototxt format) and other inputs, automatically generates

TensorFlow code to build Teacher-Student learning structures to materialize composability-based

CNN pruning (§ 4.3,§ 4.5).

We evaluate the technique on a set of CNNs and datasets with various target accuracies. For

ResNet-50 and Inception-V3, it shortens the pruning process by up to 186.7X and 30.2X respectively.

Meanwhile, the models it finds are significantly more compact (up to 70% smaller) than those by

the default pruning scheme for the same target accuracy (§ 4.6).

4.2 Composability-Based CNN Pruning: Idea and Challenges

The fundamental reason for Wootz to produce large speedups for CNN pruning is its effective

capitalization of computation reuse in CNN pruning, which is built on the composability in CNN

pruning empirically unveiled in this study. Two pruned networks in a promising subspace often

differ in only some of the layers. The basic idea of composability-based CNN pruning is to reuse

the training results of the common layers across the pruned networks. Although the idea may look

straightforward, to our best knowledge, no prior CNN pruning work has employed such reuse,

probably due to a series of open questions and challenges:

• First, there are bi-directional data dependencies among the layers of a CNN. In CNN training,

for an input image, there is a forward propagation that uses a lower layer’s output, which is

called activation maps, to compute the activation maps of a higher layer; it is followed by a

backward propagation, which updates the weights of a lower layer based on the errors com-

puted with the higher layer’s activation maps. As a result of the bi-directional dependencies,

even just one-layer differences between two networks could cause very different weights to

be produced for a common (either higher or lower) layer in the two networks. Therefore, it

remains unclear whether the training results of a common layer could help with the training

of different networks.

• Second, if a pre-trained layer could help, it is an open question how to maximize the benefits. A

pre-trained sequence of consecutive layers may have a larger impact than a single pre-trained

layer does on the whole network, but it may also take more time to produce and has fewer

chances to be reused. How to determine which sets of layers or sequences of layers to pre-train

to maximize the gains has not been explored before.

• Third, how to pre-train just a piece of a CNN? The standard CNN back propagation training

algorithm uses input labels as the ground truth to compute errors of the current network

configurations and adjust the weights. If we just want to train a piece of a CNN, what ground

1The name is after Wootz steel, the legendary pioneering steel alloy developed in the 6th century BC; Wootz bladesgive the sharpest cuts.

41

Hierarchicaltuning blockidentifier

Pre-trainingscripts

Wootzcompiler

Explorationscripts

Definitionsof tuningblocks

Promisingsubspace

A CNN toprune

Datasets &meta data

Objectivesof pruning

The bestnetworkfound

Wootz Framework

exec.

exec.Multiplexingmodel

Pre-trainedtuning blocks

Figure 4.2: Overview of Wootz framework.

truth should we use? What software architecture should be built to do the pre-training and do

it efficiently?

• Fourth, existing DNN frameworks support only the standard DNN training and inference.

Users have to write code to do CNN pruning themselves, which is already complicated for

general programmers. It would add even more challenges to ask them to additionally write

the code to pre-train CNN pieces, and then reuse the results during the evaluations of the

networks.

For the first question, we conduct a series of experiments on 16 large CNNs (four popular

CNN models trained on four datasets). Section 4.6.2 reports the details; here we just state the

key observations. Pre-trained layers bring a network to a much improved starting setting, making

the initial accuracies of the network 50-90% higher than the network without pre-trained layers.

That leads to 30-100% savings of the training time of the network. Moreover, it helps the network

converge to a significantly higher level of accuracy (by 1%-4%). These findings empirically confirm

the potential of composability-based CNN pruning.

To effectively materialize the potential, we have to address the other three challenges. Wootz

offers the solution.

4.3 Overview of Wootz Framework

This section gives an overview of Wootz. Wootz is a software framework that automatically enables

composability-based CNN pruning. As Figure 4.2 shows, its input has four parts:

• The to-be-pruned CNN model, written in Caffe Prototxt (with a minor extension), which is a

user-friendly text format (from Caffe) for CNN model specifications [Jia14].

• The promising subspace that contains the set of pruned networks configurations worth

exploring, following the format in Figure 4.3 (a). The subspace may come from the user or

42

''' An example of a promising subspace specification that contains twoconfigurations. Each number is a pruning rate for a convolutionallayer. For example, the first configuration means the first and thirdlayers are pruned with pruning rate 0.3, the second and fourth layersare not pruned. '''configs=[[0.3, 0, 0.3, 0], [0.5, 0, 0.3, 0]] ''' The configurations should be either a Numpy array or a python listthat can be serialized using Pickle as below. Users only need toprovide configs_path to the compiler. '''pickle.dump(configs, open(configs_path, "wb"))

(a) Promising subspace specifications.

# Format:[min, max] [ModelSize, Accuracy]constraint [ModelSize, Accuracy] [<, >, <=, <=] [Value] # Example:min ModelSizeconstraint Accuracy > 0.8

(b) Pruning objectives specifications.

Figure 4.3: Formats for the specifications of promising subspaces (a) and pruning objectives (b).

some third-party tools that reduce the configuration space for CNN pruning [Hoo11; He18;

Ash17].

• The dataset for training and testing, along with some meta data on the training (e.g., learning

rates, maximum training steps), following the format used in Caffe Solver Prototxt [Caf].

• The objectives of the CNN pruning, including the constraints on model size or accuracy,

following the format shown in Figure 4.3 (b).

The Wootz framework consists of four main components as shown in Figure 4.2. (1) The hierar-

chical tuning block identifier tries to define the set of tuning blocks. A tuning block is a sequence of

pruned consecutive CNN layers taken as a unit for pre-training. Suitable definitions of tuning blocks

help maximize reuse while minimizing the pre-training overhead. (2) From the given CNN model

specified in Prototxt, the Wootz compiler generates a multiplexing model, which is a function written

in TensorFlow that, when invoked, specifies the structure of the full to-be-pruned CNN model, the

network structure—which implements a Teacher-Student scheme—for pre-training tuning blocks,

or pruned networks assembled with pre-trained tuning blocks, depending on the arguments the

function receives. (3) The pre-training scripts are some generic Python functions that, when run,

pre-train each tuning block based on the outputs from the first two components of Wootz. (4) The

final component, exploration scripts, explores the promising pruned networks assembled with the

pre-trained tuning blocks. The exploration of a network includes first fine-tuning the entire network

and then testing it for accuracy. The exploration order is automatically picked by the exploration

scripts based on the pruning objectives to produce the best network as early as possible. Both the

43

pre-training scripts and the exploration scripts can run on one machine or multiple machines in a

distributed environment through MPI.

Wootz is designed to help pruning methods that have their promising subspace known at

front. There are methods that do not provide the subspace explicitly [Zha18d]. They, however, still

need to tune the pruning rate for each layer and the exploration could also contain potentially

avoidable computations. Extending Wootz to harvest those opportunities is a direction worth future

exploration.

Next, we explain the hierarchical tuning block identifier in § 4.4, and the other components in

§ 4.5.

4.4 Hierarchical Tuning Block Identifier

Composability-based CNN pruning faces a trade-off between the pre-training cost and the time

savings the pre-training results bring. The tradeoff depends on the definitions of the unit for pre-

training, that is, the definition of tuning blocks. A tuning block is a unit for pre-training; it consists of

a sequence of consecutive CNN layers pruned at certain rates. It can have various sizes, depending

on the number of CNN layers it contains. The smaller it is, the less pre-training time it takes and the

more reuses it tends to have across networks, but at the same time, its impact to the training time of

a network tends to be smaller.

So for a given promising subspace of networks, a question for composability-based CNN pruning

is how to define the best sets of tuning blocks. The solution depends on the appearing frequencies of

each sequence of layers in the subspace, their pre-training times, and the impact of the pre-training

results on the training of the networks. For a clear understanding of the problem and its complexity,

we define optimal tuning block definition problem as follows.

4.4.1 Optimal Tuning Block Definition Problem

Let A be a CNN consisting of L layers, represented as A1 · A2 · A3 · . . . · AL , where · stands for layer

stacking and Ai stands for the i -th layer (counting from input layer). C = {A(1), A(2), . . . , A(N )} is a

set of N networks that are derived from filter pruning of A, where A(n ) represents the n-th derived

network from A, and Ai(n ) stands for the i -th layer of A(n ), i = 1, 2, . . . , L .

Optimal tuning block definition problem is to identify a set of tuning blocks B = {B1, B2, . . . , BK }such that the following two conditions are met:

1. Every Bk , k = 1,2, · · · , K , is part of a network in C —that is, ∀ Bk , ∃ A(n ), n ∈ {1,2, · · · , N }, such

that Bk = Al(n ) · Al+1

(n ) · . . . · Al+bk−1(n ),1 ≤ l ≤ L − bk + 1, where bk is the number of layers

contained in Bk .

44

2. B is an optimal choice—that is, arg minB

(∑K

k=1 T (Bk )+∑N

n=1 T (A(n ,B ))), where, T (Bk ) is the time

taken to pre-train block Bk , and T (A(n ,B )) is the time taken to train A(n ,B ) to reach the accuracy

objective2; A(n ,B ) is the blocked-trained version of A(n ) with B as the tuning blocks.

A restricted version of the problem is that only a predefined set of pruning rates (e.g., {30%, 50%,

70%}) are used when pruning a layer in A to produce the set of pruned networks in C —which is a

common practice in filter pruning.

Even this restricted version is NP-hard, provable through a reduction of the problem to the

classic knapsack problem [Hoc95] (detailed proof omitted for sake of space.) A polynomial-time

solution is hence in general hard to find, if ever possible. The NP-hardness motivates our design of a

heuristic algorithm, which does not aim to find the optimal solution but to come up with a suitable

solution efficiently. The algorithm does not use the training time as an explicit objective to optimize

but focuses on layer reuse. It is a hierarchical compression-based algorithm, described next.

4.4.2 Hierarchical Compression-Based Algorithm

Our algorithm leverages Sequitur [NM97] to efficiently identify the frequent sequences of pruned

layers in the network collection C . As a linear-time hierarchical compression algorithm, Sequitur

infers a hierarchical structure from a sequence of discrete symbols. For a given sequence of symbols,

it derives a context-free grammar (CFG), with each rule in the CFG reducing a repeatedly appearing

string into a single rule ID. Figure 4.4 gives an example. Its top part shows the concatenated sequence

of layers of four networks pruned at various rates; the subscripts of the numbers indicate the pruning

rate, that is, the fraction of the least important filters of a layer that are removed. The lower part in

Figure 4.4 shows the CFG produced by Sequitur on the string. A full expansion of rule r 0 would give

the original string. The result can also be represented as a Directed Acyclic Graph (DAG) as the right

graph in Figure 4.4 shows with each node corresponding to one rule.

Applying Sequitur to the concatenated sequence of all networks in the promising subspace,

our hierarchical compression-based algorithm gets the corresponding CFG and the DAG. Let R be

the collection of all the rules in the CFG, and S be the solution to the tuning block identification

problem which is initially empty. Our algorithm then heuristically fills S with subsequences of CNN

layers (represented as rules in the CFG) that are worth pre-training.

It does it based on the appearing frequencies of the rules in the promising subspace and their

sizes (i.e., the number of layers a rule contains). It employs two heuristics: (1) A rule cannot be put

into S if it appears in only one network (i.e., its appearing frequency is one); (2) a rule is preferred

over its children rules only if that rule appears as often as its most frequently appearing descendant.

The first heuristic is to ensure that the pre-training result of the sequence can benefit more than

one network. The second heuristic is based on the following observation: A pre-trained sequence

typically has a larger impact than its subsequences all together have on the quality of a network;

2In our framework, T (x ) is not statically known or approximated, but instead explicitly computed (via training) foreach x (i. e, Bk or A(n ,B )).

45

1(.3)2(.3)3(.3)4(.5)5(0) ❶ 1(.3)2(.3)3(.5)4(.5)5(0) ❷ 1(.5)2(.3)3(.3)4(.5)5(0) ❸ 1(0)2(.3)3(.5)4(.5)5(0) ❹

r0 ! r1 r2 ❶ r1 r3 ❷ r6 r8 r2 ❸ r7 r8 r3 ❹ r1 ! r5 r8                         r2 ! r9 r4                                           r3 ! r10 r4                                           r4 ! r11 r12r5 ! 1(.3)r6 ! 1(.5)r7 ! 1(0)r8 ! 2(.3)r9 ! 3(.3)r10 ! 3(.5)r11 ! 4(.5)r12 ! 5(0)                                          

Freq. Rule ID Rule body

Four networks concatenated into a string

CFG by Sequitur on the above string

1222421142244

Notations:N(d) : the Nth convolution module pruned by a d fraction of filters ❶ : the ending marker of the first network sequence

r0

r1 r2

r8 r9r5r4

r11r12

r3

r10

r6

r7

1(.3) 2(.3) 3(.3) 4(.5)5(0)

3(.5)

1(.5)

1(0)

❶ ❷

❹DAG

2 4 2

2 2

4

44

2

2

11

1

Figure 4.4: Sequitur applies to a concatenated sequence of layers of four networks pruned at rates:0%, 30%, 50%.

46

however, the extra benefits are usually modest. For instance, a ResNet CNN network assembled

from 4-block long pre-trained sequences has an initial accuracy of 0.716, 3.1% higher than the same

network but assembled from 1-block long pre-trained sequences. The higher initial accuracy helps

save extra training steps (epochs) for the network, but the saving is limited (up to 20% of the overall

training time). Moreover, a longer sequence usually has a lower chance to be reused. For these

reasons, we employ the aforementioned heuristics to help keep S small and hence the pre-training

overhead low while still achieving a good number of reuses.

Specifically, the algorithm takes a post-order (children before parent) traversal of the DAG that

Sequitur produces. (Before that, all edges between two nodes on the DAG are combined into one

edge.) At a node, it checks its frequency. If it is greater than one, it checks whether its frequency

equals the largest frequency of its children. If so, it marks itself as a potential tuning block, unmarks

its children, and continues the traversal. Otherwise, it puts a "dead-end" mark on itself, indicating

that it is not worth going further up in the DAG from this node. When the traversal reaches the root

of the DAG or has no path to continue, the algorithm puts all the potential tuning blocks into S as

the solution and terminates.

Note that a side product from the process is a composite vector for each network in the promising

subspace. As a tuning block is put into S , the algorithm, by referencing the CFG produced by Sequitur,

records the ID of the tuning block in the composite vectors of the networks that can use the block.

Composite vectors will be used in the global fine-tuning phase as described in the next section.

The hierarchical compression-based algorithm is designed to be simple and efficient. More

detailed modeling of the time savings and pre-training cost of each sequence for various CNNs could

potentially help yield better definitions of tuning blocks, but it would add significant complexities

and runtime overhead. Our exploration in § 4.6.3 shows that the hierarchical compression-based

algorithm gives a reasonable trade-off.

4.5 Composability-Based Pruning and Wootz Compiler

The core operations in Composability-based CNN pruning includes pre-training of tuning blocks,

and global fine-tuning of networks assembled with the pre-trained blocks. This section first explains

the mechanisms we have designed to support these operations efficiently, and then describes the

implementation of Wootz compiler and scripts that automatically materializes the mechanisms for

an arbitrary CNN.

4.5.1 Mechanisms

4.5.1.1 Pre-Training of Tuning Blocks

The standard CNN back propagation training algorithm uses input labels as the ground truth to

compute errors of the current network and adjusts the weights iteratively. To train a tuning block,

the first question is what ground truth to use to compute errors. Inspired by Teacher-Student

47

networks [Buc06; Ba14; Hin15], we adopt a similar Teacher-Student mechanism to address the

problem.

We construct a network structure that contains both the pruned block to pre-train and the

original full CNN model. They are put side by side as shown in Figure 4.5 (a) with the input to the

counterpart of the tuning block in the full model also flowing into the pruned tuning block as its

input, and the output activation map of the counterpart block flowing into the pruned tuning block

as the "ground truth" of its output. When the standard back propagation algorithm is applied to the

tuning block in this network structure, it effectively minimizes the reconstruction error between the

output activation maps from the pruned tuning block and the ones from its unpruned counterpart

in the full network. (In CNN pruning, the full model has typically already been trained beforehand to

perform well on the datasets of interest.) This design essentially uses the full model as the "teacher"

to train the pruned tuning blocks. Let Ok and O ′k be the vectorized output activation maps from

the unpruned and pruned tuning block, and W ′k be the weights in the pruned tuning block. The

optimization objective in this design is: minW ′k

1|Ok |‖Ok −O ′k‖

22. Only the parameters in the pruned

tuning block are updated in this phase to ensure the pre-trained blocks are reusable.

This Teacher-Student design has three appealing properties. First, it addresses the missing

“ground truth” problem for tuning block pre-training. Second, as the full CNN model runs along with

the pre-training of the tuning blocks, it provides the inputs and "ground truth" for the tuning blocks

on the fly; there is no need to save to storage the activation maps which can be space-consuming

considering the large number of input images for training a CNN. Third, the structure is friendly

for concurrently pre-training multiple tuning blocks. As Figure 4.5 (b) shows, connections can be

added between the full model and multiple pruned blocks; the pre-training of these blocks can then

happen in one run, and the activation maps produced by a block in the full model can be seamlessly

reused across the pre-training of multiple pruned blocks.

4.5.1.2 Global Fine-Tuning

The local training phase outputs a bag of pre-trained pruned tuning blocks, as shown in Figure 4.5

(c) (tuning blocks in the original network could also be included). At the beginning of the global

fine-tuning phase is an assembly step, which, logically, assembles these training blocks into each

of the networks in the promising subspace. Physically, this step just needs to initialize the pruned

networks in the promising subspace with the weights in the corresponding tuning blocks. We call

the resulting network a block-trained network. Recall that one of the side products of the tuning

block identification step is a composite vector for each network which records the tuning blocks

the network can use; these vectors are used in this assembly step. Figure 4.5 (d) gives a conceptual

illustration; three networks are assembled with different sets of pre-trained tuning blocks.

As a pruned block with only a subset of parameters has a smaller model capacity, a global fine-

tuning step is required to further recover the accuracy performance of a block-trained network.

This step runs the standard CNN training on the block-trained networks. All the parameters in

the networks are updated during the training. Compared with training a default pruned network,

48

Block 1

Block 2

Block 3

Block 1

Block 2

Block 3

Block 2

Block 3

Block 1

Block 3Block 3

Block 2Block 2

Block 1Block 1 Block 1

Block 2

Block 3 Block 3

Block 2 Block 2

Block 1Block 1

Block 3

Global Fine-TuningPre-Trained Blocks

Block 1

Block 2

Block 3

Pre-Training

Block 2

Concurrent Pre-Training

(a) (b) (c) (d)

Figure 4.5: Illustration of composability-based network pruning. Eclipses are pruned tuning blocks;rectangles are original tuning blocks; diamonds refer to the activation map reconstruction error.Different colors of pruned tuning blocks correspond to different pruning options.

fine-tuning a block-trained network usually takes much less training time as the network starts with

a much better set of parameter values as shown in § 4.6.

4.5.2 Wootz Compiler and Scripts

Wootz compiler and scripts offer an automatic way to materialize the mechanisms for an arbitrary

CNN model. The proposed method is not restricted to a particular DNN framework, though we

demonstrate its ability using TensorFlow.

We first provide brief background on TensorFlow [Aba15] that is closely relevant to this part.

TensorFlow offers a set of APIs for defining, training, and evaluating a CNN. To specify the structure

of a CNN, one needs to call APIs in a Python script, which arranges a series of operations into a

computational graph. In a TensorFlow computational graph, nodes are operations that consume and

produce tensors, and edges are tensors that represent values flowing through the graph. CNN model

parameters are held in TensorFlow variables, which represent tensors whose values can be changed

by operations. Because a CNN model can have hundreds of variables, it is a common practice to

name variables in a hierarchical way using variable scopes to avoid name clashes. A popular option

to store and reuse the parameters of CNN model is TensorFlow checkpoints. Checkpoints are binary

files that map variable names to tensor values. The tensor value of a variable can be restored from a

checkpoint by matching the variable name.

TensorFlow APIs with other assistant libraries (e.g., Slim [Sil16]) offer conveniences for standard

CNN model training and testing, but not for CNN pruning, let alone composability-based pruning.

Asking a general programmer to implement composability-based pruning in TensorFlow for each

CNN model would add tremendous burdens on the programmer. She would need to write code to

identify tuning blocks, create TensorFlow code to implement the customized CNN structures to

49

pre-train each tuning block, generate checkpoints, and use them when creating the block-trained

CNN networks for global fine-tuning.

Wootz compiler and scripts mitigate the difficulty by automating the process. The fundamental

motivating observation is that the codes for two different CNN models follow the same pattern.

Differences are mostly on the code specifying the structure of the CNN models (both the original

and the extended for pre-training and global fine tuning). The idea is to build code templates and

use the compiler to automatically adapt the templates based on the specifications of the models.

4.5.2.1 Multiplexing Model

An important decision in our design of Wootz is to take Prototxt as the format of an input to-be-

pruned CNN model. Because our tool has to derive code for pre-training and fine-tuning of the

pruned models, our compiler would need to analyze the TensorFlow code from users, which could

be written in various ways and complex to analyze. Prototxt has a clean fixed format. It is easy for

programmers to write and simple for our compiler to analyze.

Given a to-be-pruned CNN model specified in Prototxt, the compiler first generates the multi-

plexing model, which is a piece of TensorFlow code defined as a Python function. It is multiplexing

in the sense that an invocation of the code specifies the structure of the original CNN model, or the

structure for pre-training, or the global fine tuning model; which of the three modes is used at an

invocation of the multiplexing model is determined by one of its input arguments, mode_to_use.

The multiplexing design allows easy code reuse as the three modes share much common code

for model specifications. Another argument, prune_info, conveys to the multiplexing model the

pruning information, including the set of tuning blocks to pre-train in this invocation and their

pruning rates.

The compiler-based code generation needs to provide mainly two-fold support. It needs to

map CNN model specifications in Prototxt to TensorFlow APIs. Our implementation, specifically,

generates calls to TensorFlow-Slim API [SG16] to add various CNN layers based on the parsing

results of the Prototxt specifications. The other support is to generate the code to also specify the

derived network structure for pre-training each tuning block contained in prune_info. Note that

the layers contained in a tuning block are the same as a section of the full model except for the

number of filters in the layers and the connections flowing into the block. The compiler hence

emits code for specifying each of the CNN layers again, but with connections flowing from the full

network, and sets the "depth" argument of the layer-adding API call (a TensorFlow-Slim API [SG16])

with the info retrieved from prune_info such that the layer’s filters can change with prune_info at

different calls of the multiplexing model. In addition, the compiler encloses the code with condition

checks to determine, based on prune_info, at an invocation of the multiplexing model whether the

layer should be actually added into the network for pre-training. The code generation for the global

fine-tuning is similar but simpler. In such a form, the generated multiplexing model is adaptive to

the needs of different modes and the various pruning settings.

50

Once the multiplexing model is generated, it is registered at the nets factory in Slim Model

Library [Sil16]with its unique model name. The nets factory is part of the functional programming

Slim Model Library is based on. It contains a dictionary mapping a model name to its corresponding

model function for easy retrieval and use of the models in other programs.

4.5.2.2 Pre-Training Scripts

The pre-training scripts contain a generic pre-training Python code and a wrapper that is adapted

from a Python template by the Wootz Compiler to the to-be-pruned CNN model and meta data. The

pre-training Python code retrieves the multiplexing model from nets factory based on the registered

name, and repeatedly invokes the model function with the appropriate arguments, with each call

generating one of the pre-train networks. After defining the loss function, it launches a TensorFlow

session to run the pre-training process.

The wrapper calls the pre-training Python code with required arguments such as model name

and the set of tuning blocks to train. As the tuning blocks coexisting in a pruned network cannot

have overlapping layers, one pruned network can only enable the training of a limited set of tuning

blocks. We design a simple algorithm to partition the entire set of tuning blocks returned by the

Hierarchical Tuning Block Identifier into groups. The pre-training Python script is called to train

only one group at a time. The partition algorithm is as follows:

1: Inputs: B // the entire set of tuning blocks2: Outputs: G // the set of groups of tuning blocks3: B .sort() // sort by the contained lowest conv layers4: G = {{B [0]}}5: for b ∈ B [1 :] do6: for g ∈G do7: any([overlap(b , e ) for e in g ])? G .add({b }):g .add(b )

The meta data contains the training configurations such as dataset name, dataset directory,

learning rate, maximum training steps and batch size for pre-training of tuning blocks. The set of

options to configure are predefined, similar to the Caffe Solver Prototxt [Caf]. The compiler parses

the meta data and specifies those configurations in the wrapper.

Executing the wrapper produces pre-trained tuning blocks that are stored as TensorFlow check-

points. The mapping between the checkpoint files and trained tuning blocks are also recorded for

the model variable initialization in the global fine-tuning phase. The pre-training script can run on

a single node or multiple nodes in parallel to concurrently train multiple groups through MPI.

51

4.5.2.3 Exploration Scripts

Exploration scripts contain a generic global fine-tuning Python code and a Python-based wrapper.

The global fine-tuning code invokes the multiplexing model to generate the pruned network ac-

cording to the configuration to evaluate. It then initializes the network through the checkpoints

produced in the pre-train process and launches a TensorFlow session to train the network.

In addition to feeding the global fine-tuning Python code with required arguments (e.g. the con-

figuration to evaluate), the Python-based wrapper provides code to efficiently explore the promising

subspace. The order of the exploration is dynamically determined by the objective function.

The compiler first parses the file that specifies the objective of pruning to get the metric that

needs to be minimized or maximized. The order of explorations is determined by the corresponding

MetricName. In case the MetricName is ModelSize, the best exploration order is to start from the

smallest model and proceed to larger ones. If the MetricName is Accuracy, the best exploration order

is the opposite order as a larger model tends to give a higher accuracy.

To facilitate concurrent explorations on multiple machines, the compiler generates a task as-

signment file based on the order of explorations and the number of machines to use specified by

the user in the meta data. Let c be the number of configurations to evaluate and p be the number

of machines available, the i -th node will evaluate the i +p ∗ j -th smallest (or largest) model, where

0≤ j ≤ bc /p c.

4.6 Evaluations

We conduct a set of experiments to examine the efficacy of Wootz. Our experiments are designed to

answer the following major questions: 1) Whether pre-training the tuning blocks of a CNN helps the

training of that CNN reach a given accuracy sooner? We refer to it as the composability hypothesis

as its validity is the prerequisite for the composability-based CNN pruning to work. 2) How much

benefits we could get from composability-based CNN pruning on both the speed and the quality of

network pruning while counting the pre-training overhead? 3) How much extra benefits we could

get from hierarchical tuning block identifier?

We first describe the experiment settings (datasets, learning rates, machines, etc.) in § 4.6.1,

then report our experiment results in § 4.6.2 and § 4.6.3 to answer each of the three questions.

4.6.1 Experiment Settings

4.6.1.1 Models and Datasets

Our experiments use four popular CNN models: ResNet-50 and ResNet-101, as representatives of

the Residual Network family [He16], and Inception-V2 and Inception-V3, as representatives of the

Inception family [Sze15]. They have 50, 101, 34, 48 layers respectively. These models represent a

structural trend in CNN designs, in which, several layers are encapsulated into a generic module of

a fixed structure—which we call convolution module—and a network is built by stacking many such

52

Table 4.1: Dataset statistics.

DatasetSize

ClassesAccuracy

Total Train Test ResNet-50 ResNet-101 Inception-V2 Inception-V3

General ImageNet [Rus15] 1,250,000 1,200,000 50,000 1000 0.752 0.764 0.739 0.780

Special

Flowers102 [Nil08] 8,189 6,149 2,040 102 0.973 0.975 0.972 0.968CUB200 [Wel10] 11,788 5,994 5,794 200 0.770 0.789 0.746 0.760Cars [Kra13] 16,185 8,144 8,041 196 0.822 0.845 0.789 0.801Dogs [Kho11] 20,580 12,000 8,580 120 0.850 0.864 0.841 0.835

modules together. Such CNN models are holding the state-of-the-art accuracy in many challenging

deep learning tasks. The structures of these models are described in input Caffe Prototxt3 files and

converted to the multiplexing models by the Wootz compiler.

For preparation, we adapt the four CNN models trained on ImageNet [Rus15] (ILSVRC 2012) to

each of four specific image classification tasks with the domain-specific datasets, Flowers102 [Nil08],

CUB200 [Wel10], Cars [Kra13], and Dogs [Kho11]. It gives us 16 trained full CNN models. The accuracy

of the trained ResNets and Inceptions on the test datasets are listed in columns Accuracy in Table 4.1.

The four datasets for CNN pruning are commonly used in fine-grained recognition [Kra16; Fu17;

Mol16; How17; Zha17], which is a typical usage scenario of CNN pruning. Table 4.1 reports the

statistics of the four datasets, including the data size for training (Train), the data size for testing

(Test), and the number of classes (Classes). For all experiments, network training is performed on

the training sets while accuracy results are reported on the testing sets.

4.6.1.2 Baseline for Comparison

In CNN pruning, the full CNN model to prune has typically been already trained on the datasets

of interest. When filters in the CNN are pruned, a new model with fewer filters is created, which

inherits the remaining parameters of the affected layers and the unaffected layers in the full model.

The promising subspace consists of such models. The baseline approach trains these models as they

are. Although there are prior studies on accelerating CNN pruning, what they propose are all various

ways to reduce the configuration space to a promising subspace. To the best of our knowledge, when

exploring the configurations in the promising subspace, they all use the baseline approach. As our

method is the first for speeding up the exploration of the promising space, we compare our results

with those from the baseline approach.

We refer to a pruned network in the baseline approach a default network while the one initialized

with pre-trained tuning blocks in our method a block-trained network.

4.6.1.3 Promising Subspace

The 16 trained CNNs contain up to hundreds of convolutional layers. A typical practice is to use

the same pruning rate for the convolutional layers in one convolution module. We adopt the same

strategy. The importance of a filter is determined by its `1 norm as previous work [Li16] proposes.

3We add to Prototxt a new construct "module" for specifying the boundaries of convolution modules.

53

Following prior CNN pruning practice [Li16; Luo17a], the top layer of a convolution module is kept

unpruned; it helps ensure the dimension compatibility of the module.

There are many ways to select the promising subspace, i.e., the set of promising configurations

worth evaluating. Previous works select configurations either manually [Li16; Luo17a] or based on

reinforcement learning with various rewards or algorithm design [He18; Ash17]. As that is orthogo-

nal to the focus of this work, to avoid bias from that factor, our experiment forms the promising

spaces through random sampling [Ber12] of the entire pruning space. A promising space contains

500 pruned networks, whose sizes follow a close-to-uniform distribution. In the experiments, the

pruning rate for a layer can be one of Γ = {30%, 50%, 70%}.

4.6.1.4 Objective of Pruning

There are different pruning objectives including minimizing model size, computational cost, mem-

ory footprint or energy consumption. Even though an objective of pruning affects the choice of the

best configuration, all objectives require the evaluation of the set of promising configurations. Our

composability-based CNN pruning aims at accelerating the training of a set of pruned networks

and thus can work with any objective of pruning.

For the demonstration purpose, we set the objective of pruning as finding the smallest network

(min ModelSize) that meets a given accuracy threshold (Accuracy <= thr_acc). We get a spectrum of

thr_acc values by varying the accuracy drop rate α from that of the full model from -0.02 to 0.08. We

include negative drop rates because it is possible that pruning makes the model more accurate.

4.6.1.5 Meta Data on Training

The meta data on the training in both the baseline approach and our composability-based approach

are as follows. Pre-training of tuning blocks takes 10,000 steps for all ResNets, with a batch size 32,

a fixed learning rate 0.2, and a weight decay 0.0001; it takes 20,000 steps for all Inceptions, with

batch size 32, a fixed learning rate 0.08, and a weight decay 0.0001. The global fine-tuning in the

composability-based approach and the network training in the baseline approach uses the same

training configurations: max number of steps 30,000, batch size 32, weight decay 0.00001, fixed

learning rate 0.0014.

All the experiments are performed with TensorFlow 1.3.0 on machines each equipped with a

16-core 2.2GHz AMD Opteron 6274 (Interlagos) processor, 32 GB of RAM and an NVIDIA K20X GPU

with 6 GB of DDR5 memory. One network is trained on one GPU.

4.6.2 Validation of the Composability Hypothesis

We first present empirical validations of the composability hypothesis (i.e., pre-training tuning blocks

helps CNN reach an accuracy sooner) as its validity is the prerequisite for the composability-based

CNN pruning to work.

4We experimented with other learning rates and dynamic decay schemes. No single choice works best for all networks.We decided on 0.001 as it gives the overall best results for the baseline approach.

54

0 5 10 15 20 25 30#steps (k)

0.00.10.20.30.40.50.60.7

accu

racy

init

finalfinal+

init+

defaultblock-trained

(a) ResNet-50

0 5 10 15 20 25 30#steps (k)

0.00.10.20.30.40.50.60.7

accu

racy

init

finalfinal+

init+

defaultblock-trained

(b) Inception-V3

Figure 4.6: Accuracy curves of the default and block-trained networks on dataset CUB200. Eachnetwork has 70% least important filters pruned at all convolution modules.

Table 4.2: Median accuracies of default networks (init, final) and block-trained networks (init+,final+).

ModelsAccuracyType

Flowers102 CUB200 Cars Dogs

ResNet-50

init 0.035 0.012 0.012 0.010init+ 0.926 0.662 0.690 0.735final 0.962 0.707 0.800 0.754final+ 0.970 0.746 0.821 0.791

ResNet-101

init 0.048 0.021 0.009 0.028init+ 0.932 0.698 0.663 0.733final 0.968 0.741 0.832 0.785final+ 0.977 0.767 0.844 0.814

Inception-V2

init 0.030 0.011 0.011 0.010init+ 0.881 0.567 0.552 0.630final 0.960 0.705 0.785 0.732final+ 0.966 0.725 0.806 0.771

Inception-V3

init 0.029 0.011 0.009 0.012init+ 0.866 0.571 0.542 0.563final 0.959 0.711 0.796 0.728final+ 0.965 0.735 0.811 0.755

30 35 40 45 50 55 60model size (%)

0.9500.9550.9600.9650.9700.975

accu

racy

defaultblock-trained

full model

(a) Flowers102

30 35 40 45 50 55 60model size (%)

0.760.770.780.790.800.810.820.83

accu

racy

defaultblock-trained

full model

(b) Cars

Figure 4.7: Accuracies of pruned networks of ResNet-50 after training. The model size of full ResNet-50 is 25.6 million.

55

Table 4.2 reports the median of the initial and final accuracies of all 500 block-trained networks

and their default counterparts for each of the models on every dataset. The mean is very close (less

than 1%) to the median in all the settings. In this experiment, the tuning blocks are simply the CNN

modules in each network. Overall, block-trained networks yield better final accuracies than default

networks do with one-third less training time.

To show details, the two graphs in Figure 4.6 give accuracy curves attained during the trainings

of one of the pruned networks in ResNet-50 and Inception-V3 respectively. Dataset CUB200 is used.

The initial accuracies (init) are close to zero for the default version, while 53.4% and 40.5% for

the block-trained version (init+). Moreover, the default version gets only 65.3% and 67.3% final

accuracies (final) respectively, while the block-trained version achieves 72.5% and 70.5% after only

two-thirds of the training time. Results on other pruned networks show a similar trend.

The results offer strong evidence for the composability hypothesis, showing that pre-training the

tuning blocks of a CNN can indeed help the training of that CNN reach a given accuracy sooner.

The benefits do not come for free; overhead is incurred by the pre-training of the tuning blocks. We

next report the performance of Wootz as a whole.

4.6.3 Results of Wootz

We first evaluate the performance of composability-based network pruning and then report the

extra benefits from the hierarchical tuning blocks identifier.

4.6.3.1 Basic Benefits

To measure the basic benefits from the

composability-based method, these experiments use every convolution module in these networks

as a tuning block. The extra benefits from hierarchical tuning block identification are reported later.

Figure 4.7 shows the final accuracies of all the 500 ResNet-50 variants trained with or without

leveraging composability on the Flower102 and CUB200 datasets. For reference, we also plot the

accuracies of the well-trained full ResNet-50 on the two datasets. The block-trained network gives a

clearly better final accuracy overall, which echoes the results reported in the previous subsection.

Tables 4.3 and 4.4 report the comparisons between the block-trained version and the default

version, in both speeds and network sizes, at various levels of tolerable accuracy drop rates α

(negative means higher accuracy than the large network gives). The results are collected when 1, 4,

or 16 machines are used for concurrent training for both the baseline and our method (indicated by

the "#nodes" column). The time of the block-trained version already takes the pre-training time of

tuning blocks into account ("overhead" in Tables 4.3 and 4.4 show the percentage in overall time).

For the objective of pruning, the exploration order Wootz adopts is to start from the smallest models

and proceed to larger ones.

The results show that the composability-based method avoids up to 99.6% of trial configurations

and reduces the evaluation time by up to 186X for ResNet-50; up to 96.7% reduction and 30X

speedups for Inception-V3. The reduction of trial configurations is because the method improves

56

the accuracy of the pruned networks as Figure 4.7 shows. As a result, the exploration meets a desirable

configuration sooner. For instance, in Flower102 (α = 0), the third smallest network can already

reach the target accuracy in the block-trained version while the 297th network meets the target in

the default version. This not only shortens the exploration time, but also yields more compact (up

to 70% smaller) networks as the “model size” columns in Tables 4.3 and 4.4 show. Another reason for

the speedup is that the training of a block-trained network takes fewer iterations to reach its final

accuracy level than the default version, as Figure 4.6 has illustrated. Even when configurations are

not reduced (e.g., Flower102, α=−1), the block-trained exploration finishes sooner.

Table 4.5 shows the speedups by composability-based pruning with different subspace sizes.

The speedups are higher as the number of configurations to explore increases. It is because the

time for pre-training tuning blocks weights less as the total time increases and the reduction of

configurations becomes more significant for a larger set. Another observation is that, when the

number of configurations is only four, there is still a significant speedup in most of cases. The block

training time is the time spent on pre-training all the tuning block variants (48 for ResNet-50 and 27

for Inception-V3). The speedup could be higher if tuning block identifier is applied, as shown next.

4.6.3.2 Extra Benefits from Tuning Blocks Identification

Hierarchical tuning block identifier balances the overhead of training tuning blocks and the time

savings they bring to the fine-tuning of pruned networks. Table 4.6 reports the extra speedups

brought when it is used.

For datasets Flowers102 and CUB200, we experiment with two types of collections of configu-

rations with N = 8. The first type, “collection-1”, is a randomly sampled collection as mentioned

earlier, and the second type, “collection-2”, is attained by setting one pruning rate for a sequence

of convolution modules, similar to the prior work [Li16] to reduce module-wise meta-parameters.

For each type, we repeat the experiments five times with a new collection created each time. Each

tuning block identified from the first collection tends to contain only one convolution module due

to the independence in choosing the pruning rate for each module. But the average number of

tuning blocks is less than the total number of possible pruned convolution modules (41 versus 48

for ResNet-50 and 27 versus 33 for Inception-V3) because of the small collection size. The latter

one has tuning blocks that contain a sequence of convolution modules as they are set to use one

pruning rate.

The extra speedups from the algorithm are substantial for both, but more on the latter one for

the opportunities that some larger popular tuning blocks have for benefiting the networks in that

collection. Because some tuning blocks selected by the algorithm are a sequence of convolution

modules that frequently appear in the collections, the total number of tuning blocks becomes

smaller (e.g., 27 versus 23 on Inception-V3.)

57

Table 4.3: Speedups and configuration savings for ResNet-50 by composability-based pruning(when 1, 4, or 16 machines are used for both baseline and composability-based methods as "#nodes"column indicates). Notations are at the table bottom.

Dataset α #nodesResNet-50

thr_acc#configs time (h) model size speedup

(X)overhead

base comp base comp base comp

Flowers102-1%

1416

0.983500500500

500500500

2858.7718.1184.9

1912.7481.0125.5

100% 100%1.51.51.5

0.4%0.5%1.8%

0%1416

0.973297300304

3416

1639.4412.6103.3

16.95.24.7

45.4% 29.3%97.079.322.0

40.4%43.5%48.3%

1%1416

0.9636816

1416

31.010.45.2

8.33.22.9

29.6% 27.6%3.73.31.8

82.8%70.6%78.3%

CUB2004%

1416

0.739323324336

2416

1807.3454.0118.7

12.73.13.1

46.6% 28.5%142.3146.538.3

53.7%74.4%74.4%

5%1416

0.731297300304

1416

1654.7418.8105.5

8.92.82.7

45.4% 27.6%185.9149.639.1

77.1%81.4%83.7%

6%1416

0.724154156160

1416

840.1214.253.8

8.32.62.5

38.0% 27.6%101.282.421.5

82.6%86.7%89.7%

Cars-1%

1416

0.830500500500

100100112

2864.9720.4185.3

362.490.927.1

100% 35.7%7.97.96.8

1.9%2.5%8.4%

0%1416

0.822332332336

111216

1848.6461.4115.9

44.412.15.2

46.9% 30.4%41.638.122.3

15.4%18.8%44.0%

1%1416

0.814189192192

246

1026.4259.765.5

12.84.94.1

40.4% 28.5%80.253.016.0

53.4%46.7%55.7%

Dogs6%

1416

0.799500500500

123124128

2848.1709.8178.0

441.1111.228.3

60.0% 36.9%6.56.46.3

1.6%2.0%8.1%

7%1416

0.791434436448

707280

2445.4606.2149.3

251.863.918.0

51.9% 34.2%9.79.58.3

2.7%3.6%12.7%

8%1416

0.782297300304

111216

1632.8411.7102.4

42.310.13.2

45.4% 30.4%38.640.832.0

16.2%22.7%71.6%

* thr_acc: accuracy corresponding to an accuracy drop rate α. base: baseline approach. comp: composability-based approach.speedup: T i meb a s e /T i mec o mp ; overhead counted in T i mec o mp . overhead: block training time over the total time of comp.

58

Table 4.4: Speedups and configuration savings for Inception-V3 by composability-based pruning.

Dataset α #nodesInception-V3

thr_acc#configs time (h) model size speedup

(X)overhead

base comp base comp base comp

Flowers102-1%

1416

0.978500500500

500500500

3018.8756.7194.8

2023.5508.1133.6

100% 100%1.51.51.5

0.5%0.7%2.7%

0%1416

0.968244244256

101216

1428.6358.294.8

47.313.96.5

43.2% 32.4%30.225.814.6

23.3%26.4%56.4%

1%1416

0.958272832

1416

152.639.611.2

13.95.85.6

33.9% 31.0%11.06.82.2

79.0%63.3%71.0%

CUB2004%

1416

0.720747680

3416

420.2106.427.6

21.96.76.0

41.4% 33.7%19.215.94.6

49.8%54.5%60.6%

5%1416

0.710444448

1416

247.861.716.4

14.15.45.2

38.5% 31.5%17.611.43.2

77.5%67.6%70.6%

6%1416

0.700293232

1416

162.544.510.8

12.85.35.1

35.9% 31.0%12.78.42.1

85.1%68.7%71.9%

Cars-1%

1416

0.811271272272

202032

1586.8398.199.4

85.622.411.1

40.1% 33.5%18.517.89.0

12.8%16.3%32.8%

0%1416

0.801848496

3416

480.3120.533.8

21.87.26.7

36.9% 31.3%22.016.75.0

50.2%50.6%54.7%

1%1416

0.791333648

1416

186.450.716.4

14.26.86.2

34.4% 31.0%13.17.52.6

77.0%54.0%59.1%

Dogs6%

1416

0.776416416416

201204208

2470.7618.2153.2

786.0199.352.7

100% 47.9%3.13.12.9

1.4%1.8%6.9%

7%1416

0.766311312320

129132144

1822.2456.1116.2

503.2128.036.4

56.0% 41.4%3.63.63.2

2.2%2.8%10.0%

8%1416

0.756201204208

828496

1164.1294.875.0

322.983.126.1

47.9% 39.0%3.63.52.9

3.4%4.4%13.9%

Table 4.5: Speedups by composability-based pruning with different subspace sizes.

Dataset alphasubspacesize

ResNet-50 Inception-V3basetime (h)

comptime (h)

speedup(X)

basetime (h)

comptime (h)

speedup(X)

Flowers102 0%

4 22.7 13.4 1.7 20.3 16.8 1.216 90.9 12.8 7.1 76.7 20.6 3.764 364.8 21 17.4 224.7 25.4 8.8256 1460.7 13.5 108.2 809.4 40.7 19.9

CUB200 3%

4 22.8 11 2.1 23.6 26 0.916 93.8 11.4 8.2 83.5 30 2.864 369.6 15.5 23.8 292.5 29.2 10256 1472.9 20.7 71.2 1128.9 18.1 62.4

59

Table 4.6: Extra speedups brought by improved tuning block definitions.

Dataset αResNet-50 Inception-V3

thr_accextra speedup (X)

thr_accextra speedup (X)

collection-1 collection-2 collection-1 collection-2

Flowers1020% 0.973 1.05 0.98 0.968 1.12 1.141% 0.963 1.19 1.21 0.958 1.08 1.152% 0.953 1.06 1.14 0.949 1.15 1.23

CUB2003% 0.747 1.04 1.08 0.737 1.00 1.034% 0.739 1.04 1.20 0.729 1.08 1.095% 0.731 1.11 1.15 0.722 1.03 1.04

geometric mean 1.08 1.12 1.08 1.11

4.7 Related Work

Recent years have seen many studies on speeding up the training and inference of CNN, both in

software and hardware. For the large volume, it is hard to list all; some examples are work on software

optimizations [Han15a; Zhu18; Luo18; Iof15] and work on special hardware designs [Buc18; Fow18;

Ovt15; Sha18; Mos18; Eck18; Luo17b]. These studies are orthogonal to CNN pruning. Although they

can potentially apply to the training of pruned CNNs, they are not specifically designed for CNN

pruning. They focus on speeding up the computations within one CNN network. In contrast, our

work exploits cross-network computation reuse, exploiting the special properties of CNN pruning—

many configurations to explore, common layers shared among them, and most importantly, the

composability unveiled in this work. We next concentrate on prior work closely related to CNN

pruning.

Deep neural networks are known to have many redundant parameters and thus could be pruned

to more compact architectures. Network pruning can work at different granularity levels such as

weights/connections [Han15b; LeC90; Agh17], kernels [Wen16] and filters/channels [Li16; Mol16;

Luo17a]. Filter-level pruning is a naturally structured way of pruning without introducing sparsity,

avoiding creating the need for sparse libraries or specialized hardware. Given a well-trained network,

different metrics to evaluate filters importance are proposed such as Taylor expansion [Mol16], `1

norm of neuron weights [Li16], Average Percentage of Zeros [Hu16], feature maps’ reconstruction

errors [Luo17a; He17], and scaling factors of batch normalization layers [Liu17b]. These techniques,

along with general algorithm configuration techniques [Hoo11; Ber12; Sno12] and recent reinforce-

ment learning-based methods [He18; Ash17], show promise in reducing the configuration space

worth exploring. Our work distinctively aims at reducing the evaluation time of the remaining

configurations by eliminating redundant training.

Another line of work in network pruning conducts pruning dynamically at runtime [Fig17;

McG17; Lin17]. Instead of finding the best small network, they try to generate networks that can

adaptively activate only part of the network for inference on a given input. Because each part of the

generated network may be needed for some inputs, the overall size of the generated network could

be still large. They are not designed to make the network meet the limited resource constraints on a

system.

60

Sequitur [NM97] has been applied to various tasks, including program and data pattern analy-

sis [Lau05; Chi01; Lar99; Law03; Chi02; Wal10]. We have not seen its use in CNN pruning.

Several studies train student networks to mimic the output of a teacher network [Buc06; Ba14;

Hin15]. Our method of pre-training tuning blocks is inspired by these work, but works at a different

level: rather than for training an entire network, we need to train pieces of a network. We are not

aware of the prior use of such a scheme at this level.

4.8 Conclusions

This chapter described a novel composability-based approach to accelerating CNN pruning via

computation reuse. We designed a hierarchical compression-based algorithm to efficiently identify

tuning blocks for pre-training and effective reuse. We further developed Wootz, the first compiler-

based software framework that automates the application of the composability-based approach to

an arbitrary CNN model. Experiments show that network pruning enabled by Wootz shortens the

state-of-the-art pruning process by up to 186X while producing significantly better pruned networks.

The long exploration time of CNN pruning has been a major barrier for timely delivery of many AI

products. The promising results of Wootz indicate its potential for significantly lowering the barrier,

and hence reducing the time to market AI products.

61

CHAPTER

5

EFFICIENT ENSEMBLE TRAINING WITH

DATA SHARING

5.1 Introduction

An essential step to apply DNNs to a new data set is hyper-parameter tuning—that is, the selection of

an appropriate network architecture and hyper-parameters (e.g., the number of layers, the number

of filters at each layer, and the learning rate scheduling). It is called Neural Architecture Search

(NAS) when the tuned parameters determine a DNN’s architecture. Many different search strategies

have been proposed such as random search [Ber12; Li19], reinforcement learning [Zop16; Zop18],

evolutionary methods [Sal17], and Bayesian Optimization [Kan18]. Most existing methods used

today need to train a large set of DNN candidates with different architectures (e.g. 450 networks

being trained concurrently in [Zop18]) to identify the best model for a particular task.

An effective strategy for shortening the process of hyperparameter tuning and NAS is to con-

currently train a set of DNNs on a cluster of nodes1, which is referred to as ensemble training of

DNNs. We refer to an ensemble of DNNs with the same architecture as a homogeneous ensemble.

Otherwise, the ensemble is called heterogeneous ensemble.

A common ensemble training strategy is to duplicate a training pipeline on multiple nodes to

train DNNs in parallel. A typical DNN training pipeline is an iterative process including data fetching,

preprocessing, and training. For the ease of description, we refer to data fetching and preprocessing

together as preprocessing. In ensemble training, training steps are not identical because we train

1A “node” in this chapter refers to a machine in a cluster; one node may contain one or more CPUs and GPUs

62

models with different architectures and configurations. However, preprocessing is redundant across

the pipelines, resulting in unnecessary CPU usage and even poor pipeline performance.

To eliminate the redundancies, Pittman et al. [Pit18] proposed data sharing where the common

preprocessing operations are shared across training pipelines of all DNNs in an ensemble. They

demonstrated that data sharing is an effective strategy to reduce computational resource utilization

and improve pipeline efficiency. Their solution, however, assumes relatively homogeneous compu-

tational needs for DNNs in an ensemble. It may perform poorly for an heterogeneous ensemble due

to the variance of DNN model training from two algorithmic characteristics.

The first algorithmic characteristic is varying training rate. Training rate of a DNN is the compute

throughput of processing units (e.g., CPUs and GPUs) used for training the DNN. Each DNN in an

heterogeneous ensemble could have varying computational needs and thus different training rates

with the same computing resources [Can16; Sze17]. If a DNN consumes preprocessed data slower

than other DNNs, others will have to wait for the slower one before evicting current set of cached

batches when we employ synchronized data fetching for data sharing to ensure that each DNN is

trained using the entire dataset. This waiting lowers the utilization of computing resources in the

cluster and delays the overall training time of the ensemble.

The second one is varying convergence speed. Due to the differences in network architecture

or hyper-parameter settings, some DNNs may require a larger number of epochs (one epoch goes

through all data samples once) to converge than others [Kri12; He16; Hua17; Zag16]. There can

be scenarios where a subset of DNNs in the ensemble have already converged while the shared

preprocessing operations have to keep prepossessed data for the remaining DNNs. Resources

allocated to these converged DNNs will be under-utilized until the training of all the DNNs is

completed.

To address the issues, we propose FLEET, a flexible ensemble training framework for efficiently

training a heterogeneous set of DNNs. We build FLEET via several technical innovations. First, we

formalize the essence of the problem into an optimal resource allocation problem. We analyze the

computational complexity of the problem and present an efficient greedy algorithm that groups a

subset of DNNs into a unit (named flotilla) and effectively maps DNNs to GPUs in a flotilla on the

fly. The algorithm incurs marginal runtime overhead while balancing the progressing pace of DNNs.

Second, we develop a set of techniques to seamlessly integrate distributed data-parallel training of

DNN, preprocessing sharing, and runtime DNN-to-GPU assignments together into FLEET, the first

ensemble DNN training framework for heterogeneous DNNs. We introduce checkpointing into this

context to address the issue of different convergence speeds. FLEET features flexible and efficient

communications and effective runtime resource allocations.

Experiments on 100 heterogeneous DNNs on SummitDev, the Oak Ridge Leadership Computing

Facility (Sec 5.5.1), demonstrate that FLEET can speed up the ensemble training by 1.12-1.92X over

the default training method, and 1.23-1.97X over the state-of-the-art framework that was designed

for homogeneous ensemble training.

63

Table 5.1: The job of different processes.

Process Type Job Description

Preprocesserfetch data from storage, preprocess the data, and sendthe preprocessed data to its paired training group master.

Training Group Master

receive the preprocessed data from its paired preprocesser,scatter it within its training group, broadcast the data toother training group masters, and train the DNN usingthe assigned batch of data.

Training Workerreceive the assigned batch of data from its traininggroup master and use it to train the DNN.

P1

P2

T1 (D1)

T2 (D2)

T3 (D2)

T4 (D3)

T5 (D4)

T6 (D4)

T7 (D4)

T8 (D4)

Figure 5.1: An illustration of the ensemble training pipeline in FLEET. P1 and P2 are preprocessorsand T1-T8 are trainers. There are four training groups, (T1), (T2, T3), (T4), (T5, T6, T7, T8), whichtrain the four DNNs D1-D4 respectively. Edges indicate transfers of preprocessed images.

5.2 Overview of FLEET

This section gives an overview of FLEET. FLEET is a flexible pipeline software architecture for efficient

ensemble training of heterogeneous DNNs. It provides flexibility for configuring the scheduling of

DNNs on nodes and GPUs via separation of preprocessing and training into different processes and

a collection of communication schemes. It creates efficiency via heterogeneity-conscious runtime

resource allocation and scheduling, plus sharing of preprocessing results among DNNs.

FLEET uses two types of processes, called preprocessor and trainer, to perform preprocessing

and training separately. A trainer group contains at least one trainer processes and is responsible

for training one DNN in the ensemble. A trainer process uses one GPU for training. When a trainer

group contains more than one trainer process, they perform data-parallel DNN training for one

DNN. Each trainer group has a trainer as the training group master and zero or more trainers as

the training workers. The preprocessors communicate directly with only some master trainers, and

those master trainers forward the preprocessed data to other trainers. Figure 5.1 illustrates the

ensemble training pipeline in FLEET. The job of each process is summarized in Table 5.1.

Efficiency and Flexibility. Two important features of FLEET are its efficiency and flexibility.

The efficiency of FLEET comes from its novel resource allocation strategy developed for DNN

ensemble training. The strategy is powered by some fundamental understanding of this resource

allocation problem, and a greedy scheduling algorithm designed specifically to heterogeneous

64

ensemble training. The algorithm seamlessly integrates data-parallel distributed training with

ensemble training. As illustrated in Figure 5.1, different number of GPUs can be allocated to each

DNN so that the DNNs can reach a similar training rate, avoiding the pipeline inefficiency caused

by the slowest DNNs. It overcomes the NP-hardness of the resource allocation problem through a

greedy design, grouping DNNs into multiple flotillas and periodically (re)allocate GPUs to remaining

DNNs in a global efficient manner. It further leverages check-pointing to mitigate the issue of varying

convergence speeds among DNNs. Together, FLEET is able to achieve efficient ensemble training

while enabling data sharing to save CPU usage.

The flexibility of FLEET is in two aspects. First, decoupling preprocessing and training using

different processes2 provides the flexibility in configuring the number of preprocessors such that

the preprocessing throughput can match the trainers’ throughput without creating too many pre-

processors that may waste computing resource and power. Second, as each trainer is associated

with one GPU, resources for training can be allocated in the granularity of GPUs (rather than nodes

as in prior work [Pit18]). Each GPU in a node can be assigned independently to DNNs. Each DNN

in an ensemble can be trained using different numbers of GPUs concurrently, giving flexibility for

handling the heterogeneity in DNNs.

Two-fold Enabling Techniques. The key technical contributions that make FLEET possible

are two-fold. The first is theoretical, consisting of a deep understanding of the resource allocation

problem and some novel algorithms for assigning DNNs to GPUs. The second is empirical, consisting

of a set of solutions to the various challenges for implementing FLEET above the array of complex

software components (TensorFlow, Horovod, Python, MPI, etc.) on a heterogeneous Multi-GPU

supercomputer like SummitDev [Sum]. We present the two-fold contributions in the next two

sections respectively.

5.3 Resource Allocation Algorithms

Efficient ensemble training is essentially an optimal resource allocation problem. The resources

involve CPUs and GPUs in the modern heterogeneous computing clusters. Under the context of data

sharing, an optimal CPU allocation sets the number of preprocessors to be the one that just meets

the computing requirement of training DNNs. GPU allocation, however, is much more complex

and determines the pipeline efficiency. We formalize it as an optimal resource allocation problem

and analyze its computational complexity; the understanding motivates our later designs of the

practical algorithms and the FLEET architecture. We next start with the problem definition.

5.3.1 Problem Definition

There are two possible paradigms for scheduling DNNs on GPUs. A local paradigm assigns a DNN

to a GPU immediately when the GPU becomes vacant. A global paradigm periodically examines

2The reason we used processes instead of threads is due to the Global Interpreter Lock in Python. As FLEET is built onTensorFlow which is in Python, multi-processing brings maximum parallelism into the training pipeline.

65

Table 5.2: Notations.

Notation DescriptionN the number of DNNs in an ensemble.M the number of GPUs available in a cluster.K the number of DNN flotillas.D the list of DNNs in an ensemble,D = [D1, · · · , DN ].F the list of flotillas of DNNs,F = [F1, · · · ,FK ].Fk the k -th flotilla of DNNs,Fk = [D

(k )1 , · · · , D (k )Nk

].

D (k )i the i -th DNN in the k -th flotilla.Nk the number of DNNs in the k -th flotilla.A the list of GPU allocations,A = [A1, · · · , AK ]Ak a Nk -by-M matrix, the GPU allocations for the k -th flotilla of DNNs.

a (k )i , j whether the j -th GPU is assigned to D (k )i .

m (k )i

∑Mj=1 a (k )i , j , the number of GPUs assigned to D (k )i .

r (k )i (m ) the training rate of D (k )i trained with m GPUs.

the remaining DNNs and does a global (re)assignment of the DNNs to all GPUs. The local paradigm

is relatively easy to understand; the global paradigm has the potential to avoid the local optimal

but is more difficult to design. Particularly, to effectively realize the global paradigm, several open

questions must be answered: Is an optimal scheduling algorithm feasible? If so, what is it? If not,

how to efficiently approximate it? This section focuses on the global paradigm and explores these

open questions. For easy reference, we put into Table 5.2 the important notations used in the rest of

this chapter.

In this scheduling problem, the entire execution trains N DNNs on M GPUs in K rounds. The

beginning of a round is the time for globally (re)scheduling remaining DNNs on GPUs. The set of

DNNs being trained in each round is called a flotilla. So there are K flotillas being trained in the

execution, one flotilla a round.

Theoretically, a round can be a time period of an arbitrary length. We first focus on a simple

case where a round finishes when and only when the training of all the DNNs in a flotilla finishes

(e.g., converges or the maximum training epochs reached). In this setting, the GPUs that are done

with its work in the current flotilla earlier than other GPUs would have some idle waiting time. The

simplicity of this setting, however, makes the analysis easy to understand. We will briefly discuss the

complexities of the more general settings at the end of Section 5.3.2.

We now give a formal definition of the resource allocation problem in the focused setting. Each

DNN in the ensemble are placed into at least one of the flotillas Fk , k = 1, · · · , K such that a list

of K flotillasF = [F1, · · · ,FK ] cover all the DNNs. Each flotilla,Fk = [D(k )

1 , · · · , D (k )Nk], contains no

more than M DNNs (i.e., Nk ≤M ) such that each DNN in the flotilla can have at least one GPU.

LetA = [A1, · · · , AK ] be the GPU assignment for the K flotillas of DNNs. Each assignment Ak is a

66

Nk -by-M matrix (a (k )i , j )with:

a (k )i , j =

1, if the j -th GPU is assigned to the model D (k )i ,

0, otherwise,

s .t .Nk∑

i=1

a (k )i , j ≤ 1 ( j = 1, 2, · · · , M ),

M∑

j=1

a (k )i , j ≥ 1 (i = 1, 2, · · · , Nk ).

An optimal resource allocation is an allocation strategy of available GPUs in a cluster to DNNs

in an ensemble such that the end-to-end training time of the DNNs is minimized. The definition is

as follows:

Definition 5.3.1. Optimal Resource Allocation. Given a DNN ensemble D and a cluster of nodes

with totally M GPUs, let T (D|F ,A ) be the end-to-end time to finish the training of all the DNNs

according to the listF and the corresponding GPU assignmentA . The optimal resource allocation

problem is to find a schedule (F ∗,A ∗) such that

F ∗,A ∗ =arg minF ,A

T (D|F ,A ) (5.1)

=arg minF ,A

K∑

k=1

T (Fk |Ak ), (5.2)

where, T (Fk |Ak ) is the time spent on training the DNNs in theFk with the assignment Ak for some

epochs.

5.3.2 Complexity Analysis

In this part, we argue that the Optimal Resource Allocation problem is NP-hard in general. The

argument comes from the classic results in Parallel Task System Scheduling. As Du and Leung have

proved [Du89], finding an optimal non-preemptive schedule for a Parallel Task System with the

precedence constraints consisting of chains is strongly NP-hard for each n > 2 (n is the number

of processors). And, when the precedence constraints are empty, the problem is strongly NP-hard

for each n ≥ 5. The Optimal Resource Allocation problem can be viewed as a parallel task system

scheduling problem with each DNN as a task and each GPU as a parallel processor. One subtle

aspect is that even though the DNNs are independent, to leverage shared preprocessing data among

DNNs, a newly freed GPU does not take on a new DNN until the new round starts. It could be viewed

as there are some pseudo precedence constraints between the DNNs in two adjacent rounds. So in

general, the optimal solution is unlikely to be found in polynomial time. Recall that our discussion

has been assuming that a new round starts only when the training of the DNNs in the previous

67

Algorithm 1 Greedy Algorithm

Input: D, M //DNN ensemble and the number of GPUsOutput: F ,A // A list of flotillas and GPU assignments

1: R = p r o f i l e (D) // Profile training rates of each DNN trained using m = 1, · · · , M number of GPUs.2: F ,A , c a nd s , k = [], [],D, 13: while |c a nd s |> 0 do4: Fk , mk = createFlotilla(c a nd s , R , M ) // Step 1: Create a new flotilla from candidate DNNs; return the

flotilla of DNNs (Fk ) and GPU count vector (mk)5: Ak = getGPUAssignment(Fk , mk) // Step 2: Figure out a GPU assignment for the flotilla6: d e l s = train(Fk , Ak ) // Step 3: Load the latest checkpoint if available; train DNNs in the flotilla for

some epochs; return converged models (d e l s ).7: c a nd s−= d e l s // Remove converged models from candidates (c a nd s )8: F .append(Fk );A .append(Ak ); k+= 1

round is all done. If the condition is relaxed such that a round can be a time period of an arbitrary

length, the problem becomes even more complex to solve.

5.3.3 Greedy Allocation Algorithm

Motivated by the complexity in finding optimal solutions to the problem, we have designed a

greedy algorithm for FLEET to assign DNNs to GPUs efficiently. It is worth noting that, even though

the Optimal Resource Allocation problem connects with the classic Parallel Task System Scheduling,

several special aspects of it make it unique and demand new algorithm designs. First, unlike what

is often assumed in the classic scheduling problems, the length of a task (DNN training) is hard

if ever possible to predict: There is no known method that can accurately predict the number of

epochs (and hence the time) needed for a DNN to converge. Second, the relations among tasks

(DNNs) are “fluid”. The training of two DNNs are theoretically independent: One does not depend

on another’s data or control. But when they are put into the same flotilla, they become related: They

would share the same preprocessed data and hence need to keep a similar progressing pace. These

special aspects make the problem different from prior problems and call for new algorithms to be

designed.

This section describes our algorithm. It first introduces four principles we followed in developing

the solution and then elaborates our greedy algorithm. We will explain the solution in the context of

the global paradigm and discuss how it is also applicable to the local paradigm at the end of this

section.

5.3.3.1 Principles

A resource allocation strategy involves grouping the DNNs into flotillas and assigning the DNNs in

each flotilla to the GPUs. We develop our solution by following four principles. The core of these

principles is to organize tasks with less variation and dependencies at the flotilla level (Principles 1

and 2) and at the node level (Principles 3 and 4).

68

Principle 1. DNNs in the same flotilla should be able to reach a similar training rate (e.g., images per

sec) if a proper number of GPUs are assigned to each of the DNNs.

This principle helps ensure a balanced pace of all GPUs, which helps the DNNs in consuming

the shared preprocessed data in a similar rate to minimize the waiting time of certain GPUs. This

may result in multiple flotillas to be created if not all DNNs in the ensemble are similar.

Principle 2. Packing into one flotilla as many DNNs as possible.

The reason for this principle is two-fold. First, the throughput of multi-GPU training scales

sublinearly3 with the number of GPUs due to the communication overhead of exchanging gradients.

The principle is to help maintain good efficiency of the DNNs. Second, it allows more DNNs to share

preprocessed data.

Principle 3. When assigning multiple GPUs to a DNN, try to use GPUs in the same node.

This principle is to reduce the variation in communication latency: inter-node communications

are slower and have more variations than intra-node communications.

Principle 4. Try to assign DNNs that need a small number of GPUs to the same node.

This principle is similar to Principle 2 but at the node level. The rationale is that, although it

is hard to reduce the communication overhead of DNNs that need to be trained using multiple

nodes, we can minimize the communication overhead of DNNs that need a small number of GPUs

by assigning them to the GPUs in the same node.

Based on the four principles, we propose a greedy algorithm to solve the resource allocation

problem, as described below.

5.3.3.2 Algorithm

The greedy algorithm is shown in Algorithm 1. It uses training rates of the DNNs, R = {ri (m )}, i =

1, · · · , N , m = 1, · · · , M , which are attained through a short profiling process (line 1). We propose

profiling of fewer than 50 batches of training for each DNN. We defer the detailed profiling process

to Section 5.5.1.

The greedy algorithm dynamically determines the grouping of the DNNs in an ensemble based

on whether the DNN is converged or not and the training rate of each DNN. Once a flotilla is created,

an optimal GPU assignment can be derived. Initially, all DNNs are considered as candidates (cands)

when a new flotilla needs to be created (line 2). The greedy algorithm then iterates over three main

steps, flotilla creation (line 4), GPU allocation (line 5), and training (line 6), until all the DNNs in the

ensemble are converged (i, e, cands are empty).

We next describe the three steps in detail.

3If all DNNs in an ensemble has a perfect linear scaling in throughput, training the DNNs one after another wouldbe the optimal strategy. It is however often not the case in our observation. Another practical reason for concurrentlytraining multiple DNNs is hyperparameter tuning. By checking the intermediate training results of those DNNs, theunpromising ones can be discarded.

69

Algorithm 2 createFlotilla

Input: c a nd s , R , M // The indices of DNNs that are not converged, the training rates of the DNNs, and thenumber of GPUs available

Output: Fk , mk // the k -th flotilla and the GPU count vector1: Df a s t , r f a s t = fastestDNN(c a nd s , R ) // Find the DNN with the largest training rate with a single GPU2: Fk , Mk , mk = [Df a s t ], 1, [1]3: while |Fk |< |c a nd s | do4: Db e s t , rb e s t , Mb e s t = findNext(r f a s t , R , c a nd s ,Fk , M −Mk ) // Find the next DNN, its training rate and

required GPU count5: if Db e s t ==−1 then6: break7: Fk .append(Db e s t )8: mk.append(Mb e s t )9: Mk+=Mb e s t

10: while Mk <M do11: Ds l o w = s l o w e s t D N N (Fk , mk , R ) // in terms of speed on the currently assigned GPUs12: mk[s l o w ]+ = 113: Mk+= 1

5.3.3.2.1 Flotilla Creation

This first step selects a set of DNNs from candidates to create a new flotilla whose DNNs are trained

concurrently with data sharing, following Principles 1 and 2. The algorithm first identifies the largest

training rate with a single GPU, r f a s t = max{r1(1), · · · , r|c a nd s |(1)}, and the corresponding DNN,

Df a s t , from the candidate set of DNNs. Then r f a s t is used as the reference training rate to search

for other DNNs that can be placed in the same flotilla. Mathematically, the algorithm searches for

the next DNN that can be placed into the flotilla by solving the following optimization problem:

minDi∈c a nd s−Fk ,

m=1,··· ,M

|ri (m )− r f a s t |,

s .t . |ri (m )− r f a s t | ≤δ,

m ≤M −Mk , (5.3)

where δ is the threshold that determines if two training rates are close, and Mk is the total number

of GPUs that are already assigned to DNNs. In our experiments, δ is set to 20 (images/sec). The

algorithm stops adding DNNs to a flotilla if no solution exists to Eq. 5.3.

After a flotilla is formed, if there are still GPUs available, we assign the next GPU to the DNN in

the flotilla that has the smallest training rate iteratively until all the GPUs are assigned. The DNN

with the smallest training rate determines the pipeline efficiency. Assigning extra GPUs to the slowest

DNN can improve pipeline efficiency.

Algorithm 2 shows the flotilla creation algorithm. Flotilla creation first searches for the reference

training rate (line 1, time complexity O (N )), then iteratively finds the best candidate DNN to add in

the flotilla (lines 3-11, time complexity O (Nk ×N ×M ), and finally assigns all the remaining GPUs

70

Algorithm 3 getGPUAssignment

Input: Fk , mk // The k -th flotilla and the GPU count vectorOutput: Ak

1: j = 1 // The current available GPU with the smallest index.2: Ak = 0Nk ,M // The GPU assignment matrix of dimension Nk ×M3: r e ma i ni ng = {1, · · · , N } // The indices of DNNs to allocate GPUs4: a s s i g ne d = {} // The indices of DNNs that have assigned GPUs5: for all i ∈ r e ma i ni ng do6: if m (k )

i %G PU s P e r N o d e == 0 then7: a s s i g ne d .add(i )8: j = assignGPUs(Ak , i , j , m (k )

i )9: r e ma i ni ng -= a s s i g ne d

10: memo, assigned = {}, {}11: for all i ∈ r e ma i ni ng do12: if −m (k )

i %G PU s P e r N o d e not in me mo then

13: me mo [m (k )i %G PU s P e r N o d e ] = i

14: else15: for i i ∈ {i , me mo [−m (k )

i %G PU s P e r N o d e ]} do16: a s s i g ne d .add(ii)17: j = assignGPUs(Ak , i i , j , m (k )

i i )

18: del me mo [m (k )i %G PU s P e r N o d e ]

19: r e ma i ni ng -= a s s i g ne d20: if |r e ma i ni ng |> 0 then21: m(k ), b e s t S c o r e , b e s t A, jc o p y , Ac o p y = [],∞, j , c l o ne (Ak )22: for all i ∈ r e ma i ni ng do23: m(k ).append((i , m (k )

i ))24: for p e r m u t a t i o n in allPermutations(m(k )) do25: j , Ak = jc o p y , c l o ne (Ac o p y )26: for i , m (k )

i in p e r m u t a t i o n do

27: j = assignGPUs(Ak , i , j , m (k )i )

28: s c o r e = calculateScore(Ak ) // Score is calculated based on the loss function in Eq. 729: if s c o r e < b e s t S c o r e then30: b e s t S c o r e , b e s t A = s c o r e , Ak

31: Ak = b e s t A

Algorithm 4 assignGPUs

Input: Ak , i , j , m (k )i

Output: j // The index of the next available GPU to assign1: while m (k )

i > 0 do

2: a ki , j = 1; j+= 1; m (k )

i −= 1

71

available to the DNNs in the flotilla (lines 12- 15, time complexityO (Nk×M )). So the time complexity

of flotilla creation is O (Nk ×N ×M ).

The flotilla creation step produces a flotilla of DNNs as well as the GPU count vector that specifies

the number of GPUs assigned to each DNN. We next explain how to properly assign GPUs to each

DNN based on the GPU count vector and considering GPU locality.

5.3.3.2.2 GPU Assignment

This procedure assigns GPUs to DNNs in a flotilla, following Principles 3 and 4. The goal of this

procedure is to find an assignment Ak to minimize the number of nodes involved in training each

DNN. Let c (.) be the function that counts the number of nodes involved in training a DNN given its

GPU assignment a(k )i , which is the i -th row of the assignment matrix Ak , the GPU assignment is an

optimization problem:

minAk

Nk∑

i=1

c (a(k )i )

m (k )i

,

s .t .M∑

j=1

a ki , j =m (k )

i , i = 1, · · · , N , (5.4)

wherec (a(k )i )

m (k )i

is the number of nodes involved to train the i -th DNN, scaled by m (k )i , the number of

GPUs assigned. The solution space is as large as M !∏Nk

i=1(m(k )i !)

.

Instead of exhaustively searching for an optimal solution in the space, we propose a greedy

approach that assigns GPUs to each DNN in an incremental fashion. For example, if the j -th GPU is

already assigned to a DNN, then the next GPU to be assigned to the DNN is the j +1-th GPU. The

solution space is reduced to the space of possible permutations of the GPU count vector (Nk !).

The algorithm is shown in Algorithm 3. This algorithm assumes the number of GPUs per node is

the same among nodes (GPUsPerNode), which holds in major supercomputers. It first prunes the

factorial solution space by identifying and assigning GPUs to the DNNs whose training rate meets

certain requirements in O (Nk ) time complexity. It then searches for the optimal GPU assignment

strategy for the remaining DNNs. It assigns GPUs to DNNs in the following order:

1. the DNNs whose required number of GPUs is a multiple of the number of GPUs per node;

(lines 5-11)

2. the pairs of DNNs whose sum of the required number of GPUs is a multiple of the number of

GPUs per node; (lines 12-24)

3. the remaining DNNs by searching for an optimal assignment of GPUs. (lines 25-39)

Let N ′k be the number of remaining DNNs. The solution space is N ′

k !. Most of the time, N ′k is

a small number less than five. However, enumerating all the possible solutions is still in factorial

72

time complexity. We set the maximum number of solutions to explore as 1024, reducing the time

complexity to O (1). The time complexity of GPU assignment is thus O (Nk ).

The flotilla creation and GPU assignment steps ensure that DNNs in the same flotilla can achieve

similar training rate to improve GPU utilization. We next describe how the training step addresses

the varying convergence speed issue via check-pointing.

5.3.3.2.3 Training

The training step trains the DNNs on their assigned GPUs concurrently with data sharing. Due to

the architectural difference of DNNs in a heterogeneous ensemble, these DNNs require a different

number of epochs to converge. With data sharing, converged models need to wait for the un-

converged models to complete, leading to the waste of computing resources. We leverage check-

pointing to address the varying convergence speed issue. Specifically, each flotilla is trained until

only α ·M GPUs remain active for training. α is set to 0.8 in all our experiments. We monitor whether

a model is converged at the end of each epoch. Once a model is converged, it is marked as complete

and its GPUs are released. If the total number of GPUs that are not released falls below α ·M , the

training of all the DNNs in the flotilla stops. The parameters, loss history, and epoch count of all the

DNNs are check-pointed for recovering their training later. A DNN marked as complete will not be

packed into any of the following flotillas.

5.3.3.3 Application in the Local Paradigm

Although the discussion has been assuming the global paradigm, the greedy algorithm applies the

local paradigm of resource allocation as well. The training proceeds as follows: (1) At the beginning,

the algorithm forms the first flotilla of DNNs and starts training them. (2) Whenever a DNN is done,

the algorithm fills the released GPUs with new DNNs. If no DNN remains untrained, terminate when

all current training is done.

5.4 Implementation

This section describes an efficient training pipeline implementation of FLEET. We focus on the

following two main implementation challenges:

Challenge 1: Recall that FLEET has two types of processes, preprocessor and trainer. The number

of preprocessors needs to be set to meet the requirement of trainers’ throughput. Thus, it is necessary

for FLEET to support creating a different number of processes per node on a cluster and also enable

flexible communications between preprocessors and trainers.

Challenge 2: With data-parallel DNN training, preprocessed data from a processor is received

by its paired training group master, scattered to trainers within the group (including the training

group master), and broadcasted to the other training group masters. How do we build the dataflow

to enable efficient training pipeline?

We next describe the solutions and the implementation details.

73

Rank 3

Rank 2

Rank 1

Rank 0P1

P2

QP QT

QP QT

QD

QD

QT QD

QT

QD

∗QD

T1 (D1)

T2 (D2)

T3(D2)

T4 (D2)

Process 0 CPU Thread 1

(if forked) Process 1 Thread 0

Process 0 Thread 0

Process 0 Thread 2

Data Flow  MPI Broadcast MPI Point-to-point Communication

Figure 5.2: Illustration of the dataflow implementation. Two DNNs, D 1 and D 2, are trained usingfour GPUs (Ranks 0-3) by two training groups, (T 1) and (T 2, T 3, T 4). T 1 and T 2 are training groupmasters. Sizes of QP , QT and QD are 2048 images, 2048 images and 10 batches respectively.

Communications between Preprocessors and Trainers. A preprocessor is a process created

through the fork operation. The number of preprocessors can be controlled by the number of

trainer group masters that execute the fork operation. We establish the communications between a

preprocessor and its paired trainer group master through server process. A server process holds

Python objects and allows other processes to manipulate them using proxies. A proxy is an object in

the multiprocessing package of Python and refers to a shared object which lives (presumably) in

another process. A preprocessor sends the processed data to its training group master by writing to

a Numpy object using the object’s proxy.

Dataflow Implementation. Pipeline is the essential scheme organizing the different stages of

DNN processing together. It allows the stages to run in parallel. For example, while a DNN is trained

on some set of data, preprocessors can be preprocessing another set of data. Figure 5.2 illustrates

the dataflow implementation in FLEET.

The dataflow contains three pipelined steps: (1) Training group masters receive preprocessed

data from their paired preprocessor and put the data into a preprocessed queue QP . (2) Preprocessed

data from QP are broadcast to all the training group masters through MPI. Each training group

master receives all the preprocessed data, but handles the data differently, depending on whether

data-parallel training is used. If a training group contains only one trainer (i.e., only one GPU is used

to train a DNN), the training group master puts all the data into its trainer queue QT . Otherwise, the

training group master scatters the data to its trainer queue QT and the distribution queue QD ∗ . The

data in the distribution queue is sent to the trainer queue QT of each training group worker via MPI

point-to-point communication in a separate thread. (3) Each trainer (T1-T4) reads preprocessed

data from the trainer queue QT to QD , creates batches, and feeds each batch to the DNN model for

training.

74

Model Size

Trai

ning

Rat

e (#

GPU

=1)

Epoc

hs

0

50

100

150

200

0

5

10

15

20

25

250 500 750 1000

Training Rate Epochs

Figure 5.3: Correlations between model size of a DNN and the training rate and the number ofepochs until convergence.

5.5 Evaluations

We conduct a set of experiments to examine the efficacy of FLEET by answering the following

questions: (1) How much speedup can FLEET bring to ensemble training of heterogeneous DNNs?

(2) How do the pros and cons of the two paradigms in FLEET designs, local and global, play out in

handling the variations among DNNs? More specifically, does the greedy scheduling algorithm in

FLEET produce favorable schedules? How much waiting time does the round-by-round scheme in

FLEET cause, compared to eager scheduling schemes? (3) What is the overhead of runtime profiling,

scheduling, and checkpointing in FLEET?

We first describe the experiment settings (machines, baselines, etc.) in Section 5.5.1 and then

report our experiment results in Sections 5.5.2 and 5.5.3 to answer the questions.

5.5.1 Experiment Settings

5.5.1.1 DNNs

The DNNs used in this experiment are derived from DenseNets [Hua17] and ResNets [He16]. Both

models are the state-of-the-art network architectures that achieve high performance in various

learning tasks. We select these networks as the basis because, as structural DNNs, they are composed

of many Convolutional blocks, which have a standard interface making a block ready to be connected

with any other blocks. As a result, it is easy to derive new DNNs from them—one just needs to remove

or insert some Convolutional blocks.

We derive 100 experimental DNNs from six popular DNNs: DenseNet-121/169/201 and ResNet-

50/101/152. The first three are variations of DenseNets [Hua17]. The three variations share the same

75

structure, but differ in the number of DNN layers, indicated by their suffixes. The latter three are

variations of ResNets [He16].

The sizes of the DNN models vary from 232MB to 1.19GB. The distribution of their training

rates on a single GPU which vary from 21 to 176 images/sec. Different DNNs have different GPU

memory requirements and thus require different batch sizes to maximize GPU utilization. For

each, we use the maximum batch size that can fit into GPU’s memory. Figure 5.3 outlines the

relations between the training rates and model sizes of the DNNs, as well as the relations between

convergence rates (i.e., the number of epochs needed for the DNNs to converge) and their model

sizes. As model size increases, the training rate tends to drop as more computations are involved

in the DNN, but there are no clear correlations with the convergence rate. It is the reason that the

resource allocation algorithm in FLEET primarily considers training rate explicitly, and relies on the

periodical (re)scheduling to indirectly adapt to the variations of DNNs in the converging rates.

5.5.1.2 System

All experiments are conducted on SummitDev [Sum], a development machine for Summit super-

computer at Oak Ridge National Lab. Each node is equipped with two IBM POWER8 CPUs and

256GB DRAM, and four NVIDIA Tesla P100 GPUs. Each POWER8 CPU has 10 cores with 8 HW threads

each. The default SMT level is set to one unless noted otherwise. The number of cores allocated per

GPU is five in all the experiments. NVLink 1.0 is the connection among all GPUs and between CPUs

and GPUs within a node. EDR InfiniBand connects different nodes in a full fat-tree. The file system

is an IBM Spectrum Scale file system, which provides 2.5 TB/s for sequential I/O and 2.2 TB/s for

random I/O. Our experiments show that thanks to the large I/O throughput of the file system, I/O is

not the bottleneck of DNN training. The used CUDA version is 9.2.

FLEET is built on Tensorflow 1.12 (as the core training engine), Horovod v0.15.2 [Ser18] (as

the basis for distributed DNN training), and mpi4py v3.0.0 (for the pipeline construction). We set

inter_op_parallelism_threads and intra_op_parallelism_threads to # logical cores

for parallel TensorFlow operaitons on CPU. The used CUDA version is 9.2.

5.5.1.3 Profiling

To minimize the overhead of profiling, we only profile the training rates of each DNN in the ensemble

with the number of GPUs varying from one to Mt (Mt <M ), where Mt is determined based on the

training rates of each DNN on a single GPU. For profiling on m (m = 1, · · · , Mt ) GPUs, we train a

DNN for a maximum of 48 batches and use the training time of the last 20 batches to calculate the

exact training rate: ri (m ), i = 1, · · · , N . Based on the profiled training rates, we estimate the training

rates of each DNN when m >Mt .

Specifically, the profiling has three steps:

1. Collect the training rates of each DNN on a single GPU, R (1) = {ri (1)}, i = 1, · · · , N .

76

1 2 3 4 5 6 7 8Number of GPUs

0

200

400

600

800

1000

Trai

ning

Rat

e

Figure 5.4: The profiled training rates (images/sec) of 100 DNNs in an ensemble with Imagenet.

2. Estimate the number of GPUs required to make the DNN that has the smallest training rate

on a single GPU achieve the largest single-GPU training rate, Ma =�max(R (1))

min(R (1))

.

3. Collect the training rates of each DNN with the number of GPUs varying from two to Mt =

max(Ma , Mb ), where Mb = 2×G PU s P e r N o d e .

Note that steps 1 and 3 can be done in parallel because the trainings of different DNNs with

different number of GPUs are independent. The training rate of the i -th DNN with the number of

GPUs higher than Mt is estimated via the following equation:

ri (m ) =m ×ri (Mb )

Mb� ri (Mb )

ri (Mb −1)×

Mb −1

Mb

�m−Mb. (5.5)

The formula for Mb and Equation 5.5 are the result of performance modeling on our observations

on the DNN performance trend as illustrated in Figure 5.4. It achieves a good tradeoff between the

profiling cost and the performance prediction accuracy.

The profiling process also measures the throughput of a range of preprocessors (#cores=1, 2, 4,

8, 16, 32) in the pipeline. This step is quick since preprocessing does not exhibit large variations.

Based on the profiled information, FLEET calculates the minimum number of preprocessors that

can meet the demands of an arbitrary M DNNs (with one running on one GPU), and uses it to set

the number of preprocessors.

5.5.1.4 Counterparts for Comparisons

• Baseline The baseline uses the default TensorFlow to train each DNN on one GPU indepen-

dently. Each DNN trainer has a preprocessor that preprocesses data for itself independently. A

GPU randomly picks one yet-to-be-trained DNN whenever it becomes free until there are no

DNN left.

• Homogeneous Training This is the state-of-the-art framework recently published [Pit18] for

ensemble DNN training. This framework allows the DNNs that get trained at the same time to

77

0

0.5

1

1.5

2

2.5

(20,100) (40,100) (60,100) (80,100) (100,100) (120,100) (140,100) (160,100)

Sp

ee

du

p

(#GPUs, #DNNs)

Homogeneous[24]FLEET-L

FLEET-G

Figure 5.5: The averaged speedups over the baseline in terms of the end-to-end time for training a100-DNN ensemble. The error bars show the variations.

share the preprocessed data. But it is designed for homogeneous DNN training, assuming no

variations among DNNs or the situation where the number of DNNs is no greater than the

number of GPUs. In our experiments, when there are more DNNs than GPUs, the framework

randomly picks a subset of the remaining DNNs to train, one DNN per GPU with shared

preprocessed data. After that subset is done, it picks another subset and repeats the process

until all DNNs are done.

• FLEET-G This is FLEET in the global paradigm.

• FLEET-L This is FLEET in the local paradigm as described in Section 5.3.3.3. Its difference

from FLEET-G is that as soon as a DNN is done, the released GPUs are immediately used to

train some remaining DNNs; which DNNs are picked is determined by the greedy algorithm

as in FLEET-G, but only locally (for the newly released GPUs) rather than globally.

5.5.2 End-to-End Speedups

Figure 5.5 reports the speedups of the three methods over the baseline method, in terms of the

end-to-end ensemble training time of the 100 DNNs. All runtime overhead for FLEET is included.

We repeat each measurement multiple times and report the average and error bars.

It shows the results in eight settings. The prior homogeneous framework shows large slowdowns

in the first four settings where the number of GPUs is less than the number of DNNs. The slowdowns

78

Wai

t tim

e/#G

PU in

sec

onds

5.0

10.0

50.0

100.0

20 GPU 100 DNN

40 GPU 100 DNN

60 GPU 100 DNN

80 GPU 100 DNN

100 GPU 100 DNN

120 GPU 100 DNN

140 GPU 100 DNN

160 GPU 100 DNN

Homogeneous[24] FLEET-L FLEET-G

Figure 5.6: Waiting time per GPU.

are due to the waiting of other GPUs for the slowest DNN to finish in each round, shown in Figure 5.6.

In the other four settings, the homogeneous framework performs similarly as the baseline does: As

there are more GPUs than DNNs, there is only one round, in which, the two methods use resource

similarly. The sharing of preprocessing in the homogeneous framework does not generate speedups

for these DNN trainings because the preprocessing is not the bottleneck for them.

FLEET-G gives the best overall performance, producing 1.12-1.92X speedups over the baseline.

The primary reason for the speedups come from its better resource allocation to the DNNs. The

bottom of Table 5.3 reports the mean and standard deviations of the running lengths of DNNs in

the first five flotillas in FLEET-G (80GPU,100DNN). In comparison to the data in the baseline and

FLEET-L (top rows in Table 5.3), the DNNs show much smaller variations in length, which indicate

the effectiveness of the GPU allocations in FLEET-G in evening out the differences among DNNs. At

the beginning, we thought that the catch to FLEET-G is the waiting time of some GPUs after they are

done with their work in a round. Our experiments, however, show the opposite effects. As Figure

5.6 shows, the average waiting time per GPU is smallest for FLEET-G. The reason is that the other

methods all suffer long waiting time at the end; because of their suboptimal resource allocation,

some GPUs have to work long after others to finish up the last few DNNs. FLEET-L gives notable but

fewer speedups for its less favorable decisions at resource allocation due to the local view.

Overall, FLEET gives larger speedups when #GPU > #DNN. It is worth noting that in such a

setting, there are still many flotillas to schedule and FLEET scheduler plays an important role.

The reason is that in many cases, the FLEET scheduler assigns multiple GPUs to one DNN. For

instance, 20 flotillas were created when training 100 DNN on 120 GPUs and 22 flotillas were created

when training 100 DNNs on 160 GPUs. When #GPU<#DNN, the speedups from FLEET are not that

significant but still substantial: 113–120% for three out of the four such settings in Figure 5.5.

79

Table 5.3: Mean and standard deviation of the running length of DNNs in seconds. (80 GPUs, 100DNNs)

Technique Flotilla ID Mean Std. Dev.

Baseline - 10372.2 4178.9FLEET-L - 6213 3580.0

0 2067.9 54.71 335.6 48.42 2291.9 26.0

FLEET-G 3 415.5 51.94 1072.2 364.05 2322.3 216.0

Table 5.4: Scheduling and checkpointing overhead.

(#GPU,#DNN)

Total TrainingTime (in sec)

SchedulingOverhead

CheckpointingOverhead

in sec in % in sec in %(20,100) 55200.1 20.1 0.037 1496.0 2.7(40,100) 30204.8 15.8 0.054 1156.0 3.8(60,100) 24495.0 14.0 0.060 986.0 4.0(80,100) 21891.0 12.0 0.057 816.0 3.7(100,100) 18359.1 10.1 0.058 782.0 4.3(120,100) 15323.9 9.9 0.068 680.0 4.4(140,100) 13366.3 9.3 0.073 680.0 5.1(160,100) 11825.2 10.2 0.092 748.0 6.3

5.5.3 Overhead

Table 5.4 reports the breakdown of the runtime overhead of FLEET-G. The overhead of scheduling

and checkpointing is at most 0.1% and 6.3% of the end-to-end training time in all the settings. Recall

that, due to wall-clock-time limitation of SummitDev, we have used the small Caltech256 dataset.

For large datasets (e.g., ImageNet), the overhead would become negligible. The profiling overhead

is independent of dataset size and solely depends on ensemble size. Recall that, profiling needs

the DNN to train for only a few steps in parallel. Its overhead is marginal for typical DNN trainings

on large datasets that take hours or days to train. On recent GPUs, a feature called Multi-Process

Services (MPS) could potentially allow multiple DNNs to be co-scheduled to a single GPU and

run concurrently. It is not considered in the current FLEET. To consider it, some co-run predictive

models could help, which could quickly predict the performance of a DNN when it co-runs with a

set of other DNNs on one GPU. The predictive performance can then be combined with the existing

performance models in FLEET to guide the scheduling of DNNs.

5.6 Related Work

Much research has been done to accelerate the training of a single DNN over distributed sys-

tems, such as Tensorflow from Google [Dea12; Aba15], Project Adams from Microsoft [Chi14], Fire-

Caffe [Ian16a], PipeDream [Har18], and GPipe [Hua18]. All those studies have focused on improving

the training speed of an individual DNN rather than ensemble training.

80

Recently, ensemble training starts drawing more attention. Besides the work by Pittman et

al. [Pit18], there are some other efforts [Gar18; Los16] on ensemble training, but they focus on

designing lightweight methods to form high-performing ensembles instead of improving pipeline

efficiency of ensemble training. HiveMind [Nar18] is a system designed for accelerating the training

of multiple DNNs on a single GPU by fusing common operations (e.g., preprocessing) across models.

It, however, lacks the essential support for distributed DNN training.

Another line of research that is relevant to this work is task scheduling on clusters or workflow

management systems. The scheduling of a set of tasks or workloads on clusters or multiproces-

sor systems has been extensively studied in the literature [Tur92; Shm95; Urg02; Aug11; Zah08;

Gra15; Cho16; Xu18; Ous13; Che16; Del14; Fei97]. Recent work including Gandiva [Xia18] and Tire-

sias [Gu19] design GPU cluster managers tailored for DNN workloads. They, however, lack the

flexibility supported in FLEET. First, they treat different jobs as independent black boxes. With-

out the MPI communication mechanisms we put into FLEET to enable flexible data exchanges of

TensorFlow-based workers and preprocessors, these schedulers cannot flexibly adjust the number of

workers for a DNN training. Second, as they treat the DNNs as separate jobs, they cannot support the

coordinations across DNNs in an ensemble, such as, the sharing of preprocessed data, cooperated

checkpointing at the appropriate times.

Load balancing techniques for parallel computers such as nearest neighbor assignment to

dynamically distribute workloads have been studied in [Kum94]. The way FLEET distributes DNN

training workloads are fundamentally different because an initial task assignment is not set at the

beginning but dynamically determined based on both the convergence status and the training rate

of each DNN.

HPC schedulers or workflow management systems schedule batch jobs on HPC systems. They

however cannot work directly with DNN ensembles. First, they treat different jobs as independent

black boxes. Without the MPI communication mechanisms we put into FLEET to enable flexible data

exchanges of TensorFlow-based workers and preprocessors, these schedulers cannot flexibly adjust

the number of workers for a DNN training. Second, as they treat the DNNs as separate jobs, they

cannot support the coordinations across DNNs in an ensemble, such as, the sharing of preprocessed

data, cooperated checkpointing at the appropriate times.

5.7 Conclusions

This chapter presented a systematic exploration on enabling flexible efficient ensemble training

for heterogeneous DNNs. It addressed two-fold challenges. First, it formalized the essence of the

problem into an optimal resource allocation problem, analyzes its computational complexity, and

presented an efficient greedy algorithm to effectively map DNNs to GPUs on the fly. Second, it

developed a set of techniques to seamlessly integrate distributed data-parallel training of DNN,

preprocessing sharing, and runtime DNN-to-GPU assignments together into a software framework,

FLEET. Experiments on 100 heterogeneous DNNs on SummitDev demonstrated that FLEET can

81

speed up the ensemble training by 1.12-1.92X over the default training method, and 1.23-1.97X over

the state-of-the-art framework that was designed for homogeneous ensemble training.

82

CHAPTER

6

IN-PLACE ZERO-SPACE MEMORY

PROTECTION FOR CNN

6.1 Introduction

As CNNs are increasingly explored for safety-critical applications such as autonomous vehicles and

aerospace, reliability of CNN inference is becoming an important concern. A key threat is memory

faults (e.g., bit flips in memory), which may result from environment perturbations, temperature

variations, voltage scaling, manufacturing defects, wear-out, and radiation-induced soft errors.

These faults change the stored data (e.g., CNN parameters), which may cause large deviations of

the inference results [Li17; Rea18; Rea16]. In this work, fault rate is defined as the ratio between the

number of bit flips experienced before correction is applied and the total number of bits.

Existing solutions have resorted to general memory fault protection mechanisms, such as Error

Correction Codes (ECC) hardware [Sri15], spatial redundancy, and radiation hardening [Yu10].

Being CNN-oblivious, these protections incur large costs. ECC, for instance, uses eight extra bits

in protecting 64-bit memory; spatial redundancy requires at least two copies of CNN parameters

to correct one error (called Triple Modular Redundancy (TMR) [Lyo62]); radiation hardening is

subject to substantial area overhead and hardware cost. The spatial, energy, and hardware costs

are especially concerning for safety-critical CNN inferences; as they often execute on resource-

constrained (mobile) devices, the costs worsen the limit on model size and capacity, and increase

the cost of the overall AI solution.

To address the fundamental tension between the needs for reliability and the needs for space/en-

ergy/cost efficiency, this work proposes the first zero space cost memory protection for CNNs. The

83

design capitalizes on the opportunities brought by the distinctive properties of CNNs. It further am-

plifies the opportunities by introducing a novel training scheme, Weight Distribution-Oriented Train-

ing (WOT), to regularize the weight distributions of CNNs such that they become more amenable

for zero-space protection. It then introduces a novel protection method, in-place zero-space ECC,

which removes all space cost of ECC protection while preserving protection guarantees.

Experiments on VGG16, ResNet-18, and SqueezeNet validate the effectiveness of the proposed

solution. Across all tested scenarios, the method provides protections consistently comparable

to those offered by existing hardware ECC logic, while removing all space costs. It hence offers a

promising replacement of existing protection schemes for CNNs.

6.2 Premises and Scopes

This work focuses on protections of 8-bit quantized CNN models. On the one hand, although the

optimal bit width for a network depends on its weight distribution and might be lower than 8, we

have observed that 8-bit quantization is a prevalent, robust, and general choice to reduce model

size and latency while preserving accuracy. In our experiments, both activations and weights are

quantized to 8-bit. Existing libraries that support quantized CNNs (e.g. NVIDIA TensorRT [Mig17],

Intel MKL-DNN [Mkl], Google’s GEMMLOWP [Jac17], Facebook’s QNNPACK [Qnn]) mainly target for

fast operators using 8-bit instead of lower bit width. On the other hand, previous studies [Li17; Rea18]

have suggested that CNNs should use data types that provide just-enough numeric value range and

precision to increase its fault tolerance. Our explorations on using higher precision including float32

for representing CNN parameters also show that 8-bit quantized models are the most resilient to

memory faults.

The quantization algorithm we used is symmetric range-based linear quantization that is well-

supported by major CNN frameworks (e.g. Tensorflow [Jac18], Pytorch [Zmo18]). Specifically, let

X be a floating-point tensor and X q be the 8-bit quantized version. X can be either weights or

activations from a CNN. The quantization is based on the following formula:

X q = r o und (X2n−1−1

max{|X |}), (6.1)

where n is the number of bits used for quantization. In our case, n = 8. The number of bits used for

accumulation is 32. Biases, if exist, are quantized to 32 bit integer.

Our work protects only weights for two reasons. Firstly, weights are usually kept in the mem-

ory. The longer they are kept, the higher the number of bit flips they will suffer from. This easily

results in a high fault rate (e.g. 1e-3) for weights. Activations, however, are useful only during an

inference process. Given the slight chance of having a bit flip during an inference process (usually

in milliseconds), protecting activations is not as pressing as protecting weights. Secondly, previous

work [Rea18] has shown that activations are much less sensitive to faults compared with weights.

84

Table 6.1: Accuracy and weight distribution of 8-bit quantized CNN models on ImageNet. Thepercentage rows use absolute values.

Model AlexNet VGG16 VGG16_bn Inception-V3 ResNet-18 ResNet-34 ResNet-50 ResNet-152 SqueezeNet#weights 61.1M 138.4M 138.4M 27.1M 11.7M 21.8M 25.5M 60.1M 1.2M

Accuracy(%)

Float32 56.52 71.59 73.36 69.54 69.76 73.31 76.13 78.31 58.09Int8 55.8 71.51 72.01 68.07 69.07 72.83 75.33 77.79 57.01

Percentage(%)

[0, 32) 95.09 97.69 98.83 97.98 99.66 99.76 99.65 99.49 95.16[32, 64) 4.88 2.27 1.16 1.96 0.32 0.23 0.34 0.49 4.62[64, 128] 0.03 0.04 0.01 0.06 0.02 0.01 0.01 0.01 0.22

1 2 3 4 5 6 7 8position

050

100150200250300350400

coun

t

Figure 6.1: Large weight (beyond [−64,63]) distributions in 8-byte (64-bit data) blocks forSqueezeNet on ImageNet. For instance, the first bar in (a) shows that of all the 8-byte data blocksstoring weights, around 380 have a large weight at the first byte.

Error Correction Codes (ECC) is commonly used in computer systems to correct memory faults.

They are usually described as (k , d , t ) code for length k code word, length d data, and t -bit error

correction. The number of required check bits is k −d .

6.3 In-Place Zero-Space ECC

Our proposed method, in-place zero-space ECC, builds on the following observation:Weights of a

well-trained CNN are mostly small values. The Percentage rows in Table 6.1 show the distributions

of the absolute values of weights in some popular 8-bit quantized CNN models. The absolute values

of more than 99% of the weights are less than 64. Even though eight bits are used to represent each

weight, if we already know that the absolute value of a weight is less than 64, the number of effective

bits to represent the value would be at most seven, and the remaining bit could be possibly used for

other purposes—such as error correction. We call it a non-informative bit.

The core idea of in-place zero-space ECC is to use non-informative bits in CNN parameters to

store error check bits. For example, the commonly used SEC-DED (64, 57, 1) code uses seven check

85

bits to protect 57 data bits for single error correction; they together form a 64-bit code word. If seven

out of eight consecutive weights are in range [−64,63], we can then have seven non-informative

bits, one per small weight. The essential idea of in-place ECC is to use these non-informative bits to

store the error check bits for the eight weights. By embedding the check bits into the data, it can

hence avoid all space cost.

For the in-place ECC to work, there cannot be more than one large weight in every 8 consecutive

weights. And the implementation has to record the locations of the large weights such that the

decoding step can find the error check bits from the data. It is, however, important to note that the

requirement of recording the locations of large weights would disappear if the large weights are

regularly distributed in data—an example is that the only place in which a large weight could appear

is the last byte of an 8-byte block. However, the distributions of large weights in CNNs are close to

uniform, as Figure 6.1 shows.

6.3.1 WOT

To eliminate the need of storing large weight locations in in-place ECC, we enhance our design

by introducing a new training scheme, namely weight-distribution oriented training (WOT). WOT

aims to regularize the spatial distribution of large weights such that large values can appear only at

specific places. We first formalize the WOT problem and then elaborate our regularized training

process.

Let Wl be the float32 parameters (including both weights and biases) in the l -th convolutional

layer and Wq

l be their values after quantization. Note that WOT applies to fully-connected layers as

well even though our discussion focuses on convolutional layers. WOT minimizes the sum of the

standard cross entropy loss ( f ({W ql }

Ll=1)) and weighted weight regularization loss (Frobenius norm

with the hyperparameter λ) subject to some weight distribution constraints on the weights:

min{W q

l }f ({W q

l }Ll=1) +λ

L∑

l=1

‖W ql ‖

2F , (6.2)

s .t . Wq

l ∈ Sl , l = 1, · · · , L . (6.3)

The weights are a four-dimensional tensor. If flattened, it is a vector of length Nl ×Cl ×Hl ×Wl ,

where Nl , Cl , Hl and Wl are respectively the number of filters, the number of channels in a filter, the

height of the filter, and the width of the filter, in the l -th convolutional layer. WOT adds constraints to

each 64-bit data block in the flattened weight vectors. Recall that, for in-place ECC to protect a 64-bit

data block, we need seven non-informative bits (i.e., seven small weights in the range [−64, 63]) to

store the seven check bits. To regularize the positions of large values in weights, the constraint on

the weights in the l -th convolutional layer can be given by Sl = {X | the first seven values in every

64-bit data block can have a value in only the range of [−64, 63]}.We next describe two potential solutions to the optimization problems.

86

6.3.1.1 ADMM-based Training

The above optimization problem can be formulated in the Alternating Direction Method of Multi-

pliers (ADMM) framework and solved in a way similar to an earlier work [Zha18c]. The optimization

problem (Eq. 6.2) is equivalent to:

min{W q

l }}f ({W q

l }Ll=1) +λ

L∑

l=1

‖W ql ‖

2F +

L∑

l=1

g l (Wq

l ), (6.4)

where g l (Wq

l ) =

0, if Wq

l ∈ Sl

+∞, otherwise.. Rewriting Eq. 6.4 in the ADMM framework leads to:

min{W q

l }f ({W q

l }Ll=1) +λ

L∑

l=1

‖W ql ‖

2F +

L∑

l=1

g l (Zl ), (6.5)

s .t . Wq

l = Zl , l = 1, · · · , L (6.6)

ADMM alternates between the optimization of model parameters ({W ql }

Ll=1 and the auxiliary

variables {Zl }Ll=1 by repeating the following three steps for k = 1, 2, · · · :

{W q ,k+1l }Ll=1 =arg min

{W ql }

Ll=1

f ({W ql }

Ll=1) +λ

L∑

l=1

‖W ql ‖

2F +γ

L∑

l=1

‖W ql −Zl +U k

l ‖2F , (6.7)

{Z k+1l }Ll=1 =arg min

{Zl }Ll=1

N∑

l=1

g l (Zl ) +L∑

l=1

λ

2‖W q ,k+1

l −Zl +U kl ‖

2F , (6.8)

U k+1l =U k

l +Wq ,k+1

l −Z k+1l . (6.9)

until the two conditions are met: ‖W q ,k+1l −Z k+1

l ‖2F ≤ ε and ‖Z k+1

l −Z kl ‖

2F ≤ ε.

Problem 6.7 can be solved using stochastic gradient descent (SGD) as the objective function is

differentiable. The optimal solution to the problem 6.8 is the projection of Wq ,k+1

l +U kl to set Sl . In

the implementation, we set a value in a 64-data block to 63 or -64 if the value is not in the eighth

position and is larger than 63 or smaller than -64.

Previous work has successfully applied the ADMM framework to CNN weight pruning [Zha18c]

and CNN weight quantization [Ren19] and shown remarkable compression results. But when it

is applied to our problem, experiments show that ADMM-based training cannot help reduce the

number of large values in the first seven positions of a 64-bit data block. Moreover, as the ADMM-

based training cannot guarantee that the constrain in Eq. 6.3 is satisfied, it is necessary to bound

the reamining large quantized values in the first 7 positions to 63 or -64 after the training, resulting

in large accuracy drops. Instead of ADMM-based training, WOT adopts an alternative approach

described below.

87

6.3.1.2 QAT with Throttling (QATT)

Our empirical explorations indicate that a simple quantization-aware training (QAT) procedure

combined with weight throttling can make the weights meet the constraint without jeopardising

the accuracy of a 8-bit quantized model. The training process iterates the following major steps for

each batch:

1. QAT: It involves forward-propagation using quantized parameters ({W ql }

Ll=1 and {b q

l }Ll=1) to

get the loss defined in Equation 6.2, back-propagation using quantized parameters, a update

step that applies float32 gradients to update float32 parameters ({Wl }Ll=1 and {bl }Ll=1), and a

quantization step that gets the new quantized parameters from their float32 version.

2. Throttling: It forces the quantized weights to meet the constraints defined in Eq. 6.3: If any

value in the first seven bytes of a 64-bit data block is larger than 63 (or less than -64), set the

value to 63 (or -64). The float32 versions are updated accordingly.

After the training, all of the values in the first seven positions of a 64-bit data block are ensured to

be within the range of [−64, 63], eliminating the need of storing large value positions for the in-place

ECC. It is worth noting that with WOT, all tested CNNs converge without noticeable accuracy loss

compared to the 8-bit quantized versions as Section 6.4 shows.

6.3.2 Full Design of In-Place Zero-Space ECC

In this part, we provide the full design of in-place zero-space ECC. For a given CNN, it first applies

WOT to regularize the CNN. After that, it conducts in-place error check encoding. The encoding uses

the same encoding algorithm as the standard error-correction encoding methods do; the difference

lies only in where the error check bits are placed.

There are various error-correction encoding algorithms. In principle, our proposed in-place

ECC could be generalized to various codes; we focus our implementation on SEC-DED codes for its

popularity in existing hardware-based memory protections for CNN.

Our in-place ECC features the same protection guarantees as the popular SEC-DED (72,64,1)

code but at zero-space cost. The in-place ECC uses the SEC-DED (64, 57, 1) code instead of (72, 64, 1)

to protect a 64-bit data block with the same protection strength. It distributes the seven error check

bits into the non-informative bits in the first seven weights.

As the ECC check bits are stored in-place, a minor extension to the existing ECC hardware is

required to support ECC decoding. As shown in Figure 6.2, the in-place ECC check bits and data

bits are swizzled to the right inputs to the standard ECC logic. The output of the ECC logic is then

used to recover the original weights: for each small weight (first seven bytes in a 8-byte data block),

simply copy the sign bit to its non-informative bit. As only additional wiring is needed to implement

this copy operation, no latency overhead is incurred to the standard ECC logic.

88

S W W W W W W

(Existing) ECC Logic with (64, 57) SEC-DED Hamming Code

P0

P0 P1…

… P6

S W W W W W WP6 S W W W W W WW

S W W W W WW S W W W W WW S W W W W W WW

7 check bits

8-weight (64-bit) block

57 data bits

S W W W W W WW … S W W W W W WW S W W W W W WW

8-weight (64-bit) block

Figure 6.2: Hardware design for in-place zero-space ECC protection.

6.4 Evaluations

We conducted a set of experiments to examine the efficacy of the proposed techniques in fault

protection and overhead. We first describe our experiment settings in Section 6.4.1 and then report

the effects of WOT and the proposed fault protection technique in Sections 6.4.2 and 6.4.3.

6.4.1 Experiment Settings

6.4.1.1 Models, Datasets, and Machines

The models we used in the fault injection experiments include VGG16 [Sim14], ResNet-18 [He16],

and SqueezeNet [Ian16b]. We choose these CNN models as representatives because: 1) VGG is a

typical CNN with stacked convolutional layers and widely used in transfer learning because of its

robustness. 2) ResNets are representatives of CNNs with modular structure (e.g. Residual Module)

and are widely used in advanced computer vision tasks such as object detection. 3) SqueezeNet

has much fewer parameters and represents CNNs that are designed for mobile applications. The

accuracies of these models are listed in Table 6.1. By default, We use the ImageNet dataset [Den09]

(ILSVRC 2012) for model training and evaluation. All the experiments are performed with PyTorch

1.0.1 on machines equipped with a 40-core 2.2GHz Intel Xeon Silver 4114 processor, 128GB of RAM,

and an NVIDIA TITAN Xp GPU with 12GB memory. Distiller [Zmo18] is used for 8-bit quantization.

The CUDA version is 10.1.

6.4.1.2 Counterparts for Comparisons

We compare our method (denoted as in-place) with the following three counterparts:

• No Protection (faulty): The CNN has no memory protection.

89

• Parity Zero (zero): It adds one parity bit to detect single bit errors in an eight-bit data block

(e.g. a single weight parameter). Once errors are detected, the weight is set to zero1.

• SEC-DED (ecc) It is the traditional SEC-DED [72,64,1] code-based protection in computer

systems [Sri15].

There are some previous proposals [Rea16; Azi18] of memory protections, which are however

designed for special CNN accelerators and provide without protection guarantees. The parity and

ECC represent the state of the art in the industry for memory protection that work generally across

processors and offer protection guarantees, hence the counterparts for our comparison.

6.4.2 WOT results

We evaluate the efficiency of WOT using the CNNs shown in Table 6.1. All the models are pre-trained

on ImageNet (downloaded from TorchVision2). We set λ to 0.0001 for all of the CNNs. Model training

uses stochastic gradient descent with a constant learning rate 0.0001 and momentum 0.9. Batch size

is 32 for VGG16_bn and ResNet-152, 64 for ResNet-50 and VGG16, and 128 for the remaining models.

Training stops as long as the model accuracy after weight throttling reaches its 8-bit quantized

version.

Figure 6.3 shows the changes of the total number of large values that are beyond [−64,63] in

the first seven positions of 8-byte blocks during the training on six of the CNNs. WOT successfully

reduces this number from more than 3,500–80,000 to near 0 for the models before throttling during

the training process. The remaining few large values in non-eighth positions are set to -64 or 63 at

the end of WOT. Note that VGG16_bn has around 10000 large values in the non eighth positions

after 8k iterations. Although more iterations further reduce this number, VGG16_bn can already

reach its original accuracy after weight throttling.

The accuracy curves of the models in the WOT training are shown in Figure 6.4. Overall, after

WOT training, the original accuracy of all the six networks are fully recovered. During the training,

the gap between the accuracy before throttling and after throttling is gradually reduced. For example,

the top-1 accuracy of SqueezeNet after 8-bit quantization is 57.01%. After the first iteration of WOT,

the accuracy before weight throttling is 31.38% and drops to 11.54% after throttling. WOT increases

the accuracy to 57.11% after 46k iterations with batch size 128 (around 4 epochs). All the other CNNs

are able to recover their original accuracy in only a few thousands of iterations. An exception is

VGG16, which reaches an accuracy of 71.50% (only 0.01% accuracy loss) after 20 epochs of training.

6.4.3 Fault injection results

In this set of experiments, we inject faults to CNN models and report the accuracy drops of CNN

models protected using different strategies. The fault model is random bit flip. Faults are injected to

1We have tried to set a detected faulty weight to the average of its neighbors but found it has worse performance thanParity Zero.

2https://pytorch.org/docs/master/torchvision/

90

Table 6.2: Accuracy drop of VGG16, ResNet-16, and SqueezeNet under different memory fault rates.

Model StrategyECC HW(Y/N)

SpaceOverhead (%)

Accuracy drop (%) under different fault rate1e-06 1e-05 1e-04 1e-03

VGG16

faulty N 0 0.31 ± 0.08 0.47 ± 0.09 1.35 ± 0.2 21.93 ± 5.7zero N 12.5 0.27 ± 0.05 0.36 ± 0.08 0.43 ± 0.13 1.04 ± 0.31ecc Y 12.5 0.0 ± 0.0 0.02 ± 0.02 0.35 ± 0.06 0.96 ± 0.14in-place Y 0 0.0 ± 0.0 0.02 ± 0.02 0.37 ± 0.07 0.93 ± 0.23

ResNet-18

faulty N 0 -0.09 ± 0.1 0.35 ± 0.23 4.35 ± 1.12 72.96 ± 1.48zero N 12.5 -0.06 ± 0.08 -0.08 ± 0.13 0.59 ± 0.3 4.35 ± 1.21ecc Y 12.5 0.0 ± 0.0 0.0 ± 0.01 -0.03 ± 0.08 2.8 ± 0.31in-place Y 0 0.0 ± 0.0 0.0 ± 0.01 -0.08 ± 0.09 2.96 ± 0.81

SqueezeNet

faulty N 0 0.12 ± 0.13 0.69 ± 0.31 9.39 ± 2.37 64.83 ± 0.5zero N 12.5 0.09 ± 0.12 0.11 ± 0.2 0.66 ± 0.29 8.16 ± 2.4ecc Y 12.5 0.0 ± 0.0 0.0 ± 0.0 0.12 ± 0.09 5.37 ± 0.66in-place Y 0 0.0 ± 0.0 0.0 ± 0.0 0.12 ± 0.09 5.19 ± 1.08

the weights of CNNs with memory fault rates varying from 10−9 to 0.001. The number of faulty bits

is the product of the number of bits used to represent weights of a CNN and the memory fault rate.

We repeated each fault injection experiment ten times.

Table 6.2 shows the mean accuracy drops with standard deviations under different memory fault

rates and the overheads introduced by the protection strategies for each model. Overall, the in-place

ECC protection and standard SEC-DED show similar accuracy drop patterns under various fault

rate settings as expected because they provide the same error correction capability, i.e., correcting a

single bit error and detecting double bit errors in a 64-bit data block. Both of the methods provide

stronger fault protection compared with the Parity Zero method. The space overhead is the ratio

between the extra number of bytes introduced by a protection strategy and the number of bytes

required to store weights. Parity Zero and SEC-DED encode 8-byte data with extra eight check bits

on average, making their space overhead 12.5%. In contrast, in-place ECC has zero space cost.

The fault injection experiments give the following insights on memory fault protection for CNNs.

First, larger models tend to suffer less from memory faults. For example, when fault rate is 0.0001

and no protection is applied, the accuracy drops of VGG16, ResNet-18, SqueezeNet (less than 2%,

8%, and 16% respectively) are increasing while the model size is decreasing (number of parameters

are 138M, 12M, and 1.2M respectively). Second, when the fault rate is small (e.g. less than 1e-05),

in-place ECC and standard SEC-DED can almost guarantee the same accuracy as the fault-free

model. Overall, the experiments confirm the potential of in-place zero-space ECC as an efficient

replacement of the standard ECC without compromising the protection quality.

6.5 Related Work

There are some early studies on fault tolerance of earlier neural networks (NN) [Pha95; Pro93; TH17];

they examined the performance degradation of NNs with various fault models on networks that

differ from modern CNNs in both network topologies and and model complexities.

Fault tolerance of deep neural networks (DNN) has recently drawn increasing attentions. Li et

al. [Li17] studied the soft error propagation in DNN accelerators and proposed to leverage symptom-

91

based error detectors for detecting errors and a hardware-based technique, selective latch hardening,

for detecting and correcting data-path faults. A recent work [Rea18; Are18] conducted some empirical

studies to quantify the fault tolerance of DNNs to memory faults and revealed that DNN fault

tolerance varies with respect to model, layer type, and structure. Zhang et al. [Zha18b] proposed

fault-aware pruning with retraining to mitigate the impact of permanent faults for systolic array-

based CNN accelerators (e.g., TPUs). They focused only on faults in the data-path and ignored

faults in the memory. Qin et al. [Qin17] studied the performance degradation of 16-bit quantized

CNNs under different bit flip rates and proposed to set values of detected erroneous weights as

zeros to mitigate the impact of faults. These prior works focused mainly on the characterization

of DNN’s fault tolerance with respect to various data types and network topologies. While several

software-based protection solutions were explored, they are preliminary. Some can only detect but

not correct errors (e.g. detecting extreme values [Li17]), others have limited protection capability

(e.g. setting faulty weights to zero [Qin17]).

Some prior work proposes designs of energy-efficient DNN accelerators by exploiting fault

tolerance of DNNs [Tem12; Kim18; Zha18a]. An accelerator design [Rea16] optimizes SRAM power

by reducing the supply voltage. It leverages active hardware fault detection coupled with bit masking

that shifts data towards zero to mitigate the impact of bit flips on DNNs’ model accuracy without

the need of re-training. Similar hardware faults detection techniques are later exploited in [Wha17;

Sal18; Zha18a; Hac19] to improve fault tolerance of DNNs. Azizimazreah et al. [Azi18] proposed

a novel memory cell designed to eliminate soft errors while achieving a low power consumption.

These designs are for some special accelerators rather than general DNN reliability protection. They

are still subject to various costs and offer no protection guarantees as existing ECC protections do.

This current work aims to reducing the space cost of protection to zero without compromising the

reliability of existing protections.

6.6 Future Directions

Besides 8-bit quantizations, there are proposals of even fewer-bit quantizations for CNN, in which,

there may be fewer non-informative bits in weight values. It is however worth noting that 8-bit

quantization is the de facto in most existing CNN frameworks; it has repeatedly shown in practice as

a robust choice that offers an excellent balance in model size and accuracy. Improving the reliability

of such models is hence essential. With that said, creating zero-space protections that works well

with other model quantizations is a direction worth future explorations.

A second direction worth exploring is to extend the in-place zero-space protection to other

error encoding methods (e.g., BCH [Cos83]). Some of them require more parity bits, for which, the

regularized training may need to be extended to create more free bits in data.

Finally, in-place zero-space ECC is in principle applicable to neural networks beyond CNN.

Empirically assessing the efficacy is left to future studies.

92

6.7 Conclusions

This chapter presented in-place zero-space ECC assisted with a new training scheme named WOT

to protect CNN memory. The protection scheme removes all space cost of ECC without compromis-

ing the reliability offered by ECC, opening new opportunities for enhancing the accuracy, energy

efficiency, reliability, and cost effectiveness of CNN-driven AI solutions.

93

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0iterations(k)

05000

100001500020000250003000035000

coun

t

(a) AlexNet

0 1 2 3 4 5 6 7 8iterations(k)

1000020000300004000050000600007000080000

coun

t

(b) VGG16_bn

0.000.250.500.751.001.251.501.752.00iterations(k)

0

1000

2000

3000

4000

5000

coun

t

(c) ResNet-18

0 1 2 3 4 5 6 7 8iterations(k)

2000

4000

6000

8000

coun

t

(d) ResNet-34

0.0 2.5 5.0 7.5 10.012.515.017.520.0iterations(k)

0

2000

4000

6000

8000

10000

12000

coun

t

(e) ResNet-50

0 10 20 30 40iterations(k)

0500

100015002000250030003500

coun

t

(f) SqueezeNet

Figure 6.3: Changes of the total number of large values (beyond [−64, 63]) in the first 7 positions of8-byte (64-bit data) blocks before the throttling step during the WOT training process.

94

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0iterations(k)

50515253545556

accu

racy

(%)

WOT: Before ThrottlingWOT: After ThrottlingOriginal Accuracy

(a) AlexNet

0 1 2 3 4 5 6 7 8iterations(k)

52.555.057.560.062.565.067.570.072.5

accu

racy

(%)

WOT: Before ThrottlingWOT: After ThrottlingOriginal Accuracy

(b) VGG16_bn

0.000.250.500.751.001.251.501.752.00iterations(k)

60

62

64

66

68

accu

racy

(%)

WOT: Before ThrottlingWOT: After ThrottlingOriginal Accuracy

(c) ResNet-18

0 1 2 3 4 5 6 7 8iterations(k)

64

66

68

70

72

accu

racy

(%)

WOT: Before ThrottlingWOT: After ThrottlingOriginal Accuracy

(d) ResNet-34

0.0 2.5 5.0 7.5 10.012.515.017.520.0iterations(k)

55

60

65

70

75

accu

racy

(%)

WOT: Before ThrottlingWOT: After ThrottlingOriginal Accuracy

(e) ResNet-50

0 10 20 30 40iterations(k)

10

20

30

40

50

accu

racy

(%)

WOT: Before ThrottlingWOT: After ThrottlingOriginal Accuracy

(f) SqueezeNet

Figure 6.4: Accuracy curves before and after the throttling step during the WOT training process.

95

CHAPTER

7

CONCLUSIONS AND FUTURE WORK

Technical advances in machine learning (ML) have changed ML from a bespoke solution for special

tasks to a practical technology that can be deployed almost everywhere. This thesis focuses on

resolving challenges in adopting ML in real-world applications running on various systems. It

demonstrates the benefits of reuse-centric optimization for (1) efficient model discovery via reuse-

centric k-mean configuration in Chapter 3, composability-based fast CNN pruning in Chapter 4,

and ensemble training with data sharing in Chapter 5, and (2) reliable model inference via in-place

zero-space memory protection for CNN in Chapter 6. All of these optimizations share the common

reuse principle and the semantic-changing optimization philosophy. A high-level summary of the

thesis and research scope is shown in Figure 7.1.

Efficient Model Discovery. We start with k-means clustering and demonstrate implementation-

level reuse opportunities can be leveraged to accelerate k-means algorithm configuration by up

to 9X in Chapter 3. We then focus on efficient CNN model discovery, which is notoriously slow

and one of the major hurdles for shortening the time to market AI products. Chapter 4 presents a

compiler-based framework that speeds up CNN pruning by up to 186X via algorithm-level reuse.

We empirically uncover the existence of composability in the training of a collection of pruned CNN

models, design a compression-based algorithm to identify reuse opportunities, and propose a new

training scheme called composability-based CNN pruning to get benefits from reuse. Chapter 5

studies resource allocation problems when training a heterogeneous set of DNNs and presents a

flexible ensemble training framework called FLEET that achieves up to 2X speedup in ensemble

training. The framework integrates data-parallel distributed training, checkpointing, sharing of

preprocessed data, and efficient DNNs-to-GPUs mapping to best exploit infrastructure-level reuse

opportunities.

96

Semantic-PreservingOptimization in traditional

optimizing compilers

Reuse-CentricOptimization

Semantic-ChangingProgram Optimization

New Paradigm

K-Means Configuration (Chap. 3)

CNN Pruning (Chap. 4)

Ensemble Training (Chap. 5)

Memory Faults (Chap. 6)

Model Discovery

… …

Model Inference

… …

Figure 7.1: Summary of the thesis.

Reliable Model Inference. As CNNs are increasingly adopted in safety-critical applications,

making CNN a reliable solution becomes a matter of urgency. Chapter 6 presents in-place zero-

space ECC assisted with a new training scheme weight distribution-oriented training to improve

reliability of CNN inference. The new method provides the first known zero space cost memory

protection for CNNs without compromising the reliability offered by traditional ECC.

Future Work. This thesis has shed some light on how reuse opportunities can be leveraged to

ensure the efficiency and reliability of ML via innovations in both algorithms and programming

systems. Many fundamental questions on reuse-centric optimization remain for future exploration.

An immediate direction is on generalizing the success in efficient CNN pruning to a much

broader range of ML models (e.g. Recurrent Neural Networks), other search strategies (e.g., re-

inforcement learning and evolutionary algorithms), and other search objectives (e.g., accuracy).

The recent progress on neural architecture search (NAS) has shown some promising results on

automated model discovery. NAS helps eliminate the tedious and error-prone manual design ef-

forts. However, due to the lack of understanding of fundamental properties in trading off accuracy

for performance, it requires large-scale search which is too costly for many users. To reduce the

search cost, it is necessary to develop more efficient model search and training algorithms and then

automate them via novel programming system designs.

A more fundamental direction is to answer the many open questions lying at the foundation

level of reuse-centric optimization. What is the full taxonomy of reuse opportunities? Is it possible

to automatically uncover the hidden reuse opportunities in an arbitrary ML algorithm? Is there a

principled way to create and leverage reuse opportunities? How can we characterize the tradeoffs

between reuse benefits and the myriad other objectives (accuracy, size, energy, security, reliability,

interpretability, etc.) of ML? Are reuse-centric optimizations composable? How can we make them

work synergistically and avoid conflicts? What are the connections with other optimizations (in

97

both software and hardware) of deep learning? Answers to these questions will build the theoreti-

cal foundation for reuse-centric optimizations for deep learning. They will also pave the way for

generalizing the optimization to problems even beyond ML.

The achievements in improving CNN reliability elucidate the importance of a comprehensive

characterization of ML systems. Just as the small-weight observation motivates our in-place zero-

cost protection strategy, a comprehensive characterization of ML systems could expose more opti-

mization opportunities for improving system reliability and achieving the ultimate goal of making

AI dependable. The characterization should consider all layers from domain-specific requirements

(How much reliability is required? Is the application accuracy-critical or performance-critical?),

the choice of ML models (How robust is the model to adversarial attacks or faults?), and the relia-

bility of software stacks (Is the kernel implementation bug-free?) and computing hardware (Does

the hardware exhibit a relatively high fault rate?). The potential optimizations can go beyond the

reuse-centric approaches and requires the software-hardware co-design.

These explorations are just the beginning of the journey to the development of a new optimiza-

tion paradigm called semantic-changing program optimization, shown in Figure 7.1. Programming

systems work has long followed the contract that the program to be compiled and the program to

be executed should be equivalent. In other words, we conduct semantic-preserving optimization

for our programs. Now, with the wide adoption of ML in real applications, ML has become part

of our program and software. It is seemingly that ML has opened an opportunity to develop the

new optimization paradigm, semantic-changing program optimization. In this new paradigm, we

change the semantics or intentions of our program by optimizing our algorithms or workflows

instead of just instructions to obtain larger performance gain. And reuse-centric optimization is

only the tip of an iceberg. The new paradigm grows out of the principles in optimizing compiler. It

is inspired by the influence of approximate computing. As this thesis has shown, the new paradigm

has many potentials. And there are many open questions to answer: Are there principled ways to

characterize the nature of applicable situations? What is the set of techniques to develop to best

leverage these opportunities? What are the connections to modern programming language designs?

How to go beyond ML? What are the relations to other aspects of software (e.g., reliability, security)?

98

BIBLIOGRAPHY

[Aba15] Abadi, M. et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.Software available from tensorflow.org. 2015.

[Agh17] Aghasi, A. et al. “Net-Trim: Convex Pruning of Deep Neural Networks with PerformanceGuarantee”. Advances in Neural Information Processing Systems. 2017, pp. 3180–3189.

[Ang13] Anguita, D. et al. “A Public Domain Dataset for Human Activity Recognition usingSmartphones.” ESANN. 2013.

[Are18] Arechiga, A. P. & Michaels, A. J. “The Effect of Weight Errors on Neural Networks”. 2018IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC).IEEE. 2018.

[Art07] Arthur, D. & Vassilvitskii, S. “k-means++: The advantages of careful seeding”. Proceed-ings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Societyfor Industrial and Applied Mathematics. 2007, pp. 1027–1035.

[Ash17] Ashok, A. et al. “N2N Learning: Network to Network Compression via Policy GradientReinforcement Learning”. arXiv preprint arXiv:1709.06030 (2017).

[Asu07] Asuncion, A. & Newman, D. UCI machine learning repository. 2007.

[Aug11] Augonnet, C. et al. “StarPU: a unified platform for task scheduling on heterogeneousmulticore architectures”. Concurrency and Computation: Practice and Experience 23.2(2011), pp. 187–198.

[Azi18] Azizimazreah, A. et al. “Tolerating Soft Errors in Deep Learning Accelerators with Reli-able On-Chip Memory Designs”. 2018 IEEE International Conference on Networking,Architecture and Storage (NAS). IEEE. 2018, pp. 1–10.

[Ba14] Ba, J. & Caruana, R. “Do deep nets really need to be deep?” Advances in neural informa-tion processing systems. 2014, pp. 2654–2662.

[Bal07] Balaprakash, P. et al. “Improvement strategies for the F-Race algorithm: Sampling designand iterative refinement”. International Workshop on Hybrid Metaheuristics. Springer.2007, pp. 108–122.

[Ber12] Bergstra, J. & Bengio, Y. “Random search for hyper-parameter optimization”. Journal ofMachine Learning Research 13.Feb (2012), pp. 281–305.

[Bir02] Birattari, M. et al. “A racing algorithm for configuring metaheuristics”. Proceedings of the4th Annual Conference on Genetic and Evolutionary Computation. Morgan KaufmannPublishers Inc. 2002, pp. 11–18.

[Bir10] Birattari, M. et al. “F-Race and iterated F-Race: An overview”. Experimental methods forthe analysis of optimization algorithms. Springer, 2010, pp. 311–336.

99

[Boj16] Bojarski, M. et al. “End to end learning for self-driving cars”. preprint arXiv:1604.07316(2016).

[Bot10] Bottou, L. “Large-scale machine learning with stochastic gradient descent”. Proceedingsof COMPSTAT’2010. Springer, 2010, pp. 177–186.

[Buc06] Bucilua, C. et al. “Model compression”. Proceedings of the 12th ACM SIGKDD inter-national conference on Knowledge discovery and data mining. ACM. 2006, pp. 535–541.

[Buc18] Buckler, M. et al. “EVA2: Exploiting Temporal Redundancy in Live Computer Vision”.arXiv preprint arXiv:1803.06312 (2018).

[Caf] Caffe Solver Prototxt.https://github.com/BVLC/caffe/wiki/Solver-Prototxt.

[Can16] Canziani, A. et al. “An analysis of deep neural network models for practical applications”.arXiv preprint arXiv:1605.07678 (2016).

[Che18a] Chen, H. et al. “The rise of deep learning in drug discovery”. Drug discovery today 23.6(2018), pp. 1241–1250.

[Che18b] Chen, T. et al. “{TVM}: An automated end-to-end optimizing compiler for deep learning”.13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI}18). 2018, pp. 578–594.

[Che16] Cheng, D. et al. “Improving performance of heterogeneous mapreduce clusters withadaptive task tuning”. IEEE Transactions on Parallel and Distributed Systems 28.3 (2016),pp. 774–786.

[Chi14] Chilimbi, T. et al. “Project adam: Building an efficient and scalable deep learning trainingsystem”. 11th {USENIX} Symposium on Operating Systems Design and Implementation(OSDI). 2014, pp. 571–582.

[Chi01] Chilimbi, T. M. “Efficient representations and abstractions for quantifying and exploitingdata reference locality”. ACM SIGPLAN Notices. Vol. 36. 5. ACM. 2001, pp. 191–202.

[Chi02] Chilimbi, T. M. & Hirzel, M. “Dynamic hot data stream prefetching for general-purposeprograms”. ACM SIGPLAN Notices. Vol. 37. 5. ACM. 2002, pp. 199–209.

[Cho15] Chollet, F. et al. Keras. https://keras.io. 2015.

[Cho16] Chowdhury, M. et al. “{HUG}: Multi-Resource Fairness for Correlated and Elastic De-mands”. 13th {USENIX} Symposium on Networked Systems Design and Implementation({NSDI} 16). 2016, pp. 407–424.

[Coa12] Coates, A. & Ng, A. Y. “Learning feature representations with k-means”. Neural networks:Tricks of the trade. Springer, 2012, pp. 561–580.

[Cos83] Costello, D. J. Error Control Coding: Fundamentals and Applications. prentice Hall, 1983.

100

[Dav79] Davies, D. L. & Bouldin, D. W. “A cluster separation measure”. IEEE transactions onpattern analysis and machine intelligence 2 (1979), pp. 224–227.

[Dea12] Dean, J. et al. “Large scale distributed deep networks”. Advances in neural informationprocessing systems. 2012, pp. 1223–1231.

[Del14] Delimitrou, C. & Kozyrakis, C. “Quasar: resource-efficient and QoS-aware cluster man-agement”. ACM SIGARCH Computer Architecture News. Vol. 42. 1. ACM. 2014, pp. 127–144.

[Den09] Deng, J. et al. “Imagenet: A large-scale hierarchical image database”. 2009 IEEE confer-ence on computer vision and pattern recognition. Ieee. 2009, pp. 248–255.

[Dev18] Devlin, J. et al. “Bert: Pre-training of deep bidirectional transformers for language un-derstanding”. arXiv preprint arXiv:1810.04805 (2018).

[Din15a] Ding, Y. et al. “Top: A framework for enabling algorithmic optimizations for distance-related problems”. Proceedings of the VLDB Endowment 8.10 (2015), pp. 1046–1057.

[Din15b] Ding, Y. et al. “Yinyang k-means: A drop-in replacement of the classic k-means withconsistent speedup”. Proceedings of the 32nd International Conference on MachineLearning (ICML-15). 2015, pp. 579–587.

[Dra12] Drake, J. & Hamerly, G. “Accelerated k-means with adaptive distance bounds”. 5th NIPSworkshop on optimization for machine learning. 2012, pp. 42–53.

[Du89] Du, J. & Leung, J. Y.-T. “Complexity of scheduling parallel task systems”. SIAM Journalon Discrete Mathematics 2.4 (1989), pp. 473–487.

[Dun74] Dunn, J. C. “Well-separated clusters and optimal fuzzy partitions”. Journal of cybernetics4.1 (1974), pp. 95–104.

[Eck18] Eckert, C. et al. “Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Net-works”. arXiv preprint arXiv:1805.03718 (2018).

[Elk03] Elkan, C. “Using the triangle inequality to accelerate k-means”. ICML. Vol. 3. 2003,pp. 147–153.

[Fei97] Feitelson, D. G. et al. “Theory and practice in parallel job scheduling”. Workshop on JobScheduling Strategies for Parallel Processing. Springer. 1997, pp. 1–34.

[Fig17] Figurnov, M. et al. “Spatially adaptive computation time for residual networks”. arXivpreprint (2017).

[Fow18] Fowers, J. et al. A Configurable Cloud-Scale DNN Processor for Real-Time AI. IEEE, 2018.

[Fri01] Friedman, J. et al. The elements of statistical learning. Vol. 1. Springer series in statisticsSpringer, Berlin, 2001.

101

[Fu17] Fu, J. et al. “Look closer to see better: Recurrent attention convolutional neural networkfor fine-grained image recognition”. Conf. on Computer Vision and Pattern Recognition.2017.

[Gar18] Garipov, T. et al. “Loss surfaces, mode connectivity, and fast ensembling of dnns”. Ad-vances in Neural Information Processing Systems. 2018, pp. 8789–8798.

[Gor18] Gordon, A. et al. “Morphnet: Fast & simple resource-constrained structure learning ofdeep networks”. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2018.

[Gra15] Grandl, R. et al. “Multi-resource packing for cluster schedulers”. ACM SIGCOMM Com-puter Communication Review 44.4 (2015), pp. 455–466.

[Gu19] Gu, J. et al. “Tiresias: A {GPU} cluster manager for distributed deep learning”. 16th{USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19).2019, pp. 485–500.

[Hac19] Hacene, G. B. et al. “Training Modern Deep Neural Networks for Memory-Fault Robust-ness” (2019).

[Ham10] Hamerly, G. “Making k-means even faster”. Proceedings of the 2010 SIAM internationalconference on data mining. SIAM. 2010, pp. 130–140.

[Han15a] Han, S. et al. “Deep compression: Compressing deep neural networks with pruning,trained quantization and huffman coding”. arXiv preprint arXiv:1510.00149 (2015).

[Han15b] Han, S. et al. “Learning both weights and connections for efficient neural network”.Advances in neural information processing systems. 2015, pp. 1135–1143.

[Har18] Harlap, A. et al. “Pipedream: Fast and efficient pipeline parallel dnn training”. arXivpreprint arXiv:1806.03377 (2018).

[He15] He, K. et al. “Delving deep into rectifiers: Surpassing human-level performance onimagenet classification”. Proceedings of the IEEE international conference on computervision. 2015, pp. 1026–1034.

[He16] He, K. et al. “Deep residual learning for image recognition”. Proceedings of the IEEEconference on computer vision and pattern recognition. 2016, pp. 770–778.

[He18] He, Y. & Han, S. “ADC: Automated Deep Compression and Acceleration with Reinforce-ment Learning”. arXiv preprint arXiv:1802.03494 (2018).

[He17] He, Y. et al. “Channel pruning for accelerating very deep neural networks”. InternationalConference on Computer Vision (ICCV). Vol. 2. 2017, p. 6.

[Hin15] Hinton, G. et al. “Distilling the knowledge in a neural network”. preprint arXiv:1503.02531(2015).

102

[Hoc95] Hochbaum, D. S. Approximation Algorithms for NP-Hard Problems. PWS PublishingCompany, 1995.

[Hol12] Holger H. Hoos. Automated Algorithm Configuration and Parameter Tuning. Vol. 3.(Edited by Y. Hamadi, E. Monfroy, and F. Saubion) Springer Berlin Heidelberg, 2012.

[Hoo11] Hoos, H. H. “Automated algorithm configuration and parameter tuning”. Autonomoussearch. Springer, 2011, pp. 37–71.

[Hoo04] Hoos, H. H. & Stützle, T. Stochastic local search: Foundations and applications. Elsevier,2004.

[How17] Howard, A. G. et al. “Mobilenets: Efficient convolutional neural networks for mobilevision applications”. arXiv preprint arXiv:1704.04861 (2017).

[Hu16] Hu, H. et al. “Network trimming: A data-driven neuron pruning approach towardsefficient deep architectures”. arXiv preprint arXiv:1607.03250 (2016).

[Hua17] Huang, G. et al. “Densely connected convolutional networks”. Proceedings of the IEEEconference on computer vision and pattern recognition. 2017, pp. 4700–4708.

[Hua18] Huang, Y. et al. “Gpipe: Efficient training of giant neural networks using pipeline paral-lelism”. arXiv preprint arXiv:1811.06965 (2018).

[Hue16] Huerta, R. et al. “Online decorrelation of humidity and temperature in chemical sensorsfor continuous monitoring”. Chemometrics and Intelligent Laboratory Systems 157(2016), pp. 169–176.

[Hut07] Hutter, F. et al. “Automatic algorithm configuration based on local search”. AAAI. Vol. 7.2007, pp. 1152–1157.

[Hut09] Hutter, F. et al. “An experimental investigation of model-based parameter optimisa-tion: SPO and beyond”. Proceedings of the 11th Annual conference on Genetic andevolutionary computation. ACM. 2009, pp. 271–278.

[Hut11] Hutter, F. et al. “Sequential model-based optimization for general algorithm configu-ration”. International Conference on Learning and Intelligent Optimization. Springer.2011, pp. 507–523.

[Ian16a] Iandola, F. N. et al. “Firecaffe: near-linear acceleration of deep neural network trainingon compute clusters”. Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. 2016, pp. 2592–2600.

[Ian16b] Iandola, F. N. et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and<0.5 MB model size”. arXiv preprint arXiv:1602.07360 (2016).

[Mkl] Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN). https://intel.github.io/mkl-dnn/. Accessed: 2019-08-16.

103

[Iof15] Ioffe, S. & Szegedy, C. “Batch normalization: Accelerating deep network training byreducing internal covariate shift”. International conference on machine learning. 2015,pp. 448–456.

[Jac17] Jacob, B. et al. gemmlowp: a small self-contained low-precision GEMM library. 2017.

[Jac18] Jacob, B. et al. “Quantization and training of neural networks for efficient integer-arithmetic-only inference”. Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. 2018, pp. 2704–2713.

[Jia14] Jia, Y. et al. “Caffe: Convolutional Architecture for Fast Feature Embedding”. arXivpreprint arXiv:1408.5093 (2014).

[Kal12] Kalogeratos, A. & Likas, A. “Dip-means: an incremental clustering method for estimatingthe number of clusters”. Advances in neural information processing systems. 2012,pp. 2393–2401.

[Kan18] Kandasamy, K. et al. “Neural architecture search with bayesian optimisation and optimaltransport”. Advances in Neural Information Processing Systems. 2018, pp. 2016–2025.

[Kho11] Khosla, A. et al. “Novel dataset for fine-grained image categorization: Stanford dogs”.Proc. CVPR Workshop on Fine-Grained Visual Categorization (FGVC). Vol. 2. 2011, p. 1.

[Kim18] Kim, S. et al. “Energy-efficient neural network acceleration in the presence of bit-levelmemory errors”. IEEE Transactions on Circuits and Systems I: Regular Papers 99 (2018),pp. 1–14.

[Kim14] Kim, Y. “Convolutional neural networks for sentence classification”. arXiv preprintarXiv:1408.5882 (2014).

[Kra13] Krause, J. et al. “3d object representations for fine-grained categorization”. ComputerVision Workshops (ICCVW), 2013 IEEE International Conference on. IEEE. 2013, pp. 554–561.

[Kra16] Krause, J. et al. “The unreasonable effectiveness of noisy data for fine-grained recogni-tion”. European Conference on Computer Vision. Springer. 2016, pp. 301–320.

[Kri12] Krizhevsky, A. et al. “Imagenet classification with deep convolutional neural networks”.Advances in neural information processing systems. 2012, pp. 1097–1105.

[Kum94] Kumar, V. et al. “Scalable load balancing techniques for parallel computers”. Journal ofParallel and Distributed computing 22.1 (1994), pp. 60–79.

[Lar99] Larus, J. R. “Whole program paths”. ACM SIGPLAN Notices. Vol. 34. 5. ACM. 1999, pp. 259–269.

[Lau05] Lau, J. et al. “Motivation for variable length intervals and hierarchical phase behavior”.IEEE International Symposium on Performance Analysis of Systems and Software, 2005.ISPASS 2005. IEEE. 2005, pp. 135–146.

104

[Law03] Law, J. & Rothermel, G. “Whole program path-based dynamic impact analysis”. SoftwareEngineering, 2003. Proceedings. 25th International Conference on. IEEE. 2003, pp. 308–318.

[Law97] Lawrence, S. et al. “Face recognition: A convolutional neural-network approach”. IEEEtransactions on neural networks 8.1 (1997), pp. 98–113.

[LeC95] LeCun, Y. & Bengio, Y. “Convolutional networks for images, speech, and time series”.The handbook of brain theory and neural networks 3361.10 (1995), p. 1995.

[LeC90] LeCun, Y. et al. “Optimal brain damage”. Advances in neural information processingsystems. 1990, pp. 598–605.

[Li17] Li, G. et al. “Understanding error propagation in deep learning neural network (DNN)accelerators and applications”. Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis. ACM. 2017, p. 8.

[Li16] Li, H. et al. “Pruning filters for efficient convnets”. arXiv preprint arXiv:1608.08710 (2016).

[Li19] Li, L. & Talwalkar, A. “Random search and reproducibility for neural architecture search”.arXiv preprint arXiv:1902.07638 (2019).

[Lin17] Lin, J. et al. “Runtime Neural Pruning”. Advances in Neural Information ProcessingSystems. 2017, pp. 2178–2188.

[Liu17a] Liu, J. et al. “Sparse Deep Transfer Learning for Convolutional Neural Network.” AAAI.2017, pp. 2245–2251.

[Liu17b] Liu, Z. et al. “Learning efficient convolutional networks through network slimming”.2017 IEEE International Conference on Computer Vision (ICCV). IEEE. 2017, pp. 2755–2763.

[LI10] López-Ibánez, M. et al. “Exploratory analysis of stochastic local search algorithms inbiobjective optimization”. Experimental methods for the analysis of optimization algo-rithms. Springer, 2010, pp. 209–222.

[Los16] Loshchilov, I. & Hutter, F. “Sgdr: Stochastic gradient descent with warm restarts”. arXivpreprint arXiv:1608.03983 (2016).

[Lu12] Lu, W. et al. “Efficient processing of k nearest neighbor joins using mapreduce”. Pro-ceedings of the VLDB Endowment 5.10 (2012), pp. 1016–1027.

[Luo17a] Luo, J.-H. et al. “ThiNet: A Filter Level Pruning Method for Deep Neural Network Com-pression”. arXiv preprint arXiv:1707.06342 (2017).

[Luo18] Luo, L. et al. “Parameter Hub: High Performance Parameter Servers for Efficient Dis-tributed Deep Neural Network Training”. arXiv preprint arXiv:1801.09805 (2018).

[Luo17b] Luo, T. et al. “Dadiannao: A neural network supercomputer”. IEEE Transactions onComputers 66.1 (2017), pp. 73–88.

105

[Lyo62] Lyons, R. E. & Vanderkulk, W. “The use of triple-modular redundancy to improve com-puter reliability”. IBM journal of research and development 6.2 (1962), pp. 200–209.

[Mat18] Mathuriya, A. et al. “CosmoFlow: using deep learning to learn the universe at scale”.SC18: International Conference for High Performance Computing, Networking, Storageand Analysis. IEEE. 2018, pp. 819–829.

[McG17] McGill, M. & Perona, P. “Deciding how to decide: Dynamic routing in artificial neuralnetworks”. arXiv preprint arXiv:1703.06217 (2017).

[Mig17] Migacz, S. “8-bit inference with tensorrt”. GPU technology conference. Vol. 2. 2017, p. 7.

[Mio18] Miotto, R. et al. “Deep learning for healthcare: review, opportunities and challenges”.Briefings in bioinformatics 19.6 (2018), pp. 1236–1246.

[Mol16] Molchanov, P. et al. “Pruning convolutional neural networks for resource efficient transferlearning”. arXiv preprint arXiv:1611.06440 (2016).

[Mos18] Moshovos, A. et al. “Value-Based Deep-Learning Acceleration”. IEEE Micro 38.1 (2018),pp. 41–55.

[Nar18] Narayanan, D. et al. “Accelerating deep learning workloads through efficient multi-model execution”. NIPS Workshop on Systems for Machine Learning (December 2018).2018.

[NM97] Nevill-Manning, C. G. & Witten, I. H. “Identifying hierarchical structure in sequences: Alinear-time algorithm”. J. Artif. Intell. Res.(JAIR) 7 (1997), pp. 67–82.

[Nil08] Nilsback, M.-E. & Zisserman, A. “Automated flower classification over a large number ofclasses”. Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth IndianConference on. IEEE. 2008, pp. 722–729.

[O’K18] O’Keeffe, S. & Villing, R. “Evaluating pruned object detection networks for real-timerobot vision”. Autonomous Robot Systems and Competitions (ICARSC), 2018 IEEEInternational Conference on. IEEE. 2018, pp. 91–96.

[Ous13] Ousterhout, K. et al. “Sparrow: distributed, low latency scheduling”. Proceedings of theTwenty-Fourth ACM Symposium on Operating Systems Principles. ACM. 2013, pp. 69–84.

[Ovt15] Ovtcharov, K. et al. “Accelerating deep convolutional neural networks using specializedhardware”. Microsoft Research Whitepaper 2.11 (2015).

[Pan10] Pan, S. J. & Yang, Q. “A survey on transfer learning”. IEEE Transactions on knowledgeand data engineering 22.10 (2010), pp. 1345–1359.

[Par18] Park, J. et al. “Deep learning inference in facebook data centers: Characterization, per-formance optimizations and hardware implications”. arXiv preprint arXiv:1811.09886(2018).

106

[Pas19] Paszke, A. et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Li-brary”. Advances in Neural Information Processing Systems 32. Ed. by Wallach, H. et al.Curran Associates, Inc., 2019, pp. 8024–8035.

[Pat18] Patton, R. M. et al. “167-PFlops deep learning for electron microscopy: from learningphysics to atomic manipulation”. Proceedings of the International Conference for HighPerformance Computing, Networking, Storage, and Analysis. IEEE Press. 2018, p. 50.

[Pha95] Phatak, D. S. & Koren, I. “Complete and partial fault tolerance of feedforward neuralnets”. IEEE Transactions on Neural Networks 6.2 (1995), pp. 446–456.

[Pit18] Pittman, R. et al. “Exploring flexible communications for streamlining DNN ensembletraining pipelines”. SC18: International Conference for High Performance Computing,Networking, Storage and Analysis. IEEE. 2018, pp. 807–818.

[Pro93] Protzel, P. W. et al. “Performance and fault-tolerance of neural networks for optimization”.IEEE transactions on Neural Networks 4.4 (1993), pp. 600–614.

[Qin17] Qin, M. et al. “Robustness of neural networks against storage media errors”. arXivpreprint arXiv:1709.06173 (2017).

[Qnn] QNNPACK: Open source library for optimized mobile deep learning. https://code.fb.com/ml-applications/qnnpack/. Accessed: 2019-08-16.

[Rah17] Rahman, A. et al. “Programming challenges of chatbot: Current and future prospective”.2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC). IEEE. 2017,pp. 75–78.

[Rat12] Ratnaparkhi, A. A. & Pilli, E. “Networks”. 2016 International Conference on EmergingTrends in Communication Technologies (ETCT) (2012), pp. 1–6.

[Rea16] Reagen, B. et al. “Minerva: Enabling low-power, highly-accurate deep neural networkaccelerators”. 2016 ACM/IEEE 43rd Annual International Symposium on ComputerArchitecture (ISCA). IEEE. 2016, pp. 267–278.

[Rea18] Reagen, B. et al. “Ares: A framework for quantifying the resilience of deep neural net-works”. 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE. 2018,pp. 1–6.

[Ren19] Ren, A. et al. “ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNsUsing Alternating Direction Methods of Multipliers”. Proceedings of the Twenty-FourthInternational Conference on Architectural Support for Programming Languages andOperating Systems. ACM. 2019, pp. 925–938.

[Ren15] Ren, S. et al. “Faster r-cnn: Towards real-time object detection with region proposalnetworks”. Advances in neural information processing systems. 2015, pp. 91–99.

[Rou87] Rousseeuw, P. J. “Silhouettes: a graphical aid to the interpretation and validation ofcluster analysis”. Journal of computational and applied mathematics 20 (1987), pp. 53–65.

107

[Rus15] Russakovsky, O. et al. “Imagenet large scale visual recognition challenge”. InternationalJournal of Computer Vision (IJCV) 115.3 (2015), pp. 211–252.

[SG16] S. Guadarrama, N. S. TensorFlow-Slim: a lightweight library for defining, training andevaluating complex models in TensorFlow. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim. 2016.

[Sal18] Salami, B. et al. “On the Resilience of RTL NN Accelerators: Fault Characterization andMitigation”. 2018 30th International Symposium on Computer Architecture and HighPerformance Computing (SBAC-PAD). IEEE. 2018, pp. 322–329.

[Sal17] Salimans, T. et al. “Evolution strategies as a scalable alternative to reinforcement learn-ing”. arXiv preprint arXiv:1703.03864 (2017).

[Sat11] Satopaa, V. et al. “Finding a" kneedle" in a haystack: Detecting knee points in system be-havior”. Distributed Computing Systems Workshops (ICDCSW), 2011 31st InternationalConference on. IEEE. 2011, pp. 166–171.

[Sei16] Seide, F. & Agarwal, A. “CNTK: Microsoft’s open-source deep-learning toolkit”. Proceed-ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery andData Mining. 2016, pp. 2135–2135.

[Ser18] Sergeev, A. & Del Balso, M. “Horovod: fast and easy distributed deep learning in Tensor-Flow”. arXiv preprint arXiv:1802.05799 (2018).

[Sha18] Sharma, H. et al. “Bit Fusion: Bit-Level Dynamically Composable Architecture for Accel-erating Deep Neural Network”. 2018 ACM/IEEE 45th Annual International Symposiumon Computer Architecture (ISCA). IEEE. 2018.

[Shm95] Shmoys, D. B. et al. “Scheduling Parallel Machines On-line”. SIAM J. Comput. 24.6 (1995),pp. 1313–1331.

[Sil16] Silberman, N. & Guadarrama, S. TensorFlow-Slim image classification model library.https://github.com/tensorflow/models/tree/master/research/slim.2016.

[Sil17] Silver, D. et al. “Mastering the game of go without human knowledge”. Nature 550.7676(2017), pp. 354–359.

[Sim14] Simonyan, K. & Zisserman, A. “Very deep convolutional networks for large-scale imagerecognition”. arXiv preprint arXiv:1409.1556 (2014).

[Sno12] Snoek, J. et al. “Practical bayesian optimization of machine learning algorithms”. Ad-vances in neural information processing systems. 2012, pp. 2951–2959.

[Sri15] Sridharan, V. et al. “Memory errors in modern systems: The good, the bad, and the ugly”.ACM SIGPLAN Notices 50.4 (2015), pp. 297–310.

108

[Sum] Summit User Guide Oak Ridge Leadership Computing Facility. https://www.olcf.ornl.gov/for-users/system-user-guides/summitdev-quickstart-guide/.Accessed 3/3/2019. 2019.

[Sze17] Sze, V. et al. “Efficient processing of deep neural networks: A tutorial and survey”. Pro-ceedings of the IEEE 105.12 (2017), pp. 2295–2329.

[Sze15] Szegedy, C. et al. “Going deeper with convolutions”. Proceedings of the IEEE conferenceon computer vision and pattern recognition. 2015, pp. 1–9.

[Sze16] Szegedy, C. et al. “Rethinking the inception architecture for computer vision”. Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016,pp. 2818–2826.

[Tem12] Temam, O. “A defect-tolerant accelerator for emerging high-performance applications”.ACM SIGARCH Computer Architecture News. Vol. 40. 3. IEEE Computer Society. 2012,pp. 356–367.

[Tia17] Tian, Q. et al. “Deep lda-pruned nets for efficient facial gender classification”. ComputerVision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. IEEE.2017, pp. 512–521.

[Tom14] Tompson, J. J. et al. “Joint training of a convolutional network and a graphical modelfor human pose estimation”. Advances in neural information processing systems. 2014,pp. 1799–1807.

[TH17] Torres-Huitzil, C. & Girau, B. “Fault and error tolerance in neural networks: A review”.IEEE Access 5 (2017), pp. 17322–17341.

[Tur92] Turek, J. et al. “Scheduling parallelizable tasks: Putting it all on the shelf”. ACM SIGMET-RICS Performance Evaluation Review. Vol. 20. 1. ACM. 1992, pp. 225–236.

[Urg02] Urgaonkar, B. et al. “Resource overbooking and application profiling in shared hostingplatforms”. ACM SIGOPS Operating Systems Review 36.SI (2002), pp. 239–254.

[Wal10] Walkinshaw, N. et al. “Using compression algorithms to support the comprehensionof program traces”. Proceedings of the Eighth International Workshop on DynamicAnalysis. ACM. 2010, pp. 8–13.

[Wan11] Wang, X. “A fast exact k-nearest neighbors algorithm for high dimensional search us-ing k-means clustering and triangle inequality”. Neural Networks (IJCNN), The 2011International Joint Conference on. IEEE. 2011, pp. 1293–1299.

[Wel10] Welinder, P. et al. “Caltech-UCSD birds 200” (2010).

[Wen16] Wen, W. et al. “Learning structured sparsity in deep neural networks”. Advances inNeural Information Processing Systems. 2016, pp. 2074–2082.

109

[Wha17] Whatmough, P. N. et al. “14.3 A 28nm SoC with a 1.2 GHz 568nJ/prediction sparse deep-neural-network engine with> 0.1 timing error rate tolerance for IoT applications”. 2017IEEE International Solid-State Circuits Conference (ISSCC). IEEE. 2017, pp. 242–243.

[Wu08] Wu, X. et al. “Top 10 algorithms in data mining”. Knowledge and information systems14.1 (2008), pp. 1–37.

[Xia18] Xiao, W. et al. “Gandiva: Introspective cluster scheduling for deep learning”. 13th USENIXSymposium on Operating Systems Design and Implementation (OSDI 18). 2018, pp. 595–610.

[Xu18] Xu, L. et al. “A heterogeneity-aware task scheduler for spark”. 2018 IEEE InternationalConference on Cluster Computing (CLUSTER). IEEE. 2018, pp. 245–256.

[Yan17] Yang, T.-J. et al. “Designing energy-efficient convolutional neural networks using energy-aware pruning”. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017.

[Ye18] Ye, J. et al. “Rethinking the Smaller-Norm-Less-Informative Assumption in ChannelPruning of Convolution Layers”. arXiv preprint arXiv:1802.00124 (2018).

[Yeh09] Yeh, I.-C. & Lien, C.-h. “The comparisons of data mining techniques for the predictiveaccuracy of probability of default of credit card clients”. Expert Systems with Applications36.2 (2009), pp. 2473–2480.

[Yu10] Yu, F.-X. et al. “Overview of Radiation Hardening Techniques for IC Design”. InformationTechnology Journal 9.6 (2010), pp. 1068–1080.

[Yu17] Yu, J. et al. “Scalpel: Customizing dnn pruning to the underlying hardware parallelism”.ACM SIGARCH Computer Architecture News. Vol. 45. 2. ACM. 2017, pp. 548–560.

[Zag16] Zagoruyko, S. & Komodakis, N. “Wide residual networks”. arXiv preprint arXiv:1605.07146(2016).

[Zah08] Zaharia, M. et al. “Improving MapReduce performance in heterogeneous environments.”Osdi. Vol. 8. 4. 2008, p. 7.

[Zha18a] Zhang, J. et al. “Thundervolt: enabling aggressive voltage underscaling and timing er-ror resilience for energy efficient deep learning accelerators”. Proceedings of the 55thAnnual Design Automation Conference. ACM. 2018, p. 19.

[Zha18b] Zhang, J. J. et al. “Analyzing and mitigating the impact of permanent faults on a systolicarray based neural network accelerator”. 2018 IEEE 36th VLSI Test Symposium (VTS).IEEE. 2018, pp. 1–6.

[Zha18c] Zhang, T. et al. “A systematic dnn weight pruning framework using alternating directionmethod of multipliers”. Proceedings of the European Conference on Computer Vision(ECCV). 2018, pp. 184–199.

110

[Zha18d] Zhang, T. et al. “Adam-admm: A unified, systematic framework of structured weightpruning for dnns”. arXiv preprint arXiv:1807.11091 (2018).

[Zha17] Zhao, B. et al. “Diversified visual attention networks for fine-grained object classifica-tion”. IEEE Transactions on Multimedia 19.6 (2017), pp. 1245–1256.

[Zha08] Zhao, Q. et al. “Knee point detection in BIC for detecting the number of clusters”. In-ternational Conference on Advanced Concepts for Intelligent Vision Systems. Springer.2008, pp. 664–673.

[Zho17] Zhong, R. Y. et al. “Intelligent manufacturing in the context of industry 4.0: a review”.Engineering 3.5 (2017), pp. 616–630.

[Zhu18] Zhu, H. et al. “TBD: Benchmarking and Analyzing Deep Neural Network Training”. arXivpreprint arXiv:1803.06905 (2018).

[Zmo18] Zmora, N. et al. Neural Network Distiller. https://doi.org/10.5281/zenodo.1297430. 2018.

[Zop16] Zoph, B. & Le, Q. V. “Neural architecture search with reinforcement learning”. arXivpreprint arXiv:1611.01578 (2016).

[Zop18] Zoph, B. et al. “Learning transferable architectures for scalable image recognition”.Proceedings of the IEEE conference on computer vision and pattern recognition. 2018,pp. 8697–8710.

111