toward knowledge as a service over networks: a deep ... knowledge as a service over network.pdfpart...

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 37, NO. 6, JUNE 2019 1349

Toward Knowledge as a Service Over Networks:A Deep Learning Model Communication Paradigm

Ziqian Chen, Ling-Yu Duan , Member, IEEE, Shiqi Wang , Member, IEEE, Yihang Lou,

Tiejun Huang , Senior Member, IEEE, Dapeng Oliver Wu , Fellow, IEEE, and Wen Gao, Fellow, IEEE

Abstract— The advent of artificial intelligence and Internet ofThings has led to the seamless transition turning the big data intothe big knowledge. The deep learning models, which assimilateknowledge from large-scale data, can be regarded as an alterna-tive but promising modality of knowledge for artificial intelligenceservices. Yet, the compression, storage, and communication ofthe deep learning models towards better knowledge services,especially over networks, pose a set of challenging problems onboth industrial and academic realms. This paper presents thedeep learning model communication paradigm based on multiplemodel compression, which greatly exploits the redundancy amongmultiple deep learning models in different application scenarios.We analyze the potential and demonstrate the promise of thecompression strategy for deep learning model communicationthrough a set of experiments. Moreover, the interoperability indeep learning model communication, which is enabled based onthe standardization of compact deep learning model representa-tion, is also discussed and envisioned.

Index Terms— Deep learning, deep learning model communi-cation, neural network compression, knowledge centric network.

I. INTRODUCTION

THE evolution of ubiquitous computing and widelydeployed Internet of Things (IoT) [1] are sensing and

processing tremendous growth of data, which are featured withunstructured, noisy, and high redundancy. With the advanceof artificial intelligence, there has been an increasing demand

Manuscript received July 19, 2018; revised December 11, 2018; acceptedFebruary 21, 2019. Date of publication March 14, 2019; date of currentversion May 15, 2019. This work was supported in part by the National KeyResearch and Development Program of China under Grant 2016YFB1001501,in part by the National Natural Science Foundation of China under Grant61661146005 and Grant U1611461, in part by the Shenzhen Municipal Sci-ence and Technology Program under Grant JCYJ20170818141146428, and inpart by the National Research Foundation, Prime Minister’s Office, Singapore,through the NRF-NSFC Grant, under Grant NRF2016NRF-NSFC001-098.(Corresponding author: Ling-Yu Duan.)

Z. Chen and Y. Lou are with the National Engineering Laboratory of VideoTechnology, the School of Electronics Engineering and Computer Science,Peking University, Beijing 100871, China (e-mail: [email protected];[email protected]).

L.-Y. Duan, T. Huang, and W. Gao are with the National EngineeringLaboratory of Video Technology, the School of Electronics Engineeringand Computer Science, Peking University, Beijing 100871, China, andalso with the Peng Cheng Laboratory, Shenzhen 518055, China (e-mail:[email protected]; [email protected]; [email protected]).

S. Wang is with the Department of Computer Science, City University ofHong Kong, Hong Kong (e-mail: [email protected]).

D. O. Wu is with the Department of Electrical and Computer Engi-neering, University of Florida, Gainesville, FL 32611-6130 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSAC.2019.2904360

Fig. 1. Illustration of knowledge communication with deep learning model,and the deep learning models are required to be compressed and transmittedsequentially.

in turning the raw data into knowledge, which is of para-mount significance for decision, optimization and analytics [2].The deep learning models, such as AlexNet, VGG, ResNet[3]–[5], which are mostly learned from large scale data, canbe regarded as a specific modality of knowledge implying theunique data statistics and the mechanism of domain applica-tions. In order to facilitate the knowledge assimilation andutilization with deep learning techniques, the knowledge dis-tribution is essentially important in bridging the gap betweenthe computationally expensive learning process and the limitedcapabilities afforded in end devices. This raises fundamentalchallenges in deep learning model communication over net-works, which are critical for a wide variety of applications.

The temporally dynamic data sensed from the ambientenvironments are constantly creating new knowledge that canplay a complementary role of the existing model. In order tounderstand the current state and predict the future, knowledgeupdating at a certain level is highly desired. Generally speak-ing, the new model can be generated at the cloud side and thendistributed to the IoT, such that knowledge is propagated tofacilitate better intelligent applications. As illustrated in Fig. 1,deep learning models are delivered frequently for better uti-lization with the services of IoT. Alternatively, the distributedgeneration of the deep learning models at the edge endalso requires the models to be further transmitted and fused.For the same task or intrinsically similar tasks, there existshigh correlation among different models even with differentdomains of the training data. In particular, they may share

0733-8716 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

https://orcid.org/0000-0002-4491-2023

https://orcid.org/0000-0002-3583-959X

https://orcid.org/0000-0002-4234-6099

https://orcid.org/0000-0003-1755-0183

1350 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 37, NO. 6, JUNE 2019

some common layers or the learned weights may have highcorrelations. This motivates a new paradigm that exploits theinter model redundancy in the model communication, aimingto compactly and efficiently deliver the deep learning modelswith the help of the external information. This is differentfrom the state-of-the-art deep neural network compressionmethodologies [6]–[9] in nature, which target at seeking thecompact representation of an isolated model.

The concept of the Knowledge Centric Networking (KCN),which was proposed in [2], reinforces the importance ofknowledge penetrated into both sensing and networking.In KCN, the knowledge creation, composition and distributionare three vital components that drive the networking to besmarter and better managed. In line with the concept of KCN,to provide better services with knowledge, one key ingredientis exchanging the knowledge in terms of deep learning models.This requires the extension of the traditional neural networkcompression in terms of motivation, application scope as wellas the methodology. In this context, important and challeng-ing issues regarding deep learning model communication areraised accordingly. Furthermore, regarding the transmissionand communication of deep learning models, we have alsowitnessed a strong demand in the standardization of deeplearning model representation to enable the interoperability.

Motivated from these substantially different principles,in this paper we investigate a new deep learning modelcommunication paradigm based on the exploration of theinter redundancy deep learning model compression, aiming toprovide better services of knowledge over networks. In otherwords, we investigate deep learning model communicationthrough model compression based on the redundancy amongdifferent models. The main contributions of this paper aresummarized as follows:

• The neural network model communication paradigm,which reformulates the exchange of knowledge in thecontext of KCN and artificial intelligence, is presentedand analyzed.

• The model prediction and compression based on theremoval of inter-model redundancy from the viewpointof communication and transmission, are investigated topursue better coding efficiency.

• The performance of the deep learning model communi-cation for knowledge utilization is validated from variousperspectives, and the future standardization of the deepmodel compression is discussed and envisioned.

II. RELATED WORK

In this section, we first review the recent deep learningmodel compression techniques, indicate the limitation of thesemethods and detail the relationship with the proposed scheme.Subsequently, the KCN is introduced based on a few examplesregarding visual data, to better illustrate the idea of knowledgecommunication.

A. Deep Learning and Network Compression

Deep learning based methodologies have achieved greatsuccess in many fields including computer vision, speechrecognition, nature language processing, etc. Recently, a vast

amount of deep learning models are designed, trained andwidely distributed for practical applications. For example, forConvolutional Neural Network (CNN), various CNN basedarchitectures including VGG [3], Resnet [5], GoogleNet [10],DenseNet [11], have been widely adopted in computer visiontasks such as image classification [3], [5], facial recogni-tion [12] and object detection [13]. Moreover, RecurrentNeural Network (RNN) [14] has also been widely employeddue to its capability in characterizing the dynamic temporalbehaviors. The Generative Adversarial Network (GAN) [15],[16] and its variations, which can also be built upon CNN,have also attracted a lot of interests in the generative tasks.

Despite the great promises of deep learning in these tasks,most of the deep neural networks suffer from the enormouscost in terms of both storage and computation due to thehuge amount of parameters. For example, the VGG-16 [3]has more than 60M parameters, and it is widely acknowl-edged that there exists abundant redundancy for parameters inthe deep neural networks [17], such that the deep learningmodels can be potentially compressed without sacrificingthe performance. Therefore, numerous deep neural networkcompression methods are arising, and the present approachesof neural network compression can be divided into 6 differentcategories: 1) parameters pruning; 2) matrix factorization; 3)filter selection; 4) quantization; 5) knowledge transfer; 6)network redesign.

1) Parameters Pruning: The premier design philoso-phy of parameter pruning is pruning the neurons withweak response or unnecessary neurons within the network.Han et al. [6] proposed to prune the neurons below a giventhreshold. Guo et al. [18] proposed the Dynamic NetworkSurgery (DNS) to prune parameters during gradient decentsteps with a mask matrix. Dong et al. [19] proposed a layer-wise Optimal Brain Surgeon (OBS) that prunes considerablenumber of parameters with marginal degradation in perfor-mance. Regarding the fully-connected layers, circulant struc-ture based on Fast Fourier Transform has also been proposed[20], [21].

2) Matrix Factorization: The whole neuron weight can beregarded as a matrix, which can be further reconstructedbased on the low rank methods. In [22], low rank and sparsedecomposition are applied for matrix compression. In [23],the fully-connected layer compression based on domain adap-tive low rank decomposition was proposed. In [24], the globalreconstruction error was considered to further optimize thecompression process.

3) Filter Selection: The strategy of filter selection treatsthe whole weight matrix as the combination of multiplefilters, such that unnecessary filters can be removed to achieveidentical goal as the pruning. In [7], LASSO regression basedchannel selection and least square reconstruction strategy wereproposed. Moreover, the method in [25] makes use of theinformation in frequency domain to prune the channel ofmodels, and the work in [26] accomplishes the decompositionin channel level to prune filters. In [27], Batch-norm Layerwas utilized to select the unimportant channels.

4) Quantization: Quantization is a classical method in com-pactly representing the weight. The work in [28] first proposed

CHEN et al.: TOWARD KNOWLEDGE AS A SERVICE OVER NETWORKS 1351

to quantize the deep model using vector quantization. Theincremental network quantization (INQ) was proposed in [29],which combined the network pruning and INQ as an integratedcompression framework. In [30], the pruning and quantiza-tion were combined based on the principle of Bayesian andminimum description length. Dong et al. [19] proposed anextreme quantization approach to reach the extremely highcompression ratio with satisfied performance, and Bloomierfilter was further used in [31] to obtain better compressionratio.

5) Knowledge Transfer: The knowledge distillation aims totransfer knowledge from a large scale teacher network into asmall scale student model. The work in [32] proposed to learnthe classification distributions with knowledge distillation. In[9], the knowledge from the teach network was learned withthe neurons in the higher hidden layers.

6) Network Redesign: The neural network can beredesigned based on light-weight network structures or well-designed modules. Binary network are typical examples fornetwork redesign, such as BinaryNet [33] and XNORNet [34].SqueezeNet [35], which achieves the same accuracy ofAlexnet, was designed with a fire module that uses 1∗1 con-volutional filter. Analogously, MobileNet in [36] also utilizesthe 1*1 convolution as well as the Depthwise Separable Con-volution (DSC). Xie et al. [37] adjusted the original ResNetstructure using an aggregated transformation to filters, whichis named as ResNeXt. In [38], the ResNet was redesigned withmore weight-sharing blocks.

In this work, we are not simply aiming at proposing anew deep model compression method to reduce the numberof parameters and economize weight storage space. Instead,we attempt to view and resolve the problem of deep modelcommunication from a completely different perspective, andpropose to compress models based on the redundancy amongdifferent models. This scheme can also be feasibly combinedwith the existing compression methods mentioned above toform an integrated compression and communication frame-work. Moreover, it is also worth noting that distributing thetraining task also has the concerns of the communication cost,which motivates the deep gradient compression approach in[39]. In particular, in [39] the techniques such as momentumcorrection and local gradient clipping were used to achievedeep gradient compression. Our approach is different from thiswork in principle. We aim to decrease the communication costfor deep model transmission while distribution training targetsfor reducing transmission cost of the continuous gradient. Assuch, in our work the inter redundancy among different modelsis exploited to reduce the transmission cost, in contrast to thesignificant gradient changes.

B. Knowledge Centric Networking

The concept of KCN [2], which is inspired by therapid development of deep learning and refers to theknowledge-oriented networking paradigm, has largelyreformed the traditional Content Centric Networking (CCN)with machine learning based knowledge creation, compositionand distribution. In KCN, the created knowledge, which serves

as the modality for transmission and utilization instead of theraw data, can greatly facilitate the removal of redundancyand enhance the intelligence level of the networking system.

Herein, the surveillance video data, which are regardedas the biggest big data, are used for illustration regardingthe knowledge creation and distribution. Traditional CCNbased strategy deals with the surveillance video streams,which are generated by typical video coding standards suchas H.265/HEVC [40] and AVS2/IEEE 1857.4 [41]. Thoughnumerous efforts have been devoted to improving the codingperformance, the state-of-the-art compression efficiency stillcannot meet the exponential increase in the demand for highefficiency coding algorithms due to the video data explosion.For surveillance video, instead of raw pixels, image/videoanalyses are performed relying on the extracted features, whichcan also be regarded as one kind of knowledge modality.As such, conveying the powerful and discriminative featuresbecomes an alternative and promising way towards futureknowledge communication. This paradigm is also in linewith the analyze-then-compression framework, which has beendetailed in [42] and [43]. For feature compression, bothhandcrafted and CNN features have been widely investigated[44]–[46]. In particular, the MPEG has also accomplished theCompact Descriptors for Visual Search (CDVS) [47] standard,which was developed based on compact representation ofhandcrafted features. The subsequent Compact Descriptorsfor Video Analysis (CDVA) standard [48] combines bothhandcrafted features and deep learning features, such thatmuch better analysis performance can be achieved for videoanalysis. Though numerous approaches have been proposedfor feature compression, in the context of knowledge creationand communication, the scope of knowledge can be furtherextended in several ways. In particular, deep learning modelcan be regarded as the special modality of knowledge, andtransmission as well as standardization of compact deeplearning models are of special interest to facilitate knowledgecommunication.

III. DEEP LEARNING MODEL COMPRESSION IN A

KNOWLEDGE COMMUNICATION FRAMEWORK

In this section, the problem of deep learning model com-pression is thought of within a knowledge communicationframework, and the application scenarios are demonstratedthrough both cloud computing and edge computing. Subse-quently, the knowledge communications with deep learningmodels are illustrated in the context of KCN, and the conceptsregarding knowledge creation, composition and distributionbased on deep learning models are further extended.

A. Application Scenarios

1) Knowledge Communication Based on Cloud Computing:The IoTs are constantly generating data, which are subjectedto be further utilized to sense the current dynamics, makedecisions and forecast the future. In cloud computing, the out-sourcing of storage and computational capabilities to a cen-tralized party fills the large gap between the low capabilitiesof the IoT and high demand in the computationally expensive


Fig. 2. Illustration of knowledge communication framework with edge computing. Deep models can be transmitted from edge to edge, edge to cloud, as wellas edge to the IoT devices.

applications. As such, the data are usually converged to thecloud server side, which can be progressively used for modeltraining. Moreover, with the gradually enabled capabilitiesof intelligent tasks performed at the IoT, the trained modelscan be subsequently distributed to the IoT to improve theavailability of the data with compact representation, featureextraction and analytics.

Again, an excellent example is the visual sensingin smart city. In particular, the front-end cameras insmart city constantly capture the visual data, which arefurther transmitted to the cloud side. Given the constantlygenerated data, the deep learning models can be graduallylearned or finetuned. As such, the temporally dynamic visualdata from different locations are consistently providing newknowledge and enhancing the capability of the current model.Under such circumstances, the deployed deep learning modelsat the front-end are desired to be frequently updated, suchthat there is a great demand to economize the bandwidthand cost of delivering and distributing the deep models.Due to the high redundancy between the deep learningmodels generated sequentially, the deep learning modelcompression based on inter model prediction is stronglydesired.

2) Knowledge Communication Based on Edge Computing:One of the most challenging aspects for cloud computing liesin the growing quantity of data pulled from IoT and pushedby cloud, such that the bandwidth for data transportationlimits the performance of cloud services. The edge computing[49], which leverages the computation resources between thecentral cloud server and the data sources, is emerging asan alternative paradigm to enable the computation performedclose to the data sources. By offloading the computationsfrom cloud and caching the data at the edge side, the userexperience can be significantly improved. The illustration ofsuch a communication framework is shown in Fig. 2, wheredeep learning models are transmitted from edge to edge, edgeto cloud as well as edge to the IoT devices.

In the context of edge computing, the knowledge communi-cations can also find their own applications between edge andedge, as well as cloud and edge. Regarding the visual sensingfor smart city applications, the front-end cameras generate newdata from different areas, which are subsequently transmittedto the corresponding local edge for further processing. Assuch, the communications of knowledge in terms of deeplearning models at different granularity levels can be sum-marized as follows:

(a) Knowledge communication between edge and cloud:Each edge node can generate the corresponding deep learningmodel based on the local data. The deep learning models canthen be uploaded to the cloud server for convergence. In par-ticular, the cloud servers can gather the deep learning modelsinstead of the data from different edges, and then distributemultiple deep learning models to the front-end for utilizationand decision-level fusion. In this process, the exploration ofinter deep learning model redundancy based on prediction andcompensation can greatly improve the compression efficiencyof a set of deep learning models.

(b) Knowledge communication between edge and edge: Thegenerated model at one edge node can be specifically requestedby the other edge node for performing particular tasks. Forexample, the trained model at one local area can be transmittedto another edge node for further fine-tuning based on its owndata. This also requires frequent knowledge communicationbetween edge and edge in terms of deep learning models.Again, the removal of the redundancy between different mod-els generated in temporal domain can greatly facilitate thefrequent deep learning model update over networks.

B. Towards KCN

The deep learning model communication can be fullyexplained within the KCN paradigm. In particular, KCN aimsto achieve knowledge-oriented networking based on knowl-edge creation, composition and distribution. For knowledge


creation, the scope of knowledge has been extended to thedeep learning model, as the parameters of the model are highlyrepresentative for the data as well as the specific task. More-over, the deep learning models are also capable of conveyingmeaningful information [50], [51] according to visualization.The knowledge from multiple domains or learned with mul-tiple modalities can be further composited for decision andprediction. The knowledge compositions at different levelsaim to achieve different specific objectives, and the pre-trained and half-trained models are also involved as in thiscase abundant knowledge has already been absorbed fromthe external training samples such as ImageNet [52]. Forknowledge distribution, the transmission of the deep learningmodel, especially multiple deep learning models, is essentiallyimportant. As such, the compression of the deep learningmodels by fully exploiting and removing the redundancyserves as the foundation for efficient knowledge distribution.Moreover, to provide a standardized bitstream syntax to enableinteroperability in the context of knowledge distribution andcommunication, the standardization of model compact repre-sentation in reducing the computation and memory footprintbecomes an emerging and important topic for investigation.

IV. DEEP LEARNING MODEL COMPRESSION WITH INTER

MODEL REDUNDANCY REMOVAL

The application scope of deep learning model compressionis far beyond the deployment on embedded systems of con-strained hardware resources. In essence, any deep learningmodel communication scenario would involve certain modelcompression, either explicitly or implicitly. In this section,we will show how to explore the inter model redundancy tofacilitate the communication of deep learning models, suchthat efficient knowledge distribution can be further supportedas a service over networks.

A. Single Model Compression

In this subsection, we first investigate the single modelcompression to compactly represent the deep neural network.In particular, the weights are quantized and entropy codedto remove the precision and statistical redundancy. As such,we are not aiming to squeeze the single model as smaller aspossible due to the fact it will be used as the reference forfuture prediction. Such a single model compression methodalso lays the foundation to reflect the performance of multiplemodel compression discussed in Section IV-B.

The weights in deep neural network are quantized based ona scalar quantization approach. In analogies to video coding,given the weight w, it can be quantized to qw with thequantization step as follows,

qw = round

(w × 10s_bits + f × 10q_bits

10q_bits

). (1)

Here, s_bits is the fixed parameter for the amplification ofthe weights, and q_bits is the parameter that determinesthe quantization step to quantize the weights. More specifi-cally, higher q_bits corresponds to coarse quantization levels,leading to worse prediction accuracy and less coding bits.

Fig. 3. Illustration of the weight distribution. (a) The weight distributionsof different deep learning models for image classification; (b) the weightdistributions of similar structure deep learning models for different tasks. Thex axis is divided with equal space 0.0002, and only weights within the rangeof [−0.1 0.1] in (a) and [−0.04, 0.04] in (b) are displayed.

The parameter f is introduced to maintain an equal expectedvalue for the input weight and de-quantized weight [8]. In par-ticular, we provide some typical weight distributions, as illus-trated in Fig. 3. It is observed that the weights follow typicalGeneralized Gaussian Distribution [53], such that f shouldbe defined in the range between [0, 0.5]. Here, we choosef = 0.4. To obtain the final bitstream, the quantized weightsare further subjected to binarization and lzma coding [54].

At the receiver side, the quantized weight qw is further de-quantized to recover the deep neural network, and the de-quantized weight w′ is calculated as follows,

w′ = (qw × 10q_bits−s_bits). (2)

Basically, we target at providing a conceptually meaningfulway for deep learning model communication by removing intermodel redundancy, and the fundamental single deep learningapproach is adopted here, which is also feasible to be extendedto more sophisticated methods such as network pruning [6].In Section IV-C, we demonstrate that when the predictionmodel is pruned, it is still feasible to apply the proposed intermodel prediction approach to achieve redundancy removal andbetter compression performance.

B. Multiple Model Redundancy Analysis

In this subsection, we will statistically investigate the redun-dancy existing in multiple deep learning models based onthe application scenarios described above. More specifically,in the context of sequential deep model transmission, it isnecessary to study the difference of models (DoM) betweenthe previously transmitted and current to-be-encoded one.Moreover, when a set of deep learning models are required tobe transmitted simultaneously, the redundancy among modelsshould also be fully exploited based on a well-designed modelprediction method. As such, we will analyze the rationalityof multiple model prediction from two perspectives. First,we compare the weight distributions of the original model andDoMs, to illustrate the variation of weight energy and the influ-ence of deep model prediction on theoretically derived rate-distortion relationship. Subsequently, the Kullback–Leibler(K-L) divergence is calculated to measure how the probabilitydistribution of one model diverges from another, to providestatistical analysis on deep learning model prediction.


Fig. 4. Comparisons of the weight distributions between the original deeplearning models and the DoMs. The x axis is divided with equal space0.0002, and only weights within the range of [−0.04, 0.04] are displayed.(a) Comparisons based on the fine-tuned models (V1-V3). (b) Comparisonsbased on the same structure deep learning models aiming at different tasks.Here, “Det” represents detection, “Rrv” represents image retrieval, and “Cls”represents classification.

1) Weight Distribution Comparisons: First, we focus on thescenario that the temporally dynamic data are used for training,leading to different deep learning models at different points intime. Here, we adopt the VGG model for investigation, andthe 3D landmark dataset [55] is used to train the deep learningmodels (V1-V3). As such, the best model within every 10-epoch is extracted. In other words, VGG-V1 represents theinitial model, VGG-V2 represents the best model in 10 epochsand VGG-V3 represents the updated model of 20 epochs. Sub-sequently, the DoMs between VGG-V1 and VGG-V3, as wellas VGG-V2 and VGG-V3 are computed. The comparisons ofweight distributions are illustrated in Fig. 4, and it is obviousthat the energy of the DoM has been significantly shrunk dueto the inter model prediction, leading to higher peak at zeroand narrower distribution.

Subsequently, the weight distribution of different deepmodels trained for different tasks1 with similar structure isalso investigated, including object detection, image instanceretrieval as well as classification. Again, the DoM betweenthem is computed and the comparisons of the distributions areillustrated in Fig. 4(b). In this scenario, similar observationscan be found, due to the high redundancy among multiplemodels corresponding to different tasks. These comparisonsdemonstrate the demand of multiple model prediction, whichsignificantly reduces the weight energy to achieve more com-pact representation. Moreover, the compact distributions ofresiduals are able to decrease the influences of random bit erroroccurring during transmission periods. In particular, the DoMstrategy is also able to enhance the robustness in terms ofthe random bit error as part of knowledge are preserved andinherited from the prediction model.

Here, we apply the widely adopted zero mean Laplacedistribution to model the residuals, which is a special caseof the Generalized Gaussian Distribution [53] and shows agood trade-off between fidelity and complexity,

fLap(x) =Λ2· e−Λ·|x|, (3)

1http://www.vlfeat.org/matconvnet/pretrained/

where x represents the coefficient in the modeling. It indicatesthe weight in the neural network for single model compressionand the DoM residue for inter model prediction. The para-meter Λ is called the Laplacian distribution parameter. Here,the parameter Λ is obtained with the maximum likelihoodestimation,

Λ = 1/

(1N

N∑i=1

|xi|)

, (4)

where N is the number of samples.Moreover, considering the quantization process with quan-

tization step Q = 10q_bits and rounding offset f = 0.4,the distortion and rate can be modeled as:

D =∫ (Q−f ·Q)

−(Q−f ·Q)

x2i fLap(xi)dxi

+ 2∞∑

n=1

∫ (n+1)Q−f ·Q

nQ−f ·Q(xi − nQ)2fLap(xi)dxi

R = −P0 · log2 P0 − 2∞∑

n=1

Pn · log2 Pn. (5)

Here, it is worth mentioning that D indicates the signal levelweight distortion instead of the final utility of the deep learningmodel such as the classification accuracy. The probabilities ofthe transformed residuals that are quantized to the zero-th andn-th quantization levels P0 and Pn are computed based on theLaplace distribution,

P0 =∫ (Q−f ·Q)

−(Q−f ·Q)

fLap(x)dx

Pn =∫ (n+1)Q−f ·Q

nQ−f ·QfLap(x)dx. (6)

As such, the distortion and rate modeling based on theLaplacian distribution are given by

D =ΛQ · efΛQ(2 + ΛQ − 2fΛQ) + 2 − 2eΛQ

Λ2(1 − eΛQ)(7)

R =1

ln 2

(−

(1 − e−(1−f)ΛQ

)ln

(1 − e−(1−f)ΛQ

)

+ e−(1−f)ΛQ(ln 2 − ln(1 − e−ΛQ) − fΛQ

+ΛQ

1 − e−ΛQ

)). (8)

where D indicates the signal level weight distortion and Rrepresents the theoretically derived rate.

Based on the theoretical derivations of rate and distortion,we show the Laplacian distribution parameter Λ and thecorresponding rate-distortion values in Table I. It is obviousthat the DoM based representation strategies can achievesignificant rate savings for the similar distortion of the weight,due to the fact that the energy of DoM is significantly shrunkcompared to the original model.

2) Relative Entropy Analyses: In order to further investigatethe model prediction from the perspective of informationtheoretic analysis, we calculate the K-L divergence betweendifferent models to further study the relative entropy of onemodel with respect to another. The K-L divergence [56] aims


TABLE I

COMPARISONS OF THE LAPLACIAN DISTRIBUTION PARAMETER, THECORRESPONDING RATE AND THE DISTORTION BETWEEN SINGLE

MODEL AND DOM. HERE, D1 AND R1 ARE OBTAINED WITH

q_bits = 8, AND D2 AND R2 ARE OBTAINED WITH q_bits = 9

TABLE II

K-L DIVERGENCE OF DEEP MODELS WITH SAME STRUCTURE

BUT FINE-TUNED AT DIFFERENT EPOCHS

to compute the distances between two distributions, which iscalculated as follows,

DKL(Φ||Ψ) =∑i=1

Φ(i) log(

Φ(i)Ψ(i)

), (9)

where Ψ represents the weight distribution of the predictionmodel and Φ represents the weight distribution of the to-be-compressed model.

In particular, the VGG models trained from 3D landmarkdataset as described above are used, and moreover a modelwhose weights are generated randomly within the range of[−0.1 0.1] is used for comparison. It is also worth mentioningthat the randomly generated model has the same numberof parameters compared to others. The results are shownin Table II, from which we can observe that the relativeentropy is much lower between the sequentially trained modelsthan between the trained model and random model. Thisfurther demonstrates the potential of removing the redundancybetween different models when the models are obtained ina meaningful way. Again, similar trends can be observedfor the models with different tasks, as shown in Table III.These results imply that significant coding efficiency can beachieved based on model prediction for inter model redun-dancy removal. The results also imply the redundancy amongdifferent models even trained with the target of different tasks.

C. Multiple Model Compression

The potential redundancies among different deep learningmodels inspire us to explore the space of inter model pre-diction for efficient deep learning model communication. Theprinciple of the inter model prediction framework is to selectthe most appropriate reference to reduce the weight energy andremove the redundancies. In this subsection, the inter modelprediction and compression approaches under different scenar-ios are discussed. In particular, they are mainly attributed tothe sequential transmission of deep learning models over time,and simultaneous transmission of multiple models.

TABLE III

K-L DIVERGENCE OF DIFFERENT VGG BASED MODELSTARGETING AT DIFFERENT TASKS

Fig. 5. The utilization of DoM in the sequential deep learning modelscommunication, where V2 is the newly generated model to-be-compressedand V1 is the previously transmitted model.

1) Sequential Transmission Over Time: As discussed above,the edge or cloud sides are able to gradually receive new data,which can be further utilized to enhance the capability of exist-ing deep learning models. For example, the most straightfor-ward way is to fine-tune the existing models and transmit thenewly generated models to the IoT devices or intelligent frontend, as illustrated in Fig. 5. In this scenario, the cloud or edgesides stores the previously transmitted models that can used topredict the current to-be-compressed one. As such, the DoMcan be computed to achieve compact representation.

One straightforward method to extract the DoM betweenthe prediction and the to-be-compressed models is computingthe difference for each corresponding weight layer by layer,as illustrated in Algorithm 1. Subsequently, the weight dif-ference can be quantized and entropy coded, according tothe method described in Section IV-A. At the receiver side,the weight difference is entropy decoded and dequantized torecover the DoM. Given the DoM, the newly generated modelcan be recovered such that new knowledge is conveyed toimprove the overall system performance. It is worth men-tioning that in some scenarios, only a portion of the para-meters or layers in the deep learning models are supposed tobe updated, and others remain unchanged. As such, the DoMcan be calculated based on the comparison between partiallyupdated weights and the corresponding prediction weights.

Another practical scenario that needs to be considered isthat the deep learning models to-be-compressed or used forprediction could be corrupted. Generally speaking, there aretwo types of distortions that lead to the corruption of thepristine deep learning model. On one hand, the degradationin precision caused by quantization is commonly observed.On the other hand, the model can be shrunk by reducing the


Algorithm 1 The Computation of DoMInput: The prediction model Mp, the to-be-compressed

model Mc. The larger numbers of layers, kernels,and weights inside each kernel between Mp andMc are represented as NL, NK and NW .

Output: DoM MDoM between Mp and Mc.1 The layer index L = 0;2 The feature map index I = 0;3 The weight location index J = 0;4 while L < NL do5 for I = 0; I < NK ; I = I + 1 do6 for J = 0; J < NW ; J = J + 1 do7 if � ∃ Mc(L, I, J) then8 Mc(L, I, J) = 0;

9 if � ∃ Mp(L, I, J) then10 Mp(L, I, J) = 0;

11 MDoM (L, I, J) = Mc(L, I, J) − Mp(L, I, J);

12 L=L+1;

13 return MDoM ;

number of weights or changing the structure of the model.Moreover, three different scenarios are considered, includingthe to-be-compressed model is pristine while the predictionmodel is corrupted, the to-be-compressed model is corruptedwhile the prediction model is pristine, and both the to-be-compressed model and the prediction model are corrupted.

In this context, the DoM still can be derived between theprediction and the to-be-compressed models. For example,when the prediction model is pruned, the missing layers canbe filled with zero padding. Therefore, even if the predictionis corrupted, it is not necessary to directly transmit the wholemodel. As such, deep model compression approaches aimingat reducing the inner redundancy inside the model can also becombined with the proposed inter prediction method.

2) Simultaneous Deep Learning Model Set Compression: Indeep learning model communication, different deep learningmodels targeting at different tasks or learned with differentdata may be required to be transmitted simultaneously. Dueto the fact that some popular deep learning architectureshave high generalization ability and can well handle differ-ent tasks, the employed deep learning models usually havesimilar structures or possess shared structures. For example,the VGG model can be retrained or redesigned to achieveobject detection, face recognition and classification. Based onthe redundancy removal between different models, a set ofdeep learning models can be compactly represented basedon one root model for inter model prediction, as illustratedin Fig. 6.

First, a root model is selected and the rest of to-be-compressed models can be represented based on the compar-isons of these models with the root model. The root modelis directly compressed with the single model compressionmethod. As such, only one integrated model is requested to betransmitted, and other models can be represented with DoMs.

Fig. 6. The utilization of DoM in compressing a set of deep learning models,where the DoM is obtained based on the selected root model and to-be-transmitted model.

At the receiver side, all the models are reconstructed based onthe received root model and DoMs. In this process, the opti-mization problem is modeled as follows,

maxn∑

i=1

U i subject to RRoot +n−1∑i=1

RiDoM ≤ Rc, (10)

where n is the number of models, U i represents the utilityof the model such as analysis accuracy, RRoot representsthe size of the root model and Ri

DoM denotes the size ofthe i-th DoM. Rc represents the constraint transmitted modelsizes. Analogous to the image set compression [57], [58],the performance of the proposed compression scheme is highlydependent on the selection of the root model, and the best rootmodel should have the following desired properties,

∀j ≤ n,n∑

i=1

Dis(MRoot, Mi) ≤n∑

i=1

Dis(Mj, Mi) (11)

where Mi indicates the i-th model, and Dis represents themodel distance. Here, the K-L Divergence and the sharedmemory are utilized to characterize the model distance andselect the best root model among all the to-be-compressedmodels. In particular, the K-L distances of the root model tothe rest models are minimized, aiming to select the modelthat is closest to all rest models for enhanced coding perfor-mance. Moreover, the shared memory that characterizes theshared structures between different models is also taken intoconsideration.

D. Discussions on Further Standardization

To apply the deep learning model communication in realapplications scenarios, we can perceive great promises andmajor challenges, especially for the standardization of compactdeep learning model representation. Indeed, there is a strongdemand for defining the standardized deep learning modelscompression formats with a full suite of relevant technologiesto enable the interoperability. Generally speaking, such com-pression format needs to be compact, task independent, andsufficiently robust, and has the following desirable properties,

• ensure interoperability of deep learning modelcommunication;

• reduce load on transmission and storage of deep learningmodels;


TABLE IV

EXPERIMENT SETUP IN DIFFERENT APPLICATION SCENARIOS. WE ALSO LIST THE APPLICATIONSCENARIOS BASED ON THE ORDER OF OUR EXPERIMENTS

• provide a strict format for hardware-supported deep learn-ing model compression;

• enable a high level performance of implementations con-formed to the standard, and

• simplify the design of deep learning model compressionfor knowledge communication applications.

To achieve compact deep learning model compression,the inter prediction techniques can become the dominatingcoding tools leading to significant performance improvementin coding efficiency. Even though, the rapid development ofdeep learning creates many challenging research problems.The most importance issue is that the dynamically evolveddeep neural network architectures require the standardizationto be more flexible. As such, the newly developed networkstructure can accommodate to the existing standard. In viewof the great demand of deep neural networks compact repre-sentation, MPEG has also initialized the coded representationof neural network and many use cases have been identified[59]–[61]. In particular, after investigating the current deeplearning model compression approaches [62], the inter modelprediction as well as the evaluation framework when consid-ering the deep learning model transmission and compressionwere also proposed in the standardization process [63], [64]. Itis envisioned that a wide spectrum of novel technologies andfunctionalities will be developed in the near future, makingthe standard to be more flexible and versatile in the knowledgecommunication environment.

V. VALIDATIONS

A. Experimental Setup

In order to validate the performance of the inter predictionin deep learning model communication, extensive experimentsare conducted with three tasks, including classification, imageretrieval and object detection. For different tasks, we usedifferent models and training data to well reflect the perfor-mance, such as VGG and Resnet. In particular, for imageclassification, we use the datasets Cifar10 and Cifar100 [65].For detection as well as instance image retrieval tasks, we fine-tune the pre-trained models with ImageNet. For retrievaltasks, we fine-tune the pre-trained VGG using triplet lossin 3D landmark dataset [55] and testing is conducted onHolidays [66], Oxford5k [67] and Pair6k datasets [68]. Forthe detection tasks, we use Pascal VOC [69] as our trainingdatasets based on the pre-trained models.

For the models trained with Cifar10 or Cifar100 datasets,the learning rate is set as 0.001, the batch size is set as 128 andthe softmax loss is used to train the models in 150 epochs.Regarding object detection, Faster R-CNN is employed, andwe train the pretrained VGG and Resnet101 where the learningrate is set as 0.005 and the batch size is set as 1 followingwith 5 epochs [13]. For the retrieval tasks, we train VGG andResnet 50 using the above parameter settings. The learning rateis set as 0.001, the batch size is set as 64, and 551 clustersare selected (22,156 images, 5,974 of them are queries) fortraining and 162 clusters are selected (6,403 images, 1,691 ofthem are queries) for triplet loss training. Each training triplecontains 1 query, 1 positive and 5 hard negative images.Regarding quantization, we set s_bits to 12 and differentquantization levels are achieved by varying q_bits.

The experimental setup is summarized in Table IV. First,the sequential transmission scenario is investigated such thatthe models are required to be transmitted one after another.We validate this scenario with different tasks among detection,retrieval and classification as well as partially updating caseson retrieval. The results are listed in Table IV (row 1,2).In this context, we also consider the cases when the to-be-transmitted model or the prediction model is corrupted. Bothpruning and corrupted are considered, as listed in Table IV(row 3,4). Finally, we study the simultaneous deep learningmodel transmission and a set of deep learning models iscompressed to facilitate many applications simultaneously.

B. Experimental Results on Sequential Model Transmission

1) Pristine Model for Prediction, Pristine Model for To-Be-Compressed: In practice, we individually study the deeplearning models for different tasks. For the classification tasks,we train the deep model with the structure of VGG16 fromscratch on Cifar10 and Cifar100 datasets with 150 epochsand we select the two models corresponding to the last40 epochs and the last 20 epochs. For object detection, we trainthe deep model with the structure of Resnet101 from thepretrained model in ImageNet and finetune on Pascal VOCusing Faster R-CNN. Due to the nature of fast convergence,we choose the models from the 2 and 3 epochs. For theimage retrieval, we train the deep model with the structure ofVGG16 from the pretrained model in ImageNet and finetuneon 3D landmark as feature extractor (the fully connectedlayers are removed). We choose best two models in terms


Fig. 7. Performance comparisons in the scenario of sequential model transmission under the three given tasks. Here it is worth mentioning thatthe model size approaching to zero in the DoM case implies that very few bits are transmitted for model updating. (a) Detection; (b) classification;(c) retrieval.

TABLE V

THE PERFORMANCES OF THE SELECTED MODELS FOR THREE TASKS

WHEN THE MODELS ARE PRISTINE. “CLS” REPRESENTS CLASSIFICA-TION TASKS, “DET” REPRESENTS OBJECT DETECTION AND “RRV”

REPRESENTS INSTANCE IMAGE RETRIEVAL TASK

of the overall performance on all test datasets. The detailedperformances as well as the model structures are listed onTable V. It is also observed that with the updating of themodels, the performances are improved. This implies that withmore data, better knowledge can be extracted to facilitatethese analytics applications. The performance comparisonswith and without inter model prediction are shown in Fig. 7.For the model without inter model prediction, we compressthe to-be-transmitted model (V2 in Table V) with singlemodel compression using quantization parameter varying from4 to 11, such that different compressed model sizes with thecorresponding performances are plot. Regarding inter modelprediction, we calculate the DoM between V2 and V1 based onthe proposed approaches. Then we compress the DoM usingquantization parameter from 4 to 11 as well. The performancesin terms of the model sizes are compared. From Fig. 7,we can observe that DoM strategy greatly reduces the sizeof deep learning models with the same analysis performance,which provides useful evidence that the proposed inter modelprediction approach is effective for the deep learning modelcommunication. Though the results from 0 to the largest modelsize are displayed, the performance varied and significant per-formance drop appears when extreme compression is appliedto the strategy w/o DoM. Regarding the strategy w/o DoM,as least the backbone of the model needs to be transmitted,such that in the Fig. 7 the w/o DoM approximately starts fromthe middle point.

TABLE VI

THE PERFORMANCE OF SELECTED MODELS FOR INSTANCE IMAGE

RETRIEVAL. VGG REPRESENTS THE DEEP MODEL VGG16, ANDRES50 REPRESENTS THE DEEP MODEL RESNET-50

Furthermore, we investigate the performance of inter modelprediction in more complicated scenarios. Firstly, the predic-tion and current to-be-compressed models are trained sequen-tially across a long time interval. In this case, the correlationsbetween different models is decreased. Here we choose Ima-geNet as the pretrained model, which is further finetuned with15 and 20 epochs on 3D landmarks for model generation.Subsequently, we consider the circumstance that only partialmodel is updated with the other layers fixed. In particular,we finetune ImageNet pretrained Resnet50 on 3D landmarksand fix the parameters before the “Conv4” blocks. Thenthe to-be-compressed model is generated after finetuning for20 epochs, achieving completely different performance.

The configurations are listed in Table VI and the corre-sponding performance is shown in Table VII. Here, it is worthmentioning that the model performances of these retrievaltasks are generally improved with the retraining, as shownin Table VI. As such, retraining plays an important roleand it is particularly considered in our experiments. Here,the quantization parameters (q_bits) are set from 6 to 9 tomaintain the analysis performance. We also directly com-press the VGG-epoch20 and Res50-epoch20 to investigatethe potential coding efficiency. Though long-term referenceis essentially worse than short-term reference, significantcompression performance improvement can still be achievedcompared to directly transmitting the to-be-compressed model.The compression ratios when the performances are maintainedare shown on Table VIII. Again, similar observations can beobtained, which further illustrates the advantage of the intermodel prediction.


TABLE VII

PERFORMANCE COMPARISONS FOR DIFFERENT QUANTIZATION PARAMETERS (Q_BITS). THE ΔMAP IS OBTAINED BASED ON THE MAXIMAL DIFFER-ENCES BETWEEN THE MAP VALUES OF THE ORIGINAL TO-BE-COMPRESSED MODELS (VGG-EPOCH20 AND RESNET-EPOCH20) AND THE MAP

VALUES OF THE CORRESPONDING COMPRESSED MODELS FOR THE THREE DATASETS (HOLIDAYS, OXFORD5K AND PARIS6K)

TABLE VIII

COMPARISONS OF THE COMPRESSION RATIONS. THE QUANTIZATIONPARAMETER Q_BITS IS OBTAINED BY SELECTING THE HIGHEST

QUANTIZATION PARAMETER THAT CANNOT LEAD TO THE VARI-ATION OF THE RETRIEVAL PERFORMANCE. HERE THE DIF-

FERENCES OF PERFORMANCES LAGER THAN 0.00001 IS

REGARDED AS PERFORMANCE VARIATION

Fig. 8. Comparisons of the compression ratios for different quantizationparameters with different DoM intervals.

We also compare the compression ratios in terms of dif-ferent quantization parameters with and without DoM. Asdemonstrated in Fig. 8, the DoM based scheme can achievehigher compression ratio compared with the naive singlemodel compression scheme without inter model prediction.Moreover, inter model prediction is able to further enhancethe compression performance by involving more shared infor-mation. We also plot the long interval inter model predictionwith different deep learning model structures, such as partiallyupdating Resnet with and without inter model prediction.As shown in Fig. 9, when utilizing inter model prediction,higher compression ratio can also be achieved.

2) Corrupted Model for Prediction: In real applicationscenarios, the prediction model is usually corrupted as itcan be compressed and transmitted to the receiver side in alossy manner. Thus, it is important for us to investigate

Fig. 9. Comparisons of the compression ratios for different quantizationparameters on Resnet50 for the partially updating scenario. The performanceof VGG DoM in the entire updating scenario is also provided for illustration.

whether inter model prediction is still beneficial in suchcircumstances.

Here, we take the classification task on Cifar10 andCifar100 into consideration. In particular, the prediction mod-els are generated by quantizing the original model with quanti-zation parameters ranging from 9 to 11. From Table IX, we canobserve that with the decrease of the quantization parametersof prediction models for Cifar10, which represents differentlevels of corruptions of the prediction models, better perfor-mances and higher compression ratios can be achieved forthe same to-be-compressed models. In other words, the moreinformation can be obtained from the prediction model,the better performance can be achieved. Regarding Cifar100,it is also interesting to see there is slight performance influenceon inter model prediction. As such, when the precision of theprediction model is decreased, we are still able to utilize theinter model prediction for better coding performance.

3) Pruned Model for Prediction, Pruned Model for To-Be-Compressed: Another important scenario is that the pre-diction and to-be-compressed models are pruned instead ofbeing pristine. For example, some deep model compressionapproaches for single model such as Han et al. [6] can beincorporated such that a light-weight model is delivered. Morespecifically, we use Han et al.’s pruning algorithm [6] to gen-erate the lightweight models and these pruned models achieve2x (prune50%) and 5x (prune20%) compression comparedto the original models, as shown in Table X. In particular,three different DoM models are generated for performance


TABLE IX

PERFORMANCE COMPARISONS FOR DIFFERENT PREDICTION MODELS WITH DIFFERENT QUANTIZATION PARAMETER(Q_BITS)

TABLE X

PERFORMANCE COMPARISONS WHEN THE TO-BE-COMPRESSED

MODEL IS CORRUPTED. P2 REPRESENTS prune20%, P5 REPRESENTS

prune50%, AND U REPRESENTS UN-PRUNED

validation, and the compression ratio is calculated between themodel size for transmission and the to-be-compressed modelsize.

1) DoM1 (P2-P2): The DoM between the prunedVGG-epoch0 (prediction) and pruned VGG-epoch20 (to-be-compressed). The VGG-epoch0 is pruned with 20% coeffi-cients retained, and the pruned 20% VGG-epoch20 is obtainedby retraining the pruned VGG-epoch0 with 20 epochs. TheDoM between the pruned models is transmitted.

2) DoM2 (P5-P2): The DoM between the prunedVGG-epoch0 (prediction) and pruned VGG-epoch20(to-be-compressed). The VGG-epoch0 is pruned with20% coefficients reserved and VGG-epoch20 is pruned with50% coefficients reserved. The pre-processing is performedto facilitate the DoM computation by adding extra parameterswith the values equaling to 0 for zero padding on theprediction model.

3) DoM3 (U-P2): The DoM between the prunedVGG-epoch0 (prediction) and pristine VGG-epoch20(to-be-compressed). Regarding the prediction model VGG-epoch0, the VGG is pruned and only 20% parametersare preserved after pruning. The to-be-compressed modelVGG-epoch20 is pristine. The DoM between the prunedmodel and pristine model is calculated based on the samepre-processing process in DoM (P5-P2) with padding zeros,and the DoM is finally transmitted.

From Table X, we can observe that the inter-model predic-tion can potentially decrease the cost of transmission of thepruned models. Therefore, in the scenario of transmitting analready compressed model, we are still able to make use ofthe inter-model redundancy. We also compare the compressionratios in terms of the quantization steps for the pruned 20%model in Fig. 10 with different inter model prediction intervals.Similar tendencies as in Fig. 8 and Fig. 9 can be observed,which prove that the inter model prediction is able to furtherenhance the functionality of light weight models.

Fig. 10. The performance comparisons in terms of the compression ratiowhen the to-be-compressed model is pruned 20% with different inter modelprediction intervals.

TABLE XI

ILLUSTRATION OF THE TO-BE-COMPRESSED MODELS WITH THE CORRE-SPONDING PERFORMANCE AND MODEL SIZES. THE PERFORMANCES

FOR RETRIEVAL ARE TESTED ON HOLIDAY/OXFORD5K/PARIS

USING MAP. THE PERFORMANCE FOR CLASSIFICATION IS

TEST ON IMAGENET TOP5 ACCURACY AND DETECTION ISTESTED ON PASCAL VOC USING MAP

C. Experimental Results on Simultaneous ModelTransmission

In this subsection, the performance of simultaneous modelcompression is studied. Here, we consider three models withsimilar structures (VGG) but fine-tuned for different tasks,including image retrieval (3D landmark), object detection(Pascal VOC) as well as the pretrained ImageNet VGG-16 model. More details are provided in Table XI. Though thesedeep learning models are all built upon the VGG structure,there are still some differences among different tasks. For clas-sification, the fully connected layers are required. Regardingobject detection, RPN layers [13] are taken into consideration.As such, shared memory can be computed based on similarnetwork layers. The retrieval model shares the rest models withthe layer before the fully connected layers, which is 56.1MB.The detection and classification model share more than 2 fullyconnected layers with total shared memory 473MB. For bettercoding efficiency, we also calculate their K-L divergence toreveal their correlation between each other.


Fig. 11. The performance comparison for simultaneous model transmission(retrieval).

Fig. 12. The performance comparison for simultaneous model transmission(detection).

From Table XI, considering both the shared memory anddistances, the classification model is selected as the rootmodel for better coding efficiency. Subsequently, we calcu-late the DoM on the shared layers, and fill zeros to thelayers which do not exist in the root model. The DoM isquantized with different quantization parameters, and we alsoseparately compress the two to-be-compressed models withsingle model compression for comparison. The performanceis shown in Figs. 11& 12, from which we can see thatsignificant performance improvement can be achieved basedon the proposed scheme for both detection and retrievaltasks. From Fig. 11, it is noticed that when the transmittedmodel sizes are among 4-6 MB, DoM strategy is able tomaintain the performance while the performance begins todrop for the w/o-DoM strategy. Moreover, all the curveswith DoM strategy are higher than the curves of w/o-DoM,indicating better performance can be achieved under the sametransmitted model sizes. Analogously, in Fig. 12, when theperformance is maintained around 70%, less data are neededto be conveyed for the DoM than the w/o-DoM strategy.As a result, significant performance improvement can beachieved based on the proposed scheme. This further illustratesthat the proposed scheme is effective in both sequential andsimultaneous transmission scenarios.

VI. ENVISIONING THE FUTURE

Despite significant recent progress on the deep neuralnetwork compression, there remains significant room forimprovement in the scenario of deep learning model communi-cation. In this paper, we have emphasized the great potentialsby exploiting the inter redundancy of deep learning models ina variety of application scenarios. As such, the application

scope as well as methodology of traditional deep neuralnetwork compression have been extended, and the feasibilityof new deep learning model communication paradigm has beendemonstrated and validated through extensive experiments. Inthe future, it is envisioned that the need for advanced deepmodel compression algorithms and the relevant standards willbecome more pronounced.

Driven by the demand for efficient deep learning modelcommunication, there are many open problems that need tobe solved. One of the most important issues to be addressed,as we hinted at, is selecting an appropriate prediction thatleads to better redundancy removal performance. This requiresmore flexible prediction methodologies at multiple levels.Another important problem is regarding the standardizationof deep model compression, especially in the context ofthe fast advances of deep learning algorithms and networkarchitectures.

REFERENCES

[1] L. Atzori, A. Iera, and G. Morabito, “The Internet of Things: A survey,”Comput. Netw., vol. 54, no. 15, pp. 2787–2805, Oct. 2010.

[2] D. Wu, Z. Li, J. Wang, Y. Zheng, M. Li, and Q. Huang. (2017). “Visionand challenges for knowledge centric networking (KCN).” [Online].Available: https://arxiv.org/abs/1707.00805

[3] K. Simonyan and A. Zisserman. (2014). “Very deep convolutionalnetworks for large-scale image recognition.” [Online]. Available: https://arxiv.org/abs/1409.1556

[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classificationwith deep convolutional neural networks,” in Proc. Adv. Neural Inf.Process. Syst., 2012, pp. 1097–1105.

[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jun. 2016, pp. 770–778.

[6] S. Han, H. Mao, and W. J. Dally. (2015). “Deep compression: Com-pressing deep neural networks with pruning, trained quantization andHuffman coding.” [Online]. Available: https://arxiv.org/abs/1510.00149

[7] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating verydeep neural networks,” in Proc. Int. Conf. Comput. Vis. (ICCV), vol. 2,Oct. 2017, pp. 1389–1397.

[8] G. J. Sullivan, “Efficient scalar quantization of exponential and Lapla-cian random variables,” IEEE Trans. Inf. Theory, vol. 42, no. 5,pp. 1365–1374, Sep. 1996.

[9] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model com-pression by distilling knowledge from neurons,” in Proc. AAAI, 2016,pp. 3560–3566.

[10] C. Szegedy et al., “Going deeper with convolutions,” in Proc. CVPR,Jun. 2015, pp. 1–9.

[11] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Denselyconnected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit. (CVPR), Jun. 2017, vol. 1, no. 2, pp. 1–3.

[12] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face repre-sentation by joint identification-verification,” in Proc. Adv. Neural Inf.Process. Syst., 2014, pp. 1988–1996.

[13] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Adv.Neural Inf. Process. Syst., 2015, pp. 91–99.

[14] D. Bahdanau, K. Cho, and Y. Bengio. (2014). “Neural machine trans-lation by jointly learning to align and translate.” [Online]. Available:https://arxiv.org/abs/1409.0473

[15] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. NeuralInf. Process. Syst., 2014, pp. 2672–2680.

[16] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. (2017). “Unpaired image-to-image translation using cycle-consistent adversarial networks.” [Online].Available: https://arxiv.org/abs/1703.10593

[17] M. Denil, B. Shakibi, L. Dinh, and N. De Freitas, “Predicting parametersin deep learning,” in Proc. Adv. Neural Inf. Process. Syst., 2013,pp. 2148–2156.

[18] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficientDNNs,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 1379–1387.

[19] X. Dong, S. Chen, and S. Pan, “Learning to prune deep neural networksvia layer-wise optimal brain surgeon,” in Proc. Adv. Neural Inf. Process.Syst., 2017, pp. 4860–4874.


[20] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. Choudhary, andS.-F. Chang, “An exploration of parameter redundancy in deep networkswith circulant projections,” in Proc. IEEE Int. Conf. Comput. Vis.,Dec. 2015, pp. 2857–2865.

[21] Y. Cheng, X. Y. Felix, R. S. Feris, S. Kumar, A. Choudhary, andS.-F. Chang. (2015). “Fast neural networks with circulant projections.”[Online]. Available: https://arxiv.org/abs/1502.03436v1

[22] X. Yu, T. Liu, X. Wang, and D. Tao, “On compressing deep models bylow rank and sparse decomposition,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Jul. 2017, pp. 7370–7379.

[23] M. Masana, J. van de Weijer, L. Herranz, A. D. Bagdanov, andJ. Malvarez, “Domain-adaptive deep network compression,” in Proc.IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 4289–4297.

[24] S. Lin, R. Ji, X. Guo, and X. Li, “Towards convolutional neural networkscompression via global error reconstruction,” in Proc. IJCAI, 2016,pp. 1753–1759.

[25] Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond filters: Compact featuremap for portable deep model,” in Proc. Int. Conf. Mach. Learn., 2017,pp. 3703–3711.

[26] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. (2015).“Compression of deep convolutional neural networks for fast andlow power mobile applications.” [Online]. Available: https://arxiv.org/abs/1511.06530

[27] J. Ye, X. Lu, Z. Lin, and J. Z. Wang. (2018). “Rethinking the smaller-norm-less-informative assumption in channel pruning of convolutionlayers.” [Online]. Available: https://arxiv.org/abs/1802.00124

[28] Y. Gong, L. Liu, M. Yang, and L. Bourdev. (2014). “Compressing deepconvolutional networks using vector quantization.” [Online]. Available:https://arxiv.org/abs/1412.6115

[29] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. (2017). “Incremental net-work quantization: Towards lossless CNNs with low-precision weights.”[Online]. Available: https://arxiv.org/abs/1702.03044

[30] C. Louizos, K. Ullrich, and M. Welling, “Bayesian compression for deeplearning,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 3290–3300.

[31] B. Reagen et al., (2017). “Weightless: Lossy weight encoding fordeep neural network compression.” [Online]. Available: https://arxiv.org/abs/1711.04686

[32] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Proc.Adv. Neural Inf. Process. Syst., 2014, pp. 2654–2662.

[33] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio.(2016). “Binarized neural networks: Training deep neural networks withweights and activations constrained to +1 or −1.” [Online]. Available:https://arxiv.org/abs/1602.02830

[34] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:Imagenet classification using binary convolutional neural networks,” inProc. Eur. Conf. Comput. Vis., 2016, pp. 525–542.

[35] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and K. Keutzer. (2016). “SqueezeNet: AlexNet-level accuracy with 50xfewer parameters and <0.5 MB model size.” [Online]. Available: https://arxiv.org/abs/1602.07360

[36] A. G. Howard et al. (2017). “Mobilenets: Efficient convolutionalneural networks for mobile vision applications.” [Online]. Available:https://arxiv.org/abs/1704.04861

[37] S. Xie, R. Girshick, P. Dollá, Z. Tu, and K. He, “Aggregated residualtransformations for deep neural networks,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5987–5995.

[38] V. Chandrasekhar et al., “Compression of deep neural networks forimage instance retrieval,” in Proc. IEEE Data Compress. Conf. (DCC),Apr. 2017, pp. 300–309.

[39] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. (2017). “Deep gradi-ent compression: Reducing the communication bandwidth for distributedtraining.” [Online]. Available: https://arxiv.org/abs/1712.01887

[40] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of thehigh efficiency video coding (HEVC) standard,” IEEE Trans. CircuitsSyst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012.

[41] IEEE Approved Draft Standard for 2nd Generation IEEE 1857Video Coding, IEEE Standard 1857.4-2018, 2014. [Online]. Available:http://www.ieee1857.org/1857.4.asp

[42] B. Girod et al., “Mobile visual search,” IEEE Signal Process. Mag.,vol. 28, no. 4, pp. 61–76, Jul. 2011.

[43] A. Redondi, L. Baroffio, L. Bianchi, M. Cesana, and M. Tagliasacchi,“Compress-then-analyze versus analyze-then-compress: What is best invisual sensor networks?” IEEE Trans. Mobile Comput., vol. 15, no. 12,pp. 3000–3013, Dec. 2016.

[44] J. Chao and E. Steinbach, “Keypoint encoding for improved featureextraction from compressed video at low bitrates,” IEEE Trans. Multi-media, vol. 18, no. 1, pp. 25–39, Jan. 2016.

[45] E. Eriksson, G. Dán, and V. Fodor, “Predictive distributed visual analysisfor video in wireless sensor networks,” IEEE Trans. Mobile Comput.,vol. 15, no. 7, pp. 1743–1756, Jul. 2016.

[46] E. Eriksson, G. Dán, and V. Fodor, “Coordinating distributed algo-rithms for feature extraction offloading in multi-camera visual sensornetworks,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 11,pp. 3288–3299, Nov. 2017.

[47] L.-Y. Duan et al., “Overview of the MPEG-CDVS standard,” IEEETrans. Image Process., vol. 25, no. 1, pp. 179–194, Jan. 2016.

[48] L.-Y. Duan et al. (2017). “Compact descriptors for video analy-sis: The emerging MPEG standard.” [Online]. Available: https://arxiv.org/abs/1704.08141

[49] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Visionand challenges,” IEEE Internet Things J., vol. 3, no. 5, pp. 637–646,Oct. 2016.

[50] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-tional networks,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818–833.

[51] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. (2015).“Understanding neural networks through deep visualization.” [Online].Available: https://arxiv.org/abs/1506.06579

[52] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 248–255.

[53] M. K. Varanasi and B. Aazhang, “Parametric generalized Gaussiandensity estimation,” J. Acoust. Soc. Amer., vol. 86, no. 4, pp. 1404–1415,Oct. 1989.

[54] N. Ranganathan and S. Henriques, “High-speed VLSI designs forLempel-Ziv-based data compression,” IEEE Trans. Circuits Syst. II,Analog Digit. Signal Process., vol. 40, no. 2, pp. 96–106, Feb. 1993.

[55] F. Radenovic, G. Tolias, and O. Chum, “CNN image retrieval learnsfrom BOW: Unsupervised fine-tuning with hard examples,” in Proc.Eur. Conf. Comput. Vis., 2016, pp. 3–20.

[56] S. Kullback, Information Theory and Statistics. North Chelmsford, MA,USA: Courier Corporation, 1997.

[57] X. Zhang, Y. Zhang, W. Lin, S. Ma, and W. Gao, “An inter-imageredundancy measure for image set compression,” in Proc. IEEE Int.Symp. Circuits Syst. (ISCAS), May 2015, pp. 1274–1277.

[58] X. Zhang et al., “Rate-distortion optimized sparse coding with ordereddictionary for image set compression,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 28, no. 12, pp. 3387–3397, Dec. 2018.

[59] W. Bailer, Use Cases and Requirements for Coded Representationof Neural Networks, document ISO/IEC JTC1/SC29/WG11/N17338,Gwangju, South Korea, 2018.

[60] S. Chun and W. Bailer, Call for Test Data for Compressed Rep-resentation of Neural Networks, document ISO/IEC JTC1/SC29/WG11/N17680, San Diego, CA, USA, 2018.

[61] S. Chun and W. Bailer, Use Cases and Requirements for CompressedRepresentation of Neural Networks, document ISO/IEC JTC1/SC29/WG11/N17509, San Diego, USA, 2018.

[62] Z. Chen et al., A Survey of Deep Network Compression, documentISO/IEC JTC1/SC29/WG11/m42165, Gwangju, South Korea, 2018.

[63] Z. Chen, Y. Lou, S. Wang, T. Huang, and L.-Y. Duan, AdditionalRequirement of Neural Network Compression for Friendly Transmis-sion (NNCT) Service, document ISO/IEC JTC1/SC29/WG11/m42601,San Diego, CA, USA, 2018.

[64] Z. Chen, Y. Lou, S. Wang, T. Huang, and L.-Y. Duan, Sug-gestion Towards the Evaluation Framework of Neural NetworkCompression for Friendly Transmission Service, document ISO/IECJTC1/SC29/WG11/m42602, San Diego, CA, USA, 2018.

[65] A. Krizhevsky, “Learning multiple layers of features from tiny images,”Univ. Toronto, Toronto, ON, Canada, 2012.

[66] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weakgeometric consistency for large scale image search,” in Proc. Eur. Conf.Comput. Vis., 2008, pp. 304–317.

[67] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Objectretrieval with large vocabularies and fast spatial matching,” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2007, pp. 1–8.

[68] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost inquantization: Improving particular object retrieval in large scale imagedatabases,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),Jun. 2008, pp. 1–8.

[69] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams,J. Winn, and A. Zisserman, “The Pascal visual object classes challenge:A retrospective,” Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136,Jan. 2015.


Ziqian Chen received the B.S. degree in computerscience from the Dalian University of Technology,Liaoning, China, in 2017. He is currently pursuingthe master’s degree with the School of ElectricalEngineering and Computer Science, Peking Univer-sity, Beijing, China. His current research interestsinclude image retrieval and deep learning modelcompression.

Ling-Yu Duan (M’06) received the M.Sc. degree inautomation from the University of Science and Tech-nology of China, Hefei, China, in 1999, the M.Sc.degree in computer science from the National Uni-versity of Singapore (NUS), Singapore, in 2002, andthe Ph.D. degree in information technology fromThe University of Newcastle, Callaghan, NSW, Aus-tralia, in 2008. He has been the Associate Director ofthe Rapid-Rich Object Search Laboratory (ROSE),a joint laboratory between Nanyang Technologi-cal University (NTU), Singapore, and PKU, China,

since 2012. He has been with the Peng Cheng Laboratory, Shenzhen, China,since 2019. He is currently a Full Professor with the National EngineeringLaboratory of Video Technology (NELVT), School of Electronics Engineeringand Computer Science, Peking University (PKU), China. His research interestsinclude multimedia indexing, search, and retrieval, mobile visual search, visualfeature coding, and video analytics. He received the EURASIP Journal onImage and Video Processing Best Paper Award in 2015, the Ministry ofEducation Technology Invention Award (First Prize) in 2016, the NationalTechnology Invention Award (Second Prize) in 2017, the China Patent Awardfor Excellence (2017), and the National Information Technology Standard-ization Technical Committee “Standardization Work Outstanding Person”Award in 2015. He was a Co-Editor of the MPEG Compact Descriptorfor Visual Search (CDVS) Standard (ISO/IEC 15938-13). He serves as theCo-Chair of the MPEG Compact Descriptor for Video Analytics (CDVA).He is an Associate Editor of ACM Transactions on Intelligent Systems andTechnology (ACM TIST) and ACM Transactions on Multimedia Computing,Communications, and Applications (ACM TOMM).

Shiqi Wang (M’15) received the B.S. degree incomputer science from the Harbin Institute of Tech-nology in 2008 and the Ph.D. degree in com-puter application technology from Peking Universityin 2014. From 2014 to 2016, he was a Post-DoctoralFellow with the Department of Electrical and Com-puter Engineering, University of Waterloo, Waterloo,ON, Canada. From 2016 to 2017, he was withthe Rapid-Rich Object Search Laboratory, NanyangTechnological University, Singapore, as a ResearchFellow. He is currently an Assistant Professor with

the Department of Computer Science, City University of Hong Kong. Hehas proposed over 40 technical proposals to ISO/MPEG, ITU-T, and AVSstandards. His research interests include video compression, image/videoquality assessment, and image/video search and analysis.

Yihang Lou received the B.S. degree in softwareengineering from the Dalian University of Tech-nology, Liaoning, China, in 2015. He is currentlypursuing the Ph.D. degree with the School of Elec-trical Engineering and Computer Science, PekingUniversity, Beijing, China. His current interestsinclude large-scale image retrieval and video contentanalysis.

Tiejun Huang (M’01–SM’12) received the B.S.and M.S. degrees in automation from the WuhanUniversity of Technology, Wuhan, China, in 1992,and the Ph.D. degree from the School of InformationTechnology and Engineering, Huazhong Universityof Science and Technology, Wuhan, in 1999. He wasa Post-Doctoral Researcher and a Research FacultyMember with the Institute of Computing Technol-ogy, Chinese Academy of Sciences, Beijing, China,from 1999 to 2001. He was the Associate Directorand the Director of the Research Center for Digital

Media, Chinese Academy of Sciences, from 2001 to 2003 and from 2003 to2006, respectively. He is currently a Professor with the National EngineeringLaboratory for Video Technology, School of Electronics Engineering andComputer Science, Peking University, Beijing. His research interests includedigital media technology, digital library, and digital rights management. Heis a member of the Association for Computing Machinery.

Dapeng Oliver Wu (S’98–M’04–SM’06–F’13) iscurrently a Professor with the Department of Elec-trical and Computer Engineering, University ofFlorida, Gainesville, FL, USA. His research interestsare in the areas of networking, communication, sig-nal processing, computer vision, machine learning,smart grid, and information and network security.He received the University of Florida Term Profes-sorship Award in 2017, the University of FloridaResearch Foundation Professorship Award in 2009,the AFOSR Young Investigator Program (YIP)

Award in 2009, the ONR Young Investigator Program (YIP) Award in 2008,the NSF CAREER Award in 2007, the IEEE Circuits and Systems for VideoTechnology (CSVT) Transactions Best Paper Award for Year 2001, and theBest Paper Awards from the IEEE GLOBECOM 2011 and the InternationalConference on Quality of Service in Heterogeneous Wired/Wireless Networks(QShine) 2006. He serves as the Editor-in-Chief of the IEEE TRANSACTIONS

ON NETWORK SCIENCE AND ENGINEERING.

Wen Gao (M’92–SM’05–F’09) received the Ph.D.degree in electronics engineering from The Univer-sity of Tokyo, Tokyo, Japan, in 1991. Before joiningPeking University, he was a Professor of computerscience with the Harbin Institute of Technology,Harbin, China, from 1991 to 1995, and a Professorwith the Institute of Computing Technology, ChineseAcademy of Sciences, Beijing. He has been theDirector of the Peng Cheng Laboratory, Shenzhen,China, since 2018. He is currently a Professor ofcomputer science with the School of Electronic

Engineering and Computer Science, Institute of Digital Media, Peking Uni-versity, Beijing, China. He has authored extensively, including five books andover 600 technical articles in refereed journals and conference proceedingsin the areas of image processing, video coding and communication, patternrecognition, multimedia information retrieval, multimodal interfaces, andbioinformatics. He is a member of the China Engineering Academy. Hehas been the Chair of a number of prestigious international conferenceson multimedia and video signal processing, such as the IEEE InternationalConference on Multimedia and Expo and ACM Multimedia. He served on theAdvisory and Technical Committee of numerous professional organizations.He served or serves on the Editorial Board of several journals, such as theIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOL-OGY, the IEEE TRANSACTIONS ON MULTIMEDIA, the IEEE TRANSACTIONS

ON AUTONOMOUS MENTAL DEVELOPMENT, the EURASIP Journal of ImageCommunications, and the Journal of Visual Communication and ImageRepresentation.

toward knowledge as a service over networks: a deep ... knowledge as a service over network.pdfpart...

Documents