deepeye: resource efﬁcient local execution of multiple ...deepeye: resource efﬁcient local...

DeepEye: Resource Efficient Local Execution of MultipleDeep Vision Models using Wearable Commodity Hardware

Akhil Mathur‡, Nicholas D. Lane‡†, Sourav Bhattacharya‡Aidan Boran‡, Claudio Forlivesi‡, Fahim Kawsar‡

‡Nokia Bell Labs, †University College London

ABSTRACTWearable devices with in-built cameras present interesting oppor-tunities for users to capture various aspects of their daily life andare potentially also useful in supporting users with low vision intheir everyday tasks. However, state-of-the-art image wearablesavailable in the market are limited to capturing images periodicallyand do not provide any real-time analysis of the data that might beuseful for the wearers.

In this paper, we present DeepEye - a match-box sized wearablecamera that is capable of running multiple cloud-scale deep learn-ing models locally on the device, thereby enabling rich analysis ofthe captured images in near real-time without offloading them tothe cloud. DeepEye is powered by a commodity wearable proces-sor (Snapdragon 410) which ensures its wearable form factor. Thesoftware architecture for DeepEye addresses a key limitation withexecuting multiple deep learning models on constrained hardware,that is their limited runtime memory. We propose a novel infer-ence software pipeline that targets the local execution of multipledeep vision models (specifically, CNNs) by interleaving the execu-tion of computation-heavy convolutional layers with the loading ofmemory-heavy fully-connected layers.

Beyond this core idea, the execution framework incorporates: amemory caching scheme and a selective use of model compressiontechniques that further minimizes memory bottlenecks. Through aseries of experiments, we show that our execution framework out-performs the baseline approaches significantly in terms of inferencelatency, memory requirements and energy consumption.

KeywordsWearables; deep learning; embedded devices; computer vision; lo-cal execution

1. INTRODUCTIONMobile vision systems have been built for decades, with recent

example systems from research including SenseCam [1], Gabriel [2]and ThirdEye [3]. All of these systems capture images from devicesworn by the user which are then processed by vision algorithms for

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

MobiSys’17, June 19-23, 2017, Niagara Falls, NY, USA© 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.ISBN 978-1-4503-4928-4/17/06. . . $15.00

DOI: http://dx.doi.org/10.1145/3081333.3081359

extracting high-level information in service of a range of applica-tions; for example, cognitive assistance, physical analytics, mobilehealth, memory augmentation and lifelogging. Now even commer-cial systems of this type are for sale with the most popular examplesbeing Google Glass [4] and the Narrative Clip [5]. Fortuitously, aswearable cameras start to become mainstream the ability of com-putational models to understand the images they collect has alsoundergone a revolution in its abilities. Powered by rapid innova-tion in the area of deep learning for some image tasks (recognizinga face [6], locating an image geographically [7], or deciding if animage is memorable or not [8]) these models are even equallinghuman performance.

However, currently there exists a huge gap between this growingcomputational ability to understand images and the ability of wear-able resource constrained devices to execute them. While this is anactive area of research [9, 10, 11, 12, 13, 14], still today virtuallyno wearable devices in either research or sold commercial locallyprocess the images they collect with deep learning models; and arelimited, at best to processing them in the cloud (e.g., systems like[15]) which exposes users to privacy risks as highly sensitive per-sonal images are handled by a third party. In fact, many deviceslike the Narrative Clip simply capture the image, transfer them tothe cloud for later analysis – and perform no local computation onthe image at all. Alternatively, research prototypes like Gabriel [2]that is based on Google Glass do perform some mixture of localand cloud-based processing of images using shallow image model-ing algorithms (that are not as accurate as deep learning models);but, this processing comes at a cost with Gabriel having only 1 or2 hours of battery life (reported in [2]).

The aim of this paper is to explore the future of wearable visionsystems, where devices can locally process images with bleedingedge deep learning algorithms towards realizing a wide range ofmobile vision based applications. In this work, we aim to not in-vestigate what will it take for a single deep vision model to runefficiently – instead we attempt to run multiple concurrent deeplearning algorithms using a wearable form-factor prototype. Webelieve this is a critical system design issue that will enable im-portant ubiquitous computing and mobile computing applications,many of which are already developed and studied [1, 16, 17, 18, 19,20, 21], but remain impractical for large-scale deployment becausesuitable wearable devices are not yet available.

In this work we present the design, implementation and evalua-tion of DeepEye; a one-of-a-kind wearable that is capable of exe-cuting 5 state-of-the-art deep vision models entirely on the deviceon images that are periodically captured (with a default samplingrate of 30 seconds). DeepEye is unlike any current wearable vi-sion device in its capability to process images without cloud sup-port, an ability that affords critical privacy assurances to users who

68

can trust their images never have to leave the device. In its cur-rent form the prototype device weighs 60gm and it has dimensionsof 44 X 68 X 12 mm (illustrations are available in Figures 1 and8). Although it has general support for all forms of Convolutional(CNNs) and Deep Neural Network models (DNNs) [22] (the mostpopular two forms of deep learning) – we evaluate it within this pa-per while supporting deep models that perform the following imagetasks: face recognition, counting salient images in a scence, visualscene recognition, object detection, age and gender assessment offaces and more unusually the prediction of the memorability of animage [8]. Using these models it would be possible to use Deep-Eye to realize a wide range of vision applications; for example, alife-logging application that smartly filters images based on theirexpected memorability. Importantly, under this workload DeepEyestill manages to maintain a battery life of nearly 17 hours with ittaking only 10.10 seconds (depending on the model set used) tocompute all deep models.

Two core enablers allow DeepEye to provide these levels of effi-cient mobile image processing. The first is the hardware design thatcombines a Qualcomm Snapdragon 410 processor (the same foundin many smartwatches) with a custom integrated carrier board ofour own design consisting of a 5 megapixel camera sensor anda 2400 mAh LiPo battery. The second enabler is an inferencepipeline optimized specifically to cope with the needs of multiplemodels. Multiple deep model inference execution presents spe-cific bottlenecks separate to those of single models; for instance,it becomes impossible to always keep all layers for each model inmemory. Our inference pipeline optimizes at the level of individ-ual deep learning layers and builds on a fundamental inter-modeloptimization insight: namely, that CNNs are comprised of a com-bination of computation-heavy convolutional layers and memory-heavy fully-connected layers. While convolutional layers tax thememory resources lightly, they are computationally demanding; incontrast, fully-connected layers have the exact opposite resourcedemands. Due to these orthogonal resource demands of memoryand compute, it is possible to schedule and batch layers togetherfrom multiple models that better maximizes the resource of con-strained devices and avoids bottlenecks that prevents multiple deepmodels from being executed. This core idea is complemented witha layer caching scheme and a selective use of SVD-based layercompression that further minimizes memory bottlenecks.

Finally – to complement our system and resource efficiency fo-cus – we investigate the accuracy of the deep models used withinDeepEye by testing them against images collected by actual wear-able cameras (specifically Narrative Clips [5]). This is importantbecause these models were not trained with data captured by wear-able cameras and this will negatively impact their performance.Typically their accuracy numbers are assessed with data that isnot representative of wearable devices. To address this, more than18,000 images are provided from the authors of [23] which wehave further labeled with image entities (≈4000) using AmazonMechanical Turk (detailed in § 5.6). Overall, our findings indicatethat deep models’ accuracy does (as expected) decline than typicalreported values primarily due to the nature of egocentric images(e.g., low lighting conditions, object occlusion). As such, there is aclear need to fine-tune these deep models for objects and scenarioscommonly found in egocentric images.

The scientific contributions of this work include:• We design the proof-of-concept DeepEye wearable. This is

a first-of-its-kind device that is capable of executing multipledeep learning based vision algorithms purely locally. Compara-ble wearables from research, and those available commercially,at best may be able to run just a single example deep model

Device Weight Lifetime Cloud? Inference?(gm) (hours)

DeepEye 60 16 No YesGabriel [2] 42 1 Yes Yes

SenseCam [1] 93.5 24 No NoAutographer [24] 59 24 No NoNarrative Clip [5] 19 30 No No

Table 1: Comparison of DeepEye with recent and/or popular wearablesfrom the literature and available commercially

Figure 1: DeepEye Prototype. Device is comfortable being placed in thefront pocket of the user’s shirt.

– but none are designed to, or capable of, executing multiplemodels.• We develop an inference pipeline for wearables designed to

meet the challenges faced when multiple models are being exe-cuted (large memory footprint, shear computational overhead).This pipeline primarily relies on increasing processor utiliza-tion by scheduling layers with orthogonal bottlenecks.• We build prototype hardware to demonstrate the feasibility of

our design. This prototype relies on a commodity SoC (contain-ing both CPU and GPU) and is capable of capturing/processingimages with 5 deep models within reasonable resource bounds.• We perform an extensive evaluation of the performance of Deep-

Eye including its: inference pipeline, hardware, and even theaccuracy of these models using images typical of wearable de-vices. We show that relative to alternatives, like serially per-forming inference using all models, the DeepEye offer morethan 1.7x gains in execution latency.

2. DeepEye OVERVIEWIn this section we begin by systematically describing DeepEye andsummarizing the capabilities of this wearable. DeepEye extendscurrent capabilities of wearables by allowing simultaneous execu-tion of multiple large-scale deep vision models locally in an effi-cient manner. The main use-cases of running multiple vision mod-els include: (i) logging of everyday life, while capturing variousaspects of the environment, (ii) providing navigation and socialsupport for partially blind people in real-world situations, and (iii)enabling privacy-aware inferencing on personal devices by runningall inferences on locally, completely avoiding cloud-based services.Moreover, as new sensors are becoming increasingly available onmobile platforms, application developers are adding new featuresthat require a number of different inference tasks to be executedsimultaneously.

2.1 Existing Camera WearablesIn Table 1, we summarize various camera-based wearable devicesproposed in the literature as well as those commercially available

69

Figure 2: Software architecture of DeepEye

in the market, along with their form factors and capabilities. Acommon observation is that most camera-capture wearables suchas Narrative Clip or Autographer only capture and stores imageswithout providing any kind of context inferencing support. Further,devices such as the Gabriel [2] which provide inference on the col-lected images do so by offloading computations to a cloud backend.Contrary to existing wearables, DeepEye makes a significant leapforward by addressing technical challenges in simultaneously exe-cuting a number of very large vision models locally, which has notbeen attempted previously. Figure 1 illustrates a fully functionalprototype of DeepEye.

2.2 Design ConsiderationsThe primary design goal for DeepEye is to support the executionof multiple deep vision models locally on the device, which canenable a new class of applications for camera-based wearables. Inthis vein, the following three key design considerations have shapedthe design of DeepEye:• Multiple Model Execution: Study of how a system will be-

have, and what needs to be optimized when a number of mod-els need to be locally executed periodically on the device. Thismeans no single model can completely reside in memory con-stantly. This is also similar to the situation where a single largedeep model is too big to reside completely at memory, this canbecome an increasingly critical issue due to the direction ofdeep models to be 100s and even 1000s layers deep – whichis much larger than the 10 to 20 layers more often seen today.• Wearable Continous Vision Device: The device should have

a battery life able to last the typical waking hours of a day,within a form-factor that is comfortable for the user. Ideally aform-factor not significantly larger than existing devices today(such as the Narrative Clip) but still enabling a transformativeincrease in the amount of vision based deep models applicableto collected images.• Minimal Accuracy Loss Optimizations: The device should

minimize the accuracy losses associated with common opti-mization techniques which trade-off accuracy for resource re-duction. As such, the core techniques we target in our designseek to increase the utilization of the device SoC and the effi-ciencies available in multiple model situations.

2.3 Architecture and DataflowWe briefly describe the operation of DeepEye to provide context

for how data is processed, and thus how the optimizations discussed

in the following section are applied. Figure 2 depicts the overallarchitecture of our system, and its main components are as follows:• Layer Store Initialization: Before DeepEye ever runs it must

be initialized based on the collection of deep models to be exe-cuted. Each layer of the model is separated, and dependenciesfrom the model architecture noted. DeepEye treats layers as theunit that are processed and once all layers of a model are pro-cessed then the output of the entire model is determined. At thisphase logical dependencies between models can also be config-ured. For example, some models may only operate on face,therefore they should only be activated when a face is detected(by a face recognition model). Note: For the purposes of exper-iments all models execute regardless of such dependencies.• Sampling and Pre-processing: At a configurable sampling

rate images are taken and held temporarily for each deep modelto be applied. At this phase the various types of pre-processingrequired by each model are applied so that their input layer canbe initialized when it runs.• Inference Engine: As described in the next section, this en-

gine manages when different CNN layers are processed andhow layers are cached to reduce overhead such as the pagingin and out of layers. Individual layers are executed by DeepEyewith the aim to reduce the resources consumed by all modelscollectively. The execution time of a total collection of modelsdefines an upper bound on the camera sampling frequency; it isprohibited for the camera to sample photos at a rate faster thanthe system can process them locally. DeepEye can offer imageinferences in the form of local API calls that applications canthen be built.• BLAS: In order to optimize the numeric operations on the

hardware, DeepEye uses the optimized BLAS library for nu-meric computations. The inference engine passes the layers tobe executed on the hardware to BLAS, which in turn distributestheir computations across the multiple processors available onthe device.

3. MULTIPLE MODELDEEP INFERENCE PIPELINE

In this section we describe the inference pipeline of DeepEye. Webegin by providing a primer on a Convolutional Neural Network(CNN) which is a popular deep learning model architecture forcomputer vision models. Next, we describe the major bottlenecksin running one or multiple CNN models on resource-constraineddevices. To address these bottlenecks, we propose a new executionpipeline which includes a number of optimization strategies and anovel scheduling technique.

3.1 CNNs on Embedded DevicesHere we focus on the challenges associated with running CNNmodels on resource-constrained embedded devices. We begin byproviding a brief overview of how a CNN model generates infer-ence from the raw data, and then explain the challenges in exe-cuting the various layers of a CNN on embedded devices such asDeepEye.

Convolutional Neural Network Primer. CNNs are an alternativeto DNNs (Deep Neural Networks) that still share many architec-tural similarities. As shown in Figure 3, a CNN is composed ofone or more: convolutional layers, pooling or sub-sampling layers,and fully connected layers (with this final type being equivalent tothose used in DNNs). The aim of these layers is to extract sim-ple representations at high resolution from the input data, and then

70

Sensor Data

Feed-forward LayersConvolutional Layers

Pooling Layer Convolution Layer

Output Layer

Figure 3: A CNN mixes convolutional and feed-forward layers.

converting these into more complex representations, but at muchcoarser resolutions within subsequent layers. This is achieved byfirst applying convolutional filters (with small kernel width) to cap-ture local data properties. Next follow max or min pooling layerscausing representations to be invariant to translations, this also actsas a form of dimensionality reduction. Finally, fully connected lay-ers (i.e., a DNN) are applied to complete classification. Inferenceunder a CNN operates only on a single segment of data (i.e., an im-age) at a time. Data is provided to convolutional layers at the headof the architecture. This can be considered a form of feature extrac-tion before the fully connected layers are engaged. Inference thenproceeds exactly as previously described for DNNs until ultimatelya classification is reached.

CNN bottlenecks on resource-constrained devices. As has beenreported in prior literature [9], the memory overhead in executinga deep model is dominated by the loading of weight parameters offully-connected (fc) layers. In comparison, the convolutional ker-nels are small is size and take significantly less time to load. On theother hand, the computational overhead in deep model execution isdominated by the convolution operations, which are an order ofmagnitude higher than the matrix multiplication operations in thefc layers. In the context of wearable and embedded devices, theslow disc read and memory bus speed could be a major bottleneckis loading FC layers into the memory, and increase the latency ofmodel execution. We study this bottleneck through a experiment oneight popular vision-based CNN models, where we run the modelson an embedded platform (Qualcomm Snapdragon 410c) and mea-sure the load and execution times of all the individual layers. InFigure 4 we summarize the results from this experiment and showthe breakdown of average1 time taken to load the fully-connectedlayers and execute the convolutional layers. The experiment revealsthat on this embedded platform, the loading of fully-connected lay-ers is the most expensive operation, even more than the loading andexecution of convolutional layers by a factor of 1.3 (ObjectNet) to2.15 (MemNet). As such, if we can explore opportunities to reducethe loading time of FC-layers, it will speed up the entire inferencepipeline. In §3.2, we propose employing layer caching and layercompression techniques on the FC-layers to achieve latency gainsin execution of deep models on constrained devices.

Bottlenecks in running multiple CNNs on embedded devices.As the demand for simultaneously running a number of deep mod-els increases, we quickly run into a situation when the availablememory on an embedded platform becomes insufficient to hold allthe deep models. Under this situation, a common strategy is to runthe deep models sequentially, i.e., loading one model at a time, runinferencing and clearing model memory before moving on to thenext model. However this repeated paging-in and paging-out oflarge models adds significant overhead to the inference process, es-pecially in the case of continuous inference systems such as Deep-Eye. Moreover, there can be cases where a deep model is so largethat it cannot fit into the available memory on an embedded plat-form – in these scenarios, a common approach is to break the deep

1 All models have been executed 100 times and we report theaverage here.

GenderNet

AgeNet

MemNet

SceneNet

SalientNet

FaceNet

EmotionNet

ObjectNet0

1000

2000

3000

4000

5000

1.54x 1.73x

2.15x 1.98x 1.82x 1.82x

2.05x

1.31xLoad & Execution times for Convolutional Layers

Load Times for FC Layers

Figure 4: Understanding the bottleneck in deep vision model executionon embedded platforms. Loading model parameters is the single mosttime consuming task, which was seen across eight different models pop-ular within the computer vision community.

model into its constituent layers, which are then paged-in to mem-ory and executed in series. Even then, the bottleneck associatedwith loading of model parameters into memory still exists, and infact it is amplified due to the presence of multiple deep models.In §3.2, we present a new technique of scheduling the executionof deep model layers which provides latency gains over the seriesexecution – our approach exploits a key property of CNN modelswhich is that the execution of layers follow a strict order (i.e. fclayers are executed on the output of the convolution layers), andaims to interleave the execution of convolution layers with loadingof FC layers.

3.2 DeepEye Inference EngineIn this section, we present an overview of the inference pipelineused by DeepEye for executing multiple CNN models on resource-constrained devices. The key innovation in the pipeline is a novelapproach of interleaving the loading of fully-connected layers withthe execution of convolutional layers. We also discuss how ourinference pipeline incorporates layer caching, model compression,and image hashing (detailed in § 3.3) to provide additional latencygains over the baseline approaches. Note that the pipeline is in-dependent of the underlying processor, and can be applied to anyavailable processor (e.g., CPU, DSP) on the system.

Figure 5: Overview of the DeepEye Inference Engine

Figure 5 illustrates the various components of the deep inferenceengine. The inference engine is initialized offline by segregatingall model layers into computation- and memory-heavy layers, i.e.convolutional and fully-connected layers respectively. Next, theengine allows various kinds of model compression techniques to

71

be applied on the pool of layers. Currently, DeepEye supports theSVD-based layer compression technique proposed in the DeepXtoolkit [25] for model compression, however more techniques couldbe added in future. In the online phase while computing the infer-ence, the engine reads an input image from the camera, calculatesa perceptual hash vector [26] (detailed in the subsequent sections)for the image, and compares it with the hash values of the last fiveimages. If the hash similarity is above a certain threshold (i.e., theimages are similar), the inference pipeline is terminated and the in-ference output corresponding to the nearest hash vector is returned.If the image hash does not match the previous five images, the infer-ence pipelines continues by applying various pre-processing stepson the image to generate the necessary input expected by the indi-vidual models in the pipeline. Once the preprocessing of the im-age is completed, the results are immediately fed to the input layer(convolutional layers) of all the models. Thereafter, the cachingand interleaved execution of convolution and FC layers take place(detailed in § 3.3), which finally results in an inference output foreach model. The inference results are then written to the disc orcan be passed onto any user-facing apps through API calls.

3.3 Optimization TechniquesIn the following subsections, we discuss each of the optimization

techniques employed by the DeepEye engine.

Runtime Interleaving: The core runtime optimization in Deep-Eye’s execution pipeline is based on the idea of processing differ-ent layers from a pool of deep vision models in parallel to reducethe execution latency. As discussed above, the loading of FC layersand the execution of convolution layers are the two most expensiveoperations in running a deep model on embedded devices. Interest-ingly, due to the feed-forward inferencing followed in deep models,the memory-heavy FC layers are only executed after the input isfully processed by the convolution layers – as such, the loading ofFC layers can potentially be interleaved with the execution of con-volution layers. Naturally, the case of multiple model inferencesallows for greater opportunities to parallelize the aforementionedexecution and loading processes across multiple models.

CV4

CV3

CV2

CV1

CV3

CV2

CV1

CV1

CV2

CV3

CV2

CV1

CV3

CV2

CV1

Fully Connected Layers

Convolution Layers

FC1

FC2

FC3

FC1

FC2

FC1

FC2

FC3 FC1

FC1

FC2

FC3

Asyn

chro

nous

acr

oss

mod

el

Asynchronous acrossm

odel

Event-based Synchronization

Pre-processing of Image

Predictions

Figure 6: Runtime interleaving of convolutional and FC layers

Figure 6 depicts the overview of the runtime interleaving foran example case of five deep models. As indicated before, wefirst segregate all model layers into two pools, namely convolu-tion layers and fully-connected layers. We next spawn two threads,namely convolution-execution thread and data-loading thread. Inthe convolution-execution thread, the convolution filter parametersof all models are loaded into the memory and the convolution op-erations begin on the pre-processed input data for each model. Wetake advantage of the underlying BLAS numeric library used inour implementation (detailed in § 4.2) to perform efficient multi-

threaded numerical operations (e.g., convolutions) – as such we donot spawn separate convolution-execution threads for each model.Instead we adopt a FIFO queue based execution strategy for theconvolutional layers across models. For example, as shown in Fig-ure 6, three layers of convolution operations for red model are car-ried out by this thread, which are followed by two layers of yellowand three layers of the grey model. The data-loading thread whichis spawned in parallel with the convolution thread is responsiblefor loading the FC layer parameters for all models into the mem-ory, again in a pipelined manner (i.e., one model after the other).

The objective of the convolution-execution thread is to performall convolutions on the input image, and pass the results of the finalconvolution layer of each model to the data-loading thread. Whenthe data-loading thread finishes loading the FC layer parameters fora model, it can use the pre-computed convolution outputs from theconvolution-execution and proceed to obtain the final classificationresults.

The latency gain G from runtime interleaving can be computedas:

G =tconv + tfc_load

max{tconv, tfc_load}(1)

where, tconv represents the time taken to load the convolutionfilters and execute the convolution operations for all models, andtfc_load represents the time taken to load all FC layer parametersinto the memory. As such, the numerator in Equation 1 correspondsto the baseline case where convolution execution and loading ofFC layers is done in series, while the denominator corresponds tothe interleaved execution where the total time is the maximum ofboth threads’ execution times. Note that the eventual execution ofFC layers on the output generated from the convolutional layersin common in both series and interleaved approaches – as such, itis not accounted for while calculating the latency gain (G) due tointerleaving.

The denominator of equation 1 can be simplified using the fol-lowing equation which expresses the maximum of two numbers interms of their sum and difference:

max{a, b} = 1

2(a+ b+ |a− b|) (2)

By substituting (2) in (1), we get:

G =2

1 +|tfc_load−tconv|tfc_load+tconv

=2

1 + λ

(3)

where λ =|tfc_load−tconv|tfc_load+tconv

stands for the interleaving factorand captures the difference between tconv and tfc_load. As λ in-creases, the interleaving gain G decreases – and when λ = 0 (i.e.,time taken to load the FC layers is the same as time taken to runthe convolutions), the gain G is maximized to 2x. In § 5, we exper-iment with different values of λ to showcase interleaving gains inmultiple scenarios.

Caching of Memory-Heavy Layers: For continuous-sensing ap-plications and devices such as DeepEye, caching of layer parame-ters into the memory could be potentially useful, as it saves signifi-cant time required to load thousands of parameters into memory foreach inference. As loading the parameters for fully connected lay-ers is one of the main time-consuming operations in the inference

72

process, the idea strategy would be the keep all parameters of theseFC layers into memory. However, the limited runtime memory onthe embedded devices becomes a bottleneck to load all the layerparameters, more so in the case of executing multiple deep models.

DeepEye adopts a greedy approach for caching the layers param-eters of the FC layers to maximize memory utilization. It monitorsthe available runtime memory and caches the k largest FC layersfrom the model pool that can fit into the memory. In case the avail-able memory cannot fit all k layers, it caches k − 1 largest layers,and again greedily repeat the process to cache the second-largestlayers into the remaining memory. As layer caching effectively re-duces the total load time for the FC layers (tfc_load), it is likely toreduce the value of λ and increase the latency gains by interleaving.

Model Compression: In addition to layer caching, another poten-tial approach to reduce the load times of the memory-heavy FC lay-ers is to employ model compression techniques. The goal of modelcompression techniques is to reduce the size and execution latencyfor a single deep model, at the expense of some accuracy loss. Inour work, we use the SVD-based layer factorization approach pro-posed in the DeepX toolkit [25, 27] to compress the FC layers ofall deep models. The SVD-factorization technique takes the weightmatrix containing the matrix multiplication parameters for two ad-jacent FC layers, and factorizes it into two matrices which togethercontain lesser parameters than the original matrix. As a result ofthis factorization, the total number of parameters in the FC layersare reduced, which in turn lowers the load time of the FC layers(tfc_load). This also reduces the value of λ (as the difference be-tween tfc_load and tconv reduces), and hence increases the latencygains (G) from interleaving. However, it is interesting to note thatif too much compression (C) is applied to the FC layers such thattconv > C ∗ tfc_load i.e., if the convolution execution time startsdominating the FC loading operation, then the gains due to com-pression will no longer be observable.

Pre-empt using Hashing: In continuous image capture systemssuch as DeepEye, it is likely that consecutively captured imagesare similar in their content – this situation often arises in everydaylife logging, for e.g., when the wearer is stationary or the capturedevice is kept on a table. If the system can identify such similar im-ages, it need not run the entire inference pipeline on the image, anda significant amount of computations and energy could be savedover time. Although, stationary conditions can be adequately de-tected by inertial sensors like accelerometer, in this work we rely onan image-based hashing technique to detect image similarity. Morespecifically, we apply a perceptual hashing algorithm [26] on theinput images and store hash values of the last five images along withtheir predictions. Perceptual hashing functions are widely used toestablish the perceptual similarity of multimedia content – this isdone by extracting a series of features from the content (i.e., theraw image) and applying a hash function on those features. Deep-Eye uses the Radial Variance Based Hash function as implementedin [26]. When a new image is captured, we first compute the hashof the image and then calculate its hamming distance to all fivepreviously stored hashes. If the hamming distance is found to besmaller than a predefined threshold, we then pre-empt execution ofall deep models and return the classes stored for the image withsmallest hamming distance. In our implementation we use a dis-tance threshold of 26, which is suggested by the developers of thehashing algorithm.

4. EMBEDDED IMPLEMENTATIONIn this section we describe the hardware implementation of Deep-

Eye with a quad-core Qualcomm Snapdragon 410 processor. We

Figure 7: DeepEye Hardware Prototype

also describe our current software implementation designed to runmultiple deep models on DeepEye using the Torch framework.

4.1 Hardware PrototypeDeepEye is powered by a quad-core Qualcomm Snapdragon 410

processor on a custom integrated carrier board, and includes a 2400mAh LiPo battery. We intend to open-source the design of Deep-Eye hardware soon to help other researchers build similar wearabledevices.

Snapdragon 410 Series APQ8016. The DeepEye wearable cam-era consists of a System On Chip (SOC) mounted on a carrier boardas shown in Figure 8. The SOC has a quad core ARM proces-sor (Snapdragon 410 Series APQ8016 from Qualcomm), a Qual-comm Adreno 306 GPU, 8GB of flash storage and 700MB of us-able RAM. The carrier board allows the SOC to be plugged in andprovides ports for a Camera Serial Interface (CSI) camera, a UARTfor programming and debug, i2C and SPI GPIOs for connectingsensors.

Figure 8: DeepEye Qualcomm SoC and Custom Integrated Carrier BoardCustom Integrated Carrier Board. The carrier board is poweredby a 3.7v LiPo battery. The SoC provides power management andrecharging circuitry while the carrier board provides a micro usbconnector for charging. The DeepEye wearable runs a custom buildof Linaro Debian linux based on the 4.3 Linux kernel. The custombuild adds driver support of high-speed CSI camera and removesunused devices (HDMI bridge, DSI panel, USB hub) by using acustom Linux kernel and device tree. The camera module uses a 5Megapixels Omnivision OV5640 camera sensor.

The DeepEye wearable has a small form factor (44 X 68 X 12mm), and includes a 2400mAh LiPo battery. The device weights24gm without the battery, and 60 gm with the battery attached to it,thus making it practical for everyday use.

4.2 Software ImplementationOur current software implementation for DeepEye uses an op-

timized version of the Torch deep learning framework for doinginference on the embedded device. The motivations for using Torchwere the availability of state-of-the-art vision models for this frame-work, and its excellent portability to embedded platforms. By re-moving non-essential components from the Torch framework, wewere able to keep these benefits while also reducing the overhead ofthe framework relative to a custom C++ runtime. We now discussthe major components of our implementation:

73

BLAS Runtime. While executing deep model inferencing, themain computational load remains in performing matrix multipli-cations [28, 25, 27]. In our current implementation of DeepEye weuse the highly optimized BLAS library [29] for running the convo-lutions and the matrix multiplications needed in the fully-connectedlayers. In the current implementation we use Torch as our deeplearning inferencing engine, which supports multi-threading for deepmodel execution. The Torch framework loads the model layers intothe memory and send them to the BLAS library for computation.Thereafter, BLAS distributes the actual computational tasks acrossthe CPU or GPU cores as specified by the model. Note that Deep-Eye is agnostic to the underlying deep learning framework and canbe easily integrated with other engines supporting multi-threadedexecution, e.g., TensorFlow and Theano.

Control Subsystem. We implement the data-flow runtime as de-scribed in §2 in the following manner. DeepEye camera capturesan image every 30 seconds, and saves it to the disk. This image isthen loaded in the Torch framework and a number of pre-processingoperations are done on it. Although these operations are specific toeach deep model, they broadly include cropping, resizing and meannormalization of the input image.

Next, the deep vision models are read from the storage and loadedinto the shared memory in a layer-wise manner by Torch, that iseach layer is paged-in the memory, where it operates on its inputdata and generates the input for the next layer. Thereafter, the layeris paged-out of memory and the next layer is paged-in. As de-scribed in §3, these load/execute are done in series in the baselinecase and by applying interleaving (and other optimizations) in ourproposed approach.

DeepEye is capable of using both the CPU and GPU on Snap-dragon 410c for running convolution operations. The Snapdragonon-board GPU however lacks OpenCL support, therefore we imple-mented an Open-GL based Convolution Engine (described in nextsubsection) which can run convolution operations using OpenGLshaders. When the system has to compute a convolution (which isthe dominant operation in the initial layers of a CNN), it can exe-cute the operation on the quad-core CPU (using Torch) or on theGPU (using the Open-GL engine).

Open-GL Convolution Acceleration. OpenGL is a 3-D draw-ing specification which allows applications to accelerate graphicsusing an accelerated rendering pipeline implemented in the GPU.Recently a subset of OpenGL, named OpenGL ES, has made it pos-sible for other computational algorithms to execute on the GPU, ifthey could map their data into graphical objects, such as bitmaps,textures and vertex buffers. In the case of deep learning models,our key idea is to represent the input to a convolution layer as abitmap object, then apply an OpenGL Shader to it to convert it toanother bitmap, which is then fed to the subsequent layers.

5. EVALUATIONIn this section, we first benchmark the performance of various

optimization techniques described in §3 viz. layer interleaving,layer caching, and model compression against the baseline approachof executing deep models in a layer-wise fashion. Next, we presenttwo case studies inspired by real-world applications of DeepEyeand highlight latency and energy gains achieved by our proposedinference pipeline. Note that we take a CPU-centric approach inour experiments – i.e., all experiments were conducted on the on-board quad-core CPU of DeepEye. Later, we also discuss usingthe on-board GPU for running the inference pipeline and highlightwhy it does not add to any latency gains. Our main findings can besummarized as follows:

• DeepEye can last for more than 33 hours on a single batterycharge, assuming an image is captured and analyzed by thedeep learning models every minute.

• Our proposed execution approach of layer interleaving andlayer caching results in nearly 2x gains in runtime over thebaseline method of series execution.

• The interleaved execution technique could work in tandemwith model compression, and serve as a way to offset theaccuracy losses due to compression while providing compa-rable latency gains.

5.1 MethodologyWe used the Qualcomm Dragonboard 410c, which is the develop-ment board for the Snapdragon 410 processor to perform the ex-periments for evaluating the performance of DeepEye and its op-timization techniques. For energy measurements, we instrumentedthe power lead of the Dragonboard 410c to connect an Ammeter toit. We also disabled those peripherals (e.g. HDMI port, SD Card,WiFi chip) on the Dragonboard which are not present on DeepEyeto get accurate energy measurements. As representative workloadfor our experiments, we used eight state-of-the-art pre-trained deeplearning models trained for a variety of computer vision tasks. Inthe next section, we present more details about the deep modelsused for evaluation.

Deep Models: In total, we used eight CNN models of varying sizesand computational complexities that are representative of commonvision tasks. All these models are pre-trained and available as open-source models from Torch or Caffe framework websites. A sum-mary of these models is also presented in Table 2.

FaceNet [30]: Face detection is a popular task for any image cap-turing system, and is a necessary step for algorithms which aim todetect smile or emotions in an image. Here, we use a face detectionCNN model that can detect faces in a wide range of orientations.

AgeNet and GenderNet [31]: AgeNet and GenderNet models aimto predict the age (8 categories) and gender (2 categories) of a per-son from an input image. The models have been trained usingaround 20K images.

EmotionNet [32]: This model aims to recognize the emotion ona person’s face by analyzing the input image. The input to thismodel is a cropped RGB image of the face, and it outputs confi-dence scores for 8 different emotions: angry, disgust, happy, sad,surprise, fear and neutral. The authors of this model also offer addi-tional emotion recognition models, all of which when used togetheroffer state-of-the-art performance – however, for the sake of exper-imentation - we only use the model that operates on RGB values.

SalientNet [33]: This model aims to compute the saliency of aninput image, by predicting the existence and the number of salientobjects in a given scene. Models such as SalientNet could be par-ticularly useful to filter out non-salient images from the lifelog out-puts.

ObjectNet [34]: Detecting various objects present in an image is ofutmost importance to both our representative use-cases. For this,we consider a popular CNN model named VGG which supportsmore than 1000 object classes (e.g., dog, car).

MemNet [35]: The aim of this model is to determine how memo-rable an image is (with actual output being a ”memorability score”).Experiments indicate MemNet has the ability to select images that

74

Scenario Model Size ArchitectureLifelogging SceneNet 240MB c:8ı; p:3‡; fc:3?

Lifelogging MemNet 233MB c:8ı; p:3‡; fc:3?

Lifelogging SalientNet 233MB c:8ı; p:3‡; fc:3?

Vision Support AgeNet 46MB c:3ı; p:3‡; fc:3?

Vision Support GenderNet 46MB c:3ı; p:3‡; fc:3?

Vision Support EmotionNet 419MB c:5ı; p:3‡; fc:3?

Both FaceNet 230MB c:8ı; p:3‡; fc:3?

Both ObjectNet 553MB c:13ı; p:5‡; fc:3?

ıconvolution layers; ‡pooling layers; ?fully connected layers

Table 2: Representative deep models and scenarios used for evaluatingDeepEye

are memorable with similar success to those of human labelers.This model could also be potentially useful to assist lifeloggers inbrowsing through the thousands of images captured by DeepEye.

SceneNet [36]: A total of 476 scene types are recognized by thismodel. Example scenes include: bedrooms, kitchens, forests.

Scenarios. We logically group the deep models into two scenariosto illustrate practical use-cases of DeepEye, namely i) lifelogging,and ii) vision support. Lifelogging, i.e. capturing and archivingour everyday lives, has been a primary use-case for camera-basedwearable devices such as the Microsoft SenseCam and NarrativeClip. For demonstrating a lifelogging scenario with DeepEye, weuse models which can detect objects, places, faces in an image,along with inferring the saliency and memorability of the image.Another important application for camera wearables could be tosupport users with visual impairment, in that the wearable can ana-lyze the captured images and provide non-visual (e.g. audio or hap-tic) feedback to the user. For demonstrating this scenario, we chosemodels that could detect the faces or object in an image, along withinferring the age, gender and emotion of people in the image. Bothscenarios along with the models associated with them are summa-rized in Table 2.

5.2 Interleaving AnalysisHere we evaluate the impact of layer interleaving on the overallinference time of deep models. We first show the gains achievedin inference time by using layer interleaving in both applicationscenarios mentioned earlier. Then we empirically validate the sen-sitivity of interleaving gains as shown in Equation 3, by varyingthe value of the interleaving factor λ.

Setup. To evaluate the benefits of interleaving, we run the fivedeep models in each scenario in both series (baseline) and inter-leaved manner. No other optimization (e.g., caching, compression)is applied to the models. Next, in order to explore the sensitivityof interleaving as per Equation 3, we constructed model pipelinesof various sizes in both scenarios. For example, in a scenario with5 models, there are 5C2 possible pipelines of 2 models in it, 5C3

pipelines of 3 models and so on. In total, we generated 52 differentpipelines for both scenarios, with each pipeline having a differentlength and combination of models. We executed each pipeline inthe baseline Series condition and calculated the value of interleav-ing factor λ for it. Then we ran each pipeline in an interleaved wayand calculate the inference time gain over the baseline condition.Below we present our findings for both the experiments.

Results. Figure 9 illustrates the reduction in inference time in theinterleaved case for both scenarios. We clearly observe a reductionin inference time in both scenarios (1.29x in Lifelogging scenarioand 1.4x in the Vision Support scenario), although the magnitudeof reduction is different. To explore it further, in Figure 10, we plot

Series Interleaved0

5

10

15

Run

Tim

e (s

) 1.29x

(a) Scenario: Lifelogging

Series Interleaved0

5

10

15

Run

Tim

e (s

)

1.40x

(b) Scenario: Vision Support

Figure 9: Latency comparison between series and interleaved execution.The values on the bars show the latency gains due to interleaved execution.

0.0 0.1 0.2 0.3 0.4 0.5Interleaving Factor (λ)

1.0x

1.2x

1.4x

1.6x

1.8x

2.0x

Late

ncy

gain

Lifelogging

Vision Support

Figure 10: Latency gain under different values of λ

the latency gain (due to interleaving) against 10 different values ofλ obtained from executing various combination of model pipelines.We can observe that as the value of λ increases, the latency gainsdue to interleaving decrease. As λ is proportional to the differencebetween execution time of convolution layers and load time of FClayers, this implies that if both these times are of the same order,the interleaving gains are maximized. For example, the Lifeloggingscenario has a higher value of λ than the Vision Support scenario(due to the choice of models in them), and therefore exhibits lowerlatency gains.

To sum up, we found the interleaved execution provides a wayto maximize the resource usage on the device, and provides latencygains without any loss in inference accuracy.

5.3 Interleaving Meets Model CompressionOur findings from interleaving analysis shows that although inter-leaving outperforms the baseline series execution, its latency gainhas an upper bound of 2x when tfc_load equals tconv (Equation 3).However, this is difficult to achieve in practice as the loading timesof FC layers are higher than the execution times of the convolutionlayers as shown in the model profiles in Figure 4. In this section, weinvestigate if model compression techniques proposed in the litera-ture can reduce the load times of the FC layers, thereby increasingthe latency gains in the interleaved execution.

Setup. For studying model compression, we choose four deepmodels from our model pool, namely SceneNet, MemNet, Salient-Net, and FaceNet. On each model, we apply four levels of com-pression using the SVD-based weight factorization scheme pro-posed in prior literature such as the DeepX toolkit [25, 27, 37].The SVD-based compression approach reduces the total parameters(and operations) in a fully-connected layer by factorizing it into twosmaller layers, with some loss in overall inference accuracy. Ourexperiments apply four levels of compression (10%, 20%, 70%,80%) to all FC-layers of the deep models, and we report the la-tency gains in each compression setting. The motivation to applyhigh compressions (70%, 80%) is to empirically show that the com-pression gains peak at a certain point, after which compression doesnot provide any additive gains to the interleaved execution.

Results. In Figure 11, we compare the performance of interleavedexecution approach against series baseline and model compression

75

10% Compression0

2000

4000

6000

8000

10000

12000R

un T

ime

(ms)

1.3x

1.1x

1.4x

20% Compression0

2000

4000

6000

8000

10000

12000

1.3x1.2x

1.6x

70% Compression0

2000

4000

6000

8000

10000

12000

Run

Tim

e (m

s)

1.3x

1.9x 2.1x

80% Compression0

2000

4000

6000

8000

10000

12000

1.3x

2.4x2.1x

SeriesInterleaved

CompressionInterleaved+Compression

Figure 11: Performance of layer interleaving in combination with modelcompression approaches. The values on the bars show the latency gain ineach case over the baseline Series execution approach.

for our selected 4-model pipeline. There are three key insights thatemerge from Figure 11: firstly, as the compression ratio increases,the latency gain also increases for both compression-only approachand compression+interleaving, with the latter outperforming theother techniques. For example, in the 20% compression scenario,applying interleaving with compression provides a latency gain of1.6x, as opposed to 1.2x provided by compression alone. Secondly,we observe that when the amount of compression applied to themodel is too high (e.g., 80%), the interleaved execution no longerbenefits from it, because the convolution execution starts dominat-ing the overall runtime in the interleaved case. Finally, it is evidentthat interleaving can be used as a tool to offset against the accu-racy loss caused by model compression. For example, we foundthat with 30% compression applied to the models, the pipeline canrun 1.4x faster in the baseline case (not represented in the figure).However, the same 1.4x speedup can be achieved by applying 10%compression and interleaving the execution of layers as shown inFigure 11. As such, interleaving provides a way to get same latencygains with lesser compression and lesser accuracy losses.

5.4 Caching AnalysisIn this section, we explore the benefits of caching memory-heavylayers on the overall runtime of the deep model pipeline. For con-tinuous sensing wearables, caching of memory-heavy layer param-eters into the memory could be potentially useful, as it saves sig-nificant time required to load thousands of parameters into memoryfor each inference.

Setup. For this experiment, four deep models from the Table 2namely SceneNet, MemNet, SalientNet, and FaceNet are chosen asthe workload. Although DeepEye has roughly (5̃70 MB) of avail-able runtime memory, other low-end wearable devices often havemuch less runtime memory. As such, for a detailed evaluation ofthe caching approach, we simulate three more cases with runtimememory of 150MB, 300MB, and 400MB. In each case, we adopta greedy approach to cache the n largest FC layers from the modelpool into the memory. In case the available memory cannot fit alln layers, then we cache n − 1 largest layers, and again greedilyrepeat the process to cache the second-largest layers into the re-maining memory. In the following, we compare the latency gainsby layer caching in series and interleaved execution approaches.

150MB 300MB 400MB 570MB (DeepEye)

Cache Size

0

2000

4000

6000

8000

10000

Tim

e (in

ms)

1.37x 1.44x 1.50x

2.41x

1.05x 1.08x 1.11x

1.51x

SeriesCaching + SeriesCaching + Interleaved

Figure 12: Performance of layer interleaving in combination with layercaching. The values on the bars show the latency gain in each case over thebaseline Series execution approach.

Scenario Latency per inference EstimatedBattery Life

Lifelogging 10.10s (1.74x gain) 33 hoursVision Support 8.23x (1.88x gain) 43 hours

Table 3: Performance of DeepEye in the two application scenarios

Results. Figure 12 shows that as the runtime memory available onthe device increases, we are able to cache more FC-layer param-eters into the memory and hence achieve higher latency gains forboth series and interleaved caching scenarios. For example, with570MB available memory (as is the case on DeepEye), the caching-series execution approach gives a 1.51x latency gain whereas thecaching-interleaved execution provides a 2.41x speedup over thebaseline. Interestingly, it is worth noting that the interleaved-cachingapproach is able to provide a similar speedup (1.5x) with 400MBof runtime memory, whereas the caching-series approach wouldneed an additional 170MB runtime memory to achieve the samespeedup. As such, constrained devices with limited runtime mem-ory can benefit by implementing the interleaved-caching mecha-nism for deep model execution.

The findings from the above experiments highlight the gains in in-ference latency on applying the various optimization techniquesused by DeepEye. In the next section, we will present two casestudies where we apply these optimizations on two real-world ap-plications of continuous camera capture devices.

5.5 Case StudiesIn this section, we evaluate the performance of DeepEye on thetwo scenarios described in previous section, namely i) lifelogging,and ii) vision support. Each scenario consist of five deep learningvision models as summarized in Table 2. We begin by presentingthe overall performance when these scenarios are run on DeepEye,and then present detailed latency and energy results.

DeepEye Performance: We applied layer caching, compression,and interleaving to the pool of deep model layers and calculatedthe latency and energy consumed for performing one inference.Battery life is estimated from the per-inference energy consump-tion, assuming that an inference will be made every 1 minute. InTable 3, we show the per-inference latency and battery life of Deep-Eye when it executes each of the application scenarios. We observethe DeepEye provides at least 1.74x latency gain for both applica-tion scenarios, and the estimated battery life for the device is morethan 33 hours.

76

Execution time: Here we provide a detailed breakdown of the in-ference execution time for both the scenarios. For each scenario,we compare the baseline Series execution approach against the in-terleaved execution approach, with various optimizations (caching,compression) added to it. For scenario #1 (Lifelogging), we cachedthe largest fully connected layers of FaceNet, MemnNet and Salient-Net into the memory, while for use-case #2 (Visual ImpairmentSupport), we stored in memory the first fully-connected layers ofVGG, GenderNet and AgeNet. As for model compression, we ap-plied a 20% compression to each FC layer by using the SVD-basedfactorization approach discussed earlier.

In Figures 13a and 13b, we plot the total time taken to run thetwo model pipelines on the DeepEye CPU in four different execu-tion conditions. The results are averaged over 10 iterations. We ob-serve that the baseline approach of loading and running the layersin series without caching has the worst runtime performance – thetime taken to run the entire inference pipeline being 17.57 seconds(in Scenario 1) and 15.51 seconds (in Scenario 2). On the con-trary, layer interleaving in combination with caching and compres-sion outperforms all other approaches – the runtime for the modelpipeline in Scenario 1 and Scenario 2 are 10.10 seconds (~1.74xfaster than baseline) and 8.23 seconds (~1.88x faster than baseline)respectively.

S I C-I C-C-I0

5

10

15

Run

Tim

e (s

) 1.29x1.57x 1.74x


S I C-I C-C-I0

5

10

15

Run

Tim

e (s

)

1.40x1.62x

1.88x


Figure 13: Latency comparison when executing the scenario pipelines onDeepEye. S = Series, I = Interleaved, C-I = Caching+Interleaved, C-C-I =Compression+Caching+Interleaved

S I C-I C-C-I0

1

2

3

Ene

rgy

(J)

1.08x1.23x

1.43x


S I C-I C-C-I0

1

2

3

Ene

rgy

(J)

1.11x1.26x

1.51x


Figure 14: Energy comparison when executing the scenario pipelines onDeepEye. S = Series, I = Interleaved, C-I = Caching+Interleaved, C-C-I =Compression+Caching+Interleaved

Energy Performance. We measured the average current flow dur-ing each execution of a model pipeline, and used it to calculatethe total energy consumption. In Figures 14a and 14b, we plotthe energy consumed to run the two scenarios on DeepEye in var-ious experiment conditions. Our findings show that by applyingcaching and compression in the interleaved execution case, Deep-Eye requires nearly 1.43x less energy than the baseline approach.Although the observed energy saving for one single execution maynot be considered significant, these savings get compounded overtime and can result in few additional hours of battery life as we willdiscuss next.

In Figure 15, we show the effect of image capture frequencyon the expected battery life of DeepEye. These calculations weredone by adding the energy consumption during idle time, energycost of taking a picture, and the cost of one execution of the model

pipeline. We observe that in the case of an image being capturedand analyzed every 30 seconds, DeepEye’s battery will last 16.85hours for the Lifelogging scenario and 21.6 hours for the VisionSupport scenario using our proposed optimization techniques. Ina more reasonable case of an image being captured and analyzedevery 1 minute, the battery lasts for 33 hours ( 9 hours more thanthe baseline) for Lifelogging and 43 hours ( 12 hours more thanbaseline) for the Vision Support scenario.

Hashing Performance. Now we present the findings of apply-ing the perceptual hashing technique (discussed in §3) on a datasetof 18,735 egocentric images as described in §5.6. We chose awindow-size of 5, that is we compute and store hash values of fiverecent images in the workload. For each new image, we computeits hash and calculate its hamming distance to all previously storedhashes. With a distance threshold of 26 (as suggested by the de-velopers of the hashing algorithm), we found that for roughly 18%of all images in the dataset, we could pre-empt the execution of alldeep models and return the classes stored for the image with small-est hamming distance. This approach eventually results in signifi-cant energy savings, and increases the battery life of the device.

30 60 90 120Image Capture Frequency (s)

0

20

40

60

Bat

tery

Life

(ho

urs) Baseline

DeepEye


30 60 90 120Image Capture Frequency (s)

0

20

40

60

80

Bat

tery

Life

(ho

urs) Baseline

DeepEye


Figure 15: Comparison of estimated battery life of DeepEye assuming dif-ferent image capture frequencies. This does not include energy gains fromperceptual hashing, which will increase the battery life of DeepEye.

GPU execution. Finally, we evaluated the runtime of our modelpipelines by running them on both CPU and GPU. While execut-ing the models on a GPU, we observed a 1.6x gain in executionof the convolution layers over CPU-only approach. However, asthe CPU and GPU have a shared memory on the Snapdragon 410c,the loading time of fully connected layers into the memory remainsthe primary bottleneck. As such, even with faster convolution op-erations on the GPU, we do not gain much in terms of the overallinference time due to the bottleneck of loading fc layer parametersinto the memory – moreover, using the GPU consumes additionalenergy which reduces the battery life of the wearable. Therefore, inthe current version of DeepEye, we only employ the CPU for doingmodel computations.

5.6 Inference AccuracyIn this subsection we discuss the accuracy of the deep learning

models used in our experiments on egocentric images. Note that thedeep models used in our experiments were pre-trained on a widevariety of images, not specifically on egocentric images capturedfrom a wearable. As such, we would like to understand how wellthese models perform on egocentric images, and whether they needextensive re-training in the future.

Dataset. For our evaluation, we used the egocentric image datasetfrom the authors of [23], which consists of a total of 18,735 imagescaptured by 7 users wearing a Narrative Clip for 20 days. The Nar-rative Clip has the same camera resolution as DeepEye, thereforeits images are representative of what will be captured by DeepEye.The dataset is rich in diversity, as the users were wearing the Nar-rative Clip in different contexts: while attending a conference, on

77

Figure 16: Examples of egocentric images captured from a Narrative Clip

FaceNet

ObjectNet

SceneNet

SalientNet

AgeNet

GenderNet

EmotionNet0

20

40

60

80

100

Acc

urac

y

16% 26%

34% 23%

21%

19%

23%

State-of-the-art accuraciesAccuracy on egocentric images

Figure 17: Accuracy of the pre-trained deep models on egocentric images.Numbers on the bars represent accuracy loss compared to the state-of-the-art accuracies, which happes because the models are not fine-tuned for ego-centric images.

holiday, during the weekend, and during the week. A sample of theimages in the dataset is shown in Figure 16.

Ground Truth Collection. The dataset from [23] is largely unla-beled for evaluating the models used in our experiment, with theexception of face labeling. For a subset of 4912 images in thedataset, the authors have labeled whether the image consists of a’face’ or not. As such, these images could be used to evaluate theaccuracy of FaceNet in our experiments. To collect the ground truthfor evaluation of other models, we ran two crowdsourcing tasks onAmazon Mechanical Turk.

In the first task, we asked workers to choose labels for the faceimages (extracted from the original dataset) viz. their age (8 agecategories), gender (male/female) and emotion (7 categories). Thelabel choices were the same as the output classes in AgeNet, Gen-derNet and EmotionNet. In the second task, we asked workers tolabel 4000 different images from the dataset viz. the number ofsalient objects in them (for evaluating SalientNet), objects seen inthe image (for evaluating ObjectNet) and the places seen in the im-age (for evaluating SceneNet). While the original ObjectNet andSceneNet models consist of 1000 and 205 classes respectively, weonly asked the workers to choose from object and scene classes thatare expected to be captured in routine life (e.g. laptop, office, home,building, cup, fruit). We did not evaluate the accuracy for MemNetas it was difficult for crowd workers to label the memorability of animage captured by somebody else.

Results. Before presenting the results, it is worth pointing thatnone of the models used in our experiments were fine-tuned foregocentric images. Egocentric images typically have poor lightingconditions, multiple objects in them and they may even have theobject or person of interest occluded by other objects – therebymaking it challenging to get an accurate inference.

In Figure 17, we plot the accuracy of the models as observed

on the egocentric image database. For ObjectNet and SceneNet,we present the top-5 accuracies, that is the ground-truth category iswithin the top-5 predicted categories. For other models, we presentthe top-1 accuracy. We observe that the face-based models, suchas FaceNet, AgeNet, GenderNet, and EmotionNet have lower ac-curacies drops (numbers shown on top of the bars) as compared tothe state-of-the-art accuracies. On the other hand, models whichaim to analyze the scene (e.g., SceneNet or ObjectNet) reportedmuch higher accuracies drops. Specifically for SceneNet, imagescaptured indoors (e.g. at home in the evening) had a high num-ber of misclassifications – we believe that the poor indoor lightingconditions made it challenging for the classifier to infer the loca-tion. Overall, these results suggest that fine-tuning the pre-trainedmodels on egocentric images is required to achieve their reportedaccuracies.

6. DISCUSSIONWe now discuss a key set of issues related to the design and

evaluation of DeepEye.

Beyond Just Vision-based Deep Models. We believe the op-timization techniques described will operate successfully even inmodels that are not image based. Our design of DeepEye and theoptimizations we discover exploits characteristics of layers (such asfully-connected feed-forward layers in relation to other layer types)that should be suitable for other mobile platforms that target otheruses of deep learning. For example, audio understanding and emo-tion recognition.

Relevance of Deep Learning Specific Accelerators. Existingaccelerators such as [38, 39] are not able to execute multiple deepmodels as DeepEye is able to. While their performance for individ-ual models we use (such as AlexNet and VGG based models) mayexecute more efficiently on this purpose-built hardware, how theywill support multiple model situations remains unstudied.

Complementary to Many Existing Optimizations. Becausemodel-centric techniques that use sparse-coding [27], SVD [40],node pruning [41] alter the layers of deep models themselves theyare compatible with scheduling and caching approaches we develophere. As we show in the evaluation when the proposed methodsand such model compression techniques are combined, the perfor-mance impact with respect to the accuracy trade-off can be impres-sive and better than the trade-offs possible with the original tech-nique itself.

User Trials of DeepEye. Currently DeepEye has not yet beendeployed. The focus of this work has been the development ofhardware and software techniques that at this stage are evaluatedthrough a trace-dataset. However, we will begin deployments ofthese devices for limited periods of time moving forward.

Improving Deep Model Accuracy. In this work we reportsome of the first observations of deep models running against ego-centric first-person images as are captured by wearable cameraslike DeepEye. To our knowledge, our work along with [15] hasevaluated the accuracy of these techniques on such image work-loads. We reported that the inference accuracies for all modelsdrop when they are applied on egocentric images. These reportedaccuracy numbers are expected to improve when techniques suchas fine-tuning are applied to these models using the training imagescaptured by DeepEye.

Privacy Concerns. Like any wearable camera, with Deep-Eye there are significant potential privacy concerns that must berespected. For this reason DeepEye never retains any images at anystage. Only inference results are kept. This is quite unlike mostwearables that perform heavy image processing that will most of-

78

ten rely on the cloud. Users also benefit from a very simple mentalmodel of where their data is and is not. They can assure themselvesthat data never leaves DeepEye unless they install an applicationthat does so.

7. RELATED WORKDeepEye follows in a line of wearable and mobile vision sys-

tems that have been prototyped for decades. Relatively recent ex-amples include SenseCam [1], Gabriel [2] and ThirdEye [3]. Acentral difference to these systems is that DeepEye performs im-age processing using deep learning models; this is significant be-cause this brings state-of-the-art models for various image task to awearable platform, rather either simplified alternative learning al-gorithms executing locally or complex models running remotely.While few mobile systems (such as smartphones [42, 43]) and vir-tually no wearable systems (i.e., those with a smaller form factor)use a single deep learning for local processing, in contrast Deep-Eye integrates multiple concurrently executing models. DeepEyeexplores an area that is only just emerging (cloud-free deep learn-ing for mobile vision), and more than that leaps beyond existingsystems with single models to study what we believe will be a im-portant upcoming step; specifically, how these systems can copewith (and benefit from) multiple executing deep learning models.

More importantly, DeepEye contributes significantly as a enablerof experiments to study these issues because all hardware and soft-ware will be open-sourced. In the area of mobile vision, perhaps thecentral systems question has been how these systems can be madeto be energy (and more broadly resource) efficient [12, 14]. TheDeepEye wearable and its cloud-free focus allows us to further un-derstand when and if the cloud should be incorporated especiallyin relation to multiple deep model scenarios. Few studies of thistype currently exist into the performance of deep models on mobilehardware; those that do exist (such as [28]) have not yet examinedthe multi-model problem, and more importantly have not made thetype of observations for future optimizations in this area that wehave done in this paper (see earlier section). SDKs for using mo-bile GPUs and deep models are starting to arrive (such as [11]).In contrast, other directions being take start at a more algorithmiclevel and consider how the models themselves can be different to-wards being more efficient [13, 44]. But interestingly most ofthe understanding of mobile and optimized deep learning currentlyexists in commercial settings (such as [42]) although software ap-proaches to cloud offloading [9] and device-side acceleration [45]are arriving. The experiments we report here are some of the firstto focus specifically on large deep models designed for image pro-cessing on a wearable. Examples of deep learning on wearablesform-factors are very rare up to this point and have not consideredthe CNN architecture needed for images; in comparison work like[46] and [47] consider LSTM and RBMs respectively.

Finally, a key part of the future of deep learning on mobiles andwearables undoubtedly will include consideration of custom hard-ware. Such hardware options are evolving in research [48][39]and in industry [38]; DeepEye contributes to this examination byexploring the limit of conventional hardware through the combi-nation of a conventional processor and custom supporting daugh-ter board. Interesting ideas in co-design involving low-level MLcomponents and sennsor array are also emerging towards moreefficicent vision processing systems [10]. But because DeepEyeleverages a general-purpose hardware its observations will moreeasily translate into other more popular SoCs, than deep learningspecific hardware options. Furthermore, the observations for re-thinking and optimizing the software-based inference pipeline isjust as needed as any hardware innovations because, as we high-

lighted here, many optimization opportunities exist within the in-ference algorithms available.

8. CONCLUSIONIn this paper, we empirically study two key areas pertaining to

mobile vision systems moving forward. First, the raw feasibility,and overall system performance, of wearable form-factor devices tomeet the needs of cloud-free deep learning-based vision processingof periodically captured images. We study this within the context ofa powerful conventional small form-factor processor, and purpose-built daughter board, under a real-world image workload of≈4000images collected from a user trial with actual commercial wearabledevice. Second, we focus on a key aspect of execution supportfor these models – namely, the challenges presented by simulta-neous inference of multiple models on the same image. This willbe a building block behavior of vision systems as most applica-tions require multiple types of analysis (i.e., multiple models) to beperformed on a single image. We highlighted the core bottleneckswith executing multiple deep models on constrained devices, andproposed methods to manage these issues in the inference processwhen concurrent images are processed. We believe our techniqueswill be of broad interest for mobile systems and vision researchersat large, and will lead to more research in this space.

9. REFERENCES[1] S. Hodges, L. Williams, E. Berry, S. Izadi, J. Srinivasan,

A. Butler, G. Smyth, N. Kapur, and K. Wood, “Sensecam: Aretrospective memory aid,” in UbiComp 2006: UbiquitousComputing. Springer, 2006, pp. 177–193.

[2] K. Ha, Z. Chen, W. Hu, W. Richter, P. Pillai, andM. Satyanarayanan, “Towards wearable cognitiveassistance,” in Proceedings of the 12th Annual InternationalConference on Mobile Systems, Applications, and Services,ser. MobiSys ’14. New York, NY, USA: ACM, 2014, pp.68–81. [Online]. Available:http://doi.acm.org/10.1145/2594368.2594383

[3] S. Rallapalli, A. Ganesan, K. Chintalapudi, V. N.Padmanabhan, and L. Qiu, “Enabling physical analytics inretail stores using smart glasses,” in Proceedings of the 20thAnnual International Conference on Mobile Computing andNetworking, ser. MobiCom ’14. New York, NY, USA:ACM, 2014, pp. 115–126. [Online]. Available:http://doi.acm.org/10.1145/2639108.2639126

[4] “Google Glass,”https://developers.google.com/glass/distribute/glass-at-work.

[5] “Narrative Clip 1,” http://getnarrative.com/narrative-clip-1.[6] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface:

Closing the gap to human-level performance in faceverification,” in Conference on Computer Vision and PatternRecognition (CVPR), 2014.

[7] T. Weyand, I. Kostrikov, and J. Philbin, “Planet - photogeolocation with convolutional neural networks,” CoRR, vol.abs/1602.05314, 2016. [Online]. Available:http://arxiv.org/abs/1602.05314

[8] A. Khosla, A. S. Raju, A. Torralba, and A. Oliva,“Understanding and predicting image memorability at a largescale,” in International Conference on Computer Vision(ICCV), 2015.

[9] S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman, andA. Krishnamurthy, “Mcdnn: An execution framework fordeep neural networks on resource-constrained devicesg,” in

79

Proceedings of the 14th Annual International Conference onMobile Systems, Applications, and Services, 2016.

[10] M. Philipose, “Efficient object detection via adaptive onlineselection of sensor-array elements,” in Proceedings of theTwenty-Eighth AAAI Conference on Artificial Intelligence,ser. AAAI’14. AAAI Press, 2014, pp. 2817–2823.[Online]. Available:http://dl.acm.org/citation.cfm?id=2892753.2892942

[11] L. N. Huynh, R. K. Balan, and Y. Lee, “Deepsense: Agpu-based deep convolutional neural network framework oncommodity mobile devices,” in Proceedings of the 2016Workshop on Wearable Systems and Applications, ser.WearSys ’16. New York, NY, USA: ACM, 2016, pp.25–30. [Online]. Available:http://doi.acm.org/10.1145/2935643.2935650

[12] R. LiKamWa, B. Priyantha, M. Philipose, L. Zhong, andP. Bahl, “Energy characterization and optimization of imagesensing toward continuous mobile vision,” in Proceeding ofthe 11th Annual International Conference on MobileSystems, Applications, and Services, ser. MobiSys ’13.New York, NY, USA: ACM, 2013, pp. 69–82. [Online].Available: http://doi.acm.org/10.1145/2462456.2464448

[13] H. Shen, S. Han, M. Philipose, and A. Krishnamurthy, “Fastvideo classification via adaptive cascading of deep models,”arXiv preprint arXiv:1611.06453, 2016.

[14] P. Bahl, M. Philipose, and L. Zhong, “Vision: cloud-poweredsight for all: showing the cloud what you see,” inProceedings of the third ACM workshop on Mobile cloudcomputing and services. ACM, 2012, pp. 53–60.

[15] D. Castro, S. Hickson, V. Bettadapura, E. Thomaz,G. Abowd, H. Christensen, and I. Essa, “Predicting dailyactivities from egocentric images using deep learning,” inProceedings of the 2015 ACM International Symposium onWearable Computers, ser. ISWC ’15. New York, NY, USA:ACM, 2015, pp. 75–82. [Online]. Available:http://doi.acm.org/10.1145/2802083.2808398

[16] P. Kelly, A. Doherty, E. Berry, S. Hodges, A. M. Batterham,and C. Foster, “Can we use digital life-log images toinvestigate active and sedentary travel behaviour? resultsfrom a pilot study,” Int J Behav Nutr Phys Act, vol. 8, no. 44,p. 44, 2011.

[17] A. Grimes, M. Bednar, J. D. Bolter, and R. E. Grinter,“Eatwell: sharing nutrition-related memories in alow-income community,” in Proceedings of the 2008 ACMconference on Computer supported cooperative work.ACM, 2008, pp. 87–96.

[18] R. Sarvas and D. M. Frohlich, From snapshots to socialmedia-the changing picture of domestic photography.Springer Science & Business Media, 2011.

[19] I. Li, A. K. Dey, and J. Forlizzi, “Understanding my data,myself: supporting self-reflection with ubicomptechnologies,” in Proceedings of the 13th internationalconference on Ubiquitous computing. ACM, 2011, pp.405–414.

[20] H. Pirsiavash and D. Ramanan, “Detecting activities of dailyliving in first-person camera views,” in Computer Vision andPattern Recognition (CVPR), 2012 IEEE Conference on.IEEE, 2012, pp. 2847–2854.

[21] A. J. Sellen, A. Fogg, M. Aitken, S. Hodges, C. Rother, andK. Wood, “Do life-logging technologies support memory forthe past?: an experimental study using sensecam,” inProceedings of the SIGCHI conference on Human factors incomputing systems. ACM, 2007, pp. 81–90.

[22] L. Deng and D. Yu, “Deep learning: Methods andapplications,” Tech. Rep. MSR-TR-2014-21, January 2014.[Online]. Available: http://research.microsoft.com/apps/pubs/default.aspx?id=209355

[23] M. Dimiccoli, M. Bolaños, E. Talavera, M. Aghaei, S. G.Nikolov, and P. Radeva, “Sr-clustering: Semantic regularizedclustering for egocentric photo streams segmentation,” arXivpreprint arXiv:1512.07143, 2015.

[24] “Autographer,” http://getnarrative.com/narrative-clip-1.[25] N. D. Lane, S. Bhattacharya, C. Forlivesi, P. Georgiev,

L. Jiao, L. Qendro, , and F. Kawsar, “Deepx: A softwareaccelerator for low-power deep learning inference on mobiledevices,” in IPSN 2016.

[26] “Perceptual Hashing,” http://www.phash.org/.[27] S. Bhattacharya and N. D. Lane, “Sparsification and

separation of deep learning layers for constrained resourceinference on wearables,” in ACM Conference on EmbeddedNetworked Sensor Systems (SenSys), 2016.

[28] N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, andF. Kawsar, “An early resource characterization of deeplearning on wearables, smartphones and internet-of-thingsdevices,” in Proceedings of the 2015 International Workshopon Internet of Things Towards Applications, ser. IoT-App’15. New York, NY, USA: ACM, 2015, pp. 7–12. [Online].Available: http://doi.acm.org/10.1145/2820975.2820980

[29] “BLAS library,” http://www.netlib.org/blas/.[30] S. S. Farfade, M. J. Saberian, and L.-J. Li, “Multi-view face

detection using deep convolutional neural networks,” inProceedings of the 5th ACM on International Conference onMultimedia Retrieval. ACM, 2015, pp. 643–650.

[31] G. Levi and T. Hassner, “Age and gender classification usingconvolutional neural networks,” in Proceedings of the IEEEConference on Computer Vision and Pattern RecognitionWorkshops, 2015, pp. 34–42.

[32] G. Levi and T. Hasner, “Emotion recognition in the wild viaconvolutional neural networks and mapped binary patterns,”in Proceedings of the 2015 ACM on InternationalConference on Multimodal Interaction, 2015, pp. 503–510.

[33] J. Zhang, S. Ma, M. Sameki, S. Sclaroff, M. Betke, Z. Lin,X. Shen, B. Price, and R. Mech, “Salient object subitizing,”in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2015, pp. 4045–4054.

[34] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” arXiv preprintarXiv:1409.1556, 2014.

[35] A. Khosla, A. S. Raju, A. Torralba, and A. Oliva,“Understanding and predicting image memorability at a largescale,” in Proceedings of the IEEE International Conferenceon Computer Vision, 2015, pp. 2390–2398.

[36] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva,“Learning deep features for scene recognition using placesdatabase,” in Advances in neural information processingsystems, 2014, pp. 487–495.

[37] N. Lane, S. Bhattacharya, A. Mathur, C. Forlivesi, andF. Kawsar, “Dxtk: Enabling resource-efficient deep learningon mobile and embedded devices with the deepx toolkit,” inProceedings of the 8th EAI International Conference onMobile Computing, Applications and Services, ser.MobiCASE’16. ICST, Brussels, Belgium, Belgium: ICST(Institute for Computer Sciences, Social-Informatics andTelecommunications Engineering), 2016, pp. 98–107.

80

[Online]. Available:https://doi.org/10.4108/eai.30-11-2016.2267463

[38] “Movidius,” http://www.movidius.com/.[39] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and

O. Temam, “Diannao: A small-footprint high-throughputaccelerator for ubiquitous machine-learning,” in Proceedingsof the 19th International Conference on ArchitecturalSupport for Programming Languages and OperatingSystems, ser. ASPLOS ’14. New York, NY, USA: ACM,2014, pp. 269–284. [Online]. Available:http://doi.acm.org/10.1145/2541940.2541967

[40] J. Xue, J. Li, and Y. Gong, “Restructuring of deep neuralnetwork acoustic models with singular valuedecomposition,” in INTERSPEECH, 2013, pp. 2365–2369.

[41] S. Han, J. Pool, J. Tran, and W. Dally, “Learning bothweights and connections for efficient neural network,” inAdvances in Neural Information Processing Systems (NIPS),2015, pp. 1135–1143.

[42] “How Google Translate squeezes deep learning onto aphone,” http://googleresearch.blogspot.co.uk/2015/07/how-google-translate-squeezes-deep.html.

[43] “Happy Halloween! Baidu Research Introduces FaceYou,”http://uk.reuters.com/article/idUKnMKWYbQbJa+1ea+MKW20151029.

[44] S. Venkataramani, V. Bahl, X.-S. Hua, J. Liu, J. Li,M. Phillipose, B. Priyantha, and M. Shoaib, “Sapphire: analways-on context-aware computer vision system forportable devices,” in 2015 Design, Automation & Test inEurope Conference & Exhibition (DATE). IEEE, 2015, pp.1491–1496.

[45] N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi,L. Jiao, L. Qendro, and F. Kawsar, “Deepx: A softwareaccelerator for low-power deep learning inference on mobiledevices,” in Proceedings of the 15th InternationalConference on Information Processing in Sensor Networks,2016.

[46] T. Beltramelli and S. Risi, “Deep-spying: Spying usingsmartwatch and deep learning,” CoRR, vol. abs/1512.05616,2015. [Online]. Available: http://arxiv.org/abs/1512.05616

[47] S. Bhattacharya and N. D. Lane, “From smart to deep:Robust activity recognition on smartwatches using deeplearning,” in Proceedings of the Second Workshop onSensing Systems and Applications Using Wrist Worn SmartDevices, 2016.

[48] R. LiKamWa, Y. Hou, J. Gao, M. Polansky, and L. Zhong,“Redeye: Analog convnet image sensor architecture forcontinuous mobile vision,” in International Symposium onComputer Architecture (ISCA), 2016.

81

deepeye: resource efﬁcient local execution of multiple ...deepeye: resource efﬁcient local...

Documents