circe: real-time caching for instance recognition on cloud ...panhui/papers/circe-lovagnini.pdf ·...

CIRCE: Real-Time Caching for Instance Recognition on CloudEnvironments and Multi-Core Architectures

Luca LovagniniUniversity of Pisa

Pisa, [email protected]

Wenxiao ZhangHong Kong University of Science and Technology

Hong Kong SAR, [email protected]

Farshid Hassani BijarboonehHong Kong University of Science and Technology

Hong Kong SAR, [email protected]

Pan HuiUniversity of Helsinki

Hong Kong University of Science and [email protected]

ABSTRACTIn the smartphone era, instance recognition (IR) applications arewidely used on mobile devices. However, IR applications are com-putationally expensive for a single mobile device, and usually theyare delegated to back end servers. Nevertheless, even by exploitingthe computational power offered by cloud services, IR tasks are notexecuted in real-time (i.e., in less than 30ms), which is crucial forinteractive mobile applications. In this work, we present cachingfor instance recognition on cloud environments (CIRCE), a similar-ity caching (SC) framework designed to improve the performanceof mobile IR applications in terms of execution time. Addition-ally, we introduce a parallel version of the Hessian-Affine detectorcombined with the SIFT descriptor, and a novel cache hit threshold(CHT) algorithm. Finally, we present two new public image datasetsthat we create to evaluate CIRCE. By using a cache size of just fewhundreds, we obtain a hit ratio of at least 66% and a precision of atleast 97%. In case of a cache hit, our system performs IR tasks in atmost 14ms on three different applications.

ACM Reference Format:Luca Lovagnini, Wenxiao Zhang, Farshid Hassani Bijarbooneh, and PanHui. 2018. CIRCE: Real-Time Caching for Instance Recognition on CloudEnvironments and Multi-Core Architectures. In 2018 ACM Multimedia Con-ference (MM ’18), October 22–26, 2018, Seoul, Republic of Korea. ACM, NewYork, NY, USA, 9 pages. https://doi.org/10.1145/3240508.3240697

1 INTRODUCTIONWith the ascent of smartphones, users are moving from traditionaldesktop computers to mobile devices [34], and mobile cloud com-puting (MCC) is now a hot topic for both research and industrialcommunity [27]. In particular, MCC is a common solution to de-velop mobile augmented reality (AR) applications [9], where in-stance recognition (IR) plays an important role. However, IR taskscurrently are not performed in real-time (i.e., in less than 30ms) [21],

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’18, October 22–26, 2018, Seoul, Republic of Korea© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5665-7/18/10. . . $15.00https://doi.org/10.1145/3240508.3240697

failing in the interactive requirement of AR applications. For thisreason, we face the challenging problem: can the back end system(BES) take advantage of recently computed IR tasks to speedup visuallysimilar IR tasks in the future?

Our first contribution is CIRCE, a similarity caching frameworkfor IR applications. In CIRCE, query image codes and the corre-sponding results are stored into a cache structure in order to answersimilar future IR tasks. Several parallel approaches have been usedto achieve real-time results. In particular, as part of our secondcontribution, we propose the first parallel implementation of theHessian affine detector combined with the SIFT descriptor [26],which achieves at least an order of magnitude speedup. We alsopropose a precise cache hit threshold (CHT) algorithm, which de-cides whether a query image contains the same instance of a cachedimage or not. Finally, as in any other caching mechanism, CIRCErequires that the query image frequency distribution follows a formof power-law. Our last contribution is the proposal of two newimage datasets respecting this requirement. Additionally, we usedthe popular Oxford buildings dataset [28] to evaluate CIRCE withthree different IR applications: our approach executes IR tasks in14ms in case of a cache hit on a 6-cores processor with a precisionof 97% in the worst case by caching only a few hundreds images.

The remaining of this paper is organized as following: we firstreview previous works and describe how our system is differentfrom them in Section 2. We describe the design and implementa-tion of CIRCE in Section 3. In Section 4,we focus on the parallelversion of the Hessian affine detector combined with the SIFT de-scriptor. Experimental results are reported in Section 5, and finallywe conclude our work in Section 6.

2 RELATEDWORKS2.1 Approximate Nearest NeighborsThe nearest neighbor (NN) problem [19] is popular in computer sci-ence, especially in content based image retrieval (CBIR), multimediadatabases (MDB), information retrieval (IRet), machine learning(ML) and many other fields. The (k-)NN problem is hard to solvewhen we have hundreds of thousands of elements in hundreds ofdimensions, so that several approximate nearest neighbor (ANN)methods have been proposed [1].

Locality sensitive hashing (LSH) [19] has been the focus of theresearch community for solving the ANN problem for many years.

Session: FF-2 MM’18, October 22-26, 2018, Seoul, Republic of Korea

346

https://doi.org/10.1145/3240508.3240697

https://doi.org/10.1145/3240508.3240697

ANN SC CIRCEApplications CBIR, ML, ... MDB, CDS, ... IR

Elements Number Up to 109 Up to 105 Up to 104NN k > 1 k > 1 k = 1

Precision Approximate Approximate HighCodes Dimensions Hundreds Hundreds Thousands

Real Time ✗ ✗ ✓CHT ✗ Manual Automatic

Metric Spaces Only ✗ ✓ ✗

Table 1: Comparison of ANN methods, SC systems, and CIRCE.

Different ANN methods are proposed based on tree data structures(e.g., randomized k-d trees [32]), graphs (e.g., hierarchical navigablesmall world [23]) and product quantization techniques [20].

2.2 Similarity CachingAs mentioned in [8], similarity caching (SC) is a variant of classicalcaching, where an algorithm can return an element from the cachethat is similar, but not necessarily identical, to the query element.In SC systems, the cached elements are some representation ofpreviously received queries (e.g., image codes).

In 2009, Chierichetti et al. [8] discussed the concept of SC, pro-viding the first theoretical analysis. Pandey et al. [25] proposed anSC solution for contextual advertisement systems (CDS), wherethe context of applications is very different from IR applications.In 2012, Falchi et al. [13–15] applied SC to CBIR systems, while in2015 the author in [6] used a list of cluster [7] to embed the cachein a metric space index. Notice that the results of all these SC worksare based on metric space notions (e.g., triangular inequality).

2.3 Similarity ThresholdsIn 2006, TRECVid [33] has been used for copy detection. In thiswork, the similarity threshold is defined as normalized detectioncost ratio, which is a score between 0 and 1 (lower is better) rep-resenting the trade-off between false positives and negatives. Thisprocess requires a heavy parameter tuning. In addition, copy de-tection is a very different application context from IR. Furon andJégou [16] addressed the problem of image detection. Instead ofdefining an absolute threshold, they used notions from extremevalue theory [10] for outliers detection. The results obtained fromthis approach is a set of variable size of images, while CIRCE returnsexactly one element (i.e., the instance label).

2.4 CIRCE versus the Related WorksTable 1 shows the differences between ANN, SC, and CIRCE. Below,we describe each entry in the table:Number of Elements: in SC systems, hundreds of thousands ofelements are cached. ANN methods require at least hundreds ofthousands of elements, but they can be used with millions or evenbillions of elements. As we will show in Section 5.4.2, CIRCE needsto cache only few hundreds of elements to obtain a large numberof cache hits with high precision (but it can also efficiently cachetens of thousands of elements, as shown in Section 5.5).NN: in ANN methods and SC techniques, the number of returnedelements (i.e., k) is greater than one (e.g., CBIR systems returns

Figure 1: CIRCE’s architecture. The starredmodules are our originalimplementation.

hundreds of images). In CIRCE, we return always one element: eachcached image representation has associated exactly one element.Precision: since SC and ANN solutions return multiple elements,it is acceptable if some of those elements are error prone, so SCand ANN solutions can be used for those applications where ap-proximate results are acceptable. Instead, CIRCE returns only oneelement, so it is crucial to minimize the number of incorrect results.Codes dimensions: we have seen that both ANN and SC requirespaces with at most hundreds of dimensions. CIRCE uses imagerepresentations in thousands of dimensions for two reasons: first, ithas been shown that large image representations are more precisethan compact ones [2, 11, 21]. Second, since CIRCE is designedto cache at most few thousands of elements, we are not heavilyaffected by the curse of dimensionality.Real-time constraints: CIRCE performs IR tasks in real-time, butnone of the works previously cited have real-time constraints.CHT: in all previous SC works, the CHT was manually tuned.Instead, CIRCE automatically generates the CHT during the setupprocess. We will explain the CHT generation in Section 3.3.Metric Spaces: SC approaches are based on metric space notions,so they can be used when the cached object representations relyonly on such spaces. ANN methods can be used in non-metricspaces, depending on the consideredmethod. CIRCE ismore flexible,since it can be used in both metric and non-metric spaces.

3 CIRCE DESIGNIn this Section we describe CIRCE’s design, including our SC systemfor IR applications, design choices and implementation details.

3.1 System ArchitectureFigure 1 shows an overview of the system architecture, whichincludes four modules as described in this section.

3.1.1 Descriptors Module. The descriptors module generates thedescriptor matrix for a given input image in all the phases. Thismodule wraps two image feature detector and descriptor implemen-tations. For an input image i , each of these algorithms generates adescriptor matrix D ∈ Rn×d , where n is the number of keypoints ini and d is the descriptor dimension. In particular, CIRCE uses:SURF: proposed by Bay et al. [3]. We use the parallel implementa-tion from OpenCV [5].PHA: the Hessian Affine (HA) detector combined with the SIFTdescriptor [26] is the state-of-the-art for feature detection and de-scription [18, 24]. The parallel HA algorithm (PHA) is part of ourcontribution, and it generates SIFT descriptors in 128 dimensions.

3.1.2 Training Module. The training module generates the trainingobject by using the descriptor matrices produced by the descriptor


347

module. The training object is used by the encoding module togenerate image codes. This module uses:k-means: this popular clustering algorithm is necessary to gener-ate vector of locally aggregated descriptors (VLAD) codes describedin the next module. We use the parallel Elkan algorithm [12] imple-mented in VLFeat [35].LCS: the local coordinate system (LCS) [11] exploits PCA to im-prove VLAD codes precision. We implement our own LCS.GMM: as k-Means is used to generate VLAD codes, Gaussian mix-ture models (GMM) are used to encode FV described in the nextmodule. We use the VLFeat version of this training algorithm.

3.1.3 Encoding Module. The encoding module takes the trainingobject and the image descriptors from the two previous modules tocreate image codes. This module is used during both the setup andquerying phases. The implementation is done with two differentimage encoding algorithms. We use the training object generatedby the training module to encode the image descriptor matrix D ∈Rn×d and obtain a vector v in d ′ dimensions, where d ′ depends onthe encoding algorithm as described below:VLAD: vector of locally aggregated descriptors (VLAD) [21] uses k-means to generate codes in k×d dimensions. Our parallel version ofVLAD uses important improvements from previous works, such asresidual normalization and LCS, which are not available in VLFeat.Fisher Vectors: we use the FV [22] implementation provided byVLFeat, which uses GMM to generate codes in 2×k ×d dimensions.

We do not consider recent convolutional neural network tech-niques since they do not provide results in real-time, even by ex-ploiting high-end GPUs [4, 31, 36].

3.1.4 Cache Module. In the cache module, the image codes pro-vided by the encoding module are used either for cache warm-up(Section 3.2.2) or querying (Section 3.2.3). We implement our ownparallel cache mechanism, including:Threshold trainer: automatically generates our CHT, which willbe described in section 3.3.Parallel image codes similarity:manages new incoming queriesand implements a parallel brute-force approach to find the mostsimilar cached code according to a similarity function (e.g., theeuclidean distance).Cache data structure: our cache data structure follows a leastrecently used (LRU) policy. To implement the parallel brute-forceapproach, we use an array of pointers to a linked-list, where thearray size is equal to the cache capacity. Notice that classic LRUcaches exploit hash-tables instead of arrays.

3.2 SystemWorkflowCIRCE’s workflow is divided into three phases: training, setup,and querying. The first two phases are performed offline, while thequerying phase involves user interaction. The workflow of thesephases is shown in Figure 2.

3.2.1 Training. In this phase, we use a dataset of training imagesto generate a trainer object, i.e., k-means for VLAD or GMM forFV codes, respectively. We use the chosen descriptor algorithm(e.g., SURF) to generate the training descriptor matrix Dt , whichcontains the descriptors from all the training images.

3.2.2 Setup. In this phase, we perform the setup of our systemusing a dataset of setup images, which are query images previouslycollected by the BES. We then define warm-up images (WI) as thesubset of setup images containing one image per instance. We willuse WI to initialize our cache data structure. For each WI, we mustdefine a set of true positives warm-up images (TPWI), that is a setof images containing the same instance of the consideredWI. TPWIis useful in generating the CHT used during the query phase. As aresult, the setup image dataset is the union of all the WI and all thecorrespondent TPWI. Finally, for each setup image, we provide aground truth file which contains all the correspondent TPWI andthe BES result.

The first step in the setup phase is to compute the descriptors foreach setup image, obtaining the setup descriptors matrices. Thenwe generate each image code (i.e., VLAD or FV codes) from itssetup matrix by utilizing the trainer object from the training phase.With the obtained image codes for the setup images, we initializethe cache data structure with the WI codes. Notice that the resultassociated to each WI code is obtained from the correspondentground truth file previously computed by the BES. In addition, weuse the WI codes, the TPWI codes and the ground truth files togenerate the CHT with the algorithm described in section 3.3.

3.2.3 Querying. In this phase, CIRCE receives a query image fromthe user and finds the most similar cached image. If two imagesare visually similar, we have a cache hit and the system returns theinstance label of the cached image.

First, we compute the query image descriptor matrix. We thenobtain the query code by using the trainer object, and we perform aparallel brute-force lookup operation on our cache data structure byusing the parallel image codes similarity tool. As the result, we havethe most similar cached code and the query code. If the distancebetween these two codes is at most the CHT, then we have a cachehit and CIRCE returns the same instance label associated to thecache hit element. Otherwise, the query image and the cache hitimage contain two different instances and so we need to forwardthe query image to the BES. In both cases, we update the cache datastructure following a classic LRU policy.

3.3 Cache Hit Threshold AlgorithmAs discussed in Section 2.3, similarity thresholds for different appli-cations can not help us to generate a CHT for IR applications. Inthis Section, we describe our own algorithm to generate the CHT.

Received operation characteristic (ROC) curve [30] is a suitableway of visualizing a classifier’s performance in order to select asuitable decision threshold. This curve is obtained by plotting truepositive rate (TPR) on the y-axis and false positive rate (FPR) onthe x-axis. Partially, our solution is based on the process describedin [17], where we try to minimize the distance from the optimalpoint (0,1). Algorithm 1 lists the pseudo code of our CHT method,where the default parameters are: minT = 0.5, maxT = 1.5, step =0.1, k = 10, and s is the euclidean distance. Our algorithm testsdifferent CHT values (line 3), where for eachwarm-up image i ∈ WI(line 4) we retrieve the k most similar setup images in SI accordingto the similarity function s (line 6). By using the correspondingTPWI, we define the TPR and FPR for the considered CHT (fromline 7 to line 17), and compute the distance from the optimal point


348

Figure 2: CIRCE’s workflow. Small star means that we partially implemented the block (e.g., we implement PHA but we use the SURF imple-mentation by OpenCV), while blocks with big stars are completely implemented by us. The color of each rounded block corresponds to thesame module shown in Figure 1, while white squares are objects. Dashed lines or blocks are optional steps. The dotted lines show how thesame object is used in different phases (e.g., the trainer object).

in the ROC space (line 18). Finally, we choose the CHT with thesmallest distance from the optimal point (from line 19 to line 21).Notice we retrieve the k most similar setup images, since returningonly the most similar image is discriminatory which reduces thenumber of cache hits.

4 PARALLEL HESSIAN-AFFINEIn this section we describe our parallel Hessian-affine (PHA) im-plementation that extends HA in [26]. We choose OpenMP as theparallel framework. In section 5.2, we will see the great speedupobtained by our approach.

In Figure 3, we can see the flowchart of the whole PHA algo-rithm. The HA algorithm is mainly composed of two nested loops:iterations of the outer loop refer to each octave, and iterations of theinner loops refer to each scale for the given octave. For each inneriteration, we compute the next scale by using a gaussian blur filter,which is used to call hessianResponse. Next, we use the previous,current and next scales to compute the keypoints and descriptorsfor the given octave and scales with findLevelKeypoints. Accord-ing to our analysis, more than 85% of the execution time is spent infindLevelKeypoints, 10% in hessianResponse, and around 5%in computing the gaussian blur filter.

Because of the data dependency described before, we need tocompute a chain of gaussian blurs, andmake all the hessianResponsecalls before calling findLevelKeypoints. Unfortunately, the chainof gaussian blurs can not be parallelized, but it takes only 5% ofthe overall algorithm execution time. Instead, the parallel calls tohessianResponse scale well on the tested architecture.

Our PHA algorithm splits the keypoints detection and the de-scriptors building tasks, and each thread stores all the keypointsfrom findLevelKeypoints into a synchronized queue. Once thethread finishes computing all the keypoints from all the assignedoctaves and scales, it pops the keypoints from the synchronized

Algorithm 1: CHT Trainer AlgorithmInput :WI : warm-up image codes

TPWI (i): true positive warm-up image codes for theimage code i ∈ WI .SI =

⋃i ∈WI TPWI (i) ∪WI : setup images.

Output :bestCht: best CHT.Params :s: similarity function.

minT ,maxT , step,k .1 bestCht = 0;2 bestDist = ∞;3 for cht = minT ; cht ≤ maxT ; cht+ = step do4 foreach i ∈ WI do5 tp = 0, tn = 0, fp = 0, fn = 0;6 Topk, i = getMostSimilars(i,k, s, SI );7 foreach j ∈ Topk, i do8 if j ∈ TPWI (i) ∧ s(i, j) ≤ cht then9 tp+ = 1; //true positives

10 else if j ∈ TPWI (i) ∧ s(i, j) > cht then11 fn+ = 1; //false negatives12 else if j < TPWI (i) ∧ s(i, j) ≤ cht then13 fp+ = 1; //false positives14 else15 tn+ = 1; //true negatives

16 TPR = tp/(tp + fn);17 FPR = fp/(fp + tn);18 dist =

√FPR2 + (1 − TPR)2;

19 if dist < bestDist then20 bestDist = dist;21 bestCht = cht;


349

Figure 3: Flow chart of the PHA algorithm. Each column can be executed in parallel by one thread, while white horizontal bars define parallelregions.

queue and calls onHessianKeypointDetected to compute the cor-respondent descriptors. The process is repeated when there arethreads computing keypoints or the synchronized queue is notempty. Splitting keypoints and descriptors computation balancesworkload in our PHA algorithm.

During our PHA evaluation, we notice an important load un-balancing during descriptor computation stage: some cores on thetested architecture compute only one descriptor in the whole algo-rithm execution period, while each core usually computes hundredsof descriptors. We defined these time consuming descriptors as bigdescriptors, which represent less than 1% of all the computed de-scriptors and can be removed. First, discarding these big descriptorsdoes not result into any drop in precision from our observation.Secondly, all the big descriptors are obtained from the biggest im-age patches (i.e., regions around a given keypoint). In particular,consider p as the patch’s size for a given keypoint,w andh the inputimage width and height, respectively, if the ratio r =

p

max (w,h)is smaller than a given threshold t , we do not compute the cor-respondent descriptor. Threshold t is the only parameter to tunein PHA, and its optimal value depends on the input size and thethread pool size. Big descriptors are always removed using t = 0.3in our experiments.

5 EXPERIMENTSThis section presents our experimental results of CIRCE. First, wedescribe the system setup and datasets. Secondly, we evaluate PHAby using parallel metrics. We then present precision and time effi-ciency results about feature extraction and image encoding tech-niques. Finally, we show the efficiency of our caching mechanism.

5.1 Experiments SetupWe deployed CIRCE on a computer with a six-core Intel©Core i7Processor and 32GB of RAM. CIRCE is completely developed in C++and built with the Intel C++ Compiler. External libraries includeOpenCV and VLFeat. Unless specified, we repeat each test ten timesand average the obtained results.

Original WI TPWISp

litBa

cchu

s

Figure 4: Instances from the movie and poster dataset. The leftmostcolumn are the original instances. The second column are goodquality WI. The two right most columns are TPWI images.

5.1.1 IR Applications. We implement three IR applications to testCIRCE, and for sake of simplicity, we assume that each query imagecontains exactly one instance to be recognized:Movie Poster recognition: the query image contains a movieposter and the BES returns rating and position of the input poster.Painting recognition: the query image contains a painting andthe BES returns the title of the paining.Building recognition: a subset of landmark recognition, wherethe BES identifies the building in the query image.

5.1.2 Image Datasets. In this section, we present the three imagedatasets used to evaluate image descriptors and our SC approach.We create the first two datasets presented in this section.Posters dataset: this dataset contains 3949 movie posters imagesfrom 33 movies. The images have been taken in seven differentmovie theaters in order to test the robustness of our approach indifferent environments. In addition, we create a training datasetof 476 different movie posters, including the 33 original posters.We take many challenging pictures, where over exposure, object


350

1 2 3 4 6 10 16 250,002

0,02

0,2

Poster DatasetTrending Line

Movie Poster Popularity Ranking

Dis

cret

e P

roba

bili

ty D

istr

ibut

ion

(a) Posters distribution.

1 2 3 4 6 10 16 25 40 63 1000,003

0,010

0,032

0,100

Painting DatasetTrendin LIne

Painting Popularity Ranking

Dis

cret

e P

roba

bili

ty D

istr

ibut

ion

(b) Paintings distribution.

1 2 3 4 6 100,01

0,1

1

Oxford datasetTrending Line

Building Popularity Ranking

Dis

cret

e Pr

obab

ility

Fun

ctio

n

(c) Adapted Oxford distribution.

Figure 5: Discrete probability distribution of the three images datasets used for cache evaluation.

obstruction, and bad point of view are common in this dataset.The first row of Figure 4 shows one instance from this dataset.In particular, notice that WI have images in good quality, whileTPWI have challenging images. Figure 5a shows that the querypopularity distribution follows a power-law distribution closely:images belonging to the six most popular posters (less than 20% ofthe instances) represent more than 60% of the whole dataset.Paintings dataset: this dataset contains 112 different paintingsfrom the Uffizi museum in Florence, with a total of 4465 images.We define a set of 131 images as the training dataset. As for theposter dataset, we take many challenging pictures. The second rowin Figure 4 shows a painting instance. In Figure 5b, we can see thequery distribution of the painting dataset, which fits more a straightline, and 20% of the categories reach about 50% of the images.Oxford buildings dataset: this is a popular dataset of 5062 imagesdivided in eleven different Oxford buildings with 55 query images.We use this dataset to compare our image encoding implementa-tions with the state-of-the-art. We use the Paris building datasetas training dataset [29]. To evaluate the building recognition ap-plication, we modify the Oxford dataset by using only okay andgood images from each category, since we assume a reasonable useof the IR application by the user. The resulting dataset includes512 images in 11 categories. Figure 5c shows the distribution of theadapted Oxford dataset: the first two categories (less than 20% ofthe instances) include more than 52% of the whole dataset.

As we discuss above, for each image dataset we define a set ofgood quality WI. In addition, for each image dataset, we define aset of bad quality WI. This categorization is used for testing howcache performances are affected by the quality of WI. In addition,we take bad WI on three different levels for each dataset:Easy level: We take the posters dataset as the easy level for badWI: these images have very light illumination changes, almost noobject obstruction and not so difficult point of view changes.Mid level: Since we do not create the Oxford dataset, we take it asthe mid level for bad WI: difficult points of view, object obstruction,different light conditions, but nothing impossible to identify.Hard level: Finally, we take the painting dataset as the hard levelfor bad WI: heavy underexposure, macro details only and almostfull obstruction in these images.

Bad quality WI will be used in section 5.4.1.

5.2 PHA EvaluationIn this test we evaluate the performance of PHA. We do not reportthe results obtained from the painting dataset, since they are similarto those from the posters dataset.

100 200 300 400 500 600 700 800 900 10000

50100150200250300350400450500

oxford PHAoxford HAposters PHAposters HA

Image Size (px)

Ave

rag

eD

escr

ipto

rT

ime

(ms)

Figure 6: PHA and HA timings w.r.t. image size.

200 300 400 500 600 700 800 900 10000

10

20

30

40

50

60

oxford PHAoxford SURFposters PHAposters SURF

Image Size (px)

Ave

rag

eD

escr

ipto

rT

ime

(ms)

Figure 7: SURF and PHA ADT versus image size.

Since image size is an important aspect to obtain the best timeand precision performance, we evaluate the speedup obtained bythe highest parallel degree (i.e., twelve for our experiment setup) inrelation to image resizing. In Figure 6, we can see that the originalHA implementation obtains real-time results (under 30 ms) onlywith an image size of 200 pixels. Instead, our PHA is able to obtainreal-time results with an image size of up to 800 pixels for Oxfordand 900 for posters. Figure 6 also shows an excellent speedup of10.44 for Oxford when the input images are resized to 600 pixelson the larger side, and a speedup of 9.25 for the same resizing forposters. We do not see any meaningful change in terms of precisionbetween HA and PHA.

We also compare the performance of PHA and SURF, and theresults are shown in Figure 7. We can see that SURF is slightly fasterthan PHA: the former takes around 10ms to generate the descriptorson average, while the latter is 13ms. Similarly, for posters SURFuses on average 8ms, while PHA uses 11ms. Later in Section5.3 wewill show that PHA is more accurate than SURF.

5.3 Encoders EvaluationIn this section, we compare the two image encoding techniques usedin CIRCE: VLAD and FV. Both encoders’ precision and efficiencydepend on the number of used centers, i.e. the value k .


351

32 64 96 128 160 192 224 2560,45

0,5

0,55

0,6

VLAD PHA OxfordFV PHA Oxford

Number of Centers

Mea

n A

vera

ge P

reci

sion

(a) VLAD and FVmAP v.s. number of centers.

32 64 96 128 160 192 224 2560,4

0,5

0,6

0,7

0,8

0,9

1

oxford PHAoxford SURFposters PHAposters SURF

Number of Centers

Mea

n Ave

rage

Pre

cisi

on

(b) PHA and SURF mAP using VLAD.

32 64 96 128 160 192 224 2560

0,5

1

1,5

2

2,5

3

VLAD PHA OxfordFV PHA Oxford

Number of Centers

Ave

rage

Enc

odin

g Ti

me

(ms)

(c) VLAD and FV AET v.s. number of centers.

Figure 8: Precision of the two encoding techniques is shown in Figure 8a, and the precision with different descriptors is shown in Figure 8b.Figure 8c shows their efficiency.

In Figure 8a, we can see the precision of the two encoders whenusing PHA with Oxford images. In particular, our VLAD implemen-tation is more precise than FV up to k = 128, while FV is slightlymore precise when considering k = 256. However, the number ofdimensions for FV codes is too large when using so many centers.As next step, we evaluate VLAD precision and efficiency by usingthe two descriptors described in the previous section. As we cansee from Figure 8b, PHA beats SURF in terms of precision again,obtaining a gap up to 7% in terms of mAP when using 64 centers.

In second place, we consider encoding efficiency: as we can seefrom Figure 8c, FV is faster than VLAD for Oxford images. SinceVLAD is shown to be more precise than FV by using half of thedimensions, we decide to use it for the rest of this work. We obtainconsistent results with posters and paintings too.

5.4 Cache EvaluationIn this section we evaluate our caching approach. The two metricsused in this section are the cache hit ratio (HR) and precision (P).

Supposing that the tested dataset has n images, the test workflowfor a given configuration is: a) We train our system by generatingthe trainer object. b) We setup CIRCE as described in Section 3.2.2:first initialize the cache with WI, then use the WI and TPWI togenerate the CHT. We define the TPWI by randomly select n/2non-query images. c) We randomly submit min(1500,n/2) queries,except for setup images (i.e., WI and TPWI). Unless specified other-wise, the cache size for all the experiments is ten thousand elements.

5.4.1 Testing Multiple Criteria. In the following tests, we collectresults for all the combinations of: a) good and bad quality WI; b)SURF and PHA descriptors; c) k = 64, 128 centroids for VLAD; d)paintings, posters and the adapted Oxford datasets. This makes atotal number of 240 runs (10 runs per test).

The first analysis is how cache performances change based onthe WI quality. We group the results per dataset and WI from allthe 240 tests described above, regardless of the descriptor and thethe number of centers used. Table 2 shows that when we use badquality WI for the painting dataset (which are the worst ones interms of quality), almost all the submitted queries correspond to acache hit, but only 1% of them are correct. However, as soon as weuse good quality WI, we have a HR of at least 0.66 and a precisionof at least 0.97. We cannot compare these results with other workssince CIRCE is the first SC system for IR application and we usetwo novel datasets, but the precision of our approach is clear. Thisanalysis shows that the quality of WI is very important.

Bad WI Good WIDataset Level HR P HR Pposters Easy 0.86 0.98 0.66 0.99Oxford Middle 0.81 0.90 0.75 0.97painting Hard 1.00 0.01 0.85 0.98

Table 2: Cache performance depending on WI quality per dataset.

Criteria Value HR P

Centers64 0.82 0.95128 0,75 0,98

DescriptorSURF 0,78 0,95PHA 0,80 0,98

Table 3: Cache performance based on different criteria.

Our next analysis compares the different criteria that we haveseen before. First, we group results based on descriptors, then thenumber of VLAD centers (e.g., results about SURF refer to theaverage results between all the tested configurationwhere SURF hasbeen used, regardless of the number of VLAD centers, WI qualityor image dataset). Results shown in Table 3 do not consider bad WIfor the painting dataset, whose purpose is to show the importanceof WI quality in Section 5.4.1. Using more centers results into lessbut more precise cache hits. Moreover, PHA has both higher HRand P than SURF, but notice that we are considering also bad WI(except for painting). From this test, it is clear that to maximizeHR and speed, we have to use the SURF descriptor with 64 VLADcenters. Instead, to maximize precision, we have to use PHA with128 VLAD centers. We will use these two configurations in the nexttests. In addition, from now on we will use good quality WI only.

5.4.2 Cache Size Test. Up to now, we never fill completely ourcache whose size is set to ten thousand elements. In this test weevaluate cache performances by changing the cache size from 200 to1600 elements. We do not consider the adapted Oxford dataset, sinceit has 512 images only. We compare our HR with the optimal onesobtained by the Belady’s optimal algorithm for cache replacement.Notice the optimal solution is based on the generated CHT, whichdecides a cache hit happens or not.

Figure 9a shows that we need to cache only four hundreds paint-ing images and six hundreds poster images to obtain the optimalresults. This excellent result is given by the good quality of ourcodes and the power-law query distribution of our datasets, whichmake high probability that a query belongs to a cached element.From this test we can confirm that the SURF configuration gener-ates more cache hits. Figure 9b shows that the system precision isnot affected by the cache size, and PHA is more precise than SURF.


352

100 300 500 700 900 1100130015000,3

0,4

0,5

0,6

0,7

0,8

0,9

painting PHApainting SURFposters PHAposters SURFOptimal PaintingOptimal Posters

Cache Size

Hit R

atio

(a) Hit ratio v.s. cache size.

200 400 600 800 1000 1200 1400 16000,95

0,96

0,97

0,98

0,99

1

painting PHAposters PHApainting SURFposters SURF

Cache Size

Pre

cisi

on

(b) Precision v.s. cache size.

200 400 600 800 1000 1200 1400 16000

0,1

0,2

0,3

0,4

0,5

0,6

0,7

PHA

SURF

Cache Size

Look

up T

ime

(ms)

(c) Lookup time v.s. cache size (small scale).

10 20 30 40 50 60 70 80 90 1000

5

10

15

20

PHA (d=16384)

SURF (d=4096)

SURF RT Threshold

PHA RT Threshold

Cache Size (10³)

Look

up T

ime

(ms)

(d) Lookup time v.s. cache size (large scale).

1 1,1 1,2 1,3 1,4 1,50

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

SURF OxfordSURF PostersSURF PaintingsSURF OxfordSURF PaintingsSURF Posters

Cache Hit Threshold

Hit

Rat

io

(e) Hit ratio v.s. cache hit threshold

1 1,1 1,2 1,3 1,4 1,50,8

0,82

0,84

0,86

0,88

0,9

0,92

0,94

0,96

0,98

1

SURF OxfordSURF PaintingsSURF PostersSURF OxfordSURF PaintingsSURF Posters

Cache Hit Threshold

Pre

cisi

on

(f) Precision v.s. cache hit threshold

Figure 9: Cache performances in terms of hit ratio, precision and lookup time.

Descriptor k DT ET ALTSURF 64 0.0092 0.0007 0.0201PHA 128 0.012 0.0014 0.0166

Table 4: Descriptor, encoding, and available lookup times by usingoptimal configurations.

5.5 Scalability TestIn this Section, we answer the question: given a configuration forour system, how many elements can we cache in order to answer aquery in real-time when a cache hit occurs? In case of a cache hit,query time is given by the sum of descriptor time (DT), encodingtime (ET) and cache lookup time. The time performances up to ETare summarized in Table 4: we define the available lookup time(ALT) as ALT = 30−DT − ET . As we can see from Figure 9c, cachelookup takes less than one millisecond, even for VALD codes with128 centroids and PHA descriptors in 16384 dimensions. Notice thatthe lookup time does not increase linearly because of our parallelimage code similarity implementation. From this test, we can statethat when a cache hit occurs in CIRCE, it is able to perform IR tasksin 14ms in the worst case on the tested architecture.

Then we test how many elements we can cache with real-timeguarantee. This is useful for those dataset where the query fre-quency distribution does not follow a power-law distribution. Totest the efficiency of the cache lookup operation, we create fakeimage codes in 4096 when using SURF and 64 centroids, and 16384dimensions when using PHA and 128 centroids. As we can see fromFigure 9d, by using the PHA configuration we can cache up to tenthousands codes and still obtain real-time performance, and we cancache even fifty thousands codes.

5.6 Cache Hit Threshold TestIn this test, we evaluate the quality of our CHT algorithm. In par-ticular, we test the cache performances on our three datasets bymanually setting the CHT and compare them with the one selectedby our CHT algorithm. We use the SURF configuration for this test,

starting with a CHT of 1 up to 1.5, with a step of 0.05 betweendifferent tests.

Figure 9f shows that the adapted Oxford has a precision of 1until a threshold of 1.2. However, Figure 9e shows a low HR forsuch a threshold, so our algorithm gives up around 0.02 in terms ofprecision to improve the HR from 0.4 to 0.8. Our algorithm assigna CHT of 1.1 for the poster dataset because of an high precision.However, notice that if our algorithm chooses a CHT value 1.2 forthe poster dataset, we would obtain 20% more cache hits by losingonly 2% precision.

6 CONCLUSION AND FUTUREWORKSIn this paper, we presented CIRCE, a similarity caching system toperform IR tasks in real-time when a cache hit occurs. Our imple-mentation of PHA reaches a speedup over 10 times with a threadingpool of twelve threads. In addition, we defined a new algorithmbased on ROC curves to generate a CHT for IR applications. Byusing a cache size of just six hundreds elements, we obtained a hitratio of at least 66% and a precision of at least 97%. By utilizingmulti-core architecture, CIRCE is able to perform similar and fre-quent IR tasks in at most 14ms in the worst case, which is far belowthe real-time threshold.

For future works, we consider to relax the real-time constraint touse modern neural network techniques in order to generate moreprecise and compact image codes. In addition, we intend to improveour CHT algorithm when it is possible to greatly improve the hitratio at the cost of a small drop in terms of precision.

ACKNOWLEDGEMENTWe thank Nicola Tonellotto, Relja Arandjelovic and Fabrizio Falchifor helping us to improve this work. The research of Pan Hui wassupported in part by the General Research Fund from the ResearchGrants Council of Hong Kong under Grant 26211515 and Grant16214817.


353

REFERENCES[1] Mohammad Reza Abbasifard, Bijan Ghahremani, and Hassan Naderi. 2014. A

survey on nearest neighbor search methods. International Journal of ComputerApplications 95, 25 (2014).

[2] Relja Arandjelovic and Andrew Zisserman. 2013. All about VLAD. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition. 1578–1585.

[3] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robustfeatures. In Computer vision–ECCV 2006. Springer, 404–417.

[4] Sebastian Bittel, Vitali Kaiser, Marvin Teichmann, and Martin Thoma. 2015. Pixel-wise segmentation of street with neural networks. arXiv preprint arXiv:1511.00513(2015).

[5] G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools(2000).

[6] Nieves R Brisaboa, Ana Cerdeira-Pena, Veronica Gil-Costa, Mauricio Marin, andOscar Pedreira. 2015. Efficient similarity search by combining indexing andcaching strategies. In International Conference on Current Trends in Theory andPractice of Informatics. Springer, 486–497.

[7] Edgar Chávez and Gonzalo Navarro. 2005. A compact space decomposition foreffective metric indexing. Pattern Recognition Letters 26, 9 (2005), 1363–1376.

[8] Flavio Chierichetti, Ravi Kumar, and Sergei Vassilvitskii. 2009. Similarity caching.In Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposiumon Principles of database systems. ACM, 127–136.

[9] Pei-Hsuan Chiu, Po-Hsuan Tseng, and Kai-Ten Feng. 2014. Cloud computingbased mobile augmented reality interactive system. In Wireless Communicationsand Networking Conference (WCNC), 2014 IEEE. IEEE, 3320–3325.

[10] Stuart Coles, Joanna Bawa, Lesley Trenner, and Pat Dorazio. 2001. An introductionto statistical modeling of extreme values. Vol. 208. Springer.

[11] Jonathan Delhumeau, Philippe-Henri Gosselin, Hervé Jégou, and Patrick Pérez.2013. Revisiting the VLAD image representation. In Proceedings of the 21st ACMinternational conference on Multimedia. ACM, 653–656.

[12] Charles Elkan. 2003. Using the triangle inequality to accelerate k-means. InProceedings of the 20th International Conference on Machine Learning (ICML-03).147–153.

[13] Fabrizio Falchi, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and FaustoRabitti. 2008. A metric cache for similarity search. In Proceedings of the 2008ACM workshop on Large-Scale distributed systems for information retrieval. ACM,43–50.

[14] Fabrizio Falchi, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and FaustoRabitti. 2009. Caching content-based queries for robust and efficient imageretrieval. In Proceedings of the 12th International Conference on Extending DatabaseTechnology: Advances in Database Technology. ACM, 780–790.

[15] Fabrizio Falchi, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and FaustoRabitti. 2012. Similarity caching in large-scale image retrieval. InformationProcessing & Management 48, 5 (2012), 803–818.

[16] Teddy Furon andHervé Jégou. 2013. Using extreme value theory for image detection.Ph.D. Dissertation. INRIA.

[17] Karimollah Hajian-Tilaki. 2013. Receiver operating characteristic (ROC) curveanalysis for medical diagnostic test evaluation. Caspian journal of internalmedicine 4, 2 (2013), 627.

[18] Antti Hietanen, Jukka Lankinen, Joni-Kristian Kämäräinen, Anders Glent Buch,and Norbert Krüger. 2016. A comparison of feature detectors and descriptors forobject class matching. Neurocomputing 184 (2016), 3–12.

[19] Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towardsremoving the curse of dimensionality. In Proceedings of the thirtieth annual ACMsymposium on Theory of computing. ACM, 604–613.

[20] Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantizationfor nearest neighbor search. IEEE transactions on pattern analysis and machineintelligence 33, 1 (2011), 117–128.

[21] Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggre-gating local descriptors into a compact image representation. In Computer Visionand Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 3304–3311.

[22] Herve Jegou, Florent Perronnin, Matthijs Douze, Jorge Sánchez, Patrick Perez,and Cordelia Schmid. 2012. Aggregating local image descriptors into compactcodes. IEEE transactions on pattern analysis and machine intelligence 34, 9 (2012),1704–1716.

[23] Yu A Malkov and DA Yashunin. 2016. Efficient and robust approximate nearestneighbor search using Hierarchical Navigable Small World graphs. arXiv preprintarXiv:1603.09320 (2016).

[24] KrystianMikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, JiriMatas, Frederik Schaffalitzky, Timor Kadir, and Luc Van Gool. 2005. A comparisonof affine region detectors. International journal of computer vision 65, 1-2 (2005),43–72.

[25] Sandeep Pandey, Andrei Broder, Flavio Chierichetti, Vanja Josifovski, Ravi Kumar,and Sergei Vassilvitskii. 2009. Nearest-neighbor caching for content-matchapplications. In Proceedings of the 18th international conference on World wideweb. ACM, 441–450.

[26] Michal Perd’och, Ondrej Chum, and Jiri Matas. 2009. Efficient representation oflocal geometry for large scale object retrieval. In Computer Vision and PatternRecognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 9–16.

[27] Sarah Perez. 2010. Mobile cloud computing: $9.5 billion by 2014. Technical Reprot,ReadWriteMobile (2010).

[28] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman.2007. Object retrieval with large vocabularies and fast spatial matching. InComputer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE,1–8.

[29] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman.2008. Lost in quantization: Improving particular object retrieval in large scaleimage databases. In Computer Vision and Pattern Recognition, 2008. CVPR 2008.IEEE Conference on. IEEE, 1–8.

[30] Foster J Provost, Tom Fawcett, Ron Kohavi, et al. 1998. The case against accuracyestimation for comparing induction algorithms.. In ICML, Vol. 98. 445–453.

[31] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN:Towards real-time object detection with region proposal networks. In Advancesin neural information processing systems. 91–99.

[32] Chanop Silpa-Anan and Richard Hartley. 2008. Optimised KD-trees for fast imagedescriptor matching. In Computer Vision and Pattern Recognition, 2008. CVPR2008. IEEE Conference on. IEEE, 1–8.

[33] Alan F Smeaton, Paul Over, and Wessel Kraaij. 2006. Evaluation campaigns andTRECVid. In Proceedings of the 8th ACM international workshop on Multimediainformation retrieval. ACM, 321–330.

[34] Lexton Snol. 2009. More smartphones than desktop PCs by 2011. PC World (Sep-tember 2009). Online http://www. pcworld. com/article/171380/, Retrieved December3 (2009).

[35] A. Vedaldi and B. Fulkerson. 2008. VLFeat: An Open and Portable Library ofComputer Vision Algorithms. http://www.vlfeat.org/. (2008).

[36] Liang Zheng, Yi Yang, and Qi Tian. 2017. SIFT meets CNN: A decade survey ofinstance retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence(2017).


354

http://www.vlfeat.org/

circe: real-time caching for instance recognition on cloud ...panhui/papers/circe-lovagnini.pdf ·...

Documents