a mapreduce-based indoor visual localization system using affine invariant features

10
A MapReduce-based indoor visual localization system using affine invariant features q Tien-Ruey Hsiang a,, Yu Fu b , Ching-Wei Chen a , Sheng-Luen Chung b a Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan b Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan article info Article history: Available online xxxx abstract This paper proposes a vision-based indoor localization service system that adopts affine scale invariant features (ASIFT) in MapReduce framework. Compared to prior vision-based localization methods that use scale invariant features or bag-of-words to match database images, the proposed system with ASIFT achieves better localization hit rate, especially when the query image has a large viewing angle difference to the most similar database image. The heavy computation imposed by ASIFT feature detection and image registration is handled by processes designed in MapReduce framework to speed up the localization service. Experiments using a Hadoop computation cluster provide results that show the performance of the localization system. The better localization hit rate is demonstrated by comparing the proposed approach to previous work based on scale invariant feature matching and visual vocabulary. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction To successfully localize oneself is the foundation of many context-aware services and autonomous robotics applications. In context-aware services, a relevant service can be provided to a user by incorporating the user’s location with other context such as time and the user’s habit or behavior. For autonomous robots, path planning and navigation are based on the knowl- edge of the robot’s current position. Numerous indoor localization techniques have been studied for many years. Due to the unavailability of the GPS (Global Positioning System) signal and the limited accuracy of A-GPS (Assisted-GPS), indoor localization techniques are developed by using alternative technologies such as the sensors emitting wireless signals, proximity sensors, and cameras. This paper fo- cuses on the vision-based localization techniques. Given a query image, vision-based localization techniques find the position where the query image is taken in a map. Appearance-based maps [1,2] and feature maps [3–6] are two types of maps used in the vision-based localization techniques. An appearance-based map consists of database images of the target environment. On the other hand, a feature map stores significant amount of features in the environment instead of raw sensing data in order to reduce database storage. Heavy computation is required by vision-based localization techniques because a query image is matched against the en- tire database in order to check all position candidates. Generally, thousands of images [7,8] are used to construct a map. Several approaches were proposed to quickly process large amounts of map data. [6,9] only retain long-term features in the map. [2,4,7,10] use selected data dimensions to represent each feature or database image. The coarse-to-fine or hierar- chical approaches [4,5] quickly eliminate unlikely database images or features, then compute the localization result. 0045-7906/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compeleceng.2012.12.023 q Reviews processed and recommended for publication to Editor-in-Chief by Guest Editor Dr. Taeshik Shon. Corresponding author. E-mail address: [email protected] (T.-R. Hsiang). Computers and Electrical Engineering xxx (2013) xxx–xxx Contents lists available at SciVerse ScienceDirect Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng Please cite this article in press as: Hsiang T-R et al. A MapReduce-based indoor visual localization system using affine invariant features. Comput Electr Eng (2013), http://dx.doi.org/10.1016/j.compeleceng.2012.12.023

Upload: sheng-luen

Post on 19-Dec-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A MapReduce-based indoor visual localization system using affine invariant features

Computers and Electrical Engineering xxx (2013) xxx–xxx

Contents lists available at SciVerse ScienceDirect

Computers and Electrical Engineering

journal homepage: www.elsevier .com/ locate/compeleceng

A MapReduce-based indoor visual localization system usingaffine invariant features q

0045-7906/$ - see front matter � 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.compeleceng.2012.12.023

q Reviews processed and recommended for publication to Editor-in-Chief by Guest Editor Dr. Taeshik Shon.⇑ Corresponding author.

E-mail address: [email protected] (T.-R. Hsiang).

Please cite this article in press as: Hsiang T-R et al. A MapReduce-based indoor visual localization system using affine invariant feComput Electr Eng (2013), http://dx.doi.org/10.1016/j.compeleceng.2012.12.023

Tien-Ruey Hsiang a,⇑, Yu Fu b, Ching-Wei Chen a, Sheng-Luen Chung b

a Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwanb Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan

a r t i c l e i n f o

Article history:Available online xxxx

a b s t r a c t

This paper proposes a vision-based indoor localization service system that adopts affinescale invariant features (ASIFT) in MapReduce framework. Compared to prior vision-basedlocalization methods that use scale invariant features or bag-of-words to match databaseimages, the proposed system with ASIFT achieves better localization hit rate, especiallywhen the query image has a large viewing angle difference to the most similar databaseimage. The heavy computation imposed by ASIFT feature detection and image registrationis handled by processes designed in MapReduce framework to speed up the localizationservice. Experiments using a Hadoop computation cluster provide results that show theperformance of the localization system. The better localization hit rate is demonstratedby comparing the proposed approach to previous work based on scale invariant featurematching and visual vocabulary.

� 2013 Elsevier Ltd. All rights reserved.

1. Introduction

To successfully localize oneself is the foundation of many context-aware services and autonomous robotics applications.In context-aware services, a relevant service can be provided to a user by incorporating the user’s location with other contextsuch as time and the user’s habit or behavior. For autonomous robots, path planning and navigation are based on the knowl-edge of the robot’s current position.

Numerous indoor localization techniques have been studied for many years. Due to the unavailability of the GPS (GlobalPositioning System) signal and the limited accuracy of A-GPS (Assisted-GPS), indoor localization techniques are developed byusing alternative technologies such as the sensors emitting wireless signals, proximity sensors, and cameras. This paper fo-cuses on the vision-based localization techniques.

Given a query image, vision-based localization techniques find the position where the query image is taken in a map.Appearance-based maps [1,2] and feature maps [3–6] are two types of maps used in the vision-based localization techniques.An appearance-based map consists of database images of the target environment. On the other hand, a feature map storessignificant amount of features in the environment instead of raw sensing data in order to reduce database storage.

Heavy computation is required by vision-based localization techniques because a query image is matched against the en-tire database in order to check all position candidates. Generally, thousands of images [7,8] are used to construct a map.

Several approaches were proposed to quickly process large amounts of map data. [6,9] only retain long-term features inthe map. [2,4,7,10] use selected data dimensions to represent each feature or database image. The coarse-to-fine or hierar-chical approaches [4,5] quickly eliminate unlikely database images or features, then compute the localization result.

atures.

Page 2: A MapReduce-based indoor visual localization system using affine invariant features

2 T.-R. Hsiang et al. / Computers and Electrical Engineering xxx (2013) xxx–xxx

Another difficulty in developing vision-based localization techniques is to determine the most similar image in the pres-ence of changes of viewing angles, illuminations, scales, and occlusions. Although scale invariant features such as SIFT [11]are common approaches to handle this problem, more ideal solutions often adopt perspective or affine invariant features [4].[12,13] demonstrat that image registration using ASIFT features outperforms standard scale invariant features. However, theinduced computation of perspective or affine invariant features is still formidable.

This paper proposes an ASIFT (Affine-SIFT) vision-based localization system using MapReduce in order to handle the hea-vy computation caused by estimating affine invariant features and performing localization in large databases. MapReduce isa computation framework in cloud computing environments. [14] divides scientific computing problems into four types anddiscusses the latency caused by the lack of cache in Hadoop in iterative MapReduce computation. Many applications induc-ing heavy computations such as image rendering [15] and object recognition [16] have adopted MapReduce framework toreduce overall computation time.

The localization service proposed in this paper starts when a client sends a query image to the localization system. Theclient is considered as a thin device, which can either be a smart phone or a robot equipped with a camera that is capable ofsending an image to the localization system. With only limited hardware resources, ASIFT feature matching is unrealizableby the client itself. Therefore, it offloads the computation to the external localization system, which first localizes the clienttopologically by ASIFT feature matching, then computes the coordinates by triangulation. The proposed localization systemutilizes two layers of MapReduce procedures. The ASIFT feature detection is computed in one layer and the topological local-ization is performed in the other.

The main contributions of this paper are two folds, the adaptation of ASIFT vision-based localization system in MapRe-duce framework and the better hit rate in successfully localizing the client. Experiments demonstrate the better hit rateby comparing the proposed approach to other localization approach based on SIFT and bag of words, and the reduced com-putation time with different scales of hardware resources.

The rest of this paper is organized as follows. Section 2 presents our method to build a 3D feature map for the proposedlocalization system. The localization system is detailed in Section 3, including the localization algorithm and the implemen-tation of the algorithm in MapReduce framework. Section 4 provides the experimental verifications, and Section 5 concludesthis paper.

2. SIFT feature-based point cloud

A map used for localization must be precise and consistent because the error of the map propagates to the localizationresult. This section describes our map building process and demonstrates the resulting 3D feature map.

The 3D feature map contains colored 3D points and associated feature descriptors. Each colored 3D point represents asensed point of an object and the corresponding feature descriptor is a distinguishable representation for feature matching.An environment is then modeled by colored 3D points. The feature map is constructed by moving a Microsoft Kinect, an off-the-shelf RGBD sensor, in the environment and registering multiple scans [17,18] in the global coordinate system.

The relative pose between two places where two Kinect scans are taken is computed in a RANSAC-based robust estima-tion for 6D rigid transformation. The 6D rigid transform including rotation and translation is described by

PleaseComp

p0 ¼ Rpþ t; ð1Þ

where p is the 3D coordinates of a sensed point in one scan, p0 is the 3D coordinates of the same point in the other scan, R is a3 � 3 rotation matrix, and t is a 3 � 1 translation vector. In order to establish the correspondence of p and p0, SIFT features[11] are extracted and matched between two images. With multiple correspondences, RANSAC [19] removes incorrect cor-respondences and computes the rotation matrix and the translation vector from correct correspondences. By computing therelative pose between consecutive scans, colored 3D points can be registered in the global coordinate system. However, thegenerated colored 3D points can still contain inconsistencies due to the accumulated errors.

GraphSLAM [20] generates a consistent map. By representing the posterior of the full SLAM problem in a graph, manynonlinear quadratic constraints are built. The consistent map is obtained by minimizing the sum of these nonlinear quadraticconstraints. The detail of GraphSLAM is referred to [20]. Fig. 1 shows an example of inconsistent colored 3D points of an envi-ronment and its consistent version after applying GraphSLAM. In this example, 17 Kinect scans are tused and 4,063,946points are generated. An obvious inconsistency in Fig. 1a shows a whiteboard in the form of two unaligned scans on the right.

Instead of adopting dense colored 3D points as the map for localization, this paper uses SIFT feature-based colored 3Dpoints as the map, because 3D feature map requires less storage [3–5,2,6]. In addition to the colors and the 3D coordinatesof SIFT features in each scan, SIFT feature descriptors are also saved. Fig. 2 shows a 3D feature map in the same environmentas Fig. 1b. The number of used colored 3D points is reduced from 4,063,946 to 31,881.

Our generation of 3D feature maps is implemented with two libraries: SiftGPU [21] and MRPT [22]. SiftGPU detects andmatches SIFT features. MRPT removes outliers of matched features and estimates 6D rigid transformations between consec-utive Kinect scans with RANSAC. The poses where scans are obtained are optimized by GraphSLAM provided in MRPT [22].

Since the map building error propagates to the localization error, the map building error of the generated 3D feature mapmust be estimated. This paper evaluates the error by computing the size difference between an object in the 3D feature map

cite this article in press as: Hsiang T-R et al. A MapReduce-based indoor visual localization system using affine invariant features.ut Electr Eng (2013), http://dx.doi.org/10.1016/j.compeleceng.2012.12.023

Page 3: A MapReduce-based indoor visual localization system using affine invariant features

Fig. 1. Inconsistent colored 3D points versus consistent colored 3D points. (For interpretation of the references to colour in this figure legend, the reader isreferred to the web version of this article.)

Fig. 2. SIFT feature-based 3D feature map.

T.-R. Hsiang et al. / Computers and Electrical Engineering xxx (2013) xxx–xxx 3

and the same object in the real environment. These objects include the size of the environments and furnitures. The averageand the standard deviations of map building errors are 8.73 cm and 2.94 cm, respectively.

3. Vision-based localization system in MapReduce framework

This section describes the proposed hierarchical localization algorithm in the vision-based localization system and detailsthe implementation of the vision-based localization system in Hadoop MapReduce framework.

3.1. Hierarchical localization algorithm

The hierarchical localization algorithm depicted in Fig. 3 first localizes the query image topologically by matching images,then computes the coordinate by triangulation. The input of the hierarchical localization algorithm is a query image from aclient. ASIFT features [12] are extracted from the query image in order to increase the tolerance of differences of viewingangles between the query image and its most similar scans. Fig. 4 shows an image pair with a larger difference of viewingangles. This image pair cannot be successfully established by regular SIFT matching.

Please cite this article in press as: Hsiang T-R et al. A MapReduce-based indoor visual localization system using affine invariant features.Comput Electr Eng (2013), http://dx.doi.org/10.1016/j.compeleceng.2012.12.023

Page 4: A MapReduce-based indoor visual localization system using affine invariant features

Fig. 3. Hierarchical localization algorithm.

Fig. 4. An image pair with larger difference of viewing angle.

4 T.-R. Hsiang et al. / Computers and Electrical Engineering xxx (2013) xxx–xxx

ASIFT features from the query image are matched to the sets of SIFT features from the scans in the 3D feature map toobtain the most similar scan as the topological localization result. The set of extracted ASIFT features is denoted as {FQuery}.A winner-take-all strategy is adopted to select the scan with the most features matches. RANSAC [19] is applied to removeoutliers in the matching process. In Fig. 3, a set of SIFT features detected from the ith scan in the nth environment is {Fn, i}.S

i{Fn, i} is the 3D feature map of the nth environment, denoted by {Fn}. {Mn,i} contains matched SIFT features between {FQuery}and {Fn, i}. {Mmax} contains matched features between the query image and the most similar scan.

Similar to [23], the triangulation in the proposed system adopts the angular relation of matched features in {Mmax} rel-ative to the optical center of the camera. Fig. 5 shows the construction of the relation between matched features. Suppose thequery image is captured at C = (xC,yC,zC), the relation can be represented as

PleaseComp

ðmmax;i � CÞ � ðmmax;j � CÞ ¼ jmmax;i � Cjjmmax;j � Cj � cosðhijÞ; ð2Þ

where mmax,i and mmax,j are the coordinates of matched features and hij is the included angle. In the pinhole camera model, Cis the optical center of the camera.

In Fig. 5, the projection pi of the matched feature mmax,i is

cite this article in press as: Hsiang T-R et al. A MapReduce-based indoor visual localization system using affine invariant features.ut Electr Eng (2013), http://dx.doi.org/10.1016/j.compeleceng.2012.12.023

Page 5: A MapReduce-based indoor visual localization system using affine invariant features

Fig. 5. The pinhole camera model and triangulation from two matched features in the 3D feature map.

T.-R. Hsiang et al. / Computers and Electrical Engineering xxx (2013) xxx–xxx 5

PleaseComp

spi ¼ s

ui

v i

1

264

375 ¼

fu 0 cu

0 fv cv

0 0 1

264

375 I3�3 03�1½ �

xCammmax;i

yCammmax;i

zCammmax;i

1

2666664

3777775; ð3Þ

where fu and fv are the focal length along the horizontal and vertical axes in the image, cu and cv represent the principal point,

s is a scale factor, and xCammmax;i

; yCammmax;i

; zCammmax;i

� �Tis the coordinate of mmax,i under the camera coordinate system. By substituting s

with zCammmax;i

in Eq. (3), the coordinate of mmax,i is

xCammmax;i

yCammmax;i

zCammmax;i

2664

3775 ¼

zCammmax;i

ui�cufu

zCammmax;i

v i�cvfv

zCammmax;i

2664

3775: ð4Þ

mmax,j’s coordinate under the camera coordinate system is

xCammmax;j

yCammmax;j

zCammmax;j

2664

3775 ¼

zCammmax;j

uj�cu

fu

zCammmax;j

v j�cvfv

zCammmax;j

2664

3775: ð5Þ

The angle hij (or \mmax,iC mmax,j) can be obtained by

hij ¼ arccosxCam

mmax;ixCam

mmax;jþ yCam

mmax;jyCam

mmax;jþ zCam

mmax;jzCam

mmax;j

k xCammmax;i

; yCammmax;i

; zCammmax;i

� �k k xCam

mmax;j; yCam

mmax;j; zCam

mmax;j

� �k

0@

1A: ð6Þ

By substituting xCammmax;i

; yCammmax;i

; xCammmax;j

, and yCammmax;j

with Eqs. (4)–(6) becomes

hij ¼ arccosui�cu

fu

� �uj�cu

fu

� �þ v i�cv

fv

� �v j�cv

fv

� �þ 1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ui�cufu

� �2þ v i�cv

fv

� �2þ 1

r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiuj�cu

fu

� �2þ v j�cv

fv

� �2þ 1

r0BB@

1CCA: ð7Þ

Since {Mmax} contains multiple feature matches, nonlinear solvers are applied to the overdetermined problem to obtain thecoordinate of camera optical center C of the query image as the localization result.

3.2. Localization system in Hadoop MapReduce framework

The computation of the topological localization becomes heavy when the size of feature map increases. Since SIFT featurematching between the query image and each database scan is independent, it is suitable to perform parallelization to speedup the localization service.

MapReduce [24] is a framework for processing large amounts of data in parallel in large clusters of computers. A job un-der MapReduce framework divides the input data into parts that are independently processed by mappers. The outputs frommappers are sorted and sent to the reducers.

The localization system proposed in this paper consists of two MapReduce iterations as shown in Fig. 6. One detects ASIFTfeatures in the query image received from a client, and the other performs the hierarchical localization algorithm.

cite this article in press as: Hsiang T-R et al. A MapReduce-based indoor visual localization system using affine invariant features.ut Electr Eng (2013), http://dx.doi.org/10.1016/j.compeleceng.2012.12.023

Page 6: A MapReduce-based indoor visual localization system using affine invariant features

Fig. 6. Localization system in MapReduce framework.

6 T.-R. Hsiang et al. / Computers and Electrical Engineering xxx (2013) xxx–xxx

ASIFT feature detection [12] transforms an image by simulating possible affine distortions and detects SIFT features in thetransformed images. Using the MapReduce framework, the affine transform and SIFT detection in each transformed imagecan be done as a mapper.

The input of the first MapReduce iteration is the combinations of tilt and rotation angles used in affine transforms. Eachmapper computes an affine transform and detects SIFT features in the transformed image. The reducer in the first MapRe-duce iteration collects all detected ASIFT features and stores them in HDFS.

The second MapReduce iteration takes the indices of scans in each environment as the input. The ASIFT features detectedin the previous iteration are retrieved from HDFS and copied as a cache file in each mapper. Each mapper in this iterationacquires several sets of SIFT features detected from scans and matches SIFT features between the query image and each scan.The reducer adopts the winner-take-all strategy to find the scan most similar to the query image and finishes triangulation.The feature coordinates required for triangulation can be acquired from the feature map stored in HDFS. ASIFT features de-tected in the first MapReduce iteration are copied as a cache file because the amount of ASIFT features is large, which greatlyaffects latency in accessing ASIFT features from HDFS. By employing a cache file for each mapper in the second MapReduceiteration, mappers can access ASIFT features locally and reduce latency. The benefit of using cache files will be shown inSection 4.3.

4. Experimental results

Three types of experiments are conducted to demonstrate the feasibility and the performance of the proposed localizationsystem in Hadoop MapReduce framework. The first two types of experiments evaluate the localization accuracy when queryimages are captured by different cameras, and the computation time for a localization service when different amounts ofcomputing resources. The third type of experiments demonstrates the benefit of detecting ASIFT features in the query imageby comparing our localization hit rate to other approaches, visual vocabulary-based approach [7], and the general SIFT fea-ture matching approach.

4.1. Environments and computing platform

The experiments are conducted in several indoor environments, including two houses and three offices. The first house iscomposed of one living room, one dining room, and one studying room. The second house is composed of one living room,

Table 1Dimensions of rooms in each environment.

Environment Area (m2) Environment Area (m2)

Living room (House A) 5.4 � 4.5 Office A 8.5 � 6.35Dining room (House A) 3.3 � 2.1 Office B 6.3 � 4.2Studying room (House A) 6.3 � 4.2 Office C 8.5 � 5.6Living room (House B) 7.2 � 4.2 Kitchen (House B) 3.0 � 2.2Single bedroom (House B) 3.8 � 2.6 Doorway (House B) 3.7 � 2.3Double bedroom (House B) 4.0 � 3.2 Corridor (House B) 3.4 � 1.0Storage room (House B) 3.4 � 2.2

Please cite this article in press as: Hsiang T-R et al. A MapReduce-based indoor visual localization system using affine invariant features.Comput Electr Eng (2013), http://dx.doi.org/10.1016/j.compeleceng.2012.12.023

Page 7: A MapReduce-based indoor visual localization system using affine invariant features

Table 2Localization error.

Camera Average error(m)

Standard deviation(m)

Kinect 0.52 0.16Logitech QuickCam ultra

vision0.57 0.18

T.-R. Hsiang et al. / Computers and Electrical Engineering xxx (2013) xxx–xxx 7

one kitchen, one doorway, one corridor, two bedrooms, and one storage room. 276 scans are taken and 160,451 SIFT featuresare detected to construct the 3D feature map using the procedure described in Section 2. Table 1 lists the dimensions of indi-vidual rooms in each environment.

The computing platform for experiments is a 16 node Hadoop cluster interconnected by Gigabit ethernet, each node con-tains a 2.34 GHz Intel Quad core processor and 8 GB of ram [25]. Among the 16 nodes, one is the job tracker/namenode andthe others are task trackers/datanodes. The namenode helps a client application locate the required data in datanodes. TheHadoop platform is version 0.20.2.

4.2. Localization accuracy

100 Photos randomly taken by Kinect and Logitech webcam in test environments are used to evaluate the accuracy oflocalization. In addition to the database scans captured in the test environments described in Section 4.1, images of the COsylocalization database [8] are also used to enlarge our database, including the part A of COLD-Freiburg dataset in the extendedpath, the part B of COLD-Freiburg dataset in the standard path, the COLD-Ljubljana dataset in the extended path, the part A ofCOLD-Saarbrucken dataset in the extended path, and the part B of COLD-Saarbrucken dataset in the extended path. In eachpath, one image sequences captured under sunny weather is selected to supplement our database.

The localization errors and the stand deviations of different cameras are shown in Table 2. The cameras are calibrated inadvance in order to obtain the focal lengths and the principal point to compute the relative angle in Eq. (7). Fig. 7 showsconsecutive localization results using query images taken by Kinect. The number in each circle of the same color indicatesthe order of the localization requests. Red1 circles represent the localization result of the proposed approach and blue circlesare the localization results of GraphSLAM. The white arrows are the poses where the Kinect scans are taken.

4.3. Computation time

The computation time affects the response time of a request for the localization service. The computation time in thelocalization system consists of the time for executing the localization algorithm, the data transmission latency between dif-ferent nodes in Hadoop, and the time for all processes in MapReduce framework such as splitting, sorting, and shuffling.

The computation time varies according to the available computing resources in the cluster, the number of databaseimages, and the method of data distribution for mappers. The computation times of the localization system under differentnumber of database images and mappers are provided in Fig. 8. In Fig. 8a, the number of mappers is 120 and the relationbetween computation time and the amount the processed data is shown when the number of database images increases.Fig. 8b illustrates the relation between computation time and the number of mappers when the number of database imagesis fixed at 10,748 with increasing number of mappers. The computation time does not further decrease when the number ofmappers is more than 120 because the best setting for the number of mappers is within half of total available number of CPUcores and two times of CPU cores [24]. When the computational resources are limited, increasing the number of mappersmay not reduce the computation time.

The time analysis of each process in the localization algorithm is provided in Table 3. Since ASIFT features of the queryimage is preserved in a cache file and there’s no data transmission latency when the cache file is accessed, the access timefor ASIFT features can be made less than the time for accessing relatively fewer SIFT features for a database image.

The scalability of the current MapReduce localization algorithm is affected by two factors. One is the initialization for eachmapper, which becomes relatively costly when a mapper’s lifespan is too short. The other is the imbalanced computation ineach mapper. Because a mapper matches the query image to several database images and the computation of image match-ing depends on the complexity of image texture, two mappers may require different periods of time to finish even they pro-cess the same number of images.

4.4. Comparison to visual dictionary-based approach and SIFT-based approach

The advantage of adopting ASIFT features in the query image is to increase the possibility that the query image is regis-tered to one of the database images. The goal of the proposed localization system is to provide localization service anywhere

1 For interpretation of color in Fig. 7, the reader is referred to the web version of this article.

Please cite this article in press as: Hsiang T-R et al. A MapReduce-based indoor visual localization system using affine invariant features.Comput Electr Eng (2013), http://dx.doi.org/10.1016/j.compeleceng.2012.12.023

Page 8: A MapReduce-based indoor visual localization system using affine invariant features

Fig. 7. Consecutive localization result.

Fig. 8. Computation time.

Table 3Analysis of computation time.

Process MapReduce Time (s)

ASIFT 1st 25–31Get ASIFT in cache file 2nd 0.5–2Get SIFT of a database image 2nd 3–5SIFT feature matching and RANSAC 2nd 0.5–2Triangulation 2nd 3–6

8 T.-R. Hsiang et al. / Computers and Electrical Engineering xxx (2013) xxx–xxx

in the modeled environment instead of only positions surrounding the trajectory for capturing the database images. There-fore, query images under consideration can appear very differently to the most similar database image in terms of the view-ing angle, which generally cause SIFT matching to perform poorly.

In order to demonstrate the above benefit, 500 query images are captured in random poses in the test environments. Ineach environment, the feature map is constructed by rotating Kinect in the middle of the environment to capture data. Fig. 9

Please cite this article in press as: Hsiang T-R et al. A MapReduce-based indoor visual localization system using affine invariant features.Comput Electr Eng (2013), http://dx.doi.org/10.1016/j.compeleceng.2012.12.023

Page 9: A MapReduce-based indoor visual localization system using affine invariant features

Fig. 9. The scans of six test environments (bird’s eye view).

T.-R. Hsiang et al. / Computers and Electrical Engineering xxx (2013) xxx–xxx 9

shows the directions and the positions where the database scans are taken in the largest six environments from the bird’seye view. In Fig. 9, the six environments are represented in colored point clouds in order to provide better visualization ef-fect. The white arrows in the middle of each environment points to the directions and positions of data scans.

The localization hit rate, that is, the success probability of localization, is compared among the proposed localization ap-proach, visual vocabulary used in FAB-MAP [7], and the general SIFT feature matching approach. SIFT matching is a commonapproach used in vision-based localization approach [4]. FAB-MAP based on visual vocabulary is used for testing global local-ization and loop closure detection in SLAM. A correct localization result is marked if there exists overlapping scene betweenthe returned image and the query image. In comparing the proposed method and the general SIFT feature matching method,the most similar database image is decided from the number of matched features and the winner-take-all strategy. On theother hand, FAB-MAP [7] developed a probabilistic approach using visual vocabulary in appearance-based place matching.Giving a query image, FAB-MAP computes the possibility that the query image resembles a database image or is captured at

Table 4Localization hit rate of three approaches.

Image matching approach Hit rate (%)

ASIFT 46.4Visual vocabulary 21.4SIFT 27.6

Please cite this article in press as: Hsiang T-R et al. A MapReduce-based indoor visual localization system using affine invariant features.Comput Electr Eng (2013), http://dx.doi.org/10.1016/j.compeleceng.2012.12.023

Page 10: A MapReduce-based indoor visual localization system using affine invariant features

10 T.-R. Hsiang et al. / Computers and Electrical Engineering xxx (2013) xxx–xxx

unvisited positions. The database image associated with the highest possibility is regarded as the closest database image.Table 4 shows the localization hit rate of the three approaches.

5. Conclusions and future work

This paper proposed a vision-based localization indoor localization system using MapReduce framework. Compared toexisting vision-based localization approaches, the proposed system may tolerate larger image differences and successfullylocalizes by matching ASIFT features of the query image to SIFT features of the database images. The two processes requiringheavy computational costs, the ASIFT feature detection in the query image and the image registration of the query image tothe environment database, are performed in MapReduce framework in order to speed up the response to a localization re-quest. Our experiments show the performance improvement of the proposed localization system. A comparison with regardto the localization hit rate among the proposed approach, the visual vocabulary method, and the general SIFT feature match-ing approach is provided, where the environments are modeled by numerous database images. To better the load-balancingcondition among MapReduce processes, a preliminary study that initiates tasks according to the number of features hasshown promising results. A refined version will be incorporated into our system to improve the quality of the localizationservice.

References

[1] Wolf J, Burgard W, Burkhardt H. Robust vision-based localization by combining an image-retrieval system with monte carlo localization. IEEE TransRobot 2005;21(2):208–16.

[2] Pretto A, Menegatti E, Jitsukawa Y, Ueda R, Arai T. Image similarity based on discrete wavelet transform for robots with low-computational resources.Robot Autonom Syst 2010;58(7):879–88.

[3] Leonard JJ, Durrant-Whyte HF. Mobile robot localization by tracking geometric beacons. IEEE Trans Robot Automat 1991;7(3):376–82.[4] Wang J, Zha H, Cipolla R. Coarse-to-fine vision-based localization by indexing scale-invariant features. IEEE Trans Syst Man Cybernet Part B: Cybernet

2006;36(2):413–22.[5] Rady S, Wagner A, Badreddin E. Hierarchical localization using entropy-based feature map and triangulation techniques. In: Proceedings of the IEEE

international conference on systems, man and cybernetics, Istanbul; 2010. p. 519–25.[6] Bacca B, Salvi J, Cufi X. Appearance-based mapping and localization for mobile robots using a feature stability histogram. Robot Autonom Syst

2011;59(10):840–57.[7] Cummins M, Newman P. FAB-MAP: probabilistic localization and mapping in the space of appearance. Int J Robot Res 2008;27(6):647–65.[8] Pronobis A, Caputo B. COLD: cosy localization database. Int J Robot Res 2009;28(5):588–94.[9] Dayoub F, Duckett T. An adaptive appearance-based map for long-term topological localization of mobile robots. In: International conference on

intelligent robots and systems, Nice; 2008. p. 3364–9.[10] Csurka G, Dance CR, Fan L, Willamowski J, Bray C. Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision;

2004. p. 1–22.[11] Lowe DG. Distinctive image features from scale-invariant keypoints. Int J Comput Vis 2004;60(2):91–110.[12] Yu. G, Morel J M. ASIFT: an algorithm for fully affine invariant comparison. Image Processing, 2011.[13] Cheng L, Li M, Liu Y, Cai W, Chen Y, Yang K. Remote sensing image matching by integrating affine invariant feature extraction and ransac. Comput

Electr Eng 2012;38(4):1023–32.[14] Srirama SN, Jakovits P, Vainikko E. Adapting scientific computing problems to clouds using mapreduce. Future Generat Comput Syst 2012;28(1):

184–92.[15] Okamoto Y, Oishi T, Ikeuchi K. Image-based network rendering of large meshes for cloud computing. Int J Comput Vis 2011;94(1):12–22.[16] Bistry H, Zhang J. A cloud computing approach to complex robot vision tasks using smart camera systems. In: IEEE/RSJ 2010 international conference

on intelligent robots and systems, Taipei; 2010. p. 3195–200.[17] Andreasson H, Lilienthal AJ. 6d Scan registration using depth-interpolated local image features. Robot Autonom Syst 2010;58(2):157–65.[18] Henry P, Krainin M, Herbst E, Ren X-F, Fox D. Rgbd mapping: using depth cameras for dense 3d modeling of indoor environments. In: Proceedings of

the 12th international symposium on experimental robotics; 2010.[19] Fischler MA, Bolles RC. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.

Commun ACM 1981;24(6):381–95.[20] Thrun S, Montemerlo M. The graphslam algorithm with applications to large-scale mapping of urban structures. Int J Robot Res 2005;25(5/6):403–30.[21] Wu C-C. SiftGPU: a GPU implementation of scale invariant feature transform (SIFT); 2007. <http://cs.unc.edu/ccwu/siftgpu>.[22] MRPT: the mobile robot programming toolkit. <http://www.mrpt.org/>.[23] Betke M, Gurvits L. Mobile robot localization using landmarks. IEEE Trans Robot Automat 1997;13(2):251–63.[24] Hadoop: apache hadoop software library. <http://hadoop.apache.org/>.[25] National center for high-performance computing, NCHC. <http://hadoop.nchc.org.tw/>.

Tien-Ruey Hsiang is an assistant professor in the Department of Computer Science and Information Engineering, Taiwan Tech. His research interestsinclude geometric algorithms in robotics, computer vision, wireless sensor networks, mobile networks, and cloud computing.

Yu Fu received his Ph.D. degree in Electrical Engineering from Taiwan Tech. He is now a project supervisor in AVerMedia.

Ching-Wei Chen is currently pursuing his master’s degree in the Department of Computer Science and Information Engineering, Taiwan Tech. He is workingon a cloud service platform with the emphasis on image localization.

Sheng-Luen Chung is a professor in the Department of Electrical Engineering, Taiwan Tech. His primary research focuses on context awareness, humanbehavior analysis and recognition, and interactive multimedia systems.

Please cite this article in press as: Hsiang T-R et al. A MapReduce-based indoor visual localization system using affine invariant features.Comput Electr Eng (2013), http://dx.doi.org/10.1016/j.compeleceng.2012.12.023