gpu-accelerated hierarchical panoramic image feature ... · frames after the frame q m to the...

GPU-accelerated Hierarchical Panoramic Image FeatureRetrieval for Indoor Localization

Feng HuThe Graduate Center,

CUNY365 5th Ave, New York

New York, United [email protected]

Zhigang ZhuThe Graduate Center and City

College, CUNY138th St & Convent AveNew York, United States

[email protected]

Jianting ZhangThe Graduate Center and City

College, CUNY138th St & Convent AveNew York, United States

[email protected]

ABSTRACTIndoor localization has many applications, such as commer-cial Location Based Services (LBS), robotic navigation, andassistive navigation for the blind. This paper formulatesthe indoor localization problem into a multimedia retrievingproblem by modeling visual landmarks with a panoramicimage feature, and calculating a user’s location via GPU-accelerated parallel retrieving algorithm. To solve the scenesimilarity problem, we apply a multi-images based retrievalstrategy and a 2D aggregation method to estimate the finalretrieval location. Experiments on a campus building realdata demonstrate real-time responses (14fps) and robust lo-calization.

CCS Concepts•Information systems → Information retrieval; Rankaggregation; •Theory of computation → Parallel algo-rithms; •Computing methodologies → Image repre-sentations;

KeywordsIndoor localization; environment modeling; feature indexingand retrieving; GPU-acceleration; candidates aggregation

1. INTRODUCTIONLocalization is an important task for human beings, since

people always need to know where they are. As a subsetproblem of the general localization task, indoor localizationis becoming a hot research area as more and more Loca-tion Based Services (LBS) are emerging. The applicationsinclude museum guidance, shopping mall tour guide, andassistive localization for the visually impaired, to name afew.

For the outdoor localization, Global Navigation SatelliteSystems (GNSS), such as Global Positioning System (GPS),

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

ICMR ’16 June 6–9, 2016, New York, NY, USAc© 2020 ACM. ISBN ***-****-**-***/**/**. . . $15.00

DOI: **.***/*** *

Figure 1: An example of retrieval and aggregationresults of a multi-frame omnidirectional feature ofan input on a multi-path databases

Galileo system, Beidou system, are widely used to provideaccurate localization services worldwide, for example, ve-hicle navigation, running tracking, and many other smartphone based LBS. However, due to the significant signalattenuation caused by construction materials and that themicrowave scattering by roofs, walls and other objects, usingGNSS systems for indoor localization tasks is not practical.

There are other sensors based localization methods de-veloped for indoor localization in recent years. RFID andWiFi are two frequently used ones. However, RFID-basedsystem, for example, [1] requires a large number of tags to beinstalled in the environment before the user can use his/herlocalization device. WiFi based localization method [6] canachieve relatively high accuracy, but it depends on the num-ber of access points available, and also cannot be used in anenvironment where few or no Internet routers are available.

Visual information based localization is attracting moreand more interests in both the computer vision communityand the robotics community, since visual information en-codes very rich knowledge about the indoor environment.These visual information include all the visible physical ob-jects existing in the environments, for example, doors, win-dows, or notice boards. Even though the general problemof object recognition is not fully solved, and thus the object

arX

iv:2

006.

0886

1v1

[cs

.CV

] 1

6 Ju

n 20

20

**.***/***_*

recognition based localization method is not mature, thereare still many other methods we can utilize to explore thesevisual information, for example, by extracting primitive geo-metric elements, such as vertical lines of the environment, todifferentiate environmental landmarks and distinguish themfor accurate localization.

In this paper, we propose a GPU-accelerated indoor local-ization approach, and formulate the visual-based indoor lo-calization problem into a multimedia retrieving problem bydividing it into two stages. First, we model the indoor envi-ronment using an omnidirectional image feature proposed by[2]. Second, for any given new omnidirectional images cap-tured at an unknown position by the user, the user’s locationcan be real-time calculated by retrieving the most similarimage feature from the database via GPU-accelerated par-allel retrieving and candidate aggregation algorithm. Fig. 1shows an example of retrieving and aggregation result withan input omnidirectional feature as well as its neighbor fea-tures against a multi-path indoor environment databases.The red stars are the retrieved candidates by GPU, theblue circle is the estimated location, and the green star isthe ground truth location. Candidate density distribution,which is used to determine the final estimation is also shownon the right picture. Details will be provided in Section 3.

The organization of the rest of the paper is as follows.Section 2 discusses some related work. Section 3 explainsthe main retrieval and aggregation algorithms. Section 4shows the experimental results on data of a real campusbuilding. Section 5 gives a conclusion and points out somepossible future directions.

2. RELATED WORKDirect image retrieval entails retrieving those image(s) in

a collection that satisfy a user’s need, either using a key-word associated with the image, which forms the categoryof concept-based image retrieval, or using the contents (e.g.color, shape) of the image, which forms the category of con-tent based image retrieval [4]. In our work, we do not di-rectly use images as the retrieving keys, but we extract anduse an one dimensional omnidirectional feature from the om-nidirectional images.

The applications of content based image retrieval includesketch-based image searching [3], changing environment placerecognition [5], etc., but very few research are conducted forusing the content based image retrieval for localization pur-pose, especially indoor localization. One major reason isthat to represent a specific location using images, all the vi-sual information around this location have to be collectedand stored to make this landmark distinguishable, which re-quires many images if using a normal Field Of View (FOV)camera, instead of a single image. Also, the features used forretrieving from the input image should depend on the loca-tion only, independent of the camera’s orientation, but thisis hard for normal FOV cameras, because the scene imagesare usually captured from different perspectives, and theytherefore generate different features even though images arecaptured at the same location.

Using GPU to accelerate image retrieving process is usedby [7], where each image is divided into patches and SIFTfeatures are detected and indexed for each patch before storedinto a database for retrieving. However, this work only usesa single GPU kernel and one-level parallel algorithm, whichdoes not utilize the parallelism property of the data and is

Figure 2: Parallel retrieving with GPU in the re-mote server

not efficient when the database scale increases. In our work,a hierarchy multi-level parallel algorithm is used by identi-fying both the database and algorithm parrallelism propertywith multiple GPU kernels.

3. ALGORITHMSBefore a localization system is available for usage, all the

interested landmarks within the area where localization ser-vice is provided shall be modeled with the omnidirectionalimage features and stored into the remote server database.Once this is done, the task of localizing an user’s locationusing new captured images is converted into a problem offinding the most similar image feature within the database.In this paper, we use the omnidirectional image feature pro-posed by [2] to model the indoor environment, and the sametype of feature extracted with the same procedure is laterused for retrieval. Since for each landmark, an omnidirec-tional image can capture all the visual information aroundthe user with single shot, each location can be modeled asone omnidirectional feature, instead of multiple features ifusing normal field of view camera [8]. The FFT mangnitudeof the one-dimensional omnidirectional vector is used as arotation-invariant omnidirectional feature.

When the database scales up for a large scene, it is verytime consuming to sequentially match the input frame fea-tures with all the images in the database of the large scene.So we use General Purpose GPU (GPGPU) to parallelize thequery procedure and accelerate the querying speed. Section3.1 will explain the algorithm of GPU-accelerated parallelfeature retrieving.

Since many man-made structures and objects look verysimilar, particularly for indoor environments, it is still achallenging task for accurate localization using only one sin-gle image. We therefore adopt a multiple video frames query-ing mechanism by extracting a sequence of omnidirectionalfeatures from a short video clip, which greatly reduce theprobability of false positive retrieving. Section 3.2 will dis-cuss the multi-frame based candidates aggregation algorithm.

3.1 GPU-accelerated parallel retrievalWithout using GPU, the straightforward algorithm is to

sequential compare the input feature vector with all the fea-

ture vectors in a database. This is problematic when thedatabase scale increases, for example, to hundreds of thou-sands of features or more, and a real-time performance re-quirement is needed. We therefore design a parallel featureretrieving algorithm with NVIDIA GPU-acceleration.

The retrieving process have three levels of parallelisms.First, we divide the search database into multiple subspaces,for example, each floor data is counted as one subspace,and all the subspaces can be parallelly searched with CUDAstreams. Second, within each subspace, we process all theframes of an input query video clip in parallel instead of re-trieving with each input frame one after another. Third, weuse multiple threads to compare each input frame with mul-tiple database frames simultaneously, rather than comparingwith them one-by-one.

A diagram of this multi-level parallel retrieving processis illustrated in Fig. 2. When a client sends a localizationrequest of a frame qm, it actually sends M−1

2frames be-

fore the frame qm, and M−12

frames after the frame qm tothe server (e.g. M = 3 or 5). The CPU of the server willcopy each frame’s feature to the GPU, and receive the topN candidates calculated by the GPUs for each frame afterprocessing. The database is divided into ns subspaces, andthe ni subspace has |ni| modeled frames. All the subspacesare searched in parallel. Take the frame qm as an example.The frame qm is compared with all the frames f i

t (t = 1,2, ..., |ni|) within each subspace i (i = 1, 2, ..., ns). Thesame procedure is carried out to all frames. Finally, all thereturned candidates of the M query frames are returned toCPU for aggregation.

The detailed algorithm is shown in Algorithm 1. In the ini-tialization step, the server receives the data from the clientand prepare for retrieving. In Step 2, for each retrieving,the server will not only select the current testing frame, butalso multiple other frames near the testing frame. Then, allthe selected frames are copied to each CUDA stream one byone (Step 5), where the database of each subspace are storedand computation is taken care of concurrently. Within eachstream, there are multiple threads, e.g. 32×256 threads, andeach thread is responsible for the comparison of the queryingframe and database frame (Step 7). Streams are executedsimultaneously, and all the threads within each stream arealso carried out in parallel. CUDA synchronization is calledto make sure all the tasks within each thread (Step 9) andall the work in each stream (Step 12) are finished before theprogram moves to the next step. Finally, the GPU collectsthe returned candidates.

3.2 Multi-frame candidate aggregationFor many indoor places, such as narrow corridors, model-

ing the environment by video-taping a single walking is suf-ficient. However, to cover large areas, such as wide corridorsor rooms, multiple paths of video capture are needed. Also,after the multi-frame retrieving process, a large amount ofposition candidates are returned by the GPU, so a candi-dates aggregation algorithm is needed to aggregate all thesecandidates for a finalized location estimation to be deliveredto the user. In an ideal case, all the candidates should belocated at the position where the input image is captured.However, because of various reasons (noise, scene similar-ity, etc.), some of the candidates may be far away from theground truth position, even though the majority of the can-didates will cluster around the ground truth area. In this

Algorithm 1: GPU multi-stream paralleling re-trieving

Data: M query frames received from CPUResult: Location candidates FinalResult

calculated by GPU1initialization;2selectNearbyFrames();3for each selected frame qm do4 for each CUDA stream do5 memoryCopyFrmoCPUtoGPU(qm);6 for each CUDA thread do7 calculateDistance(qm, fi);8 end9 synchronizeThreads();

10 end11 synchronizeStream();12 Results = selectTopCandidates();13 memoryCopyFromGPUtoCPU(Results);

14end

Algorithm 2: 2D multi-frame localization candi-dates aggregation

Data: Location candidates on the floor plancoordinate system

Result: Aggregated final location coordinateFinalPosition

1initialization;2for each candidate Ci do3 Accumulate locationDistriArray(Ci);4end5rankLocationDistribution(locationDistriArray);6TopC = findTopDensityArea(locationDistriArray);7for index < NumOfTopC do8 if topDensityArea(index) >

NumOfAllCandidates× tolerPer then9 FinalPosition = topDensityArea(index);

10 break;

11 end

12end

paper, we design and implement a 2D multi-frame aggrega-tion algorithm by utilizing the densities of the candidates’distribution, and calculate the most likely location estima-tion. The algorithm is shown in Algorithm 2.

In the initialization step, the candidates’ indexes in themodeling images are mapped to their actual floor plan 2Dcoordinates by checking the geo-referenced mapping tablepre-generated while modeling the environment. In our ex-periments, we discretize the floor plan coordinates by settingthe x (from west to east) and y (from south to north) di-rection’s units to be 30 cm, which is identical to groundtiles’ width. Then in Step 2, we count each tile’s candidatesnumber by a 2D binning operation, and obtain its locationdistribution array locationDistriArray. We sort this arrayin a descending order in Step 5, and obtain the first TopCnumber (currently TopC = 10) of tiles in Step 6. In a per-fect case, the top one tile of these TopC candidates shouldbe the best estimated result since it has the largest amountof candidates dropped in. This is, however, not always the

case, because the tile with the most candidates may be afalse positive estimation, and its nearby tiles may have veryfew candidates. To increase the robustness of the estima-tion, in Step 7, we set a tolerant circle around each tile,with radius 3m in our experiment, and count the total num-ber of candidates within this circle. If the amount is greaterthan a threshold percent— tolerPer (e.g. tolerPer = 20%)of the total amount of candidates, the corresponding tile isthe final position, and is returned to the user.

4. EXPERIMENTSExperiments on a database of an eight-floor building are

carried out to demonstrate the capacity of the proposed sys-tem, with real-time response (14fps) and robust localizationresults. In the experiment we will describe below, a wholefloor within a campus building are modeled with multiplepaths (5 parallel paths in this case), and test the parallelretrieving and aggregation algorithms for localization. Wefirst capture video sequences of the entire floor along thesefive paths as the training databases, and then capture twoother video sequences as the testing data sets.

For each testing frame, we query the entire training databaseand find the most similar stored features with the smallestEuclidean distance. Many scenes in the environment maybe very similar, so the returned feature with the smallestdistance may not necessary be the real match. We thenselect the top N (N = 15 in this experiment) features asthe candidates. Also, to solve the scene similarity problem,for each querying frame, we not only use itself as the querykey, but we also use M-1 (M = 11) other frames, (M-1)/2before and (M-1)/2 after this frame, as the query keys togenerate candidates. We test all these frames within all theP (P = 5) paths, the frames of which are all labeled on thefloor plan coordinate system, and retrieve all the candidates.Consequently, there are total N ×M × P candidates.

Fig.1 shows an example of the result. The red stars arethe 825 location candidates of a query frame and its related10 frames. Note that several frames may return the samelocation, so each red star may represent more than one can-didate. In the left image of Fig.1, the coordinate system hasthe same scale as the floor plan. Each unit in both horizon-tal and vertical directions corresponds 30cm in the physi-cal space. The green star shows the ground truth locationand the blue circle shows the final estimation of the testingframe. The distribution of the candidates is also illustratedin the right plot.

The x and y coordinates of right figure of Fig.1 are thefloor plan coordinates, and the height of each point rep-resents the number of candidates falling into this position.Since all the modeling frames captured near the testing framehave similar contents and thus have alike features, the ma-jority of the candidates drop near the location where thetesting frame is captured. In this example, as shown inFig.1, the testing frame’s ground truth location is on thetop left corner of the floor plan (the green star), as we cansee, the majority of the candidates drop on the top cornerarea. The rightmost corner in the right figure of Fig.1 alsoshows that the number of candidates (representing by theheight) near the ground truth position is larger than the restof places. A video demo can be found at this link 1 for thewhole experiment to visualize the accuracy and robustness

1https://youtu.be/gPGGcAfFsvY

of the parallel retrieving and aggregation algorithm.

5. CONCLUSION AND FUTURE WORKThis paper proposes a framework of converting the tradi-

tional visual information based indoor localization probleminto a GPU-accelerated image and video retrieval problem.An omnidirectional image feature is used to model each en-vironmental landmark, and stored in remote server for re-trieving/localization. To solve scene similarity problem andensure a real-time performance, we propose a multi-framebased retrieval technique and a 2D candidates aggregationalgorithm. For the future work, multiple modality (e.g.Blue-tooth beacons + images) can be integrated togetherto resolve scene repetition problem. Also, more testing on alarger scale scene, for example, in a campus scale with mul-tiple buildings, can be carried out for implementing a morepractical localization solution.

6. ACKNOWLEDGMENTSThe authors would like to thank the US National Science

Foundation (NSF) Emerging Frontiers in Research and In-novation (EFRI) Program for the support under Award No.EFRI 1137172. The authors would like to thank Jeury Mejiafor his assistance in collecting data, developing mobile phoneapps and running the experiments. Also thanks are given toDr. Hao Tang at BMCC/CUNY for his valuable discussions.

7. REFERENCES[1] G. Cicirelli, A. Milella, and D. Di Paola. Rfid tag

localization by using adaptive neuro-fuzzy inference formobile robot applications. Industrial Robot: AnInternational Journal, 39(4):340–348, 2012.

[2] F. Hu, Z. Zhu, and J. Zhang. Mobile panoramic visionfor assisting the blind via indexing and localization. InComputer Vision-ECCV 2014 Workshops, pages600–614. Springer, 2014.

[3] C. Jin, Z. Wang, T. Zhang, Q. Zhu, and Y. Zhang. Anovel visual-region-descriptor-based approach tosketch-based image retrieval. In InternationalConference on Multimedia Retrieval, ICMR ’15, pages267–274, Shanghai, China, 2015.

[4] A. Kovashka, D. Parikh, and K. Grauman.Whittlesearch: Image search with relative attributefeedback. In Computer Vision and Pattern Recognition(CVPR), 2012 IEEE Conference on, pages 2973–2980.IEEE, 2012.

[5] D. Mishkin, M. Perdoch, and J. Matas. Placerecognition with wxbs retrieval. In CVPR 2015Workshop on Visual Place Recognition in ChangingEnvironments, 2015.

[6] N. Paisios. Mobile Accessibility Tools for the VisuallyImpaired. PhD thesis, New York University, 2012.

[7] P. Sattigeri, J. J. Thiagarajan, K. N. Ramamurthy, andA. Spanias. Implementation of a fast image coding andretrieval system using a gpu. In Emerging SignalProcessing Applications (ESPA), 2012 IEEEInternational Conference on, pages 5–8. IEEE, 2012.

[8] B. Taylor, D.-J. Lee, D. Zhang, and G. Xiong. Smartphone-based indoor guidance system for the visuallyimpaired. In Control Automation Robotics Vision(ICARCV), 2012 12th International Conference on,pages 871–876, Dec 2012.

https://youtu.be/gPGGcAfFsvY

gpu-accelerated hierarchical panoramic image feature ... · frames after the frame q m to the...

Documents