an adaptive similarity search in massive datasetskhanh/papers/tldks2016... · support systems,...

An Adaptive Similarity Searchin Massive Datasets

Trong Nhan Phan1(&), Josef Küng1, and Tran Khanh Dang2

1 Institute for Application Oriented Knowledge Processing,Johannes Kepler University Linz, Linz, Austria

{nphan,jkueng}@faw.jku.at2 Faculty of Computer Science and Engineering,

HCMC University of Technology, Ho Chi Minh City, [email protected]

Abstract. Similarity search is an important task engaging in different fields ofstudies as well as in various application domains. The era of big data, however,has been posing challenges on existing information systems in general and onsimilarity search in particular. Aiming at large-scale data processing, we proposean adaptive similarity search in massive datasets with MapReduce. Additionally,our proposed scheme is both applicable and adaptable to popular similaritysearch cases such as pairwise similarity, search-by-example, range queries, andk-Nearest Neighbour queries. Moreover, we embed our collaborative refine-ments to effectively minimize irrelevant data objects as well as unnecessarycomputations. Furthermore, we experience our proposed methods with the twodifferent document models known as shingles and terms. Last but not least, weconduct intensive empirical experiments not only to verify these methodsthemselves but also to compare them with a previous related work on realdatasets. The results, after all, confirm the effectiveness of our proposed methodsand show that they outperform the previous work in terms of query processing.

Keywords: Similarity search � Massive datasets � Scalability � Adaptivity �Collaborative filtering � Cosine � MapReduce � Hadoop

1 Introduction

The essential role of similarity search has been recognized not only in diverse fields ofstudies such as machine learning, data mining, clustering, and information retrieval butalso in wide-ranges of applications and processes such as duplicate detection, decisionsupport systems, search engines, and data clustering, to name a few. Its main objectiveis to look for other objects that are potentially similar to one another. There are differentkinds of similarity search cases such as pairwise similarity, search-by-example, rangequeries, and k-Nearest Neighbor (k-NN) queries [13, 14].

In general, similarity search takes two main phases as follows: (1) Candidategeneration phase; and (2) Verification phase. The former is to produce candidate pairsthat have potential of similarity while the latter is to verify which pair is truly similar byits similarity score. The similarity search task, unfortunately, is time-consuming. Forinstance, doing the inceptive pairwise similarity in that all possible objects are

© Springer-Verlag Berlin Heidelberg 2016A. Hameurlain et al. (Eds.): TLDKS XXIII, LNCS 9480, pp. 45–74, 2016.DOI: 10.1007/978-3-662-49175-1_3

considered and computed for their self-join similarity gives an exponential complexityO(n2). Such a high cost demands either better innovations or further improvements onsimilarity computing.

The issue has got many attentions from both academia and industry world-wide.Some sorts of indexes or approximate but efficient approaches are proposed in order todeal with this issue [5, 6]. Nevertheless, it becomes more challenging than ever whenwe are in the era of big data. With the large amount of data rapidly increased, tradi-tional processing mechanisms are in a high pressure towards their effectiveness andefficiency. Consequently, state-of-the-art tends to benefit parallel mechanism either byoptimizing parallel algorithms [1] or by deploying computations on a novel parallelparadigm like MapReduce [4, 8, 12, 19, 22] to improve large-scale similarity searchwhen dataset size never stops growing.

Being aware of the new trend and promoted by state-of-the-art, we propose anadaptive similarity search in large data collections with MapReduce. This paper is theextension of our work [14]. Our goal is to achieve an efficient large-scale processingwith big data volume. Hence, our main contributions are summed up as followings:

1. We present a general similarity search scheme toward scalability and embed col-laborative strategic refinements, which reduce a large amount of candidate sizeleading to eliminating unnecessary computing and costs, into it.

2. We effectively implement the proposed scheme with MapReduce paradigm, whichsupports us for large-scale data processing.

3. We show that the proposed scheme flexibly adapts itself to well-known similaritysearches including pairwise similarity, search-by-example, range search, andk-Nearest Neighbor search.

4. These methods are consolidated by empirical experiments with real datasets fromboth DBLP [7] and Gutenberg [16] on Apache Hadoop Framework [3]. In addition,we employ the two document models known as terms and shingles to experienceour proposed methods. Furthermore, these methods are evaluated and compared tothe related work in [10], which shows how much beneficial they might get whenprocessing large amount of data.

The rest of the paper is organized as follows: Sect. 2 shows related work that ispretty close to our approach. Section 3 introduces basic concepts associated with ourcurrent work. Next, we propose the general similarity search scheme in Sect. 4 and howthe scheme is applicable to diverse similarity search cases in Sect. 5. Relevantexperiments and analytics are then given in Sect. 6. Finally, we discuss some chal-lenges as well as open issues towards our research work in Sect. 7 before making ourfinal remarks in Sect. 8.

2 Related Work

Due to the importance of similarity search, many literatures have been responding thecalls of its improvement against imposed new challenges whilst traditional mechanismsare not able to suitably react and gradually become out-dated. Fenz et al. show anefficient similarity search in very large string sets [11]. They propose a state set index

46 T.N. Phan et al.

based on a prefix index. The state set index is interpreted as a nondeterministic finiteautomaton. Then each character of a string is mapped to a state, and the last characterdefines an accepting state. Besides, they use edit distance with equal weights foroperations and tune the parameters of labeling alphabet size and the index length.Nevertheless, their approach is a sequential processing while considering a set ofstrings instead of document objects like the way we do with MapReduce.

Xiao et al. introduce efficient similarity joins for near duplicate detection [23]. Theypropose an exact similarity join algorithm, together with positional filtering principlecombined with both prefix and suffix filtering, to detect near-duplicates. Theirapproach, however, does not take parallel mode into account, which may limit thecapability of processing big data volume.

Zhang et al. present a unified approximate nearest neighbor search scheme bycombining data structure and hashing [24]. In their approach, they employ the prunestrategy from k-means clustering tree and the fast distance computation from Hammingdistance. Their goal, however, is towards only k-Nearest Neighbor queries. Moreover,these methods are done without any parallel mechanism.

Meanwhile, Alabduljalil et al. present optimized parallel algorithms for computingexactly all-pair similarity search [1]. The authors propose a hybrid indexing thatcombine the forward indexing and the inverted indexing on which the similaritycomputing is performed. In addition, they develop a partitioning method for staticfiltering and parallelism. The basic idea is to ensure that dissimilar objects are indifferent partitions. Though their methods are compatible with MapReduce paradigm,only mappers are actually involved. Besides, they introduce a circular assignment thatassigns tasks computing the similarity between partitions to early remove unwanted I/Oand computations. Nevertheless, they assume that the normalization from Cosinemeasure is already done before computing the similarity scores. We believe that themissing normalization step is really important to be effectively handled due to its extrahigh overheads.

Vernica et al. introduce efficient parallel set-similarity joins using MapReduce [22].They propose a 3-stage approach for a self-join case: (1) Build a list of word frequencyis in an increased order; (2) Generate a list of record-ID pairs; and (3) Output the pairsof records. Moreover, they also extend their approach for set joins and balance theworkloads based on term frequencies in a round-robin manner. Nevertheless, it seemsthat duplicate values in each Map job are redundant, and how to calculate the similarityscore is not clearly shown.

Elsayed et al. present pairwise document similarity in large collections withMapReduce [10]. Their main aim is to employ MapReduce paradigm to computepairwise similarity by accumulating the innter product of term frequencies between apair as follows: (1) Building a standard inverted index in that each term is associatedwith a list of documents to which it belongs and its corresponding term frequency; and(2) Calculating and summing all of the individual values of a pair to generate its finalsimilarity scores. The approach looks like using Cosine measure such that the innerproduct of term frequency between a pair of documents is used to produce the simi-larity scores. Normalization and strategic filtering, however, are not mentioned as theyare in our approach. Moreover, there is a redundancy when calculating the innerproduct of all pairs when given a query. In other words, the proposed method does not

An Adaptive Similarity Search in Massive Datasets 47

make the best use of the query in order to avoid such unnecessary computing as ourproposed methods do.

Li et al. show batch text similarity search with MapReduce [12]. They propose atwo-phase as briefly following: (1) Generate word frequency dictionary, generatevectors of all texts in the database according to the word frequency dictionary, and thengenerate PLT inverted file; and (2) Transform the query text into vector texts, and thencalculate the prefix for each vector text. Finally, match the text which meets therequirement in PLT inverted file. The basic idea is to firstly build a word frequencydictionary. For each input, it is converted into vector texts when referenced to thedictionary. Prefixes of each vector text are then generated and stored in a PLT invertedfile whose form of <word, textid, length, threshold value>. Whenever there is a querytext search, the query text is transformed into vector texts which have been later onprocessed for their prefixes. In the end, words in each prefix will be searched from thePLT inverted file to find the text pairs that satisfy the given similarity threshold.Unfortunately, this approach consumes lots of computations and large amounts ofprefixes, which easily leads to slowing down the whole system due to the large amountof datasets.

De Francisci Morales et al. propose their approach known as scaling out all pairssimilarity search with MapReduce [8]. They build inverted indexes from documentsand use Cosine measure as a metric. In addition, they eliminate some terms based on athreshold and pruning techniques. Moreover, these eliminated terms are later retrievedor distributed to reduce phase to contribute to the final similarity scores. The nor-malization phase, however, is not mentioned.

3 Preliminaries

3.1 Concepts

A workset X consists of a set of N documents Di, which is represented as X = {D1, D2,D3, …, Dn}, and each document Di composes of a set of words as termk, which isshown as Di = {term1, term2, term3, …, termk}. In general, each document Di has theprobability to share its terms with others, and we define common terms as thosecontained in all the considering documents in the workset Ω. Meanwhile, each termk

has its own term frequency tfik, which is described as the number of times the termk

occurs in the document Di. The inverse document frequency idfik shows how muchpopular a termk of a document Di is across all the documents. In addition, there isanother way to represent a document by a set of K-shingles or n-grams [17, 20]. Whengiven a document Di as a string of characters, K-shingles are defined as any sub-stringhaving the length K found in the document. This concept is exploited in the field ofnatural language processing to represent documents and avoid the miss-match whentwo documents share the same number of terms but with different positions. With thismodel, a document Di is alternatively represented by a set of shingles such asDi = {SH1, SH2, …, SHk}, and the length of a document is known as the totalnumber of shingles belonging to the document. Last but not least, the sign [,] indicatesa list, the sign [[,], [,]] demonstrates a list of lists, the sign [,]ord denotes an ordered list,and (u.v) gives the inner product between u and v.

48 T.N. Phan et al.

In this paper, we utilize the Cosine measure, which is popular and employed by thework in [1, 4, 10, 23], to compute the similarity between a pair of documents Di and Dj,whose formulae are defined as follows:

sim Di;Dj� � ¼ Xt

k¼1

Wik �Wjk ð1Þ

Where Wik ¼tfik � log N

nkffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPtk¼1 tfikð Þ2� log N

nk

2� �h ir ð2Þ

From the Eqs. (1) and (2), nk represents the total number of documents sharing thesame termk, and idfik is computed as the following equation: idfik ¼ log N

nk. All of the

documents, however, have to be normalized before being further processed. We callWik the normalized weight of termk in the document Di, which is done by the Eq. (2).The purpose of normalization integration is to avoid the much affection of largedocuments to small ones and make the similarity scores fall into the interval [0, 1],which is easily visualized to humans. Two documents are similar when the similarityscore is close to 1 and vice versa. Besides, bringing the normalization into the pro-cessing makes sense in reality and not an assumption in the context of big data becauseof its computation costs. Last but not least, we also exploit an inverted index, whichmaps a termk to the document Di to which it originally belongs, in order to speed up theprocessing then.

3.2 MapReduce Paradigm

MapReduce is a parallel programming paradigm which aims at many large-scalecomputing problems [9]. The basic idea is to divide a large problem into independentsub-problems which are then tackled in parallel by two operations known as Map andReduce. Its mechanism is deployed on commodity machines in that one is in charge ofa master node and the others are responsible for worker nodes. The master deliversm Map jobs and r Reduce jobs to workers. Those which are assigned Map jobs arecalled mappers whilst those which are assigned Reduce jobs are called reducers. Inaddition, Map jobs are specified by a Map function while Reduce jobs are defined by aReduce function.

An overview of MapReduce paradigm is illustrated as in Fig. 1, where there arem mappers and r reducers. Each mapper has its local data in order to store the inter-mediate key-value pairs. Before reaching reducers, the intermediate key-value pairs areshuffled and sorted by the keys. Generally, the single flow of MapReduce can beshortly described as follows:

1. The input is partitioned in a distributed file system (e.g., Hadoop Distributed FileSystem – HDFS) [3], which produces key-value pairs of the form [key1, value1];

2. Mappers execute the Map function to generate intermediate key-value pairs of theform [key2, value2];


3. The shuffling process groups these pairs into [key2, [value2]] according to the keys;4. Reducers execute the Reduce function to output the result;5. The result is finally written back into the distributed file system.

4 The Proposed Scheme

In this section, we propose an overview scheme that derives similarity scores betweenpairs of documents with MapReduce. From a general point of view and for simplicity,we firstly show the scheme as in the traditional self-join case without any queryparameters in that we want to find pairwise similarity. Other specific cases followingthe scheme are presented in Sect. 5 of the paper. As illustrated in Fig. 2, the wholeprocess consists of four MapReduce phases. Moreover, each phase is equipped withfiltering strategies in order to eliminate dissimilar pairs and reduce overheads includingstorage, communication, and computing costs. The brief descriptions of these phases,along with the filtering strategies, are given as follows:

• Phase 1 (MapReduce-1): Building the customized inverted index. At the firstMapReduce phase, sets of documents known as worksets are inputs to build thecustomized inverted index. The data input are split into chunks and are later on

Fig. 1. The overview of MapReduce

50 T.N. Phan et al.

processed in the form of key-value pairs. In addition, Prior Filter is applied todiscard common words. The reason is that they contribute nothing to the finalsimilarity score but give a burden to the whole process.

• Phase 2 (MapReduce-2): Normalizing candidate pairs. At this phase, the cus-tomized inverted index will be normalized. In parallel, Query Term Filtering andLonely Term Filtering are applied to filter those which only exist in a single doc-ument or those which are not in the given query document, respectively. In addition,the key-value pairs are ranked in a descending manner, which is according to theirvalues.

• Phase 3 (MapReduce-3): Building the normalized inverted index. Thesekey-value pairs from MapReduce-2 are then fed to MapReduce-3 so that the nor-malized inverted index is generated. Besides, Pre-pruning-1 will be done to reducethe candidate size when given a query document.

• Phase 4 (MapReduce-4): Computing similarity pairs. As a final phase, thenormalized inverted index is employed to compute the similarity between a pair.Again, this phase filters candidate pairs according to specific query strategies likeRange Query Filtering for range queries or k-NN Query Filtering for k-NearestNeighbor (k-NN) queries before outputting similarity pairs. Moreover, it is worthnoting that Pre-pruning-2 will be utilized to reduce candidate size at the Map taskof this phase before the similarity score is actually calculated. More details of eachphase are given in Sect. 5 of the paper, which depends on specific similarity searchcases.

Fig. 2. The overview scheme [14]


The overall MapReduce operations can be summarized in Table 1. In general, let Di

be the ith document of the workset, termk be the kth word of the whole workset, tfik be

the term frequency of the termk in the document Di, idfik be the inverse documentfrequency of the termk in the document Di,Wik be the normalized weight of the termk inthe document Di, Mi be the total weight of all the terms in the document Di, Wi be thelargest weight of the termk in the document Di, and SIM(Di, Dj) be the similarity scorebetween a document pair. A special character, e.g., @, is employed to semanticallyseparate the sub-values in the values of a pair. More specifically, the intermediatekey-value pairs after Map-1 method are of the form [termk, Di], which are then fed toReduce-1 method so that we can acquire the normalized inverted index of the form[termk, [Di@tfik@idfik]]. In order to normalize the weight of the termk in the documentDi, Map-2 method is in charge of emitting its intermediate key-value pairs of the form[Di, termk@tfik@idfik] and then Reduce-2 method executes the normalization processaccording to the Eqs. (1) and (2) before outputting an ordered list of the form [Di,[termk@Wik]]ord. After that, Map-3 method takes its responsibility to build the nor-malized inverted index from the ordered list by emitting its intermediate key-valuepairs of the form [termk, Di@Mi@Wi@Wik] and Reduce-3 method processes them andoutputs the ordered key-value pairs of the form [termk, [Di@Mi@Wi@Wik]]ord. Finally,Map-4 method computes the partial product of each corresponding pair and emits theintermediate key-value pairs of the form [Dij, (Wik. Wjk)]. After that, Reduce-4aggregates the final similarity score of each pair, which has the output of the form [Dij,SIM(Di, Dj)].

5 Similarity Search Cases

The proposed scheme is applicable not only to popular similarity searches like pairwisesimilarity and search-by-example but also to those with query strategies such as rangesearch and k-NN search. In each sub section below, we show in detail how it getsinsight on the specific similarity searches.

Table 1. The overall MapReduce operations [14]

Task Input Output

MAP-1 worksets½ � termk;Di½ �REDUCE-1 termk;Di½ � termk; Di@tfik@idfik½ �½ �MAP-2 termk; Di@tfik@idfik½ �½ � Di; termk@tfik@idfik½ �REDUCE-2 Di; termk@tfik@idfik½ � Di; termk@Wik½ �ordMAP-3 Di; termk@Wik½ �ord termk;Di@Mi@Wi@Wik½ �REDUCE-3 termk;Di@Mi@Wi@Wik½ � termk; Di@Mi@Wi@Wik½ �ord

� �MAP-4 termk; Di@Mi@Wi@Wik½ �ord

� �Dij; Wik �Wjk

� �� REDUCE-4 Dij; Wik �Wjk

� �� Dij; SIM Di;Dj

� ��

52 T.N. Phan et al.

5.1 Pairwise Similarity

Pairwise similarity search is the case in that we want to find out all possible similarpairs. In other words, one is bound to every other to give their similarity. Following thescheme, worksets are initially passed to mappers at Map-1 method, which producesintermediate key-value pairs of the form [termk, Di]. They are then retrieved byreducers at Reducer-1 method to output the key-value pairs of the form [termk,[Di@tfik@idfik]], where tfik and idfik are derived. At this step, common words whichhave idfik equal to 0 are discarded by the Prior Filter. For example, assuming that thereare three documents named D1, D2, and D3, and each document contains its corre-sponding words as the input illustrated in Fig. 3. After Map-1 method, we have a list ofintermediate key-value pairs [[A, D1], [B, D1], [B, D1], [C, D1], [A, D1], [E, D1], [C,D2], [A, D2], [D, D3], [B, D3], [A, D3], [E, D3]]. The list is then accessed by reducers atReduce-1 method. The common word A is ignored by the Common Term Filteringwhile the lonely word D is marked as Terms Not Proceeded-{TNP}. The reason whythe lonely word D is not discarded right away but marked as a special sign at this phaseis that it should be kept joining the normalization step later on even though it does notcontribute to any similarity scores in the end. Therefore, we have the output list of theform of key-value pairs as follows [[B, [D1@[email protected], D3@[email protected]]], [C,[D1@[email protected], D2@[email protected]]], [{TNP}, D3@[email protected]]], [E, [D1@[email protected],D3@[email protected]]]].

Next, the key-value pairs from the first MapReduce are normalized at the secondMapReduce. The intermediate key-value pairs after Map-2 method have the form of[Di, termk@tfik@idfik]. The Reduce-2 method normalizes these pairs into an orderedlist of the form [Di, [termk@Wik]]ord. The values are sorted by their sizes and then by

Fig. 3. MapReduce-1 operation [14]


their Wik. Figure 4 shows the ongoing example at the second MapReduce. The mappersat Map-2 method output the intermediate key-value pairs as the list [[D1,B@[email protected]], [D3, B@[email protected]], [D1, C@[email protected]], [D2, C@[email protected]], [D3,{TNP}@[email protected]], [D1, E@[email protected]], [D3, E@[email protected]]]. These pairs are laternormalized by the reducers at Reduce-2 method which gives us the normalized andordered output list as following [[D1, [[email protected], [email protected], [email protected]]], [D3,[[email protected], [email protected]]], [D2, [[email protected]]]]. It is worth noting that the lonely term{TNP} is filtered by Lonely Term Filtering at Reduce-2 method.

After the normalization, the third MapReduce takes the normalized inverted indexinto account. The mappers at Map-3 method emit the intermediate key-value pairs ofthe form [termk, Di@Mi@Wi@Wik]. The reducers at Reduce-3 method output theordered key-value pairs of the form [termk, [Di@Mi@Wi@Wik]]ord. Figure 5 presentsthe ongoing example at this phase. We have the list after Map-3 method as follows:

½½B; D3@0:6542@0:3271@0:3271�; ½E; D3@0:6542@0:3271@0:3271�;½B; D1@1:6329@0:8165@0:8165�; ½C; D1@1:6329@0:8165@0:4082�;½E; D1@1:6329@0:8165@0:4082�; ½C; D2@0:1760@0:1760@0:1760��

And we have the list after Reduce-3 method as follows:

½½B; ½D1@1:6329@0:8165@0:8165; D3@0:6542@0:3271@0:3271��;½E; ½D1@1:6329@0:8165@0:4082; D3@0:6542@0:3271@0:3271��;½C; D1@1:6329@0:8165@0:4082; D2@0:1760@0:1760@0:1760��:

Finally, the fourth MapReduce computes the partial product of each correspondingterm of a pair, which has the form [Dij, (Wik. Wjk)], at Map-4 method and leads to thefinal similarity score of each pair, which has the form [Dij, SIM(Di, Dj)] after Reduce-4method. The running example is closed at this phase from Fig. 6. The intermediatekey-value pairs [[D13, 0.2881], [D12, 0.0718], [D13, 0.1440]] after Map-4 method are


54 T.N. Phan et al.

aggregated to the final similarity scores [[D13, 0.4321], [D12, 0.0718]] at Reduce-4method. Last but not least, Query Parameter Filtering is optionally applied to obtaincloser results when query parameters are given.

5.2 Search-by-Example

Search-by-example is a well-known similarity search case when given a pivot object asan example for the search. The goal is to find the most similar objects according to thepivot. Once it is the case, not only are lonely words in the pivot discarded but also thosewhich do not exist in the pivot are ignored by Lonely Term Filtering and Query TermFiltering at Reduce-2 method. The reason is that they do not contribute to the similaritybetween a pair but make the process bulky. Doing so significantly contributes to thereduction of overheads such as storage, communication, and computing costs throughthe whole process of MapReduce jobs.




Let us come to the example as illustrated in Fig. 7, and at this time, the documentD3 is considered as the pivot. According to the proposed scheme, the intermediatekey-value pairs emitted from the mappers at Map-1 method are of the form [termk,Di] and the key-value pairs output from the reducers at Reduce-1 method are of theform [termk, [Di@tfik@idfik]]. Specifically, the mappers at Map-1 method emit a list ofintermediate key-value pairs [[A, D1], [B, D1], [B, D1], [C, D1], [A, D1], [E, D1], [C,D2], [A, D2], [D, D3], [B, D3], [A, D3], [E, D3]]. The list is later retrieved by thereducers to build the customized inverted index. Again, the Common Term Filteringfilters the word A that is common among the documents whereas the lonely word Dbelonging to D3 is marked as Terms Not Proceeded-{TNP}. Consequently, thekey-value pairs list from the reducers at Reduce-1 method is output as follows [[B,[D1@[email protected], D3@[email protected]]], [C, [D1@[email protected], D2@[email protected]]], [{TNP},D3@[email protected]]], [E, [D1@[email protected], D3@[email protected]]]].

Then we come to the normalization phase at the second MapReduce as illustrated inFig. 8. At this phase, the intermediate key-value pairs emitted from Map-2 method areof the form [Di, termk@tfik@idfik] before being normalized at Reduce-2 method into anordered list of the form [Di, [termk@Wik]]ord. More concretely, the mappers at Map-2method emit the intermediate key-value pairs as the list [[D1, B@[email protected]], [D3,B@[email protected]], [D1, C@[email protected]], [D2, C@[email protected]], [D3, {TNP}@[email protected]], [D1,E@[email protected]], [D3, E@[email protected]]]. After that, these pairs are normalized by thereducers at Reduce-2 method which gives us the normalized and ordered output list[[D1, [[email protected], [email protected]]], [D3, [[email protected], [email protected]]]]. It is worth noting thatthe lonely term {TNP} is filtered by Lonely Term Filtering at Reduce-2 method.Besides, the Query Term Filtering is in active due to the fact that we are in the case of

Fig. 7. MapReduce-1 operation when given the pivot [14]

56 T.N. Phan et al.

search-by-example. Thus, the word C in D1 and the word C in D2 which are notincluded in the pivot D3 are discarded in advance as shown in Fig. 8.

The other two MapReduce operations (i.e., MapReduce-3 and MapReduce-4)conform to the proposed scheme as the examples in Figs. 5 and 6. Furthermore,search-by-example can be leveraged by query strategies presented in Sect. 5.3, whichshows how soon candidate pairs are filtered to reduce the candidate size and make themthemselves fit the query.

5.3 Query Strategies

Most similarity searches are also accompanied with search query strategies such asrange search or k-NN search. The range search adds the similarity threshold ε so thatthose pairs whose similarity is greater or equal to the threshold should be returned asthe final result. Meanwhile, the k-NN search looks for the k most similar objects fromthe candidate sets. As a consequence, the parameters ε and k are utilized to filter objectsso that the final result, on the one hand, is as close as users’ needs and the searchprocess, on the other hand, is significantly improved. In order to exploit them for theproposed scheme, both Pre-pruning-1, for the case a query document is given, and Pre-pruning-2, for other cases, are attached but not mutually exclusive.

In the case of pairwise similarity, we do not actually want to find all-pair similaritydue to the fact that it is rarely used in a specific range of applications whereas its entireresult is not completely utilized. Moreover, such a big process consumes much timeand resources, which is not really suitable for most application scenarios, especially forreal-time intensive ones. Thus, the threshold ε is provided to filter necessary pairs fromthe total candidates to meet certain needs. Pre-pruning-2 at Map-4 method catches thisline of thought. It employs the two below inequalities with the latter adopted in [1] todo its task as candidate filtering:

Fig. 8. MapReduce-2 operation when given the pivot [14]


sim Di;Dj� � ¼ Xt

k¼1

Wik �Wjk � e ð3Þ

sim Di;Dj� ��min Mi �Wj;Mj �Wi

� � ¼ r ð4Þ

From the inequalities (3) and (4), the filtering rule is to find those whose σ is greateror equal to the threshold ε. Let us back to the example of pairwise similarity inSect. 5.1. At Map-4 method as illustrated in Fig. 9, the pair D1 and D3 has their σ as0.5341 whilst the pair D1 and D2 has their σ as 0.1437. Assuming that the threshold εhas the value 0.4, the pair D1 and D2 is early discarded. Meanwhile, Pre-pruning-1 isable to sooner get rid of unnecessary pairs when given a query object, and this sup-porting process takes place at Reduce-3 method. It is worth noting that the key-valuepairs at this phase have the form [termk, Di@Mi@Wi@Wik], so the above filtering rulecan be shortly derived. From the instance of search-by-example in Sect. 5.2, the Pre-pruning-1 method indicated in Fig. 10 estimates candidate pairs whether σ is greater orequal to ε. The value of σ is computed as 0.4006, which is the minimum between0.4006 and 0.5342. Assuming the threshold ε has the value 0.4, the pair D1 and D3 is,therefore, further processed to get their final similarity.

On the other hand, k-NN query is also attached together with a query object. Pre-pruning-1 takes the k parameter into account to filter objects before their similarity iscomputed. In other words, each mapper at Map-3 method approximately emits top-kkey-value pairs whose size is according to the total number of running mappers as theEq. (5) below:

Fig. 9. MapReduce-4 operation with Pre-pruning-2 [14]

58 T.N. Phan et al.

top� k pairsfor each mapper

¼ maxk2N

kPMappers

; 1

ð5Þ

It is totally possible because the key-value pair input of Map-3 method has beenordered by its size and normalized weights from the second MapReduce operation.Moreover, the probability a pair is the most similar is high when each combined objecthas its largest size and normalized weights. As a consequence, the Eq. (5) helps reduceboth unnecessary computing and the candidate size.

6 Experiments

6.1 Environment Settings

In order to do our experiments with MapReduce, we employ the stable version 1.2.1 ofHadoop [3]. The Hadoop framework is deployed in the cluster of commodity machinescalled Alex, which has 48 nodes and 8 CPU cores with either 96 or 48 GB RAM foreach node [2]. In general, we leave Hadoop configurations in default mode as much aspossible, for we want to keep the most initial settings which a commodity machine mayget even though some parameters could be tuned or optimized to fit the Alex cluster.The configured capacity is set to 5 GB per node, so the 48-node cluster totally has240 GB. The number of reducers for a reduce operation is set to 168. The possible heapsize of the cluster is about 629 MB, and each HDFS file has 64 MB Block Size. It isworth noting that Alex has suffered the overhead of other coordinating parallel tasks,i.e., these nodes are not exclusively for the experiments. Last but not least, eachbenchmark has its fresh running. In other words, data from the old benchmark areremoved before the new benchmark starts. All the experiments for one type of query

Fig. 10. MapReduce-3 operation with Pre-pruning-1 [14]


are consecutively run so that their testing environments are kept closely as much aspossible.

6.2 Datasets

In this paper, we use DBLP datasets [7], which are used to do similarity search on thetitle of publications. On the other hand, we also use Gutenberg datasets [16], whoseproject is the first provider of free electronic books, to experience a large number oflong text files.

With DBLP Datasets. The datasets used for pairwise similarity and search-by-example are synthetically partitioned into ten packages whose sizes are additionallyincreased from 50 MB to 500 MB, respectively. In the cases of range query and k-NNquery, the datasets are made greater in size up to 700 MB. In addition, the replicationfactor for DBLP datasets is set to 47, which also means that data are replicated intoeach node in the cluster.

With Gutenberg Datasets. The datasets for pairwise similarity and search-by-example are divided into five packages separately including 1000 files, 1500 files, 2000files, 2500 files, and 3000 files. These files which are randomly selected from theGutenberg repository have their sizes ranging from 15 KB to 100 KB. Unlike DBLPdatasets, the replication factor used for Gutenberg datasets is preserved as 3 as itsdefault block replication.

6.3 Experiment Measurement

For our experiments, we step-by-step evaluate our proposed methods and the relatedwork as following:

• The naïve self-join: indicates the self-join approach without any filtering.• The filtering self-join: implies the self-join approach with filtering.• Search-by-example: mentions the case when given an object as an example for the

search.• Range-query case: shows the case of similarity search when given a similarity

threshold ε.• k-NN query case: denotes the case of searching for k most similar objects from the

candidate sets.• Pivot case: refers to the case doing pairwise similarity when given a query.• The work in 2008 [10]: builds the standard inverted index and term frequencies and

then employ them to compute similarity between a pair by their inner product. Thiswork only consumes two MapReduce phases due to the fact that normalizationphase is omitted.

Moreover, we experience the two document models known as terms model andshingles model in Sect. 3.1. The former represents a document as a set of terms whilstthe latter represents a document as a set of shingles. Furthermore, we are interested in

60 T.N. Phan et al.

both the performance and the data volume of the proposed methods and the relatedmethod, which is described as follows:

Performance Measurement. We measure the execution time of MapReduce jobsknown as the total processing time. The measuring time is bound since the timeMapReduce jobs start running to the time they finish writing the result to the distributedfile system. Moreover, we also separately consider the measuring time for eachMapReduce job. And from this point of view, the better performance actually costs lessprocessing time.

Data Volume Measurement. We observe how much data are produced throughoutMapReduce jobs and are then written into the distributed file system. The goal is to findout how much the data volumes output and written to the distributed file system giveinfluences to the performance in overall.

6.4 Empirical Evaluations with DBLP Datasets

In this section, we perform some performance measurements for examining ourmethods. First, Fig. 11 shows Pairwise similarity case with DBLP datasets among thenaïve self-join, the filtering self-join, and the work in 2008 [10] and search-by-example. Apart from the work in 2008, the other approaches are based on our proposedscheme in Sect. 4. Besides, we also compare the search-by-example case with thepairwise similarity case. The dataset size is increased turn by turn from 50 MB to500 MB. From Fig. 11a, the result shows that our proposed methods outperform thework in 2008 in terms of query processing time. More concretely on the average, thenaïve self-join is 68.38 % faster than the work in 2008, the filtering self-join is 69.41 %faster than the work in 2008, and the search-by-example is 73.03 % faster than thework in 2008. The main reason is that the work in 2008 finds the term frequency rightaway at mappers instead of reducers whose main goal is to perform reduced compu-tations. In other words, the functionality of mappers is mistakenly used from thebeginning. Moreover, the work in 2008 computes all possible candidates without fil-tering whilst our approach does. On the other hand, there is no big difference among thenaïve approach, the filtering self-join, and search-by-example while the dataset size isstill small, or to say, under a specific threshold. The reason is due to the operation costof the whole system. Once the dataset size is significantly increased, a big gap amongthem emerges. On the average, the naïve self-join consumes 3.5 % more CPU time thanthe filtering self-join and 15.1 % more CPU time than search-by-example.

In terms of data volumes, Fig. 11b shows the correlation of data quantity among theapproaches throughout MapReduce operations with DBLP datasets. The work in 2008has fewer amounts of data output in the end. More specifically, the work in 2008produces 75.21 % less data than the naïve self-join, 72.61 % less data than the filteringself-join, and 17.70 % less data than search-by-example, respectively on the average.The reason is that the proposed scheme, on the one hand, needs to normalize inputsbefore computing the similarity whilst no filtering is accompanied. On the other hand, itis worth noticing that the work in 2008 computes the similarity score between twodocuments by summing the inner products of the term frequencies, which are not


normalized yet. Normalization is essential because weight terms should be high if theyare frequent in relevant documents but infrequent in the collection as a whole. Ifnormalization is taken into account, the work in 2008 suffers more computations anddata volumes. Nevertheless, we implement it as the original version, i.e., withoutnormalization. Furthermore, the result indicates how much important the refinementsare applied in order for the filtering self-join to save 8.67 % data quantity and forsearch-by-example to save 69.78 % data quantity, on the average, when compared tothe naïve self-join. Last but not least, the amount of data output from MapReduce-2operation to MapReduce-4 operation, when filtering is applied, just gets 0.04 % dataproportion on the average compared to the whole data output in the case ofsearch-by-example itself. As a consequence, search-by-example has less data than thenaïve self-join. In summary, the data output volume without filtering is nearly double incomparison with the data input size due to normalization. In addition, MapReducemechanism always writes down intermediate outputs into HDFS, whose disk accesscosts are too expensive. Filtering strategies are, therefore, essential to reduce thecandidate size and related computing costs as well.

On the other side, we conduct experiments with query strategies when the DBLPdataset size is step-by-step increasing from 300 MB to 700 MB, which are shown inFig. 12. The data values from Fig. 12a indicate that there is no big difference in terms ofquery processing among range queries where the similarity thresholds are set to 90 %,70 %, and 50 %. Likewise, the values from Fig. 12b point out the same evaluation fork-NN queries where the values of parameter k are set to 100, 300, and 500, respectively.Moreover, the two kinds of query strategies mostly have the same performance. In otherwords, either the parameter ε for range queries or the parameter k for k-NN queries doesnot give a big gap between them. Last but not least, the two kinds of query strategiesperform from 2.67 % to 4 % faster than search-by-example without pre-pruning.

6.5 Empirical Evaluations with Gutenberg Datasets

When working with Gutenberg datasets for pairwise similarity, both the naïve self-joinand the work in 2008 fail right away with 3000 Gutenberg files in the second Reduce

Fig. 11. Similarity-computing performance with DBLP datasets among the naïve self-join, thefiltering self-join, the work in 2008 and search-by-example; (a) the total processing time; and(b) the saved data volume [14] (Color figure online)

62 T.N. Phan et al.

operations because of out of memory at reducers. This means that the reducers workwith massive key-value pairs that get over their memory capacity. Nevertheless, thefiltering self-join, pivot case (i.e., the case does pairwise similarity when given aquery), and search-by-example get rid of that problem due to the fact that they areequipped with filtering strategies that help reduce candidate size. Figure 13 illustrates,therefore, the similarity-computing performance among the filtering self-join, pivotcase, and search-by-example in Gutenberg datasets. In general, the performancesamong them, as seen in Fig. 13a, are not much different when the number of filesincreases from 1000 files to 2000 files. The performance among them has, however, agap when the number of files increases from 2000 files to 3000 files. This implies thatboth pivot case and search-by-example perform better than the filtering self-join withthe average rates as 7.06 % and 5.74 %, respectively. The important key basicallycomes from the fact that both pivot case and search-by-example deal with candidatesmore efficiently than the filtering self-join does. Other than Lonely Term Filtering, bothpivot case and search-by-example take the advantage of Query Term Filtering whengiven a pivot object. In other words, both of them avoid the outbreak case in thefiltering self-join, which has to compute similarity scores between one document andevery other in the corpus.

Fig. 12. Query strategies with DBLP datasets; (a) range query case; and (b) k-NN query case[14] (Color figure online)

Fig. 13. Similarity-computing performance with Gutenberg datasets between the filteringself-join and search-by-example; (a) the total processing time; and (b) the saved data volume(Color figure online)


Besides, the saved data experiments with Gutenberg datasets in Fig. 13b alsopromote search-by-example among others. In general, search-by-example emits lessdata than both the filtering self-join and pivot case do. Thanks to Query Term Filteringand Pre-pruning-2, the data output of search-by-example of 90 % similarity is, on theaverage, 59.84 % less than that of the filtering self-join and 3.13 % less than that ofpivot case. Meanwhile, Fig. 14 demonstrates the performances of query strategies withGutenberg datasets. From the data collected, we almost have the same trend whenexamining range queries and k-NN queries with DBLP datasets. In overall, the per-formances of both range queries, in Fig. 14a, and k-NN queries, in Fig. 14b, are not sodifferent. On the average when compared to search-by-example, the range queriesperform 1.24 % to 2.67 % faster when the similarity threshold changes from 90 %,70 %, and 50 % whilst the k-NN queries have the speed-up rates from 3.18 % to 3.85 %when the values of parameter k are set to 100, 300, and 500, respectively.

6.6 Empirical Evaluations Between Terms and Shingles

In these experiments, we want to evaluate our methods with shingles instead of terms.In other words, each document in the Gutenberg datasets is respectively represented asa set of terms and a set of shingles. As the data outbreak of pairwise similarity in thework 2008 and the naïve self-join, Fig. 15 shows the similarity-computing performancewith Gutenberg datasets and shingles among the filtering self-join, pivot case, andsearch-by-example. In terms of total processing time illustrated in Fig. 15a,search-by-example tends to perform better than the others while the filtering self-jointends to consume more processing time than the others. Nevertheless, there are no biggaps among them. More specifically, the gap of total processing time between thefiltering self-join and search-by-example is around 0.26 %, the gap of total processingtime between the filtering self-join and pivot case is around 0.86 %, and the gap of totalprocessing time between pivot case and search-by-example is around 0.15 %.

On the other hand, there are visible differences among these similarity searches inthe view of the saved data volume as seen in Fig. 15b. Like what we have got whendoing this kind of experiments with terms, the filtering strategies really work. Ingeneral, search-by-example keeps emitting the less total data output whilst the filtering

Fig. 14. Query strategies with Gutenberg datasets; (a) range query case; and (b) k-NN querycase (Color figure online)

64 T.N. Phan et al.

self-join does the most. Besides, the total data output in pivot case is a bit more in thatof search-by-example but much less than that of the filtering self-join. More specifi-cally, the percentage difference of total data output between the filtering self-join andsearch-by-example is around 46.07 %, the percentage difference of total data outputbetween the filtering self-join and pivot case is around 41.77 %, and the percentagedifference of total data output between pivot case and search-by-example is around7.26 %. The reason behind that makes the big gap between the filtering self-join and theothers is mostly from the Query Term Filtering. In this case, it filters shingles by thequery shingles.

To compare the similarity-computing performance when the documents are rep-resented as terms and shingles, we separately compare our methods into pairs. For theperformance comparison, we turn-by-turn show the total processing time of not onlythe whole MapReduce operation, which is known as Total MR, but also the fourMapReduce sub-operations, which are called MR-1, MR-2, MR-3, and MR-4,respectively. From now on, the left axis presents for the four sub-operations whereasthe right axis presents for the whole operations. Firstly, we compare the filteringself-join with terms and shingles. Figure 16 demonstrates the performance of thefiltering self-join between them. Generally, the total MR with shingles performs betterthan that with terms. On the one hand, the most time-consuming MapReduce operationamong one another falls into MR-1, which has to process large amounts of data andcompute term frequency as well as the inverse document frequency at the same time.On the other hand, MR-4 also takes time to produce candidate pairs and the finalsimilarity scores. The others, MR-2 and MR-3, consume less time in comparison withMR-1 and MR-4. More specifically in the comparison between terms and shingles, themaximum difference between MR-1 with terms and MR-1 with shingles is about10.14 %, the maximum difference between MR-2 with terms and MR-2 with shingles isabout 11.43 %, the maximum difference between MR-3 with terms and MR-3 withshingles is about 7.5 %, the maximum difference between MR-4 with terms and MR-4with shingles is about 44.32 %, and the maximum difference between Total MR withterms and Total MR with shingles is about 14.25 %.

Fig. 15. Similarity-computing performance with Gutenberg datasets and shingles among thefiltering self-join, pivot case, and search-by-example; (a) the total processing time; and (b) thesaved data volume (Color figure online)


In terms of data output, the total data output with shingles is generally much lessthan that with terms as showed in Fig. 17. Although MR-3 operation takes little time tocomplete, it emits the most data compared to the others. On the contrary, MR-4 takesmore time than MR-3 but produces the least data output. More specifically in thecomparison between terms and shingles, the maximum difference between MR-1 withterms and MR-1 with shingles is about 7.66 %, the maximum difference between MR-2with terms and MR-2 with shingles is about 19.82 %, the maximum difference betweenMR-3 with terms and MR-3 with shingles is about 37.37 %, the maximum differencebetween MR-4 with terms and MR-4 with shingles is about 27.85 %, and the maximumdifference between Total MR with terms and Total MR with shingles is about 22.67 %.

Secondly, we compare pivot case with terms and shingles. Figure 18 demonstratesthe performance of pivot case between them. In general, the total MR with shinglesperforms better than that with terms only in the data package of 3000 files. In the otherdata packages, the total MR with terms performs better than that with shingles. Asusual, the most time-consuming MapReduce operation falls into MR-1. On the otherhand, the others take not much time to complete their jobs. In particular, the maximumdifference between MR-1 with terms and MR-1 with shingles is about 14.05 %, themaximum difference between MR-2 with terms and MR-2 with shingles is about19.15 %, the maximum difference between MR-3 with terms and MR-3 with shingles isabout 9.09 %, the maximum difference between MR-4 with terms and MR-4 withshingles is about 29.79 %, and the maximum difference between Total MR with termsand Total MR with shingles is about 6.62 %.

In the point of view of data output, the total data output with shingles is generallymore than that with terms as showed in Fig. 19. In this case, MR-1 is the one whichemits the most data output while MR-4 is the one which produces the least. Byobserving, the total data output from MR-4 is much less than that of MR-1, MR-2, andMR-3, even with either terms or shingles. Specifically in the comparison between terms

Fig. 16. The performance of the filtering self-join with terms and shingles (Color figure online)

66 T.N. Phan et al.

and shingles, the maximum difference between MR-1 with terms and MR-1 withshingles is about 7.66 %, the maximum difference between MR-2 with terms and MR-2with shingles is about 76.86 %, the maximum difference between MR-3 with terms andMR-3 with shingles is about 65.09 %, the maximum difference between MR-4 withterms and MR-4 with shingles is about 40.04 %, and the maximum difference betweenTotal MR with terms and Total MR with shingles is about 19.20 %.

Thirdly, we compare search-by-example with terms and shingles when the simi-larity threshold is set to 90 %. Figure 20 illustrates the performance of 90 %-similaritysearch-by-example between them. Similarly with pivot case, the total MR with shinglesperforms better than that with terms only in the data package of 3000 files. In the otherdata packages, the total MR with terms performs better than that with shingles. One

Fig. 17. The data output of the filtering self-join with terms and shingles (Color figure online)

Fig. 18. The performance of pivot case with terms and shingles (Color figure online)


again, the most time-consuming MapReduce operation falls into MR-1. On the otherside, the others take not much time to complete their jobs. What is more, the maximumdifference between MR-1 with terms and MR-1 with shingles is about 12.62 %, themaximum difference between MR-2 with terms and MR-2 with shingles is about24.39 %, the maximum difference between MR-3 with terms and MR-3 with shingles isabout 11.11 %, the maximum difference between MR-4 with terms and MR-4 withshingles is about 17.07 %, and the maximum difference between Total MR with termsand Total MR with shingles is about 7.37 %.

In the meantime, the total data output with shingles is generally more than that withterms as indicated in Fig. 21. And at this time, MR-1 is the one which emits the mostdata output while MR-4 produces the least. Nevertheless, the total data output fromMR-4 is much less than that of MR-1, MR-2, and MR-3, even with either terms or

Fig. 19. The data output of pivot case with terms and shingles (Color figure online)

Fig. 20. The performance of 90 %-similarity search-by-example with terms and shingles (Colorfigure online)

68 T.N. Phan et al.

shingles. In the comparison between terms and shingles, the maximum differencebetween MR-1 with terms and MR-1 with shingles is about 7.66 %, the maximumdifference between MR-2 with terms and MR-2 with shingles is about 76.88 %, themaximum difference between MR-3 with terms and MR-3 with shingles is about71.08 %, the maximum difference between MR-4 with terms and MR-4 with shingles isabout 73.90 %, and the maximum difference between Total MR with terms andTotal MR with shingles is about 17.27 %.

Finally, we compare k-NN queries with terms and shingles when the parameter k isset to 500. Figure 22 presents the performance of 500-NN queries between them. Inthese experiments, we observe that the total MR with shingles slightly performs betterthan that with terms only in the data package of 3000 files. In the other data packages,the total MR with terms performs pretty better than that with shingles. Normally, the

Fig. 21. The data output of 90 %-similarity search-by-example with terms and shingles (Colorfigure online)

Fig. 22. The performance of 500-NN queries with terms and shingles (Color figure online)


most time-consuming MapReduce operation falls into MR-1. On the contrary, theothers take not much time to complete their jobs. Moreover, the maximum differencebetween MR-1 with terms and MR-1 with shingles is about 6.67 %, the maximumdifference between MR-2 with terms and MR-2 with shingles is about 23.91 %, themaximum difference between MR-3 with terms and MR-3 with shingles is about3.13 %, the maximum difference between MR-4 with terms and MR-4 with shingles isabout 6.25 %, and the maximum difference between Total MR with terms andTotal MR with shingles is about 2.41 %.

Meanwhile, the total data output with shingles is not much than that with terms asindicated in Fig. 23. And in this case, MR-1 is the one which emits the most data outputwhile MR-4 produces the least. The total data output from MR-3 is, however,approximately as small as that from MR-4. Consequently, the total data output fromboth MR-3 and MR-4 is much less than that of MR-1 and MR-2, even with either termsor shingles. In the comparison between terms and shingles, the maximum differencebetween MR-1 with terms and MR-1 with shingles is about 7.65 %, the maximumdifference between MR-2 with terms and MR-2 with shingles is about 76.86 %, themaximum difference between MR-3 with terms and MR-3 with shingles is about30.54 %, the maximum difference between MR-4 with terms and MR-4 with shingles isabout 2.77 %, and the maximum difference between Total MR with terms andTotal MR with shingles is about 5.69 %.

7 Discussion

When doing experiments with terms and shingles, we observe that Query Term Fil-tering applying to terms produces less data output than that applying to shingles. Inother words, the number of terms is filtered more than that of shingles in the samemethod. More concretely, a number of terms are approximately 1.3x as many as that ofshingles on average in the filtering self-join. Nevertheless, a number of shingles areapproximately 1.15x as many as that of terms on average in pivot case, approximately

Fig. 23. The data output of 500-NN queries with terms and shingles (Color figure online)

70 T.N. Phan et al.

1.09x as many as that of terms on average in search-by-example, and approximately1.03x as many as that of terms on average in 500-NN queries. As a consequence, thesenumbers indicate that Query Term Filtering works with terms more effectively than itdoes with shingles, which results in better performance. The main reason might comefrom the unique characteristic of using shingles. It is worth noting that shingles gen-erated by continuous terms bound by the length of K. Thus, the collision of two randomshingles is often smaller while there is a high probability of collision for two randomterms in the corpus. Because of this, the number of terms that might be shared in thequery is large whilst that of shingles that might be included in the query is small. As aresult, we observe that there are more filtered terms than filtered shingles against thegiven query.

Besides, we further describe the important factors that most matter to achievinghigh performance with MapReduce in order to additionally wrap up by the experimentsthat have been conducted so far as follows:

• The MapReduce operations should not be too complex due to limited computingresources. Moreover, neither complicated computations nor unoptimized processingimproves the performance;

• The less computations the similarity measure is, the high efficiency the wholesystem gets [13]. Alternatively, there are different metrics to measure how similar apair of objects is. Nevertheless, these have their own characteristic and computingcomplexity. One is of choice depends on specific applications and domains due tothe fact that it adds its complexity to the whole computing processes;

• Natural language processing such as filtering useless symbols is essential for theproper model of documents, which leads to effectively filtering either terms orshingles. If these special symbols are not handled properly, they might easily causeunexpected errors when we process data strings by a programming language such asPython;

• The load balancing also a big issue when we do with MapReduce, for the overallperformance is always finalized by the last MapReduce job. Though some imple-mentations of MapReduce, like Hadoop, try to distribute the work load during theirexecution, there is also a promising need to find out adaptive load-balancingstrategies for running algorithms so that the amounts of key-value pairs output byeither mappers or reducers are relatively the same among them;

• The ways of implementing Map and Reduce functions with key-value pairs alsoaffect the entire system and its performance. Hence, an optimized execution plan ispreferred;

• Depending on the characteristics of the cluster of commodity machines, the envi-ronment settings in general and the configurations in particular can be furtheroptimized to improve the overall performance.

On the other hand, clustering techniques may be useful when being integrated intothe proposed methods. The idea behind is to partition the search space into severalsub-spaces in that only some spaces are promoted as candidates for similarity search.From this point of view, clustering techniques help reduce the search space. The way ofclustering, however, gives big influences to the above goal. For instance, if there are afew clusters, most of unnecessary shingles will be considered and this does not help at


all in comparison either with or without the simple clustering methods. Otherwise, incase there are too many clusters, extra costs from cluster checking become big and doharm to the overall performance. Thus, finding out the trade-off optimally balancing thetwo cases and well partitioning the search space is essentially an open research. Onepossible solution, in our case, comes from a family of phonetic hashing methods (e.g.,Double Metaphone [15]) to strengthen the power of pivots. The basic idea is totransform the original search space into another one by grouping similar terms inwriting and pronunciation. Doing by this way makes the new search space become lessjumble and easily to be pruned, for most of the similar terms belong to the same cluster.On the other side, one-way hashing functions should be optionally used not only tosupport the clustering process but also to help reduce the size of k-shingles when the kparameter is large. Using this way aims at saving data transferred throughout thenetwork and written into the distributed file system. Last but not least, one-way hashingfunctions enhance data security, for real text data as well as query data should not berevealed in the time of similarity search.

Other problems from our research work are also how to assure data quality as wellas data freshness. They require much effort on many intensive tasks to pre-process databefore feeding MapReduce operations. Normally, an update policy may be alterna-tively set up when running MapReduce-1 operations by either automatically done in aperiod of time or manually executed on request. Additionally, other similarity measuresor variants of simple forms such as Cosine, Dice, edit distance, or Hamming distance[17, 18], and similarity computing methods like locality-sensitive hashing [20, 21]should also be seriously considered when extracting similarity scores due to the factthat the unique characteristics of application domains and similarity measures, them-selves, may prefer different optimizations. For instance, the research studies in [13, 14]show that doing MapReduce-based similarity search with Jaccard outperforms thatwith Cosine. The reason is that Cosine measure demands more computations to nor-malize the weights of documents, which adds more overheads in terms of performanceand data volume.

Moreover, approximate searches should be a shining point when integrated into ourmethods in order to further improve the overall performance whilst keeping tryingensuring the accuracy of the results. Furthermore, it would be great if the proposedmethods can adapt to other well-known similarity search cases such as pairwise sim-ilarity and similarity joins. Last but not least, doing more comparisons withstate-of-the-art helps consolidate and enhance our methods. These matters above arethen left as our future work.

8 Summary

In this paper, we propose an adaptive similarity search scheme supporting large-scaleprocessing with MapReduce in massive datasets. In addition, we equip our proposedscheme with collaborative strategic refinements that not only promote the potentialscalability of MapReduce paradigm but also eliminate unnecessary computations aswell as diminish candidate sizes. Besides, our proposed methods are flexibly adaptableto popular similarity search cases such as pairwise similarity, search-by-example, range

72 T.N. Phan et al.

queries, and k-NN queries. Moreover, these methods are verified by many empiricalexperiments on real datasets and experienced with Hadoop framework, which isdeployed in the commodity machines. Furthermore, we model our documents withdistinct n-grams known as shingles, together with terms representation, so that weobserve the difference between them from the proposed methods. Last but not least, wediscuss other challenges as our future work under the context of big data, together withother open research issues, in order to further strengthen and enhance our methodssupporting data-intensive applications.

Acknowledgements. We would like to give our thanks to Mr. Faruk Kujundžić, InformationManagement team, Johannes Kepler University Linz, for kindly supporting us in Alex Cluster.

References

1. Alabduljalil, M.A., Tang, X., Yang, T.: Optimizing parallel algorithms for all pairs similaritysearch. In: Proceedings of the 6th ACM International Conference on Web Search and DataMining, USA, pp. 203–212 (2013)

2. Alex cluster. http://www.jku.at/content/e213/e174/e167/e186534. Accessed 4 Feb 20143. Apache Software Foundation: Hadoop: A Framework for Running Applications on Large

Clusters Built of Commodity Hardware (2006)4. Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document similarity self-join with

MapReduce. In: Proceedings of the 10th IEEE International Conference on Data Mining,pp. 731–736 (2010)

5. Dang, T.K., Küng, J.: The SH-tree: a super hybrid index structure for multidimensional data.In: Mayr, H.C., Lazanský, J., Quirchmayr, G., Vogel, P. (eds.) DEXA 2001. LNCS, vol.2113, pp. 340–349. Springer, Heidelberg (2001)

6. Dang, T.K.: Solving approximate similarity queries. Int. J. Comput. Syst. Sci. Eng. 22(1–2),71–89 (2007). CRL Publishing Ltd., UK

7. DBLP data set. http://dblp.uni-trier.de/xml/. Accessed 8 Mar 20148. De Francisci Morales, G., Lucchese, C., Baraglia, R.: Scaling out all pairs similarity search

with MapReduce. In: Proceedings of the 8th Workshop on Large-Scale Distributed Systemsfor Information Retrieval, pp. 25–30 (2010)

9. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In:Proceedings of the 6th Symposium on Operating Systems Design and Implementation,pp. 137–150. USENIX Association (2004)

10. Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections withMapReduce. In: Proceedings of the 46th Annual Meeting of the Association forComputational Linguistics on Human Language Technologies, Companion Volume,Columbus, Ohio, pp. 265–268 (2008)

11. Fenz, D., Lange, D., Rheinländer, A., Naumann, F., Leser, U.: Efficient similarity search invery large string sets. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338,pp. 262–279. Springer, Heidelberg (2012)

12. Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch text similarity search with MapReduce. In:Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds.) APWeb 2011. LNCS, vol. 6612,pp. 412–423. Springer, Heidelberg (2011)


http://www.jku.at/content/e213/e174/e167/e186534

http://dblp.uni-trier.de/xml/

13. Phan, T.N., Küng, J., Dang, T.K.: An elastic approximate similarity search in very largedatasets with MapReduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds.) Globe 2014.LNCS, vol. 8648, pp. 49–60. Springer, Heidelberg (2014)

14. Phan, T.N., Küng, J., Dang, T.K.: An efficient similarity search in large data collections withMapReduce. In: Dang, T.K., Wagner, R., Neuhold, E., Takizawa, M., Küng, J., Thoai, N.(eds.) FDSE 2014. LNCS, vol. 8860, pp. 44–57. Springer, Heidelberg (2014)

15. Philips, L.: The double metaphone search algorithm. C/C++ Users J. 18(6), 38–43 (2000)16. Project Gutenberg. http://www.gutenberg.org/. Accessed 8 Mar 201417. Rajaraman, A., Ullman J.D.: Finding similar items. In: The book Mining of Massive

Datasets, 1st edn., pp. 71–127. Cambridge University Press (2011). Chapter 318. Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.H.: Efficient and scalable

processing of string similarity join. IEEE Trans. Knowl. Data Eng. 25(10), 2217–2230(2013)

19. Szmit, R.: Locality sensitive hashing for similarity search using MapReduce on large scaledata. In: Kłopotek, M.A., Koronacki, J., Marciniak, M., Mykowiecka, A., Wierzchoń, S.T.(eds.) IIS 2013. LNCS, vol. 7912, pp. 171–178. Springer, Heidelberg (2013)

20. Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicatedetection in large web collections. In: Proceedings of the 31st Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pp. 563–570(2008)

21. Ture, F., Elsayed, T., Lin, J.: No free lunch: brute force vs. locality-sensitive hashing forcross-lingual pairwise similarity. In: Proceedings of the 34th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pp. 943–952(2011)

22. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In:Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data,USA, pp. 495–506 (2010)

23. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection.In: Proceedings of the 17th International World Wide Web Conference, pp. 131–140 (2008)

24. Zhang, D., Yang, G., Hu, Y., Jin, Z., Cai, D., He, X.: A unified approximate nearestneighbor search scheme by combining data structure and hashing. In: Proceedings of the23rd International Joint Conference on Artificial Intelligence, pp. 681–687 (2013)

74 T.N. Phan et al.

http://www.gutenberg.org/

an adaptive similarity search in massive datasetskhanh/papers/tldks2016... · support systems,...

Documents