pairwise document similarity in large collections with map reduce
TRANSCRIPT
Pairwise Document Similarity in Large
Collections with MapReduceTamer Elsayed, Jimmy Lin, and Douglas Oard
Niveda Krishnamoorthy
Overview Pairwise Similarity MapReduce Framework Proposed algorithm
• Inverted Index Construction• Pairwise document similarity calculation
Results
Pairwise Similarity of Documents
PubMed – “More like this” Similar blog posts Google – Similar pages
MapReduce Programming Framework that supports distributed
computing on clusters of computers Introduced by Google in 2004 Map step Reduce step Combine step (Optional) Applications
MapReduce Model
Example – Word Frequency Consider two files:
Hello
World
Bye
World
Hello
Hadoop
Goodbye
Hadoop
Hello ,2
World ,2
Bye,1
Hadoop ,2
Goodbye ,1
Map Phase
Hello
Hadoop
Goodbye
Hadoop
Hello
World
Bye
World
Map 1
Map 2
<Hello,1>
<World,1>
<Bye,1>
<World,1>
<Hello,1>
<Hadoop,1>
<Goodbye,1><Hadoop,1>
Reduce Phase<Hello,1>
<World,1>
<Bye,1>
<World,1>
<Hello,1>
<Hadoop,1>
<Goodbye,1><Hadoop,1>
<Hello (1,1)>
<World(1,1)>
<Bye(1)>
<Hadoop(1,1)>
<Goodbye(1)>
SHUFFLE
&
SORT
Reduce 2
Reduce 1
Reduce 3
Reduce 4
Reduce 5
Hello ,2
World ,2
Bye,1
Hadoop ,2
Goodbye ,1
Pairwise Document Similarity
MAPREDUCE ALGORITHM•Inverted Index Computation•Pairwise Similarity
Scalable and
Efficient
Constructing Inverted Index (Map Phase)
Document 2BDD
Document 1AABC
Map 1
Map 2
<A,(d1,2)>
<B,(d1,1)>
<C,(d1,1)>
<B,(d2,1)>
<D,(d2,2)>
Document 1ABBE
Map 3
<A,(d3,1)>
<B,(d3,2)>
<E,(d3,1)>
Constructing Inverted Index (Reduce Phase)
<A,(d1,2)>
<B,(d1,1)>
<C,(d1,1)>
<B,(d2,1)>
<D,(d2,2)>
<A,[(d1,2),(d3,1)]>
<B,[(d1,1), (d2,1),(d3,2)]><C,[(d1,1)]>
<D,[(d2,2)]>
SHUFFLE
&
SORT
Reduce 1
Reduce 2
Reduce 3
Reduce 4
<B,[(d1,1), (d2,1),(d3,2)]><C,[(d1,1)]>
<D,[(d2,2)]>
<A,(d3,1)>
<B,(d3,2)>
<E,(d3,1)>
Reduce 5 <E,[(d3,1)]>
<A,[(d1,2),(d3,1)]>
<E,[(d3,1)]>
Space saving technique Group by document ID, not pairs
Golomb’s compression for postings Individual Postings List of Postings
Pairwise document similarity (Map Phase)
<B,[(d1,1), (d2,1),(d3,2)]><C,[(d1,1)]>
<D,[(d2,2)]>
<E,[(d3,1)]>
<A,[(d1,2),(d3,1)]>
Map 1
Map 2
<(d1,d3),2>
<(d1,d2),1(d2,d3),2(d1,d3),2>
Pairwise document similarity (Reduce phase)
<(d1,d3),2>
<(d1,d2),1(d2,d3),2(d1,d3),2>
SHUFFLE
&
SORT
<(d1,d2)[1]>
<(d2,d3)[2]>
<(d1,d3)[2,2]>
Reduce 1
Reduce 2
Reduce 3
<(d1,d2)[1]>
<(d2,d3)[2]>
<(d1,d3)[4]>
Experimental Setup Hadoop 0.16.0 20 machine (4GB memory, 100GB
disk) Similarity function - BM25 Dataset: AQUAINT-2 (newswire text)
• 2.5 GB• 906k documents
Procedure Tokenization Stop word removal Stemming Df-cut
• Fraction of terms with highest document frequency is eliminated – 99% cut (9093)
• 3.7 billion pairs (vs) 8.1 trillion pairs
Linear space and time complexity
Running Time of Pairwise Similarity Comparisons
Effect of df-cut on number of Intermediate pairs
Observations Complexity: O(n2)
Df-cut of 99 percent eliminates meaning bearing terms and some irrelevant terms• Cornell, arthritis• sleek, frail
Df-cut can be relaxed to 99.9 percent
Discussion Exact algorithms used for inverted
index construction and pair-wise document similarity are not specified.
Df-cut – Does a df-cut of 99 percent affect the quality of the results significantly?
The results have not been evaluated.
Thank you