ndd project presentation
Post on 30-Jun-2015
120 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Weighted Simhash: A Random Projection Approach for detecting near duplicate
documents in large collection
Md Mishfaq AhmedGraduate StudentDepartment of CS
University of Memphis
2
Introduction• Near duplicate documents (NDD): identical
in terms of core content but differ in small portions of the document– Harder to be detected than exact duplicates– Exact duplicates:• Standard methods exists
– Near duplicates:• Several approaches exists but no widely accepted
method to identify
3
Near Duplicate: main sources
• News articles • Web documents (web pages) differing only in
advertisements and/or timestamps– As many as 40% of the pages on the web are
duplicates of other pages
4
Near Duplicate: main sources
• NDD techniques useful in sequences that are not documents (such as DNA sequences)
• Replication for Reliability– In file systems: main content of an important
document is replicated and stored at different places
5
Earlier Approaches for NDD• A naive solution : Compare a document with
all documents in the collection, word by word– prohibitively expensive on large datasets
• Another approach: Convert documents into canonical forms, until they are exact duplicates
• More viable approach: Approximation and probabilistic methods– Trade off
• precision and recall ↔ manageable speed
6
Earlier Approaches for NDD
7
Shingling based methods
• A document d = a sequence of tokens• Encode d as a set of unique k-grams– k gram = k contiguous sequence
of tokens• Measure overlap or similarity
between two k-grams • Sum of overlaps or similarity across
the entire set gives the similarity between two docs
9
Projection based methods: SimHash
Example:
– d1: word1+word2+word3 – d2: word1+word4
10
SimHash: Example
• Document d1: w1+w2+w3
11
SimHash: Example
– Document d2: word1+word4
12
Projection based methods: Probabilistic Simhash
• Key observations:– Projection is already probabilistic– Bits in a fingerprint are mutually independent– Intermediate values are ignored while generating
fingerprints• Useful to give insight into the volatility of a bit
13
Projection based methods: Probabilistic Simhash
• Key observations:
For another document d, which is not a near duplicate of d1: fingerprint of d is most likely to be different than that of d1 at bit position with intermediate value closest to zero
14
Projection based methods: Probabilistic Simhash
• Implementation– An unique data structure per document to rank
bits or set of bits according to volatility• Stores bit positions
– When comparing two fingerprints• Compare bits with higher volatility first• Ensures quicker identification of nonduplicates• Reduce number of bit comparisons for nonduplicates
15
Projection based methods: Probabilistic Simhash
• Drawback:– Overhead of extra data structure per document
apart from the fingerprint
16
Our Approach: Weighted SimHash
• Main Idea:– Terms with higher inverse document frequency
(IDF): better in finding NDD• Consider two documents D1, D2 and two terms: t1:
high IDF , t2: low IDF -– Case I: both D1 and D2 has t1– Case II: both D1 and D2 has t2– Case III: None of them have t1– Case IV: None of them have t2
• D1,D2 more likely to be NDD in Case I then in Case II• D1,D2 more likely to be NDD in Case IV then in Case III
18
Weighted SimHash: Key Steps
19
Weighted SimHash
• Generation of fingerprint:– Terms with higher IDFs contribute more in forming the
sum leading to more significant bits (towards the left end of fingerprint)
– Terms with lower IDFs contribute more in forming the sum leading to less significant bits (towards the right end of fingerprint)
– Leads to increased chance of mismatches in leading bits for non duplicates.
– How to achieve this?• Multiplication factor
20
Weighted SimHash
• Multiplication factor (MF) for term t , mft = f(IDFt , bp)
Series10
0.5
1
1.5
2
2.5MF for high IDF Term MF for mid IDF Term MF for low IDF Term
Mul
tiplic
ation
fact
or
Bit positionMSB LSB
23
Weighted SimHash
• Example(generation of fingerprint): – Document D2: word1 + word4
24
Weighted SimHash
• Example(generation of fingerprint): – Document D2: word1 + word4
25
Weighted SimHash
• Finding near duplicate– Compare fingerprints of the query document and
each document from collection• Start scan from most significant bit (or left most bit)• Count number of mismatch• If number of mismatch gets past k (allowed hamming
distance threshold): No near duplicate. stop scan. Go to next document• If number of mismatch within allowed hamming
distance threshold after scanning entire fingerprint: near duplicate found
26
Experiment
• Reuters data set: almost 10k documents– 10 documents randomly selected– Each of 10 documents has been very slightly
modified at most two words change per doc to produce 20 documents per selection
– 200 documents which we consider as near duplicates for respective selection
– 10 docs then used as source query
27
Experiment: Procedure
28
Results
0 1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
90
100
Recall(SimHash)Recall(WSH)
K value (hamming distance threshod)
Reca
ll Pe
rcan
tage
Figure : Comparison of percentage recall for all the 20 query documents for SimHash and Weighted SimHash methods with k (hamming distance value) shown in the X axis.
29
Results
0 2 4 6 8 10 12 14 160
10
20
30
40
50
60
70
80
90
100
Precision(SimHash)Precision(WSH)
K value (hamming distance threshold))
Prec
ision
Per
cant
age
Figure : Precision comparison between random projection and weighted simhash. X axis shows different values of k. The figure shows there is no real difference between two methods in terms of precision.
30
Results
Figure: Average execution time per query for each of the methods. For cosine similarity the threshold is 0.95.
Cosine Similarity (0.95)
SimHash WSH with MF(0.1,1.0)
WSH with MF(0.3,1.0)
WSH with MF(0.5,1.0)
WSH with MF(0.5,1.5)
WSH with MF(1.0,1.3)
WSH with MF(1.0,1.6)
WSH with MF(1.0,2.0)
4656
38203560 3490 3587 3664 3680
3440 3560
Comparison of Average Execution timeAverage Execution time(milliseconds)
31
Limitations of WSH
• Dependence on IDF– Web search: IDF unknown– Heuristics can be used: • IDF from first 1000 documents
32
Limitations of WSH
• Difficulty Setting the lower and upper bound on multiplication factor– May vary from collection to
collection
33
Conclusion
• Batch processing of Document collection:– Runtime: WSH better than SimHash– Precision and Recall: WSH and SimHash are
comparable
35
Conclusion
• Further work on SimHash:– How much the fingerprint can be allowed to be
altered?
36
Thank you
top related