ndd project presentation

Weighted Simhash: A Random Projection Approach for detecting near duplicate

documents in large collection

Md Mishfaq AhmedGraduate StudentDepartment of CS

University of Memphis

Introduction• Near duplicate documents (NDD): identical

in terms of core content but differ in small portions of the document– Harder to be detected than exact duplicates– Exact duplicates:• Standard methods exists

– Near duplicates:• Several approaches exists but no widely accepted

method to identify

Near Duplicate: main sources

• News articles • Web documents (web pages) differing only in

advertisements and/or timestamps– As many as 40% of the pages on the web are

duplicates of other pages

Near Duplicate: main sources

• NDD techniques useful in sequences that are not documents (such as DNA sequences)

• Replication for Reliability– In file systems: main content of an important

document is replicated and stored at different places

Earlier Approaches for NDD• A naive solution : Compare a document with

all documents in the collection, word by word– prohibitively expensive on large datasets

• Another approach: Convert documents into canonical forms, until they are exact duplicates

• More viable approach: Approximation and probabilistic methods– Trade off

• precision and recall ↔ manageable speed

Earlier Approaches for NDD

Shingling based methods

• A document d = a sequence of tokens• Encode d as a set of unique k-grams– k gram = k contiguous sequence

of tokens• Measure overlap or similarity

between two k-grams • Sum of overlaps or similarity across

the entire set gives the similarity between two docs

Projection based methods: SimHash

Example:

– d1: word1+word2+word3 – d2: word1+word4

SimHash: Example

• Document d1: w1+w2+w3

SimHash: Example

– Document d2: word1+word4

Projection based methods: Probabilistic Simhash

• Key observations:– Projection is already probabilistic– Bits in a fingerprint are mutually independent– Intermediate values are ignored while generating

fingerprints• Useful to give insight into the volatility of a bit

• Key observations:

For another document d, which is not a near duplicate of d1: fingerprint of d is most likely to be different than that of d1 at bit position with intermediate value closest to zero

• Implementation– An unique data structure per document to rank

bits or set of bits according to volatility• Stores bit positions

– When comparing two fingerprints• Compare bits with higher volatility first• Ensures quicker identification of nonduplicates• Reduce number of bit comparisons for nonduplicates

• Drawback:– Overhead of extra data structure per document

apart from the fingerprint

Our Approach: Weighted SimHash

• Main Idea:– Terms with higher inverse document frequency

(IDF): better in finding NDD• Consider two documents D1, D2 and two terms: t1:

high IDF , t2: low IDF -– Case I: both D1 and D2 has t1– Case II: both D1 and D2 has t2– Case III: None of them have t1– Case IV: None of them have t2

• D1,D2 more likely to be NDD in Case I then in Case II• D1,D2 more likely to be NDD in Case IV then in Case III

Weighted SimHash: Key Steps

Weighted SimHash

• Generation of fingerprint:– Terms with higher IDFs contribute more in forming the

sum leading to more significant bits (towards the left end of fingerprint)

– Terms with lower IDFs contribute more in forming the sum leading to less significant bits (towards the right end of fingerprint)

– Leads to increased chance of mismatches in leading bits for non duplicates.

– How to achieve this?• Multiplication factor

Weighted SimHash

• Multiplication factor (MF) for term t , mft = f(IDFt , bp)

Series10

2.5MF for high IDF Term MF for mid IDF Term MF for low IDF Term

tiplic

Bit positionMSB LSB

Weighted SimHash

• Example(generation of fingerprint): – Document D2: word1 + word4

Weighted SimHash

• Example(generation of fingerprint): – Document D2: word1 + word4

Weighted SimHash

• Finding near duplicate– Compare fingerprints of the query document and

each document from collection• Start scan from most significant bit (or left most bit)• Count number of mismatch• If number of mismatch gets past k (allowed hamming

distance threshold): No near duplicate. stop scan. Go to next document• If number of mismatch within allowed hamming

distance threshold after scanning entire fingerprint: near duplicate found

Experiment

• Reuters data set: almost 10k documents– 10 documents randomly selected– Each of 10 documents has been very slightly

modified at most two words change per doc to produce 20 documents per selection

– 200 documents which we consider as near duplicates for respective selection

– 10 docs then used as source query

Experiment: Procedure

Results

0 1 2 3 4 5 6 7 8 9 100

Recall(SimHash)Recall(WSH)

K value (hamming distance threshod)

Figure : Comparison of percentage recall for all the 20 query documents for SimHash and Weighted SimHash methods with k (hamming distance value) shown in the X axis.

Results

0 2 4 6 8 10 12 14 160

Precision(SimHash)Precision(WSH)

K value (hamming distance threshold))

Figure : Precision comparison between random projection and weighted simhash. X axis shows different values of k. The figure shows there is no real difference between two methods in terms of precision.

Results

Figure: Average execution time per query for each of the methods. For cosine similarity the threshold is 0.95.

Cosine Similarity (0.95)

SimHash WSH with MF(0.1,1.0)

WSH with MF(0.3,1.0)

38203560 3490 3587 3664 3680

3440 3560

Comparison of Average Execution timeAverage Execution time(milliseconds)

Limitations of WSH

• Dependence on IDF– Web search: IDF unknown– Heuristics can be used: • IDF from first 1000 documents

Limitations of WSH

• Difficulty Setting the lower and upper bound on multiplication factor– May vary from collection to

collection

Conclusion

• Batch processing of Document collection:– Runtime: WSH better than SimHash– Precision and Recall: WSH and SimHash are

comparable

Conclusion

• Further work on SimHash:– How much the fingerprint can be allowed to be

altered?

Thank you

ndd project presentation

Education

manual de identidade visual - ndd

03nov10 ndd - a violent past to a brutal future

uc1708-sp neutron displacement damage (ndd) characterization

2020-02-cf-ndd-8,5x11-poster-ndd-final-en...title:...

presentation on project project management management...

queduweb 2016 ndd expiré - sébastien moity - ls project

ndd monthly chronology events august- 2011

o ndd medizintechnikag 201909 v06 operating conditions power...

ndd branded gal montana - brochure - 100114 -jmi immo

ndd programs are essential to national security

ndd monthly chronology events september - 2010

ndd 2017_rajasthan_joint...

ln bblth dr ltht vltr ndd dr dr pt - bücher.de · nnzhnt...

sro45 201900923. mois ndd - gob

innovative nuclear reacot development - oecd-nea.org › ndd...

sinistra di ndd. esami eseguiti in pronto soccorso ·...

ndd - a violent past to a brutal future

ndd visio-c1-ned-nld-rj_v241_(22.12.2012)

ndd banner 14072016 -...

c llii lt((ccoonn teenntt la an ndd glaanngguuaaggee … ·...