detecting near-duplicates for web crawling manku, jain, sarma presented by venkatesh katari
TRANSCRIPT
Detecting Near-Duplicates for Web Crawling
Manku, Jain, Sarma
Presented By
Venkatesh Katari
Overview :
• Why Do we care ?• Purpose of the paper.• Proposed solution for finding near duplicates• Pros• Cons• Future Research.
Why Do We Care?
• Why do we want to detect near-duplicates?– Save storage
– Search quality
• Web mirrors
• Clustering for “related documents” query
• Data extraction
• Plagiarism
• Spam detection
• Duplicates in domain-specific corpora
Purpose of The Paper?
• This paper addresses the following issues:
1. Finding near duplicates on the web.
2. Handling the scale of the web • Tens of billions of documents indexed
• Millions of pages crawled every day
3. Which features to be selected while detecting duplicates
4. algorithm for single query and batch processing
5. Survey of other techniques in this field
What are Near-Duplicates?
• Identical content, but differ in small portion of document– Advertisements
– Counters
– Timestamps
Simplified Crawl Architecture
Web Index
HTMLDocument
WebWeb
Near-duplicate?
traverse
links
newly-crawled document(s)
one document
entire index
inserttrash
Feature-set per document• Shingles from page content
• Connectivity information
• Anchor text, anchor window
• Phrases
• Document vector from page content
- case-folding
- stop-word removal,
- stemming
- computing term-frequencies and weighing each term
by its inverse document frequency
Simhash • Dimensionality-reduction technique• Obtain f-bit fingerprint for each document• A pair of documents are near duplicate if and only
if fingerprints at most k-bits apart• Experimental results show that f=64 & k=3 is good
for detecting near duplicates.
Simhash
Doc.
w1
w2
wn
feature, weight100110 w1
hash, weight
110000 w2
001001 wn
w1 -w1 -w1 w1 w1 -w1
w2 w2 -w2 -w2 -w2 -w2
-wn -wn wn -wn -wn wn
add
13,108,-22,-5,-32,55sign
110001
fingerprint
Method One
Pre-sorted fingerprints
in S64-bit Q
All Q’: hd(Q,Q’)≤k=3
Exact Probes
Fingerprints in S 64-bit Q
Exact ProbesS’: All
fingerprints at most k-bits away from S
(Sort)
Method Two
• Observation 1: Consider 2d f-bit fingerprints in
sorted order– Most 2d combinations in d most significant bits exist
– Can quickly do exact probe on first d’ (≤d) bits
• Observation 2:
Q’
Qhd(Q,Q’) = 3
exact match!
Final implementation
Example
Fingerprints in S
A B C D
64-bit
16-bit
A
B
C
D
64-bit QQ1 Q2 Q3 Q4
Q1
Q2
Q3
Q4
exact search on 16 bits
Example: Analysis
• 64-bits split into 4 pieces• 4 tables with permuted fingerprints• Exact search on 16 bits• If 234 (≈10 billion) fingerprints
– Each probe gives 234-16 fingerprints
Batch Algorithm
• Tens of billions of pages indexed• Crawl millions of pages each day• Quickly find all new pages having a near-duplicate
in the index
MapReduce Framework
• MapReduce framework used within Google– massively parallel
• Map phase:– operate individually on a set of objects
• Reduce phase– aggregate results of the mapped objects
Batch Algorithm
• Suppose 8B existing fingerprints (~32GB after compression): File F
• 1M batch query fingerprints (~8MB): File B• F stored in a GFS file system
– chunked into roughly 64MB
– replicated at 3 random nodes
• B stored with much higher replication factor
Batch Algorithm (continued)
• Map Phase:– Duplicate detection within each chunk Fi and whole of B
– Build multiple tables for B (in memory)
– Scan Fi and probe into B
– Output near-duplicates in B
• Reduce phase– Merge outputs
Pros• Addressed near-duplicate detection in a web-crawling
system
• Proposed algorithms for single and batch cases
• Experiments to validate the suitability of simhash
• Mini-survey of near-duplicate detection techniques in the paper
Cons
• Weight Selection for feature set• Handling of continuously changing IDF• How to find near duplicates when data is present
in different formats• Inadequate results
References
• G. Manku, A. Jain, A. Das Sarma. Detecting near duplicates for web crawling. WWW 2007, pp. 141-150, 2007.
• M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), pages 380{388, 2002.
• J. Dean and S. Ghemawat. MapReduce: Simplied data processing on large clusters. In Proc. 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137{150, Dec. 2004.
• Articles from Wikipedia etc.
Future Research
• Considering document size while detecting near duplicates
• Pruning the space of existing fingerprints• Categorizing web pages• Removal of portions of web pages with ads and
time stamps
Q & A