detecting near-duplicates for web crawling manku, jain, sarma presented by venkatesh katari

Detecting Near-Duplicates for Web Crawling

Manku, Jain, Sarma

Presented By

Venkatesh Katari

Overview :

• Why Do we care ?• Purpose of the paper.• Proposed solution for finding near duplicates• Pros• Cons• Future Research.

Why Do We Care?

• Why do we want to detect near-duplicates?– Save storage

– Search quality

• Web mirrors

• Clustering for “related documents” query

• Data extraction

• Plagiarism

• Spam detection

• Duplicates in domain-specific corpora

Purpose of The Paper?

• This paper addresses the following issues:

1. Finding near duplicates on the web.

2. Handling the scale of the web • Tens of billions of documents indexed

• Millions of pages crawled every day

3. Which features to be selected while detecting duplicates

4. algorithm for single query and batch processing

5. Survey of other techniques in this field

What are Near-Duplicates?

• Identical content, but differ in small portion of document– Advertisements

– Counters

– Timestamps

Simplified Crawl Architecture

Web Index

HTMLDocument

WebWeb

Near-duplicate?

traverse

links

newly-crawled document(s)

one document

entire index

inserttrash

Feature-set per document• Shingles from page content

• Connectivity information

• Anchor text, anchor window

• Phrases

• Document vector from page content

- case-folding

- stop-word removal,

- stemming

- computing term-frequencies and weighing each term

by its inverse document frequency

Simhash • Dimensionality-reduction technique• Obtain f-bit fingerprint for each document• A pair of documents are near duplicate if and only

if fingerprints at most k-bits apart• Experimental results show that f=64 & k=3 is good

for detecting near duplicates.

Simhash

Doc.

w1

w2

wn

feature, weight100110 w1

hash, weight

110000 w2

001001 wn

w1 -w1 -w1 w1 w1 -w1

w2 w2 -w2 -w2 -w2 -w2

-wn -wn wn -wn -wn wn

add

13,108,-22,-5,-32,55sign

110001

fingerprint

Method One

Pre-sorted fingerprints

in S64-bit Q

All Q’: hd(Q,Q’)≤k=3

Exact Probes

Fingerprints in S 64-bit Q

Exact ProbesS’: All

fingerprints at most k-bits away from S

(Sort)

Method Two

• Observation 1: Consider 2d f-bit fingerprints in

sorted order– Most 2d combinations in d most significant bits exist

– Can quickly do exact probe on first d’ (≤d) bits

• Observation 2:

Q’

Qhd(Q,Q’) = 3

exact match!

Final implementation

Example

Fingerprints in S

A B C D

64-bit

16-bit

A

B

C

D

64-bit QQ1 Q2 Q3 Q4

Q1

Q2

Q3

Q4

exact search on 16 bits

Example: Analysis

• 64-bits split into 4 pieces• 4 tables with permuted fingerprints• Exact search on 16 bits• If 234 (≈10 billion) fingerprints

– Each probe gives 234-16 fingerprints

Batch Algorithm

• Tens of billions of pages indexed• Crawl millions of pages each day• Quickly find all new pages having a near-duplicate

in the index

MapReduce Framework

• MapReduce framework used within Google– massively parallel

• Map phase:– operate individually on a set of objects

• Reduce phase– aggregate results of the mapped objects

Batch Algorithm

• Suppose 8B existing fingerprints (~32GB after compression): File F

• 1M batch query fingerprints (~8MB): File B• F stored in a GFS file system

– chunked into roughly 64MB

– replicated at 3 random nodes

• B stored with much higher replication factor

Batch Algorithm (continued)

• Map Phase:– Duplicate detection within each chunk Fi and whole of B

– Build multiple tables for B (in memory)

– Scan Fi and probe into B

– Output near-duplicates in B

• Reduce phase– Merge outputs

Pros• Addressed near-duplicate detection in a web-crawling

system

• Proposed algorithms for single and batch cases

• Experiments to validate the suitability of simhash

• Mini-survey of near-duplicate detection techniques in the paper

Cons

• Weight Selection for feature set• Handling of continuously changing IDF• How to find near duplicates when data is present

in different formats• Inadequate results

References

• G. Manku, A. Jain, A. Das Sarma. Detecting near duplicates for web crawling. WWW 2007, pp. 141-150, 2007.

• M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), pages 380{388, 2002.

• J. Dean and S. Ghemawat. MapReduce: Simplied data processing on large clusters. In Proc. 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137{150, Dec. 2004.

• Articles from Wikipedia etc.

Future Research

• Considering document size while detecting near duplicates

• Pruning the space of existing fingerprints• Categorizing web pages• Removal of portions of web pages with ads and

time stamps

detecting near-duplicates for web crawling manku, jain, sarma presented by venkatesh katari

Documents

near duplicate

wn wn wn

w1 w1 w1 w1w2 w2 w2

d d bitsobservation

duplicate detection

d combinations

b existing fingerprints

wn wnadd13