detecting near-duplicates for web crawling manku, jain, sarma presented by venkatesh katari

23
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Upload: imogen-hancock

Post on 11-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Detecting Near-Duplicates for Web Crawling

Manku, Jain, Sarma

Presented By

Venkatesh Katari

Page 2: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Overview :

• Why Do we care ?• Purpose of the paper.• Proposed solution for finding near duplicates• Pros• Cons• Future Research.

Page 3: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Why Do We Care?

• Why do we want to detect near-duplicates?– Save storage

– Search quality

• Web mirrors

• Clustering for “related documents” query

• Data extraction

• Plagiarism

• Spam detection

• Duplicates in domain-specific corpora

Page 4: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Purpose of The Paper?

• This paper addresses the following issues:

1. Finding near duplicates on the web.

2. Handling the scale of the web • Tens of billions of documents indexed

• Millions of pages crawled every day

3. Which features to be selected while detecting duplicates

4. algorithm for single query and batch processing

5. Survey of other techniques in this field

Page 5: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

What are Near-Duplicates?

• Identical content, but differ in small portion of document– Advertisements

– Counters

– Timestamps

Page 6: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Simplified Crawl Architecture

Web Index

HTMLDocument

WebWeb

Near-duplicate?

traverse

links

newly-crawled document(s)

one document

entire index

inserttrash

Page 7: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Feature-set per document• Shingles from page content

• Connectivity information

• Anchor text, anchor window

• Phrases

• Document vector from page content

- case-folding

- stop-word removal,

- stemming

- computing term-frequencies and weighing each term

by its inverse document frequency

Page 8: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Simhash • Dimensionality-reduction technique• Obtain f-bit fingerprint for each document• A pair of documents are near duplicate if and only

if fingerprints at most k-bits apart• Experimental results show that f=64 & k=3 is good

for detecting near duplicates.

Page 9: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Simhash

Doc.

w1

w2

wn

feature, weight100110 w1

hash, weight

110000 w2

001001 wn

w1 -w1 -w1 w1 w1 -w1

w2 w2 -w2 -w2 -w2 -w2

-wn -wn wn -wn -wn wn

add

13,108,-22,-5,-32,55sign

110001

fingerprint

Page 10: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Method One

Pre-sorted fingerprints

in S64-bit Q

All Q’: hd(Q,Q’)≤k=3

Exact Probes

Page 11: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Fingerprints in S 64-bit Q

Exact ProbesS’: All

fingerprints at most k-bits away from S

(Sort)

Method Two

Page 12: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

• Observation 1: Consider 2d f-bit fingerprints in

sorted order– Most 2d combinations in d most significant bits exist

– Can quickly do exact probe on first d’ (≤d) bits

• Observation 2:

Q’

Qhd(Q,Q’) = 3

exact match!

Final implementation

Page 13: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Example

Fingerprints in S

A B C D

64-bit

16-bit

A

B

C

D

64-bit QQ1 Q2 Q3 Q4

Q1

Q2

Q3

Q4

exact search on 16 bits

Page 14: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Example: Analysis

• 64-bits split into 4 pieces• 4 tables with permuted fingerprints• Exact search on 16 bits• If 234 (≈10 billion) fingerprints

– Each probe gives 234-16 fingerprints

Page 15: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Batch Algorithm

• Tens of billions of pages indexed• Crawl millions of pages each day• Quickly find all new pages having a near-duplicate

in the index

Page 16: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

MapReduce Framework

• MapReduce framework used within Google– massively parallel

• Map phase:– operate individually on a set of objects

• Reduce phase– aggregate results of the mapped objects

Page 17: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Batch Algorithm

• Suppose 8B existing fingerprints (~32GB after compression): File F

• 1M batch query fingerprints (~8MB): File B• F stored in a GFS file system

– chunked into roughly 64MB

– replicated at 3 random nodes

• B stored with much higher replication factor

Page 18: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Batch Algorithm (continued)

• Map Phase:– Duplicate detection within each chunk Fi and whole of B

– Build multiple tables for B (in memory)

– Scan Fi and probe into B

– Output near-duplicates in B

• Reduce phase– Merge outputs

Page 19: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Pros• Addressed near-duplicate detection in a web-crawling

system

• Proposed algorithms for single and batch cases

• Experiments to validate the suitability of simhash

• Mini-survey of near-duplicate detection techniques in the paper

Page 20: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Cons

• Weight Selection for feature set• Handling of continuously changing IDF• How to find near duplicates when data is present

in different formats• Inadequate results

Page 21: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

References

• G. Manku, A. Jain, A. Das Sarma. Detecting near duplicates for web crawling. WWW 2007, pp. 141-150, 2007.

• M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), pages 380{388, 2002.

• J. Dean and S. Ghemawat. MapReduce: Simplied data processing on large clusters. In Proc. 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137{150, Dec. 2004.

• Articles from Wikipedia etc.

Page 22: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Future Research

• Considering document size while detecting near duplicates

• Pruning the space of existing fingerprints• Categorizing web pages• Removal of portions of web pages with ads and

time stamps

Page 23: Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Q & A