challenges in large-scale web crawling
TRANSCRIPT
![Page 1: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/1.jpg)
WEB CRAWLINGintroduction to
by Nate Murray& extraction
Wednesday, September 14, 2011
![Page 2: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/2.jpg)
WHO AM I ?
Wednesday, September 14, 2011
![Page 3: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/3.jpg)
Nate Murray
AT&T Interactive (Yellowpages.com)
TB-scale data since 2009
Various crawlers since 2005
Wednesday, September 14, 2011
![Page 4: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/4.jpg)
WEB CRAWLINGwhat is
?
Wednesday, September 14, 2011
![Page 5: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/5.jpg)
web crawlerdefinition:
a program that browses the web.
Wednesday, September 14, 2011
![Page 6: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/6.jpg)
transforming unstructured web data into structured data
web extractiondefinition:
Wednesday, September 14, 2011
![Page 7: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/7.jpg)
transforming semistructured web data into structured data
web extractiondefinition:
Wednesday, September 14, 2011
![Page 8: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/8.jpg)
motivation
Wednesday, September 14, 2011
![Page 9: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/9.jpg)
motivation: bookmark buddies
Wednesday, September 14, 2011
![Page 10: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/10.jpg)
motivation: bookmark buddies
URL TitleUsers
Wednesday, September 14, 2011
![Page 11: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/11.jpg)
motivation:
Wednesday, September 14, 2011
![Page 12: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/12.jpg)
motivation: business hours
Wednesday, September 14, 2011
![Page 13: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/13.jpg)
motivation: business hours
Day OpennessOpennessMon ClosedClosedTue 11:30-14:30 17:30-22:00
Wed 11:30-14:30 17:30-22:00
Thur 11:30-14:30 17:30-22:00
Fri 11:30-14:30 17:30-22:00
Sat 12:00-14:30 17:00-22:00
Sun - 17:00-21:00
Wednesday, September 14, 2011
![Page 14: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/14.jpg)
motivation:
Wednesday, September 14, 2011
![Page 15: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/15.jpg)
motivation: recommend videos
Wednesday, September 14, 2011
![Page 16: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/16.jpg)
motivation: recommend videos
Users
Wednesday, September 14, 2011
![Page 17: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/17.jpg)
motivation:
Wednesday, September 14, 2011
![Page 18: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/18.jpg)
motivation: vertical search
Wednesday, September 14, 2011
![Page 19: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/19.jpg)
motivation: vertical search
ImageSKU
NamePriceRating
Wednesday, September 14, 2011
![Page 20: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/20.jpg)
motivation:
Wednesday, September 14, 2011
![Page 21: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/21.jpg)
DESIRED PROPERTIES
Wednesday, September 14, 2011
![Page 22: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/22.jpg)
DESIRED PROPERTIES
SPEED
Wednesday, September 14, 2011
![Page 23: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/23.jpg)
CONSTRAINTS
Wednesday, September 14, 2011
![Page 24: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/24.jpg)
CONSTRAINTS
• Politeness
Wednesday, September 14, 2011
![Page 25: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/25.jpg)
CONSTRAINTS
• Politeness• Distributed
Wednesday, September 14, 2011
![Page 26: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/26.jpg)
CONSTRAINTS
• Politeness• Distributed• Linear Scalability
Wednesday, September 14, 2011
![Page 27: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/27.jpg)
CONSTRAINTS
• Politeness• Distributed• Linear Scalability• Even partitioning
Wednesday, September 14, 2011
![Page 28: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/28.jpg)
CONSTRAINTS
• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap
Wednesday, September 14, 2011
![Page 29: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/29.jpg)
CONSTRAINTS
• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap
it’s easy to burden small servers
Wednesday, September 14, 2011
![Page 30: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/30.jpg)
CONSTRAINTS
• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap
(for any significant crawl)
Wednesday, September 14, 2011
![Page 31: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/31.jpg)
CONSTRAINTS
• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap
n machines = n*m pages-per-second
Wednesday, September 14, 2011
![Page 32: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/32.jpg)
CONSTRAINTS
• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap
every machine should perform equal work
Wednesday, September 14, 2011
![Page 33: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/33.jpg)
CONSTRAINTS
• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap crawl each page
exactly once
Wednesday, September 14, 2011
![Page 34: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/34.jpg)
CONSTRAINTS
• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap
Wednesday, September 14, 2011
![Page 35: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/35.jpg)
BASIC ALGORITHM
Wednesday, September 14, 2011
![Page 36: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/36.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 37: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/37.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 38: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/38.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 39: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/39.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 40: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/40.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 41: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/41.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 42: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/42.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 43: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/43.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 44: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/44.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 45: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/45.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 46: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/46.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 47: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/47.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 48: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/48.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 49: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/49.jpg)
architecture overview
FETCHER
STORAGE
CRAWL PLANNER INTERNET
URL QUEUE
Web Data
Web Data
Web Data
URLs
Wednesday, September 14, 2011
![Page 50: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/50.jpg)
CHALLENGES
Wednesday, September 14, 2011
![Page 51: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/51.jpg)
challenges:
depends on your ambitions
Wednesday, September 14, 2011
![Page 52: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/52.jpg)
challenges:
1998 - 26 million2005 - 8 billion2008 - 1 trillion
http://www.nytimes.com/2005/08/15/technology/15search.htmlhttp://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html
Google’s Index Size:
Wednesday, September 14, 2011
![Page 53: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/53.jpg)
challenges:
small crawls are easy
Wednesday, September 14, 2011
![Page 54: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/54.jpg)
challenges:
small crawls are easy
< 10MM
Wednesday, September 14, 2011
![Page 55: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/55.jpg)
challenges:
large crawls are interesting
Wednesday, September 14, 2011
![Page 56: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/56.jpg)
challenges:
Wednesday, September 14, 2011
![Page 57: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/57.jpg)
challenges:
DNS Lookup
Wednesday, September 14, 2011
![Page 58: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/58.jpg)
challenges:
DNS LookupURLs Crawled
Wednesday, September 14, 2011
![Page 59: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/59.jpg)
challenges:
DNS LookupURLs Crawled
Politeness
Wednesday, September 14, 2011
![Page 60: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/60.jpg)
challenges:
DNS LookupURLs Crawled
PolitenessURL Frontier
Wednesday, September 14, 2011
![Page 61: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/61.jpg)
challenges:
DNS LookupURLs Crawled
PolitenessURL Frontier
Queueing URLs
Wednesday, September 14, 2011
![Page 62: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/62.jpg)
challenges:
DNS LookupURLs Crawled
PolitenessURL Frontier
Queueing URLsExtracting URLs
Wednesday, September 14, 2011
![Page 63: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/63.jpg)
DNS LOOKUPchallenges:
Wednesday, September 14, 2011
![Page 64: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/64.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 65: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/65.jpg)
DNS LOOKUPchallenges:
can easily be a bottleneck
Wednesday, September 14, 2011
![Page 66: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/66.jpg)
DNS LOOKUPchallenges:
• consider running your own DNS servers• djbdns• PowerDNS• etc.
Wednesday, September 14, 2011
![Page 67: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/67.jpg)
DNS LOOKUPchallenges:
• be aware of software limitations• gethostbyaddr is synchronized• same with many “default” DNS clients
Wednesday, September 14, 2011
![Page 68: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/68.jpg)
DNS LOOKUPchallenges:
You’ll know when you need it
Wednesday, September 14, 2011
![Page 69: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/69.jpg)
URLs CRAWLEDchallenges:
Wednesday, September 14, 2011
![Page 70: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/70.jpg)
Initialize: UrlsDone = null UrlFrontier = {'google.com/index.html', ..}Repeat url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
![Page 71: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/71.jpg)
URLs CRAWLEDchallenges:
1 machine, store in memory
Wednesday, September 14, 2011
![Page 72: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/72.jpg)
URLs CRAWLEDchallenges:
1 machine, store in memory
NAPKIN CALCULATION
Wednesday, September 14, 2011
![Page 73: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/73.jpg)
URLs CRAWLEDchallenges:
1 machine, store in memory
NAPKIN CALCULATION~50 bytes per URL
e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
Wednesday, September 14, 2011
![Page 74: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/74.jpg)
URLs CRAWLEDchallenges:
1 machine, store in memory
NAPKIN CALCULATION~50 bytes per URL
e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
+8 bytes for time-last-crawledas long e.g. System.currentTimeMillis() -> 1314392455712
Wednesday, September 14, 2011
![Page 75: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/75.jpg)
URLs CRAWLEDchallenges:
1 machine, store in memory
NAPKIN CALCULATION~50 bytes per URL
e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
+8 bytes for time-last-crawledas long e.g. System.currentTimeMillis() -> 1314392455712
x 100 million
Wednesday, September 14, 2011
![Page 76: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/76.jpg)
URLs CRAWLEDchallenges:
1 machine, store in memory
NAPKIN CALCULATION~50 bytes per URL
e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
+8 bytes for time-last-crawledas long e.g. System.currentTimeMillis() -> 1314392455712
x 100 million
=~ 5.4 gigabytes
Wednesday, September 14, 2011
![Page 77: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/77.jpg)
can we do better?
Wednesday, September 14, 2011
![Page 78: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/78.jpg)
BLOOM FILTERS
Wednesday, September 14, 2011
![Page 79: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/79.jpg)
BLOOM FILTERS
answers the question:
is this item in the set?
Wednesday, September 14, 2011
![Page 80: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/80.jpg)
BLOOM FILTERS
answers either:
Wednesday, September 14, 2011
![Page 81: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/81.jpg)
BLOOM FILTERS
answers either:
• yes, probably
Wednesday, September 14, 2011
![Page 82: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/82.jpg)
BLOOM FILTERS
answers either:
• yes, probably• definitely not
Wednesday, September 14, 2011
![Page 83: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/83.jpg)
BLOOM FILTERS
answers either:
• yes, probably• definitely not
Have we crawled: http://www.xcombinator.com?
Wednesday, September 14, 2011
![Page 84: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/84.jpg)
BLOOM FILTERS
answers either:
• yes, probably• definitely not
Have we crawled: http://www.xcombinator.com?
Wednesday, September 14, 2011
![Page 85: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/85.jpg)
URLs CRAWLEDchallenges:
1 machine, bloom filter
100 million URLs
1 in 100 million chanceof false positive
see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8
Wednesday, September 14, 2011
![Page 86: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/86.jpg)
URLs CRAWLEDchallenges:
1 machine, bloom filter
NAPKIN CALCULATION100 million URLs
1 in 100 million chanceof false positive
see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8
Wednesday, September 14, 2011
![Page 87: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/87.jpg)
URLs CRAWLEDchallenges:
1 machine, bloom filter
NAPKIN CALCULATION100 million URLs
1 in 100 million chanceof false positive
=~ 457 megabytes
see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8
Wednesday, September 14, 2011
![Page 88: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/88.jpg)
BLOOM FILTER
Wednesday, September 14, 2011
![Page 89: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/89.jpg)
BLOOM FILTERdrawbacks
Wednesday, September 14, 2011
![Page 90: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/90.jpg)
BLOOM FILTER
• probabilistic - occasional errors
drawbacks
Wednesday, September 14, 2011
![Page 91: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/91.jpg)
BLOOM FILTER
• probabilistic - occasional errors
• estimate # of items ahead of time
drawbacks
Wednesday, September 14, 2011
![Page 92: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/92.jpg)
BLOOM FILTER
• probabilistic - occasional errors
• estimate # of items ahead of time
• can’t delete
drawbacks
Wednesday, September 14, 2011
![Page 93: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/93.jpg)
BLOOM FILTER
• probabilistic - occasional errors
• estimate # of items ahead of time
• can’t delete
drawbackssolutions
Wednesday, September 14, 2011
![Page 94: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/94.jpg)
BLOOM FILTER
• probabilistic - occasional errors
• estimate # of items ahead of time
• can’t delete
drawbacks
• acceptable
solutions
Wednesday, September 14, 2011
![Page 95: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/95.jpg)
BLOOM FILTER
• probabilistic - occasional errors
• estimate # of items ahead of time
• can’t delete
drawbacks
• acceptable
• not hard, see Dynamic BFs
solutions
Wednesday, September 14, 2011
![Page 96: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/96.jpg)
BLOOM FILTER
• probabilistic - occasional errors
• estimate # of items ahead of time
• can’t delete
drawbacks
• acceptable
• not hard, see Dynamic BFs
• pick granularity (days)
solutions
Wednesday, September 14, 2011
![Page 97: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/97.jpg)
BLOOM FILTER
• probabilistic - occasional errors
• estimate # of items ahead of time
• can’t delete
drawbacks
• acceptable
• not hard, see Dynamic BFs
• pick granularity (days)• cascade them
solutions
Wednesday, September 14, 2011
![Page 98: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/98.jpg)
BLOOM FILTERSreferences:
http://en.wikipedia.org/wiki/Bloom_filterhttp://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.htmlhttp://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/
Wednesday, September 14, 2011
![Page 99: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/99.jpg)
POLITENESSchallenges:
Wednesday, September 14, 2011
![Page 100: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/100.jpg)
obey robots.txt
Wednesday, September 14, 2011
![Page 101: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/101.jpg)
wait 2 seconds (w.r.t. ip)
rule of thumb:
Wednesday, September 14, 2011
![Page 102: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/102.jpg)
centralized politeness
Wednesday, September 14, 2011
![Page 103: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/103.jpg)
centralized politeness
SPOF
Wednesday, September 14, 2011
![Page 104: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/104.jpg)
centralized politeness
SPOFcontention
Wednesday, September 14, 2011
![Page 105: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/105.jpg)
POLITENESSchallenges:
Wednesday, September 14, 2011
![Page 106: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/106.jpg)
POLITENESSchallenges:
• Options:
Wednesday, September 14, 2011
![Page 107: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/107.jpg)
POLITENESSchallenges:
• Options:• central database
Wednesday, September 14, 2011
![Page 108: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/108.jpg)
POLITENESSchallenges:
• Options:• central database • distributed locks (paxos/sigma/zookeeper)
Wednesday, September 14, 2011
![Page 109: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/109.jpg)
POLITENESSchallenges:
• Options:• central database • distributed locks (paxos/sigma/zookeeper)• controlled URL distribution
Wednesday, September 14, 2011
![Page 110: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/110.jpg)
POLITENESSchallenges:
• Options:• central database • distributed locks (paxos/sigma/zookeeper)• controlled URL distribution
http://en.wikipedia.org/wiki/Paxos_(computer_science)
Wednesday, September 14, 2011
![Page 111: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/111.jpg)
POLITENESSchallenges:
• Options:• central database • distributed locks (paxos/sigma/zookeeper)• controlled URL distribution
http://en.wikipedia.org/wiki/Paxos_(computer_science)http://zookeeper.apache.org/
Wednesday, September 14, 2011
![Page 112: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/112.jpg)
URL FRONTIERchallenges:
Wednesday, September 14, 2011
![Page 113: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/113.jpg)
url frontier
Wednesday, September 14, 2011
![Page 114: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/114.jpg)
consistently distribute URLs based on IP
idea:
Wednesday, September 14, 2011
![Page 115: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/115.jpg)
moduloIP SHA-1 bucket (mod 5)
174.132.225.106 4dd14b0b... 2
74.125.224.115 cf4b7594... 1
157.166.255.19 0ac4d141... 4
69.22.138.129 6c1584fa... 4
98.139.50.166 327252c5... 3
Wednesday, September 14, 2011
![Page 116: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/116.jpg)
same IP always goes to same machine
benefits:
simple
Wednesday, September 14, 2011
![Page 117: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/117.jpg)
susceptible to skew
drawbacks:
can’t add / remove nodes without pain
Wednesday, September 14, 2011
![Page 118: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/118.jpg)
consistent hashing
Wednesday, September 14, 2011
![Page 119: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/119.jpg)
source: http://michaelnielsen.org/blog/consistent-hashing/
Wednesday, September 14, 2011
![Page 120: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/120.jpg)
source: http://michaelnielsen.org/blog/consistent-hashing/
Wednesday, September 14, 2011
![Page 121: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/121.jpg)
source: http://michaelnielsen.org/blog/consistent-hashing/
Wednesday, September 14, 2011
![Page 122: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/122.jpg)
source: http://michaelnielsen.org/blog/consistent-hashing/
Wednesday, September 14, 2011
![Page 123: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/123.jpg)
~ 1/(n+1) URLs move on add/remove
benefits:
virtual nodes help skewrobust (no SOP)
Wednesday, September 14, 2011
![Page 124: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/124.jpg)
naive solution won’t work for large sites
drawbacks:
Wednesday, September 14, 2011
![Page 125: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/125.jpg)
Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications (2001) Stoica et al.
Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007
Tapestry: A Resilient Global-Scale Overlay for Service Deployment (2004) Zhao et al.
further reading:
Wednesday, September 14, 2011
![Page 126: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/126.jpg)
QUEUEING URLSchallenges:
Wednesday, September 14, 2011
![Page 127: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/127.jpg)
situation:
Wednesday, September 14, 2011
![Page 128: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/128.jpg)
situation:URL
Wednesday, September 14, 2011
![Page 129: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/129.jpg)
situation:URLnot recently crawled
Wednesday, September 14, 2011
![Page 130: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/130.jpg)
situation:URLnot recently crawledallowed by robots.txt
Wednesday, September 14, 2011
![Page 131: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/131.jpg)
situation:URLnot recently crawledallowed by robots.txtpolite
Wednesday, September 14, 2011
![Page 132: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/132.jpg)
how to you order them?
(within a single machine)
Wednesday, September 14, 2011
![Page 133: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/133.jpg)
http://yachtmaintenanceco.com/http://www.amsterdamports.nl/http://www.4s-dawn.com/http://www.embassysuiteslittlerock.com/http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htmhttp://mdgroover.iweb.bsu.eduhttp://music.imbc.com/http://www.robertjbradshaw.comhttp://www.kerkattenhoven.behttp://www.escolania.org/http://www.musiciansdfw.org/http://www.ariana.org/
1 2 3hash each lane:
Wednesday, September 14, 2011
![Page 148: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/148.jpg)
1 2 3
Wednesday, September 14, 2011
![Page 149: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/149.jpg)
1 2 3
Wednesday, September 14, 2011
![Page 150: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/150.jpg)
1 2 3
Wednesday, September 14, 2011
![Page 151: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/151.jpg)
1 2 3
Wednesday, September 14, 2011
![Page 152: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/152.jpg)
1 2 3
Wednesday, September 14, 2011
![Page 153: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/153.jpg)
1 2 3
Wednesday, September 14, 2011
![Page 154: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/154.jpg)
1 2 3
Wednesday, September 14, 2011
![Page 155: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/155.jpg)
ERLANG
lookup: erlang B / C / engset
Wednesday, September 14, 2011
![Page 156: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/156.jpg)
as many threads as possible
Wednesday, September 14, 2011
![Page 157: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/157.jpg)
don’t sort input URLs
Wednesday, September 14, 2011
![Page 158: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/158.jpg)
http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&page=1
Wednesday, September 14, 2011
![Page 159: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/159.jpg)
http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&page=1
fetch
Wednesday, September 14, 2011
![Page 160: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/160.jpg)
http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&page=1
wait
Wednesday, September 14, 2011
![Page 161: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/161.jpg)
http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&page=1
fetch
Wednesday, September 14, 2011
![Page 162: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/162.jpg)
http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&page=1
wait
Wednesday, September 14, 2011
![Page 163: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/163.jpg)
http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&page=1
fetch
Wednesday, September 14, 2011
![Page 164: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/164.jpg)
http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&page=1
wait
Wednesday, September 14, 2011
![Page 165: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/165.jpg)
http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/2020/story?id=207269&page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&page=1
Wednesday, September 14, 2011
![Page 166: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/166.jpg)
http://yachtmaintenanceco.com/http://www.amsterdamports.nl/http://www.4s-dawn.com/http://www.embassysuiteslittlerock.com/http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htmhttp://mdgroover.iweb.bsu.eduhttp://music.imbc.com/http://www.robertjbradshaw.comhttp://www.kerkattenhoven.behttp://www.escolania.org/http://www.musiciansdfw.org/http://www.ariana.org/
Wednesday, September 14, 2011
![Page 167: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/167.jpg)
http://yachtmaintenanceco.com/http://www.amsterdamports.nl/http://www.4s-dawn.com/http://www.embassysuiteslittlerock.com/http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htmhttp://mdgroover.iweb.bsu.eduhttp://music.imbc.com/http://www.robertjbradshaw.comhttp://www.kerkattenhoven.behttp://www.escolania.org/http://www.musiciansdfw.org/http://www.ariana.org/
Wednesday, September 14, 2011
![Page 168: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/168.jpg)
http://yachtmaintenanceco.com/http://www.amsterdamports.nl/http://www.4s-dawn.com/http://www.embassysuiteslittlerock.com/http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htmhttp://mdgroover.iweb.bsu.eduhttp://music.imbc.com/http://www.robertjbradshaw.comhttp://www.kerkattenhoven.behttp://www.escolania.org/http://www.musiciansdfw.org/http://www.ariana.org/
no waiting!
Wednesday, September 14, 2011
![Page 169: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/169.jpg)
EXTRACTING URLSchallenges:
Wednesday, September 14, 2011
![Page 170: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/170.jpg)
challenges:
EXTRACTING URLS
the internet is full of garbage
Wednesday, September 14, 2011
![Page 171: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/171.jpg)
challenges:
EXTRACTING URLS
Wednesday, September 14, 2011
![Page 172: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/172.jpg)
challenges:
EXTRACTING URLS
enormous pages
Wednesday, September 14, 2011
![Page 173: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/173.jpg)
challenges:
EXTRACTING URLS
enormous pages
terrible markup
Wednesday, September 14, 2011
![Page 174: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/174.jpg)
challenges:
EXTRACTING URLS
enormous pages
terrible markup
ridiculous urls
Wednesday, September 14, 2011
![Page 175: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/175.jpg)
challenges:
EXTRACTING URLS
enormous pages
terrible markup
ridiculous urls
☃.net/
Wednesday, September 14, 2011
![Page 176: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/176.jpg)
challenges:
EXTRACTING URLS
enormous pages
terrible markup
ridiculous urls
☃.net/“unicode snowman dot net”
Wednesday, September 14, 2011
![Page 177: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/177.jpg)
challenges:
EXTRACTING URLSbe prepared:
Wednesday, September 14, 2011
![Page 178: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/178.jpg)
challenges:
EXTRACTING URLSbe prepared:
use a streaming XML parser
Wednesday, September 14, 2011
![Page 179: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/179.jpg)
challenges:
EXTRACTING URLSbe prepared:
use a streaming XML parser
use a library that handle’s bad markup
Wednesday, September 14, 2011
![Page 180: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/180.jpg)
challenges:
EXTRACTING URLSbe prepared:
use a streaming XML parser
use a library that handle’s bad markup
be aware that URLs aren’t ASCII
Wednesday, September 14, 2011
![Page 181: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/181.jpg)
challenges:
EXTRACTING URLSbe prepared:
use a streaming XML parser
use a library that handle’s bad markup
be aware that URLs aren’t ASCII
use a URL normalizer
Wednesday, September 14, 2011
![Page 182: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/182.jpg)
SOFTWARE
Wednesday, September 14, 2011
![Page 183: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/183.jpg)
software advice:
Wednesday, September 14, 2011
![Page 184: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/184.jpg)
software advice:
• goals determine scale
Wednesday, September 14, 2011
![Page 185: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/185.jpg)
software advice:
• goals determine scale
• someone else has already done it
Wednesday, September 14, 2011
![Page 186: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/186.jpg)
2 second crawler:
function wgetspider() { wget --html-extension --convert-links --mirror \ --page-requisites --progress=bar --level=5 \ --no-parent --no-verbose \ --no-check-certificate "$@"; }
$ wgetspider http://www.ischool.berkeley.edu/
Wednesday, September 14, 2011
![Page 187: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/187.jpg)
java crawlers:
Wednesday, September 14, 2011
![Page 188: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/188.jpg)
java crawlers:
• Heritrix (Internet Archive)
Wednesday, September 14, 2011
![Page 189: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/189.jpg)
java crawlers:
• Heritrix (Internet Archive)
• Nutch (Lucene)
Wednesday, September 14, 2011
![Page 190: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/190.jpg)
java crawlers:
• Heritrix (Internet Archive)
• Nutch (Lucene)
• Bixo (Hadoop / Cascading)
Wednesday, September 14, 2011
![Page 191: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/191.jpg)
java crawlers:
• Heritrix (Internet Archive)
• Nutch (Lucene)
• Bixo (Hadoop / Cascading)
http://crawler.archive.org/http://nutch.apache.org/http://bixo.101tec.com/
Wednesday, September 14, 2011
![Page 192: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/192.jpg)
extraction packages:
Wednesday, September 14, 2011
![Page 193: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/193.jpg)
extraction packages:
• mechanize
Wednesday, September 14, 2011
![Page 194: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/194.jpg)
extraction packages:
• mechanize
• BeautifulSoup & urllib2
Wednesday, September 14, 2011
![Page 195: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/195.jpg)
extraction packages:
• mechanize
• BeautifulSoup & urllib2
• Scrapy
Wednesday, September 14, 2011
![Page 196: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/196.jpg)
extraction packages:
• mechanize
• BeautifulSoup & urllib2
• Scrapy
http://wwwsearch.sourceforge.net/mechanize/http://www.crummy.com/software/BeautifulSoup/http://scrapy.org/
Wednesday, September 14, 2011
![Page 197: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/197.jpg)
wrapper induction(ish)
Wednesday, September 14, 2011
![Page 198: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/198.jpg)
wrapper induction(ish)• Ariel
Wednesday, September 14, 2011
![Page 199: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/199.jpg)
wrapper induction(ish)• Ariel
• RoadRunner
Wednesday, September 14, 2011
![Page 200: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/200.jpg)
wrapper induction(ish)• Ariel
• RoadRunner
• TemplateMaker
Wednesday, September 14, 2011
![Page 201: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/201.jpg)
wrapper induction(ish)• Ariel
• RoadRunner
• TemplateMaker
• scrubyt
Wednesday, September 14, 2011
![Page 202: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/202.jpg)
wrapper induction(ish)• Ariel
• RoadRunner
• TemplateMaker
• scrubyt
http://ariel.rubyforge.org/index.htmlhttp://www.dia.uniroma3.it/db/roadRunner/http://code.google.com/p/templatemaker/http://scrubyt.rubyforge.org/files/README.html
Wednesday, September 14, 2011
![Page 203: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/203.jpg)
QUESTIONS?
Wednesday, September 14, 2011
![Page 204: Challenges in Large-Scale Web Crawling](https://reader033.vdocuments.site/reader033/viewer/2022042814/5553b2e4b4c905d9448b4bd8/html5/thumbnails/204.jpg)
FEEDBACK:
@xcombinatorWednesday, September 14, 2011