current challenges in web crawling

79
Current challenges in Web Crawling ICWE 2013 Tutorial Denis Shestakov Department of Media Technology School of Science, Aalto University, Finland firstname.lastname@aalto.fi Version 1.4: 08.07.2013

Upload: denis-shestakov

Post on 06-May-2015

13.623 views

Category:

Technology


0 download

DESCRIPTION

Tutorial given at ICWE'13, Aalborg, Denmark on 08.07.2013 Abstract: Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research. To cite this tutorial: Please refer to http://dx.doi.org/10.1007/978-3-642-39200-9_49

TRANSCRIPT

Page 1: Current challenges in web crawling

Current challenges in Web CrawlingICWE 2013 TutorialDenis ShestakovDepartment of Media TechnologySchool of Science, Aalto University, [email protected]

Version 1.4: 08.07.2013

Page 2: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.20132/79

Speaker’s Bio

I Postdoc in WebServices Group, AaltoUniversity, Finland

I PhD dissertation onlimited coverage ofweb crawlers

I Over ten years ofexperience in the area

Page 3: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.20133/79

Speaker’s Bio

I http://www.linkedin.com/in/

dshestakov

I http://www.mendeley.com/

profiles/denis-shestakov/

I http://www.tml.tkk.fi/~denis/

Page 4: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.20134/79

Tutorial OutlineOVERVIEW

I Web crawling in a nutshellI Web structure& statisticsI Large-scale crawling

Coffee Break

CHALLENGESI Collaborative web crawlingI Crawling the deep WebI Crawling the multimedia contentI Future directions

Page 5: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.20135/79

PART I: OVERVIEW

Vizualization of http://media.tkk.fi/webservices by aharef.info applet

Page 6: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.20136/79

Outline of Part I

Overview of Web CrawlingI Web Crawling in a Nutshell

I ApplicationsI Industry vs. AcademiaI Web Ecosystem and Crawling

I Web Structure& StatisticsI Large-scale crawling

I Basic architectureI ImplementationsI Design issues and considerations

Page 7: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.20137/79

Web Crawling in a Nutshell

I Automatic harvesting of web contentI Done by web crawlers (also known as robots, bots or

spiders)I Follow a link from a set of links (URL queue), download a

page, extract all links, eliminate already visited, add therest to the queue

I Then repeatI A set of policies involved (like ’ignore links to images’, etc.)

Page 8: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.20138/79

Web Crawling in a NutshellExample:

1. Follow http://media.tkk.fi/webservices (vizualization of itsHTML DOM tree below)

2. Extract URLs inside blue bubbles (designating <a> tags)3. Remove already visited URLs4. For each non-visited URL, start at Step 1

Page 9: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.20139/79

Web Crawling in a Nutshell

I In essence: simple and naive processI However, a number of ’restrictions’ imposed make it much

more complicatedI Most complexities due to operating environment (Web)I For example, do not overload web servers (challenging as

distribution of web pages on web servers is non-uniform)I Or avoiding web spam (not only useless but consumes

resources and often spoils the collected content)

Page 10: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201310/79

Web Crawling in a Nutshell

Crawler Agents

I First in 1993: the Wanderer (written in Perl)I Over different 1100 crawler signatures (User-Agent string

in HTTP request header) mentioned athttp://www.crawltrack.net/crawlerlist.php

I Educated guess on overall number of different crawlers –at least several thousands

I Write your own in a few dozens lines of code (usinglibraries for URL fetching and HTML parsing)

I Or use existing agent: e.g., wget tool (developed from1996; http://www.gnu.org/software/wget/)

Page 11: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201311/79

Web Crawling in a Nutshell

Crawler Agents

I For advanced things, you may modify the code of existingprojects for programming language preferred

I Crawlers play a big role on the WebI Bring more traffic to certain web sites than human visitorsI Generate sizeable portion of traffic to any (public) web siteI Crawler traffic important for emerging web sites

Page 12: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201312/79

Web Crawling in a NutshellClassification

I General/universal crawlers- Not so many of them, lots of resources required- Big web search engines

I Topical/focused crawlers- Pages/sites on certain topic- Crawling all in one specific (i.e., national) web segment israther general, though

I Batch crawling- One or several (static) snapshots

I Incremental/continuous crawling- Re-visiting- Resources divided between fetching newly discoveredpages and re-downloading previously crawled pages- Search engines

Page 13: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201313/79

Applications of Web CrawlingWeb Search Engines

I Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex,Ask, ...

I One of three underlying technology stacks

Page 14: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201314/79

Applications of Web CrawlingWeb Search Engines

I One of three underlying technology stacks

I BTW, what are the other two and which is the most’crucial’?

Page 15: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201315/79

Applications of Web Crawling

Web Search Engines

I What are the other two and which is the most ’crucial’?Query processor (particularly, ranking)

Page 16: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201316/79

Applications of Web Crawling

Web Archiving

I Digital preservationI ’Librarian’ look on the WebI The biggest: Internet ArchiveI Quite huge collectionsI Batch crawlsI Primarily, collection of national web sites - web sites at

country-specific TLDs or physically hosted in a countryI There are quite many and some are huge! see the list of

Web Archiving Initiatives at Wikipedia

Page 17: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201317/79

Applications of Web Crawling

Vertical Search Engines

I Data aggregating from many sources on certain topicI E.g., apartment search, car search

Page 18: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201318/79

Applications of Web Crawling

Web Data Mining

I “To get data to be actually mined”I Usually using focused crawlersI For example, opinion miningI Or digests of current happenings on the Web (e.g., what

music people listen now)

Page 19: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201319/79

Applications of Web Crawling

Web Monitoring

I Monitoring sites/pages for changes and updates

Page 20: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201320/79

Applications of Web Crawling

Detection of malicious web sitesI Typically a part of anti-virus, firewall, search engine, etc.

serviceI Building a list of such web sites and inform a user about

potential threat of visiting such

Page 21: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201321/79

Applications of Web Crawling

Web site/application testing

I Crawl a web site to check a navigation through it, validitythe links, etc.

I Regression/security/... testing a rich internet application(RIA) via crawling

I Checking different application states by simulating possibleuser interaction events (e.g., mouse click, time-out)

Page 22: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201322/79

Applications of Web Crawling

Fighting crime! :) well, copyright violations

I Crawl to find (media) items under copyright or links to themI Regular re-visiting ’suspicious’ web sites, forums, etc.I Tasks like finding terrorist chat rooms also go here

Page 23: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201323/79

Applications of Web Crawling

Web Scraping

I Extracting particular pieces of information from a group oftypically similar pages

I When API to data is not availableI Interestingly, scraping might be more preferable even with

API available as scraped data often more clean andup-to-date than data-via-API

Page 24: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201324/79

Applications of Web Crawling

Web Mirroring

I Copying of web sitesI Often hosting copies on different servers to ensure

constant accessibility

Page 25: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201325/79

Industry vs. Academia

In web crawling domain

I Huge lag between industrial and academic web crawlers- Research-wise and development-wise- Algorithms, techniques, strategies used in industrialcrawlers (namely, operated by search engines) poorlyknown

I Industrial crawlers operate on a web-scale (=dozens ofbillions pages)- Only a few (three?) academic crawlers dealt with morethan one billion pages- Academic scale is rather hundreds of millions

Page 26: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201326/79

Industry vs. Academia

I Re-crawling- Batch crawls in academia- Regular re-crawls by industrial crawlers

I Evaluation of crawled data- And hence corrections/improvements into crawlers- Direct evaluation by users of search engines- To some extent, artificial evaluation of academic crawls

Page 27: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201327/79

Industry vs. Academia

I Industrial (search engines’) crawlers are much moreappreciated- Eventually they attract visitors(=revenue/prestige/influence/...)- It makes perfect sense to trick them

I Academic crawlers just consume resources (e.g., networkbandwidth)- Don’t bring anything- No point to do tricks with them (assuming siteadministrator bothers to differentiate them from searchengines’ bots)

Page 28: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201328/79

Web Ecosystem and Crawling

Pull vs. Push modelI Web Content Provider (site owners)I Web Aggregators (crawler operators)I Aggregator pulls contentI Content is not pushed to aggregators

Page 29: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201329/79

Web Ecosystem and Crawling

Why not Push?

I Pull is just easier for both partiesI No ’agreement’ between provider and aggregatorI No specific protocols for content providers – serving

content is enoughI Perhaps pull model is the reason why the Web is

succeeded while earlier hypertext systems failed

Page 30: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201330/79

Web Ecosystem and Crawling

Why not Push?

I Still pull model has several disadvantagesI What are these?

Page 31: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201331/79

Web Ecosystem and Crawling

Why not Push?

I Still pull model has several disadvantagesI Avoiding redundant requests from crawlers, more control

over the content from providers

Page 32: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201332/79

Web Ecosystem and Crawling

Crawler politeness

I Content providers possess some control over crawlersI Via special protocols to define access to parts of a siteI Via direct banning of agents hitting a site too often

Page 33: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201333/79

Web Ecosystem and Crawling

Crawler politeness

I Robots.txt says what can(not) be crawledI Sitemaps is newer protocol specifying access restrictions

and other infoI No agent should visit any URL starting with

“yoursite/notcrawldir”, except an agent called“goodsearcher”

ExampleUser-agent: *Disallow: yoursite/notcrawldir

User-agent: goodsearcherDisallow:

Page 34: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201334/79

Web Structure& Statistics

Some numbersI Number of pages per host is not uniform: most hosts

contain only a few pages, others contain millionsI Roughly 100 links on a pageI Must try to keep all crawling threads busyI According to Google statistics (over 4 billions pages,

2010): fetching a page takes 320KB (textual content plusall embeddings)

I Page has 10-100KB of textual (HTML) content on averageI One trillion URLs known by Google/Yahoo in 2008

Page 35: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201335/79

Web Structure& Statistics

Some numbersI 20 million web pages in 1995 (indexed by AltaVista)I One trillion (1012) URLs known by Google/Yahoo in 2008

- ’Independent’ search engine called Majestic12(P2P-crawling) confirms one billion items

I Doesn’t mean one trillion indexed pagesI Supposedly, index has dozens times less pagesI Cool crawler facts: IRLbot crawler (running on one server)

downloaded 6.4 billions pages over 2 months- Throughput: 1000-1500 pages per second- Over 30 billions discovered URLs

Page 36: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201336/79

Web Structure& StatisticsBow-tie model of the Web

Illustration taken from http://dx.doi.org/doi:10.1038/35012155

Page 37: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201337/79

Basic Crawler ArchitectureCrawler crawls the Web

Illustration taken from CMSC 476/676 course slides by Charles Nicholas

Page 38: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201338/79

Basic Crawler ArchitectureTypically in a distributed fashion

Illustration taken from CMSC 476/676 course slides by Charles Nicholas

Page 39: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201339/79

Basic Crawler Architecture

URL FrontierI Include multiple pages from the same hostI Must avoid trying to fetch them all at the same timeI Must try to keep all crawling threads busyI Prioritization also helps

Page 40: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201340/79

Basic Crawler ArchitectureCrawler Architecture

Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.

Page 41: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201341/79

Basic Crawler Architecture

DNSI Given a URL, retrieve its IP addressI Distributed service – lookup latencies can be high

(seconds)I Critical componentI Common implementations of DNS lookup (e.g., nslookup)

are synchronous: one request at a timeI Asynchronous DNS resolvingI Pre-cachingI Batch DNS resolving

Page 42: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201342/79

Basic Crawler ArchitectureContent seen?

I If page fetched is already in the base/index, don’t process itI Document fingerprints (shingles)

Filtering

I Filter out URLs – due to ’politeness’, restrictions on crawlI Fetched robots.txt are cached to avoid fetching them

repeatedly

Duplicate URL Elimination

I Check if an extracted+filtered URL has been alreadypassed to frontier (batch crawling)

I More complicated in continuous crawling (different URLfrontier implementation)

Page 43: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201343/79

Basic Crawler Architecture

Distributed Crawling

I Run multiple crawl threads, under different processes(often at different nodes)

I Nodes can be geographically distributedI Partition hosts being crawled into nodes

Page 44: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201344/79

Basic Crawler ArchitectureHost Splitter

Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.

Page 45: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201345/79

Implementations

I Popular languages: Perl, Java, Python, C/C++I HTTP fetching, HTML parser, asynchronous DNS

resolving librariesI Open-source, in Java: Heritrix, Nutch

Page 46: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201346/79

Implementations

Simple code example in Perl

Page 47: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201347/79

Large-scale CrawlingObjectives

I High web coverageI High page freshnessI High content qualityI High download rate

Internal and External factorsI Amount of hardware (I)I Network bandwidth (I)I Rate of web growth (E)I Rate of web change (E)I Amount of malicious content (i.e., spam, duplicates) (E)

Page 48: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201348/79

Large-scale Crawling

I Architecture ofsequential crawler

I Seeds – list of startingURLs

I Order of page visitsdetermined by frontierdata structure

I Stop condition (e.g., Xpages fetched)

Illustration taken from Ch.8 Web Crawling by FilippoMenczer in Bing Liu’s Web Data Mining (Springer, 2007)

Page 49: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201349/79

Large-scale Crawling

Graph Traversal

I Breadth first search- Implemented withQUEUE (FIFO)- Pages with shortestpaths

I Depth first search- Implemented withSTACK (LIFO)

Page 50: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201350/79

Large-scale Crawling

Some implementation notes

I Get only the first part of pages (10-100KB)I Detect redirection loopsI Handle all possible errors (e.g., server not responding),

timeouts, etc.I Deal with lots of invalid HTMLI Take care of dynamic pages

- Some are ’spider traps’ (think of Next month link on acalendar)- E.g., limit number of pages per host

Page 51: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201351/79

Large-scale Crawling

Delays in crawling

I Resolving host to IP addressI Connecting a socket to server and sending requestI Receiving requested page in responseI Overlap delays by fetching many pages concurrently

Page 52: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201352/79

Large-scale Crawling

Architecture ofconcurrent crawler

Illustration taken from Ch.8 Web Crawling by Filippo Menczerin Bing Liu’s Web Data Mining (Springer, 2007)

Page 53: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201353/79

Large-scale Crawling

Design points: frontier data structure

I Most links on a page refer to the same site/server- Note: remember of virtual hosting

I Problem with a FIFO queue – too many requests to thesame server

I Common policy is to delay next request by, say, 10 x time(it took to download last page from the server)

I ’Mercator’ scheme – have more additional queues to thefrontier queue

Page 54: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201354/79

Large-scale Crawling

Design points: URL seen test

I To not add multiple instances of URL to the frontierI For batch crawling, two operations required: insertion and

membership testingI For continuous crawling, one more operation: deletionI URLs compressed (e.g., 10-byte hash value)I In-memory implementations: hash table, Bloom filterI Search engines keep all URLs in-memory in the crawling

cluster (hash table partitioned across nodes; partitioningcan be based on host part of URL)

Page 55: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201355/79

Large-scale Crawling

Design points: URL seen test

I If in-memory not possible, disk-based hash table used withcaching

I Limits crawling rate to tens of pages per second – disklookups are slow

I To scale, sequential read/writes are faster and thus usedI ’Mercator/IRLbot’ scheme: combining (reading-writing)

sorted URL (visited) hashes on disk with hashes of ’justextracted’ URLs

I Delay due to batch merging manageable

Page 56: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201356/79

PART II: CHALLENGES

Page 57: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201357/79

Outline of Part II

Challenges in Web CrawlingI Collaborative CrawlingI Deep Web Crawling

I Crawling content behind search formsI Crawling JavaScript-rich web sites

I Crawling MultimediaI Other Challenges in CrawlingI Future DirectionsI References

Page 58: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201358/79

Collaborative Crawling

Main considerationsI Lots of redundant crawlingI To get data (often on a specific topic) need to crawl broadly

- Often lack of expertise when large crawl required- Often, crawl a lot, use only a small subset

I Too many redundant requests for content providersI Idea: have one crawler doing very broad and intensive

crawl and many parties accessing the crawled data via API- Specify filters to select required pages

I Crawler as a common service

Page 59: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201359/79

Collaborative Crawling

Some requirements

I Filter language for specifying conditionsI Efficient filter processing (millions filter to process)I Efficient fetching (hundreds pages per second)I Support real-time requests

Page 60: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201360/79

Collaborative Crawling

New component

I Process a stream of documents against a filter index

Page 61: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201361/79

Collaborative Crawling

Filter processing architecture

Page 62: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201362/79

Collaborative Crawling

Filter processing architecture

Page 63: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201363/79

Collaborative Crawling

I Based on ’The architecture and implementation of anextensible web crawler’ by Hsieh, Gribble, Levy, 2010(illustrations on slides 61-62 from Hsieh’s slides)

I E.g., 80legs provides similar crawling servicesI In a way, it is reconsidering pull/push model of content

delivery on the Web

Page 64: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201364/79

Deep Web Crawling

Visualization of http://amazon.com by aharef.info applet

Page 65: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201365/79

Deep Web CrawlingIn a nutshell

I Problem is in yellow nodes (designating web formelements)

Page 66: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201366/79

Deep Web Crawling

See slides on deep Web crawling at http://goo.gl/Oohoo

Page 67: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201367/79

Crawling Multimedia Content

I The web is now multimedia platformI Images, video, audio are integral part of web pages (not

just supplementing them)I Almost all crawlers, however, consider it as a textual

repositoryI One reason: indexing techniques for multimedia doesn’t

reach yet the maturity required by interesting usecases/applications

I Hence, no real need to harvest multimediaI But state-of-the-art multimedia retrieval/computer vision

techniques already provide adequate search qualityI E.g., search for images with a cat and a man based on

actual image content (not text around/close to image)I In case of video: set of frames plus audio (can be converted

to textual form)

Page 68: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201368/79

Crawling Multimedia Content

Challenges in crawling multimedia

I Bigger load on web sites since files are biggerI More apparent copyright issuesI More resources (e.g., bandwidth, storage place) required

from a crawlerI More complicated duplicate resolvingI Re-visiting policy

Page 69: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201369/79

Crawling Multimedia ContentApproaches

I Utilize metadata info (fetch and analyse small metadata fileto decide on full download)

I Intelligent crawling: better ranking of URLs in frontier(based on specified domain of crawl)

I Move from pull to push modelI API-directed crawling

- Access to data via predefined APIs- Need in annotation/discovery of such APIs

I Technically: use additional component for multimedia crawl- With its own URL queue- Main crawler component provides it with URLs tomultimedia- In return, it sends feedback to main crawler to betterscore links in frontier

Page 70: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201370/79

Crawling Multimedia Content

I Scalable Multimedia Web Observatory of ARCOMEMproject (http://www.arcomem.eu)

I Focus on web archiving issuesI Uses several crawlers

- ’Standard’ crawler for regular web pages- API crawler to mine social media sources (e.g., Twitter,Facebook, YouTube, etc.)- Deep Web crawler able to extract information frompre-defined web sites

I Data can be exported in WARC (Web ARChive) files and inRDF

Page 71: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201371/79

Other Crawling ChallengesOrdering policy

I Resources are limited, while number of pages to visitessentially infinite

I Decision should be done based on URL itselfI PageRank-like metrics can be usedI More complicated in case of incremental crawls

Focused crawling

I Avoid links leading to content out of the topic of interestI Content of a page can be taken into account when decide

if a particular link leads toI Setting a good seed is a challenge

Page 72: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201372/79

Other Crawling Challenges

Re-visiting policy

Generating good seed URLs

Avoiding redundant content

I Avoid visiting duplicate pages (different URLs leading toidentical or near-identical content)- Near-duplicates might be very tricky (think of a news itempropagation on the Web)

I Avoid crawler trapsI Avoid useless content (i.e., web spam)

Page 73: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201373/79

Future Directions

I Collaborative crawling, mixed pull-push modelI Understanding site structureI Deep Web crawlingI Media content crawlingI Social network crawling

Page 74: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201374/79

References: Crawl Datasets

Use for building your crawls, web graph analysis, web datamining tasks, etc.

ClueWeb09 Dataset:- http://lemurproject.org/clueweb09.php/- One billion web pages, in ten languages- 5TBs compressed- Hosted at several cloud services (free license required) ora copy can be ordered on hard disks (pay for disks)

ClueWeb12:- Almost 900 millions English web pages

Page 75: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201375/79

References: Crawl Datasets

Use for building your crawls, web graph analysis, web datamining tasks, etc.

Common Crawl Corpus:- See http://commoncrawl.org/data/accessing-the-data/

and http://aws.amazon.com/datasets/41740

- Around six billion web pages- Over 100TB uncompressed- Available as Amazon Web Services’ public dataset (pay forprocessing)

Page 76: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201376/79

References: Crawl Datasets

Use for building your crawls, web graph analysis, web datamining tasks, etc.

Internet Archive:- See http://blog.archive.org/2012/10/26/

80-terabytes-of-archived-web-crawl-data-available-for-research/

- Crawl of 2011- 80TB WARC files- 2.7 billions pages- Includes multimedia data- Available by request

Page 77: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201377/79

References: Crawl Datasets

LAW Datasets:- http://law.dsi.unimi.it/datasets.php- Variety of web graphs datasets (nodes, arcs, etc.) includingbasic properties of recent Facebook graphs (!)

- Thoroughly studied in a number of publications

ICWSM 2011 Spinn3r Dataset:- http://www.icwsm.org/data/- 130mln blog posts and 230mln social media publications- 2TB compressed

Academic Web Link Database Project:- http://cybermetrics.wlv.ac.uk/database/- Crawls of national universities web sites

Page 78: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201378/79

References: Literature

I For beginners: Udacity/CS101 course;http://www.udacity.com/overview/Course/cs101

I Intermediate: Chapter 20 of Introduction to InformationRetrieval book by Manning, Raghavan, Schütze;http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf

I Advanced: Web Crawling by Olston and Najork;http://www.nowpublishers.com/product.aspx?product=

INR&doi=1500000017

Page 79: Current challenges in web crawling

Denis ShestakovCurrent Challenges in Web Crawling

ICWE’13, Aalborg, Denmark, 08.07.201379/79

References: Literature

I See relevant publications at Mendeley:I http://www.mendeley.com/groups/531771/web-crawling/

I Feel free to join the group!I Check ’Deep Web’ group too

http://www.mendeley.com/groups/601801/deep-web/