information retrieval (9) prof. dragomir r. radev

Information Retrieval(9)

Prof. Dragomir R. [email protected]

IR Winter 2010

…14. Webometrics

The Bow-tie model…

Brief history of the Web

• FTP/Gopher• WWW (1989)• Archie (1990)• Mosaic (1993)• Webcrawler (1994)• Lycos (1994)• Yahoo! (1994)• Google (1998)

Size

• The Web is the largest repository of data and it grows exponentially.– 320 Million Web pages [Lawrence & Giles 1998]– 800 Million Web pages, 15 TB [Lawrence & Giles

1999]– 20 Billion Web pages indexed [now]

• Amount of data– roughly 200 TB [Lyman et al. 2003]

Zipfian properties

• In-degree• Out-degree• Visits to a page

Bow-tie model of the Web

SCC56 M

OUT44 M

IN44 M

Bröder & al. WWW 2000, Dill & al. VLDB 2001

DISC17 M

TEND44M

24% of pagesreachable froma given page

Measuring the size of the web

• Using extrapolation methods• Random queries and their coverage by

different search engines• Overlap between search engines• HTTP requests to random IP addresses

Bharat and Broder 1998

• Based on crawls of HotBot, Altavista, Excite, and InfoSeek

• 10,000 queries in mid and late 1997• Estimate is 200M pages• Only 1.4% are indexed by all of them

Example (from Bharat&Broder)

A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).

What makes Web IR different?• Much bigger• No fixed document collection• Users• Non-human users• Varied user base• Miscellaneous user needs• Dynamic content• Evolving content• Spam• Infinite sized – size is whatever can be indexed!

IR Winter 2010

…15. Crawling the Web Hypertext retrieval & Web-based IR Document closures Focused crawling …

Web crawling• The HTTP/HTML protocols• Following hyperlinks• Some problems:

– Link extraction– Link normalization– Robot exclusion– Loops– Spider traps– Server overload

Example• U-M’s root robots.txt file:• http://www.umich.edu/robots.txt

– User-agent: * – Disallow: /~websvcs/projects/ – Disallow: /%7Ewebsvcs/projects/ – Disallow: /~homepage/ – Disallow: /%7Ehomepage/ – Disallow: /~smartgl/ – Disallow: /%7Esmartgl/ – Disallow: /~gateway/ – Disallow: /%7Egateway/

http://www.umich.edu/robots.txt

Example crawler

• E.g., poacher– http://search.cpan.org/~neilb/Robot-0.011/

examples/poacher– Included in clairlib

&ParseCommandLine();&Initialise();$robot->run($siteRoot)

#=======================================================================# Initialise() - initialise global variables, contents, tables, etc# This function sets up various global variables such as the version number# for WebAssay, the program name identifier, usage statement, etc.#=======================================================================sub Initialise{ $robot = new WWW::Robot( 'NAME' => $BOTNAME, 'VERSION' => $VERSION, 'EMAIL' => $EMAIL, 'TRAVERSAL' => $TRAVERSAL, 'VERBOSE' => $VERBOSE, ); $robot->addHook('follow-url-test', \&follow_url_test); $robot->addHook('invoke-on-contents', \&process_contents); $robot->addHook('invoke-on-get-error', \&process_get_error);}#=======================================================================# follow_url_test() - tell the robot module whether is should follow link#=======================================================================sub follow_url_test {}#=======================================================================# process_get_error() - hook function invoked whenever a GET fails#=======================================================================sub process_get_error {}#=======================================================================# process_contents() - process the contents of a URL we've retrieved#=======================================================================sub process_contents{ run_command($COMMAND, $filename) if defined $COMMAND;}

Focused crawling

• Topical locality– Pages that are linked are similar in content (and vice-

versa: Davison 00, Menczer 02, 04, Radev et al. 04)• The radius-1 hypothesis

– given that page i is relevant to a query and that page i points to page j, then page j is also likely to be relevant (at least, more so than a random web page)

• Focused crawling– Keeping a priority queue of the most relevant pages

Challenges in indexing the web

• Page importance varies a lot• Anchor text• User modeling• Detecting duplicates• Dealing with spam (content-based and

link-based)

Duplicate detection

• Shingles• TO BE OR• BE OR NOT• OR NOT TO• NOT TO BE• The use the Jaccard coefficient (size of

intersection/size of union) to determine similarity• Hashing• Shingling (separate lecture)

Document closures for Q&A

capital

P L P

Madridspain

spain

capital

Document closures for IR

Physics

P L P

PhysicsDepartment

University ofMichigan

Michigan

The link-content hypothesis

• Topical locality: page is similar () to the page that points to it ().• Davison (TF*IDF, 100K pages)

– 0.31 same domain– 0.23 linked pages– 0.19 sibling– 0.02 random

• Menczer (373K pages, non-linear least squares fit)

• Chakrabarti (focused crawling) - prob. of losing the topic

Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001

21)1()(

e 03.01=1.8, 2=0.6,

information retrieval (9) prof. dragomir r. radev

Documents