brass: a queueing manager for warrick

26
Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA IWAW 2007 Vancouver, BC June 23, 2007

Upload: kaiya

Post on 08-Jan-2016

25 views

Category:

Documents


1 download

DESCRIPTION

Brass: A Queueing Manager for Warrick. Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA IWAW 2007 Vancouver, BC June 23, 2007. Agenda. Dangers facing website Web-repository crawling - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Brass: A Queueing Manager  for Warrick

Brass: A Queueing Manager for Warrick

Frank McCown, Amine Benjelloun, and Michael L. Nelson

Old Dominion UniversityComputer Science Department

Norfolk, Virginia, USA

IWAW 2007Vancouver, BCJune 23, 2007

Page 2: Brass: A Queueing Manager  for Warrick

2

Agenda

• Dangers facing website• Web-repository crawling• Comparing web crawling with web-

repository crawling• All about Brass• Alternate Warrick deployments

Page 3: Brass: A Queueing Manager  for Warrick

3Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

Page 4: Brass: A Queueing Manager  for Warrick

4

Page 5: Brass: A Queueing Manager  for Warrick

5

A couple weeks ago I… accidentally deleted my entire database of about 30 articles. After I finished berating myself for being so stupid, I realized that my hosting company would have a backup, so I sent an email asking them to restore the database. Their reply stated that backups were “coming soon”…OUCH! So right after I signed up with a better hosting company I had to figure out a plan B.

Page 6: Brass: A Queueing Manager  for Warrick

6

Crawling the Crawlers

World Wide Web

Repo1

Repo2

Repon

...

Web crawling

Repo

Web-repository crawling

Page 7: Brass: A Queueing Manager  for Warrick

7

• McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.

• McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.

• McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006.

• McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.

Available at http://warrick.cs.odu.edu/

Page 8: Brass: A Queueing Manager  for Warrick

8

Page 9: Brass: A Queueing Manager  for Warrick

9

Page 10: Brass: A Queueing Manager  for Warrick

10

Cached Image

Page 11: Brass: A Queueing Manager  for Warrick

11

Cached PDF

http://www.fda.gov/cder/about/whatwedo/testtube.pdf

MSN version Yahoo version Google version

canonical

Page 12: Brass: A Queueing Manager  for Warrick

12

Examples of Lost Websites Recovered with Warrick

Page 13: Brass: A Queueing Manager  for Warrick

13

Web Crawler

Page 14: Brass: A Queueing Manager  for Warrick

14

Web-Repository Crawler

Page 15: Brass: A Queueing Manager  for Warrick

15

Issues

Web crawling• Limit hit rate per host• Websites periodically

unavailable • Portions of website off-

limits (robots.txt, passwords)

• Deep web• Spam• Duplicate content• Flash and JavaScript

interfaces• Crawler traps

Web-repo crawling• Limit hit rate per repo• Limited hits per day (API

query quotas)• Repos periodically

unavailable• Flash and JavaScript

interfaces• Can only recover what

repos have stored• Lossy format conversions

(thumb nail images, HTMLlized PDFs, etc.)

Page 16: Brass: A Queueing Manager  for Warrick

16

Problems with Warrick

• Requires user to download, install, and run from the command linewarrick.pl –d –r –o log.txt –c –wr ia http://foo.org/

• Google API keys are no longer available

• Screen-scrapes Google’s web user interface which can cause Google to black-list an IP address

Page 17: Brass: A Queueing Manager  for Warrick

17

Solution: Brass

• Queueing system using ODU nodes, so API query limits can be spread across several machines

• Uses Google API keys which we obtained before they were no longer made available

• Easy-to-use web interface utilizing email to notify user when reconstructions are complete

Page 19: Brass: A Queueing Manager  for Warrick

19

Page 20: Brass: A Queueing Manager  for Warrick

20

Page 21: Brass: A Queueing Manager  for Warrick

21

Page 22: Brass: A Queueing Manager  for Warrick

22

Page 23: Brass: A Queueing Manager  for Warrick

23

Brass Architecture

Page 24: Brass: A Queueing Manager  for Warrick

25

Other Warrick Deployments

• GUI interface for client executable– Installation difficulties– Lack of Google API keys

• Web interface along with client application which makes queries– Browser plug-in, Flash, or applet– Must manage Google API keys– Browser must be left open and continued

Internet access

Page 25: Brass: A Queueing Manager  for Warrick

26

Conclusions

• Warrick interface is almost ready for the public

• Web interface will likely greatly increase Warrick usage

• Collection of usage data will allow us to better understand what kinds of websites the public is interesting in recovering

Page 26: Brass: A Queueing Manager  for Warrick

27

Frank [email protected]

And that’s everything there is to know about

Brass!Thanks, Dad, but I just

wanted to know when you were going to change my

diaper…