introducing web archiving and wsdl research group

38
Introducing Web Archiving and WSDL Research Group Sawood Alam Department of Computer Science Old Dominion University Norfolk, Virginia - 23529 (USA)

Upload: sawood-alam

Post on 22-Jan-2017

399 views

Category:

Internet


1 download

TRANSCRIPT

Page 1: Introducing Web Archiving and WSDL Research Group

IntroducingWeb Archiving and

WSDL Research Group

Sawood AlamDepartment of Computer ScienceOld Dominion UniversityNorfolk, Virginia - 23529 (USA)

Page 2: Introducing Web Archiving and WSDL Research Group

About Me

Sawood Alam

Lexical SignatureWeb, Digital Library, Web Archiving, Ruby on Rails, PHP, HTML,

CSS, JavaScript, ExtJS, Go, Urdu, RTL, Docker, and Linux.

● BTech, Jamia Millia Islamia, India, 2008● MSc, Old Dominion University, USA, 2013● PhD, Old Dominion University, USA, Current

Page 3: Introducing Web Archiving and WSDL Research Group

She Calls Me Dad!

Page 4: Introducing Web Archiving and WSDL Research Group

Agenda● Archiving and Web archiving● Purpose and importance● Scope of the web archiving● Issues and challenges● Tools and techniques● Memento: Time Travel for the Web● Archive X-Ray● Research opportunities in Web archiving● Our WSDL Research Group

Page 5: Introducing Web Archiving and WSDL Research Group

What is an Archive?● Accumulation of historical records● Long term storage and preservation● Less frequently used● Physical or digital

Page 6: Introducing Web Archiving and WSDL Research Group

What is Web Archiving?● Periodic snapshots of web pages● Preserving important events on the Web● Making archived content accessible

Page 7: Introducing Web Archiving and WSDL Research Group

Why do We Care Archiving?

Web contents decay rapidly!

● To preserve the history● To tell a story● For evidence● For backup● For personal satisfaction

Page 8: Introducing Web Archiving and WSDL Research Group

Issues and Challenges● Crawling● Storage● Retrieval● Replay● Accessibility● Completeness● Accuracy● Credibility

Page 9: Introducing Web Archiving and WSDL Research Group

Web Archiving Efforts● Internet Archive● Archive-It● Wikipedia● UK Web Archive● Various national and non-profit archives● Film, music and other multimedia archives● Scholarly archives● Personal archiving

Page 11: Introducing Web Archiving and WSDL Research Group

Memento<http://example.com>; rel="original",

<http://web.archive.org/web/20020120142510/http://example.com/>;

rel="memento";

datetime="Sun, 20 Jan 2002 14:25:10 GMT",

<http://web.archive.org/web/20020328012821/http://www.example.com/>;

rel="memento";

datetime="Thu, 28 Mar 2002 01:28:21 GMT",

<http://webarchive.loc.gov/all/20020803080544/http://www.example.com/>;

rel="memento";

datetime="Sat, 03 Aug 2002 08:05:44 GMT",

<http://wayback.archive-it.org/all/20091213015014/http://www.example.com/>;

rel="memento";

datetime="Sun, 13 Dec 2009 01:50:14 GMT",

Page 12: Introducing Web Archiving and WSDL Research Group

Archive X-Ray!● How much of the Web is archived?● Profiling various archive services● Predicting what they contain● Routing Memento aggregator queries

Page 13: Introducing Web Archiving and WSDL Research Group
Page 14: Introducing Web Archiving and WSDL Research Group
Page 15: Introducing Web Archiving and WSDL Research Group
Page 16: Introducing Web Archiving and WSDL Research Group
Page 17: Introducing Web Archiving and WSDL Research Group

MemGatorhttps://github.com/oduwsdl/memgator

Page 18: Introducing Web Archiving and WSDL Research Group

MemGatorhttp://memgator.cs.odu.edu:1208/

Page 19: Introducing Web Archiving and WSDL Research Group

Memento Aggregator

Page 20: Introducing Web Archiving and WSDL Research Group

Memento Aggregator

Page 21: Introducing Web Archiving and WSDL Research Group

Memento Aggregator

Page 22: Introducing Web Archiving and WSDL Research Group

Memento Aggregator

Page 23: Introducing Web Archiving and WSDL Research Group

Memento Aggregator

Page 24: Introducing Web Archiving and WSDL Research Group

Memento Aggregator

Page 25: Introducing Web Archiving and WSDL Research Group

From: Michael Nelson [mailto:[email protected]]

Sent: Wednesday, December 02, 2015 12:33 PM

To: Jones, Gina

Cc: Rourke, Patrick; Grotke, Abigail

Subject: Re: WebSciDL

Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages.

regards,

Michael

On Wed, 2 Dec 2015, Jones, Gina wrote:

> Hi Michael, we have a slight configuration issue with the current OW

> set up for our webarchives. I think, from looking at the logs, that

> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback.

> Do you know who is running this scraper? Itʼs not part of memento is it?

>

> Gina Jones

> Web Archiving Team

> Library of Congress

From: Ilya Kreymer <[email protected]>

Date: Wed, 2 Dec 2015 10:33:56 -0800

Subject: high traffic on oldweb!

To: Herbert Van de Sompel <[email protected]>, Sawood Alam <[email protected]>

Hi Herbert, Sawood,

Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily..

I am thinking that ability to remove source archives quickly is an important aspect of an aggregator.

Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;)

Ilya

Broadcasting is Bad

Page 26: Introducing Web Archiving and WSDL Research Group

Memento Routing

Page 27: Introducing Web Archiving and WSDL Research Group

Long Tail of Archives

Page 28: Introducing Web Archiving and WSDL Research Group

While the IA was Down...

$ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c 2 2002 1 2005 1 2008 6 2009 67 2010 17 2011 64 2012 108 2013 108 2014 186 2015 51 2016

Page 29: Introducing Web Archiving and WSDL Research Group

Archive Profile● High-level summary of an archive● Predicts presence of mementos● Provides statistics about the holdings● Small in size and publicly available● Easy to update and partially patch● Useful for Memento query routing and

other things

com,cnn)/ {“frequency”: 40, “spread”: 2}

uk,co,bbc)/ {“frequency”: 20, “spread”: 1}

com,usatoday)/ {“frequency”: 5, “spread”: 1}

Page 30: Introducing Web Archiving and WSDL Research Group

Research Opportunities● Information retrieval● Information visualization● Client and server side archiving● Archiving dynamic content● Distributed archiving● Discovering alternate long term archiving

techniques● Predicting “Important” events on the Web

and archiving them timely

Page 32: Introducing Web Archiving and WSDL Research Group

ODU Sailing Center

Page 33: Introducing Web Archiving and WSDL Research Group

WSDL Feast

Page 34: Introducing Web Archiving and WSDL Research Group

WSDL Whiteboards

Page 35: Introducing Web Archiving and WSDL Research Group

WSDL Surprise

Page 36: Introducing Web Archiving and WSDL Research Group

WSDL Ping Pong Table

Page 37: Introducing Web Archiving and WSDL Research Group

WSDL Travels