introducing web archiving and wsdl research group
TRANSCRIPT
IntroducingWeb Archiving and
WSDL Research Group
Sawood AlamDepartment of Computer ScienceOld Dominion UniversityNorfolk, Virginia - 23529 (USA)
About Me
Sawood Alam
Lexical SignatureWeb, Digital Library, Web Archiving, Ruby on Rails, PHP, HTML,
CSS, JavaScript, ExtJS, Go, Urdu, RTL, Docker, and Linux.
● BTech, Jamia Millia Islamia, India, 2008● MSc, Old Dominion University, USA, 2013● PhD, Old Dominion University, USA, Current
She Calls Me Dad!
Agenda● Archiving and Web archiving● Purpose and importance● Scope of the web archiving● Issues and challenges● Tools and techniques● Memento: Time Travel for the Web● Archive X-Ray● Research opportunities in Web archiving● Our WSDL Research Group
What is an Archive?● Accumulation of historical records● Long term storage and preservation● Less frequently used● Physical or digital
What is Web Archiving?● Periodic snapshots of web pages● Preserving important events on the Web● Making archived content accessible
Why do We Care Archiving?
Web contents decay rapidly!
● To preserve the history● To tell a story● For evidence● For backup● For personal satisfaction
Issues and Challenges● Crawling● Storage● Retrieval● Replay● Accessibility● Completeness● Accuracy● Credibility
Web Archiving Efforts● Internet Archive● Archive-It● Wikipedia● UK Web Archive● Various national and non-profit archives● Film, music and other multimedia archives● Scholarly archives● Personal archiving
Tools and Techniques● Heritrix, PhantomJS, WGet, cURL● OpenWayback, PyWB● TimeTravel, MemGator● CarbonDate, Warrick, Synchronicity● Preserve Me!● WARCreate,WAIL, Mink● Browsertrix● And many more...
Memento<http://example.com>; rel="original",
<http://web.archive.org/web/20020120142510/http://example.com/>;
rel="memento";
datetime="Sun, 20 Jan 2002 14:25:10 GMT",
<http://web.archive.org/web/20020328012821/http://www.example.com/>;
rel="memento";
datetime="Thu, 28 Mar 2002 01:28:21 GMT",
<http://webarchive.loc.gov/all/20020803080544/http://www.example.com/>;
rel="memento";
datetime="Sat, 03 Aug 2002 08:05:44 GMT",
<http://wayback.archive-it.org/all/20091213015014/http://www.example.com/>;
rel="memento";
datetime="Sun, 13 Dec 2009 01:50:14 GMT",
Archive X-Ray!● How much of the Web is archived?● Profiling various archive services● Predicting what they contain● Routing Memento aggregator queries
MemGatorhttps://github.com/oduwsdl/memgator
MemGatorhttp://memgator.cs.odu.edu:1208/
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
From: Michael Nelson [mailto:[email protected]]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patrick; Grotke, Abigail
Subject: Re: WebSciDL
Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is it?
>
> Gina Jones
> Web Archiving Team
> Library of Congress
From: Ilya Kreymer <[email protected]>
Date: Wed, 2 Dec 2015 10:33:56 -0800
Subject: high traffic on oldweb!
To: Herbert Van de Sompel <[email protected]>, Sawood Alam <[email protected]>
Hi Herbert, Sawood,
Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily..
I am thinking that ability to remove source archives quickly is an important aspect of an aggregator.
Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;)
Ilya
Broadcasting is Bad
Memento Routing
Long Tail of Archives
While the IA was Down...
$ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c 2 2002 1 2005 1 2008 6 2009 67 2010 17 2011 64 2012 108 2013 108 2014 186 2015 51 2016
Archive Profile● High-level summary of an archive● Predicts presence of mementos● Provides statistics about the holdings● Small in size and publicly available● Easy to update and partially patch● Useful for Memento query routing and
other things
com,cnn)/ {“frequency”: 40, “spread”: 2}
uk,co,bbc)/ {“frequency”: 20, “spread”: 1}
com,usatoday)/ {“frequency”: 5, “spread”: 1}
Research Opportunities● Information retrieval● Information visualization● Client and server side archiving● Archiving dynamic content● Distributed archiving● Discovering alternate long term archiving
techniques● Predicting “Important” events on the Web
and archiving them timely
Web Science and Digital Libraries Research Group
ws-dl.cs.odu.edu
ws-dl.blogspot.com
@WebSciDL
github.com/oduwsdl
flickr.com/photos/124419986@N07
ODU Sailing Center
WSDL Feast
WSDL Whiteboards
WSDL Surprise
WSDL Ping Pong Table
WSDL Travels
Sawood AlamDepartment of Computer Science
Old Dominion UniversityNorfolk, Virginia - 23529 (USA)
[email protected]@ibnesayeed
www.cs.odu.edu/~salam