tools for managing the past web
DESCRIPTION
Tools for Managing the Past Web 2014 Archive-It Partners Meeting November 18, 2014 Presented by Michele WeigleTRANSCRIPT
Tools for Managing the Past Web
Michele C. WeigleWeb Sciences and Digital Libraries (WS-DL) Group
Department of Computer Science
Old Dominion University
Norfolk, VA
Includes joint work with Michael L. Nelson and our PhD students, Yasmin AlNoamany, Ahmed AlSum (PhD 2014), Justin Brunelle, Mat Kelly, Hany SalahEldeen
Archive-It Partners Meeting
November 18, 2014
Start-Up and
Implementation Grants
– WARCreate
– WAIL
– Mink
– Assessing Memento
Damage
Web Archiving Incentive
– Thumbnail
Summarization
– Detecting Off-Topic
Mementos
November 18, 2014 Archive-It Partners Meeting 2
Outline
WARCreate WAIL Minkhttps://ws-dl.cs.odu.edu/Software
Archive What I See Now
• Standard web archiving tools are difficult for non IT experts.
• "Save Page As" is not suitable for archiving purposes.
• Pages are behind authentication.
• Pages change quickly, but current state needs archiving.
Archive-It Partners Meeting
NEH Digital Humanities Implementation Grant, 2014-2017, http://bit.ly/odu-dhig-2014
3November 18, 2014
How we're addressing the problem
4
Kelly and Weigle, "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage", JCDL 2012
Kelly, Weigle, and Nelson. "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation 2012, Tools Demo Session
November 18, 2014 Archive-It Partners Meeting
Google Chrome extension
Archive the current state of the
page in standard Web Archive
(WARC) format
Compatible with Wayback
WARCreate
WARCreate - Work in Progress
• New modes of operation
– record mode
• while activated, add capture of each page visited
to the WARC
– countdown mode
• every interval, refresh and add new capture of
page
– event mode
• add new capture of page every time it dynamically
reloads or refreshes
November 18, 2014 Archive-It Partners Meeting 5
• Uploading created WARCs to Archive-It
or other archives
– consideration of data integrity
– merging local WARCs with crawled WARCs • how do we account for your www.facebook.com vs.
my www.facebook.com?
– privacy
WARCreate - Work in Progress
6November 18, 2014 Archive-It Partners Meeting
What to do with created WARCs?
November 18, 2014 Archive-It Partners Meeting 7
Kelly, Weigle, and Nelson. "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving," Personal Digital Archiving 2013, Poster Session
Kelly, Nelson, and Weigle. "WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013
WAIL
Load created WARCs into a
Wayback instance on your local
computer
Single-click install of Wayback
(and other archiving tools)
Includes IIPC's OpenWayback 2.0
and Heritrix 3.2
Available for Windows, OS X
(Linux coming soon!)
WAIL - Work in Progress
• More tools
– integration with Ilya Kreymer's pywb
• User interface enhancements
– ease of installation
– intuitive GUI
– configuration of Wayback display and Heritrix
crawls
November 18, 2014 Archive-It Partners Meeting 8
Bridging the gap between the past
web and the live web
November 18, 2014 Archive-It Partners Meeting
Google Chrome extension
For each page you visit, displays
the number of archived versions
available
Provides access by date
Allows for submission to public
archiving services
Mink
9
Kelly, Nelson and Weigle, "Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento," poster, ACM/IEEE Digital
Libraries (DL), September 2014.
Mink - Work in Progress
• Pick public archives (Memento
Aggregator) or private archive (local
computer)
November 18, 2014 Archive-It Partners Meeting 10
Tools
Archive-It Partners Meeting
WARCreate
Mink
WAIL
11
https://ws-dl.cs.odu.edu/Software
November 18, 2014
Start-Up and
Implementation Grants
– WARCreate
– WAIL
– Mink
– Assessing Memento
Damage
Web Archiving Incentive
– Thumbnail
Summarization
– Detecting Off-Topic
Mementos
November 18, 2014 Archive-It Partners Meeting 12
Outline
WAIL Minkhttps://ws-dl.cs.odu.edu/Software
WARCreate
How damaged are these mementos?
Archive-It Partners Meeting
M = 0.17
D = 0.09(live web)
M = 0.24
D = 0.41(missing main) M = 0.29
D = 0.36(missing logo + navigation)
Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing Resources",
IEEE/ACM Digital Libraries (DL) 2014, Best Student Paper
13November 18, 2014
M = percentage missing
D = our damage metric
Archive-It Partners Meeting
Good News:
Although M is steady/increasing, D is decreasing
November 18, 2014 14
M = percentage missing
D = our damage metric
Sampled 45,000 URI-Ms
- one URI-M each year of ~1850 URI-Rs
- URI-Rs from Bitly URIs shared over Twitter and Archive-It collections
Start-Up and
Implementation Grants
– WARCreate
– WAIL
– Mink
– Assessing Memento
Damage
Web Archiving Incentive
– Thumbnail
Summarization
– Detecting Off-Topic
Mementos
November 18, 2014 Archive-It Partners Meeting 15
Outline
WAIL Minkhttps://ws-dl.cs.odu.edu/Software
WARCreate
Browsing TimeMaps
November 18, 2014 Archive-It Partners Meeting 16
How were
these 4
thumbnails
chosen?
Which tells you more about the
past of www.apple.com?
Archive-It Partners MeetingNovember 18, 2014
700 thumbnails
(not even all of them!)
32 sampled thumbnails
17
AlSum and Nelson, "Thumbnail Summarization Techniques for Web Archives", ECIR 2014
Thumbnail Summarization
• Process– compare HTML of consecutive mementos
• more efficient than image diff
– when diff threshold passed, generate thumbnail
– return data + thumbnail as JSON
• Considerations– diff threshold too low -> near duplicate images
– diff threshold too high -> miss important changes
• Work in Progress– wayback plugin
– embeddable version
November 18, 2014 Archive-It Partners Meeting 18
Thumbnail Summary Screencast
Archive-It Partners MeetingNovember 18, 2014 19
Start-Up and
Implementation Grants
– WARCreate
– WAIL
– Mink
– Assessing Memento
Damage
Web Archiving Incentive
– Thumbnail
Summarization
– Detecting Off-Topic
Mementos
November 18, 2014 Archive-It Partners Meeting 20
Outline
WAIL Minkhttps://ws-dl.cs.odu.edu/Software
WARCreate
Have you ever had this problem?
November 18, 2014 Archive-It Partners Meeting 21
May 21, 2012
nothing but spam
May 16, 2013
Detecting Off-Topic Mementos
• Goal: Build a tool to alert curators of
potential off-topic mementos in a collection
• Compare text of mementos
– Intersection of top terms (TF)
– Cosine similarity
– Jaccard similarity coefficient
– Clustering with topic modeling
November 18, 2014 Archive-It Partners Meeting 22
Test Collections
November 18, 2014 Archive-It Partners Meeting 23
Turns out to be rather difficult
• Egyptian Revolution
– lots of non-English pages
• Occupy Movement
– lots of Facebook and social media pages
– template extractors have trouble with these
• Boston Marathon Bombing
November 18, 2014 Archive-It Partners Meeting 24
but we're making progress
(stay tuned!)
Storytelling For Archives
Archived collectionsStorytelling services
Archived enriched stories
November 18, 2014 Archive-It Partners Meeting 25
AlNoamany, "Using Web Archives to Enrich the Live Web Experience Through Storytelling", TCDL Bulletin, December 2013.
Story Types
Fixed Page – Fixed Time:
differences in GeoIP,
mobile, etc.
Fixed Page – Sliding Time:
evolution of a single page
(or domain) through time
Sliding Page – Fixed Time:
different perspectives on a
point in time
Sliding Page – Sliding Time:
broadest possible coverage
of a collection
same
Time
different
URI
same
different
Issues: topic modeling, eliminating duplicates, maximizing
novelty, structural & content quality
November 18, 2014 Archive-It Partners Meeting 26
Tools for Storytelling
• Tools for Curators
– create stories from your collections
• candidate mementos automatically selected
– use existing stories to augment your
collections
• Tools for Users
– use existing tools like Storify to view the
stories of a collection
November 18, 2014 Archive-It Partners Meeting 27
Start-Up and Implementation Grants
– WARCreate
– WAIL
– Mink
– Assessing Memento Damage
Web Archiving Incentive– Thumbnail Summarization
– Detecting Off-Topic Mementos
November 18, 2014 Archive-It Partners Meeting 28
Tools for Managing the Past Web
WAIL Mink
https://ws-dl.cs.odu.edu/Software
Michele C. Weigle
@weiglemc
http://www.cs.odu.edu/~mweigle/
Web Science and Digital
Libraries (WS-DL) Group
@WebSciDL
http://ws-dl.cs.odu.edu/
http://ws-dl.blogspot.com/
WARCreate