tools for managing the past web

28
Tools for Managing the Past Web Michele C. Weigle Web Sciences and Digital Libraries (WS-DL) Group Department of Computer Science Old Dominion University Norfolk, VA Includes joint work with Michael L. Nelson and our PhD students, Yasmin AlNoamany, Ahmed AlSum (PhD 2014), Justin Brunelle, Mat Kelly, Hany SalahEldeen Archive-It Partners Meeting November 18, 2014

Upload: michele-weigle

Post on 02-Jul-2015

936 views

Category:

Technology


0 download

DESCRIPTION

Tools for Managing the Past Web 2014 Archive-It Partners Meeting November 18, 2014 Presented by Michele Weigle

TRANSCRIPT

Page 1: Tools for Managing the Past Web

Tools for Managing the Past Web

Michele C. WeigleWeb Sciences and Digital Libraries (WS-DL) Group

Department of Computer Science

Old Dominion University

Norfolk, VA

Includes joint work with Michael L. Nelson and our PhD students, Yasmin AlNoamany, Ahmed AlSum (PhD 2014), Justin Brunelle, Mat Kelly, Hany SalahEldeen

Archive-It Partners Meeting

November 18, 2014

Page 2: Tools for Managing the Past Web

Start-Up and

Implementation Grants

– WARCreate

– WAIL

– Mink

– Assessing Memento

Damage

Web Archiving Incentive

– Thumbnail

Summarization

– Detecting Off-Topic

Mementos

November 18, 2014 Archive-It Partners Meeting 2

Outline

WARCreate WAIL Minkhttps://ws-dl.cs.odu.edu/Software

Page 3: Tools for Managing the Past Web

Archive What I See Now

• Standard web archiving tools are difficult for non IT experts.

• "Save Page As" is not suitable for archiving purposes.

• Pages are behind authentication.

• Pages change quickly, but current state needs archiving.

Archive-It Partners Meeting

NEH Digital Humanities Implementation Grant, 2014-2017, http://bit.ly/odu-dhig-2014

3November 18, 2014

Page 4: Tools for Managing the Past Web

How we're addressing the problem

4

Kelly and Weigle, "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage", JCDL 2012

Kelly, Weigle, and Nelson. "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation 2012, Tools Demo Session

November 18, 2014 Archive-It Partners Meeting

Google Chrome extension

Archive the current state of the

page in standard Web Archive

(WARC) format

Compatible with Wayback

WARCreate

Page 5: Tools for Managing the Past Web

WARCreate - Work in Progress

• New modes of operation

– record mode

• while activated, add capture of each page visited

to the WARC

– countdown mode

• every interval, refresh and add new capture of

page

– event mode

• add new capture of page every time it dynamically

reloads or refreshes

November 18, 2014 Archive-It Partners Meeting 5

Page 6: Tools for Managing the Past Web

• Uploading created WARCs to Archive-It

or other archives

– consideration of data integrity

– merging local WARCs with crawled WARCs • how do we account for your www.facebook.com vs.

my www.facebook.com?

– privacy

WARCreate - Work in Progress

6November 18, 2014 Archive-It Partners Meeting

Page 7: Tools for Managing the Past Web

What to do with created WARCs?

November 18, 2014 Archive-It Partners Meeting 7

Kelly, Weigle, and Nelson. "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving," Personal Digital Archiving 2013, Poster Session

Kelly, Nelson, and Weigle. "WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013

WAIL

Load created WARCs into a

Wayback instance on your local

computer

Single-click install of Wayback

(and other archiving tools)

Includes IIPC's OpenWayback 2.0

and Heritrix 3.2

Available for Windows, OS X

(Linux coming soon!)

Page 8: Tools for Managing the Past Web

WAIL - Work in Progress

• More tools

– integration with Ilya Kreymer's pywb

• User interface enhancements

– ease of installation

– intuitive GUI

– configuration of Wayback display and Heritrix

crawls

November 18, 2014 Archive-It Partners Meeting 8

Page 9: Tools for Managing the Past Web

Bridging the gap between the past

web and the live web

November 18, 2014 Archive-It Partners Meeting

Google Chrome extension

For each page you visit, displays

the number of archived versions

available

Provides access by date

Allows for submission to public

archiving services

Mink

9

Kelly, Nelson and Weigle, "Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento," poster, ACM/IEEE Digital

Libraries (DL), September 2014.

Page 10: Tools for Managing the Past Web

Mink - Work in Progress

• Pick public archives (Memento

Aggregator) or private archive (local

computer)

November 18, 2014 Archive-It Partners Meeting 10

Page 11: Tools for Managing the Past Web

Tools

Archive-It Partners Meeting

WARCreate

Mink

WAIL

11

https://ws-dl.cs.odu.edu/Software

November 18, 2014

Page 12: Tools for Managing the Past Web

Start-Up and

Implementation Grants

– WARCreate

– WAIL

– Mink

– Assessing Memento

Damage

Web Archiving Incentive

– Thumbnail

Summarization

– Detecting Off-Topic

Mementos

November 18, 2014 Archive-It Partners Meeting 12

Outline

WAIL Minkhttps://ws-dl.cs.odu.edu/Software

WARCreate

Page 13: Tools for Managing the Past Web

How damaged are these mementos?

Archive-It Partners Meeting

M = 0.17

D = 0.09(live web)

M = 0.24

D = 0.41(missing main) M = 0.29

D = 0.36(missing logo + navigation)

Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing Resources",

IEEE/ACM Digital Libraries (DL) 2014, Best Student Paper

13November 18, 2014

M = percentage missing

D = our damage metric

Page 14: Tools for Managing the Past Web

Archive-It Partners Meeting

Good News:

Although M is steady/increasing, D is decreasing

November 18, 2014 14

M = percentage missing

D = our damage metric

Sampled 45,000 URI-Ms

- one URI-M each year of ~1850 URI-Rs

- URI-Rs from Bitly URIs shared over Twitter and Archive-It collections

Page 15: Tools for Managing the Past Web

Start-Up and

Implementation Grants

– WARCreate

– WAIL

– Mink

– Assessing Memento

Damage

Web Archiving Incentive

– Thumbnail

Summarization

– Detecting Off-Topic

Mementos

November 18, 2014 Archive-It Partners Meeting 15

Outline

WAIL Minkhttps://ws-dl.cs.odu.edu/Software

WARCreate

Page 16: Tools for Managing the Past Web

Browsing TimeMaps

November 18, 2014 Archive-It Partners Meeting 16

How were

these 4

thumbnails

chosen?

Page 17: Tools for Managing the Past Web

Which tells you more about the

past of www.apple.com?

Archive-It Partners MeetingNovember 18, 2014

700 thumbnails

(not even all of them!)

32 sampled thumbnails

17

AlSum and Nelson, "Thumbnail Summarization Techniques for Web Archives", ECIR 2014

Page 18: Tools for Managing the Past Web

Thumbnail Summarization

• Process– compare HTML of consecutive mementos

• more efficient than image diff

– when diff threshold passed, generate thumbnail

– return data + thumbnail as JSON

• Considerations– diff threshold too low -> near duplicate images

– diff threshold too high -> miss important changes

• Work in Progress– wayback plugin

– embeddable version

November 18, 2014 Archive-It Partners Meeting 18

Page 19: Tools for Managing the Past Web

Thumbnail Summary Screencast

Archive-It Partners MeetingNovember 18, 2014 19

Page 20: Tools for Managing the Past Web

Start-Up and

Implementation Grants

– WARCreate

– WAIL

– Mink

– Assessing Memento

Damage

Web Archiving Incentive

– Thumbnail

Summarization

– Detecting Off-Topic

Mementos

November 18, 2014 Archive-It Partners Meeting 20

Outline

WAIL Minkhttps://ws-dl.cs.odu.edu/Software

WARCreate

Page 21: Tools for Managing the Past Web

Have you ever had this problem?

November 18, 2014 Archive-It Partners Meeting 21

May 21, 2012

nothing but spam

May 16, 2013

Page 22: Tools for Managing the Past Web

Detecting Off-Topic Mementos

• Goal: Build a tool to alert curators of

potential off-topic mementos in a collection

• Compare text of mementos

– Intersection of top terms (TF)

– Cosine similarity

– Jaccard similarity coefficient

– Clustering with topic modeling

November 18, 2014 Archive-It Partners Meeting 22

Page 23: Tools for Managing the Past Web

Test Collections

November 18, 2014 Archive-It Partners Meeting 23

Page 24: Tools for Managing the Past Web

Turns out to be rather difficult

• Egyptian Revolution

– lots of non-English pages

• Occupy Movement

– lots of Facebook and social media pages

– template extractors have trouble with these

• Boston Marathon Bombing

November 18, 2014 Archive-It Partners Meeting 24

but we're making progress

(stay tuned!)

Page 25: Tools for Managing the Past Web

Storytelling For Archives

Archived collectionsStorytelling services

Archived enriched stories

November 18, 2014 Archive-It Partners Meeting 25

AlNoamany, "Using Web Archives to Enrich the Live Web Experience Through Storytelling", TCDL Bulletin, December 2013.

Page 26: Tools for Managing the Past Web

Story Types

Fixed Page – Fixed Time:

differences in GeoIP,

mobile, etc.

Fixed Page – Sliding Time:

evolution of a single page

(or domain) through time

Sliding Page – Fixed Time:

different perspectives on a

point in time

Sliding Page – Sliding Time:

broadest possible coverage

of a collection

same

Time

different

URI

same

different

Issues: topic modeling, eliminating duplicates, maximizing

novelty, structural & content quality

November 18, 2014 Archive-It Partners Meeting 26

Page 27: Tools for Managing the Past Web

Tools for Storytelling

• Tools for Curators

– create stories from your collections

• candidate mementos automatically selected

– use existing stories to augment your

collections

• Tools for Users

– use existing tools like Storify to view the

stories of a collection

November 18, 2014 Archive-It Partners Meeting 27

Page 28: Tools for Managing the Past Web

Start-Up and Implementation Grants

– WARCreate

– WAIL

– Mink

– Assessing Memento Damage

Web Archiving Incentive– Thumbnail Summarization

– Detecting Off-Topic Mementos

November 18, 2014 Archive-It Partners Meeting 28

Tools for Managing the Past Web

WAIL Mink

https://ws-dl.cs.odu.edu/Software

Michele C. Weigle

[email protected]

@weiglemc

http://www.cs.odu.edu/~mweigle/

Web Science and Digital

Libraries (WS-DL) Group

@WebSciDL

http://ws-dl.cs.odu.edu/

http://ws-dl.blogspot.com/

WARCreate