digital preservation - odu

Post on 11-May-2015

798 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is the slide deck of the presentation given to the RRAC national group meeting on 10-20-2010. It is a summary of the research efforts in Digital Preservation at ODU.

TRANSCRIPT

Digital Preservation Research at Old Dominion University

Justin F. Brunelle

The MITRE Corporation

Old Dominion University

(And hopefully MITRE, soon)

Why are we listening?

• Overview of the problem

• BRIEF introduction to ODU WSDL group research

• Memento

• I’ll be skipping around, so don’t hesitate to interrupt me

Digital Preservation

• Using the past Web– Focus of our research

• Temporal Browsing– Sessions in the past

• Recovering Lost Pages– Is it really gone?

• 404s– How to fix broken links?

1

same URI maps to same or very similar content at a later time

2

same URI maps to different content at a later time

3

different URI maps to same or very similar content at the same or at a later time

4

the content can not be found at any URI

U1

C1

U1

C1

timeA B

U1

C2

U1

C1

timeA B

U2

C1

U1

C1

U1

404

timeA B

U1

??

U1

C1

timeA B

Change on the Web

Time to Talk About Saving Everything?

Dinner for one or two costs more than 1TB disk Wikis have popularized versioning

Cool URIs (http://www.w3.org/Provider/Style/URI.html) are widely adopted, e.g.:http://news.yahoo.com/s/ap/20100920/ap_on_el_se/us_alaska_senate http://d.yimg.com/a/p/ap/20100918/capt.67567dbc0a874b689f0b4a5c392f379c-67567dbc0a874b689f0b4a5c392f379c-0.jpghttp://d.yimg.com/a/p/afp/20100918/thumb.photo_1284846332993-1-0.jpg

Also related projects with cool URI / permalink focus: http://www.citability.org/ http://data.gov/ http://data.gov.uk/

Fortress Model

• Get a lot of money

• Buy lots of storage

• Hire lots of people

• “Look upon my archive ye Mighty, and despair!”

Alternate Methods

• Lazy Preservation (McCown)– “How much preservation do I get if I do absolutely

nothing?”• Just-In-Time Preservation (Klein)

– Wait for it to disappear, then find a “good ‘nuff” version

• Shared Infrastructure Preservation– Push content to sites that might preserve it

• arXiv.org, IA, WebCite…

• Server Enhanced Preservation– Create archival-ready resources

And Soon…

• Social Preservation– Preserving resources using 3rd party Web Services

– Repository for OAI-ORE ReMs

– Social network feel

– Lazy-esque, server-side reconstruction

But I digress…

• Few years away…

• Preliminary research

• And now back to the prior research…

Web Infrastructure (McCown, 2007)

WayBack Machine

http://web.archive.org/web/*/http://www.thecribs.com/http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.thecribs.com/

from these we can create time-based: • indexes• IDF values• PageRank

Batch Recovery For Sites

http://warrick.cs.odu.edu/

Free limo rides for life?!

13

Reconstruction Diagram

added 20%

identical 50%

changed 33%

missing 17%

Real-Time Recovery for URIs

Synchronicity - www.cs.odu.edu/~mklein/

Memento wants to make navigating the Web’s Past Easy

15

http://www.mementoweb.orghttp://groups.google.com/group/memento-dev

What are you talking about?

• Universal Resource Identifier (URI) ~= URL

• Resource:– <HTML>

• Representation

W3C Web Architecture: Resource – URI - Representation

Resource

Representation

Represents

URI

Identifies

dereference

17

dereference content negotiation

W3C Web Architecture: Resource – URI - Representation

Resource

URI

Identifies

Representation 1

Represents

Representation 2Represents

18

Resources

19

Resources have Representations

20

Resources have Representations that Change over Time

21

Only the Current Representation is Available from a Resource

22

Old Representations are Lost Forever

23

Finding Archived Resources

Go to http://www.archive.org/ and searchhttp://cnn.com

On http://web.archive.org/web/*/http://cnn.com, select desired datetime

24

Archived Resources

http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com

http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived

resource for http://en.wikipedia.org/wiki/September_11_attacks

Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC

25

Navigating Archived Resources

http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived

resource for http://en.wikipedia.org/wiki/September_11_attacks3

Dec 20 2001, 4:51:00 UTC

http://en.wikipedia.org/wiki/The_Pentagon

current

Pentagon

26

Current and Past Web are Not Integrated

27

• Current and Past Web based on same technology.

• But, going from Current to Past Web is a matter of (manual) discovery.

• Memento wants to make going from Current to Past Web a (HTTP) protocol matter.

• Memento wants to integrate Current And Past Web.

One Memento HTTP Navigation

28

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

One Memento HTTP Navigation

30

Scenario

• cnn.com includes Link to TimeGate at Internet Archive• URI-R on one server, URI-G & URI-M on another

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Memento HTTP Flow: URI-RHEAD R, Accept-Datetime

HEAD http://cnn.com/ HTTP/1.1Host: cnn.comAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close

32

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Memento HTTP Flow: Success – URI-RLinkG

HTTP/1.1 200 OKDate: Thu, 21 Jan 2010 00:02:12 GMTServer: ApacheLink: <http://web.archive.org/web/timegate/http://cnn.com>; rel="timegate"Content-Length: 255Connection: closeContent-Type: text/html; charset=iso-8859-1

34

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

GET G, Accept-Datetime

Memento HTTP Flow: URI-G

GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1Host: web.archive.orgAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close

36

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Memento HTTP Flow: Success – URI-G

302M, Vary, LinkR,B,M

HTTP/1.1 302 FoundDate: Thu, 21 Jan 2010 00:06:50 GMTServer: ApacheTCN: choiceVary: negotiate, accept-datetimeLocation: http://web.archive.org/web/20010911203610/http://www.cnn.comLink: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first- memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”Content-Length: 0Connection: closeContent-Type: text/plain; charset=UTF-8

38

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

GET M, Accept-Datetime

Memento HTTP Flow: URI-M

GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1Host: web.archive.orgAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close

40

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Memento HTTP Flow: Success – URI-M

200, Content-Datetime, LinkR,B,M

HTTP/1.1 200 OKServer: Apache-Coyote/1.1X-Archive-Orig-Accept-Ranges: bytes…Content-Type: text/html;charset=utf-8Content-Length: 23364Date: Thu, 21 Jan 2010 00:09:40 GMTContent-Datetime: Tue, 11 Sep 2001 20:36:10 GMTLink: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”Connection: close

What does it all mean?

• Cutting edge technology

• Existing Infrastructure

• Redefining Web surfing

• MAJOR “real world” implications

Closing Thoughts

Preservation not for

privileged priesthoodhttp://doi.acm.org/10.1145/1592761.1592794

http://booktwo.org/notebook/wikipedia-historiography/

no more hoary storiesabout format obsolescence:http://blog.dshr.org/2010/09/reinforcing-my-point.html

Don't dessicate resources;

leave them on the webEndless metadata is not

preservation…

archiving as branded service, not infrastructurehttp://blog.dshr.org/2010/06/jcdl-2010-keynote.html

Acknowledgements

• Slides borrowed from:

• Dr. Michael L. Nelson:

– http://www.slideshare.net/phonedude/my-point-of-view-michael-l-nelson-web-archiving-cooperative

– http://www.slideshare.net/phonedude/review-of-web-archiving

– http://www.slideshare.net/phonedude/memento-time-travel-for-the-web

• Martin Klein:

– http://www.slideshare.net/phonedude/synchronicity-justintime-discovery-of-lost-web-pages

top related