technology module: web archiving · preservation standard. web archiving software/services examples...

@digitalPOWRR

This POWRR Institute is generously funded by the

Technology Module: Web Archiving

http://digitalpowrr.niu.edu/


Expected Outcomes

Become familiarized with the practice of web archiving

Become familiar with common terminology

Learn about common tools/services currently available to perform this work

Consider a scenario provided by the instructor concerning a use case for using WebRecorder in an archival setting

Use WebRecorder to capture several websites

Web Archiving Overview Process of collecting portions of the world wide web to ensure information is

preserved in an archive for future researchers.

Typically employ “web crawlers” for scheduled, automated capture.

Internet Archive began crawling in 1996.

Wayback Machine launched 2001 - making crawls publicly available.

Bulk archiving requires special software for capture and use.

Various kinds of web content can be captured, depending on particular needs.

Web ARChive format (WARC) is now an ISO Standard, used by LOC, de facto preservation standard.

Web Archiving Software/Services Examples

Heritrix

HTTrack

NutchWAX

WAIL

WARCreate

wget

Archive-It (Internet Archive’s paid service)

Preservica (web archiving component built in)

ArchiveFacebook (Firefox extension)

DocNow (suite of Twitter-specific archiving tools)

Web Recorder

Web Curator Tool

More on Software/ToolsWeb archiving software can cover various aspects:

creation of the archived content scheduling crawls indexing/searching of the content viewing the content making that content available to the public.

Some software only performs one function. IIPC divides tools into the following categories: Acquisition, Replay, Search & Discovery, Analysis, Utilities.

Most institutions who do web archiving at a large scale subscribe to Archive It or use a combination of open source tools to build their own service.

“Good Enough” Web ArchivingBefore embarking on a web archiving endeavor, it’s important to consider the following questions:

What volume of web content do I need to archive?

At what level do I need to archive content? Entire websites? A page here or there?

Do I need to replicate/capture the appearance and behavior of the site? OR just scrape content from it?

How often do I need to capture the content on the site/pages? Do I really need automation or can I do it manually?

“Good Enough” Web ArchivingIf you only need to save a webpage here or there, you have a couple options.

Wayback Machine’s “Save Page Now” feature (stores copy in the Wayback Machine). You can save the resulting archived link for your own use in the future.

Available at https://archive.org/web/, oryou can also download the Save Page Nowbrowser extension for Chrome.

Archive.is is a similar service, but allows youto download a zip file of the site you saved.

https://archive.org/web/

“Good Enough” Web Archiving You can save the page(s) as HTML or PDF/A.

If you save to PDF, the formatting may be compromised, and some aspects of dynamic content will not be captured.

If you save HTML, your browser will also save associated CSS and Javascript files. Can be a pain to save/organize.

Apps like Sitesucker will download the entire contents of a website (including media) & replicate the directory structure for you.

This can be time consuming if you’re doing a lot of sites.

Using HTML versions of websites can be cumbersome.

Potential “Good Enough” Solution: Webrecorder

Developed by Rhizome, the born-digital art organization.

Free and easy to use

It doesn’t “crawl” - it records - records the dynamic web, live as your view it.

This means that anything you want to save in your archive needs to be opened/played, it does not open/play automatically. This includes videos.

They will host 5 gb of your recordings/WARCs for free, but you must create free account. OR you can download the WARCs you create locally.

They also have a free Web Archive Viewer app, that plays your WARCs.

Why Webrecorder? The Archivist does not anticipate needing to archive a tremendous amount of web

content at this time.

The Archivist feels comfortable enough creating web archives “on the fly” - when special events come, or on a manageable schedule (like, archiving major portions of their Uni website every semester.)

The Archivist anticipates needing to crawl social media sites.

The Archivist does not want to install anything that requires sysadmin knowledge.

The Archivist does not have a budget to pay for the Archive It service.

The Archivist likes the idea of creating a WARC which will ensure she can later use it in 3rd party applications. She also likes that the WARC will contain multiple pages/sites relating to a particular event

Web Archiving Case Study

The Naropa University Archivist is contacted by a staff member from the University’sinternal Development office, looking for information on alumni donations made for the40th Anniversary that was celebrated in 2014. The Archivist looks in the usual places tofind mention of the event (news releases, etc), but is unable to locate anything.

After some head scratching, she feels a sinking feeling upon realizing that theCommunications office had stopped sending the Archives formal press releases, andinstead published this information on their website directly, removing it after a periodof 6 months.

The archivist realizes that the campus website has become a documentary black hole.

What can the archivist do to start to plug this gaping hole?

Naropa University 40th Anniversary URL’s to capture1. http://www.naropa.edu/about-naropa/events/40.php

2. http://www.naropa.edu/media/press-releases/press-2014/naropa-university-day.php

3. https://www.facebook.com/NaropaUniversity/photos/a.126003793681.106597.54736648681/

10152056106173682/?type=3&hc_ref=PAGES_TIMELINE

4. https://www.buddhistdoor.net/news/naropa-university-celebrates-40th-anniversary

5. http://www.dailycamera.com/news/boulder/ci_26522937/boulders-naropa-celebrates-40-

years-contemplative-education

6. https://dc.shambhala.org/2014/11/30/radical-compassion-report-naropas-40th-anniversary/

7. https://www.poets.org/poetsorg/stanza/celebration-naropas-40th-anniversary

8. http://litseen.com/jack-kerouac-school-of-disembodied-poetics-40th-anniversary/

9. https://www.centerforthehumanities.org/programming/naropa-at-40

10. http://www.beatdom.com/naropa-turns-40/

11. https://twitter.com/search?q=naropa%2040th%20anniversary&src=typd

12. https://www.facebook.com/NaropaUniversity/

http://www.naropa.edu/about-naropa/events/40.php

http://www.naropa.edu/media/press-releases/press-2014/naropa-university-day.php

https://www.facebook.com/NaropaUniversity/photos/a.126003793681.106597.54736648681/10152056106173682/?type=3&hc_ref=PAGES_TIMELINE

https://www.buddhistdoor.net/news/naropa-university-celebrates-40th-anniversary

http://www.dailycamera.com/news/boulder/ci_26522937/boulders-naropa-celebrates-40-years-contemplative-education

https://dc.shambhala.org/2014/11/30/radical-compassion-report-naropas-40th-anniversary/

https://www.poets.org/poetsorg/stanza/celebration-naropas-40th-anniversary

http://litseen.com/jack-kerouac-school-of-disembodied-poetics-40th-anniversary/

https://www.centerforthehumanities.org/programming/naropa-at-40

http://www.beatdom.com/naropa-turns-40/

https://twitter.com/search?q=naropa 40th anniversary&src=typd

https://www.facebook.com/NaropaUniversity/

Before Using the Tool – Do Some Scoping!

Regardless of the tool you use to create web archives, it’s important to know exactly what you are capturing, and how much is there.

Scoping is the term most frequently used when talking about what we tell a crawler to capture and what not to capture.

With Web Recorder, YOU are essentially the crawler. Web Recorder will only record what you show it. Therefore, you want to know if there is embedded content you should be capturing, or if there are internal links you should be following.

Using Webrecorder

1. Open Chrome or Firefox2. Go to https://webrecorder.io/3. Choose a name for your recording4. Paste in the first URL from our list in the “URL to Record” box5. Press “Record”6. When you get to the page, scroll through it. Mouse over any animations

you want to capture.7. Press play on any videos, sound bites, or navigate through any photo

galleries. Remember, whatever you click on gets recorded and added to your archive.

8. When you’re ready to add the next URL, click on “Temporary Collection” in the upper left corner. Click “NEW” under Recordings. Repeat until you’re done!

https://webrecorder.io/

Viewing Your Web Archive1. Click on “Temporary Collection”, and then click the “Download

Collection” button in the upper left corner. 2. Your browser will download a WARC file of all the material you

recorded.3. Open Webrecorder Player4. Click on “Load Web Archives”5. Select your unzipped WARC file6. Have a look around - see what you recorded and what you didn’t.7. If you were going to save the file locally, you can also rename

your WARC to something more meaningful to you and add it to your own local storage.

Internet Archive Intended for materials to be made available

to everyone (public domain, CC license).

Geographically distributed copies.

No frills (and no charge!) service.

Can handle text, audio, video, and images.

Institutions of all sizes are taking advantage of this service.

Can setup an account and landing page for your institution. Example: https://archive.org/details/illinoisstateuniversity

https://archive.org/details/illinoisstateuniversity

QUESTIONS?

Technology Module: Web Archiving



technology module: web archiving · preservation standard. web archiving software/services examples...

Documents