1 archiving and preserving the web kristine hanna internet archive july 2008

1

Archiving and Preserving the Web

Kristine Hanna

Internet Archive

July 2008

2

Open Source Technology primarily developed by Internet Archive and IIPC

• Heritrix: web harvester to capture the content

• Wayback Machine: access tool for rendering and viewing content. Displays archived web pages--surf the web as it was.

• NutchWAX: Search engine. Standard full-text search

Open Source Technology

3

Heritrix development

2.0 (2008)Duplicate Reduction (saving storage)Prioritization of seeds, domains, Url’sAdapting to WARC format

2.2 (September 2008) and 2.4 (2009)• adaptive & continuous revisit crawling at a large scale

– Ability to run one never-ending 'master crawl' on the same 'scope’ and not break up the crawl

• improving check pointing for stable long-running crawl– Essentially a 'snapshot' of the entire state of the crawl, so if anything goes

wrong, we can pick up from exactly that 'snapshot' point, with all internal queues/counters in exactly the same state.

• better crawling of web video content• improving the usability and documentation features

4

NutchWAX Development

.12 (September)

• De-duplication of archive content during indexing.

• Adds support for WARC files

• Addresses high priority bugs

• Built on most recent versions of Nutch/Hadoop

• Distributed computing system scales to 100 millions of documents.

• Open Search interface to integrate with numerous 3rd-party systems

1.0 (December)

• Improve and simplify installation, indexing and service deployment of Nutch

• Provide NutchWAX documentation

5

Wayback Development

1.4 (July) Configurable/customizable error messages per website support for exclusions framework including date ranges anchoring date during replay to prevent "drift" through a replay session anchoring window, to limit embedded content to a defined time range within a replay session index format change to "identity format” proxy mode embedding of time lines, banners, etc

1.6 (December) Performance optimizations and better documentation Ability to play back https Improved packaging, installation and documentation Formal Support for Windows platform Improved video replay Thumbnails and/or document titles in the UI In page difference between two captures (visual comparison as you move through time)

6

IA Projects

• Using Open Source tools

• Collaborating with Partners

7

National Libraries

Ongoing thematic crawls, event based harvests, and domain snapshots

• Iceland Czech Republic• Germany France• UK Ireland• Norway Australia• Denmark Norway• US Sweden

8

Topic/Event crawls

Library of Congress

• National elections – 2000, 2002, 2004, 2006, 2008 • Supreme Court Nomination• War in Iraq• Crisis in Darfur• Egyptian Elections• Olympics• .gov• Papal Election

9

Community Web archiving

• Hurricane Katrina collection– Contributors: The Internet Archive, the Library of Congress,

CDL, a group of universities, and many individual contributors – spans content generated between September 4 and November 8,

2005 – 1700 web sites /61 million pages, all text searchable

Public access at http://websearch.archive.org/katrina/

• Tsunami Collection¯ Contributors: The Internet Archive, Singapore Internet Research

Centre, Web Archivist¯ 1500 sites / 4 million pages, all text searchable

Public access at http://tsunami.archive.org/

10

Virginia Tech University

Web archiving as a result of crisis and tragedy

• Tragedy at Virginia Tech

3 million documents all text searchable accessible to the public at http://www.dl-vt-416.org/

• Northern Illinois University

11

World Wide Web of Humanities

• Collaboration between IA, Hanzo Web and Oxford Internet Institute. Funded by NEH and JISC

• Objective is to support new methodologies for digital humanities research built around large collections of web and digitized data, using automated tools to extract, index, and analyze the data

• Chose a well-rounded set of humanities materials that will allow us to test the tools against a variety of types of documents and resource types

• Will build focused research collections around the topics of World Wars I and II

12

K-12

• Collaboration with LOC and CDL

• Chose 3 high schools from around the country (California, Illinois and Louisiana)

• http://www.archive-it.org/k12

13

Around the World in Two Billion Pages

• Mellon Award - unique global snapshot of the Web

– Crawled from June 2007 to December 2007– Over 60 countries participated– Started with 18,000 seeds (websites)– Completed with 2 billion pages

http://wa.archive.org/aroundtheworld/

14

Archive-It

(state archives, state university and public libraries, university libraries and non government non profits)

– Web based application that allows users to harvest manage and preserve collections of born digital content.

– Own institution’s websites, topics/subjects/events and/or government records

– Functions include: setting crawl frequencies, defining scope, cataloging with metadata, managing and analysis of collections and full text search

– Includes hosting and storage

15

Video

• 2007:• IA Engineers crawled over one million You Tube videos. Broad

crawls off of home page links (most popular, most viewed)• Started crawling embedded videos for LOC Election ‘08

collection

• 2008: • NDIIPP project with UNC: 8 weekly crawls

• Broad crawls: 2 weekly crawls from You Tube home page, prioritized based on popularity

• Focused/topical crawls: 3 weekly crawls with specific id’s or search queries provided by UNC

• Broad and/or Focused: last 3 crawls (TBA)

16

Video Harvests

• Difficult to interact with youtube and other proprietary flash video players

• Configuration is a moving target, since these video hosting sites may change their software at any time.

• Highly customized scoping rules need to be added to capture all the URLs relevant to embedded Flash videos

• replay (through the Wayback Machine) is complicated by some of the same issues we face with Flash in general

17

s What’s Next for Internet Archive and Web Archiving

• Collaboration and Partnerships– Continue to act as a technology partner in providing

web archiving services – Continue to develop Open Source software– Develop common tools, storage formats and standards

through the IIPC, and with our partners

• Multiple copies around the world– Within IA’s own repository, and with partners such as

LC, Bnf, Library of Alexandria

1 archiving and preserving the web kristine hanna internet archive july 2008

Documents