1 archiving and preserving the web dan avery kristine hanna merrilee proffitt internet archive rlg...

1

Archiving and Preserving the WebDan Avery

Kristine HannaMerrilee Proffitt

Internet ArchiveRLG

April 2006

2

Agenda

RLGInternet Archive Archive-ItChallengesThe FutureQ&A

3

The importance of archiving the web

• The web contains much of what will be the basis of scholarship in the future– record of events– official publications– personal viewpoints– ephemeral material

4

RLG’s interest

• RLG mission includes working with its member organizations to enhance their ability to provide research resources

• RLG members have long been participating in web archiving, but so far, this has been an activity restricted to large organizations

5

Members active in web archiving

• Bibliothèque Nationale de France

• British National Library• California Digital Library• Library of Congress • National Library of Australia• National Library of New Zealand

6

Archive-It pilot partners

• Indiana University• International Institute of Social History

• University of Toronto• Swarthmore/Haverford College

7

About Internet Archive

• Founded in 1996 • Largest public web archive• 60 billion pages, 55 million sites• Have expanded to include texts, audio, moving images, and software: 2.6 million downloads a day

8

What do we collect?Web Archive

• Take a broad snapshot of the web every 2 months

• 2 billion pages a month• Websites from every domain (.org, .com, .edu etc)

• Content in 21 languages

9

Policy

• We follow Oakland Archive Policy, 2002

• Founded by commercial and non commercial organizations

• Opt-out policy• We collect it all, and make it

inaccessible if requested by site owner

• Site owner directly blocks harvester on website

10

Access to Web Archive

• Entire archive accessible for free to the public via the website at www.archive.org

• Receive100 hits/second• 60k unique users per day• Evolving/Fluid: through public use we hope to find out what is important and to continuously improve

11

Why try to collect and preserve it all?

• Web has no boundaries, no limits• What will be important?• What is there today may be gone tomorrow

– “Capture now, ask why later”– “Grab it while you can, work it out later”

– “Lose as little as possible”

12

Open Source Technology primarily developed by Internet

Archive and IIPC

• Heritrix: web crawler• Wayback Machine: access tool for rendering and viewing files

• Nutchwax: Search engine• Arc File: archival record format (ISO work item)

How do we collect it?

13

Wayback Machine

14

Preservation

• Store multiple copies of each Archive

• 1300 machines/servers• Multiple copies at different geographical locations (U.S. Alexandria, Amsterdam)

• Standard storage boxes, open source design

15

Next Steps

Institutions:•need to create collections around primary source web material

•want to do more than broad crawling with specific and complete web archives

•want a technology partner that could harvest, index, access, store and preserve their collections for them.

16

• In 2002, began to form partnerships with Library of Congress, NARA and other National Libraries, including Australia and France.– Library of Congress collections:

•Iraq War: 450,000,000 documents and growing

•U.S. National Elections– 2000:131,331,973 documents– 2004: 87,481,265 documents

•Supreme Court Nomination 2005: 100 Million documents

1. Partner Contract Crawls

17

• Last year, early 2005, we had requests from state archivists, university librarians and other memory institutions to expand our archiving services and develop an application that acknowledge resource constraints

• Developed Archive-It, web based service that allows partners to create, manage, search and store their web archives through an easy to use web interface

• Does not require technical expertise or infrastructure

• Pilot launched in September 2005• 1.0 Release in February• 1.5 Release in April• 2.0 Release in July

2. Archive-It

18

Pilot Partners

• Center for Research Libraries• Research Libraries Group ( U of Toronto, U of Indiana, Haverford and Swarthmore Colleges, IISH)

• University of Texas• Library of Virginia• State Archives South Dakota• State Archives North Carolina• State Archives Alabama• Minnesota Historical Society• Institut d'Etude Politique de Grenoble

19

• 1.0 Release in February

• 1.5 Release in April

• 2.0 Release in July

Archive-It

20

Archive-It Collections

•Some samples:–Virginia’s political landscape, 2005 (Gov. Mark Warner)–Hurricane Katrina–Jamestown 2007 Commemoration

21

Archive-It Access

• All collections are accessible for free to the general public, with text search, at:– www.archiveit. org– Partners websites with links

• Plus, member web application with login

22

Demo

23

Dan’s slides

Tech

24

Challenges we face

• Making the collections useful for a variety of end users (i.e. general public, researchers)

• Making sure we capture the best and most relevant content

• Continuing to develop our tools for access and harvesting (crawler.archive.org)

25

Internet Archive’s priorities

• Collaboration and Partnerships– Continue to act as a technology partner in providing web archiving services to government and memory institutions

– Continue to develop Open Source software– Develop common tools, storage formats and standards through the IIPC (International Internet Preservation Consortium)

– Open Content Alliance (OCA) digital books project

• Multiple copies across the world– Within IA’s own facilities and with partners such as LC, Bnf, Library of Alexandria

26

RLG’s web archiving program

• Collaborative collection development.

• Descriptive metadata for web archives.

• Usability/user studies• Intellectual property concerns• Web Archiving 101• Web archiving services and software

1 archiving and preserving the web dan avery kristine hanna merrilee proffitt internet archive rlg...

Documents