1 archiving and preserving the web dan avery kristine hanna merrilee proffitt internet archive rlg...
TRANSCRIPT
1
Archiving and Preserving the WebDan Avery
Kristine HannaMerrilee Proffitt
Internet ArchiveRLG
April 2006
2
Agenda
RLGInternet Archive Archive-ItChallengesThe FutureQ&A
3
The importance of archiving the web
• The web contains much of what will be the basis of scholarship in the future– record of events– official publications– personal viewpoints– ephemeral material
4
RLG’s interest
• RLG mission includes working with its member organizations to enhance their ability to provide research resources
• RLG members have long been participating in web archiving, but so far, this has been an activity restricted to large organizations
5
Members active in web archiving
• Bibliothèque Nationale de France
• British National Library• California Digital Library• Library of Congress • National Library of Australia• National Library of New Zealand
6
Archive-It pilot partners
• Indiana University• International Institute of Social History
• University of Toronto• Swarthmore/Haverford College
7
About Internet Archive
• Founded in 1996 • Largest public web archive• 60 billion pages, 55 million sites• Have expanded to include texts, audio, moving images, and software: 2.6 million downloads a day
8
What do we collect?Web Archive
• Take a broad snapshot of the web every 2 months
• 2 billion pages a month• Websites from every domain (.org, .com, .edu etc)
• Content in 21 languages
9
Policy
• We follow Oakland Archive Policy, 2002
• Founded by commercial and non commercial organizations
• Opt-out policy• We collect it all, and make it
inaccessible if requested by site owner
• Site owner directly blocks harvester on website
10
Access to Web Archive
• Entire archive accessible for free to the public via the website at www.archive.org
• Receive100 hits/second• 60k unique users per day• Evolving/Fluid: through public use we hope to find out what is important and to continuously improve
11
Why try to collect and preserve it all?
• Web has no boundaries, no limits• What will be important?• What is there today may be gone tomorrow
– “Capture now, ask why later”– “Grab it while you can, work it out later”
– “Lose as little as possible”
12
Open Source Technology primarily developed by Internet
Archive and IIPC
• Heritrix: web crawler• Wayback Machine: access tool for rendering and viewing files
• Nutchwax: Search engine• Arc File: archival record format (ISO work item)
How do we collect it?
13
Wayback Machine
14
Preservation
• Store multiple copies of each Archive
• 1300 machines/servers• Multiple copies at different geographical locations (U.S. Alexandria, Amsterdam)
• Standard storage boxes, open source design
15
Next Steps
Institutions:•need to create collections around primary source web material
•want to do more than broad crawling with specific and complete web archives
•want a technology partner that could harvest, index, access, store and preserve their collections for them.
16
• In 2002, began to form partnerships with Library of Congress, NARA and other National Libraries, including Australia and France.– Library of Congress collections:
•Iraq War: 450,000,000 documents and growing
•U.S. National Elections– 2000:131,331,973 documents– 2004: 87,481,265 documents
•Supreme Court Nomination 2005: 100 Million documents
1. Partner Contract Crawls
17
• Last year, early 2005, we had requests from state archivists, university librarians and other memory institutions to expand our archiving services and develop an application that acknowledge resource constraints
• Developed Archive-It, web based service that allows partners to create, manage, search and store their web archives through an easy to use web interface
• Does not require technical expertise or infrastructure
• Pilot launched in September 2005• 1.0 Release in February• 1.5 Release in April• 2.0 Release in July
2. Archive-It
18
Pilot Partners
• Center for Research Libraries• Research Libraries Group ( U of Toronto, U of Indiana, Haverford and Swarthmore Colleges, IISH)
• University of Texas• Library of Virginia• State Archives South Dakota• State Archives North Carolina• State Archives Alabama• Minnesota Historical Society• Institut d'Etude Politique de Grenoble
19
• 1.0 Release in February
• 1.5 Release in April
• 2.0 Release in July
Archive-It
20
Archive-It Collections
•Some samples:–Virginia’s political landscape, 2005 (Gov. Mark Warner)–Hurricane Katrina–Jamestown 2007 Commemoration
21
Archive-It Access
• All collections are accessible for free to the general public, with text search, at:– www.archiveit. org– Partners websites with links
• Plus, member web application with login
22
Demo
23
Dan’s slides
Tech
24
Challenges we face
• Making the collections useful for a variety of end users (i.e. general public, researchers)
• Making sure we capture the best and most relevant content
• Continuing to develop our tools for access and harvesting (crawler.archive.org)
25
Internet Archive’s priorities
• Collaboration and Partnerships– Continue to act as a technology partner in providing web archiving services to government and memory institutions
– Continue to develop Open Source software– Develop common tools, storage formats and standards through the IIPC (International Internet Preservation Consortium)
– Open Content Alliance (OCA) digital books project
• Multiple copies across the world– Within IA’s own facilities and with partners such as LC, Bnf, Library of Alexandria
26
RLG’s web archiving program
• Collaborative collection development.
• Descriptive metadata for web archives.
• Usability/user studies• Intellectual property concerns• Web Archiving 101• Web archiving services and software