Web Archiving at the National Library of Australia
Russell LathamSenior Web Archivist,
National Library of Australia
“The Web's ever-expanding size, the dynamic and ephemeral nature of its content, and how this is to be captured, stored and made accessible for the long-term are some of the key questions being addressed by electronic archiving programs. “
PADIhttp://www.nla.gov.au/padi/topics/92.html
What is web archiving? A web archive is not the same as the
live web Brings a different value to web content
Creating artefacts from the web Preserved snapshots, slices, gobbets of
time Challenge of timeliness
At certain times some things are more interesting and valuable
Focus on the future and long term access (preservation objective)
History: web archiving at the NLA April Fools Day 1996: ‘Electronic Unit’
established May 1998: public access to PANDORA titles July 1998: first PANDORA ‘partner’ began
participation 10th participant joined in 2003
June 2001: PANDAS v.1 released Web archiving workflow system developed by NLA
2002: Digital Archiving Branch Our own identity at last! Began first trial of ‘mainstreaming’ web archiving in
Serials and Govt Deposit sections
History: web archiving at the NLA August 2002: PANDAS v.2 released July 2003: joined IIPC 2004: PANDORA added to UNESCO Australian
Memory of the World Register July 2005: first .au domain harvest
Subsequent harvests in 2006, 2007, 2008 & 2009 December 2006: “Web Archiving and Digital
Preservation Branch” July 2007: PANDAS v.3 released (at last!) 2010: PANDORA search moved to Trove May 2010: Proposal for whole-of-govt ‘opt-out’
arrangements through SIGB
7
What we collect
Selective approach Collaboration with PANDORA
participating agencies Modest in size High quality, timely, high value
collection, described and searchable
Accessible to the public
Subjects Browse list Collections Agency based Trove – Archived Websites Trove – bibliographic Search engines
Searching the
collections
CollectionsNational Events Iraq war, 2003
Asia Tsunami, 2004Bali Bombing, 2002
Political Events ElectionsCHOGMNational Apology
Topic Based Extreme sportsSeven Network
Natural events FloodsCyclonesBushfires
Agency based
Use the partners page
http://pandora.nla.gov.au/partners.html
1996 Federal Election2001 Federal Election2004 Federal Election2007 Federal Election2010 Federal Election1998 Federal Election
19
Australian web domain harvests
Annual domain harvests 2005-2009 Working with the Internet Archive Covers .au top level domain and a bit
more … No public access Quantity over quality; content not
assessed or described; opportunistic rather than timely
20
Comparative statisticsPANDORA
Files: 115 million
Size: 5.03 TB
Domain Harvest
2005 2006 2007 2008 2009
Unique files
185 million 596 million 516 million 1 billion 765 million
Hosts crawled
811,523 1,046,038 1,247,614 3,038,658 1,074,645
Size TBs 6.69 19.04 18.47 34.55 24.29
Domain Harvests
Files: 3 billion
Size: 103 TB