http:// webarchiv czech web archive iipc 2007, paris
Post on 25-Dec-2015
231 Views
Preview:
TRANSCRIPT
http://www.webarchiv.cz
WebArchiv
Czech Web Archive
IIPC 2007, Paris
http://www.webarchiv.cz IIPC 2007
WebArchiv – overview
The Czech WebArchiv was originally funded by the Ministry of Culture and launched in 2000.
Since then the project has been implemented by the National Library in cooperation with the Moravian Library and the Institute of Computer Science of Masaryk University.
Both large-scale automated harvesting of the entire Czech national web and selective archiving are being carried out, including thematic, event-based collections (using Heritrix).
Due to copyright law, only restricted on-site access from within the library is possible to all files in the archive (using wayback).
Archived resources which are covered by a written agreement with their publisher are accessible online using WERA.
http://www.webarchiv.cz IIPC 2007
WebArchiv – Workflows Prague:
Resource selection Cataloguing for the National Bibliography (MARC21) Providing Dublin Core metadata for interested publishers Making archive access agreements with publishers
Brno: Running WebArchiv hardware Software localization, maintenance and development Pre-harvesting resource analysis Harvesting, indexing, access
Results so far: 4 harvesting rounds of .cz domain (2001, 2002, 2004, 2006) 5 event-oriented harvests several times per year – harvests of sites under agreements 5.4 TB archive with 136 million files
http://www.webarchiv.cz IIPC 2007
WebArchiv – Tools
Software tools: Web Based Dublin Core metadata creator National Bibliography Number (NBN) generator Heritrix crawler NutchWAX, WERA – full text indexing & public archive access wa-cz – locally developed infrastructure WayBack – Wayback Machine like interface for whole archive,
limited access Hardware:
3 HP ProLiant servers, 5.8 TB SATA disc array awaiting transfer of the archive files to National Library’s central
storage facility (25+ TB, mirrored, FC+SATA) later this year
http://www.webarchiv.cz IIPC 2007
WebArchiv – Infrastructure
A1 new crawl; A2 end crawl -> index; A3 update fulltext; A4 update host list
http://www.webarchiv.cz IIPC 2007
http://www.webarchiv.cz IIPC 2007
WebArchiv - Future Work
Workflow management application Harvesting of bohemical resources outside the .cz domain
language analysis feedback from Heritrix about dropped URLs from .cz crawl
Adaptive incremental harvesting, incremental indexing Selective harvesting on demand Fulltext indexing of the whole archive Identification of similar documents Permanent linking into the archive (permanent ID) Integration of the archive into planned National Digital Library
(selection of software 2008) Long-term preservation (via NDL system) Implementation of digital library standards: OAI-PMH, METS,
SRU/SRW
http://www.webarchiv.cz IIPC 2007
Archive daily ingest
0
500000
1000000
1500000
2000000
2500000
3000000
1.9.2001
1.11.2001
1.1.2002
1.3.2002
1.5.2002
1.7.2002
1.9.2002
1.11.2002
1.1.2003
1.3.2003
1.5.2003
1.7.2003
1.9.2003
1.11.2003
1.1.2004
1.3.2004
1.5.2004
1.7.2004
1.9.2004
1.11.2004
1.1.2005
1.3.2005
1.5.2005
1.7.2005
1.9.2005
1.11.2005
1.1.2006
1.3.2006
1.5.2006
1.7.2006
1.9.2006
1.11.2006
1.1.2007
1.3.2007
cz2006
cz2004
cz2002cz2001
cz2005
agreements
agreemen
NEDLIB harvester Heritrix
Num
ber
of f
iles
http://www.webarchiv.cz IIPC 2007
People
Librarians, project management: National Library: 3.5 FTE
IT management Moravian Library – 1 part-time
IT Masaryk University – 6 part-time
top related