http:// webarchiv czech web archive iipc 2007, paris

http://www.webarchiv.cz

WebArchiv

Czech Web Archive

IIPC 2007, Paris

http://www.webarchiv.cz IIPC 2007

WebArchiv – overview

The Czech WebArchiv was originally funded by the Ministry of Culture and launched in 2000.

Since then the project has been implemented by the National Library in cooperation with the Moravian Library and the Institute of Computer Science of Masaryk University.

Both large-scale automated harvesting of the entire Czech national web and selective archiving are being carried out, including thematic, event-based collections (using Heritrix).

Due to copyright law, only restricted on-site access from within the library is possible to all files in the archive (using wayback).

Archived resources which are covered by a written agreement with their publisher are accessible online using WERA.

WebArchiv – Workflows Prague:

Resource selection Cataloguing for the National Bibliography (MARC21) Providing Dublin Core metadata for interested publishers Making archive access agreements with publishers

Brno: Running WebArchiv hardware Software localization, maintenance and development Pre-harvesting resource analysis Harvesting, indexing, access

Results so far: 4 harvesting rounds of .cz domain (2001, 2002, 2004, 2006) 5 event-oriented harvests several times per year – harvests of sites under agreements 5.4 TB archive with 136 million files

WebArchiv – Tools

Software tools: Web Based Dublin Core metadata creator National Bibliography Number (NBN) generator Heritrix crawler NutchWAX, WERA – full text indexing & public archive access wa-cz – locally developed infrastructure WayBack – Wayback Machine like interface for whole archive,

limited access Hardware:

3 HP ProLiant servers, 5.8 TB SATA disc array awaiting transfer of the archive files to National Library’s central

storage facility (25+ TB, mirrored, FC+SATA) later this year

WebArchiv – Infrastructure

A1 new crawl; A2 end crawl -> index; A3 update fulltext; A4 update host list

WebArchiv - Future Work

Workflow management application Harvesting of bohemical resources outside the .cz domain

language analysis feedback from Heritrix about dropped URLs from .cz crawl

Adaptive incremental harvesting, incremental indexing Selective harvesting on demand Fulltext indexing of the whole archive Identification of similar documents Permanent linking into the archive (permanent ID) Integration of the archive into planned National Digital Library

(selection of software 2008) Long-term preservation (via NDL system) Implementation of digital library standards: OAI-PMH, METS,

SRU/SRW

Archive daily ingest

500000

1000000

1500000

2000000

2500000

3000000

1.9.2001

1.11.2001

1.1.2002

1.3.2002

1.5.2002

1.7.2002

1.9.2002

1.11.2002

1.1.2003

1.3.2003

1.5.2003

1.7.2003

1.9.2003

1.11.2003

1.1.2004

1.3.2004

1.5.2004

1.7.2004

1.9.2004

1.11.2004

1.1.2005

1.3.2005

1.5.2005

1.7.2005

1.9.2005

1.11.2005

1.1.2006

1.3.2006

1.5.2006

1.7.2006

1.9.2006

1.11.2006

1.1.2007

1.3.2007

cz2006

cz2004

cz2002cz2001

cz2005

agreements

agreemen

NEDLIB harvester Heritrix

People

Librarians, project management: National Library: 3.5 FTE

IT management Moravian Library – 1 part-time

IT Masaryk University – 6 part-time

http:// webarchiv czech web archive iipc 2007, paris

archive files

tb archive

archive access agreements

czech webarchiv

year slide

srusrw slide

harvesting rounds

management moravian

Documents

profiling web archives iipc ga 2015

iipc, london web archiving week · 2017. 10. 2. · 1 iipc,...

iipc illmer consulting ag - holdings... · produced by: dr....

update on memento (iipc 2011 plenary)

iipc illmer consulting ag - future of... · iipc illmer...

nkp.cz · úéast na valném shromáždéní konsorcia iipc...

webarchiv -...

iipc 2014: synthesis

old dominion university computer science iipc new member

iipc illmer consulting ag - new and... · produced by: dr....

tm - iipc canada

webarchiv cz

webarchiv jako digitální knihovna ii

iipc illmer consulting ag - guidance to... · iipc...

webarchiv digitální knihovna českého webu

webarchiv – digitální knihovna českého webu

el iipc teresa malo de molina

the czech library digitization of cultural heritage...

industry institute partnership cell...

webarchiv - městská knihovna v...