11 warc standard revision workshop clément oury iipc general assembly open workshops stanford,...

16
1 1 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April 28th, 2015

Upload: abel-davis

Post on 22-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

11

WARC standard revision workshop

Clément Oury

IIPC General Assembly open workshops

Stanford, April 28th, 2015

IIPC General Assembly – Stanford – April 28th, 2015

2

IIPC General Assembly – Stanford – April 28th, 2015

Summary of the presentation

Current status of the WARC standard The revision process Identify, discuss and prioritize revision needs Set up an organization and agenda for

further work

3

IIPC General Assembly – Stanford – April 28th, 2015

Summary of the presentation

Current status of the WARC standard The revision process Identify, discuss and prioritize revision needs Set up an organization and agenda for

further work

4

IIPC General Assembly – Stanford – April 28th, 2015

The WARC format A container format designed to store any kind of digital

content– Along with relevant metadata– Extension of the ARC format designed in 1996

WARC improvements– Assigns a unique identifier to each record– New records types:

• To describe the harvesting process: warcinfo, request, response, metadata records

• To store information on deduplication: revisit records• To store segmented files: continuation records• To record outputs of a file format migration: conversion records• To record non web material: resource records

– New named fields for each records

5

IIPC General Assembly – Stanford – April 28th, 2015

Usage of WARC format Widely adopted by the web archiving community

– Most institutions have switched from ARC to WARC format– Harvesting: Heritrix, Wget, WARCcreate– Data management/preservation: JWAT, Jhove2– Indexing and access: SOLR, Open Wayback

But also adopted beyond web archiving community– To store e-periodicals and e-books: LOCKSS project– To store all files ingested in a long-term repository: Danish

Bit Repository

Some usage issues discussed in the WARC implementation guidelines

6

IIPC General Assembly – Stanford – April 28th, 2015

The WARC standard

Published as “ISO 28 500” on May 15th, 2009– Standardization process had started in 2006– Mainly ensured by IIPC members under ISO

umbrella

ISO group: TC 46 / SC 4 / WG 12– TC 46: Information and communication– SC 4: technical interoperability– WG 12: WARC file format

ISO standards generally reviewed after 5 years– ISO members voted in 2014 in favor of the revision

7

IIPC General Assembly – Stanford – April 28th, 2015

Summary of the presentation

Current status of the WARC standard The revision process Identify, discuss and prioritize revision needs Set up an organization and agenda for

further work

8

IIPC General Assembly – Stanford – April 28th, 2015

The revision process

A maximum period of 36 months A two steps approach

– IIPC draft / IIPC WG– ISO validated standard / ISO WG

Proposed agenda in 2015– WARC revision workshop: now!– June: presentation of revision process during TC46

meeting – May-September: first IIPC draft– October (?): ISO WG meeting

9

IIPC General Assembly – Stanford – April 28th, 2015

The revision process – why?

Amend or improve the current standard, on several topics– clarify potential ambiguities or inconsistencies in the

standard;– offer better solutions to record some information, e.g. by

adding new named fields or even new record types;– take into account some needs not identified when the original

standard was designed (e.g. use of WARC for other documents than web archives);

– perform minor editorial revisions.

Afterwards, no change possible until the next revision!

10

IIPC General Assembly – Stanford – April 28th, 2015

Summary of the presentation

Current status of the WARC standard The revision process Identify, discuss and prioritize revision

needs Set up an organization and agenda for

further work

11

IIPC General Assembly – Stanford – April 28th, 2015

12

IIPC General Assembly – Stanford – April 28th, 2015

Revision needs – active discussions

Clarification– Is it allowed to add new named fields?

• New record types are allowed…• But nothing is indicated on new named fields

Two new named fields for deduplication– WARC-Refers-To-Target-URI– WARC-Refers-To-Date

A proposal to record screenshots?

13

IIPC General Assembly – Stanford – April 28th, 2015

Revision needs – WARC for data mining

WAT: Web Archive Transformation– Specified by Internet Archive to store metadata

extracted from WARC files– Metadata (HTML headers, HTML metadata, links…)

recorded in metadata records with a JSON structure

WET: WARC Encapsulated Text– Designed by Common Crawl– Contains only text content extracted from WARC files

Official recommendation as informative appendix?

14

IIPC General Assembly – Stanford – April 28th, 2015

Revision needs – open questions

Is WARC format suited for non-web material?

Is WARC format suited for server side archiving?

How to improve the use of unique IDs?

15

IIPC General Assembly – Stanford – April 28th, 2015

Summary of the presentation

Current status of the WARC standard The revision process Identify, discuss and prioritize revision needs Set up an organization and agenda for

further work

16

Next steps

Set up a working group: who’s in?– Should we share the work?

What tools?– Using IIPC Github?

Agenda?– Phone calls?

IIPC General Assembly – Stanford – April 28th, 2015