annick le follic bibliothèque nationale de france tallinn, 2015-01-29 1
TRANSCRIPT
1
Statistics on web archives using ISO metrics
Annick Le FollicBibliothèque nationale de France
Tallinn, 2015-01-29
2
BnF needsObjectives
Characterize BnF web collectionsManage the activity of the digital legal deposit
teamDescribe BnF web data to be preserved
Two main kinds of metrics: harvesting and preservation
From experimentation…Scripts and Heritrix reports from Internet
Archive engineersA dedicated application developed by BnF
engineers
3
International environment… to the standardization of the metrics
Definition of concepts and standards with an ISO working group
Dedicated statistics have to be included in the Library general performance statistics
Experience sharing within the IIPCMany libraries have changed from ARC to
WARC
BnF has developed a specific tool (NAS_qual) for its first internal broad crawl in 2010
4
Benefits of ISO report for BnFAdoption of strict definitions of termsMain metrics chosen for collection
development
At a more refined level, collection characterisation
Statistic Purpose Example
Number of targets Objectives of the collection 8,000 targets
Total number of URLQuantity of information in web archive
14 billion URL
Total compressed size stored
Overall size of web archive 200 terabytes
Number of container filesNumber of conservation units in archive
18,000 WARC files
Statistic Purpose Example
Distribution by top level domain Geographic distribution 70 % of collection in .fr TLD
Distribution by format types Document type characterisation
60 % of collection in text/html
5
BnF methodA table lists and characterizes all possible
metricsA code and a name for each oneThe source reportThe calculation methodThe needed tool (scripts, NAS_qual…)
Main difficultiesDifference of scale between broad and
selective crawlCompressed or uncompressed sizeCollected or processed URLs
6
Production Preservation
Sources NetarchiveSuite (and Heritrix) SPAR
Statistics tools
NAS_qual SPAR indicators
Exploitation Excel files Excel files
7
Description of top level domain
Metrics description Metrics elaboration
Name Description Data source
Calculation
Number of TLD
Number of unique first top level domain harvested
Hosts-report.txt – N files
Extract the TLD from the name of the hosts. Add the different TLD with at least one URL harvested. Be careful: a TLD can have several occurrences in several host-reports.
To characterize a collection in terms of geographic distribution (e.g. France)
8
Statistics on top level domainsStarting with a seed list of .fr domains, we
can see that French scope also includes a large part of .com domains, and also European domains
TLD Number of URL %
fr 1,050,488,163 43.3 %
com 952,199,484 39.3 %
net 105,871,664 4.4 %
org 104,451,350 4.3 %
eu 29,396,613 1.2 %
9
Description of MIME types
Metrics description Metrics elaboration
Name Description Data source
Calculation
Number of MIME types
Number of unique MIME types harvested
mimetype-report.txt - N files
Add the unique MIME types. Be careful : a MIME type can have several occurrences in several MIME type reports.
Get a distribution by content types comparable to other documents in a library
Help preservation tasks
10
Statistics on MIME typesWe can note that around 1 million audio
and video files will need special attention to be preserved
MIME type (by categories) Number of URL %
text 1,34, 647,190 55.7 %
image 947,101,138 39.3 %
application 114,668,770 4.8 %
video 2,120,719 0.1 %
audio 1,837,207 0.1 %
11
Description of WARC files volume
Metrics description Metrics elaboration
Name Description Data source
Calculation
WARC files volume (compressed with metadata WARC)
Weight in bytes (and in Go) of all the conservation units produced / data harvested
Manifest of storage servers
Add the weight of all the WARC files of one or several harvests (configurable).
Manage the storage spaceHelp preservation tasks
12
Statistics on WARC files volumeBnF uses a similar way to count ARC and
WARC files99.35 Tio in 2014567.38 Tio for the entire BnF web collections
Question: BnF still hesitates to convert bytes to Go or Gio, To or Tio?
13
Communication to usersComments by the digital legal deposit team
Describe the web archive
Discussion with the IT team Define annual storage volumeDefine number of crawlers
Content librarians networkCooperate on selective crawls
BnF managers and readers Disseminate figures on the annual report,
the BnF website, the legal deposit observatory
15 metrics
5 metrics
4 metrics
2 metrics or more