27 april 2015 studying a nation’s web domain over time: analytical and methodological...
TRANSCRIPT
27 APRIL 2015
STUDYING A NATION’S WEB DOMAIN OVER TIME:
ANALYTICAL AND METHODOLOGICAL
CONSIDERATIONS
NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE FOR INTERNET STUDIES AND NETLAB, AARHUS UNIVERSITY
DITTE LAURSEN, SENIOR RESEARCHER AND CURATOR, THE DANISH NETARCHIVE
JANNE NIELSEN, RESEARCH ASSISTANT, NETLAB, AARHUS UNIVERSITY
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
2
OVERVIEW OF PRESENTATION
1. The project› Why study the development of a nation’s web domain?› How to study the development of a nation’s web domain?
— an outline of an analytical design2. Methodological challenges3. Solutions4. Results
› Registry of .dk domains› Corpus creation
5. Next steps
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
3
THE PROJECT
What has the entire Danish web looked like in the past, and how has it developed?
What are the methodological challenges in conducting such a study?
What kind of research infrastructure do we need to conduct such a study?
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
WHY STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN?
› It is an important part of a nation’s cultural heritage
› It is a back cloth for all other types of web entities and activities
› It can identify some of the patterns of the developments of the web and relate them to the web of today
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
5
HOW TO STUDY THE DEVELOPMENT OF A NATION’S WEB DOMAIN?
An outline of an analytical design — A gross list of possible ’probes’:› Size› Space› Structure› Aliveness› Content
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
6
HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? SIZE — BYTES
› How small/big is a nation’s web domain?› The size of different file types and of file types in general› How big/small are websites?
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
7
HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? SPACE – GEOLOCATION
› Where are websites located?› Search the text for geographic references, e.g. postcodes in
footers
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
8
HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? NETWORKS
Website internal/external hyperlinks› Are websites closed or open towards the web?› How flat/deep are websites?
Web domain internal/external hyperlinks› Centrality based on in-links› How well-linked is the national web domain to the rest of the
web?› Which other domain names are the most linked-to?
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
9
HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? ALIVENESS – UPDATING
› Domain names: number of new/inactive/disappeared domain names
› Updating: number of web objects having been changed since last archiving
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
10
HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? CONTENT 1
Closedness› How many websites are password protected?
File and software types› Which file types are the most prevalent?› Which software types are the most widespread?
Language› Does the national language prevail? — Or foreign languages?
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
11
HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? CONTENT 2
Textual elements on webpage› Background color› Most used fonts› Length of webpages › Placing of menu items (left align, vertical, or top align,
horizontal)
Semantics› Word frequencies› Where specific issues or topics are to be found, and how they
spread
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
12
METHODOLOGICAL CHALLENGES
The web of the past is gone
Possible solution: using (national) web archives› DK: Legal Deposit law effective July 2005› DK: web material within the ccTLD .dk and websites on other
domains aimed at a Danish audience› DK: 2015: approx 1 million active domain names within the
ccTLD .dk — 583 Terabytes
No 1:1 relation between archive and the Danish web domain
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
13
METHODOLOGICAL CHALLENGES
No 1:1 relation between Danish national archive and the Danish national web domain
› Not everything has been archived› Unsystematic, no register, no original to compare with› Archiving takes time, e.g. the link structure becomes
inconsistent› Deduplication may affect the subsequent use of the archived
material› Archiving strategies may be changed between two archivings› Parts of domains may be harvested more than once
NETLAB WORKSHOP OM WEBARKIVERING 18. MARTS 2015
14
PARTS OF DOMAINS MAY BE HARVESTED MORE THAN ONCE
start url
url
url url
url
url url url
url
url
1
0
2
3
harvester (web crawler/spider)
domain
domain
domain
domain domain A
urlurlurlurl
url url url
urlurl url
url url url
url url url url urlurl
domain A
domain B
domain C…
domain B
urlurlurlurl
url url url
urlurl url
url url url
url url url url urlurl
url
domain C
urlurlurlurl
url url url
urlurl url
url url
url url url url urlurl
url
url
url
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
16
METHODOLOGICAL CHALLENGES
› Main harvest: objects within a domain which have been harvested in the job to which the harvest of the domain was assigned
› By-harvest: objects within a domain which have been harvested in another job than the one to which the harvest of the domain was assigned
Domain A — MH
JOB 1
Domain B — MH
Domain C — MH
Domain E — MH
JOB 2
B1 — BH
Domain F — MH
JOB 3
B2 — BH
D1 — BH
Domain D — MH
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
17
SOLUTIONS
Not to use the archive after all› Use the registry of .dk domains
Corpus creation› Selection of harvests› Selection of one version of each domain (consisting of the main
harvest and possibly by-harvests)
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
18
REGISTRY OF .DK DOMAINS
Size and aliveness – 2006, 2009, 2012, 2015 › What are the total number of domain names over time?› How many domain names have disappeared compared to the
previous years? (and which ones)› How many domain names have been created compared to the
previous year? (and which ones)› How many domain names have changed hands compared to
the previous years? (and which ones)› How is the relationship of ownership and domains over time?
(cf. long tail)
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
19
RESULTS: REGISTRY OF .DK DOMAINS
Number of domain names over time
2005 2009 2012 20150
200000
400000
600000
800000
1000000
1200000
1400000
629,344
973,456
1,163,2501,277,035
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
20
RESULTS: REGISTRY OF .DK DOMAINS
New and disappearing domain names from 2005 to 2015
2005-2009 2009-2012 2012-20150
50000100000150000200000250000300000350000400000450000500000
470,925416,081
369,002
126,813
226,287255,217
CreatedDisappeared
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
21
RESULTS: REGISTRY OF .DK DOMAINS
Number of domain names which have changed hands over time
• In 2015, 14% of the domains from 2012 had changed the owner name
• Both in 2012 and in 2015, just less of 10% of the total number of owners owned 50% of the Danish domains
• An observation: If you own more than three domains you are part of the top 10% of domain owners
Year Domains Owners Anonymous
2012 1.163.250 513.326 46.727
2015 1.277.035 549.978 58.710
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
22
RESULTS: REGISTRY OF .DK DOMAINS
Relationship of ownership and domain names over time. Anonymous registrants removed. Chart shows 2012—no visual difference between 2012 and 2015
Parameter
2012 2015
Max 3422 3786
Mean 2.175 2.215
Median 1 1
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
23
PRE/POST-STEPS: REGISTRY OF .DK DOMAINS
Pre-steps› DK Hostmaster has shifted from ISO-8853 to UTF-8› Earlier attempts at handling the data assumed space separated
data sets when in fact they are fixed width fields› Data from DK hostmaster contains dirt, e.g. tab characters and
in one year some sort of header:
Post-steps› Same questions on several years (all years, up till four times a
year)› Further investigation on which domains have disappeared› New questions emerged in the process
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
24
CORPUS CREATION
Collaboration between researchers, curators, developers and management at the archive› How is a broad crawl performed? ie. several ”steps”› When were broad crawls performed?› How to find the most complete version of a domain within a
certain timespan within a broad crawl?› What do we mean when we talk about a ”web element”, a ”web
page”, a ”version” etc.?› What could a corpus creation algorithm look like?› How many resources are needed to test and implement a
creation of a corpus?
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
25
CORPUS CREATION
Use of broad crawls› Internationally recognized as a suitable web harvesting strategy
for national archives› 2-4 broad crawls each year of all domains from .dk as well as
Danish websites published under other extensions› Comprehensive in nature and consistent over time
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
26
CORPUS CREATION
Selection of broad crawls› Four broad crawls, one from each of the years 2006, 2009, 2012
and 2015 (first crawl of the year)
2006 2009 2012 2015
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
27
CORPUS CREATION
Selection of one havested version of each domain› Domain version from ’main harvest’› Inclusion of unique materials from the’ by-
harvest’ if the material is within our selected time span
Domain A — MH
JOB 1
Domain B — MH
Domain C — MH
Domain E — MH
JOB 2
B1 — BH
Domain F — MH
JOB 3
B2 — BH
D1 — BH
Domain D — MH
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
28
CORPUS CREATION
Test of the algorithm› Tested on the first broad crawl from January
2006 (1TB, only websites <10MB)› This harvest consists of 127 jobs› Each job consist of several domains› We produce an 18GB crawl log enhanced
with job IDs
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
29
CORPUS CREATION
Test of the algorithm› Using IBM BigInsights we can perform the
algorithm on this large spreadsheet› The algorithm locates the objects that are
not included in a main harvest (’by-harvests’)
› There might be duplicates — in these cases, the algorithm identifies and selects the objects closest to the time of the main harvest
STUDYING A NATION’S WEB DOMAIN OVER TIME
Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015
30
NEXT STEPS
From test to implementation› How to get from crawl logs to the material that the crawl logs
refer to and that we want to analyze? — Should WARC files be opened? Should a subset of an index be used?
› Start making some of the analyzes
Dissemination and networking› Book chapters and papers› An open workshop in Aarhus, Denmark in 2016 for other
national web archives and scholars wanting to do similar projects — aiming at establishing transnational ’best practice’ and analytical design
27 APRIL 2015
STUDYING A NATION’S WEB DOMAIN OVER TIME:
ANALYTICAL AND METHODOLOGICAL
CONSIDERATIONS
NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE FOR INTERNET STUDIES AND NETLAB, AARHUS UNIVERSITY
DITTE LAURSEN, SENIOR RESEARCHER AND CURATOR, THE DANISH NETARCHIVE
JANNE NIELSEN, RESEARCH ASSISTANT, NETLAB, AARHUS UNIVERSITY