towards multidimensional web archive access (iipc 2016)

48
Towards Multidimensional Web Archive Access Creating & Analyzing Representations of Aggregated Web Content Hugo Huurdeman Thaer Samar Jaap Kamps Arjen de Vries

Upload: timelessfuture

Post on 12-Apr-2017

492 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Towards Multidimensional Web Archive Access (IIPC 2016)

Towards Multidimensional Web Archive Access

Creating & Analyzing Representations of Aggregated Web Content

Hugo Huurdeman Thaer Samar Jaap Kamps Arjen de Vries

Page 2: Towards Multidimensional Web Archive Access (IIPC 2016)

Introduction

• Web archives:• exceptionally rich potential scholarly data source

• Important: temporal & hierarchical aspects• however, current access usually at single page level

Page 3: Towards Multidimensional Web Archive Access (IIPC 2016)

Introduction

• Focus: how can we provide insights into the multidimensional aspects of the archive? • i.e. moving from singular representations of

time-stamped pages to larger aggregated representations

• Illustrated by previous work on scholarly access & examples from Dutch Web archive

Webarchive

Page 4: Towards Multidimensional Web Archive Access (IIPC 2016)

Scholars’ Needsliterature analysis1

Page 5: Towards Multidimensional Web Archive Access (IIPC 2016)

1.1 Exploratory Study

• Exploratory analysis of scholars’ research tasks (journal papers) [see: Huurdeman15]

• scholars using temporal Web data

• Focus on corpus generation, analysis and dissemination

artist:

Page 6: Towards Multidimensional Web Archive Access (IIPC 2016)

1.1 Exploratory Study

• Method:

• querying EBSCOhost using the CMMC (Communication & Mass Media Complete), and LISTA (Library, Information Science & Technology Abstracts) databases

• selecting all journal papers (2007-2015) which contain longitudinal analyses (excl. computer science papers)

Page 7: Towards Multidimensional Web Archive Access (IIPC 2016)

1.2 Results: Scholars’ Corpora

• Observation: • Of the 18 resulting papers, most scholars did

not use institutional Web archives as their data source

• Corpus definition:• 1. by selecting webpages or websites, e.g. based

on authoritative lists (13)

• 2. by querying regular search engines (5)

• 3. by taking a sample of webpages (4)

• or a combination thereof

Page 8: Towards Multidimensional Web Archive Access (IIPC 2016)

1.3 Results: Dimensions

• Some research examples:• quality of answers in question-answering sites over time

(Chua et al, 2013)

• hyperlinking in news websites across time (Karlsson et al, 2015)

• electoral web spheres at election times (Xenos & Bennet, 2007)

• Various hierarchical and temporal dimensions

Page 9: Towards Multidimensional Web Archive Access (IIPC 2016)

1.3.1 Results: Hierarchical Dimension

• Level of analysis:(b/o Brügger, 2013)

• page element (4) (22%) • e.g. mission statements

• web page (6) (33%) • e.g. blog pages

• web site* (7) (39%) • e.g. political actors’ sites

• web sphere (1) (6%) • e.g. electoral web sphere

web sphere (1)

website (7)

page element (4)

webpage (8)

Page 10: Towards Multidimensional Web Archive Access (IIPC 2016)

1.3.2 Results: Temporal Dimension

2000 2005 2010

timepoints

singulartimerange

multiple timeranges

}5 (28%)

8 (44%)

5 (28%)

#Papers

Page 11: Towards Multidimensional Web Archive Access (IIPC 2016)

1.3 Dimensions: Wrapup

• Scholars’ focus: not just on pages, but also on page elements, web sites and web spheres• at timepoints, singular timerange, multiple timeranges

• Various ways to define a corpus• queries, samples and selections (e.g. URL lists)

• How are these needs reflected in Web archive data and access functionality?

Page 12: Towards Multidimensional Web Archive Access (IIPC 2016)

Dimensions of the Web archivedata and access2

Page 13: Towards Multidimensional Web Archive Access (IIPC 2016)

2.1 Web Archive Data

• Usually stored in (W)ARC files• each containing one or more (W)ARC records

• resources of various kinds

Page 14: Towards Multidimensional Web Archive Access (IIPC 2016)

2.1 Data: Dimensions

• (1) temporal dimension• versions of Web content accumulated over time

• timestamped (W)ARC records

• crawl dates

• last-modified dates

Page 15: Towards Multidimensional Web Archive Access (IIPC 2016)

2000

2016

20041997

2008 20122008

Page 16: Towards Multidimensional Web Archive Access (IIPC 2016)

2.2 Data: Dimensions

• (2) hierarchical dimension• “web sphere, web site, web page, page element”

• stored in (W)ARC files

• as “flat”(W)ARC records

Page 17: Towards Multidimensional Web Archive Access (IIPC 2016)

Web sphere

Website

Page

Ele-ment

Website Website Website

Page Page Page Page Page Page Page

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

eg, all pages under a host or domain;all homepages; all homepages+1

eg, set of websites;category of sites

eg, .css, .jpg file

Issue: delineating the granularities

Page 18: Towards Multidimensional Web Archive Access (IIPC 2016)

2.3 Access: current limits• Open question: how to support these dimensions?

• current support in interfaces:• most: Selecting URLs, timestamps (Wayback Machine)• many: Querying contents of the archive, temporal filters• few: Selecting categories, facet filters

• usually still page-level results, i.e. individual pages

• How to provide aggregated results using different hierarchical and temporal dimensions?• scaling from page to site and ‘sphere’ level

• moving from single timestamp to time periods

Page 19: Towards Multidimensional Web Archive Access (IIPC 2016)

Web sphere

Page element

Web site

Web page

2000 2005 2010

Page 20: Towards Multidimensional Web Archive Access (IIPC 2016)

Exploring AggregationsAggregated representations in the Dutch Web archive3

Page 21: Towards Multidimensional Web Archive Access (IIPC 2016)

Flickr: koninklijkebibliotheek

Statistics:•10,000+ websites

•35,000+ harvests

•16+ Terabyte

National Library of the Netherlands: Web archive since 2007

Page 22: Towards Multidimensional Web Archive Access (IIPC 2016)

3.1 Data: extraction and processing

extracting all homepages + all pages 1 level deep

matching with seedlistadding KB metadata

cleaning, processing, data enrichment (e.g. NER) generate aggregations~900K XML

files

Page 23: Towards Multidimensional Web Archive Access (IIPC 2016)

Sing

le p

age

Page 24: Towards Multidimensional Web Archive Access (IIPC 2016)

Site

sum

mar

y

Sing

le p

ages

Page 25: Towards Multidimensional Web Archive Access (IIPC 2016)

3.2 Potential Use: Explorations

• Potential for analysis and visualization

• Examples via Dutch Web archive• I. (aggregated) degree of change — hierarchical

• homepages+1, ssdeep (content text, links, images)

• II. (aggregated) content summaries — temporal

• homepages + 1, tf-idf

Page 26: Towards Multidimensional Web Archive Access (IIPC 2016)

3.2.1 Examining aggregated degree of change

Page 27: Towards Multidimensional Web Archive Access (IIPC 2016)

Web sphere

Page element

Web site

Web page

2010 2015

eyefilm.nl

Page 28: Towards Multidimensional Web Archive Access (IIPC 2016)

0"

20"

40"

60"

80"

100"

120"

20100722"

20100816"

20100817"

20110413"

20110610"

20110706"

20111013"

20111218"

20111220"

20120520"

20120613"

20120617"

20120618"

20120918"

20121014"

20121120"

20121221"

20121222"

20121222"

20130218"

20130413"

20130518"

20130611"

20130620"

20130818"

20131001"

20131013"

20131030"

20131101"

20131115"

20131118"

20131120"

20131130"

20131206"

20131220"

20131220"

20140118"

20140225"

20140413"

20140518"

20140609"

20141013"

20141118"

20150218"

20150413"

20150518"

Reeks1" Reeks2" Reeks3" Reeks4"

Example: eyefilm.nl (2010-2015)

redesign redesign

content links images overall

Page 29: Towards Multidimensional Web Archive Access (IIPC 2016)

0"

10"

20"

30"

40"

50"

60"

70"

80"

90"

100"

20090226"

20091110"

20100204"

20100210"

20100510"

20100804"

20100810"

20101110"

20110206"

20110211"

20110510"

20110706"

20110802"

20110810"

20111110"

20120202"

20120210"

20120510"

20120802"

20120810"

20121110"

20130210"

20130510"

20130810"

20131110"

20140210"

20140821"

20141110"

20150210"

20150510"

Reeks1" Reeks2" Reeks3" Reeks4"

Example: escherinhetpaleis.nl (2010-2015)

0"

20"

40"

60"

80"

100"

120"

20100722"

20100816"

20100817"

20110413"

20110610"

20110706"

20111013"

20111218"

20111220"

20120520"

20120613"

20120617"

20120618"

20120918"

20121014"

20121120"

20121221"

20121222"

20121222"

20130218"

20130413"

20130518"

20130611"

20130620"

20130818"

20131001"

20131013"

20131030"

20131101"

20131115"

20131118"

20131120"

20131130"

20131206"

20131220"

20131220"

20140118"

20140225"

20140413"

20140518"

20140609"

20141013"

20141118"

20150218"

20150413"

20150518"

Reeks1" Reeks2" Reeks3" Reeks4"content links images overall

Page 30: Towards Multidimensional Web Archive Access (IIPC 2016)

Web sphere

Page element

Web site

Web page

2010 2015

unesco classifications

Page 31: Towards Multidimensional Web Archive Access (IIPC 2016)

Changerate (type of site)

0"

10"

20"

30"

40"

50"

60"

01" 02" 03" 04" 05" 06" 08" 09" 16" 17" 18" 19" 20" 22" 23" 24" 25" 30" 31"

Gemiddeld"van"content"

Gemiddeld"van"images"

Gemiddeld"van"links"

Gemiddeld"van"combined"

Changes per unesco category (all p/quarter harvests, n=~600, 2009-2015)

MeteorologyLaw & government

HistorySports

Agriculture

0"

10"

20"

30"

40"

50"

60"

01" 02" 03" 04" 05" 06" 08" 09" 16" 17" 18" 19" 20" 22" 23" 24" 25" 30" 31"

Gemiddeld"van"content"

Gemiddeld"van"images"

Gemiddeld"van"links"

Gemiddeld"van"combined"

Page 32: Towards Multidimensional Web Archive Access (IIPC 2016)

Changerate (all sites)

0"

5"

10"

15"

20"

25"

30"

35"

2009Q3"

2009Q4"

2010Q1"

2010Q2"

2010Q3"

2010Q4"

2011Q1"

2011Q2"

2011Q3"

2011Q4"

2012Q1"

2012Q2"

2012Q3"

2012Q4"

2013Q1"

2013Q2"

2013Q3"

2013Q4"

2014Q1"

2014Q2"

2014Q3"

2014Q4"

2015Q1"

Gemiddeld"van"content"

Gemiddeld"van"links"

Gemiddeld"van"images"

Gemiddeld"van"combined"

Changerate (all p/quarter harvests, 2009-2015)

0"

5"

10"

15"

20"

25"

30"

35"

2009Q3"

2009Q4"

2010Q1"

2010Q2"

2010Q3"

2010Q4"

2011Q1"

2011Q2"

2011Q3"

2011Q4"

2012Q1"

2012Q2"

2012Q3"

2012Q4"

2013Q1"

2013Q2"

2013Q3"

2013Q4"

2014Q1"

2014Q2"

2014Q3"

2014Q4"

2015Q1"

Gemiddeld"van"content"

Gemiddeld"van"links"

Gemiddeld"van"images"

Gemiddeld"van"combined"

Page 33: Towards Multidimensional Web Archive Access (IIPC 2016)

3.2.2 Examining aggregated content summaries

Page 34: Towards Multidimensional Web Archive Access (IIPC 2016)

3.2.2 Exploring Content Summaries

• Examine textual contents of a website

• for example, nu.nl

• most popular Dutch news site (Alexa, 2016)

• daily crawls by KB

• Exploration: different temporal site-level summarizations

Page 35: Towards Multidimensional Web Archive Access (IIPC 2016)

2014

2015

Page 36: Towards Multidimensional Web Archive Access (IIPC 2016)

Jan’13 Feb’13 Mar’13 Apr’13

May’13 Jun’13 Jul’13 Aug’13

Sep’13 Oct’13 Nov’13 Dec’13

Page 37: Towards Multidimensional Web Archive Access (IIPC 2016)

Daily (2012)

Page 38: Towards Multidimensional Web Archive Access (IIPC 2016)

Organizations (NER)

201420132012

Persons (NER)

2013 2014 2015

Places (NER)

Page 39: Towards Multidimensional Web Archive Access (IIPC 2016)

0"

20"

40"

60"

80"

100"

120"

20100722"

20100816"

20100817"

20110413"

20110610"

20110706"

20111013"

20111218"

20111220"

20120520"

20120613"

20120617"

20120618"

20120918"

20121014"

20121120"

20121221"

20121222"

20121222"

20130218"

20130413"

20130518"

20130611"

20130620"

20130818"

20131001"

20131013"

20131030"

20131101"

20131115"

20131118"

20131120"

20131130"

20131206"

20131220"

20131220"

20140118"

20140225"

20140413"

20140518"

20140609"

20141013"

20141118"

20150218"

20150413"

20150518"

Reeks1" Reeks2" Reeks3" Reeks4"

3.2.3 Next: combining approaches

Page 40: Towards Multidimensional Web Archive Access (IIPC 2016)

ConclusionTowards Multidimensional Web Archive Access4

Page 41: Towards Multidimensional Web Archive Access (IIPC 2016)

4.1 Conclusion

• Gap between researchers needs and data/access

• Researchers’ needs

• rich access, e.g. different analytical levels, temporal ranges

• Archive access

• mainly access at single page level (URLs and queries)

• Calls for new approaches to provide access to aggregated contents

• temporally and hierarchically

Page 42: Towards Multidimensional Web Archive Access (IIPC 2016)

4.2 Our approach

• Starting from a selection instead of a query

• Potential support exploratory stages of (re)search

• Potential support analysis and comparisons

• Issues: which levels of a website to summarize

• experimental focus on homepages and underlying pages

• deeper layers: additional richness, additional issues

• custom file formats vs standardized formats

• Integration into access interfaces

Page 43: Towards Multidimensional Web Archive Access (IIPC 2016)

Web Archive

4.3 Ongoing and Future Work

• Further extending our approach; integration into WebARTist toolset

• providing new ways to explore material in the archive (without using queries)

• Creating aggregated representations of unarchived contents

• see “Lost but Not Forgotten: Finding Pages on the Unarchived Web” (2015)

“Corpus Creation”

“Analysis”

“Dissemination”

Page 44: Towards Multidimensional Web Archive Access (IIPC 2016)

References

• Ben-David A. & Huurdeman H. (2014). Web Archive Search as Research: Methodological and Theoretical Implications. Alexandria Journal, Volume 25, No. 1 (2014)

• Brügger, N. (2013). Historical Network Analysis of the Web. Social Science Computer Review, 31(3), 306–321 • Brügger, N. (2014). Concluding Remarks. International Internet Preservation Consortium General

Consortium. Paris, France. Retrieved from: http://netpreserve.org/sites/default/files/attachments/Brugger.ppt (April 19, 2015)

• Chu, C. M. (1999). Literary critics at work and their information needs: A research-phases model. Library & Information Science Research, 21(2), 247–273.

• Dougherty, M., & Meyer, E. T. (2014). Community, tools, and practices in web archiving: The state-of-the-art in relation to social science and humanities research needs. Journal of the Association for Information Science and Technology, 65(11), 2195–2209. http://doi.org/10.1002/asi.23099

• Hockx-Yu, H. (2014). Access and Scholarly Use of Web Archives. Alexandria, 25(1-2), 113–127. • Huurdeman, H. (2015). Towards Research Engines: Supporting Search Stages in Web archives. Presented at

Web Archives as Scholarly Sources conference, Aarhus University, Denmark. • Huurdeman H., Kamps J., Samar T., de Vries A., Ben-David A., Rogers R. (2015). Finding Pages in the

Unarchived Web. International Journal on Digital Libraries. • Huurdeman, H., & Kamps, J. (2014). From Multistage Information-seeking Models to Multistage Search

Systems. In Proceedings of the 5th Information Interaction in Context Symposium (pp. 145–154). New York, NY, USA: ACM.

• Meho, L. I., & Tibbo, H. R. (2003). Modeling the information-seeking behavior of social scientists: Ellis’s study revisited. Journal of the American Society for Information Science and Technology, 54(6), 570–587.

• Rogers R. (2013). Digital Methods. MIT Press 2013

Page 45: Towards Multidimensional Web Archive Access (IIPC 2016)

Thanks & Acknowledgements

• The WebART team (’12-’16): Jaap Kamps, Richard Rogers, Arjen de Vries, Hugo Huurdeman, Thaer Samar, Anat Ben-David, Sanna Kumpulainen

• We gratefully acknowledge the collaboration with the Dutch Web Archive of the National Library of the Netherlands.

• This research was supported by the Netherlands Organization for Scientific Research (WebART project, NWO CATCH # 640.005.001).

Page 46: Towards Multidimensional Web Archive Access (IIPC 2016)
Page 47: Towards Multidimensional Web Archive Access (IIPC 2016)

webarchiving.nl

@webart12

Page 48: Towards Multidimensional Web Archive Access (IIPC 2016)

Towards Multidimensional Web Archive Access

Creating & Analyzing Representations of Aggregated Web Content

Hugo Huurdeman Thaer Samar Jaap Kamps Arjen de [email protected]@timelessfuture