building scalable web archives

19
Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014

Upload: matt

Post on 11-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Building Scalable Web Archives. Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014. Internet Memory Foundation. Internet Memory Foundation (European Archive) Established in 2004 in Amsterdam and then in Paris: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Building  Scalable  Web Archives

Building Scalable Web Archives

Florent Carpentier, Leïla Medjkoune Internet Memory Foundation

IIPC GA, Paris, May 2014

Page 2: Building  Scalable  Web Archives

Internet Memory Foundation

Internet Memory Foundation (European Archive)Established in 2004 in Amsterdam and then in Paris:•Mission: Preserve Web content by building a shared WA platform•Actions: Dissemination, R&D and partnerships with research groups and cultural institutions•Open Access Collections: UK National Archives & Parliament, PRONI, CERN The National Library of Ireland, etc.

Internet Memory ResearchSpin-off of IMF established in June 2011 in Paris•Mission: Operate large scale or selective crawls & develop new technologies (processing and extraction)

Page 3: Building  Scalable  Web Archives

Internet Memory FoundationFocused crawling:•Automated crawls through the Archivethe.Net shared platform•Quality focused crawls :• Video capture (You Tube channels), Twitter crawls,

complex crawls

Large scale crawling•Inhouse developed distributed software •Scalable crawler: MemoryBot•Also designed for focused crawl and complex scoping

Page 4: Building  Scalable  Web Archives

Research projects

Web Archiving and PreservationLiving Web Archives (2007-2010)Archives to Community MEMories: (2010-2013)SCAlable Preservation Environment (2010-2013)

Webscale data Archiving and Extraction✓Living Knowledge (2009-2012)✓Longitudinal Analytics of Web Archive data (2010-2013)

Page 5: Building  Scalable  Web Archives

MemoryBot design (1)• Started in 2010 with the support of the LAWA

(Longitudinal Analytics of Web Archive data) project

• URL store designed for large-scale crawls (DRUM)• Built in Erlang: distributed and fault-tolerant

system language• Distributed (consistent hashing)• Robust: topology change adaptation, memory

usage regulation, process isolation

Page 6: Building  Scalable  Web Archives

MemoryBot design (2)

Page 7: Building  Scalable  Web Archives

MemoryBot performance

• Good throughput and slow decrease• 85 resources written per second, slowing to 55 after 4

weeks on a nine 8-core servers cluster (32 GiB of RAM)

Page 8: Building  Scalable  Web Archives

MemoryBot counters

Page 9: Building  Scalable  Web Archives

MemoryBot counters

Page 10: Building  Scalable  Web Archives

MemoryBot – quality

• Support of HTTPS, retries on server failure, configurable URL canonicalisation

• Scope: domain suffixes, language, hops sequence, white lists, black lists

• Priorities• Trap detection (URL pattern identification, within

PLD duplicate detection)

Page 11: Building  Scalable  Web Archives

MemoryBot – multi-crawl

• Easier management• Politeness observed across different crawls• Better resource utilisation

Page 12: Building  Scalable  Web Archives

IM InfrastructureGreen datacenters•Through a collaboration with NoRack•Designed for massive storage (petabytes of data)•Highly scalable/low consumption •Reduces storage and processing costs

Repository :•HDFS (Hadoop File System): Distributed, fault-tolerant file system•Hbase. A distributed key-value index (temporal archives)•MapReduce: A distributed execution framework

Page 13: Building  Scalable  Web Archives

IM Platform (1)

Data storage:• temporal aspect (versions)

Organised data:• Fast and easy access to content• Easy processing distribution (Big Data)

Several views on same data:• Raw, extracted and/or analysed

Takes care of data replication: • No (W)ARC synchronisation required

Page 14: Building  Scalable  Web Archives

IM Platform (2)

Extensive characterisation and data mining actions:•Process and reprocess information any time depending on needs/requests– Extract information such as MIME type, text resources,

images metadata, etc.

Page 15: Building  Scalable  Web Archives

SCAlable Preservation Environment (SCAPE)

QA/Preservation challenges?•Growing size of web archives•Ephemeral and heterogenous content•Costly tools/actions

Develop scalable quality assurance toolsEnhance existing characterisation tools

Page 16: Building  Scalable  Web Archives

Visual automated QA: Pagelizer

• Visual and structural comparison tool developped by the UPMC as part of SCAPE

• Trained and enhanced through a collaboration with IMF

• Wrapped by IMF team to be used at large scale within its platform Allows comparison of two web

pages snapshots Provides a similarity score as

an output

Page 17: Building  Scalable  Web Archives

Visual automated QA: Pagelizer

• Tested on 13 000 pairs of URLs (Firefox & Opera)• 75% of correct assessment• Whole workflow runs for around 4 seconds/pair• 2 seconds for screenshot (depends on page

rendered)• 2 seconds for comparison

• Performance already cut per 2 since initial tests (map reduce)

Page 18: Building  Scalable  Web Archives

Next stepsImprovements are to be made:•Performance •Robustness •Correctness

New test in progress on a large scale crawl:•Results to be disseminated to the community through the SCAPE project and through on-site demos (contact IMF)!

Page 19: Building  Scalable  Web Archives

Thank you.Any questions?

http://internetmemory.org - http://archivethe.netflorent.carpentier@[email protected]