Transcript
Page 1: Building  Scalable  Web Archives

Building Scalable Web Archives

Florent Carpentier, Leïla Medjkoune Internet Memory Foundation

IIPC GA, Paris, May 2014

Page 2: Building  Scalable  Web Archives

Internet Memory Foundation

Internet Memory Foundation (European Archive)Established in 2004 in Amsterdam and then in Paris:•Mission: Preserve Web content by building a shared WA platform•Actions: Dissemination, R&D and partnerships with research groups and cultural institutions•Open Access Collections: UK National Archives & Parliament, PRONI, CERN The National Library of Ireland, etc.

Internet Memory ResearchSpin-off of IMF established in June 2011 in Paris•Mission: Operate large scale or selective crawls & develop new technologies (processing and extraction)

Page 3: Building  Scalable  Web Archives

Internet Memory FoundationFocused crawling:•Automated crawls through the Archivethe.Net shared platform•Quality focused crawls :• Video capture (You Tube channels), Twitter crawls,

complex crawls

Large scale crawling•Inhouse developed distributed software •Scalable crawler: MemoryBot•Also designed for focused crawl and complex scoping

Page 4: Building  Scalable  Web Archives

Research projects

Web Archiving and PreservationLiving Web Archives (2007-2010)Archives to Community MEMories: (2010-2013)SCAlable Preservation Environment (2010-2013)

Webscale data Archiving and Extraction✓Living Knowledge (2009-2012)✓Longitudinal Analytics of Web Archive data (2010-2013)

Page 5: Building  Scalable  Web Archives

MemoryBot design (1)• Started in 2010 with the support of the LAWA

(Longitudinal Analytics of Web Archive data) project

• URL store designed for large-scale crawls (DRUM)• Built in Erlang: distributed and fault-tolerant

system language• Distributed (consistent hashing)• Robust: topology change adaptation, memory

usage regulation, process isolation

Page 6: Building  Scalable  Web Archives

MemoryBot design (2)

Page 7: Building  Scalable  Web Archives

MemoryBot performance

• Good throughput and slow decrease• 85 resources written per second, slowing to 55 after 4

weeks on a nine 8-core servers cluster (32 GiB of RAM)

Page 8: Building  Scalable  Web Archives

MemoryBot counters

Page 9: Building  Scalable  Web Archives

MemoryBot counters

Page 10: Building  Scalable  Web Archives

MemoryBot – quality

• Support of HTTPS, retries on server failure, configurable URL canonicalisation

• Scope: domain suffixes, language, hops sequence, white lists, black lists

• Priorities• Trap detection (URL pattern identification, within

PLD duplicate detection)

Page 11: Building  Scalable  Web Archives

MemoryBot – multi-crawl

• Easier management• Politeness observed across different crawls• Better resource utilisation

Page 12: Building  Scalable  Web Archives

IM InfrastructureGreen datacenters•Through a collaboration with NoRack•Designed for massive storage (petabytes of data)•Highly scalable/low consumption •Reduces storage and processing costs

Repository :•HDFS (Hadoop File System): Distributed, fault-tolerant file system•Hbase. A distributed key-value index (temporal archives)•MapReduce: A distributed execution framework

Page 13: Building  Scalable  Web Archives

IM Platform (1)

Data storage:• temporal aspect (versions)

Organised data:• Fast and easy access to content• Easy processing distribution (Big Data)

Several views on same data:• Raw, extracted and/or analysed

Takes care of data replication: • No (W)ARC synchronisation required

Page 14: Building  Scalable  Web Archives

IM Platform (2)

Extensive characterisation and data mining actions:•Process and reprocess information any time depending on needs/requests– Extract information such as MIME type, text resources,

images metadata, etc.

Page 15: Building  Scalable  Web Archives

SCAlable Preservation Environment (SCAPE)

QA/Preservation challenges?•Growing size of web archives•Ephemeral and heterogenous content•Costly tools/actions

Develop scalable quality assurance toolsEnhance existing characterisation tools

Page 16: Building  Scalable  Web Archives

Visual automated QA: Pagelizer

• Visual and structural comparison tool developped by the UPMC as part of SCAPE

• Trained and enhanced through a collaboration with IMF

• Wrapped by IMF team to be used at large scale within its platform Allows comparison of two web

pages snapshots Provides a similarity score as

an output

Page 17: Building  Scalable  Web Archives

Visual automated QA: Pagelizer

• Tested on 13 000 pairs of URLs (Firefox & Opera)• 75% of correct assessment• Whole workflow runs for around 4 seconds/pair• 2 seconds for screenshot (depends on page

rendered)• 2 seconds for comparison

• Performance already cut per 2 since initial tests (map reduce)

Page 18: Building  Scalable  Web Archives

Next stepsImprovements are to be made:•Performance •Robustness •Correctness

New test in progress on a large scale crawl:•Results to be disseminated to the community through the SCAPE project and through on-site demos (contact IMF)!

Page 19: Building  Scalable  Web Archives

Thank you.Any questions?

http://internetmemory.org - http://archivethe.netflorent.carpentier@[email protected]


Top Related