tpdl 2016 doctoral consortium - web archive profiling
TRANSCRIPT
![Page 1: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/1.jpg)
Web Archive ProfilingFor
Efficient Memento Aggregation
Sawood AlamOld Dominion University, Norfolk, Virginia - 23529
Advisor: Michael L. Nelson
Doctoral Consortium TPDL’16September 5, 2016
Supported in part by the International Internet Preservation Consortium (IIPC)
![Page 2: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/2.jpg)
Motivation
2
![Page 3: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/3.jpg)
Motivation
3
![Page 4: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/4.jpg)
Motivation
4
![Page 5: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/5.jpg)
Memento Aggregator
5
![Page 6: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/6.jpg)
Memento Aggregator
6
![Page 7: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/7.jpg)
Memento Aggregator
7
![Page 8: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/8.jpg)
Memento Aggregator
8
![Page 9: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/9.jpg)
Memento Aggregator
9
![Page 10: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/10.jpg)
Memento Aggregator
10
![Page 11: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/11.jpg)
From: Michael Nelson [mailto:[email protected]]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patrick; Grotke, Abigail
Subject: Re: WebSciDL
Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is it?
>
> Gina Jones
> Web Archiving Team
> Library of Congress
From: Ilya Kreymer <[email protected]>
Date: Wed, 2 Dec 2015 10:33:56 -0800
Subject: high traffic on oldweb!
To: Herbert Van de Sompel <[email protected]>, Sawood Alam <[email protected]>
Hi Herbert, Sawood,
Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily..
I am thinking that ability to remove source archives quickly is an important aspect of an aggregator.
Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;)
Ilya
Broadcasting is Bad
11
![Page 12: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/12.jpg)
Availability and Overlap
● Archives are sparse● Broadcasting is wasteful, both clients and archives suffer
12
![Page 13: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/13.jpg)
Memento Routing
13
![Page 14: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/14.jpg)
Routing Pros & Cons
● Pros○ Minimizes traffic and resources consumption○ Improves throughput
● Cons○ Upfront profile maintenance cost○ May miss Mementos (false negatives)
14
![Page 15: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/15.jpg)
Why Small Archives Matter?
15
![Page 16: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/16.jpg)
Why Small Archives Matter?
● 400B+ web pages at IA do not cover everything
● Top three archives after IA produce full TimeMap 52% of the time (AlSum et al, TPDL 2013)
● Targeted crawls● Special focus archives● Restricted resources● Private archives● Censorship
16
![Page 17: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/17.jpg)
While the IA was Down...
$ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c 2 2002 1 2005 1 2008 6 2009 67 2010 17 2011 64 2012 108 2013 108 2014 186 2015 51 2016 17
![Page 18: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/18.jpg)
Research Questions
● What do individual web archives hold?● How much do we need to know about an
archive’s holdings?● What is the optimal level of summarization for
better accuracy and increased freshness?● What are various ways to learn about archives’
holdings?● How to store and update archives’ profiles to
efficiently scale?
18
![Page 19: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/19.jpg)
Archive Profile
● High-level summary of an archive● Predicts presence of mementos of a URI-R in
an archive● Provides various statistics about the holdings● Small in size● Publicly available● Easy to update and partially patch● Useful for Memento query routing and other
things
19
![Page 20: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/20.jpg)
Profiling Policies
● Complete URI-R Profiling (1 URI-R = 1 Profile Key)
○ bbc.co.uk/images/logo.png?w=90○ cnn.com/2014/03/15/?id=128734
● TLD-only Profiling (1 TLD = 1 Profile Key)
○ com)/○ uk)/
● Middle Ground○ uk,co)/○ uk,co,bbc)/images○ uk,co,bbc)/0/2/1○ com,cnn)/ 201309 ar
20
![Page 21: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/21.jpg)
Available Profiling ResourcesClient request
Archive Response
CDX Records
21
![Page 22: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/22.jpg)
Profiling Strategies
● Sample URI Profiling● CDX Profiling● Response Cache Profiling● Fulltext Search Profiling
22
![Page 23: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/23.jpg)
Sample Profile
23
![Page 24: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/24.jpg)
Probability Rank
24
![Page 25: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/25.jpg)
Archives
Archive URI-Rs URI-Ms Index Size
Archive-It 1.9B 5.3B 1.8TB
UKWA 0.7B 1.7B 0.5TB
Stanford 12M 25M 8.3GB
25
![Page 26: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/26.jpg)
Sample Query Sets
Sample(1M URIs Each)
InArchive-It
InUKWA
InStanford
Union{AIT, UK,
SU}
DMOZ 4.097% 3.594% 0.034% 7.575%
MementoProxy 4.182% 0.408% 0.046% 4.527%
IAWayback 3.716% 0.519% 0.039% 4.165%
UKWayback 0.108% 0.034% 0.002% 0.134%
26
![Page 27: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/27.jpg)
Evaluation
● Generate profiles with 23 policies● Relate CDX Size, URI-M, URI-R, and URI-Key● Analyze profile growth● Estimate Relative Cost● Evaluate Routing Efficiency
27
![Page 28: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/28.jpg)
Resource Requirement
28
![Page 29: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/29.jpg)
CDX Size vs URI-M (UKWA 10 Years)
Alpha: 175 bytes per CDX line
29
![Page 30: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/30.jpg)
URI-M vs URI-R (UKWA 10 Years)
Gamma: 2.46 K : 2.686Beta: 0.911
30
![Page 31: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/31.jpg)
Space Cost (UKWA 7 Years)
Phi: 8.5e-07 -- 0.70583
31
![Page 32: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/32.jpg)
Time Cost (UKWA 7 Years)
Tau: 5.7e-05 -- 6.2e-05CDX: 45GBURI-Ms: 181MURI-Rs: 96MTime: 3 hours
32
![Page 33: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/33.jpg)
Archive-It
33
![Page 34: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/34.jpg)
Fulltext Search Cost
34
![Page 35: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/35.jpg)
Partial Knowledge
35
![Page 36: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/36.jpg)
Cost vs Accuracy
Group Policies Cost Accuracy
G1 H1P0/TLD Bound by # of TLDs ≈ 0.01
G2 H3P0, DDom, DSub, DPth, DQry < 0.01 ≈ 0.78
G3 DIni ≈ 2 * G2 ≈ 0.88
G4 HxP1 ≈ 5 * G3 ≈ 0.94
G5 Higher HmPn 0.4 -- 0.7 Not Explored
G6 URIR 1.0 1.036
![Page 37: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/37.jpg)
Work Plan
✓ Baseline Profiling Through CDX Files✓ Profile Serialization✓ Fulltext Search Profiling✓ Sample URI Dataset➢ Instrumenting Memento Aggregator➢ Multidimensional Profiling
37
![Page 38: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/38.jpg)
Publications
TPDL15 Web Archive Profiling Through CDX Summarization
TCDL15 Profiling Web Archives - For Efficient Memento Query Routing
IJDL16 Web Archive Profiling Through CDX Summarization
JCDL16 Poster: MemGator - A Portable Concurrent Memento Aggregator
TPDL16 Web Archive Profiling Through Fulltext Search
RFC Object Resource Stream (ORS) and CDX-JSON (CDXJ) Formats
C4LJ MemGator - A Portable Concurrent Memento Aggregator Architecture
JCDL17 Scalable, Maintainable, and Extensible Web Archive Profile Serialization for Efficient Lookup
JCDL17 URI, Time, and Language Profiling from Live Archives via URI Sampling and Fulltex Search
SIGIR17 Memento Aggregator Routing Based on Probability Distribution of Memento Availability with Archive Profiles
IJDL17 Archive X-Ray - Web Archive Profiling for Efficient Memento Aggregation
38
![Page 39: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/39.jpg)
Future Work
● Language profiles● Evaluation of combination profiles such as
URI-Key along with Datetime● Utilize archive profile to generate rank
ordered list of archive● Profiles for usage other than Memento
routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.)
39
![Page 40: TPDL 2016 Doctoral Consortium - Web Archive Profiling](https://reader031.vdocuments.site/reader031/viewer/2022021502/58841cdc1a28ab485c8b4a37/html5/thumbnails/40.jpg)
Conclusions● Generated profiles with different policies for three archives● Examined cost-precision tradeoffs of various policies● Related CDX Size, URI-M, URI-R, and URI-Key● Gained up to 80% routing accuracy with <1% relative cost
while maintaining 0.9 recall
40